Most recent edit on 2008-07-28 06:17:48 by UdenSherpa
Additions:
It is known that speech signals can be represented as waveforms. Taking a sample of waveform representing a portion with initial consonant, vowel and final consonant, we notice that all vowels are periodic and repeatable. Consonants can be either periodic (voiced) or non periodic (unvoiced). The waveforms are converted from time domain to frequency domain by means of fourier transformation. These representations are called spectrums. Spectrums can be represented by the Mel-Cepstral Coefficiants (MCEP), also known as the spectral parameter. This parameter along with some other components like the excitation parameter log F0 (associated with the emotion related to speech), are used to model speech.
A series of syllables make up a speech signal. And each syllable consists of an arrangement of phonemes or phones. In order to model phones using HMMs we have a 3 state machine to represent that phone. So each phone in a word or syllable is in the order: initial consonant, vowel and final consonant. The sequence of phones in a speech signal are representated as vectors of speech parameters which are modelled as three state machines and are simply concatenated to produce back speech. Therefore for each speech utterance only the HMM with maximum probablity is selected.
Oldest known version of this page was edited on 2008-07-28 06:14:09 by UdenSherpa []
Page view:
Synthesis
Now to elaborate on HMM-Based speech synthesis, there is a collection of trained HMMs. When text is passed to text analyzer, label sequence are passed to the HTS-engine. Speech parameters are generated from the HMM and the best HMMs with the highest probablity are selected and concatenated to form sentence HMM from the syllable HMMs. These are then passed to MLSA filter to finally generate the speech.