
01 02 03 04 05 06 07 08 09 10 11 Lamina Propria of the Vocal Cords
Automatic Speech Recognition Lab
Von
Helmholtz Biographical Sketch
|
Electroacoustic Models
In the late nineteenth century, before tools like the spectrogram were available for studying the speech signal, H.L.F. von Helmholtz and other scientists studied the relationship between the spectrum and the resultant sound. They postulated that speechlike sounds can be produced by carefully controlling the relative loudness of different regions of the spectrum and that, therefore, they could generate speech by electrical means instead of mechanically replicating the vocal tract. Helmholtz also studied the influence of the shape of different cavities on their resonance frequencies. Early in the twentieth century, J. Q. Stewart, among others, built a device to test these theories. Stewart's machine consisted of two coupled resonances excited by periodic electrical impulses. By tuning these resonances to different frequencies, he produced different vowel-like sounds. An electrical analog of Kempelen's machine was constructed by H. Dudley, R. Reiz, and S. Watkins in the 1930s. This machine, the voder, was displayed at the 1939 World's Fair. Like its mechanical predecessors, the voder was manually operated by an operator who used a keyboard to control the relative loudness of the different regions of a spectrum -- instead of changing the shape of an artificial vocal tract, as in earlier machines. An electrical sound generator excited the spectral shaping apparatus. The voder, the first electronic machine capable of producing speech, is the basis for today's acoustic synthesizers (see figure 6.2). The voder generated speech sounds but was not a true speaking machine, since a human operator controlled it. A genuine speaking machine creates speech from a given text (text-to-speech) or -- as in the case of HAL -- generates speech to communicate its thoughts (concept-to-speech). We have explained that speech is made up of a combination of different sounds or phonemes and that we can generate speechlike sounds with electronic resonators that simulate the formants of the speech signal. As a specific configuration of formants can simulate a given phoneme, we should be able to synthesize speech by configuring the frequencies of a set of resonances to produce the desired sequence of phonemes that make up a given speech signal. Could we, in fact, produce a complete speech utterance by simply connecting the different phonemes? It could be a tricky process. When we utter the sound of a phoneme, we move our articulators (lips, tongue, etc.) to shape the vocal tract to produce the desired sound. To say the vowel / ee / in the word beet, we move our tongue forward and raise it so it almost touches the roof of the mouth; when we say / a /, as in father, the tongue recedes to the back of the mouth and is lowered, along with the jaw. When we want to say an / a / followed by an / ee /, we produce a smooth transition from the / a / configuration of the articulators to the / ee / configuration by raising the jaw and moving the tongue forward and up. The motion of the tongue and the jaw is not instantaneous; there is a gap between the vowels in which the sound is neither / a / nor / ee / but something in between. This can also be explained by observing the formants in a spectrogram. The first formant for / a / is quite high (850 Hz) for the range of the first formant, which is typically 250 to 900 Hz, and the second formant is low (just above the first formant). For / ee /, the first formant is extremely low, while the second formant is extremely high. Thus, when / a / is followed by / ee /, the first formant descends while the second formant rises. During the transition period when formants are moving from one configuration to the other, the sound is a mixture of the preceding and following sounds. This mix is clearly visible in the spectrogram for the word error (see figure 6.1), where the formants move smoothly between the different phonemes of the word. To synthesize an / a / followed by an / ee /, therefore, we have to model the motion of the articulators or the formants very correctly.
Another difficulty with configuring the articulators or formants for
each phoneme arises when we utter very short vowels and the
articulators are not able to move quickly enough to form the
appropriate vocal-tract shape of the vowel. This can be seen by
observing the short vowel /i/ in two different syllables,
bil and dic (see figure 6.3). In the syllable
dic , the second formant moves only slightly from the
surrounding consonants; however, in the syllable bil the
second formant has to rise from the /b/ to the /i/
and fall again for the /l/. Since the vowel is short, the
second formant is not able to rise fast enough before it begins
falling again; so it never reaches the position of the second formant
in dic . The spectrogram of the two syllables demonstrates
that the vowel in the two syllables is indeed different. In the
context of the syllable bil, listeners do not notice that the
|