
![]() ![]() 01 02 03 04 05 06 07 08 09 10 11 A History of Speech Synthesis
|
Approximately a hundred years later, Sir Charles Wheatstone built an improved version of von Kempelen's machine. R. Riesz constructed a more sophisticated mechanical talking machine in 1937. Using a similar arrangement of air flow through a reed, this machine possessed the ability to change the reed length to create the intonation or melody of speech. The user employed finger-controlled sliders to modify the shape of the tube simulating the vocal tract. Although mechanical talkers like these are still occasionally constructed, they are generally used as measurement tools rather than as talking machines.
Current attempts to generate machine speech focus on electronic
methods and, more recently, electronic simulation by digital computer.
Before discussing synthetic voices in more detail, we need to introduce certain basic concepts used in research on speech production. We begin by defining speech as a sound signal used for language communication. Superficially, the speech signal is similar to a sound produced by a musical instrument, although it is more flexible and varied. When we speak, we push air from our lungs through the vocal chords, sometimes tightening the chords to make them vibrate as the air passes over them -- like the reed of a musical instrument such as a clarinet. In the clarinet, the pitch of the sound is changed by closing and opening holes in the body of the instrument, which causes the column of air in the instrument to become longer or shorter. When we speak, however, we change the pitch by loosening and tightening our vocal chords. We also have the ability to completely relax our vocal chords to producing voiceless sounds such as / s / or / sh /. The capacity to produce both pitched (or voiced) sounds and noiselike (or voiceless) sounds with a single instrument is not generally available to musical instruments. Our greatest flexibility, however, comes from the innate ability to vary the shape of our instrument, the vocal tract. Most musical instruments are rigid structures and so produce a sound with a unique color or timbre associated with their particular class of instruments; thus a clarinet has a sound that is distinct from the sound of a trumpet or a violin. The descriptive words color and timbre refer to the sound quality rather than the pitch range or loudness of instruments. We humans, by contrast, can change the shape of our oral cavity by moving our tongue, lips, and jaw, thus creating a variety of sound colors. For example, the sound of / oo / in the word boot is "dark" and muffled compared to the sound of the / ee / in a word like beet, which has a bright sound. In addition to / oo / and / ee /, two of the vowel sounds, there are consonant sounds such as / l /, / r /, and / m /. This human facility to produce a variety of sounds is the basis for our ability to speak. By combining a small number of sounds to produce a large number of words, we can produce an unlimited number of sentences. We call the different sounds that make up language phonemes. A speech signal and its constituent phonemes can be given visual form with a sound spectrogram, commonly known as a voiceprint.(A voiceprint is used in 2001 to verify the identity of Dr. Floyd.) The term voiceprint, coined by a manufacturer of the machines used to display spectrograms, was intended to associate them with fingerprints, which are uniquely reliable means of identification. In the 1970s, police departments bought spectrogram machines and used them for forensic purposes. Speech scientists, however, opposed this practice, because they believed the spectrogram was not reliable legal evidence. Eventually the judicial and forensic use of spectrograms disappeared. Today, computers can reliably perform voice verification, not by using a spectrogram but with techniques borrowed from Automatic Speech Recognition (see chapter 7). Although spectrograms are extremely useful for visualizing speech events, they are still too complex for computers to extract the appropriate information from them. The sound spectrogram in figure 6.1 shows many aspects of the
speech signal. The light blue regions corresponding to the / k /,
/ p /, and / b / show that the vocal tract is
completely closed to pronounce the stop phonemes. In the
vowel regions -- / a /, / i /, / o / and / e /
-- as well as in the / r / and / l / regions,
the repeated vertical lines indicate segments in which the vocal
chords vibrate, causing a voiced speech signal. These segments
contrast with the region corresponding to the latter part of the
/ k /, where such lines are not apparent, indicating that the
sounds are voiceless. A number of thick colored horizontal lines also
appear in the voiced sections; they show the loudness of different
frequencies at different times and represent the frequencies at which
the sound is reinforced by the vocal tract. These resonances of the
vocal tract are known as speech formants. The different
configurations of the colored regions represent differences in the
color or timbre of the sounds. When the colored regions appear in the
higher frequencies (the higher areas of the spectrogram), such as
during the vowel / a / in the word and or / i
/ in the word hit, the sound is brighter, while segments
devoid of energy in the higher frequencies -- such as during the
/ l / or / o / -- tend to sound more
muffled. You can simulate this effect by turning down the treble
control on your stereo amplifier and observing the reduction of energy
at higher frequencies.
|