
01 02 03 04 05 06 07 08 09 10 11 Speech synthesis with hybrid H/S model
Modeling Coarticulation
in Synthetic Visual Speech
The Museum of Speech
Synthesis, Francs
|
Synthesis from Stored Segments
An alternative method of producing computer speech stores small segments of speech to retrieve when they are needed. Storing whole sentences or phrases is impractical, and even saving words is not feasible; there are too many of them and new ones are constantly being added to the language. Storing words would also leave unsolved the problem of connecting the individual words together; although a word is a linguistic unit, acoustically there are no apparent breaks between words and only unclear delineations of word boundaries. There are, however, certain applications with limited vocabulary needs in which whole words can be the unit of synthesis. Telephone directory assistance is one such application. Even though the speech in this case consists of a string of ten digits, the vocabulary for the application must be longer than ten digits, as the first digit in a string of ten digits is spoken differently from the third or the tenth one. A storage of one hundred words -- all ten digits in ten different positions -- encompasses all the possibilities. Even so, the speech sounds like a series of isolated digits; it lacks the continuous flow of human speech. Storing syllables is also impractical for there are approximately fifteen thousand syllables in English and an adequate system would have to provide for smooth connections among them. Nor, as mentioned earlier, can phonemes serve as units for synthesis; their acoustic manifestations do not exist as independent entities and, besides, they are affected by the coarticulatory influence of neighboring sounds. In 1958, G. Peterson, W. Wang, and E. Sivertsen experimented with using diphones to produce synthetic speech. These units consist of small speech segments that start in the middle of a phoneme and end in the middle of the next one. The authors theorized that phonemes are more stable in the middle and that segments between phonemes contain the necessary information about the transition from one phoneme to the next. Splicing the speech in the middle of each phoneme, therefore, should generate a smoother speech signal. The researchers did not attempt to construct a full system of diphones to produce all the possible speech-sound combinations of a given language (American English in this case). Instead, they selected several diphones and spliced them together to create phoneme sequences for a few utterances. Although the experiment showed that the method was viable, there were some obvious problems. When speech segments are joined, discontinuities in loudness, pitch, or spectrum at the junctures are audible, usually as clicks or other undesirable sounds. Splicing speech cut from different speech utterances does not prevent such discontinuities. Because they spliced tape to connect the diphones, Peterson and his colleagues had to carefully select diphones with similar acoustic characteristics at the junctures. In a system that includes all possible combinations of phonemes in the language, it would be impractical to use only diphones that match at the boundaries. Instead, we would have to smooth the connections between segments which can only be done when the speech is parametrized. The first such system for synthesized speech generated from stylized stored parameters of formant tracks was demonstrated in 1967.
The foregoing section describes the history of the talking machine
prior to the making of 2001 in the late 1960s. Although
research on talking machines had been under way for a long time, it
was still in its infancy at that time. Computers were able to utter
speechlike sounds, but they lacked the eloquence of HAL. In fact, the
computer-generated speechlike sounds of the era were almost
unintelligible, whether produced through synthesis by rule or
synthesis from stored data.
In the 1970s, however, researchers made great advances in speech synthesis, mainly because of the wealth of data on spoken utterances and improved computational power. The best system of rules for synthesizing speech, developed by D. H. Klatt, utilized a digital implementation of an electroacoustic synthesizer. The spectral shaping module (see figure 6.2) consisted of a complicated network of resonances with different branches for producing vowels, nasal constants, fricatives, and stopped consonants. By recording and observing the formant motions, Klatt was able to create speech synthesis of high quality. One derivative of his system, Digital's DECTalk, has been used by noted physicist Stephen Hawking.
During the same decade, progress in the synthesis of speech from
stored data was aided by research in speech coding and creation of new
methods of speech analysis, and of resynthesizing speech from analysis
parameters. Like synthesis by rule, synthesis from stored data can use
different kinds of parameters; however, because the method is data
driven, parameters do not need to be as intuitive; they should be able
to produce high-quality speech from resynthesized, previously analyzed
speech segments.
|