Chapter 6



01  02  03  04  05  

06  07  08  09  10  

11  










Speech Synthesis samples

AT&T Watson Speech Synthesis

After concluding the computation involved in language analysis, the computer -- whether reading text or generating speech -- has information about the hierarchical structure of the text, the focus or stress of the different segments, and the correct pronunciation, including lexical stress, of the words in the utterance. The result of the analysis is a string of phonemes annotated with several levels of stress marking and different levels of phrase marking. Once these linguistic units are generated, the computer is ready to synthesize speech.

Synthesis from Linguistics Units

It would seem a trivial task to synthesize speech, by either rule or stored data, once the desired sequence of phonemes is known. However, the computer still lacks information about the timing and pitch of the utterance. These factors may seem unimportant as long as the computer can pronounce the phonemes correctly. Nonetheless, mistakes in timing and pitch are likely to result in unintelligible speech or, at best, the perception that the speaker is a non-native speaker.

We are aware of the role of pitch when actors impersonating a computer in a television commercial or science fiction movie try to speak in a monotone. You notice that I said try, because they are not really talking in a monotone; if they did, it would sound more like singing than speaking. They do, however, severely restrict the range of the pitch. Humans normally talk with the timing and intonation appropriate to their native language which they acquired as children by imitating adult speakers. The computer, of course, does not learn by imitation; for the computer to speak correctly, we have to develop the rules for pitch and timing and program it to use them.

The timing of speech events is very complicated. First, phonemes have inherent durations; for example the vowel in the word had is much longer than the vowel in pit. Yet the duration of the phonemes are not invariable. They are affected by the position of the phoneme's syllable in the phrase, the degree of stress on the syllable, the influence of neighboring phonemes, and other factors. For example, the vowel in had is much longer than the vowel in hat, because of the difference between the following consonants /d/ and the /t/. At Bell Laboratories recently we devised a statistics-based analysis scheme that measures the contribution of various factors to phoneme durations and creates algorithms to compute them.

To program rules for the pitch contour of speech, we must first understand how intonation provides information about the sentence type, sentence structure, sentence focus, and lexical stress of a speech signal. We are aware, for example, that the pitch is lower at the end of a declarative sentence, while in many interrogative sentences, it rises at the end. At the end of phrases and nonterminal sentences and parenthetical statements we indicate that we will continue speaking by lowering the pitch and reducing the range. We also express focus and stress by large pitch variations. All of the above phenomena must be programmed to make the computer deliver a message effectively.


Feeling and Singing

So far, we have concentrated on aspects of speech synthesis that convey linguistic information by analyzing the acoustics of speech sounds, as well as the manifestations of timing and pitch. Another dimension of human speech, the emotional state of the speaker, is as important as the linguistic content of the message. I do not explore computer feelings in this chapter (see chapter 13 for discussion of this topic). In 1974, however, my interest in computer music led me to write a computer opera dealing with the intriguing subject of computer emotion. The opera featured a singing computer.