Chapter 6



01  02  03  04  05  

06  07  08  09  10  

11  










Linear Predictive Coding (LPC) Tutorial

Make your own Speech-to-Text files

Text-to-Speech Synthesis at Bell Labs

At present, two types of parameters are used for the data-driven method of synthesis: stored waveforms and a small set of spectral parameters that is mathematically derived from the speech signal. These parameters are called LPCs (linear predictive coding) because one of their forms predicts the next set of speech-waveform values from a small set of previously computed waveform values. Although waveform parameters produce high-quality speech, it is impossible to control independently the spectrum of waveforms of the stored speech. Synthesizing with these parameters, therefore, lacks flexibility for altering the speech spectrum. The LPC parameters also produce high-quality speech, although it is somewhat mechanical-sounding. These LPCs' flexibility makes it easy to alter them to produce connected speech.

When I began working in speech synthesis shortly after the discovery of LPC parameters, I was attracted by their ability to reproduce high-quality speech. My early research involved constructing a synthesizer, using words as the unit of synthesis. By using twelve hundred common words I was able to synthesize many paragraphs of text. Because I used parametrized speech, I could smooth the connection between words and impose an intonation over the utterance to make the speech sound continuous. However, the synthesizer was limited -- too many words were not in my inventory.

I then turned to the methods introduced by Peterson and his colleagues. The speech synthesizer I currently use at Bell Laboratories generates speech from stored short utterances of analyzed speech, using LPC-derived parameters. It is not a simple system of diphones, but a complex system that contains many segments larger than diphones -- to accommodate phonemes with complex coarticulation effects. For example, to synthesize the word incapable spoken by HAL and shown in figure 6.1, we first transcribe the word into a phonetic notation. Incapable becomes

where /*/ represents silence, /1/ is the neutral vowel schwa, and /U/ is the vowel a as in word able. The synthesizer then attempts to match the largest string of phonemes from the word to a string in its databank. If two adjacent phonemes do not interact -- that is, there is little coarticulation between them, as is the case for /n/ followed by a /k/ -- the synthesizer will not find a diphone. In this case, it will add a silence element of zero duration. When the phoneme is greatly influenced by its neighbors, as in the case of a schwa, a triplet of phonemes will be stored in the database. Thus the word incapable will be synthesized from the following elements:

The resultant speech is intelligible, although it sounds mechanical and would never be mistaken for a human voice.


Speech Generation and Text-to-Speech Conversion

Thus far, we have described a system capable of synthesizing speech from phonemic input. Given a sequence of phonemes, scientists can now generate a signal that sounds speechlike. This was a very important task and the main preoccupation of researchers for a long time. But is that all there is to speech?

Speech, a subset of language, is one method humans use to communicate with each other. The most direct form of language communication happens when one human, the generator, speaks to one or more humans, the receptors. This mode of communication is easy for the generator; he or she needs only choose the proper words to represent an idea and produce the speech sounds that represent the words. Barring such problems as a noisy environment or language differences, receptors will usually understand the idea the generator is trying to transmit.