Chapter 6



01  02  03  04  05  

06  07  08  09  10  

11  










A Spectrogram Demo

Speech Synthesis Research @ ICP

Before creating rules to control a speaking machine, it was necessary to develop methods of reproducing a human speech signal. At the turn of the century, two devices could convert an acoustic speech signal (air vibrations) into an electrical signal by a microphone and change an electrical signal back to an acoustic signal through a loud speaker. The two devices -- the telephone for speech transmission and the phonograph for speech or sound storage and playback -- could store or transmit the signal but could not manipulate or alter it much. Operators could distort the signal or equalize it by boosting or reducing the bass or treble but could not convert the sound of one phoneme into another or change the pitch without changing the speed of the speech and its spectrum. To attain this flexibility, it was necessary to have independent control over the excitation of the signal, the pitch, the loudness, and the spectrum.

In 1951, an attempt to recreate human speech with an electrical device that controlled a variable spectrum used a black-and-white version of the sound spectrogram. As the spectrogram is a visual recording of a sound signal, F.S. Cooper, A.M. Liberman, and J.M. Borst used light to generate the speech. Their machine consisted of a light source and a rotating wheel of fifty concentric circles of variable densities -- to generate the different harmonics of the source signal. The light beams representing the different harmonics were aimed at the appropriate regions of a sound spectrogram. The intensity differences of the lines in the sound spectrogram varied the amount of light transmitted as the spectrogram moved through the beams. The light was converted to an electric current, then converted again, to an acoustic signal, by a speaker. In this way, the machine was able to speak the information encoded in the sound spectrogram.

The device proved that speech could be generated electrically by a machine using time-varying parameters to control a spectral filter (as in figure 6.2). However, the ultimate aim of the project to build a speaking machine was to generate speech by defining a set of rules, not just to reproduce previously spoken utterances. A major obstacle to using a spectrogram reading machine as a component of such a machine is its need for fifty different control parameters to reproduce speech. Generating speech by creating rules to control fifty different parameters is too complex. We needed a simpler model for controlling the time-varying spectrum component of a synthesizer.

Experiments had shown it is possible to produce speechlike sounds with a system of coupled resonances. J. Holmes experimented with recreating speech by controlling the frequencies of the resonances. First, he carefully analyzed short speech segments and manually determined the formant values for each such short segment. He then applied the data gathered from his analysis to the speech signal. Holmes's experiment demonstrated that if we could predict how formants change over time into a desired phoneme sequence, we could program a machine to speak. Since then, several researchers have introduced techniques for automatically analyzing the parameters that control the time-varying spectral filter. These techniques are extremely useful for encoding speech at a reduced storage and transmission rate and have provided a basis for studying methods of creating rules for generating speech by machine.

Once this theoretical groundwork was established, we could begin to conceive of ways to generate speech by machines. Early work had consisted of creating specialized circuitry to control synthesizers. When digital computers became available, however, research progressed rapidly. These computers made it possible to program a machine with independent control of the pitch, loudness, and spectrum and to compute time-varying parameters tocontrol them. Researchers constructing talking machines then faced two issues: what parameters to use, and how to generate these parameters for a given sequence of phonemes. They investigated two ways to generate the synthesis parameters: one method employs rules to generate the parameters, while the other uses stored data.

Synthesis by Rules

The choice of parameters is extremely important to developing rules for speech synthesis. Some scientists hold that the best approach to developing rules is the geometry of the human vocal tract itself. There is a good deal of information about the articulators and their movements during speech, because both are subject to physical constraints. Some researchers have studied the geometry of the vocal tract, especially the tongue, through X-ray movies of people speaking. However, the danger of prolonged exposure to X-rays, even X-ray microbeams, means that only a limited number of such films is available.

Other researchers have tried to map the geometry of the vocal tract by analyzing the speech signal itself. This is still a topic of ongoing research, although no satisfactory solutions have yet been formulated. The air flow through the vocal tract is still not fully understood, due to the complex geometry of the vocal tract. In addition, the fact that the walls of the vocal tract (particularly the cheeks and soft palate) are not rigid contributes to the difficulty of computing airflow.

Still other researchers have attempted to apply ad hoc rules and simplified geometries of the vocal tract. Although they have been able to produce machine speech, its quality is lower than that yielded by other methods of synthesis.

Finally, one group of speech scientists has worked to formulate rules for synthesizing speech by using more accessible parameters, in particular the resonances of the vocal tract, the formants. By observing spectrograms or computing the frequencies of formants of spoken utterances, these researchers have derived rules for synthesizing the phonemes within their contextual dependencies and for creating the transitions between the phonemes. So far, using the formant frequencies as the parameters for synthesis is the most successful approach.