Chapter 7



01  02  03  04  05  

06  07  08  09  10  

11  12  13  14  15  

16  17  





For example, we want to know that a certain segment of sound contains a broad noiselike band of frequencies that might represent the sound of rushing air, as in the sound /h/ in HAL; another segment contains two or three resonant frequencies in a certain ratio that might represent the vowel sound /a/ in HAL. One way to accomplish this labeling is to store examples of such sounds and attempt to match incoming time slices against these templates. Usually, the attempt to categorize slices of sound uses a much finer classification system than the approximately forty phonemes of English. We typically use a set of 256 or even 1,024 possible classifications in a process called vector quantization.

Once we have classified these time slices of sound, we can use one of several competing approaches to recognizing words. One of them develops statistical models for words or portions of words by analyzing massive amounts of prerecorded speech data. Markov modeling and neural nets are examples of this approach. Another approach tries to detect the underlying string of phonemes (or possibly other types of subword units) and then match them to the words spoken.

At Kurzweil Applied Intelligence (KAI), rather than select one optimal approach, we implemented seven or eight different modules, or "experts," then programmed another software module, the "expert manager," which knows the strengths and weaknesses of the different software experts. In this decision-by-committee approach, the expert manager is the chief executive officer and makes the final decisions.

In the KAI systems, some of the expert modules are based, not on the sound of the words but on rules and the statistical behavior of word sequences. This is a variation of the hypothesis-and-test paradigm in which the system expects to hear certain words, according to what the speaker has already said. Each of the modules in the system has a great deal of built-in knowledge. The acoustic experts contain knowledge on the sound structure of words or such subword units as phonemes. The language experts know how words are strung together. The expert manager can judge which experts are more reliable in particular situations.

The system as a whole begins with generic knowledge of speech and language in general, then adapts these knowledge structures, based on what it observes in a particular speaker. In the film, Dave and Frank frequently invoke HAL's name. Even today's speech-recognition systems would quickly learn to recognize the word HAL and would not mistake it for hill or hall, at least not after being corrected once or twice.

In continuous speech, a speech-recognition system needs to deal with the additional ambiguity of when words start and end. Its attempts to match the classified time slices and recognized subword units against actual word hypotheses could result in a combinatorial explosion. A vocabulary of, say, sixty thousand words, could produce 3.6 billion possible two-word sequences, 216 trillion three-word sequences, and so on. Obviously, as we cannot examine even a tiny fraction of these possibilities, search constraints based on the system's knowledge of language are crucial.


Moore's Law

The other major ingredient needed to achieve the holy grail (i.e., a system that can understand fully continuous speech with high accuracy with relatively unrestricted vocabulary and domain and with no previous exposure to the speaker) is a more-powerful computer. We already have systems that can combine continuous speech, very large vocabularies, and speaker independence -- with the only limitation being restriction of the domain to business English. But these systems require RAM memories of over 100 megabytes and run much slower than real time on powerful workstations. Even though computational power is critical to developing speech recognition and understanding, no one in the field is worried about obtaining it in the near future. We know we will not have to wait long to achieve the requisite computational power because of Moore's law.