Chapter 7



01  02  03  04  05  

06  07  08  09  10  

11  12  13  14  15  

16  17  





A system armed with this type of information would know that flies are not similar to arrows and would thus knock out the third interpretation. Often there is more than one way to resolve language ambiguities. The third interpretation could be syntactically resolved by noting that like in the sense of similar to ordinarily requires number agreement between the two objects compared. Such a system would also note that, as there are no such things as time flies, the fourth interpretation too is wrong. The system would also need such tidbits of knowledge as the fact that flies have never shown a fondness for arrows, and that arrows cannot and do not time anything -- much less flies -- to select the first interpretation as the only plausible one. The ambiguity of language, however, is far greater than this example suggests. In a language -- parsing project at the MIT Speech Lab, Ken Church found a sentence with over two million syntactically correct interpretations.

Often the tidbits of knowledge we need have to do with the specific situation of speakers and listeners. If I walk into my business associate's office and say "rook to king one," I am likely to get a response along the lines of "excuse me?" Even if my words were understood, their meaning would still be unclear; my associate would probably interpret them as a sarcastic remark implying that I think he regards himself as a king. In the context of a chess game, however, not only is the meaning clear, but the words are easy to recognize. Indeed, our contemporary speech -- recognition systems do a very good job when the domain of discourse is restricted to something as narrow as a chess game. So, HAL, too, has little trouble understanding when Frank says "rook to king one" during one of their chess matches.


Lesson 2: The Unpredictability of Human Speech

A second lesson for building our computer system is that it must be capable of understanding the variability of human speech. We can, of course, build in pictures of human speech called spectrograms, which plot the intensity of different frequencies (or pitches in human perceptual terms) as they change over time. What is interesting, and -- for those of us developing speech- recognition machines -- daunting is that spectrograms of two people saying the same word can look dramatically different. Even the same person pronouncing the same word at different times can produce quite different spectrograms.

Look at the two spectrogram pictures of Dave and Frank saying the word HAL (figure 7.2). It would be difficult to know that they are saying the same word from the pictures alone. Yet the spectrograms present all the salient information in the speech signals.

Yet, there must be something about these different sound pictures that is the same; otherwise we humans and HAL, as a human-level machine, would be unable to identify them as two examples of the same spoken word. Thus, one key to building automatic speech recognition (ASR) machines is the search for these invariant features. We note for example that vowel sounds (e.g., the a sound in HAL, which may be denoted as æ (or /a/) involve certain resonant frequencies called formants that are sustained over some tens of milliseconds. We tend to find these formants in a certain mathematical relationship whenever /a/ is spoken. The same is true of the other sustained vowels. (Although the relationship is not a simple one, we observe that the relationship of the frequency of the second formant to the first formant for a particular vowel falls within a certain range, with some overlap between the ranges for different vowels.) Speech recognition systems frequently include a search function for finding these relationships, sometimes called features.

By studying spectrograms, we also note that certain changes do not convey any information; that is, there are types of changes we should filter out and ignore. An obvious one is loudness. When Dave shouts HAL's name in the pod in space, HAL realizes it is still his name. HAL infers some meaning from Dave's volume, but it is relatively unimportant for identifying the words being spoken. We apply, vtherefore, a process called normalization, in which we make all words the same loudness so as to eliminate this noninformative source of variability.