
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 |
A system armed with this type of information would know that flies are
not similar to arrows and would thus knock out the third
interpretation. Often there is more than one way to resolve language
ambiguities. The third interpretation could be syntactically resolved
by noting that like in the sense of similar to
ordinarily requires number agreement between the two objects
compared. Such a system would also note that, as there are no such
things as time flies, the fourth interpretation too is
wrong. The system would also need such tidbits of knowledge as the
fact that flies have never shown a fondness for arrows, and that
arrows cannot and do not time anything -- much less flies
-- to select the first interpretation as the only plausible
one. The ambiguity of language, however, is far greater than this
example suggests. In a language -- parsing project at the MIT
Speech Lab, Ken Church found a sentence with over two million
syntactically correct interpretations.
Often the tidbits of knowledge we need have to do with the specific
situation of speakers and listeners. If I walk into my business
associate's office and say "rook to king one," I am likely to
get a response along the lines of "excuse me?" Even if my words
were understood, their meaning would still be unclear; my associate
would probably interpret them as a sarcastic remark implying that I
think he regards himself as a king. In the context of a chess game,
however, not only is the meaning clear, but the words are easy to
recognize. Indeed, our contemporary speech -- recognition systems do a
very good job when the domain of discourse is restricted to something
as narrow as a chess game. So, HAL, too, has little trouble
understanding when Frank says "rook to king one" during one of
their chess matches.
Look at the two spectrogram pictures of Dave and Frank saying the word HAL (figure 7.2). It would be difficult to know that they are saying the same word from the pictures alone. Yet the spectrograms present all the salient information in the speech signals. Yet, there must be something about these different sound pictures that is the same; otherwise we humans and HAL, as a human-level machine, would be unable to identify them as two examples of the same spoken word. Thus, one key to building automatic speech recognition (ASR) machines is the search for these invariant features. We note for example that vowel sounds (e.g., the a sound in HAL, which may be denoted as æ (or /a/) involve certain resonant frequencies called formants that are sustained over some tens of milliseconds. We tend to find these formants in a certain mathematical relationship whenever /a/ is spoken. The same is true of the other sustained vowels. (Although the relationship is not a simple one, we observe that the relationship of the frequency of the second formant to the first formant for a particular vowel falls within a certain range, with some overlap between the ranges for different vowels.) Speech recognition systems frequently include a search function for finding these relationships, sometimes called features.
By studying spectrograms, we also note that certain changes do not
convey any information; that is, there are types of changes we should
filter out and ignore. An obvious one is loudness. When Dave shouts
HAL's name in the pod in space, HAL realizes it is still his name. HAL
infers some meaning from Dave's volume, but it is relatively
unimportant for identifying the words being spoken. We apply,
vtherefore, a process called normalization, in which we make
all words the same loudness so as to eliminate this noninformative
source of variability.
|