Chapter 7



01  02  03  04  05  

06  07  08  09  10  

11  12  13  14  15  

16  17  





The test involves an acoustic matching process, but the hypothesis has nothing to do with sound at all -- nor even with language -- but rather relates to knowledge on a multiplicity of levels. As many of the chapters in this book point out, knowledge goes far beyond mere facts and data. For information to become knowledge, it must incorporate the relationships between ideas. And for knowledge to be useful, the links describing how concepts interact must be easily accessed, updated, and manipulated. Human intelligence is remarkable in its ability to perform all these tasks. Ironically, it is also remarkably weak at reliably storing the information on which knowledge is based. The natural strengths of today's computers are roughly the opposite. They have, therefore, become powerful allies of the human intellect because of their ability to reliably store and rapidly retrieve vast quantities of information. Conversely, they have been slow to master true knowledge. Modeling the knowledge needed to understand the highly ambiguous and variable phenomenon of human speech has been a primary key to making progress in the field of automatic speech recognition (ASR).


Lesson 1: Knowledge Is a Many -- layered Thing

Thus lesson number one for constructing a computer system that can understand human speech is to build ­ ;in knowledge at many levels: the structure of speech sounds, the way speech is produced by our vocal apparatus, the patterns of speech sounds that comprise dialects and languages, the complex (and not fully understood) rules of word usage, and the -- greatest difficulty -- general knowledge of the subject matter being spoken about.

Each level of analysis provides useful constraints that can limit our search for the right answer. For example, the basic building blocks of speech called phonemes cannot appear in just any order. Indeed, many sequences are impossible to articulate (try saying ptkee). More important, only certain phoneme sequences correspond to a word or word fragment in the language. Although the set of phonemes used is similar (although not identical) from one language to another, contextual factors differ dramatically. English, for example, has over ten thousand possible syllables, whereas Japanese has only a hundred and twenty.

On a higher level, the syntax and semantics of the language put constraints on possible word orders. Resolving homonym ambiguities can require multiple levels of knowledge. One type of technology frequently used in speech recognition and understanding systems is a sentence parser, which builds sentence diagrams like those we learned in elementary school (see figure 7.1). One of the first such systems, developed in 1963 by Susumu Kuno of Harvard (around the time Kubrick and Clarke began work on 2001), revealed the depth of ambiguity in English. Kuno asked his computerized parser what the sentence "Time flies like an arrow" means. In what has become a famous response, the computer replied that it was not quite sure. It might mean

1. That time passes as quickly as an arrow passes.

2. Or maybe it is a command telling us to time the flies the same way that an arrow times flies; that is, Time flies like an arrow would.

3. Or it could be a command telling us to time only those flies that are similar to arrows; that is, Time flies that are like an arrow.

4. Or perhaps it means that a type of flies known as time flies have a fondness for arrows: Time -- flies like (i.e., appreciate) an arrow."

It became clear from this and other syntactical ambiguities that understanding language, spoken or written, requires both knowledge of the relationships between words and of the concepts underlying words. It is impossible to understand the sentence about time (or even to understand that the sentence is indeed talking about time and not flies) without mastery of the knowledge structures that represent what we know about time, flies, arrows, and how these concepts relate to one another.