Chapter 7



01  02  03  04  05  

06  07  08  09  10  

11  12  13  14  15  

16  17  





Of course, the limitation to discrete speech is no minor exception. When will our computers be capable of recognizing fully continuous speech? Recently, ARPA has funded a new round of research aimed at "holy grail" systems that combine all three capabilities -- handling continuous speech with very large vocabularies and speaker independence. Like the earlier ARPA SUR projects, there are no restrictions on memory or real-time performance. Restricting the task to understanding "business English," ARPA contractors -- including Phillips, Bolt, Beranek and Newman, Dragon Systems, Inc., and others -- have reported word accuracies around 97 percent or higher. Moore's law will take care of achieving real-time performance on affordable machines, so that we should see such systems available commercially by, perhaps, early 1998.

Expanding the domain of recognition -- not to mention understanding -- to the humanlike flexibility HAL displays will take a far greater mastery of the many levels of knowledge represented in spoken language. I would expect that by the year 2001 -- remembering that in the movie HAL became intelligent much earlier -- we will have systems able to recognize speech well enough to produce a written transcription of the movie from the sound track. Even then, the error rate will be far higher than HAL's (who, of course, claims he has never made a mistake).

In 1997 we appreciate that speech recognition does not exist in a vacuum but has to be integrated with other levels and sources of knowledge. Kurzweil Applied Intelligence, Inc., for example, has integrated its large-vocabulary speech recognition capability with an expert system that has extensive knowledge about the preparation of medical reports; the Kurzweil VoiceMED can guide doctors through the reporting process and assist them to comply with the latest regulations (see figure 7.7). If you find yourself in a hospital emergency room, there is a 10-percent chance your attending physician will dictate his or her report to one of our speech-recognition systems. We recently began adding the ability to understand natural language commands spoken in continuous speech. If, for example, you say, "go to the second paragraph on the next page; select the second sentence; capitalize every word in this sentence; underline it ..." the system is likely to follow this series of commands. If you say "Open the pod bay doors," it will probably respond "Command not understood."


How to Build a Speech Recognizer

Software today is not an isolated field, but one that encompasses and codifies every other field of endeavor. Everyone -- librarians, musicians, magazine publishers, doctors, graphic artists, architects, researchers of every kind -- are digitizing their knowledge bases, methods, and expressions of their work. Those of us working on speech understanding are experiencing the same rapid change, as hundreds of scientists and engineers build increasingly elaborate data bases and structures to describe our knowledge of speech sounds, phonetics, linguistics, syntax, semantics, and pragmatics -- in accordance with lesson 1.

A speech-recognition system operates in phases, with each new phase using increasingly sophisticated knowledge about the next higher level of language. At the front end, the system converts the time-varying air pressure we call sound into an electrical signal, as Bell did a hundred years ago with his crude microphones. Then, a device called an analog-to-digital converter changes the signal into a series of numbers. The numbers may be modified to normalize for loudness levels and possibly to eliminate background noise and distortion. The signal, which is now a digital stream of numbers, is usually converted into multiple streams, each of which represents a different frequency band. These multiple streams are then compressed, using a variety of mathematical techniques that reduce the amount of information and emphasize those features of the speech signal important for recognizing speech.


top of pageauthor infofurther readingorderforward