
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 |
Of course, the limitation to discrete speech is no minor exception. When will our computers be capable of recognizing fully continuous speech? Recently, ARPA has funded a new round of research aimed at "holy grail" systems that combine all three capabilities -- handling continuous speech with very large vocabularies and speaker independence. Like the earlier ARPA SUR projects, there are no restrictions on memory or real-time performance. Restricting the task to understanding "business English," ARPA contractors -- including Phillips, Bolt, Beranek and Newman, Dragon Systems, Inc., and others -- have reported word accuracies around 97 percent or higher. Moore's law will take care of achieving real-time performance on affordable machines, so that we should see such systems available commercially by, perhaps, early 1998. Expanding the domain of recognition -- not to mention understanding -- to the humanlike flexibility HAL displays will take a far greater mastery of the many levels of knowledge represented in spoken language. I would expect that by the year 2001 -- remembering that in the movie HAL became intelligent much earlier -- we will have systems able to recognize speech well enough to produce a written transcription of the movie from the sound track. Even then, the error rate will be far higher than HAL's (who, of course, claims he has never made a mistake).
In 1997 we appreciate that speech recognition does not exist in a
vacuum but has to be integrated with other levels and sources of
knowledge. Kurzweil Applied Intelligence, Inc., for example, has
integrated its large-vocabulary speech recognition capability with an
expert system that has extensive knowledge about the preparation of
medical reports; the Kurzweil VoiceMED can guide doctors through the
reporting process and assist them to comply with the latest
regulations (see figure 7.7). If you find yourself in a hospital
emergency room, there is a 10-percent chance your attending physician
will dictate his or her report to one of our speech-recognition
systems. We recently began adding the ability to understand natural
language commands spoken in continuous speech. If, for example, you
say, "go to the second paragraph on the next page; select the
second sentence; capitalize every word in this sentence; underline it
..." the system is likely to follow this series of commands. If you
say "Open the pod bay doors," it will probably respond
"Command not understood."
A speech-recognition system operates in phases, with each new phase
using increasingly sophisticated knowledge about the next higher level
of language. At the front end, the system converts the time-varying
air pressure we call sound into an electrical signal, as Bell did a
hundred years ago with his crude microphones. Then, a device called an
analog-to-digital converter changes the signal into a series of
numbers. The numbers may be modified to normalize for loudness levels
and possibly to eliminate background noise and distortion. The signal,
which is now a digital stream of numbers, is usually converted into
multiple streams, each of which represents a different frequency
band. These multiple streams are then compressed, using a variety of
mathematical techniques that reduce the amount of information and
emphasize those features of the speech signal important for
recognizing speech.
|