Chapter 7



01  02  03  04  05  

06  07  08  09  10  

11  12  13  14  15  

16  17  





Great strides were made in normalizing the speech signal to filter out variability (lesson 2). F. Itakura, a Japanese scientist, and H. Sakoe and S. Chiba introduced dynamic programming to compute optimal nonlinear time alignments, a technique that quickly became the standard. Jim Baker and IBM's Fred Jelinek introduced a statistical method called Markov Modeling; it provided a powerful mathematical tool for finding the invariant information in the speech signal.

Lesson 3, about breaking the speech signal into its frequency components, had already been established prior to the ARPA SUR projects, some of which developed systems that could adapt to aspects of the speaker's voice (lesson 4). Lesson 5 was anticipated by allowing ARPA SUR researchers to use as much computer memory as they could afford to buy and as much computer time as they had the patience to wait for. An underlying, and accurate expectation was that Moore's law (see chapter 3 and below) would ultimately provide whatever computing platform the algorithms required.


The 1970s

The 1970s were notable for other significant research efforts. In addition to introducing dynamic programming, Itakura developed an influential analysis of spectral-distance measures, a way to compute how similar two different sounds are. His system demonstrated an impressive 97.3 percent accuracy on two hundred Japanese words spoken over the telephone. Bell Labs also achieved significant success (a 97.1 percent accuracy) with speaker -- independent systems -- that is, systems that understand voices they have not heard before. IBM concentrated on the Markov modeling statistical technique and demonstrated systems that could recognize a large vocabulary.

By the end of the 1970s, numerous commercial speech-recognition products were available. They ranged from Heuristics' $259 H-2000 Speech Link, to $100,000 speaker-independent systems from Verbex and Nippon. Other companies, including Threshold, Scott, Centigram, and Interstate, offered systems with sixteen-channel filter banks at prices between $2,000 and $15,000. Such products could recognize small vocabularies spoken with pauses between words.


The 1980s

The 1980s saw the commercial field of ASR split into two fairly distinct market segments. One group -- which included Verbex, Voice Processing Corporation, and several others -- pursued reliable speaker-independent recognition of small vocabularies for telephone transaction processing. The other group, which included IBM and two new companies -- Jim and Janet Baker's Dragon Systems, and my Kurzweil Applied Intelligence -- pursued large-vocabulary ASR for creating written documents by voice.

Important work on large-vocabulary continuous speech (i.e., speech with no pauses between words) was also conducted at Carnegie Mellon University by Kai-Fu Lee, who subsequently left the university to head Apple's speech-recognition efforts.

By 1991, revenues for the ASR industry were in low eight figures and were increasing substantially every year. A buyer could choose any one (but not two) characteristics from the following menu: large vocabulary, speaker independence, or continuous speech. HAL, of course, could do all three.


The State of the Art

So where are we today? We now, finally, have inexpensive personal computers that can support high-performance ASR software. Buyers can now choose any two (but not all three) capabilities from the menu listed above. For example, my company's Kurzweil VOICE for Windows can recognize a sixty-thousand-word vocabulary spoken discretely (i.e., with brief pauses between each word). Another experimental system can handle a thousand-word, command-and-control vocabulary with continuous speech (i.e., no pauses). Both systems provide speaker independence; that is, they can recognize words spoken by your voice even if they've never heard it before. Systems in this product category are also made by Dragon Systems and IBM.


top of pageauthor infofurther readingorderforward