
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 |
For example, we want to know that a certain segment of sound contains
a broad noiselike band of frequencies that might represent the sound
of rushing air, as in the sound /h/ in HAL; another segment
contains two or three resonant frequencies in a certain ratio that
might represent the vowel sound /a/ in HAL. One way to
accomplish this labeling is to store examples of such sounds and
attempt to match incoming time slices against these
templates. Usually, the attempt to categorize slices of sound uses a
much finer classification system than the approximately forty phonemes
of English. We typically use a set of 256 or even 1,024 possible
classifications in a process called vector quantization.
Once we have classified these time slices of sound, we can use one of several competing approaches to recognizing words. One of them develops statistical models for words or portions of words by analyzing massive amounts of prerecorded speech data. Markov modeling and neural nets are examples of this approach. Another approach tries to detect the underlying string of phonemes (or possibly other types of subword units) and then match them to the words spoken. At Kurzweil Applied Intelligence (KAI), rather than select one optimal approach, we implemented seven or eight different modules, or "experts," then programmed another software module, the "expert manager," which knows the strengths and weaknesses of the different software experts. In this decision-by-committee approach, the expert manager is the chief executive officer and makes the final decisions. In the KAI systems, some of the expert modules are based, not on the sound of the words but on rules and the statistical behavior of word sequences. This is a variation of the hypothesis-and-test paradigm in which the system expects to hear certain words, according to what the speaker has already said. Each of the modules in the system has a great deal of built-in knowledge. The acoustic experts contain knowledge on the sound structure of words or such subword units as phonemes. The language experts know how words are strung together. The expert manager can judge which experts are more reliable in particular situations. The system as a whole begins with generic knowledge of speech and language in general, then adapts these knowledge structures, based on what it observes in a particular speaker. In the film, Dave and Frank frequently invoke HAL's name. Even today's speech-recognition systems would quickly learn to recognize the word HAL and would not mistake it for hill or hall, at least not after being corrected once or twice.
In continuous speech, a speech-recognition system needs to deal with
the additional ambiguity of when words start and end. Its attempts to
match the classified time slices and recognized subword units against
actual word hypotheses could result in a combinatorial explosion. A
vocabulary of, say, sixty thousand words, could produce 3.6 billion
possible two-word sequences, 216 trillion three-word sequences, and so
on. Obviously, as we cannot examine even a tiny fraction of these
possibilities, search constraints based on the system's knowledge of
language are crucial.
|