
![]() ![]() 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 |
The mechanisms described above for creating speech sounds -- vocal
cord vibrations, the noise of rushing air, articulatory gestures of
the mouth, teeth and tongue, the shaping of the vocal and nasal
cavities -- produce different rates of vibration. Physicists measure
these rates of vibration as frequencies; we perceive them as
pitches. Though we normally think of speech as a single time-varying
sound, it is actually a composite of many different sounds, each with
its own frequency. Using this insight, most ASR researchers starting
in the late 1960s began by breaking up the speech waveform into a
number of frequency bands. A typical commercial or research ASR system
will produce between a few and several dozen frequency bands. The
front end of the human auditory system does exactly the same thing:
each nerve ending in the cochlea (inner ear) responds to different
frequencies and emits a pulsed digital signal when activated by an
appropriate pitch. The cochlea differentiates several thousand
overlapping bands of frequency, which gives the human auditory system
its extremely high degree of sensitivity to frequency. Experiments
have shown that increasing the number of overlapping frequency bands
of an ASR system (thus making it more like the human auditory system)
increases the ability of that system to recognize human speech.
Lesson 4: Learn While You Listen A fourth lesson emphasizes the importance of learning. At each stage of processing, a system must adapt to the individual characteristics of the talker. Learning to do this has to take place at several levels: those of the frequency and time relationships characterizing each phoneme, the dialect (pronunciation) patterns of each word, and the syntactic patterns of possible phrases and sentences. At the highest cognitive level, a person or machine understanding speech learns a great deal about what a particular talker tends to talk about and how that talker phrases his or her thoughts.
HAL learns a great deal about his human crew mates by listening to the
sound of their voices, what they talk about, and how they put
sentences together. He also watches what their mouths do when they
articulate certain phrases (chapter 11). HAL gathers so much knowledge
about them that he can understand them even when some of the
information is obscured -- for example, when he has to rely solely
on his visual observation of Dave and Frank's lips.
We now know that 1997, when HAL reportedly became intelligent, is too soon. We won't have the quantity of computing, in terms of speed and memory needed, to build a HAL. And we won't be there in 2001 either.
Let's keep these lessons in mind as we examine the roots and future
prospects of building machines that can duplicate HAL's ability to
understand speech.
|