|

[August 1, 2001]
Atomistic Parallel Listening: The Next
Step In Speech Recognition
BY WALTER ROLANDI, Ph.D.
Everyone agrees that speech recognition technologies have advanced
significantly over the past ten years. Under favorable conditions,
recognition rates for both grammar-based and dictation systems approach a
human-like ability to recognize words and phrases. The dramatic growth of
various speech recognition companies attests to the stability of the
technology.
But if the technology is so advanced and sound, why isn't everyone
everywhere already talking to computers? The answer depends on what is
meant by "talking to computers."
There are indeed a great many people who interact with voice portals on
a daily basis. These applications are typically "hidden menu"
systems, or voice response units (VRUs), where users are prompted in some
way or another to say one of a set of available choices. In these
circumstances, the number of things a user can say is limited and the
possibilities are usually reduced to set of relatively short but
phonetically distinct utterances. Recognition of one item either leads the
user to content or to other menu choices that are required before content
can be delivered to the user. "Talking to computers," in this
respect, is well supported by current technologies. When a grammar is
limited and its elements are sufficiently distinct, commercial recognizers
will understand user utterances at rates close to 100 percent.
But human conversation is not limited to proffering and selecting among
menu choices. Conversational dialog is highly variable and indeterminate.
Being able to recognize utterances, while certainly an important
component, is merely one of many capacities that are called into play when
humans converse with each other. Thus, if "talking to computers"
is to convey any significant similarity to "talking among
humans," existing recognition technology is obviously inadequate.
What's Missing From Today's Speech Recognition
Speech recognizers do not currently account for semantic information
conveyed by intonation, volume, pitch, pace, and other prosodic
characteristics of conversational speech acts. Nor do they provide any
means to model user intentions.
Additionally, and paraphrasing speech industry guru Bruce Balentine,
speech recognizers have no memory: they have no way to relate what was
recognized last to what is being recognized now. Each and every utterance
is treated as an independent and unrelated event.
Commercial vendors of applications using conversational dialog
interfaces typically compensate for these shortcomings through skillful
design of their dialogs. They know that speech recognizers are designed to
be the "ears" of their applications, and that they are
responsible for designing the "brains."
Speech recognizers will probably advance to account for factors such as
intonation, volume, and pitch in the not too distant future. On the other
hand, recognizers are not likely to incorporate user models or any
significant sense of memory anytime soon.
Additionally, there are limitations to what can be overcome by creative
dialog designers. Not even the world's most gifted dialog designer could
create an interactive system capable of continuously compensating for the
absence of "brains." What is still missing in speech recognition
today is its ability to adapt. In order to progress towards human-like
conversational interaction, speech systems will have to be able to
dynamically learn from their "experiences," and adapt their
behavior appropriately.
A New Approach: Atomistic Parallel Listening
Atomistic parallel listening is a promising new approach in speech
recognition. Basically, parallel listening entails using both a
conventional primary grammar in conjunction with another secondary grammar
that looks for "atomistic" components of speech. Think of an
atom as essentially a smaller unit of speech. An individual word, a
syllable, or even a phoneme serve as examples. In contrast to the primary
grammar that anticipates entire utterances, the secondary grammar looks
for constituent atoms within the utterances.
The basic idea is to use the secondary grammar to extract atoms from a
user's utterances that are predictive of his intent. Here is how it works:
Each time the user says something that is recognized on the primary
grammar, the recognizer returns a natural language tag. These tags
represent the "meaning" that the user intended to convey.
Behaviorally, tags lead to the consequences of the user's utterance: they
cause the system to take some appropriate action as a consequence of the
recognition result.
As such, each correct recognition result on the primary grammar
represents an opportunity for the system to learn by example. While the
primary grammar is processing the utterance, the secondary grammar is
doing so as well. Learning takes places by correlating what the secondary
grammar "hears" with the natural language tags returned by the
primary grammar. Correlations are made for each atom that is returned by
the secondary grammar, both individually and collectively.
Consider an example in which syllables are used as atoms, and the
following recognition results are obtained on the primary and secondary
grammars:
Primary:
Schedule an appointment for tomorrow
Secondary:
Sked joule an up point ment for to mar row
Again, each time the user says something that is recognized on the
primary grammar, the system has an opportunity to learn. If the user
accepts the subsequent action taken by the system (and the user does not
cancel the action), the atoms detected by the secondary grammar can be
assumed to be predictors of that action. The atoms returned by the
secondary grammar, both individually and collectively, provide cumulative
evidence of what the user intends the system to do.
Thus, whatever actions the application might take as a consequence of
recognizing Schedule an appointment for tomorrow become predicted
by the individual atoms, Sked, joule, an, up, point,
ment, for, to, mar and row.
Furthermore, since atoms such Sked and joule are strongly
correlated, the presence of both of them provides additional evidence that
the Schedule an appointment for tomorrow tags are indicated.
Advantages Of Parallel Listening
There is a facetious saw in the speech industry: Grammar-based systems
never fail, as long as you know what the user is saying. The problem, of
course, is that one can never anticipate everything a user might say, or
even the highly variable ways he might say the same thing. Speech
recognizers do a very good job under sound studio-like conditions and when
speakers only say things that are represented in the grammars. But in the
real world, particularly in telephony applications where cell phones come
into play, recognizers are doing well if they reach 80 percent recognition
rates. Even if recognition accuracy rates were at 95 percent, one out of
every twenty utterances would fail. And this sort of behavior, over
repeated interactions, makes for very tiresome conversation.
The greatest advantage of this parallel listening approach, therefore,
is that it provides a way to compensate for recognition failures, at least
when they occur on the primary grammar. Each time there is a failure on
the primary grammar, the application can use the atoms detected by the
secondary grammar in order to formulate a guess as to what the user
intended by his utterance. The best guess is experientially indicated by
whatever natural language tags these atoms have reliably predicted in the
past.
Initially, a parallel listening application has no atomistic knowledge.
It must acquire this knowledge by observing the user. It relies on the
user to say things that are in the primary grammar in order to return
actionable tags and perform some task. But every time there is a
recognition success in the primary grammar, the parallel listening
application compiles additional atomistic evidence that can later be used
to infer the user's intent.
Another desirable advantage of this approach is that it allows users to
change the subject more naturally, even within a directed dialog. Failures
on the primary grammar will always defer to the results of the secondary
grammar. Thus, a user can change the subject of conversation and still
expect the application to "follow along." While the
application's primary grammar was listening for some particular,
predefined input, the atoms obtained from the secondary grammar will
always point to an action that has been appropriate in the past.
Another example:
User: Schedule a meeting for September 15th.
App: At what time? (Application must obtain the time
of day to proceed with the task; directed dialog primary grammar is
listening only for expressions of time.)
User: Give me the weather report for San Jose. (User
changes subject and primary grammar fails; application determines the
tags historically predicted by atoms of the utterance obtained, and
proceeds.)
Is "Talking To Computers" The Next Step?
So, what is the next breakthrough in speech recognition? The answer may be
surprising because this is actually a trick question.
Speech recognition engines will continue to evolve and improve. Their
developers are likely to expand their recognition abilities to more
prosodic components of verbal behavior like those mentioned above.
However, advancements in the field of speech recognition technology alone
are unlikely to promote breakthroughs in conversational dialog systems.
This can only result from advancements in adaptive systems.
When it comes to "talking to computers," expectations have
been set for many by C-3PO and HAL. These are our visions of true
man-machine conversational dialog. But conversational dialog is not a
speech recognition problem -- it is a machine learning problem. And while
methods such as parallel listening are important advances in the right
direction, the next breakthrough in speech recognition cannot truly begin
until people both in and out of the speech industry come to accept that
recognizing utterances is just a single step.
Walter Rolandi, Ph.D. is
director of Applied Research for Conita
Technologies, Inc. Conita (pronounced kah-night-ah) is a provider of
voice-driven technology for mobile productivity. Conita’s Personal
Virtual Assistant (PVA) software provides just-in-time knowledge,
awareness and transactions for mobile professionals by voice from any
telephone, enabling access to business application software, enterprise
messaging and communication systems.
|