TMCnet - The World's Largest Communications and Technology Community
TMC Launches New Sites ::  NGC  |  4GWE  |  Green Tech  |  Satellite  |  IT |  ITEXPO  |  Healthcare  |  Smart Grid  |  M2M  |  Smart Products  |  AstriCon News  |  SATCON News
Share

[August 1, 2001]

Atomistic Parallel Listening: The Next Step In Speech Recognition

BY WALTER ROLANDI, Ph.D.

Everyone agrees that speech recognition technologies have advanced significantly over the past ten years. Under favorable conditions, recognition rates for both grammar-based and dictation systems approach a human-like ability to recognize words and phrases. The dramatic growth of various speech recognition companies attests to the stability of the technology.

But if the technology is so advanced and sound, why isn't everyone everywhere already talking to computers? The answer depends on what is meant by "talking to computers."

There are indeed a great many people who interact with voice portals on a daily basis. These applications are typically "hidden menu" systems, or voice response units (VRUs), where users are prompted in some way or another to say one of a set of available choices. In these circumstances, the number of things a user can say is limited and the possibilities are usually reduced to set of relatively short but phonetically distinct utterances. Recognition of one item either leads the user to content or to other menu choices that are required before content can be delivered to the user. "Talking to computers," in this respect, is well supported by current technologies. When a grammar is limited and its elements are sufficiently distinct, commercial recognizers will understand user utterances at rates close to 100 percent.

But human conversation is not limited to proffering and selecting among menu choices. Conversational dialog is highly variable and indeterminate. Being able to recognize utterances, while certainly an important component, is merely one of many capacities that are called into play when humans converse with each other. Thus, if "talking to computers" is to convey any significant similarity to "talking among humans," existing recognition technology is obviously inadequate.

What's Missing From Today's Speech Recognition
Speech recognizers do not currently account for semantic information conveyed by intonation, volume, pitch, pace, and other prosodic characteristics of conversational speech acts. Nor do they provide any means to model user intentions.

Additionally, and paraphrasing speech industry guru Bruce Balentine, speech recognizers have no memory: they have no way to relate what was recognized last to what is being recognized now. Each and every utterance is treated as an independent and unrelated event.

Commercial vendors of applications using conversational dialog interfaces typically compensate for these shortcomings through skillful design of their dialogs. They know that speech recognizers are designed to be the "ears" of their applications, and that they are responsible for designing the "brains."

Speech recognizers will probably advance to account for factors such as intonation, volume, and pitch in the not too distant future. On the other hand, recognizers are not likely to incorporate user models or any significant sense of memory anytime soon.

Additionally, there are limitations to what can be overcome by creative dialog designers. Not even the world's most gifted dialog designer could create an interactive system capable of continuously compensating for the absence of "brains." What is still missing in speech recognition today is its ability to adapt. In order to progress towards human-like conversational interaction, speech systems will have to be able to dynamically learn from their "experiences," and adapt their behavior appropriately.

A New Approach: Atomistic Parallel Listening
Atomistic parallel listening is a promising new approach in speech recognition. Basically, parallel listening entails using both a conventional primary grammar in conjunction with another secondary grammar that looks for "atomistic" components of speech. Think of an atom as essentially a smaller unit of speech. An individual word, a syllable, or even a phoneme serve as examples. In contrast to the primary grammar that anticipates entire utterances, the secondary grammar looks for constituent atoms within the utterances.

The basic idea is to use the secondary grammar to extract atoms from a user's utterances that are predictive of his intent. Here is how it works:

Each time the user says something that is recognized on the primary grammar, the recognizer returns a natural language tag. These tags represent the "meaning" that the user intended to convey. Behaviorally, tags lead to the consequences of the user's utterance: they cause the system to take some appropriate action as a consequence of the recognition result.

As such, each correct recognition result on the primary grammar represents an opportunity for the system to learn by example. While the primary grammar is processing the utterance, the secondary grammar is doing so as well. Learning takes places by correlating what the secondary grammar "hears" with the natural language tags returned by the primary grammar. Correlations are made for each atom that is returned by the secondary grammar, both individually and collectively.

Consider an example in which syllables are used as atoms, and the following recognition results are obtained on the primary and secondary grammars:

Primary:
Schedule an appointment for tomorrow

Secondary:
Sked joule an up point ment for to mar row

Again, each time the user says something that is recognized on the primary grammar, the system has an opportunity to learn. If the user accepts the subsequent action taken by the system (and the user does not cancel the action), the atoms detected by the secondary grammar can be assumed to be predictors of that action. The atoms returned by the secondary grammar, both individually and collectively, provide cumulative evidence of what the user intends the system to do.

Thus, whatever actions the application might take as a consequence of recognizing Schedule an appointment for tomorrow become predicted by the individual atoms, Sked, joule, an, up, point, ment, for, to, mar and row. Furthermore, since atoms such Sked and joule are strongly correlated, the presence of both of them provides additional evidence that the Schedule an appointment for tomorrow tags are indicated.

Advantages Of Parallel Listening
There is a facetious saw in the speech industry: Grammar-based systems never fail, as long as you know what the user is saying. The problem, of course, is that one can never anticipate everything a user might say, or even the highly variable ways he might say the same thing. Speech recognizers do a very good job under sound studio-like conditions and when speakers only say things that are represented in the grammars. But in the real world, particularly in telephony applications where cell phones come into play, recognizers are doing well if they reach 80 percent recognition rates. Even if recognition accuracy rates were at 95 percent, one out of every twenty utterances would fail. And this sort of behavior, over repeated interactions, makes for very tiresome conversation.

The greatest advantage of this parallel listening approach, therefore, is that it provides a way to compensate for recognition failures, at least when they occur on the primary grammar. Each time there is a failure on the primary grammar, the application can use the atoms detected by the secondary grammar in order to formulate a guess as to what the user intended by his utterance. The best guess is experientially indicated by whatever natural language tags these atoms have reliably predicted in the past.

Initially, a parallel listening application has no atomistic knowledge. It must acquire this knowledge by observing the user. It relies on the user to say things that are in the primary grammar in order to return actionable tags and perform some task. But every time there is a recognition success in the primary grammar, the parallel listening application compiles additional atomistic evidence that can later be used to infer the user's intent.

Another desirable advantage of this approach is that it allows users to change the subject more naturally, even within a directed dialog. Failures on the primary grammar will always defer to the results of the secondary grammar. Thus, a user can change the subject of conversation and still expect the application to "follow along." While the application's primary grammar was listening for some particular, predefined input, the atoms obtained from the secondary grammar will always point to an action that has been appropriate in the past.

Another example:

User: Schedule a meeting for September 15th.

App: At what time? (Application must obtain the time of day to proceed with the task; directed dialog primary grammar is listening only for expressions of time.)

User: Give me the weather report for San Jose. (User changes subject and primary grammar fails; application determines the tags historically predicted by atoms of the utterance obtained, and proceeds.)

Is "Talking To Computers" The Next Step?
So, what is the next breakthrough in speech recognition? The answer may be surprising because this is actually a trick question.

Speech recognition engines will continue to evolve and improve. Their developers are likely to expand their recognition abilities to more prosodic components of verbal behavior like those mentioned above. However, advancements in the field of speech recognition technology alone are unlikely to promote breakthroughs in conversational dialog systems. This can only result from advancements in adaptive systems.

When it comes to "talking to computers," expectations have been set for many by C-3PO and HAL. These are our visions of true man-machine conversational dialog. But conversational dialog is not a speech recognition problem -- it is a machine learning problem. And while methods such as parallel listening are important advances in the right direction, the next breakthrough in speech recognition cannot truly begin until people both in and out of the speech industry come to accept that recognizing utterances is just a single step.

Walter Rolandi, Ph.D. is director of Applied Research for Conita Technologies, Inc. Conita (pronounced kah-night-ah) is a provider of voice-driven technology for mobile productivity. Conita’s Personal Virtual Assistant (PVA) software provides just-in-time knowledge, awareness and transactions for mobile professionals by voice from any telephone, enabling access to business application software, enterprise messaging and communication systems.


Subscribe FREE to all of TMC's monthly magazines. Click here now.