×

SUBSCRIBE TO TMCnet
TMCnet - World's Largest Communications and Technology Community

CHANNEL BY TOPICS


QUICK LINKS




 

September 1998


The New Talking Cure, Or ... The Voice User Interface

The spoken word — not only may it reveal much about the speaker, it may elicit much from the hearer, even (or especially) if the hearer is a computer. However, for a computer to hear properly, and yield that which abides within it, be it raw information or processing capability, the computer may need a voice user interface, or VUI, the aural equivalent of the visually-oriented graphical user interface.

If the voice user interface is to stimulate more fruitful encounters between users and computer systems, it would benefit us to consult those who’ve studied it. That’s why we spoke to Dr. William Meisel, a noted speech recognition analyst. We had the opportunity to interview Dr. Meisel following the release of his new market study, "The Telephony Voice User Interface."

CTI: Dr. Meisel, what is the most compelling trend in speech recognition that you see?

Meisel: The most basic trend is the replacement of the touch-tone pad as an interface to automated systems with speech recognition — the development of a telephony voice user interface. This is a fundamental change, comparable to the replacement of a keyboard interface on the PC with a graphical user interface.

CTI: If speech rec is becoming more accessible to users, is it also becoming more accessible to developers?

Meisel: Yes. As a matter of fact, that’s another key trend — the improvement of speech recognition application development tools. They are constantly getting easier to use and requiring less speech-specific expertise on the part of the developer.

CTI: Could you describe how speech rec technology is eventually translated into products?

Meisel: There are two aspects to a speech recognition product — the development of the application and running it once it is developed. Let’s start at the end and talk about how a speech recognition system runs once the application has been developed.

A telephone call is handled as in any automated telephone application, using a telephone interface card and software that sequences the call and plays voice responses. The digitized speech from the call is passed on to speech recognition software, which processes the speech and determines its content and the action associated with that content. The speech recognition software determines the flow of the conversation and what can be understood, the "voice user interface."

There are several possible architectures for running the speech recognition software:

First, the speech recognition software can run as "software only" on the host computer that holds the telephone interface card. This is the least expensive option, but is usually limited to medium-vocabulary applications and 2-8 telephone lines.

Second, the speech recognition can handle more lines if there is more processing provided by a "resource card," a board that has multiple DSPs or RISC processors on it. Typically, the resource card is in the same PC or workstation as the telephone interface card. The speech recognition can run entirely on the resource card or partially on the resource card and the host.

Third, another architecture uses multiple workstations or PCs on a LAN. The host computer handles the telephone interface. One server or several servers on the LAN do the speech recognition simultaneously for several callers. In this architecture, there may still be a resource card in the host or in the server to do part of the processing.

CTI: And, presumably, developers need to be aware of these architectural issues. Do these issues relate to the complexity of development?

Meisel: Well, before the speech recognition application can be run on any architecture, it must be developed. But the complexity of the development process depends largely on the type of application. Some applications, such as voice-activated automated attendants (which direct a call by a person’s or department’s name), can be purchased as a turnkey product. Tailoring to a particular company’s personnel list is part of the installation process.

Some applications are sold as services that don’t require hardware in the buyer’s premises at all. These include some telecommunications assistants, products that help handle telephone communications, taking messages and in some cases providing information.

Most applications, however, require some custom development. Some simple applications — for example, those that only recognize digits, yes, no, and a few other words — are supported by conventional telephone application generators, and can be created using familiar toolkits. There are several development options available, some which automate much of the process of application development. (See the sidebar entitled Types Of Tools For Speech Recognition Application Development.)

CTI: The availability of tools suggests convenience for the developer, even ease of development. However, are there any special speech rec issues that developers should know about before embracing these tools?

Meisel: A caution is certainly in order. That is, developing a speech interface is different from developing a touch-tone interface. For example, the voice prompts strongly affect the likelihood that the caller will say what is expected (and, hence, the likelihood that what is said is recognized).

The voice user interface of an application is critical to its success. This issue can be partly addressed by tools which allow testing a speech recognition application without any speech recognition hardware. With these tools, a person simulates the recognition system, triggering pre-recorded prompts so that the caller thinks they are talking to a computer. This "Wizard of Oz" approach helps test the feasibility of an application, while refining the user interaction early in the process.

CTI: What would happen if the developer didn’t refine user interaction?

Meisel: The developer could find that merely using speech recognition doesn’t assure a "natural" interface. A poorly designed voice interface doesn’t seem like speaking with a person at all. One example of this is a common fault of early speech recognition systems — they would just say "please repeat" when the score of an utterance was too low, implying that all would be well if the person would just speak more clearly. In many cases, the problem is not how the caller is speaking, but what the caller is saying. Repeating a phrase that is not in the system’s vocabulary will just result in another "please repeat." The system should indicate (at least the second time) what kind of response it is expecting.

CTI: What other problems should developers anticipate? Could you describe the principles of voice user interface design?

Meisel: It is possible to write a book on voice user interface design, so I will just mention a few practices that I think are important.

First, testing is critical. No matter how smart the developer is, callers will behave in unexpected ways. For example, in one test of a voice-activated auto attendant, callers had a natural (but unanticipated) response to an error. When the user asked for "Fred Steele," and was prompted, "connecting to Ted Steele," the natural response was "No, Fred Steele." If the system responded, "Please repeat," or, "Name, please," the same error would be likely to occur again. There is enough information in the interchange for the system to act on that phrase by recognizing the "no" and the name following it, eliminating the mis-recognized name from consideration. In this case, the developers decided that many callers would react this way, and consequently designed the system to react in the way the caller expected.

It is best to uncover interface issues before going to the field. The Wizard of Oz approach I mentioned earlier is a good way to start. But the Wizard of Oz approach doesn’t indicate what recognition errors will tend to occur, so testing with the automated system is also necessary. Developers should plan on more than one cycle of testing and revision of the interface.

Second, consistency is important. By making similar actions work consistently throughout an application, one can avoid confusing the user. When the recognition system is unsure of what is said, for example, it should use the same process for clarification throughout the application.

Third, the user should always have a way of finding out what they can say or returning to a familiar place. They should be able to say "Start over," "What can I say?," or "Help" at any time.

Fourth, consider the different requirements of a novice and expert user. A novice wants longer, explicit prompts; a repeat user wants the shortest prompts possible. This conflict can be addressed in different ways. For example, the initial instructions could include "say ‘help’ for more instructions at any time," and then the system could use short prompts with context-sensitive help available. Or, if the user evidences confusion through the need for repetition or clarification, prompts could automatically get longer.

CTI: We’ve heard some people debate the advantages and disadvantages of small vocabularies. Do you have your own take on this subject?

Meisel: For some environments, such as wireless systems, the speech quality can limit recognition accuracy. In such systems, there seem to be two points of view on how best to proceed. One camp suggests that the caller should be given limited choices at any one time.

For example, one can require the customer to say, "Call," wait for a prompt, "Call who?," and then say, "John Jones." The advantages of this structured approach are two-fold. First, the customer knows that they must first say a command and wait, so they get less confused about acceptable ways to say things. Second, the speech recognition system has to deal with much less variability at each step of the conversation, so it can be more accurate.

CTI: If one approach is highly structured, then is the other less so? Would a less structured approach be more natural, more conversational?

Meisel: There’s a trade-off. The higher accuracy achieved by the structured approach compensates for its less fluent conversation. A conversation will not be fluent if the recognizer constantly makes mistakes or interrupts for clarification.

All the same, a structured interaction obviates the advantage of speech. A caller must remember very specific ways to say things, and the structure can make it take a painfully long time to achieve an objective. In the example above, a caller should be able to simply say, "Call John Jones." The structure can also make it easy for the caller to get confused as to what can be said at a given time.

Both sides are correct. In practice, some limitations on conversational style are necessary, and some structuring is inevitable. Users want to have some hint as to what they can say; they don’t react well to completely open interfaces, such as "How can I help you?"

The conversational method falls apart if the error rate is too high, and the structured method falls apart if it is overly difficult to use. There is a continuum, not a chasm, between the two approaches. As speech recognition accuracy increases, less structure is necessary.

CTI: Is accuracy improving? How?

Meisel: One of the limits on accuracy has been the telephone channel. It reduces bandwidth and adds noise. That is, performing speech recognition on the resulting degraded speech limits the possible accuracy.

One way to remedy this deficiency is with distributed speech recognition, which involves pre-processing the speech in the telephone or other local device (in effect, compressing it in a form suitable for speech recognition), and then sending it digitally to the remote server for recognition. This process can retain the quality of the speech at its source. The accuracy is then no longer limited by the channel.

Unfortunately, there are currently no standards for compressing the speech for speech recognition purposes. Thus, a device that uses distributed speech recognition could only call a compatible system. But we may eventually see standards (or de facto standards) evolve.

CTI: Speaking of de facto standards, we’ve heard that Microsoft will make speech rec a part of the operating system. When do you suppose that might happen?

Meisel: Microsoft seems committed to making speech recognition a part of the operating system, but they are not rushing to do it. They want the technology for interactive dialog (conversation with the computer) to improve, so that the interaction is like working with an assistant. This requires some improvements in natural-language interpretation that goes beyond speech recognition alone.

My best guess as to when they will make speech a part of the operating system is in three to five years. They might move faster if they perceive that there is danger someone else will set the standard for a voice user interface on a desktop or palm-sized PC.

CTI: What are the prospects for SAPI support?

Meisel: SAPI (Microsoft’s Speech API) is most relevant for PC-type applications now, rather than telephony, but I understand that Microsoft plans some extensions that better support telephony. Some vendors do support a SAPI interface, but today proprietary interfaces seem to be necessary to use all the functions available in the telephone speech recognition.

CTI: Do you anticipate that speech rec capabilities will appear not only in new products, but may be added to existing products? Will it be difficult to achieve "retro-fits"?

Meisel: Retro-fits are not difficult to implement if the hardware architecture of an existing system is at all flexible. For example, voice-activated auto attendants are usually added to an existing PBX by simply simulating an extension of the PBX. If a system uses "computer telephony" components, speech recognition can also usually be added through additional modules. Retro-fits can result in a poor voice user interface, however, if the speech recognition simply emulates an existing touch-tone interface.

CTI: What, in speech rec applications, are the advantages/disadvantages of host-based processing vs. DSP-based processing? What sorts of applications would require an embedded approach?

Meisel: For a small number of telephone lines, host-based systems that don’t use DSP resource cards are cost-effective. They use capacity already available in the host computer and result in a much lower cost to add speech recognition. Once one gets beyond the number of lines that can be handled by host processing, however, one must decide whether to get the additional processing power required through resource cards or through a LAN-server architecture. Most resource cards today are DSP-based, although some now use multiple RISC processors.

DSPs are particularly suited for "front-end" tasks that occur before speech recognition processing, e.g., spectral processing or echo-canceling to allow interrupting a voice response. These front-end tasks don’t require much memory (which can be expensive on DSP boards).

On the other hand, the "back-end" processing, comparing the speech to stored acoustic word models, is memory-intensive and has characteristics that are not efficiently handled by a DSP. As the vocabulary size grows, the front-end tasks become a small part of the overall processing, and DSPs lose their advantage. Thus, DSPs are efficient for many lines of digit recognition; but become an expensive, inflexible solution as the vocabulary grows.

Many vendors have adopted a hybrid solution. They use DSPs for front-end processing and the host computer or a server for back-end processing. This architecture can be an efficient way to handle many lines of larger-vocabulary recognition.

CTI: Will increasingly powerful host processors obviate the use of DSPs for all but high-end applications?

Meisel: Eventually, as host processors get faster and PC memory cheaper, the economics of a separate DSP board may become questionable. However, DSPs tend to use considerably less power per MIPS than other types of processors, so it is easier to pack a large number in a small place. And DSPs continue to get faster, as well. They will probably be around for a long time in many-line applications.

CTI: Before wrapping up, we’d like to ask about speech rec’s business prospects.

Meisel: There is growing interest within the investment community. Several public companies featuring speech recognition have done well on the stock market, and venture capitalists are viewing speech recognition as an area that may produce some big winners. This is important for users in that it means that the technology, tools, and applications will receive the investment they need for continued rapid improvement.

CTI: Could you give us an idea of the current size (and the future size) of the speech rec market?

Meisel: In 1997, the revenues for telephone equipment and services using speech recognition was about $400 million. I estimated in a recent market study that the market will almost double each year.

CTI: Who are the leading technology providers? Who are the most notable companies licensing the technology?

Meisel: Companies licensing basic technology tend to have different strengths and features. Some have established strong positions in handling cellular calls for voice dialing by name or number, for example. Others have delivered large-vocabulary applications, such as stock quotes, effectively. Leading telephone speech recognition vendors include Applied Language Technology, BBN Technologies (now part of GTE Internetworking), Dragon Systems, Lernout & Hauspie, Lucent Technologies, Northern Telecom, Nuance Communications, Philips Speech Processing, and Voice Control Systems. IBM and Voice Control Systems have teamed to deliver IBM telephone speech recognition to market.

Dr. Meisel’s latest publication, "The Telephony Voice User Interface," analyzes 140 companies active in the speech recognition industry, and examines the many different technologies these companies deploy. In addition to providing contact and product information for each company, the study describes each company’s background, distribution, and typical customers. Competition is analyzed, and market size is projected for each of the many product categories. For more information, contact the publisher of the study, TMA Associates. Call TMA at 818-708-0962 or visit the company’s Web site at www.tmaa.com.


Types Of Tools For Speech Recognition Application Development

Application Suites
Some companies have complete applications that can be combined together, some of which use speech recognition. For example, a voice-dialing application may be available as part of a suite that also includes voice mail or single-number modules. Application development in this case is minimal, although some integration may be required.

Telephone Application Generators
High-level design tools are available for certain types of applications, particularly IVR and call center applications. They allow developers to specify the flow of complete applications, including the call flow. Some application generators have support for speech recognition. The support varies in its generality and flexibility.

Dialog Manager
A dialog manager is a level above toolkits that only let the developer create unconnected grammars. It manages sets of grammars. The developer can specify which grammars and prompts to use within an interactive system and when to switch them, based on an event or on the user response. Dialog managers usually are part of a speech recognition vendor’s toolkit, but are being added to some application generators.

Dialog Subroutines
Modules that automate certain common interactive tasks at the user interface level, such as asking a yes-no question and dealing with the response (including the possibility of a recognition error).

Speech Recognition Software Development Kits
A set of software tools for developers that allow programming applications using a vendor’s speech engine. Some kits support ActiveX controls or other high-level interfaces to standard programming tools.

Speech Application Programming Interfaces (APIs)
Speech recognition "engines" constitute the core software that accomplishes the basic recognition. These are usually modular, so that the application code can interact with the engine through an API. The API allows using the detailed capabilities of the speech recognition engine and is the most flexible, but the most detailed, level of programming.

 







Technology Marketing Corporation

2 Trap Falls Road Suite 106, Shelton, CT 06484 USA
Ph: +1-203-852-6800, 800-243-6002

General comments: [email protected].
Comments about this site: [email protected].

STAY CURRENT YOUR WAY

© 2024 Technology Marketing Corporation. All rights reserved | Privacy Policy