The New Talking Cure, Or ... The Voice User Interface
The spoken word not only may it reveal much about the speaker,
it may elicit much from the hearer, even (or especially) if the hearer is a computer.
However, for a computer to hear properly, and yield that which abides within it, be it raw
information or processing capability, the computer may need a voice user interface, or
VUI, the aural equivalent of the visually-oriented graphical user interface.
If the voice user interface is to stimulate more fruitful encounters between users and
computer systems, it would benefit us to consult those whove studied it. Thats
why we spoke to Dr. William Meisel, a noted speech recognition analyst. We had the
opportunity to interview Dr. Meisel following the release of his new market study,
"The Telephony Voice User Interface."
CTI: Dr. Meisel, what is the most compelling trend in
speech recognition that you see?
Meisel: The most basic trend is the replacement of the touch-tone pad
as an interface to automated systems with speech recognition the development of a
telephony voice user interface. This is a fundamental change, comparable to the
replacement of a keyboard interface on the PC with a graphical user interface.
CTI: If speech rec is becoming more accessible to users, is it also
becoming more accessible to developers?
Meisel: Yes. As a matter of fact, thats another key trend
the improvement of speech recognition application development tools. They are constantly
getting easier to use and requiring less speech-specific expertise on the part of the
developer.
CTI: Could you describe how speech rec technology is eventually translated
into products?
Meisel: There are two aspects to a speech recognition product
the development of the application and running it once it is developed. Lets start
at the end and talk about how a speech recognition system runs once the application has
been developed.
A telephone call is handled as in any automated telephone application, using a
telephone interface card and software that sequences the call and plays voice responses.
The digitized speech from the call is passed on to speech recognition software, which
processes the speech and determines its content and the action associated with that
content. The speech recognition software determines the flow of the conversation and what
can be understood, the "voice user interface."
There are several possible architectures for running the speech recognition software:
First, the speech recognition software can run as "software only" on the host
computer that holds the telephone interface card. This is the least expensive option, but
is usually limited to medium-vocabulary applications and 2-8 telephone lines.
Second, the speech recognition can handle more lines if there is more processing
provided by a "resource card," a board that has multiple DSPs or RISC processors
on it. Typically, the resource card is in the same PC or workstation as the telephone
interface card. The speech recognition can run entirely on the resource card or partially
on the resource card and the host.
Third, another architecture uses multiple workstations or PCs on a LAN. The host
computer handles the telephone interface. One server or several servers on the LAN do the
speech recognition simultaneously for several callers. In this architecture, there may
still be a resource card in the host or in the server to do part of the processing.
CTI: And, presumably, developers need to be aware of these
architectural issues. Do these issues relate to the complexity of development?
Meisel: Well, before the speech recognition application can be run on
any architecture, it must be developed. But the complexity of the development process
depends largely on the type of application. Some applications, such as voice-activated
automated attendants (which direct a call by a persons or departments name),
can be purchased as a turnkey product. Tailoring to a particular companys personnel
list is part of the installation process.
Some applications are sold as services that dont require hardware in the
buyers premises at all. These include some telecommunications assistants, products
that help handle telephone communications, taking messages and in some cases providing
information.
Most applications, however, require some custom development. Some simple applications
for example, those that only recognize digits, yes, no, and a few other words
are supported by conventional telephone application generators, and can be created
using familiar toolkits. There are several development options available, some which
automate much of the process of application development. (See the sidebar entitled Types Of Tools For Speech Recognition Application Development.)
CTI: The availability of tools suggests convenience for the developer, even
ease of development. However, are there any special speech rec issues that developers
should know about before embracing these tools?
Meisel: A caution is certainly in order. That is, developing a speech
interface is different from developing a touch-tone interface. For example, the voice
prompts strongly affect the likelihood that the caller will say what is expected (and,
hence, the likelihood that what is said is recognized).
The voice user interface of an application is critical to its success. This issue can
be partly addressed by tools which allow testing a speech recognition application without
any speech recognition hardware. With these tools, a person simulates the recognition
system, triggering pre-recorded prompts so that the caller thinks they are talking to a
computer. This "Wizard of Oz" approach helps test the feasibility of an
application, while refining the user interaction early in the process.
CTI: What would happen if the developer didnt refine
user interaction?
Meisel: The developer could find that merely using speech recognition
doesnt assure a "natural" interface. A poorly designed voice interface
doesnt seem like speaking with a person at all. One example of this is a common
fault of early speech recognition systems they would just say "please
repeat" when the score of an utterance was too low, implying that all would be well
if the person would just speak more clearly. In many cases, the problem is not how the
caller is speaking, but what the caller is saying. Repeating a phrase that is not in the
systems vocabulary will just result in another "please repeat." The system
should indicate (at least the second time) what kind of response it is expecting.
CTI: What other problems should developers anticipate?
Could you describe the principles of voice user interface design?
Meisel: It is possible to write a book on voice user interface design,
so I will just mention a few practices that I think are important.
First, testing is critical. No matter how smart the developer is, callers will behave
in unexpected ways. For example, in one test of a voice-activated auto attendant, callers
had a natural (but unanticipated) response to an error. When the user asked for "Fred
Steele," and was prompted, "connecting to Ted Steele," the natural response
was "No, Fred Steele." If the system responded, "Please repeat," or,
"Name, please," the same error would be likely to occur again. There is enough
information in the interchange for the system to act on that phrase by recognizing the
"no" and the name following it, eliminating the mis-recognized name from
consideration. In this case, the developers decided that many callers would react this
way, and consequently designed the system to react in the way the caller expected.
It is best to uncover interface issues before going to the field. The Wizard of Oz
approach I mentioned earlier is a good way to start. But the Wizard of Oz approach
doesnt indicate what recognition errors will tend to occur, so testing with the
automated system is also necessary. Developers should plan on more than one cycle of
testing and revision of the interface.
Second, consistency is important. By making similar actions work consistently
throughout an application, one can avoid confusing the user. When the recognition system
is unsure of what is said, for example, it should use the same process for clarification
throughout the application.
Third, the user should always have a way of finding out what they can say or returning
to a familiar place. They should be able to say "Start over," "What can I
say?," or "Help" at any time.
Fourth, consider the different requirements of a novice and expert user. A novice wants
longer, explicit prompts; a repeat user wants the shortest prompts possible. This conflict
can be addressed in different ways. For example, the initial instructions could include
"say help for more instructions at any time," and then the system
could use short prompts with context-sensitive help available. Or, if the user evidences
confusion through the need for repetition or clarification, prompts could automatically
get longer.
CTI: Weve heard some people debate the advantages
and disadvantages of small vocabularies. Do you have your own take on this subject?
Meisel: For some environments, such as wireless systems, the speech
quality can limit recognition accuracy. In such systems, there seem to be two points of
view on how best to proceed. One camp suggests that the caller should be given limited
choices at any one time.
For example, one can require the customer to say, "Call," wait for a prompt,
"Call who?," and then say, "John Jones." The advantages of this
structured approach are two-fold. First, the customer knows that they must first say a
command and wait, so they get less confused about acceptable ways to say things. Second,
the speech recognition system has to deal with much less variability at each step of the
conversation, so it can be more accurate.
CTI: If one approach is highly structured, then is the other less so? Would
a less structured approach be more natural, more conversational?
Meisel: Theres a trade-off. The higher accuracy achieved by the
structured approach compensates for its less fluent conversation. A conversation will not
be fluent if the recognizer constantly makes mistakes or interrupts for clarification.
All the same, a structured interaction obviates the advantage of speech. A caller must
remember very specific ways to say things, and the structure can make it take a painfully
long time to achieve an objective. In the example above, a caller should be able to simply
say, "Call John Jones." The structure can also make it easy for the caller to
get confused as to what can be said at a given time.
Both sides are correct. In practice, some limitations on conversational style are
necessary, and some structuring is inevitable. Users want to have some hint as to what
they can say; they dont react well to completely open interfaces, such as "How
can I help you?"
The conversational method falls apart if the error rate is too high, and the structured
method falls apart if it is overly difficult to use. There is a continuum, not a chasm,
between the two approaches. As speech recognition accuracy increases, less structure is
necessary.
CTI: Is accuracy improving? How?
Meisel: One of the limits on accuracy has been the telephone channel.
It reduces bandwidth and adds noise. That is, performing speech recognition on the
resulting degraded speech limits the possible accuracy.
One way to remedy this deficiency is with distributed speech recognition, which
involves pre-processing the speech in the telephone or other local device (in effect,
compressing it in a form suitable for speech recognition), and then sending it digitally
to the remote server for recognition. This process can retain the quality of the speech at
its source. The accuracy is then no longer limited by the channel.
Unfortunately, there are currently no standards for compressing the speech for speech
recognition purposes. Thus, a device that uses distributed speech recognition could only
call a compatible system. But we may eventually see standards (or de facto standards)
evolve.
CTI: Speaking of de facto standards, weve heard that
Microsoft will make speech rec a part of the operating system. When do you suppose that
might happen?
Meisel: Microsoft seems committed to making speech recognition a part
of the operating system, but they are not rushing to do it. They want the technology for
interactive dialog (conversation with the computer) to improve, so that the interaction is
like working with an assistant. This requires some improvements in natural-language
interpretation that goes beyond speech recognition alone.
My best guess as to when they will make speech a part of the operating system is in
three to five years. They might move faster if they perceive that there is danger someone
else will set the standard for a voice user interface on a desktop or palm-sized PC.
CTI: What are the prospects for SAPI support?
Meisel: SAPI (Microsofts Speech API) is most relevant for
PC-type applications now, rather than telephony, but I understand that Microsoft plans
some extensions that better support telephony. Some vendors do support a SAPI interface,
but today proprietary interfaces seem to be necessary to use all the functions available
in the telephone speech recognition.
CTI: Do you anticipate that speech rec capabilities will
appear not only in new products, but may be added to existing products? Will it be
difficult to achieve "retro-fits"?
Meisel: Retro-fits are not difficult to implement if the hardware
architecture of an existing system is at all flexible. For example, voice-activated auto
attendants are usually added to an existing PBX by simply simulating an extension of the
PBX. If a system uses "computer telephony" components, speech recognition can
also usually be added through additional modules. Retro-fits can result in a poor voice
user interface, however, if the speech recognition simply emulates an existing touch-tone
interface.
CTI: What, in speech rec applications, are the
advantages/disadvantages of host-based processing vs. DSP-based processing? What sorts of
applications would require an embedded approach?
Meisel: For a small number of telephone lines, host-based systems that
dont use DSP resource cards are cost-effective. They use capacity already available
in the host computer and result in a much lower cost to add speech recognition. Once one
gets beyond the number of lines that can be handled by host processing, however, one must
decide whether to get the additional processing power required through resource cards or
through a LAN-server architecture. Most resource cards today are DSP-based, although some
now use multiple RISC processors.
DSPs are particularly suited for "front-end" tasks that occur before speech
recognition processing, e.g., spectral processing or echo-canceling to allow interrupting
a voice response. These front-end tasks dont require much memory (which can be
expensive on DSP boards).
On the other hand, the "back-end" processing, comparing the speech to stored
acoustic word models, is memory-intensive and has characteristics that are not efficiently
handled by a DSP. As the vocabulary size grows, the front-end tasks become a small part of
the overall processing, and DSPs lose their advantage. Thus, DSPs are efficient for many
lines of digit recognition; but become an expensive, inflexible solution as the vocabulary
grows.
Many vendors have adopted a hybrid solution. They use DSPs for front-end processing and
the host computer or a server for back-end processing. This architecture can be an
efficient way to handle many lines of larger-vocabulary recognition.
CTI: Will increasingly powerful host processors obviate
the use of DSPs for all but high-end applications?
Meisel: Eventually, as host processors get faster and PC memory
cheaper, the economics of a separate DSP board may become questionable. However, DSPs tend
to use considerably less power per MIPS than other types of processors, so it is easier to
pack a large number in a small place. And DSPs continue to get faster, as well. They will
probably be around for a long time in many-line applications.
CTI: Before wrapping up, wed like to ask about
speech recs business prospects.
Meisel: There is growing interest within the investment community.
Several public companies featuring speech recognition have done well on the stock market,
and venture capitalists are viewing speech recognition as an area that may produce some
big winners. This is important for users in that it means that the technology, tools, and
applications will receive the investment they need for continued rapid improvement.
CTI: Could you give us an idea of the current size (and
the future size) of the speech rec market?
Meisel: In 1997, the revenues for telephone equipment and services
using speech recognition was about $400 million. I estimated in a recent market study that
the market will almost double each year.
CTI: Who are the leading technology providers? Who are the most notable
companies licensing the technology?
Meisel: Companies licensing basic technology tend to have different
strengths and features. Some have established strong positions in handling cellular calls
for voice dialing by name or number, for example. Others have delivered large-vocabulary
applications, such as stock quotes, effectively. Leading telephone speech recognition
vendors include Applied Language Technology, BBN Technologies (now part of GTE
Internetworking), Dragon Systems, Lernout & Hauspie, Lucent Technologies, Northern
Telecom, Nuance Communications, Philips Speech Processing, and Voice Control Systems. IBM
and Voice Control Systems have teamed to deliver IBM telephone speech recognition to
market.
Dr. Meisels latest publication, "The Telephony Voice User
Interface," analyzes 140 companies active in the speech recognition industry, and
examines the many different technologies these companies deploy. In addition to providing
contact and product information for each company, the study describes each companys
background, distribution, and typical customers. Competition is analyzed, and market size
is projected for each of the many product categories. For more information, contact the
publisher of the study, TMA Associates. Call TMA at 818-708-0962 or visit the
companys Web site at www.tmaa.com.
|