While they were intended as conveniences, early speech recognition applications
occasioned more than a little frustration. Indeed, disgruntled users found these
application nearly as balky as ordinary, non-speech-enabled auto-attendants. The reason?
Auto-attendant, speech-enabled or not, puts all the convenience on the computer side of
any human-computer interaction.
Inputting information through a menu-driven process creates a sensation much like the
one we feel when we sit behind the wheel of an idling car, while we wait for a break in
traffic. We fancy the traffic is inspired by some perverse agency bent on vexing us, and
we resign ourselves to fitting in however we can, as stressful as that may be.
Fortunately, the old, "fit in" model of speech recognition is giving way to a
new approach: natural language recognition. This new approach lets the caller "own
the road," as it were. The caller speaks naturally, and the speech recognition
application parses and channels the input on behalf of the computer. The computer no
longer requires the caller to parse his or her utterances according to some infernal
script or menu.
Now, if the new speech recognition lets callers own the road, the question remains: Who
is going to build the road? Developers who take advantage of advanced speech recognition
engines. One such engine, the Nuance Speech Recognition System, is discussed in this
review. So, buckle your seatbelts, and let's get started.
INSTALLATION
The Nuance system, which ships on a single CD, installs on selected platforms, including
Windows NT, Solaris, AIX, and SCO OpenServer 5. We installed the product on a Pentium 200
MMX computer with 64 Megs of Ram running Windows NT.
The installation involved little more than inserting the CD in the computer's CD-ROM
drive. Once we had the CD in place, it ran automatically, so we did not have to execute a
setup file. From this point, right until the very end of the installation, the entire
operation was self-propelled. The system never once asked for a change in directories or
for help in finding any specific Windows files. This ease and rapidity compelled us to
give the Nuance system the highest allowable score for installation.
DOCUMENTATION
We received two documents with the Nuance system. One document, a thin booklet, was
entitled Getting Started. The other document, a 400-page tome, constituted the Developer's
Manual. The Getting Started booklet provided information on the platforms supported by
Nuance, as well as the audio setups for those platforms. It also reviewed some issues
specific to the Solaris operating system. The remainder of the booklet described the
commands necessary to run applications on the system. The Developer's Manual covered many
aspects of developing applications, from compiling recognition packages to run-time
support to special topics.
We didn't stop at reviewing the literature supplied by Nuance. We also explored the
Nuance home page. We noticed that the section of the Nuance site devoted to Nuance6 was
quite thorough. In this section, topics such as descriptions, features, supported
platforms, the developer's toolkit, and developer training were covered. In addition, new
features of the Nuance system, such as the Speech Verifier, were illustrated.
Perhaps the most impressive portion of the site was the Nuance demo section, which
includes four demos, which any visitor can access. Visitors who take advantage of these
demos can use them to test the accuracy and usability of the system. We encourage anyone
to go the Nuance Web site and try these very useful demos.
FEATURES
Core Client/Server Software
- Speech recognition, which eliminates the need for complicated and frustrating touch-tone
call attendants.
- Natural language capabilities, which allow the system to recognize speech patterns
rather than just single words. (Thus, the system can pick out key words. If the system
fails to recognize a particular word, it does not reject the entire phrase.)
- Telephony control, which allows the Speech Server to control a variety of telephony
processes.
- An easy to use, powerful API, which facilitate the creation of speech recognition
applications.
- SQL query integration, which gives developers access to a wide variety of database
functions.
- Barge-in capabilities, which allow callers to use the system without having to insert
unnatural pauses.
- Confidence scores, which indicate the reliability of the matches the system makes
between the input it receives and the words in its vocabulary. (Confidence scores are
configurable to create systems geared towards speed or accuracy.)
Nuance Verifier
- Speech recognition as a means of instituting security.
- Simultaneous recognition and authentication of speech, which allows for very fast,
real-time verification of speakers.
Developer's Toolkit
- Grammar specification.
- Natural language specification.
- SpokenSQL, which can be used to generate a database query from speech.
- Xwavedit, which is a means of recording prompts. (Allows the developer to edit prompts
for maximum recognition accuracy and efficiency.)
OPERATIONAL TESTING
Although actually developing our own application was beyond the scope of our review, we
acquainted ourselves with the Nuance systems by running a sample application, and by
reviewing a few of the demonstration programs we found on Nuance's home page.
Sample Application
This application, working from a specified vocabulary, prompts the user for input,
compares the input to the words in the vocabulary, and returns the word that it deems the
closest match, along with a confidence score. We experimented quite a bit - so much, in
fact, that we started to feel at one with our microphone-equipped headset. In the course
of our work, we took careful note of the confidence scores, which displayed such variation
that we satisfied ourselves that the system was acting properly.
Demonstration Programs
Two of the demonstrations, Travel Plan and Better Bank, gave us a chance to evaluate
Nuance's natural language capabilities. Also, Travel Plan let us check out Nuance's
barge-in feature. Another feature, speech verification, was displayed to advantage in the
Nuance Verifier demo. And, finally, the Stock Quotes demo let us work with a system that
boasted an enormous vocabulary.
Travel Plan: This demo, which was an interactive, real-time travel
planner, showed off how a Nuance-based application could recognize alternative names for
airports. (We need hardly add that the ability to cope with alternative names is one of
the manifestations of natural language recognition.)
We started by referring to an airport first as JFK, then as Kennedy. The system
responded appropriately, recognizing that both names corresponded to the same airport. So,
we decided to try something a little trickier. We decided to refer to Dulles, for Dulles
International Airport. We supposed the system might mistake Dulles for Dallas, which
sounds much the same. We supposed wrong, however. The system did in fact recognize Dulles,
and we had to admit we were impressed.
Before we moved on to the next demo, we made sure to check out the system's barge-in
capability, that is, the system's ability to accept user input even if the user supplies
it before hearing the appropriate prompt. This capability has its advantages. It lets
callers familiar with the menu to move through it more quickly. However, it can increase
the potential for errors. Hence, good barge-in functionality accommodates impatient
callers without sacrificing accuracy.
Since we're naturally impatient, we had no difficulty testing the barge-in
functionality. For example, after we had heard only two of three flight options, we knew
which option we wanted. So, we barged in, not caring to hear the third option. The system
responded exactly as it should have. It stopped reading the flight information, and it
asked us if we would like to obtain pricing information on that particular flight.
Better Bank: In this demo, Nuance's Natural Grammar and Natural
Language capabilities are displayed. These capabilities go beyond the recognition of
individual words. Instead, the idea is to recognize alternative word combinations, sparing
the user the challenge of speaking information in any particular predefined format.
The few tests we tried here gave us favorable responses. When asked how much we wanted
to transfer from checking into savings, we responded, "Fifty-seven dollars and
thirty-two cents." Later, in response to the same question, we said,
"Fifty-seven thirty-two." In both cases, the demo transferred the correct
amount.
We did have one problem with the demo, however. When we indicated we wanted to make a
payment on our American Express bill, the system asked us to indicate the amount. We said,
"Pay in full." However, the system told us it didn't understand. Then, we tried
other, equivalent phrases, to no avail.
Finally, we tried a question: "How much do I owe?" The system then asked us
if it was correct that we wanted to pay twelve dollars. Well, that wasn't the balance.
Perhaps "twelve" was the closest match the system could produce. If so, it would
appear the demo's programming simply didn't anticipate the sort of input we provided. We
don't suppose, however, that our problem had anything to do with the underlying speech
recognition engine.
Nuance Verifier: We called into the demo and enrolled by giving our
seven-digit phone number. The system then asked one of our engineers to repeat the phrase
"My voice is my password" three times until it was satisfied that it could
recognize his speech patterns.
Then, to proceed with the demo, we had a second engineer attempt to log into the first
engineer's account. When the system prompted the second engineer to say, "My voice is
my password," he complied, but the system refused to let him access the account.
Thus, we confirmed the new speech verification feature actually worked.
Stock Quotes: Finally, we came to what is probably the best-known
application of the Nuance system, the Stock Quotes demo. In 1996, Nuance developed a
system for Charles Schwab & Co. This system, which lets the company offer stock quotes
to its clients, needs an enormous vocabulary, for there are thousands of stocks clients
might ask about. In fact, there are over thirteen thousand stocks, mutual funds, and
market indicators.
To make matters even more complicated, many of these stocks may be referenced in
multiple ways, all of which have to be recognized by the system. That such a large system
should work so well is impressive, for as more words are added into any system's
vocabulary, the confidence levels for any matches delivered by the system invariably
decline. Thus, in such a system, it is often a good idea to read the caller's input back
to the caller, so the caller knows whether the information the system provides really is
pertinent to the original input. Also, it might help to read back some ancillary
information. For example, in the case of a company name, it might help to read back the
company's city and state, the better to distinguish any given company from sound-alike
companies.
ROOM FOR IMPROVEMENT
It appears Nuance believes (and rightly so) that its role is to provide the basis for
voice-based systems, and that creating working speech recognition applications is up to
developers. Thus, Nuance concentrates on refining its speech recognition engine. And,
while Nuance doesn't neglect to give developers some guidance (with the Developer's
Toolkit, for example), it hasn't gone so far as to release a full application generator.
We would like to see a development tool of this sort specifically aimed at creating a
Nuance-compliant application. Such a tool would be a convenience to developers, and it
would (more than incidentally) promote Nuance's interests.
Nuance could look after its own interests in yet another way, and yet again extend a
convenience to other parties, by eliminating whatever problem caused our difficulties with
the Banking demo. (We suppose the problem is a limited vocabulary.) Of course, this
problem doesn't raise any issues with the speech recognition engine itself. We just feel
the demo should do justice to the engine. So, you might consider our suggestion a
backhanded compliment.
CONCLUSION
The continued evolution of speech recognition seems assured. This evolution is driven, of
course, by the need for more natural human-computer interfaces. With the right interfaces,
people will no longer need to adapt to the computer's way of working. Instead, it will be
the computer that does the adapting; interfaces will acquire whatever attributes are
needed to maximize user convenience. One new attribute that is already facilitating
human-computer interactions is, as we've seen, natural language recognition. This
attribute will soon enhance many applications, thanks to tools such as those provided by
Nuance and other companies. |