Understanding Speech Technology BY BILL
LEDINGHAM
As speech recognition and natural language understanding technologies begin to mature
and gain commercial acceptance, they have captured a great deal of attention in both the
trade and mainstream press. While speech technology still has some maturing ahead, it can
be used effectively today for a range of CTI applications. To gain a truly better
appreciation of speech understanding technology and its current capabilities, it is
necessary to trace its heritage and delve more deeply into its inner workings.
RESEARCH BACKGROUND
For over a decade, ARPA (the United States Governments Advanced Research Projects
Agency) has supported a significant amount of research directed at improving the
capabilities of computer speech recognition and understanding systems. This research has
taken place at the Massachusetts Institute of Technology (MIT) and several other
institutions. Great emphasis has been placed on systems that do not require
speaker-specific training, can operate with large vocabularies, and can understand
continuous speech. During the early and mid-90s, the ARPA program concentrated on
combining speech recognition with natural language understanding to create systems which
are able to conduct interactive dialogues with users in order to complete transactions
within specific application domains.
RECENT ADVANCES
This research has led to significant advances in the field of speech recognition. Two
meaningful measures of these advances are accuracy and overall task complexity. Accuracy
is measured by the error rate (how many mistakes does the speech recognition software
make) for a given task, while the task complexity is typically characterized by the size
of the vocabulary (how many words can the system look for at the same time). Error rates
for a given vocabulary size have continued to decline while vocabulary sizes have
continued to increase.
Until very recently, most deployed speech recognition applications could only handle
very small vocabularies (less than two dozen words). Functionality was thus limited to
tasks such as Please press or say one or Will you accept the charges?
Please say yes or no. With todays current technology, vocabulary sizes are now
exceeding 25,000 words. For example, stock quote applications containing active vocabulary
sizes of 40,000 words have been deployed commercially and are fielding thousands of calls
daily.
In addition, through ongoing improvements to the software algorithms, error rates for a
given task have been declining by 30 percent per year over the past five years. For small
vocabulary recognition tasks, the error rate is now on the order of 1 to 2 percent
(accuracy of 9899 percent). Even on large vocabulary recognition tasks of over
25,000 words, the accuracy rates can exceed 90 percent. With this level of accuracy, along
with a well-crafted user interface, it is now feasible to use speech recognition for a
range of applications.
In addition to improving recognition accuracy and increasing vocabulary sizes,
significant progress has been made in reducing the computational needs of the speech
recognition software. The combination of the continued improvement in the recognition
algorithms, coupled with the seemingly unending growth in microprocessor power, presently
allow the recognition to occur in real-time (Figure 2). For example, a factor of 4,000
increase in speed (system performance) has been achieved over the past five years. As
recently as six years ago, it took approximately 20 minutes to process a speech utterance
using a 350word vocabulary, and it required spe-cialized signal processing hardware.
Current speech recognition software can run with no noticeable delay (less than
one-second response time) using vocabularies of tens of thousands of words. Most
importantly, this can now be achieved in software (with no specialized hardware) on
Pentiumbased PCs. In addition, the software algorithms can be segmented such that a
portion of the processing can be distributed to run on digital signal processors (DSPs),
thereby resulting in a lowcost, high-volume (24 phone lines or more) platform for call
processing. Thus, there have been significant advances in the field of speech recognition
over the past several years. Despite being over-hyped for a number of years,
speech recognition is now commercially viable for a wide variety of applications.
SPEECH RECOGNITION
The goal of speech recognition technology is to convert human speech into a string of text
that represents what the person is saying. This is actually a very complex task and
requires a thorough understanding of many different disciplines including digital signal
processing, electrical engineering, statistics, and linguistics. The process of converting
speech into text requires a number of different steps. Two of the most prevalent
approaches for performing speech recognition are HMM-based (Hidden Markov Model) and
phonetic segmentbased. While these approaches differ somewhat, they use the same basic
techniques for recognizing speech.
Waveform Capture And Digitization The first step of the process involves capturing the
utterance (speech waveform) from the caller. For over-the-telephone speech recognition,
this step includes the capture of the speech signal by the microphone on the tele-phone,
conversion into an analog waveform, and transmission of the waveform over the telephone
network and into the speech recognition system. Once the utterance has been captured by
the system, the analog acoustic signal is digitized by the system.
Spectral Representation
After the waveform has been digitized, it is then converted to a representation
that can be used by the other components of the software. Digital signal processing is
done at this point to normalize variations in the input signal due to telephone system
differences, noise, and the like. In addition, signal processing may be used to enhance
signal features to make it easier to identify spoken words.
Segmentation
The segmentation process involves splitting the speech waveform into distinct
sounds or segments. Each of these segments corresponds to a specific sound, such as
consonant or vowel sounds such as s, p, e, and so on.
These speech sounds will vary in terms of duration. The segmentation process, therefore,
must be able to demarcate between the different sounds. The way the process works is to
hypothesize possible boundaries and determine the likelihood of each possible combination.
The output of the segmentation process will be a listing of these possible boundaries and
their associated probabilities. This output will then be used to phonetically classify the
various sounds.
Phonetic Classification
After the segmentation is complete, the speech recognition software attempts to
classify each of the sounds. For example, all the sounds in the English language can be
matched to one of 44 basic phonemes. Phonetic classification involves determining possible
matches between the sound segments and their phonetic representations (trying to match an
s sound to the phoneme s). This is accomplished by statistically
comparing the segments to acoustic models for the various phonemes. These acoustic models
of the different phonemes are based on training data of a number of different people
speaking words and phrases that contain the various phonemes. System accuracy is directly
correlated with the amount of training data that has been collected since this helps to
statistically normalize the acoustic models and handle the multitude of ways in which
various people speak each of the sounds. The output of this stage is a network or matrix
of phonemes each sound segment will have a list of possible phonemes and an
associated probability.
Search And Matching
The final step of the speech recognition process involves searching for the word
or phrase that most closely matches what the caller said. This process involves matching
the network of possible phonemes and their associated probabilities to a lexical network
that incorporates the word vocabulary, language models or grammars, and other potential
sources of constraints, such as databases.
In other words, this step involves mapping the sets of possible phonemes to the words
or phrases that form the vocabulary for the recognizer. Each word or phrase consists of
one or more phonemes. The recognizer compares various paths through the phonetic network
to the phonetic representation of the words or phrases in its vocabulary. For each
possible word or phrase, a confidence score or probability measure is
generated. There are potentially millions of calculations that occur in this step because
of the many possible paths through each network. Thus, the use of constraints
is a key consideration to help reduce some of the complexity of the task. For example, by
using language models that determine the probability of one word following another,
various word combinations can be discarded. The use of constraints therefore improves
recognition accuracy by reducing the scope and variability of the task.
The output of this stage is an n-best list of the most likely word or phrase matches to
the spoken utterance. The corresponding confidence score for each word or phrase measures
the probability that it is the correct answer. The n-best list and associated confidence
scores provide the raw material for constructing conversational systems. Using natural
language capabilities and comprehensive user interface design techniques, versatile,
robust applications can then be developed.
NATURAL LANGUAGE CAPABILITIES
Natural language technology can be used to augment speech recognition to provide speech
understanding within specific application domains. Natural language processing is a
technology for taking a string of words and parsing out the vital elements such that the
computer can extract meaning from the words. Simply put, speech recognition software
attempts to discern the specific words that the caller said, whereas natural language
software attempts to understand what the caller meant. Unconstrained natural language
systems those that provide the ability to speak freeform to the computer are
still a number of years from being commercially realized. However, a number of natural
language techniques are currently being applied to extend the functionality and usability
of telephone-based speech applications. These techniques include natural language
modeling, natural language shortcuts, and discourse management.
Natural Language Modeling
Different callers will undoubtedly respond differently when prompted with the
same question. Even with a simple yes/no question, there are roughly 30 ways in which
people will typically respond yes (yes, yup,
uh-huh, yes, please, yeah, correct,
okay, and the list goes on) and 20 ways in which people will respond
no. By using natural language modeling, the developer of the speech
recognition application can provide callers with flexibility in how they respond while
improving the overall level of understanding of the application. Advanced speech
recognition systems employ the ability to model explicit grammars (e.g., a BNF
grammar) for the recognition context that the software uses to interpret a response
from a caller.
[Note: BNF, originally Backus Normal Form is a
formal metasyntax used to express context-free grammars. Backus Normal Form, renamed
BackusNaur Form, is one of the most commonly used metasyntactic notations for specifying
the syntax of programming languages, command sets, and the like.]
Simply put, the recognition vocabulary is the set of words that the recognition engine
attempts to match what the caller is saying to, while the recognition grammar is how the
words are arranged in a phrase or sentence. For example, in a reservations application,
the system might ask the caller to speak their destination. The caller might respond in a
number of different ways they might simply say Boston, or they might
embed the keyword Boston in a phrase, for example, I want to go to Boston or
Im flying to Boston. Through the explicit modeling of the different ways
in which callers typically respond, the software achieves higher recognition accuracy and
understanding than can be gained through the use of simple word spotting
techniques. By matching the callers speech to a likely response defined by a
grammar, there is less likelihood that various words can be misrecognized or
misunderstood.
Natural Language Shortcuts
The use of complex grammars also affords the ability for callers to speak full
phrases or sentences and impart multiple pieces of information to the system. Rather than
stepping through a transaction, experienced callers often want to take shortcuts through
the various steps of the dialogue and fill in several fields with a single sentence as
they might when speaking with a live operator. Again, using a reservations application as
an example, a caller might say I want to travel from Boston to San Francisco leaving
at 4 P.M. tomorrow. With a complex phrase such as this, there are a number of items
that the natural language software needs to parse to extract meaning. After the speech
recognition engine has output a (hypothesized) string of text representing the phrase, the
natural language software attempts to break the phrase down into low-level items on which
it can take action:
- I maps to (established elsewhere in the dialogue);
- want to travel maps to;
- from Boston maps to is Boston;
- to San Francisco maps to is San Francisco;
- leaving at 4 P.M. maps to is 4 P.M.;
- tomorrow maps to.
Natural language shortcuts thus provide a nice extension to a directed dialogue process
and allow callers especially experienced ones to complete transactions more
quickly and efficiently.
Discourse Management
Discourse management techniques help to provide contextual understanding to a
conversational application. They establish a frame of reference for the application by
determining what pieces of information have been gathered and what remains to be gathered.
This is especially important in applications where natural language shortcuts are being
combined with directed dialogue. For example, in a stock trading application, a caller
might respond to the prompt Do you want to buy or sell? with any of the
following: Buy, Buy 100 shares, Buy 100 shares at 20 and
�, or Buy 100 shares at 20 and � good until close. Under this
scenario, the discourse manager needs to track what information has been captured and
carry on a dialogue with the caller to capture the remaining pieces of information (the
number of shares, limit price, or time limit) to complete the stock purchase order.
User Interface Design
Successful deployment of speech understanding technology involves mapping the
capabilities of the technology to the requirements of the task being performed. With this
in mind, effort needs to be focused in user interface design and dialogue management. The
goal is to be able to discern meaning from the user while providing nonobtrusive
interaction. A successful user interface design is one in which the questions and expected
responses are designed for high recognition accuracy, the user finds the system dialogue
pleasant, and the transactions get completed in a timely manner.
An effective approach for aiding understanding accuracy is to design user interfaces
that combine directed dialogue with natural language extensions. Taking the reservations
application again as an example, the system would not prompt the caller with a phrase such
as How may I help you? Rather, the system will prompt the caller to solicit
specific pieces of information with each question such as, What is your
destination? Natural language modeling is used to handle a range of responses (e.g.,
My destination is Boston, Id like to go to Boston,
Im traveling to Boston, Boston, uh, Boston,
etc.) to allow callers to speak naturally. Likewise, natural language shortcuts allow the
caller to input multiple pieces of information to speed up the transaction.
CONCLUSION
It is important to note that speech recognition technology will never achieve 100 percent
accuracy. Even humans rarely achieve 100 percent accuracy in conversational dialogue over
the telephone. In a typical conversation, there is often a fair amount of clarification
and confirmation needed, such as, Im sorry, what did you say? or I
didnt quite catch that, would you please repeat it? Thus, even though there
are often a number of recognition errors that occur, people are very adept at recovering
from these sorts of errors and keeping the conversation afloat. A well-designed speech
user interface mimics some of the qualities of human-tohuman interaction by gracefully
recovering from errors, providing a consistent form of interaction, and by appearing
cooperative to the caller.
In summary, through the combination of speech recognition, natural language techniques,
and careful user interface design, speech understanding is now possible for a range of
telephonebased applications and services. Computers do not yet have the capability of
conversing free-form with humans. However, within specific applications such as
reservations, stock quotes and trading, or order entry, the dialogue capabilities and
functional utility can indeed be very impressive.
Bill Ledingham is vice president of product development for Applied Language
Technologies, Inc. (ALTech), a leading vendor of conversational speech recognition
software. ALTechs SpeechWorks software provides a comprehensive platform for
speechenabling telephone-based transactions and services. For more information, please
visit ALTechs Web site at www.altech.com |