×

SUBSCRIBE TO TMCnet
TMCnet - World's Largest Communications and Technology Community

CHANNEL BY TOPICS


QUICK LINKS




 

February 1998


Understanding Speech Technology

BY BILL LEDINGHAM

As speech recognition and natural language understanding technologies begin to mature and gain commercial acceptance, they have captured a great deal of attention in both the trade and mainstream press. While speech technology still has some maturing ahead, it can be used effectively today for a range of CTI applications. To gain a truly better appreciation of speech understanding technology and its current capabilities, it is necessary to trace its heritage and delve more deeply into its inner workings.

RESEARCH BACKGROUND
For over a decade, ARPA (the United States Government’s Advanced Research Projects Agency) has supported a significant amount of research directed at improving the capabilities of computer speech recognition and understanding systems. This research has taken place at the Massachusetts Institute of Technology (MIT) and several other institutions. Great emphasis has been placed on systems that do not require speaker-specific training, can operate with large vocabularies, and can understand continuous speech. During the early and mid-’90s, the ARPA program concentrated on combining speech recognition with natural language understanding to create systems which are able to conduct interactive dialogues with users in order to complete transactions within specific application domains.

RECENT ADVANCES
This research has led to significant advances in the field of speech recognition. Two meaningful measures of these advances are accuracy and overall task complexity. Accuracy is measured by the error rate (how many mistakes does the speech recognition software make) for a given task, while the task complexity is typically characterized by the size of the vocabulary (how many words can the system look for at the same time). Error rates for a given vocabulary size have continued to decline while vocabulary sizes have continued to increase.

Until very recently, most deployed speech recognition applications could only handle very small vocabularies (less than two dozen words). Functionality was thus limited to tasks such as “Please press or say one” or “Will you accept the charges? Please say yes or no.” With today’s current technology, vocabulary sizes are now exceeding 25,000 words. For example, stock quote applications containing active vocabulary sizes of 40,000 words have been deployed commercially and are fielding thousands of calls daily.

In addition, through ongoing improvements to the software algorithms, error rates for a given task have been declining by 30 percent per year over the past five years. For small vocabulary recognition tasks, the error rate is now on the order of 1 to 2 percent (accuracy of 98–99 percent). Even on large vocabulary recognition tasks of over 25,000 words, the accuracy rates can exceed 90 percent. With this level of accuracy, along with a well-crafted user interface, it is now feasible to use speech recognition for a range of applications.

In addition to improving recognition accuracy and increasing vocabulary sizes, significant progress has been made in reducing the computational needs of the speech recognition software. The combination of the continued improvement in the recognition algorithms, coupled with the seemingly unending growth in microprocessor power, presently allow the recognition to occur in real-time (Figure 2). For example, a factor of 4,000 increase in speed (system performance) has been achieved over the past five years. As recently as six years ago, it took approximately 20 minutes to process a speech utterance using a 350word vocabulary, and it required spe-cialized signal processing hardware.

Current speech recognition software can run with no noticeable delay (less than one-second response time) using vocabularies of tens of thousands of words. Most importantly, this can now be achieved in software (with no specialized hardware) on Pentiumbased PCs. In addition, the software algorithms can be segmented such that a portion of the processing can be distributed to run on digital signal processors (DSPs), thereby resulting in a lowcost, high-volume (24 phone lines or more) platform for call processing. Thus, there have been significant advances in the field of speech recognition over the past several years. Despite being “over-hyped” for a number of years, speech recognition is now commercially viable for a wide variety of applications.

SPEECH RECOGNITION
The goal of speech recognition technology is to convert human speech into a string of text that represents what the person is saying. This is actually a very complex task and requires a thorough understanding of many different disciplines including digital signal processing, electrical engineering, statistics, and linguistics. The process of converting speech into text requires a number of different steps. Two of the most prevalent approaches for performing speech recognition are HMM-based (Hidden Markov Model) and phonetic segmentbased. While these approaches differ somewhat, they use the same basic techniques for recognizing speech.

Waveform Capture And Digitization The first step of the process involves capturing the utterance (speech waveform) from the caller. For over-the-telephone speech recognition, this step includes the capture of the speech signal by the microphone on the tele-phone, conversion into an analog waveform, and transmission of the waveform over the telephone network and into the speech recognition system. Once the utterance has been captured by the system, the analog acoustic signal is digitized by the system.

Spectral Representation
After the waveform has been digitized, it is then converted to a representation that can be used by the other components of the software. Digital signal processing is done at this point to normalize variations in the input signal due to telephone system differences, noise, and the like. In addition, signal processing may be used to enhance signal features to make it easier to identify spoken words.

Segmentation
The segmentation process involves splitting the speech waveform into distinct sounds or segments. Each of these segments corresponds to a specific sound, such as consonant or vowel sounds such as “s,” “p,” “e,” and so on. These speech sounds will vary in terms of duration. The segmentation process, therefore, must be able to demarcate between the different sounds. The way the process works is to hypothesize possible boundaries and determine the likelihood of each possible combination. The output of the segmentation process will be a listing of these possible boundaries and their associated probabilities. This output will then be used to phonetically classify the various sounds.

Phonetic Classification
After the segmentation is complete, the speech recognition software attempts to classify each of the sounds. For example, all the sounds in the English language can be matched to one of 44 basic phonemes. Phonetic classification involves determining possible matches between the sound segments and their phonetic representations (trying to match an “s” sound to the phoneme “s”). This is accomplished by statistically comparing the segments to acoustic models for the various phonemes. These acoustic models of the different phonemes are based on training data of a number of different people speaking words and phrases that contain the various phonemes. System accuracy is directly correlated with the amount of training data that has been collected since this helps to statistically normalize the acoustic models and handle the multitude of ways in which various people speak each of the sounds. The output of this stage is a network or matrix of phonemes — each sound segment will have a list of possible phonemes and an associated probability.

Search And Matching
The final step of the speech recognition process involves searching for the word or phrase that most closely matches what the caller said. This process involves matching the network of possible phonemes and their associated probabilities to a lexical network that incorporates the word vocabulary, language models or grammars, and other potential sources of constraints, such as databases.

In other words, this step involves mapping the sets of possible phonemes to the words or phrases that form the vocabulary for the recognizer. Each word or phrase consists of one or more phonemes. The recognizer compares various paths through the phonetic network to the phonetic representation of the words or phrases in its vocabulary. For each possible word or phrase, a “confidence score” or probability measure is generated. There are potentially millions of calculations that occur in this step because of the many possible paths through each network. Thus, the use of “constraints” is a key consideration to help reduce some of the complexity of the task. For example, by using language models that determine the probability of one word following another, various word combinations can be discarded. The use of constraints therefore improves recognition accuracy by reducing the scope and variability of the task.

The output of this stage is an n-best list of the most likely word or phrase matches to the spoken utterance. The corresponding confidence score for each word or phrase measures the probability that it is the correct answer. The n-best list and associated confidence scores provide the raw material for constructing conversational systems. Using natural language capabilities and comprehensive user interface design techniques, versatile, robust applications can then be developed.

NATURAL LANGUAGE CAPABILITIES
Natural language technology can be used to augment speech recognition to provide speech understanding within specific application domains. Natural language processing is a technology for taking a string of words and parsing out the vital elements such that the computer can extract meaning from the words. Simply put, speech recognition software attempts to discern the specific words that the caller said, whereas natural language software attempts to understand what the caller meant. Unconstrained natural language systems — those that provide the ability to speak freeform to the computer — are still a number of years from being commercially realized. However, a number of natural language techniques are currently being applied to extend the functionality and usability of telephone-based speech applications. These techniques include natural language modeling, natural language shortcuts, and discourse management.

Natural Language Modeling
Different callers will undoubtedly respond differently when prompted with the same question. Even with a simple yes/no question, there are roughly 30 ways in which people will typically respond “yes” (“yes,” “yup,” “uh-huh,” “yes, please,” “yeah,” “correct,” “okay,” and the list goes on) and 20 ways in which people will respond “no.” By using natural language modeling, the developer of the speech recognition application can provide callers with flexibility in how they respond while improving the overall level of understanding of the application. Advanced speech recognition systems employ the ability to model explicit grammars (e.g., a “BNF grammar”) for the recognition context that the software uses to interpret a response from a caller.

[Note: BNF, originally “Backus Normal Form” is a formal metasyntax used to express context-free grammars. Backus Normal Form, renamed BackusNaur Form, is one of the most commonly used metasyntactic notations for specifying the syntax of programming languages, command sets, and the like.]

Simply put, the recognition vocabulary is the set of words that the recognition engine attempts to match what the caller is saying to, while the recognition grammar is how the words are arranged in a phrase or sentence. For example, in a reservations application, the system might ask the caller to speak their destination. The caller might respond in a number of different ways — they might simply say “Boston,” or they might embed the keyword Boston in a phrase, for example, “I want to go to Boston” or “I’m flying to Boston.” Through the explicit modeling of the different ways in which callers typically respond, the software achieves higher recognition accuracy and understanding than can be gained through the use of simple “word spotting” techniques. By matching the caller’s speech to a likely response defined by a grammar, there is less likelihood that various words can be misrecognized or misunderstood.

Natural Language Shortcuts
The use of complex grammars also affords the ability for callers to speak full phrases or sentences and impart multiple pieces of information to the system. Rather than stepping through a transaction, experienced callers often want to take shortcuts through the various steps of the dialogue and fill in several fields with a single sentence as they might when speaking with a live operator. Again, using a reservations application as an example, a caller might say “I want to travel from Boston to San Francisco leaving at 4 P.M. tomorrow.” With a complex phrase such as this, there are a number of items that the natural language software needs to parse to extract meaning. After the speech recognition engine has output a (hypothesized) string of text representing the phrase, the natural language software attempts to break the phrase down into low-level items on which it can take action:

  • “I” maps to  (established elsewhere in the dialogue);
  • “want to travel” maps to;
  • “from Boston” maps to is Boston;
  • “to San Francisco” maps to is San Francisco;
  • “leaving at 4 P.M.” maps to is 4 P.M.;
  • “tomorrow” maps to.

Natural language shortcuts thus provide a nice extension to a directed dialogue process and allow callers — especially experienced ones — to complete transactions more quickly and efficiently.

Discourse Management
Discourse management techniques help to provide contextual understanding to a conversational application. They establish a frame of reference for the application by determining what pieces of information have been gathered and what remains to be gathered. This is especially important in applications where natural language shortcuts are being combined with directed dialogue. For example, in a stock trading application, a caller might respond to the prompt “Do you want to buy or sell?” with any of the following: “Buy,” “Buy 100 shares,” “Buy 100 shares at 20 and �,” or “Buy 100 shares at 20 and � good until close.” Under this scenario, the discourse manager needs to track what information has been captured and carry on a dialogue with the caller to capture the remaining pieces of information (the number of shares, limit price, or time limit) to complete the stock purchase order.

User Interface Design
Successful deployment of speech understanding technology involves mapping the capabilities of the technology to the requirements of the task being performed. With this in mind, effort needs to be focused in user interface design and dialogue management. The goal is to be able to discern meaning from the user while providing nonobtrusive interaction. A successful user interface design is one in which the questions and expected responses are designed for high recognition accuracy, the user finds the system dialogue pleasant, and the transactions get completed in a timely manner.

An effective approach for aiding understanding accuracy is to design user interfaces that combine directed dialogue with natural language extensions. Taking the reservations application again as an example, the system would not prompt the caller with a phrase such as “How may I help you?” Rather, the system will prompt the caller to solicit specific pieces of information with each question such as, “What is your destination?” Natural language modeling is used to handle a range of responses (e.g., “My destination is Boston,” “I’d like to go to Boston,” “I’m traveling to Boston,” “Boston,” “uh, Boston,” etc.) to allow callers to speak naturally. Likewise, natural language shortcuts allow the caller to input multiple pieces of information to speed up the transaction.

CONCLUSION
It is important to note that speech recognition technology will never achieve 100 percent accuracy. Even humans rarely achieve 100 percent accuracy in conversational dialogue over the telephone. In a typical conversation, there is often a fair amount of clarification and confirmation needed, such as, “I’m sorry, what did you say?” or “I didn’t quite catch that, would you please repeat it?” Thus, even though there are often a number of recognition errors that occur, people are very adept at recovering from these sorts of errors and keeping the conversation afloat. A well-designed speech user interface mimics some of the qualities of human-tohuman interaction by gracefully recovering from errors, providing a consistent form of interaction, and by appearing cooperative to the caller.

In summary, through the combination of speech recognition, natural language techniques, and careful user interface design, speech understanding is now possible for a range of telephonebased applications and services. Computers do not yet have the capability of conversing free-form with humans. However, within specific applications such as reservations, stock quotes and trading, or order entry, the dialogue capabilities and functional utility can indeed be very impressive.

Bill Ledingham is vice president of product development for Applied Language Technologies, Inc. (ALTech), a leading vendor of conversational speech recognition software. ALTech’s SpeechWorks software provides a comprehensive platform for speechenabling telephone-based transactions and services. For more information, please visit ALTech’s Web site at www.altech.com







Technology Marketing Corporation

2 Trap Falls Road Suite 106, Shelton, CT 06484 USA
Ph: +1-203-852-6800, 800-243-6002

General comments: [email protected].
Comments about this site: [email protected].

STAY CURRENT YOUR WAY

© 2024 Technology Marketing Corporation. All rights reserved | Privacy Policy