TMCnet - The World's Largest Communications and Technology Community
ITEXPO begins in:   New Coverage :  Asterisk  |  Fax Software  |  SIP Phones  |  Small Cells
 

Feature Article
March 2000

 

Online Exclusive Talk About Progress: The New Sound Of Automatic Speech Recognition

BY PAM RAVESI


First, the bad news. The bewildering array of vendors offering interactive voice response (IVR) and speech recognition applications has grown wider than ever. But you knew that. The good news is that new technology in this rapidly evolving market is about to narrow the field.

Applications have arrived that combine improved text-to-speech (TTS) and automatic speech recognition (ASR) engines with natural language processing. ASR converts the user's speech to a text sentence of distinct words. TTS converts a text sentence into computer-generated speech. Mediating between them, natural language technology enables the computer to understand what the user is saying.

The combination of these technologies makes it possible for applications to interact with humans through spoken text, eliminating the need for prerecorded voice files or manual input devices. What's more, TTS has passed a milestone: The newest speech synthesis engines combine advances in concatenation (where prerecorded segments of actual speech are knitted together) with new synthesis algorithms. The result is an end to the robotic synthesized voices produced by traditional TTS engines. These engines will be ideal for industries such as telephony that require high-quality voice and can support a large-footprint engine. They will find ideal applications in banking, telecommunications, airline flight booking, and other industries that use phone systems to offer customers dynamic information retrieval from their databases. This advance also represents the clearing of the first hurdle to true natural language dialogue systems that enable two-way conversations with computers.

Previously, developers defined TTS's major stumbling block as inadequate "naturalness," meaning the robotic, unfriendly synthesized voice. The TTS engines dominant in applications throughout the 1980s and early 1990s relied on a technology called "formant synthesis," where a processor generates a waveform, and then runs it through a variety of filters that modify it into a speech wave. Despite the ability to vary word pitch and duration, the sound was decidedly synthetic and hard to listen to. Therefore, practical applications were limited.

As processors and memory continue to grow in capacity and drop in price, developers have used larger voice segments that make it easier to develop more natural-sounding speech. At the same time, developers have broken new ground in the ability to join these voice segments effectively to create a smoother, more natural sounding synthetic voice.

The combination of more voice segments and better ways to link them, plus improved processing and in-depth linguistic rules, provides intelligent and human-sounding pronunciation of variable text input. Add in the ability to generate speech on the fly, and concatenation algorithms are opening the door to a truly interactive IVR.

For call centers, the convergence of TTS and ASR means two things. First, the improved TTS will help to expand users' acceptance of the technology due to the more human- sounding voice. Second, the combination of the more human-sounding TTS with high-quality speech recognizers will enable computers and humans to engage in true dialogues, in which the computer is able to comprehend what a person is saying and ask questions to clarify anything it does not understand.

Was That "Nevada" Or "Nirvana?"
The future of the voice interface in general hinges on computers' ability to interact with users the way a human would. That means computers must generate questions to clarify what they've heard, just like humans do. While pre-recording solved the problem of a realistic voice interface, it restricted the computer to repeating only what the developer anticipated it would need to say, precluding a truly interactive dialogue. That's what's changed.

The newest synthesizers, combined with new ASR technology, enable the computer to generate any question necessary to clarify spoken input. Boosted by the advances in TTS voice quality, developers are turning their attention to creating new natural language dialogue systems that combine TTS with natural language ASR. A natural language dialogue system enables a computer to behave like "Human 2" in the following dialogue:

Human 1: "I would like a ticket to <mumble> on Friday the seventeenth."
Human 2: "What was the destination?"
Human 1: "Boston." <muffled by cell phone interference>
Human 2: "Was that Austin with an 'A' or Boston with a 'B'?"
Human 1: "Boston with a 'B.'"

New Life For Older Technology
Basic speech synthesis is a two-step process. First, standard text is converted into a phonetic representation with markers for stress and other pronunciation guides. Then, the voice is created through a synthesis process, via a digital signal processor (DSP), a microprocessor, or both. The phonetic representation then becomes spoken sound.

The new ASR engines use natural language understanding, an artificial intelligence-based technology, to understand speech. The technology augments traditional speech recognition (converting spoken sounds to digital symbols) with grammar-based language understanding software. The computer can then create a version of the abstract meaning of the spoken words.

Speech recognition software applies basic grammatical rules to parse the sentence into its parts: subject, verb, object, etc. The ASR engine applies natural language understanding to determine the meaning of the sentence, and formats a question in a series of commands that the system can understand. Once these commands have been processed as a sentence, the speech synthesizer converts the sentence into words.

Vendors hope that a more human dialogue system will open the door to a wealth of new network services, including remote e-mail, remote database access, voice mail, and faxing. The natural fit between speech recognition and the call center is being played out in the rising popularity of these and other emerging applications. As ASR and TTS continue to evolve, industry observers see continued growth and new speech-enabled applications and services in the future.

Easy As A-B-C
Of the two main TTS technologies -- formant and concatenation synthesis -- it's the latter, with its process of splicing processed speech fragments into recognizable human speech, that is leading the way in TTS. Concatenation systems use chips to store tiny segments of actual recorded human speech -- fragments and combinations of the irreducible units of sound that make up words in all languages. The challenge to incorporating this technology in call center applications was two-fold.

The first challenge was in balancing speech quality with the limitations of computer memory. Developers realized that the larger the segments of speech they used, the more natural the voice would sound. They needed more memory to store and access these segments than processing technology would practically allow.

Second, because of the nature of phonetic speech, joining the speech segments together in a natural way was also difficult. Developers refer to the fluid contours of continuous human speech as intonation, melody, and prosody. Without it, computer-generated speech sounds uneven, disjointed and obviously artificial -- previous TTS engines' major shortfalls.

Developers have taken advantage of cheaper, more powerful processors to use larger voice segments that make it easier to develop more natural-sounding speech. At the same time, they have broken new ground in the algorithms used to join these voice segments effectively. A new generation of better TTS engines is now hitting the market. Many developers are satisfied they have effectively removed the barrier to a workable, truly conversational interface by generating natural-sounding speech. This is what is driving the industry on to its next stage.

Dialing In To The Future
The achievement of a truly natural-sounding human voice is already making current TTS and ASR applications much more compelling. But the future of the voice interface hinges on the computer's ability to interact with users conversationally, like a human would.

The growth of computer processing power will eventually enable developers to go beyond the natural-sounding voice itself, to create applications that speak as naturally as any expressive and perceptive reader. They will assume voices for the two sides of a dialogue, and will anticipate the cause and effect of various events. A person reading aloud can appreciate tone and meaning, and express humor, irony, or the contextual meaning of a narrative's elements. Computers will have the intelligence to add a high level of understanding and contextualization to the prosody of synthetic speech, and will be able to formulate and ask any question.

E-mail, unified messaging systems, data access, security systems, text-based sales and services of all kinds, navigation systems, personal computer-based agents, server-based telephony, voice mail systems, and new telephone directory services are just a few places to look for TTS and ASR in the near future, where actual dialogue will replace cumbersome key pad menus. Consumers can already easily retrieve information from automated systems, where a perfectly natural-sounding voice reads his or her e-mail, account information, news headlines, stock quotes, or Web pages. Some technology watchers predict the future will be filled with devices that converse with us, from our houses and cars to our wristwatches and cellular phones. Whether we see those futuristic applications of ASR or not, one thing is certain: This technology is coming soon to a call center near you.

Pam Ravesi is the senior director product management for Lernout & Hauspie (L&H). L&H is a global leader in advanced speech and language solutions for vertical markets, computers, automobiles, telecommunications, embedded products, consumer goods and the Internet. The company is making the speech user interface (SUI) the keystone of simple, convenient interaction between humans and technology, and is using advanced translation technology to break down language barriers. The company provides a wide range of offerings, including: customized solutions for corporations; core speech technologies marketed to OEMs; end user and retail applications for continuous speech products in horizontal and vertical markets; and document creation, human and machine translation services, Internet translation offerings, and linguistic tools.


Upcoming Events
ITEXPO West 2012
October 2- 5, 2012
The Austin Convention Center
Austin, Texas
MSPWorld
The World's Premier Managed Services and Cloud Computing Event
Click for Dates and Locations
Mobility Tech Conference & Expo
October 3- 5, 2012
The Austin Convention Center
Austin, Texas
Cloud Communications Summit
October 3- 5, 2012
The Austin Convention Center
Austin, Texas

Subscribe FREE to all of TMC's monthly magazines. Click here now.