Blogs:
Rich Tehrani
Tom Keating
Al Bredenberg
Michelle Pasquerello
Greg Galitzine
Call Center/CRM
...more
 

Feature Article
March 2000

 

Online Exclusive Talk About Progress: The New Sound Of Automatic Speech Recognition

BY PAM RAVESI


First, the bad news. The bewildering array of vendors offering interactive voice response (IVR) and speech recognition applications has grown wider than ever. But you knew that. The good news is that new technology in this rapidly evolving market is about to narrow the field.

Applications have arrived that combine improved text-to-speech (TTS) and automatic speech recognition (ASR) engines with natural language processing. ASR converts the user's speech to a text sentence of distinct words. TTS converts a text sentence into computer-generated speech. Mediating between them, natural language technology enables the computer to understand what the user is saying.

The combination of these technologies makes it possible for applications to interact with humans through spoken text, eliminating the need for prerecorded voice files or manual input devices. What's more, TTS has passed a milestone: The newest speech synthesis engines combine advances in concatenation (where prerecorded segments of actual speech are knitted together) with new synthesis algorithms. The result is an end to the robotic synthesized voices produced by traditional TTS engines. These engines will be ideal for industries such as telephony that require high-quality voice and can support a large-footprint engine. They will find ideal applications in banking, telecommunications, airline flight booking, and other industries that use phone systems to offer customers dynamic information retrieval from their databases. This advance also represents the clearing of the first hurdle to true natural language dialogue systems that enable two-way conversations with computers.

Previously, developers defined TTS's major stumbling block as inadequate "naturalness," meaning the robotic, unfriendly synthesized voice. The TTS engines dominant in applications throughout the 1980s and early 1990s relied on a technology called "formant synthesis," where a processor generates a waveform, and then runs it through a variety of filters that modify it into a speech wave. Despite the ability to vary word pitch and duration, the sound was decidedly synthetic and hard to listen to. Therefore, practical applications were limited.

As processors and memory continue to grow in capacity and drop in price, developers have used larger voice segments that make it easier to develop more natural-sounding speech. At the same time, developers have broken new ground in the ability to join these voice segments effectively to create a smoother, more natural sounding synthetic voice.

The combination of more voice segments and better ways to link them, plus improved processing and in-depth linguistic rules, provides intelligent and human-sounding pronunciation of variable text input. Add in the ability to generate speech on the fly, and concatenation algorithms are opening the door to a truly interactive IVR.

For call centers, the convergence of TTS and ASR means two things. First, the improved TTS will help to expand users' acceptance of the technology due to the more human- sounding voice. Second, the combination of the more human-sounding TTS with high-quality speech recognizers will enable computers and humans to engage in true dialogues, in which the computer is able to comprehend what a person is saying and ask questions to clarify anything it does not understand.

Was That "Nevada" Or "Nirvana?"
The future of the voice interface in general hinges on computers' ability to interact with users the way a human would. That means computers must generate questions to clarify what they've heard, just like humans do. While pre-recording solved the problem of a realistic voice interface, it restricted the computer to repeating only what the developer anticipated it would need to say, precluding a truly interactive dialogue. That's what's changed.

The newest synthesizers, combined with new ASR technology, enable the computer to generate any question necessary to clarify spoken input. Boosted by the advances in TTS voice quality, developers are turning their attention to creating new natural language dialogue systems that combine TTS with natural language ASR. A natural language dialogue system enables a computer to behave like "Human 2" in the following dialogue:

Human 1: "I would like a ticket to <mumble> on Friday the seventeenth."
Human 2: "What was the destination?"
Human 1: "Boston." <muffled by cell phone interference>
Human 2: "Was that Austin with an 'A' or Boston with a 'B'?"
Human 1: "Boston with a 'B.'"

New Life For Older Technology
Basic speech synthesis is a two-step process. First, standard text is converted into a phonetic representation with markers for stress and other pronunciation guides. Then, the voice is created through a synthesis process, via a digital signal processor (DSP), a microprocessor, or both. The phonetic representation then becomes spoken sound.

The new ASR engines use natural language understanding, an artificial intelligence-based technology, to understand speech. The technology augments traditional speech recognition (converting spoken sounds to digital symbols) with grammar-based language understanding software. The computer can then create a version of the abstract meaning of the spoken words.

Speech recognition software applies basic grammatical rules to parse the sentence into its parts: subject, verb, object, etc. The ASR engine applies natural language understanding to determine the meaning of the sentence, and formats a question in a series of commands that the system can understand. Once these commands have been processed as a sentence, the speech synthesizer converts the sentence into words.

Vendors hope that a more human dialogue system will open the door to a wealth of new network services, including remote e-mail, remote database access, voice mail, and faxing. The natural fit between speech recognition and the call center is being played out in the rising popularity of these and other emerging applications. As ASR and TTS continue to evolve, industry observers see continued growth and new speech-enabled applications and services in the future.

Easy As A-B-C
Of the two main TTS technologies -- formant and concatenation synthesis -- it's the latter, with its process of splicing processed speech fragments into recognizable human speech, that is leading the way in TTS. Concatenation systems use chips to store tiny segments of actual recorded human speech -- fragments and combinations of the irreducible units of sound that make up words in all languages. The challenge to incorporating this technology in call center applications was two-fold.

The first challenge was in balancing speech quality with the limitations of computer memory. Developers realized that the larger the segments of speech they used, the more natural the voice would sound. They needed more memory to store and access these segments than processing technology would practically allow.

Second, because of the nature of phonetic speech, joining the speech segments together in a natural way was also difficult. Developers refer to the fluid contours of continuous human speech as intonation, melody, and prosody. Without it, computer-generated speech sounds uneven, disjointed and obviously artificial -- previous TTS engines' major shortfalls.

Developers have taken advantage of cheaper, more powerful processors to use larger voice segments that make it easier to develop more natural-sounding speech. At the same time, they have broken new ground in the algorithms used to join these voice segments effectively. A new generation of better TTS engines is now hitting the market. Many developers are satisfied they have effectively removed the barrier to a workable, truly conversational interface by generating natural-sounding speech. This is what is driving the industry on to its next stage.

Dialing In To The Future
The achievement of a truly natural-sounding human voice is already making current TTS and ASR applications much more compelling. But the future of the voice interface hinges on the computer's ability to interact with users conversationally, like a human would.

The growth of computer processing power will eventually enable developers to go beyond the natural-sounding voice itself, to create applications that speak as naturally as any expressive and perceptive reader. They will assume voices for the two sides of a dialogue, and will anticipate the cause and effect of various events. A person reading aloud can appreciate tone and meaning, and express humor, irony, or the contextual meaning of a narrative's elements. Computers will have the intelligence to add a high level of understanding and contextualization to the prosody of synthetic speech, and will be able to formulate and ask any question.

E-mail, unified messaging systems, data access, security systems, text-based sales and services of all kinds, navigation systems, personal computer-based agents, server-based telephony, voice mail systems, and new telephone directory services are just a few places to look for TTS and ASR in the near future, where actual dialogue will replace cumbersome key pad menus. Consumers can already easily retrieve information from automated systems, where a perfectly natural-sounding voice reads his or her e-mail, account information, news headlines, stock quotes, or Web pages. Some technology watchers predict the future will be filled with devices that converse with us, from our houses and cars to our wristwatches and cellular phones. Whether we see those futuristic applications of ASR or not, one thing is certain: This technology is coming soon to a call center near you.

Pam Ravesi is the senior director product management for Lernout & Hauspie (L&H). L&H is a global leader in advanced speech and language solutions for vertical markets, computers, automobiles, telecommunications, embedded products, consumer goods and the Internet. The company is making the speech user interface (SUI) the keystone of simple, convenient interaction between humans and technology, and is using advanced translation technology to break down language barriers. The company provides a wide range of offerings, including: customized solutions for corporations; core speech technologies marketed to OEMs; end user and retail applications for continuous speech products in horizontal and vertical markets; and document creation, human and machine translation services, Internet translation offerings, and linguistic tools.

 • TMC, Light and Electric Partner To Produce Cloud Communications Training Series
 • TMC and EMBRASE Partner to Host StartupCamp Telephony at ITEXPO East 2010 in Miami
 • Unified Communications Magazine Announces Third Annual Product of the Year Awards Call for Entries
 • TMC Accepting Applications for 25th Annual Top 50 Teleservices Agencies Rankings
 • TMC and EZGSA Announce Its First Government Contractor of the Year Award
 • 2009 INTERNET TELEPHONY TEM Excellence Awards Winners Announced
 • 2009 Unified Communications Excellence Awards Announces Winners
 • Digium CEO Danny Windham to Deliver Keynote Address at ITEXPO East 2010 in Miami
 • Polycom Co-Founder and CTO to Deliver Keynote Address at ITEXPO East 2010 in Miami
 • 4G Wireless Evolution - Verizon Wireless' Ecosystem Development Executive to Keynote ITEXPO and Collocated 4GWE Conferences in Miami
 • TMC's Smart Grid Web Site Gains More Than 500K Page Views in Its Third Month
 • 17th Annual MVP Quality Award Open for Nominations
 • INTERNET TELEPHONY Announces Winners of the BSS/OSS Excellence Awards
 • INTERNET TELEPHONY Magazine's 12th Annual Product of the Year Award
 • TMC Welcomes Matt Weiner as Vice President of Business Development
 • Announcing the 4GWE Wireless LTE Visionary Award
 • TMC's Information Technology Web Site Serves More Than 1 Million Page Views
 • Customer Interaction Solutions Announces 2009 Product of the Year Award Call for Entries
 • John Grogan Joins IT.TMCnet.com as Director of Business Development
 • 4G Wireless Evolution Announces Winners of the 2009 Wireless Backhaul Distinction Award
 • Anthony Cassio Joins 4GWE as Director of Business Development
 • TMC, Crossfire Media Launch New Web Site Focused on Smart Connected Products and Services
 • ITEXPO West 2009 Draws More Than 6,000 Enterprise, Service Provider, and Channel Decision Makers to Exhibit Hall and Conferences
 • 4G Wireless Evolution - Introducing 4GWE.TMCnet.com Product of the Year Awards
 • 2009 INTERNET TELEPHONY TEM Excellence Awards Call for Entries
 • 2009 INTERNET TELEPHONY Excellence Award Winners Announced
 • TMCnet Editorial Team Expanded
 • Introducing Cable.TMCnet.com Product of the Year Awards
 • Introducing Robotics.TMCnet.com Product of the Year Awards
 • 2009 INTERNET TELEPHONY BSS/OSS Excellence Awards Call for Entries
 • Paula Bernier Named Executive Editor of INTERNET TELEPHONY
 • Customer Interaction Solutions and TMC Labs Announce 2009 Innovation Award Winners
 • 4G Wireless Evolution - Announcing the Wireless LTE Visionary Award, New from 4GWE.TMCnet.com
 • INTERNET TELEPHONY Magazine Announces Winners for the 2009 IPTV Excellence Award
 • TMC, Intelligent Communications Partners Launch New Web Site, Conference Covering Smart Grid Technology
 • TMC Announces Promotions within Senior Executive Team
 • TMC Expands Integrated Sales Team
 • Digium to Host Asterisk Training Courses at ITEXPO in Los Angeles
 • 4G Wireless Evolution - TMC and Award Solutions Add New Wireless Broadband Training Courses to ITEXPO West '09 in Los Angeles
 • Ingate Adds New Sessions to Its Free SIP Trunking Workshop at ITEXPO, September 1-3, in Los Angeles
 • Customer Interaction Solutions Magazine Announces 2009 Speech Technology Excellence Award Winners
 • TMC and WiNOG Announce Conference Agenda for Fixed Broadband Track at ITEXPO West 2009 in Los Angeles
 • Digium to Host Asterisk Training Courses at ITEXPO in Los Angeles
 • Erin E. Harrison Named Senior Editor for TMC and TMCnet
 • 2009 INTERNET TELEPHONY Excellence Awards Call for Entries
 • TMC Announces 2009 IP Contact Center Technology Pioneer Award Winners
 • Call for Early Bird Entries for the 2009 TMC Labs Innovation Awards
 • INTERNET TELEPHONY's 2009 TMC Labs Innovation Award Winners Announced INTERNET TELEPHONY's 2009 TMC Labs Innovation Award Winners Announced
 • Erik Linask and Michael Dinan Promoted within the TMCnet Editorial Team
 • 2009 Unified Communications TMC Labs Innovation Award Winners Announced
 • The 2009 INTERNET TELEPHONY IPTV Excellence Award Is Seeking Nominations
 • Influential Managers at Enterprises, SMBs, Government Agencies Rely on IT.TMCnet.com
 • TMC Introduces 'Telecom Agent Day' at ITEXPO East 2009
 • Customer Interaction Solutions Magazine Releases 2009 Editorial Calendar
 • Betsy Estes Joins Leading Global Media Company as Senior Accountant
 • Ingate's Free SIP Trunking Seminar Returns to TMC's INTERNET TELEPHONY Conference & EXPO in Miami
 • Customer Interaction Solutions Announces 2008 Product of the Year Award Call For Entries
 • 2008 Speech Technology Excellence Award Winners Announced by Customer Interaction Solutions Magazine
 • 2008 INTERNET TELEPHONY Excellence Award Winners Announced

Share

3rd Annual VoIP Developer Conference
August 8-10, 2006 - Westin Santa Clara Santa, Clara, CA • http://www.voipdeveloper.com

TMC's Customized Keymail Alert and RSS Service Usage Instructions
 To receive daily e-mail alerts and RSS URLs of stories posted on TMCnet.com, please enter keyword terms to match and your e-mail address.  
Keyword 1:
Keyword 2:
Keyword 3:
 
E-mail Address:

Search terms are case-insensitive.

Enclose in double-quotes for exact phrase match.

No password necessary!

Subscribe FREE to all of TMC's monthly magazines. Click here now.












Subscribe Today!



Latest Stock
Information