First, the bad news. The bewildering array of vendors offering
interactive voice response (IVR) and speech recognition applications has
grown wider than ever. But you knew that. The good news is that new
technology in this rapidly evolving market is about to narrow the field.
Applications have arrived that combine improved text-to-speech (TTS)
and automatic speech recognition (ASR) engines with natural language
processing. ASR converts the user's speech to a text sentence of distinct
words. TTS converts a text sentence into computer-generated speech.
Mediating between them, natural language technology enables the computer
to understand what the user is saying.
The combination of these technologies makes it possible for
applications to interact with humans through spoken text, eliminating the
need for prerecorded voice files or manual input devices. What's more, TTS
has passed a milestone: The newest speech synthesis engines combine
advances in concatenation (where prerecorded segments of actual speech are
knitted together) with new synthesis algorithms. The result is an end to
the robotic synthesized voices produced by traditional TTS engines. These
engines will be ideal for industries such as telephony that require
high-quality voice and can support a large-footprint engine. They will
find ideal applications in banking, telecommunications, airline flight
booking, and other industries that use phone systems to offer customers
dynamic information retrieval from their databases. This advance also
represents the clearing of the first hurdle to true natural language
dialogue systems that enable two-way conversations with computers.
Previously, developers defined TTS's major stumbling block as
inadequate "naturalness," meaning the robotic, unfriendly
synthesized voice. The TTS engines dominant in applications throughout the
1980s and early 1990s relied on a technology called "formant
synthesis," where a processor generates a waveform, and then runs it
through a variety of filters that modify it into a speech wave. Despite
the ability to vary word pitch and duration, the sound was decidedly
synthetic and hard to listen to. Therefore, practical applications were
limited.
As processors and memory continue to grow in capacity and drop in
price, developers have used larger voice segments that make it easier to
develop more natural-sounding speech. At the same time, developers have
broken new ground in the ability to join these voice segments effectively
to create a smoother, more natural sounding synthetic voice.
The combination of more voice segments and better ways to link them,
plus improved processing and in-depth linguistic rules, provides
intelligent and human-sounding pronunciation of variable text input. Add
in the ability to generate speech on the fly, and concatenation algorithms
are opening the door to a truly interactive IVR.
For call centers, the convergence of TTS and ASR means two things.
First, the improved TTS will help to expand users' acceptance of the
technology due to the more human- sounding voice. Second, the combination
of the more human-sounding TTS with high-quality speech recognizers will
enable computers and humans to engage in true dialogues, in which the
computer is able to comprehend what a person is saying and ask questions
to clarify anything it does not understand.
Was That "Nevada" Or "Nirvana?"
The future of the voice interface in general hinges on computers' ability
to interact with users the way a human would. That means computers must
generate questions to clarify what they've heard, just like humans do.
While pre-recording solved the problem of a realistic voice interface, it
restricted the computer to repeating only what the developer anticipated
it would need to say, precluding a truly interactive dialogue. That's
what's changed.
The newest synthesizers, combined with new ASR technology, enable the
computer to generate any question necessary to clarify spoken input.
Boosted by the advances in TTS voice quality, developers are turning their
attention to creating new natural language dialogue systems that combine
TTS with natural language ASR. A natural language dialogue system enables
a computer to behave like "Human 2" in the following dialogue:
Human 1: "I would like a
ticket to <mumble> on Friday the seventeenth."
Human 2: "What was the destination?"
Human 1: "Boston." <muffled by cell phone
interference>
Human 2: "Was that Austin with an 'A' or Boston with a
'B'?"
Human 1: "Boston with a 'B.'"
New Life For Older Technology
Basic speech synthesis is a two-step process. First, standard text is
converted into a phonetic representation with markers for stress and other
pronunciation guides. Then, the voice is created through a synthesis
process, via a digital signal processor (DSP), a microprocessor, or both.
The phonetic representation then becomes spoken sound.
The new ASR engines use natural language understanding, an artificial
intelligence-based technology, to understand speech. The technology
augments traditional speech recognition (converting spoken sounds to
digital symbols) with grammar-based language understanding software. The
computer can then create a version of the abstract meaning of the spoken
words.
Speech recognition software applies basic grammatical rules to parse
the sentence into its parts: subject, verb, object, etc. The ASR engine
applies natural language understanding to determine the meaning of the
sentence, and formats a question in a series of commands that the system
can understand. Once these commands have been processed as a sentence, the
speech synthesizer converts the sentence into words.
Vendors hope that a more human dialogue system will open the door to a
wealth of new network services, including remote e-mail, remote database
access, voice mail, and faxing. The natural fit between speech recognition
and the call center is being played out in the rising popularity of these
and other emerging applications. As ASR and TTS continue to evolve,
industry observers see continued growth and new speech-enabled
applications and services in the future.
Easy As A-B-C
Of the two main TTS technologies -- formant and concatenation synthesis --
it's the latter, with its process of splicing processed speech fragments
into recognizable human speech, that is leading the way in TTS.
Concatenation systems use chips to store tiny segments of actual recorded
human speech -- fragments and combinations of the irreducible units of
sound that make up words in all languages. The challenge to incorporating
this technology in call center applications was two-fold.
The first challenge was in balancing speech quality with the
limitations of computer memory. Developers realized that the larger the
segments of speech they used, the more natural the voice would sound. They
needed more memory to store and access these segments than processing
technology would practically allow.
Second, because of the nature of phonetic speech, joining the speech
segments together in a natural way was also difficult. Developers refer to
the fluid contours of continuous human speech as intonation, melody, and
prosody. Without it, computer-generated speech sounds uneven, disjointed
and obviously artificial -- previous TTS engines' major shortfalls.
Developers have taken advantage of cheaper, more powerful processors to
use larger voice segments that make it easier to develop more
natural-sounding speech. At the same time, they have broken new ground in
the algorithms used to join these voice segments effectively. A new
generation of better TTS engines is now hitting the market. Many
developers are satisfied they have effectively removed the barrier to a
workable, truly conversational interface by generating natural-sounding
speech. This is what is driving the industry on to its next stage.
Dialing In To The Future
The achievement of a truly natural-sounding human voice is already making
current TTS and ASR applications much more compelling. But the future of
the voice interface hinges on the computer's ability to interact with
users conversationally, like a human would.
The growth of computer processing power will eventually enable
developers to go beyond the natural-sounding voice itself, to create
applications that speak as naturally as any expressive and perceptive
reader. They will assume voices for the two sides of a dialogue, and will
anticipate the cause and effect of various events. A person reading aloud
can appreciate tone and meaning, and express humor, irony, or the
contextual meaning of a narrative's elements. Computers will have the
intelligence to add a high level of understanding and contextualization to
the prosody of synthetic speech, and will be able to formulate and ask any
question.
E-mail, unified messaging systems, data access, security systems,
text-based sales and services of all kinds, navigation systems, personal
computer-based agents, server-based telephony, voice mail systems, and new
telephone directory services are just a few places to look for TTS and ASR
in the near future, where actual dialogue will replace cumbersome key pad
menus. Consumers can already easily retrieve information from automated
systems, where a perfectly natural-sounding voice reads his or her e-mail,
account information, news headlines, stock quotes, or Web pages. Some
technology watchers predict the future will be filled with devices that
converse with us, from our houses and cars to our wristwatches and
cellular phones. Whether we see those futuristic applications of ASR or
not, one thing is certain: This technology is coming soon to a call center
near you.
Pam Ravesi is the senior director product management for Lernout
& Hauspie (L&H). L&H is a global leader in advanced speech
and language solutions for vertical markets, computers, automobiles,
telecommunications, embedded products, consumer goods and the Internet.
The company is making the speech user interface (SUI) the keystone of
simple, convenient interaction between humans and technology, and is using
advanced translation technology to break down language barriers. The
company provides a wide range of offerings, including: customized
solutions for corporations; core speech technologies marketed to OEMs; end
user and retail applications for continuous speech products in horizontal
and vertical markets; and document creation, human and machine translation
services, Internet translation offerings, and linguistic tools. |