Speech
Technologies For The 21st Century
By Rob Kassel, SpeechWorks International,
Inc.
Almost every vision of technology in the 21st century includes
interacting with devices via speech. A computer might be asked to search a
database, display a star chart or activate machinery. Just by speaking a
name, communications are established no matter where the other person may
be. In books, film and television, we constantly see speech as the most
common way to interact with our artificial environment.
While flying cars and orbiting hotels remain a distant dream, practical
speech technology for input and output is a reality today. In fact,
speaking to devices and hearing their spoken responses is rapidly becoming
commonplace. The reason is simple: recent advances in speech technology
have enabled new applications that offer dramatic return on investment.
Today, we can place a telephone call to check flight arrival times, track
packages, transfer bank funds or purchase office supplies without ever
speaking to a human agent. Despite the relatively low cost of such
systems, they are designed to deliver high customer satisfaction and do so
all day, every day. Speech technology is changing the way we conduct
business over the telephone, and soon will be changing the way we control
devices and access information no matter where we may be.
Two separate but related speech technologies are responsible for this
revolution. Each has been available commercially for decades but only
through remarkable progress in the past few years have they matured
sufficiently for mainstream applications. Because speech is so natural to
us, it may seem that it would be easy for a computer to manage. In fact,
speech is remarkably complex in subtle and surprising ways that have been
discerned only through clever experimentation and analysis. Humans are
adept at speech because, through millions of years of evolution, we have
developed brain, vocal tract and auditory specializations that enhance our
ability to listen and speak. In a much shorter time we have been able to
invent similar mechanisms so that computers can now also listen and speak.
Speech recognition is the technology that enables computers to hear by
determining which words were spoken. It starts by capturing an audio
signal and processing it through sophisticated algorithms that mimic some
of the processing performed by the human ear. The sounds are evaluated to
determine which phonemes, the basic constituents of speech, they might
represent. The possible strings of phonemes are then compared against a
grammar of allowable words and phrases to determine what was most likely
said, resulting in a textual representation of what was spoken. Based on
these words, the computer can conduct further actions such as placing
calls or buying stocks.
Older speech recognition technology required greatly constraining the
task to simplify the computation needed, perhaps by understanding only a
single speaker's voice, requiring pauses between words or limiting the
vocabulary to a handful of carefully selected phrases. However, today's
commercial speech recognition products are far less restrictive. They
understand most everyone's speech no matter how strong the accent, allow
callers to speak naturally and spontaneously, reject background noise and
line distortions, and support virtually unlimited vocabularies. These
properties allow speech recognition to handle tasks that previously could
be assigned only to a human agent, such as entering names or addresses,
while delivering accuracy that rivals human listening performance.
Telephony applications, with or without speech recognition, often
respond to callers by playing recorded speech. This works well when the
possible responses can be enumerated in advance and are relatively few in
number, but it cannot be used to read an e-mail message or provide an
address listing.
Speech synthesis, also called 'text-to-speech' or 'TTS,' is the
technology that enables computers to speak arbitrary phrases. It starts by
analyzing the text to be spoken, converting strings such as '$3.50'
into 'three dollars and fifty cents' and determining how each word is
pronounced. This conversion needs to be sophisticated enough to know when
the abbreviation 'Dr.' is pronounced as 'doctor' and when it is
pronounced 'drive,' or when 'read' is pronounced like 'red' or
'reed.' Appropriate pitch, timing and emphasis must be assigned to
words in each sentence to avoid producing a grating monotone. Only then
can an audio stream be generated.
You may have heard phone numbers read to you by concatenating
recordings of each digit. Today's leading speech synthesis technique
takes a similar approach but on a much finer scale, dissecting recorded
speech into tiny stretches and reassembling them to form the requested
phrases. The output is so natural sounding that it can be difficult to
discern from an original recording, a startling contrast to earlier
approaches that were highly intelligible but had a mechanical quality. The
synthetic output sounds so much like the original speaker's voice that
it is possible to smoothly blend natural and synthetic speech into a
single, spoken response without noticeable transitions.
Bringing together high-performance speech recognition and
natural-sounding speech synthesis allows developers to create applications
that engage callers with dialog. Doing so effectively is not easy, as
designers must draw on their experience to combine art and science into a
plan covering each step in the conversation. Should prompts be friendly or
formal? How much guidance should be offered and when? What if the caller
makes a mistake? Even brief exchanges may harbor hidden complexity that is
best exposed through observing the behavior of actual callers, and subtle
changes can result in dramatic differences in usability.
Through careful wording of prompts and anticipation of possible
responses, designers can create an experience that allows almost every
caller to complete tasks efficiently by answering a series of directed
questions. The result is a much lower cost per call compared to live
agents without the frustration induced by lengthy touch-tone menus. In
fact, thanks to shorter hold queues and the ability of seasoned callers to
interrupt prompts, speech-driven applications often provide the highest
caller satisfaction possible.
Today's speech recognition and synthesis technologies depend on
copious processing power and memory, usually available only in server
systems installed within contact centers, and enabled by the plummeting
cost of computation and storage. However, the same advances we have seen
in server computing are beginning to appear in handheld devices, too.
While size, weight and battery life restrictions dictate that handheld
devices will never be as capable as their server brethren, compelling
speech recognition and synthesis technology embedded in consumer devices
is now possible.
The convergence of several other technology trends is creating exciting
new possibilities for applications on handheld devices. Color flat panel
displays are becoming inexpensive and lightweight with dazzling image
quality, making it possible to view photos, intricate diagrams and even
video. Battery capacity is increasing while device power consumption is
decreasing, reducing device weight while extending the time between
charges. Wireless data networking is becoming pervasive for both long- and
short-distance connections, enabling devices to access the Internet and
work cooperatively. In addition, location tracking is becoming less
expensive and more accurate, enabling a new category of location-based
services.
The Natural Interface For Handheld Computers
Emerging handheld devices are powerful computing platforms that combine
and transcend the capabilities of existing PDAs and mobile phones,
delivering functions traditionally reserved for full-fledged computers.
Yet, these devices are too small for practical keyboards, making extensive
data entry complicated and error prone. How will consumers tap the
tremendous power of these devices? How will they access valuable
information no matter where they are and what they are doing? With speech,
of course.
Speech provides an intuitive interface for these complex devices that
allows users to concentrate on what needs to be done rather than on how it
can be accomplished.
The speech technologies used to control a device cannot be located in
the network because accessing them would consume too much power, have
excessively long latencies and be highly unreliable and costly. Instead,
the speech technologies must be embedded in the device itself, tapping
network resources only when needed to handle the most complicated tasks.
Using speech recognition to dial a phone number in your personal address
book might be accomplished using embedded technology. Using speech
recognition to search for a phone number in the city directory might be
accomplished using network-based technology. Both embedded speech
recognition and embedded speech synthesis are required to produce a true
conversational interface.
Although speech can provide an intuitive interface, it alone is not a
complete solution. Speech is poorly suited to data that demand privacy and
could be overheard, such as passwords or PIN numbers. Speech is also
ill-advised in situations, such as business meetings, where courtesy
demands silence. The true answer lies in multimodal interfaces, accepting
text, pointing and speech as input and producing text, graphics and speech
as output. Together these methods create a universal interface that can
provide information to anyone, anytime, anywhere, even when driving an
automobile.
The coming generation of handheld devices will be produced by many
manufacturers and will offer capabilities to access information and
applications from a diverse array of sources. It must be possible for a
device to load any content the user might need, regardless of who created
that content. Similarly, a content provider will want to ensure its wares
are available on any device, regardless of the manufacturer. The only way
such an ecosystem can develop is through agreed-upon standards for
representing data and controlling the device's multimodal interface
capabilities. This will allow a map vendor to specify the action taken
when a user points to a particular location or says 'zoom out.'
Thanks to the efforts of Web developers, excellent standards exist
today for text, graphics and other media distributed to millions of
desktop browsers. That addresses most of what is needed for tomorrow's
multimodal applications, but a common means for controlling the spoken
interface is missing. The leading approach to fill this void, known as the
SALT (Speech Application Language Tags) specification, is currently under
development by an industrywide consortium. The SALT specification builds
on the strong base of Web standards, harmoniously adding just what is
necessary to control speech input and output. This approach works well
with existing Web development tools, making it easier to voice-enable Web
content and thereby accelerating adoption of multimodal applications.
Within a few years, standards-based multimodal interfaces will make it
possible to retrieve a map of your current location, find a nearby
restaurant, read a review of its menu and call to reserve a table, all
with a few spoken commands uttered into a single device that fits
comfortably in your hand.
The complex artificial intelligence behind HAL, as envisioned by Arthur
C. Clarke in 2001: A Space Odyssey, is still only science fiction. Yet
artificial spoken communication is available today and offers a compelling
business case to drive its adoption. It will not be long before we each
place a phone call at least once per day that is answered by a computer,
and soon we will all carry a device that allows us to effortlessly tap the
vast information resources of the Internet. Whereas today's children
expect television to provide a color image and stereo sound, tomorrow's
children will find it difficult to imagine technology incapable of
conversation.
Rob Kassel is product marketing manager of Emerging Technologies for
SpeechWorks International, Inc. (www.speechworks.com).
[ Return
To The July 2002 Table Of Contents ]
|