Excellent interface design and world-class audio
production dramatically impact the success of
automated voice applications, as measured by cost
savings and customer satisfaction. Traditional
touch-tone IVR applications can be maddening, and
while voice recognition makes it possible to produce
outstanding interfaces that quickly and efficiently
deliver self-service access to information and
services, achieving this is still a complex art that
demands specialized expertise and a deep commitment to
quality. Uniting a proven design methodology with
iterative usability testing and world-class sound
design and creative direction can significantly reduce
call length, improve access to richer features and
help businesses save millions of dollars while
dramatically improving CRM.
WHY BUILD VOICE APPLICATIONS?
More than ever, companies must deliver world-class
customer service in the face of unrelenting cost
pressure. Consumers expect more while offering less
loyalty, and the Web has taught them to demand
self-service access to information and services at all
times. Voice applications allow callers to quickly
express their choices as a natural conversation rather
than painstakingly wading through long series of
arbitrary menus. This makes it possible to automate
broad new classes of rich customer service
applications, as compared to traditional touch-tone
Of course, the buying criteria for voice solutions
are centered on return on investment. U.S.
corporations spend more than $30 billion each year
answering their phones for customer service, and on
average it costs $1.75 per minute to keep a live
operator on the phone. Voice solutions slash call
center costs by providing automated self-service to
common tasks that sharply reduce reliance on live
operators. Because high-volume voice solutions cost
closer to 10 to 20 cents per minute or less, large
contact centers can easily save hundreds of thousands
or even millions of dollars each year.
Thus, the fundamental driver of these savings is
the automation rate a given solution achieves. In
large call centers, every one percent in additional
automation can easily translate to tens or hundreds of
thousands of dollars in annual savings. For this
reason, it is critical that companies looking to
invest in voice take steps to coax every last
percentage point of automation out of their solution.
WHY GREAT VOICE APPLICATIONS ARE RARE
Speaking is the most natural form of human
communication, so it's natural to assume that building
great voice applications would be at least as easy as
designing a Web site. However, it turns out that
crafting a voice interface that is easy to use and
pleasing to customers is deceptively difficult. Great
voice applications help companies deliver exceptional
customer service at reasonable costs, but great voice
applications are rare because they are very hard to
build and deploy successfully.
Conversations are different than touch tone or Web
pages. Even if voice recognition technology were
perfect (which it isn't), designing voice applications
would remain fundamentally harder than building Web
sites or touch-tone IVR. Consider the following
Speaking is slower than reading. Just
think about how much time it takes to read a grocery
list aloud versus quickly reading it in print or on a
Web page. On the phone, options have to be listed one
at a time, and this can become frustrating very
quickly. By contrast, Web pages can display hundreds
of choices at the same time.
People quickly forget what they just heard.
On a Web page, people can carefully browse through
screens to find the option they're looking for. On the
phone, information is "gone" as soon as it's spoken,
and callers have to remember everything because they
can no longer see the choices.
It's not clear what you can't say.
Web pages and touch-tone IVR applications are bounded.
There are simply a fixed number of links to click or
keys to press. Conversations are unbounded because
people can say anything in response to a given
question. Minute differences in the way prompts and
menus are structured have a dramatic impact on
customer satisfaction, because callers need to be
gently and clearly directed to "say the right things"
and apologetically led back on track when they get
lost or confused.
RECOGNITION QUALITY DEPENDS UPON SPECIALIZED
DESIGN AND TUNING
Today's leading voice recognition software packages
are outstanding tools. However, achieving optimal
performance from this software is a lot like coaxing
performance out of enterprise database packages such
as Oracle. Companies unilaterally employ specialized
Oracle DBAs, because without them performance can be
catastrophic. The same holds true for voice
applications; they require specialized design and
tuning by qualified experts to be successful. Consider
the following issues:
People will always say unexpected things.
People are accustomed to having "real" conversations
over the phone; they immediately assume voice
applications can understand whatever they say and can
quickly get frustrated when their expectations aren't
met. Voice applications can only "understand" the
specific things they are trained for -- similar to
when people first bring their phrasebook to a foreign
country. For example, consider a simple menu of
keywords that includes "movies" and "restaurants."
Callers who say, "moving" without knowing that it is
not a valid choice are likely to consistently get
thrown into "movies" and be very frustrated. For this
reason, applications must use clear, concise prompting
and learn from usability tests to take into account
the unexpected things people tend to say. Minute
shifts in prompt wording or the underlying "grammars"
can have dramatic effects on usability and ultimately
the automation rate and ROI of voice applications.
"Grammars" must be tuned. Voice
recognition technology works by comparing what the
caller said to a specific list of expected choices.
These "grammars" are required to make it
computationally feasible to do speaker-independent
voice recognition in real-time. Large grammars, such
as the 10,000+ companies on U.S. stock exchanges, can
work very well in production today -- doing so
requires careful attention by both application
designers and speech scientists "tuning" the
underlying recognition engine.
People pronounce the same words differently.
Pronunciations for words and phrases can vary widely
across regions of a given country. Proper names
further complicate the matter -- consider, for
example, how to pronounce "Qantas Airways" or "Worcester
Court." Voice recognition engines rely on built-in
dictionaries that specify each of the ways callers may
say each word and common phrase. Especially because
voice recognition is rapidly being deployed in new
industries for new applications, it is critical to
ensure that all relevant pronunciations are in the
"Acoustic models" must be continually
refined. Voice recognizers use "acoustic
models" to decide whether a caller has said something
that matches a given grammar. Acoustic models are
essentially a mathematical representation of how a
wide variety of people sound when they say each of the
building blocks of words (e.g., "buh" or "ing").
Acoustic models are built by analyzing millions of
diverse recordings of real people actually speaking
over the telephone. The more data that get used to "train"
these acoustic models, the better recognition quality
Noisy environments are problematic.
Phone conversations -- particularly mobile phones --
often have a lot of background noise. Consider how
difficult it is sometimes even for "real people" to
distinguish the actual conversation from background
noise; the problem is compounded for voice recognition
engines because they have far less intelligent context
about how to differentiate sounds and speakers' voices
from one another. Voice applications, and voice
recognition platforms, must be carefully designed to
accommodate and minimize the difficulties presented by
Hundreds of thousands of calls must be
transcribed by hand. To compile the necessary
data to address most of the problems listed above, it
is necessary to manually compare what callers "actually"
say with what the voice recognition software "thought"
they said. Large numbers of calls must be manually
transcribed so speech scientists can determine how
accurately each grammar is performing. This
labor-intensive process is critical to provide
designers with the information they need to optimize
applications and maximize automation rates.
AUDIO PRODUCTION DRAMATICALLY IMPACTS CUSTOMER
Automated voice applications allow companies to use
distinctive voice talent and sound design to convey
the unique richness of their brand identity and
customer service philosophy. This opportunity directly
translates to customer satisfaction and can make the
difference when customers are selecting companies with
which they would like to do business.
People love applications that sound natural
and "feel" good. People simply appreciate and
respond more favorably to applications that sound
professional and engaging. Poor recording quality, bad
music and robotic-sounding synthesized speech are some
of the key reasons people tend to hate traditional IVR
systems so viscerally. By contrast, companies can use
great design to deliver a very compelling experience
that callers enjoy and associate positively with their
brand and its commitment to customer service.
Crafting the optimal voice and sound is an
art. Crafting the optimal audio experience for
voice applications is a unique art form that demands
specific experience and talent. Challenges include
choosing the "right" voice talent (e.g., the optimal
voice for stock trading is useless for selling
children's games). One of the greatest technical
challenges is properly producing prompts for "concatenative
speech." Concatenative speech is a technique whereby
short bits of prerecorded audio are quickly played in
sequence to form longer phrases and sentences.
Concatenative speech makes it possible for voice
applications to sound very "human" and natural, even
when delivering dynamic data such as stock prices or
flight information. Without great concatenative
speech, applications must resort to robotic-sounding
synthesized speech for dynamic data, because
prerecording all possible combinations is
Tackling all of the challenges of building great
voice applications is as necessary as it is daunting.
It is absolutely possible to deliver great voice
applications that delight callers, reduce costs and
drive revenue. Companies that wish to achieve these
benefits should carefully consider the challenges, and
their core competencies, when formulating their voice
Jeff Kunins is senior manager, technical
marketing, for Tellme
To The October 2001 Table Of Contents ]
VoiceXML In The Real
BY ZIV KARP, COMPOSIT COMMUNICATIONS
Several speech technologies have made giant leaps
forward in both deployment and acceptance in recent
years. Automatic speech recognition (ASR) and
text-to-speech (TTS) conversion, for example, have
become accepted technologies for the infrastructure of
any speech implementation. In the meantime, VoiceXML,
the apparatus that facilitates the deployment of voice
technology in the Internet, wireless and telephony
world, has become the agreed-upon standard for
Internet voice implementation. Additionally, there is
the voice browser, which serves as a pure vocal
interface to the World Wide Web.
The arena of TTS and ASR tools has always been
dominated by a small number of specialized players and
has not been regarded as an interesting playground for
most IT organizations. These products were viewed as
black boxes, activated through proprietary APIs, were
very expensive and, until recently, were of inadequate
quality. Voice browsers, although positioned much
higher in the technology food chain, share the same
characteristics in that they represent highly
sophisticated infrastructure technology, where most
software manufacturers cannot trespass.
By contrast, the establishment of the VoiceXML
forum, which brought an end to the frustrating voice
standard competition of the '90s, opened the road for
development of voice application of all kinds. The
most attractive attribute of VoiceXML is the way it
emerged from, and contributed to, several
- VoiceXML relies on 20-year-old TTS and ASR
- It is riding the Internet, as a natural
extension to the HTTP-HTML world.
- As a native XML-based specification, it benefits
from the relatively mature and very rich world of
XML standards and implementations.
VoiceXML appeared at just the right time. Although
Internet technologies became ubiquitous, they still
required heavy equipment and some technical knowledge.
The Internet exposed huge amounts of information, but
no natural human interface. Communication devices, on
the other hand, immobile or wireless, did provide the
means to communicate, but almost nothing more. It is
with the convergence of these two revolutions -- the
information revolution and the communication
revolution -- that voice technology becomes an
absolute must. VoiceXML is a great beginning.
But if everything is so promising, why does voice
technology linger? The answer is simple: exciting
technology is not enough. General acceptance in the IT
world, understanding of the potential and several good
products have failed to make the breakthrough. Not
until last year did some in the IT arena began to
believe that a visionary idea with Internet
implications is all one needs for that great
technology breakthrough. But then came the time to
turn dream into reality. Voice technology is of
limited use, unless it can be tied to the real world.
In other words, now that the computer can talk, it had
better have something to say.
Voice applications can be no more than the user
interface of content containers. Very simple
applications, like dictating price lists, weather or
flight information over the phone may require almost
no effort. Implementing such an application requires
an interface to the data (e.g., ODBC) and an interface
to the voice telephony system.
More sophisticated systems may communicate with the
customer via a Web browser, involving a graphical user
interface with the ability to play voice and video on
the customer's desktop. Developing this type of
application is possible using Web application
development tools that can spread out from the HTML
user interface to the voice user interface.
Alas, for most service and information providers,
much more is required -- provisioning and
personalization, security and authentication,
connection to heterogeneous data sources and
intelligent business rules in the heart of this
process. Examples include Web-based financial
services. Another sophisticated application is a
customer contact center system. This would require, in
addition to the above, multiple customer access
channels including phone, mail, fax, chat, VoIP and
As many components are involved, the
inter-component interface strategy deserves a few
words. When dealing with relatively old software, we
have to yield to traditional approaches. Proprietary
APIs, COM interfaces, direct TCP/IP connections and
more are used for these components.
It's a different story for newer components. XML
has already become the common denominator for any
interface. This makes sense: if the application server
talks to the voice world through VoiceXML, why shouldn't
this same method be used everywhere? In e-business,
for example, several vendors have formed popular XML
standards. As for CRM and ERP tools, several have
already developed XML interfaces. Databases can be
handled as a container of XML information. For other
systems, we will have to be patient, as they too will
eventually adapt to XML.
So far, we have described the components and
connectivity issues involved. These provide the
skeleton and muscles, yet one thing is missing: the
soul. To make everything work together, business
logic, defined by sets of business rules, must exist.
There are several methods for building the logic,
- Third-generation languages like Java, C++, C or
VB, combined with server-side and client-side
- Visual languages, in which a minimal amount of
code is written and business rules are defined
through an intuitive visual interface. Microsoft
Visio and other Visio-like tools become trendy for
this task, and not without reason; programming
comes into planning, and business-oriented experts
can replace technology-oriented programmers.
- Abstract environments, in which there are a
certain separation between the business rules as
written and the actual implementation. Both the
technical details and specific interfaces are
hidden during development and become known only at
run time. For example, why should the expert
planning a telephony application make the choice
between the voice over PSTN (traditional
telephony), voice over IP and voice over VoiceXML
Making VoiceXML a true story requires an underlying
development and execution environment with the
- Connectivity to a wide range of components,
machines and software packages that together mold
the enterprise. These components include database
access, networking and data exchange with
data-stores on mainframes and other platforms.
- Interoperability with complementing tools,
including CRM, ERP, e-marketplace, recording
systems and authentication means.
- Rich communication media -- telephony, mail,
fax, chat, IP telephony and, of course, the
- A simple yet powerful development environment,
where business rules are defined and tedious
scripting is minimal.
- High abstraction level, to allow decisions about
implementation details to be made at runtime. Let
the same application run once with the Oracle
database connected to Unix and communication over
PSTN and in other configuration on Microsoft SQL
Server, with AS/400 as backend and VoiceXML in the
Ziv Karp is the vice president of research and
development for Composit
Communications International. He oversees the
development of Composit's contact center software
solution and manages its programming team.
To The October 2001 Table Of Contents ]