TMCnet - The World's Largest Communications and Technology Community
ITEXPO begins in:   New Coverage :  Asterisk  |  Fax Software  |  SIP Phones  |  Small Cells

Innovative Management Information
October 2001

New Speech Technologies End The Madness Of Traditional IVR


[ Go Right To: VoiceXML In The Real World ]

Excellent interface design and world-class audio production dramatically impact the success of automated voice applications, as measured by cost savings and customer satisfaction. Traditional touch-tone IVR applications can be maddening, and while voice recognition makes it possible to produce outstanding interfaces that quickly and efficiently deliver self-service access to information and services, achieving this is still a complex art that demands specialized expertise and a deep commitment to quality. Uniting a proven design methodology with iterative usability testing and world-class sound design and creative direction can significantly reduce call length, improve access to richer features and help businesses save millions of dollars while dramatically improving CRM.

More than ever, companies must deliver world-class customer service in the face of unrelenting cost pressure. Consumers expect more while offering less loyalty, and the Web has taught them to demand self-service access to information and services at all times. Voice applications allow callers to quickly express their choices as a natural conversation rather than painstakingly wading through long series of arbitrary menus. This makes it possible to automate broad new classes of rich customer service applications, as compared to traditional touch-tone IVR.

Of course, the buying criteria for voice solutions are centered on return on investment. U.S. corporations spend more than $30 billion each year answering their phones for customer service, and on average it costs $1.75 per minute to keep a live operator on the phone. Voice solutions slash call center costs by providing automated self-service to common tasks that sharply reduce reliance on live operators. Because high-volume voice solutions cost closer to 10 to 20 cents per minute or less, large contact centers can easily save hundreds of thousands or even millions of dollars each year.

Thus, the fundamental driver of these savings is the automation rate a given solution achieves. In large call centers, every one percent in additional automation can easily translate to tens or hundreds of thousands of dollars in annual savings. For this reason, it is critical that companies looking to invest in voice take steps to coax every last percentage point of automation out of their solution.

Speaking is the most natural form of human communication, so it's natural to assume that building great voice applications would be at least as easy as designing a Web site. However, it turns out that crafting a voice interface that is easy to use and pleasing to customers is deceptively difficult. Great voice applications help companies deliver exceptional customer service at reasonable costs, but great voice applications are rare because they are very hard to build and deploy successfully.

Conversations are different than touch tone or Web pages. Even if voice recognition technology were perfect (which it isn't), designing voice applications would remain fundamentally harder than building Web sites or touch-tone IVR. Consider the following issues:

Speaking is slower than reading. Just think about how much time it takes to read a grocery list aloud versus quickly reading it in print or on a Web page. On the phone, options have to be listed one at a time, and this can become frustrating very quickly. By contrast, Web pages can display hundreds of choices at the same time.

People quickly forget what they just heard. On a Web page, people can carefully browse through screens to find the option they're looking for. On the phone, information is "gone" as soon as it's spoken, and callers have to remember everything because they can no longer see the choices.

It's not clear what you can't say. Web pages and touch-tone IVR applications are bounded. There are simply a fixed number of links to click or keys to press. Conversations are unbounded because people can say anything in response to a given question. Minute differences in the way prompts and menus are structured have a dramatic impact on customer satisfaction, because callers need to be gently and clearly directed to "say the right things" and apologetically led back on track when they get lost or confused.

Today's leading voice recognition software packages are outstanding tools. However, achieving optimal performance from this software is a lot like coaxing performance out of enterprise database packages such as Oracle. Companies unilaterally employ specialized Oracle DBAs, because without them performance can be catastrophic. The same holds true for voice applications; they require specialized design and tuning by qualified experts to be successful. Consider the following issues:

People will always say unexpected things. People are accustomed to having "real" conversations over the phone; they immediately assume voice applications can understand whatever they say and can quickly get frustrated when their expectations aren't met. Voice applications can only "understand" the specific things they are trained for -- similar to when people first bring their phrasebook to a foreign country. For example, consider a simple menu of keywords that includes "movies" and "restaurants." Callers who say, "moving" without knowing that it is not a valid choice are likely to consistently get thrown into "movies" and be very frustrated. For this reason, applications must use clear, concise prompting and learn from usability tests to take into account the unexpected things people tend to say. Minute shifts in prompt wording or the underlying "grammars" can have dramatic effects on usability and ultimately the automation rate and ROI of voice applications.

"Grammars" must be tuned. Voice recognition technology works by comparing what the caller said to a specific list of expected choices. These "grammars" are required to make it computationally feasible to do speaker-independent voice recognition in real-time. Large grammars, such as the 10,000+ companies on U.S. stock exchanges, can work very well in production today -- doing so requires careful attention by both application designers and speech scientists "tuning" the underlying recognition engine.

People pronounce the same words differently. Pronunciations for words and phrases can vary widely across regions of a given country. Proper names further complicate the matter -- consider, for example, how to pronounce "Qantas Airways" or "Worcester Court." Voice recognition engines rely on built-in dictionaries that specify each of the ways callers may say each word and common phrase. Especially because voice recognition is rapidly being deployed in new industries for new applications, it is critical to ensure that all relevant pronunciations are in the dictionaries.

"Acoustic models" must be continually refined. Voice recognizers use "acoustic models" to decide whether a caller has said something that matches a given grammar. Acoustic models are essentially a mathematical representation of how a wide variety of people sound when they say each of the building blocks of words (e.g., "buh" or "ing"). Acoustic models are built by analyzing millions of diverse recordings of real people actually speaking over the telephone. The more data that get used to "train" these acoustic models, the better recognition quality becomes.

Noisy environments are problematic. Phone conversations -- particularly mobile phones -- often have a lot of background noise. Consider how difficult it is sometimes even for "real people" to distinguish the actual conversation from background noise; the problem is compounded for voice recognition engines because they have far less intelligent context about how to differentiate sounds and speakers' voices from one another. Voice applications, and voice recognition platforms, must be carefully designed to accommodate and minimize the difficulties presented by background noise.

Hundreds of thousands of calls must be transcribed by hand. To compile the necessary data to address most of the problems listed above, it is necessary to manually compare what callers "actually" say with what the voice recognition software "thought" they said. Large numbers of calls must be manually transcribed so speech scientists can determine how accurately each grammar is performing. This labor-intensive process is critical to provide designers with the information they need to optimize applications and maximize automation rates.

Automated voice applications allow companies to use distinctive voice talent and sound design to convey the unique richness of their brand identity and customer service philosophy. This opportunity directly translates to customer satisfaction and can make the difference when customers are selecting companies with which they would like to do business.

People love applications that sound natural and "feel" good. People simply appreciate and respond more favorably to applications that sound professional and engaging. Poor recording quality, bad music and robotic-sounding synthesized speech are some of the key reasons people tend to hate traditional IVR systems so viscerally. By contrast, companies can use great design to deliver a very compelling experience that callers enjoy and associate positively with their brand and its commitment to customer service.

Crafting the optimal voice and sound is an art. Crafting the optimal audio experience for voice applications is a unique art form that demands specific experience and talent. Challenges include choosing the "right" voice talent (e.g., the optimal voice for stock trading is useless for selling children's games). One of the greatest technical challenges is properly producing prompts for "concatenative speech." Concatenative speech is a technique whereby short bits of prerecorded audio are quickly played in sequence to form longer phrases and sentences. Concatenative speech makes it possible for voice applications to sound very "human" and natural, even when delivering dynamic data such as stock prices or flight information. Without great concatenative speech, applications must resort to robotic-sounding synthesized speech for dynamic data, because prerecording all possible combinations is prohibitively expensive.

Tackling all of the challenges of building great voice applications is as necessary as it is daunting. It is absolutely possible to deliver great voice applications that delight callers, reduce costs and drive revenue. Companies that wish to achieve these benefits should carefully consider the challenges, and their core competencies, when formulating their voice application strategy.

Jeff Kunins is senior manager, technical marketing, for Tellme Networks, Inc.

[ Return To The October 2001 Table Of Contents ]

VoiceXML In The Real World


Several speech technologies have made giant leaps forward in both deployment and acceptance in recent years. Automatic speech recognition (ASR) and text-to-speech (TTS) conversion, for example, have become accepted technologies for the infrastructure of any speech implementation. In the meantime, VoiceXML, the apparatus that facilitates the deployment of voice technology in the Internet, wireless and telephony world, has become the agreed-upon standard for Internet voice implementation. Additionally, there is the voice browser, which serves as a pure vocal interface to the World Wide Web.

The arena of TTS and ASR tools has always been dominated by a small number of specialized players and has not been regarded as an interesting playground for most IT organizations. These products were viewed as black boxes, activated through proprietary APIs, were very expensive and, until recently, were of inadequate quality. Voice browsers, although positioned much higher in the technology food chain, share the same characteristics in that they represent highly sophisticated infrastructure technology, where most software manufacturers cannot trespass.

By contrast, the establishment of the VoiceXML forum, which brought an end to the frustrating voice standard competition of the '90s, opened the road for development of voice application of all kinds. The most attractive attribute of VoiceXML is the way it emerged from, and contributed to, several state-of-the-art technologies:

  • VoiceXML relies on 20-year-old TTS and ASR technologies.
  • It is riding the Internet, as a natural extension to the HTTP-HTML world.
  • As a native XML-based specification, it benefits from the relatively mature and very rich world of XML standards and implementations.

VoiceXML appeared at just the right time. Although Internet technologies became ubiquitous, they still required heavy equipment and some technical knowledge. The Internet exposed huge amounts of information, but no natural human interface. Communication devices, on the other hand, immobile or wireless, did provide the means to communicate, but almost nothing more. It is with the convergence of these two revolutions -- the information revolution and the communication revolution -- that voice technology becomes an absolute must. VoiceXML is a great beginning.

But if everything is so promising, why does voice technology linger? The answer is simple: exciting technology is not enough. General acceptance in the IT world, understanding of the potential and several good products have failed to make the breakthrough. Not until last year did some in the IT arena began to believe that a visionary idea with Internet implications is all one needs for that great technology breakthrough. But then came the time to turn dream into reality. Voice technology is of limited use, unless it can be tied to the real world. In other words, now that the computer can talk, it had better have something to say.

Voice applications can be no more than the user interface of content containers. Very simple applications, like dictating price lists, weather or flight information over the phone may require almost no effort. Implementing such an application requires an interface to the data (e.g., ODBC) and an interface to the voice telephony system.

More sophisticated systems may communicate with the customer via a Web browser, involving a graphical user interface with the ability to play voice and video on the customer's desktop. Developing this type of application is possible using Web application development tools that can spread out from the HTML user interface to the voice user interface.

Alas, for most service and information providers, much more is required -- provisioning and personalization, security and authentication, connection to heterogeneous data sources and intelligent business rules in the heart of this process. Examples include Web-based financial services. Another sophisticated application is a customer contact center system. This would require, in addition to the above, multiple customer access channels including phone, mail, fax, chat, VoIP and other media.

As many components are involved, the inter-component interface strategy deserves a few words. When dealing with relatively old software, we have to yield to traditional approaches. Proprietary APIs, COM interfaces, direct TCP/IP connections and more are used for these components.

It's a different story for newer components. XML has already become the common denominator for any interface. This makes sense: if the application server talks to the voice world through VoiceXML, why shouldn't this same method be used everywhere? In e-business, for example, several vendors have formed popular XML standards. As for CRM and ERP tools, several have already developed XML interfaces. Databases can be handled as a container of XML information. For other systems, we will have to be patient, as they too will eventually adapt to XML.

So far, we have described the components and connectivity issues involved. These provide the skeleton and muscles, yet one thing is missing: the soul. To make everything work together, business logic, defined by sets of business rules, must exist.

There are several methods for building the logic, such as:

  • Third-generation languages like Java, C++, C or VB, combined with server-side and client-side scripts.
  • Visual languages, in which a minimal amount of code is written and business rules are defined through an intuitive visual interface. Microsoft Visio and other Visio-like tools become trendy for this task, and not without reason; programming comes into planning, and business-oriented experts can replace technology-oriented programmers.
  • Abstract environments, in which there are a certain separation between the business rules as written and the actual implementation. Both the technical details and specific interfaces are hidden during development and become known only at run time. For example, why should the expert planning a telephony application make the choice between the voice over PSTN (traditional telephony), voice over IP and voice over VoiceXML environments?

Making VoiceXML a true story requires an underlying development and execution environment with the following abilities:

  • Connectivity to a wide range of components, machines and software packages that together mold the enterprise. These components include database access, networking and data exchange with data-stores on mainframes and other platforms.
  • Interoperability with complementing tools, including CRM, ERP, e-marketplace, recording systems and authentication means.
  • Rich communication media -- telephony, mail, fax, chat, IP telephony and, of course, the emerging VoiceXML.
  • A simple yet powerful development environment, where business rules are defined and tedious scripting is minimal.
  • High abstraction level, to allow decisions about implementation details to be made at runtime. Let the same application run once with the Oracle database connected to Unix and communication over PSTN and in other configuration on Microsoft SQL Server, with AS/400 as backend and VoiceXML in the front.

Ziv Karp is the vice president of research and development for Composit Communications International. He oversees the development of Composit's contact center software solution and manages its programming team.

[ Return To The October 2001 Table Of Contents ]

Upcoming Events
ITEXPO West 2012
October 2- 5, 2012
The Austin Convention Center
Austin, Texas
The World's Premier Managed Services and Cloud Computing Event
Click for Dates and Locations
Mobility Tech Conference & Expo
October 3- 5, 2012
The Austin Convention Center
Austin, Texas
Cloud Communications Summit
October 3- 5, 2012
The Austin Convention Center
Austin, Texas

Subscribe FREE to all of TMC's monthly magazines. Click here now.