TMCnet - The World's Largest Communications and Technology Community
ITEXPO begins in:   New Coverage :  Asterisk  |  Fax Software  |  SIP Phones  |  Small Cells

Call Center/CRM Management Scope
July 2002

Speech Technologies For The 21st Century

By Rob Kassel, SpeechWorks International, Inc.

Almost every vision of technology in the 21st century includes interacting with devices via speech. A computer might be asked to search a database, display a star chart or activate machinery. Just by speaking a name, communications are established no matter where the other person may be. In books, film and television, we constantly see speech as the most common way to interact with our artificial environment.

While flying cars and orbiting hotels remain a distant dream, practical speech technology for input and output is a reality today. In fact, speaking to devices and hearing their spoken responses is rapidly becoming commonplace. The reason is simple: recent advances in speech technology have enabled new applications that offer dramatic return on investment. Today, we can place a telephone call to check flight arrival times, track packages, transfer bank funds or purchase office supplies without ever speaking to a human agent. Despite the relatively low cost of such systems, they are designed to deliver high customer satisfaction and do so all day, every day. Speech technology is changing the way we conduct business over the telephone, and soon will be changing the way we control devices and access information no matter where we may be.

Two separate but related speech technologies are responsible for this revolution. Each has been available commercially for decades but only through remarkable progress in the past few years have they matured sufficiently for mainstream applications. Because speech is so natural to us, it may seem that it would be easy for a computer to manage. In fact, speech is remarkably complex in subtle and surprising ways that have been discerned only through clever experimentation and analysis. Humans are adept at speech because, through millions of years of evolution, we have developed brain, vocal tract and auditory specializations that enhance our ability to listen and speak. In a much shorter time we have been able to invent similar mechanisms so that computers can now also listen and speak.

Speech recognition is the technology that enables computers to hear by determining which words were spoken. It starts by capturing an audio signal and processing it through sophisticated algorithms that mimic some of the processing performed by the human ear. The sounds are evaluated to determine which phonemes, the basic constituents of speech, they might represent. The possible strings of phonemes are then compared against a grammar of allowable words and phrases to determine what was most likely said, resulting in a textual representation of what was spoken. Based on these words, the computer can conduct further actions such as placing calls or buying stocks.

Older speech recognition technology required greatly constraining the task to simplify the computation needed, perhaps by understanding only a single speaker's voice, requiring pauses between words or limiting the vocabulary to a handful of carefully selected phrases. However, today's commercial speech recognition products are far less restrictive. They understand most everyone's speech no matter how strong the accent, allow callers to speak naturally and spontaneously, reject background noise and line distortions, and support virtually unlimited vocabularies. These properties allow speech recognition to handle tasks that previously could be assigned only to a human agent, such as entering names or addresses, while delivering accuracy that rivals human listening performance.

Telephony applications, with or without speech recognition, often respond to callers by playing recorded speech. This works well when the possible responses can be enumerated in advance and are relatively few in number, but it cannot be used to read an e-mail message or provide an address listing.

Speech synthesis, also called 'text-to-speech' or 'TTS,' is the technology that enables computers to speak arbitrary phrases. It starts by analyzing the text to be spoken, converting strings such as '$3.50' into 'three dollars and fifty cents' and determining how each word is pronounced. This conversion needs to be sophisticated enough to know when the abbreviation 'Dr.' is pronounced as 'doctor' and when it is pronounced 'drive,' or when 'read' is pronounced like 'red' or 'reed.' Appropriate pitch, timing and emphasis must be assigned to words in each sentence to avoid producing a grating monotone. Only then can an audio stream be generated.

You may have heard phone numbers read to you by concatenating recordings of each digit. Today's leading speech synthesis technique takes a similar approach but on a much finer scale, dissecting recorded speech into tiny stretches and reassembling them to form the requested phrases. The output is so natural sounding that it can be difficult to discern from an original recording, a startling contrast to earlier approaches that were highly intelligible but had a mechanical quality. The synthetic output sounds so much like the original speaker's voice that it is possible to smoothly blend natural and synthetic speech into a single, spoken response without noticeable transitions.

Bringing together high-performance speech recognition and natural-sounding speech synthesis allows developers to create applications that engage callers with dialog. Doing so effectively is not easy, as designers must draw on their experience to combine art and science into a plan covering each step in the conversation. Should prompts be friendly or formal? How much guidance should be offered and when? What if the caller makes a mistake? Even brief exchanges may harbor hidden complexity that is best exposed through observing the behavior of actual callers, and subtle changes can result in dramatic differences in usability.

Through careful wording of prompts and anticipation of possible responses, designers can create an experience that allows almost every caller to complete tasks efficiently by answering a series of directed questions. The result is a much lower cost per call compared to live agents without the frustration induced by lengthy touch-tone menus. In fact, thanks to shorter hold queues and the ability of seasoned callers to interrupt prompts, speech-driven applications often provide the highest caller satisfaction possible.

Today's speech recognition and synthesis technologies depend on copious processing power and memory, usually available only in server systems installed within contact centers, and enabled by the plummeting cost of computation and storage. However, the same advances we have seen in server computing are beginning to appear in handheld devices, too. While size, weight and battery life restrictions dictate that handheld devices will never be as capable as their server brethren, compelling speech recognition and synthesis technology embedded in consumer devices is now possible.

The convergence of several other technology trends is creating exciting new possibilities for applications on handheld devices. Color flat panel displays are becoming inexpensive and lightweight with dazzling image quality, making it possible to view photos, intricate diagrams and even video. Battery capacity is increasing while device power consumption is decreasing, reducing device weight while extending the time between charges. Wireless data networking is becoming pervasive for both long- and short-distance connections, enabling devices to access the Internet and work cooperatively. In addition, location tracking is becoming less expensive and more accurate, enabling a new category of location-based services.

The Natural Interface For Handheld Computers
Emerging handheld devices are powerful computing platforms that combine and transcend the capabilities of existing PDAs and mobile phones, delivering functions traditionally reserved for full-fledged computers. Yet, these devices are too small for practical keyboards, making extensive data entry complicated and error prone. How will consumers tap the tremendous power of these devices? How will they access valuable information no matter where they are and what they are doing? With speech, of course.

Speech provides an intuitive interface for these complex devices that allows users to concentrate on what needs to be done rather than on how it can be accomplished.

The speech technologies used to control a device cannot be located in the network because accessing them would consume too much power, have excessively long latencies and be highly unreliable and costly. Instead, the speech technologies must be embedded in the device itself, tapping network resources only when needed to handle the most complicated tasks. Using speech recognition to dial a phone number in your personal address book might be accomplished using embedded technology. Using speech recognition to search for a phone number in the city directory might be accomplished using network-based technology. Both embedded speech recognition and embedded speech synthesis are required to produce a true conversational interface.

Although speech can provide an intuitive interface, it alone is not a complete solution. Speech is poorly suited to data that demand privacy and could be overheard, such as passwords or PIN numbers. Speech is also ill-advised in situations, such as business meetings, where courtesy demands silence. The true answer lies in multimodal interfaces, accepting text, pointing and speech as input and producing text, graphics and speech as output. Together these methods create a universal interface that can provide information to anyone, anytime, anywhere, even when driving an automobile.

The coming generation of handheld devices will be produced by many manufacturers and will offer capabilities to access information and applications from a diverse array of sources. It must be possible for a device to load any content the user might need, regardless of who created that content. Similarly, a content provider will want to ensure its wares are available on any device, regardless of the manufacturer. The only way such an ecosystem can develop is through agreed-upon standards for representing data and controlling the device's multimodal interface capabilities. This will allow a map vendor to specify the action taken when a user points to a particular location or says 'zoom out.'

Thanks to the efforts of Web developers, excellent standards exist today for text, graphics and other media distributed to millions of desktop browsers. That addresses most of what is needed for tomorrow's multimodal applications, but a common means for controlling the spoken interface is missing. The leading approach to fill this void, known as the SALT (Speech Application Language Tags) specification, is currently under development by an industrywide consortium. The SALT specification builds on the strong base of Web standards, harmoniously adding just what is necessary to control speech input and output. This approach works well with existing Web development tools, making it easier to voice-enable Web content and thereby accelerating adoption of multimodal applications. Within a few years, standards-based multimodal interfaces will make it possible to retrieve a map of your current location, find a nearby restaurant, read a review of its menu and call to reserve a table, all with a few spoken commands uttered into a single device that fits comfortably in your hand.

The complex artificial intelligence behind HAL, as envisioned by Arthur C. Clarke in 2001: A Space Odyssey, is still only science fiction. Yet artificial spoken communication is available today and offers a compelling business case to drive its adoption. It will not be long before we each place a phone call at least once per day that is answered by a computer, and soon we will all carry a device that allows us to effortlessly tap the vast information resources of the Internet. Whereas today's children expect television to provide a color image and stereo sound, tomorrow's children will find it difficult to imagine technology incapable of conversation.

Rob Kassel is product marketing manager of Emerging Technologies for SpeechWorks International, Inc. (www.speechworks.com).

[ Return To The July 2002 Table Of Contents ]

Leveraging Web Infrastructure For Speech Applications

By Jim Seidman, Verascape, Inc.

VoiceXML is a language designed for writing speech-based applications. As HTML is designed to interact via a computer, VoiceXML is designed to interact via a telephone, and its solutions use many of the same architectural components as HTML.

From a functional perspective, it is helpful to imagine a VoiceXML platform as conceptually similar to an IVR system, only much more flexible. VoiceXML platforms use open standards and are designed to be interoperable with components from multiple vendors. As a result, these platforms make it easier to deploy voice-based solutions.

The term 'VoiceXML platform' refers to the collection of telephony interface, speech recognition, text-to-speech synthesis and VoiceXML interpreter software necessary to actually process a call. The VoiceXML interpreter takes a VoiceXML script and uses it to guide the caller's interaction with the system. The platform then retrieves these scripts using HTTP, the same protocol Web browsers use to access HTML pages.

The result is that the architecture of a voice system using VoiceXML looks very similar to that of an HTML-based solution. There is a Web server that responds to the HTTP requests. The Web server interfaces with an application server that runs the business logic and creates dynamic content. The application server, in turn, will 'talk' to the back-end system containing the customer information. The major difference occurs on the side closest to the user. A customer accessing the HTML interface uses a browser to view data from the Web server. By contrast, a voice caller calls the VoiceXML platform over the telephone, and that call causes the platform to speak those data from the Web server. The platform 'listens' to the user's responses and interacts appropriately with the Web server.

VoiceXML platforms are currently appealing as they are usually more cost-effective than proprietary IVR platforms. The cost per telephony port for the actual equipment is typically lower for VoiceXML platforms. Also, because VoiceXML is a standard language, it tends to be easier to find developers. This is in sharp contrast to proprietary systems, for which the only practical source of development is the vendor's own professional services department. The breadth of adoption of the language also means that there is a variety of tools available to aid with content creation, which saves production time. The biggest source of cost savings, though, comes from the ability to reuse existing Web infrastructure.

There are four major components of creating a voice solution. The first is the design of the voice interface, deciding what both the system and the user will say at each point of the interaction. Then comes the production phase, in which the various prompts, scripts and grammars are created. The back-end integration work is the next major piece of work, and will typically be done concurrently with the production. Last is the ongoing maintenance, in which grammars are tuned, capacity is monitored and incremental script improvements are made.

The biggest cost savings comes in the back-end integration work, due to the fact that VoiceXML scripts are served from standard Web servers. Because the platform retrieves these scripts using HTTP, many of the tools and components that have been developed to allow access to back-end systems from HTML pages can be used for VoiceXML as well. Integration with back-end systems often consumes a very large portion of a total implementation effort, since it is often very unique to each implementation. However, if this back-end work has already been done for HTML pages, then implementing similar voice functionality may require no additional integration work.

A good example is an order tracking system. A company might have an HTML-based system that allows a customer, via a Web browser, to query whether or not an order has shipped. Typically, an application server will dynamically create the HTML and deliver it via a Web server. The code running on the application server will, in turn, need to communicate with a legacy system to retrieve the order information.

An IVR system attempting to replicate this functionality would typically not be able to use the Web application server, and therefore not be able to leverage the code that was written to support the HTML system. Providing callers with functionality equivalent to that enjoyed by those using HTML browsers would likely require integrating from scratch, probably involving the vendor's professional services department.

This is where the value of VoiceXML really shines. The voice application can run on the same application server and use exactly the same code to access the legacy system. This not only minimizes development, but also ongoing maintenance. The possibility for reuse of the same application servers also reduces the need for engineering training and can reduce software licensing costs.

VoiceXML can make the ongoing maintenance of a voice solution easier, as well. Monitoring of the load on the Web and application servers can be done with the same tools the IT department is already using for the HTML solution. In fact, depending on the network design, the same physical systems can potentially be used for both the HTML and voice solutions, allowing the computing resources to be shared between them. In a properly configured load-balancing cluster, this approach can allow the two solutions to run with less total server hardware than if they were separate, especially if the peak loads for the two systems occur at different times of the day.

Creating a voice solution is still a major undertaking. The design of the voice interface and the grammars to support interpretation of the users' speech requires a great deal of effort by experienced specialists. VoiceXML platforms, while cheaper than IVR platforms, still represent a significant investment. However, a well-designed system can divert a large percentage of calls away from live operators to self-service. Recovery of the system cost can often occur in under a year. These factors, coupled with the reuse of the Web infrastructure, create very attractive economics and are a large part of the reason for the tremendous success of VoiceXML as a technology for voice solutions.

Jim Seidman is vice president of engineering for Verascape, Inc. He has worked on developing the standards for several Internet technologies, including HTML, HTTP, WAP and VoiceXML. Verascape is a provider of integrated VoiceXML platforms.

Upcoming Events
ITEXPO West 2012
October 2- 5, 2012
The Austin Convention Center
Austin, Texas
The World's Premier Managed Services and Cloud Computing Event
Click for Dates and Locations
Mobility Tech Conference & Expo
October 3- 5, 2012
The Austin Convention Center
Austin, Texas
Cloud Communications Summit
October 3- 5, 2012
The Austin Convention Center
Austin, Texas

Subscribe FREE to all of TMC's monthly magazines. Click here now.