TMCnet - World's Largest Communications and Technology Community




callcttechnology.gif (2256 bytes)
May 1999

Listen Up! Eight Criteria For Selecting A Speech Recognition Vendor For Your Call Center


If you have been following trends in over-the-telephone speech recognition, you may be convinced that this once "futuristic" technology has much to contribute to your call center today. Automated speech recognition (ASR) makes possible the widest range of self-service e-commerce applications through the most ubiquitous device -- the telephone -- and the most natural interface of all -- the spoken word. ASR technology provides an extremely cost-effective way to offer friendly, personalized customer service 24 hours a day, while stopping the cumbersome torture of "press 1," "press 2" commands.

Early adopters of over-the-telephone speech recognition, including United Airlines, E--TRADE, FedEx, BellSouth and Hewlett-Packard, are already realizing the benefits of ASR. With innovative applications, they are taking customer service to the next level, offering advanced solutions that were not possible a few years ago. Many other companies across industries such as health care, manufacturing, insurance and banking are following suit.

Selecting The Right Solution
As with many tasks, the first step is the most difficult. When thinking about getting started in speech, the challenge is to understand the basic technology and the key criteria to consider when reviewing the products of various vendors. The following pages present an overview of eight key criteria that should be evaluated in a vendor selection process. If you do your job right, you will find a solutions partner that will not only deliver a finished application that your callers will love, but will also support you through a development process that is easier than you ever thought possible.

Let's get started. What criteria should you assess, and what results should you demand?

State-Of-The-Art Accuracy. The best recognition engines on the market today are recognizing simple commands (like numbers and yes/no responses) with more than 98 percent accuracy. Even complex and larger vocabularies are handled accurately more than 95 percent of the time in some real-world applications. Today's state-of-the-art engines recognize speaker-independent, continuous speech with vocabularies of more than 50,000 words and understand the more than 1 billion ways people combine them. Leading vendors are supporting their engines with teams of speech scientists who are continually incorporating new techniques to improve recognition accuracy for today's over-the-phone applications.

What appears as a straightforward concept, however, may often be complicated to explain. Many vendors test and report recognition accuracy using different approaches and assumptions. Be sure to compare apples to apples and probe for details when making your own comparisons.

In addition to accuracy ratings, vendors should also provide measures of "transaction completion." This rating lets you know the percentage of callers who successfully complete their transactions using the ASR interface, without transferring to a customer service representative (CSR) or hanging up in frustration. Transaction completion rates, which should reach levels of 98 percent, are a good measure of accuracy and will reflect a sound application design and an effective user interface.

Natural Language Processing. Natural language processing (NLP) allows your callers to speak complex commands in complete sentences or phrases. Demonstrated in research labs for years, NLP is now being used in many real-world applications.

NLP should be used to enhance the user friendliness of the speech interface. For example, an experienced travel customer should be able to say, "I want to fly from Boston to San Francisco next Thursday morning." The most advanced speech systems enable this capability.

On the other hand, if most users of a speech system are typically going to be "newcomers" (such as in a package rate-finding application), then NLP may not be appropriate at all, and a "directed dialog" approach (prompt/response, prompt/response) may be most comfortable option.

When selecting a vendor, evaluate both its NLP technology and its user interface design expertise, which will enable you to use NLP most effectively.

Proven Techniques For "Barge-In". Experienced callers will not tolerate speech systems that force them to listen to prompts in full, or "wait for the beep" before responding. Like IVR applications that let callers, "type ahead," "barge-in" lets users interrupt the system whenever they are ready and know what they want to do next.

Barge-in is a challenging technology to implement. Discuss this concept with your vendors, and make sure your callers will be able to take advantage of this advanced capability.

Advanced Development Tools. To facilitate and speed implementation, look for well-designed, proven development tools that are available to you and/or your systems integrator. "Building block" components allow developers to create applications by linking objects, and setting parameters, with little or no coding required.

The most advanced toolkits are also integrated into the development environments of other IVR vendors, including graphical user interfaces (GUIs), enabling speech services to be built quickly and easily using the tools with which your developers are most familiar.

Be sure to query vendors about how extensively their tools are being used today by their customers and partners, and how much code they really save developers from writing.

A Well-Designed And Documented User Interface Design Process. In the world of over-the-telephone ASR, you will hear a great deal about the speech user interface -- far more than you did in the days of touch-tone IVR. Consider this: with IVR, the end-user has a universe of 12 options, 1-9, 0, # and --. If the caller makes an invalid entry, you can easily reroute them to a basic menu structure.

Speech recognition is a more powerful interface, allowing callers to route themselves from topic to topic more freely and speak multiple inputs in one sentence, such as "buy 100 shares at the market price."

Because of this power and flexibility, the user interface must be carefully crafted. Callers can say "anything" and your system needs to respond accurately, effectively and pleasantly at all times. In addition, call center managers and vendors have all learned from the past. Although touch-tone systems offered many benefits, they were not generally hailed for the friendly experience they provided to callers. (What are your "transfer to CSR" figures?) ASR gives us all an opportunity to do better -- to give your most important constituents the highest level of customer service satisfaction.

The key to good user interface design is a proven development process including research, design and user testing to match your application needs to your caller population. In addition, look for tools to aid the testing and tuning process. Do the vendors you are considering have human factors specialists on staff? Do they have in-house capabilities for testing and prompt development? Most important, check their track records of satisfied customers.

Can You Build A Prototype? Because speech recognition is a new technology, many corporations are looking for low-risk ways to explore its capabilities. By building an application prototype, companies can "hear" just how their own application will sound and can obtain feedback from trial users.

By building application prototypes, you and your team can gain a lot of experience very quickly about all aspects of the speech development process. In addition, you have a model system to share with your management team. When you are ready to move forward, the building blocks of a prototype can be rolled into a full-scale, deployed system.

Reliable, Scalable And Efficient Systems. When comparing speech vendors, you and your technical advisors must evaluate the various system configuration approaches that will be proposed. A central server or server farm approach, for example, may be vulnerable to single points of failure. An "n+1" architecture, on the other hand, in which the system is set up as a number of independent, identical units of processing power, provides high reliability since even if a single unit fails, there are extra processors to keep the system up and running.

Be sure to ask how speech suppliers plan to scale your system upwards as your market expands. What additional costs will be involved? Again, some system setups require complicated load-balancing across multiple systems and networks, whereas the n+1 approach used in large-scale telecom services can provide infinite scalability.

Finally, as you look at cost structures, get a feel for vendors' speech processing efficiency. The most efficient system configurations will clearly give you more "bang for the buck," and it is important to understand the options available to you. The "Moore's Law" trends in processing power are having a huge impact on the viability of speech applications for today's call centers - and the amount of speech recognition possible on single, open systems platforms. You should make sure you are working with a provider whose software is designed to benefit from these trends.

The Availability Of Ongoing Support. Supporting a new system -- and constantly improving it -- is an important part of the entire process. Unlike other applications your call center may have implemented, speech recognition systems can be improved dramatically through careful analysis of usage patterns and results.

Even if you have done internal and pilot group testing, the feedback from early users "in the field" is critical. Ask vendors how they manage this process and find out what tuning tools they have to track results and make improvements quickly and easily.

By exploring these eight criteria, you will soon understand the capabilities needed to develop a speech application quickly and easily. Then, you can launch your own speech-activated application, which your call center customers will enjoy using and will return to happily, again and again.

Lauren Richman is director of marketing communications at Boston, Massachusetts-based SpeechWorks (formerly Applied Language Technologies). SpeechWorks (www.speechworks.com) enables people to talk to computers over the telephone in a natural way. The company's solution is based on research conducted at and licensed from the Massachusetts Institute of Technology. SpeechWorks simplifies and speeds the development of speech-activated services for information delivery and e-commerce through its patent-pending DialogModules. The company also provides system integration and support services to clients and distributes its solution through a network of resellers and integrators worldwide.

Speech Technologies: Choosing The Right Architecture For Call Center Apps


Speech technologies are revolutionizing the call center computer-telephony integration environment. Adding automatic speech recognition (ASR) and text-to-speech (TTS) greatly increases the effectiveness of interactive voice response (IVR) systems. Speech technologies are used in a wide range of applications. For callers without dual tone multi frequency (DTMF) capability, speech technology gives callers the ability to "press or say one." Speech-enabled auto-attendants allow a caller to be connected by simply speaking the name of an individual or department. Financial, travel reservation and sophisticated personal assistant applications -- some with vocabularies of more than 50,000 words -- offer the convenience of fully automated transactions.

Two viable architectures for implementing speech technologies are software-only and embedded. In a software-only architecture, speech technologies run on the same host central processing unit (CPU) as the computer-telephony integration (CTI) application, requiring no specialized hardware. In embedded architectures, the speech technologies run on dedicated digital signal processing (DSP) hardware. For some applications, a hybrid of these two architectures works best. To decide which architecture is appropriate for a specific call center application, many factors must be considered.

Software-Only, Hybrid And Embedded Architectures
Software-only speech technology systems are a cost-effective solution since they do not require additional hardware to run the speech technology. This type of architecture leverages advances in CPU price/performance, as well as the sheer CPU processing power now available on PCs. Moreover, according to Moore's Law, CPU processing power should continue to double every 18 months -- for even better price/performance in the future. Besides the cost savings, software-only architectures reduce any maintenance that might be associated with speech-technology-specific hardware.

Embedded systems also offer advantages: scalability, easier problem isolation and diagnosis and more deterministic system behavior for resource provisioning. By moving the processing load for the speech technology away from the host CPU, embedded speech technology systems based on DSP boards offer greater scalability, providing the ability to create high-density, speech-enabled applications. In this architecture, you can add channels of speech technology by adding more DSP boards -- without worrying about adding to the CPU load, as you would with a software-only system.

Embedded systems isolate the speech technology processing load, making it easier for the developer to determine system resource requirements. By localizing the speech technology to a specific board or group of boards, this architecture better insulates the CPU from any problems that might occur with the speech technology. If errors do occur, it is also easier to isolate and correct them. This is obviously extremely important for call centers, given their high system availability/reliability requirements.

Finally, the DSPs used in embedded systems are more efficient for some aspects of speech technologies, especially those associated with ASR processing.

A third architecture, which is a hybrid of the embedded and software-only models, is proving to be very efficient for extremely large-vocabulary phonetic ASR call center CTI systems. This efficiency results from distributing components of the ASR processing to the place where each is best handled.

For instance, since DSPs are designed for signal processing, they are very efficient at performing what is referred to in ASR as front-end processing. An example of using front-end processes is barge-in, or the ability of the speech recognition system to recognize an utterance by the caller, even during an outbound voice prompt by the system. Permitting speech input during a prompt is important to users familiar with the system who do not need the same prompting as a novice user to complete their transactions. By performing barge-in on a DSP board, the input does not need to be processed by the host until the caller actually speaks. This can dramatically increase the ASR densities achievable on the host.

By contrast, the host computer, with its large and inexpensive memory, is better suited to tasks like storing very large vocabularies and searching for a specific utterance within a vocabulary. Also, adding sophisticated grammar processing to speech systems, enabling fully automated systems to understand caller input (for example, "I want to make a reservation for a flight from Newark to Miami this Thursday"), is most efficiently done on the host. As a result, the hybrid architecture -- with front-end processing on a DSP board and vocabulary look-up and grammar processing on the host -- is an optimized combination that allocates specific tasks to those environments (DSP or host) where they are handled most efficiently.

Matching The Architecture To The Call Center
When you are deploying speech-enabled CTI apps in a call center, there are several issues to consider in choosing the right architecture.

A small call center system that only requires a few channels of TTS and ASR may save money by opting for a software-only solution, with no speech-technology-specific hardware required. Using current technology, it is possible to reach densities of about eight channels per single Pentium 200, depending upon the type of technology (ASR or TTS) and the CPU processing power required by the CTI application. To ensure reliable performance with this architecture, the system designer must carefully consider the tasks the speech technologies need to accomplish. It is essential to determine how much of the time a recognizer is expected to be active, as well as how much CPU processing power would be used by all of the recognizers simultaneously. This must be followed by actual benchmarking tests to ensure a reliable, effective solution. Call centers should be sure to ask their dealers if such benchmarking is included in the development costs.

In larger call center environments, where high-density, high-availability systems are crucial, embedded architectures provide a more scalable, maintainable solution. Based upon the specific type of technology used, current densities can be in the area of 12 ASR or 24 TTS resources per board. In these environments, there is great flexibility in being able to confidently scale the speech technology component of a system by adding boards that do not load the host CPU. It is also crucial to be able to quickly replace a defective board with a new one to bring a system back online.

In summary, for low-density call center applications where cost is a primary concern, software-only, host-based technologies are most appropriate. This configuration requires more work in terms of benchmarking to ensure adequate system performance and may lead to greater difficulty in problem isolation and diagnosis. Host-based products can also serve as a relatively inexpensive testing ground or prototyping platform for speech-enabled call center applications.

For high-density, high-availability applications, embedded systems are more appropriate. They offer greater scalability and more deterministic behavior for provisioning. Also, embedded systems lend themselves to easier problem isolation and diagnosis and can quickly be brought back online.

Finally, for sophisticated, large-vocabulary, fully automated transaction applications -- which are increasingly common in call centers -- a DSP/host-CPU hybrid architecture provides the most efficient platform. This hybrid architecture can move a specific aspect of speech recognition to the environment (DSP or host) where it is handled most effectively.

Gene Eagle is speech product line manager for Dialogic Corporation (www.dialogic.com), a manufacturer of high-performance, standards-based CT components. Dialogic products are used in voice, fax, data, voice recognition, speech synthesis, call center management and IP telephony applications in both the CPE and public network environments. The company is headquartered in Parsippany, New Jersey, with regional headquarters in Tokyo, Brussels and Buenos Aires and sales offices worldwide.

Technology Marketing Corporation

2 Trap Falls Road Suite 106, Shelton, CT 06484 USA
Ph: +1-203-852-6800, 800-243-6002

General comments: tmc@tmcnet.com.
Comments about this site: webmaster@tmcnet.com.


© 2020 Technology Marketing Corporation. All rights reserved | Privacy Policy