TMCnet - World's Largest Communications and Technology Community




[March 10, 2003]

A Closer Look At Voice XML (Part 2 of 3)


Anyone familiar with current products and technology for automated telephone customer service or messaging applications has surely heard that Voice XML (Voice Extended Markup Language) is The Next Big Thing. It's supposed to leverage Web technology and make these applications, well... better! But what is it, exactly? What are its advantages and drawbacks? When does it make sense? And what skills and tools are needed to develop Voice XML applications?

In the first article in this three-part series, we gave an overview of Voice XML (or VXML), and another emerging voice application standard, Speech Application Language Tags (SALT). We mentioned the enormous advantages of implementing voice applications in a Web environment, including dependable system design using proven architectures; a wealth of established supporting standards, technologies, and products; and large pools of experienced people. Here we look at VXML in more detail: What it attempts to do, its components, and some issues about how to make it work in the real world.

VXML is promoted by a non-profit industry group, the Voice XML Forum. Most current VXML-based commercial products support Version 1.0 of the specification. Voice XML Version 2.0 was accepted in January, 2003 as a candidate specification by the World Wide Web Consortium (W3C), and is rapidly being adopted by VXML product vendors.

VXML consists of software elements that are used to construct automated telephone-based voice applications. (It may also be applied, in principle, to non-telephone voice applications, but SALT may be a better choice in those cases.) It conforms to Web programming paradigms and is designed to fit the typical three-tier Web system architecture: Web servers contain business logic and "pages" of content; they exchange data with back-end data stores; and they serve content pages to browsers, which present it to end users.

VXML pages are downloaded from Web servers to "voice browsers." They contain code, in the form of tags in XML syntax with their associated attributes and data, that's interpreted by the voice browser to orchestrate audio dialogs with end users. It's analogous to HTML, where tags specify text and graphics to be displayed in desktop Web browsers. For example, the HTML tag <p> instructs Web browsers to render the indicated text as a paragraph; the VXML tag <prompt> instructs voice browsers to render the indicated data as audio output. VXML browsers are typically located in data centers and accessed through the public telephone network.

Borrowing from HTML, the basic building block of VXML is the "form," each of which governs a bit of dialog between the caller and application. VXML forms, like HTML forms, contain "fields," each to be filled in by the caller with a particular piece of data. The dialog proceeds by prompts ("Please say your account number") and responses, either as speech, which is processed with speech recognition software, or touch-tones. Each field must have an assigned grammar, specifying the verbal phrases the speech recognition engine will understand and/or allowed touch-tone sequences. The various VXML elements allow developers to create forms and fields, specify prompts and grammars, define variables for storing various pieces of data, and perform other chores like handling cases where no response is heard or the caller's speech isn't recognized. As with the Web, the HTTP protocol is used for communication with the server.

VXML also offers a set of program logic elements, like "GO TO" and "IF-THEN-ELSE" constructs to control portions of dialogs from within VXML pages. Since VXML coexists with the normal Web infrastructure, other browser-side programming resources, such as JavaScript and ECMA Script, can be used in similar ways. As explained in the previous article, the introduction of program logic in VXML pages makes a lot of sense from a system design perspective. But it creates much greater complexity in program design, development, and testing than conventional voice application approaches, since application logic is split among various VXML pages and the Web server.

The Web execution paradigm creates further intricacies. When, during a dialog, some data must be submitted to the server (say, a request for the caller's account balance), execution of the VXML page stops. The server processes the request and, keeping track of the current state of the dialog, sends a new VXML page back to the browser, which then executes it to continue the dialog. As with Web applications, technologies like Java Server Pages (JSP), Active Server Pages (ASP), and Java Servlets enable servers to generate VXML pages on the fly, customizing them based on business logic and data specific to the user session. But even so, VXML pages have their program logic and data fixed when they're created. Then they're on their own. Any further interactions with the server will require new pages to be generated.

For example, a voice mail application allows callers to browse through their voice mail messages. When a caller logs in, the server creates and downloads a VXML "message browsing dialog" page. It accepts commands for navigating among the messages, and has a variable array containing the locations of the audio files where each message is stored, so they can be retrieved and played. When the caller wants to delete a message, a request must be submitted back to the server. Execution of the "message browsing dialog" page stops, and the server deletes the indicated message file. To allow message browsing to continue, the server must then generate a new page with the same program logic, but where the variable array is now initialized with the remaining message files.

Organizing these various bits of logic and data on the server and different VXML pages presents new challenges for application design and testing. Clearly, care is needed so that the server, static VXML pages, and dynamic VXML pages work together in a clean and maintainable way.

What kinds of knowledge, tools, and equipment are needed for VXML application development? The convergence in VXML of formerly distinct Web and voice technologies means that VXML application development demands development teams with an uncommon set of disparate skills: Telephony hardware and software, voice user-interface design, speech recognition and text-to-speech technology (grammars, pronunciations), and audio editing, together with Web technologies like HTTP, scripting languages, Java, servlets, JSP, ASP, and Web server configuration. In the last couple of years, many companies have rushed to narrow this gap with VXML development tools and services.

As with most software development aids, VXML tools aspire to yield good results without requiring in-depth knowledge of the underlying technologies and programming languages. However, as with all such tools, there are trade-offs between power and ease-of-use. And successful voice application development demands some knowledge of telephony and speech technologies that development tools can't provide.

Most VXML development tools feature graphical interfaces for drag-and-drop editing. Some also allow direct editing of the VXML code, so those with sufficient knowledge can create functionality beyond those features supported by the graphical editor. To further simplify development, many tools come with libraries of canned dialogs that perform common tasks, like collecting spoken telephone numbers or times and dates. While these pre-form dialogs can be very useful, most are implemented in proprietary code, so they can't be easily modified or extended, and they aren't portable to other tools.

Testing and debugging voice and touch-tone-based telephone applications presents unique requirements. Testers must call, listen to, and speak to the application, and perhaps use touch-tone keys. Some development tools use the desktop computer's audio card as an interface to the voice browser and simulate a telephone environment, including a graphical telephone keypad. Others rely on actual telephone interfaces, requiring the installation of telephone line cards in the test machines. Line cards can be tricky to get working right, especially for the "barge-in" functionality, which is crucial to most voice applications.

A third approach is offered by hosting service providers. Customers develop their VXML applications remotely via the Web using the service provider's development tools. The applications then run in the service providers' data centers. Testing is done by simply calling the application using a designated telephone number. This approach is very simple, with virtually no up-front investment in equipment or software. But the application may be limited to features offered by the host, and is locked-in to the host's development and run-time environment.

The voice browser itself also has some influence on application development. At present, not all voice browsers support every VXML feature. And most offer proprietary extensions and other features to differentiate their products. The most important concern is that VXML doesn't address call control functions: Answer, transfer, conference, etc. The W3C is in the process of formulating a Call Control XML (CCXML) standard, but for the moment, voice browsers implement these functions in proprietary ways.

In the final article of this series, we'll take a closer look at SALT: The motives behind its creation, its advantages and limitations, and its issues for the system designer and application developer.

Mark Levinson is president of VoxMedia Consulting. He has over 15 years of telecom industry experience, including more than five years managing the design, development, and deployment of real-world speech applications. He can be reached at 781-259-0404 or [email protected].

Technology Marketing Corporation

2 Trap Falls Road Suite 106, Shelton, CT 06484 USA
Ph: +1-203-852-6800, 800-243-6002

General comments: [email protected].
Comments about this site: [email protected].


© 2022 Technology Marketing Corporation. All rights reserved | Privacy Policy