TMCnet - World's Largest Communications and Technology Community




[March 3, 2003]

Making Sense of Voice XML and SALT (Part 1 of 3)


A couple of years ago, as the dot-com boom was at its height, it seemed every area of information technology was being transformed by a tsunami of Web-derived architectures and standards: HTTP, XML, XSL, SOAP and many others. Those working in voice applications found something called Voice Extensible Markup Language (Voice XML, or VXML) had washed ashore. Now the boom has gone bust, but the Web is here to stay. So too, is VXML, which is being vigorously embraced by many in the voice industry. And it's been joined by yet another Web-derived voice standard, Speech Application Language Tags (SALT).

What are VXML and SALT? How do they differ? Are they really a big improvement over previous languages and architectures? And how has their Web heritage affected their applicability and usefulness for voice systems?

VXML and SALT are software standards that address the unique user-interface requirements of humans interacting with computers by listening and speaking. As extensions of Web technology, they're designed to fit the common Web three-tier information architecture: The presentation tier, where the end user interacts with a "browser"; the middle tier consisting of the Web server and the application's business logic; and the back-end data storage tier. The primary difference between purely text/graphical Web applications and those with voice lies in the presentation tier, where the "browsers" employ voice and audio rather than, or in addition to, text and graphics.

VXML was developed under the auspices of the Voice XML Forum and the World Wide Web Consortium (W3C), while the more recent SALT standard is supported by the SALT Forum.

Both standards specify language elements in XML syntax that are interpreted by VXML- or SALT-capable "voice browsers" to execute dialogs where audio prompts are played to elicit users' speech or touch-tone input. They rely on other Web protocols, like HTTP, to handle non-voice-specific functions.

VXML is primarily for telephone applications. It runs on voice browsers, typically located in data centers, that callers access through the public telephone network. SALT can likewise be used for telephone-only applications, but is designed to be more general purpose, supporting multi-modal voice/text/graphic applications via SALT-enabled Web browsers and other devices like cell phones and personal digital assistants (PDAs).

OK, more Web technology -- great. What's the advantage over existing formats? One happy fact about the World Wide Web is that its genesis among academic and government researchers has meant that from the very beginning its logical structure and protocols were open and grew rapidly into extensible standards. The subsequent explosion of Web usage was made possible in part because these standards allowed many different organizations to produce innovations and products that enhanced its power and ease of use.

The major intent of VXML and SALT is to apply the strengths of Web standards and architectures to voice applications. Open standards, allowing portability and interoperability, mean that software components from different vendors should run on a variety of hardware platforms, and work successfully with each other. So any voice-enabled browser should correctly render voice content from any Web server.

This is a huge win for the end user, since different vendors should (in principle) compete to offer the best-performing, lowest-cost products. The proven three-tier Web architecture greatly simplifies system design and maintenance, and the costs of switching between vendors should be vastly reduced. A critical mass of users supporting a given a set of standards naturally creates a market, attracting many vendors, not only for the core components, but also for application development and maintenance tools. And it produces a large pool of skilled workers.

Sounds like a slam-dunk winner, right? Well, there are some downsides, too. The technology is not quite mature. Neither standard addresses some essential tasks like call control (answer, transfer, conference, etc.). The W3C is currently considering standards for call control, as well as related functions not covered in VXML and SALT, like how to specify speech recognition grammars and text-to-speech parameters. But for now, most currently available products supporting VXML and SALT handle these things in proprietary ways.

More significantly, VXML and SALT Web-based architectures introduce new complexities into application design, development and maintenance. Conventional IVR and voice applications are developed using popular software languages, like C and C++, or proprietary graphical development tools, to create stand-alone programs that express the application logic and access various telephony, speech and database functions. But VXML and SALT force application logic to be organized quite differently.

Using Web architectures for voice applications to a certain extent "shoehorns" the dialog-based voice user interface into a mold designed for text- and graphics-centric interactions. Graphical interactions typically involve large amounts of information presented at one time, which users can absorb and respond to over a period of time. So Web servers download whole pages of information at one time. Program logic on the server determines which pages and data are presented.

Voice applications, in contrast, typically feature rapid exchanges of many small pieces of information in prompt-and-response dialogs: "What is your departure city?" "Orlando." "Now please tell me your destination city," etc. Dialogs aren't deterministic, but will vary according to how the interactions evolve: If the user's speech isn't recognized, for instance, she may be asked to repeat it.

Voice user interfaces must have some form of procedural code to control these interactions. It would waste network bandwidth to send each of the many prompts and responses back and forth between the server and browser. It makes more sense to have a large portion of the dialog-controlling logic reside on the browser where the user interaction takes place.

VXML tackles this problem by providing its own simple set of program logic elements, such as IF-THEN-ELSE constructs. Web scripting languages like JavaScript and ECMA Script can also be used. SALT is more straightforward: It addresses only voice functions and relies on script languages and other Web infrastructure for everything else. But in both cases, splitting program logic between the browser and server complicates design, coding, testing and maintenance.

Mixing program logic with content creates additional complexities. Rather than arranging program components by functional groupings, their organization is governed in part by the requirements of communication between server and browser. The server downloads a "page" containing VXML or SALT code to the browser. As the logic on that page runs the dialog, there may be a need to submit some data back to the server (let's say, an account number to be verified). When the data is submitted, the page ceases execution. After processing the submission, the server must remember the current state of the dialog and download a new page to pick up where it left off. Clearly, organizing and testing an application's dialog components can be a challenge.

Another design constraint arises from Web security features. Browsers are severely restricted in access to the machine on which they're running. Only certain constrained kinds of code --Web applets and scripts -- are allowed to execute, and access to local files is limited to certain temporary Internet files and "cookies." This makes perfect sense in the world of the Internet, where an open network lies between the browser and server, but not for typical telephone applications, where voice browsers sit securely behind firewalls in data centers. A voice mail application on the browser can record voice mail messages, for example, but must upload them to the server for storage. And they must be downloaded again in order to be played to the recipient. Browser vendors are free to offer proprietary platform features, like the ability to store recordings locally, which will differentiate their products. But these features will reduce the advantages of the standards.

Despite these issues, it seems clear that the advantages of VXML and SALT far outweigh the shortcomings. The task that lies ahead is to grow a pool of software designers, developers and administrators who are familiar with the components and architecture, and can successfully work with these resources. Support for this goal is appearing in the many VXML development tools now coming to market, and Microsoft is offering a new SALT-based Speech SDK as part of its .NET framework.

In the next two articles of this three-part series, we take more in-depth looks at VXML and SALT: Their capabilities and limitations, what you need to know to use them, what skills are required and which development tools are available.

Mark Levinson is president of VoxMedia Consulting. He has over 15 years of telecom industry experience, including more than five years managing the design, development, and deployment of real-world speech applications. He can be reached at 781-259-0404 or [email protected].

Technology Marketing Corporation

2 Trap Falls Road Suite 106, Shelton, CT 06484 USA
Ph: +1-203-852-6800, 800-243-6002

General comments: [email protected].
Comments about this site: [email protected].


© 2022 Technology Marketing Corporation. All rights reserved | Privacy Policy