A couple of years ago, as the dot-com boom was at its height, it seemed
every area of information technology was being transformed by a tsunami of
Web-derived architectures and standards: HTTP, XML, XSL, SOAP and many
others. Those working in voice applications found something called Voice
Extensible Markup Language (Voice XML, or VXML) had washed ashore. Now the
boom has gone bust, but the Web is here to stay. So too, is VXML, which is
being vigorously embraced by many in the voice industry. And it's been
joined by yet another Web-derived voice standard, Speech Application
Language Tags (SALT).
What are VXML and SALT? How do they differ? Are they really a big
improvement over previous languages and architectures? And how has their
Web heritage affected their applicability and usefulness for voice
systems?
VXML and SALT are software standards that address the unique
user-interface requirements of humans interacting with computers by
listening and speaking. As extensions of Web technology, they're designed
to fit the common Web three-tier information architecture: The
presentation tier, where the end user interacts with a
"browser"; the middle tier consisting of the Web server and the
application's business logic; and the back-end data storage tier. The
primary difference between purely text/graphical Web applications and
those with voice lies in the presentation tier, where the
"browsers" employ voice and audio rather than, or in addition
to, text and graphics.
VXML was developed under the auspices of the Voice XML Forum and the World Wide Web Consortium (W3C), while the
more recent SALT standard is supported by the SALT Forum.
Both standards specify language elements in XML syntax that are
interpreted by VXML- or SALT-capable "voice browsers" to execute
dialogs where audio prompts are played to elicit users' speech or
touch-tone input. They rely on other Web protocols, like HTTP, to handle
non-voice-specific functions.
PRIMARY USES
VXML is primarily for telephone applications. It runs on voice
browsers, typically located in data centers, that callers access through
the public telephone network. SALT can likewise be used for telephone-only
applications, but is designed to be more general purpose, supporting
multi-modal voice/text/graphic applications via SALT-enabled Web browsers
and other devices like cell phones and personal digital assistants (PDAs).
OK, more Web technology -- great. What's the advantage over existing
formats? One happy fact about the World Wide Web is that its genesis among
academic and government researchers has meant that from the very beginning
its logical structure and protocols were open and grew rapidly into
extensible standards. The subsequent explosion of Web usage was made
possible in part because these standards allowed many different
organizations to produce innovations and products that enhanced its power
and ease of use.
The major intent of VXML and SALT is to apply the strengths of Web
standards and architectures to voice applications. Open standards,
allowing portability and interoperability, mean that software components
from different vendors should run on a variety of hardware platforms, and
work successfully with each other. So any voice-enabled browser should
correctly render voice content from any Web server.
This is a huge win for the end user, since different vendors should (in
principle) compete to offer the best-performing, lowest-cost products. The
proven three-tier Web architecture greatly simplifies system design and
maintenance, and the costs of switching between vendors should be vastly
reduced. A critical mass of users supporting a given a set of standards
naturally creates a market, attracting many vendors, not only for the core
components, but also for application development and maintenance tools.
And it produces a large pool of skilled workers.
THE DRAWBACKS
Sounds like a slam-dunk winner, right? Well, there are some downsides,
too. The technology is not quite mature. Neither standard addresses some
essential tasks like call control (answer, transfer, conference, etc.).
The W3C is currently considering standards for call control, as well as
related functions not covered in VXML and SALT, like how to specify speech
recognition grammars and text-to-speech parameters. But for now, most
currently available products supporting VXML and SALT handle these things
in proprietary ways.
More significantly, VXML and SALT Web-based architectures introduce new
complexities into application design, development and maintenance.
Conventional IVR and voice applications are developed using popular
software languages, like C and C++, or proprietary graphical development
tools, to create stand-alone programs that express the application logic
and access various telephony, speech and database functions. But VXML and
SALT force application logic to be organized quite differently.
Using Web architectures for voice applications to a certain extent
"shoehorns" the dialog-based voice user interface into a mold
designed for text- and graphics-centric interactions. Graphical
interactions typically involve large amounts of information presented at
one time, which users can absorb and respond to over a period of time. So
Web servers download whole pages of information at one time. Program logic
on the server determines which pages and data are presented.
Voice applications, in contrast, typically feature rapid exchanges of
many small pieces of information in prompt-and-response dialogs:
"What is your departure city?" "Orlando." "Now
please tell me your destination city," etc. Dialogs aren't
deterministic, but will vary according to how the interactions evolve: If
the user's speech isn't recognized, for instance, she may be asked to
repeat it.
Voice user interfaces must have some form of procedural code to control
these interactions. It would waste network bandwidth to send each of the
many prompts and responses back and forth between the server and browser.
It makes more sense to have a large portion of the dialog-controlling
logic reside on the browser where the user interaction takes place.
PUSHING SOME PROGRAM LOGIC TO THE BROWSER
VXML tackles this problem by providing its own simple set of program
logic elements, such as IF-THEN-ELSE constructs. Web scripting languages
like JavaScript and ECMA Script can also be used. SALT is more
straightforward: It addresses only voice functions and relies on script
languages and other Web infrastructure for everything else. But in both
cases, splitting program logic between the browser and server complicates
design, coding, testing and maintenance.
Mixing program logic with content creates additional complexities.
Rather than arranging program components by functional groupings, their
organization is governed in part by the requirements of communication
between server and browser. The server downloads a "page"
containing VXML or SALT code to the browser. As the logic on that page
runs the dialog, there may be a need to submit some data back to the
server (let's say, an account number to be verified). When the data is
submitted, the page ceases execution. After processing the submission, the
server must remember the current state of the dialog and download a new
page to pick up where it left off. Clearly, organizing and testing an
application's dialog components can be a challenge.
Another design constraint arises from Web security features. Browsers
are severely restricted in access to the machine on which they're running.
Only certain constrained kinds of code --Web applets and scripts -- are
allowed to execute, and access to local files is limited to certain
temporary Internet files and "cookies." This makes perfect sense
in the world of the Internet, where an open network lies between the
browser and server, but not for typical telephone applications, where
voice browsers sit securely behind firewalls in data centers. A voice mail
application on the browser can record voice mail messages, for example,
but must upload them to the server for storage. And they must be
downloaded again in order to be played to the recipient. Browser vendors
are free to offer proprietary platform features, like the ability to store
recordings locally, which will differentiate their products. But these
features will reduce the advantages of the standards.
Despite these issues, it seems clear that the advantages of VXML and
SALT far outweigh the shortcomings. The task that lies ahead is to grow a
pool of software designers, developers and administrators who are familiar
with the components and architecture, and can successfully work with these
resources. Support for this goal is appearing in the many VXML development
tools now coming to market, and Microsoft is offering a new SALT-based
Speech SDK as part of its .NET framework.
In the next two articles of this three-part series, we take more
in-depth looks at VXML and SALT: Their capabilities and limitations, what
you need to know to use them, what skills are required and which
development tools are available.
Mark Levinson is president of VoxMedia Consulting. He has
over 15 years of telecom industry experience, including more than five
years managing the design, development, and deployment of real-world
speech applications. He can be reached at 781-259-0404 or [email protected]. |