Anyone familiar with current products and technology for automated
telephone customer service or messaging applications has surely heard that
Voice XML (Voice Extended Markup Language) is The Next Big Thing. It's
supposed to leverage Web technology and make these applications, well...
better! But what is it, exactly? What are its advantages and drawbacks?
When does it make sense? And what skills and tools are needed to develop
Voice XML applications?
In the first
article in this three-part series, we gave an overview of Voice XML
(or VXML), and another emerging voice application standard, Speech
Application Language Tags (SALT). We mentioned the enormous advantages of
implementing voice applications in a Web environment, including dependable
system design using proven architectures; a wealth of established
supporting standards, technologies, and products; and large pools of
experienced people. Here we look at VXML in more detail: What it attempts
to do, its components, and some issues about how to make it work in the
real world.
VXML is promoted by a non-profit industry group, the Voice XML Forum. Most current
VXML-based commercial products support Version 1.0 of the specification.
Voice XML Version 2.0 was accepted in January, 2003 as a candidate
specification by the World Wide Web Consortium
(W3C), and is rapidly being adopted by VXML product vendors.
VOICE XML BASICS
VXML consists of software elements that are used to construct automated
telephone-based voice applications. (It may also be applied, in principle,
to non-telephone voice applications, but SALT may be a better choice in
those cases.) It conforms to Web programming paradigms and is designed to
fit the typical three-tier Web system architecture: Web servers contain
business logic and "pages" of content; they exchange data with
back-end data stores; and they serve content pages to browsers, which
present it to end users.
VXML pages are downloaded from Web servers to "voice
browsers." They contain code, in the form of tags in XML syntax with
their associated attributes and data, that's interpreted by the voice
browser to orchestrate audio dialogs with end users. It's analogous to
HTML, where tags specify text and graphics to be displayed in desktop Web
browsers. For example, the HTML tag <p> instructs Web browsers to
render the indicated text as a paragraph; the VXML tag <prompt>
instructs voice browsers to render the indicated data as audio output.
VXML browsers are typically located in data centers and accessed through
the public telephone network.
Borrowing from HTML, the basic building block of VXML is the
"form," each of which governs a bit of dialog between the caller
and application. VXML forms, like HTML forms, contain "fields,"
each to be filled in by the caller with a particular piece of data. The
dialog proceeds by prompts ("Please say your account number")
and responses, either as speech, which is processed with speech
recognition software, or touch-tones. Each field must have an assigned
grammar, specifying the verbal phrases the speech recognition engine will
understand and/or allowed touch-tone sequences. The various VXML elements
allow developers to create forms and fields, specify prompts and grammars,
define variables for storing various pieces of data, and perform other
chores like handling cases where no response is heard or the caller's
speech isn't recognized. As with the Web, the HTTP protocol is used for
communication with the server.
PROGRAM LOGIC AND APPLICATION DESIGN
VXML also offers a set of program logic elements, like "GO TO"
and "IF-THEN-ELSE" constructs to control portions of dialogs
from within VXML pages. Since VXML coexists with the normal Web
infrastructure, other browser-side programming resources, such as
JavaScript and ECMA Script, can be used in similar ways. As explained in
the previous article, the introduction of program logic in VXML pages
makes a lot of sense from a system design perspective. But it creates much
greater complexity in program design, development, and testing than
conventional voice application approaches, since application logic is
split among various VXML pages and the Web server.
The Web execution paradigm creates further intricacies. When, during a
dialog, some data must be submitted to the server (say, a request for the
caller's account balance), execution of the VXML page stops. The server
processes the request and, keeping track of the current state of the
dialog, sends a new VXML page back to the browser, which then executes it
to continue the dialog. As with Web applications, technologies like Java
Server Pages (JSP), Active Server Pages (ASP), and Java Servlets enable
servers to generate VXML pages on the fly, customizing them based on
business logic and data specific to the user session. But even so, VXML
pages have their program logic and data fixed when they're created. Then
they're on their own. Any further interactions with the server will
require new pages to be generated.
For example, a voice mail application allows callers to browse through
their voice mail messages. When a caller logs in, the server creates and
downloads a VXML "message browsing dialog" page. It accepts
commands for navigating among the messages, and has a variable array
containing the locations of the audio files where each message is stored,
so they can be retrieved and played. When the caller wants to delete a
message, a request must be submitted back to the server. Execution of the
"message browsing dialog" page stops, and the server deletes the
indicated message file. To allow message browsing to continue, the server
must then generate a new page with the same program logic, but where the
variable array is now initialized with the remaining message files.
Organizing these various bits of logic and data on the server and
different VXML pages presents new challenges for application design and
testing. Clearly, care is needed so that the server, static VXML pages,
and dynamic VXML pages work together in a clean and maintainable way.
DEVELOPMENT TOOLS AND SKILLS
What kinds of knowledge, tools, and equipment are needed for VXML
application development? The convergence in VXML of formerly distinct Web
and voice technologies means that VXML application development demands
development teams with an uncommon set of disparate skills: Telephony
hardware and software, voice user-interface design, speech recognition and
text-to-speech technology (grammars, pronunciations), and audio editing,
together with Web technologies like HTTP, scripting languages, Java,
servlets, JSP, ASP, and Web server configuration. In the last couple of
years, many companies have rushed to narrow this gap with VXML development
tools and services.
As with most software development aids, VXML tools aspire to yield good
results without requiring in-depth knowledge of the underlying
technologies and programming languages. However, as with all such tools,
there are trade-offs between power and ease-of-use. And successful voice
application development demands some knowledge of telephony and speech
technologies that development tools can't provide.
Most VXML development tools feature graphical interfaces for
drag-and-drop editing. Some also allow direct editing of the VXML code, so
those with sufficient knowledge can create functionality beyond those
features supported by the graphical editor. To further simplify
development, many tools come with libraries of canned dialogs that perform
common tasks, like collecting spoken telephone numbers or times and dates.
While these pre-form dialogs can be very useful, most are implemented in
proprietary code, so they can't be easily modified or extended, and they
aren't portable to other tools.
Testing and debugging voice and touch-tone-based telephone applications
presents unique requirements. Testers must call, listen to, and speak to
the application, and perhaps use touch-tone keys. Some development tools
use the desktop computer's audio card as an interface to the voice browser
and simulate a telephone environment, including a graphical telephone
keypad. Others rely on actual telephone interfaces, requiring the
installation of telephone line cards in the test machines. Line cards can
be tricky to get working right, especially for the "barge-in"
functionality, which is crucial to most voice applications.
A third approach is offered by hosting service providers. Customers
develop their VXML applications remotely via the Web using the service
provider's development tools. The applications then run in the service
providers' data centers. Testing is done by simply calling the application
using a designated telephone number. This approach is very simple, with
virtually no up-front investment in equipment or software. But the
application may be limited to features offered by the host, and is
locked-in to the host's development and run-time environment.
The voice browser itself also has some influence on application
development. At present, not all voice browsers support every VXML
feature. And most offer proprietary extensions and other features to
differentiate their products. The most important concern is that VXML
doesn't address call control functions: Answer, transfer, conference, etc.
The W3C is in the process of formulating a Call Control XML (CCXML) standard,
but for the moment, voice browsers implement these functions in
proprietary ways.
In the final article of this series, we'll take a closer look at SALT:
The motives behind its creation, its advantages and limitations, and its
issues for the system designer and application developer.
Mark Levinson is president of VoxMedia Consulting. He has
over 15 years of telecom industry experience, including more than five
years managing the design, development, and deployment of real-world
speech applications. He can be reached at 781-259-0404 or [email protected]. |