Modern technology has a remarkable habit of making once-futuristic
concepts come to life in ways not so far removed from their original
depictions in novels and films. A case in point is the idea of information
and communication devices you control by talking to them. Evoking images
stretching back more than 30 years, like HAL in the film 2001 or the Star
Trek communicator, some luxury automobiles now offer voice-command
navigation systems that give you maps and driving directions. Before long,
they'll allow voice control of the radio and climate systems as well. And
voice control is coming to wireless devices like PDAs and cell phones,
which are morphing into pocket-sized communication and information
management appliances.
These devices feature multimodal voice/text/graphic interfaces. The World Wide Web Consortium (W3C) is
working on standards that could be applied in this area, including Voice
XML and the Multimodal Interaction Activity. But the standard that seems
to be gaining the most traction at present, in terms of interest and
industry support, is Speech Application Language Tags (SALT). Its initial
specification was developed by the SALT Forum, and
submitted to the W3C for their stamp of approval in August 2002.
In the first two articles in this series (Part 1, Part 2), we gave an overview of Voice XML and SALT
and then took a more detailed look at Voice XML. In this final
installment, we consider SALT in greater depth -- the motives for its
creation, its components, and issues of using it in the real world.
The primary intent for the SALT standard is to provide functionality
for building a wide range of voice applications, including multimodal ones
that mix text and voice command input with text, graphic and audio output.
The most prominent applications people seem to have in mind are those
mentioned above: Easy-to-use PDAs, where voice commands can replace
handheld stylus writing and dwarf keyboards, and embedded devices, like
vehicle navigation and control systems, where keyboard input would be out
of the question. However, SALT's creators have made it generic enough to
support voice-only applications, too, like telephone interactive voice
response (IVR) systems for call routing and customer self-service.
WHAT IT DOES AND DOESN'T DO
SALT is a Web technology. It's implemented as XML code in Web pages that
tell SALT-enabled browsers how to conduct voice interactions with end
users. Its elegant, lightweight design comprises only those functions
directly related to executing voice dialogs. It doesn't specify the
formats of dialogs' associated data, like speech recognition grammars
(rules denoting allowable sequences of words), but it's hoped that SALT
browser vendors will adopt related standards, like the Speech Recognition Grammar
Specification, Natural Language
Semantic Markup Language (for specifying how meaning is extracted from
word sequences) and Speech
Synthesis Markup Language (for controlling the output of
text-to-speech resources). Non-voice tasks are accomplished using common
Web resources, such as HTTP, HTML, and browser-side scripting languages.
Browser platform functions like telephony control (answer, transfer,
conference, hang-up, etc.) can be addressed by standards such as Call Control Extensible Markup Language
(CCXML) or the XML
Protocol for Computer Supported Telecommunications Applications
(ECMA-323).
As discussed in our previous articles, there are huge advantages to
building applications with Web architectures and technologies, including a
large and expanding toolkit of established standards and products, and
ready availability of skilled people. But for voice applications, Web
architectures result in fairly complex code design, development and
maintenance because system design considerations require that the program
logic controlling voice dialogs be split among various Web pages and the
server. In keeping with its minimalist design philosophy, SALT, unlike
Voice XML, doesn't include any program logic elements, but leaves them to
scripting languages like JavaScript and ECMA script.
For embedded devices like PDAs, the question arises of whether the
overhead of the Web client-server architecture assumed by SALT even makes
sense. The SALT specification provides a reduced feature set for
"downlevel" browsers on devices with limited processing power
and memory. But it may be preferable in many cases to avoid client-server
altogether and implement voice applications as monolithic programs running
entirely on the individual devices. The choice of local program vs.
client-server will be decided to a large extent by the need for the
application to access external data. Take, for example, a cell phone
voice-dialing application. If the phone numbers to be dialed are merely
those stored in the user's personal phone directory, then a local
application makes the most sense. But if the requirement is to access a
group directory, like a company phone list, then it would be better to
maintain that data on a central server and download it as needed via a
client-server application.
On the other hand, SALT might be a good choice as an application
development language even for purely local programs. Many of its
advantages as a standard, XML-based language encompassing a comprehensive
set of voice functionality would be gained by designing embedded devices
to interpret SALT code (and possibly other Web languages like HTML and
JavaScript), whether or not they employ client-server architectures.
SALT COMPONENTS
There are four top-level SALT elements:
- <prompt> specifies how audio output is played to users, either
from recorded audio files or generated on-the-fly by text-to-speech
engines;
- <listen> specifies how speech input will be processed;
- <dtmf> provides for touch-tone input in telephone
applications; and
- <smex>, "simple messaging extension," is a
general-purpose method of communication with the browser platform that
supports new features and allows applications to control
platform-specific functions like logging and telephony call control.
Of special interest is the <bind> element, which, as part of the
<listen> functionality, provides wide flexibility in determining how
speech input is to be used. It allows speech recognition results to be
sent directly back to the Web server, for example. Or recognition results
can be attached to HTML form fields, so they act exactly like text input
to the field, allowing multimodal applications to accept either speech or
text as input to the same field.
SALT specifies an extensive set of events and methods for the
<prompt>, <listen> and <dtmf> elements. Especially for
prompts, they give very fine-grained control over interactions with users.
Sequences of audio can be queued to create composite prompts, as might be
desired when reading back an account number by stringing together
recordings of individual digits. Events are generated, for example, when
each prompt element in the queue has completed playing and when users
interrupt prompts by speaking ("barge-in"). There are also
methods for pausing and resuming prompts, and changing speed and volume,
among others.
These capabilities can support very sophisticated user-interface
designs. Knowing that a user has interrupted a prompt near the beginning,
for instance, might indicate that she's familiar with the application. If
so, subsequent prompts can be more abbreviated versions, giving quicker
interactions than would be appropriate for first-time users. Or it might
indicate that the graphical display she's viewing has given her all the
information she needs in deciding what to say.
CAN YOU HEAR ME NOW?
So can we just start rolling out SALT applications? Not quite yet. For one
thing, not many devices and browsers support the standard yet, although a
base of support seems to be building and more SALT-enabled devices should
appear before too long. There's also a need for application development
and testing tools. Fortunately, Microsoft is supporting SALT in a big way.
They now offer SALT development tools in the form of a Speech SDK for use
with Visual Studio and the .NET framework. Other vendors are starting to
offer their own SALT development tools, and this trend will likely
accelerate as SALT becomes more widely accepted. However, as previously
noted in our discussion about Voice XML, these development tools, as
helpful as they may be, don't provide the basic knowledge of voice
technology and voice user-interface design necessary to create
high-quality applications.
Another issue is the maturity of voice/graphic user-interface design.
There are extremely few multimodal applications now in production, and
there's been very little real-world experience in how to design these
kinds of user interfaces. This is a fascinating area with tremendous
potential for creative approaches. But it's still somewhat experimental.
Nevertheless, SALT appears to be the right standard at the right time
to help create a whole new class of novel, easy-to-use devices that will
soon be as commonplace as the desktop Web browser is today.
Mark Levinson is president of VoxMedia Consulting. He has
over 15 years of telecom industry experience, including more than five
years managing the design, development, and deployment of real-world
speech applications. He can be reached at 781-259-0404 or [email protected]. |