SALT: The Light In Speech Mark-Up
BY ALBERT KOOIMAN & DR. KUAN SAN WANG
In October 2001, Cisco, Comverse, Intel, Microsoft, Philips, and
SpeechWorks established the SALT Forum in order to create a standard mark-up
language for multimodal and telephony speech applications. This initiative
has gained increasing momentum since its inception and in the last few
months alone, 43 companies have joined the Forum. The SALT Forumï¿½s
identification of the telephony market as one of its targets has created
speculation in the industry that the SALT (speech application language tags)
specification is competing with VoiceXML 2.0, which, if all goes well, will
receive the World Wide Web Consortiumï¿½s (W3C) Recommendation Status in
Spring 2003. VoiceXML and the SALT specification do not need to be seen as
competing standards; each can be seen as a viable standard in its own right
that can be chosen for deployment on the basis of its own merits.
Web, What Web?
For several years now, the buzz about a voice-enabled Web has created
tremendous excitement and promise in the industry. Although this event is
still in its infancy, a great amount of momentum has been building, and
hundreds of companies have embraced VoiceXML.
A key driver of the voice-enabled Web is that the future of commerce is
based on Internet standards and the ability of mobile users to access Web
content through voice channels. The notion that Internet standards will
power commerce of the future and that adding a voice channel to give access
to such commerce for mobile users will grow business, will drive this even
further. Whereas VoiceXML addresses the voice-only case, it is expected that
more and more people would prefer to choose the way to access their
application on the basis of what is the most convenient: keyboard and mouse,
a stylus, or speech.
The SALT Forum was created on that very principle ï¿½ the ability to
simply add a speech access channel to an existing GUI-based Web application.
The SALT specification builds on a few Internet paradigms as the cornerstone
for its success:
- Clean separation of representation, logic, and data;
- Event-driven, object-oriented programming approach to speech interface
- Extendable XML model.
Moreover, the SALT specification allows for the integration of different
modalities into one unified execution model ï¿½ developers do not need to
learn different languages for the respective modalities.
Multimodal is Different
Applications based on the SALT specification can either use the speech
channel and be accessed only by voice, or they can be multimodal and give
visual feedback. The user can choose to speak, click a mouse, or enter text
via a keyboard, stylus, or any other way at his disposal. SALT-based
applications can run on a server in the network, on a handheld device, or
even on a mix of the two. The SALT specification has defined profiles for
all of these various ways of using the specification. One key element in
this is that SALT is event-driven. In the multimodal world, you never know
what will happen next ï¿½ where the user will click on the page, which field
he will fill first, whether he will give more than one piece of information
at once, etc. Actually, this is similar to the telephony world, for most
telephony events are asynchronous as well.
Stairway To Heaven
This is all possible because SALT concentrates on the basics ï¿½
providing a doorway for input and output to Web applications, and giving the
programmer a fine control over implementing the user interface that makes
the most sense for the application by using a scripting language like ECMA
event-driven, object-oriented programming model to speech interface design.
This programming model has proven itself to be a powerful and flexible
paradigm that meets the most demanding requirements in creating
sophisticated user interfaces, most notably, for GUI or Web applications.
The programming model is also already familiar to the developer community at
large. Software engineers will be able to apply their immense experience and
best practices directly to developing applications with the SALT
specification. By following widely used programming models and using
existing programming languages, the SALT Forum believes that speech
programming will become much more mainstream.
The SALT object model is fairly simple with only a few objects, called
ï¿½listenï¿½ and ï¿½prompt,ï¿½ for speech input and output processing
respectively. Underneath this simple cover, however, lies the rich
functionality the SALT specification has to offer. The key to making the
SALT specification versatile, yet simple, is to separate data and operations
in the most logical way. Object models often become unnecessarily
complicated because too many function calls must be used to manage complex
data structures. The SALT specificationï¿½s design avoids this problem by
using XML to represent complicated data. As a result, SALT objects only need
a few methods, such as starting and stopping audio streams, and a few events
reporting synthesis progress and recognition results.
Who Cares Who Is Talking?
The SALT specification is designed to be extensible and it defines
standard ways to extend the functionality not covered its current 1.0
version. For example, the SALT specification incorporates a mechanism to
extend the ï¿½listenï¿½ object for speaker identification and verification.
The specification also allows interoperability with a wide range of
input/output devices, such as Instant Messaging, Internet chatting, VoIP or
general telephony, global positioning systems for location-aware
applications, and text telephones (TTY) or Braille devices for the hearing
or visually impaired. In addition to physical I/O devices, the same
mechanism also makes the SALT specification Web Services ready, in the sense
that SALT-based documents can have simple and secure links to the Web
Services available on the Internet. These extensions also enable SALT to be
easily interfaced with legacy infrastructures so that existing investments
can be recapitalized. In other words, SALT extension standards simply take
advantage of XML and realize its benefits to the fullest. The SALT
specification empowers developers to introduce extensions, while using XML
to insure that these extensions do not sacrifice application portability and
A <Programmer> Knows Brackets
It is the strength of the SALT specification to concentrate on the
basics and not to invent a new programming language with a new browser.
VoiceXML has become a standalone programming language with a lot of
brackets, which combines procedural and declarative approaches, and has
several limitations in this respect. However, VoiceXML has created something
very useful ï¿½ a universal interactive voice response (IVR) scripting
language that can be run on many IVR systems. This language addresses
customer demand for interoperability, which is the prime benefit standards
VoiceXML, like IVR, is based on fixed menu driven turns with a
synchronous execution model. It uses menus or some kind of a form, which is
defined by a Form Interpretation Algorithm (FIA) that synchronizes speech
input and output. Actually, this FIA may get in the developerï¿½s way. The
synchronous execution model makes it difficult to integrate asynchronous
modalities, or use VoiceXML together with those technologies. Plus, the FIA
makes the browser so ï¿½heavyï¿½ that it has to reside in the network. These
are major issues that will be discussed by the W3C Voice Browser group in
the framework of VoiceXML 3.0. Combining VoiceXML with other modalities,
like the ability to simultaneously handle speech and point-and-click, will
open some technological issues that are already solved in the execution
environments of SALT.
Voice Browsing Standards
There is a perception in the industry that there is a ï¿½standards warï¿½
going on between SALT and the upcoming VoiceXML 2.0 specification, which
currently is reaching Candidate Recommendation Status for W3C to adopt. In
this context, it is worth looking at how much the SALT specification uses
the work done by the W3C thus far. What many people do not realize is that
the Voice Browser Group of the W3C, in addition to creating VoiceXML, has
developed other standard specifications that are used by SALT, including the
Speech Recognition Grammar Format and the Speech Synthesis Markup Language,
as well as markup languages like for call control, semantic interpretation,
In the SALT Forum, interoperability is a design principle of great value.
Therefore, it is recommended that all SALT-based browsers use these W3C
standards as a common denominator. This leverages the W3C Voice Browser
Groupï¿½s work within the SALT specification.
All On The Same Page?
The founders of the SALT Forum all have vested interest in promoting
speech to continue the success of the Internet. Most of them choose to
continue investing in VoiceXML in addition to investing in SALT. In the W3C,
the first discussions on VoiceXML 3.0 have started, which offer the
opportunity to take the best of the VoiceXML 2.0 and SALT specifications. In
parallel, the W3C has established a multimodal working group, called
Multimodal Interaction Activity. This working group has a charter for the
coming two years, and will take its time to find common ground on the
requirements and to develop a common standard for multimodal applications.
In the meantime, the industry can use both approaches: VoiceXML in cases
when a voice-only server-based solution is needed, or SALT if an application
may need to offer multimodality as well and is based on an existing Web
Albert Kooiman is director of business development at Philips Speech
Processing. Dr. Kuan San Wang is a researcher with Microsoft Corp. Microsoft
and Philips are both founding members of the SALT Forum, which brings
together a diverse group of companies sharing a common interest in
developing and promoting speech technologies for multimodal and telephony
applications. For more information, visit them online at www.saltforum.org.
To The September 2002 Table Of Contents ]