TMCnet - World's Largest Communications and Technology Community




[March 17, 2003]

SALT Sets The Standard For Web-Based Voice Applications (Part 3 of 3)


Modern technology has a remarkable habit of making once-futuristic concepts come to life in ways not so far removed from their original depictions in novels and films. A case in point is the idea of information and communication devices you control by talking to them. Evoking images stretching back more than 30 years, like HAL in the film 2001 or the Star Trek communicator, some luxury automobiles now offer voice-command navigation systems that give you maps and driving directions. Before long, they'll allow voice control of the radio and climate systems as well. And voice control is coming to wireless devices like PDAs and cell phones, which are morphing into pocket-sized communication and information management appliances.

These devices feature multimodal voice/text/graphic interfaces. The World Wide Web Consortium (W3C) is working on standards that could be applied in this area, including Voice XML and the Multimodal Interaction Activity. But the standard that seems to be gaining the most traction at present, in terms of interest and industry support, is Speech Application Language Tags (SALT). Its initial specification was developed by the SALT Forum, and submitted to the W3C for their stamp of approval in August 2002.

In the first two articles in this series (Part 1, Part 2), we gave an overview of Voice XML and SALT and then took a more detailed look at Voice XML. In this final installment, we consider SALT in greater depth -- the motives for its creation, its components, and issues of using it in the real world.

The primary intent for the SALT standard is to provide functionality for building a wide range of voice applications, including multimodal ones that mix text and voice command input with text, graphic and audio output. The most prominent applications people seem to have in mind are those mentioned above: Easy-to-use PDAs, where voice commands can replace handheld stylus writing and dwarf keyboards, and embedded devices, like vehicle navigation and control systems, where keyboard input would be out of the question. However, SALT's creators have made it generic enough to support voice-only applications, too, like telephone interactive voice response (IVR) systems for call routing and customer self-service.

SALT is a Web technology. It's implemented as XML code in Web pages that tell SALT-enabled browsers how to conduct voice interactions with end users. Its elegant, lightweight design comprises only those functions directly related to executing voice dialogs. It doesn't specify the formats of dialogs' associated data, like speech recognition grammars (rules denoting allowable sequences of words), but it's hoped that SALT browser vendors will adopt related standards, like the Speech Recognition Grammar Specification, Natural Language Semantic Markup Language (for specifying how meaning is extracted from word sequences) and Speech Synthesis Markup Language (for controlling the output of text-to-speech resources). Non-voice tasks are accomplished using common Web resources, such as HTTP, HTML, and browser-side scripting languages. Browser platform functions like telephony control (answer, transfer, conference, hang-up, etc.) can be addressed by standards such as Call Control Extensible Markup Language (CCXML) or the XML Protocol for Computer Supported Telecommunications Applications (ECMA-323).

As discussed in our previous articles, there are huge advantages to building applications with Web architectures and technologies, including a large and expanding toolkit of established standards and products, and ready availability of skilled people. But for voice applications, Web architectures result in fairly complex code design, development and maintenance because system design considerations require that the program logic controlling voice dialogs be split among various Web pages and the server. In keeping with its minimalist design philosophy, SALT, unlike Voice XML, doesn't include any program logic elements, but leaves them to scripting languages like JavaScript and ECMA script.

For embedded devices like PDAs, the question arises of whether the overhead of the Web client-server architecture assumed by SALT even makes sense. The SALT specification provides a reduced feature set for "downlevel" browsers on devices with limited processing power and memory. But it may be preferable in many cases to avoid client-server altogether and implement voice applications as monolithic programs running entirely on the individual devices. The choice of local program vs. client-server will be decided to a large extent by the need for the application to access external data. Take, for example, a cell phone voice-dialing application. If the phone numbers to be dialed are merely those stored in the user's personal phone directory, then a local application makes the most sense. But if the requirement is to access a group directory, like a company phone list, then it would be better to maintain that data on a central server and download it as needed via a client-server application.

On the other hand, SALT might be a good choice as an application development language even for purely local programs. Many of its advantages as a standard, XML-based language encompassing a comprehensive set of voice functionality would be gained by designing embedded devices to interpret SALT code (and possibly other Web languages like HTML and JavaScript), whether or not they employ client-server architectures.

There are four top-level SALT elements:

  • <prompt> specifies how audio output is played to users, either from recorded audio files or generated on-the-fly by text-to-speech engines;
  • <listen> specifies how speech input will be processed;
  • <dtmf> provides for touch-tone input in telephone applications; and
  • <smex>, "simple messaging extension," is a general-purpose method of communication with the browser platform that supports new features and allows applications to control platform-specific functions like logging and telephony call control.

Of special interest is the <bind> element, which, as part of the <listen> functionality, provides wide flexibility in determining how speech input is to be used. It allows speech recognition results to be sent directly back to the Web server, for example. Or recognition results can be attached to HTML form fields, so they act exactly like text input to the field, allowing multimodal applications to accept either speech or text as input to the same field.

SALT specifies an extensive set of events and methods for the <prompt>, <listen> and <dtmf> elements. Especially for prompts, they give very fine-grained control over interactions with users. Sequences of audio can be queued to create composite prompts, as might be desired when reading back an account number by stringing together recordings of individual digits. Events are generated, for example, when each prompt element in the queue has completed playing and when users interrupt prompts by speaking ("barge-in"). There are also methods for pausing and resuming prompts, and changing speed and volume, among others.

These capabilities can support very sophisticated user-interface designs. Knowing that a user has interrupted a prompt near the beginning, for instance, might indicate that she's familiar with the application. If so, subsequent prompts can be more abbreviated versions, giving quicker interactions than would be appropriate for first-time users. Or it might indicate that the graphical display she's viewing has given her all the information she needs in deciding what to say.

So can we just start rolling out SALT applications? Not quite yet. For one thing, not many devices and browsers support the standard yet, although a base of support seems to be building and more SALT-enabled devices should appear before too long. There's also a need for application development and testing tools. Fortunately, Microsoft is supporting SALT in a big way. They now offer SALT development tools in the form of a Speech SDK for use with Visual Studio and the .NET framework. Other vendors are starting to offer their own SALT development tools, and this trend will likely accelerate as SALT becomes more widely accepted. However, as previously noted in our discussion about Voice XML, these development tools, as helpful as they may be, don't provide the basic knowledge of voice technology and voice user-interface design necessary to create high-quality applications.

Another issue is the maturity of voice/graphic user-interface design. There are extremely few multimodal applications now in production, and there's been very little real-world experience in how to design these kinds of user interfaces. This is a fascinating area with tremendous potential for creative approaches. But it's still somewhat experimental.

Nevertheless, SALT appears to be the right standard at the right time to help create a whole new class of novel, easy-to-use devices that will soon be as commonplace as the desktop Web browser is today.

Mark Levinson is president of VoxMedia Consulting. He has over 15 years of telecom industry experience, including more than five years managing the design, development, and deployment of real-world speech applications. He can be reached at 781-259-0404 or [email protected].

Technology Marketing Corporation

2 Trap Falls Road Suite 106, Shelton, CT 06484 USA
Ph: +1-203-852-6800, 800-243-6002

General comments: [email protected].
Comments about this site: [email protected].


© 2023 Technology Marketing Corporation. All rights reserved | Privacy Policy