×

TMCnet - The World's Largest Communications and Technology Community
ITEXPO begins in:   New Coverage :  Asterisk  |  Fax Software  |  SIP Phones  |  Small Cells
 

Call Center/CRM Management Scope
March 2003


Beyond SALT Versus VoiceXML: Coping With The Wealth Of Standards In Speech And Multimodal Self-Service Applications

By K. W. (Bill) Scholz, Ph.D., Unisys Corp.

Standards serve as the foundation for growth within an industry. As a new technology is spawned and begins to pique the interest of developers and consumers, its initial growth is typically haphazard and devoid of structure. As the technology reaches adolescence, however, its leaders develop standards that guide growth and interoperability, and its haphazard evolution fades. As the technologies enabling speech and multimodal self-service applications mature, many standards have emerged and combined to enable the field to approach mainstream status. The growth of standards is not without its cost, however; because of the complexity of the underlying technologies, the standards documents themselves have grown to span thousands of pages, and as a consequence constitute an overwhelming obstacle to a developer's mastery of the technology.

Furthermore, this past year has seen considerable press devoted to the so-called 'conflict' between the two key standards in our industry: SALT and VoiceXML (VXML). Claims of conflict have deluded some developers into feeling pressure to make premature 'choices' between them, while intimidating others into inactivity as they wait for the industry to choose the 'right' one. In fact, there are over a dozen distinct standards designed to guide the development and execution of speech and multimodal applications, occasionally competing with one another but more frequently operating in harmony to guide distinct components of the application's architecture. 

Deployment Architecture
Figure 1 illustrates the deployment architecture for a speech or multimodal application. The major components in the architecture and their functions are as follows:
Application Server. The central component is the application server, the platform and software responsible for managing the execution of the application. The application server's principal responsibilities include management of the dialog with the end-user and management of the business transaction processor, the application's business functionality.

Business transaction processor. This term describes the software and (optionally) the platform responsible for execution of the business transactions (for example, a travel reservation system, a retail banking database, a regional or national weather repository, or a securities transaction database, to name a few).

Voice gateway. During execution, the application server interchanges information with the voice gateway that is coded in a markup language and is conveyed using the familiar Internet delivery paradigm. The voice gateway includes:
' A markup language interpreter,
' An automatic speech recognizer (ASR), 
' A text-to-speech (TTS) generator, and 
' A telephone network interface (tele interface). The tele interface mediates the connection through the circuit-switched or packet-switched telephone network to the end user. The network connection will use either a direct digital interface to the circuit-switched network or voice-over-IP (VoIP) through a media gateway to the telephone network. 

Voice user interface. This is an end user interface using speech over wireless or wireline telephones.

Graphics user interface. This is an end user interface using desktop PCs, PDAs, cell phones with digital visual displays, or other screen-oriented devices.

Figure 1

Standards
The principal standards and standardized APIs (application program interfaces) that guide the operation and interaction of the components in the architecture are shown in Figure 1, and are listed and described below. The agency responsible for each standard or API is shown in parentheses after the standard's name.

CCXML (W3C). Call Control eXtensible Markup Language is designed to provide telephony call control support for dialog systems. CCXML is intended to serve as an adjunct language for use with a VXML, SALT or other dialog implementation platform.

HTTP (IETF). Hypertext Transfer Protocol is an application-level protocol for distributed, collaborative, hypermedia information systems. It is a generic, stateless protocol which can be used for many tasks beyond its use for hypertext, such as name servers and distributed object management systems, through extension of its request methods, error codes and headers.

H.323 (ITU). H.323 is a standard that specifies the components, protocols and procedures that provide multimedia communication services ' real-time audio, video and data communications ' over packet networks, including Internet protocol (IP)'based networks. H.323 is part of a family of recommendations that provide multimedia communication services over a variety of networks.

JDBC (Sun Microsystems). Java Database Connectivity is an API that lets developers access virtually any tabular data source from the Java programming language. It provides cross-DBMS connectivity to a wide range of SQL databases and, with the JDBC API, it also provides access to other tabular data sources, such as spreadsheets or flat files.

ODBC (Microsoft). Online Database Connectivity is a widely accepted API for database access. It is based on the Call-Level Interface (CLI) specifications from X/Open and ISO/IEC for database APIs and uses Structured Query Language (SQL) as its database access language.

SALT (W3C). Speech Application Language Tags is a platform-independent standard that makes possible multimodal and telephony-enabled access to information, applications and Web services from PCs, telephones, tablet PCs and wireless PDAs (personal digital assistants). The standard extends existing mark-up languages such as HTML, XHTML and XML.

SIP, RTP, MGCP (IETF). SIP (Session Initiation Protocol) is a signaling protocol for Internet conferencing, telephony, presence, events notification and instant messaging. RTP (Real-time Transport Protocol) is a protocol for the transport of real-time data, including audio and video. MGCP/MEGACO (Media Gateway Control Protocol) addresses the relationship between the media gateway, which converts circuit-switched voice to packet-based traffic, and the media gateway controller (sometimes called a softswitch), which dictates the service logic of that traffic. 

SRGS (W3C). Speech Recognition Grammar Specification defines the syntax for grammar representation intended for use by speech recognizers and other grammar processors so that developers can specify the words and patterns of words to be listened for by a speech recognizer.

SSML (W3C). Speech Synthesis Markup Language is a rich, XML-based markup language for assisting the generation of synthetic speech in Web and other applications. Its essential role is to give authors of synthesizable content a standard way to control aspects of speech output such as pronunciation, volume, pitch, rate, etc., across different synthesis-capable platforms.

SS7/ISUP (IETF). Signaling System 7 is an architecture for performing out-of-band signaling in support of the call-establishment, billing, routing and information-exchange functions of the PSTN (public switched telephone network). It identifies functions to be performed by a signaling-system network and a protocol to enable their performance. ISUP (ISDN User Part) defines the messages and protocol used in the establishment and tear down of voice and data calls over the PSTN, and to manage the trunk network on which they rely. 

VoiceXML (W3C). VoiceXML (Voice eXtensible Markup Language) is designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony and mixed-initiative conversations. Its major goal is to bring the advantages of Web-based development and content delivery to interactive voice response applications.

WAP / WML (OMA). Wireless Application Protocol and Wireless Markup Language refer to a markup language based on XML which is intended for use in specifying content and user interface for narrow band devices, including cellular phones and pagers.

XHTML (W3C). eXtended HyperText Markup Language is a family of current and future document types and modules that reproduce, subset and extend HTML 4. The XHTML document types are XML-based and ultimately are designed to work in conjunction with XML-based user agents.

XML (W3C). eXtensible Markup Language is a simple, very flexible text format derived from SGML (Standard Generalized Markup Language). Originally designed to meet the challenges of large-scale electronic publishing, XML is also playing an increasingly important role in the exchange of a wide variety of data on the Web and elsewhere.

X+V (W3C). XHTML + Voice brings spoken interaction to standard Web content by integrating a set of mature Web technologies such as XHTML and XML Events with XML vocabularies developed as part of the W3C Speech Interface Framework. The profile includes voice modules that support speech synthesis, speech dialogs, command and control, speech grammars and the ability to attach voice event handlers.

Application Creation
It is clear that if an application developer were required to attend specifically to the details of every standard during the development process, application creation would become prohibitively complex. Yet it is equally clear that the evolution of standards plays a vital role in facilitating inter-vendor operability and modularization, and has become the lifeblood of growth in our industry. The solution to this problem is found in today's collection of development tool suites and service creation environments. In recent years, these have grown in sophistication to the point that the developer is shielded from the intricacies of standards conformity or enforcement, yet can derive the full benefit of standard conformance. 

The retail shelves are lined with a collection of sophisticated tool suites and SCEs (Service Creation Environments) designed to address these problems. Developers produce speech and multimodal applications using a selected subset of these tools. Figure 2 illustrates how one can combine a carefully selected subset of these tools and packaged application delivery components to shield the developer from the need to explicitly master each of the standards inherent in the architecture. 

Figure 2

Application Development And Deployment Using The 'Right' Tools
The following description summarizes our application development process with special emphasis on the tools and delivery components used in each phase, and how standards are addressed without the need for specific focus on each.

Planning and discovery. The development process starts with 'planning and discovery' where the project management team interviews the customer to analyze the problem in detail to identify the application's purpose and methodology. 

Dialog design and evaluation. A user interface layout tool is used to express the applications methodology as an ordered collection of dialog 'states,' where each state includes a prompt, expected responses to the prompt and lists of actions associated with each response. The same tool manages testing where the application's execution is simulated for candidate end users using operator-guided call flow. 

Grammar and prompt design. Once dialog design is completed, evaluated and modified as required, the detailed grammars are entered using a feature that employs a spreadsheet metaphor to refine the responses in each dialog state by entering anticipated words and phrases. Additionally a prompt design tool is used to structure the verbal output for each dialog state to use any mixture of recordings and synthesized speech. 

Business transaction integration. Integration with the business transaction process is performed by building a 'connector.' The tool supports creation of connectors to databases, legacy mainframe applications, and to any Web-based resource or site. Output from a connector consists of an XML or XHTML stream which is integrated into the code using J2EE conventions. 

Voice gateway integration. A voice gateway is selected which best meets a customer's needs, and the runtime engine is conditioned to produce the markup language stream (VoiceXML or SALT) appropriate to the selected platform. Voice gateway provider's tools are used to integrate the gateway into customer-specific circuit-switched or packet-switched networks.

Application testing, tuning and delivery. Tuning and testing are performed using a combination of locally developed tools and tools provided by the speech recognizer and voice gateway vendors. Tuning is followed by piloting, beta testing and phased rollout as dictated by customer contracts.

The past two or three years have seen outstanding growth in the speech application industry, and the start of an expansion into the adjacent multimodal application industry. No single factor is more important in stimulating this growth than the creation of cross-vendor and cross-industry standards. Yet the very abundance of new and maturing standards has led to an increased incentive to hide their arcane complexity in tools to facilitate service creation without the requirement to master details of each relevant standard. Fortunately, service creation tools and deployment platforms have also matured significantly and, because of the very standards they encapsulate, inter-operate to permit cross-vendor life cycle support for speech and multimodal applications. It is the growth of standards that makes this blossoming inter-operability possible, and provides the foundation for our industry to grow to maturity.

At Unisys, Dr. Scholz managed the development of two large scale expert systems. Starting in 1991 as R&D manager, he managed business development for government service contracts. In 1994 he co-founded the NL Speech Solutions business unit and since then has been directing efforts to integrate speech recognition and natural language processing in the creation of Spoken Language Understanding systems. He is a frequent speaker at professional trade shows and was selected as one of the Top Ten Leaders in speech by Speech Technology magazine in 2001. His commitment to standards is demonstrated by his participation as the Unisys representative to the SALT Forum, the VoiceXML Forum, and the W3C Voice Browser Working Group. 

[ Return To The March 2003 Table Of Contents ]


Upcoming Events
ITEXPO West 2012
October 2- 5, 2012
The Austin Convention Center
Austin, Texas
MSPWorld
The World's Premier Managed Services and Cloud Computing Event
Click for Dates and Locations
Mobility Tech Conference & Expo
October 3- 5, 2012
The Austin Convention Center
Austin, Texas
Cloud Communications Summit
October 3- 5, 2012
The Austin Convention Center
Austin, Texas

Subscribe FREE to all of TMC's monthly magazines. Click here now.