ITEXPO begins in:   New Coverage :  Asterisk  |  Fax Software  |  SIP Phones  |  Small Cells

Reality Check
December 2002

Robert Vahid Hashemian Pouring SALT on .NET


Making computers intelligent has been one of the foremost ambitions of many computer scientists. While creating a machine such as Data (the android in Start Trek, The Next Generation) is still out of reach, making computers speak (at least in a parrot-like fashion) was achieved years ago. I still have fond memories of a younger me connecting a speaker to a voice chip I bought from Radio Shack and hearing the synthesized voice when I sent it text commands from my 286s serial port.

Times have changed and software has now replaced hardware and firmware as the main tool for generating sounds and generally conversing with the computer. SAPI (Speech API) from Microsoft is one such tool. It allows Windows programmers to speech-enable their Windows applications and write programs that can understand users verbal requests and respond to them. The two major engines of SAPI are TTS (Text-To-Speech) and SR (Speech Recognition). With SAPI, the system can be trained to understand its users voice and produce meaningful output. The more the user trains the system, the more tuned the system becomes to the particulars users voice nuances.

As good as SAPI is, it is written for the previous version of Microsofts development platform. That means that there are no native .NET SAPI modules yet. Of course the .NET framework has plenty of interoperability modules (mainly in the System.Runtime.InteropServices namespace) to allow .NET developers take full advantage of SAPIs functionality.

Then I heard about the .NET Speech SDK. Naturally I thought that this was the actual SAPI.NET. Turns out, its not. .NET Speech SDK is a set of tools to speech-enable Web applications based on SALT (Speech Application Language Tags) extensions. SALT is backed by a number of companies and its specifications can be found on the dedicated Web site, http://www.saltforum.org/. Essentially SALT extensions consist of HTML-based set of tags that can be mingled with the rest of the HTML code on a Web page, thereby speech-enabling the page. So what does speech-enabling a Web page mean and how does it benefit users?

A speech-enabled Web page can interact with a user in two distinct modes. The multimodal mode allows the user to interact with the page while sitting in front of the browser. Rather than typing the information into several text boxes or drop down list, the user can click a certain button and speak her choices into the microphone. The page would recognize the users choices and promptly fill in the controls on the page. It would also give the user vocal feedback (prompts) on the ongoing status of the interaction. This mode is perfect for times when users simply do not care to use the keyboard or find it cumbersome to use it such as on a PDA device.

In the telephony mode, the application becomes fully speech driven. All interaction between the user and the page is carried out using voice. In many cases the user will interact with the page over the phone, without any visual feedback. The pages voice would normally make an introduction and then interact with the user through a set of questions and answers. Once the choices are confirmed, the page is submitted and the user is done.

Just like HTML tags, SALT tags are commands processed at the client side. Plain devices such as telephones could be fronted with a telephony application gateway (SALT gateway, if you will) to handle the job. In a Windows environment, Internet Explorer would need to be speech-enabled before it can work with SALT pages coming from a Web server. This is easily accomplished by installing the speech client software. After that, SALT Web pages make use of client-side JavaScript and the speech object to become conversant.

SALT itself has a very basic syntax. The main tags are <listen> for speech recognition, <prompt> for TTS speech synthesis and/or real voice prompts and <dtmf> for dial-tone recognition. A simple SALT section could be like:

<salt:listen id=car>

<salt:grammar src=./make.xml /> <salt:bind targetElement=textbox_car value=/result/car_make />


This code listens for users input for the make of a car. It then consults the appropriate XML file, make.xml, which contains match rules collectively called a grammar, for a match. If a match is found, an XML fragment called SML (Speech Markup Language) is generated with a tag named car_make whose value is accessed using Xpath and placed in the textbox_car text box.

One of the advantages of Microsofts .NET Speech SDK is that all the complexities of SALT are abstracted from the developer by using the graphical approach of Visual Studio. Using Visual Studio.NET the developer can initiate an ASP.NET Web application, drop a few server-side speech controls called QA (Question-Answer) on that page along with other text-based controls. Then he can design the grammar XML files (used for matching spoken words to a list of values) and record the prompts using Visual Studio tools, wire up the QA controls to the grammar and prompt files, and have a simple speech-enabled page up and running. I really enjoyed working with the prompt editor. It is an intelligent product that can recognize spoken words and word-align them to the corresponding text that I type. Then I could create a variety of prompts using pieces of the recorded phrases. When the ASP.NET page is accessed, the server side generates the contents and SALT and HTML tags and sends them off to the browser. The speech-enabled browser, having received the page, processes the SALT tags and becomes conversant. The .Net Speech SDK also comes with a decent debugger displaying a dialog box from which the user can interact with the page using text or speech to test a variety of inputs and observe the outputs such as the SML fragment generated by speech recognition.

The .NET Speech SDK (which is currently in Beta) does have some shortcomings. For example, the telephony mode only runs in emulation on a browser and one can not experience a real interaction with the application using a telephone. The TTS engine is hard-coded to one synthesized voice, although developers are encouraged to record human voice prompts for a richer user experience. Also, while client-side code is abstracted from the developer, sometimes it is necessary to edit pieces of the code in the HTML editor or write client-side JavaScript modules to handle some desired functionality such as dynamically choosing prompts to play based on user entries on the page. But all in all, the .NET Speech SDK is a good first step in allowing developers to craft speech-enabled SALT pages using a more simplified graphical approach. Speak back to me with your comments at rhashemian@tmcnet.com.

Robert Vahid Hashemian provides us with a healthy dose of reality every other month in his Reality Check column. Robert is Webmaster for TMCnet.com -- your online resource for CTI, Internet telephony, and call center solutions. He is also the author of the recently published Financial Markets For The Rest Of Us. He can be reached at rhashemian@tmcnet.com.

[ Return To The December 2002 Table Of Contents ]

Today @ TMC
Upcoming Events
ITEXPO West 2012
October 2- 5, 2012
The Austin Convention Center
Austin, Texas
The World's Premier Managed Services and Cloud Computing Event
Click for Dates and Locations
Mobility Tech Conference & Expo
October 3- 5, 2012
The Austin Convention Center
Austin, Texas
Cloud Communications Summit
October 3- 5, 2012
The Austin Convention Center
Austin, Texas