Making computers intelligent has been one of the foremost ambitions of
many computer scientists. While creating a machine such as Data (the
android in Start Trek, The Next Generation) is still out of reach, making
computers speak (at least in a parrot-like fashion) was achieved years
ago. I still have fond memories of a younger me connecting a speaker to a
voice chip I bought from Radio Shack and hearing the synthesized voice
when I sent it text commands from my 286s serial port. Times have
changed and software has now replaced hardware and firmware as the main
tool for generating sounds and generally conversing with the computer.
SAPI (Speech API) from Microsoft is
one such tool. It allows Windows programmers to speech-enable their
Windows applications and write programs that can understand users verbal
requests and respond to them. The two major engines of SAPI are TTS
(Text-To-Speech) and SR (Speech Recognition). With SAPI, the system can be
trained to understand its users voice and produce meaningful output. The
more the user trains the system, the more tuned the system becomes to the
particulars users voice nuances.
As good as SAPI is, it is written for the previous version of
Microsofts development platform. That means that there are no native .NET
SAPI modules yet. Of course the .NET framework has plenty of
interoperability modules (mainly in the System.Runtime.InteropServices
namespace) to allow .NET developers take full advantage of SAPIs
functionality.
Then I heard about the .NET Speech SDK. Naturally I thought that this
was the actual SAPI.NET. Turns out, its not. .NET Speech SDK is a set of
tools to speech-enable Web applications based on SALT (Speech Application
Language Tags) extensions. SALT is backed by a number of companies and its
specifications can be found on the dedicated Web site,
http://www.saltforum.org/.
Essentially SALT extensions consist of HTML-based set of tags that can be
mingled with the rest of the HTML code on a Web page, thereby
speech-enabling the page. So what does speech-enabling a Web page mean and
how does it benefit users?
A speech-enabled Web page can interact with a user in two distinct
modes. The multimodal mode allows the user to interact with the page while
sitting in front of the browser. Rather than typing the information into
several text boxes or drop down list, the user can click a certain button
and speak her choices into the microphone. The page would recognize the
users choices and promptly fill in the controls on the page. It would
also give the user vocal feedback (prompts) on the ongoing status of the
interaction. This mode is perfect for times when users simply do not care
to use the keyboard or find it cumbersome to use it such as on a PDA
device.
In the telephony mode, the application becomes fully speech driven. All
interaction between the user and the page is carried out using voice. In
many cases the user will interact with the page over the phone, without
any visual feedback. The pages voice would normally make an introduction
and then interact with the user through a set of questions and answers.
Once the choices are confirmed, the page is submitted and the user is
done.
Just like HTML tags, SALT tags are commands processed at the client
side. Plain devices such as telephones could be fronted with a telephony
application gateway (SALT gateway, if you will) to handle the job. In a
Windows environment, Internet Explorer would need to be speech-enabled
before it can work with SALT pages coming from a Web server. This is
easily accomplished by installing the speech client software. After that,
SALT Web pages make use of client-side JavaScript and the speech object to
become conversant.
SALT itself has a very basic syntax. The main tags are <listen> for
speech recognition, <prompt> for TTS speech synthesis and/or real voice
prompts and <dtmf> for dial-tone recognition. A simple SALT section could
be like:
<salt:listen id=car>
<salt:grammar src=./make.xml /> <salt:bind targetElement=textbox_car
value=/result/car_make />
</salt:listen>
This code listens for users input for the make of a car. It then consults
the appropriate XML file, make.xml, which contains match rules
collectively called a grammar, for a match. If a match is found, an XML
fragment called SML (Speech Markup Language) is generated with a tag named
car_make whose value is accessed using Xpath and placed in the textbox_car
text box.
One of the advantages of Microsofts .NET Speech SDK is that all the
complexities of SALT are abstracted from the developer by using the
graphical approach of Visual Studio. Using Visual Studio.NET the developer
can initiate an ASP.NET Web application, drop a few server-side speech
controls called QA (Question-Answer) on that page along with other
text-based controls. Then he can design the grammar XML files (used for
matching spoken words to a list of values) and record the prompts using
Visual Studio tools, wire up the QA controls to the grammar and prompt
files, and have a simple speech-enabled page up and running. I really
enjoyed working with the prompt editor. It is an intelligent product that
can recognize spoken words and word-align them to the corresponding text
that I type. Then I could create a variety of prompts using pieces of the
recorded phrases. When the ASP.NET page is accessed, the server side
generates the contents and SALT and HTML tags and sends them off to the
browser. The speech-enabled browser, having received the page, processes
the SALT tags and becomes conversant. The .Net Speech SDK also comes with
a decent debugger displaying a dialog box from which the user can interact
with the page using text or speech to test a variety of inputs and observe
the outputs such as the SML fragment generated by speech recognition.
The .NET Speech SDK (which is currently in Beta) does have some
shortcomings. For example, the telephony mode only runs in emulation on a
browser and one can not experience a real interaction with the application
using a telephone. The TTS engine is hard-coded to one synthesized voice,
although developers are encouraged to record human voice prompts for a
richer user experience. Also, while client-side code is abstracted from
the developer, sometimes it is necessary to edit pieces of the code in the
HTML editor or write client-side JavaScript modules to handle some desired
functionality such as dynamically choosing prompts to play based on user
entries on the page. But all in all, the .NET Speech SDK is a good first
step in allowing developers to craft speech-enabled SALT pages using a
more simplified graphical approach. Speak back to me with your comments at
rhashemian@tmcnet.com.
Robert Vahid Hashemian provides us with a healthy dose of reality
every other month in his Reality Check column. Robert is Webmaster for
TMCnet.com -- your online resource for CTI, Internet telephony, and call
center solutions. He is also the author of the recently published
Financial Markets For The Rest Of Us. He can be reached at rhashemian@tmcnet.com.
[ Return
To The December 2002 Table Of Contents ]
|