Its all too easy to imagine a computer that can recognize and translate human
speech. We see it on television all the time. Take Star Trek, for example. Nearly every
episode of that program depends on the universal translator. The galaxy may be
a veritable Babel, what with all those aliens threatening each other in strange tongues,
but everyone is confident that the universal translator will make sure every statement is
heard the way the speaker intended. Never, no matter how strange the alien creature, is a
pronouncement such as, I will accept your gracious invitation and all your
thoughtful suggestions misinterpreted as, say, I will obliterate your
ridiculous planet and all its noxious inhabitants.
Confronted with the science fiction vision of speech recognition and translation,
people with technical knowledge are almost embarrassed to admit the true state of
todays technology, especially if there are Trekkies around. Which is too bad.
Although having that e-mail about your next appointment read to you over the phone
isnt the stuff of high adventure, it is useful, and far from trivial as a technical
accomplishment. Further, todays technology keeps improving. How long will it be
before technological fact catches up with science fiction?]
PRIMITIVE BEGINNINGS
I remember when having your computer translate text-to-speech was a novelty. I
couldnt wait to add text-tospeech capability to my first real computer, a Tandy
TRS-80 Color Computer (COCO), by installing RealTalker, a cartridge with a special chip
inside. RealTalker plugged right into the side of the computer.
I invited my friends over and showed them how I could make my computer talk. They were,
to my disappointment, unimpressed. But I knew my audience, and I instructed the computer
to say things that would land any Star Trek crewman in the brig. Then, when COCO started
swearing a blue streak, eyes widened and jaws dropped. My friends finally agreed that
computers were cool after all. They all wanted a computer just like mine. (If only
marketing text-tospeech were so easy today!)
I have to admit, though, the voice wasnt all that great. The computers
voice was just like the one used by the computer in the movie War Games. Basically, the
computer sounded like a computer, not a human. The technology for making a computer sound
like a human just wasnt there yet. If text-to-speech was primitive 15 years ago,
speech recognition was even worse. As a matter of fact, getting a computer to recognize
the human voice is much more complex than getting a computer to perform text-tospeech.
Yet, the complexity of the task hasnt daunted researchers. The Advanced Research
Projects Agency (ARPA) has supported research at several institutions, including the
Massachusetts Institute of Technology (MIT), to foster the development of speech
recognition.
Why is speech recognition so important that the government would see fit to give it a
helping hand? It could be come the ultimate human/computer interface. No more clunky
keyboard or carpal tunnel syndrome inducing mouse! No more telephone keypad!
Just tell the computer what you want, and the computer responds. One day, we may, like
Star Treks Scotty, be amazed that anyone would communicate with a computer any other
way. (In one of the Star Trek feature films, Scotty traveled back in time, and was
enjoined to use a mouse. So, he picked up the mouse and spoke into it, announcing,
Computer: I want you to
)
PRACTICAL ADVANTAGES OF SPEECH RECOGNITION
Speech recognition is an interesting niche in the CTI industry. Everyone has been
predicting its explosive growth, particularly in the IVR industry. Several IVR systems
give you the convenience of being able to speak digits to traverse the IVRs menu
tree rather than using DTMF digits. Although DTMF has been the traditional interface for
telephone users, this interface has several drawbacks. Using DTMF is time-consuming and
often frustrating. Resorting to speech, on the other hand, is natural and much more
powerful. In addition, many would-be DTMF users have rotary phones. Outside the U.S., the
percentage of people using rotary phones is even higher. Heres another example of
how speech recognition can be advantageous. How many times have you called someone without
knowing what that persons extension was? You could, of course, go to the company
directory and enter the persons name. But this has three disadvantages. First, you
may misspell the persons name. Second, it takes a long time to key in a
persons name. Third, you will run up your phone bill keying in peoples names
(or, in the case of 800 numbers, run up the phone bill of the person youre calling).
Does the technology exist today for implementing speech recognition into an
IVR/auto-attendant system? You bet! Ive seen at least one company (Dialogic, in
their sales department) implement speech recognition into their Interactive Voice Response
system. So, its only a matter of time before other companies see the benefits of
implementing speech recognition into their phone systems.
The latest continuous speech recognition systems allow callers to exchange much more
information using complete phrases and sentences in a single response. Today, callers need
not restrict themselves to simple, short phrases or numeric responses. Instead, they can
give detailed verbal instructions to the system. For instance, if you are traveling and
wish to remotely retrieve e-mail with a particular date, you can just say, Retrieve
e-mail from November 9, 1997. Another example might be, Read all voice mail
from John Smith. Accessing features in this way was impossible, or at the least very
difficult, through the traditional DTMF interface. Thus, the latest generation of
automatic speech recognition technology is helping todays highly mobile business
professional conduct business while traveling.
VENDOR COMMITMENT TO SPEECH RECOGNITION
Several vendors are active in speech recognition. Once such company, Dragon Systems, is
discussed in this issue. (See our review of Dragon Systems NaturallySpeaking) In addition, several vendors who produce application generators have integrated
speech recognition technology into their software. For instance, Artisofts Visual
Voice application generator has integrated text-to-speech as well as speech recognition
capabilities.
Brooktrout has entered the speech recognition arena by taking advantage of technology
from Voice Control Systems. Brooktrout has used this technology to add a module to their
Show N Tel product.
Speech Solutions (a subsidiary of Global Intellicom, Inc.) has developed speech
recognition ActiveX custom controls. These ActiveX controls allow programmers to add
speech recognition capabilities to applications simply by dropping these controls into
their applications which support ActiveX, such as Visual Basic. (For more information,
check out their We b s i t e a t www.speechsolutions.com.) These are just a few examples
of how speech recognition is being embraced by software vendors.
What about the hardware side? Every voice processing board manufacturer has been
scrambling to add speech recognition to its product line or to partner with a leading
speech recognition vendor. For instance, Dialogic has their Antares line for speech
recognition.
Natural MicroSystems, like Brooktrout, uses technology from Voice Control Systems.
NMSs recently released NaturalRecognition 2.0 leverages the speech recognition
technology from Voice Control Systems to provide speaker-independent and speaker-dependent
speech recognition capabilities. NMSs NaturalRecognition supports 16 channels of
speech recognition in a single PC slot that also provides fax and call processing
capabilities.
THE MICROSOFT ANGLE
One other interesting tidbit in the speech recognition industry is that Microsoft recently
put a $45 million stake in Belgian speech technology firm Lernout & Hauspie, which
analysts say is about 5 to 7 percent of L&Hs capital. Microsofts entry
into the speech recognition field certainly bodes well for this industry and only
reinforces my belief that speech recognition will become increasingly important. Rumor
also has it that Microsoft wishes to embed speech recognition technology into the
operating system. This is very interesting. How this will affect other speech recognition
vendors remains to be seen.
However, in my opinion, hardware-based speech recognition doesnt have much to
worry about. The reason is that hardware-based speech recognition uses DSP technology
which allows for much more scalability. Without hardware, you are relying on the computer
processor for speech recognition, which as you probably already know chews up
a lot of CPU cycles. Therefore, I firmly believe that having L&Hs technology
embedded into Windows will be targeted toward the low end of speech recognition.
I should note, however, that with the increasing power of computer processors, this
low end scenario may not hold true forever. Still, in my opinion, DSPs offer
more power per price-point than a computer processor, and thus there will still be a need
for hardware-based speech recognition using DSP technology for higherend applications.
Having speech recognition embedded into the next Microsoft operating system will most
likely affect soft-warebased speech recognition ven-dors at first. Thus, companies such as
Dragon Systems and IBM would be wise to keep their eyes open. IBM has a software-based
speech recognition product called ViaVoice. This product is similar to Dragon
Systems NaturallySpeaking in that ViaVoice is a continuous speech recognition
software product. I believe that L&Hs technology will first be used for entering
voice commands into Windows rather than (or in addition to) using the mouse, or for
dictation into Microsoft Office applications. Well find out in 1998 when the newest
version of Windows is released!
UNDERLYING DRIVERS
Several forces are driving speech recognition, making it a practical and useful
technology. First, the algorithms behind speech recognition are getting better. Second,
the technology continues to improve and become more efficient in the use of computer
resources. Third, there are more software development tools, such as application
generators, for writing speech recognition applications. These software development tools
make it possible to quickly and easily develop speech recognition applications; simply
drag-and-drop a speech recognition block from a palette of blocks and edit the properties.
We should also remember that the processing power in computers continues to increase in
accordance with Moores Law, which states that processing power will double every 18
months. Indeed, Moores Law has been holding true for at least the past 15 years. As
computer processors become faster and faster, and as speech recognition uses more
efficient and accurate algorithms, the factors listed above will greatly enhance speech
recognition technology.
A HYPOTHETICAL KILLER APP
We all know that text-tospeech technology has been around for a long time, certainly as
far back as 15 years ago, when I played with my first text-tospeech product, RealTalker.
But now, Id like to describe a hypothetical killer app that should not, given
improvements in speech recognition, remain hypothetical for long.
Suppose we take a text-tospeech product, similar to RealTalker, but much more advanced.
Suppose the software has very advanced knowledge and intelligence including grammar rules,
and other language rules which give it the capability to translate text from one language
to text in another language, similar to what most of us did in high school when we took
Spanish 101 or French 101. You could then convert the translated text to speech using
text-tospeech. Thus, any written text in English could be converted to text in French,
German, Russian, or whatever language you choose.
Now consider this possibility. Many speech recognition products support multiple
languages and can perform speech-to-text conversion. Take my previous example of
converting text in one language to text in another language. Now using speech recognition
you could, theoretically, talk to anyone in the world over the telephone, even if that
person speaks a completely different language.
Heres how it could be done. The speech recognition would first convert the speech
to text. Next, a special text-to-text converter using grammar rules and the like would
translate the text to another language, and then finally text-to-speech would convert the
text at the other end of the phone to the native language of the person to whom you are
talking. Universal language recognition not since the destruction of the Tower of
Babel has the possibility of universal language communication been possible.
|