For those who aren’t familiar with the phrase, the title of this article is arguably the most well-known idiom used in the field of scientific speech research. That’s because the phrase beautifully demonstrates the confusion that can wreak havoc on automated speech recognition (ASR) or text-to-speech (TTS) systems that misconstrue “How to Wreck a Nice Beach” with “How to Recognize Speech.”
The answer to the question was evident to researchers at MIT, who have developed a freely available database of knowledge to analyze speech patterns using common sense – that is, you sing calm incense. Such databases would hopefully tackle problems like sending NY- or DC-based business travelers flight times and airfares for Austin instead of Boston when asking about “shuttle flights.”
The project, dubbed ConceptNet, is a semantic network that is automatically generated from the 700,000 sentences of the MIT’s Media Labs’ Open Mind Common Sense Project, a World Wide Web collaboration built over the past five years by over 14,000 contributing authors.
“In the past few years, we have started to reach the point where these more sophisticated kinds of [linguistic models] have shown improvements on some tasks,’ said Jim Glass, principal research scientist at the Computer Science and Artificial Intelligence Laboratory (CSAIL) at MIT. “Most people's intuition has been that linguistic constraints should help. It's just been a challenge to make it work.”
But the scientific community has its fair share of skeptics. And not everyone is convinced that MIT’s advance will lead to advances in speech recognition. “Frankly, this is in its infancy,” said David Nahamoo, Department Group Manager of Human Language Technologies at IBM Research.
“The evidence that they actually help a lot in speech recognition still has to be shown,” said Nahamoo, referring to linguistics-based modeling. “There is a lot more to be gained from acoustic-modeling than linguistic-modeling.”
And just as IBM steadfastly refuses to support Microsoft’s SALT standard for enabling voice systems integration, the IBM researcher sees little need to explore a methodology which he believes is still yet to be proven.
“Today, if I were a betting person, which I am in my scientific activities, I am betting more on the acoustic modeling than the linguistic modeling,” Nahamoo said.
Yet while ConceptNet is really more about enabling Artificial Intelligence, it does lend itself easily to ASR/TTS systems because speech applications are, albeit more targeted, nearly as complex when it comes to communicating with us.
IBM’s approach to speech technology is very similar to that of the Hidden Markov Model (HMM). HMM uses a small amount of language-modeling but is more dependent on statistical analysis of acoustic speech patterns, broken down to their sub-atomic parts called phonemes (the basic building block) and a subset of allophones (variations of a given word when they are spoken).
The English language, for example, has only 44 phonemes but because of cadence or intonation over 2,000 allophones. And that doesn’t even take into account accents, drawls, emotions and noise. As you can see, human speech is a lot for a machine to understand. Add to that the statistical combinations occurring due to social or psychological context and that might scare an IT director enough to run back to traditional IVRs.
Based on acoustic modeling, speech then is recognized by estimating the likelihood of phonemes arranged in particular sequences (sometimes referred to as “concatenating”) and a search procedure is used to determine which sequences have the highest probability. The only drawback is obviously this isn’t even remotely close to how the human brain functions.
“We need to do a better job of modeling how the brain interprets speech,” said Paul Hosom, assistant professor at Oregon Health and Science University’s Center for Spoken Language Understanding (CSLU). “The way HMMs work is very dissimilar to the way humans recognize speech.”
ConceptNet, on the other hand, maps surface linguistic variations with real-world text. By doing so, researchers at MIT are hoping to help more systems with natural-language processing. They have even released an API for a processing engine written in Python and Java. (English only)
“ One of the things that researchers have been trying to do for many years is to incorporate more linguistic structure ... into the language model,” MIT’s Glass explained.
“When we learn our grammar in school we learn about proper syntax but it has been hard to figure out how to incorporate it effectively into ASR.”
To discover more about speech technologies and how your business can benefit from them, be sure to attend TMC's Speech-World Expo May 24-26, 2005, in Dallas. Both Speech-World and TMCnet are owned by Technology Marketing Corporation.
Robert Liu is executive editor at TMCnet. Previously, he was executive editor at Jupitermedia and has also written for CNN, A&E, Dow Jones and Bloomberg. He can be reached at [email protected].