As robots become more and more humanlike, researchers have observed that there is a sudden dip in appeal, a feeling of revulsion, close to the most human-like end of the spectrum. They call this the uncanny valley, since a near perfect simulation can transition out the other side.
Your effortless world-grasping pattern recognition capabilities instantly lock on to the most subtle dissonances. Conversely, one of the many breakthroughs in the iPhone (News - Alert) when it first came out was the physics-inspired behavior of the screen transitions, with inertia, friction and elasticity. This subtle innovation helped to make iPhone comfortable at a subconscious level. It is telling that while iOS 7 has abandoned many other skeuomorphic features of its predecessors, the physics-based screen behavior remains.
Siri, on the other hand, demonstrates that speech recognition (ASR) and speech synthesis (TTS) remain obstinately on the other side of the uncanny valley. There are many successful implementations of ASR and TTS, but they are in narrowly constrained applications. What makes Siri particularly infuriating is the general-purpose claims that its behavior implies. When you have spent 10 minutes struggling to get Siri to dial a person in your contact list, you tend to be less impressed by its ability to respond to questions like: How good are you at speech recognition, Siri?
As others have pointed out, the confident expectations of decades past were similar for artificial intelligence and for speech technologies, and they are dashed for similar reasons. Listening to a real person talking, it is easy to think of the cadences and emphases that differentiate between question and command, or sincerity and irony as being analogous to the accelerations and inertias of physical objects, and to surmise that there must be some physics of expression that could be applied to them the way we apply the physics of rocks and springs to on-screen elements. As with so many intuitive judgments, this turns out to be monumentally wrong.
I am confident that one day a computer will be able to speak and listen as well as a person can, but it will not be through a small set of yet-to-be-discovered physics-like rules. It will be when the computer can reliably detect meta-communications like irony or confidence – a task that challenges the most verbally adept of us.
Michael Stanford has been an entrepreneur and strategist in VoIP for more than a decade. (Visit his blog at www.wirevolution.com.)
Edited by Stefania Viscusi