March 19, 2010
Posted: 08:53 AM ET
Michael Cohen, Google’s voice-technology guru, is in the business of knowing what you’ll say - before you say it.
That may sound creepy, but that’s essentially how voice-recognition technology works these days.
In a conference room at Google’s Mountain View, California, headquarters, Cohen recently explained this complicated process by scribbling a mess of circles, words and arrows in blue ink on a whiteboard.
He ends up drawing a sort of assembly line:
On the left, words go in. Arrows direct them to various circles and boxes, which process the speech using complex equations. Some boxes analyze the sounds in the words. If “the cat” goes through this whiteboard factory, the sounds for “th” and “c” may get picked up because of how they look in digital format.
But the process doesn’t stop there. It would be far too time-consuming and computer-intensive to analyze every sound. So Google’s computers start guessing at what comes next, Cohen said.
What word would likely come after the phrase, “the cat”? Well, a verb, probably. "Is," “sat” or “jumped” would be good bets. So, based on the fact that other people have used verbs after the phrase “the cat” before, Google’s computers start guessing what word is likely to come next.
“You can think of the whole thing as just circles and arrows, and if you’re in this circle, there’s a certain probability that you’ll go to this next one,” Cohen said. Google's computers draw out these paths, based on statistics, and then spit out the text that goes with the correct path.
This guessing game works for sounds, words and sentences. It's not the computer really understands what you're saying, it's just that it can often guess what you'll say and how you might say it.
This only gets trickier the more you think about it, and the more questions you ask.
What about accents, for example?
Cohen has a noticeable Brooklyn, New York, accent because that's where he's from. Instead of saying "car," he says "caah." Instead of "human," he says "you-muhn." If you're a person, it's still obvious what he's saying. But to a computer, that's really confusing. All of us talk differently, and we don't always say words the same way twice. So Google's computers have to work hard to understand these differences, and they use context and statistics to do so.
"Luckily, there are a lot of people from Brooklyn, so it recognizes me well," he said.
Google's equations account for 10,000 different kinds of sounds. That's obviously way more than the 26 letters in the English alphabet, but if you think about it, it makes some sense. When you say "map," you make a different "a" sound than when you say "tap." That because your lips come together differently, Cohen said.
To further complicate things, every time the computers make a guess about what a person is saying, they have literally trillions of sound combinations to choose from, he said.
As I wrote on CNN.com today, voice technology has been around for a while, but it's seeing a sort of Renaissance on mobile phones. Google has been demoing a number of speech products lately, including one for the Nexus One phone that lets people who speak different languages talk to each other. Bing and Google both have voice search functions for mobile phones, in case it's easier to say what you're looking for than to type it on a mobile keyboard.
What do you think? Is speech technology good enough for people to start using it regularly? Do you use voice search on your phone? And also let me know if you have further questions about how the technology works.
I've starting using voice search on my phone from time to time. And, after talking to Cohen and several other researchers like him, each time I do, I'm amazed that any words come back - whether they're right or not.
From around the web
Are you a gadgethead? Do you spend hours a day online? Or are you just curious about how technology impacts your life? In this digital age, it's increasingly important to be fluent, or at least familiar, with the big tech trends. From gadgets to Google, smartphones to social media, this blog will help keep you informed.