Michael Cohen, Google’s voice-technology guru, is in the business of knowing what you’ll say - before you say it.
That may sound creepy, but that’s essentially how voice-recognition technology works these days.
In a conference room at Google’s Mountain View, California, headquarters, Cohen recently explained this complicated process by scribbling a mess of circles, words and arrows in blue ink on a whiteboard.
He ends up drawing a sort of assembly line:
On the left, words go in. Arrows direct them to various circles and boxes, which process the speech using complex equations. Some boxes analyze the sounds in the words. If “the cat” goes through this whiteboard factory, the sounds for “th” and “c” may get picked up because of how they look in digital format.
But the process doesn’t stop there. It would be far too time-consuming and computer-intensive to analyze every sound. So Google’s computers start guessing at what comes next, Cohen said.
What word would likely come after the phrase, “the cat”? Well, a verb, probably. "Is," “sat” or “jumped” would be good bets. So, based on the fact that other people have used verbs after the phrase “the cat” before, Google’s computers start guessing what word is likely to come next.
“You can think of the whole thing as just circles and arrows, and if you’re in this circle, there’s a certain probability that you’ll go to this next one,” Cohen said. Google's computers draw out these paths, based on statistics, and then spit out the text that goes with the correct path.
This guessing game works for sounds, words and sentences. It's not the computer really understands what you're saying, it's just that it can often guess what you'll say and how you might say it.
This only gets trickier the more you think about it, and the more questions you ask.
What about accents, for example?
Cohen has a noticeable Brooklyn, New York, accent because that's where he's from. Instead of saying "car," he says "caah." Instead of "human," he says "you-muhn." If you're a person, it's still obvious what he's saying. But to a computer, that's really confusing. All of us talk differently, and we don't always say words the same way twice. So Google's computers have to work hard to understand these differences, and they use context and statistics to do so.
"Luckily, there are a lot of people from Brooklyn, so it recognizes me well," he said.
Google's equations account for 10,000 different kinds of sounds. That's obviously way more than the 26 letters in the English alphabet, but if you think about it, it makes some sense. When you say "map," you make a different "a" sound than when you say "tap." That because your lips come together differently, Cohen said.
To further complicate things, every time the computers make a guess about what a person is saying, they have literally trillions of sound combinations to choose from, he said.
As I wrote on CNN.com today, voice technology has been around for a while, but it's seeing a sort of Renaissance on mobile phones. Google has been demoing a number of speech products lately, including one for the Nexus One phone that lets people who speak different languages talk to each other. Bing and Google both have voice search functions for mobile phones, in case it's easier to say what you're looking for than to type it on a mobile keyboard.
What do you think? Is speech technology good enough for people to start using it regularly? Do you use voice search on your phone? And also let me know if you have further questions about how the technology works.
I've starting using voice search on my phone from time to time. And, after talking to Cohen and several other researchers like him, each time I do, I'm amazed that any words come back - whether they're right or not.
Posted by: John D. Sutter -- CNN.com writer/producerFiled under: Google
this will make humans so much more lazy. but it is very interesting
[...] Tech blog: The man who teaches computers to listen [...]
We love the convenience of what voice search does for many areas of life, especially when you're shopping...multi-tasking and hunting for items throughout a store. Google does an amazing job with this and this will change all our lives in the next three years. http://www.facebook.com/pages/aisle411/369008343660
[...] has spread by word of mouth in India. But the real buzz is using your voice to search. According to CNN, Google's Nexis One can break down language barriers. And you can use your voice to search [...]
It's a fabulous attempt. It is the first step to new world.
I think I can help you on this projet. How can I send you my ideas to you? Please let me know.
It's way past time for HAL9000.
Using Dragon Naturally Speaking at work, I can dictate 250 words per minute with 98% accuracy to my computer vs. my crappy 65 wpm typing skills, with about the same accuracy. That is, as long as you don't throw anything weird at me like a phone number :D
When given an optoin with entering info over the phone by voice or by keypad, I always pick keypad. For example on Delta I have to quickly enter my FF number....or when calling about my Amex card I use the phone keypad again over voice (mostly due to speed and habbit). In part that I am very fast with the keypad and in part I don't trust the voice. My only request for these systems is that they always assume I am wrong....when it should be that they should assume I am always correct (they just need to add an Undo button, so when I do make a mistake and I always know when I mistype...then I get to start over quickly!!). Don't repeat it back to me....and waste my time...just give me an undo button. After all....the computer knows if I enter the wrong number as there is no match for my Amex card number or FF number and since they are both password protected....no fear if I guess someone else's account or card# or FF# as the odds of also knowing their PIN or password is nill.
I see the use of Voice to enter data to Dental Charting Systems and are known to be very frustrating. Even when you take time to train the computer to know your voice and your pitch and inflection the computer still fails regularly. These "voice activation" systems often claim success rates up to 95%. Let me break that down for you... a dental chart for gum disease is based on 32 teeth, each being measured at 6 places per tooth. You can easily have 3-5 pieces of data per point. That totals 600 to 1,000 pieces of data per chart. The hygienist often has 6 to 10 minutes to enter that data. 95% success (if you get that) equates to 5% failure rate...so on 1,000 pieces of data that is 50 errors per chart (all done in front of the patient....oops!). Imagine making 50 errors per chart, 8 patients per day. These dental charting systems also have the advantage of a limited voice vocabulary (not the 50,000 dictionary many of these Dragon and other systems need), but only 100 words or less and still the 5% failure rates! Every year for the last 10 years the dental companies claim, "this year they work" and they never do. I am sure one day they will get better... exhibit A from a small company MS http://video.google.com/videoplay?docid=-1123221217782777472#
...I am a big fan of Google and think if anyone could throw enough money at the "voice activation" problem to fix it, then they are the ones to do it! I also like the idea that using voice for controlling your phone or small device might have a limited set of commands and thus get better accuracy rates. I like that they are looking at how the human brain does this already as a guide....think what word is coming next...maybe More's law will help us there and speed up computers so that when they get the algorithms figured out then we can run them in a hundredth of a second to put voice to use in a way that speeds up use of hand held products. However, that white board shown with Michal Cohen seems to look a lot like what my device is thinking when I say something simple like "call Mike" ....I end up with "buy milk" instead.....LOL
My frustration with the big expensive voice systems in use by companies like Delta and Amex (who must use huge computers and complicated algorithms) is that they don't listen when they talk. They should be always listening for info. Remember the old speaker phones that cut off any ability to hear when you were talking...? Only one side could talk at a time....so they fixed that and it was an important improvement. Every conversation has a natural ebb and flow and I hope Google can improve it just like they are improving Chat like in their Google Wave product. I like that as you type live chat you see every single character, so I can formulate my reply as you type vs. wait for the entire message to be finished and sent. Voice systems need that extra half second to speed it up in the same way (always be listening, even when the voice system is talking). just my "two cents"...
...maybe Michael should ask Ray Kurzweil about working on voice.... I think ray got smart and sold off the part of his company that worked on that project.... LOL http://www.kurzweiltech.com/kai.html
if there are trillions of words to choose from why not make it state-based?
what recongize the correct word ? for example The KNIGHT of ths round table, vs the NIGHT of the round table, or I SEE the account vs I SEA the account
JESUS HATES YOU ALL
NOMNOMNOMNOM IM FATTY
i think that i love you
my name is poopsicle people call me poop for short......i am 5 and i am pregnet with puppys.....there so damn cute they always kick....i like to suck on lollypops
I like you too poopsicle
The man who teaches computers to listen.. Retweeted it :)
Thank you sir ! I am delighted to read the Information about our soochl and Ghandruk. I hope that you will give further general information.My heartfelt thanks to you for your hard work towards our village and as well as soochl. yours faithfully former student:Ram Bahadur Gurung/2046 From. ENGLAND READING
Notify me of new comments via email.
Are you a gadgethead? Do you spend hours a day online? Or are you just curious about how technology impacts your life? In this digital age, it's increasingly important to be fluent, or at least familiar, with the big tech trends. From gadgets to Google, smartphones to social media, this blog will help keep you informed.