
Voice recognition starts by picking up on the sounds that regularly correspond with it’s alphabetical counterpart. These sounds are called “phonemes” and they are the ideal bits of sound we accept in an abstract way. The sounds we actually use are more like offshoots of “phonemes”, called “phones” and they are the real building blocks of speech. To be able to differentiated between what a “b” and a “v” sound is key for any recognition system to work properly and that’s just the start. After creating a range of which a program can decide on what “phones” were spliced together, the program has a whole slew of other problems to be more or less dealt with. Programmers have to decide on how to handle voice inconsistency, background noise, voice inconsistency, similar sounding words ( such as, there, their, they’re), and much more. These things are dealt with in a number of ways, and the quality of a specific voice recognition system is dependent on how well this is done. After all of that is taken care of, it’s usually just a very complex pattern matching algorithm that is written to clean up errors that would look ridiculous to us.
Specifically for Google Now, a neural network of sorts is used. Basically, it takes speech and then sends what it hears off to different parts of the "neural network" based on what it heard originally. Each level of the network requires more specific information to be sent to it, which reduces errors. The Google Knowledge Graph compiles a database of certain entities that aid in this process; TTS is then used to respond back to us in a way that is more easily understood by a human vs. a computer.
Specifically for Google Now, a neural network of sorts is used. Basically, it takes speech and then sends what it hears off to different parts of the "neural network" based on what it heard originally. Each level of the network requires more specific information to be sent to it, which reduces errors. The Google Knowledge Graph compiles a database of certain entities that aid in this process; TTS is then used to respond back to us in a way that is more easily understood by a human vs. a computer.