Accurate lip reading is notoriously hard under ordinary circumstances and gets harder and less reliable in noisy conditions. But one team of researchers is putting artificial intelligence and machine learning to work to improve accuracy in the hope that it will one day benefit forensic experts and the hearing-impaired.
Ordinary people lip read all the time in noisy conditions, pairing audible words with lip movements. But their lip reading “is really quite unreliable,” said Richard Harvey of Britain’s University of East Anglia (UEA), one of few groups around the world trying to teach a deep neural network to interpret visual features more accurately.
Until recently, improving the accuracy of automatic lip reading was deemed a hopeless endeavor. And even when that research started nearly a decade ago, Harvey was doubtful it would yield better accuracy than some of the current estimates of 30-40 percent, due to numerous hurdles. One is the significant “gap in performance” between automatic lip reading as done by computers and human speech recognition as done by people in noisy conditions.
UEA’s research looks at people’s lips to determine what they are saying against a noisy backdrop, then extracts visual features instead of acoustic signals. The raw data are converted into sets of numbers called “features” that are easy to learn and fed to the computer so that it can figure out if the features correspond to the words.
Using a 1,000 word vocabulary, the UEA team trained on two separate sets of speakers to keep test and training data separate. The experiment yielded 60 percent word accuracy, with a 40 percent margin of error.
“The user would find that objectionable,” said Harvey. “It’s far too many errors. But that’s not bad at all. And when we compare that to professional forensic lip readers who tested on the same sequence – tested six of such people. They are very difficult to find – we beat five of them. I was very surprised by that result because these are professional forensic lip readers.”
The results are still better than the current accuracy rates which make lip reading evidence inadmissible in court in most cases.
“I have not seen an efficacy of lip reading to a point where anyone could claim it is greater than 50 percent,” said Steven Becker, photogrammetry, imagery analysis and acoustics expert at Pennsylvania’s Robson Forensic.
Noise, a beard or mustache, or an object blocking the view could get in the way of both acoustic and visual lip reading. But a machine, added Becker, “may be faster at processing the mouth movements and able to capture more changes.” But he cautioned that mouth movements do not represent all speech.
Ruth Campbell, a retired experimental psychologist and neuropsychologist, also believes computer engineering advances might make automatic lip reading work “quite well.”
“But I suspect that it will need powerful mixed models, combining a range of different approaches to make a machine that can lip read accurately a reality,” she said in an email. “I don’t think we yet have … machines that can accurately decode very noisy auditory speech – and lip reading is analogous to that.”
She said that would be a notable advance for forensic lip readers “because it could be claimed that the machine interpretation is likely to be less prone to [human] confirmation error.”
The improved accuracy might also help people with hearing impairments determine the context of a conversation. And instead of having to remember the visual sequences they need to lip read, the computer can do that for them.
“The automated AI system would not have that problem,” said Becker, “as the video would convert the movements to dimensions and the sequence would be digitized. There are [fewer] losses that way, but the AI system would need to make speech rules for … words that do not fit.”