Correction choices key for speech software

By Kimberly Patch, Technology Research News

One broad trend in getting humans to interact more easily with computers is multimodal input -- essentially giving us the breadth of choices we are used to rather than restricting communications to tapping on keys and positioning a cursor.

Researchers from Carnegie Mellon University have found that giving users who are not experts several ways to correct speech recognition errors makes them considerably more successful in using speech recognition software, substantially lowering the method's steep learning curve.

This is true even though some ways of correcting are clearly more efficient than others.

"The general goal of the research was to see how much better one could do using multimodal correction," said Brad Myers, a senior research scientist at Carnegie Mellon University.

The results also point out the importance of being able to quickly and easily correct the words the computer mishears, a capability that is generally underrated, said Myers. "Error correction really needs to be one of the fundamental things you take into account when you're designing a speech system, because they're never going to work perfectly and the ways that are available for people to correct the errors have an enormous impact on their usability," he said.

The researchers tested the abilities of three different types of users -- novice, average and skilled -- to correct speech-generated text in three different ways: using only the keyboard, using the keyboard and speech as in conventional dictation systems, and multimodal correction, which allowed users to choose among keyboard, speech and handwriting correction methods.

In general, using speech to correct speech recognition errors allowed all three types of users to create text faster than using only typing to correct speech errors, according to the research. There was also a considerable learning curve, with experts doing much better than average users, and average users doing much better than novices.

Also, all three types of users dictated somewhat slower than is possible with commercial speech dictation systems because the researchers' system does not adapt to a user's voice. This decreased dictation accuracy from 90 percent or more to about 75 percent, which also increased the need to correct.

Correcting by typing only, experts produced about 40 words per minute, average users 30, and novices about 11. Using typing or speech to correct errors, experts produced about 55 words per minute, average users 40 and novices 22.

There are three basic ways to correct speech using speech: saying a word again, choosing from a list of possibilities, and spelling a word. All of these methods are used by commercial speech recognition systems.

The researchers found that the most instinctive way for humans to correct mistakes using speech was the computer's worst. "The most obvious way to correct when the system mishears is to say it again. But it turns out that most speech systems... actually do much worse when you try that," said Myers.

There are several reasons for this, he said. First, a misheard word is likely to be a difficult one to understand in the first place. Also, speech recognition systems are designed to understand a word in context, but users are more likely to say a single word at a time when correcting a mistake. "When you say a word in isolation it actually sounds totally different," said Myers. In addition, when forced to say something over again, "people tend to hyper-articulate -- it works when you try to make someone else understand [but it] sounds different," to a computer, said Myers.

The problem with allowing users to correct a word by choosing from a list of the computer's top 10 or so possibilities is that the correct word or phrase is often not listed, and when this happens, it slows correction down considerably, said Myers.

There's also a catch to using speech to spell a word in order to correct it. "The problem with that is [the system] doesn't do a very good job of recognizing the switch to spelling," said Myers. So, as in today's commercial speech recognition systems, the researchers had to provide a command so the users could tell the system that they were spelling, which slowed the correction down. However, once users learned to make the switch, spelling was the best of the speech modes of correcting, Myers said.

The researchers found that the experts tended to recognize that spelling words was the most efficient way to correct, and so did so consistently. Beginners, however, tended to use the least effective and most frustrating speech method of saying words over again.

One hypothesis going into the study was that people would eventually try to pick the technology that worked best, said Myers. "That was only partially supported" by the research, said Myers, because novices kept trying to correct an error by repeating the word even after many unsuccessful tries. This is probably because it works well for communicating with other people and so is a hard habit to break when talking to a machine, said Myers.

This loop became even less successful with time because "as you get more emotional your enunciation changes," said Myers.

Giving users the ability to correct using handwriting as well as speech increased the correction speed of novices and average users considerably. "Handwriting with a pen-based interface in general worked pretty well," Myers said.

Using the multimodal system, novices' dictation speed nearly doubled to about 40 words a minute; average users' speed increased slightly to 44 words a minute.

The multimodal system didn't help experts who were already proficient in the spelling correction method. In fact, it slowed them down a little, from 55 to 48 words a minute.

The main message of the study is that the error rate and techniques for correcting errors have to be taken into account in order to improve usability of speech systems, Myers said. "Allowing multiple choices for how to correct errors really makes a big difference in the success of the system," he said.

Speech systems should take into account the ingrained tendency to simply say a word again when it is not heard correctly, said Myers. "Our recommendation would be that speech systems... have different language models... in correction mode, [that account for] hyper articulating and the same words in isolation as compared to context -- that might make the correction more successful," said Myers.

This type of research is not only very relevant and applicable to existing interfaces, but will prove more important in future interfaces, said Matthew Turk, an associate professor of computer science at the University of California at Santa Barbara. It's one piece of a larger movement towards interfaces that accommodate people in all kinds of situations, he said.

As electronics grow even more widespread, these types of interfaces will ensure that "we're not going to have to spend hours and hours learning detailed computer systems and different systems for different purposes," said Turk. "We can just use whatever were comfortable with. If someone does have a particular skill like... good voice input, they can take advantage of it, but if they don't, they have these other alternatives."

There are no technical barriers to implementing multimodal speech correction in today's products, said Myers. "It would require some engineering work to tune the parameters, but there is reasonably good speech, handwriting and [even] gesture recognition already on the market. [They're] just not integrated into the same system," he said.

The Carnegie Mellon researchers are currently working to more fully meld different ways of interacting with a computer -- for instance using a combination of speech and gesture to evoke a command -- to make communicating with computers more natural for humans, said Myers. The trick is figuring out how to get the different input mechanisms, or recognizers, to cooperate at a more basic level than they usually do, he said.

Myers' research colleagues were Bernhard Suhm, a former Carnegie Mellon University graduate student who is now at BBN Technologies, and Alex Waibel of Carnegie-Mellon University and Karlsruhe University in Germany. The researchers published their work in the March, 2001 issue of the journal ACM Transactions on Computer-Human Interaction. The research was funded by The Defense Advanced Research Projects Agency (DARPA).

Timeline:   Now
Funding:   Government
TRN Categories:  Human-Computer Interaction
Story Type:   News
Related Elements:  Technical paper "Multimodal Error Correction for Speech User Interfaces", ACM Transactions on Computer-Human Interaction, March 2001


September 5, 2001

Page One

Pen and paper networked

Quantum current closer to computing

Correction choices key for speech software

Software spins tales into animations

Watched quantum pot boils slower


Research News Roundup
Research Watch blog

View from the High Ground Q&A
How It Works

RSS Feeds:
News  | Blog  | Books 

Ad links:
Buy an ad link


Ad links: Clear History

Buy an ad link

Home     Archive     Resources    Feeds     Offline Publications     Glossary
TRN Finder     Research Dir.    Events Dir.      Researchers     Bookshelf
   Contribute      Under Development     T-shirts etc.     Classifieds
Forum    Comments    Feedback     About TRN

© Copyright Technology Research News, LLC 2000-2006. All rights reserved.