choices key for speech software
Technology Research News
One broad trend in getting humans to interact
more easily with computers is multimodal input -- essentially giving us
the breadth of choices we are used to rather than restricting communications
to tapping on keys and positioning a cursor.
Researchers from Carnegie Mellon University have found that giving users
who are not experts several ways to correct speech recognition errors
makes them considerably more successful in using speech recognition software,
substantially lowering the method's steep learning curve.
This is true even though some ways of correcting are clearly more efficient
"The general goal of the research was to see how much better one could
do using multimodal correction," said Brad Myers, a senior research scientist
at Carnegie Mellon University.
The results also point out the importance of being able to quickly and
easily correct the words the computer mishears, a capability that is generally
underrated, said Myers. "Error correction really needs to be one of the
fundamental things you take into account when you're designing a speech
system, because they're never going to work perfectly and the ways that
are available for people to correct the errors have an enormous impact
on their usability," he said.
The researchers tested the abilities of three different types of users
-- novice, average and skilled -- to correct speech-generated text in
three different ways: using only the keyboard, using the keyboard and
speech as in conventional dictation systems, and multimodal correction,
which allowed users to choose among keyboard, speech and handwriting correction
In general, using speech to correct speech recognition errors allowed
all three types of users to create text faster than using only typing
to correct speech errors, according to the research. There was also a
considerable learning curve, with experts doing much better than average
users, and average users doing much better than novices.
Also, all three types of users dictated somewhat slower than is possible
with commercial speech dictation systems because the researchers' system
does not adapt to a user's voice. This decreased dictation accuracy from
90 percent or more to about 75 percent, which also increased the need
Correcting by typing only, experts produced about 40 words per minute,
average users 30, and novices about 11. Using typing or speech to correct
errors, experts produced about 55 words per minute, average users 40 and
There are three basic ways to correct speech using speech: saying a word
again, choosing from a list of possibilities, and spelling a word. All
of these methods are used by commercial speech recognition systems.
The researchers found that the most instinctive way for humans to correct
mistakes using speech was the computer's worst. "The most obvious way
to correct when the system mishears is to say it again. But it turns out
that most speech systems... actually do much worse when you try that,"
There are several reasons for this, he said. First, a misheard word is
likely to be a difficult one to understand in the first place. Also, speech
recognition systems are designed to understand a word in context, but
users are more likely to say a single word at a time when correcting a
mistake. "When you say a word in isolation it actually sounds totally
different," said Myers. In addition, when forced to say something over
again, "people tend to hyper-articulate -- it works when you try to make
someone else understand [but it] sounds different," to a computer, said
The problem with allowing users to correct a word by choosing from a list
of the computer's top 10 or so possibilities is that the correct word
or phrase is often not listed, and when this happens, it slows correction
down considerably, said Myers.
There's also a catch to using speech to spell a word in order to correct
it. "The problem with that is [the system] doesn't do a very good job
of recognizing the switch to spelling," said Myers. So, as in today's
commercial speech recognition systems, the researchers had to provide
a command so the users could tell the system that they were spelling,
which slowed the correction down. However, once users learned to make
the switch, spelling was the best of the speech modes of correcting, Myers
The researchers found that the experts tended to recognize that spelling
words was the most efficient way to correct, and so did so consistently.
Beginners, however, tended to use the least effective and most frustrating
speech method of saying words over again.
One hypothesis going into the study was that people would eventually try
to pick the technology that worked best, said Myers. "That was only partially
supported" by the research, said Myers, because novices kept trying to
correct an error by repeating the word even after many unsuccessful tries.
This is probably because it works well for communicating with other people
and so is a hard habit to break when talking to a machine, said Myers.
This loop became even less successful with time because "as you get more
emotional your enunciation changes," said Myers.
Giving users the ability to correct using handwriting as well as speech
increased the correction speed of novices and average users considerably.
"Handwriting with a pen-based interface in general worked pretty well,"
Using the multimodal system, novices' dictation speed nearly doubled to
about 40 words a minute; average users' speed increased slightly to 44
words a minute.
The multimodal system didn't help experts who were already proficient
in the spelling correction method. In fact, it slowed them down a little,
from 55 to 48 words a minute.
The main message of the study is that the error rate and techniques for
correcting errors have to be taken into account in order to improve usability
of speech systems, Myers said. "Allowing multiple choices for how to correct
errors really makes a big difference in the success of the system," he
Speech systems should take into account the ingrained tendency to simply
say a word again when it is not heard correctly, said Myers. "Our recommendation
would be that speech systems... have different language models... in correction
mode, [that account for] hyper articulating and the same words in isolation
as compared to context -- that might make the correction more successful,"
This type of research is not only very relevant and applicable to existing
interfaces, but will prove more important in future interfaces, said Matthew
Turk, an associate professor of computer science at the University of
California at Santa Barbara. It's one piece of a larger movement towards
interfaces that accommodate people in all kinds of situations, he said.
As electronics grow even more widespread, these types of interfaces will
ensure that "we're not going to have to spend hours and hours learning
detailed computer systems and different systems for different purposes,"
said Turk. "We can just use whatever were comfortable with. If someone
does have a particular skill like... good voice input, they can take advantage
of it, but if they don't, they have these other alternatives."
There are no technical barriers to implementing multimodal speech correction
in today's products, said Myers. "It would require some engineering
work to tune the parameters, but there is reasonably good speech, handwriting
and [even] gesture recognition already on the market. [They're] just not
integrated into the same system," he said.
The Carnegie Mellon researchers are currently working to more fully meld
different ways of interacting with a computer -- for instance using a
combination of speech and gesture to evoke a command -- to make communicating
with computers more natural for humans, said Myers. The trick is figuring
out how to get the different input mechanisms, or recognizers, to cooperate
at a more basic level than they usually do, he said.
Myers' research colleagues were Bernhard Suhm, a former Carnegie Mellon
University graduate student who is now at BBN Technologies, and Alex Waibel
of Carnegie-Mellon University and Karlsruhe University in Germany. The
researchers published their work in the March, 2001 issue of the journal
ACM Transactions on Computer-Human Interaction. The research was funded
by The Defense Advanced Research Projects Agency (DARPA).
TRN Categories: Human-Computer Interaction
Story Type: News
Related Elements: Technical paper "Multimodal Error
Correction for Speech User Interfaces", ACM Transactions on Computer-Human
Interaction, March 2001
Pen and paper networked
closer to computing
choices key for speech software
tales into animations
Watched quantum pot
Research News Roundup
Research Watch blog
View from the High Ground Q&A
How It Works
News | Blog
Buy an ad link