gets the point
Technology Research News
Tone of voice can mean a lot. Your colleague
can be giving you a complement or an insult depending on how she inflects
the phrase "great work." Gestures can be just as expressive.
Communicating with computers is much more basic. Try to insult an uncooperative
speech recognition system by telling it where to go, and, assuming your
diction is clear, it will simply show the words on-screen without gleaning
anything about your dark mood. Adding an appropriate gesture would make
things very clear to even a tone-deaf human, but computers are generally
gesture-blind as well.
Researchers from Pennsylvania State University and Advanced Interface
Technologies are trying to change that. They are working to untangle the
relationships between prosody -- the loudness, pitch, and timing of speech
-- and gestures in an attempt to improve the way computers recognize human
The research could eventually be applied to many different situations
where humans try to get information across to computers, including computer
games, surgical applications, crisis management software, and security
systems, according to Rajeev Sharma, an associate professor of computer
science and engineering at Penn State University and president of Advanced
Interface Technologies, Inc.
Although it's child's play for humans, getting a computer to recognize
gestures is difficult, said Sharma. Gestures "do not exhibit one-to-one
mapping of form to meaning," he said. "The same gesture... can exhibit
different meanings when associated with a different spoken context; at
the same time, a number of gesture forms can be used to express the same
In previous work, the researchers analyzed hours of tape of meteorologists
giving weather forecasts in order to link prosody to gestures.
The researchers increased their understanding of the phenomenon by plugging
speech pitch and hand velocity into the Hidden Markov Model, which breaks
information into very small pieces and makes predictions about a given
piece of information based on what comes before and after it. The model
is commonly used to predict words in commercial speech recognition systems.
The researchers used the system to help detect speech segments that commonly
occur along with a particular class of gesture. "We [combined] visual
and speech signals for continuous gesture recognition," said Sharma. "The
basic idea... is to detect emphasized parts of speech and align them with
the velocity of the moving hand."
For instance, a pointing gesture commonly precedes these emphasized segments
of speech, a contour-type gesture is more likely to occur at the same
time as an emphasized speech segment, and auxiliary gestures, which include
preparation and retraction movements, tend not to include emphasized speech
segments at all, according to Sharma.
The researchers are using the method in a geographical information system
prototype that uses a large screen display, microphones attached to the
ceiling and cameras that track users gestures.
The state-of-the-art in continuous gesture recognition is still far from
meeting the naturalness criteria of a true multimodal human-computer interface,
said Sharma. Computers have achieved accuracies of up to 95 percent in
interpreting isolated gestures, but recognizing significant gestures from
a full range of movements is much harder, he said.
Taking into consideration prosody when trying to interpret gestures, however,
increased the accuracy of gesture recognition from about 72 percent to
about 84 percent, Sharma said.
One of the challenges of putting together the system was to define when
the visual and audio signals corresponded, said Sharma. "Although speech
and gesture... complement each other, the production of gesture and speech
involve different psychological and neural systems," he said.
Further complicating things, speech contains both phonological information,
which are the basic sounds that make up words, and intonational characteristics,
which include some words louder than others and raising the pitch at the
end of a question. The system had to accurately pick up changes in intonation
amidst the phonological variation in the speech signal, Sharma said.
Modeling and understanding prosody in systems that combine speech and
gesture is important in the long run to help transition from a low-level,
or syntax-based, to a high-level, or semantics-based understanding of
communication, said Matthew Turk, an associate professor of computer science
at the University of California at Santa Barbara.
The field has applications in "just about every human-computer interaction
scenario, and in many computer-mediated human-to-human communication scenarios
[like] remote meetings," Turk said.
The researchers are currently working on incorporating the prosody-based
framework into a system to manipulate large displays. The researchers'
next step is to run a series of laboratory environment studies to investigate
how it works with real people, according to Sharma.
The researchers are ultimately aiming for an environment where a user
can interact with the gestures he is accustomed to in everyday life rather
than artificially-designed gestural signs, said Sharma.
The system could eventually enable more natural human-computer interfaces
in applications like crisis management, surgery and video games, Sharma
Another possibility is using the method in reverse for biometric authentication,
said Sharma. "This research [could] enable a novel way to identify a person
from [a] video sequence... since a multimodal dynamic signal would be
very hard to fake," he said.
Understanding how humans and computers can interact using several different
types of communication will become increasingly important "as we deal
with the need to interact with computing devices... embedded in our environment,"
The first products that incorporate the prosody-based system could be
ready within two years, said Sharma.
Sharma's research colleagues were Sanshzar Kettebekov and Muhammad Yeasin.
The research was funded by the National Science Foundation (NSF) and Advanced
Interface Technologies, Inc.
Timeline: > 2 years
Funding: Corporate, Government
TRN Categories: Human-Computer Interaction
Story Type: News
Related Elements: Technical paper, "Prosody Based Co-Analysis
for Continuous Recognition of Co-Verbal Gestures," posted at the computing
research repository at arXiv.org/archive/cs/intro.html.
Interface gets the point
orders metal bits
Hubs increase Net risk
Electron pairs power
could speed storage
Research News Roundup
Research Watch blog
View from the High Ground Q&A
How It Works
News | Blog
Buy an ad link