Interface lets you point and
Technology Research News
One of the reasons speech recognition software
remains inferior to human speech recognition is computers can't read hands.
Humans convey a surprising amount of information through the gestural
cues that accompany speech. We point things out, convey concepts like
'big' or 'small', get across metaphorical ideas, and provide a sort of
beat that directs conversational flow.
No matter how often or how vigorously you shake your fist at at your computer
screen, however, it won't help the computer tune in to your mood.
Researchers from Pennsylvania State University are working on a human-computer
interface that goes a step toward allowing a computer to glean contextual
information from our hands. The software allows a computer to see where
a human is pointing and uses that information to interpret the mixed speech
and gestural directions that are a familiar part of human-to-human communications.
These pointing, or deictic gestures are commonly mixed with speech when
talking about things like directions, for example, saying "from here to
here," while pointing at a map.
The researchers used Weather Channel video to glean a database of deictic
gestures, which include directly pointing to something, circling an area,
or tracing a contour. "Looking at the weather map we were able to classify
pieces of gestures, then say which pieces we can interpret, and what kind
of gestures would be useful. We came up with algorithms [that] extract
those gestures from just the video," said researcher Sanshzar Kettebekov,
a Pennsylvania State University computer science and engineering graduate
The researchers used this database to create a pair of applications designed
for large screens that allow the computer to interpret what people mean
when they use a mix of speech and pointing gestures.
One application, dubbed IMAP, is a campus map that responds to pointing
and spoken queries. "It brings the computer into the loop with the human,"
said Kettebekov. For example, if a person asks the map for a good restaurant
in an area she is circling with her hand, the computer will reply based
on the spoken request for a restaurant and the gestural request for a
location, according to Kettebekov.
The second application is a battlefield planning or city crisis management
simulation that allows a person standing in front of a large screen to
direct vehicles around a battlefield or city. "A person has limited resources
[and there are] alarms going off all over the city. The person is using...
a 50-inch display... to direct the resources to where the alarm is going
[off]," said Kettebekov.
Even though it seems easy to us, giving a computer the ability to sense
and make sense of gestures in a verbal context is a complicated problem
that involves several steps, according to Kettebekov. The computer must
be able to track the user's hands, recognize meaningful gestures, and
interpret those gestures.
The first problem is tracking. "We have a vision algorithm that tracks
a person and tries to follow a person's hand," Kettebekov said. The second
stage is picking out the pointing gestures. "You're trying to delimit
gestures from a continuous stream of frames where the hands are just moving
-- saying 'from here to here was this gesture'," he said. "The third stage
is interpretation when you really associate [the gesture you have isolated]
with parts of speech and try to extract meaning," he said.
Multimodal human computer interaction is an active research topic with
a long history, said Jie Yang, a research scientist at Carnegie Mellon
University. "Coordination of speech and gestures is an old but still open
problem," he said, noting that there was a paper published 20 years ago
on a computer system that integrated speech and gesture, and there have
been many studies on the advantages of using speech and gesture. "Yet,
we cannot naturally interact with a computer using speech and gesture
without constraints today."
When all the difficult computer problems have been worked out, however,
systems that recognize speech and gesture will allow a person to "efficiently
manipulate multimedia information regardless of whether the person is
communicating with a computer or with another human," he said.
The Penn State researchers are working on improving their gesture recognition
algorithms by adding an understanding of the prosodic information that
lends speech its subtle shades of meaning, said Kettebekov. "We're working
on using prosodic information in speech: tone of voice, stresses, pauses...
to improve gesture recognition and interpretation," he said.
The toughest of the three gesture problems is improving gesture recognition,
said Kettebekov. Currently the system identifies keywords and tries to
correlate them with gestures. Adding prosodic information would help the
system to both recognize gestures and interpret them, he said.
For example, when a TV meteorologist wants to emphasize a keyword, he
raises the tone of his voice, said Kettebekov. "If I want you to pay attention
I not only point, but my voice would change so that I would attract more
attention to that concrete point," he said. "You can extract those most
prominent parts of speech, and those parts of speech nicely relate with
the gestures -- in this case it was pointing," he said.
The researchers may eventually turn their sights to iconic, metaphoric
and beat gestural information, but there is a lot of work to be done in
the deictic area first, said Kettebekov. In addition, understanding what
these subtler gestures mean from a linguistics point of view "is not there
yet -- so there's not enough theoretical basis," to use to give that understanding
to computers, he said.
Kettebekov's research colleague was Rajeev Sharma of Pennsylvania State
University. They presented the research at the Engineering for Human-Computer
Interaction conference in Toronto in May, 2001. The research was funded
by the Army Research Laboratory and the National Science Foundation (NSF).
TRN Categories: Human-Computer Interaction; Computer Vision
and Image Processing
Story Type: News
Related Elements: Technical paper, "Toward Natural Gesture/Speech
Control of a Large Display," presented at the Engineering for Human-Computer
Interaction conference in Toronto, May 11-14, 2001.
July 25, 2001
Sounds attract camera
you point and speak
Quantum logic counts
turns out flat screens
put privacy at risk
Research News Roundup
Research Watch blog
View from the High Ground Q&A
How It Works
News | Blog
Buy an ad link