Interface gets the point

By Kimberly Patch, Technology Research News

Tone of voice can mean a lot. Your colleague can be giving you a complement or an insult depending on how she inflects the phrase "great work." Gestures can be just as expressive.

Communicating with computers is much more basic. Try to insult an uncooperative speech recognition system by telling it where to go, and, assuming your diction is clear, it will simply show the words on-screen without gleaning anything about your dark mood. Adding an appropriate gesture would make things very clear to even a tone-deaf human, but computers are generally gesture-blind as well.

Researchers from Pennsylvania State University and Advanced Interface Technologies are trying to change that. They are working to untangle the relationships between prosody -- the loudness, pitch, and timing of speech -- and gestures in an attempt to improve the way computers recognize human gestures.

The research could eventually be applied to many different situations where humans try to get information across to computers, including computer games, surgical applications, crisis management software, and security systems, according to Rajeev Sharma, an associate professor of computer science and engineering at Penn State University and president of Advanced Interface Technologies, Inc.

Although it's child's play for humans, getting a computer to recognize gestures is difficult, said Sharma. Gestures "do not exhibit one-to-one mapping of form to meaning," he said. "The same gesture... can exhibit different meanings when associated with a different spoken context; at the same time, a number of gesture forms can be used to express the same meaning."

In previous work, the researchers analyzed hours of tape of meteorologists giving weather forecasts in order to link prosody to gestures.

The researchers increased their understanding of the phenomenon by plugging speech pitch and hand velocity into the Hidden Markov Model, which breaks information into very small pieces and makes predictions about a given piece of information based on what comes before and after it. The model is commonly used to predict words in commercial speech recognition systems.

The researchers used the system to help detect speech segments that commonly occur along with a particular class of gesture. "We [combined] visual and speech signals for continuous gesture recognition," said Sharma. "The basic idea... is to detect emphasized parts of speech and align them with the velocity of the moving hand."

For instance, a pointing gesture commonly precedes these emphasized segments of speech, a contour-type gesture is more likely to occur at the same time as an emphasized speech segment, and auxiliary gestures, which include preparation and retraction movements, tend not to include emphasized speech segments at all, according to Sharma.

The researchers are using the method in a geographical information system prototype that uses a large screen display, microphones attached to the ceiling and cameras that track users gestures.

The state-of-the-art in continuous gesture recognition is still far from meeting the naturalness criteria of a true multimodal human-computer interface, said Sharma. Computers have achieved accuracies of up to 95 percent in interpreting isolated gestures, but recognizing significant gestures from a full range of movements is much harder, he said.

Taking into consideration prosody when trying to interpret gestures, however, increased the accuracy of gesture recognition from about 72 percent to about 84 percent, Sharma said.

One of the challenges of putting together the system was to define when the visual and audio signals corresponded, said Sharma. "Although speech and gesture... complement each other, the production of gesture and speech involve different psychological and neural systems," he said.

Further complicating things, speech contains both phonological information, which are the basic sounds that make up words, and intonational characteristics, which include some words louder than others and raising the pitch at the end of a question. The system had to accurately pick up changes in intonation amidst the phonological variation in the speech signal, Sharma said.

Modeling and understanding prosody in systems that combine speech and gesture is important in the long run to help transition from a low-level, or syntax-based, to a high-level, or semantics-based understanding of communication, said Matthew Turk, an associate professor of computer science at the University of California at Santa Barbara.

The field has applications in "just about every human-computer interaction scenario, and in many computer-mediated human-to-human communication scenarios [like] remote meetings," Turk said.

The researchers are currently working on incorporating the prosody-based framework into a system to manipulate large displays. The researchers' next step is to run a series of laboratory environment studies to investigate how it works with real people, according to Sharma.

The researchers are ultimately aiming for an environment where a user can interact with the gestures he is accustomed to in everyday life rather than artificially-designed gestural signs, said Sharma.

The system could eventually enable more natural human-computer interfaces in applications like crisis management, surgery and video games, Sharma said.

Another possibility is using the method in reverse for biometric authentication, said Sharma. "This research [could] enable a novel way to identify a person from [a] video sequence... since a multimodal dynamic signal would be very hard to fake," he said.

Understanding how humans and computers can interact using several different types of communication will become increasingly important "as we deal with the need to interact with computing devices... embedded in our environment," said Sharma.

The first products that incorporate the prosody-based system could be ready within two years, said Sharma.

Sharma's research colleagues were Sanshzar Kettebekov and Muhammad Yeasin. The research was funded by the National Science Foundation (NSF) and Advanced Interface Technologies, Inc.

Timeline:  > 2 years
Funding:   Corporate, Government
TRN Categories:   Human-Computer Interaction
Story Type:   News
Related Elements:  Technical paper, "Prosody Based Co-Analysis for Continuous Recognition of Co-Verbal Gestures," posted at the computing research repository at


January 1/8, 2003

Page One

Interface gets the point

Altered protein orders metal bits

Hubs increase Net risk

Electron pairs power quantum plan

Aligned fields could speed storage


Research News Roundup
Research Watch blog

View from the High Ground Q&A
How It Works

RSS Feeds:
News  | Blog  | Books 

Ad links:
Buy an ad link


Ad links: Clear History

Buy an ad link

Home     Archive     Resources    Feeds     Offline Publications     Glossary
TRN Finder     Research Dir.    Events Dir.      Researchers     Bookshelf
   Contribute      Under Development     T-shirts etc.     Classifieds
Forum    Comments    Feedback     About TRN

© Copyright Technology Research News, LLC 2000-2006. All rights reserved.