Speech
recognition to sort Holocaust tapes
By
Kimberly Patch,
Technology Research News
When Steven Spielberg established the Shoah
Foundation to record eyewitness accounts of Holocaust survivors and rescuers
seven years ago, speech recognition software that took dictation was barely
usable.
Now, after videotaping 52,000 eyewitness accounts in 57 countries and
32 languages, the foundation is looking to speech recognition software
-- which has also come a long way in the past seven years -- to help with
the arduous task of indexing the 116,000 hours of interviews.
The foundation is currently indexing the material manually according to
a thesaurus of keywords. "Annotators mark down... codes from the thesaurus
as they watch the interviews," said Bill Byrne, an associate research
professor of electrical and computer engineering at Johns Hopkins University.
The process is very time-consuming: it would take 40 years of 8-hour days
to simply watch the entire collection. "It's also difficult to determine
beforehand how to annotate the data so that subsequent searchers can find
exactly what they're looking for," he said.
Teams of researchers from IBM, Johns Hopkins University, the University
of Maryland, and the Shoah Foundation will take several approaches over
the next five years in an attempt to automate the process and make the
material more accessible to historians and teachers, said Byrne.
"We hope to be able to use speech recognition and a cross-lingual information
retrieval technique to both speed up the annotation so it will be easier
for the skilled translators to annotate and also to, at some point, make
it possible for people to be able to search these data collections directly
without the need of human annotation at all," said Byrne.
Current speech recognition software, which works fairly well for a single
trained user, is still not up to the task of transcribing from tape emotional
testimony from many users in many languages. The nature of this job makes
an excellent research project, however, said Byrne.
Speech recognition systems work well when people are speaking specifically
to be understood, like dictating directly to a computer or professionally
announcing the news, said Byrne. This is why the real-time speech translators
used in loud bars to subtitle news or sports broadcasts work fairly well.
In contrast, in the Shoah foundation material, "people are speaking to
an interviewer... and their speech is highly emotional and about topics
that are something out of the general realm of experience. They're heavily
accented in the English collections. And the speakers are also elderly.
Children and elderly people [have] a lot more variability in their speech,
[which] makes it hard to recognize as well," he said.
Another challenge is the acoustics. In contrast to newscasts, the videotaping
"was not done in a sound booth... there's just a microphone in the camera
several feet away from the speaker," said Byrne.
It's a difficult project, said Alex Waibel, a professor of computer science
at Carnegie Mellon University. "The biggest challenges are that the recorded
speech is conversational, not read, and therefore presents greater variability,
leading to higher error rates, [it is in] multiple languages, [and it
involves the] expression of emotion, which makes recognition harder."
Usually speech recognition systems address the multiple language problem
by individually training recognizers for each language, said Waibel. The
project is an obvious fit for an alternative approach that has already
shown some promise -- multilingual speech recognition models, he said.
Multilingual models proposed five years ago by Carnegie Mellon and University
of Karlsruhe researcher Tanja Schultz showed that a multilingual translator
can do as well as the approach that uses multiple translators for individual
languages, said Waibel.
The fictional Star Trek universal translator presages this approach. It
uses a speech model that "can essentially be used for any new language
with little adaptation data," Waibel said.
The researchers are taking several different tacks, said Byrne.
The IBM researchers are adapting an English translation module using 100
hours of tape from the collection that will be transcribed by people,
essentially giving the module the answers to the first 100 hours of words.
It is not possible to do this much work with each of the 32 languages,
however, so the researchers will next use about 20 hours of translated
Czech to adapt the Czech module, said Byrne. "We're going to see if we
can develop techniques that allow us to train systems with much less data,"
he said.
Speech recognition is just part of the project, he added. Its goal is
finding information, and the speech recognizers will be embedded in a
much larger search and retrieval system, he said. The idea is to "make
this data usable by historians and educators and teachers... they're going
to want to search through the material to find discussion of certain events
or themes... related to their research or the classroom material," he
said.
The advantage of this goal is "the speech recognition systems don't need
to work perfectly to be useful for searching archives. Returning good
answers to the user's query [is] what we're really after," he said. The
researchers will also concentrate on retrieval lexicons, which are lists
of words used by search engines. "We will try to make sure that we do
a very good job on these words, because these are the words [search engines
are] looking for," he said.
The general plan is the Maryland researchers will work on information
retrieval and interaction with users; the Johns Hopkins researchers will
work on speech recognition and the problems of working with multiple languages;
the IBM researchers will focus on the project of transcribing English;
and the Shoah foundation researchers will concentrate on cataloging and
adapting the approaches to their specific needs, said Burns. "All these
efforts fit together tightly.
The project is scheduled to last five years. Improved access to the Shoah
foundation archives is likely to be available sooner, said Byrne. "We
could start seeing initial results from the effect of our work within
a year or so," he said.
Byrne's research colleagues are Frederick Jelinek, Sanjeev Khudanpur and
David Yarowsky from Johns Hopkins University; Douglas Oard, Bruce Dearstyne,
David Doermann, Bonnie Dorr, Philip Resnik and Dagobert Soergel of the
University of Maryland; Bhuvana Ramabhadran and Michael Picheny from IBM
T. J. Watson Research; and Sam Gustman, Douglas Greenberg and Ella Thompson
of the Survivors of the Shoah Visual History Foundation. The research
is funded by the National Science Foundation (NSF).
Timeline: 5 years
Funding: Government
TRN Categories: Human-Computer Interaction
Story Type: News
Related Elements: None.
Advertisements:
|
October
31, 2001
Page
One
Address key locks email
Speech
recognition to sort Holocaust tapes
Sensitive sensor
spots single photons
Synced lasers pulse shorter
Electrons clean wire
machine
News:
Research News Roundup
Research Watch blog
Features:
View from the High Ground Q&A
How It Works
RSS Feeds:
News | Blog
| Books
Ad links:
Buy an ad link
Advertisements:
|
|
|
|