Summarizer gets the idea
Technology Research News
The flow of a document, including the topics
covered and the ways those topics relate to each other, is clear to people.
It would be useful if computer systems that process documents -- like
search engines and programs that generate summaries of news articles --
could also learn to consider topic information.
Teaching a computer to discern a document's topics and create
a summary that puts the topics in the correct order is a bit like teaching
it how to put together the pieces of a jigsaw puzzle. Current methods
focus on finding the right match for a given piece.
Researchers from the Massachusetts Institute of Technology and
Cornell University have developed a system that does the equivalent of
putting pieces that show parts of a mountain and pieces that show parts
of the sky into separate groups, and putting the sky pieces above the
mountain pieces, said Lillian Lee, an associate professor of computer
science at Cornell University.
The researchers' automatic classification algorithm, or content
model, is trained on subject-specific sets of documents and document summaries.
It can then extract the topic structure of a group of related topics.
The system selects and orders topics to generate a summary.
The researchers put together a prototype system that can automatically
create capsule summaries of, for example, movies from a movie information
database. Once the content model is trained on movie reviews, the system
can determine appropriate ways to present the information, said Lee.
The content model could eventually be used to make search engines
more precise, said Lee. Today's search engines "don't take the internal
topic structure into account in any but a very coarse way," she said.
The researchers' system would allow a search engine to determine
the overall topic and domain of discourse of a Web page, call up the appropriate
content model to analyze the page's topic structure, and then return only
on-topic pages, said Lee. It could also allow a search engine to present
the user with just those parts of a document that were relevant to a query,
The researchers' content model algorithm is based on the hidden
Markov model, a method commonly used to delineate words in speech recognition
programs and genes in computational biology.
A set of movie reviews, for example, usually contains several
common topics: director, plot, actors, previous movies by the same director,
and the reviewer's opinion of the movie, said Lee. The reviewer chooses
an order in which to present some or all of the topics, she said. For
example, the reviewer might begin by giving an overall opinion about the
plot before discussing the director.
The hidden Markov model can specify mathematically that a likely
sequence of topics within a review is opinion/plot/director/director's
previous films/opinion rather than actors/opinion/director's previous
films/director/actors/plot, said Lee. There are also techniques that allow
systems to automatically learn the relevant probabilities just from examining
samples of sequences, she said.
The researchers adapted standard hidden Markov model techniques
in several ways, said Lee. "We did not want to specify the set of topics
ahead of time, but rather wanted the system to automatically decide on
a set of topics itself," she said. The system clusters sentences that
have similar patterns, then treats the clusters as representations of
This is useful because it is automatic and because computers can
pick up subtle patterns in documents that humans are not consciously aware
The tricky part, however, is dealing with digression. "We humans
understand the phenomenon of digression, [but] digressions can really
confuse computers, [which] rely on statistical regularities," said Lee.
"The computer sees complete chaos and doesn't understand the meta-pattern
of off-topic commentary."
The researchers incorporated a mathematical model of previously-unseen
topics to deal with digression.
Modeling document content from a global perspective turned out
to be an advantage according to the researchers' tests. In one experiment,
the method out performed a state-of-the-art sentence-level method by 79
percent, according to Lee.
The method requires relatively formulaic domains and requires
a sample of documents and corresponding summaries for training. "The domain
of discourse needs to be formulaic enough for a computer to be able to
find patterns of language use," said Lee. Fortunately, "many domains of
interest to us have this property: for example, news articles about specific
types of events tend to be written in rather stereotypical ways," she
It is possible to use the model to do capsule summaries in restricted
domains now. Adapting the model to provide better search engine results
could take 10 years, said Lee.
Lee's research colleague was Regina Barzilay from the Massachusetts
Institute of Technology. The researchers presented the work at the North
American Chapter of the Association for Computational Linguistics Human
Language Technology (HLT/NAACL) 2004 conference in Boston, Massachusetts,
May 2 to 7. The research was funded by the National Science Foundation
(NSF) and the Alfred P. Sloan Foundation.
Timeline: Now, > 10 years
Funding: Government; Private
TRN Categories: Natural Language Processing; Databases and
Story Type: News
Related Elements: Technical paper, "Catching the Drift:
Probabilistic Content Models, with Applications to Generation and Summarization,"
North American Chapter of the Association for Computational Linguistics
Human Language Technology Conference (HLT/NAACL) 2004, May 2-7, Boston,
July 28/August 4, 2004
Photonic chips go 3D
Online popularity tracked
Summarizer gets the idea
Electric fields assemble
silicon on plastic
fast laser tweezer
chains make quantum wires
Research News Roundup
Research Watch blog
View from the High Ground Q&A
How It Works
News | Blog
Buy an ad link