Summarizer ranks sentences

By Kimberly Patch, Technology Research News

Because computers don't understand the meanings of words and sentences, automating the seemingly simple task of summarizing a news story using several sources is a major computer science challenge.

Key to meeting the challenge is finding a way to identify the most important sentences from a set of documents on the same subject.

Researchers from the University of Michigan have developed a multi-document summarization technique that compares sentences and has the effect of sentences voting for the most important among them.

The method, dubbed LexRank, combines the content-sorting concepts of prestige and lexical similarity to find the most important sentences in a group of documents on the same subject.

Algorithms that use prestige to sort information have been around since the '90s. It is possible to find the most prestigious, or popular member of a network by analyzing the relationships among network members. In a social network, for example, the most prestigious individual can be identified by analyzing the social relations among all pairs of members of the group.

The PageRank algorithm that powers Google takes advantage of this concept. It assigns a Web site a prestige score based on the number of other sites that point to it and the prestige of those sites. This works to rank Web pages on a network because the pages are connected. In the case of multi-document summarization, however, such hyperlinks are not available, said Dragomir Radev, an assistant professor of information, electrical engineering and computer science and linguistics at the University of Michigan.

Instead, the algorithm uses the similarities among sentences.

The researchers' lexical centrality algorithm compares the lexical similarity of sentences. "Lexical similarity can be thought of as a measure of the word overlap between two sentences," said Radev. "For example, 'Bush went to China' and 'George Bush visited China' are fairly similar in a lexical way [but] 'Bush visited China' and 'Blair is the prime minister of the United Kingdom' have no overlap at all," he said.

The algorithm allows the researcher to pick a threshold to indicate the point at which two sentences start to become similar.

There are many possible factors that can be used to assess the lexical similarity of a pair of sentences, said Radev. "We chose to weigh the contribution of each word... by its relative informativeness," he said. "Rare words like 'Igor', 'Taha' and 'disarmament' are more informative than common ones like 'today', 'between', and 'November'."

The researchers' system considers a sentence important if it is similar to many other sentences and if those other sentences are themselves important. "In a sense, sentences vote for each other just by virtue of being similar to each other," said Radev. "The sentences with the highest scores... are considered to contain the gist of the document and are presented as the multi-document summary," he said.

In contrast, the state-of-the-art method -- dubbed Centroid -- calculates a pseudo sentence that is the average of all the sentences in a set of documents, and calculates how similar each sentence is to this "centroid" sentence.

The researchers have applied the method to a prototype of their news clustering Web site. "For each cluster of related stories, we compute the pairwise similarity between all sentence pairs, then apply the [lexical] centrality algorithm," to parse out the important sentences, which become the summary of the document cluster, said Radev.

The most important realization in doing the research was that the patterns of language in the multi-document summarization task are similar to seemingly unrelated natural phenomena such as the patterns of links among Web pages, social interactions and electrical components, said Radev.

The researchers are planning to incorporate the method into their NewsInEssence Web site, which crawls the Web for news stories, clusters them into topical groups, and summarizes each group.

The researchers are also looking for other uses of the lexical centrality algorithm. Possibilities include automatic translation and question answering, said Radev. The method could potentially find sentences that are likeliest to contain the answer to a given natural language question, or, in the biomedical domain, sentences that are most likely to contain important facts like particular protein interactions, said Radev.

The researchers' experiments show that the method has the potential to yield summaries as good as those of state-of-the-art summarization systems, said Lillian Lee, an associate professor of computer science at Cornell University. "This general field of investigation is one that seems very promising," she said.

Radev's research colleague was Gunes Erkan. The researchers published the work in the July 2004 to January 2005 issue of the Journal of Artificial Intelligence Research (JAIR). The research was funded by the National Science Foundation (NSF).

Timeline:  6-18 months
Funding: Government
TRN Categories:  Natural Language Processing; Databases and Information Retrieval; Internet
Story Type:  News
Related Elements:  Technical paper, "Lexrank: Graph-based centrality as salience in text summarization," Journal of Artificial Intelligence Research (JAIR), July 2004-January 2005,; Demo


April 20/27, 2005

Page One

Telescopes make bug-eye optics
Summarizer ranks sentences
Impact Assessment:
Overly smart buildings

Ultraviolet shifts plastic's shape
Spiral laser beam demoed
Nanotube chemical sensor gains speed
Trapped cells make micromotors


Research News Roundup
Research Watch blog

View from the High Ground Q&A
How It Works

RSS Feeds:
News  | Blog  | Books 

Ad links:
Buy an ad link


Ad links: Clear History

Buy an ad link

Home     Archive     Resources    Feeds     Offline Publications     Glossary
TRN Finder     Research Dir.    Events Dir.      Researchers     Bookshelf
   Contribute      Under Development     T-shirts etc.     Classifieds
Forum    Comments    Feedback     About TRN

© Copyright Technology Research News, LLC 2000-2006. All rights reserved.