Software paraphrases sentences

By Kimberly Patch, Technology Research News

We paraphrase all the time, often without thinking about it. Try to give a computer the means to reword a sentence, however, and it becomes apparent that figuring out how to say it differently is complicated.

Researchers at Cornell University have tapped a pair of unlike sources -- on-line journalism and computational biology -- to make it possible to automatically paraphrase whole sentences. The researchers used gene comparison techniques to identify word patterns from different news sources that described the same event.

The method could eventually allow computers to more easily process natural language, produce paraphrases that could be used in machine translation, and help people who have trouble reading certain types of sentences.

Two ideas led to the system, said Regina Barzilay, one of the Cornell researchers who is now an assistant professor of computer science at the Massachusetts Institute of Technology.

First, there is a lot of duplication online, which is potentially useful fodder for systems trying to learn to paraphrase. When two reporters describe the same news event, for instance, they may use different details, but they tend to report about the same basic facts, said Barzilay. "This redundancy can help us to learn ways to paraphrase the same information," she said. "If we have a lot of [different sources] we can clean up the noise and identify the pieces of information where they say the same thing."

Even given similar writing styles, sentence-level paraphrasing is more than simply recognizing synonyms. The researchers' example of a pair of business journalism paraphrases makes this clear:
After the latest Fed rate cut, stocks rose across the board.
Winners strongly outpaced losers after Greenspan cut interest rates again.
Second, to sort out sentence similarities, the researchers borrowed techniques from computational biology that determine how closely related organisms are by finding similarities among genes. "In computational biology... you have genes which started from the same kind of seed, and then they change during evolution [but] there is some similarity," said Barzilay.

Key to the technique is comparing news sources that cover the same events but employ slightly different styles. Because they are writing about the same events they contain the same facts, or arguments, said Barzilay. "This gives us patterns which are kind of the same -- and this is the core of the paraphrasing technique."

The researchers tested the system by comparing articles produced in English between September 2000 and August 2002 by Agence France-Presse (AFP) and Reuters news agencies.

The researchers' system performs two types of grouping: the first comparison is across articles of the same source, said Barzilay.

The researchers' system uses word-based clustering methods to identify sets of text that have a high degree of overlapping words, said Barzilay. Using this method, the researchers identified articles that described individual acts of violence in Israel and army raids on the Palestinian Territories.

They then employed computational biology techniques to identify sentence templates, or lattices. Lattices are made up of words or parallel sets of words that occur across several examples, and arguments, or slots, where names, dates or number of people hurt or killed occur.

The challenge is to identify which sentence differences are due to lexical variability and which are due to different subjects, said Barzilay.

The technique allowed the researchers to identify common templates that journalists use to describe similar events, said Barzilay. Journalists "use a similar style, but then change it -- add one word, remove words. With this technique we can still identify this common pattern," she said.

One pattern, or lattice, read: Palestinian suicide bomber blew himself up in NAME on DATE killing NUMBER (other) people and injuring/wounding NUMBER. In addition to the injuring/wounding variable, there are several variables within the name argument: settlement of, coastal resort of, center of, southern city, or garden cafe.

The system found 43 AFP lattices and 32 Reuters lattices. Once these were identified, the researchers did a cross-comparison.

The researchers compared the lattices from the two sources by comparing the slot values of articles written on the same days. They used a statistical technique to identify patterns that tend to take the same arguments in both sources, said Barzilay.

Twenty-five lattices from each source matched up. Taking into account the variables contained within the lattices, there were 6,534 template pairs.

Given a sentence to paraphrase, the system finds the closest match among one set of lattices, then uses the matching lattice from the second source to fill in the argument values of the original sentence to create paraphrases. The sentence can be paraphrased in perhaps as many as 20 ways using different variables, according to Barzilay.

The researchers' ultimate goal is to use the system to allow computers to be able to paraphrase like humans, and to understand paraphrases, "but that's very far [off]", said Barzilay.

Their next step is to find ways to put paraphrased sentences together in order to paraphrase whole documents. Barzilay's previous work, which used a different technique to paraphrase at the level of words and phrases rather than sentences, is part of the Columbia News Blaster project, which summarizes news stories.

The sentence-based paraphrasing system could improve machine translation, according to Barzilay. "Today a majority of machine translation is statistical... you have large amounts of data in English and in French and then the system learns how to translate that," she said.

These systems work best when they have many different translations of a given sentence, however. "To create such a corpus where 10 people are to translate huge amounts of French text is very expensive," said Barzilay. The researchers' system has the potential to accomplish the same thing by taking one human translation and creating 10 paraphrases of it automatically, she said.

Sentence paraphrasing is also useful for people with certain disabilities, said Barzilay. The system could be used to produce paraphrases based on a specific model, for example, for aphasic readers, who find it difficult to read certain types of phrases, she said.

The system also produced a couple of insights into reporters' writing, said Barzilay. It showed that the writing was very formulaic, and it pointed out bias, she said.

For example, the system learned incorrectly that "Palestinian suicide bomber" and "suicide bomber" were the same, and that "killing 20 people" is the same as "killing 20 Israelis", said Barzilay. These mistakes made by the system are "due to how reporters are reporting," she said. "In some sense... the teacher here is what the reporter writes," she said.

Barzilay's research colleague was Lillian Lee. The researchers presented the work at the Human Language Technology Conference held in Edmonton, Canada, May 27 to June 1, 2003. The research was funded by the Sloan Foundation and the National Science Foundation (NSF).

Timeline:   Now
Funding:   Government, Private
TRN Categories:  Databases and Information Retrieval; Natural Language Processing
Story Type:   News
Related Elements:  Technical paper, "Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment," posted on the Computer Research Repository (CoRR) at and presented at the Human Language Technology Conference, Edmonton, Canada, May 27-June 1, 2003


December 3/10, 2003

Page One

Biochip puts it all together

DNA assembles nanotube transistor

Software paraphrases sentences

Chaotic lasers lock messages

Nanotubes detect nerve gas
Microneedles give painless shots
Layers promise cheap storage
Molecule makes two-step switch
Spin material handles heat
Carbon boosts plastic circuits


Research News Roundup
Research Watch blog

View from the High Ground Q&A
How It Works

RSS Feeds:
News  | Blog  | Books 

Ad links:
Buy an ad link


Ad links: Clear History

Buy an ad link

Home     Archive     Resources    Feeds     Offline Publications     Glossary
TRN Finder     Research Dir.    Events Dir.      Researchers     Bookshelf
   Contribute      Under Development     T-shirts etc.     Classifieds
Forum    Comments    Feedback     About TRN

© Copyright Technology Research News, LLC 2000-2006. All rights reserved.