Software paraphrases sentences
Technology Research News
We paraphrase all the time, often without
thinking about it. Try to give a computer the means to reword a sentence,
however, and it becomes apparent that figuring out how to say it differently
Researchers at Cornell University have tapped a pair of unlike sources
-- on-line journalism and computational biology -- to make it possible to
automatically paraphrase whole sentences. The researchers used gene comparison
techniques to identify word patterns from different news sources that described
the same event.
The method could eventually allow computers to more easily process
natural language, produce paraphrases that could be used in machine translation,
and help people who have trouble reading certain types of sentences.
Two ideas led to the system, said Regina Barzilay, one of the Cornell
researchers who is now an assistant professor of computer science at the
Massachusetts Institute of Technology.
First, there is a lot of duplication online, which is potentially
useful fodder for systems trying to learn to paraphrase. When two reporters
describe the same news event, for instance, they may use different details,
but they tend to report about the same basic facts, said Barzilay. "This
redundancy can help us to learn ways to paraphrase the same information,"
she said. "If we have a lot of [different sources] we can clean up the noise
and identify the pieces of information where they say the same thing."
Even given similar writing styles, sentence-level paraphrasing is
more than simply recognizing synonyms. The researchers' example of a pair
of business journalism paraphrases makes this clear:
After the latest Fed rate cut, stocks rose across the board.
Second, to sort out sentence similarities, the researchers borrowed techniques
from computational biology that determine how closely related organisms
are by finding similarities among genes. "In computational biology... you
have genes which started from the same kind of seed, and then they change
during evolution [but] there is some similarity," said Barzilay.
Winners strongly outpaced losers after Greenspan cut interest rates again.
Key to the technique is comparing news sources that cover the same
events but employ slightly different styles. Because they are writing about
the same events they contain the same facts, or arguments, said Barzilay.
"This gives us patterns which are kind of the same -- and this is the core
of the paraphrasing technique."
The researchers tested the system by comparing articles produced
in English between September 2000 and August 2002 by Agence France-Presse
(AFP) and Reuters news agencies.
The researchers' system performs two types of grouping: the first
comparison is across articles of the same source, said Barzilay.
The researchers' system uses word-based clustering methods to identify
sets of text that have a high degree of overlapping words, said Barzilay.
Using this method, the researchers identified articles that described individual
acts of violence in Israel and army raids on the Palestinian Territories.
They then employed computational biology techniques to identify
sentence templates, or lattices. Lattices are made up of words or parallel
sets of words that occur across several examples, and arguments, or slots,
where names, dates or number of people hurt or killed occur.
The challenge is to identify which sentence differences are due
to lexical variability and which are due to different subjects, said Barzilay.
The technique allowed the researchers to identify common templates
that journalists use to describe similar events, said Barzilay. Journalists
"use a similar style, but then change it -- add one word, remove words.
With this technique we can still identify this common pattern," she said.
One pattern, or lattice, read: Palestinian suicide bomber blew himself
up in NAME on DATE killing NUMBER (other) people and injuring/wounding NUMBER.
In addition to the injuring/wounding variable, there are several variables
within the name argument: settlement of, coastal resort of, center of, southern
city, or garden cafe.
The system found 43 AFP lattices and 32 Reuters lattices. Once these
were identified, the researchers did a cross-comparison.
The researchers compared the lattices from the two sources by comparing
the slot values of articles written on the same days. They used a statistical
technique to identify patterns that tend to take the same arguments in both
sources, said Barzilay.
Twenty-five lattices from each source matched up. Taking into account
the variables contained within the lattices, there were 6,534 template pairs.
Given a sentence to paraphrase, the system finds the closest match
among one set of lattices, then uses the matching lattice from the second
source to fill in the argument values of the original sentence to create
paraphrases. The sentence can be paraphrased in perhaps as many as 20 ways
using different variables, according to Barzilay.
The researchers' ultimate goal is to use the system to allow computers
to be able to paraphrase like humans, and to understand paraphrases, "but
that's very far [off]", said Barzilay.
Their next step is to find ways to put paraphrased sentences together
in order to paraphrase whole documents. Barzilay's previous work, which
used a different technique to paraphrase at the level of words and phrases
rather than sentences, is part of the Columbia News Blaster project, which
summarizes news stories.
The sentence-based paraphrasing system could improve machine translation,
according to Barzilay. "Today a majority of machine translation is statistical...
you have large amounts of data in English and in French and then the system
learns how to translate that," she said.
These systems work best when they have many different translations
of a given sentence, however. "To create such a corpus where 10 people are
to translate huge amounts of French text is very expensive," said Barzilay.
The researchers' system has the potential to accomplish the same thing by
taking one human translation and creating 10 paraphrases of it automatically,
Sentence paraphrasing is also useful for people with certain disabilities,
said Barzilay. The system could be used to produce paraphrases based on
a specific model, for example, for aphasic readers, who find it difficult
to read certain types of phrases, she said.
The system also produced a couple of insights into reporters' writing,
said Barzilay. It showed that the writing was very formulaic, and it pointed
out bias, she said.
For example, the system learned incorrectly that "Palestinian suicide
bomber" and "suicide bomber" were the same, and that "killing 20 people"
is the same as "killing 20 Israelis", said Barzilay. These mistakes made
by the system are "due to how reporters are reporting," she said. "In some
sense... the teacher here is what the reporter writes," she said.
Barzilay's research colleague was Lillian Lee. The researchers presented
the work at the Human Language Technology Conference held in Edmonton, Canada,
May 27 to June 1, 2003. The research was funded by the Sloan Foundation
and the National Science Foundation (NSF).
Funding: Government, Private
TRN Categories: Databases and Information Retrieval; Natural
Story Type: News
Related Elements: Technical paper, "Learning to Paraphrase:
An Unsupervised Approach Using Multiple-Sequence Alignment," posted on the
Computer Research Repository (CoRR) at arxiv.org/abs/cs.CL/0304006 and presented
at the Human Language Technology Conference, Edmonton, Canada, May 27-June
December 3/10, 2003
Biochip puts it all together
DNA assembles nanotube
Chaotic lasers lock messages
give painless shots
Molecule makes two-step
Spin material handles
Research News Roundup
Research Watch blog
View from the High Ground Q&A
How It Works
News | Blog
Buy an ad link