| Software paraphrases sentencesBy 
      Kimberly Patch, 
      Technology Research News
 We paraphrase all the time, often without 
      thinking about it. Try to give a computer the means to reword a sentence, 
      however, and it becomes apparent that figuring out how to say it differently 
      is complicated.
 
 Researchers at Cornell University have tapped a pair of unlike sources 
      -- on-line journalism and computational biology -- to make it possible to 
      automatically paraphrase whole sentences. The researchers used gene comparison 
      techniques to identify word patterns from different news sources that described 
      the same event.
 
 The method could eventually allow computers to more easily process 
      natural language, produce paraphrases that could be used in machine translation, 
      and help people who have trouble reading certain types of sentences.
 
 Two ideas led to the system, said Regina Barzilay, one of the Cornell 
      researchers who is now an assistant professor of computer science at the 
      Massachusetts Institute of Technology.
 
 First, there is a lot of duplication online, which is potentially 
      useful fodder for systems trying to learn to paraphrase. When two reporters 
      describe the same news event, for instance, they may use different details, 
      but they tend to report about the same basic facts, said Barzilay. "This 
      redundancy can help us to learn ways to paraphrase the same information," 
      she said. "If we have a lot of [different sources] we can clean up the noise 
      and identify the pieces of information where they say the same thing."
 
 Even given similar writing styles, sentence-level paraphrasing is 
      more than simply recognizing synonyms. The researchers' example of a pair 
      of business journalism paraphrases makes this clear:
  After the latest Fed rate cut, stocks rose across the board.Second, to sort out sentence similarities, the researchers borrowed techniques 
      from computational biology that determine how closely related organisms 
      are by finding similarities among genes. "In computational biology... you 
      have genes which started from the same kind of seed, and then they change 
      during evolution [but] there is some similarity," said Barzilay.Winners strongly outpaced losers after Greenspan cut interest rates again.
 
 Key to the technique is comparing news sources that cover the same 
      events but employ slightly different styles. Because they are writing about 
      the same events they contain the same facts, or arguments, said Barzilay. 
      "This gives us patterns which are kind of the same -- and this is the core 
      of the paraphrasing technique."
 
 The researchers tested the system by comparing articles produced 
      in English between September 2000 and August 2002 by Agence France-Presse 
      (AFP) and Reuters news agencies.
 
 The researchers' system performs two types of grouping: the first 
      comparison is across articles of the same source, said Barzilay.
 
 The researchers' system uses word-based clustering methods to identify 
      sets of text that have a high degree of overlapping words, said Barzilay. 
      Using this method, the researchers identified articles that described individual 
      acts of violence in Israel and army raids on the Palestinian Territories.
 
 They then employed computational biology techniques to identify 
      sentence templates, or lattices. Lattices are made up of words or parallel 
      sets of words that occur across several examples, and arguments, or slots, 
      where names, dates or number of people hurt or killed occur.
 
 The challenge is to identify which sentence differences are due 
      to lexical variability and which are due to different subjects, said Barzilay.
 
 The technique allowed the researchers to identify common templates 
      that journalists use to describe similar events, said Barzilay. Journalists 
      "use a similar style, but then change it -- add one word, remove words. 
      With this technique we can still identify this common pattern," she said.
 
 One pattern, or lattice, read: Palestinian suicide bomber blew himself 
      up in NAME on DATE killing NUMBER (other) people and injuring/wounding NUMBER. 
      In addition to the injuring/wounding variable, there are several variables 
      within the name argument: settlement of, coastal resort of, center of, southern 
      city, or garden cafe.
 
 The system found 43 AFP lattices and 32 Reuters lattices. Once these 
      were identified, the researchers did a cross-comparison.
 
 The researchers compared the lattices from the two sources by comparing 
      the slot values of articles written on the same days. They used a statistical 
      technique to identify patterns that tend to take the same arguments in both 
      sources, said Barzilay.
 
 Twenty-five lattices from each source matched up. Taking into account 
      the variables contained within the lattices, there were 6,534 template pairs.
 
 Given a sentence to paraphrase, the system finds the closest match 
      among one set of lattices, then uses the matching lattice from the second 
      source to fill in the argument values of the original sentence to create 
      paraphrases. The sentence can be paraphrased in perhaps as many as 20 ways 
      using different variables, according to Barzilay.
 
 The researchers' ultimate goal is to use the system to allow computers 
      to be able to paraphrase like humans, and to understand paraphrases, "but 
      that's very far [off]", said Barzilay.
 
 Their next step is to find ways to put paraphrased sentences together 
      in order to paraphrase whole documents. Barzilay's previous work, which 
      used a different technique to paraphrase at the level of words and phrases 
      rather than sentences, is part of the Columbia News Blaster project, which 
      summarizes news stories.
 
 The sentence-based paraphrasing system could improve machine translation, 
      according to Barzilay. "Today a majority of machine translation is statistical... 
      you have large amounts of data in English and in French and then the system 
      learns how to translate that," she said.
 
 These systems work best when they have many different translations 
      of a given sentence, however. "To create such a corpus where 10 people are 
      to translate huge amounts of French text is very expensive," said Barzilay. 
      The researchers' system has the potential to accomplish the same thing by 
      taking one human translation and creating 10 paraphrases of it automatically, 
      she said.
 
 Sentence paraphrasing is also useful for people with certain disabilities, 
      said Barzilay. The system could be used to produce paraphrases based on 
      a specific model, for example, for aphasic readers, who find it difficult 
      to read certain types of phrases, she said.
 
 The system also produced a couple of insights into reporters' writing, 
      said Barzilay. It showed that the writing was very formulaic, and it pointed 
      out bias, she said.
 
 For example, the system learned incorrectly that "Palestinian suicide 
      bomber" and "suicide bomber" were the same, and that "killing 20 people" 
      is the same as "killing 20 Israelis", said Barzilay. These mistakes made 
      by the system are "due to how reporters are reporting," she said. "In some 
      sense... the teacher here is what the reporter writes," she said.
 
 Barzilay's research colleague was Lillian Lee. The researchers presented 
      the work at the Human Language Technology Conference held in Edmonton, Canada, 
      May 27 to June 1, 2003. The research was funded by the Sloan Foundation 
      and the National Science Foundation (NSF).
 
 Timeline:   Now
 Funding:   Government, Private
 TRN Categories:  Databases and Information Retrieval; Natural 
      Language Processing
 Story Type:   News
 Related Elements:  Technical paper, "Learning to Paraphrase: 
      An Unsupervised Approach Using Multiple-Sequence Alignment," posted on the 
      Computer Research Repository (CoRR) at arxiv.org/abs/cs.CL/0304006 and presented 
      at the Human Language Technology Conference, Edmonton, Canada, May 27-June 
      1, 2003
 
 
 
 
 Advertisements:
 
 
 
 | December 3/10, 2003
 
 Page 
      One
 
 Biochip puts it all together
 
 DNA assembles nanotube 
      transistor
 
 Software paraphrases 
      sentences
 
 Chaotic lasers lock messages
 
 Briefs:
 Nanotubes detect 
      nerve gas
 Microneedles 
      give painless shots
 Layers promise 
      cheap storage
 Molecule makes two-step 
      switch
 Spin material handles 
      heat
 Carbon boosts 
      plastic circuits
 
 News:
 Research News Roundup
 Research Watch blog
 
 Features:
 View from the High Ground Q&A
 How It Works
 
 RSS Feeds:
 News
  | Blog  | Books  
 
   
 Ad links:
 Buy an ad link
 
 
 
         
          | Advertisements: 
 
 
 
 |   
          |  
 
 
 |  |  |