| Summarizer gets the ideaBy 
      Kimberly Patch, 
      Technology Research News
 The flow of a document, including the topics 
        covered and the ways those topics relate to each other, is clear to people. 
        It would be useful if computer systems that process documents -- like 
        search engines and programs that generate summaries of news articles -- 
        could also learn to consider topic information.
 
 Teaching a computer to discern a document's topics and create 
        a summary that puts the topics in the correct order is a bit like teaching 
        it how to put together the pieces of a jigsaw puzzle. Current methods 
        focus on finding the right match for a given piece.
 
 Researchers from the Massachusetts Institute of Technology and 
        Cornell University have developed a system that does the equivalent of 
        putting pieces that show parts of a mountain and pieces that show parts 
        of the sky into separate groups, and putting the sky pieces above the 
        mountain pieces, said Lillian Lee, an associate professor of computer 
        science at Cornell University.
 
 The researchers' automatic classification algorithm, or content 
        model, is trained on subject-specific sets of documents and document summaries. 
        It can then extract the topic structure of a group of related topics. 
        The system selects and orders topics to generate a summary.
 
 The researchers put together a prototype system that can automatically 
        create capsule summaries of, for example, movies from a movie information 
        database. Once the content model is trained on movie reviews, the system 
        can determine appropriate ways to present the information, said Lee.
 
 The content model could eventually be used to make search engines 
        more precise, said Lee. Today's search engines "don't take the internal 
        topic structure into account in any but a very coarse way," she said.
 
 The researchers' system would allow a search engine to determine 
        the overall topic and domain of discourse of a Web page, call up the appropriate 
        content model to analyze the page's topic structure, and then return only 
        on-topic pages, said Lee. It could also allow a search engine to present 
        the user with just those parts of a document that were relevant to a query, 
        she said.
 
 The researchers' content model algorithm is based on the hidden 
        Markov model, a method commonly used to delineate words in speech recognition 
        programs and genes in computational biology.
 
 A set of movie reviews, for example, usually contains several 
        common topics: director, plot, actors, previous movies by the same director, 
        and the reviewer's opinion of the movie, said Lee. The reviewer chooses 
        an order in which to present some or all of the topics, she said. For 
        example, the reviewer might begin by giving an overall opinion about the 
        plot before discussing the director.
 
 The hidden Markov model can specify mathematically that a likely 
        sequence of topics within a review is opinion/plot/director/director's 
        previous films/opinion rather than actors/opinion/director's previous 
        films/director/actors/plot, said Lee. There are also techniques that allow 
        systems to automatically learn the relevant probabilities just from examining 
        samples of sequences, she said.
 
 The researchers adapted standard hidden Markov model techniques 
        in several ways, said Lee. "We did not want to specify the set of topics 
        ahead of time, but rather wanted the system to automatically decide on 
        a set of topics itself," she said. The system clusters sentences that 
        have similar patterns, then treats the clusters as representations of 
        different topics.
 
 This is useful because it is automatic and because computers can 
        pick up subtle patterns in documents that humans are not consciously aware 
        of.
 
 The tricky part, however, is dealing with digression. "We humans 
        understand the phenomenon of digression, [but] digressions can really 
        confuse computers, [which] rely on statistical regularities," said Lee. 
        "The computer sees complete chaos and doesn't understand the meta-pattern 
        of off-topic commentary."
 
 The researchers incorporated a mathematical model of previously-unseen 
        topics to deal with digression.
 
 Modeling document content from a global perspective turned out 
        to be an advantage according to the researchers' tests. In one experiment, 
        the method out performed a state-of-the-art sentence-level method by 79 
        percent, according to Lee.
 
 The method requires relatively formulaic domains and requires 
        a sample of documents and corresponding summaries for training. "The domain 
        of discourse needs to be formulaic enough for a computer to be able to 
        find patterns of language use," said Lee. Fortunately, "many domains of 
        interest to us have this property: for example, news articles about specific 
        types of events tend to be written in rather stereotypical ways," she 
        said.
 
 It is possible to use the model to do capsule summaries in restricted 
        domains now. Adapting the model to provide better search engine results 
        could take 10 years, said Lee.
 
 Lee's research colleague was Regina Barzilay from the Massachusetts 
        Institute of Technology. The researchers presented the work at the North 
        American Chapter of the Association for Computational Linguistics Human 
        Language Technology (HLT/NAACL) 2004 conference in Boston, Massachusetts, 
        May 2 to 7. The research was funded by the National Science Foundation 
        (NSF) and the Alfred P. Sloan Foundation.
 
 Timeline:   Now, > 10 years
 Funding:   Government; Private
 TRN Categories:  Natural Language Processing; Databases and 
        Information Retrieval
 Story Type:   News
 Related Elements:  Technical paper, "Catching the Drift: 
        Probabilistic Content Models, with Applications to Generation and Summarization," 
        North American Chapter of the Association for Computational Linguistics 
        Human Language Technology Conference (HLT/NAACL) 2004, May 2-7, Boston, 
        Massachusetts
 
 
 
 
 Advertisements:
 
 
 
 | July 28/August 4, 2004
 
 Page 
      One
 
 Photonic chips go 3D
 
 Online popularity tracked
 
 Summarizer gets the idea
 
 Electric fields assemble 
      devices
 
 Briefs:
 Process prints 
      silicon on plastic
 Tool automates 
      photomontage edits
 Device promises 
      microwave surgery
 Hologram makes 
      fast laser tweezer
 Chemistry yields 
      DNA fossils
 Particle 
      chains make quantum wires
 
 News:
 Research News Roundup
 Research Watch blog
 
 Features:
 View from the High Ground Q&A
 How It Works
 
 RSS Feeds:
 News
  | Blog  | Books  
 
   
 Ad links:
 Buy an ad link
 
 
 
         
          | Advertisements: 
 
 
 
 |   
          |  
 
 
 |  |  |