Webs 
        within Web boost searches 
         
        
      By 
      Kimberly Patch, 
      Technology Research News 
       
      Internet search engines regularly use information 
        about the text contained in pages and the links between pages to return 
        relevant search results because the approach works reasonably well, but 
        less is known about why these relationships exist.  
         
        A researcher from the University of Iowa has expanded the utility of using 
        text and links in search engines with a mathematical model that divides 
        a large network like the Internet into small local Webs.  
         
        A Web crawler designed to completely traverse a small Web will provide 
        more comprehensive coverage of a topic than typical search engines, according 
        to Filippo Menczer, an assistant professor of management sciences at the 
        University of Iowa. "My result shows that it is possible to design efficient 
        Web crawling algorithms -- crawlers that can quickly locate any related 
        page among the billions of unrelated pages in the Web," he said.  
         
        Menczer's earlier work showed how similarities in pages' text related 
        to the Web's link structure.  
         
        His latest work has expanded the concept by looking at a large number 
        of pairs of pages from the entire Web and studying the relationships between 
        three measures of similarity -- text, links and meaning -- across those 
        pages. "A better understanding of the relationships between the cues available 
        to us -- such as words and links -- about the meaning of Web pages is 
        essential in designing better ranking and crawling algorithms, which determine 
        how well a search engine works," Menczer said.  
         
        The brute force approach gave Menczer enough data to uncover power-law 
        relationships between textual content and Web page popularity and between 
        semantic, or categorical, distance and Web page popularity. "From a sample 
        of 150,000 pages taken from all top-level categories in the Open Directory, 
        I considered every possible pair of pages, resulting in almost 4 billion 
        pairs," said Menczer. The pattern would have been difficult to notice 
        with smaller or nonrandom samples, he said.  
         
        Menczer used the data in a mathematical model that predicts Web growth, 
        and showed that the model accurately predicted the way links are distributed 
        in the Internet. "The Web growth model based on local content predicts 
        the link... distribution," he said.  
         
        The model is based on the idea that Web page authors link to the most 
        popular or important pages in their subject areas, said Menczer. The question 
        is how they do this practically without a global knowledge of page popularity. 
        Many existing models simply assume that a Web page author has knowledge 
        of every Web site.  
         
        Menczer's model uses local content as a way to determine the probable 
        distribution of links in a network. "In this sense the new model is more 
        realistic because it is based on behavior that matches our intuition of 
        what authors do," he said.  
         
        The model is relatively simple, Menczer said. "When you look at a new 
        page, you link it to related pages which you know about with probability 
        proportional to their... popularity," he said. The probability of linking 
        between given pages decreases as the text similarities between them decreases, 
        he said. The relationship between the probability of a link between pages 
        and their text similarity follows a power-law, or exponential decrease. 
         
         
        The model, based on local knowledge, sees the Web as clusters of smaller 
        webs of sites with similar topics. This bodes well for search engine developers, 
        who can design Web crawlers to use textual and categorical cues to completely 
        traverse a small Web in order to provide comprehensive coverage on a certain 
        topic, according to Menczer.  
         
        The research should allow for ranking and crawling algorithms and more 
        scalable search engines "where most pages of interest to a community of 
        users can be located, indexed, and the semantic needs of users can be 
        mapped into algorithms to destill the most related pages," Menczer said. 
         
         
        Menczer' research group is designing and evaluating topical Web crawlers, 
        Menczer said. In addition, "we have some ideas on how to induce natural 
        collaborative activities in communities of users that can emerge spontaneously 
        in peer networks," he said. "Such activities will provide crawlers and 
        indexers with rich contexts to improve their performance," he added.  
         
        Some progress in crawling and ranking is possible within a few years, 
        but a full understanding of the complex inter-relationships between all 
        sorts of information available on the Web will take longer to map out, 
        he said.  
         
        Menczer is working on visual maps that will allow for a better interpretation 
        of the relationships between text, links and the meaning of Web pages. 
         
         
        The work is useful and novel, said Shlomo Havlin, a physics professor 
        at Bar-Ilan University in Israel. "It extends previous work on networks 
        to [quantify] correlations between neighboring nodes. Such correlations 
        have been found in realistic social and computer networks," he said.  
         
        The research adds to network models information that could improve researchers' 
        understanding of aspects of networks like stability and immunization against 
        software viruses, Havlin said. "This work extends the general body of 
        research to include realistic features," he said.  
         
        Menczer published the research in the October 7, 2002 issue of Proceedings 
        of the National Academy of Sciences. The research was funded by the National 
        Science Foundation (NSF).  
         
        Timeline:   > 3 years  
         Funding:   Government  
         TRN Categories:   Internet  
         Story Type:   News  
         Related Elements:  Technical paper, "Growing and Navigating 
        the Small World Web by Local Content," proceedings of the National Academy 
        of Sciences, October 7, 2002.  
         
         
          
      
       
        
      Advertisements: 
       
       
      
      
       
       
       | 
     
       November 
      13/20, 2002 
       
      Page 
      One 
       
      Coax goes nano 
       
      Webs within Web boost 
      searches 
       
      Circuit gets more 
      power from shakes  
       
      Method measures quantum 
      quirk  
       
      Biochip sprouts DNA strands  
       
      News:  
      Research News Roundup 
      Research Watch blog 
       
      Features:  
      View from the High Ground Q&A 
      How It Works  
        
      RSS Feeds: 
      News   | Blog 
        | Books   
       
        
       
       
      Ad links: 
      Buy an ad link 
       
        
      
         
           Advertisements: 
             
            
            
             
             
             
             | 
         
         
             
             
             
             
            
           | 
         
       
     | 
      |