within Web boost searches
Technology Research News
Internet search engines regularly use information
about the text contained in pages and the links between pages to return
relevant search results because the approach works reasonably well, but
less is known about why these relationships exist.
A researcher from the University of Iowa has expanded the utility of using
text and links in search engines with a mathematical model that divides
a large network like the Internet into small local Webs.
A Web crawler designed to completely traverse a small Web will provide
more comprehensive coverage of a topic than typical search engines, according
to Filippo Menczer, an assistant professor of management sciences at the
University of Iowa. "My result shows that it is possible to design efficient
Web crawling algorithms -- crawlers that can quickly locate any related
page among the billions of unrelated pages in the Web," he said.
Menczer's earlier work showed how similarities in pages' text related
to the Web's link structure.
His latest work has expanded the concept by looking at a large number
of pairs of pages from the entire Web and studying the relationships between
three measures of similarity -- text, links and meaning -- across those
pages. "A better understanding of the relationships between the cues available
to us -- such as words and links -- about the meaning of Web pages is
essential in designing better ranking and crawling algorithms, which determine
how well a search engine works," Menczer said.
The brute force approach gave Menczer enough data to uncover power-law
relationships between textual content and Web page popularity and between
semantic, or categorical, distance and Web page popularity. "From a sample
of 150,000 pages taken from all top-level categories in the Open Directory,
I considered every possible pair of pages, resulting in almost 4 billion
pairs," said Menczer. The pattern would have been difficult to notice
with smaller or nonrandom samples, he said.
Menczer used the data in a mathematical model that predicts Web growth,
and showed that the model accurately predicted the way links are distributed
in the Internet. "The Web growth model based on local content predicts
the link... distribution," he said.
The model is based on the idea that Web page authors link to the most
popular or important pages in their subject areas, said Menczer. The question
is how they do this practically without a global knowledge of page popularity.
Many existing models simply assume that a Web page author has knowledge
of every Web site.
Menczer's model uses local content as a way to determine the probable
distribution of links in a network. "In this sense the new model is more
realistic because it is based on behavior that matches our intuition of
what authors do," he said.
The model is relatively simple, Menczer said. "When you look at a new
page, you link it to related pages which you know about with probability
proportional to their... popularity," he said. The probability of linking
between given pages decreases as the text similarities between them decreases,
he said. The relationship between the probability of a link between pages
and their text similarity follows a power-law, or exponential decrease.
The model, based on local knowledge, sees the Web as clusters of smaller
webs of sites with similar topics. This bodes well for search engine developers,
who can design Web crawlers to use textual and categorical cues to completely
traverse a small Web in order to provide comprehensive coverage on a certain
topic, according to Menczer.
The research should allow for ranking and crawling algorithms and more
scalable search engines "where most pages of interest to a community of
users can be located, indexed, and the semantic needs of users can be
mapped into algorithms to destill the most related pages," Menczer said.
Menczer' research group is designing and evaluating topical Web crawlers,
Menczer said. In addition, "we have some ideas on how to induce natural
collaborative activities in communities of users that can emerge spontaneously
in peer networks," he said. "Such activities will provide crawlers and
indexers with rich contexts to improve their performance," he added.
Some progress in crawling and ranking is possible within a few years,
but a full understanding of the complex inter-relationships between all
sorts of information available on the Web will take longer to map out,
Menczer is working on visual maps that will allow for a better interpretation
of the relationships between text, links and the meaning of Web pages.
The work is useful and novel, said Shlomo Havlin, a physics professor
at Bar-Ilan University in Israel. "It extends previous work on networks
to [quantify] correlations between neighboring nodes. Such correlations
have been found in realistic social and computer networks," he said.
The research adds to network models information that could improve researchers'
understanding of aspects of networks like stability and immunization against
software viruses, Havlin said. "This work extends the general body of
research to include realistic features," he said.
Menczer published the research in the October 7, 2002 issue of Proceedings
of the National Academy of Sciences. The research was funded by the National
Science Foundation (NSF).
Timeline: > 3 years
TRN Categories: Internet
Story Type: News
Related Elements: Technical paper, "Growing and Navigating
the Small World Web by Local Content," proceedings of the National Academy
of Sciences, October 7, 2002.
Coax goes nano
Webs within Web boost
Circuit gets more
power from shakes
Method measures quantum
Biochip sprouts DNA strands
Research News Roundup
Research Watch blog
View from the High Ground Q&A
How It Works
News | Blog
Buy an ad link