Web pages cluster by content type TRN 011602

Web pages cluster by content type

By Kimberly Patch, Technology Research News

What makes the Web so useful is the vast amount of information it spans. And this is also what makes it so frustrating.

The challenge is indexing the Web in a way that allows people to find information quickly and painlessly. Scientists struggling with this problem have found that the Internet harbors more correlations among the types of information it holds than was at first apparent.

Information retrieval methods have long counted on correlations between word matches and meaning to find pages that are similar to each other.

A University of Iowa researcher has confirmed that there are also correlations between link distance and content, and link distance and meaning. "If two pages are separated by [only] a few links, then they are also similar in content and in meaning," said Filippo Menczer, an assistant professor of management sciences at the University of Iowa.

Untangling the correlations that exist among different aspects of the Web could be one key to better organizing its vast reaches.

The idea is that there are many notions of distance on the Web, and studying the relationships among these types of distance will provide cues to the relationships among Web pages, said Menczer. "It's like using cues in a physical environment. Suppose you are at a picnic in a park and you have to find the apple pie with your eyes closed. When the smell get stronger you know you're getting closer. So the strength of the smell signal is correlated with a physical distance," he said.

To verify the link-content correlation he measured the similarity of the words of many pairs of pages and the number of links that must be clicked to get from one to another. He also measured the link distances between pages that human experts had determined were similar in meaning.

"My results show that links... tell us a lot about the content and meaning of pages. This helps [us] understand why algorithms like Google's PageRank... work so well. They use links to estimate the meaning of pages," he said.

This strength of the correlations between links, text and meaning was surprising, said Menczer. "I found that beyond four or five links away, the probability [of finding] a relevant page is reduced to random chance," he said.

Menczer also found that the results varied depending on the type of domain he was measuring. "If you are browsing through Web sites of educational institutions, the signals are significantly more reliable than if you are surfing commercial sites," meaning the probability of finding a relevant page drops faster when you click away from commercial sites, he said. "In other words, you can get lost in cyberspace much faster when you're shopping online than when you are browsing a class syllabus," he said.

Taken together with two other recent findings in Web structure, the results could help build Web crawlers that do a better job of indexing, and cover more of the Web.

The Web is a small-world network, meaning it has a regular topology of pages clustered together, but also enough random links that they act as tunnels to reduce the average number of links between pages. This is the reason for the six degrees of separation phenomenon, which is that any person in the United States, or any Web page, can be reached from any other by making no more than six successive connections among people who know people, or among pages that are linked.

At the same time, it has become clear that finding these short paths to information is sometimes very difficult. The new correlations may help.

"The research I'm doing might shed light on this problem and help us understand whether it is theoretically possible to build efficient Web crawlers -- agents that can find target pages in a reasonable time through local lexical and link cues," said Menczer.

Measuring and documenting the relationships between the structure of the Web and its content is clearly important, said Soumen Chakrabarti, an assistant professor of computer science at the Indian Institute of Technology in Bombay. "It has also been measured before, but not as systematically as in Menczer's paper," he said.

"Menczer takes an important step of modeling the coupling formally" and his model treats the link content relation more deeply than past research efforts, Chakrabarti added.

Menczer is working on Web crawlers that will take advantage of these topological findings. "The crawlers that now build a search engine's index... do not use knowledge about what the users are interested in," he said. Menczer's prototype Web crawler, dubbed MySpiders, is designed to better harness the clues in links and to integrate it with information from Web page content, he said.

This type of search engine could technically be ready for practical use within one or two years, said Menczer. The research was funded by the University of Iowa.

Timeline: 1-2 years
Funding: University
TRN Categories: Internet
Story Type: News
Related Elements: Technical paper, "Links Tell Us about Lexical and Semantic Web Content," posted on the arXiv physics archive at http://xxx.lanl.gov/abs/cs.IR/0108004. MySpiders Web crawler site: myspiders.biz.uiowa.edu

Advertisements:

January 16, 2002

Page One

Morphing DNA makes motor

Toolset teams computers to design drugs

Atom clouds ease quantum computing

Web pages cluster by content type

Quantum effect alters device motion

News:
Research News Roundup
Research Watch blog

Features:
View from the High Ground Q&A
How It Works

RSS Feeds:
News

| Blog

| Books

Ad links:
Buy an ad link

Advertisements:

Ad links: Clear History

Buy an ad link

Home Archive Resources Feeds Offline Publications Glossary

TRN Finder Research Dir. Events Dir. Researchers Bookshelf

Contribute Under Development T-shirts etc. Classifieds

Forum Comments Feedback About TRN

TRN Newswire and Headline Feeds for Web sites