Ties that bind boost searches

By Kimberly Patch, Technology Research News

Search for Michael Jordan on the Web and chances are you'll have a hard time finding the scientist of that name.

If you happen to know specifics about Jordan's branch of science, you can augment the search with a term like "Bayesian networks" in order to divert the inevitable flood of references to the popular athlete. But if you don't already know something about the scientist, you'll have to slog through many results to find the right guy.

Scientists from NEC Corporation's Research Institute are attempting to solve this basic Web search problem with an algorithm that takes advantage of the linking structure of the Web. "The solution... is to find an efficient and effective means of partitioning the Web in a manner that is independent of text content," said Gary William Flake, a research scientist at NEC Research.

The algorithm could lead to more precise search engines, according to Flake.

Organizing the Web in a way that is not dependent on the language contained within web pages is important because language is ambiguous. "If we tried to cluster occurrences of 'Michael Jordan' by text content, we could run into many difficulties related to the ambiguity of languages," said Flake. In addition, "there may be many more 'Michael Jordans' out there, so we can't just arbitrarily choose to separate athletes from scientists," he said.

One way to organize the Web is to introduce some type of centralized hierarchy by hand, said Flake. Portals like Yahoo, for instance, employ people to index portions of the Web. But using people to comb through Web pages runs into a limit "related to how many humans you can get to do the dirty work," he said. "I suspect that there will never be a portal that indexes more than 10 million pages simply because it is too labor-intensive to do so."

Between automatic search engines and portals that use humans to organize content there is an implicit trade-off between reach and precision, said Flake. "On a search engine, you get back many results, but a lot of them are garbage. On portals, you get a small set of high-quality results, but with a lot of stuff missing," he said.

The algorithm attempts to span both approaches by providing a way to automatically organize the information based on the structure imposed on the Web by the way connections grow among pages.

Although it may seem random, the Internet actually contains a tremendous amount of structure, said Flake. The Internet is scale-free, meaning it has a few nodes, or pages that have many links to other nodes, and many nodes with just a few links. It is also a small-world network, meaning it has enough links, or shortcuts between large nodes, or hubs, that it is possible to get from one node to any other by taking just a few hops. This phenomenon is also responsible for the six degrees of separation found in social networks.

The algorithm connects the Web's linking structure to communities like science, finance, health, education, or recreation. A Web community has more links among community members than it does outside of the community. There are several algorithms that track links to find sets of related pages, but they're either inefficient or cover only a portion of the Web, according to Flake.

The key to coming up with an efficient algorithm for identifying communities was to start with a set of hand-picked seed sites, he said. "The entries within a portal category typically make excellent seeds."

The algorithm looks to the Web's structure to calculate a community that contains the seed sites, said Flake. "In this way, we can improve the [reach] of a portal while preserving the precision," he said.

The algorithm then does a popularity-like ranking within the community.

The group's continuing work crosses three distinct areas, said Flake. "On the mathematical front, we [are] refining our algorithm to make it more efficient and more accurate." The group is also working to mathematically characterize the properties of communities, he said.

"On the engineering front, my group is working on improving our hardware and software infrastructure so that our algorithms will scale up to the case where we're dealing with billions of Web pages," he said.

On the scientific front, the researchers are studying how large-scale communities on the Web relate to one another and how new communities emerge, said Flake.

"For the Web, I'm hoping that we can turn this into niche search engines, automatic portals, smart content filters, and even user agents," said Flake.

The community algorithm may eventually also be applicable to other types of networks, said Flake. "I would like to see if we can effectively apply this to other data sets that have a network-like structure, such as those found in biology," he said.

The researchers have found a novel way to infer content from links without using textual information, said Filippo Menczer, an assistant professor at the University of Iowa. The work "pushes the envelope with respect to just how much useful information we can extract from the self-organizational Web hypertext -- the fact that people selectively link their Web pages to other, probably related pages," he said.

The method could prove useful for augmenting search engines, said Menczer. "For example, Google burned the competition when it first introduced link analysis into the realm of commercial search engines. There are still many clues hidden in the Web that can be exploited, both in text and link information," he said. "It's an incredibly rich environment and we have only begun its exploration."

One issue with this particular method is how much time the algorithm takes. Because the Web is so large, it may be impractical to crawl the Web to the depth it requires, Menczer said. "An interesting direction for this work is to explore efficient ways to apply the idea. While the algorithm does not make any use of text information, it is possible that using text to guide the preliminary crawls," may speed community identification, he said.

The researchers are aiming to make a viable content filter that they can demonstrate within a year, said Flake. "Later, on the order of two years from now, we may introduce a niche search engine that ultimately customizes itself to individual users," he said.

Flake's research colleagues were Steve Lawrence, C. Lee Giles and Frans M. Coetzee. They published the research in the March, 2002 issue of the journal IEEE Computer. The research was funded by NEC.

Timeline:   1 year, 2 years
Funding:   Corporate
TRN Categories:   Internet
Story Type:   News
Related Elements:  Technical paper, "Self-Organization of the Web and Identification of Communities," IEEE computer, March, 2002.


March 13, 2002

Page One

Decision tool keeps it simple

Ties that bind boost searches

Sapphire chips linked by light

Lasers grasp cell-size water balloons

Fuel cell aimed at handhelds


Research News Roundup
Research Watch blog

View from the High Ground Q&A
How It Works

RSS Feeds:
News  | Blog  | Books 

Ad links:
Buy an ad link


Ad links: Clear History

Buy an ad link

Home     Archive     Resources    Feeds     Offline Publications     Glossary
TRN Finder     Research Dir.    Events Dir.      Researchers     Bookshelf
   Contribute      Under Development     T-shirts etc.     Classifieds
Forum    Comments    Feedback     About TRN

© Copyright Technology Research News, LLC 2000-2006. All rights reserved.