Queries guide Web crawlers

By Kimberly Patch, Technology Research News

Only a small percentage of the Internet's vast collection of information is indexed by search engines, which makes it important to improve the way search engines find what they do index.

Researchers from Contraco Consulting and Software Ltd., T-Online International and Siegen University in Germany have written an algorithm that improves Internet search results by factoring in what people are looking for. The researchers took their cue from the audience analysis that drives format and programming changes in television.

The algorithm, dubbed Vox Populi, picks up trends by analyzing patterns in people's Web searching behavior, then directs search engine crawlers to more thoroughly index relevant sites, according to Andreas Schaale, a partner at Contraco Consulting and Software. For instance, "if we see that the amount of queries about soccer is growing before entering the World Cup, this algorithm would give more resources for... soccer sites," he said.

The algorithm analyzes the queries people use to ask for information to find those that represent what the average user is searching for, sends these to the Web crawler component of an existing search system with instructions to give the relevant domains more Web crawler resources. The algorithm determines how much more attention each domain should gain. Web crawlers travel around the Web making the raw indexes of Web pages that search engines use.

Internet searching has gone through several changes in the past decade. The first search engines, like AltaVista, ranked purely on relevancy. Today the major search engines use static rank algorithms, which also consider domain popularity. Google introduced this method in 1997.

Web crawlers have evolved as well. Focused crawlers index pages related to specific topics, and adaptive crawlers reorder their lists of uncrawled pages based on the relevancy of the pages they have crawled.

Vox Populi also takes into account the subjects the average user is searching for. The algorithm "answers the question 'What are most of the people searching for?'" Said Schaale. Vox Populi does not replace the existing ranking algorithms, which retrieve their results from an index, he said.

The need for directing crawlers based on feedback from queries is driven by economics; data storage and handling is a growing cost, said Shaale. "A shop owner orders his products [depending on] what his customers ask for," said Schaale. "Vox Populi does basically the same," he said. This type of ranking is only necessary because search engines are not nearly powerful enough to crawl all Internet content in real-time, he said. The Google crawler, for instance, does its main crawl to update its index of the Web about once a month.

The researchers' scheme also includes methods to suppress spam, or unwanted content. Spam suppression is especially important in this method because in "most wanted" topic areas like free downloads, adult content, and shopping, the amount of spam is clearly above-average," said Schaale.

The main challenge to making the method work is not related to the algorithm, but the filtering, Schaale added. "The spammers and the search engine optimizers... adapt fast to new methods of filtering. This is a challenge for each search engine," he said.

The basic idea of improving searching by incorporating user context, including queries, has a lot of potential and is an active research area, said Filippo Menczer, an associate professor of informatics and computer science at Indiana University. The researchers' idea of improving a search engine by modifying its crawling and ranking algorithms to capture the preferences inferred from user queries is interesting, but its mathematical framework is incomplete, he said.

The researchers' algorithm can be used in combination with the ranking methods used by search engines, according to Schaale. It could be used in vertical information systems that search by subject and personalized searches that take into account a user's topics of interest, he said.

The method could be ready within a year, said Schaale.

Schaale's research colleagues were Carsten Wulf-Mathies from T-Online International AG in Germany and Sönke Lieberam-Schmidt from Siegen University in Germany. The research was funded by Contraco Consulting and Software.

Timeline:  > 1 year
Funding:   Corporate
TRN Categories:  Internet; Databases and Information Retrieval
Story Type:   News
Related Elements:  Technical paper, "A New Approach to Relevancy in Internet Searching - the "Vox Populi Algorithm", posted in the Computing Research Repository (CoRR) at arxiv.org/abs/cs.DS/0308039




Advertisements:



October 22/29, 2003

Page One

Body network gains speed

Queries guide Web crawlers

Nanowires make flexible circuits

DNA forms nano waffles

Briefs:
Fiber handles powerful pulses
Process prints nanoparticles
Single electrons perform logic
Embedded rotors mix fluids
Nanowires boost plastic circuits
Chip mixes droplets faster

News:

Research News Roundup
Research Watch blog

Features:
View from the High Ground Q&A
How It Works

RSS Feeds:
News  | Blog  | Books 



Ad links:
Buy an ad link

Advertisements:







Ad links: Clear History

Buy an ad link

 
Home     Archive     Resources    Feeds     Offline Publications     Glossary
TRN Finder     Research Dir.    Events Dir.      Researchers     Bookshelf
   Contribute      Under Development     T-shirts etc.     Classifieds
Forum    Comments    Feedback     About TRN


© Copyright Technology Research News, LLC 2000-2006. All rights reserved.