|July 27/August 3, 2005|
engines find content via spiders that go through the pages on a Web site
by following the links among pages. This information is stored in an index
that is used to match query terms. There are two challenges to making Web
searches available this way. The first is dealing with the sheer size of
the Internet. The second is being able to present a user with a reasonably
whittled-down number of useful links.
The Internet is big by any measure:
The Internet Systems Consortium pegged the number of Internet hosts as of January, 2004 at 233 million.
Global Reach counted 729 million users online as of March, 2004. And a University of California at Berkeley study showed that, in 2002, 532,897 terabytes of new data flowed across the Internet, 440,606 terabytes of email was sent, and the Web contained 167 terabytes of data that was accessible to all users, plus another 91,850 terabytes in the deep Web where access is controlled.
A terabyte is 1,000 gigabytes, or 1,000,000 megabytes, or the amount of information that can be stored on 213 DVDs, or one tenth the amount of information stored in the entire Library of Congress print collection.
This is a lot of information, and it lives in a world in which computers are only so fast and hold only so much information. There is simply not enough time and compute power for spiders to crawl all the information in anything like a timely manner, or for even the tens of thousands of servers deployed by the major search engine companies to index and cache it.
To get around the problem, today’s search engines cover only 10 to 20 percent of the Web, and even then, spiders take weeks to finish a single crawl of just that portion. Search engines often crawl popular sites more often to keep them more up to date, but in general, when you search the Web or access a search engine’s cached copy of a page, you are working with a snapshot that is days or weeks old.
Link structure already plays an important role in the second challenge for search engines - presenting links that are relevant. And it is starting to play a more important role in the first challenge - covering more of the Web.
Perhaps the best known example of using link structure to determine link relevance is Google’s PageRank algorithm, which orders search results using an algorithm that measures a page’s popularity based on the number and status of pages that link to it.
PageRank assigns a value to a page by adding up the values of its inbound links. A link’s value is determined by the originating page’s value divided by the number of its outbound links. The algorithm aims to identify authoritative sources and use their authority to evaluate other sources. Because pages determine each other’s rankings, the algorithm has to run many times before it converges on a reasonable value for a given page.
More recently, researchers have been using link structure to categorize the Internet by subject in order to identify portions of the Web that are more manageable than the entire thing. Given that pages are likely to link to related pages, search algorithms can be tuned to find densely interconnected communities of interest.
Traffic model maps congestion
Fingernails store data
Quantum crypto scheme doubly fast
How It Works: Internet Structure
Baited molecule fights cancer
Bacteria drive biochip sensor
System brightens dark video
Micro fuel cell packs power
Research News Roundup
Research Watch blog
View from the High Ground Q&A
How It Works
News | Blog | Books
Buy an ad link
Ad links: Clear History
Buy an ad link
© Copyright Technology Research News, LLC 2000-2006. All rights reserved.