Finding Information on the 'Net

September 5, 2005
Although the reach of today's search engines seems impressive, collectively they have indexed only about half of all publicly available Web pages. Publicly available Web pages make up less than one percent of the Web.

The challenges to improving today's search methods include indexing a larger portion of the Web, searching the much larger amount of information available in non-public Web pages and databases, restructuring the Web so that search engines can tap into the meaning of Web content, and correlating information across Web pages.

Searching the Web

The search engine programs that serve up Internet information are really a constellation of three types of software.

A spider, or crawler, finds all the pages on a Web site by mapping out the link structure of pages within a site. Spiders make return trips periodically to find changes.

The index, or catalog, stores information from Web pages found by the spider in a database. Once the information is indexed, it can be accessed by the actual search engine, which looks through the database to find entries that match a query and ranks relevant entries.

Because it takes time to crawl and index Web pages, Web searches are actually only searches of a database of pages found by the crawler at some point - usually days or weeks - in the past.

The Deep Web

And you won’t find everything contained in the World Wide Web using today’s search engines.

Web pages can be excluded from search engines by including appropriate metgatags, and others remain unsearched because search engine spiders can’t gain access. Pages that are not indexed by the major search engines are, collectively, the Deep Web.

These pages include thousands of specialized databases that can be accessed over the Web. Such databases generate Web pages from search results on-the-fly. The Deep Web is probably several hundred times bigger than the surface Web.

The Semantic Web

The Semantic Web initiative is poised to make searches more accurate and enable increasingly sophisticated information services like intelligent agents that can find products and services, schedule appointments and make purchases. The initiative includes a sort of grammar and vocabulary that provide information about a document’s components; this information will enable Web software to act on the meaning of Web content.

Semantic Web software includes a special set of Extensible Markup Language (XML) tags that includes Uniform Resource Identifiers (URIs), a Resource Description Framework (RDF), and a Web Ontology Language (OWL).

The Extensible Markup Language tags provide information about a document’s components. The Uniform Resource Identifiers contained in the XML tags expand the concept of Uniform Resource Locators (URLs) by adding IDs for objects, concepts and values that are not dependent on location.

The Resource Description Framework is a set of rules for describing objects like Web pages, people and products by their properties and relationships to other objects. There are three elements to a Resource Description Framework object definition: the object, the object’s properties, and the values of these properties. For example, the object could be a car that has the property of color with a value of blue. Objects, properties and values are all identified by Uniform Resource Identifiers.

Properties can also be relationships to other objects, like employment and authorship. In the case of a Prof. Stevens who teaches at State University, the object is Prof. Stevens, the property is employment, and the property value is State University, which is also an object that can have its own properties. And in the case of Bill Johnson who is the composer of State University’s school song, the object is Bill Johnson, the property is authorship, and the property value is school song, which is also an object.

The Web Ontology Language is a tool for building vocabularies that define specific sets of objects. The vocabularies are expressed and interpreted through the Resource Description Framework.

Semantic Web software makes it possible for an intelligent agent to carry out the request “show me the opticians in the neighborhood” even if there is no explicit list, because it knows that “neighborhood” has the property “location” with the value “Bellevue,” and in searching a directory of opticians it knows to skip Dr. Smith, whose location value is “Springfield”, but include Dr. Jones, whose location value is “Bellevue.”

Comparing text

The DataSpace project is aimed at making it possible to not only search for but also automatically correlate Web information.

The project includes four major pieces of software that enable this: · Data Space Transfer Protocol (DSTP) - a protocol for transferring columns of data that includes universal correlation keys, which are analogous to a database’s primary key attributes · Predictive Model Markup Language (PMML), an XML-based language that allows users to define data mining and statistical models, then mark different parts of data according to those models so different sets of data can be compared · Open source client and server software that allows computers to exchange data


Last            Next



Advertisements:




News:
Research News Roundup
Research Watch blog

Features:
View from the High Ground Q&A
How It Works

RSS Feeds:
News  | Blog  | Books 



Ad links:
Buy an ad link

Advertisements:







Ad links: Clear History

Buy an ad link

 
Home     Archive     Resources    Feeds     Offline Publications     Glossary
TRN Finder     Research Dir.    Events Dir.      Researchers     Bookshelf
   Contribute      Under Development     T-shirts etc.     Classifieds
Forum    Comments    Feedback     About TRN


© Copyright Technology Research News, LLC 2000-2006. All rights reserved.