Software sifts text to sort Web sites TRN 022101

Software sifts text to sort Web sites

By Ted Smalley Bowen, Technology Research News

Although the World Wide Web is a multimedia network, the job of classifying sites is still largely a matter of interpreting textual information. In an attempt to make that process quicker and more accurate, a research team has developed a method of automatically sifting and categorizing the various forms of text found on the Web.

The researchers devised a spider program that crawls, or examines the various types of text that make up a Web site, and classification software that organizes the information the spider finds, creating a sort of card catalog system for parts of the Web.

Although sophisticated methods of searching pictures, video and audio are under development, text-based categorization promises a more immediate improvement in traversing and making sense of the Web, said John Pierre, a member of the technical staff at Interwoven, Inc. Pierre and his research colleague, Bill Wohler, developed the categorization system while employed by Metacode Technolgies, Inc.

They have used the software to group English language sites into business categories. The scheme could apply to other languages and categories as well, Pierre said.

The software ferrets out meaningful text from three distinct sources within a Web site to categorize the site: the words that make up the site's Hypertext markup language (HTML) meta tags, the words within its HTML body tags, and the readable text on the site.

HTML meta tags are key words contained in the hidden code of a Web site that summarize the type of information a Web site contains. HTML body tags, also hidden, are page layout instructions that affect the look of the site.

The term metadata, which generally refers to information about information, has many different meanings in computer science. Pierre's scheme uses subject-based metadata, or data about subject categories to group Web sites in much the way a library card catalog groups books. "You can think of a library catalog card, where you have author and title, and number of pages in the book, but you would also have other fields like subject or Dewey decimal code," said Pierre.

Other organizational systems could be tapped to make use of other types of metadata fields as well, he said. "This type of system, in the larger picture of metadata creation, can really serve as a driver for processing [content] beyond keyword searching to more reason-based searching," he said.

The drawback to using descriptive metadata, however, is there is not enough of it. "People are a little resistant or lazy about deploying metadata. It's a tedious task that nobody really wants to do," he said.

Web developers must enter meta tags into their sites in order for Web search engines to use them in ranking search results. Despite this incentive, however, less than a third of Web sites use meta-tag key words and descriptions.

The program also looks for text in HTML body tags, which usually contain page layout commands but often also include useful information like Web addresses.

Based on an examination of 19,195 Web domains, Pierre found that while most had words in title tags, which allow Web browsers to title each page, the information was of limited use in classifying sites because there were few words and they often consisted of generic terms like "homepage."

The scheme also addresses pages without words -- some pages have only frame sets, images or software plug-ins, and do not lend themselves to accurate classification, Pierre said. Frame sets are organizational elements that divide the browser's window into multiple frames. Software plug-ins are programs that give a larger program additional functions, like the ability to play movies.

The spider program searches first for text in HTML meta tags and titles, then follows links for frame sets and hyperlinks. The program searches body text only if no meta tag information is found, according to Pierre.

Once the spider gathers the information, it passes it to a Latent Sematic Indexing (LSI) information retrieval engine, which identifies matches based on concepts rather than single words, but does not have the heavy computational requirements as a true natural language processor, he said.

"It provides a certain level of concept-based matching, without any specialized knowledge base or rules, and it does that along with a complete framework for matching terms in documents -- similarity matching," he said.

Next, the information is fed into a classification engine, which, for the sake of performance, uses shallow parsing, according to Pierre.

"It's a way of understanding and extracting some limited subset of the data without worrying about the complete structure of it. For example, you could assign and extract proper names in sentences without having to understand every word, diagram all parts of speech, and understand the full meaning of the sentence," he said.

The scheme works most accurately using meta tags as the only source of text, while classifications based partly or entirely on body text are less accurate, according to Pierre.

The categorization system pulls together various ways of retrieving information on the Web, said Jon Kleinberg, and assistant professor of computer science at Cornell University. "In a sense, it's collecting a sequence of techniques which have been widely used in the information retrieval and machine learning community [and] grouping them into a single architecture for [Web] classification tasks.

The big question for the system is whether metadata will ultimately be more widely developed, said Kleinberg. "It remains to be seen to what level meta data will be adopted, and whether there's a standard that will somehow achieve widespread use" since meta-data is not visible to end-users and can be labor-intensive to create," he said.

The scheme is also competing with other general classification schemes that build automatic taxonomies of Web pages per and existing, more focused classification systems, Kleinberg added.

In general, research schemes like this need to be more accurate. "Ultimately, we need to develop a better method to combine a natural language processor and statistical [analysis software]," said Pierre.

Some elements of the classification system are already in commercial use. The spider program could be in wide use in three to five years, according to Pierre. The research was funded by Network Solutions.

Timeline: Now, 3-5 years
Funding: Corporate
TRN Categories: Internet
Story Type: News
Related Elements: Technical paper, "On the Automated Classification of Web Sites," posted at http://xxx.lanl.gov/abs/cs.IR/0102002

Advertisements:

February 21, 2001

Page One

Artificial cells make mini lab

PDA interface keeps a low profile

Software sifts text to sort Web sites

Light drives microscopic metal gears

Quantum computer design lights dots

News:
Research News Roundup
Research Watch blog

Features:
View from the High Ground Q&A
How It Works

RSS Feeds:
News

| Blog

| Books

Ad links:
Buy an ad link

Advertisements:

Ad links: Clear History

Buy an ad link

Home Archive Resources Feeds Offline Publications Glossary

TRN Finder Research Dir. Events Dir. Researchers Bookshelf

Contribute Under Development T-shirts etc. Classifieds

Forum Comments Feedback About TRN

TRN Newswire and Headline Feeds for Web sites