Page layout drives Web search
Technology Research News
typical Web search engine indexes the Web by crawling Web pages, extracting
text and links, and using the information to construct a Web graph that
reflects the relative importance of individual Web pages. The method relies
heavily on analyzing links, or the way pages connect.
Researchers from the University of Chicago and Microsoft Research
Asia have devised a system that analyzes content at the level of blocks
of information on a page rather than the coarser page-level. This allows
for a model of the relationships between Web pages that shows the intrinsic
semantic structure of the Web, said Deng Cai, who was with Microsoft Research
Asia and Tsinghua University in China when the research was done, but is
now at the University of Illinois at Urbana-Champaign.
The research could eventually lead to more accurate search engines,
according to Cai.
Link-based Web algorithms, including Google's PageRank, are based
on a pair of assumptions, said Cai. First, that the links convey human endorsement,
meaning if page "A" is linked to page "B" and those two pages were authored
by different people, the linking author found the other page valuable. Second,
that if one page links to two other pages, those two pages are likely to
contain related subject matter.
The assumptions don't hold in many cases, however, said Cai. A single
page often contains sections, and hyperlinks in different sections of the
page often point to pages that have different topics. Many links exist only
for navigation and advertisement, for instance.
To correct this problem, search engines should analyze content in
units smaller than pages, said Cai.
The researchers' prototype consists of a pair of search algorithms
that work with their previously developed method of segmenting Web pages
into topic-based blocks.
The researchers used their Vision-Based Page Segmentation algorithm
to delineate the different parts of a Web page based on how a human views
a page, said Cai. Pages are segmented by horizontal and vertical lines,
and blocks of content are weighted by page position. Links from advertisements,
for example, count for less than links from central content blocks.
In theory, other visual aspects of Web pages like background color
and font could also be used to segment and weight blocks, according to Cai.
Also, learning algorithms like neural networks could be trained for the
task using examples chosen by people, he said
The researchers prototype ranks Web pages by extracting page-to-block
and block-to-page relationships, then using the information to construct
a page graph and a block graph. Page-to-block relationships are determined
by analyzing the layout of a page, and block-to-page relationships are determined
by the probability of a block linking to a given page.
The information is fed to the researchers' link-analysis algorithms
-- Block Level PageRank, and Block Level Hypertext-Induced Topic Selection
(HITS) -- which assign an importance value to each page based on the type
of blocks that link to it. "Based on these, we can build our search engine
from the block level," rather than the coarser-grained page level, Cai said.
This means doing block-level link analysis and a block-based Web search,
The link analysis algorithms are able to extract the intrinsic semantic
structure of the Web from this information, according to Cai.
This is in some ways similar the World Wide Web Consortium's Semantic
Web project, which aims to give search engines and other software the means
to interpret Web page content. The block-level search technique does not
provide the concrete semantic information that the Semantic Web promises,
but also does not require wide-spread adoption of tags and other software
to parse Web pages.
The approaches are different because "we try to extract the [semantic]
structure of the Web automatically from the existing Web," said Cai.
The method also allowed the researchers to compute a BlockRank at
the block level similar to a page-level PageRank.
In a comparison of their search algorithms with page-based versions
of the PageRank and HITS algorithms using a standard information-retrieval
research data set, the block-based algorithms performed better most of the
time, according to Cai.
The block-level analysis of the Web could also lead to a better
understanding of the network in general, said Cai.
In a practical search system, the block-level PageRank function
would not burden the system because it can be calculated offline, said Cai.
The researchers are currently working to improve the page segmentation
algorithm, and to construct Web graphs that more accurately reflect the
semantic structure of the Web, said Cai. "We ultimately aim [to build] a
better search engine," he said. The researchers previously used the technique
to cluster like Web page images.
The technique could be ready for commercial use in a general search
engine within two years said Cai.
Cai's research colleagues were Xiaofei He from the University of
Chicago and Microsoft Research Asia, and Ji-Rong Wen and Wei-Ying Ma from
Microsoft Research Asia. The researchers presented the work at the Association
of Computing Machinery (ACM) Special Interest Group Information Retrieval
(SIGIR) 2004 conference in Sheffield, England July 25-29. The research was
funded by Microsoft Research Asia.
Timeline: 1-2 years
TRN Categories: Internet; Databases and Information Retrieval
Story Type: News
Related Elements: Technical paper, "Block-level Link Analysis,"
presented at at the Association of Computing Machinery (ACM) Special Interest
Group Information Retrieval (SIGIR) 2004 conference in Sheffield, England
October 6/13, 2004
Atomic clock to sync
Quantum math models speech
Page layout drives Web
Fluid chip does binary
Gas flow makes electricity
electricity for space
build on self-assembly
Research News Roundup
Research Watch blog
View from the High Ground Q&A
How It Works
News | Blog
Buy an ad link