tool builds encyclopedia
Technology Research News
The best part about the Internet is having
so much information at your fingertips. You type in a word or phrase,
hit “search” and wait for your hits. Then you hope for the best as you
click on a description to see if the site contains what you need.
A pair of researchers at the University of Library and Information Sciences
in Tsukuba, Japan has come up with a system that winnows down the process
of an Internet search
by indexing the Web
as a sort of open encyclopedia. Instead of seeing a list of a thousand
Web sites that might possibly contain answers, the system extracts the
information and its reference links
and organizes it in the form of an encyclopedic entry.
“The interface has two fundamental modes: keyword and concept input,”
said Atsushi Fujii, a postdoctoral research assistant at the University.
If you type a word such as ‘pipeline,’ which could be either a means of
conveying liquids and gases or a computer processing method, the application
distinguishes between the two usage domains, and then shows the various
entries describing each usage, Fujii said. The resulting page looks much
like it came out of a paper dictionary or an encyclopedia, except each
description has a hyperlink to its source page.
In the concept input mode, users can type in sentences rather than keywords,
such as, ‘What infects computer files by way of e-mails?’ Fujii said.
To answer the question, the system generates a list of candidate keywords,
such as ‘microvirus’ and ‘computer virus,’ he said. Users select one of
the keywords to see its description page, essentially switching back to
the keyword input mode.
The system culls entries from Web pages and stores them in a database.
Because the system uses the Google search engine to generate sites, the
raw material the system works with is what anyone would get from searching
on a term like microvirus.
The system deletes layout information and links and retains only the sentence
fragments surrounding a key term. It uses a statistical language model
and a morphological analyzer to prevent the output from resembling garbled
strings of words. The morphological analyzer segments the input sentences
into words; the statistical language model is “a set of probabilities
that each word appears in a given context,” Fujii said.
Using two preceding words as contexts, the statistical language model
extracts three-word patterns, or tri-grams, such as "go to school" that
are inherent in term descriptions. “Given a fragment extracted from a
Web page, our method extracts all the possible tri-grams from the fragment,
and computes a combined probability for them,” Fujii explained. The result
is very readable, and quite accurate, he said.
To test accuracy, the researchers generated an encyclopedia from 96 test
terms collected from the Japanese IT Engineers Examinations. The method
generated appropriate descriptions for 90 percent of the test terms. The
answers from the generated encyclopedia were comparable to an existing
hand-compiled computer encyclopedia, said Fujii.
The system is better than encyclopedias and dictionaries that are unable
to keep up with new developments and information, said Fujii. “Our method
facilitates searching the Web for encyclopedic knowledge related to input
terms. Consequently, users can easily obtain knowledge associated with
new or technical terms unlisted in existing encyclopedias,” he said.
Once an encyclopedia has been generated for a search term, it is stored
in a database. The database is updated periodically, Fujii said. If the
search term has already been indexed, it takes only a few seconds to find
an entry. Terms that are not indexed in the encyclopedia are processed
in real-time, which can take up to a couple of minutes, he said.
“On the whole it is promising, but the current system is too premature
to be practically interesting just yet,” said John Prager, a research
staff member at IBM’s T.J. Watson Research Center. If a user wanted to
research a technical subject, “this could be an interesting front-end
to a traditional search engine such as Google, but as a Question-Answering
system it is well below the state of the art,” he said.
Its drawbacks are that it only deals with “what-is” questions of a multiple
choice nature, for which the correct answers are already supplied. Its
performance on these questions is also no better than existing systems,
The researchers are planning to use a parallel PC cluster to speed up
the process since each description can be processed independently, Fujii
said. They also plan to expand the system to answer “how” and “why” questions
along with “what” questions, he said.
The system is currently used for Japanese text only, but it could be used
for several other languages, according to Fujii. It will be ready for
practical application in two years, he said.
Fujii’s research colleague was Tetsuya Ishikawa. They presented their
research at the 39th Annual Meeting of the Association for Computational
Linguistics (ACL2001), held in Toulouse, France from July 6-11, 2001.
The research was funded by the University of Library and Information Science,
Timeline: >2 years
TRN Categories: Natural Language Processing; Databases
and Information Retrieval; Internet
Story Type: News
Related Elements: Technical paper, "Organizing Encyclopedic
Knowledge based on the Web and its Application to Question Answering,"
scheduled to be presented at the 39th Annual Meeting of the Association
for Computational Linguistics (ACL2001), July 6-11 2001, Toulouse, France.
Tool reads quantum bits
Study shows fiber
has room to grow
Search tool builds
atoms advance quantum chips
Electron beam welds
Research News Roundup
Research Watch blog
View from the High Ground Q&A
How It Works
News | Blog
Buy an ad link