Search tool finds answers before queriesBy Eric Smalley, Technology Research News
Using keywords to search a large database like the Web is as tedious as casting a net into crowded waters and then sifting through everything that was too big to slip through. A question answering system being developed at IBM Research, however, could make the experience more like questioning an expert.
IBM's GuruQA system will answer a query with "a name or a place or a number, or [a] short passage containing the answer," said John Prager, who is the lead researcher on the project. The answer will also include links to the documents where the answer was found.
GuruQA is actually an attempt to find a middle ground between the sophisticated text understanding performed by natural language processing systems and the simpler but more practical information retrieval techniques commonly used in search engines, Prager said. Natural language processing is computationally expensive and has generally been limited to specific subjects, while information retrieval techniques do little more than apply statistics to matching keywords.
IBM's system combines three technologies: predictive annotation, query analysis and answer selection. Predictive annotation analyzes documents looking for words, numbers and phrases that are likely to be answers to questions. Then it labels them with tokens that indicate the type of questions they would likely answer. The tokens include place, person, duration, date and length. (See chart)
The query analysis component analyzes the question using 400 standard question types for comparison and replaces certain types of words with appropriate tokens. For example, the question "How tall is the Matterhorn" would be converted to "LENGTH$ is the Matterhorn." LENGTH$ is the token denoting distance.
The system also uses a search engine to rank passages within documents and an answer selection component to rank the best answer phrases from within the selected passages.
"The search engine might typically return 10 passages, each with its own selection of answers, some of which may agree with answers [in] other passages, some of which may not," said Prager. The answer selection component evaluates the different possible answers and ranks them by looking at how well the type of answer matches the type of entity asked for, how close certain words in the answer passage are to each other and what the search engine's score for each passage is, he said.
The IBM researchers' approach is somewhat unusual, said Ellen M. Voorhees, Text Retrieval Conference project manager at the National Institute of Standards and Technology. "The Guru system is basically indexing the whole collection on the basis of these [tokens]. Therefore… their actual matching is done at the indexing stage." It's too soon to determine whether this approach will provide a significant advantage over other question answering systems, she said.
Despite performing some text analysis, GuruQA has many of the same limitations as other question answering systems. The system has difficulty with synonyms and instances where phrases like "the club" and "the company" replace proper nouns in the answer phrases. The system also has difficulty with questions about method and reason.
"When you get into questions of how and why that gets a lot harder. You're starting to get into interpretation," said Stephanie W. Haas, an associate professor at the School of Information and Library Science at the University of North Carolina.
The researchers intend to expand GuruQA by adding part-of-speech tagging, a thesaurus, deeper natural language processing and other technologies to improve performance. However, it is already practical technology for questions seeking factual answers, Prager said.
The IBM team has run a successful prototype of GuruQA with a well-known encyclopedia, he said.
Prager will present a paper titled "Question-Answering by Predictive Annotation" at the SIGIR 2000 conference in Athens next week. The paper was co-authored by Eric Brown and Anni Coden of IBM Research and Dragomir Radev of the University of Michigan. The research was funded by IBM.
TRN Categories: Databases and Information Retrieval
Story Type: News
Related Elements: Chart
July 19, 2000
Hearing between the lines
Search tool finds answers before queries
Scatter could boost fiber capacity
Software makes data really sing
Magnetic microscope recovers damaged data
Research News Roundup
Research Watch blog
View from the High Ground Q&A
How It Works
News | Blog | Books
Buy an ad link
Ad links: Clear History
Buy an ad link
© Copyright Technology Research News, LLC 2000-2006. All rights reserved.