Software sorts out subjectivity TRN 111704

Software sorts out subjectivity

By Kimberly Patch, Technology Research News

One of the fundamental challenges in getting computers to sort and analyze text is finding ways to automatically classify information.

Applications like search engines that group similar documents do so using topic-based categories. Sentiment analysis techniques add another dimension by determining the author's attitude about a topic rather than just identifying a topic.

Existing techniques tend to concentrate on finding words, phrases and patterns that indicate sentiment. This has proven difficult, however. "This laptop is a great deal", for instance, shows strong sentiment, but contains the same words as the neutral sentence "The release of this new laptop drew a great deal of media attention."

In this example, it's not just the presence of a cue word like "great" that matters, but also its meaning in context.

People can easily tell the difference between the phrases because they understand the meaning of the words. Enabling computers to deal with meaning is an extremely difficult challenge, however.

Researchers from Cornell University have devised a way to improve sentiment classification that sidesteps having to deal with meaning by instead concentrating on context. Their method weeds out neutral sentences. "Getting rid of neutral sentences like 'The release of this new laptop drew a great deal of media attention' [makes] the overall sentiment more obvious," said Lillian Lee, an associate professor of computer science at Cornell University.

The method improved sentiment classification performance from 82.8 to 86.4 percent, which is statistically very significant, according to Lee. The method could eventually be used to maintain review-aggregator Web sites, to filter search results by viewpoint, and to track attitudes toward a given topic, she said.

It is not readily apparent that classifying text as subjective or objective is any easier than classifying text as positive or negative, said Lee. But it turned out to be easier simply because people tend to switch between objective and subjective statements less often than they switch between positive and negative phrases.

A movie reviewer, for instance, may begin with several sentences of objective text concerning a movie's plot before switching to a subjective statement about how good the movie was, said Lee. "If the sentence appears in the context of a block of other obviously objective sentences, there's a good chance that it is also objective," she said.

To take advantage of this clustering, the researchers represented text as a network, or graph. "Imagine that each sentence is represented by a network point, or node," said Lee. To model contextual information between each pair of sentence nodes, the researchers added a link whose strength represented how much the two sentences deserved the same label -- objective or subjective -- based on criteria including how close the sentences are to the text, and whether they are separated by a paragraph boundary.

The model also took into consideration the evidence within a sentence that the sentence is subjective or objective. Possible evidence that a sentence is subjective, for example, includes the presence of a word like 'wonderful', or 'terrible', said Lee.

Each sentence was linked strongly or weakly to a special subjective and objective nodes depending on the amount of evidence there was within the sentence that it was subjective or objective.

The sentences are then clustered into subjective and objective camps based on the strength of the links. This is a graph partitioning problem known as finding the minimum cut, and it can be solved exactly by a quick, efficient algorithm, said Lee.

One way to visualize how this works is to picture someone taking the special subjective node in one hand and the special objective node in the other hand, and pulling them in opposite directions so that the weaker links snap until the network is broken into two pieces, said Lee. "Two sentences that prefer to be in the same class will tend to be in the same piece because they had a strong link between them, but they could still be separated if they have very strong links to opposite special nodes," she said.

Once the subjective vs. objective classification is done, the researchers use standard pattern recognition techniques to classify each document as positive or negative based just on the portions identified as subjective.

The researchers found that seemingly empty words and phrases can turn out to be unexpectedly informative when it comes to sentiment classification. In the context of movie reviews, for example, the word "good" provides less evidence for positive sentiment than the word "still" followed by a comma, said Lee. "This makes sense in retrospect -- a typical use would be something like 'still, this film is worth seeing' -- but illustrates how subtle the sentiment problem can be."

The researchers are working on improving their method for estimating the affinity sentences have for being classified the same way, said Lee. "We used very simple cues like distance... but more sophisticated information ought to be incorporated," she said.

Longer-term, they are aiming to develop methods that can handle variations in language, said Lee. "This is very important in dealing with on-line text, since Internet sources can very widely in form, tenor and even grammaticality," she said. "One can get reviews from the highly-edited New York Times or from a stream-of-consciousness personal Web log."

The ultimate aim is to be able to handle rhetorical devices like irony and sarcasm, said Lee. "Given that even humans are occasionally misled by such rhetorical devices, this is going to be very challenging," she said.

People are incredibly creative at expressing negative opinions, said Lee. For example, this sentence not only contains no obviously negative words, but has a lot of potentially positive words: "If you think this laptop is a great deal, I've got a nice bridge you might be interested in."

The system could be deployed now for domains that have fairly consistent language and training data that the system can use to learn what cues work in that domain, said Lee.

It will take at least a decade before the system can readily handle unrestricted texts containing arbitrary rhetorical devices, she said.

The method could be used to automate the maintenance of review-aggregation sites, said Lee. "A system could crawl on-line information sources and automatically extract ratings, even from documents like New York Times book reviews that don't include explicit scores," she said.

It could be used by search engines to sort or filter results by viewpoint to, for instance, help users distinguish between objective and biased Web sites, said Lee.

It could also be used to track changes in attitudes toward a given topic by, for instance, analyzing press articles, she said. "An analyst might desire a summary of the international press's reaction to a particular act of political violence, as well as a list of which countries approve of the act and which condemn it," she said.

And companies could use the system to gather business intelligence such as finding out what people think of their products or the products of their competitors. "A computer company might crawl blogs to find out whether or not people like its latest laptop model," said Lee.

Lee's research colleague was Bo Pang. The research is published in the Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, held July 21 to 26, 2004 in Barcelona, Spain.

The research was funded by the National Science Foundation (NSF), the Alfred P. Sloan Research Foundation, and the Cornell cognitive studies program.

Timeline: 10 years
Funding: Government, Private, University
TRN Categories: Natural Language Processing; Databases and Information Retrieval
Story Type: News
Related Elements: Technical paper, "A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts," published in the Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, held July 21 to 26, 2004 in Barcelona, Spain and posted at arxiv.org/abs/cs.CL/0409058

Advertisements:

November 17/24, 2004

Page One

Fibers mix light and electricity

Software sorts out subjectivity

Nanomechanical memory demoed

Nanotubes tune in light

Briefs:
Low-pressure material holds hydrogen
Plastic cuts artificial hip wear
2D holograms make 3D color display
Lasers drive nano locomotive
Light-recording plastic holds up
Atom flip energy measured

News:
Research News Roundup
Research Watch blog

Features:
View from the High Ground Q&A
How It Works

RSS Feeds:
News

| Blog

| Books

Ad links:
Buy an ad link

Advertisements:

Home Archive Resources Feeds Offline Publications Glossary

TRN Finder Research Dir. Events Dir. Researchers Bookshelf

Contribute Under Development T-shirts etc. Classifieds

Forum Comments Feedback About TRN

TRN Newswire and Headline Feeds for Web sites