filters catch more spam
Technology Research News
How much unsolicited email do you find
cluttering your inbox every morning? Even when Internet service providers
block junk email, spam creeps in, disguised, for example, as innocuous
messages from people who seem to have only first names. At the same time,
spam filters sometimes block legitimate messages.
A group of researchers in Greece has come up with a method that could
solve both problems.
Most spam filters work by doing two things: they block known spammers
who have been blacklisted, and they follow general rules, such as blocking
messages that contain the word ‘adult’ in the subject header, said Ion
Androutsopoulos, a research fellow at the Demokritos National Center for
Scientific Research (NCSR) in Greece.
But spammers frequently forge email addresses to get around the blacklists,
and those filters that use keyword-specific blocking might also nix that
funny anecdote about your brother’s kids that contains the word ‘nude.’
The NCSR spam program creates custom filters for each user that learn
what is spam and what is not, said Androutsopoulos. The filters learn
to tell the two apart by looking through a user’s legitimate email and
comparing it with lots of spam collected by the researchers, he said.
The key to the process is using several filters that work together. The
researchers found that they could bolster accuracy by combining filters
based on different learning algorithms that individually made different
types of errors, said Androutsopoulos.
The program analyses the user's existing mail using Natural
algorithms to build the set of anti-spam filters, said Androutsopoulos.
It calculates the probabilities of certain words appearing in spam versus
legitimate messages and classifies incoming messages by comparing them
with previously analyzed email.
“The individual filters are treated as members of a committee presided
[over] by a higher-level classifier, which is trained to learn when to
trust each of the members,” said Androutsopoulos. When a new message arrives,
the committee members cast their votes on whether the message is spam.
“The president of the committee then makes the final decision by taking
into consideration the opinions of the members, the message itself, and
its previous experience regarding when to trust each member,” he explained.
The stacked spam filter is more accurate than keyword-based spam filters,
Androutsopoulos said. It identifies about 90 percent of junk email accurately,
and mistakes a legitimate email for spam about 1 percent of the time,
he said. The accuracy could be increased further by returning messages
classified as spam to their senders and asking them to change the address,
he said. If the email is legitimate, the originator can send it again
to a different, unfiltered address.
“Training the filter takes a few minutes per user, depending on the number
of training messages. Classifying an incoming message is almost instantaneous,”
said Androutsopoulos. When the filter is configured separately for each
user, it could be installed either on the end user’s desktop or on the
ISP’s server. “In the latter case, the ISP would run the user's filter
on behalf of the user before downloading the messages to [a] desktop,
saving bandwidth wasted by spam messages,” he said.
The same configuration of the filter can be applied to all users on a
network, said Androutsopoulos, “but I would expect the accuracy of the
filter to be worse than when using filters especially configured for each
user.” The training time will also go up because more training messages
would be needed for a pan-network filter, he said.
Better spam filters are definitely needed, said Ben Gross, a visiting
scholar at Berkeley, and a coordinator of the Digital Libraries Initiative
Phase Two for the National Science Foundation (NSF). “Spam remains a nearly
intractable problem for most users [and] better Natural Language Processing
techniques for spam could certainly improve the current state of technology,”
An important variable the researchers did not discuss, which may bear
on the scheme's use in large networks, is time. “For a system to be viable
for large scale deployment with email it must be highly efficient,” said
Gross. Still, if a spam filter’s performance were to prove inadequate,
it could be deployed at the users’ desktops, he said.
The stacked spam filter could be used by firewall makers, listserve moderators,
newsgroups, ISP’s and individual users, said Androutsopoulos. It will
be ready for such use within a year, he said.
The researchers’ next step is to improve the filters by evaluating more
thoroughly how the filters work and improving the system’s learning algorithms,
according to Androutsopoulos. The researchers would like to make the system’s
training period faster, he said.
Androutsopoulos’s research colleagues were George Sakkis and Panagiotis
Stamatopoulos at the University of Athens and Georgios Paliouras, Vengelis
Karkaletsis, and Constantine D. Spyropoulos at the Demokritos National
Center for Scientific Research. They presented the research at the 6th
Conference on Empirical Methods in Natural language Processing held in
Pittsburgh, PA on June 3 and 4, 2001. The research was funded by the universities.
Timeline: >1 year
TRN Categories: Natural Language Processing; Internet
Story Type: News
Related Elements: Technical paper, "Stacking Classifiers
for Anti-Spam Filtering of E-mail," in the Proceedings of the 6th Conference
on Empirical Methods in Natural language Processing (EMNLP 2001), at Cornell
University on June 3, 2001.
Nets mimic quantum physics
Teamed filters catch
remote robot control
Ion beams mold tiny holes
tell of coming storms
Research News Roundup
Research Watch blog
View from the High Ground Q&A
How It Works
News | Blog
Buy an ad link