A Neural Network Classifier for Junk E-Mail Ian Stuart, Sung-Hyuk Cha, and Charles Tappert CSIS Student/Faculty Research Day May 7, 2004

Download Report

Transcript A Neural Network Classifier for Junk E-Mail Ian Stuart, Sung-Hyuk Cha, and Charles Tappert CSIS Student/Faculty Research Day May 7, 2004

A Neural Network
Classifier for Junk E-Mail
Ian Stuart, Sung-Hyuk Cha, and Charles Tappert
CSIS Student/Faculty Research Day
May 7, 2004
Spam, spam, spam, …
Fighting spam

Several commercial applications exist
– Server-side: expensive
– Client-side: time-consuming

No approach is 100% effective
– Spammers are aggressive and adaptable
– Best solutions are typically hybrids of
different approaches and criteria
Common approaches

Simple filters
– Common words or phrases
– Unusual punctuation or capitalization

Blacklisting: “just say NO” (if you can)
– Reject e-mail from known spammers

Whitelisting: “friends only, please”
– Accept e-mail only from known correspondents

Classifiers: examine each e-mail and decide
– Only a few publications on spam classifiers
Naïve Bayesian classifiers


Used in commercial classifiers
Assumes recognition features are independent
– Max likelihood = product of likelihoods of features

E-mail classifier – examines each word
– Training assigns a probability to each word
– Look up each word/probability in a dictionary
– If the product of the probabilities exceeds a given
threshold, it is spam


Challenge – creating the “dictionary”
We compare our Neural Network against two
published Naïve Bayesian classifiers
Naïve Bayesian classifier issues





How many features (words), which ones?
How is degradation avoided as spammers’
vocabulary changes?
What values are assigned to new words?
What are the thresholds?
How to avoid “sabotage” of classifier?
Which one isn’t spam?
(subject headers)

5 Be a mighty warrior in bed! vcrhwt ygjztyjjh

Money Back Guarantee_HGH

kindle life pddez liw mzac

v a l i u m - D i a z e p a m used to relieve anxiety

Fairfield tennis schedule

:Dramatic E,nhancement fo=r .Men = f"fumqid

,Refina'nce now. Don't wait
Which one isn’t spam?
(subject headers)

5 Be a mighty warrior in bed! vcrhwt ygjztyjjh

Money Back Guarantee_HGH

kindle life pddez liw mzac

v a l i u m - D i a z e p a m used to relieve anxiety

Fairfield tennis schedule

:Dramatic E,nhancement fo=r .Men = f"fumqid

,Refina'nce now. Don't wait
Spammers make patterns



The more they try to hide, the easier it
is to see them
Therefore, we use common spammer
patterns (instead of vocabulary) as
features for classification
Learn these patterns with a Neural
Network
Neural Network features

Total of 17 features
– 6 from the subject header
– 2 from priority and content-type headers
– 9 from the e-mail body
Features from subject header
1.
2.
3.
4.
5.
6.
Number of words with no vowels
Number of words with at least two of letters J, K, Q, X, Z
Number of words with at least 15 characters
Number of words with non-English characters, special
characters such as punctuation, or digits at beginning or
middle of word
Number of words with all letters in uppercase
Binary feature indicating 3 or more repeated characters
Features from priority and
content-type headers
1.
2.
Binary feature indicating whether the
priority had been set to any level
besides normal or medium
Binary feature indicating whether a
content-type header appeared within
the message headers or whether the
content type had been set to “text/html”
Features from message body
1.
2.
3.
4.
5.
6.
7.
8.
9.
Proportion of alphabetic words with no vowels and at least 7
characters
Proportion of alphabetic words with at lease two of letters J,
K, Q, X, Z
Proportion of alphabetic words at least 15 characters long
Binary feature indicating whether the strings “From:” and
“To:” were both present
Number of HTML opening comment tags
Number of hyperlinks (“href=“)
Number of clickable images represented in HTML
Binary feature indicating whether a text color was set to white
Number of URLs in hyperlinks with digits or “&”, “%”, or “@”
Neural Network spam classifier

3-layer, feed-forward network (Perceptron)
– 17 input units, variable # hidden layer units, 1 output unit

Data – 1,654 e-mails: 854 spam, 800 legitimate

Use half of each (spam/non-spam) for training,
the other half for testing

Test with variations of hidden nodes (4 to 14)
and epochs (100 to 500)
Definitions used for classifier
success measures
nSS = number of spam classified as spam
nSL = number of spam classified as legitimate
nLL = number of legitimate classified
as legitimate
nLS = number of legitimate classified as spam
Measure of success: precision
Precision: the percentage of labeled
spam/legitimate e-mail correctly classified
correct spam s
Spam precision ( SP) 
all labeled spam
correct legitim ate
Legitim ateprecision ( LP) 
all labeled legitim ate
Measure of success: precision
Precision: the percentage of labeled
spam/legitimate e-mail correctly classified
nSS
Spam precision( SP) 
nSS  nLS
nLL
Legitim ateprecision( LP) 
nLL  nSL
Measure of success: accuracy
Accuracy: the percentage of actual
spam/legitimate e-mail correctly classified
correct spam s
Spamrecall ( SR) 
all actual spam
correct legitim ate
Legitim aterecall ( LR) 
all actual legitim ate
Measure of success: accuracy
Accuracy: the percentage of actual
spam/legitimate e-mail correctly classified
nSS
Spamrecall ( SR) 
nSS  nSL
nLL
Legitim aterecall ( LR) 
nLL  nLS
Neural Network results

Best overall results with 12 hidden nodes at
500 epochs
–
–
–
–
Spam Precision: 92.45%
Legitimate Precision: 91.32%
Spam Accuracy: 91.80%
Legitimate Accuracy : 92.00%
35 spams misclassified: 8.20%
 32 legitimates misclassified: 8.00%

Misclassified e-mails


Most spam misclassified as legitimate
were short in length, with few hyperlinks
Most legitimate e-mails misclassified as
spam had unusual features for personal
e-mail (that is, they were “spam-like” in
appearance)
Comparing Neural Network and
Naïve Bayesian Classifiers



Accuracy of the NN classifier is comparable to
that reported for Naïve Bayesian classifiers
NN classifier required fewer features (17 versus
100 in one study and 500 in another)
NN classifier uses descriptive qualities of words
and messages similar to those used by human
readers
Blacklisting Experiment

Manually entered IP addresses of e-mail
incorrectly tagged by NN classifier
– Entered first (original) IP address and, when present,
second IP address (e.g., mail server or ISP)


Into a website that sends IP addresses to 173
working spam blacklists and returns the # hits,
http://www.declude.com/junkmail/support/ip4r.htm
Counted only hit counts greater than one as spam
since single-list hits to be anomalies
Blacklisting Experimental Results



Of the 32 legitimate e-mails misclassified
by the NN, 53% were identified as spam
Of the 35 spam e-mails misclassified by
the NN, 97% were identified as spam
These poor results indicate that the
blacklisting strategy, at least for these
databases, is inadequate
Conclusions



NN competitive to Naïve Bayesian studies
despite using a much smaller feature set
Room for refinement of parsing for features
Use of descriptive, more human-like
features makes NN less subject to
degradation than Naïve Bayesian
Conclusions (cont.)



Neural Network approach is useful and
accurate, but too many legitimate -> spam
Should be powerful when used in
conjunction with a whitelist to reduce
legitimate -> spam (nLS), increasing spam
precision and legitimate accuracy
Blacklisting strategy is not very helpful