Learning to Detect Phishing Emails
Download
Report
Transcript Learning to Detect Phishing Emails
Learning to Detect Phishing Emails
Report : 鄭志欣
Advisor: Hsing-Kuo Pao
1
I. Fette, N. Sadeh, and A. Tomasic. Learning to detect phishing emails. In
Proceedings of the International World Wide Web Conference (WWW), pages
649–656, 2007.
Outline
Introduction
Method
Empirical evaluation
Conclusion
2
Introduction
Phishing (Spoofed websites)
Stealing account information
Logon credentials
Identity information
Phishing Problem – Hard
3
Method
PILFER – A Machine Learning based approach to
classification.
phishing emails / ham (good) emails
Feature Set
Features as used in email classification
4
Features as used in email
classification
IP-based URLs:
http://192.168.0.1/paypal.cgi?fix_account
Phishing attacks are hosted off of compromised
PCs.
This feature is binary.
5
Age of linked-to domain names
Legitimate-sounding domain name
Palypal.com
paypal-update.com
These domains often have a limited life
WHOIS query
date is within 60 days of the date the email was
sent – “fresh” domain.
This is a binary feature
6
Nonmatching URLs
This is a case of a link that says paypal.com but
actually links to badsite.com.
Such a link looks like <a href="badsite.com">
paypal.com</a>.
This is a binary feature.
7
“Here” links to non-modal domain
“Click here to restore your account access”
Link with the text “link”, “click”, or “here” that
links to a domain other than this “modal domain”
This is a binary feature.
8
HTML emails
Emails are sent as either plain text, HTML, or a
combination of the two - multipart/alternative
format.
To launch an attack without using HTML is difficult.
This is a binary feature.
9
Number of links
The number of links present in an email.
<a> in HTML tag
This is a continuous feature.
10
Number of domains
Simply take the domain names previously
extracted from all of the links, and simply count
the number of distinct domains.
Look at the “main” part of a domain
https://www.cs.university.edu/
http://www.company.co.jp/
This is a continuous feature.
11
Number of dots
Subdomains like
http://www.my-bank.update.data.com.
Redirection script, such as
http://www.google.com/url?q=http://www.badsite.com
This feature is simply the maximum number of
dots (`.') contained in any of the links present in
the email, and is a continuous feature.
12
Contains javascript
Attackers can use JavaScript to hide information
from the user, and potentially launch sophisticated
attacks.
An email is flagged with the “contains javascript”
feature if the string “javascript” appears in the
email, regardless of whether it is actually in a
<script> or <a> tag
This is a binary feature.
13
Spam-filter output
This is a binary feature, using the trained version
of SpamAssassin with the default rule weights and
threshold.
“Ham” or “Spam”
This is a Binary feature.
14
Empirical Evaluation
Machine-Learning Implementation
Testing Spam Assassin
Datasets
Additional Challenges
False Positives vs. False Negatives
15
Machine-Learning Implementation-PILFER
First, run a set of scripts to extract all the
features listed.
Second , we train and test a classifier using 10-fold
cross validation.
Random Forest (classifier)
Random forests create a number of decision trees and
each decision tree is made by randomly choosing an
attribute to split on at each level, and then pruning the
tree.
16
•
17
we use a random forest as a classifier.
Testing SpamAssassin
SpamAssassin is a widely-deployed freely-available
spam filter that is highly accurate in classifying
spam emails.
We classify the exact same dataset using
SpamAssassin version 3.1.0, using the default
thresholds and rules.
Using “Untrain” SpamAssassin
“Training” on 10-fold
18
Datasets
Two publicly available datasets.
ham corpora from the SpamAssassin project
6950 non-phishing non-spam emails
Phishingcorpus
approximately 860 email messages
19
Additional Challenges
The age of the dataset.
Phishing websites are short-lived.
Some of our features can therefore not be
extracted from older emails, making our tests
difficult. EX: Domain linked to
20
Result
21
22
Conclusion
it is possible to detect phishing emails with high
accuracy by using a specialized filter, using
features that are more directly applicable to
phishing emails than those employed by general
purpose spam filters.
23
Reference
I. Fette, N. Sadeh, and A. Tomasic. Learning to
detect phishing emails. In Proceedings of the
International World Wide Web Conference
(WWW), pages 649–656, 2007.
www.ics.uci.edu/.../Learning%20to%20Detect%2
0Phishing%20Emails.pptx
http://armorize-
cht.blogspot.com/2010/01/phishing-mail.html
24
25