Learning to Detect Phishing Emails

Transcript Learning to Detect Phishing Emails

Learning to Detect Phishing Emails
Report : 鄭志欣
Advisor: Hsing-Kuo Pao
1
I. Fette, N. Sadeh, and A. Tomasic. Learning to detect phishing emails. In
Proceedings of the International World Wide Web Conference (WWW), pages
649–656, 2007.
Outline
 Introduction
 Method
 Empirical evaluation
 Conclusion
2
Introduction
 Phishing (Spoofed websites)
 Stealing account information
 Logon credentials
 Identity information
 Phishing Problem – Hard
3
Method
 PILFER – A Machine Learning based approach to
classification.
 phishing emails / ham (good) emails
 Feature Set
 Features as used in email classification
4
Features as used in email
classification
 IP-based URLs:
 http://192.168.0.1/paypal.cgi?fix_account
 Phishing attacks are hosted off of compromised
PCs.
 This feature is binary.
5
 Age of linked-to domain names
 Legitimate-sounding domain name
 Palypal.com
 paypal-update.com
 These domains often have a limited life
 WHOIS query
 date is within 60 days of the date the email was
sent – “fresh” domain.
 This is a binary feature
6
 Nonmatching URLs
 This is a case of a link that says paypal.com but
actually links to badsite.com.
 Such a link looks like <a href="badsite.com">
paypal.com</a>.
 This is a binary feature.
7
 “Here” links to non-modal domain
 “Click here to restore your account access”
 Link with the text “link”, “click”, or “here” that
links to a domain other than this “modal domain”
 This is a binary feature.
8
 HTML emails
 Emails are sent as either plain text, HTML, or a
combination of the two - multipart/alternative
format.
 To launch an attack without using HTML is difficult.
 This is a binary feature.
9
 Number of links
 The number of links present in an email.
 <a> in HTML tag
 This is a continuous feature.
10
 Number of domains
 Simply take the domain names previously
extracted from all of the links, and simply count
the number of distinct domains.
 Look at the “main” part of a domain
 https://www.cs.university.edu/
 http://www.company.co.jp/
 This is a continuous feature.
11
 Number of dots
 Subdomains like
 http://www.my-bank.update.data.com.
 Redirection script, such as
 http://www.google.com/url?q=http://www.badsite.com
 This feature is simply the maximum number of
dots (`.') contained in any of the links present in
the email, and is a continuous feature.
12
 Contains javascript
 Attackers can use JavaScript to hide information
from the user, and potentially launch sophisticated
attacks.
 An email is flagged with the “contains javascript”
feature if the string “javascript” appears in the
email, regardless of whether it is actually in a
<script> or <a> tag
 This is a binary feature.
13
 Spam-filter output
 This is a binary feature, using the trained version
of SpamAssassin with the default rule weights and
threshold.
 “Ham” or “Spam”
 This is a Binary feature.
14
Empirical Evaluation
 Machine-Learning Implementation
 Testing Spam Assassin
 Datasets
 Additional Challenges
 False Positives vs. False Negatives
15
 Machine-Learning Implementation-PILFER
 First, run a set of scripts to extract all the
features listed.
 Second , we train and test a classifier using 10-fold
cross validation.
 Random Forest (classifier)
 Random forests create a number of decision trees and
each decision tree is made by randomly choosing an
attribute to split on at each level, and then pruning the
tree.
16
•
17
we use a random forest as a classifier.
 Testing SpamAssassin
 SpamAssassin is a widely-deployed freely-available
spam filter that is highly accurate in classifying
spam emails.
 We classify the exact same dataset using
SpamAssassin version 3.1.0, using the default
thresholds and rules.
 Using “Untrain” SpamAssassin
 “Training” on 10-fold
18
 Datasets
 Two publicly available datasets.
 ham corpora from the SpamAssassin project
 6950 non-phishing non-spam emails
 Phishingcorpus
 approximately 860 email messages
19
 Additional Challenges
 The age of the dataset.
 Phishing websites are short-lived.
 Some of our features can therefore not be
extracted from older emails, making our tests
difficult. EX: Domain linked to
20
Result
21
22
Conclusion
 it is possible to detect phishing emails with high
accuracy by using a specialized filter, using
features that are more directly applicable to
phishing emails than those employed by general
purpose spam filters.
23
Reference
 I. Fette, N. Sadeh, and A. Tomasic. Learning to
detect phishing emails. In Proceedings of the
International World Wide Web Conference
(WWW), pages 649–656, 2007.
 www.ics.uci.edu/.../Learning%20to%20Detect%2
0Phishing%20Emails.pptx
 http://armorize-
cht.blogspot.com/2010/01/phishing-mail.html
24
25

Learning to Detect Phishing Emails

Transcript Learning to Detect Phishing Emails

Directory