PhishDef: URL Names Say It All

Download Report

Transcript PhishDef: URL Names Say It All

PhishDef: URL Names Say It All
Anh Le, Athina Markopoulou
Michalis Faloutsos
University of California, Irvine
USA
University of California, Riverside
USA
What is Phishing?
• Social engineering and technical
means to steal consumers’
personal identity, data, etc.
• Cause billions of dollars of loss
annually
Anh Le - UC Irvine - PhishDef
2
Most Targeted Industry Sectors 2nd Quarter ‘10
Social Networking,
2.8%
Retail/Service,
3.6%
ISP, 1.2%
Government,
1.3%
Other, 3.4%
Gaming, 4.6%
Financial,
33.1%
Auction, 5.5%
Classifieds, 6.6%
Payment
Services,
37.9%
Antiphishing.org
Anh Le - UC Irvine - PhishDef
3
Example of a Phishing Site
Anh Le - UC Irvine - PhishDef
4
Current Protection
• Google Safe Browsing
• Microsoft
Smart Screen
• Third-Party
Anh Le - UC Irvine - PhishDef
5
Current Protection Model
Google Safe Browsing
Motivation:
Blacklist-based protection is reactive --- cannot protect against zero-day phishing
Anh Le - UC Irvine - PhishDef
6
Outline
o Phishing Background
o Motivation
o Our proposal
o New Protection Model
o Learning Algorithms
o Dataset
o Feature Selection
o Evaluation Results
o Concluding Remarks
Anh Le - UC Irvine - PhishDef
7
Our Proposed Protection Model
•
Main challenges: Accuracy and Classification Latency
•
Which classification algorithm works best?
•
Which set of features works best?
Anh Le - UC Irvine - PhishDef
8
Prior Work
o Whittaker et al. [NDSS ’10]
o Google Safe Browsing
o Ma et al. [SIGKDD ’09]
o Batch-based Classification
o Ma et al. [ICML ‘09]
o Batch-based vs. Online Learning
Server-Side Classification
Anh Le - UC Irvine - PhishDef
9
Main Contributions
o New Protection Model:
o Client-side classification
o Propose using Adaptive
Regularization of Weights (AROW)
o High accuracy
o Resilient to noise
o Set of Lexical Features
o Fast to extract at client side
o Obfuscation resistant
Anh Le - UC Irvine - PhishDef
10
Machine Learning Algorithms
• Batch-based Support Vector Machine
• Online Perceptron
• Confident Weighted (CW)
[Dredze et al., ICML 2008]
• Adaptive Regularization of Weights (AROW)
[Crammer et al., NIPS 2009]
Anh Le - UC Irvine - PhishDef
11
Online Classification
• Maintaining a weight vector and
use it for classification
Client Side:
Trained Beforehand
Extract In Real Time
• Online Perceptron
Server Side:
Anh Le - UC Irvine - PhishDef
12
Online Classification
• Confident Weighted (CW)
minimum change
enough to correct last mistake
• Adaptive Regularization of Weights (AROW)
minimum change
penalty for mistake
Anh Le - UC Irvine - PhishDef
increasing confidence
13
Dataset
o Phishing URLs
o PhishTank (4,082)
o MalwarePatrol (2,001)
o Benign URLs
o Open directory (4,012)
o Yahoo directory (4,143)
o Time period: June 2010
Anh Le - UC Irvine - PhishDef
14
Feature Selection
o Lexical Features
o External Features
o Country, AS number, registration date, registrant,
registrar, etc.
Anh Le - UC Irvine - PhishDef
15
Outline
o Phishing Background
o Motivation
o Our proposal
o New Protection Model
o Learning Algorithms
o Dataset
o Feature Selection
o Evaluation Results
o Concluding Remarks
Anh Le - UC Irvine - PhishDef
16
Evaluation Results:
Lexical vs. Full Features
(+) ~ 1%
(-) Dependency on
Remote Server
(-) Avg. Latency:
1.64 s
Lexical features alone are better-suited than
full features for client-side phishing classification
Anh Le - UC Irvine - PhishDef
17
Evaluation Results:
CW vs. AROW
AROW is more resilient to noise than CW
Anh Le - UC Irvine - PhishDef
18
Conclusion: PhishDef
o Client-side phishing
classification system
o Proactive, on-the-fly
classification of zero-day phishing
URLs
o Low delay client side (ms),
high accuracy (97%)
o Resilient to noisy data
o Future Work:
o Develop an add-on for Firefox
Anh Le - UC Irvine - PhishDef
19
o Questions
Anh Le - UC Irvine - PhishDef
20
Anh Le - UC Irvine - PhishDef
21
Example of a Phishing Site
http://pilety.ru/c548c205d7660ed0628b467d7d5aa54c9c3a7124/image/taxrefund.htm
http://www.hmrc.gov.uk/intro-income-tax.htm
Anh Le - UC Irvine - PhishDef
22
Evaluation Results:
Batch-Based vs. Online Learning
Online Learning outperforms Batched-Based Learning
for Phishing classification
Anh Le - UC Irvine - PhishDef
23
Chrome 11 > Firefox 4
Anh Le - UC Irvine - PhishDef
24