Bayesian Filtering, Anti-Phishing Toolbar Benefits

Download Report

Transcript Bayesian Filtering, Anti-Phishing Toolbar Benefits

Bayesian Filtering
Anti-Phishing Toolbar Benefits
P. Likarish, E. Jung,
D. Dunbar, T. E. Hansen, and J.-P. Hourcade
12/04/07
presented by EJ Jung
Phishing
Why study phishing?
 Identity Theft*
• One of fastest growing crimes
• ~15 million Americans/year, $2.8 billion dollars
*Gartner,
Inc. 2007 press release. http://www.gartner.com/it/page.jsp?id=501912, March 2007
**Phishing report. http://apwg.org
Phishing leads into malware
**Phishing report. Trojans and keyloggers. http://apwg.org
Phishing and botnet into black
market (Franklin et al, 2007)
6 months of IRC log
… and into national security threat
FBI director Robert Muller says:
• Younis Tsouli, and his colleagues stole thousands of
credit card accounts through phishing schemes. They
ran up charges of more than $3 million for items they
thought fellow extremists might need, from night
vision goggles to GPS devices.
• botnet is Swiss Army Knifes of hackers
Phishing attack
Anti-Phishing Tools
Client or server side?
• server side protection is limited
• server-client cooperation
– hash of system
Client side is more common
• web browser toolbar
• password management
Early Efforts
Largely heuristics-based
• Set of rules developed by experts
• Still used by most anti-phishing tools
Examples:
• IE7 phishing filter
• SpoofGuard
SpoofGuard*
IE6 toolbar
Developed by Chou, Ledesma, Teraguchi,
Boneh, Mitchell at Stanford
Heuristics+whitelist
*N.
Chou, R. Ledesma, Y. Teraguchi, D. Boneh, and J. C. Mitchell. Client-side defense against web-based identity theft. In NDSS '04:
Proceedings of the 11th Annual Network and Distributed System Security Symposium, February 2004
Stateless Heuristics
URL check
• Suspicious URLs: @, IP, hex
Image check
• Hashed image database
– Image hashing
– Produces same hash for similar images
Link check
• Fails if >¼ of links fail URL check
Password check
Stateful Heuristics
Domain check
• Hamming distance to known domains
Referrals
• From email site?
• May require DNS lookup
Image-domain association
• Extension of hashed image heuristic
• <image, URL> tuples
Scoring
TSS = Total Spoof Score
0
n
n
w P  w
i
i 1
i
i, j
i , j 1
PiPj 
n
w
P PjPk  ...
i, j, k i
i , j , k 1
Ex: P1 = URL check
(0 if page passes, 1 if it fails)
w1 = .2
Source: N. Chou, R. Ledesma, Y. Teraguchi, D. Boneh, and J. C. Mitchell. Client-side defense against web-based identity theft. In NDSS
'04: Proceedings of the 11th Annual Network and Distributed System Security Symposium, February 2004
Drawbacks to Heuristics
Difficult to develop accurate rules*
Large number of false positives and negatives**
Heuristics don’t evolve—phishing sites do.
*M.
Sahami, S. Dumais, D. Heckerman, and E. Horvitz. A Bayesian approach to filtering junk e-mail. In AAAI Workshop on Learning for Text
Categorization, July 1998.
**Y. Zhang, J. I. Hong, and L. F. C Y. Zhang, J. I. Hong, and L. F. Cranor. CANTINA: a content-based approach to detecting phishing web sites. In
WWW '07: Proceedings of the 16th international conference on World Wide Web, pages 639–648, New York, NY, USA, 2007. ACM Press.
Next: Blacklist/Whitelist
 ~2004-current
 Largely blacklist-based
• rely on phishing site reports
• still used by most anti-phishing tools
 Examples:
• IE7 phishing filter
• Firefox 2 phishing protection & Google safe-browsing
• Netcraft* Toolbar
*Netcraft Ltd. http://toolbar.netcraft.com
Drawbacks to Blacklist/Whitelist
Need reliable and timely sources for reports
Window of vulnerability
•
•
•
•
after site launch before being blacklisted
avg lifetime of a phishing site: 3 days
avg lifetime after blacklisted: 22 hours
cost of undoing identity theft: priceless
adapt classification methods
-CANTINA, B-APT
*Y. Zhang, J. I. Hong, and L. F. Cranor. CANTINA: a content-based approach to detecting phishing web sites. In WWW '07: Proceedings
of the 16th international conference on World Wide Web, pages 639–648, New York, NY, USA, 2007. ACM Press.
CANTINA*
Technique
• TF-IDF + Robust Hyperlinks
• Domain name
• Heuristics
•*Y. Zhang, J. I. Hong, and L. F. Cranor. CANTINA: a content-based approach to detecting phishing web sites. In WWW '07: Proceedings
of the 16th international conference on World Wide Web, pages 639–648, New York, NY, USA, 2007. ACM Press.
TF-IDF
Text classification technique
• Information retrieval
• Term Frequency-Inverse Document Frequency
Importance of a word in a document in a given
corpus
• Document = website
• Corpus = English language
Robust Hyperlinks
Phelps and Wilensky
TF-IDF on all words on page
Lexical signature
• 5 words with highest TF-IDF scores
• Almost uniquely id 1,000,000,000 pages…
TF-IDF + Hyperlinks in CANTINA
Calculate lexical signature
Google search on signature
• If domain name is within top 30 hits, site is legitimate
• Otherwise, it is phishing
Results:
• 94% true positives : 30% false positives
Improving on TF-IDF
Add domain name to Google search
• 97%  67% t.p.
• 30%  10% f.p.
TF-IDF + Zero results-Means-Phishing + domain
name
• 97% t.p. : 10% f.p.
Adding heuristics to CANTINA
Heuristics from SpoofGuard and other sources
Trade-off
• Reduces true positive accuracy
– 97%  89% t.p.
• Reduces false positive rate
– 10%  1% f.p.
Drawbacks to CANTINA
 Relies on outside sources for information
• Google
 Requires heuristics to reduce false positives
• Reduces accuracy…
 Language-specific
• Different corpus for each foreign language
• Difficulties with East Asian languages
 Unacceptable false positive rate
• Misclassifications undermine user confidence in tool
B-APT: Bayesian Anti-Phishing toolbar
 Firefox browser toolbar
• will extend to other browsers
• goals: detect, communicate, and educate
 Bayesian filtering + whitelist
• similar to spam filtering
• different from spam filtering
– phishing sites mirror legitimate sites
– hard to find training set (inbox vs. blacklist database)
• comprehensive whitelist
 Innovative UI
• no known effective security indicators for warning user of
phishing sites (Dhamija, 2006; Wu, 2007)
Bayesian classification
Bayes’ law on conditional probability
Pros
• easy to compute
• training and tayloring
Cons
• assume independence among words
• Bayesian poisoning
Implementation details
 Training on phishing
pages and legitimate
pages
• Phishtrack: HTML of
phishing pages*
• 1200+ phishing sites
= 160+ unique sites
• Alexa top 500: most
popular websites**
• same KBs of phishing
sites (17k vs 64k
tokens)
*http://www.dslreports.com/phishtrack
**http://www.alexa.com/
B-APT detecting phishing sites
Anti-phishing tool’s tested on 60 phishing sites
blocked
warned
no action
B-APT
100%(60)
0%(0)
0%(0)
Firefox 2.0
55%(33)
0%(0)
45%(27)
Internet Explorer 7
42%(25)
22%(13)
36%(22)
Netcraft
88%(53)
10%(6)
2%(1)
SpoofGuard
63%(38)
27%(16)
10%(6)
B-APT detecting legitimate sites
Anti-phishing tool’s tested on 60 legitimate sites
blocked
warned
no action
B-APT
3%(2)
0%(0)
97%(58)
Firefox 2.0
0%(0)
0%(0)
100%(60)
Internet Explorer 7
0%(0)
0%(0)
100%(60)
Netcraft
0%(0)
0%(0)
100%(60)
SpoofGuard
10%(6)
25%(15)
65%(39)
Summary
 Classification + heuristics do well
• B-APT has no false negative, some false positive
• working on communicating false positives
• detect, communicate, and educate
 Use of any toolbar is better than none
• the least number was 42% of IE7
• blacklist-based ones get better as time passes
 Beware of malware
• Badware.org with Google
(Zhang, 2007)