Searching the Searchers with SearchAudit

Transcript Searching the Searchers with SearchAudit

Searching the Searchers
with SearchAudit
JOHN P. JOHN
FANG YU
YINGLIAN XIE
MARTÍN ABADI
ARVIND KRISHNAMURTHY
PRESENTATION BY SAM KLOCK
Motivation
We can find this via a Google search
Motivation (cont’d)
 Search engines open opportunities for attackers
 Construct clever queries
 Find vulnerable sites
 Plant malware; spam (e.g., MyDoom)
 Do so stealthily and cheaply
 Mitigation strategy: identify malicious queries
 May be able to deny results to user
 Identify attackers (probably bots)
 Interpret strategy, then anticipate and prevent
 The question: how to do so
Proposed Approach
 SearchAudit

Framework for generating
malicious queries
 Input:


Seed set of known
malicious queries
Search logs
 Output:


Large set of suspicious
queries
Regular expressions
matching queries
Seed set
Search logs
inurl:gotoURL.asp?url=
filetype:asp
inurl:"shopdisplayprod
ucts.asp"
ext:pl inurl:cgi
intitle:"FormMail *"
-"*Referrer" -"*
Denied" -sourceforge
-error -cvs -input
filetype:cgi
inurl:tseekdir.cgi
...
SearchAudit
inurl:gotoURL.asp?url=
inurl:gotoURL.asp?url=
filetype:asp
inurl:gotoURL.asp?url=
filetype:asp
inurl:"shopdisplayprod
filetype:asp
inurl:"shopdisplayprod
ucts.asp"
inurl:"shopdisplayprod
ucts.asp"
ext:pl
inurl:cgi
ucts.asp"
ext:pl
inurl:cgi
intitle:"FormMail
ext:pl
inurl:cgi *"*"
intitle:"FormMail
-"*Referrer"
-"*
intitle:"FormMail
*"
-"*Referrer"
-"*
Denied"
-sourceforge
-"*Referrer"
-"*
Denied"
-sourceforge
-error
-cvs-sourceforge
-input
Denied"
-error
-cvs -input
filetype:cgi
-error
-cvs
-input
filetype:cgi
inurl:tseekdir.cgi
filetype:cgi
inurl:tseekdir.cgi
...
inurl:tseekdir.cgi
...
...
Expanded set
"/includes/joomla\.php
" site:\.[a-zAZ]{2,3}
"/includes/class_item\
.php"
site:[^?=#+@;&:]{2,
4}
"php-nuke"
site:[^?=#+@;&:]{2,
4}
"modules\.php\?op=modl
oad" site:\.[a-zAZ0-9]{2,6}
Regular expressions
Proposed Approach (cont’d)
 Needed to implement:
 Seed set: milw0rm.com
 Search logs: Microsoft Research  Bing
 Way to expand seed set into more queries
 Way to infer regular expressions
 Intended benefits:
 Harvesting lots of information




Three months: ~1.2 TB of logs
Interpret relationship between queries and attacks
Use queries to find potential victims
Stop attacks
SearchAudit
Query
identification
Query analysis
Query Identification: Expansion
 Basic idea: bootstrap on
seed set



Search logs for exact
matches to seed queries
Record IPs of hosts
making seed queries
Add other queries from
those IPs to set

Intuition: make one
malicious query, will
probably make more
 Account for DHCP
Seed queries
Log search
IP addresses
Queries
made on
same day
Queries made
by IPs
Query Identification: Regular Expressions
 Goals:
 Account for variation in
queries
 Take advantage of scripting
 See paper for generation
algorithm
 Compute score for
generated expressions


Lower score: more specific
Goal: discard overly general
expressions (score > 0.6)
 Consolidate to avoid
overlap
 Avoid proxies, public NAT
for performance
 Loopback for more queries
Query Identification: Results
 Data from Bing and milw0rm
 500 queries
 Logs for Feb. 2009, Dec. 2009, Jan. 2010

~2 billion views per month
 System implemented on Dryad/DryadLINQ
 Initial observations:
 Using specificity scores < 0.6
seems to be effective


Based on cookie heuristic
Proxy elimination does not limit
results
Query Identification: Results (cont’d)
 Query expansion:
 122 of 500 queries
matched in logs: 174
unique IPs
 Expanded to 800 unique
queries, 264 IPs
 Regular expressions
matched 3,560 queries,
1,001 IPs
 Incomplete seeds
 Tried with subsets of
original set
 Coverage still good
Query Identification: Results (cont’d)
 Loopback:


Multiple loopbacks got
more results
One iteration is good
enough
 Overall statistics


10,000s IPs each month
100,000s unique queries
each month

Dec. 09: set of unusual
attacker IPs cause spike
Query Identification: Verification
 Want to show queries are
malicious


Sometimes easy: 73% of
queries associated with
security/hacker sites
What about others?
 Individual bots


 Groups of bots

 No ground truth exists


Individual level (one IP)
Group level (multiple IPs)
Data often fixed by
botnets
User agent string
 Metadata for requests

 So: look for bot-like
features
New cookie
Whether a link was clicked

Tendencies dictated by
scripts
Pages viewed per query
 Time between queries

Query Identification: Verification (cont’d)
Substantial variation between host behavior for
normal queries and suspicious queries
Observations on Stage One
 Regular expressions can become obsolete
 Just need fresh logs and a new seed to get new ones
 Attacker awareness of technique yields adaptation
 Example: mix in normal user queries
Goal: trick SearchAudit into identifying as proxy
 Hard to do: needs to be appropriate to time and place
 Anyway: proxy elimination is optimization only



Injecting randomness also possible, but makes querying less
productive
Could obviate cookie heuristic, but it is replaceable
 All attackers need to be careful to succeed
Query Analysis
Query Analysis
 42,000 IPs gave suspicious queries globally
 U.S., Russia, China contribute almost 50%
 10% of IPs gave 90% of queries
 Found 200 regular expressions
 Reveal three kinds of attack-related queries:
 Vulnerable web sites
 Forum spamming
 Phishing on Windows Live Messenger
Queries for Vulnerable Websites
 Queries look for exploitable
inurl:index.php?content=X
server vulnerabilities


http://www.example.com/ind
ex.php?content=X’%20OR%20’
1’%20OR%20‘1=1’
GET variables embedded in
URL (for SQL injection)
Server software with known
vulnerabilities (e.g., status
pages)
 SearchAudit as a defense:
 Pull suspicious queries for
vulnerabilities
 Run queries; gather results
 Inspect results for
vulnerabilities
 Notify sites of vulnerabilities
Queries for Vulnerable Websites (cont’d)
 With identified queries:
 Sampled 5,000 queries
 Obtained 80,490 URLs from
39,475 sites
 Compared to
malware/phishing lists:


3-4% on anti-phishing lists
1.5% on anti-malware lists
 SQL injection vulnerability:
 Add a single-quote to
variable in URL
 Look for SQL error
 12% of examined URLs
showed an error
Queries for Forum Spamming
 Query motivation:
 Find scriptable forums
 Good for spam, PageRank
 Found 46 applicable
regular expressions
 Most IPs show transient
behavior: probably bots

All regular expression
groups show at least one
group similarity feature
 IPs got less aggressive
over time: more stealthy
Queries for Forum Spamming (cont’d)
 Validation
 Project Honey Pot
Dynamically generate email address for each
visiting IP
 E-mail received: must be
spam


12% of all IPs listed (vs.
0.5% for normal IPs)
 Applications
 Use queries to find and
clean targeted pages
 Deny results to malicious
queries
Phishing via Windows Live Messenger
 Queries triggered by
normal users

Victim receives message
from a contact



Follow link for party
photos
Taken to fake WLM login
After giving credentials,
redirected to Bing search
for “party”
 Bing search to avoid
costs of hosting
Phishing via WLM (cont’d)
 Detect via query referral
field (source page)


Found two regular
expressions for referrals
Both expressions: victim
username embedded in
URL
 Over 180 phishing
domains for 12 IPs
detected
 Compromised accounts
show different login
behaviors
Conclusion
 Presented framework for finding suspicious queries
 Input: search logs, small set of seed queries
 Output: regular expressions, millions of suspicious queries
 Analyzed suspicious queries
 Identified possible attacks
 Suggested means of prevention
 Generally: attempted to demonstrate relationship
between suspicious queries and the possibility of
attack

Searching the Searchers with SearchAudit

Transcript Searching the Searchers with SearchAudit

Directory