Searching the Searchers with SearchAudit
Download
Report
Transcript Searching the Searchers with SearchAudit
Searching the Searchers
with SearchAudit
JOHN P. JOHN
FANG YU
YINGLIAN XIE
MARTÍN ABADI
ARVIND KRISHNAMURTHY
PRESENTATION BY SAM KLOCK
Motivation
We can find this via a Google search
Motivation (cont’d)
Search engines open opportunities for attackers
Construct clever queries
Find vulnerable sites
Plant malware; spam (e.g., MyDoom)
Do so stealthily and cheaply
Mitigation strategy: identify malicious queries
May be able to deny results to user
Identify attackers (probably bots)
Interpret strategy, then anticipate and prevent
The question: how to do so
Proposed Approach
SearchAudit
Framework for generating
malicious queries
Input:
Seed set of known
malicious queries
Search logs
Output:
Large set of suspicious
queries
Regular expressions
matching queries
Seed set
Search logs
inurl:gotoURL.asp?url=
filetype:asp
inurl:"shopdisplayprod
ucts.asp"
ext:pl inurl:cgi
intitle:"FormMail *"
-"*Referrer" -"*
Denied" -sourceforge
-error -cvs -input
filetype:cgi
inurl:tseekdir.cgi
...
SearchAudit
inurl:gotoURL.asp?url=
inurl:gotoURL.asp?url=
filetype:asp
inurl:gotoURL.asp?url=
filetype:asp
inurl:"shopdisplayprod
filetype:asp
inurl:"shopdisplayprod
ucts.asp"
inurl:"shopdisplayprod
ucts.asp"
ext:pl
inurl:cgi
ucts.asp"
ext:pl
inurl:cgi
intitle:"FormMail
ext:pl
inurl:cgi *"*"
intitle:"FormMail
-"*Referrer"
-"*
intitle:"FormMail
*"
-"*Referrer"
-"*
Denied"
-sourceforge
-"*Referrer"
-"*
Denied"
-sourceforge
-error
-cvs-sourceforge
-input
Denied"
-error
-cvs -input
filetype:cgi
-error
-cvs
-input
filetype:cgi
inurl:tseekdir.cgi
filetype:cgi
inurl:tseekdir.cgi
...
inurl:tseekdir.cgi
...
...
Expanded set
"/includes/joomla\.php
" site:\.[a-zAZ]{2,3}
"/includes/class_item\
.php"
site:[^?=#+@;&:]{2,
4}
"php-nuke"
site:[^?=#+@;&:]{2,
4}
"modules\.php\?op=modl
oad" site:\.[a-zAZ0-9]{2,6}
Regular expressions
Proposed Approach (cont’d)
Needed to implement:
Seed set: milw0rm.com
Search logs: Microsoft Research Bing
Way to expand seed set into more queries
Way to infer regular expressions
Intended benefits:
Harvesting lots of information
Three months: ~1.2 TB of logs
Interpret relationship between queries and attacks
Use queries to find potential victims
Stop attacks
SearchAudit
Query
identification
Query analysis
Query Identification: Expansion
Basic idea: bootstrap on
seed set
Search logs for exact
matches to seed queries
Record IPs of hosts
making seed queries
Add other queries from
those IPs to set
Intuition: make one
malicious query, will
probably make more
Account for DHCP
Seed queries
Log search
IP addresses
Queries
made on
same day
Queries made
by IPs
Query Identification: Regular Expressions
Goals:
Account for variation in
queries
Take advantage of scripting
See paper for generation
algorithm
Compute score for
generated expressions
Lower score: more specific
Goal: discard overly general
expressions (score > 0.6)
Consolidate to avoid
overlap
Avoid proxies, public NAT
for performance
Loopback for more queries
Query Identification: Results
Data from Bing and milw0rm
500 queries
Logs for Feb. 2009, Dec. 2009, Jan. 2010
~2 billion views per month
System implemented on Dryad/DryadLINQ
Initial observations:
Using specificity scores < 0.6
seems to be effective
Based on cookie heuristic
Proxy elimination does not limit
results
Query Identification: Results (cont’d)
Query expansion:
122 of 500 queries
matched in logs: 174
unique IPs
Expanded to 800 unique
queries, 264 IPs
Regular expressions
matched 3,560 queries,
1,001 IPs
Incomplete seeds
Tried with subsets of
original set
Coverage still good
Query Identification: Results (cont’d)
Loopback:
Multiple loopbacks got
more results
One iteration is good
enough
Overall statistics
10,000s IPs each month
100,000s unique queries
each month
Dec. 09: set of unusual
attacker IPs cause spike
Query Identification: Verification
Want to show queries are
malicious
Sometimes easy: 73% of
queries associated with
security/hacker sites
What about others?
Individual bots
Groups of bots
No ground truth exists
Individual level (one IP)
Group level (multiple IPs)
Data often fixed by
botnets
User agent string
Metadata for requests
So: look for bot-like
features
New cookie
Whether a link was clicked
Tendencies dictated by
scripts
Pages viewed per query
Time between queries
Query Identification: Verification (cont’d)
Substantial variation between host behavior for
normal queries and suspicious queries
Observations on Stage One
Regular expressions can become obsolete
Just need fresh logs and a new seed to get new ones
Attacker awareness of technique yields adaptation
Example: mix in normal user queries
Goal: trick SearchAudit into identifying as proxy
Hard to do: needs to be appropriate to time and place
Anyway: proxy elimination is optimization only
Injecting randomness also possible, but makes querying less
productive
Could obviate cookie heuristic, but it is replaceable
All attackers need to be careful to succeed
Query Analysis
Query Analysis
42,000 IPs gave suspicious queries globally
U.S., Russia, China contribute almost 50%
10% of IPs gave 90% of queries
Found 200 regular expressions
Reveal three kinds of attack-related queries:
Vulnerable web sites
Forum spamming
Phishing on Windows Live Messenger
Queries for Vulnerable Websites
Queries look for exploitable
inurl:index.php?content=X
server vulnerabilities
http://www.example.com/ind
ex.php?content=X’%20OR%20’
1’%20OR%20‘1=1’
GET variables embedded in
URL (for SQL injection)
Server software with known
vulnerabilities (e.g., status
pages)
SearchAudit as a defense:
Pull suspicious queries for
vulnerabilities
Run queries; gather results
Inspect results for
vulnerabilities
Notify sites of vulnerabilities
Queries for Vulnerable Websites (cont’d)
With identified queries:
Sampled 5,000 queries
Obtained 80,490 URLs from
39,475 sites
Compared to
malware/phishing lists:
3-4% on anti-phishing lists
1.5% on anti-malware lists
SQL injection vulnerability:
Add a single-quote to
variable in URL
Look for SQL error
12% of examined URLs
showed an error
Queries for Forum Spamming
Query motivation:
Find scriptable forums
Good for spam, PageRank
Found 46 applicable
regular expressions
Most IPs show transient
behavior: probably bots
All regular expression
groups show at least one
group similarity feature
IPs got less aggressive
over time: more stealthy
Queries for Forum Spamming (cont’d)
Validation
Project Honey Pot
Dynamically generate email address for each
visiting IP
E-mail received: must be
spam
12% of all IPs listed (vs.
0.5% for normal IPs)
Applications
Use queries to find and
clean targeted pages
Deny results to malicious
queries
Phishing via Windows Live Messenger
Queries triggered by
normal users
Victim receives message
from a contact
Follow link for party
photos
Taken to fake WLM login
After giving credentials,
redirected to Bing search
for “party”
Bing search to avoid
costs of hosting
Phishing via WLM (cont’d)
Detect via query referral
field (source page)
Found two regular
expressions for referrals
Both expressions: victim
username embedded in
URL
Over 180 phishing
domains for 12 IPs
detected
Compromised accounts
show different login
behaviors
Conclusion
Presented framework for finding suspicious queries
Input: search logs, small set of seed queries
Output: regular expressions, millions of suspicious queries
Analyzed suspicious queries
Identified possible attacks
Suggested means of prevention
Generally: attempted to demonstrate relationship
between suspicious queries and the possibility of
attack