Detecting Data Leakage

Download Report

Transcript Detecting Data Leakage

Tagging with Queries: How and
Why?
Ioannis Antonellis
[email protected]
Hector Garcia-Molina
[email protected]
Jawed Karim
[email protected]
Content on the Web
Back Link Text
Search queries
Page Text
Cnn
Critics
Forward Link Text
Obama
news
Stanford Infolab
2
How?
• Basic observation: http referrer field contains
search query
Stanford Infolab
3
How?
Stanford Infolab
4
How?
• Basic observation: http referrer field contains
search query
1) Extract queries from web access log
Stanford Infolab
5
Web Access Log
a997c1950718d75c03f22ca8715e50b3 [28/Feb/2007:23:45:47 -0800] /group/svsa/cgibin/www/officers.php http://www.google.com/search?sourceid=navclient&ie=UTF8&rls=HPIB,HPIB:2006-47,HPIB:en&q=sexy+random+facts
a64344ffd6638d0f6fb2a0284f98b28b [28/Feb/2007:23:45:49 -0800] /group/King/
"http://www.google.com.au/search?hl=en&q=Martin+Luther+King&meta="
413fa663474b2288c1661882e7e62aea [28/Feb/2007:23:46:02 -0800]
/group/pandegroup/folding/results.html
"http://www.google.com/search?sourceid=navclient-menuext&ie=UTF-8&q=RESULTS"
3d2edd4dfa7778da92875ee67a319433 [28/Feb/2007:23:46:03 -0800]
/group/vpge/sgsi/entrepreneurship/
"http://www.google.com/search?hl=en&q=summer+institute+of+entrepreneurship"
ac49793239a6c490023e460fd4863a48 [28/Feb/2007:23:46:06 -0800] /
"http://www.google.com/search?sourceid=navclient&hl=ko&ie=UTF8&rlz=1T4SUNA_ko___KR209&q=stanford"
1c9893680
Stanford Infolab
6
How?
• Basic observation: http referrer field contains
search query
1) Extract queries from web access log
2) Embed Javascript code in web pages that
capture search queries
Stanford Infolab
7
Embeddable code
Stanford Infolab
8
How?
• Basic observation: http referrer field contains
search query
1) Extract queries from web access log
2) Embed Javascript code in web pages and
capture search queries
• Convince server administrator/page onwer
Stanford Infolab
9
Stanford Infolab
10
Query tags
Stanford Infolab
11
Information value of Query Tags
• Datasets:
WebBase
• Stanford Query Logs: 360,000 URLs, 900,000
query tags
• Delicious@Stanford: 3,000 URLs, 5,500 tags
Stanford Infolab
12
Experiments - Summary
• URLs coverage
• Query vs Delicious Tags
• Query/Delicious Tags vs Pagetext
Stanford Infolab
13
URLs coverage
• Query logs provide tags for ~110 times more URLs than
delicious
• 13% of delicious URLs (380 URLs) only tagged by delicious
Stanford Infolab
14
Query Tags
• Query logs provide 42 query tags per URL on average
Stanford Infolab
15
Delicious Tags
• Delicious provides 3 tags per URL on average
Stanford Infolab
16
Tags for common URLs
• Query logs provide 250 query tags per URL on average for
common URLs
• Delicious provides 5 tags per URL on average for common
URLs
Stanford Infolab
17
Query Tags vs Page Text
• For every URL, 1 out of 3 query tags are not present in the
pagetext
Stanford Infolab
18
Delicious Tags vs Page Text
• For every URL, 1 out of 2 query tags are not present in the
pagetext
Stanford Infolab
19
Tags for common URLs
• For common URLs, 1 out of 2 query/delicious tags not
present in the pagetext
Stanford Infolab
20
Conclusions
Query tags:
• Can be extracted in a distributed fashion
• new promising source of information
• can provide substantially many, new tags, for a
large fraction of the Web
Stanford Infolab
21
Thank You!
(DEMO)
http://tags.stanford.edu
Stanford Infolab
22
Stanford Infolab
23
Stanford Infolab
24
Stanford Infolab
25
Stanford Infolab
26
Stanford Infolab
27
Stanford Infolab
28
Stanford Infolab
29
Stanford Infolab
30
Stanford Infolab
31
Stanford Infolab
32
How?
Stanford Infolab
33
Stanford Infolab
34
Stanford Infolab
35