N-Gram-based Dynamic Web Page Defacement Validation

Download Report

Transcript N-Gram-based Dynamic Web Page Defacement Validation

N-Gram-based Dynamic Web
Page Defacement Validation
Woonyon Kim
Aug. 23, 2004
NSRI, Korea
Contents
Introduction
Related Works
N-Gram Frequency Index
N-Gram-based Index Distance
Experiments
Conclusions
Introduction
Defacement of Web Sites
 CSI/FBI 2001
 38 % of web sites were hacked.
 21% of hacked sites were not aware of their own
defacements.
 Zone-h
 The defaced web pages are rapidly increased year
by year. (.kr domain : about 200% increase)
Current solutions
 Hash-based detection system for minimizing damage
 Intrusion-tolerant system for contiguous service
Problems of current solutions
 Current solutions use hash code as validation metric. Hash
code can’t support dynamic characteristics.
Introduction
N-Gram-based Index Distance (NGID)
 A validation metric of dynamically changing web
pages
 The sum of absolute differences of frequency
probability of N-Grams that can be found from both
indexes.
 NGID represents the similarity of two web pages.
 NGID can be used to validate web pages with
dynamic components or static.
Related Works
Hash-based validation system
 Detecting web page defacements by comparing
two hash codes
 Hash code is useful metric for large and static web
pages.
 Hash code can’t work properly on the dynamically
changing web pages.
Intrusion-tolerant system
 Hash code is used to validate web pages.
 It also has limitation on dynamic web pages.
N-Gram Frequency Index (1)
N-Gram
 An N-character slice of a string
 For example “TEXT”
 2-Gram : TE, EX, XT
N-Gram Frequency Index
 An index file that is sorted from the most frequent
N-Grams to the least frequent ones
 It cuts off N-Grams below at a particular rank. So,
minor changes are ignored. And this feature of NGram Frequency Index supports dynamics.
N-Gram Frequency Index (2)
How to generate
 Count all N-Grams frequencies in a web page.
 Sort N-Grams from the most frequent to the least.
 Cut off N-Grams below at a particular rank.
 Sum up the frequencies of the remained N-Grams.
 Compute the probability of each N-Gram frequency.
 Save the N-Grams, frequency of the N-Grams, the
probability of N-Grams into an index file.
N-Gram-based Index Distance(NGID)
The sum of absolute difference of frequency
probability of same N-Grams that can be found from
both web pages.
A metric for detecting whether a web page is defaced
Normal Index
Target Index
or not.
most
frequent
least
frequent
0.037
AB
0.036
AB
D11 = 0.01
0.023
BC
0.028
BD
D24 = 0.08
0.019
CD
0.019
CD
D33 = 0.00
0.017
EF
0.015
BC
D45 = 0.07
0.017
BF
0.010
EF
D50 = 0.017
...
...
---------NGID = 0.177
N-Gram-based Index Distance
Evaluation is done by comparing NGID
to validation threshold
Evaluation
 Valid : NGID <= Validation Threshold
 Invalid : NGID > Validation Threshold
Experiments
Assumptions
 Select 100 web pages
News Paper
Broadcast
Portal
Public
Total
38
15
14
33
100
 Choose 0.1 for Validation Threshold of NGID.
Procedure for false positive
 Connect to a selected web page at a time in remote




place.
Download a page and save it a file.
Validate it using NGID.
Validate it using Hash Code.
Above four steps are recursively applied.
 Every 30-minute in a day
Experiments
False Positive
News
Paper
Broad
cast
Portal Public Total
38
15
14
33
100
No. of False Positive 29
(MD5)
14
12
8
63
No. of False Positive 1
(NGID)
1
0
0
2
No. of Web Sites
Experiments
False Positive
70
No. of False Positive
60
50
40
NGID
HASH
30
20
10
0
Time
Experiments
NGID value as time flows
0.12
2
0.1
NGID
0.08
A
B
0.06
0.04
0.02
0
1
Time
The time of contents update
Experiments
Procedure for false negative
 Collecting 50 web pages that are normal
pages and hacked pages from zone-h.
 Validate it using NGID.
 Validate it using Hash Code.
Result of Hash code
 50-web pages are detected to be defaced.
 The number of false negative is 0.
Experiments
False Negative
1.2
1
NGID
0.8
0.6
0.4
0.2
0
W eb Page Index
Threshold(0.1)
NGID
Conclusions
N-Gram-based Index Distance
 A metric to evaluate dynamic web page defacement.
 NGID can validate dynamically changing web pages.
Future Works
 Need a learning model to resolve a validation
threshold of each web page.
 Need a feedback mechanism of normal index.