N-Gram-based Dynamic Web Page Defacement Validation
Download
Report
Transcript N-Gram-based Dynamic Web Page Defacement Validation
N-Gram-based Dynamic Web
Page Defacement Validation
Woonyon Kim
Aug. 23, 2004
NSRI, Korea
Contents
Introduction
Related Works
N-Gram Frequency Index
N-Gram-based Index Distance
Experiments
Conclusions
Introduction
Defacement of Web Sites
CSI/FBI 2001
38 % of web sites were hacked.
21% of hacked sites were not aware of their own
defacements.
Zone-h
The defaced web pages are rapidly increased year
by year. (.kr domain : about 200% increase)
Current solutions
Hash-based detection system for minimizing damage
Intrusion-tolerant system for contiguous service
Problems of current solutions
Current solutions use hash code as validation metric. Hash
code can’t support dynamic characteristics.
Introduction
N-Gram-based Index Distance (NGID)
A validation metric of dynamically changing web
pages
The sum of absolute differences of frequency
probability of N-Grams that can be found from both
indexes.
NGID represents the similarity of two web pages.
NGID can be used to validate web pages with
dynamic components or static.
Related Works
Hash-based validation system
Detecting web page defacements by comparing
two hash codes
Hash code is useful metric for large and static web
pages.
Hash code can’t work properly on the dynamically
changing web pages.
Intrusion-tolerant system
Hash code is used to validate web pages.
It also has limitation on dynamic web pages.
N-Gram Frequency Index (1)
N-Gram
An N-character slice of a string
For example “TEXT”
2-Gram : TE, EX, XT
N-Gram Frequency Index
An index file that is sorted from the most frequent
N-Grams to the least frequent ones
It cuts off N-Grams below at a particular rank. So,
minor changes are ignored. And this feature of NGram Frequency Index supports dynamics.
N-Gram Frequency Index (2)
How to generate
Count all N-Grams frequencies in a web page.
Sort N-Grams from the most frequent to the least.
Cut off N-Grams below at a particular rank.
Sum up the frequencies of the remained N-Grams.
Compute the probability of each N-Gram frequency.
Save the N-Grams, frequency of the N-Grams, the
probability of N-Grams into an index file.
N-Gram-based Index Distance(NGID)
The sum of absolute difference of frequency
probability of same N-Grams that can be found from
both web pages.
A metric for detecting whether a web page is defaced
Normal Index
Target Index
or not.
most
frequent
least
frequent
0.037
AB
0.036
AB
D11 = 0.01
0.023
BC
0.028
BD
D24 = 0.08
0.019
CD
0.019
CD
D33 = 0.00
0.017
EF
0.015
BC
D45 = 0.07
0.017
BF
0.010
EF
D50 = 0.017
...
...
---------NGID = 0.177
N-Gram-based Index Distance
Evaluation is done by comparing NGID
to validation threshold
Evaluation
Valid : NGID <= Validation Threshold
Invalid : NGID > Validation Threshold
Experiments
Assumptions
Select 100 web pages
News Paper
Broadcast
Portal
Public
Total
38
15
14
33
100
Choose 0.1 for Validation Threshold of NGID.
Procedure for false positive
Connect to a selected web page at a time in remote
place.
Download a page and save it a file.
Validate it using NGID.
Validate it using Hash Code.
Above four steps are recursively applied.
Every 30-minute in a day
Experiments
False Positive
News
Paper
Broad
cast
Portal Public Total
38
15
14
33
100
No. of False Positive 29
(MD5)
14
12
8
63
No. of False Positive 1
(NGID)
1
0
0
2
No. of Web Sites
Experiments
False Positive
70
No. of False Positive
60
50
40
NGID
HASH
30
20
10
0
Time
Experiments
NGID value as time flows
0.12
2
0.1
NGID
0.08
A
B
0.06
0.04
0.02
0
1
Time
The time of contents update
Experiments
Procedure for false negative
Collecting 50 web pages that are normal
pages and hacked pages from zone-h.
Validate it using NGID.
Validate it using Hash Code.
Result of Hash code
50-web pages are detected to be defaced.
The number of false negative is 0.
Experiments
False Negative
1.2
1
NGID
0.8
0.6
0.4
0.2
0
W eb Page Index
Threshold(0.1)
NGID
Conclusions
N-Gram-based Index Distance
A metric to evaluate dynamic web page defacement.
NGID can validate dynamically changing web pages.
Future Works
Need a learning model to resolve a validation
threshold of each web page.
Need a feedback mechanism of normal index.