On network-Aware Clustering of Web Clients

Transcript On network-Aware Clustering of Web Clients

On Network-Aware Clustering
of Web Clients
Balachander Krishnamurthy
[email protected]
AT&T Labs-Research, Florham Park, NJ, USA
Jia Wang
[email protected]
Cornell University, Ithaca, NY, USA
Outline
•
•
•
•
•
Introduction
Simple approaches to clustering
Network-aware approach
Applications of client clustering
Conclusion and future work
ACM SIGCOMM'2000
On Network-Aware Clustering of Web Clients
2
Introduction
• Original goal: identify the group of clients that are
responsible for a significant portion of a Web
site’s requests
• Cluster
– Non-overlapping
– Topologically close
– Under common administrative control
• But, identifying clusters requires knowledge that
is not available to anyone outside the
administrative entities.
• Network-aware approach – BGP based
ACM SIGCOMM'2000
On Network-Aware Clustering of Web Clients
3
Simple approaches
• Two approaches
1. Use traditional Class A, Class B and Class C networks
2. Assume prefix length is 24 bits
• They are simple, but do not give good results
(~50% accuracy).
• Counter example
IP address
Name
Prefix/netmask
151.198.194.17
client-151-198-194-17.bellatlantic.net
151.198.194.16/28
151.198.194.34
mailsrv1.wakefern.com
151.198.194.32/28
151.198.194.50
firewall.commonhealthusa.com
151.198.194.48/28
ACM SIGCOMM'2000
On Network-Aware Clustering of Web Clients
4
Network-aware approach
• Use BGP routing and forwarding table snapshots
• Routing table entries  clusters
• Example snapshot of BGP routing table
Prefix
Prefix description Next hop
AS path
Peer
6.0.0.0/8
Army Information
System Center
cs.nynap.vbns.net
7170 1455 AT&T Government
(IGP)
Markets
12.0.48.0/20
Harvard
University
cs.cht.vbns.net
1742
(IGP)
Harvard
University
18.0.0.0/8
Massachusetts
Institute of
Technology
cs.cht.vbns.net
3 (IGP)
Massachusetts
Institute of
Technology
ACM SIGCOMM'2000
On Network-Aware Clustering of Web Clients
AS description
5
Automated process
Clustering process
Source of IP addresses
BGP routing tables
IP address extraction
Prefix extraction, unification, merging
IP addresses
Prefix table
Client cluster identification
Raw client clusters
Validation (optional)
Examining impact of network dynamics
Self-correction and adaptation
Client clusters
ACM SIGCOMM'2000
On Network-Aware Clustering of Web Clients
6
Network prefix extraction
• Prefix entry extraction (BGP tables from 14 places
via automated scripts)
AADS, MAE-EAST, MAE-WEST, PACBELL, PAIX, ARIN,
AT&T-Forw, AT&T-BGP, CANET, CERFNET, NLANR,
OREGON, SINGAREN, and VBNS.
• Prefix format unification and merging
• Three formats:
x1.x2.x3.x4/k1.k2.k3.k4
x1.x2.x3.x4/m
x1.x2.x3.0
• Assembled total 391,497 unique prefix entries
(412,109 entries by 7/24/2000)
ACM SIGCOMM'2000
On Network-Aware Clustering of Web Clients
7
Client cluster identification
• Methodology
• Extract the client IP address from the server log
• Perform longest prefix matching on each client IP address
• Classify all the client IP addresses which have the same
longest matched prefix into a client cluster
• Experiments
• Experiments on wide range of Web server logs
• Results
• > 99% clients can be grouped into clusters
• ~ 90% sampled clusters passed our validation tests
ACM SIGCOMM'2000
On Network-Aware Clustering of Web Clients
8
Server logs used in our experiments
Log
Description
Date
Duration
(days)
# requests
# clients
# clusters
Apache
Apache site
10/1/9911/18/99
49
3,461,361
51,536
35,563
Ew3
AT&T content
hosting site
7/1/997/31/99
31
1,199,276
21,519
7,754
Nagano
1998 Winter
Olympic Game
2/13/98
1
11,665,713
59,582
9,853
Sun
Sun Microsystems site
9/30/9710/9/97
9
13,871,352
219,528
33,468
ACM SIGCOMM'2000
On Network-Aware Clustering of Web Clients
9
Example: Nagano server log
ACM SIGCOMM'2000
On Network-Aware Clustering of Web Clients
10
Example: Nagano server log (cont.)
ACM SIGCOMM'2000
On Network-Aware Clustering of Web Clients
11
Validation of clustering
• Validation - fundamentally difficult problem
• A client cluster may be mis-identified by being too large or too small
• Two approaches
• nslookup-based test
• Optimized traceroute-based test
• Results on sampled 1% client clusters
• A client cluster is mis-identified even if there is one client in the cluster
doesn’t share same suffix with others.
• Error rate of network-aware approach: ~10%
• Error rate of simple approach: ~50%
• Possible reason of mis-clustering: route aggregation,
national gateway proxies
• Effect of BGP prefix changes: < 3% (during 2 weeks)
ACM SIGCOMM'2000
On Network-Aware Clustering of Web Clients
12
Applications
• Web caching, content distribution, server
replication, traffic management and load
balancing, Internet map discovery, etc.
• Example: Web caching
• Client classification: Normal client, proxy, and spider
• Identifying spiders/proxies based on access patterns
?
spider
ACM SIGCOMM'2000
proxy
On Network-Aware Clustering of Web Clients
13
Detecting proxy/spider
Histogram of the requests in Sun server log
Percentage of requests
0.014
4000
3500
3000
2500
2000
1500
1000
500
0
0.012
0.01
0.008
0.006
0.004
0.002
0
1
25
49
73
97
121
145
169
193
Time (in hours)
A client cluster containing a proxy
1 25 49 73 97 121 145 169 193
Time (in hours)
Number of requests issued
Number of requests
issued
A client cluster containing a spider
14000
12000
10000
8000
6000
4000
2000
0
1
25
49
73
97
121
145
169
193
Time (in hours)
ACM SIGCOMM'2000
On Network-Aware Clustering of Web Clients
14
217
Thresholding client clusters
• Metric: number of requests issued from within a
client cluster
• 70% of the total requests in the server log
Log
# requests # clients
# clusters
# busy
clusters
Accuracy
Apache
3,461,361
51,536
35,563
2,869
92%
Ew3
1,199,276
21,519
7,754
1,600
96%
Nagano
11,665,713
59,582
9,853
717
90%
Sun
13,871,352 219,528
33,468
2,536
91%
• Web caching simulation
ACM SIGCOMM'2000
On Network-Aware Clustering of Web Clients
15
New dataset
• Altavista server log containing 60,011,458 requests
issued by 2,503,974 clients all over the world.
• # clusters: 100,091
• # busy clusters: 242
• Accuracy: 91%
• Clustering works on large, general portal site data.
• Thanks to Altavista for sharing data with us. The data
included only client IP addresses with no personally
identifiable information.
ACM SIGCOMM'2000
On Network-Aware Clustering of Web Clients
16
Conclusion and future work
• Network-aware client clustering
–
–
–
–
–
Based on BGP routing table snapshots
Ability to cluster >99% of clients in the server logs
Error rate is 10% (~ 50% for the simple approach)
Immune to BGP dynamics
Variety of applications
• Ongoing work
–
–
–
–
Online algorithm
Super/sub clustering
Server clustering
Server replication application
• Future work
– Better validation
– Lower error rate
– Other applications
ACM SIGCOMM'2000
On Network-Aware Clustering of Web Clients
17
Acknowledgement
Thanks to the following people for helping us in this
project.
Jennifer Rexford
Tim Griffin
Vern Paxson
Thomas Narten
Emden Gansner
S. Keshav
ACM SIGCOMM'2000
Anja Feldmann
Bill Manning
Craig Labovitz
Steven Bellovin
Nick Duffield
Walter Willinger
On Network-Aware Clustering of Web Clients
18