Sensors, Databases and Flash Storage

Download Report

Transcript Sensors, Databases and Flash Storage

CDNs Content Outsourcing
via
Generalized Communities
@ Dept. of Computer & Communication Engineering, University of Thessaly
@ Dept. of Informatics, Aristotle University
Heraklion, March 20th, 2008
Dimitrios Katsaros, Ph.D.
Outline of the talk
•A summary of my research
•Latest results:
“CDNs Content Outsourcing
via Generalized Communities”
•(IEEE Transactions on Knowledge & Data Engineering)
•PRIMITIVE: Community Identification
•METHOD: Content Outsourcing for CDNs
•GOAL: Access Latency Reduction & Robustness
2
Research areas: Ultimately  ???
Mobile/Pervasive Computing
Web
Overlay Nets
Caching &
Air-Indexing
Ad Hoc
Pervasive
Peer-to-Peer
Networks
Content-Based MIR
Web
Broadcasting &
Data Dissemination
Cooperative Caching &
Sensor Node Clustering &
Distributed Indexing &
Coverage/Connectivity &
Flash storage &
Sensors
Web Ranking &
Search Engines
Social Network Analysis
Information Retrieval
4
Content Outsourcing
•
•
•
•
•
•
The problem: flash crowds
The solution: CDNs
Reactive vs proactive solutions
Community identification
The CiBC algorithm
Evaluation
5
A problem…
• Feb 3, 2004: Google linked banner to “julia
fractals”
• Users clicking directed to Australian University
web site
• …University’s network link overloaded, web server
taken down temporarily…
6
The problem strikes again!
• Feb 4, 2004: Slashdot ran the story about Google
• …Site taken down temporarily…again
7
The response from down under…
• later…Paul Bourke asks:
“They have hundreds (thousands?) of servers
worldwide that distribute their traffic load. If even
a small percentage of that traffic is directed to a
single server … what chance does it have?”
→ Help him ←
8
Existing approaches
• Client-side proxying
• Squid, Summary Cache, hierarchical cache,
CoDeeN, Squirrel, Backslash, PROOFS, …
• Problem: Not 100% coverage
• Throw money at the problem
• Load-balanced servers, fast network connections
• Problem: Can’t afford or don’t anticipate need
• Content Distribution Networks (CDNs)
• Akamai, Digital Island, Mirror Image, …
9
From Internet Mazes to …
Origin
Server
End User
End User
End User
End User
End User
End User
10
Content distribution
Toronto
Seattle
Chicago
New
York
San
Jose
Denver
Tokyo
Singapore
Los
Angeles
Dallas
Hong
Kong
Boston
Washington
D.C.
London
Stockholm
Amsterdam
Paris Frankfurt
Zurich
Atlanta
Miami
Sydney
11
Content Distribution Network (CDNs)
12
Types of CDNs
Coral
First proposed @
IEEE JSAC’03, and
cooperative
What is described
here today
uncooperative
Akamai
pull
X
push
13
Comparison
Outsourcing
Policies
Replication
redundancy
Commun.
cost
Update cost
Temporal
coherency
Uncooperative
pull
High
High
High
Low
Cooperative
pull
Low
High
Medium Medium
Uncooperative
push
High
Low
Medium Medium
Cooperative
push
Low
Medium
Low
High
14
Cooperative push
• What to push?
• Frequently accessed content (IEEE JSAC’03)
• Hard to predict what will be popular!
• Popularity changes rapidly, too!
•Request statistics? Reactive approach
•Can we devise a proactive
solution?
• Where to store the pushed content?
• Easy; a lot of replica placement algorithms
15
Communities as “attractors”
16
Web-site communities DO exist
hollins.edu
Antonis Sidiropoulos et al., WWW Journal, 11(1), 2008
17
“Hard” (max-flow) communities
• COMMUNITY: a subset of the nodes
of a graph, with the property that:
(for each node of the community)
The number of links to other nodes
belonging to the community is larger
than the number of links to nodes
NOT belonging to the community
18
“Hard”, but inefficient
19
Generalized communities …
• COMMUNITY: a subset of the nodes
of a graph, with the property that:
(for each node of the community)
The sum of all degrees within the
community is larger than the sum of
all degrees toward the rest of graph
20
Social Network Analysis
• A social network is a social structure to describe social
relations (wikipedia)
• History of Social Network is older than everybody who is
here (more than 100 years – Cooley 1909, Durkheim
1893)
[book: Stanley Wasserman & Katherine Faust]
1. Mathematical Representation
2. Structural & Locational Properties
1. Centrality
• Betweenness centrality
3. Roles & Positions
4. Dyadic & Triadic Methods
21
Betweenness Centrality
• σuw= σwu : number of shortest paths from u 
V to w  V
(σuu=0)
• σuw(v) : number of shortest paths from u to w
that some vertex v  V lies on
• Betweenness Centrality NI(v) of a vertex
v is:
22
Betweenness Centrality in sample graphs
13
6
8
12
15
5
7
14
2
4
20
18
16
9
11
3
10
1
19
17
Y
X
T
A
U
P
V
C
W
R
Q
B
23
Betweenness Centrality in sample graphs
13 (0)
6 (0)
Nodes with large NI:
8 (26)
12 (0)
15 (0)
5 (0)
7 (156)
14 (233)
2 (0)
4 (96)
3 (68)
1 (0)
20 (0)
18 (97)
16 (131)
9 (0)
11 (0)
10 (0)
19 (0)
17 (1)
Articulation nodes
(in bridges), e.g., 3, 4,
7, 16, 18
With large fanout,
e.g., 14, 8, U
Y (0)
T (1,33)
X (0)
U (54)
A (6,67)
P (41)
V (1,33)
C (0)
W (3,33)
R (9,33)
Q (8)
B (13)
24
Betweenness centrality in …
• [WEB] Performing graph clustering
and recognizing communities in
Web site graphs
25
CiBC Method
• Target:
d out (C )
s
din (C )
is true
• CiBC method:
• Building “cliques” and clusters around
representative (pole) nodes (with low CB)
26
CiBC Method
Phase 1:
NI Computation -O(nm)
Phase 2:
Initialization of cliques
O(n)
8
9
7
5
6
10
1
2
11
3
0
4
ID
NI index
10
20.68
2
19.61
6
11.38
1
10.28
7
2.06
0
1.73
9
0.99
8
0.99
4
0.75
5
0.00
11
0.00
27
CiBC Method
8
9
Phase 2:
Initialization of cliques
O(n)
7
5
6
10
1
2
11
3
0
4
ID
NI index
10
20.68
2
19.61
6
11.38
1
10.28
7
2.06
0
1.73
9
0.99
8
0.99
4
0.75
5
0.00
11
0.00
28
CiBC Method
8
9
Phase 2:
Initialization of cliques
O(n)
7
5
6
10
1
2
11
3
0
4
ID
NI index
10
20.68
2
19.61
6
11.38
1
10.28
7
2.06
0
1.73
9
0.99
8
0.99
4
0.75
5
0.00
11
0.00
29
CiBC Method
8
9
Phase 2:
Initialization of cliques
O(n)
7
5
6
10
1
2
11
3
0
4
ID
NI index
10
20.68
2
19.61
6
11.38
1
10.28
7
2.06
0
1.73
9
0.99
8
0.99
4
0.75
5
0.00
11
0.00
30
CiBC Method
8
9
Phase 2:
Initialization of cliques
O(n)
7
5
6
10
1
2
11
3
0
4
ID
NI index
10
20.68
2
19.61
6
11.38
1
10.28
7
2.06
0
1.73
9
0.99
8
0.99
4
0.75
5
0.00
11
0.00
31
CiBC Method
A
8
9
7
Phase 3:
Clique Merging &
Creation of Communities
B
5
6
10
A
B
C
D
A
3
3
0
0
B
3
3
1
1
C
0
1
3
4
D
0
1
4
3
1
2
11
C
D
3
0
Complexity: O(l2)
l is the number
of cliques
4
32
CiBC Method
A
8
9
7
Phase 3:
Clique Merging &
Creation of Communities
B
5
6
10
A
B
C
D
A
3
3
0
0
B
3
3
1
1
C
0
1
3
4
D
0
1
4
3
1
2
11
C
D
3
0
4
3
4
33
CiBC Method
A
8
9
7
Phase 3:
Clique Merging &
Creation of Communities
B
5
A
B
C
A
3
3
0
B
3
3
2
C
0
2
10
6
10
1
2
11
C
3
0
4
34
CiBC Method
A
8
9
7
Phase 3:
Clique Merging &
Creation of Communities
B
5
A
B
C
A
3
3
0
B
3
3
2
C
0
2
10
6
10
1
2
11
C
3
0
4
35
CiBC Method
A
8
9
7
Phase 3:
Clique Merging &
Creation of Communities
C
A
9
2
C
2
10
5
6
10
Phase 4:
Check constraints
A
1
2
11
C
3
0
4
36
Evaluation …
Need for:
• Web site graphs
• CDN
• Topology
• Networking issues
• Request streams
• Roaming over the site graph
Impossible to find real data for all these …
• Simulators for each of them
• To compensate for the lack of any of the above
37
Simulators
• Web site graphs
• Simulating the growth process of the Web
• Request streams
• Random surfer (following links + teleportation)
• CDN
• CDNSim (http://oswinds.csd.auth.gr/~cdnsim/)
38
Competing methods
• Communities-based methods
• Clique Percolation Method (CPM)
• Correlation Clustering Communities
identification method (C3i)
• Simple Web Caching (LRU)
• No CDN (only the origin server)
• Full Replication
39
Metrics
• Mean Response Time (MRT): the expected time
for a request to be satisfied
• Response time CDF: the Cumulative Distribution
Function (CDF) denotes the probability of having response
times lower or equal to a given response time
• Replica Factor (RF): the percentage of the number
of replica objects to the whole CDN infrastructure w.r.t.
the total outsourced objects
• Byte Hit Ratio (BHR)
• Independent parameters
• a) Surrogates’ cache size
b) graph assortativity
40
Situations examined
• Regular traffic
• Network delay dominates the other components
• Flash crowd event
• TCP setup delay + network delay dominate the
other components
41
Regular traffic: MRT vs. comm. strength
42
Regular traffic: BHR vs. comm. strength
43
Regular traffic: MRT vs. cache size
44
Surge of requests: CiBC
45
Surge of requests: CPM
46
Surge of requests: C3i
47
Surge of requests: LRU
48
Discussion
• CDNs: industrial interest for them
• Content outsourcing: significant issue
• Proactive content outsourcing
• Discovery of communities
• Placement to surrogate servers
• CiBC prevails
49
References
Our work
• D. Katsaros , G. Pallis, K. Stamos, A. Sidiropoulos, A.
Vakali, Y. Manolopoulos. “CDNs Content Outsourcing
via Generalized Communities”. IEEE Transactions
on Knowledge and Data Engineering, 2008.
State-of-the-art competing method
• [CPM community identification method] G. Palla,
I.Derenyi, I.Farkas, and T.Vicsek. Uncovering the
overlapping community structure of complex
networks in nature and society. Nature,
435(7043):814–818, 2005.
50
Thanks to my
collaborators at
A.U.Th
Thank you
for your attention!
Questions?
The CiBC Algorithm (1/4)
• The Web site is represented by a Web graph G =
(V,E), where its nodes are the Web pages and the
edges depict the hyperlinks among Web pages
• Input:Web site graph
• Output: a set of Web page communities; These
communities constitute the set of objects which are
outsourced to the surrogate servers
• Phase I: Computation of Betweenness Centrality
• the pole nodes – the nodes with the lowest Betweenness
Centrality
• the concept of Betweenness Centrality (BC) is used to select
the pole nodes
• Phase II: Nodes Accumulation around Pole Nodes
• nodes are accumulated around identified pole nodes by
making use of Web graph properties;
• a set of Web page communities is created
53
The CiBC Algorithm (2/4)
• Betweenness Centrality (BC) reflects the amount
of control exerted by a given Web page over the
interactions between the other Web pages in the
Web server content structure.
54
Performance Evaluation
Examined Methods
• Clique Percolation Method (CPM): The outsourced objects obtained
by the CPM correspond to k-clique percolation clusters in the network.
• A k-clique percolation cluster is a sub-graph containing k-cliques (complete
sub-graphs of size k) that can all reach each other through chains of kclique adjacency, where two k-cliques are said to be adjacent if they share k
- 1 nodes.
• Experiments have shown that this method is quite efficient when it is
applied on large graphs.
• Web caching scheme (LRU): The objects are stored reactively to
proxy cache servers. We consider that each proxy cache server follows
the LRU (Least Recently Used) replacement policy since it is the
typical case for the popularity of proxy servers (e.g., Squid ).
• No Replication (W/O CDN): All the objects are placed on the origin
server and there is no CDN/no proxy servers. This policy represents
the “worst-case” scenario.
• Full Replication (FR): All the objects have been outsourced to all the
CDN’s surrogate servers. This (unrealistic) policy represents the
“optimal-case” scenario.
57
Performance evaluation parameters
Simulation Testbed
58
Content Replication Problem
• Lat-cdn: the outsourced objects are
placed to surrogate servers with
respect to the total network’s latency,
without taking into account the
objects’ popularity (La-Web 2005)
• il2p: the outsourced objects are placed
to surrogate servers integrating both
the network’s latency and the objects’
load (ICDE workshop 2006)
59
The Lat-cdn Algorithm: The Flowchart
CDN
Infrastructure
outsourced
objects
All the “outsourced objects” are stored in the origin
server and all the CDN’s surrogate servers are empty
For each outsourced object, we find which is the best
surrogate server in order to place it (produces the
minimum network latency)
Surrogate
servers
become
full?
The final Placement
Yes
No
We select from all the pairs of
outsourced object – surrogate server
that have been occurred in the previous step, the one
which produces the largest network latency, and thus
place this object to that surrogate server
61
The il2p Algorithm: The Flowchart
CDN
Infrastructure
outsourced
objects
All the “outsourced objects” are stored in the origin
server and all the CDN’s surrogate servers are empty
For each outsourced object, we find which is the best
surrogate server in order to place it (produces the
minimum network latency)
We select from all the pairs of
outsoursed object – surrogate server
that have been occurred in the previous step, the one with the maximum
utility value and thus place this object to that surrogate server
The final Placement
Yes
Surrogate
servers
become
full?
No
64