Document 7394773

Transcript Document 7394773

A Server-less Architecture for
Building Scalable, Reliable, and
Cost-Effective Video-on-demand
Systems
Presented by: Raymond Leung Wai Tak
Supervisor: Prof. Jack Lee Yiu-bun
Department of Information Engineering
The Chinese University of Hong Kong
Contents








1. Introduction
2. Challenges
3. Server-less Architecture
4. Reliability Analysis
5. Performance Modeling
6. System Dimensioning
7. Multiple Parity Groups
8. Conclusion
1. Introduction
 Traditional Client-server Architecture



Clients connect to server and request for video
Server capacity limits the system capacity
Cost increases with system scale
1. Introduction
 Server-less Architecture



Motivated by the availability of powerful user devices
Each user node (STB) serves both as a client and as a mini-server
Each user node contributes to the system





Memory
Processing power
Network bandwidth
Storage
Costs shared by users
1. Introduction
 Architecture Overview

Composed of clusters
STB
Playback
STB
STB
STB
STB
Autonomous
Clusters
2. Challenges






Video Data Storage Policy
Retrieval and Transmission Scheduling
Fault Tolerance
Distributed Directory Service
Heterogeneous User Nodes
System Adaptation – node joining/leaving
3. Server-less Architecture
 Storage Policy




Video data is divided into fixed-size blocks (Q bytes)
Data blocks are distributed among nodes in the cluster (data striping)
Low storage requirement and load balancing
Capable of fault tolerance using redundant blocks (discussed later)
3. Server-less Architecture
 Retrieval and Transmission Scheduling


Round-based scheduler
Grouped Sweeping Scheduling1 (GSS)


1P.S.
Composed of macro rounds and micro rounds
Tradeoff between disk efficiency and buffer requirement
Yu, M.S. Chen & D.D. Kandlur, “Grouped Sweeping Scheduling for DASD-based Multimedia Storage Management”, ACM
Multimedia Systems, vol. 1, pp. 99 –109, 1993
3. Server-less Architecture
 Retrieval and Transmission Scheduling




Data retrieved in current micro round will be transmitted
immediately in next micro round
Each retrieval block is divided into b transmission blocks for
transmission
Transmission block size: U  Q b
Transmission lasts for one macro round
U bytes
round 0
(group 0)
group 1
round 1
(group 0)
group 2
Transmission
Q bytes
Disk retrieval
Tg: micro round
Tf: macro round
round 2
(group 0)
group 1
group 2
3. Server-less Architecture
 Retrieval and Transmission Scheduling

Macro round length



Defined as the time required by all nodes transmitting one retrieval
block
Number of requests served: N
Macro round length:
Tf 

NQ
Rv
Micro round length



Each macro round is divided into g micro rounds
Number of requests served: N/g
Micro round length:
Tg 
Tf
g

NQ
gRv
3. Server-less Architecture
 Modification in Storage Policy


As the retrieval blocks are divided into transmission blocks for
transmission
Video data is striped across transmission blocks, instead of retrieval
blocks
3. Server-less Architecture
 Fault Tolerance


Recover from not a single node failure, but multiple simultaneously
node failures as well
Redundancy by Forward Error Correction (FEC) Code

e.g. Reed-Solomon Erasure Code (REC)
3. Server-less Architecture
 Impact of Fault Tolerance on Block Size


Tolerate up to h simultaneous failures
To maintain the same amount of video data transmitted in each
macro round, the block size is increased to Qr.
 N 
Qr  Q 

 N h

Similarly, the transmission block size is increased to Ur.
Ur 
Qr
 N 
U 

b
 N h
4. Reliability Analysis
 Reliability Analysis




Find out the system mean time to failure (MTTF)
Assuming independent node failure/repair rate
Tolerate up to h failures by redundancy
Analysis by Markov chain model

0

1

2

h
...
h
h+1
...
system failure
4. Reliability Analysis
 Reliability Analysis

With the assumption of independent failure and repair rate
i  ( N  i )
i  i

Let Ti be the expected time the system takes to reach state h+1 from
state i
T0 
1
0
 T1
T1 
1
1
1

T2 
T0
1  1 1  1
1  1
Th 
h
h
1

Th 1 
Th 1
h   h h   h
h   h
Th 1  0
4. Reliability Analysis
 Reliability Analysis

By solving the above set of equations, the system MTTF (T0) is
j 1


h  i  i k
T0     k j0
i 0  j 0
  i  k
k 0








With a target system MTTF, we can find the redundancy (h) required
4. Reliability Analysis
 Redundancy Level

Defined as the proportion of nodes serving redundant data (h/N)
Redundancy level versus number of nodes on achieving the target
system MTTF
Required Redundancy Level

0.25
0.2
0.15
0.1
50
100
150
200
Achieving MT TF of 1,000 hrs
Achieving MT TF of 10,000 hrs
Achieving MT TF of 100,000 hrs
250
300
Number of Nodes
350
400
450
500
5. Performance Modeling




Storage Requirement
Network Bandwidth Requirement
Buffer Requirement
System Response Time
 Assumptions:



Zero network delay
Zero processing delay
Bounded clock jitters among nodes
5. Performance Modeling
 Storage Requirement



Let SA be the combined size of all video titles to be stored in the
cluster
With redundancy h, additional storage is required
The storage requirement per node (SN)
SA
SN 
N h
5. Performance Modeling
 Bandwidth Requirement





Assume video bitrate of Rv bps
Without redundancy, each node transmits (N1) streams of video
data to other nodes in the cluster,
Each stream consuming a bitrate of Rv/N bps
With redundancy h, additional bandwidth is required
The bandwidth requirement per node (CR)
N 1
CR 
Rv
N h
5. Performance Modeling
 Buffer Requirement

Composed of sender buffer requirement and receiver buffer
requirement
 Sender Buffer Requirement

Under GSS scheduling

1
Bs , r  1   NQr
g


1  N2
 1  
Q
g
(
N

h
)


5. Performance Modeling
 Receiver Buffer Requirement


Store the data temporarily before playback
Absorb the deviations in data arrival time caused by clock jitter
Br ,r
  b  
 21     NU
  Tf  


 Total Buffer Requirement



One data stream is for local playback rather than transmission
Buffer sharing for this local playback stream
Subtract b buffer blocks of size Ur from the receiver buffer
   b  
 1  N2
b 


 NU
Bt ,r  1  
Q  2 1    


   T f   N  h 
 g  N h
5. Performance Modeling
 System Response Time


Time required from sending out request to playback begins
Scheduling delay + pre-fetch delay
 Scheduling delay under GSS



Time required from sending out request to data retrieval starts
Can be analyzed using urns model
Detailed derivation available in Lee’s work2
new request
Disk retrieval
Tg: micro round
Tf: macro round
2Lee,
J.Y.B., “Concurrent push-A scheduling algorithm for push-based parallel video servers”, IEEE Transactions on Circuits and Systems
for Video Technology, Volume: 9 Issue: 3 , April 1999, Page(s): 467 -477
5. Performance Modeling
 Prefetch delay



Time required from retrieving data to playback begins
One micro round to retrieve a data block and buffering time to fill up
the prefetch buffer of the receiver
Additional delay will be incurred due to clock jitter among nodes
 1 1   b   
D p    1     T f
 g b   Tf  



6. System Dimensioning
 Storage Requirement


What is the minimum number of nodes required to store a given
amount of video data?
For example:




If each node can allocate 2 GB for video storage, then



video bitrate: 4 Mb/s
video length: 2 hours
storage required for 100 videos: 351.6 GB
176 nodes are needed (without redundancy); or
209 nodes are needed (with 33 nodes added for redundancy)
This sets the lower limit on the cluster size
6. System Dimensioning
 Network Capacity


How many nodes can be connected given a certain network
switching capacity?
For example:


If the network switching capacity is 32Gbps, and assume 60%
utilization


video bitrate: 4 Mb/s
up to 2412 nodes (without redundancy)
Network switching capacity is not a bottleneck
6. System Dimensioning
 Disk Access Bandwidth


Determine the value of Q and g to evaluate the buffer requirement
and the system response time
Finite disk access bandwidth limits the value of Q and g
 Disk Model on Disk Service Time




Time required to retrieve data blocks for transmission
Depends on seeking overhead, rotational latency and data block size
Suppose k requests per GSS group
The maximum service time in worst case scenario
t round (k ) – maximum round service time
tround (k , Qr )  k  t
max
seek
 1 Qr
(k )  k W 
rmin





t
-- fixed overhead
max
seek
(k ) – maximum seek time for k requests
W-1 – rotational latency
rmin – minimum transfer rate
Qr – data block size
6. System Dimensioning
 Constraint for Smooth Data Flow


Disk service round to be finished before transmission
Disk service time shorter than micro round length
N
 T
tround  , Qr   f
g
 g
6. System Dimensioning
 Buffer Requirement
Decreasing block size (Qr) and increasing number of groups (g) to
achieve minimum system response time, provided that the smooth
data flow constraint is satisfied
20
15
Buffer (MB)

10
5
0
0
50
100
Receiver Buffer
Sender Buffer
Total Buffer
150
200
250
300
Number of nodes
350
400
450
500
6. System Dimensioning
 System Response Time
System response time versus number of nodes in the cluster
15
10
Time (s)

5
0
0
50
100
150
Scheduling Delay
Prefetch Delay
System Response Time
300
250
200
Number of nodes
350
400
450
500
6. System Dimensioning
 Scheduling Delay

Relatively constant while system scales up
 Prefetch Delay



Time required to receive the first group of blocks from all nodes
Increases linearly with system scale – not scalable
Ultimately limits the cluster size
 What is the Solution?

Multiple parity groups
7. Multiple Parity Groups
 Primary Limit in Cluster Scalability

Prefetch delay in system response time
 Multiple Parity Groups




Instead of single parity group, the redundancy is encoded with
multiple parity groups
Decrease the number of blocks required to receive before playback
Playback begins after receiving the data of first parity group
Reduce the prefetch delay
7. Multiple Parity Groups
 Multiple Parity Groups

Transmission of different parity groups are staggered
round 0
node 0
Parity
Group 1
Parity
Group 1
round 2
Parity
Group 1
Transmission
....
.
.
.
round 0
node i
round 1
round 2
Transmission
....
round 0
node j
round 2
Transmission
....
.
.
.
round 0
round 1
round 2
node (N-1) Transmission
....
Parity
Group 2
Parity
Group 2
Parity
Group 2
7. Multiple Parity Groups
 Impact on Performance



Buffer requirement
System response time
Redundancy requirement
 Buffer Requirement


The number of blocks within same parity group is reduced
Receiver buffer requirement is reduced
Br , p
  bp   NU

 21  



  Tf   p
7. Multiple Parity Groups
 System Response Time


Playback begins after receiving the data of first parity group
System response time is reduced
Dp, p
 1 1   bp   
 T
   1  

 g bp   T f    f



7. Multiple Parity Groups
 Redundancy Requirement


Cluster is divided into parity groups with less number of nodes
Higher redundancy level to maintain the same system MTTF
Tradeoff between response time and redundancy level
0.5
Required Redundancy Level

0.4
0.3
0.2
0.1
50
100
MTTF=10,000
MTTF=10,000
MTTF=10,000
MTTF=10,000
MTTF=10,000
150
hrs, p=1
hrs, p=2
hrs, p=3
hrs, p=4
hrs, p=5
200
250
300
Number of Nodes
350
400
450
500
7. Multiple Parity Groups
 Performance Evaluation
60
60
50
50
40
40
30
30
20
20
10
10
0
0.1
0.15
0.2
Response Time
Buffer
0.25
0.3
0.35
Redundancy Level
0.4
0.45
0
0.5
Total Buffer (MB)

Buffer requirement and system response versus redundancy level at a
cluster size of 1500 nodes
Both system response time and buffer requirement decrease with
more redundancy (i.e. more parity groups)
System Response Time (sec)

7. Multiple Parity Groups
 Cluster Scalability
What are the system configurations if the system
0.35
5
0.3
4
0.25
3
0.2
2
0.15
0
200
400
600
800
1000
Number of Nodes
Redundancy Level, Buffer=16MB
Redundancy Level, Buffer=8MB
Response Time, Buffer=16MB
Response Time, Buffer=8MB
1200
1400
1
1600
System Response Time (sec)
a. achieves a MTTF of 10,000 hours, and
b. keeps under a response time constraint of 5 seconds, and
c. keeps under a buffer requirement of 8/16 MB?
Redundancy Level

7. Multiple Parity Groups
 Cluster Scalability

The cluster is divided into more parity groups if it exceeds either



the response time constraint, or
the buffer constraint
The redundancy level keeps relatively constant as the increased
cluster size results in improved redundancy efficiency that
compensates for the increased redundancy overhead incurred by the
multiple parity group scheme (eg. 16 MB buffer constraint)
7. Multiple Parity Groups
 Shifted bottleneck in Cluster Scalability




Transmission buffer increases linearly with cluster scale and cannot
be reduced by multiple parity group scheme
The system is forced to divided into more parity groups to reduce the
receiver buffer requirement to stay within the buffer constraint
The redundancy overhead is sharply increased and the system
response system is sharply reduced (eg. 8 MB buffer constraint)
Eventually the total buffer requirement exceeds the buffer constraint
even the cluster is further divided into more parity groups
 Scalability Bottleneck Shifted to the Buffer Requirement

System can be further scaled up by forming autonomous clusters
8. Conclusion
 Server-less Architecture

Scalable



Reliable



Acceptable redundancy level to achieve reasonable response time in a
cluster
Further scale up by forming new autonomous clusters
Fault tolerance by redundancy
Comparable reliability as high-end server by the analysis using Markov
chain
Cost-Effective


Dedicated server is eliminated
Costs shared by all users
8. Conclusion
 Future Work



Distributed Directory Service
Heterogeneous User Nodes
Dynamic System Adaptation


Node joining/leaving
Data re-distribution
End of Presentation
Thank you
Question & Answer Session.