Transcript WDAS 2004

Workshop in Distributed Data & Structures
*July 2004
Design & Implementation of LH*RS : a
Highly- Available Distributed Data
Structure
Rim Moussa
[email protected]
http://ceria.dauphine.fr/rim/rim.html
Thomas J.E. Schwartz
[email protected]
http://www.cse.scu.edu/~tschwarz/homepage
/thomas_schwarz.html
Objective
LH*RS
Design
Implementation
Performance Measurements
Factors of Interest are :
Parity Overhead
Recovery Performances
2
Overview
1. Motivation
2. Highly-available schemes
3. LH*RS
4. Architectural Design
5. Hardware testbed
6. File Creation
 Scenario Description
7. High Availability
 Performance Results
8. Recovery
9. Conclusion
10.Future Work
3
Motivation



Information Volume
of 30% / year
Bottleneck of disk access and CPUs
Failures are frequent & costly
Business Operation
Industry
Average Hourly
Financial Impact
Brokerage (Retail)
operations
Financial
$6.45 million
Credit Card Sales
Authorization
Financial
$2.6 million
Airline Reservation
Centers
Transportation
$89,500
Cellular (new)
Service Activation
Communication
$41,000
Source: Contingency Planning Research -1996
4
Requirements
Need
Highly Available
Networked Data Storage
Systems
Scalability
High Throughput
High Availability
5
Scalable & Distributed Data Structure
Dynamic file growth
Client
…
Client
Coordinator
Inserts
Records Transfered
Network
…
…
Data Buckets (DBs)
6
SDDS (Ctnd.)
No Centralized Directory Access
Client
Network
Query
Forwarded
Query
…
…
Image
Adjustement
Message
…
Data Buckets (DBs)
7
Solutions towards High Availability
Data Replication
(+) Good Response time since mirrors are queried
(-) High storage cost (n if n repliquas)
Parity Calculus
Erasure-resilient codes are evaluated regarding:
 Coding Rate (parity volume / data volume)
 Update Penality
 Group Size used for Data Reconstruction
 Complexity of Coding & Decoding
8
Fault-Tolerant Schemes
1 server failure
Simple XOR parity calculus : RAID Systems [Patterson et al., 88],
The SDDS LH*g [Litwin et al., 96]
More than 1 server failure
Binary linear codes: [Hellerstein & al., 94]
Array Codes: EVENODD [Blaum et al., 94 ],
X-code [Xu et al.,99],
RDP schema [Corbett et al., 04]
 Tolerate just 2 failures
Reed Solomon Codes:
IDA [Rabin, 89], RAID X [White, 91], FEC [Blomer et al., 95],
Tutorial [Plank, 97], LH*RS[Litwin & Schwarz, 00]
…
 Tolerate large number of failures
9
A Highly Available &
Distributed Data
Structure: LH*RS
[Litwin & Schwarz, 00]
[Litwin, Moussa & Schwarz, sub.]
LH*RS
SDDS
Data Distribution scheme based on Linear Hashing :
LH*LH [Karlesson et al., 96] applied to the key-field
Scalability
High Throughput
Parity Calculus
Reed-Solomon Codes [Reed & Solomon, 63]
High Availability
11
LH*RS File Structure
Key r
Insert Rank
r
2
1
0
2
1
0















Parity Buckets



: Rank [Key List] Parity Field
Data Buckets
 :
Key
Data Field
12
Architectural
Design of LH*RS
Communication
Use of UDP
Individual Insert/ Update/ Delete/
Search Queries
Record Recovery
Service and Control Messages
Use of TCP/IP
New PB Creation
Large Update Transfer (DB split)
Bucket Recovery
Speed
Better
Performance & Reliability
than UDP
14
Bucket Architecture
Network
Multicast Listening Send UDP Port
Port
TCP/IP Port
Recv UDP Port
Message
Queue
Message
Queue
-Message processing-
Process Buffer
-Message processing-
…
…
Free Zones
Window

Messages waiting for ack
Sending Credit
Not ack’ed messages
…
15
Architectural Design
Enhancements to SDDS2000 [B00, D01] Bucket
Architecture
TCP/IP Connection Handler
TCP/IP connections are passive OPEN, RFC 793 –[ISI,81],TCP/IP
Implem. under Win2K Server O.S. [McDonal & Barkley, 00]
Ex.: 1 DB recovery: SDDS 2000 Architecture: 6.7 s
New Architecture: 2.6 s  Improv. 60%
(Hardware config.: 733MhZ machines, 100Mbps network)
Flow Control and Acknowledgement Mgmt.
Principle of “Sending Credit + Message conservation
until delivery” [Jacobson, 88] [Diène, 01]
16
Architectural Design (Ctnd.)
Dynamic IP@ Structure
Updated when adding new/spare Buckets (PBs/DBs) through
Multicast Probe
Coordinator
Blank PBs
Multicast Group
Multicast
Component
Blank DBs
Multicast Group
DBs, PBs
A pre-defined & static
IP @s Table
17
Hardware Testbed

5 Machines (Pentium IV: 1.8 GHz, RAM: 512 Mb)

Ethernet Network: max bandwidth of 1 Gbps

Operating System: Windows 2K Server

Tested configuration



1 Client
A group of 4 Data Buckets
k Parity Buckets, k  {0, 1, 2}
18
LH*RS
File Creation
File Creation
Client Operation
Propagation of each Insert/ Update/ Delete on Data Record to Parity
Buckets
Data Bucket Split
Splitting Data Bucket
 PBs : (Records that Remain) N Deletes -from old rank &
N Inserts -at new rank + (Records that move) N Deletes
New Data Bucket
 PBs: N Inserts (Moved Records)
All Updates are gathered in the same buffer and transferred
(TCP/IP) simultaneously to respective Parity Buckets of the Splitting
DB & New DB.
20
File Creation Perf.
Experiments Set-up
File of 25 000 data records; 1 data record = 104 B
Client Sending Credit = 1
Client Sending Credit = 5
PB Overhead
File Creation Time
(sec)
12,000
10,000
8,000
6,000
4,000
2,000
0,000
10,963s
9,990s
7,896s
k=0
k=1
k=2
k = 0 to k = 1
 Perf. Degradation
of 20%
k = 1 to k = 2
0
5000
10000
15000
Inserted Keys
20000
25000
 Perf. Degradation
of 8%
21
File Creation Perf.
Experimental Set-up
File of 25 000 data records; 1 data record = 104 B
Client Sending Credit = 1
Client Sending Credit = 5
File Creation Time
(sec)
PB Overhead
10
k=0
k=1
k=2
8
6
4
7,720s
6,940s
4,349s
k = 0 to k = 1
 Perf. Degradation
of 37%
2
k = 1 to k = 2
0
0
5000
10000
15000
Number of Inserted Keys
20000
25000
 Perf. Degradation
of 10%
22
LH*RS
Parity Bucket Creation
PB Creation Scenario
Searching for a new PB
Coordinator
Wanna join group g ?
<Multicast>
[Sender IP@+Entity#, Your Entity#]
PBs Connected to The
Blank PBs Multicast Group
24
PB Creation Scenario
Waiting for Replies
I would
Coordinator
I would
Start UDP Listening,
Start TCP Listening,
Start Working Threads
Waiting for
Confirmation,
If Time-out elapsed
 Cancel all
I would
PBs Connected to The
Blank PBs Multicast Group
25
PB Creation Scenario
Cancellation
PB Selection
Coordinator
Cancellation
You are Selected
<UDP>
Disconnect from Blank
PBs Multicast Group
PBs Connected to The
Blank PBs Multicast Group
26
PB Creation Scenario
Auto-creation
-Query phase
Send me your contents !
<UDP>
…
New PB
Data Bucket’s group
27
PB Creation Scenario
Auto-creation –Encoding phase
Requested Buffer
…
<TCP>
New PB
Data Bucket’s group
28
PB Creation Perf.
Experimental Set-up
Bucket Size : 5000 .. 50000 records; Bucket Contents = 0.625* Bucket Size
File Size: 2.5 * Bucket Size records
XOR Encoding
RS Encoding
Comparison
Total Time
(sec)
Processing
Time (sec)
Communication
Time (sec)
Encoding Rate
MB/sec
5000
0.190
0.140
0.029
0,608
10000
0.429
0.304
0.066
0.686
25000
1.007
0.738
0.144
0.640
50000
2.062
1.484
0.322
0.659
Bucket
Size
Bucket Size:
PT  74% TT
29
PB Creation Perf.
Experimental Set-up
Bucket Size : 5000 .. 50000 records; Bucket Contents = 0.625* Bucket Size
File Size: 2.5 * Bucket Size records
XOR Encoding
RS Encoding
Comparison
Bucket
Size
Total Time
(sec)
Processing
Time (sec)
Communication
Time (sec)
Encoding Rate
MB/sec
5000
0.193
0.149
0.035
0,618
10000
0.446
0.328
0.059
0.713
25000
1.053
0.766
0.153
0.674
50000
2.103
1.531
0.322
0.673
Bucket Size;
PT  74% TT
30
PB Creation Perf.
XOR Encoding
RS Encoding
Comparison
For Bucket Size = 50000
XOR Encoding Rate : 0.66 MB/sec
RS Encoding Rate :
0.673 MB/sec
XOR provides a performance
gain of 5% in Processing Time
(0.02% in the Total Time)
31
LH*RS
Bucket Recovery
Buckets’ Recovery
Failure Detection
Coordinator
Are You Alive ?
<UDP>
 
Parity Buckets
Data Buckets
33
Buckets’ Recovery
Waiting for Replies…
Coordinator
I am Alive ?
<UDP>
 
Parity Buckets
Data Buckets
34
Buckets’ Recovery
Searching for 2 Spare DBs…
Coordinator
Wanna be a Spare DB ?
<Multicast>
[Sender IP@, Your Entity#]
DBs Connected to The
Blank DBs Multicast Group
35
Buckets’ Recovery
Waiting for Replies …
I would
Coordinator
I would
Start UDP Listening,
Start TCP Listening,
Start Working Threads
Waiting for
Confirmation,
If Time-out elapsed
 Cancel all
I would
DBs Connected to The
Blank DBs Multicast Group
36
Buckets’ Recovery
Disconnect from Blank
PBs Multicast Group
Spare DBs Selection
Coordinator
You are Selected
<UDP>
Cancellation
Disconnect from Blank
PBs Multicast Group
DBs Connected to The
Blank DBs Multicast Group
37
Buckets’ Recovery
Recovery Manager Determination
Coordinator
Recover Buckets
[Spares IP@s+Entity#s;…]
Parity Buckets
38
Buckets’ Recovery
Query Phase
Recovery
Manager
Alive Buckets
participating to Recovery
Send me Records of
rank in [r, r+slice-1]
<UDP>
Parity
Buckets
Data
Buckets
Spare DBs
39
Buckets’ Recovery
Reconstruction Phase
Alive Buckets
participating to Recovery
Recovery
Manager
Requested Buffer
<TCP>
Parity
Buckets
Decoding Process
Data
Buckets
Recovered Records
<TCP>
Spare DBs
40
DBs Recovery Perf.
Experimental Set-up
File: 125 000 recs; Bucket: 31250 recs  3.125 MB
XOR Decoding
Slice
RS Decoding
Comparison
Total
Time (sec)
Processing
Time (sec)
Communication
Time (sec)
1250
0.750
0.291
0.433
3125
0.693
0.249
0.372
6250
0.667
0.72
0.260
0.360
0.755
0.255
0.458
0.734
0.271
0.448
15625
31250
Slice (from 4% to 100% of Bucket contents)
 TT doesn’t vary a lot
41
DBs Recovery Perf.
Experimental Set-up
File: 125 000 recs; Bucket: 31250 recs  3.125 MB
XOR Decoding
Slice
RS Decoding
Comparison
Total
Time (sec)
Processing
Time (sec)
Communication
Time (sec)
1250
0.870
0.390
0.443
3125
0.867
0.375
0.375
6250
0.828
0.85
0.385
0.303
0.854
0.375
0.433
0.854
0.375
0.448
15625
31250
Slice (from 4% to 100% of Bucket contents)
 TT doesn’t vary a lot
42
DBs Recovery Perf.
Experimental Set-up
File: 125 000 recs; Bucket: 31250 recs  3.125 MB
XOR Decoding
RS Decoding
Comparison
1DB Recovery Time - XOR : 0.720 sec
1DB Recovery Time – RS
: 0.855 sec
XOR provides a performance
gain of 15% in Total Time
43
DBs Recovery Perf.
Experimental Set-up
File: 125 000 recs; Bucket: 31250 recs  3.125 MB
Recover 2 DBs
Slice
Recover 3 DBs
Total
Time (sec)
Processing
Time (sec)
Communication
Time (sec)
1250
1.234
0.590
0.519
3125
1.172
0.599
0.400
0.598
0.365
6250
1.172
1.2
15625
1.146
0.609
0.443
31250
1.088
0.599
0.442
Slice (from 4% to 100% of Bucket contents)
 TT doesn’t vary a lot
44
DBs Recovery Perf.
Experimental Set-up
File: 125 000 recs; Bucket: 31250 recs  3.125 MB
Recover 2 DBs
Slice
Recover 3 DBs
Total
Time (sec)
Processing
Time (sec)
Communication
Time (sec)
1250
1.589
0.922
0.522
3125
1.599
0.928
0.383
6250
1.541
1.6
0.907
0.401
1.578
0.891
0.520
1.468
0.906
0.495
15625
31250
Slice (from 4% to 100% of Bucket contents)
 TT doesn’t vary a lot
45
Perf. Summary of Bucket Recovery
1 DB
(3.125 MB) in
0.7 sec (XOR)
4.46 MB/sec
1 DB
(3.125 MB) in
0.85 sec (RS)
 3.65 MB/sec
2 DBs (6.250 MB) in
1.2 sec (RS)
 5.21 MB/sec
3 DBs (9,375 MB) in 1.6 sec (RS)
 5.86 MB/sec
46
Conclusion
The conducted experiements show that:
Encoding/Decoding Optimization
Enhanced Bucket Architecture
 Impact on performance
Good Recovery Performance
Finally, we improved the processing time of the RS
decoding process by 4% to 8%
1DB is recovered in half a second
47
Conclusion
LH*RS
Mature Implementation
Many Optimization Iterations
Only SDDS with Scalable Availability
48
Future Work
Better Parity Update Propagation Strategy to PBs
Investigation of faster Encoding/ Decoding processes
49
References
[Patterson et al., 88] D. A. Patterson, G. Gibson & R. H. Katz, A Case for Redundant Arrays
of Inexpensive Disks, Proc. of ACM SIGMOD Conf, pp.109-106, June 1988.
[ISI,81] Information Sciences Institute, RFC 793: Transmission Control Protocol (TCP) –
Specification, Sept. 1981, http://www.faqs.org/rfcs/rfc793.html
[McDonal & Barkley, 00] D. MacDonal, W. Barkley, MS Windows 2000 TCP/IP
Implementation Details, http://secinf.net/info/nt/2000ip/tcpipimp.html
[Jacobson, 88] V. Jacobson, M. J. Karels, Congestion Avoidance and Control, Computer
Communication Review, Vol. 18, No 4, pp. 314-329. [Xu et al.,99] L. Xu & Jehoshua Bruck, XCode: MDS Array Codes with Optimal Encoding, IEEE Trans. on Information Theory, 45(1),
p.272-276, 1999.
[Corbett et al., 04] P. Corbett, B. English, A. Goel, T. Grcanac, S. Kleiman, J. Leong, S.
Sankar, Row-Diagonal Parity for Double Disk Failure Correction, Proc. of the 3rd USENIX –
Conf. On File and Storage Technologies, Avril 2004.
[Rabin, 89] M. O. Rabin, Efficient Dispersal of Information for Security, Load Balancing and
Fault Tolerance, Journal of ACM, Vol. 26, N° 2, April 1989, pp. 335-348.
[White, 91] P.E. White, RAID X tackles design problems with existing design RAID schemes,
ECC Technologies, ftp://members.aol.com.mnecctek.ctr1991.pdf
[Blomer et al., 95] J. Blomer, M. Kalfane, R. Karp, M. Karpinski, M. Luby & D. Zuckerman,
An XOR-Based Erasure-Resilient Coding Scheme, ICSI Tech. Rep. TR-95-048, 1995.
50
References (Ctnd.)
[Litwin & Schwarz, 00] W. Litwin & T. Schwarz, LH*RS: A High-Availability Scalable Distributed
Data Structure using Reed Solomon Codes, p.237-248, Proceedings of the ACM SIGMOD 2000.
[Karlesson et al., 96] J. Karlson, W. Litwin & T. Risch, LH*LH: A Scalable high performance data
structure for switched multicomputers, EDBT 96, Springer Verlag.
[Reed & Solomon, 60] I. Reed & G. Solomon, Polynomial codes over certain Finite Fields, Journal
of the society for industrial and applied mathematics, 1960.
[Plank, 97] J. S. Plank, A Tutorial on Reed-Solomon Coding for fault-Tolerance in RAID-like
Systems, Software– Practise & Experience, 27(9), Sept. 1997, pp 995- 1012,
[Diéne, 01] A.W. Diène, Contribution à la Gestion de Structures de Données Distribuées et
Scalables, PhD Thesis, Nov. 2001, Université Paris Dauphine.
[Bennour, 00] F. Sahli Bennour, Contribution à la Gestion de Structures de Données Distribuées et
Scalables, PhD Thesis, Juin 2000, Université Paris Dauphine.
[Moussa] http://ceria.dauphine.fr/rim/rim.html
More references: http://ceria.dauphine.fr/rim/biblio.pdf
51
End
Parity Calculus
Galois Field
GF[28]  1 symbol is 1 byte || GF[216]  1 symbol is 2 bytes
(+)
(-)
GF[216] vs. GF[28] reduces by 1/2
Multiplication Tables Sizes
the # of symbols, and consequently
number of opertaions in the field
New Generator Matrix
1st Column of ‘1’s
1st parity bucket executes XOR
calculus instead of RS calculus 
gain performance in encoding of 20%
1st line of ‘1’s
Each PB executes XOR calculus for
any update from the 1st DB of any
group  gain performance of 4% measured for PB creation
Encoding & Decoding Hints
Encoding
log pre-calculus of the P matrix
coefficents  improv. of 3.5%
Decoding
log pre-calculus of H-1 matrix coef.
and b vector for multiple buckets
recovery  improv. from 4% to 8%
53