Transcript WDAS 2004
Workshop in Distributed Data & Structures *July 2004 Design & Implementation of LH*RS : a Highly- Available Distributed Data Structure Rim Moussa [email protected] http://ceria.dauphine.fr/rim/rim.html Thomas J.E. Schwartz [email protected] http://www.cse.scu.edu/~tschwarz/homepage /thomas_schwarz.html Objective LH*RS Design Implementation Performance Measurements Factors of Interest are : Parity Overhead Recovery Performances 2 Overview 1. Motivation 2. Highly-available schemes 3. LH*RS 4. Architectural Design 5. Hardware testbed 6. File Creation Scenario Description 7. High Availability Performance Results 8. Recovery 9. Conclusion 10.Future Work 3 Motivation Information Volume of 30% / year Bottleneck of disk access and CPUs Failures are frequent & costly Business Operation Industry Average Hourly Financial Impact Brokerage (Retail) operations Financial $6.45 million Credit Card Sales Authorization Financial $2.6 million Airline Reservation Centers Transportation $89,500 Cellular (new) Service Activation Communication $41,000 Source: Contingency Planning Research -1996 4 Requirements Need Highly Available Networked Data Storage Systems Scalability High Throughput High Availability 5 Scalable & Distributed Data Structure Dynamic file growth Client … Client Coordinator Inserts Records Transfered Network … … Data Buckets (DBs) 6 SDDS (Ctnd.) No Centralized Directory Access Client Network Query Forwarded Query … … Image Adjustement Message … Data Buckets (DBs) 7 Solutions towards High Availability Data Replication (+) Good Response time since mirrors are queried (-) High storage cost (n if n repliquas) Parity Calculus Erasure-resilient codes are evaluated regarding: Coding Rate (parity volume / data volume) Update Penality Group Size used for Data Reconstruction Complexity of Coding & Decoding 8 Fault-Tolerant Schemes 1 server failure Simple XOR parity calculus : RAID Systems [Patterson et al., 88], The SDDS LH*g [Litwin et al., 96] More than 1 server failure Binary linear codes: [Hellerstein & al., 94] Array Codes: EVENODD [Blaum et al., 94 ], X-code [Xu et al.,99], RDP schema [Corbett et al., 04] Tolerate just 2 failures Reed Solomon Codes: IDA [Rabin, 89], RAID X [White, 91], FEC [Blomer et al., 95], Tutorial [Plank, 97], LH*RS[Litwin & Schwarz, 00] … Tolerate large number of failures 9 A Highly Available & Distributed Data Structure: LH*RS [Litwin & Schwarz, 00] [Litwin, Moussa & Schwarz, sub.] LH*RS SDDS Data Distribution scheme based on Linear Hashing : LH*LH [Karlesson et al., 96] applied to the key-field Scalability High Throughput Parity Calculus Reed-Solomon Codes [Reed & Solomon, 63] High Availability 11 LH*RS File Structure Key r Insert Rank r 2 1 0 2 1 0 Parity Buckets : Rank [Key List] Parity Field Data Buckets : Key Data Field 12 Architectural Design of LH*RS Communication Use of UDP Individual Insert/ Update/ Delete/ Search Queries Record Recovery Service and Control Messages Use of TCP/IP New PB Creation Large Update Transfer (DB split) Bucket Recovery Speed Better Performance & Reliability than UDP 14 Bucket Architecture Network Multicast Listening Send UDP Port Port TCP/IP Port Recv UDP Port Message Queue Message Queue -Message processing- Process Buffer -Message processing- … … Free Zones Window Messages waiting for ack Sending Credit Not ack’ed messages … 15 Architectural Design Enhancements to SDDS2000 [B00, D01] Bucket Architecture TCP/IP Connection Handler TCP/IP connections are passive OPEN, RFC 793 –[ISI,81],TCP/IP Implem. under Win2K Server O.S. [McDonal & Barkley, 00] Ex.: 1 DB recovery: SDDS 2000 Architecture: 6.7 s New Architecture: 2.6 s Improv. 60% (Hardware config.: 733MhZ machines, 100Mbps network) Flow Control and Acknowledgement Mgmt. Principle of “Sending Credit + Message conservation until delivery” [Jacobson, 88] [Diène, 01] 16 Architectural Design (Ctnd.) Dynamic IP@ Structure Updated when adding new/spare Buckets (PBs/DBs) through Multicast Probe Coordinator Blank PBs Multicast Group Multicast Component Blank DBs Multicast Group DBs, PBs A pre-defined & static IP @s Table 17 Hardware Testbed 5 Machines (Pentium IV: 1.8 GHz, RAM: 512 Mb) Ethernet Network: max bandwidth of 1 Gbps Operating System: Windows 2K Server Tested configuration 1 Client A group of 4 Data Buckets k Parity Buckets, k {0, 1, 2} 18 LH*RS File Creation File Creation Client Operation Propagation of each Insert/ Update/ Delete on Data Record to Parity Buckets Data Bucket Split Splitting Data Bucket PBs : (Records that Remain) N Deletes -from old rank & N Inserts -at new rank + (Records that move) N Deletes New Data Bucket PBs: N Inserts (Moved Records) All Updates are gathered in the same buffer and transferred (TCP/IP) simultaneously to respective Parity Buckets of the Splitting DB & New DB. 20 File Creation Perf. Experiments Set-up File of 25 000 data records; 1 data record = 104 B Client Sending Credit = 1 Client Sending Credit = 5 PB Overhead File Creation Time (sec) 12,000 10,000 8,000 6,000 4,000 2,000 0,000 10,963s 9,990s 7,896s k=0 k=1 k=2 k = 0 to k = 1 Perf. Degradation of 20% k = 1 to k = 2 0 5000 10000 15000 Inserted Keys 20000 25000 Perf. Degradation of 8% 21 File Creation Perf. Experimental Set-up File of 25 000 data records; 1 data record = 104 B Client Sending Credit = 1 Client Sending Credit = 5 File Creation Time (sec) PB Overhead 10 k=0 k=1 k=2 8 6 4 7,720s 6,940s 4,349s k = 0 to k = 1 Perf. Degradation of 37% 2 k = 1 to k = 2 0 0 5000 10000 15000 Number of Inserted Keys 20000 25000 Perf. Degradation of 10% 22 LH*RS Parity Bucket Creation PB Creation Scenario Searching for a new PB Coordinator Wanna join group g ? <Multicast> [Sender IP@+Entity#, Your Entity#] PBs Connected to The Blank PBs Multicast Group 24 PB Creation Scenario Waiting for Replies I would Coordinator I would Start UDP Listening, Start TCP Listening, Start Working Threads Waiting for Confirmation, If Time-out elapsed Cancel all I would PBs Connected to The Blank PBs Multicast Group 25 PB Creation Scenario Cancellation PB Selection Coordinator Cancellation You are Selected <UDP> Disconnect from Blank PBs Multicast Group PBs Connected to The Blank PBs Multicast Group 26 PB Creation Scenario Auto-creation -Query phase Send me your contents ! <UDP> … New PB Data Bucket’s group 27 PB Creation Scenario Auto-creation –Encoding phase Requested Buffer … <TCP> New PB Data Bucket’s group 28 PB Creation Perf. Experimental Set-up Bucket Size : 5000 .. 50000 records; Bucket Contents = 0.625* Bucket Size File Size: 2.5 * Bucket Size records XOR Encoding RS Encoding Comparison Total Time (sec) Processing Time (sec) Communication Time (sec) Encoding Rate MB/sec 5000 0.190 0.140 0.029 0,608 10000 0.429 0.304 0.066 0.686 25000 1.007 0.738 0.144 0.640 50000 2.062 1.484 0.322 0.659 Bucket Size Bucket Size: PT 74% TT 29 PB Creation Perf. Experimental Set-up Bucket Size : 5000 .. 50000 records; Bucket Contents = 0.625* Bucket Size File Size: 2.5 * Bucket Size records XOR Encoding RS Encoding Comparison Bucket Size Total Time (sec) Processing Time (sec) Communication Time (sec) Encoding Rate MB/sec 5000 0.193 0.149 0.035 0,618 10000 0.446 0.328 0.059 0.713 25000 1.053 0.766 0.153 0.674 50000 2.103 1.531 0.322 0.673 Bucket Size; PT 74% TT 30 PB Creation Perf. XOR Encoding RS Encoding Comparison For Bucket Size = 50000 XOR Encoding Rate : 0.66 MB/sec RS Encoding Rate : 0.673 MB/sec XOR provides a performance gain of 5% in Processing Time (0.02% in the Total Time) 31 LH*RS Bucket Recovery Buckets’ Recovery Failure Detection Coordinator Are You Alive ? <UDP> Parity Buckets Data Buckets 33 Buckets’ Recovery Waiting for Replies… Coordinator I am Alive ? <UDP> Parity Buckets Data Buckets 34 Buckets’ Recovery Searching for 2 Spare DBs… Coordinator Wanna be a Spare DB ? <Multicast> [Sender IP@, Your Entity#] DBs Connected to The Blank DBs Multicast Group 35 Buckets’ Recovery Waiting for Replies … I would Coordinator I would Start UDP Listening, Start TCP Listening, Start Working Threads Waiting for Confirmation, If Time-out elapsed Cancel all I would DBs Connected to The Blank DBs Multicast Group 36 Buckets’ Recovery Disconnect from Blank PBs Multicast Group Spare DBs Selection Coordinator You are Selected <UDP> Cancellation Disconnect from Blank PBs Multicast Group DBs Connected to The Blank DBs Multicast Group 37 Buckets’ Recovery Recovery Manager Determination Coordinator Recover Buckets [Spares IP@s+Entity#s;…] Parity Buckets 38 Buckets’ Recovery Query Phase Recovery Manager Alive Buckets participating to Recovery Send me Records of rank in [r, r+slice-1] <UDP> Parity Buckets Data Buckets Spare DBs 39 Buckets’ Recovery Reconstruction Phase Alive Buckets participating to Recovery Recovery Manager Requested Buffer <TCP> Parity Buckets Decoding Process Data Buckets Recovered Records <TCP> Spare DBs 40 DBs Recovery Perf. Experimental Set-up File: 125 000 recs; Bucket: 31250 recs 3.125 MB XOR Decoding Slice RS Decoding Comparison Total Time (sec) Processing Time (sec) Communication Time (sec) 1250 0.750 0.291 0.433 3125 0.693 0.249 0.372 6250 0.667 0.72 0.260 0.360 0.755 0.255 0.458 0.734 0.271 0.448 15625 31250 Slice (from 4% to 100% of Bucket contents) TT doesn’t vary a lot 41 DBs Recovery Perf. Experimental Set-up File: 125 000 recs; Bucket: 31250 recs 3.125 MB XOR Decoding Slice RS Decoding Comparison Total Time (sec) Processing Time (sec) Communication Time (sec) 1250 0.870 0.390 0.443 3125 0.867 0.375 0.375 6250 0.828 0.85 0.385 0.303 0.854 0.375 0.433 0.854 0.375 0.448 15625 31250 Slice (from 4% to 100% of Bucket contents) TT doesn’t vary a lot 42 DBs Recovery Perf. Experimental Set-up File: 125 000 recs; Bucket: 31250 recs 3.125 MB XOR Decoding RS Decoding Comparison 1DB Recovery Time - XOR : 0.720 sec 1DB Recovery Time – RS : 0.855 sec XOR provides a performance gain of 15% in Total Time 43 DBs Recovery Perf. Experimental Set-up File: 125 000 recs; Bucket: 31250 recs 3.125 MB Recover 2 DBs Slice Recover 3 DBs Total Time (sec) Processing Time (sec) Communication Time (sec) 1250 1.234 0.590 0.519 3125 1.172 0.599 0.400 0.598 0.365 6250 1.172 1.2 15625 1.146 0.609 0.443 31250 1.088 0.599 0.442 Slice (from 4% to 100% of Bucket contents) TT doesn’t vary a lot 44 DBs Recovery Perf. Experimental Set-up File: 125 000 recs; Bucket: 31250 recs 3.125 MB Recover 2 DBs Slice Recover 3 DBs Total Time (sec) Processing Time (sec) Communication Time (sec) 1250 1.589 0.922 0.522 3125 1.599 0.928 0.383 6250 1.541 1.6 0.907 0.401 1.578 0.891 0.520 1.468 0.906 0.495 15625 31250 Slice (from 4% to 100% of Bucket contents) TT doesn’t vary a lot 45 Perf. Summary of Bucket Recovery 1 DB (3.125 MB) in 0.7 sec (XOR) 4.46 MB/sec 1 DB (3.125 MB) in 0.85 sec (RS) 3.65 MB/sec 2 DBs (6.250 MB) in 1.2 sec (RS) 5.21 MB/sec 3 DBs (9,375 MB) in 1.6 sec (RS) 5.86 MB/sec 46 Conclusion The conducted experiements show that: Encoding/Decoding Optimization Enhanced Bucket Architecture Impact on performance Good Recovery Performance Finally, we improved the processing time of the RS decoding process by 4% to 8% 1DB is recovered in half a second 47 Conclusion LH*RS Mature Implementation Many Optimization Iterations Only SDDS with Scalable Availability 48 Future Work Better Parity Update Propagation Strategy to PBs Investigation of faster Encoding/ Decoding processes 49 References [Patterson et al., 88] D. A. Patterson, G. Gibson & R. H. Katz, A Case for Redundant Arrays of Inexpensive Disks, Proc. of ACM SIGMOD Conf, pp.109-106, June 1988. [ISI,81] Information Sciences Institute, RFC 793: Transmission Control Protocol (TCP) – Specification, Sept. 1981, http://www.faqs.org/rfcs/rfc793.html [McDonal & Barkley, 00] D. MacDonal, W. Barkley, MS Windows 2000 TCP/IP Implementation Details, http://secinf.net/info/nt/2000ip/tcpipimp.html [Jacobson, 88] V. Jacobson, M. J. Karels, Congestion Avoidance and Control, Computer Communication Review, Vol. 18, No 4, pp. 314-329. [Xu et al.,99] L. Xu & Jehoshua Bruck, XCode: MDS Array Codes with Optimal Encoding, IEEE Trans. on Information Theory, 45(1), p.272-276, 1999. [Corbett et al., 04] P. Corbett, B. English, A. Goel, T. Grcanac, S. Kleiman, J. Leong, S. Sankar, Row-Diagonal Parity for Double Disk Failure Correction, Proc. of the 3rd USENIX – Conf. On File and Storage Technologies, Avril 2004. [Rabin, 89] M. O. Rabin, Efficient Dispersal of Information for Security, Load Balancing and Fault Tolerance, Journal of ACM, Vol. 26, N° 2, April 1989, pp. 335-348. [White, 91] P.E. White, RAID X tackles design problems with existing design RAID schemes, ECC Technologies, ftp://members.aol.com.mnecctek.ctr1991.pdf [Blomer et al., 95] J. Blomer, M. Kalfane, R. Karp, M. Karpinski, M. Luby & D. Zuckerman, An XOR-Based Erasure-Resilient Coding Scheme, ICSI Tech. Rep. TR-95-048, 1995. 50 References (Ctnd.) [Litwin & Schwarz, 00] W. Litwin & T. Schwarz, LH*RS: A High-Availability Scalable Distributed Data Structure using Reed Solomon Codes, p.237-248, Proceedings of the ACM SIGMOD 2000. [Karlesson et al., 96] J. Karlson, W. Litwin & T. Risch, LH*LH: A Scalable high performance data structure for switched multicomputers, EDBT 96, Springer Verlag. [Reed & Solomon, 60] I. Reed & G. Solomon, Polynomial codes over certain Finite Fields, Journal of the society for industrial and applied mathematics, 1960. [Plank, 97] J. S. Plank, A Tutorial on Reed-Solomon Coding for fault-Tolerance in RAID-like Systems, Software– Practise & Experience, 27(9), Sept. 1997, pp 995- 1012, [Diéne, 01] A.W. Diène, Contribution à la Gestion de Structures de Données Distribuées et Scalables, PhD Thesis, Nov. 2001, Université Paris Dauphine. [Bennour, 00] F. Sahli Bennour, Contribution à la Gestion de Structures de Données Distribuées et Scalables, PhD Thesis, Juin 2000, Université Paris Dauphine. [Moussa] http://ceria.dauphine.fr/rim/rim.html More references: http://ceria.dauphine.fr/rim/biblio.pdf 51 End Parity Calculus Galois Field GF[28] 1 symbol is 1 byte || GF[216] 1 symbol is 2 bytes (+) (-) GF[216] vs. GF[28] reduces by 1/2 Multiplication Tables Sizes the # of symbols, and consequently number of opertaions in the field New Generator Matrix 1st Column of ‘1’s 1st parity bucket executes XOR calculus instead of RS calculus gain performance in encoding of 20% 1st line of ‘1’s Each PB executes XOR calculus for any update from the 1st DB of any group gain performance of 4% measured for PB creation Encoding & Decoding Hints Encoding log pre-calculus of the P matrix coefficents improv. of 3.5% Decoding log pre-calculus of H-1 matrix coef. and b vector for multiple buckets recovery improv. from 4% to 8% 53