HBase MTTR, Stripe Compaction and Hoya Ted Yu ([email protected]) About myself • Been working on Hbase for 3 years • Became Committer & PMC member.
Download ReportTranscript HBase MTTR, Stripe Compaction and Hoya Ted Yu ([email protected]) About myself • Been working on Hbase for 3 years • Became Committer & PMC member.
HBase MTTR, Stripe Compaction and Hoya Ted Yu ([email protected]) About myself • Been working on Hbase for 3 years • Became Committer & PMC member June 2011 Outline • • • • • Overview to HBase Recovery HDFS issues Stripe compaction Hbase-on-Yarn Q&A We’re in a distributed system • Hard to distinguish a slow server from a dead server • Everything, or, nearly everything, is based on timeout • Smaller timeouts means more false positive • HBase works well with false positive, but they always have a cost. • The less the timeouts the better HBase components for recovery Recovery in action Recovery process • Failure detection: ZooKeeper heartbeats the servers. Expire the session when it does not reply • Region assignment: the master reallocates the regions to the other servers • Failure recovery: read the WAL and rewrite the data again • The client stops the connection to the dead server and goes to the new one. ZK Heartbeat Master, RS, ZK Region Assignment Region Servers, DataNode Data recovery Client Failure detection • Failure detection – Set a ZooKeeper timeout to 30s instead of the old 180s default. – Beware of the GC, but lower values are possible. – ZooKeeper detects the errors sooner than the configured timeout • 0.96 – HBase scripts clean the ZK node when the server is kill 9ed • => Detection time becomes 0 – Can be used by any monitoring tool With faster region assignment • Detection: from 180s to 30s • Data recovery: around 10s • Reassignment : from 10s of seconds to seconds DataNode crash is expensive! • One replica of WAL edits is on the crashed DN – 33% of the reads during the regionserver recovery will go to it • Many writes will go to it as well (the smaller the cluster, the higher that probability) • NameNode re-replicates the data (maybe TBs) that was on this node to restore replica count – NameNode does this work only after a good timeout (10 minutes by default) HDFS – Stale mode As today: used for reads & writes, using locality Live 30 seconds, can be less. Stale Not used for writes, used as last resort for reads 10 minutes, don’t change this Dead As today: not used. And actually, it’s better to do the HBase recovery before HDFS replicates the TBs of data of this node Results • Do more read/writes to HDFS during the recovery • Multiple failures are still possible – Stale mode will still play its role – And set dfs.timeout to 30s – This limits the effect of two failures in a row. The cost of the second failure is 30s if you were unlucky Here is the client The client • You want the client to be patient • Retries when the system is already loaded is not good. • You want the client to learn about region servers dying, and to be able to react immediately. • You want the solution to be scalable. Scalable solution • The master notifies the client – A cheap multicast message with the “dead servers” list. Sent 5 times for safety. – Off by default. – On reception, the client stops immediately waiting on the TCP connection. You can now enjoy large hbase.rpc.timeout Faster recovery (HBASE-7006) • Previous algorithm – Read the WAL files – Write new Hfiles – Tell the region server it got new Hfiles • Put pressure on namenode – Remember: avoid putting pressure on the namenode • New algo: – – – – Read the WAL Write to the regionserver We’re done (have seen great improvements in our tests) TBD: Assign the WAL to a RegionServer local to a replica WAL-file3 WAL-file2 <region2:edit1><region1:edit2> WAL-file1 …… <region2:edit1><region1:edit2> …… <region2:edit1><region1:edit2> <region3:edit1> …… …….. <region3:edit1> …….. <region3:edit1> …….. writes HDFS Distributed log Splitting RegionServer3 reads RegionServer2 RegionServer0 RegionServer_x RegionServer_y writes RegionServer1 reads Splitlog-file-for-region3 <region3:edit1><region1:edit2> Splitlog-file-for-region2 …… <region2:edit1><region1:edit2> Splitlog-file-for-region1 <region3:edit1> …… <region1:edit1><region1:edit2> …….. <region2:edit1> …… …….. <region1:edit1> …….. HDFS WAL-file3 WAL-file2 <region2:edit1><region1:edit2> WAL-file1 …… <region2:edit1><region1:edit2> …… <region2:edit1><region1:edit2> <region3:edit1> …… …….. <region3:edit1> …….. <region3:edit1> …….. writes HDFS Distributed log Replay RegionServer3 reads RegionServer2 RegionServer0 RegionServer_x RegionServer_y writes replays RegionServer1 reads Recovered-file-for-region3 <region3:edit1><region1:edit2> Recovered-file-for-region2 …… <region2:edit1><region1:edit2> Recovered-file-for-region1 <region3:edit1> …… <region1:edit1><region1:edit2> …….. <region2:edit1> …… …….. <region1:edit1> …….. HDFS Write during recovery • Concurrent writes allowed during the WAL replay – same memstore serves both • Events stream: your new recovery time is the failure detection time: max 30s, likely less! • Caveat: HBASE-8701 WAL Edits need to be applied in receiving order MemStore flush • Real life: some tables are updated at a given moment then left alone – With a non empty memstore – More data to recover • It’s now possible to guarantee that we don’t have MemStore with old data • Improves real life MTTR • Helps online snapshots .META. • .META. – There is no –ROOT- table in 0.95/0.96 – But .META. failures are critical • A lot of small improvements – Server now says to the client when a region has moved (client can avoid going to meta) • And a big one – .META. WAL is managed separately to allow an immediate recovery of META – With the new MemStore flush, ensure a quick recovery Data locality post recovery • HBase performance depends on data-locality • After a recovery, you’ve lost it – Bad for performance • Here comes region groups • Assign 3 favored RegionServers for every region • On failures assign the region to one of the secondaries • The data-locality issue is minimized on failures Discoveries from cluster testing • HDFS-5016 Heartbeating thread blocks under some failure conditions leading to loss of datanodes • HBASE-9039 Parallel assignment and distributed log replay during recovery • Region splitting during distributed log replay may hinder recovery Compactions example • Memstore fills up, files are flushed • When enough files accumulate, they are compacted writes MemStore HDFS … HFile HFile HFile Architecting the Future of Big Data HFile HFile But, compaction cause slowdowns Looks like lots of I/O for no apparent benefit Example effect on reads (note better average) 25 Read latency, ms 20 15 10 5 0 0 3600 7200 Load test time, sec 10800 Key ways to improve compactions • Read from fewer files –Separate files by row key, version, time, etc. –Allows large number of files to be present, uncompacted • Don't compact the data you don't need to compact –For example, old data in OpenTSDB-like systems –Obviously, results in less I/O • Make compactions smaller –Without too much I/O amplification or too many files –Results in less compaction-related outages • HBase works better with few large regions; however, large compactions cause unavailability © Hortonworks Inc. 2011 Stripe compactions (HBASE-7667) • Somewhat like LevelDB, partition the keys inside each region/store • But, only 1 level (plus optional L0) • Compared to regions, partitioning is more flexible –The default is a number of ~equal-sized stripes • To read, just read relevant stripes + L0, if present L0 HFile HFile get 'hbase' HFile HFile HFile HFile H Region start key: ccc Architecting the Future of Big Data © Hortonworks Inc. 2011 Row-key axis eee ggg iii: region end key Stripe compactions – writes • Data flushed from MemStore into several files • Each stripe compacts separately most of the time MemStore HFile HFile HFile H HDFS H Architecting the Future of Big Data © Hortonworks Inc. 2011 HFile HFile H HFile H Stripe compactions – other • Why Level0? –Bulk loaded files go to L0 –Flushes can also go into single L0 files (to avoid tiny files) –Several L0 files are then compacted into striped files • Can drop deletes if compacting one entire stripe +L0 –No need for major compactions, ever • Compact 2 stripes together – rebalance if unbalanced –Very rare, however - unbalanced stripes are not a huge deal • Boundaries could be used to improve region splits in future Architecting the Future of Big Data © Hortonworks Inc. 2011 Stripe compactions - performance • EC2, c1.xlarge, preload; then measure random read perf –LoadTestTool + deletes + overwrites; measure random reads Random gets per second 2000 1500 Default gets-per-second, 30sec. MA Stripe gets-per-second, 30sec. MA 1000 500 0 2500 3500 4500 5500 Test time, sec. Architecting the Future of Big Data © Hortonworks Inc. 2011 6500 7500 8500 Hbase on Yarn • Hoya is a YARN application • All components are YARN services • Input is cluster specification, persisted as JSON document on HDFS • HDFS and ZooKeeper are shared by multiple cluster instances • The cluster can also be stopped and later resumed Hoya Architecture • Hoya Client: parses commandline, executes local operations, talks to HoyaMasterService • HoyaMasterService: AM service, deploys the HBase master locally • HoyaRegionService: installs and executes the region server HBase Master Service Deployment • HoyaMasterService requested to create cluster • Local Hbase dir chosen for expanded image • User supplied config dir overwrites conf files in conf directory • Hbase conf patched with hostname of master • HoyaMasterService monitors reporting from RM Failure Handling • • • • Region Service failures trigger new RS instances MasterService failures not trigger restart RegionService monitors ZK node for master MasterService monitors state of Hbase master Runtime classpath dependencies Q&A Thanks!