Transcript Study of Hbase
Evaluation of Hbase Read/Write
(A study of Hbase and it’s benchmarks)
B Y V A I B H A V N A C H A N K A R A R V I N D D W A R A K A N A T H
Recap of Hbase
Hbase is an open-source, distributed, column oriented and sorted-map data storage.
It is a Hadoop Database; sits on HDFS.
Hbase can support reliable storage and efficient access of a huge amount of structured data
Hbase Architecture
Recap of Hbase (contd.)
Modeled after BigTable.
Map/reduce with Hadoop. Optimizations for real time queries.
No single point of failure.
Random access performance is like MySQL.
Application : Facebook Messaging Database.
Hbase Benchmark Techniques
‘Hadoop Hbase-0.20.2 Performance Evaluation’ by D. Carstoiu, A. Cernian, A. Olteanu. University of Bucharest.
STRATEGY: Uses random read, writes to test and benchmark Hadoop with Hbase.
Hbase Benchmark Techniques (contd.)
‘Hadoop Hbase-0.20.2 Performance Evaluation’ by Kareem Dana at Duke University. It shows a varied set of test cases for executions to test HBase.
STRATEGY: Tested on column families, columns, Sort and interspersed read/writes.
Yahoo! Cloud Serving Benchmark (YCSB)
‘Benchmarking Cloud Serving Systems with YCSB’ by Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, Russell Sears.
This paper/project is designed to benchmark existing and newer cloud storage technologies.
The benchmark is done so far on Hbase, Cassandra, MongoDb, Project Voldemort and SQL.
YCSB
The benchmark tool uses Workload files and the workload files can be customized according to users.
You can specify 50/50 read/write, 95/5 r/w and so on.
The code for the project is available on Github.
https://github.com/brianfrankcooper/YCSB.git
Example of a Workload
# Yahoo! Cloud System Benchmark # Workload A: Update heavy workload # Application example: Session store recording recent actions # # Read/update ratio: 50/50 # Default data size: 1 KB records (10 fields, 100 bytes each, plus key) # Request distribution: zipfian recordcount=1000 operationcount=1000 workload=com.yahoo.ycsb.workloads.CoreWorkload
readallfields=true readproportion=0.5
updateproportion=0.5
scanproportion=0 insertproportion=0
Example of a Workload
# Yahoo! Cloud System Benchmark # Workload B: Read mostly workload # Application example: photo tagging; add a tag is an update, but most operations are to read tags # # Read/update ratio: 95/5 # Default data size: 1 KB records (10 fields, 100 bytes each, plus key) # Request distribution: zipfian recordcount=1000 operationcount=1000 workload=com.yahoo.ycsb.workloads.CoreWorkload
readallfields=true readproportion=0.95
updateproportion=0.05
scanproportion=0 insertproportion=0
Our Project
Install Hbase and get Hadoop to interface with it. Study benchmark techniques.
Build a suite of codes and get it to run on Hadoop/Hbase.
Include basic get, put, scan operations.
Extend Word Count’s map-reduce to add to Hbase.
Compare with Brisk Cassandra.
About Brisk
Cassandra is a No-SQL BigTable-based database.
Datastax enterprise built Brisk to interface Hadoop with Cassandra Hadoop + Cassandra = Brisk!!
Brisk Architecture
Challenges Faced
Configuration of Hbase is a tedious job! Not for the weak of will!
Hbase subsequent releases do not keep the APIs consistent. So we ran into a lot of ‘deprecated API’ error messages.
Hadoop compatibility with Hbase has to be verified before we proceed with installations.
Challenges Faced (contd.)
Very few documents on installation details of Hbase.
Even fewer for Brisk!
Performance for Word Count (2 nodes/2 cores each) 46 45 44 43 42 41 49 Time in secs 48 47 1 2
1 mapper/ 3 reducer
3 4
Average = 45.484
1 mapper/ 3 reducer 5 Number of readings
Performance for Word Count (contd.)
2 mapper/ 3 reducers
52,5 Time in secs 52 51,5 51 50,5 50 49,5 49 48,5 48 47,5 1 2 3 4
Average = 49.664
2 mapper/ 3 reducers 5 Number of readings
Performance for Word Count (contd.)
60 Time in secs 50 40 10 0 30 20 1 2
2 mapper/ 2 reducers
3 4
Average = 43.7008
2 mapper/ 2 reducers 5 Number of readings
Performance for a simple get/put/scan (2 nodes/ 2 core) 2,5 Average for get, scan and put are 1.841.6266 and 1.71.
2 1,5 1 0,5 0 1 2 3 4 get scan put 5 Number of readings
Performance for Word Count (3 nodes/2 cores each)
1 mapper/ 3 reducers
31 30 29 35 34 33 32 37 Time in secs 36 1 2 3 4 5
Average = 34.047
1 mapper/ 3 reducers Number of Readings
Performance for Word Count (contd.)
2 mappers/ 3 reducers
38,5 Time in secs 38 37,5 37 36,5 36 35,5 35 34,5 34 33,5 33 1 2 3 4 5
Average = 36.1012
2 mappers/ 3 reducers Number of Readings
Performance for Word Count (contd.)
2 mappers/ 2 reducers
40 35 30 25 20 15 10 5 0 50 Time in secs 45 1 2 3 4 5
Average = 37.4358
2 mappers/ 2 reducers Number of readings
Conclusions
Brisk seems a lot more promising tool; as it integrates Cassandra and Hadoop together without much ado.
Hbase/Hadoop APIs have to be made consistent. With standardization, it would be easier to work with them.
Hbase Reads are faster than the Writes.
Thank You
Questions??