Study of Hbase

Download Report

Transcript Study of Hbase

Evaluation of Hbase Read/Write

(A study of Hbase and it’s benchmarks)

B Y V A I B H A V N A C H A N K A R A R V I N D D W A R A K A N A T H

Recap of Hbase

 Hbase is an open-source, distributed, column oriented and sorted-map data storage.

 It is a Hadoop Database; sits on HDFS.

 Hbase can support reliable storage and efficient access of a huge amount of structured data

Hbase Architecture

Recap of Hbase (contd.)

      Modeled after BigTable.

Map/reduce with Hadoop. Optimizations for real time queries.

No single point of failure.

Random access performance is like MySQL.

Application : Facebook Messaging Database.

Hbase Benchmark Techniques

‘Hadoop Hbase-0.20.2 Performance Evaluation’ by D. Carstoiu, A. Cernian, A. Olteanu. University of Bucharest.

 STRATEGY: Uses random read, writes to test and benchmark Hadoop with Hbase.

Hbase Benchmark Techniques (contd.)

‘Hadoop Hbase-0.20.2 Performance Evaluation’ by Kareem Dana at Duke University. It shows a varied set of test cases for executions to test HBase.

 STRATEGY: Tested on column families, columns, Sort and interspersed read/writes.

Yahoo! Cloud Serving Benchmark (YCSB)

‘Benchmarking Cloud Serving Systems with YCSB’ by Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, Russell Sears.

 This paper/project is designed to benchmark existing and newer cloud storage technologies.

 The benchmark is done so far on Hbase, Cassandra, MongoDb, Project Voldemort and SQL.

YCSB

 The benchmark tool uses Workload files and the workload files can be customized according to users.

 You can specify 50/50 read/write, 95/5 r/w and so on.

 The code for the project is available on Github.

https://github.com/brianfrankcooper/YCSB.git

Example of a Workload

# Yahoo! Cloud System Benchmark # Workload A: Update heavy workload # Application example: Session store recording recent actions # # Read/update ratio: 50/50 # Default data size: 1 KB records (10 fields, 100 bytes each, plus key) # Request distribution: zipfian recordcount=1000 operationcount=1000 workload=com.yahoo.ycsb.workloads.CoreWorkload

readallfields=true readproportion=0.5

updateproportion=0.5

scanproportion=0 insertproportion=0

Example of a Workload

# Yahoo! Cloud System Benchmark # Workload B: Read mostly workload # Application example: photo tagging; add a tag is an update, but most operations are to read tags # # Read/update ratio: 95/5 # Default data size: 1 KB records (10 fields, 100 bytes each, plus key) # Request distribution: zipfian recordcount=1000 operationcount=1000 workload=com.yahoo.ycsb.workloads.CoreWorkload

readallfields=true readproportion=0.95

updateproportion=0.05

scanproportion=0 insertproportion=0

Our Project

 Install Hbase and get Hadoop to interface with it. Study benchmark techniques.

 Build a suite of codes and get it to run on Hadoop/Hbase.

 Include basic get, put, scan operations.

 Extend Word Count’s map-reduce to add to Hbase.

 Compare with Brisk Cassandra.

About Brisk

 Cassandra is a No-SQL BigTable-based database.

 Datastax enterprise built Brisk to interface Hadoop with Cassandra  Hadoop + Cassandra = Brisk!!

Brisk Architecture

Challenges Faced

 Configuration of Hbase is a tedious job! Not for the weak of will!

 Hbase subsequent releases do not keep the APIs consistent. So we ran into a lot of ‘deprecated API’ error messages.

 Hadoop compatibility with Hbase has to be verified before we proceed with installations.

Challenges Faced (contd.)

 Very few documents on installation details of Hbase.

 Even fewer for Brisk!

Performance for Word Count (2 nodes/2 cores each) 46 45 44 43 42 41 49 Time in secs 48 47 1 2

1 mapper/ 3 reducer

3 4

Average = 45.484

1 mapper/ 3 reducer 5 Number of readings

Performance for Word Count (contd.)

2 mapper/ 3 reducers

52,5 Time in secs 52 51,5 51 50,5 50 49,5 49 48,5 48 47,5 1 2 3 4

Average = 49.664

2 mapper/ 3 reducers 5 Number of readings

Performance for Word Count (contd.)

60 Time in secs 50 40 10 0 30 20 1 2

2 mapper/ 2 reducers

3 4

Average = 43.7008

2 mapper/ 2 reducers 5 Number of readings

Performance for a simple get/put/scan (2 nodes/ 2 core) 2,5 Average for get, scan and put are 1.841.6266 and 1.71.

2 1,5 1 0,5 0 1 2 3 4 get scan put 5 Number of readings

Performance for Word Count (3 nodes/2 cores each)

1 mapper/ 3 reducers

31 30 29 35 34 33 32 37 Time in secs 36 1 2 3 4 5

Average = 34.047

1 mapper/ 3 reducers Number of Readings

Performance for Word Count (contd.)

2 mappers/ 3 reducers

38,5 Time in secs 38 37,5 37 36,5 36 35,5 35 34,5 34 33,5 33 1 2 3 4 5

Average = 36.1012

2 mappers/ 3 reducers Number of Readings

Performance for Word Count (contd.)

2 mappers/ 2 reducers

40 35 30 25 20 15 10 5 0 50 Time in secs 45 1 2 3 4 5

Average = 37.4358

2 mappers/ 2 reducers Number of readings

Conclusions

 Brisk seems a lot more promising tool; as it integrates Cassandra and Hadoop together without much ado.

 Hbase/Hadoop APIs have to be made consistent. With standardization, it would be easier to work with them.

 Hbase Reads are faster than the Writes.

Thank You

Questions??