HBase: An Introduction
Download
Report
Transcript HBase: An Introduction
Christopher Shain
Software Development Lead
Tresata
Hadoop for Financial Services
First completely Hadoop-powered analytics
application
Widely recognized as “Big Data Startup to
Watch”
Winner of the 2011 NCTA Award for Emerging
Tech Company of the Year
Based in Charlotte NC
We are hiring! [email protected]
Software Development Lead at Tresata
Background in Financial Services IT
End-User Applications
Data Warehousing
ETL
Email: [email protected]
Twitter: @chrisshain
What is HBase?
From: http://hbase.apache.org:
▪ “HBase is the Hadoop database.”
I think this is a confusing statement
‘Database’, to many, means:
Transactions
Joins
Indexes
HBase has none of these
More on this later
HBase is a data storage platform designed to
work hand-in-hand with Hadoop
Distributed
Failure-tolerant
Semi-structured
Low latency
Strictly consistent
HDFS-Aware
“NoSQL”
Need for a low-latency, distributed datastore
with unlimited horizontal scale
Hadoop (MapReduce) doesn’t provide lowlatency
Traditional RDBMS don’t scale out
horizontally
November 2006: Google BigTable whitepaper
published:
http://research.google.com/archive/bigtable.html
February 2007: Initial HBase Prototype
October 2007: First ‘usable’ HBase
January 2008: HBase becomes Apache
subproject of Hadoop
March 2009: HBase 0.20.0
May 10th, 2010: HBase becomes Apache Top
Level Project
Web Indexing
Social Graph
Messaging (Email etc.)
HBase is written almost entirely in Java
JVM clients are first-class citizens
HBase Master
RegionServer
JVM Clients
RegionServer
Non-JVM
Clients
Proxy
(Thrift or REST)
RegionServer
RegionServer
All data is stored in Tables
Table rows have exactly one Key, and all rows in a
table are physically ordered by key
Tables have a fixed number of Column Families
(more on this later!)
Each row can have many Columns in each column
family
Each column has a set of values, each with a
timestamp
Each row:family:column:timestamp combination
represents coordinates for a Cell
Defined by the Table
A Column Family is a group of related
columns with it’s own name
All columns must be in a column family
Each row can have a completely different set
of columns for a column family
Row:
Column Family:
Chris
Bob
James
Columns:
Friends:Bob
Friends
Friends:Chris
Friends:Bob
Friends:James
Not exactly the same as rows in a traditional
RDBMS
Key: a byte array (usually a UTF-8 String)
Data: Cells, qualified by column family, column, and
timestamp (not shown here)
Row Key: Column Families :
(Defined by the
Table)
Attributes
Chris
Friends
Columns:
Cells:
(Defined by the Row) (Created with Columns)
(May vary between
rows)
Attributes:Age
30
Attributes:Height
68
Friends:Bob
1 (Bob’s a cool guy)
Friends:Jane
0 (Jane and I don’t get along)
All cells are created with a timestamp
Column family defines how many versions of
a cell to keep
Updates always create a new cell
Deletes create a tombstone (more on that
later)
Queries can include an “as-of” timestamp to
return point-in-time values
HBase deletes are a form of write called a
“tombstone”
Indicates that “beyond this point any previously
written value is dead”
Old values can still be read using point-in-time
queries
Timestamp
Write Type
Resulting Value
Point-In-Time
Value “as of” T+1
T+0
PUT (“Foo”)
“Foo”
“Foo”
T+1
PUT (“Bar”)
“Bar”
“Bar”
T+2
DELETE
<none>
“Bar”
T+3
PUT (“Foo Too”)
“Foo Too”
“Bar”
Requirement: Store real-time stock tick data
Ticker
Timestamp
Sequence
Bid
Ask
IBM
09:15:03:001
1
179.16
179.18
MSFT
09:15:04:112
2
28.25
28.27
GOOG
09:15:04:114
3
624.94
624.99
IBM
09:15:04:155
4
179.18
179.19
Requirement: Accommodate many
simultaneous readers & writers
Requirement: Allow for reading of current
price for any ticker at any point in time
Historical Prices:
Keys
PK
Column
Ticker
Varchar
Timestamp
DateTime
Sequence_Number
Integer
Bid_Price
Decimal
Ask_Price
Decimal
Latest Prices:
Keys
PK
DataType
Column
DataType
Ticker
Varchar
Bid_Price
Decimal
Ask_Price
Decimal
Row Key
Family:Column
Prices:Bid
[Ticker].[Rev_Timestamp].[Rev_Sequence_Number]
Prices:Ask
HBase throughput will scale linearly with # of
nodes
No need to keep separate “latest price” table
A scan starting at “ticker” will always
return the latest price row
HBase scales horizontally
Needs to split data over many RegionServers
Regions are the unit of scale
All HBase tables are broken into 1 or more
regions
Regions have a start row key and an end row
key
Each Region lives on exactly one
RegionServer
RegionServers may host many Regions
When RegionServers die, Master detects this
and assigns Regions to other RegionServers
“Users” Table
Row Keys in Region
“Aaron” – “George”
“Aaron”
-META- Table
Table
Users
Region
Region
Server
“Bob”
“Aaron” – “George”
Node01
“George” – “Matthew”
Node02
Row Keys in Region
“George” – “Matthew”
“Matthew” – “Zachary”
Node01
“George”
“Chris”
Row Keys in Region
“Matthew” – “Zachary”
“Matthew”
“Nancy”
“Zachary”
Deceptively simple
ZooKeeper
Cluster
HBase Master
JVM Clients
Backup HBase
Master
RegionServer
RegionServer
Non-JVM
Clients
Proxy
(Thrift or REST)
RegionServer
RegionServer
ZooKeeper
Keeps track of which server is the current HBase
Master
HBase Master
Keeps track of Region/RegionServer mapping
Manages the -ROOT- and .META. tables
Responsible for updating ZooKeeper when these
change
RegionServer
Stores table regions
Clients
Need to be smarter than RDBMS clients
First connect to ZooKeeper to get RegionServer
for a given Table/Region
Then connect directly to RegionServer to interact
with the data
All connections over Hadoop RPC – non-JVM
clients use proxy (Thrift or REST (Stargate))
-ROOT- Table
info:regioninfo
.META.[region]
info:server
info:serverstartcode
Points to DataNode hosting
.META. region.…
.META. Table
info:regioninfo
[table],[region start key],[region id] info:server
info:serverstartcode
Regular User Table
… whatever …
…
Points to DataNode
hosting table region.
HBase Master is not necessarily a single point of
failure (SPOF)
Multiple masters can be running
Current ‘active’ Master controlled via ZooKeeper
Make sure you have enough ZooKeeper nodes!
Master is not needed for client connectivity
Clients connect directly to ZooKeeper to find Regions
Everything Master does can be put off until one is
elected
ZooKeeper Quorum
ZooKeeper
Node
ZooKeeper
Node
HBase
Master
(Current)
ZooKeeper
Node
HBase
Master
(Standby)
HBase
Master
(Standby)
HBase tolerates RegionServer failure when
running on HDFS
Data is replicated by HDFS (dfs.replication setting)
Lots of issues around fsync, failure before data is
flushed - some probably still not fixed
Thus, data can still be lost if node fails after a write
HDFS NameNode is still SPOF, even for HBase
Similar to log in many RDBMS
All operations by default written to log before considered
‘committed’ (can be overridden for ‘disposable fast writes’)
Log can be replayed when region is moved to another
RegionServer
One WAL per RegionServer
WAL
Flushed periodically
(10s by default)
HFile
Writes
MemStore
Flushed when
MemStore gets too big
HFile
RegionServer
Log
StoreFile
HFile
HDFS Client
Block
Block
Store
StoreFile
HFile
Block
Block
Store
StoreFile
HFile
Block
MemStore
MemStore
Store
MemStore
Region
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
HDFS DataNode
HDFS DataNode
HDFS DataNode
HDFS DataNode
A RegionServer is not guaranteed to be on
the same physical node as it’s data
Compaction causes RegionServer to write
preferentially to local node
But this is a function of HDFS Client, not HBase
All data is in memory initially (memstore)
HBase is a write-only system
Modifications and deletes are just writes with
later timestamps
Function of HDFS being append-only
Eventually old writes need to be discarded
2 Types of Compactions:
Minor
Major
All HBase edits are initially stored in memory
(memstore)
Flushes occur when memstore reaches a
certain size
By default 67,108,864 bytes
Controlled by hbase.hregion.memstore.flush.size
configuration property
Each flush creates a new HFile
Triggered when a certain number of HFiles are created for
a given Region Store (+ some other conditions)
By default 3 HFiles
Controlled by hbase.hstore.compactionThreshold configuration
property
Compacts most recent HFiles into one
By default, uses RegionServer-local HDFS node
Does not eliminate deletes
Only touches most recent HFiles
NOTE: All column families are compacted at once (this
might change in the future)
Triggered every 24 hours (with random
offset) or manually
Large HBase installations usually leave this for
manual operators
Re-writes all HFiles into one
Processes deletes
Eliminates tombstones
Erases earlier entries
HBase does not have transactions
However:
Row-level modifications are atomic: All
modifications to a row will succeed or fail as a unit
Gets are consistent for a given point in time
▪ But Scans may return 2 rows from different points in
time
All data read has been ‘durably stored’
▪ Does NOT mean flushed to disk- can still be lost!
DO: Design your schema for linear range scans on your
most common queries.
Scans are the most efficient way to query a lot of rows
quickly
DON’T: Use more than 2 or 3 column families.
Some operations (flushing and compacting) operate
on the whole row
DO: Be aware of the relative cardinality of column
families
Wildly differing cardinality leads to sparsity and bad
scanning results.
DO: Be mindful of the size of your row and column
keys
They are used in indexes and queries, can be quite
large!
DON’T: Use monotonically increasing row keys
Can lead to hotspots on writes
DO: Store timestamp keys in reverse
Rows in a table need to be read in order, usually
you want most recent
DO: Query single rows using exact-match on key
(Gets) or Scans for multiple rows
Scans allow efficient I/O vs. multiple gets
DON’T: Use regex-based or non-prefix column filters
Very inefficient
DO: Tune the scan cache and batch size parameters
Drastically improves performance when returning lots of
rows
Deceptively simple
HBase Master
JVM Clients
RegionServer
RegionServer
Non-JVM
Clients
Proxy
(Thrift or REST)
RegionServer
RegionServer
ZooKeeper Quorum
ZooKeeper
Node
ZooKeeper
Node
HBase
Master
(Current)
ZooKeeper
Node
HBase
Master
(Standby)
HBase
Master
(Standby)
RegionServer
HFile
Block
StoreFile
HFile
Block
Block
Block
Store
StoreFile
HFile
Block
MemStore
Log
StoreFile
HDFS Client
Store
MemStore
Store
MemStore
Region
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
HDFS DataNode
HDFS DataNode
HDFS DataNode
HDFS DataNode
Requirement: Store an arbitrary set of
preferences for all users
Requirement: Each user may choose to store
a different set of preferences
Requirement: Preferences may be of
different data types (Strings, Integers, etc)
Requirement: Developers will add new
preference options all the time, so we
shouldn’t need to modify the database
structure when adding them
One possible RDBMS solution:
Key/Value table
All values as strings
Flexible, but wastes space
Keys:
PK
Column:
Data Type:
UserID
Int
PreferenceName
Varchar
PreferenceValue
Varchar
Store all preferences in the Preferences
column family
Preference name as column name,
preference value as (serialized) byte array:
HBase client library provides methods for
serializing many common data types
Row Key:
Family:
Chris
Preferences
Joe
Preferences
Column:
Value:
Age
30
Hometown
“Mineola, NY”
Birthdate
11/13/1987