Database overview

Download Report

Transcript Database overview

Part II NoSQL Database
(BigTable and Hbase)
Yuan Xue
([email protected])
Introduction


BigTable Background
 Development began in 2004 at Google (published 2006)
 need to store/handle large amounts of (semi)-structured data
 Many Google projects store data in BigTable
 Google’s web crawl
 Google Earth
 Google Analytics
HBase Background
 open-source implementation of BigTable built on top of HDFS
 Initial HBase prototype in 2007
 Hadoop become Apache top-level project and HBase becomes subproject
in 2008
Road Map

Database User/Application Developer: How to use?


Database System Designer: How to design?


(Logic) data model and CRUD operations
Under the hood: (Physical) data model and distribution algorithm
Database Designer: How to link application needs with database design

Schema design
Data Model



A sparse, distributed, persistent multidimensional sorted map
Map indexed by a row key, column key, and a timestamp
 (row:string, column:string, time:int64)  uninterpreted byte array
Rows maintained in sorted lexicographic order based on row key
 A row key is an arbitrary string
 Every read or write of data under a single row is atomic.
 Row ranges dynamically partitioned into tablets
 Unit of distribution and load balancing
 Applications can exploit this property for efficient row scans
Data Model



A sparse, distributed, persistent multidimensional sorted map
Map indexed by a row key, column key, and a timestamp
 (row:string, column:string, time:int64)  uninterpreted byte array
Columns grouped into column families
 Column key = family:qualifier
 Column family must be created before data can be stored in a column key.
 Column families provide locality hints
 Unbounded number of columns
Data Model



A sparse, distributed, persistent multidimensional sorted map
Map indexed by a row key, column key, and a timestamp
 (row:string, column:string, time:int64)  uninterpreted byte array
Timestamps
 64 bit integers , Assigned by:
 Bigtable: real-time in microseconds,
 Client application: when unique timestamps are a necessity.
 Items in a cell are stored in decreasing timestamp order.
 Application specifies how many versions (n) of data items are maintained in a cell.
 Bigtable garbage collects obsolete versions.
Data Model – MiniTwitter Example
View as a Map of Map
Operations & APIs in Hbase



Create and delete tables and column families; Modify meta-data
Operations are based on row keys
Single-row operations:




Multi-row operations:




Put
Get
Delete
Scan
MultiPut
Atomic R-M-W sequences on data stored in a single row key (No support for
transactions across multiple rows).
No built-in joins

Can be done in the application


Using scan() and get() operations
Using MapReduce
Creating a Table
HBaseAdmin admin= new HBaseAdmin(config);
HColumnDescriptor []column;
column= new HColumnDescriptor[2];
column[0]=new HColumnDescriptor("columnFamily1:");
column[1]=new HColumnDescriptor("columnFamily2:");
HTableDescriptor desc= new
HTableDescriptor(Bytes.toBytes("MyTable"));
desc.addFamily(column[0]);
desc.addFamily(column[1]);
admin.createTable(desc);
10
Altering a Table
Disable the table before changing the schema
11
Single-row operations: Put()


Insert a new record (with a new key), Or
Insert a record for an existing key
Implicit version number
(timestamp)
Explicit version number
12
Put() in MiniTwitter
Update information
Single-row operations: Get()


•
Given a key  return corresponding record
For each value return the highest version
Can control the number of versions you want
15
Get() in MiniTwitter
Single-row operations: Delete()


Marking table cells as deleted
Multiple levels


Can mark an entire column family as deleted
Can make all column families of a given row as deleted
Delete d = new Delete(Bytes.toBytes(“rowkey”));
userTable.delete(d);
Delete d = new Delete(Bytes.toBytes(“rowkey”));
d.deleteColumns(
Bytes.toBytes(“cf”),
Bytes.toBytes(“attr”));
userTable.delete(d);
17
Multi-row operations: Scan()
18
Road Map

Database User/Application Developer: How to use?


(Logic) data model and CRUD operations
Database System Designer: How to design?


Under the hood: (Physical) data model and distribution algorithm
Single Node



Write, Read, Delete
Distributed System
Database Designer: How to link application needs with database design

Schema design
Basics

Terms
BigTable
Hbase
SSTable
HFile
memtable
MemStore
tablet
region
tablet server
RegionServer
HFile/SSTable
BigTable
Hbase
SSTable
HFile
memtable
MemStore
tablet
region
Basic building block of Bigtable
tablet server
RegionServer
 Persistent, ordered immutable map from keys to values



Sequence of blocks on disk plus an index for block lookup


Stored in GFS (HDFS)
Can be completely mapped into memory
Supported operations:


Look up value associated with key
Iterate key/value pairs within a key range
64K
block
64K
block
64K
block
SSTable
Index
HDFS: Hadoop Distributed File Systems
How Hfile/SSTable is stored? On HDFS

HDFS as a reliable storage layer for Hbase
Handles checksums, replications, failover
 Each file has three copies by default


Design
Client requests meta data about a file from namenode
 Data is served directly from datanode

File Read/Write in HDFS

File Read

1. open
HDFS
client
3. read
6. close
File Write
2. get block locations
Distributed
FileSystem
NameNode
FSData
InputStream
1. create
HDFS
client
name node
3. write
7. close
client JVM
client JVM
client node
client node
FSData
OutputStream
2. create
NameNode
8. complete
name node
4. get a list of 3 data nodes
5. write packet
4. read from the closest node
Distributed
FileSystem
6. ack packet
5. read from the 2nd closest node
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
data node
data node
data node
data node
data node
data node
If a data node crashed, the crashed node is removed, current block receives a newer id so as to delete the
partial data from the crashed node later, and Namenode allocates an another node.
23
Hbase: Logic storage vs Physical storage
Tablet (Region)



Dynamically partitioned range of rows
Built from multiple SSTables
Column-Family oriented storage
Tablet
64K
block
Start:Alice00
64K
block
64K
block
BigTable
Hbase
SSTable
HFile
memtable
MemStore
tablet
region
tablet server
RegionServer
End:Dave11
SSTable
Index
64K
block
64K
block
64K
block
SSTable
Index
Table (HTable)




BigTable
Hbase
SSTable
HFile
memtable
MemStore
tablet
region
tablet server
Multiple tablets make up the table
The entire BigTable is split into tablets of contiguous ranges of rows
 Approximately 100MB to 200MB each
Tablets are split as their size grows
SSTables can be shared
Tablet
Alice00
SSTable
RegionServer
Tablet
Dave11
SSTable
Emily
Darth
SSTable
HTable
SSTable
•Each column family is stored in a separate file
•Key & Version numbers are replicated with each column family
•Empty cells are not stored
Source: Graphic from slides by Erik Paulson
Tablet1
Tablet2
Table to Region
Physical Storage: MiniTwitter Example
HTable
Write Path in HBase
Hlog
(append only WAL on HDFS
One per RS)
Read Path in Hbase
Deletion and Compaction in HBase

Delete() will mark the record for deletion

A new “tombstone” record is written for that value
BigTable
Hbase
Merging
compaction
Minor compaction
Minor compaction
flush
Major compaction
Major compaction
Data Distribution and Serving -- Big Picture
32
Placement of Tablets and Data Serving



A tablet is assigned to one tablet server at a time.
Metadata for tablet locations and start/end row are stored in a special Bigtable cell
Master maintains:
 The set of live tablet servers,
 Current assignment of tablets to tablet servers (including the unassigned ones)
Region Servers – Physical Layout
RegionServer and DataNode
RegionServer and DataNode
Interacting with Hbase
Hbase Schema Design







How many column families should the table have?
What data goes into what column family?
How many columns should be in each column family?
What should the column names be? Although column names don’t
have to be defined on table creation, you need to know them when
you write or read data.
What information should go into the cells?
How many versions should be stored for each cell?
What should the row key structure be, and what should it contain?
MiniTwitter Review

Read operation
 Whom does TheFakeMT follow?
 Does TheFakeMT follow TheRealMT?
 Who follows TheFakeMT?
 Does TheRealMT follow TheFakeMT?

Write operation
 A user follows someone
 A user unfollows someone
MiniTwitter- Version 1
Version 2

Read operation

How many people
a user follows?
Atomic operation!
Version 3

Get rid of the
counter

Problem

Row access
overhead
Key Cardinality
Version 4

Wide table vs
tall table
Version 4 – client code
Version 5

Trick with hash code
Normalization vs Denormalization
Additional Reading and Reference