Database overview

Transcript Database overview

CS 292 Special topics on
Big Data
Yuan Xue
([email protected])
Part II NoSQL Database
(Overview)
Yuan Xue
([email protected])
Outline

From SQL to NoSQL




Motivation
Challenges
Approaches
Notable NoSQL systems

Database User/Application Developer: How to use?


Database System Designer: How to design?


Under the hood: (Physical) data model and distribution algorithm
Database Designer: How to link application needs with database design


(Logic) data model and CRUD operations
Schema design
Summary and Moving forward




Summary of NoSQL Data Modeling Techniques
Summary of NoSQL Data Distribution Models/Algorithms
Limits of NoSQL
NewSQL
Data Model
• Column Family
• Key-value
• Document
From SQL to NoSQL

Relational Database Review


Persistent data storage
Transaction support



ACID
Concurrency control +
recovery
Standard data interface for
data sharing

SQL
Design
Operation
Conceptual Design
Entity/Relationship model
SQL query
Logic Design
Data model mapping
Logical Schema]
Normalization
Normalized Schema
Physical Design
Physical (Internal) Schema
SQL Review -- Putting Things Together
Users
Application Program/Queries
Query Processing
DBMS
system
Data access
(Database Engine)
Meta-data
Data
http://www.dbinfoblog.com/post/24/the-query-processor/
From SQL to NoSQL

Motivation I

Scaling (up/out) SQL database
 Web-based application with SQL database as backend
 High Web traffic  large volume of transactions
 More users  large amount of data
 Solution 1: cache
 E.g. memcached,
 Only handle read traffic
 Solution 2: Scale up (vertically)
 Add more resources to a single node
 Solution 3: Scale out (horizontally)
 Add more nodes to the DB system
 Data distribution among multiple nodes  non-trivial
Scale out SQL database – Techniques and Challenges

Two Techniques



Replication
Sharding
Replication

Master-Slave





• Duplication facilitates read
• Data consistency problem arises
All writes are written to the master
All reads performed against the replicated slave databases
Critical reads may be incorrect as writes may not have been propagated down
Large data sets can pose problems as master needs to duplicate data to slaves
Peer-to-peer



SQL and multi-node cluster do not go well
Writes can happen at any nodes
Inconsistent write (which can be persistent)
Sharding (Partition the dataset to multiple nodes)




Scales well for both reads and writes
Not transparent, application needs to be partition-aware
Can no longer have relationships/joins across partitions
Loss of referential integrity across shards
• Partition facilitates write/read
• lost transaction support across partition
From SQL to NoSQL
 Motivation II
 Limits in Relational Data Model




Impedance Mismatch
 Difference between in-memory data structures and relational model
Predefined schema
Join operation
Not appropriate for
 Graph
data
 Geographical data
 Unstructured data
From SQL to NoSQL


Google: Search Engine
 Store billions document  Bigtable + Google File System
Amazon: Online Shopping
 Shopping cart management  dynamo
Foundation
HBase + HDFS
Cassandra
Riak
Redis
MongoDB
…
Open source DBMS
Supported by
many social
media sites with
large data needs
• facebook
• twitter
Amazon DynamoDB
Amazon SimpleDB
…
Cloud-hosted managed
DBMS
Utilized by
companies
• imdb
• startups..
What is NoSQL

Stands for Not Only SQL: Not relational database
 Umbrella term for many different types of data stores (database)
 Different Types

Key value DB,

Column Family DB
Document DB
Graph DB



of Data Model
Just as there are different programming languages, NoSQL
provides different data storage tools in the toolbox

Polyglot Persistence
What is the Magic?

Logic Data Model: From Table to Aggregate

Diverse data models







Aggregate
No schema predefined


Column Family
Key-value
Document
Graph
allows an attribute/field to be added at run-time
Still need to consider how to define “key” “column family”
Giving up build-in Joins
Physical Data Handling: From ACID to BASE (CAP theorem coming up)
 No full transaction support



Support at Aggregate level
Support both replication and sharding (automatically)
Relax one or more of the ACID properties
NoSQL Database Classification
-- Data Model View

Key Value Store





Column Family



Hbase (BigTable)
Cassandra
Document



Dynamo
Riak
Redis
Memcached (in memory)
MongoDB
Terrastore
Graph


FlockDB
Neo4J
Transaction Review





ACID (Atomicity, Consistency, Isolation, Durability) is a set of properties that
guarantee that database transactions are processed reliably.
Atomicity: "all-or-nothing" proposition
 Each work unit performed in a database must either complete in its
entirety or have no effect whatsoever
Consistency: database constraints conformation
 Each client, each transaction,
 Can assume all constraints hold when transaction begins
 Must guarantee all constraints hold when transaction ends
Isolation: serial equivalency
 Operations may be interleaved, but execution must be equivalent to some
sequential (serial) order of all transactions.
Durability: durable storage.
 If system crashes after transaction commits, all effects of transaction
remain in database
ACID and Transaction Support in Distributed
Environment

Recall -- scaling out database
 Distributed environment with multiple nodes
 Data are distributed across nodes via
 Replication
 Sharding

Concerns from Distributed Networking Environment




Message Loss/Delay
Network partition
Can ACID property still hold for database in distributed
environment?
CAP theorem comes as a guideline
CAP Theorem
Start with three properties for
distributed systems:
Availability
Consistency: All nodes see the same
data at the same time
Availability: Every request to a nonfailing node in the system returns a
response about whether it was
successful or failed.
Partition Tolerance: System
properties (consistency and/or
availability) hold even when the
system is partitioned (communicate
lost)
Consistency
Partition
tolerance
Availability
CAP Theorem
Consistency – Atomic data object
As in ACID
1. In a distributed environment,
multiple copies of a data item
may exist on different nodes.
2. Consistency requires that all
operations towards a data
item are executed as if they
are performed over a single
instant
Consistency
Partition
tolerance
Clients
Data item X
Data item X
Copy 1
Data item X
Copy 2
Availability
CAP Theorem
Availability – Available data object
Requests to data -- Read/write,
always succeed.
Consistency
1.
2.
Partition
tolerance
All (non-failing ) nodes remain able to read and write even when
network is partitioned.
A system that keeps some, but not all, of its nodes able to read and
write is not Available in the CAPsense, even if it remains available to
clients and satisfies its SLAs for high availalbility
Reference: https://foundationdb.com/white-papers/the-cap-theorem
Availability
CAP Theorem
Partition Tolerance
1.
2.
The network will be allowed to
lose arbitrarily many messages
sent from one node to another.
When network is partitioned, all
messages from one component
to another will get lost.
Consistency
Partition
tolerance
Under Partition Tolerance
Consistency requirement implies that every data operation will be atomic,
even though arbitrary messages may get lost.
Availability requirement implies that every nodes receiving a request from
a client must respond, even though arbitrary messages may get lost.
CAP Theorem

You can have at most two of these three properties for any
shared-data system


Consistency, availability and partition tolerance
To scale out, you have to support partition tolerant

NoSQL: either consistency or availability to choose from under
network partition
Availability
SQL
NoSQL
Pick one side
Consistency
Partition
tolerance
NoSQL Database Classification
-- View from CAP Theorem
Availability
Relational:
MySQL,
PostgreSQL
Consistency
Dynamo and its
derivatives:
Cassandra, Riak
Relational:
MySQL
BigTable and its
derivatives: HBase,
Redis, MongoDB
Partition
tolerance
More on Consistency

Question: in AP system, if consistency property can not be hold, what property can
be claimed?

Example
 Data
item X is replicated on nodes M and N
 Client A writes X to node N
 Some period of time t elapses.
 Client B reads X from node M
 Does client B see the write from client A?

Client B
read
M
Data item X
Copy 1
Client A
write
N
Data item X
Copy 2
From client’s perspective, two kinds of consistency:
 Strong consistency (as C in CAP): any subsequent access is guaranteed to
return the updated value
 Weak consistency: subsequent access is not guaranteed to return the updated
value


Inconsistency window: The period between the update and the moment when it is
guaranteed that any observer will always see the updated value  Consistency is a
continuum with tradeoffs
Multiple consistency models
Eventual Consistency

Eventual consistency
 a specific form of weak consistency



When no updates occur for a long period of time, eventually all updates
will propagate through the system and all the nodes will be consistent 
all accesses will return the last updated value.
For a given accepted update and a given node, eventually either the
update reaches the node or the node is removed from service
Based on CAP


SQL  NoSQL
ACID  BASE (Basically Available, Soft state, Eventual consistency)
 Basically Available - system seems to work all the time
 Soft State - it doesn't have to be consistent all the time
 Eventually Consistent - becomes consistent at some later time
References and Additional Reading for CAP





http://en.wikipedia.org/wiki/CAP_theorem
Formal proof for CAP theorem: Brewer's Conjecture and the Feasibility of
Consistent, Available, Partition-Tolerant Web Services by Seth Gilbert and
Nancy Lynch
Graphical illustration of CAP theorem:
http://www.julianbrowne.com/article/viewer/brewers-cap-theorem
Recent post from Brewer: CAP Twelve Years Later: How the "Rules" Have
Changed
Eventually Consistent by Werner Vogels
Part II NoSQL Database
(Overview)
Yuan Xue
([email protected])
BigTable and Hbase
Introduction


BigTable Background
 Development began in 2004 at Google (published 2006)
 need to store/handle large amounts of (semi)-structured data
 Many Google projects store data in BigTable
 Google’s web crawl
 Google Earth
 Google Analytics
HBase Background
 open-source implementation of BigTable built on top of HDFS
 Initial HBase prototype in 2007
 Hadoop become Apache top-level project and HBase becomes subproject
in 2008
Road Map

Database User/Application Developer: How to use?


Database System Designer: How to design?


(Logic) data model and CRUD operations
Under the hood: (Physical) data model and distribution algorithm
Database Designer: How to link application needs with database design

Schema design
Data Model



A sparse, distributed, persistent multidimensional sorted map
Map indexed by a row key, column key, and a timestamp
 (row:string, column:string, time:int64)  uninterpreted byte array
Rows maintained in sorted lexicographic order based on row key
 A row key is an arbitrary string
 Every read or write of data under a single row is atomic.
 Row ranges dynamically partitioned into tablets
 Unit of distribution and load balancing
 Applications can exploit this property for efficient row scans
Data Model



A sparse, distributed, persistent multidimensional sorted map
Map indexed by a row key, column key, and a timestamp
 (row:string, column:string, time:int64)  uninterpreted byte array
Columns grouped into column families
 Column key = family:qualifier
 Column family must be created before data can be stored in a column key.
 Column families provide locality hints
 Unbounded number of columns
Data Model



A sparse, distributed, persistent multidimensional sorted map
Map indexed by a row key, column key, and a timestamp
 (row:string, column:string, time:int64)  uninterpreted byte array
Timestamps
 64 bit integers , Assigned by:
 Bigtable: real-time in microseconds,
 Client application: when unique timestamps are a necessity.
 Items in a cell are stored in decreasing timestamp order.
 Application specifies how many versions (n) of data items are maintained in a cell.
 Bigtable garbage collects obsolete versions.
Data Model – MiniTwitter Example
View as a Map of Map
Operations & APIs in Hbase



Create and delete tables and column families; Modify meta-data
Operations are based on row keys
Single-row operations:




Multi-row operations:




Put
Get
Delete
Scan
MultiPut
Atomic R-M-W sequences on data stored in a single row key (No support for
transactions across multiple rows).
No built-in joins

Can be done in the application


Using scan() and get() operations
Using MapReduce
Creating a Table
HBaseAdmin admin= new HBaseAdmin(config);
HColumnDescriptor []column;
column= new HColumnDescriptor[2];
column[0]=new HColumnDescriptor("columnFamily1:");
column[1]=new HColumnDescriptor("columnFamily2:");
HTableDescriptor desc= new
HTableDescriptor(Bytes.toBytes("MyTable"));
desc.addFamily(column[0]);
desc.addFamily(column[1]);
admin.createTable(desc);
34
Altering a Table
Disable the table before changing the schema
35
Single-row operations: Put()


Insert a new record (with a new key), Or
Insert a record for an existing key
Implicit version number
(timestamp)
Explicit version number
36
Put() in MiniTwitter
Update information
Single-row operations: Get()


•
Given a key  return corresponding record
For each value return the highest version
Can control the number of versions you want
39
Get() in MiniTwitter
Single-row operations: Delete()


Marking table cells as deleted
Multiple levels


Can mark an entire column family as deleted
Can make all column families of a given row as deleted
Delete d = new Delete(Bytes.toBytes(“rowkey”));
userTable.delete(d);
Delete d = new Delete(Bytes.toBytes(“rowkey”));
d.deleteColumns(
Bytes.toBytes(“cf”),
Bytes.toBytes(“attr”));
userTable.delete(d);
41
Multi-row operations: Scan()
42
Road Map

Database User/Application Developer: How to use?


(Logic) data model and CRUD operations
Database System Designer: How to design?


Under the hood: (Physical) data model and distribution algorithm
Single Node



Write, Read, Delete
Distributed System
Database Designer: How to link application needs with database design

Schema design
Basics

Terms
BigTable
Hbase
SSTable
HFile
memtable
MemStore
tablet
region
tablet server
RegionServer
HFile/SSTable
BigTable
Hbase
SSTable
HFile
memtable
MemStore
tablet
region
Basic building block of Bigtable
tablet server
RegionServer
 Persistent, ordered immutable map from keys to values



Sequence of blocks on disk plus an index for block lookup


Stored in GFS
Can be completely mapped into memory
Supported operations:


Look up value associated with key
Iterate key/value pairs within a key range
64K
block
64K
block
64K
block
SSTable
Index
HDFS: Hadoop Distributed File Systems
Client requests meta data about a file from namenode
 Data is served directly from datanode

File Read/Write in HDFS

File Read

1. open
HDFS
client
3. read
6. close
File Write
2. get block locations
Distributed
FileSystem
NameNode
FSData
InputStream
1. create
HDFS
client
name node
3. write
7. close
client JVM
client JVM
client node
client node
FSData
OutputStream
2. create
NameNode
8. complete
name node
4. get a list of 3 data nodes
5. write packet
4. read from the closest node
Distributed
FileSystem
6. ack packet
5. read from the 2nd closest node
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
data node
data node
data node
data node
data node
data node
If a data node crashed, the crashed node is removed, current block receives a newer id so as to delete the
partial data from the crashed node later, and Namenode allocates an another node.
47
Hbase: Logic storage vs Physical storage
Region/Tablet



Dynamically partitioned range of rows
Built from multiple SSTables
Column-Family oriented storage
Tablet
64K
block
Start:Alice00
64K
block
64K
block
BigTable
Hbase
SSTable
HFile
memtable
MemStore
tablet
region
tablet server
RegionServer
End:Dave11
SSTable
Index
64K
block
64K
block
64K
block
SSTable
Index
Table (HTable)




BigTable
Hbase
SSTable
HFile
memtable
MemStore
tablet
region
tablet server
Multiple tablets make up the table
The entire BigTable is split into tablets of contiguous ranges of rows
 Approximately 100MB to 200MB each
Tablets are split as their size grows
SSTables can be shared
Tablet
Alice00
SSTable
RegionServer
Tablet
Dave11
SSTable
Emily
Darth
SSTable
HTable
SSTable
•Each column family is stored in a separate file
•Key & Version numbers are replicated with each column family
•Empty cells are not stored
Source: Graphic from slides by Erik Paulson
Tablet1
Tablet2
Table to Region
Physical Storage: MiniTwitter Example
HTable
Write Path in HBase
Hlog
(append only WAL on HDFS
One per RS)
Read Path in Hbase
Deletion and Compaction in HBase

Delete() will mark the record for deletion

A new “tombstone” record is written for that value
BigTable
Hbase
Merging
compaction
Minor compaction
Minor compaction
flush
Major compaction
Major compaction
Announcement




Lab 1 Due
Lab 2 Release (team up)
Project team up
Quiz 1 graded
Data Distribution and Serving -- Big Picture
57
Placement of Tablets and Data Serving



A tablet is assigned to one tablet server at a time.
Metadata for tablet locations and start/end row are stored in a special Bigtable cell
Master maintains:
 The set of live tablet servers,
 Current assignment of tablets to tablet servers (including the unassigned ones)
RegionServer and DataNode
RegionServer and DataNode
Interacting with Hbase
Hbase Schema Design







How many column families should the table have?
What data goes into what column family?
How many columns should be in each column family?
What should the column names be? Although column names don’t
have to be defined on table creation, you need to know them when
you write or read data.
What information should go into the cells?
How many versions should be stored for each cell?
What should the rowkey structure be, and what should it contain?
MiniTwitter Review

Read operation
 Whom does TheFakeMT follow?
 Does TheFakeMT follow TheRealMT?
 Who follows TheFakeMT?
 Does TheRealMT follow TheFakeMT?

Write operation
 A user follows someone
 A user unfollows someone
MiniTwitter- Version 1
Version 2

Read operation

How many people
a user follows?
Atomic operation!
Version 3

Get rid of the
counter

Problem

Row access
overhead
Version 4

Wide table vs
tall table
Version 4 – client code
Version 5

Trick with hash code
Normalization vs Denormalization
Course
CourseID
CourseName
Hour
Description
CS292
Special Topics on Big Data
3
large-scale data processing
CS283
Computer networks
3
Networking technology
ClassSchedule
ClassID
CourseID
Semester
InstructorID Classroom
Time
2014CS292
CS292
S2014
xuey1
FGH134
Tue/Th 1:10-2:25
2014CS283
CS283
S2014
jmatt
FGH236
Tue/Th 1:10-2:25
ClassID
StudentID
Grade
2014CS292
balice1
NULL
Registration
VandyUser
VUNetID
FirstName
LastName
Email
xuey1
Yuan
Xue
Yuan.xue
balice1
Alice
Burch
Alice.burch
ClassSchedule
eID
SectionID
Semester
InstructorID
Classroom
Time
2
01
S2014
xuey1
FGH134
Tue/Th 1:10-2:25
3
01
S2014
jmatt
FGH236
Tue/Th 1:10-2:25

Database overview

Transcript Database overview

Directory