PowerPoint Template - Renmin University of China
Download
Report
Transcript PowerPoint Template - Renmin University of China
MVCC on Flash Memory
Company
LOGO
Fan Yulei,
Lab of WAMDM,
School of Information,
Renmin University of China,
Beijing, China,
2009-06-13
Outline
Motivation
MVCC
Berkeley DB
PostgreSQL
Future work
Motivation
Characteristics
Not In-Place Update
HDD
Flash
Motivation
Kinds of Lock
Log File & Data File
Checkpoint: D & S
1st : Lock
2nd : Release Lock
Multiple Version
2PL
ReadRecovery
Log file Undo & Redo
Backup Database
Hot-standby : mirrored media
CC
Directed Acycling
Graph
Timestamp Ordering
•2PL
•MVCC
Index : B+-Tree
•Conflict graph
•Timestamp
•Index CC
Read Log file Undo & Redo
Log
•
Transaction
Transaction
•
Media
•
MVCC
Snapshot Isolation
MVCC
Monoversion Schedule
Conflict cycle: t1,t2
s = r1(x) w1(x) r2(x) w2(y) r1(y) w1(z) c1 c2
s’ = r1(x) w1(x) r2(x) r1(y) w2(y) w1(z) c1 c2
Multiversion Schedule & Monoversion Schedule
Multiversion Schedule
• m = r1(x0) w1(x1) r2(x1) w2(y2) r1(y0) w1(z1) c1 c2
• h(ri(x))=wj(x) & h(wi(x))=wi(x): version function
Monoversion Schedule
• m = r1(x0) w1(x1) r2(x1) w2(y2) r1(y2) w1(z1) c1 c2
• s = r1(x) w1(x) r2(x) w2(y) r1(y) w1(z) c1 c2
• Monoversion Schedule is a special case of Multiversion Schedule
MVCC
Traditional Conflict
s = w0(x) c0 w1(x) c1 r2(x) w2(y) c2
m = w0(x0) c0 w1(x1) c1 r2(x0) w2(y2) c2
View Equivalent
Reads-From Relationship
• RF(m) := {(ti, x, tj) | rj(xi) ∈OP(m) & ti, tj ∈trans(m)}
View Equivalent
• trans(m) = trans(m’) and RF(m) = RF(m’)
Example
• m = w0(x0) w0 (y0) c0 r3(x0) w3(x3) c3 w1(x1) c1 r2(x1) w2(y2) c2
• m’ = w0(x0) w0 (y0) c0 w1(x1) c1 r2(x1) r3(x0) w2(y2) w3(x3) c3 c2
MVCC
Multiversion View Serializability
Serializable but not View Equivalent
• m = w0(x0) w0(y0)c0 r1(x0) r1(y0) w1(x1) w1(y1)c1 r2(x0) r2(y1)c2
• s = w0(x) w0(y)c0 r1(x) r1(y) w1(x) w1(y)c1 r2(x) r2(y)c2
MVSR
• m’ is a serialized monoversion schedule
• trans(m) = trans(m’)
• m and m’ are view equivalent
Example
• m = w0(x0) w0 (y0) c0 w1(x1) c1 r2(x1) r3(x0) w3(x3) c3 w2(y2) c2
• m’ = w0(x0) w0 (y0) c0 r3(x0) w3(x3) c3 w1(x1) c1 r2(x1) w2(y2) c2
• s = w0(x) w0 (y) c0 r3(x) w3(x) c3 w1(x) c1 r2(x) w2(y) c2
MVCC
Conflict Graph G(m) = (V , E)
r2(x0)r2(y1)
r2(x1)r2(y0)
V = trans(m) ;
E = {(ti, tj) | rj(xi) ∈OP(m) & ti, tj ∈trans(m)}}
m and m’ are View Equivalent => G(m) = G(m’)
Version Oder
m = w0(x0) w0 (y0) w0 (z0) c0 r1(x0) r2(x0) r2(z0) r3(z0) w1(y1) w2(x2)
w3(y3) w3(z3) c1c2c3 r4(x2) r4(y3) r4(z3) c4
T1
Version Oder = {x0«x2, y0«y1«y3, z0«z3}
MVSG
T3
T0
M
MVSG = G(m) + Version Order
rk(xj) and wi(xi), k≠i≠j
If xi « xj then (ti, tj) ∈E; else (tk, ti) ∈E
∈ MVSR iff MVSG(m, «) have no cycle
T2
T4
MVCC
Multiversion Conflict
ri(xj) and wk(xk) and ri(xj) < wk(xk)
Multiversion Conflict Serializability
m’ is a serialized monoversion schedule
trans(m) = trans(m’)
Pair of operations with conflict: same ordering
Multiversion Conflict Graph
all
E={(ti, tk) | ri(xj) < wk(xk) }
MVSR
MCSR
M ∈ MVCR iff MSVG(m, «) have
VSR no cycle
CSR
MVCC
Limit the number of version: k=2
w0(x0) c0 r1(x0) w3(x3) c3 w1(x1) c1 r2(x1) w2(x2) c2
w0(x0) c0 r1(x0) w1(x1) c1 r2(x1) w2(x2) c2 w3(x3) c3
w0(x0) c0 r1(x0) w1(x1) c1 w3(x3) c3 r2(x3) w2(x2) c2
w0(x0) c0 r2(x0) w2(x2) c2 r1(x2) w1(x1) c1 w3(x3) c3
w0(x0) c0 r2(x0) w2(x2) c2 w3(x3) c3 r1(x3) w1(x1) c1
w0(x0) c0 w3(x3) c3 r1(x3) w1(x1) c1 r2(x1) w2(x2) c2
w0(x0) c0 w3(y3) c3 r2(x3) w2(x2) c2 r1(x2) w1(x1) c1
K-version view serializability (kVSR):
Serializable
View equivalent
k newest/nearest version
Hierarchy Relationship
VSR 1VSR 2VSR 3VSR MVSR
x1,x2
x2,x3
x2,x3
x1,x3
x1,x3
x1,x2
x1,x2
MVCC
MVCC Protocol
MVTO (multiversion timestamp ordering)
MV2PL : 2VPL
• three kinds of kinds: rl, wl, cl
MVSGT
ROMV
• Read-only transaction
Berkeley DB
Five components
a standalone utility
Deadlock detection
one or more library interfaces
• db_deadlock
• DB_ENV->lock_detect, DB_ENV->set_lk_detect
Checkpoints
• db_checkpoint
• DB_ENV->txn_checkpoint
Database and log file archival
• db_archive
• DB_ENV->log_archive
Log file removal
• db_archive
• DB_ENV->log_archive
Recovery procedures
• db_recover
• DB_ENV->open
Berkeley DB
Transaction API
Transaction Subsystem and Related Methods Description
• DB_ENV->txn_checkpoint,
DB_ENV->txn_stat
• DB_ENV->open
DB_ENV->remove
DB_ENV->txn_recover
DB_ENV->close
Transaction Subsystem Configuration
• DB_ENV->set_timeout
DB_ENV->set_tx_timestamp
DB_ENV->set_tx_max
Transaction Operations
• DB_ENV->txn_begin
DB_TXN->commit
DB_TXN->id
DB_TXN->set_name
DB_TXN->abort
DB_TXN->discard
DB_TXN->prepare
DB_TXN->set_timeout
Berkeley DB
2PL In Berkeley DB
Locks are released
• during DB_TXN->abort or DB_TXN->commit.
Guidelines:
• If possible, use nested transactions to protect the parts of
your transaction most likely to deadlock
Transaction limits
Transaction IDs: 31-bit unsigned integer (OX80000000)
Cursors: can not span more transactions, must be opened
and closed within a single transaction
Multiple Threads of Control:
Berkeley DB
Several filesystem operations on Berkeley DB
Disk seek to database file,
Database file read,
Disk seek to log file,
Log file write,
Disk seek to update log file metadata, Log metadata write,
Flush log file information to disk,
Flush log file metadata to disk
Ways to increase transactional throughput
Berkeley DB software support group commit
Additional tuning parameters
• Tune the size of the database cache
• Put the database and the log files on different disks
• Set the filesystem configuration
• Upgrade your hardware
• Turn on DB_TXN_WRITE_NOSYNC or DB_TXN_NOSYNC flags
– ACI, but not D
PostgreSQL
PG: a sanpshot of data
Reading never blocks writing
Writing never blocks reading
Three undesirable phenomena
dirty reads, non-repeatable reads, phantom read
SQL Transaction Isolation Levels
Isolation Level
Dirty Read
Non-Repeatable Read
Phantom Read
Read uncommitted
Possible
Possible
Possible
Read committed
Not possible
Possible
Possible
Repeatable read
Not possible
Not possible
Possible
Serializable
Not possible
Not possible
Not possible
PostgreSQL
Read Committed Isolation Level
the default isolation level
A SELECT query sees only data committed
The SELECT does see the effects of previous updates
executed within this same transaction
Two successive SELECTs can see different data
• Other transactions commit changes during executions
NOT adequate for many applications that do complex
queries and updates
Serializable Isolation Level
This level emulates serial transaction execution.
PostgreSQL
Data consistency checks at the application level
Readers in PostgreSQL don't lock data
To ensure the current existence of a row and protect it
against concurrent updates one must use SELECT FOR
UPDATE or an appropriate LOCK TABLE statement.
(SELECT FOR UPDATE locks just the returned rows
against concurrent updates, while LOCK TABLE protects
the whole table.)
Lock and Tables
Table-level Lock
Row-level : when rows are being updated
Lock and Index
Gist and R-tree : released after statement is done
Hash Index : released after page is processed
B-Tree : released immediately after each index tuple is
fetched/inserted
ASL
RSL
REL
SUEL
SL
SREL
EL
AEL
AccessShareLock
√
√
√
√
√
√
√
×
RowShareLock
√
√
√
√
√
√
×
×
RowExclusiveLock
√
√
√
√
×
×
×
×
ShareUpdateExclusiveLock
√
√
√
×
×
×
×
×
ShareLock
√
√
×
×
√
×
×
×
ShareRowExclusiveLock
√
√
×
×
×
×
×
×
ExclusiveLock
√
×
×
×
×
×
×
×
AccessExclusiveLock
×
×
×
×
×
×
×
×
SR
DR
IR
UR
AT
DT
CI
LT
√
√
√
√
√
√
√
√
AccessShareLock
RowShareLock
RowExclusiveLock
√
√
√
√
√
√
√
ShareUpdateExclusiveLock
√
ShareLock
√
ShareRowExclusiveLock
√
ExclusiveLock
√
AccessExclusiveLock
√
√
√
Future work
Experiment
BDB & PG Code
Transaction on Flash Memory
Concurrency Control
• MVCC
Recovery
• Log
Company
LOGO