Transcript Flash Memory Database Systems and IPL
Flash Talk
Flash Memory Database Systems and In-Page Logging
Bongki Moon
Department of Computer Science University of Arizona Tucson, AZ 85721, U.S.A.
In collaboration with Sang-Won Lee (SKKU), Chanik Park (Samsung)
KOCSEA’09, Las Vegas, December 2009 -1-
Magnetic Disk vs Flash SSD
Champion for 50 years
Intel X25-M Flash SSD 80GB 2.5 inch Seagate ST340016A 40GB,7200rpm
New challengers!
Samsung FlashSSD 128 GB 2.5/1.8 inch KOCSEA’09, Las Vegas, December 2009 -2-
Past Trend of Disk
•
From 1983 to 2003 [Patterson, CACM 47(10) 2004]
Capacity increased about 2500 Bandwidth improved 143.3
times (0.03 GB
times (0.6 MB/s
73.4 GB) 86 MB/s) Latency improved 8.5
times (48.3 ms
5.7 ms)
Year
Product Capacity RPM Bandwidth (MB/sec) Media diameter Latency (msec)
1983
CDC 94145-36 0.03 GB 3600 0.6
5.25
48.3
1990
Seagate ST41600 1.4 GB 5400 4 5.25
17.1
1994
Seagate ST15150 4.3 GB 7200 9 3.5
12.7
1998
Seagate ST39102 9.1 GB 10000 24 3.0
8.8
2003
Seagate ST373453 73.4 GB 15000 86 2.5
5.7
KOCSEA’09, Las Vegas, December 2009 -3-
I/O Crisis in OLTP Systems
• •
I/O becomes bottleneck in OLTP systems
Process a large number of small random I/O operations Common practice to close the gap
Use a large disk farm to exploit I/O parallelism
•
Tens or hundreds
of disk drives per processor core
•
(E.g.) IBM Power 596 Server : 172 15k-RPM disks per processor core
Adopt short-stroking to reduce disk latency
•
Use only the outer tracks of disk platters
Other concerns are raised too
• •
Wasted capacity of disk drives Increased amount of energy consumption
•
Then, what happens 18 months later?
To catch up with Moore’s law and balance CPU and I/O (Amdhal’s law), the number of spindles should be doubled again?
KOCSEA’09, Las Vegas, December 2009 -4-
Flash News in the Market
• • • • •
Sun Oracle
Exadata Storage Server [Sep 2009] Each Exadata cell comes with 384 GB flash cache
MySpace
dumped disk drives [Oct 2009] Went all flash, saving power by 99%
Google Chrome
ditched disk drives [Nov 2009] SSD is the key to 7-sec boot time
Gordon
at UCSD/SDSC [Nov 2009] 64 TB RAM, 256 TB Flash, 4 PB Disks
IBM
hooked up with
Fusion-io
[Dec 2009] SSD storage appliance for System X server line COMPUTER SCIENCE DEPARTMENT
KOCSEA’09, Las Vegas, December 2009 -5-
Flash for Database, Really?
•
Immediate benefit for some DB operations
Reduce commit-time delay by fast logging
Reduce read time for multi-versioned data
Reduce query processing time (sort, hash)
•
What about the Big Fat Tables?
Random scattered I/O is very common in OLTP
•
Slow random writes by flash SSD can handle this?
KOCSEA’09, Las Vegas, December 2009 -7-
Transactional Log
SQL Queries System Buffer Cache Database Table space Transaction (Redo) Log Temporary Table Space Rollback Segments KOCSEA’09, Las Vegas, December 2009 -8-
Commit-time Delay by Logging
•
Write Ahead Log (WAL)
A committing transaction
force-writes
log records its
Makes it hard to hide latency With a separate disk for logging
• • •
No seek delay, but …
Half a revolution of spindle
on average 4.2 msec (7200RPM), 2.0 msec (15k RPM)
With a Flash SSD: about 0.4 msec T 1 T 2
SQL
…
Buffer
T n
Log Buffer pi DB LOG • •
Commit-time delay remains to be a significant overhead
Group-commit helps but the delay doesn’t go away altogether.
How much commit-time delay?
On average, 8.2 msec (HDD) vs 1.3 msec (SDD) :
•
TPC-B benchmark with 20 concurrent users.
6-fold reduction
KOCSEA’09, Las Vegas, December 2009 -9-
Rollback Segments
SQL Queries System Buffer Cache Database Table space Transaction (Redo) Log Temporary Table Space Rollback Segments KOCSEA’09, Las Vegas, December 2009 -10-
MVCC Rollback Segments
• •
Multi-version Concurrency Control (MVCC)
Alternative to traditional Lock-based CC
Support read consistency and snapshot isolation Oracle, PostgresSQL, Sybase, SQL Server 2005, MySQL Rollback Segments
Each transaction is assigned to a rollback segment
When an object is updated, its current value is recorded in the rollback segment sequentially (in
append-only
fashion)
To fetch the correct version of an object, check whether it has been updated by other transactions
KOCSEA’09, Las Vegas, December 2009 -11-
MVCC Write Pattern
•
Write requests from TPC-C workload
Concurrent transactions generate multiple streams of append-only traffic in parallel (apart by approximately 1 MB)
HDD moves disk arm very frequently SSD has no negative effect from no in-place update limitation
KOCSEA’09, Las Vegas, December 2009 -12-
MVCC Read Performance
T 2
C B A … 100 50 A
T 1
100 Rollback segment A
T 0
200 • •
To support MV read consistency, I/O activities will increase
A long chain of old versions may have to be traversed for each access to a frequently updated object Read requests are scattered randomly
Old versions of an object may be stored in several rollback segments
With SSD,
10-fold read time reduction
was not surprising
Rollback segment KOCSEA’09, Las Vegas, December 2009 -13-
Database Table Space
SQL Queries System Buffer Cache Database Table space Transaction (Redo) Log Temporary Table Space Rollback Segments KOCSEA’09, Las Vegas, December 2009 -14-
Workload in Table Space
• •
TPC-C workload
Exhibit little locality and sequentiality
•
Mix of small/medium/large read-write, read-only (join)
Highly skewed
•
84% (75%) of accesses to 20% of tuples (pages) Write caching not as effective as read caching
Physical read/write ratio is much lower that logical read/write ratio
•
All bad news for flash memory SSD
Due to the
No in place update
and
Asymmetric read/write speeds
In-Page Logging (IPL)
approach [SIGMOD’07] COMPUTER SCIENCE DEPARTMENT
KOCSEA’09, Las Vegas, December 2009 -15-
In-Page Logging (IPL)
•
Key Ideas of the IPL Approach
Changes written to
•
log
instead of updating them in place Avoid frequent write and erase operations
Log records are
• •
co-located
with data pages No need to write them sequentially to a separate log region Read current data more efficiently than sequential logging
DBMS buffer and storage managers work together
KOCSEA’09, Las Vegas, December 2009 -16-
Design of the IPL
•
Logging on Per-Page basis in both Memory and Flash
Database Buffer in-memory data page (8KB) update-in-place in-memory log sector (512B) Flash Memory Erase unit: 128KB 15 data pages (8KB each) log area (8KB): 16 sectors The log area is shared by all the data pages in an erase unit An
In-memory log sector
be associated with a buffer can frame in memory Allocated on demand when a page becomes dirty An
In-flash log segment
is allocated in each erase unit KOCSEA’09, Las Vegas, December 2009 -17-
IPL Write
• • •
Data pages in memory
Updated in place, and
Physiological log records written to its in-memory log sector In-memory log sector is written to the in-flash log segment, when
Data page is evicted from the buffer pool, or
The log sector becomes full When a dirty page is evicted, the content is
not written
The previous version remains intact to flash memory Data pages and their log records are physically co-located in the same erase unit
Update / Insert / Delete update-in-place
physiological log
Buffer Pool Sector : 512B Page : 8KB Block : 128KB Flash Memory
Data Block Area
KOCSEA’09, Las Vegas, December 2009 -18-
IPL Read
•
When a page is read from flash, the current version is computed on the fly P i
Apply the “physiological action” to the copy read from Flash (CPU overhead) Buffer Pool Re-construct the current in-memory copy Read from Flash Original copy of P i All log records belonging to P i (IO overhead) Flash Memory data area (120KB): 15 pages log area (8KB): 16 sectors KOCSEA’09, Las Vegas, December 2009 -19-
IPL Merge
•
When all free log sectors in an erase unit are consumed
Log records are applied to the corresponding data pages The current data pages are copied into a new erase unit
•
Consumes, erases, and releases only one erase unit
Can be Erased
Physical Flash Block B
old Merge
log area (8KB): 16 sectors B
new
15 up-to-date data pages clean log area KOCSEA’09, Las Vegas, December 2009 -20-
Industry Response
•
Common in Enterprise Class SSDs
Multi-channel, inter-command parallelism
•
Thruput than bandwidth, write-followed-by-read pattern
Command queuing (SATA-II NCQ)
Large RAM Buffer (with super-capacitor backup)
•
Even up to 1 MB per GB
•
Write-back caching, controller data (mapping, wear leveling)
Fat provisioning (up to ~20% of capacity)
•
Impressive improvement
Prototype/Product
Read (IOPS) Write (IOPS)
EC SSD
10500 2500
X-25M
20000 1200
15k-RPM Disk
450 450 COMPUTER SCIENCE DEPARTMENT
KOCSEA’09, Las Vegas, December 2009 -21-
EC-SSD Architecture
•
Parallel/interleaved operations
8 channels, 2 packages/channel, 4 chips/package
Two-plane page write, block erase, copy-back operations
ARM9 ECC Flash package Flash package Host I/F (SATA-II) Main Controller Flash Controller NAND NAND 8 channels DRAM (128MB) NAND NAND
COMPUTER SCIENCE DEPARTMENT
KOCSEA’09, Las Vegas, December 2009 -22-
Concluding Remarks
•
Recent advances cope with random I/O better
Write IOPS 100x higher than early SSD prototype TPS: 1.3~2x higher than RAID0-8HDDs for read-write TPC-C workload, with much less energy consumed
• •
Write still lags behind
IOPS Disk IOPS < IOPS SSD-Read SSD-Write << IOPS SSD-Read / IOPS SSD-Write = 4 ~ 17 A lot more issues to investigate
Flash-aware buffer replacement, I/O scheduling, Energy
Fluctuation in performance, Tiered storage architecture
Virtualization, and much more …
KOCSEA’09, Las Vegas, December 2009 -23-
Question?
•
For more Flash Memory work of Ours
In-Page Logging [SIGMOD’07] Logging, Sort/Hash, Rollback [SIGMOD’08] SSD Architecture and TPC-C [SIGMOD’09] In-Page Logging for Indexes [CIKM’09]
•
Even More?
www.cs.arizona.edu/~bkmoon
COMPUTER SCIENCE DEPARTMENT
KOCSEA’09, Las Vegas, December 2009 -24-