Flash Memory Database Systems and IPL

Download Report

Transcript Flash Memory Database Systems and IPL

Flash Talk

Flash Memory Database Systems and In-Page Logging

Bongki Moon

Department of Computer Science University of Arizona Tucson, AZ 85721, U.S.A.

[email protected]

In collaboration with Sang-Won Lee (SKKU), Chanik Park (Samsung)

KOCSEA’09, Las Vegas, December 2009 -1-

Magnetic Disk vs Flash SSD

Champion for 50 years

Intel X25-M Flash SSD 80GB 2.5 inch Seagate ST340016A 40GB,7200rpm

New challengers!

Samsung FlashSSD 128 GB 2.5/1.8 inch KOCSEA’09, Las Vegas, December 2009 -2-

Past Trend of Disk

From 1983 to 2003 [Patterson, CACM 47(10) 2004]

  

Capacity increased about 2500 Bandwidth improved 143.3

times (0.03 GB

times (0.6 MB/s

73.4 GB) 86 MB/s) Latency improved 8.5

times (48.3 ms

5.7 ms)

Year

Product Capacity RPM Bandwidth (MB/sec) Media diameter Latency (msec)

1983

CDC 94145-36 0.03 GB 3600 0.6

5.25

48.3

1990

Seagate ST41600 1.4 GB 5400 4 5.25

17.1

1994

Seagate ST15150 4.3 GB 7200 9 3.5

12.7

1998

Seagate ST39102 9.1 GB 10000 24 3.0

8.8

2003

Seagate ST373453 73.4 GB 15000 86 2.5

5.7

KOCSEA’09, Las Vegas, December 2009 -3-

I/O Crisis in OLTP Systems

• •

I/O becomes bottleneck in OLTP systems

Process a large number of small random I/O operations Common practice to close the gap

Use a large disk farm to exploit I/O parallelism

Tens or hundreds

of disk drives per processor core

(E.g.) IBM Power 596 Server : 172 15k-RPM disks per processor core

Adopt short-stroking to reduce disk latency

Use only the outer tracks of disk platters

Other concerns are raised too

• •

Wasted capacity of disk drives Increased amount of energy consumption

Then, what happens 18 months later?

To catch up with Moore’s law and balance CPU and I/O (Amdhal’s law), the number of spindles should be doubled again?

KOCSEA’09, Las Vegas, December 2009 -4-

Flash News in the Market

• • • • •

Sun Oracle

Exadata Storage Server [Sep 2009] Each Exadata cell comes with 384 GB flash cache

MySpace

dumped disk drives [Oct 2009] Went all flash, saving power by 99%

Google Chrome

ditched disk drives [Nov 2009] SSD is the key to 7-sec boot time

Gordon

at UCSD/SDSC [Nov 2009] 64 TB RAM, 256 TB Flash, 4 PB Disks

IBM

hooked up with

Fusion-io

[Dec 2009] SSD storage appliance for System X server line COMPUTER SCIENCE DEPARTMENT

KOCSEA’09, Las Vegas, December 2009 -5-

Flash for Database, Really?

Immediate benefit for some DB operations

Reduce commit-time delay by fast logging

Reduce read time for multi-versioned data

Reduce query processing time (sort, hash)

What about the Big Fat Tables?

Random scattered I/O is very common in OLTP

Slow random writes by flash SSD can handle this?

KOCSEA’09, Las Vegas, December 2009 -7-

Transactional Log

SQL Queries System Buffer Cache Database Table space Transaction (Redo) Log Temporary Table Space Rollback Segments KOCSEA’09, Las Vegas, December 2009 -8-

Commit-time Delay by Logging

Write Ahead Log (WAL)

A committing transaction

force-writes

log records its

 

Makes it hard to hide latency With a separate disk for logging

• • •

No seek delay, but …

Half a revolution of spindle

on average 4.2 msec (7200RPM), 2.0 msec (15k RPM)

With a Flash SSD: about 0.4 msec T 1 T 2

SQL

Buffer

T n

Log Buffer pi DB LOG • •

Commit-time delay remains to be a significant overhead

Group-commit helps but the delay doesn’t go away altogether.

How much commit-time delay?

On average, 8.2 msec (HDD) vs 1.3 msec (SDD) :

TPC-B benchmark with 20 concurrent users.

6-fold reduction

KOCSEA’09, Las Vegas, December 2009 -9-

Rollback Segments

SQL Queries System Buffer Cache Database Table space Transaction (Redo) Log Temporary Table Space Rollback Segments KOCSEA’09, Las Vegas, December 2009 -10-

MVCC Rollback Segments

• •

Multi-version Concurrency Control (MVCC)

Alternative to traditional Lock-based CC

 

Support read consistency and snapshot isolation Oracle, PostgresSQL, Sybase, SQL Server 2005, MySQL Rollback Segments

Each transaction is assigned to a rollback segment

When an object is updated, its current value is recorded in the rollback segment sequentially (in

append-only

fashion)

To fetch the correct version of an object, check whether it has been updated by other transactions

KOCSEA’09, Las Vegas, December 2009 -11-

MVCC Write Pattern

Write requests from TPC-C workload

Concurrent transactions generate multiple streams of append-only traffic in parallel (apart by approximately 1 MB)

 

HDD moves disk arm very frequently SSD has no negative effect from no in-place update limitation

KOCSEA’09, Las Vegas, December 2009 -12-

MVCC Read Performance

T 2

C B A … 100 50 A

T 1

100 Rollback segment A

T 0

200 • •

To support MV read consistency, I/O activities will increase

A long chain of old versions may have to be traversed for each access to a frequently updated object Read requests are scattered randomly

Old versions of an object may be stored in several rollback segments

With SSD,

10-fold read time reduction

was not surprising

Rollback segment KOCSEA’09, Las Vegas, December 2009 -13-

Database Table Space

SQL Queries System Buffer Cache Database Table space Transaction (Redo) Log Temporary Table Space Rollback Segments KOCSEA’09, Las Vegas, December 2009 -14-

Workload in Table Space

• •

TPC-C workload

Exhibit little locality and sequentiality

Mix of small/medium/large read-write, read-only (join)

Highly skewed

84% (75%) of accesses to 20% of tuples (pages) Write caching not as effective as read caching

Physical read/write ratio is much lower that logical read/write ratio

All bad news for flash memory SSD

Due to the

No in place update

and

Asymmetric read/write speeds

In-Page Logging (IPL)

approach [SIGMOD’07] COMPUTER SCIENCE DEPARTMENT

KOCSEA’09, Las Vegas, December 2009 -15-

In-Page Logging (IPL)

Key Ideas of the IPL Approach

Changes written to

log

instead of updating them in place Avoid frequent write and erase operations

Log records are

• •

co-located

with data pages No need to write them sequentially to a separate log region Read current data more efficiently than sequential logging

DBMS buffer and storage managers work together

KOCSEA’09, Las Vegas, December 2009 -16-

Design of the IPL

Logging on Per-Page basis in both Memory and Flash

Database Buffer in-memory data page (8KB) update-in-place in-memory log sector (512B) Flash Memory Erase unit: 128KB 15 data pages (8KB each) log area (8KB): 16 sectors The log area is shared by all the data pages in an erase unit   An

In-memory log sector

be associated with a buffer can frame in memory  Allocated on demand when a page becomes dirty An

In-flash log segment

is allocated in each erase unit KOCSEA’09, Las Vegas, December 2009 -17-

IPL Write

• • • 

Data pages in memory

Updated in place, and

Physiological log records written to its in-memory log sector In-memory log sector is written to the in-flash log segment, when

Data page is evicted from the buffer pool, or

The log sector becomes full When a dirty page is evicted, the content is

not written

The previous version remains intact to flash memory Data pages and their log records are physically co-located in the same erase unit

Update / Insert / Delete update-in-place

physiological log

Buffer Pool Sector : 512B Page : 8KB Block : 128KB Flash Memory

Data Block Area

KOCSEA’09, Las Vegas, December 2009 -18-

IPL Read

When a page is read from flash, the current version is computed on the fly P i

Apply the “physiological action” to the copy read from Flash (CPU overhead) Buffer Pool Re-construct the current in-memory copy Read from Flash   Original copy of P i All log records belonging to P i (IO overhead) Flash Memory data area (120KB): 15 pages log area (8KB): 16 sectors KOCSEA’09, Las Vegas, December 2009 -19-

IPL Merge

When all free log sectors in an erase unit are consumed

 

Log records are applied to the corresponding data pages The current data pages are copied into a new erase unit

Consumes, erases, and releases only one erase unit

Can be Erased

Physical Flash Block B

old Merge

log area (8KB): 16 sectors B

new

15 up-to-date data pages clean log area KOCSEA’09, Las Vegas, December 2009 -20-

Industry Response

Common in Enterprise Class SSDs

Multi-channel, inter-command parallelism

Thruput than bandwidth, write-followed-by-read pattern

Command queuing (SATA-II NCQ)

Large RAM Buffer (with super-capacitor backup)

Even up to 1 MB per GB

Write-back caching, controller data (mapping, wear leveling)

Fat provisioning (up to ~20% of capacity)

Impressive improvement

Prototype/Product

Read (IOPS) Write (IOPS)

EC SSD

10500 2500

X-25M

20000 1200

15k-RPM Disk

450 450 COMPUTER SCIENCE DEPARTMENT

KOCSEA’09, Las Vegas, December 2009 -21-

EC-SSD Architecture

Parallel/interleaved operations

8 channels, 2 packages/channel, 4 chips/package

Two-plane page write, block erase, copy-back operations

ARM9 ECC Flash package Flash package Host I/F (SATA-II) Main Controller Flash Controller NAND NAND 8 channels DRAM (128MB) NAND NAND

COMPUTER SCIENCE DEPARTMENT

KOCSEA’09, Las Vegas, December 2009 -22-

Concluding Remarks

Recent advances cope with random I/O better

 

Write IOPS 100x higher than early SSD prototype TPS: 1.3~2x higher than RAID0-8HDDs for read-write TPC-C workload, with much less energy consumed

• •

Write still lags behind

 

IOPS Disk IOPS < IOPS SSD-Read SSD-Write << IOPS SSD-Read / IOPS SSD-Write = 4 ~ 17 A lot more issues to investigate

Flash-aware buffer replacement, I/O scheduling, Energy

Fluctuation in performance, Tiered storage architecture

Virtualization, and much more …

KOCSEA’09, Las Vegas, December 2009 -23-

Question?

For more Flash Memory work of Ours

   

In-Page Logging [SIGMOD’07] Logging, Sort/Hash, Rollback [SIGMOD’08] SSD Architecture and TPC-C [SIGMOD’09] In-Page Logging for Indexes [CIKM’09]

Even More?

www.cs.arizona.edu/~bkmoon

COMPUTER SCIENCE DEPARTMENT

KOCSEA’09, Las Vegas, December 2009 -24-