Transcript Document

Lecture 5:
Wrap-up RAID
Flash memory
Prof. Shahram Ghandeharizadeh
Computer Science Department
University of Southern California
Mental Block from Last Lecture
Question!
Why?
Level 4
Level 5
Last Lecture’s Discussion

With RAID 4, why is the performance of
small writes D/2G?

To write block b:



Read the old Block b and old parity block ECC1,
Compute the new parity using the old Block b, new Block b,
and the old parity:
New parity = (old block xor new block) xor old parity ECC1
Write new block b and new parity block.
Disk 1
Disk 2
Disk 3
Disk 4
Block a
Block b
Block c
Block d
Parity
ECC 1
Small Write with RAID 4
Disk 1
Disk 2
Disk 3
Disk 4
Parity
Read Old b
Read Old ECC1
New parity ECC1 = (old b xor new b) xor old ECC1
Write New b




Write New ECC1
Note that a write to a block on Disk 3 cannot proceed in parallel
because the Parity disk is busy!
One disk would perform the write of block b without reading it first.
With 1 group, level 4 RAID performs ½ the number of small write
events when compared with 1 disk.
With nG groups, level 4 RAID performs nG/2 number of events when
compared with 1 disk. nG = D/G
RAID 4

Two groups may perform write
operations independently.
RAID 5: Resolve the Bottleneck

With Level 5 RAID, different disks may perform
different small write operations simultaneously.
Disk 1
Disk 2
Disk 3
Disk 4
Disk 5
Block a
Block b
Block c
Block d
ECC 1
Block e
Block f
Block g
ECC 2
Block h
Block i
Block j
ECC 3
Block k
Block l
Block m
ECC 4
Block n
Block o
Block p
ECC 5
Block q
Block r
Block s
Block t
RAID 5

Example: Write block a and f simultaneously and
initiate a part of write for block j.
Disk 1
Disk 2
Disk 3
Disk 4
Disk 5
Read a
Read f
Read ECC 3
Read ECC 2
Read ECC 1
Read n
Write ECC 2
Write ECC 1
Compute parity blocks for a and f
Write a


Write f
All disks are busy reading data and parity blocks.
A write requires 4 I/Os.
RAID 5



When compared with one disk, Level 5
RAID performs 4 times as many I/Os.
To compare with one disk, divide the
total number of operations supported
by the data disks by 4.
Total number of small writes for 1
group


D/4 + ¼ (1 check disk)
With nG groups, there are nG check
disks.


D/4 + nG*C/4 (nG = D/G)
D/4 + (D/4 * C/G)
Level 5 RAID

D/4 + (D/4 * C/G)
A Comment

Definitions may appear somewhat
arbitrary and far-fetched!

Definitions are applied consistently.
Flash Memory

Goetz Graefe. The Five-Minute Rule
Twenty Years Later, and How Flash
Memory Changes the Rules. DaMoN
2007.
Alternative Storage Mediums

Magnetic disk drive

Flash memory

Dynamic Random Access Memory
(DRAM)
Flash Memory [Kim et al. 02]



Nonvolatile storage media: stored data is sustained
after power is turned off.
Supports random access to data.
Comes in two types:


NOR: can read/erase/write 1 byte individually.
NAND: optimized for mass storage and supports read/erase/write
of a block.


A block consists of multiple pages. A page is typically 512 bytes. A
block is somewhere between 4KB to 128KB.
Write performance for flash memory is an order-of-magnitude higher
when compared with NOR.
Flash Storage

Comes with different interfaces:

UFD: USB Flash Disk. Throughput is
price dependent; typically quoted at:


Read throughput of 8 to 16 MBps
Write throughput of 6 to 12 MBps
Flash Storage

Comes with different interfaces:

UFD: USB Flash Disk. Throughput is
price dependent; typically quoted at:



Flash memory card:



Accessed as memory.
Typically byte-accessible.
Flash disk:



Read throughput of 8 to 16 MBps.
Write throughput of 6 to 12 MBps.
Accessed through a disk interface.
Block-accessible.
Focus of this paper is on flash disk.
Flash Memory



Reads are faster than writes because a
write of a page (512 bytes) requires a
block to be erased.
Sequential writes are fast because the
interface has a cache and manages
write operations intelligently.
Random write operations are slow
because of the erase operations and a
small cache.
Flash: Sequential Reads/Write [Gray’08]



Read/Write performance is sensitive to request size.
Read performance is significantly better than write performance.
Throughput plateaus at 53 MBps for reads and 35 MBps for writes.

Note the higher throughput for flash disk when compared with UFD.
Flash: Random Read/Write [ Gray’08]



Read performance is comparable to sequential reads.
Write performance is very poor, 216 KBps with 8KB writes (27
requests per second).
Poor performance of random writes is being addressed – might have
been addressed already! (A fast moving field!)
Disk & Flash [Gray 08]


Disk provides a higher bandwidth with sequential reads/writes.
With random reads, flash blows disk away!


Why?
When one considers power consumption, IOPS/watt of flash is
very impressive!
Flash

Reliability of flash suffers after 100,000
to 1,000,000 erase-and-write cycles.

Less reliable (a lower MTTF) than
magnetic disk assuming a write intensive
workload.
Characteristics


RAM is faster than the other two storage
mediums.
Flash disk consumes less power than disk
because there are no moving parts.
Disk and DRAM

Question: When does it make economic sense to
make a piece of data resident in DRAM and when
does it make sense to have it resident in disk where
it must be moved to main memory prior to reading
or writing it?

Assumptions:






Fix sized disk pages, say 4 Kilobyte.
A 250 GB disk costs $80 and supports 83 page reads per
second. So the price per page read per second is about
$1.
1 MB of DRAM holds 256 disk pages and costs $0.047 per
megabyte. So, the cost of a disk page occupying DRAM is
$0.000184.
If making a page memory resident saves 1 page a/s then it
saves $1. A good deal. If it saves .1 page a/s then it saves
10 cents, still a good deal.
Break even point is an access every $1/0.000184 which is
roughly 90 minutes.
In 1987, this break even point was 2 minutes.
Disk and DRAM: Moral of the story



In 2007, pages referenced every 90
minutes should be DRAM resident.
In 1987, pages referenced every 5
minutes should be DRAM resident.
Key observation:


Focus is on memory space and disk
bandwidth!
Is something missing from this analysis?
Assumed Page Size Matters

A larger page size enhances the
throughput of a magnetic disk drive.

How?
Assumed Page Size Matters

A larger page size enhances the
throughput of a magnetic disk drive.

With small page sizes (1 KB), seek and
rotational latency result in a lower disk
throughput, and a higher cost per a/s.
Flash and DRAM

Question: When does it make economic sense to
make a piece of data resident in DRAM and when
does it make sense to have it resident in Flash
where it must be moved to main memory prior to
reading or writing it?

Assumptions:






Fix sized disk pages, say 4 Kilobyte.
A 32 GB Flash disk costs $999 and supports 6200 page
reads per second. So the price per page read per second
is about $0.16.
1 MB of DRAM holds 256 disk pages and costs $0.047 per
megabyte. So, the cost of a disk page occupying DRAM is
$0.000184.
If making a page memory resident saves 1 page a/s then it
saves $0.16. A good deal. If it saves .1 page a/s then it
saves $0.016, still a good deal.
Break even point is an access every $0.16/0.000184 which
is roughly 15 minutes.
In the price of flash drops to $400, break even point is 6
minutes.
Flash and DRAM: Moral of the story




With 2007 price of $999, pages
referenced every 15 minutes should be
DRAM resident.
With anticipated price of $400, pages
referenced every 6 minutes should be
DRAM resident.
Focus is on DRAM space, and Flash
bandwidth!
Is something missing from this
analysis?
What is Missing?

Page size matters (same discussion as
DRAM)

With flash disk, throughput of reads
and writes is asymmetrical – even with
sequential reads and writes.

A 32 GB Flash disk costs $999 and
supports 30 page writes per second. So
the price per page write per second is
about $33. (For reads, it is 16 cents.)
Disk and Flash Memory


With Flash memory, the available flash is
accessible in the same manner as DRAM.
 The read and write performance of Flash
memory is different than DRAM.
One may repeat the analysis to establish a
Δ-Minute rule for magnetic disk and flash
memory, see discussion of Table 3.
Possible Software Architectures?



Extended buffer pool: Flash is an
extension of DRAM.
Extended disk: Flash is an extension
of disk.
Treat DRAM, Flash, and magnetic disk
independently using a new cache
management technique.


Trojan storage manager.
This paper focuses on the first two
possibilities using LRU to manage their
content.
Architecture Choice

Choice of an architecture depends on
pattern of usage. This study claims:



File systems and operating systems
prefer “extended buffer pool”
architecture.
DBMS prefer “extended disk architecture”
Why?
Usage Pattern

File system/OS:





Pointer pages maintain
data pages or runs of
contiguous pages.
Movement of a page
requires writing of the
page and the entire
pointer page.
During recovery,
checks the entire
storage.
Many random I/Os!
Extended buffer pool
architecture.

DBMS, assuming
logging with
immediate database
modification:






Data is stored in B-tree
indexes.
Writing a page requires
appending a few bytes
in the log file.
The log file is flushed
using large sequential
write operations.
During recovery plays
log records
sequentially.
Large I/Os!
Extended disk
architecture.
LOG-BASED RECOVERY
<T1, start>
(1)
A=1000
B=10
<T1, start>
<T1,A,1000
,950>
<T1, B, 10,
60>
<T1, start>
<T1,A,1000
A=A-50
,950>
(2)
Read(A)
A=1000
(3)
A=950
Write(A)
A=1000
B=10
<T1, start>
<T1,A,1000
,950>
B=B+50
(5)
B=50
A=950
B=10
A=950
(6)
(7)
Write(B)
Commit
A=1000
B=10
<T1, start>
<T1,A,1000
,950>
<T1, B, 10,
60>
<T1,
commit>
(4)
Read(B)
Checkpointing

Motivation: In the presence of failures, the system
consults with the log file to determine which
transaction should be redone and which should be
undone. There are two major difficulties:
1)
2)

the search process is time consuming
most transactions are okay as their updates have made it to the
database (the system performs wasteful work by searching through
and redoing these transactions).
Approach: perform a checkpoint that requires the
following operations:
1.
2.
3.
output all log records from main memory to the disk
output all modified (dirty) pages in the buffer pool to the disk
output a log record <checkpoint> onto the log file on disk
Checkpointing (Cont…)


Dirty pages and log records stored on
flash storage persist during failure.
No need to flush them to disk drive.

If DBMS assumes extended buffer pool
architecture, the check-point operation
will flush data to disk un-necessarily!

Motivation for extended disk
architecture with xact-processing
applications!
Checkpointing

Unsure about the following argument:
B+-TREE

B+-tree is a multi-level tree structured directory
Root
Internal
Nodes
….
...
...
Leaf
Nodes
Data
File


A node is a page. A larger node has a higher fan-out, reducing
the depth of the tree.
Utility of a node is measured by the logarithm of records in a
node. A larger node has a higher utility.
B+-Tree



Using Flash-Disk
hardware
combination, page
size of 256/512
maximizes
utility/time value.
Note access time
does not change as
a function of page
size.
3rd column is a log
function of the 2nd
column.
B+-tree

Using DRAM-Flash
combination, a
small page size (2
KB) provides the
highest utility.
Summary


With an extended-disk architecture that
requires a page to migrate from DRAM
to Flash and then to disk, different B+tree page sizes should be used with
Flash and Disk.
SB-trees [O’Neil 1992] supports the
concept of extents and different page
sizes.
EXTERNAL SORTING

Sort a 20 page relation assuming a five
page buffer pool.
Merge sort
EXTERNAL SORTING

Use flash to store intermediate runs:


Large sequential reads/writes to flash memory.
More energy efficient!
Merge sort
References



Gray and Fitzgerald. Flash Disk
Opportunity for Server Applications.
ACM Queue, July 2008.
Kim et. al. A Space-Efficient Flash
Translation Layer for CompactFlash
Systems. IEEE Transactions on
Consumer Electronics, Vol. 48, No. 2,
May 2002.
O’Neil P. The SB-Tree: An IndexSequential Structure for HighPerformance Sequential Access. Acta
Inf., 29(3), 1992.