Outline - University of Florida

Download Report

Transcript Outline - University of Florida

CDA3101
Fall 2013
Computer Storage:
Practical Aspects
6,13 November 2013
Copyright © 2011 Prabhat Mishra
1
Storage Systems
 Introduction
 Disk Storage
 Dependability and Reliability
• I/O Performance
 Server Computers
 Conclusion
CDA 3101 – Fall 2013
Copyright © 2011 Prabhat Mishra
2
Case for Storage
 Shift in focus from computation to communication and
storage of information
 “The
Computing Revolution” (1960s to 1980s)
– IBM, Control Data Corp., Cray Research
 “The
Information Age” (1990 to today)
– Google, Yahoo, Amazon, …
 Storage emphasizes reliability and scalability as well
as cost-performance
 Program crash – frustrating

Data loss is unacceptable  dependability is key concern
 Which software determines HW features?
 Operating
System for storage
 Compiler for processor
Cost vs Access time in DRAM/Disk
DRAM is 100,000 times faster, and costs 30-150 times more per gigabyte.
Nonvolatile semiconductor storage
– 1000× faster than disk
Smaller, lower power, more robust
But more $/GB (between disk and DRAM)
100×
§6.4 Flash Storage
Flash Storage
Chapter 6 — Storage and O
Hard Disk Drive
6
Seek Time is not Linear in Distance
Requires 3 revolutions to perform 4 reads
(26, 100, 724, 9987)
Requires just 3/4th of a revolution
• RULE OF THUMB: average seek is the time
to access 1/3rd of the number of cylinders
-- it is not linear,  accelerate arm,
pause, decelerate, wait for settle time.
-- The average does not work well due to
locality property.
Dependability
Fault: failure of a component
May
or may not lead to system failure
Service accomplishment
Service delivered
as specified
Restoration
Failure
Service interruption
Deviation from
specified service
Dependability Measures
 Reliability: mean time to failure (MTTF)
 Service interruption: mean time to repair (MTTR)
 Mean time between failures (MTBF)
 MTBF
= MTTF + MTTR
 Availability = MTTF / (MTTF + MTTR)
 Improving Availability
 Increase
MTTF: fault avoidance, fault tolerance, fault
forecasting
 Reduce
MTTR: improved tools and processes for
diagnosis and repair
Disk Access Example
Given

512B sector, 15,000rpm, 4ms average seek
time, 100MB/s transfer rate, 0.2ms controller
overhead, idle disk
Average read time

4ms seek time
+ ½ / (15,000/60) = 2ms rotational latency
+ 512 / 100MB/s = 0.005ms transfer time
+ 0.2ms controller delay
= 6.2ms
If actual average seek time is 1ms

Average read time = 3.2ms
Use Arrays of Small Disks?
Can smaller disks be used to close gap in
performance between disks and CPUs?

Improves throughput, latency may not improve
Conventional:
4 disk
3.5” 5.25”
designs
Low End
Disk Array:
1 disk design
3.5”
10”
14”
High End
Array Reliability
• Reliability of N disks = Reliability of 1 Disk ÷ N
50,000 Hours ÷ 70 disks = 700 hours
Disk system MTTF: Drops from 6 years to 1 month!
• Arrays (w/o redundancy) too unreliable to be used
Hot spares support reconstruction in parallel with
access: very high media availability can be achieved
Redundant Arrays of (Inexpensive) Disks
Files are "striped" across multiple disks
Redundancy yields high data availability
Availability:
service still provided to user, even if
some components failed
Disks will still fail
Contents reconstructed from data redundantly
stored in the array
Capacity
penalty to store redundant information
Bandwidth
penalty to update redundant information
RAID 1: Disk Mirroring/Shadowing
recovery
group
• Each disk is fully duplicated onto its “mirror”
Very high availability can be achieved
• Bandwidth sacrifice on write:
• Logical write = two physical writes
• Reads may be optimized
• Most expensive solution: 100% capacity overhead
RAID 10 vs RAID 01
Striped mirrors

RAID 1 + 0
 For example, four pair of disks for four-disk
data
Mirrored stripes

For example, pair of
four disks for
four-disk data
 RAID 0 + 1
15
RAID 2
Memory-style error correcting codes in disks
Not used anymore.
Other
RAID organizations are more attractive
16
RAID 3: Parity Disk
10010011
11001101
10010011
...
logical record
Striped physical
records
P
1
0
1
0
0
0
1
1
1
1
0
0
1
1
0
1
P contains sum of
other disks per stripe
mod 2 (“parity”)
If disk fails, subtract
P from sum of other
disks to find missing information
1
0
1
0
0
0
1
1
1
1
0
0
1
1
0
1
Inspiration for RAID 4
RAID 3 relies on parity disk to discover errors
on Read
But every sector has an error detection field
To catch errors on read, rely on error detection
field vs. the parity disk
Allows independent reads to different disks
simultaneously
RAID 4: High I/O Rate Parity
Inside of
5 disks
Example:
small read
D0 & D5,
large write
D12-D15
D0
D1
D2
D3
P
D4
D5
D6
D7
P
D8
D9
D10
D11
P
D12
D13
D14
D15
P
D16
D17
D18
D19
P
D20
D21
D22
D23
P
.
.
.
.
Columns
.
.
.
.
.
.
.
.
.
.
Disk
.
Increasing
Logical
Disk
Address
Stripe
Inspiration for RAID 5
RAID 4 works well for small reads
Small writes (write to one disk):


Option 1: read other data disks, create new sum and write to Parity Disk
Option 2: since P has old sum, compare old data to new data, add the
difference to P
Small writes are limited by Parity Disk: Write to
D0, D5 both also write to P disk
D0
D1
D2
D3
P
D4
D5
D6
D7
P
RAID 5: Distributed Parity
N + 1 disks
Like
RAID 4, but parity blocks distributed
across disks
Avoids parity disk being a bottleneck
Widely used
RAID 6: Recovering from 2 failures
Why > 1 failure recovery?
If
operator accidentally replaces the wrong disk
during a failure
Since disk bandwidth is growing slowly than
disk capacity, the MTTR of a disk is increasing
increases the chances of a 2nd failure during repair
since it takes longer
– 500 GB SATA disk could take 3 hours to read sequentially
reading much more data during reconstruction meant
increasing the chance of an uncorrectable media
failure, which would result in data loss
Increasing
number of disks, use of ATA disks
(slower and larger than SCSI disks).
RAID 6: Recovering from 2 failures
 Network Appliance’s row-diagonal parity or RAID-DP
 Like the standard RAID schemes, it uses redundant
space based on parity calculation per stripe
 Since it is protecting against a double failure, it adds
two check blocks per stripe of data.
If
p+1 disks total, p-1 disks have data
 Row parity disk is just like in RAID 4
Even
parity across other data blocks in its stripe
 Each block of the diagonal parity disk contains the
even parity of the blocks in the same diagonal
Example p = 5
 Row diagonal parity starts by recovering one of the 4 blocks
on the failed disk using diagonal parity

Since each diagonal misses one disk, and all diagonals miss a
different disk, 2 diagonals are only missing 1 block
 Once the data for those blocks are recovered, then the
standard RAID recovery scheme can be used to recover two
more blocks in the standard RAID 4 stripes
 Process continues until two failed disks are restored
I/O - Introduction
I/O devices can be characterized by
Behavior:
input, output, storage
Partner: human or machine
Data rate: bytes/sec, transfers/sec
I/O bus connections
I/O System Characteristics
Dependability is important
Particularly
for storage devices
Performance measures
Latency
(response time)
Throughput
Desktops
(bandwidth)
& embedded systems
Primary focus is response time & diversity of devices
Servers
Primary focus is throughput & expandability of devices
Typical x86 PC I/O System
I/O Register Mapping
Memory mapped I/O
Registers
are addressed in same space as
memory
Address decoder distinguishes between them
OS uses address translation mechanism to make
them only accessible to kernel
I/O instructions
Separate
instructions to access I/O registers
Can only be executed in kernel mode
Example: x86
Polling
Periodically check I/O status register
If
device ready, do operation
If error, take action
Common in small or low-performance realtime embedded systems
Predictable
timing
Low hardware cost
In other systems, wastes CPU time
Interrupts
When a device is ready or error occurs
Controller
interrupts CPU
Interrupt is like an exception
But
not synchronized to instruction execution
Can invoke handler between instructions
Cause information often identifies the interrupting
device
Priority interrupts
Devices
needing more urgent attention get higher
priority
Can interrupt handler for a lower priority interrupt
I/O Data Transfer
Polling and interrupt-driven I/O
CPU
transfers data between memory and I/O
data registers
Time
consuming for high-speed devices
Direct memory access (DMA)
OS
provides starting address in memory
I/O
controller transfers to/from memory
autonomously
Controller
interrupts on completion or error
Server Computers
Applications are increasingly run on servers
Web
search, office apps, virtual worlds, …
Requires large data center servers
Multiple
processors, networks connections,
massive storage
Space
and power constraints
Server equipment built for 19” racks
Multiples
of 1.75” (1U) high
Rack-Mounted Servers
Sun Fire x4150 1U server
Chapter 6 — Storage and Ot
Sun Fire x4150 1U server
4 cores
each
16 x 4GB =
64GB DRAM
Concluding Remarks
I/O performance measures
Throughput,
response time
Dependability
and cost also important
Buses used to connect CPU, memory,
I/O controllers
Polling,
interrupts, DMA
RAID
Improves
performance and dependability
Please read Sections 6.1 – 6.10 P&H 4th Ed.
THINK: Weekend!!
The best way to predict the future is to create it.
Peter Drucker
36