Designing for 20TB Disk Drives And “enterprise storage” Jim Gray, Microsoft research.

Transcript Designing for 20TB Disk Drives And “enterprise storage” Jim Gray, Microsoft research.

Designing for 20TB Disk Drives
And “enterprise storage”
Jim Gray, Microsoft research
1
Kilo
Mega
Giga
Tera
Peta
Exa
Zetta
Disk Evolution
Capacity:100x in 10 years
1 TB 3.5” drive in 2005
20 TB?
in 2012?!
System on a chip
High-speed SAN
Yotta
Disk replacing tape
Disk is super computer!
2
Disks are becoming computers
Smart drives
Camera with micro-drive
Replay / Tivo / Ultimate TV
Phone with micro-drive
MP3 players
Tablet
Xbox
Many more…
Applications
Web, DBMS, Files
OS
Disk Ctlr +
1Ghz cpu+
1GB RAM
Comm:
Infiniband,
Ethernet,
radio… 3
Intermediate Step: Shared Logic
Snap
Brick with 8-12 disk drives
200 mips/arm (or more)
2xGbpsEthernet
General purpose OS
10k$/TB to 100k$/TB
Shared





Sheet metal
Power
Support/Config
Security
Network ports
These bricks could run applications
~1TB
12x80GB NAS
NetApp
~.5TB
8x70GB NAS
Maxstor
~2TB
12x160GB NAS
IBM
TotalStorage
~360GB
10x36GB NAS
4
(e.g. SQL or Mail or..)
Hardware
Homogenous machines
leads to quick response
through reallocation
HP desktop machines,
320MB RAM, 3u high, 4
100GB IDE Drives
$4k/TB (street),
2.5processors/TB, 1GB
RAM/TB
3 weeks from ordering to
operational
Slide courtesy of Brewster Kahle, @ Archive.org
5
Disk as Tape
Tape is unreliable, specialized,
slow, low density, not improving fast,
and expensive
Using removable hard drives to replace tape’s
function has been successful
When a “tape” is needed, the drive is put in a
machine and it is online. No need to copy
from tape before it is used.
Portable, durable, fast, media cost = raw
tapes, dense. Unknown longevity: suspected
good.
Slide courtesy of Brewster Kahle, @ Archive.org
6
Disk
As
Tape:
What
format?
Today I send NTFS/SQL disks.
But that is not a good format for Linux.
Solution: Ship NFS/CIFS/ODBC servers (not
disks)
Plug “disk” into LAN.


DHCP then file or DB server via standard interface.
Web Service in long term
7
State is Expensive
Stateless clones are easy to manage

App servers are middle tier
Cost goes to zero with Moore’s law.


One admin per 1,000 clones.
Good story about scaleout.
Stateful servers are expensive to manage


1TB to 100TB per admin
Storage cost is going to zero(2k$ to 200k$).
Cost of storage is management cost
8
Databases (== SQL)
VLDB survey (Winter Corp).
10 TB to 100TB DBs.



Size doubling yearly
Riding disk Moore’s law
10,000 disks at 18GB is 100TB cooked.
Mostly DSS and data warehouses.
Some media managers
9
Interesting facts
No DBMSs beyond 100TB.
Most bytes are in files.
The web is file centric
eMail is file centric.
Science (and batch) is file centric.
But….
SQL performance is better than CIFS/NFS..

CISC vs RISC
10
BarBar: the biggest DB
500 TB
Uses Objectivity™
SLAC events
Linux cluster scans DB looking for patterns
11
300 TB (cooked)
Hotmail / Yahoo
Clone front ends
~10,000@hotmail.
Application servers




~100 @ hotmail
Get mail box
Get/put mail
Disk bound
 ~30,000 disks
~ 20 admins
12
AOL (msn) (1PB?)
10 B transactions per day (10% of that)
Huge storage
Huge traffic
Lots of eye candy
DB used for security/accounting.
GUESS AOL is a petabyte

(40M x 10MB = 400 x 1012)
13
Google
1.5PB as of last spring
8,000 no-name PCs

Each 1/3U, 2 x 80 GB disk, 2
cpu 256MB ram
1.4 PB online.
2 TB ram online
8 TeraOps
Slice-price is 1K$ so 8M$.
15 admins (!) (== 1/100TB).
14
Astronomy
I’ve been trying to apply DB to astronomy
Today they are at 10TB per data set
Heading for Petabytes
Using Objectivity
Trying SQL (talk to me offline)
15
Scale Out: Buy Computing by the Slice
709,202 tpmC! == 1 Billion transactions/day
Slice: 8cpu, 8GB, 100 disks (=1.8TB)
20ktpmC per slice, ~300k$/slice
clients and 4 DTC nodes not shown
16
ScaleUp: A Very Big System!
UNISYS Windows 2000 Data Center
Limited Edition
32 cpus on
32 GB of RAM and
1,061 disks (15.5 TB)
Will be helped by 64bit addressing
24
fiber
channel
17
Hardware
8 Compaq DL360 “Photon” Web Servers
One SQL database per rack
Each rack contains 4.5 tb
261 total drives / 13.7 TB total
Fiber SAN
Switches
Meta Data
Stored on 101 GB
“Fast, Small Disks”
(18 x 18.2 GB)
Imagery Data
Stored on 4 339 GB
“Slow, Big Disks”
(15 x 73.8 GB)
To Add 90 72.8 GB
Disks in Feb 2001
to create 18 TB SAN
O O
E E
J J
SQL\Inst1
P Q
K
L
F
G
SQL\Inst2
R S
M N
H
4 Compaq ProLiant 8500 Db Servers
I
18
Amdahl’s Balance Laws
parallelism law: If a computation has a serial
part S and a parallel component P,
then the maximum speedup is (S+P)/S.
balanced system law: A system needs
a bit of IO per second per instruction per second:
about 8 MIPS per MBps.
memory law: =1:
the MB/MIPS ratio (called alpha ()),
in a balanced system is 1.
IO law:
Programs do one IO per 50,000 instructions.
19
Amdahl’s Laws Valid 35 Years Later?
Parallelism law is algebra: so SURE!
Balanced system laws?
Look at tpc results (tpcC, tpcH) at http://www.tpc.org/
Some imagination needed:

What’s an instruction (CPI varies from 1-3)?
 RISC, CISC, VLIW, … clocks per instruction,…

What’s an I/O?
20
TPC systems
Normalize for CPI (clocks per instruction)


TPC-C has about 7 ins/byte of IO
TPC-H has 3 ins/byte of IO
TPC-H needs ½ as many disks, sequential vs random
Both use 9GB 10 krpm disks (need arms, not bytes)
KB IO/s
MHz/
Disk Disks MB/s
CPI mips
/
/
s
/ cpu / cpu
cpu
IO disk
Amdahl
1
1
1
6
TPC-C=
random
550
2.1
262
8
100
397
50
40
TPC-H=
sequential
550
1.2
458
64
100
176
22
141
Ins/
IO
Byte
21
8
7
3
TPC systems: What’s alpha
(=MB/MIPS)
?
Hard to say:



Intel 32 bit addressing (= 4GB limit). Known CPI.
IBM, HP, Sun have 64 GB limit.
Unknown CPI.
Look at both, guess CPI for IBM, HP, Sun
Alpha is between 1 and 6
Mips
Memory
Alpha
Amdahl
1
1
tpcC Intel
8x262 = 2Gips
4GB
tpcH Intel
8x458 = 4Gips
4GB
tpcC IBM
24 cpus ?= 12 Gips
64GB
tpcH HP
32 cpus ?= 16 Gips
32 GB
1
2
1
6
222
Performance (on current SDSS
data)
IO count
Run times: on 15k$ COMPAQ Server
(2 cpu, 1 GB , 8 disk) 1E+7
cpu vs IO
1E+6
Some take 10 minutes
1E+5
Some take 1 minute
1E+4
1,000 IOs/cpu sec
1E+3
Median ~ 22 sec.
1E+2
~1,000 IO/cpu sec
Ghz processors are fast! 1E+1
~ 64 MB IO/cpu sec


(10 mips/IO, 200 ins/byte)1E+0
0.01
2.5 m rec/s/cpu
seconds
1000
10
1
1. CPU sec 10.
100.
1,000.
time vs queryID
cpu
elapsed
100
0.1
ae
23
Q08
Q01
Q09
Q10A
Q19
Q12
Q10
Q20
Q16
Q02
Q13
Q04
Q06
Q11
Q15B
Q17
Q07
Q14
Q15A
Q05
Q03
Q18
How much storage do we need?
Yotta
Everything
Soon everything can be
!
recorded and indexed
Recorded
All Books
Most bytes will never be
MultiMedia
seen by humans.
Data summarization, trend
All LoC books
detection anomaly
(words)
detection are key
technologies
.Movi
See Mike Lesk:
How much information is there:
http://www.lesk.com/mlesk/ksg97/ksg.html
http://www.sims.berkeley.edu/research/projects/how-much-info/
24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli
Exa
Peta
Tera
Giga
e
A Photo
See Lyman & Varian:
How much information
Zetta
A Book
Mega
24
Kilo
Standard Storage Metrics
Capacity:



RAM:
Disk:
Tape:
MB and $/MB: today at 512MB and 200$/GB
GB and $/GB: today at 80GB and 70k$/TB
TB and $/TB: today at
40GB and 10k$/TB
(nearline)
Access time (latency)



RAM:
Disk:
Tape:
100 ns
15 ms
30 second pick, 30 second position
Transfer rate



RAM:
Disk:
Tape:
1-10 GB/s
10-50 MB/s - - -Arrays can go to 10GB/s
5-15 MB/s - - - Arrays can go to 1GB/s
25
New Storage Metrics:
Kaps, Maps, SCAN
Kaps: How many kilobyte objects served per second


The file server, transaction processing metric
This is the OLD metric.
Maps: How many megabyte objects served per sec

The Multi-Media metric
SCAN: How long to scan all the data

the data mining and utility metric
And

Kaps/$, Maps/$, TBscan/$
26
More Kaps and Kaps/$ but….
1970
1980
1990
1000
100
10
2000
100 GB
30 MB/s
27
Kaps/disk
Kaps/$
Disk accesses got much less expensive
Better disks
Kaps over time
Cheaper disks!
1.E+6
Kaps/$
But: disk arms
1.E+5
1.E+4
are expensive
the scarce resource 1.E+3
1.E+2
1 hour Scan
Kaps
1.E+1
vs 5 minutes in 1990 1.E+0
Data on Disk
Can Move to RAM in 10 years
Storage Price vs Time
Megabytes per kilo-dollar
10,000.
100:1
MB/k$
1,000.
100.
10.
10 years1.
0.1
1980
1990
Year
2000
28
The “Absurd” 10x (=4 year) Disk
2.5 hr scan time
(poor sequential access)
1 aps / 5 GB
(VERY cold data)
It’s a tape!
100 MB/s
200 Kaps
1 TB
29
It’s Hard to Archive a Petabyte
It takes a LONG time to restore it.
At 1GBps it takes 12 days!
Store it in two (or more) places online
A geo-plex
(on disk?).
Scrub it continuously (look for errors)
On failure,


use other copy until failure repaired,
refresh lost copy from safe copy.
Can organize the two copies differently
(e.g.: one by time, one by space)
30
Auto Manage Storage
1980 rule of thumb:

A DataAdmin per 10GB, SysAdmin per mips
2000 rule of thumb


A DataAdmin per 5TB
SysAdmin per 100 clones (varies with app).
Problem:

5TB is 50k$ today, 5k$ in a few years.

Admin cost >> storage cost !!!!
Challenge:

Automate ALL storage admin tasks
31
How to cool disk data:
Cache data in main memory

See 5 minute rule later in presentation
Fewer-larger transfers

Larger pages (512-> 8KB -> 256KB)
Sequential rather than random access


Random 8KB IO is 1.5 MBps
Sequential IO is 30 MBps (20:1 ratio is growing)
Raid1 (mirroring) rather than Raid5
(parity).
32
Data delivery costs 1$/GB today
Rent for “big” customers:
300$/megabit per
second per month
Improved 3x in last 6
years (!).
That translates to 1$/GB at
each end.
You can mail a 160 GB disk
for 20$.


3x160
GB
~ ½ TB
That’s 16x cheaper
If overnight it’s 4 MBps.
33

Designing for 20TB Disk Drives And “enterprise storage” Jim Gray, Microsoft research.

Transcript Designing for 20TB Disk Drives And “enterprise storage” Jim Gray, Microsoft research.

Directory