Building PetaByte Servers Jim Gray Microsoft Research [email protected] http://www.Research.Microsoft.com/~Gray/talks Kilo Mega Giga Tera Peta Exa 10610121018 today, we are here Outline • The challenge: Building GIANT data stores – for example, the EOS/DIS 15

Download Report

Transcript Building PetaByte Servers Jim Gray Microsoft Research [email protected] http://www.Research.Microsoft.com/~Gray/talks Kilo Mega Giga Tera Peta Exa 10610121018 today, we are here Outline • The challenge: Building GIANT data stores – for example, the EOS/DIS 15

Building PetaByte Servers

Jim Gray Microsoft Research [email protected]

http://www.Research.Microsoft.com/~Gray/talks Kilo 10 3 Mega 10 6 Giga Tera Peta Exa 10 9 10 12 10 15 10 18 today, we are here 1

Outline

• The challenge: Building GIANT data stores – for example, the EOS/DIS 15 PB system • Conclusion 1 – Think about MOX and SCANS • Conclusion 2: – Think about Clusters – SMP report – Cluster report 2

The Challenge -- EOS/DIS

Antarctica is melting - 77% of fresh water liberated

– –

sea level rises 70 meters Chico & Memphis are beach-front property

New York, Washington, SF, LA, London, Paris

• Let’s study it!

Mission to Planet Earth •

EOS: Earth Observing System

(17B$ => 10B$) – –

50 instruments on 10 satellites 1997-2001 Landsat (added later)

EOS DIS: Data Information System:

– – –

3-5 MB/s raw, 30-50 MB/s processed.

4 TB/day, 15 PB by year 2007

3

The Process Flow

• Data arrives and is pre-processed.

– instrument data is calibrated, gridded averaged – Geophysical data is derived • Users ask for stored data OR to analyze and combine data.

• Can make the pull-push split dynamically

Pull Processing Other Data Push Processing

4

Designing EOS/DIS

Expect that millions will use the system (online) Three user categories:

NASA 500 - funded by NASA to do science

Global Change 10 k other dirt bags

Internet 20 m everyone else Grain speculators Environmental Impact Reports New applications => discovery & access must be automatic

Allow anyone to set up a peer- node (DAAC & SCF)

Design for Ad Hoc queries, Not Standard Data Products If push is 90%, then 10% of data is read (on average).

=> A failure: no one uses the data, in DSS, push is 1% or less.

=> computation demand is enormous (pull:push is 100: 1)

5

Obvious Points: EOS/DIS will be a cluster of SMPs

• It needs 16 PB storage – = 1 M disks in current technology – = 500K tapes in current technology • It needs 100 TeraOps of processing – = 100K processors (current technology) – and ~ 100 Terabytes of DRAM • 1997 requirements are 1000x smaller – smaller data rate – almost no re-processing work 6

The architecture

• 2+N data center design • Scaleable OR-DBMS • Emphasize Pull vs Push processing • Storage hierarchy • Data Pump • Just in time acquisition 7

2+N data center design

• duplex the archive (for fault tolerance) • let anyone build an extract (the +N) • Partition data by time and by space (store 2 or 4 ways).

• Each partition is a free-standing OR-DBBMS (similar to Tandem, Teradata designs).

• Clients and Partitions interact via standard protocols – OLE-DB, DCOM/CORBA, HTTP,… 8

Hardware Architecture

• 2 Huge Data Centers • Each has 50 to 1,000 nodes in a cluster – Each node has about 25…250 TB of storage – SMP – DRAM – 100 disks – 10 tape robots – 2 Interconnects .5Bips to 50 Bips 20K$ 50GB to 1 TB 2.3 TB to 230 TB 200K$ 25 TB to 250 TB 1GBps to 100 GBps 50K$ 200K$ 20K$ • Node costs 500K$ • Data Center costs 25M$ (capital cost) 9

Scaleable OR-DBMS

• Adopt cluster approach (Tandem, Teradata, VMScluster, DB2/PE, Informix,....) • System must scale to many processors, disks, links • OR DBMS based on standard object model – CORBA or DCOM (not vendor specific) • Grow by adding components • System must be self-managing 10

Storage Hierarchy

• Cache hot 10% (1.5 PB) on disk.

• Keep cold 90% on near-line tape.

• Remember recent results on speculation 10-TB RAM 500 nodes 1 PB of Disk 10,000 drives 15 PB of Tape Robot

Data Pump

• Some queries require reading ALL the data (for reprocessing) • Each Data Center scans the data every 2 weeks.

– Data rate 10 PB/day = 10 TB/node/day = 120 MB/s • • • • • Compute on demand small jobs

less than 1,000 tape mounts less than 100 M disk accesses less than 100 TeraOps.

(less than 30 minute response time)

• For BIG JOBS scan entire 15PB database • Queries (and extracts) “snoop” this data pump.

12

Just-in-time acquisition 30%

• Hardware prices decline 20%-40%/year • So buy at last moment • Buy best product that day: commodity • • Depreciate over 3 years so that facility is fresh.

(after 3 years, cost is 23% of original). 60% decline peaks at 10M$ EOS DIS Disk Storage Size and Cost

10 5 assume 40% price decline/year Data Need TB 10 4 10 3 10 2 Storage Cost M$ 10 1 1994 1996 1998 2000 2002 2004 2006 2008 13

Problems

• HSM • Design and Meta-data • Ingest • Data discovery, search, and analysis • reorg-reprocess • disaster recovery • cost 14

What's a Terabyte

1 Terabyte

1,000,000,000 business letters 100,000,000 book pages 50,000,000 FAX images 10,000,000 TV pictures (mpeg) 4,000 LandSat images 150 miles of bookshelf 15 miles of bookshelf 7 miles of bookshelf 10 days of video Library of Congress (in ASCI) is 25 TB

1980

: 200 M$ of disc 10,000 discs 5 M$ of tape silo 10,000 tapes

1994

: 1 M$ of magnetic disc 120 discs 500 K$ of optical disc robot 250 platters 50 K$ of tape silo 50 tapes

Terror Byte !!

.1% of a PetaByte!!!!!!!!!!!!!!!!!!

16

The Cost of Storage & Access

• File Cabinet: cabinet (4 drawer) paper (24,000 sheets) 250$ space (2x3 @ 10$/ft2) 180$ total 700$

3.0 ¢/sheet

250$ • Disk: disk (9 GB =) ASCII: 0.04

2,000$ 5 m pages

¢/sheet

(100x cheaper)

• Image:

200 k pages

1 ¢/sheet (similar to paper)

17

Standard Storage Metrics

• Capacity: – RAM: MB and $/MB: today at 100 MB & 10 $/MB – Disk: GB and $/GB: today at 10 GB and 200 $/GB – Tape: TB and $/TB: today at .1 TB and 100 k$/TB (nearline) • Access time (latency) – RAM: 100 ns – Disk: – RAM: 10 ms – Tape: 30 second pick, 30 second position • Transfer rate 1 GB/s – Disk: 5 MB/s - - - Arrays can go to 1GB/s – Tape: 3 MB/s - - - not clear that striping works 18

New Storage Metrics: KOXs, MOXs, GOXs, SCANs?

• • • •

KOX:

How many kilobyte objects served per second – the file server, transaction processing metric

MOX

: How many megabyte objects served per second – the Mosaic metric

GOX

: How many gigabyte objects served per hour – the video & EOSDIS metric

SCANS:

How many scans of all the data per day – the data mining and utility metric 19

Summary (of new ideas)

• Storage accesses are the bottleneck • Accesses are getting larger (MOX, GOX, SCANS) • Capacity and cost are improving • BUT • Latencies and bandwidth are not improving much • SO • Use parallel access (disk and tape farms) 20

How To Get Lots of MOX, GOX, SCANS

• parallelism: use many little devices in parallel

At 10 MB/s: 1.2 days to scan 1,000 x parallel: 1.5 minute SCAN.

1 Terabyte 1 Terabyte

10 MB/s Parallelism: divide a big problem into many smaller ones to be solved in parallel. • Beware of the media myth • Beware of the access time myth 21

Meta-Message: Technology Ratios Are Important • If everything gets faster&cheaper at the same rate then nothing really changes.

• Some things getting MUCH BETTER: – communication speed & cost 1,000x – processor speed & cost 100x – storage size & cost 100x • Some things staying about the same – speed of light (more or less constant) – people (10x worse) – storage speed (only 10x better) 22

Outline

• The challenge: Building GIANT data stores – for example, the EOS/DIS 15 PB system • Conclusion 1 – Think about MOX and SCANS • Conclusion 2: – Think about Clusters – SMP report – Cluster report 23

Scaleable Computers BOTH SMP and Cluster

SMP Super Server Departmental Server Personal System Grow Up with SMP 4xP6 is now standard Grow Out with Cluster Cluster has inexpensive parts Cluster of PCs

24

TPC-C Current Results

Best Performance is 30,390 tpmC @ $305/tpmC (Oracle/DEC)

Best Price/Perf. is 7,693 tpmC @ $43.5/tpmC ( MS SQL/Dell)

Graphs show

UNIX high price

UNIX scaleup diseconomy

$300 $250

tpmC vs $/tpmC

DB2 Informix MS SQL Server Oracle Sybase $200 $150 $100

tpmC vs $/tpmC

low -end DB2 Informix MS SQL Server Oracle Sybase $200 $150 $50 $100 $50 $0 0 $0 0 2000 4000

tpmC

6000 8000 10000 25 5000 10000 15000

tpmC

20000 25000 30000

25,000 20,000 15,000 10,000 5,000 0 0

Compare SMP Performance

tpmC vs CPS

5 10

CPUs

15

SMP Scaleability

20 20,000 18,000 16,000 14,000 12,000 10,000 8,000 6,000 4,000 2,000 0 0 SUN Sybase SQL Server 5 10

cpus

15 20 26

TPC C improved fast

$1,000

$/tpmC vs time

40% hardware, 100% software, 100% PC Technology

tpmC vs time

100,000

250 %/year improvement!

10,000 $100

250 %/year improvement!

1,000 $10 Mar-94 Sep-94 Apr-95 Oct-95 May-96 Dec-96

date

Jun-97 100 Mar-94 Sep-94 Apr-95 Oct-95 May-96 Dec-96 Jun-97

date

27

Where the money goes

70

66

60 20 10 0 50 40 30

42 33 40 35 45 27 9 processor 64 TPC Price/tpmC 41 47 42 38 40 30 22 54

Oracle on DEC Unix Oracle on UltraSparc/Solaris Oracle on Compaq/NT Sybase on Compaq/NT Microsoft on Compaq with Visigenics Microsoft on Intergraph with IIS Microsoft on Compaq with IIS Microsoft on Dell with IIS

39 38 41 30 31 17 22 8 3 3 9 11 19 21 9 disk software net

28

What does this mean?

• PC Technology is 3x cheaper than high-end SMPs • PC nodes performance are 1/2 of high-end SMPs – 4xP6 vs 20xUltraSparc • Peak performance is a cluster – Tandem 100 node cluster – DEC Alpha 4x8 cluster • Commodity solutions WILL come to this market 29

Cluster: Shared What?

• Shared Memory Multiprocessor – Multiple processors, one memory – all devices are local – DEC, SG, Sun Sequent 16..64 nodes – easy to program, not commodity • Shared Disk Cluster – an array of nodes – all shared common disks – VAXcluster + Oracle • Shared Nothing Cluster – each device local to a node – ownership may change – Tandem, SP2, Wolfpack 30

Clusters being built

• Teradata 1500 nodes +24 TB disk (50k$/slice) • Tandem,VMScluster 150 nodes • Intel, 9,000 nodes @ 55M$ (100k$/slice)

( 6k$/slice)

• Teradata, Tandem, DEC moving to NT+low slice price • IBM: 512 nodes @ 100m$ (200k$/slice) • PC clusters (bare handed) at dozens of nodes web servers (msn, PointCast,…), DB servers • KEY TECHNOLOGY HERE IS THE APPS.

– Apps distribute data – Apps distribute execution 31

Cluster Advantages

• Clients and Servers made from the same stuff.

– Inexpensive: Built with commodity components • Fault tolerance: – Spare modules mask failures • Modular growth – grow by adding small modules • Parallel data search – use multiple processors and disks 32

Clusters are winning the high end

• You saw that a 4x8 cluster has best TPC-C performance • This year, a 95xUltraSparc cluster won the MinuteSort Speed Trophy (see NOWsort at www.now.cs.berkeley.edu) • Ordinal 16x on SGI Origin is close (but the loser!).

Sort Re cords/se cond vs T ime

1.0E+07 1.0E+06 IBM 3090 NOW (95 nodes) Ordinal + SGI SGI IBM RS6000 Alpha 1.0E+05 Cray YMP 1.0E+04 Sequent Intel Hyper Hardware Sorter 1.0E+03 Tandem M68000 1.0E+02 1985 1990 1995 2000 33

Clusters (Plumbing)

• Single system image – naming – protection/security – management/load balance • Fault Tolerance – Wolfpack Demo • Hot Pluggable hardware & Software 34

So, What’s New?

• When slices cost 50k$, you buy 10 or 20.

• When slices cost 5k$ you buy 100 or 200.

• Manageability, programmability, usability become key issues (total cost of ownership).

• PCs are MUCH easier to use and program

MPP Vicious Cycle

No Customers!

CP/Commodity Virtuous Cycle:

Standards allow progress and investment protection

New MPP & NewOS New App New MPP & NewOS New App New MPP & NewOS New App New MPP & NewOS New App Apps Standard OS & Hardware Customers

35

Windows NT Server Clustering

High Availability On Standard Hardware

Standard API for clusters on many platforms No special hardware required.

Resource Group is unit of failover Typical resources: shared disk, printer, ...

IP address, NetName Service (Web,SQL, File, Print Mail,MTS API to define resource groups, dependencies, resources, GUI administrative interface

A consortium of 60 HW & SW vendors (everybody who is anybody

) 2-Node Cluster in beta test now.

Available 97H1 >2 node is next SQL Server and Oracle Demo on it today Key concepts

System

: a node

Cluster

: systems working together

Resource

: hard/ soft-ware module

Resource dependency

: resource needs another

Resource group

: fails over as a unit

Dependencies

: do not cross group boundaries 36

Wolfpack NT Clusters 1.0

Two node file and print failover

Private Disks Shared SCSI Disk Strings Private Disks A lice B etty Clients •

GUI admin interface

37

Where We Are Today

• Clusters moving fast – OLTP – Sort – WolfPack • Technology ahead of schedule – cpus, disks, tapes,wires,..

• OR Databases are evolving • Parallel DBMSs are evolving • HSM still immature 39

Outline

• The challenge: Building GIANT data stores – for example, the EOS/DIS 15 PB system • Conclusion 1 – Think about MOX and SCANS • Conclusion 2: – Think about Clusters – SMP report – Cluster report 40

Building PetaByte Servers

Jim Gray Microsoft Research [email protected]

http://www.Research.Microsoft.com/~Gray/talks Kilo 10 3 Mega 10 6 Giga Tera Peta Exa 10 9 10 12 10 15 10 18 today, we are here 41