ISTORE-1 Update David Patterson University of California at Berkeley [email protected] UC Berkeley IRAM Group UC Berkeley ISTORE Group [email protected] July 2000 Slide 1

Download Report

Transcript ISTORE-1 Update David Patterson University of California at Berkeley [email protected] UC Berkeley IRAM Group UC Berkeley ISTORE Group [email protected] July 2000 Slide 1

ISTORE-1 Update

David Patterson

University of California at Berkeley

[email protected]

UC Berkeley IRAM Group UC Berkeley ISTORE Group

[email protected]

July 2000

Slide 1

Perspective on Post-PC Era

PostPC Era will be driven by 2 technologies:

1) “Gadgets”:Tiny Embedded or Mobile Devices

– ubiquitous: in everything – e.g., successor to PDA, cell phone, wearable computers

2) Infrastructure to Support such Devices

– e.g., successor to Big Fat Web Servers, Database Servers

Slide 2

Outline

Motivation for the ISTORE project – AME: Availability, Maintainability, Evolutionary growth • ISTORE’s research principles & techniques – Introspection – SON: Storage-Oriented Node In Cluster – RAIN: Redundant Array of Inexpensive Network switches – Benchmarks for AME • A Case for SON vs. CPUsApplications, near term and futureConclusions and future work

Slide 3

Lampson: Systems Challenges

Systems that work – Meeting their specs – Always available – Adapting to changing environment – Evolving while they run – Made from unreliable components – Growing without practical limitCredible simulations or analysisWriting good specsTestingPerformance

“Computer Systems Research

– Understanding when it doesn’t matter

-Past and Future”

Keynote address, 17th SOSP, Dec. 1999

Butler Lampson Microsoft

Slide 4

Hennessy: What Should the “New World”

AvailabilityMaintainability – Two functions: • ScalabilityCost

Focus Be?

– Both appliance & service » Enhancing availability by preventing failure » Ease of SW and HW upgrades – Especially of service • Performance

“Back to the Future: Time to Return to Longstanding Problems in Computer Systems?”

– per device and per service transaction – Remains important, but its not SPECint Keynote address, FCRC, May 1999

John Hennessy Stanford

Slide 5

The real scalability problems: AME

A

vailability

– systems should continue to meet quality of service goals despite hardware and software failures • M

aintainability

– systems should require only minimal ongoing human administration, regardless of scale or complexity: Today, cost of maintenance = 10X cost of purchase • E

volutionary Growth

– systems should evolve gracefully in terms of performance, maintainability, and availability as they are grown/upgraded/expanded • These are problems at today’s scales, and will

only get worse as systems grow Slide 6

Principles for achieving AME (1)

No single points of failureRedundancy everywherePerformance robustness is more important

than peak performance

– “performance robustness” implies that real-world performance is comparable to best-case performance • Performance can be sacrificed for

improvements in AME

– resources should be dedicated to AME » compare: biological systems spend > 50% of resources on maintenance – can make up performance by scaling system

Slide 7

Principles for achieving AME (2)

Introspectionreactive techniques to detect and adapt to failures, workload variations, and system evolution – proactive techniques to anticipate and avert problems before they happen

Slide 8

Hardware Techniques (1): SON

SON: Storage Oriented Nodes (in clusters)Distribute processing with storage – If AME really important, provide resources!

– Most storage servers limited by speed of CPUs!! – Amortize sheet metal, power, cooling, network for disk to add processor, memory, and a real network?

– Embedded processors 2/3 perf, 1/10 cost, power?

– Serial lines, switches also growing with Moore’s Law; less need today to centralize vs. bus oriented systems • Advantages of cluster organization – Truly scalable architecture – Architecture that tolerates partial failure – Automatic hardware redundancy

Slide 9

Hardware techniques (2)

Heavily instrumented hardware – sensors for temp, vibration, humidity, power, intrusion – helps detect environmental problems before they can affect system integrity • Independent diagnostic processor on each node – provides remote control of power, remote console access to the node, selection of node boot code – collects, stores, processes environmental data for abnormalities – non-volatile “flight recorder” functionality – all diagnostic processors connected via independent diagnostic network

Slide 10

Hardware techniques (3)

On-demand network partitioning/isolation – Internet applications must remain available despite failures of components, therefore can isolate a subset for preventative maintenance – Allows testing, repair of online system – Managed by diagnostic processor and network switches via diagnostic network

Slide 11

Hardware techniques (4)

Built-in fault injection capabilities – Power control to individual node components – Injectable glitches into I/O and memory busses – Managed by diagnostic processor – Used for proactive hardware introspection » automated detection of flaky components » controlled testing of error-recovery mechanisms – Important for AME benchmarking (see next slide)

Slide 12

ISTORE-1 hardware platform

80-node x86-based cluster, 1.4TB storagecluster nodes are plug-and-play, intelligent, network-

attached storage “bricks”

» a single field-replaceable unit to simplify maintenance – each node is a full x86 PC w/256MB DRAM, 18GB diskmore CPU than NAS; fewer disks/node than cluster

ISTORE Chassis

80 nodes, 8 per tray 2 levels of switches •20 100 Mbit/s •2 1 Gbit/s Environment Monitoring: UPS, redundant PS, fans, heat and vibration sensors...

Intelligent Disk “Brick”

Portable PC CPU: Pentium II/266 + DRAM Redundant NICs (4 100 Mb/s links) Diagnostic Processor Disk Half-height canister

Slide 13

ISTORE-1 Status

10 Nodes manufactured; 60 board fabbed, 25

to go

Boots OSDiagnostic Processor Interface SW completePCB backplane: not yet designedFinish 80 node system: Summer 2000

Slide 14

A glimpse into the future?

System-on-a-chip enables computer, memory,

redundant network interfaces without significantly increasing size of disk

ISTORE HW in 5-7 years: – building block: 2006 MicroDrive integrated with IRAM » 9GB disk, 50 MB/sec from disk » connected via crossbar switch – If low power, 10,000 nodes fit into one rack! • O(10,000) scale is our

ultimate design point Slide 15

Hardware Technique (6): RAIN

Switches for ISTORE-1 substantial fraction of

space, power, cost, and just 80 nodes!

Redundant Array of Inexpensive Disks (RAID):

replace large, expensive disks by many small, inexpensive disks, saving volume, power, cost

Redundant Array of Inexpensive Network

switches: replace large, expensive switches by many small, inexpensive switches, saving volume, power, cost?

– ISTORE-1: Replace 2 16-port 1-Gbit switches by fat tree of 8 8-port switches, or 24 4-port switches?

Slide 16

“Hardware” techniques (6)

Benchmarking – One reason for 1000X processor performance was ability to measure (vs. debate) which is better » e.g., Which most important to improve: clock rate, clocks per instruction, or instructions executed?

– Need AME benchmarks “what gets measured gets done” “benchmarks shape a field” “quantification brings rigor”

Slide 17

Availability benchmark methodology

Goal: quantify variation in QoS metrics as

events occur that affect system availability

Leverage existing performance benchmarks – to generate fair workloads – to measure & trace quality of service metrics • Use fault injection to compromise system – hardware faults (disk, memory, network, power) – software faults (corrupt input, driver error returns) – maintenance events (repairs, SW/HW upgrades) • Examine single-fault and multi-fault workloads – the availability analogues of performance micro- and macro-benchmarks

Slide 18

Benchmark Availability?

Methodology for reporting results

Results are most accessible graphically – plot change in QoS metrics over time – compare to “normal” behavior?

» 99% confidence intervals calculated from no-fault runs }

normal behavior (99% conf) 0 injected disk failure reconstruction Time Slide 19

Example single-fault result

Linux 220 215 210 205 200 195 190

Reconstruction

2 1 0 0 10 20 30 40 50 60 70 80 90 100 110 160 2 140

Reconstruction

Solaris 120 1 100 Hits/sec # failures tolerated 0 80 0 10 20 30 40 50 60 70 80 90 100 110

Time (minutes)

• Compares Linux and Solaris reconstruction – Linux: minimal performance impact but longer window of vulnerability to second fault – Solaris: large perf. impact but restores redundancy fast

Slide 20

Software techniques

Fully-distributed, shared-nothing code – centralization breaks as systems scale up O(10000) – avoids single-point-of-failure front ends • Redundant data storage – required for high availability, simplifies self-testing – replication at the level of application objects » application can control consistency policy » more opportunity for data placement optimization

Slide 21

Software techniques (2)

“River” storage interfaces – NOW Sort experience: performance heterogeneity is the norm » e.g., disks: outer vs. inner track (1.5X), fragmentation » e.g., processors: load (1.5-5x) – So demand-driven delivery of data to apps » via distributed queues and graduated declustering » for apps that can handle unordered data delivery – Automatically adapts to variations in performance of producers and consumers – Also helps with evolutionary growth of cluster

Slide 22

Software techniques (3)

Reactive introspection – Use statistical techniques to identify normal behavior and detect deviations from it – Policy-driven automatic adaptation to abnormal behavior once detected » initially, rely on human administrator to specify policy » eventually, system learns to solve problems on its own by experimenting on isolated subsets of the nodes • one candidate: reinforcement learning

Slide 23

Software techniques (4)

Proactive introspection – Continuous online self-testing of HW and SW » in deployed systems!

» goal is to shake out “Heisenbugs” before they’re encountered in normal operation » needs data redundancy, node isolation, fault injection – Techniques: » fault injection: triggering hardware and software error handling paths to verify their integrity/existence » stress testing: push HW/SW to their limits » scrubbing: periodic restoration of potentially “decaying” hardware or software state • self-scrubbing data structures (like MVS) • ECC scrubbing for disks and memory

Slide 24

Storage Oriented Nodes

Advantages of SON:

1 v. 2 NetworksPhysical

A Case for

Repair/Maintenance

Die size vs. Clock rate,

Complexity

Silicon Die Cost ~ AreaCooling ~ (Watts/chip)

N 4

Size, Power Cost of

System v. Cost of Disks

Cluster advantages:

dependability, scalability Advantages of CPU:

Apps don’t parallelize, so

1 very fast CPU much better in practice than N fast CPUs

Leverage Desktop MPU

investment

Software Maintenance:

1 Large system with several CPUs easier to install SW than several small computers Slide 25

SON: 1 vs. 2 networks

Current computers all have LAN + Disk

interconnect (SCSI, FCAL)

– LAN is improving fastest, most investment, most features – SCSI, FCAL poor network features, improving slowly, relatively expensive for switches, bandwidth – Two sets of cables, wiring?

Why not single network based on best

HW/SW technology?

Slide 26

SON: Physical Repair

Heterogeneous system with server components

(CPU, backplane, memory cards, interface cards, power supplies, ...) and disk array components (disks, cables, controllers, array controllers, power supplies, ... )

– Keep all components available somewhere as FRUs • Homogeneous modules that is based on hot-

pluggable interconnect (LAN) with Field Replacable Units: Node, Power Supplies, network cables

– Replace node (disk, CPU, memory, NI) if any fail – Preventative maintenance via isolation, fault insertion

Slide 27

SON: Complexity v. Perf

Complexity increase: – HP PA-8500: issue 4 instructions per clock cycle, 56 instructions out-of-order execution, 4Kbit branch predictor, 9 stage pipeline, 512 KB I cache, 1024 KB D cache (> 80M transistors just in caches) – Intel SA-110: 16 KB I$, 16 KB D$, 1 instruction, in order execution, no branch prediction, 5 stage pipeline • Complexity costs in development time,

development power, die size, cost

– 440 MHz HP PA-8500 477 mm 2 , 0.25 micron/4M $330, > 40 Watts – 233 MHz Intel SA-110 50 mm 2 , 0.35 micron/3M $18, 0.4 Watts

Slide 28

Cost of System v. Disks

Examples show cost of way we build current

systems (CPU, 2 networks, many disks/CPU …)

Date Cost Disks Disks/CPU – NCR WorldMark: 10/97 $8.3M

– Sun Enterprise 10k: 3/98 $5.2M

1312 668 – Sun Enterprise 10k: 9/99 $6.2M 1732 – IBM Netinf. Cluster: 7/00 $7.8M 7040 10.2

10.4

27.0

55.0

And these Data Base apps are CPU bound!!! Also potential savings in space, power – ISTORE-1: with big switches, its 2-3 racks for 80 CPUs/disks (3/8 rack unit per CPU/disk themselves) – ISTORE-2: 4X density improvement?

Slide 29

SON: Cluster Advantages

Truly scalable architectureArchitecture that tolerates partial failureAutomatic hardware redundancy

Slide 30

SON: Cooling cost

– Fan failure?

v. Peak Power

What is relationship? – Feet per second of air flow?

– Packaging costs?

Slide 31

The Case for CPU

Advantages of CPU:

Apps don’t parallelize, so

N very fast CPU much better in practice than 2N fast CPUs

Leverage Desktop MPU

investment

Software Installation:

1 Large system with several CPUs easier to keep SW up-to-date than several small computers But:

Assume Apps that

parallelize: WWW services, Vision, Graphics

Leverage investment in

Embedded MPU, System on a Chip

Improved maintenance is

research target: e.g., many disks lower reliability, but RAID is better Slide 32

Initial Applications

ISTORE is not one super-system that

demonstrates all these techniques!

– Initially provide middleware, library to support AME goals • Initial application targets – cluster web/email servers » self-scrubbing data structures, online self-testing » statistical identification of normal behavior – information retrieval for multimedia data » self-scrubbing data structures, structuring performance-robust distributed computation

Slide 33

ISTORE Successor does Human Quality Vision?

Malik at UCB thinks vision research at critical

juncture; have about right algorithms, awaiting faster computers to test them

10,000 nodes with System-On-A-Chip +

Microdrive + network

– 1 to 10 GFLOPS/node => 10,000 to 100,000 GFLOPS – High Bandwidth Network – 1 to 10 GB of Disk Storage per Node => can replicate images per node – Need AME advances to keep 10,000 nodes useful

Slide 34

Conclusions: ISTORE

Availability, Maintainability, and Evolutionary

growth are key challenges for server systems

– more important even than performance • ISTORE is investigating ways to bring AME to

large-scale, storage-intensive servers

– via clusters of network-attached, computationally enhanced storage nodes running distributed code – via hardware and software introspection – we are currently performing application studies to • Availability benchmarks a powerful tool? – revealed undocumented design decisions affecting SW RAID availability on Linux and Windows 2000

Slide 35

Backup Slides

Slide 36

State of the art Cluster: NCR

Proc Proc pci bridge BYNET switched network Bus Mem 1

WorldMark

… Proc Proc Bus pci Mem bridge

32

TPC-D, TD V2,

10/97

– 32 nodes x 4 200 MHz CPUs, 1 GB DRAM, 41 disks (128 cpus, 32 GB, 1312 disks, 5.4 TB) – CPUs, DRAM, encl., boards, power $5.3M

– Disks+cntlr – Disk shelves Mem bus bridge Mem bus bridge – Cables – HW total $2.2M

$0.7M

s

… …

c s i s i

s c s i

… 1 …

s

… …

s i s c s i

c s s i

… 64

source: www.tpc.org

$0.1M

$8.3M

Slide 37

State of the Art SMP: Sun E10000

s i

Proc Proc s Xbar Mem

s c s i

bridge … …

s c s i

bus bridge

s c s i

… 4 address buses data crossbar switch

s c s i

1 … 1 … … Proc Proc s

s

bridge

s c s i

bus bridge … …

s i s c s i s c s i

Xbar Mem 16 …

c s s i

TPC-D,Oracle 8,

3/98

– SMP 64 336 MHz CPUs, 64GB dram, 668 disks (5.5TB) – Disks,shelf $2.1M

– Boards,encl.

– CPUs – DRAM – Power $1.2M

$0.9M

$0.8M

$0.1M

… – Cables,I/O – HW total $0.1M

$5.2M

23

source: www.tpc.org

Slide 38

State of the Art SMP: Sun E10000

l f

Proc Proc s Xbar Mem

f c a l

bridge

f c a l

bus bridge … …

l f c a

4 address buses data crossbar switch …

c a l f

1 … 1 … … Proc Proc s Xbar Mem bridge

f c a l f c a l

bus bridge

f

… …

a c a l f l

16 …

c a f l

TPC-C,Oracle 8i,

9/99

– SMP 64 400 MHz CPUs, 64GB dram, 1732 disks (15.5TB) – Disks,shelf – CPUs $3.6M

– Boards,encl. $0.9M

$0.9M

– DRAM – Power $0.6M

$0.1M

– Cables,I/O $0.1M

… 27 – HW total

source: www.tpc.org

$6.2M

Slide 39

State of the art Cluster: IBM

Giganet 1Gbit switched Ethernet Proc Proc pci bridge Bus Mem 1 …

Netinfinity

Proc Proc Bus pci Mem bridge

32

TPC-C, DB2, 7/00 – 32 nodes x 4 700 MHz CPUs, 0.5 GB DRAM, 220 disks (128 cpus, 16 GB, 7040 disks, 116 TB) – CPUs $0.6M

– Caches $0.5M

– DRAM $0.6M

Mem bus bridge Mem bus bridge – Disks – Racks $3.8M

– Disk shelves $1.6M

– Disk cntrl. $0.4M

$0.1M

s

… …

c s i s i

s c s i

… 1 …

s

… …

s s c s i i

c s s

… – Cables – Switches

i

704 – HW total

source: www.tpc.org

$0.1M

$0.1M

$7.8M

Slide 40

Attacking Computer Vision

Analogy: Computer Vision Recognition in 2000

like Computer Speech Recognition in 1985

– Pre 1985 community searching for good algorithms: classic AI vs. statistics?

– By 1985 reached consensus on statistics – Field focuses and makes progress, uses special hardware – Systems become fast enough that can train systems rather than preload information, which accelerates progress – By 1995 speech regonition systems starting to deploy – By 2000 widely used, available on PCs

Slide 41

Computer Vision at Berkeley

Jitendra Malik believes has an approach that

is very promising

2 step process:

1) Segmentation: Divide image into regions of coherent color, texture and motion 2) Recognition: combine regions and search image database to find a match

Algorithms for 1) work well, just slowly

(300 seconds per image using PC)

Algorithms for 2) being tested this summer

using hundreds of PCs; will determine accuracy Slide 42

Human Quality Computer Vision

Suppose Algorithms Work: What would it take

to match Human Vision?

At 30 images per second: segmentation – Convolution and Vector-Matrix Multiply of Sparse Matrices (10,000 x 10,000, 10% nonzero/row) – 32-bit Floating Point – 300 seconds on PC (assuming 333 MFLOPS) => 100G FL Ops/image – 30 Hz => 3000 GFLOPs machine to do segmentation

Slide 43

Human Quality Computer Vision

At 1 / second: object recognition – Human can remember 10,000 to 100,000 objects per category (e.g., 10k faces, 10k Chinese characters, high school vocabulary of 50k words, ..) – To recognize a 3D object, need ~10 2D views – 100 x 100 x 8 bit (or fewer bits) per view => 10,000 x 10 x 100 x 100 bytes or 10 candidate object images – Use storage to reduce computation?

9 bytes – Pruning using color and texture and by organizing shapes into an index reduce shape matches to 1000 – Compare 1000 candidate merged regions with 1000 – If 10 hours on PC (333 MFLOPS) => 12000 GFLOPS

Slide 44