Computers for the Post-PC Era David Patterson University of California at Berkeley [email protected] UC Berkeley IRAM Group UC Berkeley ISTORE Group [email protected] February 2000 Slide 1
Download ReportTranscript Computers for the Post-PC Era David Patterson University of California at Berkeley [email protected] UC Berkeley IRAM Group UC Berkeley ISTORE Group [email protected] February 2000 Slide 1
Computers for the Post-PC Era
David Patterson
University of California at Berkeley
UC Berkeley IRAM Group UC Berkeley ISTORE Group
February 2000
Slide 1
Perspective on Post-PC Era
• PostPC Era will be driven by 2 technologies:
1) “Gadgets”:Tiny Embedded or Mobile Devices
– ubiquitous: in everything – e.g., successor to PDA, cell phone, wearable computers
2) Infrastructure to Support such Devices
– e.g., successor to Big Fat Web Servers, Database Servers
Slide 2
Outline
1) Example microprocessor for PostPC gadgets 2) Motivation and the ISTORE project vision
– AME: Availability, Maintainability, Evolutionary growth – ISTORE’s research principles – Proposed techniques for achieving AME – Benchmarks for AME • Conclusions and future work
Slide 3
New Architecture Directions
• “…media processing will become the dominant
force in computer arch. and microprocessor design.”
• “...new media-rich applications ... involve
media streams, and make heavy use of vectors of packed 8-, 16-, 32-bit integer and Fl. Pt.”
• Needs include real-time response, continuous
grain parallelism, coarse grain parallelism, memory bandwidth
– “How Multimedia Workloads Will Change Processor Design”, Diefendorff & Dubey, IEEE Computer (9/97)
Slide 4
Intelligent RAM: IRAM
Microprocessor & DRAM on a single chip: – 10X capacity vs. SRAM – on-chip memory latency
5-10X, bandwidth 50-100X
– improve energy efficiency
2X-4X (no off-chip bus)
– serial I/O 5-10X v. buses – smaller board area/volume I/O IRAM advantages extend to: – a single chip system – a building block for larger systems I/O
Bus
I/O
D R A M
I/O
L2 $ Bus Proc Bus D R A M L o g i c f a b D R A M f a b Slide 5
Revive Vector Architecture
• Cost: $1M each? • Low latency, high
BW memory system?
• Code density? • Compilers? • Performance? • Power/Energy? • Limited to scientific
applications?
• Single-chip CMOS MPU/IRAM • IRAM • Much smaller than VLIW • For sale, mature (>20 years)
(We retarget Cray compilers)
• Easy scale speed with
technology
• Parallel to save energy, keep
performance
• Multimedia apps vectorizable
too: N*64b, 2N*32b, 4N*16b Slide 6
V-IRAM1: Low Power v. High Perf.
2-way Vector Instruction Queue + x ÷
4 x 64 or 8 x 32 or 16 x 16 I/O I/O
Load/Store Vector Registers 16K I cache 16K D cache
4 x 64
Serial I/O
4 x 64
Memory Crossbar Switch
I/O I/O
M M M M M … M M M M M M M … M M M M M M … M M M M M … M … M M M M M … M Slide 7
VIRAM-1: System on a Chip
Prototype scheduled for tape-out mid 2000 •0.18 um EDL process •16 MB DRAM, 8 banks
Memory (64 Mbits / 8 MBytes)
•MIPS Scalar core and caches @ 200 MHz •4 64-bit vector unit pipelines @ 200 MHz •4 100 MB parallel I/O lines •17x17 mm, 2 Watts direction and per Xbar)
Xbar
•25.6 GB/s memory (6.4 GB/s per •1.6 Gflops (64-bit), 6.4 GOPs (16-bit)
4 Vector Pipes/Lanes Memory (64 Mbits / 8 MBytes) C P U +$ I/O Slide 8
Media Kernel Performance
Image Composition iDCT Color Conversion Image Convolution Integer MV Multiply Integer VM Multiply FP MV Multiply FP VM Multiply AVERAGE Peak Perf.
6.4 GOPS 6.4 GOPS 3.2 GOPS 3.2 GOPS 3.2 GOPS 3.2 GOPS 3.2 GFLOPS
Sustained Perf.
6.40 GOPS 1.97 GOPS 3.07 GOPS 3.16 GOPS 2.77 GOPS 3.00 GOPS 2.80 GFLOPS
3.2 GFLOPS
3.19 GFLOPS % of Peak 100.0% 30.7% 96.0% 98.7% 86.5% 93.7% 87.5% 99.6% 86.6% Slide 9
Base-line system comparison
Image Composition iDCT Color Conversion Image Convolution VIRAM
0.13
1.18
0.78
5.49
MMX VIS
3.75 ( 3.2x
) 8.00 ( 10.2x
) 2.22 ( 17.0x
) 5.49 ( 4.5x
) 6.19 ( 5.1x
)
TMS320C82
5.70 ( 6.50 ( 7.6x
5.3x
) ) • All numbers in cycles/pixel •MMX and VIS results assume all data in L1 cache
Slide 10
IRAM Chip Challenges
• Merged Logic-DRAM process Cost: Cost of
wafer, Impact on yield, testing cost of logic and DRAM
• Price: on-chip DRAM v. separate DRAM chips? • Delay in transistor speeds, memory cell sizes
in Merged process vs. Logic only or DRAM only
• DRAM block: flexibility via DRAM “compiler”
(vary size, width, no. subbanks) vs. fixed block
• Apps: advantages in memory bandwidth,
energy, system size to offset challenges?
Slide 11
Other examples: IBM “Blue Gene”
• 1 PetaFLOPS in 2005 for $100M? • Application: Protein Folding • Blue Gene Chip – 32 Multithreaded RISC processors + ??MB Embedded DRAM + high speed Network Interface on single 20 x 20 mm chip – 1 GFLOPS / processor • 2’ x 2’ Board = 64 chips (2K CPUs) • Rack = 8 Boards
(512 chips,16K CPUs)
• System = 64 Racks (512 boards,32K chips,1M CPUs) • Total 1 million processors in just 2000 sq. ft.
Slide 12
Other examples: Sony Playstation 2
• Emotion Engine: 6.2 GFLOPS, 75 million polygons per
second (Microprocessor Report, 13:5)
– Superscalar MIPS core + vector coprocessor + graphics/DRAM – Claim: “Toy Story” realism brought to games
Slide 13
Outline
1) Example microprocessor for PostPC gadgets 2) Motivation and the ISTORE project vision
– AME: Availability, Maintainability, Evolutionary growth – ISTORE’s research principles – Proposed techniques for achieving AME – Benchmarks for AME • Conclusions and future work
Slide 14
The problem space: big data
• Big demand for enormous amounts of data – today: high-end enterprise and Internet applications » enterprise decision-support, data mining databases » online applications: e-commerce, mail, web, archives – future: infrastructure services, richer data » computational & storage back-ends for mobile devices » more multimedia content » more use of historical data to provide better services • Today’s SMP server designs can’t easily scale • Bigger scaling problems than performance!
Slide 15
Lampson: Systems Challenges
• Systems that work – Meeting their specs – Always available – Adapting to changing environment – Evolving while they run – Made from unreliable components – Growing without practical limit • Credible simulations or analysis • Writing good specs • Testing • Performance
“Computer Systems Research
– Understanding when it doesn’t matter
-Past and Future”
Keynote address, 17th SOSP, Dec. 1999
Butler Lampson Microsoft
Slide 16
Hennessy: What Should the “New World”
• Availability • Maintainability – Two functions: • Scalability • Cost
Focus Be?
– Both appliance & service » Enhancing availability by preventing failure » Ease of SW and HW upgrades – Especially of service • Performance
“Back to the Future: Time to Return to Longstanding Problems in Computer Systems?”
– per device and per service transaction – Remains important, but its not SPECint Keynote address, FCRC, May 1999
John Hennessy Stanford
Slide 17
The real scalability problems: AME
• A
vailability
– systems should continue to meet quality of service goals despite hardware and software failures • M
aintainability
– systems should require only minimal ongoing human administration, regardless of scale or complexity • E
volutionary Growth
– systems should evolve gracefully in terms of performance, maintainability, and availability as they are grown/upgraded/expanded • These are problems at today’s scales, and will
only get worse as systems grow Slide 18
The ISTORE project vision
• Our goal:
develop principles and investigate hardware/software techniques for building storage-based server systems that:
– are highly available – require minimal maintenance – robustly handle evolutionary growth – are scalable to O(10000) nodes
Slide 19
Principles for achieving AME (1)
• No single points of failure • Redundancy everywhere • Performance robustness is more important
than peak performance
– “performance robustness” implies that real-world performance is comparable to best-case performance • Performance can be sacrificed for
improvements in AME
– resources should be dedicated to AME » compare: biological systems spend > 50% of resources on maintenance – can make up performance by scaling system
Slide 20
Principles for achieving AME (2)
• Introspection – reactive techniques to detect and adapt to failures, workload variations, and system evolution – proactive techniques to anticipate and avert problems before they happen
Slide 21
Outline
1) Example microprocessor for PostPC gadgets 2) Motivation and the ISTORE project vision
– AME: Availability, Maintainability, Evolutionary growth – ISTORE’s research principles – Proposed techniques for achieving AME – Benchmarks for AME • Conclusions and future work
Slide 22
Hardware techniques
• Fully shared-nothing cluster organization – truly scalable architecture – architecture that tolerates partial failure – automatic hardware redundancy
Slide 23
Hardware techniques (2)
• No Central Processor Unit:
distribute processing with storage
– Serial lines, switches also growing with Moore’s Law; less need today to centralize vs. bus oriented systems – Most storage servers limited by speed of CPUs; why does this make sense?
– Why not amortize sheet metal, power, cooling infrastructure for disk to add processor, memory, and network? – If AME is important, must provide resources to be used to help AME: local processors responsible for health and maintenance of their storage
Slide 24
Hardware techniques (3)
• Heavily instrumented hardware – sensors for temp, vibration, humidity, power, intrusion – helps detect environmental problems before they can affect system integrity • Independent diagnostic processor on each node – provides remote control of power, remote console access to the node, selection of node boot code – collects, stores, processes environmental data for abnormalities – non-volatile “flight recorder” functionality – all diagnostic processors connected via independent diagnostic network
Slide 25
Hardware techniques (4)
• On-demand network partitioning/isolation – Internet applications must remain available despite failures of components, therefore can isolate a subset for preventative maintenance – Allows testing, repair of online system – Managed by diagnostic processor and network switches via diagnostic network
Slide 26
Hardware techniques (5)
• Built-in fault injection capabilities – Power control to individual node components – Injectable glitches into I/O and memory busses – Managed by diagnostic processor – Used for proactive hardware introspection » automated detection of flaky components » controlled testing of error-recovery mechanisms – Important for AME benchmarking (see next slide)
Slide 27
“Hardware” techniques (6)
• Benchmarking – One reason for 1000X processor performance was ability to measure (vs. debate) which is better » e.g., Which most important to improve: clock rate, clocks per instruction, or instructions executed?
– Need AME benchmarks “what gets measured gets done” “benchmarks shape a field” “quantification brings rigor”
Slide 28
ISTORE-1 hardware platform
• 80-node x86-based cluster, 1.4TB storage – cluster nodes are plug-and-play, intelligent, network-
attached storage “bricks”
» a single field-replaceable unit to simplify maintenance – each node is a full x86 PC w/256MB DRAM, 18GB disk – more CPU than NAS; fewer disks/node than cluster
ISTORE Chassis
80 nodes, 8 per tray 2 levels of switches •20 100 Mbit/s •2 1 Gbit/s Environment Monitoring: UPS, redundant PS, fans, heat and vibration sensors...
Intelligent Disk “Brick”
Portable PC CPU: Pentium II/266 + DRAM Redundant NICs (4 100 Mb/s links) Diagnostic Processor Disk Half-height canister
Slide 29
A glimpse into the future?
• System-on-a-chip enables computer, memory,
redundant network interfaces without significantly increasing size of disk
• ISTORE HW in 5-7 years: – building block: 2006 MicroDrive integrated with IRAM » 9GB disk, 50 MB/sec from disk » connected via crossbar switch – 10,000 nodes fit into one rack! • O(10,000) scale is our
ultimate design point Slide 30
Software techniques
• Fully-distributed, shared-nothing code – centralization breaks as systems scale up O(10000) – avoids single-point-of-failure front ends • Redundant data storage – required for high availability, simplifies self-testing – replication at the level of application objects » application can control consistency policy » more opportunity for data placement optimization
Slide 31
Software techniques (2)
• “River” storage interfaces – NOW Sort experience: performance heterogeneity is the norm » e.g., disks: outer vs. inner track (1.5X), fragmentation » e.g., processors: load (1.5-5x) – So demand-driven delivery of data to apps » via distributed queues and graduated declustering » for apps that can handle unordered data delivery – Automatically adapts to variations in performance of producers and consumers – Also helps with evolutionary growth of cluster
Slide 32
Software techniques (3)
• Reactive introspection – Use statistical techniques to identify normal behavior and detect deviations from it – Policy-driven automatic adaptation to abnormal behavior once detected » initially, rely on human administrator to specify policy » eventually, system learns to solve problems on its own by experimenting on isolated subsets of the nodes • one candidate: reinforcement learning
Slide 33
Software techniques (4)
• Proactive introspection – Continuous online self-testing of HW and SW » in deployed systems!
» goal is to shake out “Heisenbugs” before they’re encountered in normal operation » needs data redundancy, node isolation, fault injection – Techniques: » fault injection: triggering hardware and software error handling paths to verify their integrity/existence » stress testing: push HW/SW to their limits » scrubbing: periodic restoration of potentially “decaying” hardware or software state • self-scrubbing data structures (like MVS) • ECC scrubbing for disks and memory
Slide 34
Applications
• ISTORE is not one super-system that
demonstrates all these techniques!
– Initially provide library to support AME goals • Initial application targets – cluster web/email servers » self-scrubbing data structures, online self-testing » statistical identification of normal behavior – decision-support database query execution system » River-based storage, replica management – information retrieval for multimedia data » self-scrubbing data structures, structuring performance-robust distributed computation
Slide 35
Outline
1) Example microprocessor for PostPC gadgets 2) Motivation and the ISTORE project vision
– AME: Availability, Maintainability, Evolutionary growth – ISTORE’s research principles – Proposed techniques for achieving AME – Benchmarks for AME • Conclusions and future work
Slide 36
Availability benchmark methodology
• Goal: quantify variation in QoS metrics as
events occur that affect system availability
• Leverage existing performance benchmarks – to generate fair workloads – to measure & trace quality of service metrics • Use fault injection to compromise system – hardware faults (disk, memory, network, power) – software faults (corrupt input, driver error returns) – maintenance events (repairs, SW/HW upgrades) • Examine single-fault and multi-fault workloads – the availability analogues of performance micro- and macro-benchmarks
Slide 37
Methodology: reporting results
• Results are most accessible graphically – plot change in QoS metrics over time – compare to “normal” behavior?
» 99% confidence intervals calculated from no-fault runs 210 200 190 180 170 160 0 5
injected disk failure
10
reconstruction
15 20 25 30 35 40
Time (2-minute intervals)
45 50 55 60 }
normal behavior (99% conf)
• Graphs can be distilled into numbers?
Slide 38
Example results: software RAID-5
• Test systems: Linux/Apache and Win2000/IIS – SpecWeb ’99 to measure hits/second as QoS metric – fault injection at disks based on empirical fault data » transient, correctable, uncorrectable, & timeout faults • 15 single-fault workloads injected per system – only 4 distinct behaviors observed (A) no effect (C) RAID enters degraded mode (B) system hangs – both systems hung (B) on simulated disk hangs – Linux exhibited (D) on all other errors – Windows exhibited (A) on transient errors and (C) on uncorrectable, sticky errors
Slide 39
Example results: multiple-faults
210
disks replaced
200 190 }
normal behavior (99% conf)
Windows 2000/IIS 180 170 160
data disk faulted spare faulted
150
reconstruction (manual)
140 0 10 20 90 100 110 30 40 50 60 70 80
Time (2-minute intervals)
Linux/ Apache 220 210 200 190 180 170 160
data disk faulted reconstruction (automatic) spare faulted reconstruction (automatic)
}
normal behavior (99% conf)
150
disks replaced
140 0 10 20 30 40 50 60 70 80 90 100 110
Time (2-minute intervals)
• Windows reconstructs ~3x faster than Linux • Windows reconstruction noticeably affects application performance, while Linux reconstruction does not
Slide 40
Conclusions (1): Benchmarks
• Linux and Windows take opposite approaches
to managing benign and transient faults
– Linux is paranoid and stops using a disk on any error – Windows ignores most benign/transient faults – Windows is more robust except when disk is truly failing • Linux and Windows have different
reconstruction philosophies
– Linux uses idle bandwidth for reconstruction – Windows steals app. bandwidth for reconstruction – Windows rebuilds fault-tolerance more quickly • Win2k favors fault-tolerance over performance;
Linux favors performance over fault-tolerance Slide 41
Conclusions (2): ISTORE
• Availability, Maintainability, and Evolutionary
growth are key challenges for server systems
– more important even than performance • ISTORE is investigating ways to bring AME to
large-scale, storage-intensive servers
– via clusters of network-attached, computationally enhanced storage nodes running distributed code – via hardware and software introspection – we are currently performing application studies to • Availability benchmarks a powerful tool? – revealed undocumented design decisions affecting SW RAID availability on Linux and Windows 2000
Slide 42
Conclusions (3)
• IRAM attractive for
bandwidth
scrubbing
two Post-PC applications because of low power, small size, high memory
– Gadgets: Embedded/Mobile devices – Infrastructure: Intelligent Storage and Networks • PostPC infrastructure requires – New Goals: Availability, Maintainability, Evolution – New Principles: Introspection, Performance Robustness – New Techniques: Isolation/fault insertion, Software – New Benchmarks: measure, compare AME metrics
Slide 43
• ISTORE
Berkeley Future work
• IRAM: fab and test chip – implement AME-enhancing techniques in a variety of Internet, enterprise, and info retrieval applications – select the best techniques and integrate into a generic runtime system with “AME API” – add maintainability benchmarks » can we quantify administrative work needed to maintain
a certain level of availability?
– Perhaps look at data security via encryption?
– Even consider denial of service?
Slide 44
The UC Berkeley IRAM/ISTORE Projects: Computers for the PostPC Era
For more information:
http://iram.cs.berkeley.edu/istore [email protected]
Slide 45
Backup Slides
(mostly in the area of benchmarking) Slide 46
Case study
• Software RAID-5 plus web server – Linux/Apache vs. Windows 2000/IIS • Why software RAID? – well-defined availability guarantees » RAID-5 volume should tolerate a single disk failure » reduced performance (degraded mode) after failure » may automatically rebuild redundancy onto spare disk – simple system – easy to inject storage faults • Why web server? – an application with measurable QoS metrics that depend on RAID availability and performance
Slide 47
Benchmark environment: metrics
• QoS metrics measured – hits per second » roughly tracks response time in our experiments – degree of fault tolerance in storage system • Workload generator and data collector – SpecWeb99 web benchmark » simulates realistic high-volume user load » mostly static read-only workload; some dynamic content » modified to run continuously and to measure average hits per second over each 2-minute interval
Slide 48
Benchmark environment: faults
• Focus on faults in the storage system (disks) • How do disks fail? – according to Tertiary Disk project, failures include: » recovered media errors » uncorrectable write failures » hardware errors (e.g., diagnostic failures) » SCSI timeouts » SCSI parity errors – note: no head crashes, no fail-stop failures
Slide 49
Disk fault injection technique
• To inject reproducible failures, we replaced
one disk in the RAID with an emulated disk
– a PC that appears as a disk on the SCSI bus – I/O requests processed in software, reflected to local disk – fault injection performed by altering SCSI command processing in the emulation software • Types of emulated faults: – media errors (transient, correctable, uncorrectable) – hardware errors (firmware, mechanical) – parity errors – power failures – disk hangs/timeouts
Slide 50
System configuration
IBM 18 GB 10k RPM IBM 18 GB 10k RPM
Server
Adaptec 2940 Adaptec 2940 Adaptec 2940 IDE system disk Adaptec 2940 Emulated Disk
Disk Emulator
SCSI system disk AdvStor ASC-U2W Adaptec 2940 IBM 18 GB 10k RPM emulator backing disk IBM 18 GB 10k RPM AMD K6-2-333 64 MB DRAM Linux or Win2000 Emulated Spare Disk AMD K6-2-350 Windows NT 4.0
ASC VirtualSCSI lib.
= Fast/Wide SCSI bus, 20 MB/sec • RAID-5 Volume: 3GB capacity, 1GB used per disk – 3 physical disks, 1 emulated disk, 1 emulated spare disk • 2 web clients connected via 100Mb switched Ethernet
Slide 51
Results: single-fault experiments
• One exp’t for each type of fault (15 total) – only one fault injected per experiment – no human intervention – system allowed to continue until stabilized or crashed • Four distinct system behaviors observed (A) no effect: system ignores fault (B) RAID system enters degraded mode (C) RAID system begins reconstruction onto spare disk (D) system failure (hang or crash)
Slide 52
State of the Art: Ultrastar 72ZX
Track Buffer + Arm Latency =
per access per byte
{
Head Sector Cylinder Platter Queuing Time + Controller time + Seek Time + Rotation Time + Size / Bandwidth
– 73.4 GB, 3.5 inch disk – 2¢/MB – 16 MB track buffer – 11 platters, 22 surfaces – 15,110 cylinders – 7 Gbit/sq. in. areal density – 17 watts (idle) – 0.1 ms controller time – 5.3 ms avg. seek (seek 1 track => 0.6 ms) – 3 ms = 1/2 rotation – 37 to 22 MB/s to media
source: www.ibm.com; www.pricewatch.com; 2/14/00
Slide 53