Tortola: Addressing Tomorrow’s Computing Challenges through Hardware/Software Symbiosis Kim Hazelwood

Download Report

Transcript Tortola: Addressing Tomorrow’s Computing Challenges through Hardware/Software Symbiosis Kim Hazelwood

Tortola: Addressing Tomorrow’s Computing Challenges through Hardware/Software Symbiosis Kim Hazelwood September 29, 2006

Modern Computing Challenges

• Performance • Power – Energy consumption, max instantaneous power, di/dt • Temperature – Total heat output, “hot spots” • Reliability – Neutron strikes, alpha particles, MTBF, design flaws • Approaches: Circuit, microarchitecture, compiler • Constraint: Fixed HW-SW interface (e.g., x86) 2 of 26

Typical Approaches

• Optimize using SW or HW techniques in isolation • Performance – SW: Compile-time optimizations – HW: Architectural improvements, VLSI technology • Reliability: Code/data duplication (HW or SW) • Power & Temperature – HW control mechanisms

SW

– Profile + recompile cycle

HW

3 of 26

Modern Design Constraints

Compilers – “Compile once, run anywhere” – Cannot ship “MS Office for 1Q05 batch of Pentium-4 3GHz, > 1GB RAM, BrandX power supply, located in high altitudes…” Microarchitecture – Limited window of application knowledge (past must predict the future) VLSI – Guaranteed correctness, reliability We currently must optimize for the common case (but must design for the worst case) 4 of 26

The Power of Virtualization

• A HW-SW interface

layer

SW Applications Binary Modifier HW

x86 x86 SWI HWI Initially Eventually 5 of 26

Dynamic Binary Modification

• Creates a modified code image at run time EXE Profile Transform Execute Code Cache Examples: • Dynamo (HP) • DAISY/BOA (IBM) • CMS (Transmeta) • Mojo (Microsoft) • Strata (UVa) • Pin (Intel) 6 of 26

Dynamic Instrumentation Demo

Pin

– Four architectures – IA32, EM64T, IPF, XScale – Four OSes – Linux, FreeBSD, MacOS, Windows –

http://rogue.colorado.edu/pin/

7 of 26

Dynamic Optimization Demo

DynamoRIO

– Windows and Linux for IA32 –

http://www.cag.lcs.mit.edu/dynamorio/

8 of 26

Dynamic Binary Modification

• Creates a modified code image at run time EXE Profile Transform Execute Code Cache Examples: • Dynamo (HP) • DAISY/BOA (IBM) • CMS (Transmeta) • Mojo (Microsoft) • Strata (UVa) • Pin (Intel) • Always triggered by software events …

until now

9 of 26

Tortola: Symbiotic Optimization

• Enable HW/SW Communication

SW Applications Binary Modifier HW

10 of 26

Simulation Methodology

• SimpleScalar 4.0 for x86 • Wattch 1.02 power extensions • Pin dynamic instrumentation system (x86/Linux version)

SW Application Binary Modifier HW Benchmarks Pin Wattch & Simplescalar/x86

11 of 26

Tortola Applications

• Combine global program information with run time feedback – System-specific power usage – Application-specific heat anomalies – Workload/input specific performance optimization • Reduce hardware complexity – No more backwards compatibility warts – Fix bugs after shipment – Reduce time to market for new architectures • One such application: The di/dt problem 12 of 26

The Di/dt Problem

• Voltage stability is important for reliability, performance • Low-power techniques have a negative side effect: current variation • Dips (undershoots) in supply voltage – can cause incorrect values to be calculated or stored • Spikes (overshoots) in supply voltage – can cause reliability problems 13 of 26

The Di/dt Problem

• ITRS cites noise management as a

Grand Challenge

for 5-10 year time frame • Several trends are aggravating the issue: – Voltage is scaling down with technology – Current draw is increasing – Package impedance is not scaling as quickly – Aggressive clock gating causes large swings in processor current draw (di/dt) 14 of 26

Di/dt Solutions

Software

MicroArch Compiler Optimizations Co-Designed MicroArch & SW Binary Modifier Sensor/Actuator Mechanisms Circuit-Level Decoupling capacitors More Vdd Gnd pins on package

15 of 26

Sensor-Actuator Mechanisms • On-chip voltage sensors detect abnormally high/low voltage levels • On-chip actuator then attempts to quickly raise/lower the processor’s current draw – Phantom firing • increases current (at the expense of power) – Resource throttling • reduces current (at the expense of performance) 16 of 26

Detecting Imminent Emergencies 1.05V

Soft Emergency Hard Emergency Control Threshold

1.03V

1V 0.97V

0.95V

17 of 26

Targeting Mid-Frequency Di/dt • Problematic: wide current spike • Worst case: pulse at the resonant frequency

60 cycles 20 cycles Maximum Voltage

*From: Joseph et al.

HPCA-9

Minimum Voltage Time (Cycles) Minimum Voltage Time (Cycles)

18 of 26

A Di/dt Stressmark

BEGIN_LOOP: … ldt $f1, ($4) divt $f1, $f2, $f3 divt $f3, $f2, $f3 stt $f3, 8($4) ldq cmovne $31, $7, $3 stq $7, 8($4) $3, $(4) stq stq $3, $(4) $3, $(4) … stq $3, $(4) … JUMP BEGIN_LOOP

But…Actuator engages every loop iteration degrading performance Why not correct the problem in the code?

19 of 26

Proposed Solution

• Leverage our additional software layer to supplement existing solutions • Microarchitecture provides feedback to our software based virtual layer VL SW HW

Altered Executable Binary Sensor+Actuator Ext Microprocessor

20 of 26

Required Investigations

• Characterizing emergencies – How often do we see di/dt emergency loops?

• Communication between the microarchitecture and the virtual layer – What information should be passed to virtual layer during an emergency?

• Fixing di/dt via binary modification – Will existing techniques help?

– New algorithms?

21 of 26

Static vs. Dynamic Instances

Data suggests modifying a few code sequences will eliminate many voltage emergencies

1000000 Distinct Total 100000 10000 1000 100 10 1

22 of 26

• • Possible Compiler Optimizations Our goal is to – Smooth out current profile, or – Knock pulses off of the resonant frequency Some existing options – Software pipelining, code motion, instruction padding

Altered Executable

Apply Optimizations

Executable Binary Modifier Sensor+Actuator Ext’ns Microprocessor

23 of 26

Loop Unrolling & SW Pipelining Problematic loop:

A A B B

Current Unrolled loop:

A A A A B B B B

Current

Software pipelining smoothes profile

Iteration=1 Iteration=2 Iteration=3 Current

Loop unrolling disrupts resonance pulse A A B B A A B B A A B

24 of 26

B

Unrolling the Di/dt Stressmark

H L Before Loop Unrolling H1 H2 L1 L2 After Loop Unrolling 1.02V

1.01V

1.00V

0.99V

0.98V

0.97V

25 of 26

Summary

• Symbiotic program optimization is a powerful approach • The di/dt problem – well suited for a symbiotic solution • The Tortola design can also target power reduction, temperature reduction, reliability, etc.

http://www.tortolaproject.com/

26 of 26