Tortola: Addressing Tomorrow’s Computing Challenges through Hardware/Software Symbiosis Kim Hazelwood
Download ReportTranscript Tortola: Addressing Tomorrow’s Computing Challenges through Hardware/Software Symbiosis Kim Hazelwood
Tortola: Addressing Tomorrow’s Computing Challenges through Hardware/Software Symbiosis Kim Hazelwood September 29, 2006
Modern Computing Challenges
• Performance • Power – Energy consumption, max instantaneous power, di/dt • Temperature – Total heat output, “hot spots” • Reliability – Neutron strikes, alpha particles, MTBF, design flaws • Approaches: Circuit, microarchitecture, compiler • Constraint: Fixed HW-SW interface (e.g., x86) 2 of 26
Typical Approaches
• Optimize using SW or HW techniques in isolation • Performance – SW: Compile-time optimizations – HW: Architectural improvements, VLSI technology • Reliability: Code/data duplication (HW or SW) • Power & Temperature – HW control mechanisms
SW
– Profile + recompile cycle
HW
3 of 26
Modern Design Constraints
Compilers – “Compile once, run anywhere” – Cannot ship “MS Office for 1Q05 batch of Pentium-4 3GHz, > 1GB RAM, BrandX power supply, located in high altitudes…” Microarchitecture – Limited window of application knowledge (past must predict the future) VLSI – Guaranteed correctness, reliability We currently must optimize for the common case (but must design for the worst case) 4 of 26
The Power of Virtualization
• A HW-SW interface
layer
SW Applications Binary Modifier HW
x86 x86 SWI HWI Initially Eventually 5 of 26
Dynamic Binary Modification
• Creates a modified code image at run time EXE Profile Transform Execute Code Cache Examples: • Dynamo (HP) • DAISY/BOA (IBM) • CMS (Transmeta) • Mojo (Microsoft) • Strata (UVa) • Pin (Intel) 6 of 26
Dynamic Instrumentation Demo
•
Pin
– Four architectures – IA32, EM64T, IPF, XScale – Four OSes – Linux, FreeBSD, MacOS, Windows –
http://rogue.colorado.edu/pin/
7 of 26
Dynamic Optimization Demo
•
DynamoRIO
– Windows and Linux for IA32 –
http://www.cag.lcs.mit.edu/dynamorio/
8 of 26
Dynamic Binary Modification
• Creates a modified code image at run time EXE Profile Transform Execute Code Cache Examples: • Dynamo (HP) • DAISY/BOA (IBM) • CMS (Transmeta) • Mojo (Microsoft) • Strata (UVa) • Pin (Intel) • Always triggered by software events …
until now
9 of 26
Tortola: Symbiotic Optimization
• Enable HW/SW Communication
SW Applications Binary Modifier HW
10 of 26
Simulation Methodology
• SimpleScalar 4.0 for x86 • Wattch 1.02 power extensions • Pin dynamic instrumentation system (x86/Linux version)
SW Application Binary Modifier HW Benchmarks Pin Wattch & Simplescalar/x86
11 of 26
Tortola Applications
• Combine global program information with run time feedback – System-specific power usage – Application-specific heat anomalies – Workload/input specific performance optimization • Reduce hardware complexity – No more backwards compatibility warts – Fix bugs after shipment – Reduce time to market for new architectures • One such application: The di/dt problem 12 of 26
The Di/dt Problem
• Voltage stability is important for reliability, performance • Low-power techniques have a negative side effect: current variation • Dips (undershoots) in supply voltage – can cause incorrect values to be calculated or stored • Spikes (overshoots) in supply voltage – can cause reliability problems 13 of 26
The Di/dt Problem
• ITRS cites noise management as a
Grand Challenge
for 5-10 year time frame • Several trends are aggravating the issue: – Voltage is scaling down with technology – Current draw is increasing – Package impedance is not scaling as quickly – Aggressive clock gating causes large swings in processor current draw (di/dt) 14 of 26
Di/dt Solutions
Software
MicroArch Compiler Optimizations Co-Designed MicroArch & SW Binary Modifier Sensor/Actuator Mechanisms Circuit-Level Decoupling capacitors More Vdd Gnd pins on package
15 of 26
Sensor-Actuator Mechanisms • On-chip voltage sensors detect abnormally high/low voltage levels • On-chip actuator then attempts to quickly raise/lower the processor’s current draw – Phantom firing • increases current (at the expense of power) – Resource throttling • reduces current (at the expense of performance) 16 of 26
Detecting Imminent Emergencies 1.05V
Soft Emergency Hard Emergency Control Threshold
1.03V
1V 0.97V
0.95V
17 of 26
Targeting Mid-Frequency Di/dt • Problematic: wide current spike • Worst case: pulse at the resonant frequency
60 cycles 20 cycles Maximum Voltage
*From: Joseph et al.
HPCA-9
Minimum Voltage Time (Cycles) Minimum Voltage Time (Cycles)
18 of 26
A Di/dt Stressmark
BEGIN_LOOP: … ldt $f1, ($4) divt $f1, $f2, $f3 divt $f3, $f2, $f3 stt $f3, 8($4) ldq cmovne $31, $7, $3 stq $7, 8($4) $3, $(4) stq stq $3, $(4) $3, $(4) … stq $3, $(4) … JUMP BEGIN_LOOP
But…Actuator engages every loop iteration degrading performance Why not correct the problem in the code?
19 of 26
Proposed Solution
• Leverage our additional software layer to supplement existing solutions • Microarchitecture provides feedback to our software based virtual layer VL SW HW
Altered Executable Binary Sensor+Actuator Ext Microprocessor
20 of 26
Required Investigations
• Characterizing emergencies – How often do we see di/dt emergency loops?
• Communication between the microarchitecture and the virtual layer – What information should be passed to virtual layer during an emergency?
• Fixing di/dt via binary modification – Will existing techniques help?
– New algorithms?
21 of 26
Static vs. Dynamic Instances
Data suggests modifying a few code sequences will eliminate many voltage emergencies
1000000 Distinct Total 100000 10000 1000 100 10 1
22 of 26
• • Possible Compiler Optimizations Our goal is to – Smooth out current profile, or – Knock pulses off of the resonant frequency Some existing options – Software pipelining, code motion, instruction padding
Altered Executable
Apply Optimizations
Executable Binary Modifier Sensor+Actuator Ext’ns Microprocessor
23 of 26
Loop Unrolling & SW Pipelining Problematic loop:
A A B B
Current Unrolled loop:
A A A A B B B B
Current
Software pipelining smoothes profile
Iteration=1 Iteration=2 Iteration=3 Current
Loop unrolling disrupts resonance pulse A A B B A A B B A A B
24 of 26
B
Unrolling the Di/dt Stressmark
H L Before Loop Unrolling H1 H2 L1 L2 After Loop Unrolling 1.02V
1.01V
1.00V
0.99V
0.98V
0.97V
25 of 26
Summary
• Symbiotic program optimization is a powerful approach • The di/dt problem – well suited for a symbiotic solution • The Tortola design can also target power reduction, temperature reduction, reliability, etc.
http://www.tortolaproject.com/
26 of 26