Transcript ppt - ECE Users Pages
Power Management
Lecture notes S. Yalamanchili and S. Mukhopadhyay
GATE
Technology Scaling
GATE DRAIN SOURCE
t ox
SOURCE DRAIN BODY
• 30% scaling down in dimensions transistor density L doubles
P
CV dd
2
f
V dd I st
V dd I leak
• Power per transistor V dd scaling lower power • Transistor delay = C gate C gate , V dd scaling V dd /I SAT lower delay (2)
Moore’s Law
Goal: Sustain Performance Scaling • • Performance scaled with number of transistors Dennard scaling* feature size : power scaled with
From wikipedia.org
*R. Dennard, et al., “Design of ion-implanted MOSFETs with very small physical dimensions,” IEEE Journal of Solid State Circuits, vol. SC-9, no. 5, pp. 256-268, Oct. 1974.
(3) 3
IBM Power5
Parallelism and Power
AMD Trinity
Source: forwardthinking.pcmag.com
Source: IBM • How much of the chip area is devoted to compute?
• Run many cores slower. Why does this reduce power?
(4)
The Power Wall
P
CV dd
2
f
V dd I st
V dd I leak
• Power per transistor scales with frequency but also scales with V dd Lower V dd can be compensated for with increased pipelining to keep throughput constant Power per transistor is not same as power per area power density is the problem!
Multiple units can be run at lower frequencies to keep throughput constant, while saving power (5)
What is the Problem?
Mukhopadhyay and Yalamanchili (2009)
Based on scaling using Pentium-class cores While Moore’s Law continues, scaling phenomena have changed Power densities are increasing with each generation (6) 6
ITRS Roadmap for Logic Devices
From: “ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems,” P. Kogge, et.al, 2008 (7)
Power Management Basics
Lecture notes S. Yalamanchili and S. Mukhopadhyay
What are my Options?
1. Better technology Manufacturing Better devices (FinFet) New Devices non-CMOS?
Not this course
this is the future 2. Be more efficient – activity management Clock gating – dynamic energy/power Power gating – static energy/power Power state management both 3. Improved architecture Simpler pipelines 4. Parallelism (9)
Activity Management
Clock Gating cond input clk Combinational Logic • • • clk clk Turn off clock to a block of logic Eliminate unnecessary transitions/activity Clock distribution power Power Gating V dd Core 0
Power gate transistor
Core 1 • • Turn off power to a block of logic, e.g., core No leakage (10)
Multiple Voltage Frequency Domains
Intel Sandy Bridge Processor
• • • Cores and ring in one DVFS domain Graphics unit in another DVFS domain Cores and portion of cache can be gated off
From E. Rotem et. Al. HotChips 2011
(11)
Processor Power States
• Performance States – P-states Operate at different voltage/frequencies o Recall delay-voltage relationship Lower voltage lower leakage Lower frequency Lower frequency lower power ( not the same as energy!
longer execution time ) • Idle States - C-states Sleep states Differ is how much state is saved • SW or HW managed transitions between states!
(12)
Example of P-states
AMD Trinity A10-5800 APU: 100W TDP
HW Only (Boost) CPU P state
Pb0 Pb1
Voltage (V)
1 0.875
Freq (MHz)
2400 1800 • Software Managed Power States • Changing Power States is not free P0 0.825
1600
SW Visible
P1 P2 P3 P4 0.812
0.787
0.762
0.75
1400 1300 1100 900 (13)
Example of P-states
From: http://www.intel.com/content/www/us/en/processors/core/2nd-gen-core-family-mobile-vol-1-datasheet.html
(14)
Management Knobs
• Each core can be in any one of a multiple of states • How do I decide what state to set each core?
Who decides? HW? SW?
• How do I decide when I can turn off a core?
• What am I saving? Static energy or dynamic energy?
(15)
Power Management
• Software controlled power management Optimize power and/or energy Orchestrated by the operating system or application libraries Industry standard interfaces for power management o Advanced Configuration and Power Interface (ACPI) https://www.acpica.org/ http://www.acpi.info/ • Hardware power management Optimized power/energy Failsafe operation, e.g., protect against thermal emergencies (16)
Power Management
Performance and energy efficiency depend on
3.0
of power and thermal headroom utilization
Thermal Headroo m Time CPU HW Only (Boost) DVFS state
Pb0 Pb1
SW Visible
P0 P1 P2 - - Pmin
HW Boost states SW visible states Convert thermal headroom to higher performance through boost
Time (17)
Boosting
• Exploit package physics Temperature changes on the order of milliseconds • Use the thermal headroom
Turbo boost region Intel Sandy Bridge
Max Power TDP Power
10s of seconds Low power – build up thermal credits
(18)
• Turn off components that are not being used Lose all state information • Costs of powering down • Costs of powering up • Smart shutdown Models to guide decisions
Power Gating
Intel Sandy Bridge Processor
(19)
Parallelism
• Concurrency + lower frequency energy efficiency greater Core Cache Core Cache Core Cache Core Cache Core Cache • • • • • Example
4X #cores 0.75x voltage 0.5x Frequency 1X power 2X in performance P
CV dd
2
f
V dd I st
V dd I leak
(20)
AMD Bulldozer Core ARM A7 Core (arm.com)
Simplify Core Design
• Support for branch prediction, schedulers, etc. consumes more energy per instruction • Can fit many more simpler cores on a die (21)
Metrics
• Power efficiency MIPS/watt Ops/watt • Energy efficiency Joules/instruction Joules/op • Composite Energy-delay product Energy-delay 2
Why are these useful?
(22)
Lecture notes S. Yalamanchili and S. Mukhopadhyay
Modeling
Microarchitectural Level Models
• How can we study power consumption without building circuits?
Models • Models can are available at multiple levels of abstraction.
We are interested in microarchitectural models
(24)
Fetch
Processor Microarchitecture
Decode Execute/Writeback Instruction Cache Fetch Queue Instruction TLB Instruction Decoder Instruction Queue Branch Prediction Register Files Data TLB L1 Data Cache Memory L2 Data Cache ALU MUL FPU LD ST NoC Router Network On-Chip Network (25)
Energy/Power Calculation
• How do we calculate energy or power dissipation for a given microarchitecture?
• Energy/Power varies between: Different ISA;
ARM vs Intel x86
Different microarchitecture;
in-order vs out-of-order
Different applications;
memory vs compute-bound
Different technologies;
90nm vs 22nm technology
Different operation conditions;
frequency
,
temperature
(26)
• • •
Architecture Activity (1)
icache.read++; fbuffer.write++; Activity 1: Instruction Fetch Register Files ALU MUL Instruction Cache Fetch Queue Instruction Decoder Instruction Queue FPU LD Instruction TLB Collect activity counts of each architecture component (through simulation or measurement).
Branch Prediction List of components differs between microarchitectures.
Activity counts at each component differs between applications.
Data TLB L1 Data Cache L2 Data Cache ST NoC Router On-Chip Network (27)
fbuffer.read++; idecoder.logic++; Activity 2: Instruction Decode Register Files Instruction Cache Fetch Queue Instruction Decoder
Architecture Activity (2)
Instruction Queue ALU MUL FPU LD ST • • • Instruction TLB Branch Prediction Read/write accesses to caches, buffers, etc.
Logical accesses to logic blocks such as decoder, ALUs, etc.
Tradeoff of differentiating more access types (accuracy) vs simulation speed (complexity).
Data TLB L1 Data Cache L2 Data Cache NoC Router On-Chip Network (28)
Power and Architecture Activity
• For example, At n th counters are: Data cache: clock cycle, collected o o o o o o read = 20, write = 12; per-read energy = 0.5nJ; per-write energy = 0.6nJ; Read energy = read*per-read energy = 10nJ Write energy = write*per-write energy = 7.2nJ
Total activity energy = read+write energies = 17.2nJ
If n = 50 th clock cycle and clock frequency = 2GHz, Total activity power = energy*clock_freq/n = 688mW *Note: n/clock_freq = n clock periods in sec power = time average of energy (29)
Things to consider (1)
1. How do we calculate per-read/write energies?
• • Per-access energies can be estimated from circuit-level designs and analyses.
There are various open-source tools for this.
Architecture Specification Technology Parameters
Circuit-level Estimation Tool
Estimation
Results: Area, Energy, Timing, etc.
(30)
Things to consider (2)
2. Is per-access energy always the same?
• • • Per-access energy in fact depends on: •
how many bits are switching
• how they are switching (0 → 1 or 1 → 0) It is reasonable to assume
constant per-access energy
in long-term observation (e.g., n = 1M clock cycles); the number of switching bits are averaged (e.g., 50% of bits are switching).
Most architecture simulators do not capture bit level details due to simulation complexity .
(31)
Things to consider (3)
3. If a register file didn’t have read/write accesses but held data, what is the energy dissipation?
• • • • Energy (or power) is largely comprised of
dynamic static
dissipations.
and Dynamic (or switching) energy refers to energy dissipation due to switching activities .
Static (or leakage) energy is dissipation to keep the electronic system turned on .
In this case, the register file has no dynamic energy dissipation but consumes static energy.
(32)
Thermal Issues
Lecture notes S. Yalamanchili and S. Mukhopadhyay
Thermal Issues
• Heat can cause damage to the chip Need failsafe operation • Thermal fields change the physical characteristics Leakage current and therefore power increases Delay increases Device degradation becomes worse • Cooling solution determines the permitted power dissipation (34)
AMD Trinity APU
Thermal Design Power (TDP)
• This is the maximum power at which the part is designed to operate Dictates the design of the cooling system o Max temperature T jmax Typically fixed by worst case workload • Parts are typically operating below the TDP • Opportunities for turbo mode?
http://ecs.vancouver.wsu.edu/thermofluids-research
(35)
Heat Sink Limits on Performance
Thermal design power (TDP) Determines the cooling solution & package limits Performance depends on effective utilization of this thermal headroom
www.legitreviews.com
Workload
Thermal Headroom
Boost power
Time
HW Boost states SW visible states Convert thermal headroom to higher performance through boosting
(36)
Trinity TDP
Source: http://www.anandtech.com/show/6347/amd-a10-5800k-a8-5600k-review-trinity-on-the-desktop-part-2
(37)
Issues
• Cooling chips is now an issue for computer architects!
• Co-design the cooling system and the processor • Some very “cool” new technologies E.g., microfluidics!
(38)
Electrical and Fluidic I/Os
Courtesy L. Zheng ECE) and Professor Muhannad Bakir (ECE)
• Fluid flow through the microchannels carry heat out to an external heat exchanger (e.g., heat sink) (39)
Fabrication Examples
Courtesy L. Zheng ECE) and Professor Muhannad Bakir (ECE)
Micropin-fins (150 µm diameter and 225 µm diameter)and vias Electrical and fluidic microbumps, fluidic vias and fine wires (40)
Conclusions
• Power/energy is the leading driver of modern architecture design • Power and energy management is key to scalability • Need integrated power/energy, performance, thermal management in fielded systems • What about energy/power efficient algorithms?
(41)
Study Guide
• Explain the difference between energy dissipation and power dissipation • Distinguish between static power dissipation and dynamic power dissipation • Explain dynamic voltage frequency scaling What are power states?
Why is this an advantage?
What is the impact of DVFS on i) energy, ii) execution time, and iii) power • Distinguish between clock gating and power gating (42)
Study Guide (cont.)
• Define thermal design power (TDP) • Name two schemes to preventing the chip from exceeding TDP. Explain how they achieve this goal • What does boosting achieve?
• What is the difference between C-states and P states?
• Name one power management technique that will save static power?
• How does using many slower simpler cores improve power efficiency?
(43)
Study Guide (cont.)
• How is thermal design power (TDP) calculated?
• When using boost algorithms, what determines the duration of the high frequency operation?
• How does a power virus work?
• Describe how throttling works • Know the power dissipation in some modern processor-memory systems drawn from the embedded, server, and high performance computing segments (44)
• Boosting • C-states • Dynamic Power and Energy • Power Gating • P-states Glossary • Static Power and Energy • Time constant • Thermal Design Point • Throttling
(45)