CS252 Graduate Computer Architecture Lecture 12 Branch Prediction

Transcript CS252 Graduate Computer Architecture Lecture 12 Branch Prediction

CS252
Graduate Computer Architecture
Lecture 12
Branch Prediction
Possible Projects
October 8th, 2003
Prof. John Kubiatowicz
http://www.cs.berkeley.edu/~kubitron/courses/cs252-F03
10/08/03
CS252/Kubiatowicz
Lec 12.1
CS252 Projects
• Two People from this class
– Projects can overlap with other classes
– Exceptions to the two person requirement need to be OK’d
• Amount of work: 3 Solid Weeks of work
– Spread over the remainder of the term
• Should be a miniature research project
– State of the art (can’t redo something that others have done)
– Should be publishable work
– Must have solid methodology!
• Elements:
– Base architecture to measure against
– Simulation or other analysis against some application set
– Several variations on a theme
10/08/03
CS252/Kubiatowicz
Lec 12.2
CS252 Projects
•
•
•
•
•
•
10/08/03
DynaCOMP related (or Introspective Computing)
OceanStore related
Smart Dust/NEST
ROC Related Projects
BRASS project related
Benchmarking Related (Yelick)
CS252/Kubiatowicz
Lec 12.3
DynaCOMP:
Introspective Computing
• Biological Analogs for computer systems:
– Continuous adaptation
– Insensitivity to design flaws
» Both hardware and software
» Necessary if can never be
sure that all components
are working properly…
Monitor
Compute
• Examples:
– ISTORE -- applies introspective
computing to disk storage
– DynaComp -- applies introspective
computing at chip level
» Compiler always running and part of execution!
10/08/03
Adapt
CS252/Kubiatowicz
Lec 12.4
DynaCOMP Vision Statement
• Modern microprocessors gather profile information in hardware
in order to generate predictions: Branches, dependencies, and
values.
• Processors such as the Pentium-II employ a primitive form of
“compilation” to translate x86 operations into internal RISC-like
micro-ops.
• So, why not do all of this in software? Make use of a
combination of explicit monitoring, dynamic compilation
technology, and genetic algorithms to:
– Simplify hardware, possibly using large on-chip multiprocessors built from
simple processors.
– Improve performance through feedback-driven optimization. Continuous:
Execution, Monitoring, Analysis, Recompilation
– Generate design complexity automatically so that designers are not required
to. Use of explicit proof verification techniques to verify that code
generation is correct.
• This is aptly called Introspective Computing
• Related idea: use of continuous observation to reduce power on
buses!
10/08/03
CS252/Kubiatowicz
Lec 12.5
The Thermodynamic Analogy
• Large Systems have a variety of latent order
– Connections between elements
– Mathematical structure (erasure coding, etc)
– Distributions peaked about some desired behavior
• Permits “Stability through Statistics”
– Exploit the behavior of aggregates (redundancy)
• Subject to Entropy
– Servers/Components, fail, attacks happen, system changes
• Requires continuous repair
– Apply energy (i.e. through servers) to reduce entropy
– Introspection restores distributions
10/08/03
CS252/Kubiatowicz
Lec 12.6
Comp
Adapt Monitor
ThermoSpective
• Many Redundant Components (Fault Tolerance)
• Continuous Repair (Entropy Reduction)
• What about NanoComputing Domain?
– How will you build reliable systems from unreliable components?
10/08/03
CS252/Kubiatowicz
Lec 12.7
OceanStore Vision
10/08/03
CS252/Kubiatowicz
Lec 12.8
Ubiquitous Devices 
Ubiquitous Storage
• Consumers of data move, change from one device to
another, work in cafes, cars, airplanes, the office,
etc.
• Properties REQUIRED for Endeavour storage
substrate:
– Strong Security: data must be encrypted whenever in the
infrastructure; resistance to monitoring
– Coherence: too much data for naïve users to keep coherent “by
hand”
– Automatic replica management and optimization: huge quantities of
data cannot be managed manually
– Simple and automatic recovery from disasters: probability of failure
increases with size of system
– Utility model: world-scale system requires cooperation across
administrative boundaries
10/08/03
CS252/Kubiatowicz
Lec 12.9
Utility-based Infrastructure
Canadian
OceanStore
Sprint
AT&T
Pac
Bell
IBM
IBM
• Service provided by confederation of companies
– Monthly fee paid to one service provider
– Companies buy and sell capacity from each other
10/08/03
CS252/Kubiatowicz
Lec 12.10
Preliminary Smart Dust Mote
Brett Warneke, Bryan Atwood, Kristofer Pister
Berkeley Sensor and Actuator Center
Dept. of Electrical Engineering and Computer Sciences
University of California, Berkeley
10/08/03
CS252/Kubiatowicz
Lec 12.11
Smart Dust
Passive Transmitter with
Corner-Cube Retroreflector
Active Transmitter with Laser
Diode and Beam Steering
Receiver with Photodetector
Sensors
Analog I/O, DSP, Control
Power Capacitor
Solar Cell
Thick-Film Battery
1-2mm
1-2
mm
10/08/03
CS252/Kubiatowicz
Lec 12.12
COTS Dust
GOAL:
• Get our feet wet
RESULT:
• Cheap, easy, off-the-shelf RF systems
• Fantastic interest in cheap, easy, RF:
–
–
–
–
–
Industry
Berkeley Wireless Research Center
Center for the Built Environment (IUCRC)
PC Enabled Toys (Intel)
Endeavor Project (UCB)
• Optical proof of concept
10/08/03
CS252/Kubiatowicz
Lec 12.13
Smart Dust/Micro Server
Projects
• David Culler and Kris Pister collaborating
• What is the proper operating system for devices of
this nature?
– Linux or Window is not appropriate!
– State machine execution model is much simpler!
» Assume that little device is backed by servers in net.
» Questions of hardware/software tradeoffs
• What is the high-level organization of zillions of
dust motes in the infrastructure???
• What type of computational/communication ability
provides the right tradeoff between functionality
and power consumption???
10/08/03
CS252/Kubiatowicz
Lec 12.14
A glimpse into the future?
• System-on-a-chip enables computer, memory,
redundant network interfaces without significantly
increasing size of disk
• ISTORE HW in 5-7 years:
– 2006 brick: System On a Chip
integrated with MicroDrive
» 9GB disk, 50 MB/sec from disk
» connected via crossbar switch
» From brick to “domino”
– If low power, 10,000 nodes fit into one
rack!
• O(10,000) scale is our ultimate
design point
10/08/03
CS252/Kubiatowicz
Lec 12.15
ROC vision:
Storage System of the Future
• Availability, Maintainability, and Evolutionary growth
key challenges for storage systems
– Maintenance Cost ~ >10X Purchase Cost per year,
– Even 2X purchase cost for 1/2 maintenance cost wins
– AME improvement enables even larger systems
• ISTORE has cost-performance advantages
–
–
–
–
Better space, power/cooling costs ($@colocation site)
More MIPS, cheaper MIPS, no bus bottlenecks
Compression reduces network $, encryption protects
Single interconnect, supports evolution of technology
• Match to future software storage services
– Future storage service software target clusters
10/08/03
CS252/Kubiatowicz
Lec 12.16
Is Maintenance the Key?
• Rule of Thumb: Maintenance 10X to 100X HW
– so over 5 year product life, ~ 95% of cost is maintenance
• VAX crashes ‘85, ‘93 [Murp95]; extrap. to ‘01
• Sys. Man.: N crashes/problem, SysAdmin action
– Actions: set params bad, bad config, bad app install
• HW/OS 70% in ‘85 to 28% in ‘93. In ‘01, 10%?
10/08/03
CS252/Kubiatowicz
Lec 12.17
Availability benchmark
methodology
• Goal: quantify variation in QoS metrics as events
occur that affect system availability
• Leverage existing performance benchmarks
– to generate fair workloads
– to measure & trace quality of service metrics
• Use fault injection to compromise system
– hardware faults (disk, memory, network, power)
– software faults (corrupt input, driver error returns)
– maintenance events (repairs, SW/HW upgrades)
• Examine single-fault and multi-fault workloads
– the availability analogues of performance micro- and macrobenchmarks
10/08/03
CS252/Kubiatowicz
Lec 12.18
Quantum Architecture:
Use of “Spin” for QuBits
North
North
South
South
Spin ½ particle:
(Proton/Electron)
Representation:
|0> or |1>
• Quantum effect gives “1” and “0”:
– Either spin is “UP” or “DOWN” nothing in between
• Superposition: Mix of “1” and “0”:
– Written as: = C0|0> + C1|1>
– An n-bit register can have 2n values simultaneously!
= C000|000> + C001|001> + C010|010> + C011|011> +
C100 |100> + C101 |101> + C110 |110> + C111 |111>
10/08/03
CS252/Kubiatowicz
Lec 12.19
Skinner-Kane Si based computer
• S-gate
– Electron shuttling
• Global magnetic field
– 0 <> 1 qubit flip
G
AT
E
E
electron
Si substrate
P ion
A-
S-
G
AT
E
AT
G
S-
G
AT
E
– Hyperfine interaction
– Electron-ion spin swap
global B
A-
• Silicon substrate
• Phosphorus ion spin+ donor
electron spin = qubit
• A-gate
electron
P ion
measurement
SETs
• Single-electron transistors
– Qubit readout
10/08/03
CS252/Kubiatowicz
Lec 12.20
Interesting Ubiquitous Component:
The Entropy Exchange Unit
Garbage In
#!$**#
Zeros Out
000000
• Possibilities for cooling:
– Spin-polarized photons spin-polarized electrons 
spin-polarized nucleons
– Simple thermal cooling of some sort
• Two material domains:
– One material in contact with environment
• Analysis of properties of such a system
10/08/03
CS252/Kubiatowicz
Lec 12.21
Swap cell
...
S
A
S
S
A
S
S
A
S
S
A
S
S
...
S
A
S
S
A
S
e1 -
e12-
e12-
e12-
e21-
e2 -
e2 -
e1 Electron-ion spin swap
P ion
Electron-ion spin swap
P ion
• A lot of steps for two qubits!
10/08/03
CS252/Kubiatowicz
Lec 12.22
Swap Cell Control Complexity
Electrodes
Time
Step 1
S4
2
3
1
-
8
9
10
e2
e2e2-
A1e
7
Electrons are too close
e2-
S2
S1
6
e1- e1- e1-
e2-
Control
signals
e2-
e1e1e1-
e1e1-
e1- e1- e124
24
A2
S1
S2
S3
S4
•
e2- e2-
e1-
e2e2-
e1-
A1
10/08/03
12 13 14
11
e2- e2- e2-
A2e2S3
5
4
Electron-ion spin swap
What a mess! Long pulse sequence…
CS252/Kubiatowicz
Lec 12.23
Single-electron transistors (SETs)
Control Input
Tunnel Junction
Island
VDD
CLOAD
VDD
Y. Takahashi et. al.
• Electrons move one-by-one through tunnel junction
onto quantum dot and out other side
• Work well at low temperatures
• Low drive current (~5nA) and voltage swing (~40mV)
10/08/03
CS252/Kubiatowicz
Lec 12.24
Swap control circuit
ACK!
S1a
S3a
S1b
S3b
S1 on
S1c
S1 on
S2 on
S3 on
S3c
S1d
S3 on
S4 on
A
S3d
Aon
a
S2a
S2 on
S2b
D
T
D
D
D
S1a
S2a
S3c
S4b
D
D
D
S3
S1c
D
S2b
D
S4a
S1b
a
S1d
D
D
S4
S3b
a
D
D
S3d
S4 on
S4b
S-gate pulse cascade
D
D
Reset
7
4 3 2
1 0
6
Reset
5
5-bit counter
Aa
4
8-bit counter
3
2
Enable
A-gate pulse repeats 24 times
10/08/03
1
0
On-off A-gate pulse ratio (2:254)
• Can this even be built with SETs?
CS252/Kubiatowicz
Lec 12.25
In SIMD we trust?
•
•
•
•
Large control circuit/small swap cell ratio = SIMD
Like clock distribution network
Clock skew at 11.3GHz?
Error correction?
.
.
.
S3
A
S2
S1
A
A
10/08/03
S3
S2
A
S2
S1
S3
S2
S2
S1
S1
S2
S2
S1
.
.
.
S3
S2
A
.
.
.
S3
S3
A
S2
S1
S2
A
S2
S1
A
S3
S2
A
S2
S1
S3
S2
S2
S1
S1
Swap
Control
S3
S2
A
.
.
.
S3
CS252/Kubiatowicz
Lec 12.26
Brass Vision Statement
• The emergence of high capacity reconfigurable devices
is igniting a revolution in general-purpose processing.
It is now becoming possible to tailor and dedicate
functional units and interconnect to take advantage of
application dependent dataflow. Early research in this
area of reconfigurable computing has shown
encouraging results in a number of spot areas including
cryptography, signal processing, and searching --achieving 10-100x computational density and reduced
latency over more conventional processor solutions.
• BRASS: Microprocessor & FPGA on single chip:
– use some of millions of transitors to customize HW dynamically to
application
10/08/03
CS252/Kubiatowicz
Lec 12.27
Architecture Target
128 LUTs
• Integrated RISC core +
memory system +
reconfigurable array.
• Combined RAM/Logic
structure.
• Rapid reconfiguration with
many contexts.
• Large local data memories
and buffers.
• These capabilities enable:
– hardware virtualization
– on-the-fly specialization
2Mbit
10/08/03
CS252/Kubiatowicz
Lec 12.28
SCORE: Stream-oriented computation
model
Goal: Provide view of reconfigurable hardware which exposes
strengths while abstracting physical resources.
• Computations are expressed as
data-flow graphs.
• Graphs are broken up into compute
pages.
• Compute pages are linked together
in a data-flow manner with
streams.
• A run-time manager allocates and
schedules pages for computations
and memory.
10/08/03
CS252/Kubiatowicz
Lec 12.29
Ok. Back to Branch Prediction
10/08/03
CS252/Kubiatowicz
Lec 12.30
Review: Problem: “Fetch” unit
Stream of Instructions
To Execute
Instruction Fetch
with
Branch Prediction
Out-Of-Order
Execution
Unit
Correctness Feedback
On Branch Results
• Instruction fetch decoupled from execution
• Often issue logic (+ rename) included with Fetch
10/08/03
CS252/Kubiatowicz
Lec 12.31
Branches must be resolved
quickly for loop overlap!
• In our loop-unrolling example, we relied on the fact that branches
were under control of “fast” integer unit in order to get overlap!
Loop:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
• What happens if branch depends on result of multd??
– We completely lose all of our advantages!
– Need to be able to “predict” branch outcome.
– If we were to predict that branch was taken, this would be
right most of the time.
• Problem much worse for superscalar machines!
10/08/03
CS252/Kubiatowicz
Lec 12.32
Review: Predicated Execution
• Avoid branch prediction by turning branches
into conditionally executed instructions:
if (x) then A = B op C else NOP
– If false, then neither store result nor cause exception
– Expanded ISA of Alpha, MIPS, PowerPC, SPARC have
conditional move; PA-RISC can annul any following
instr.
– IA-64: 64 1-bit condition fields selected so conditional
execution of any instruction
– This transformation is called “if-conversion”
x
A=
B op C
• Drawbacks to conditional instructions
– Still takes a clock even if “annulled”
– Stall if condition evaluated late
– Complex conditions reduce effectiveness;
condition becomes known late in pipeline
10/08/03
CS252/Kubiatowicz
Lec 12.33
Dynamic Branch Prediction Problem
History
Information
Incoming Branches
{ Address }
Branch
Predictor
Prediction
{ Address, Value }
Corrections
{ Address, Value }
• Incoming stream of addresses
• Fast outgoing stream of predictions
• Correction information returned from pipeline
10/08/03
CS252/Kubiatowicz
Lec 12.34
Review: Branch Target Buffer
• Branch Target Buffer (BTB): Address of branch index to get
prediction AND branch address (if taken)
– Note: must check for branch match now, since can’t use wrong branch address
(Figure 4.22, p. 273)
Branch PC
Predicted PC
PC of instruction
FETCH
=?
Predict taken or untaken
• Return instruction addresses predicted with stack
• Remember branch folding (Crisp processor)?
10/08/03
CS252/Kubiatowicz
Lec 12.35
Branch (Pattern?) History Table
Predictor 0
Predictor 1
Branch PC
Predictor 7
• BHT is a table of “Predictors”
– Usually 2-bit, saturating counters
– Indexed by PC address of Branch – without tags
• In Fetch state of branch:
– BTB identifies branch
– Predictor from BHT used to make prediction
• When branch completes
– Update corresponding Predictor
10/08/03
CS252/Kubiatowicz
Lec 12.36
Review: Dynamic Branch Prediction
(Jim Smith, 1981)
• “Predictor”: 2-bit scheme where change prediction
only if get misprediction twice
T
NT
Predict Taken
Predict Taken
T
T
NT
NT
Predict Not
Taken
T
Predict Not
Taken
• Red: stop, not taken
NT
• Green: go, taken
• Adds hysteresis to decision making process
10/08/03
CS252/Kubiatowicz
Lec 12.37
Correlating Branches
• Hypothesis: recent branches are correlated; that is, behavior of
recently executed branches affects prediction of current branch
• Two possibilities; Current branch depends on:
– Last m most recently executed branches anywhere in program
Produces a “GA” (for “global adaptive”) in the Yeh and Patt
classification (e.g. GAg)
– Last m most recent outcomes of same branch.
Produces a “PA” (for “per-address adaptive”) in same classification
(e.g. PAg)
• Idea: record m most recently executed branches as taken or not
taken, and use that pattern to select the proper branch history
table entry
– A single history table shared by all branches (appends a “g” at end),
indexed by history value.
– Address is used along with history to select table entry (appends a
“p” at end of classification)
– If only portion of address used, often appends an “s” to indicate
“set-indexed” tables (I.e. GAs)
10/08/03
CS252/Kubiatowicz
Lec 12.38
Discussion of Yeh and Patt
classification
GBHR
PABHR
GAg
GPHT
PAg
GPHT
PABHR
PAPHT
PAp
• GAg: Global History Register, Global History Table
• PAg: Per-Address History Register, Global History Table
• PAp: Per-Address History Register, Per-Address History Table
10/08/03
CS252/Kubiatowicz
Lec 12.39
Other Global Variants:
Try to Avoid Aliasing
GBHR

GBHR
Address
GAs
PAPHT
GShare
GPHT
• GAs: Global History Register,
Per-Address (Set Associative) History Table
• Gshare: Global History Register, Global History Table with
Simple attempt at anti-aliasing
10/08/03
CS252/Kubiatowicz
Lec 12.40
What are Important Metrics?
• Clearly, Hit Rate matters
– Even 1% can be important when above 90% hit rate
• Speed: Does this affect cycle time?
• Space: Clearly Total Space matters!
– Papers which do not try to normalize across different
options are playing fast and lose with data
– Try to get best performance for the cost
10/08/03
CS252/Kubiatowicz
Lec 12.41
Accuracy of Different Schemes
(Figure 4.21, p. 272)
4096 Entries 2-bit BHT
Unlimited Entries 2-bit BHT
1024 Entries (2,2) BHT 11%
16%
14%
12%
10%
8%
6%
6%
6%
5%
4%
4%
2%
1%
1%
0%
Unlimited entries: 2-bits/entry
li
eqntott
espresso
gcc
fpppp
spice
doducd
tomcatv
matrix300
0%
0%
4,096 entries: 2-bits per entry
10/08/03
6%
5%
nasa7
of Mispredictions
Frequency
Frequency of Mispredictions
18%
18%
1,024 entries (2,2)
CS252/Kubiatowicz
Lec 12.42
Discussion of Papers
• A Comparative Analysis of Schemes for Correlated
Branch Prediciton
– Cliff Young, Nicolas Gloy and Michael D. Smith
• An Analysis of Correlation and Predictability: What
Makes Two-Level Branch Predictors Work?
– Marius Evers, Sanjay J. Patel, Robert S. Chappel, and Yale N. Patt
10/08/03
CS252/Kubiatowicz
Lec 12.43
Summary #1
Dynamic Branch Prediction
• Prediction becoming important part of scalar
execution.
– Prediction is exploiting “information compressibility” in execution
• Branch History Table: 2 bits for loop accuracy
• Correlation: Recently executed branches correlated
with next branch.
– Either different branches (GA)
– Or different executions of same branches (PA).
• Branch Target Buffer: include branch address &
prediction
• Predicated Execution can reduce number of
branches, number of mispredicted branches
10/08/03
CS252/Kubiatowicz
Lec 12.44
Summary #2
• Prediction, prediction, prediction!
– Over next couple of lectures, we will explore prediction of
everything! Branches, Dependencies, Data
• The high prediction accuracies will cause us to ask:
– Is the deterministic Von Neumann model the right one???
10/08/03
CS252/Kubiatowicz
Lec 12.45

CS252 Graduate Computer Architecture Lecture 12 Branch Prediction

Transcript CS252 Graduate Computer Architecture Lecture 12 Branch Prediction

Directory