Efficient Memory Barrier Implementation

Transcript Efficient Memory Barrier Implementation

Exploring Efficient SMT
Branch Predictor Design
Matt Ramsay, Chris Feucht
& Mikko H. Lipasti
University of Wisconsin-Madison
PHARM Team
www.ece.wisc.edu/~pharm
WCED: June 7, 2003
Matt Ramsay, Chris Feucht, & Mikko Lipasti
University of Wisconsin-Madison
Slide 1 of 26
Introduction & Motivation

Two main performance limitations:
Memory stalls
 Pipeline flushes due to incorrect speculation


In SMTs:



Multiple threads to hide these problems
However, multiple threads make speculation harder
because of interference with shared prediction
resources
This interference can cause more branch mispredicts
and thus limit potential performance
WCED: June 7, 2003
Matt Ramsay, Chris Feucht, & Mikko Lipasti
University of Wisconsin-Madison
Slide 2 of 26
Introduction & Motivation

We study:



Providing each thread with its own pieces of the branch
predictor to eliminate interference between threads
Apply these changes to different branch prediction
schemes to evaluate their performance
We hypothesize:


Elimination of thread interference in the branch
predictor will improve prediction accuracy
Thread-level parallelism in an SMT makes branch
prediction accuracy much less important than in a
single-threaded processor
WCED: June 7, 2003
Matt Ramsay, Chris Feucht, & Mikko Lipasti
University of Wisconsin-Madison
Slide 3 of 26
Talk Outline
Introduction & Motivation
 SMT Overview
 Branch Prediction Overview
 Test Methodology
 Results
 Conclusions

WCED: June 7, 2003
Matt Ramsay, Chris Feucht, & Mikko Lipasti
University of Wisconsin-Madison
Slide 4 of 26
SMT Overview

Simultaneous Multithreading



Machines often have more resources than can be
used by one thread
SMT: Allows TLP along with ILP
4-wide example:
WCED: June 7, 2003
Matt Ramsay, Chris Feucht, & Mikko Lipasti
University of Wisconsin-Madison
Slide 5 of 26
Tested Predictors

Static Predictors (in paper):



2-Bit Predictor:




Always Taken
Backward-Taken-Forward-Not-Taken
Branch History Table (BHT) indexed by PC of branch instruction
Allows for significant aliasing by branches that share low bits of PC
Does not take advantage of global branch history information
Gshare Predictor:



BHT indexed by XOR of the branch PC and the global branch history
Hashing reduces aliasing
Correlates prediction based on global branch behavior
WCED: June 7, 2003
Matt Ramsay, Chris Feucht, & Mikko Lipasti
University of Wisconsin-Madison
Slide 6 of 26
YAGS Predictor
WCED: June 7, 2003
Matt Ramsay, Chris Feucht, & Mikko Lipasti
University of Wisconsin-Madison
Slide 7 of 26
Indirect Branch Predictor





Predicts the target of Jump-Register (JR)
instructions
Prediction table holds target addresses
Larger table entries lead to more aliasing
Indexed like Gshare branch predictor
Split indirect predictor caused little change
in branch prediction accuracy and overall
performance (in paper)
WCED: June 7, 2003
Matt Ramsay, Chris Feucht, & Mikko Lipasti
University of Wisconsin-Madison
Slide 8 of 26
Talk Outline
Introduction & Motivation
 SMT Overview
 Branch Prediction Overview
 Test Methodology
 Results
 Conclusions

WCED: June 7, 2003
Matt Ramsay, Chris Feucht, & Mikko Lipasti
University of Wisconsin-Madison
Slide 9 of 26
Simulation Environment












Multithreaded version of SimpleScalar
developed by Craig Zilles at UW
Machine Configuration:
# of Threads = 4
# of Address Spaces = 4
# Bits in Branch History = 12
# of BT Entries = 4096
# Bits in Indirect History = 10
# of IT Entries = 1024
Machine Width = 4
Pipeline Depth = 15
Max Issue Window = 64
# of Physical Registers = 512
WCED: June 7, 2003










# Instructions Simulated = ~40M
L1 Latency = 1 cycle
L2 Latency = 10 cycles
Mem Latency = 200 cycles
L1 Size = 32 KB
L1 Associativity = D.M.
L1 Block Size = 64 B
L2 Size = 1MB
L2 Associativity = 4
L2 Block Size = 128 B
Matt Ramsay, Chris Feucht, & Mikko Lipasti
University of Wisconsin-Madison
Slide 10 of 26
Benchmarks Tested

From SpecCPU2000

INT



FP



crafty
gcc
ammp
equake
Benchmark Configurations


Heterogeneous Threads: Each thread runs one of the
listed benchmarks to simulate a multi-tasking environment
Homogeneous Threads: Each thread runs a different copy
of the same benchmark (crafty) to simulate a
multithreaded server environment
WCED: June 7, 2003
Matt Ramsay, Chris Feucht, & Mikko Lipasti
University of Wisconsin-Madison
Slide 11 of 26
Shared Configuration
Thread 0
Thread 1
Thread 2
History
Predictor
Thread 3
WCED: June 7, 2003
Matt Ramsay, Chris Feucht, & Mikko Lipasti
University of Wisconsin-Madison
Slide 12 of 26
Split Branch Configuration
Thread 0
Thread 1
Thread 2
Thread 3

History
Predictor
History
Predictor
History
Predictor
History
Predictor
Predictor block retains original size when duplicated
WCED: June 7, 2003
Matt Ramsay, Chris Feucht, & Mikko Lipasti
University of Wisconsin-Madison
Slide 13 of 26
Split Branch Table Configuration
Predictor
Thread ID
Predictor
Thread 0
Thread 1
Thread 2
History
Thread 3
Predictor
Predictor
WCED: June 7, 2003
Matt Ramsay, Chris Feucht, & Mikko Lipasti
University of Wisconsin-Madison
Slide 14 of 26
Split History Configuration
Thread ID
History 0
Thread 0
Thread 1
Thread 2
History 1
History 2
Predictor
Thread 3
History 3
WCED: June 7, 2003
Matt Ramsay, Chris Feucht, & Mikko Lipasti
University of Wisconsin-Madison
Slide 15 of 26
Talk Outline
Introduction & Motivation
 SMT Overview
 Branch Prediction Overview
 Test Methodology
 Results
 Conclusions

WCED: June 7, 2003
Matt Ramsay, Chris Feucht, & Mikko Lipasti
University of Wisconsin-Madison
Slide 16 of 26
Split Branch Predictor Accuracy
Split Branch Predictor Accuracy
% Mispredicts
60.00%
50.00%
40.00%
Yags
Gshare
2 Bit
30.00%
20.00%
10.00%
0.00%
ammp

crafty
equake
gcc
Full predictor split: Predictors act as expected,
as they would in a single threaded environment
WCED: June 7, 2003
Matt Ramsay, Chris Feucht, & Mikko Lipasti
University of Wisconsin-Madison
Slide 17 of 26
Shared Branch Predictor Accuracy
Shared Branch Predictor Accuracy
% Mispredicts
60.00%
50.00%
40.00%
Yags
Gshare
2 Bit
30.00%
20.00%
10.00%
0.00%
ammp

crafty
equake
gcc
Shared predictor: Performance suffers because
of interference by other threads (esp. Gshare)
WCED: June 7, 2003
Matt Ramsay, Chris Feucht, & Mikko Lipasti
University of Wisconsin-Madison
Slide 18 of 26
Prediction Accuracy:
Heterogeneous Threads
Gshare
% M ispredicts
60.00%
50.00%
Shared
Split Branch
Split Branch Table
Split History
40.00%
30.00%
20.00%
10.00%
0.00%
ammp

equake
gcc
Yags & Gshare:



crafty
Sharing the history register performs very poorly
Split history configuration performs almost as well as the split branch
configuration while using significantly less resources
2-Bit: splitting the predictor performs better, mispredicts reduced from
9.52% to 8.35%
WCED: June 7, 2003
Matt Ramsay, Chris Feucht, & Mikko Lipasti
University of Wisconsin-Madison
Slide 19 of 26
Prediction Accuracy:
Homogeneous Threads
Gshare
% M ispredicts
60.00%
50.00%
Shared
Split Branch
Split Branch Table
Split History
40.00%
30.00%
20.00%
10.00%
0.00%
crafty

crafty
crafty
Yags & Gshare:



crafty
Configurations perform similarly to heterogeneous thread case
Split history configuration performs even closer to split branch configuration
because of positive aliasing in the BHT
Surprisingly, splitting portions of the predictor still performs better even
when each thread runs the same program
WCED: June 7, 2003
Matt Ramsay, Chris Feucht, & Mikko Lipasti
University of Wisconsin-Madison
Slide 20 of 26
Per Thread CPI:
Heterogeneous Threads
CPI
Gshare CPI
3
2.8
2.6
2.4
2.2
2
1.8
1.6
1.4
1.2
1
Shared
Split Branch
Split Branch Table
Split History
ammp


crafty
equake
gcc
Sharing history register using Gshare has significant negative effect on
performance (near 50% mispredicts)
Split history configuration produces almost same performance as split
branch configuration while using significantly less resources
WCED: June 7, 2003
Matt Ramsay, Chris Feucht, & Mikko Lipasti
University of Wisconsin-Madison
Slide 21 of 26
Per Thread CPI:
Homogeneous Threads
CPI
Gshare CPI
3
2.8
2.6
2.4
2.2
2
1.8
1.6
1.4
1.2
1
Shared
Split Branch
Split Branch Table
Split History
crafty

crafty
crafty
crafty
Per-thread performance is worse in homogeneous thread configuration
because crafty benchmark has highest number of cache misses
WCED: June 7, 2003
Matt Ramsay, Chris Feucht, & Mikko Lipasti
University of Wisconsin-Madison
Slide 22 of 26
Performance Across Predictors

Split Branch Configuration Performance
CPI (Normalized to Yags)
1.20
1.15

1.10
1.05
Yags
Gshare
2-Bit
1.00
0.95
0.90

0.85

0.80
crafty
WCED: June 7, 2003
crafty
crafty
crafty
Branch prediction scheme
has little effect on
performance
Only 2.75% and 5% CPI
increases when Gshare and
2-bit predictors are used
instead of much more
expensive YAGS
Increases are 6% and 11% in
a single-threaded machine
Heterogeneous thread
configuration performs
similarly
Matt Ramsay, Chris Feucht, & Mikko Lipasti
University of Wisconsin-Madison
Slide 23 of 26
Performance Across Predictors
Split History Configuration Performance

CPI (Normalized to Yags)
1.20
1.15

1.10
1.05
Yags
Gshare
2-Bit
1.00
0.95

0.90
0.85

0.80
crafty
WCED: June 7, 2003
crafty
crafty
crafty
Split history configuration still
allows performance to hold for
simpler schemes
4% and 6.25% CPI increases
for Gshare and 2-bit schemes
compared to YAGS
Simpler schemes allow for
reduced cycle time and power
consumption
CPI numbers only close
estimates because simulations
are not deterministic
Matt Ramsay, Chris Feucht, & Mikko Lipasti
University of Wisconsin-Madison
Slide 24 of 26
Talk Outline






Introduction & Motivation
SMT Overview
Branch Prediction Overview
Test Methodology
Results
Conclusions
WCED: June 7, 2003
Matt Ramsay, Chris Feucht, & Mikko Lipasti
University of Wisconsin-Madison
Slide 25 of 26
Conclusions

Multithreaded execution interferes with branch
prediction accuracy

Prediction accuracy trends are similar across both
homogeneous and heterogeneous thread test cases

Splitting only the branch history has best branch
prediction accuracy and performance per resource

Performance (CPI) is relatively stable, even when
branch prediction structure is simplified
WCED: June 7, 2003
Matt Ramsay, Chris Feucht, & Mikko Lipasti
University of Wisconsin-Madison
Slide 26 of 26

Efficient Memory Barrier Implementation

Transcript Efficient Memory Barrier Implementation

Directory