Single Instruction Stream Parallelism is greater than 2

Transcript Single Instruction Stream Parallelism is greater than 2

NC STATE UNIVERSITY
FabScalar
Niket K. Choudhary, Salil V. Wadhavkar, Tanmay A. Shah,
Sandeep S. Navada, Hashem H. Najaf-abadi, Eric Rotenberg
Center for Efficient, Scalable and Reliable Computing (CESR)
Department of Electrical and Computer Engineering
North Carolina State University
NC STATE UNIVERSITY
High-Performance Superscalar Processor
 Generic pipeline configuration
↑ Good performance on wide range of applications
↓ Not highest-performing for any given application
↓ Power inefficient
2
Eric Rotenberg © 2009
WARP’09 6/20/09
NC STATE UNIVERSITY
Application-Specific Superscalar Processor
generic superscalar processor
App.
X
application-specific superscalar processor
App.
X
3
Eric Rotenberg © 2009
WARP’09 6/20/09
NC STATE UNIVERSITY
Propagation Delay
2-way superscalar
4-way superscalar
2-way to 4-way:
–
–
–
Increase sizes of ILP-extracting
units to expose and exploit
more ILP
Hide increase in propagation
delays with deeper pipelining
Except: worsened propagation
delays not hidden for interinstruction dependences
dependencies
propagation
delay (ns)
independencies
App. 1
2-way
4-way
App. 2
2-way
4-way
Execution Time
4
Eric Rotenberg © 2009
WARP’09 6/20/09
NC STATE UNIVERSITY
Heterogeneous Multi-core
App. 1 App. 2
App. N
Customize each core to an application,
class of application, or
class of application behavior.
5
Eric Rotenberg © 2009
WARP’09 6/20/09
NC STATE UNIVERSITY
Challenge
 Customization captures interplay between
program, microarchitecture, and technology
 Need real superscalar designs …
 … and need many of them
Need to try out many real superscalar designs.
Need tool for automatically composing physical
designs of arbitrary superscalar processors.
6
Eric Rotenberg © 2009
WARP’09 6/20/09
NC STATE UNIVERSITY
Target both R & D
 Research:
High fidelity designs improve discovery
 Development:
Designs should be product strength
7
Eric Rotenberg © 2009
WARP’09 6/20/09
NC STATE UNIVERSITY
Canonical Superscalar Processor
 Different superscalar processors have same
canonical pipeline stages
Fetch
Decode
Rename
Dispatch
Issue
Reg. Read
Execute
Load/Store Unit Writeback
Retire
 Their canonical stages differ in terms of:
•
Complexity
 Width, i.e., number of superscalar “ways”
 Sizes of stage-specific structures
•
Sub-pipelining
 How deeply pipelined a canonical stage is
8
Eric Rotenberg © 2009
WARP’09 6/20/09
NC STATE UNIVERSITY
FabScalar
1) Define composable interfaces of canonical pipeline
stages, so that they can be stitched together to
compose an overall superscalar processor.
2) Pre-design multiple versions of each canonical
pipeline stage, that differ in their width and stagespecific structure sizes (complexity) and depth (subpipelining).
3) Develop a high-level superscalar synthesis tool that
can automatically compose an arbitrary superscalar
processor based on processor-level and stage-level
constraints (frequency, power, and area), and output
multiple representations (verilog, cycle-accurate C++,
netlist, and physical design) of the processor.
9
Eric Rotenberg © 2009
WARP’09 6/20/09
NC STATE UNIVERSITY
SSL and Composability
fetch
scalar,
1 to 3 stages
2-way superscalar,
1 to 3 stages
decode
rename
10
Eric Rotenberg © 2009
WARP’09 6/20/09
NC STATE UNIVERSITY
Status
 Designed synthesizable verilog for a baseline
superscalar processor
•
Starting point for populating SSL with pipeline
stage designs
Niket
Stage
Description
Fetch
4-wide, 512-entry BTB, 128-entry bimodal branch predictor, 8-entry RAS, 16-instruction fetch buffer
Decode
4-wide, ISA = PISA (MIPS-like)
Rename
4-wide, 32-entry rename map table with 8 read and 4 write ports, 4 shadow map tables (checkpoints)
Dispatch
4-wide
Issue
4-wide issue, 32-entry issue queue
Register Read
4-wide, 128-entry physical register file with 8 read ports and 4 write ports
Execute
1 simple ALU, 1 complex ALU, 1 branch ALU, 1 AGEN + 1 port to load-store unit
Load-Store Unit
16-entry load queue, 16 entry store queue
Writeback
4-wide
Retire
4-wide, 128-entry active list with 4 read and 4 write ports, arch. map table with 4 read and 4 write ports
11
Eric Rotenberg © 2009
WARP’09 6/20/09
NC STATE UNIVERSITY
Status (cont.)
Niket
12
Eric Rotenberg © 2009
WARP’09 6/20/09
NC STATE UNIVERSITY
Status (cont.)
(Fetch-2 Delay) vs (Fetch Width)
(Rename Delay) vs (Rename Width)
1.6
0.9
1.4
0.8
1.2
0.7
(RegRead Delay) vs (Register File Size)
0.6
0.4
1.4
0.6
1.2
RegRead Delay (ns)
Rename Delay (ns)
1
0.8
0.5
0.4
0.3
0.2
0.2
0.1
0
0
1
2
3
4
5
6
7
8
1
2
3
Fetch Width
4
5
6
7
1
Issue Width:2
0.8
Issue Width:4
Issue Width:6
0.6
0.4
8
0.2
Rename Width
0
32
64
128
256
512
Register File Size
(Select Logic Delay) vs (Issue Queue Size)
(Wakeup Delay) vs (Issue Queue Size)
1
1.4
0.9
1
Issue Width:2
0.8
Issue Width:4
Issue Width:6
0.6
Issue Width:8
0.4
Select Logic Delay (ns)
1.2
Wakeup Delay (ns)
Fetch-2 Delay (ns)
1.6
0.8
0.7
Issue Width:2
0.6
Issue Width:4
0.5
Issue Width:6
0.4
Issue Width:8
0.3
0.2
0.1
0.2
0
32
0
16
Niket
32
64
128
64
128
Issue Queue Size
Issue Queue Size
13
Eric Rotenberg © 2009
WARP’09 6/20/09
NC STATE UNIVERSITY
Status (cont.)
 Developed cycle-accurate C++ simulator and
verilog/C++ co-simulation environment
•
Cycle-accurate at pipeline stage level
C++
==
verilog
C++
verilog
C++
==
verilog
C++
verilog
C++
==
verilog
C++
verilog
C++
==
verilog
C++
verilog
C++
==
verilog
Functional
Simulator
(a) Tightly integrated C++ & verilog.
Functional
Simulator
verilog
==
==
Salil
Functional
Simulator
C++
==
(b) Standalone C++.
(c) Standalone verilog.
Figure 1. Flexible simulation options.
IPC
gap
gcc
gzip
twolf
vortex
vpr
0.45
0.45
0.54
0.44
0.52
0.48
14
Eric Rotenberg © 2009
WARP’09 6/20/09
NC STATE UNIVERSITY
Status (cont.)
 Developed register file compiler
•
16R8W bitcell layout
Superscalar processor has many
specialized and highly-ported RAMbased structures
Tanmay
Eric Rotenberg © 2009
15
WARP’09 6/20/09
NC STATE UNIVERSITY
Status (cont.)


Begun sub-pipelining key stages: fetch and issue
Block-ahead pipelining [Seznec et al.]
A
B
A
A
C
B
B
C
Eric Rotenberg © 2009
C
D
Unpipelined Fetch
throughput = 1
Jayneel
Pipelined Fetch (no block-ahead)
throughput = 1
D
Pipelined Fetch (with block-ahead)
throughput = 2
D
16
WARP’09 6/20/09
NC STATE UNIVERSITY
Example Applications

Superscalar customization,
fast design-space exploration
b ench m a rk/co re
bz ip
crafty/gap/vpr
parser/perl/tw olf
gcc
gzip
vorte x
m cf
bzip
Ben c h m a r k
Sandeep
Eric Rotenberg © 2009
bzip
crafty
gap
vpr
parser
perl
tw olf
gcc
gzip
vorte x
m cf
clock
0.3 5
0 .5
0 .5
0 .5
0 .5
0.5 5
0 .6
RF
512
256
256
256
256
512
32
IQ LQ /SQ
64
48
64
48
64
24
32
48
64
48
64
64
16
16
7.39
3.17
4.02
2.63
3.33
2.59
2 .1
3 .2
3.39
4.97
1 .5
Core
parser/perl/tw olf
gcc
6 .58
3 .17
3 .59
2 .61
3 .43
2 .64
2 .11
3 .19
3 .26
4 .23
1 .66
crafty/gap/vpr
8.6 2
2.9 2
3.7 8
2.4 1
3.1 7
2 .4
1 .9
3.0 9
3 .2
4.7 6
1.4 1
w idths
f/d i/c
6 6
6 4
6 4
6 6
8 4
8 6
4 4
I$ D$
64 64
1 28 64
1 28 64
1 28 64
1 28 64
1 28 128
1 28 128
L2 $ f
10 24 7
20 48 5
20 48 5
20 48 5
20 48 5
10 24 5
20 48 4
d
4
3
3
3
4
3
2
gzip
6.7
2.8
3.6
2.3 7
3.2
2.5 1
1.9 5
3.2 5
3.0 4
4.4 2
1.5
7.3 9
3.1 4
4
2 .6
3.3 3
2.5 8
2.0 7
3.1 6
3.3 9
4.9 5
1 .5
de pths
i rr ex m 1 m 2
5 5 2 6 3
3 2 1 4 2
3 2 1 3 2
3 3 1 4 2
3 2 1 4 2
3 3 1 4 2
2 1 1 2 2
vorte x
m cf
8 .42
3.5 8
3
2.4 5
3 .96
2.7 6
2 .46
2.1 8
3 .16
2.7 3
2 .43
2.3 3
1 .93
1 .9
3 .16
2.6 8
3 .28
2.4 5
4 .97
2.9 5
1 .42
17 1.6 8
WARP’09 6/20/09
NC STATE UNIVERSITY
Example Applications (cont.)
 Core-Selectability in Chip Multiprocessors
Hashem
Configure parallel processor
for parallel workload at hand.
Tiled Het. Multi-cores
18
Eric Rotenberg © 2009
WARP’09 6/20/09
NC STATE UNIVERSITY
Example Applications (cont.)
 Revisit microarchitecture techniques
 Techniques discarded for limited applicability
may be valuable in workload-customized cores
19
Eric Rotenberg © 2009
WARP’09 6/20/09
NC STATE UNIVERSITY
Example Applications (cont.)
 Conventional methodology flawed
•
•
•
•
Arbitrarily pick a baseline (perhaps rules-of-thumb)
Add gadget to baseline
Speedup: (baseline+gadget) / (baseline)
Influence of gadget depends on choice of baseline
•
Example: Value prediction more important with undersized IQ
 OK methodology
•
•
•
Baseline = custom core for each benchmark
Add gadget to this baseline, per benchmark
Speedup: (baseline+gadget) / (baseline)
 Better methodology
•
•
•
Baseline = custom core for each benchmark
Recustomize core with gadget in place (new global optimum)
Speedup: (recustomized core) / (customized core)
20
Eric Rotenberg © 2009
WARP’09 6/20/09
NC STATE UNIVERSITY
Summary
 Customizing superscalar cores has value in applicationspecific designs and heterogeneous multi-core chips
 Customization captures interplay among program,
microarchitecture, and technology
 FabScalar enables the composition of arbitrary
superscalar processors, inclusive of technology
 Enabled by canonical view of superscalar pipeline, and
a lot of “pre-fab” by students who aren’t paid enough
Supported by NSF and IBM.
accepting
donations
http://www.tinker.ncsu.edu/ericro/research/fabscalar.htm
21
Eric Rotenberg © 2009
WARP’09 6/20/09