Research with MacSim

Download Report

Transcript Research with MacSim

1
MacSim Tutorial (In ISCA-39, 2012)
2/8
Front-end
Memory System
• Thread fetch policies
• Branch predictor
• Software and
Hardware prefetcher
• Cache studies
(sharing, inclusion)
• DRAM scheduling
• Interconnection
studies
MacSim Tutorial (In ISCA-39, 2012)
Misc.
• Power model
3/8
MacSim
Trace Generator
(PIN, GPUOCelot)
Frontend
Memory System
Software prefetch instructions
PTX  prefetch, prefetchu
x86  prefetcht0, prefetcht1,
prefetchnta
Hardware
prefetch
requests
Stream, stride,
GHB, …
•
•
Hardware
Prefetcher
Many-thread Aware Prefetching Mechanism [Lee et al. MICRO-43, 2010]
When prefetching works, when it doesn’t, and why [Lee et al. ACM TACO, 2012]
MacSim Tutorial (In ISCA-39, 2012)
4/8
| Cache studies – sharing, inclusion property
| On-chip interconnection studies
$
$
$
$
$
$
Interconnection
Shared $
•
TLP-Aware Cache Management Policy [Lee and Kim, HPCA-18, 2012]
MacSim Tutorial (In ISCA-39, 2012)
$
Private Caches
Interconnection
Shared Cache
5/8
| Heterogeneous link configuration
C0
C1
C2
G0
G1
G2
M1
M0
L3
L3
L3
L3
C0
G0
C2
G1
C1
G2
M1
M0
L3
L3
L3
L3
CPU
GPU
Ring Network
MC
Different topologies
L3
•
C
C
M
M
C
C
M
M
C
C
G
G
C
C
G
G
On-chip Interconnection for CPU-GPU Heterogeneous Architecture [Lee et al. under review]
MacSim Tutorial (In ISCA-39, 2012)
6/8
Trace Generator
(GPUOCelot)
Frontend
RR, ICOUNT, FAIR, LRF, …
Execution
DRAM
•
FCFS, FRFCFS, FAIR, …
Effect of Instruction Fetch and Memory Scheduling on GPU Performance [Lakshminarayana and Kim,
LCA-GPGPU, 2010]
MacSim Tutorial (In ISCA-39, 2012)
7/8
DRAM Bank
W0 W1 W2 W3
W0 W1 W2 W3
RH RH RH
RM RM RM
RM RM RM
RM
RM
RM
RH RH
RM
RM
RM
Qs for Core-0
Qs for Core-1
Potential
of Requests from Core-0 = |W0|α + |W1|α + |W2|α + |W3|α
DRAM
Controller
α
α
α
Tolerance(Core-0) < Tolerance(Core-1)  select Core-0
= 4 + 3 + 5 (α < 1)
Servicing row hit from W1 (of Core-0) results in
greatest reduction in potential, so service row hits from
W1 next
Core-0
Reduction in potential if:
row hit from queue of length L is serviced next  Lα – (L – 1)α
row hit from queue of length L is serviced next  Lα – (L – 1/m)α
m = cost of servicing row miss/cost of servicing row hit
Core-1
Tolerance(Core-0) < Tolerance(Core-1)
•
DRAM Scheduling Policy for GPGPU Architectures Based on a Potential Function [Lakshminarayana et al.
IEEE CAL, 2011]
MacSim Tutorial (In ISCA-39, 2012)
8/8
1.2
1
I PC
0.8
0.6
0.4
Modeled
Measured
t
ConstCache TextureCache
1%
1%
SharedMem
1%
ns
ul
m
m
ad
d
m
em
cm
fp
in
t
sh
ar
ed
m
b1
1s
am
e
m
b1
0s
am
e
m
b1
4s
am
e
m
b1
2s
am
e
0
co
0.2
Fetch Decode
3%
1%
Schedule
3%
RF
5%
| Verifying simulator and GTX580
| Modeling X86-CPU power
MMU
0%
| Modeling GPU power
Execution

Still on-going research
MacSim Tutorial (In ISCA-39, 2012)
L1
27%
EX_alu
6%
0%
EX_LD/ST
3%
EX_SFU
1%
EX_fpu
48%
OpenGL Program
ARM Architecture
Mobile Platform
Power/Energy Model
2012 ~ 2013
MacSim Tutorial (In ISCA-39, 2012)