Transcript [PPT

Department of Computer Science
Efficient Execution of Memory
Access Phases Using Dataflow
Specialization
Chen-Han Ho, Sung Jin Kim, and
Karthikeyan Sankaralingam
Department of Computer Science
Memory Access Phase
 A dynamic portion of a program where its
instruction stream is predominantly for
for (f=0; f<FSIZE; f+=4) {
__m128
xmm_in_r
= _mm_loadu_ps(in_r+p+f);
memory accesses
and
address
generation.
__m128 xmm_in_i = _mm_loadu_ps(in_i+p+f);
__m128 xmm_mul_r =
_mm_mul_ps(xmm_in_r, xmm_coef);
for (i=0;i<v_size;++i){
__m128 xmm_mul_i =
for(i=0; i<8;
++i) {+= V[i];
_mm_mul_ps(xmm_in_i, xmm_coef);
A[K[i]]
accum_r = _mm_add_ps(xmm_accum_r,
for(j=0;
j<8;
++j)
{
}
_mm_sub_ps(
xmm_mul_r,
xmm_mul_i));
for(int y =
0; y < srcImg.height;
++y )
float sum=0;
accum_i
= _mm_add_ps(xmm_accum_i,
for(int
x = 0; x < srcImg.width; ++x ){
xmm_mul_r, xmm_mul_i));
p = srcImg.build3x3Window(x,
y);
for(k=0; k<8; ++k) { _mm_add_ps(
}
sum+=matAT[i*matAcol+k]*NPU_SEND(p[0][0]);NPU_SEND(p[0][1]);
NPU_SEND(p[0][2]);NPU_SEND(p[1][0]);
NPU_SEND(p[1][1]);NPU_SEND(p[1][2]);
matB[j*matBrow+k];
NPU_SEND(p[2][0]);NPU_SEND(p[2][1]);
}
NPU_SEND(p[2][2]);NPU_RECEIVE(pixel);
matCT[i*matBcol+j]+=sum; dstImg.setPixel(x, y, pixel);
}
}
}
Aggregation, matrix multiply, image
processing…
}
2
Department of Computer Science
Execution Model
Executed
in core
Natural memory pipe
access phase
D$
In-order,
OOO2, OOO4
Executed
in core
Induced memory pipe
access phase
or
DySER
SSE
NPU
C-Cores
Natural
Executed
Read
D$,
in MAD
send to
accel,
DySER
write D$
D$
(host
off)
Speedup
In-order
OOO2
OOO4
D$
1.0
1.5
2.2
(host
off)
Speedup
In-order
OOO2
OOO4
DySER
NPU
1.0
1.5
2.7
SSE
C-Cores
1.0
D$
1.7
2.9
1.0
1.6
2.2
SSE
NPU
MAD
or
MAD
In-order,
OOO2, OOO4
Read D$,
little
comp.,
write D$
Executed
in MAD
3
Core becomes bottleneck (Power)
~ 2.3 W
~ 6.0 W
Address Computation + Data Access < 40%
4
Department of Computer Science
Goal: To more efficiently access
memory, obtain OOO’s performance
without power overheads
5
Department of Computer Science
Memory Access Dataflow
 A specialized dataflow architecture to access
memory (Processor pipeline turned off)
 Big idea: exposing the concept of triggering
events & actions
Processor
MAD
Core (off)
Cache
Accelerator
6
Department of Computer Science
What does the core do?
Execute Component Accelerator
Specialized HW
I$
Decode&Dispatch
FU
RF
Compute
patterns
Access memory with
loads and stores
Fetch
LSQ
Decode
ROB
Bypass
AGU
Fetch
Create
Events
Scheduler
React
Instruction Queue
Branch
Predictor
address ready,
Ctrl I/F
control variable
resolved, value
Branch
returned from cache
History
Queue/
special RF
Issue
Execute
D$
computes the
address and
control
WB
Memory
variables
7
Department of Computer Science
MAD ISA Primitives
 Dataflow Graph Nodes
– Analogous to compute instructions & reg state
 Actions
– Analogous to ld/st and move instructions
 Events
– Analogous to program counter sequencing
Conventional RISC/CISC ISA:
1) Register state
2) Compute instructions
3) LD/St instruction
4) Program counter and control flow
Arch. Primer!
8
Department of Computer Science
Transforming ISA
BaseA BaseB 1
Pseudo Program
+
for(i=0; i<n; ++j) +
{
}
n
+
i
<
a[i] = accel(a[i],b[i]) Computation
Ports
RISC ISA
.L0
ld,
ld,
st,
addi,
ble,
$r0+$r1 -> $acc0
$r2+$r1 -> $acc1
$acc2
-> $r0+$r1
$r1, 1 -> $r1
$r4, $r1, .L0
Named
registers
Branch, PC..
Data
Movement
9
Department of Computer Science
Transforming ISA
BaseA BaseB 1
+
+
n
+
i
<
Named Event
Queues
# Dataflow
N0: $eq7 +
N1: $eq7 +
N2: $eq7 +
N3: $eq7 <
Graph Nodes
base A -> $eq0,$eq2 #Addr A
base B -> $eq4 #Addr B
1 -> $eq6 #i++
n -> $eq8 #i<n
10
Department of Computer Science
MAD ISA (cont’d)
 Static Dataflow: Computations
 Dynamic Dataflow: Event-ConditionActions (ECA) rules
on Event if Condition do Action
A combination of
primitive dataflow
events (the arrival
of data)
data
states
load, store,
or moves
11
Department of Computer Science
Transforming ISA
Named Event
Queues MAD
# ECA Rules
On $eq0
On $eq2∧eq3
On $eq4
On $eq8∧eq6
,
,
,
,
Data
Movement
ISA
if
,
if
,
if
,
if $eq8(true),
do
do
do
do
A0:ld,$eq0->$eq1
A1:st,$eq3->$eq2
A2:ld,$eq4->$eq5
A3:mv,$eq6->$eq7, $eq8->
# Dataflow-Graph Nodes
N0: $eq7 + base A -> $eq0,$eq2 #Addr A
N1: $eq7 + base B -> $eq4 #Addr B
N2: $eq7 + 1 -> $eq6 #i++
N3: $eq7 < n -> $eq8 #i<n
EQ States
(Conditions)
Computation
12
Department of Computer Science
Microarchitecture
To/From Accelerator
Matching
Events
Move data
Event
Block
Accelerator
I/O Event Queues
Comparator
Array
Queue
states
To LSQ
ECA
Rule
11
ECA
Rule
ECA
Rule
1
ECA
Rule
ECA
Rule
22 1
ECA
Rule
ECA
22
ECARule
Rule
ECA
ECA
Rule
2 2 Rule
ECA
Rule
ECA
Rule
Table
ECA
Rule
ECA
Rule
3322
ECA
Rule
ECA
ECARule
Rule3 3
Action
Block
decoder
Arbiter
actions (bit vector)
Computation Block
I/O Event Queues
FU
FU
FU
FU
FU
FU
FU
FU
FU
FU
FU
FU
S
S
S
S
FU
FU
FU
FU
S
S
S
S
S
S
S
S
S
S
S
S
Action
Action0
Action
Table 1
Action 2
Action 3
FU and
Switches
Computation
Block
Data-driven
computation
13
Department of Computer Science
MAD Execution
Code Gen.
Accelerator
Processor Off
MAD ISA
MAD
(Access)
14
Department of Computer Science
Evaluation Methodology
 Baseline: In-order, OOO2 and OOO4
 MAD integration:
– 256 Dataflow Nodes, 64 Event Queues
– Integrated to OOO2/OOO4’s LSU
 Natural and Induced Memory Access
Phases
– Accelerators: DySER, SIMD, NPU, C-Cores
 Reproduce/reuse benchmarks relevant to
each accelerator
15
Department of Computer Science
Evaluation & Analysis
 Performance
– Explicit static & dynamic dataflow, larger
instruction window, less speculative
– Can MAD match 2/4-OOO?
 MAD should consume less energy/power
16
Department of Computer Science
Summary - Performance
 MAD’s performance is similar to OOO4
– MAD can utilize OOO2’s LSU better, MAD+OOO2 >
OOO2, with OOO4 MAD can be better than OOO4
– In DySER programs, there are more opportunities for
OOO4 to speculatively execute memory instructions
17
Department of Computer Science
Summary Energy
 ~Half energy compared to OOO2
– Compared to In-Order, OOO2 delivers better
performance but does not save energy
 ~30% energy compared to OOO4
18
Department of Computer Science
Power: Natural Phases
OOO2, OOO4,
MAD2, MAD4
 MAD < sum(Fetch, Decode, Dispatch, Issue,
Execute, WriteBack)
– LSU: More than 2-OOO, similar to 4-OOO
19
Department of Computer Science
Summary
 MAD is an novel and useful customization
for memory access phases
 Performance improvement and Power
reduction
 Flexible & effective for accelerators
20
Department of Computer Science
Questions
21