Enabling Multithreading on CGRAs

Download Report

Transcript Enabling Multithreading on CGRAs

Enabling Multithreading on CGRAs
Aviral Shrivastava1, Jared Pager1, Reiley Jeyapaul,1
Mahdi Hamzeh12, Sarma Vrudhula2
Compiler Microarchitecture Lab,
VLSI Electronic Design Automation Laboratory,
Arizona State University, Tempe, Arizona, USA
M
C L
Need for High Performance
Computing
Applications that need high performance computing




Weather and geophysical simulation
Genetic engineering
Multimedia streaming
zettaflop
petaflop
2
Web page: aviral.lab.asu.edu
M
C L
Need for Power-efficient
Performance
Power requirements limit the aggressive scaling
trends in processor technology
In high-end servers,




power consumption doubles every 5 years
Cost for cooling also increases in similar trend
ITRS 2010
3
2.3%
$4 Billion
of US
Electricity
Electrical
Consumption
charges
Web page: aviral.lab.asu.edu
M
C L
Accelerators can help achieve
Power-efficient Performance
Power critical computations can be off-loaded to
accelerators



Perform application specific operations
Achieve high throughput without loss of CPU programmability
Existing examples


Hardware Accelerator


Reconfigurable Accelerator


FPGA
Graphics Accelerator

4
Intel SSE
nVIDIA Tesla (Fermi GPU)
Web page: aviral.lab.asu.edu
M
C L
CGRA:
Power-efficient Accelerator



Flexible programming
High performance
Power-efficient computing
From Neighbors and Memory
Cons





5
RFdifficult
Compiling a program for
FUCGRA
Not all applications can be compiled
No standard CGRA
architecture
To Neighbors
and Memory
Require extensive compiler support for
general purpose computing
Web page: aviral.lab.asu.edu
Local Instruction
Memory
Distinguishing Characteristics

PEs communicate through an
inter-connect network
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
Local Data Memory
Main System Memory
M
C L
Mapping a Kernel onto a CGRA
Given the kernel’s DDG
Mark source and destination
nodes
Assume CGRA Architecture
Place all nodes on the PE array
1.
2.
3.
1.
2.
Dependent nodes closer to their
sources
Ensure dependent nodes have
interconnects connecting sources
Map time-slots for each PE
execution
4.
1.
Dependent nodes cannot execute
before source nodes
Web page: aviral.lab.asu.edu
Loop:
t1 = (a[i]+b[i])*c[i]
d[i] = ~t1 & 0xFFFF
Data-Dependency Graph:
1
2
Spatial Mapping &
3
Temporal Scheduling
PE
11i 5 PE
4 PE
PE
99i-6
PE
22i
PE
33i-1
6
PE
44i-2
88i-5
PE
PE
5i-2
5PE
7
PE
66i-3
PE
77i-4
PE
PE
PE
PE
8
9
M
C L
Mapped Kernel Executed on the CGRA
Loop:
t1 = (a[i]+b[i])*c[i]
d[i] = ~t1 & 0xFFFF
Data-Dependency Graph:
1
2
3
4
5
Execution time slot:
(or cycle)
After cycle 6, one
iteration of loop
completes execution
every cycle
Entire kernel can be
mapped onto CGRA by
unrolling 6 times
PE
PE
176543210
PE
PE
910
PE
276543210
PE
36543210
PE
4543210
8210
PE
PE
PE
5543210
PE
643210
PE
73210
PE
PE
PE
PE
6
7
8
7
9
Iteration Interval
(II) is a measure of
mapping quality
Web page: aviral.lab.asu.edu
Iteration Interval = 1
0
1
2
5
6
7
3
4
M
C L
E0
E1
E2
E3
E4
E5
E6
E7
E8
E9
E10
E11
E12
E13
E14
E15
Application Output
Application Input
Traditional Use of CGRAs
An application is mapped onto the CGRA
System inputs given to the application
Power-efficient application execution achieved
Generally used for streaming applications





8
ADRES, MorphoSys, ADRES, KressArray, RSPA, DART
Web page: aviral.lab.asu.edu
M
C L
Envisioned Use of CGRAs
Processor
Program
thread
Kernel to
accelerate
co-processor
E0
E1
E2
E3
E4
E5
E6
E7
E8
E9
E10
E11
E12
E13
E14
E15
Specific kernels in a thread can be power/performance critical
The kernel can be mapped and scheduled for execution on the CGRA
Using the CGRA as a co-processor (accelerator)






9
Power consuming processor execution is saved
Better performance of thread is realized
Overall throughput is increased
Web page: aviral.lab.asu.edu
M
C L
CGRA as an Accelerator

Application: Single thread



Entire CGRA used to schedule each kernel of the thread
Only a single thread is accelerated at a time
Application: Multiple threads


Entire CGRA is used to accelerate each individual kernel
if multiple threads require simultaneous acceleration


threads must be stalled
kernels are queued to be run on the CGRA
Not all PEs are used in
each schedule.
Thread-stalls create a
performance bottleneck
10
E0
E1
E2
E3
E4
E5
E6
E7
E8
E9
E10
E11
E12
E13
E14
E15
Web page: aviral.lab.asu.edu
M
C L
Proposed Solution:
Multi-threading on the CGRA
Through program compilation and scheduling



Can increase total CGRA utilization


Reduce overall power consumption
Increases multi-threaded system throughput
Threads:
2, 2,
31,32
Threads:
Threads:
1,
S3
S3’
S3
E0
E1
E2
E3
E4
E5
E6
E7
E8
E9
E10
E11
E13
E14
E12
11
Web page: aviral.lab.asu.edu
Expand
to maximize
Shrink-to-fit
mapping
Maximum
CGRA
CGRA
utilization
and
maximizing
performance
utilization
performance
S2’

Schedule application onto subset of PEs, not entire CGRA
Enable dynamic multi-threading w/o re-compilation
Facilitate multiple schedules to execute simultaneously
S2’

E15
M
C L
Our Multithreading Technique
Static compile-time constraints to enable fast run-time
transformations
1.


Has minimal effect on performance (II)
Increases compile-time
Perform fast dynamic transformations
2.


Takes linear time to complete with respect to kernel II
All schedules are treated independently
Features:

Dynamic Multithreading enabled in linear runtime

No additional hardware modifications

Require supporting PE inter-connects in CGRA topology
Works with current mapping algorithms


12
Algorithm must allow for custom PE interconnects
Web page: aviral.lab.asu.edu
M
C L
Hardware Abstraction:
CGRA Paging
P2
P3
No additional hardware
‘feature’ is required.
P1
P1

P0
P0

Page: conceptual group of
PEs
A page has symmetrical
connections to each of the
neighboring pages
Local Instruction
Memory

PE
e0
PE
e1
PE
e2
PE
e3
P0
PE
e4
PE
e5
PE
e6
PE
e7
P1
PE
e8
PE
e9
PE
e10
PE
e11
P2
PE
e12
PE
e13
PE
e14
PE
e15
P3
P3
P2
Local Data Memory

Page-level interconnects
follow a ring topology
13
Web page: aviral.lab.asu.edu
Main System Memory
M
C L
Step 1: Compiler Constraints
assumed during Initial Mapping

Compile-time Assumptions




CGRA is collection of pages
Each page can interact with
only one topologically
neighboring page.
Inter-PE connections within a
page are unmodified
These assumptions,


14
in most cases will not effect
mapping quality
may help improve CGRA
resource usage
Web page: aviral.lab.asu.edu
Our
Naïve
paging
mapping
methodology,
could
result
helps reduce
in under-used
CGRA
CGRA
resource
resources
usage
P0
P3
e0
e11
e2
e93
e24
e35
e46
e87
e8
e59
e8
610
911
e7
e412
e613
e714
e15
P1
P2
M
C L
Step 2: Dynamic Transformation
enabling multiple schedules
Example:

Transformation Procedure:
2.
3.
1.
Shrunk pages executed on altered
time-schedules
4.

Ensures inter-node dependency
Constraints

e0
e11
e2
e3
e24
e35
e6
e7
e8
e59
e810
e911
e412
e613
e714
e15
P2
Split pages
Arrange pages in time order
Mirror pages to facilitate shrinking
1.
P0

application mapped to 3 pages
Shrink to execute on 2 pages
P3

P1

inter-page dependencies should be
maintained
15
Web page: aviral.lab.asu.edu
M
C L
Step 2: Dynamic Transformation
enabling multiple schedules

Transformation Procedure:
Split pages
Arrange pages in time order
Mirror pages to facilitate shrinking
1.
2.
3.
Ensures inter-node dependency
1.
Shrunk pages executed on altered
time-schedules
e0
e11
e2
e3
e24
e35
e6
e7
e8
e59
e810
e911
e412
e613
e714
e15
P2
P1
P0
P3
4.
16
Web page: aviral.lab.asu.edu
M
C L
Step 2: Dynamic Transformation
enabling multiple schedules

Transformation Procedure:
Split pages
Arrange pages in time order
Mirror pages to facilitate shrinking
1.
2.
3.
1.
Ensures inter-node dependency
T421
Constraints

e0
e11
e24
e35
e8
e59
e412
e613
e9
810
e9
811
e714
e715
Shrunk pages executed on altered
time-schedules
4.

T30
PP1,1
1

application mapped to 3 pages
Shrink to execute on 2 pages
inter-page dependencies should be
maintained
17
Web page: aviral.lab.asu.edu
T2
P2

PP0,1
0
Example:
P2

M
C L
Experiment 1: Compiler
Constraints are Liberal

Mapping quality measured in Iteration Intervals

smaller II is better
1.6
18
1.4
Interation Interval (II)
Constraints
can
degrade can
Constraints
individual
also improve
benchmark
individual
performance
benchmark
by
limiting by,
performance
compiler
ironically,
search
limitingspace
On average,
compiler search
performance
space
is minimally
impacted
2 PEs/Page
8 PEs/Page
4 PEs/Page
1.2
1
0.8
0.6
0.4
0.2
0
Web page: aviral.lab.asu.edu
M
C L
Experimental Setup



4x4, 6x6, 8x8
Page configurations:


Thread 3
Thread 1
Thread 2
Thread 4
CGRA Configurations used:
2, 4, 8 PEs per page
kernel to be
MULTIPLE
Only ONE
threads
thread
accelerated
serviced
Number of threads in system:


1, 2, 4, 8, 16
Each has a kernel to be accelerated
Experiments
 Single-threaded CGRA



Each thread arrives at “kernel”
thread is stalled until kernel executes
Multi-threaded CGRA


CGRA used to accelerate kernels as
and when they arrive
No thread is stalled
19
Web page: aviral.lab.asu.edu
CPU CPU CPU CPU
Core Core Core Core
CGRA
M
C L
Multithreading Improves System
Performance
20
Performance across CGRA Size
(4 PEs/Page)
Performance across # PEs/Page
(6x6 CGRA)
5
2 PEs/Page
4 PEs/Page
8 PEs/Page
4x4 CGRA
4.5
Performance Improvement
Number of
Threads
Accessing
CGRA:Size:
CGRA
Aswe
theincrease
number
As
of threads
CGRA
size,
increases,
multithreading
Number
of
multithreading
provides
better
PEs
per Page:
provides
utilization
and
For
the
set
of
increasing
therefore better
benchmarks
performance
performance
tested, the
number of
optimal PEs per
page is either 2
or 4
6x6 CGRA
4
8x8 CGRA
3.5
3
2.5
2
1.5
1
0.5
0
1
2
4
8
Number of Threads
Web page: aviral.lab.asu.edu
16
1
2
4
8
Number of Threads
16
M
C L
Summary


Power-efficient performance is the need of the future
CGRAs can be used as accelerators




Propose a two-step dynamic methodology



Power-efficient performance can be achieved
Has limitations on usability due to compiling difficulties
With multi-threaded applications, need multi-threading capabilities in
CGRA
Non-restrictive compile-time constraints to schedule application into
pages
Dynamic transformation procedure to shrink/expand the resources used
by a schedule
Features:



21
No additional hardware required
Improved CGRA resource usage
Improved system performance
Web page: aviral.lab.asu.edu
M
C L
Future Work

Using CGRAs as accelerator in systems with inter-thread
communication.

Study the impact of compiler constraints on computeintensive and memory-bound benchmark applications?

Possible use of thread-level scheduling to improve overall
performance?
22
Web page: aviral.lab.asu.edu
M
C L
Thank you !
23
Web page: aviral.lab.asu.edu
M
C L
State-of-the-art Multi-threading
on CGRAs

Limitations


26
Collection of schedules must
be known at compile-time
Schedules are assumed to be
‘pipelining’ stages in a single
kernel
Web page: aviral.lab.asu.edu
Filter 3Output Output
Core 1
Core 2
Mem
Bank 1
Core 3
Core 4
Mem
Bank 2
Output
Core 5
Core 7
Core 6
Core 8
Mem
Bank 3
Mem
Bank 4
Filter 3

Collection of schedules make a
kernel
Some schedules can be given
more resources than other
schedules
3
Filter 2

Filter
Filter
2 2Filter 2 Filter 3Filter 3
Filter 2
Enables dynamic scheduling
Filter 2

Data
Filter
11
Set 3Filter Filter
1
Data Set Data
2
Set 1
Filter 1
Polymorphic Pipeline Arrays
[Park 2009]
Filter 1

M
C L