Mapping Task Graphs to Processors in Large Multiprocessor Systems Kurt Keutzer and the MESCAL Team especially Yujia Jin, Kaushik Ravindran, and N.
Download
Report
Transcript Mapping Task Graphs to Processors in Large Multiprocessor Systems Kurt Keutzer and the MESCAL Team especially Yujia Jin, Kaushik Ravindran, and N.
Mapping Task Graphs to Processors in Large
Multiprocessor Systems
Kurt Keutzer
and the MESCAL Team
especially
Yujia Jin, Kaushik Ravindran, and N. R. Satish
Design Space Exploration Flow
FromDevice(0)
Discard
Discard
DecIPTTL
IPVerify
FromDevice(1)
FromDevice(2)
ToDevice(0)
Discard
Discard
DecIPTTL
IPVerify
FromDevice(3)
ToDevice(1)
Discard
Lookup
IPRoute
ToDevice(2)
DecIPTTL
…
…
DecIPTTL
ToDevice(3)
Discard
Discard
…
DecIPTTL
Discard
IPVerify
…
FromDevice(15)
Multiprocessor
platform
Application
description
ToDevice(15)
Off-chip
MEM
SDRAM
Ethern
et
PERIPHERAL
OPB
MEM
MEM
FS
PLBPECo-PE
Co-PEPE L
PECo-PE MEM
OnHardware MicroBlaze
chip
acceleratio (soft)
BRAM
n
MEM
Discard
S1
S2
R1
L1
T1
R2
L2
T2
Task Graph + profiles
Scheduling
Constraints
Allocation/Scheduling
Task graph
Platform
Constraints
HW/SW generation
Implementation
Performance
Analysis
11/6/2015
Performance
Numbers
2
Investigative Approach
Demonstrate network applications on FPGA-based soft
multiprocessors
Tomahawk exploration framework
Automated task allocation and scheduling
Extend framework to large multiprocessor systems
1000’s-10,000’s of tasks
100’s-1000’s of PE’s
RAMP
11/6/2015
3
What Is a FPGA-based Soft Multiprocessor System
Architecture Building Blocks
A network of architecture building blocks
on an FPGA
Multiprocessor architecture customized
for target application
Number of processors
Interconnection network
Memory hierarchy
Custom co-processors
Processing
Element
Blaze(soft)
Hash engine
PowerPC(hard) Crypto engine
Productivity gains due to software
abstraction
11/6/2015
Bus
OPB
PLB
Queue
Memory
FSL
BRAM(on-chip)
SDRAM(off-chip)
Multiprocessor Configuration
Off-chip SDRAM
Cost reduction by avoiding custom
silicon
Co-Processor
MEM
PERIPHERAL
MEM
Co-PE
Ethernet
MEM
PE
PE
Co-PE
FSL
Hardware
acceleration
OPB
MicroBlaze
(soft)
Xilinx Virtex-II Pro,
Virtex-IV family of
FPGAs
MEM
PLB
MEM
PE
Co-PE
On-chip BRAM
PowerPC
(hard)
4
Obstacles to Their Adoption: Hard to design
Complex micro-architecture design space
Processor choices
Memory hierarchy
Communication topology
Difficult mapping decisions
assigning computation to processing elements
data to exposed heterogeneous memories
To unlock potential of these systems, tools enabling efficiency and
productivity are needed
11/6/2015
5
Example: Design Difficulty
Architecture Model
Application Task Graph
R1
L1
T1
R2
L2
T2
Profile
P1
P2
R
20
10
L
20
20
T
30
10
10
R1
P2
Queue
Execution Time (cycles)
Explore
R1 L
1
L1 T
1
T1
R2
L2
T2
R
2
P21
Total
time
70
Total
time
== 50
Total
time
= 80
11/6/2015
P1
B
Optimal
Design
Design
A
L2
T2
Makespan
= 70
60
80
P1 P2
Total
time
Total
time == 40
60
Total time
= 80
6
Tomahawk: Network Applications onto Soft MPs
Discard
Discard
FromDevice(0)
DecIPTTL
IPVerify
ToDevice(0)
FromDevice(1)
Discard
Discard
FromDevice(2)
DecIPTTL
ToDevice(1)
IPVerify
Discard
ToDevice(2)
DecIPTTL
Lookup
IPRoute
DecIPTTL
…
…
Click
FromDevice(3)
ToDevice(3)
Discard
…
…
FromDevice(15)
DecIPTTL
Discard
ToDevice(15)
IPVerify
Discard
Discard
Task graph
Automated microarchitecture
configuration
S1
S2
R1
L1
T1
R2
L2
T2
M1
P1
S2
S1
P2
R1
L1
T1
R2
L2
T2
Automated Mapping
Off-chip SDRAM
MEM
C programs and
micro architecture
specification
MEM
Co-PE
Hardware
acceleration
PE
Ethernet
PERIPHERAL
FSL
OPB
MEM
MEM
PE
Co-PE
MicroBlaze (soft)
MEM
PLB
PE
Co-PE
On-chip
BRAM
Xilinx 2VP50 FPGA
11/6/2015
7
Possible Approaches for Automated Exploration
Randomized algorithms
probabilistic bounds, simulated annealing
Heuristic methods
list scheduling, force directed scheduling
Exact methods
enumeration and tabu search, branch-and-bound
Limitations of these approaches
11/6/2015
Specific implementation constraints are hard to enforce
Most approaches require per-instance tuning and are hard to generalize –
therefore poor for design space exploration
8
Constraint Optimization Techniques for Automated
Exploration
Constraint solver technologies
Integer linear programming (ILP) solvers
0-1 Boolean reasoning solvers (SAT, PB-SAT)
Advantages
Constraint formulations are a formal, yet natural way to capture a mathematical
optimization problem
Implementation constraints specific to a problem can be incorporated easily
Constraint solvers can exhaustively cover a search space without enumerating all
solutions
Key strategies to improve solver performance:
11/6/2015
Decomposition methods
Variable ordering
Improved lower and upper bounds
Symmetry representation
9
ILP Formulation
11/6/2015
10
Example Application: IPv4 Packet Forwarding
Data plane of IPv4 packet forwarding (RFC-1812)
Campus network router, Home router
Medium sized route table (5,000 entries or less)
Route table small enough to fit in on-chip memory
Lookup: inspect destination
address and find next hop
Route
Table
Ingress
Receive
IPv4
packet
Verify
version,
checksum
and TTL
Header
Lookup
next-hop
(prefix
match)
Payload
– Longest prefix match
– Implementation determined
by route distribution, memory
and performance constraints
Update
checksum
and TTL
Header
Transmit
IPv4
packet
Egress
Target platform
Xilinx Virtex-II Pro 2VP50 FPGA
Architecture Library
MicroBlazes, PowerPC, on-chip Block RAM, IBM CoreConnect buses, queue
11/6/2015
11
Hand-tuned Multiprocessor Design for IPv4 Forwarding
From source
MicroBlaze 1
From source
MicroBlaze 2
Route
Table
Route
Table
Verify
ver & ttl
checksum
Lookup1
Lookup2
Verify
ver & ttl
checksum
Lookup1
Verify
ver & ttl
checksum
Lookup1
Verify
ver & ttl
checksum
To source
MicroBlaze 1
To source
MicroBlaze 2
Lookup1
Lookup2
Lookup2
Lookup2
To source
MicroBlaze 1
To source
MicroBlaze 2
To source
MicroBlaze 1
Key:
MicroBlaze
Block RAM
Bus
Queue
Achieved 1.8 Gbps throughput for header processing
11/6/2015
using 12 MicroBlaze processors
12
Improved Design after Automated Exploration
Route
Table
Route
Table
Route
Table
Lookup1
From source
Verify
ver& ttl
MicroBlaze 1
Lookup1
Verify
ver& ttl
Lookup2
Lookup2
Lookup1
From source
Verify
ver& ttl
MicroBlaze 2
Lookup1
Verify
ver& ttl
Lookup2
Lookup2
Lookup3
Verify
checksum
Lookup3
Verify
checksum
Lookup3
Verify
checksum
To source
MicroBlaze 1
To source
MicroBlaze 2
Key:
MicroBlaze
To source
MicroBlaze 1
To source
MicroBlaze 2
Block RAM
Bus
Queue
Resulting design achieved 2.0 Gbps throughput
surpassing performance of a 1.8 Gbps hand-tuned design
using one less MicroBlaze processor
The improvement was due to a less regular configuration and balanced workload of tasks
across the processors
11/6/2015
13
Justifying constraint optimization techniques
Our constraint optimization method can handle instances of the
representative allocation and scheduling problem with up to
100’s of tasks onto 10’s of PE’s
Implementation constraints can be easily incorporated
11/6/2015
Task groupings
Multiprocessor topology restrictions
Preferred allocations
Memory assignments
Mutual exclusion
14
Following Moore’s Law
Extend to more complex
applications
1000’s-10,000’s of tasks
Explore
Extend to bigger
multiprocessor systems
100’s-1000’s of PE’s
PE
NI
M
NI
NI
M
M
PE
PE
NI
On-chip network
NI
M
M
PE
M
PE
PE
NI
11/6/2015
PE
PE
NI
M
NI
M
15
What can we do for RAMP?
Challenges in deploying concurrent applications on a RAMP system
Task allocation and scheduling across 100’s – 1000’s of PEs
Fast mapping step to enable efficient design space exploration
Our optimization techniques for static task allocation and scheduling
are a first step to address these challenges
A “compile-time” tool to guide the designer to explore efficient mappings
Flexible formulation to target diverse multiprocessors
Research in progress to extend our techniques to work on problems in the
scale of RAMP systems
11/6/2015
16
Backup Slides
11/6/2015
17
Example
Optimal design found in less
than 6 seconds on 400MHz
Sparc II
Application
explore
P11
P21
MicroBlazes
P12
Power PC
M1
BRAMs
2VP50
Optimal design
Communication
FSLs
11/6/2015
Bus
Architecture
18
Following Moore’s Law
Extend to more complex applications
1000’s-10,000’s of tasks
DSLAM
Extend to bigger multiprocessor systems
100’s-1000’s of PE’s
RAMP
11/6/2015
19
Challenges in Automated Exploration
Higher exploration complexity
Increases by 2 orders of magnitude
More emphasis on communication
Arbitration modeling
Routing constraints due to network topology
Statistical cost model for dynamic behavior
11/6/2015
20
Potential Approaches to Address these Challenges
Additional constraints can be easily added to incorporate
new features
Constraint solver performance will slow down and thus
become the bottleneck
Some strategies to improve constraint solver performance
Task graph based structural decompositions
Relaxation heuristics
Symmetry representation
Cutting planes and valid inequalities
11/6/2015
21
PE
PE
PE
NI
M
NI
NI
M
PE
Memory
PE
Network
Interface
Processor
interconnect
NI
M
M
PE
PE
PE
NI
11/6/2015
PE
M
On-chip network
NI
M
Key
NI
M
NI
M
22