Mapping Task Graphs to Processors in Large Multiprocessor Systems Kurt Keutzer and the MESCAL Team especially Yujia Jin, Kaushik Ravindran, and N.

Transcript Mapping Task Graphs to Processors in Large Multiprocessor Systems Kurt Keutzer and the MESCAL Team especially Yujia Jin, Kaushik Ravindran, and N.

Mapping Task Graphs to Processors in Large
Multiprocessor Systems
Kurt Keutzer
and the MESCAL Team
especially
Yujia Jin, Kaushik Ravindran, and N. R. Satish
Design Space Exploration Flow
FromDevice(0)
Discard
Discard
DecIPTTL
IPVerify
FromDevice(1)
FromDevice(2)
ToDevice(0)
Discard
Discard
DecIPTTL
IPVerify
FromDevice(3)
ToDevice(1)
Discard
Lookup
IPRoute
ToDevice(2)
DecIPTTL
…
…
DecIPTTL
ToDevice(3)
Discard
Discard
…
DecIPTTL
Discard
IPVerify
…
FromDevice(15)
Multiprocessor
platform
Application
description
ToDevice(15)
Off-chip
MEM
SDRAM
Ethern
et
PERIPHERAL
OPB
MEM
MEM
FS
PLBPECo-PE
Co-PEPE L
PECo-PE MEM
OnHardware MicroBlaze
chip
acceleratio (soft)
BRAM
n
MEM
Discard
S1
S2
R1
L1
T1
R2
L2
T2
Task Graph + profiles
Scheduling
Constraints
Allocation/Scheduling
Task graph
Platform
Constraints
HW/SW generation
Implementation
Performance
Analysis
11/6/2015
Performance
Numbers
2
Investigative Approach
Demonstrate network applications on FPGA-based soft
multiprocessors

Tomahawk exploration framework

Automated task allocation and scheduling
Extend framework to large multiprocessor systems

1000’s-10,000’s of tasks

100’s-1000’s of PE’s

RAMP
11/6/2015
3
What Is a FPGA-based Soft Multiprocessor System
Architecture Building Blocks
 A network of architecture building blocks
on an FPGA
 Multiprocessor architecture customized
for target application

Number of processors

Interconnection network

Memory hierarchy

Custom co-processors
Processing
Element
Blaze(soft)
Hash engine
PowerPC(hard) Crypto engine
 Productivity gains due to software
abstraction
11/6/2015
Bus
OPB
PLB
Queue
Memory
FSL
BRAM(on-chip)
SDRAM(off-chip)
Multiprocessor Configuration
Off-chip SDRAM
 Cost reduction by avoiding custom
silicon
Co-Processor
MEM
PERIPHERAL
MEM
Co-PE
Ethernet
MEM
PE
PE
Co-PE
FSL
Hardware
acceleration
OPB
MicroBlaze
(soft)
Xilinx Virtex-II Pro,
Virtex-IV family of
FPGAs
MEM
PLB
MEM
PE
Co-PE
On-chip BRAM
PowerPC
(hard)
4
Obstacles to Their Adoption: Hard to design
 Complex micro-architecture design space

Processor choices

Memory hierarchy

Communication topology
 Difficult mapping decisions

assigning computation to processing elements

data to exposed heterogeneous memories
 To unlock potential of these systems, tools enabling efficiency and
productivity are needed
11/6/2015
5
Example: Design Difficulty
Architecture Model
Application Task Graph
R1
L1
T1
R2
L2
T2
Profile
P1
P2
R
20
10
L
20
20
T
30
10
10
R1
P2
Queue
Execution Time (cycles)
Explore
R1 L
1
L1 T
1
T1
R2
L2
T2
R
2
P21
Total
time
70
Total
time
== 50
Total
time
= 80
11/6/2015
P1
B
Optimal
Design
Design
A
L2
T2
Makespan
= 70
60
80
P1 P2
Total
time
Total
time == 40
60
Total time
= 80
6
Tomahawk: Network Applications onto Soft MPs
Discard
Discard
FromDevice(0)
DecIPTTL
IPVerify
ToDevice(0)
FromDevice(1)
Discard
Discard
FromDevice(2)
DecIPTTL
ToDevice(1)
IPVerify
Discard
ToDevice(2)
DecIPTTL
Lookup
IPRoute
DecIPTTL
…
…
Click
FromDevice(3)
ToDevice(3)
Discard
…
…
FromDevice(15)
DecIPTTL
Discard
ToDevice(15)
IPVerify
Discard
Discard
Task graph
Automated microarchitecture
configuration
S1
S2
R1
L1
T1
R2
L2
T2
M1
P1
S2
S1
P2
R1
L1
T1
R2
L2
T2
Automated Mapping
Off-chip SDRAM
MEM
C programs and
micro architecture
specification
MEM
Co-PE
Hardware
acceleration
PE
Ethernet
PERIPHERAL
FSL
OPB
MEM
MEM
PE
Co-PE
MicroBlaze (soft)
MEM
PLB
PE
Co-PE
On-chip
BRAM
Xilinx 2VP50 FPGA
11/6/2015
7
Possible Approaches for Automated Exploration
 Randomized algorithms

probabilistic bounds, simulated annealing
 Heuristic methods

list scheduling, force directed scheduling
 Exact methods

enumeration and tabu search, branch-and-bound
 Limitations of these approaches
11/6/2015

Specific implementation constraints are hard to enforce

Most approaches require per-instance tuning and are hard to generalize –
therefore poor for design space exploration
8
Constraint Optimization Techniques for Automated
Exploration
 Constraint solver technologies


Integer linear programming (ILP) solvers
0-1 Boolean reasoning solvers (SAT, PB-SAT)
 Advantages



Constraint formulations are a formal, yet natural way to capture a mathematical
optimization problem
Implementation constraints specific to a problem can be incorporated easily
Constraint solvers can exhaustively cover a search space without enumerating all
solutions
 Key strategies to improve solver performance:




11/6/2015
Decomposition methods
Variable ordering
Improved lower and upper bounds
Symmetry representation
9
ILP Formulation
11/6/2015
10
Example Application: IPv4 Packet Forwarding
 Data plane of IPv4 packet forwarding (RFC-1812)
 Campus network router, Home router
 Medium sized route table (5,000 entries or less)
 Route table small enough to fit in on-chip memory
Lookup: inspect destination
address and find next hop
Route
Table
Ingress
Receive
IPv4
packet
Verify
version,
checksum
and TTL
Header
Lookup
next-hop
(prefix
match)
Payload
– Longest prefix match
– Implementation determined
by route distribution, memory
and performance constraints
Update
checksum
and TTL
Header
Transmit
IPv4
packet
Egress
 Target platform
 Xilinx Virtex-II Pro 2VP50 FPGA
 Architecture Library
 MicroBlazes, PowerPC, on-chip Block RAM, IBM CoreConnect buses, queue
11/6/2015
11
Hand-tuned Multiprocessor Design for IPv4 Forwarding
From source
MicroBlaze 1
From source
MicroBlaze 2
Route
Table
Route
Table
Verify
ver & ttl
checksum
Lookup1
Lookup2
Verify
ver & ttl
checksum
Lookup1
Verify
ver & ttl
checksum
Lookup1
Verify
ver & ttl
checksum
To source
MicroBlaze 1
To source
MicroBlaze 2
Lookup1
Lookup2
Lookup2
Lookup2
To source
MicroBlaze 1
To source
MicroBlaze 2
To source
MicroBlaze 1
Key:
MicroBlaze
Block RAM
Bus
Queue
Achieved 1.8 Gbps throughput for header processing

11/6/2015
using 12 MicroBlaze processors
12
Improved Design after Automated Exploration
Route
Table
Route
Table
Route
Table
Lookup1
From source
Verify
ver& ttl
MicroBlaze 1
Lookup1
Verify
ver& ttl
Lookup2
Lookup2
Lookup1
From source
Verify
ver& ttl
MicroBlaze 2
Lookup1
Verify
ver& ttl
Lookup2
Lookup2
Lookup3
Verify
checksum
Lookup3
Verify
checksum
Lookup3
Verify
checksum
To source
MicroBlaze 1
To source
MicroBlaze 2
Key:
MicroBlaze
To source
MicroBlaze 1
To source
MicroBlaze 2
Block RAM
Bus
Queue
 Resulting design achieved 2.0 Gbps throughput


surpassing performance of a 1.8 Gbps hand-tuned design
using one less MicroBlaze processor
 The improvement was due to a less regular configuration and balanced workload of tasks
across the processors
11/6/2015
13
Justifying constraint optimization techniques
 Our constraint optimization method can handle instances of the
representative allocation and scheduling problem with up to
100’s of tasks onto 10’s of PE’s
 Implementation constraints can be easily incorporated
11/6/2015

Task groupings

Multiprocessor topology restrictions

Preferred allocations

Memory assignments

Mutual exclusion
14
Following Moore’s Law
Extend to more complex
applications

1000’s-10,000’s of tasks
Explore
Extend to bigger
multiprocessor systems

100’s-1000’s of PE’s
PE
NI
M
NI
NI
M
M
PE
PE
NI
On-chip network
NI
M
M
PE
M
PE
PE
NI
11/6/2015
PE
PE
NI
M
NI
M
15
What can we do for RAMP?
 Challenges in deploying concurrent applications on a RAMP system

Task allocation and scheduling across 100’s – 1000’s of PEs

Fast mapping step to enable efficient design space exploration
 Our optimization techniques for static task allocation and scheduling
are a first step to address these challenges

A “compile-time” tool to guide the designer to explore efficient mappings

Flexible formulation to target diverse multiprocessors

Research in progress to extend our techniques to work on problems in the
scale of RAMP systems
11/6/2015
16
Backup Slides
11/6/2015
17
Example
 Optimal design found in less
than 6 seconds on 400MHz
Sparc II
Application
explore
P11
P21
MicroBlazes
P12
Power PC
M1
BRAMs
2VP50
Optimal design
Communication
FSLs
11/6/2015
Bus
Architecture
18
Following Moore’s Law
Extend to more complex applications

1000’s-10,000’s of tasks

DSLAM
Extend to bigger multiprocessor systems

100’s-1000’s of PE’s

RAMP
11/6/2015
19
Challenges in Automated Exploration
Higher exploration complexity

Increases by 2 orders of magnitude
More emphasis on communication

Arbitration modeling

Routing constraints due to network topology
Statistical cost model for dynamic behavior
11/6/2015
20
Potential Approaches to Address these Challenges
Additional constraints can be easily added to incorporate
new features
Constraint solver performance will slow down and thus
become the bottleneck
Some strategies to improve constraint solver performance

Task graph based structural decompositions

Relaxation heuristics

Symmetry representation

Cutting planes and valid inequalities
11/6/2015
21
PE
PE
PE
NI
M
NI
NI
M
PE
Memory
PE
Network
Interface
Processor
interconnect
NI
M
M
PE
PE
PE
NI
11/6/2015
PE
M
On-chip network
NI
M
Key
NI
M
NI
M
22

Mapping Task Graphs to Processors in Large Multiprocessor Systems Kurt Keutzer and the MESCAL Team especially Yujia Jin, Kaushik Ravindran, and N.

Transcript Mapping Task Graphs to Processors in Large Multiprocessor Systems Kurt Keutzer and the MESCAL Team especially Yujia Jin, Kaushik Ravindran, and N.

Directory