Adaptive Latency-Aware Parallel Resource Mapping: Task

Download Report

Transcript Adaptive Latency-Aware Parallel Resource Mapping: Task

Adaptive Latency-Aware Parallel Resource Mapping:
Task Graph Scheduling

Heterogeneous Network Topology
Liwen Shih, Ph.D.
Computer Engineering
U of Houston – Clear Lake
[email protected]
ADAPTIVE PARALLEL TASK TO
NETWORK TOPOLOGY MAPPING
Latency-adaptive:
•
•
•
•
•
Topology
Traffic
Bandwidth
Workload
System hierarchy
Thread partition:
• Coarse
• Medium
• Fine
Fine-Grained Mapping System
[Shih 1988]
• Parallel Mapping
– Compiler- vs. run- time
• Task migration
– Vertical vs. Horizontal
• Domain decomposition
– Data vs. Function
• Execution order
– Eager data-driven
vs. Lazy demand-driven
3
PRIORITIZE TASK DFG NODES
Task priority factors:
1. Level depth
2. Critical Paths
3. In/Out degree
Data flow partial order:
{(n7n5), (n7n4), (n6n4), (n6n3),
(n5n1), (n4n2), (n3n2), (n2n1)}
 total task priority order:
{n1 > n2 > n4 > n3 > n5 > n6 > n7}
 P2 thread: {n1>n2>n4>n3>n6}
P3 thread: {n5 > n7}
SHORTEST-PATH NETWORK ROUTING
Shortest latency and
routes are updated
after each taskprocessor allocation.
Adaptive A* Parallel Processor Scheduler
• Given a directed, acyclic task DFG G(V, E) with task
vertex set V connected by data-flow edge set E, And a
processor network topology N(P , C) with processor
node set P connected by channel link set C
• Find a processor assignment and schedule
S: V(G)  P (N)
S minimizes total parallel computation time of G.
• A* Heuristic mapping reduces scheduling complexity
from NP to P
Demand-Driven Task-Topology mapping
• STEP 1 – assign a level to each task node vertex in G.
• STEP 2 – count critical paths passing through each DFG edge
and node with a 2-pass bottom-up and then up-down graph
traversal.
• STEP 3 – initially load and prioritize all deepest level task
nodes that produce outputs, to the working task node list.
• STEP 4 – WHILE working task node list is not empty, schedule
a best processor to the top priority task, and replace it with
its parent task nodes inserted onto the working task node
priority list.
Demand-Driven Processor Scheduling
STEP 4 – WHILE working task node list is not empty:
BEGIN
– STEP 4.1 – initialize if first time, otherwise update inter-processor shortestpath latency/routing table pair affected by last task-processor allocation.
– STEP 4.2 – assign a nearby capable processor to minimize thread
computation time for the highest priority task node at the top of the
remaining prioritized working list.
– STEP 4.3 – remove the newly scheduled task node, and replace it with its
parent nodes, which are to be inserted/appended onto the working list
(demand-driven) per priority, based on tie-breaker rules, which along with
node level depth, estimate the time cost of the entire computation tread
involved.
END{WHILE}
QUANTIFY SW/HW MAPPING QUALITY
• Example 1 – Latency-Adaptive Tree-Task to
Tree-Machine Mapping
• Example 2 – Scaling to Larger Tree-to-Tree
Mapping
• Example 3 – Select the Best Processor
Topology Match for an Irregular Task Graph
Example 1 – Latency-Adaptive Tree-Task
to Tree-Machine Mapping
K-th Largest Selection
Will tree Algorithm [3]
match tree machine [4]?
Example 1 – Latency-Adaptive Tree-Task
to Tree-Machine Mapping
Adaptive mapping
moves toward
sequential
processing when
inter/intra
communication
latency ratio
increase.
Example 1 – Latency-Adaptive Tree-Task
to Tree-Machine Mapping
Adaptive Mapper
allocates fewer
processors and channels
with fewer hops.
Example 1 – Latency-Adaptive Tree-Task
to Tree-Machine Mapping
Adaptive Mapper
achieves higher
speedups consistently.
(Bonus! 25.7+ pipeline
processing speedup and
be extrapolated when
inter/intra communication
latency ratio <1)
Example 1 – Latency-Adaptive Tree-Task
to Tree-Machine Mapping
Adaptive Mapper
results in better
efficiencies consistently.
(Bonus! 428.3+% pipeline
processing efficiency can
be extrapolated when
inter/intra communication
latency ratio <1)
Example 2 – Scaling to Larger Tree-toTree Mapping
Adaptive Mapper
achieves sub-optimal
speedups as tree
sizes scaled larger
speedups, still trailing
fixed tree-to-tree
mapping closely.
Example 2 – Scaling to Larger Tree-toTree Mapping
Adaptive Mapper is
always more costefficient using less
resource, with
compatible sub-optimal
speedups to fixed treeto-tree mapping as tree
sizes scaled.
Example 3 – Select the Best Processor
Topology Match for an Irregular Task Graph
Lack of matching
topology clues for
irregular shaped
Robot Elbow
Manipulator [5]
• 105 task nodes,
• 161 data flow edges
• 29 node levels
Example 3 – Select the Best Processor
Topology Match for an Irregular Task Graph
• Candidate topologies
• Compare schedules for
each topology
• Farther processors may
not be selected
– Linear Array
– Tree
Example 3 – Select the Best Processor
Topology Match for an Irregular Task Graph
Best network
topology
performers
(# channels)
•
•
•
•
•
Complete (28)
Mesh (12)
Chordal ring (16)
Systolic array (16)
Cube (12)
Example 3 – Select the Best Processor
Topology Match for an Irregular Task Graph
Fewer processors
selected for
higher diameter
networks
• Tree
• Linear Array
Example 3 – Select the Best Processor
Topology Match for an Irregular Task Graph
Deducing network
switch hops
• Low multi-hop data
exchanges < 10%
• Moderate 0-hop of
30% to 50%
• High near-neighbor
direct 1-hop 50% to
70%
Future Speed/Memory/Power
Optimization
• Latency-adaptive
–
–
–
–
–
Topology
Traffic
Bandwidth
Workload
System hierarchy
• Thread partition
– Coarse
– Mid
– Fine
• Latency/Routing tables
–
–
–
–
Neighborhood
Network hierarchy
Worm-hole
Dynamic mobile network
routing
– Bandwidth
– Heterogeneous system
• Algorithm-specific network
topology
References
Q & A?
Liwen Shih, Ph.D.
Professor in Computer Engineering
University of Houston – Clear Lake
[email protected]
24
xScale13 paper
Thank You!
27