Presentazione di PowerPoint

Download Report

Transcript Presentazione di PowerPoint

Tecniche di ottimizzazione per lo
sviluppo di applicazioni embedded
su piattatforme multiprocessore su
singolo chip
Michela Milano
[email protected]
DEIS Università di Bologna
Digital Convergence – Mobile Example
Communication
Computing
Imaging



Entertainment
Broadcasting
Telematics
One device, multiple functions
Center of ubiquitous media network
Smart mobile device: next drive for semicon. Industry
SoC: Enabler for Digital Convergence
Future
> 100X
Performance
Low Power
Complexity
Storage
Today
SoC
Design as optimization

Design space
The set of “all” possible design choices

Constraints
Solutions that we are not willing to
accept

Cost function
A property we are interested in
(execution time, power, reliability…)
MOTIVATION & CONTEXT
System design
with system design
Embedded
MPSoCs
Design flow
Exploit application and
platform parallelism to achieve
real time performance
P1
P2
ram
 Given a platform description
 and an application abstraction
 compute an allocation &
schedule
 verify results & perform
thechanges
allocation
Crucial role of
& scheduling
algorithm
P1
1
2
3
4
5
1
2
P2
3
5
1
2
4
4
3
5
t
PROBLEM DESCRIPTION
MPSoC platform
MPSoC platform




Resources:
Constraints:




Identical processing
elements (PE)
Local storage devices
Remote on-chip memory
Shared bus
PEs
Local memory devices
Shared BUS
PE frequencies
DVS
 Local device capacity
 Bus bandwidth (additive resource)
 Architecture dependent constraints
PROBLEM DESCRIPTION
Application
Application Task Graph

1
2

Each task:
3
6
4
 Reads data for each ingoing arc
 Performs some computation
 Writes data for each outgoing arc
5
FREQ.
RD
Nodes are tasks/processes
Arcs are data communication
PROG. DATA
PE
LOC/REM
WR
EXEC
WR
COMM. BUFFER
LOC/REM
PROBLEM DESCRIPTION
Application
Application Task Graph
Durations depend on:
1
2


3

6
4
5
FREQ.
RD
Memory allocation
Remote memory is slower
than the local ones
Execution frequency
Different phases have different bus
requirements
PROG. DATA
PE
LOC/REM
WR
EXEC
WR
COMM. BUFFER
LOC/REM
PROBLEM VARIANTS
Problem variants
Application
We focused on problem variants with different
 Objective function
 Graph features
O. F.
Bus
traffic
Energy
consumption
Makespan
Pipelined
no DVS
DVS
no DVS
Generic
no DVS
DVS
no DVS
Generic with
cond. branches
no DVS
DVS
no DVS
G.F.
PROBLEM VARIANTS
Objective function
Objective function
Bus traffic
 Tasks produce traffic when they access the bus
 Completely depends on memory allocation
Makespan
 Completely depends on the computed schedule
Energy (DVS)
 The higher the frequency, the
higher the energy consumption
 Cost for frequency switching
Allocation dependent
Schedule dependent
Time &
energy cost
1
2
3
4
t
PROBLEM VARIANTS
Graph structure
Graph Structure
Pipelined
1
2
3
Generic
4
Generic with cond. branches
0.3
1
2
Typical of stream processing
applications
3
a
1
!a
2
3
0.5
6
4
5
0.7
b
4
!b
0.5
6
5
Stochastic problem
Stochastic O.F. (exp. value)
Application Development Flow
Application
Development
Support
Simulator
CTG
Characterization Application
Profiles
Phase
Optimal SW
Application
Implementation
Optimizer
Optimization
Phase
Platform
Execution
When & Why Offline Optimization?

Plenty of design-time knowledge



Applications pre-characterized at design time
Dynamic transitions between different precharacterized scenarios
Aggressive exploitation of system resources


Reduces overdesign (lowers cost)
Strong performance guarantees
Applicable for many embedded applications
Question?

Complete or incomplete solver?

Can I solve the instances with a complete solver?



See the structure of the problem and the average
instance dimension
If yes, which technique to use?
If no, which is the quality of the heuristic solution
proposed?
Optimization in system design

Complete solvers find the optimal solution and prove
its optimality



The System Design community uses Integer Programming
techniques for every optimization problem despite the
structure of the problem itself
Scheduling is poorly handled by IP
Incomplete solvers: lack of estimated optimality gap



Decomposition of the problem and sequential solution of
each subproblem.
Local search/metaheuristic algorithms
Require a lot of tuning
Optimization techniques

We will consider two techniques:



Two aspects of a problem :



Constraint Programming
Integer Programming
Feasibility
Optimality
Merging the two, one could obtain
better results
Constraint Programming

Quite recent programming paradigm


declarative paradigm
Inherits from




logic programming
operations research
software engineering
AI constraint solving
Constraint Programming

Problem model




variable
domains
constraints
Problem solving


constraint propagation
search
Mathematical Constraints

Example 1

X::[1..10], Y::[5..15], X>Y
Arc-consistency: for each value v of X, there should be a value in
the domain of Y consistent with v
X::[6..10], Y::[5..9] after propagation


Example 2

X::[1..10], Y::[5..15], X=Y

X::[5..10], Y::[5..10] after propagation
Example 3

X::[1..10], Y::[5..15], XY

No propagation
Constraint Interaction



Every variable involved in many constraints: each
changing in the domain of a variable could affect many
constraints
Agent perspective:
Example:
Y = Z + 1
X = Y + 1
X::[1..5],
Y::[1..5],
Z::[1..5]
X = Z - 1
Constraint Interaction
Y = Z + 1
X = Y + 1
X::[1..5],
Y::[1..5],
Z::[1..5]
X = Z - 1

First propagation of X = Y + 1 leads to
X::[2..5]
Y::[1..4]
Z::[1..5]
X = Y + 1
suspended
Constraint Interaction

Second propagation of Y = Z + 1 leads to
X::[2..5]
Y::[2..4]
Z::[1..3]

Y = Z + 1
suspended
The domain og Y is changed and X = Y + 1 is awaken
X::[3..5]
Y::[2..4]
Z::[1..3]
X = Y + 1
suspended
Constraint Interaction

Third propagation of Z = X - 1 leads to
X::[]
Y::[2..4]
Z::[1..3]

FAIL
The order in which constraints are considered does not
affect the result BUT can influence the efficiency
Global Constraints
cumulative([S1,..,Sn],[D1,..,Dn],[R1,..,Rn], L)
 S1,...Sn activity starting time (domain variables)
 D1,...Dn duration (domain variables)
 R1,...Rn resource consumption (domain variables)
 L resource capacity
 Given the interval [min,max] where min = mini {S },
i

max = max{Si+Di} - 1, the cumulative constraint holds iff
max{
j|SjiSj+Dj
Ri} L
Global Constraints

cumulative([1,2,4],[4,2,3],[1,2,2],3)
resources
L
3
2
3
2
1
1
1
2
3
4
5
6
7
time
Propagation

One of the propagation algorithms used in resource
constraints is the one based on obligatory parts
S min
8
8
S max
Obligatory part
Propagation

Another propagation is the one based on edge finding
[Baptiste, Le Pape, Nuijten, IJCAI95]
Consider a unary resource R and three activities that should be scheduled on
the R
6
S1
0
17
4
S2
11
1
3
S3
1
12
Propagation
S1
6
S2
4
11
1
S3
17
8
3
1
12
We can deduce that the earliest start time of S1 is 8.
In fact, S1 should be executed after both S2 e S3.
Global reasoning: if S1 is executed before the two, then there is no
space for executing both S2 and S3
CP solving process
•
The solution process interleaves propagation and search
•
•
•
The search heuristics chooses
•
•
•
Constraints propagate as much as possible
When no more propagation can be performed, either the process
fails since one domain is empty or search is performed.
Which variable to select
Which value to assign to it
Optimization problems
•
Solved via a set of feasibility problems with constraints on the
objective function variable
Pros and Cons
 Declarative programming: the user states the constraint
the solver takes into account propagation and search
 Strong on the feasibility side
 Constraints are symbolic and mathematic: expressivity
 Adding a constraint helps the solution process: flexibility
 Weak optimality pruning if the link between problem
decision variables and the objective function is loose
 No use of relaxations
Integer Programming
•
•
•
Standard form of Combinatorial Optimization Problem (IP)
n
 min z =  cj xj
j =1
 subject to
n
 aij xj = bi
i = 1..m
j =1
xj  0
j = 1..n
xj integer
May make the problem NP complete
Inequality y  0 recasted in y - s = 0
Maximization expressed by negating the objective function
0-1 Integer Programming
•
Many Combinatorial Optimization Problem can be
expressed in terms of 0-1 variables (IP)
n


min z =j =1
 cj xj
subject to
n
 aij xj = bi
i = 1..m
j =1
May make the problem NP complete
xj :[0,1]
Linear Relaxation
n




min z = j 
cj xj
=1
subject to
n

a x = bi
i = 1..m
j =1 ij j
xj  0
j = 1..n
xj integer
Removed
Linear Relaxation
The linear relaxation is solvable in POLYNOMIAL TIME
The SIMPLEX ALGORITHM is the technique of choice
even if it is exponential in the worst case
Geometric Properties
•
•
The set of constraints defines a polytope
The optimal solution is located on one of its vertices
n


min z =j =1
 cj xj
subject to
n
 aij xj = bi
i = 1..m
j =1
xj  0
j = 1..n
The simplex algorithm starts from
one vertex and moves to an adjacent one
with a better value of the objective function
Optimal solution
Objective function
IP solving process
•
•
The optimal LP solution is in general fractional: violates
the integrality constraint but provides a bound on the
solution of the overall problem
The solution process interleaves branch and bound:
•
•
relaxation
search
Pros and Cons
 Declarative programming: the user states the constraint
the solver takes into account relaxation and search
 Strong on the optimality side
 Many real problem structure has been deeply studied
 Only linear constraints should be used
 If sophisticated techniques are used, we lose flexibility
 No pruning on feasibility (only some preprocessing)
Resource-Efficient
Application mapping for MPSoCs
MULTIMEDIA
APPLICATIONS
Given a platform
1.
Achieve a specified throughput
2.
Minimize usage of shared resources
Allocation and scheduling
•
Given:
•
•
•
An hardware platform with processors, (local and remote) storage
devices, a communication channel
A pre characterized task graph representing a functional
abstraction of the application we should run
Find:
•
An allocation and a scheduling of tasks to resources respecting
•
•
•
•
•
Real time constraints
Task deadlines
Precedences among tasks
Capacity of all resources
Such that
•
the communication cost is minimized
Allocation and scheduling

The platform is a multi-processor system with N nodes




Each node includes a processor and a schretchpad memory
The bus is a shared communication channel
In addition we have a remote memory of unlimited capacity
(realistic assomption for our application, but easily
generalizable)
The task graph has a pipeline workload



Real time video graphic processing pixel of a digital image
Task dependencies, i.e., arcs between tasks
Computation, communication, storage requirements on the
graph
MOTIVATION & CONTEXT
Embedded system design
Allocation and scheduling
Design flow
P1
P2
ram
1
2
3
4
5
 Given a platform description
 and an application abstraction
 compute an allocation &
schedule
P1
1
2
P2
3
5
1
2
4
4
3
5
t
PROBLEM DESCRIPTION
MPSoC platform
MPSoC platform




Identical processing
elements (PE)
Local storage devices
Remote on-chip memory
Shared bus
Resources:




PEs Unary resource
Local memory devices Limited capacity
Shared BUS Limited bandwidth
Remote memory assumed to be infinite
Constraints:
 Local device capacity
 Bus bandwidth (additive resource)
 Architecture dependent constraints
PROBLEM DESCRIPTION
Application
Application Task Graph

1
2

Each task:
3
6
4
 Reads data for each ingoing arc
 Performs some computation
 Writes data for each outgoing arc
5
FREQ.
RD
Nodes are tasks/processes
Arcs are data communication
PROG. DATA
PE
LOC/REM
WR
EXEC
WR
COMM. BUFFER
LOC/REM
PROBLEM DESCRIPTION
Application
Application Task Graph
Durations depend on:
1
2


3
6
4
5
FREQ.
RD
Memory allocation
Remote memory is slower
than the local ones
Different phases have different bus
requirements
PROG. DATA
PE
LOC/REM
WR
EXEC
WR
COMM. BUFFER
LOC/REM
Problem structure

As a whole it is a scheduling problem
with alternative resources: very tough
problem

It smoothly decomposes into allocation and
scheduling

Allocation better handled with IP techniques


Scheduling better handled with CP techniques


Not with CP due to the complex objective function
Not with IP since we should model for each task all
its possible starting time with a 0/1 variable
INTERACTION REGULATED VIA CUTTING
PLANES
Logic Based Benders Decomposition
Memory constraints
Obj. Function:
Communication cost
Allocation
INTEGER PROGRAMMING
No good: linear
constraint
Valid
allocation
Timing
constraint

Scheduling:
CONSTRAINT PROGRAMMING
Decomposes the problem into 2 sub-problems:
 Allocation → IP
 Objective Function: communication of tasks
 Scheduling → CP
 Secondary Objective Function: makespan
Master Problem model
 Assignment of tasks and memory slots (master problem)




Tij= 1 if task i executes on processor j, 0 otherwise,
Yij =1 if task i allocates program data on processor j memory, 0 otherwise,
Zij =1 if task i allocates the internal state on processor j memory, 0 otherwise
Xij =1 if task i executes on processor j and task i+1 does not, 0 otherwise
 Each process should be allocated to one processor
 Tij= 1 for all j
i
 Link between variables X and T: Xij = |Tij – Ti+1 j | for all i and j (can be linearized)
 If a task is NOT allocated to a processor nor its required memories are:
Tij= 0  Yij =0 and Zij =0
Objective function

i
j
memi (Tij - Yij) + statei (Tij - Yij) + datai Xij /2
Improvement of the model
 With the proposed model, the allocation problem solver tends to pack
all tasks on a single processor and all memory required on the local
memory so as to have a ZERO communication cost: TRIVIAL SOLUTION
 To improve the model we should add a relaxation of the subproblem to
the master problem model:
For each set S of consecutive tasks whose sum of durations exceeds the
Real time requirement, we impose that their processors should not be the
same
 WCETi > RT   Tij  |S| -1
iS
iS
Sub-Problem model
Task scheduling with static resource assignment (subproblem)
We have to schedule tasks so we have to decide when they start
Activity Starting Time: Starti::[0..Deadlinei]
Precedence constraints: Starti+Duri  Startj
Real time constraints: for all activities running on the same processor
 (Starti+Duri )  RT
i
Cumulative constraints on resources
processors are unary resources: cumulative([Start], [Dur], [1],1)
memories are additive resources: cumulative([Start],[Dur],[MR],C)
What about the bus??
Bus model
BANDWIDTH
Additive bus model
BIT/SEC
Max bus
bandwidth
Task0 accesses
input data:
BW=MaxBW/NoProc
Size of program data
TaskExecTime
task1
task0
Task0
reads state
Task0
writes state
TIME
The model does not hold under heavy bus congestion
Bus traffic has to be minimized
Results
Algorithm search time
The combined approach dominates, and its higher complexity
comes out only for simple system configurations
Energy-Efficient
Application mapping for MPSoCs
MULTIMEDIA
APPLICATIONS
Given a platform
1.
Achieve a specified throughput
2.
Minimize power consumption
Logic Based Benders Decomposition
Memory constraints
Allocation
& Freq. Assign.:
INTEGER PROGRAMMING

& energy consumption
No good and
cutting planes
Valid
allocation
Timing
constraint
Obj. Function:
Communication cost
Scheduling:
CONSTRAINT PROGRAMMING
Decomposes the problem into 2 sub-problems:
 Allocation & Assignment (& freq. setting) → IP
 Objective Function: minimizing energy consumption during
execution
 Scheduling → CP
 Objective Function: E.g.: minimizing energy consumption
during frequency switching
Allocation problem model
The objective function: minimize energy consumption
P
Xtfp = 1 if task t executes on processor p at frequency f;
associated with task execution and communication
Wijfp = 1 if task i and j run on different core.
Task i on core p writes data to j at freq. f;
Rijfp = 1 if task i and j run on different core.
Task j on core p reads data to i at freq. f;
M
Each task can execute only on
X tfp  1t
one processor at one freq.

p 1 f 1
P
M
 W
ijfp
p 1 f 1
P
M
 R
p 1 f 1
P
ijfp
 1i, j  T
 1i, j  T
M
 (W
p 1 f 1
ijfp
 Rijfp )  0i, j  T
OF  EnComp  EnRe ad  EnWrite
Communication between tasks
can execute only once for
execution and one write
corresponds to one read
Scheduling problem model
INPUT EXEC OUTPUT
Duration of task i is now fixed since mode is fixed:
Task i
Task j
Starti  durationi  Start j
Tasks running on the same
processor at the same frequency
Starti  durationi  Ti  Start j
Tasks running on the same
processor at different frequencies
Starti  durationi  dWriteij  d Re ad ij  Start j
Tasks running on
different processors
•Processors are modelled as unary resource
•Bus is modelled as additive resource
The objective function: minimize energy consumption associated with
frequency switching
Computational efficiency
Search time for an increasing number of tasks
P4 2GHz 512 Mb RAM
Professional solving tools (ILOG)
• Similar plot for an increasing number of processors
• Standalone CP and IP proved not comparable on a simpler problem
• We varied the RT constraint:
 tight deadline: few feasible solutions
 very loose deadline: trivial solution
 search time within 1 order of magnitude
Conditional Task Graphs

Conditional Task graphs: the problem
becomes stochastic since the outcome of
the conditions labeling the arcs are known
at execution time.


We only know the probability distribution
Minimize the expected value of the objective
function



Min communication cost: easier
Min makespan: much more complicated
Promising results
Other problems

What if the task durations are not known: only
the worst and best case are available



Change platform:



Scheduling only the worst case is not enough
Scheduling anomalies
We are facing the CELL BE Processor architecture
Model synchronous data flow applications
Model communication intensive applications
Challenge: the Abstraction Gap
Optimization
Development
Platform Modelling
Abstraction
gap
Starting Implementation
Optimization Analysis
Final Implementation
Optimal Solution
. .
(


Platform Execution
The abstraction gap between high level optimization tools
and standard application programming models can
introduce unpredictable and undesired behaviours.
Programmers must be conscious about simplified
assumptions taken into account in optimization tools.
Validation of optimizer solutions
Throughput
Optimizer
Optimal
Allocation
& Schedule
Probability (%)
0.25
0.2
250 instances
0.15
0.1
0.05
Virtual Platform
validation



0
-5% -4% -3% -2% -1% 0%
-0.05
1%
2%
3%
4%
5%
6%
7% 8%
9% 10% 11%
Throughput difference (%)
MAX error lower than 10%
AVG error equal to 4.51%, with standard
deviation of 1.94
All deadlines are met
Validation of optimizer solutions
Power
Optimizer
Optimal
Allocation
& Schedule
Probability (%)
0.35
0.3
250
250 instances
instances
0.25
0.2
0.15
0.1
0.05
Virtual Platform
validation


0
-5% -4% -3% -2% -1% 0%
-0.05
1%
2%
3%
4%
5%
6%
7%
8%
9% 10% 11%
Energy consumption difference (%)
MAX error lower than 10%;
AVG error equal to 4.80%, with standard
deviation of 1.71;
GSM Encoder
Task Graph:
 10 computational tasks;
 15 communication tasks.


Throughput required: 1 frame/10ms.
With 2 processors and 4 possible frequency
& voltage settings:
Without optimizations:
50.9μJ
With optimizations:
17.1 μJ
- 66,4%
Challenge: programming environment

A software development toolkit to help programmers in
software implementation:



The main goals are:




a generic customizable application template  OFFLINE
SUPPORT;
a set of high-level APIs  ONLINE SUPPORT in RT-OS (RTEMS)
predictable application execution after the optimization step;
guarantees on high performance and constraint satisfaction.
Starting from a high level task and data flow graph, software
developers can easily and quickly build their application
infrastructure.
Programmers can intuitively translate high level
representation into C-code using our facilities and library