Presentazione di PowerPoint
Download
Report
Transcript Presentazione di PowerPoint
Tecniche di ottimizzazione per lo
sviluppo di applicazioni embedded
su piattatforme multiprocessore su
singolo chip
Michela Milano
[email protected]
DEIS Università di Bologna
Digital Convergence – Mobile Example
Communication
Computing
Imaging
Entertainment
Broadcasting
Telematics
One device, multiple functions
Center of ubiquitous media network
Smart mobile device: next drive for semicon. Industry
SoC: Enabler for Digital Convergence
Future
> 100X
Performance
Low Power
Complexity
Storage
Today
SoC
Design as optimization
Design space
The set of “all” possible design choices
Constraints
Solutions that we are not willing to
accept
Cost function
A property we are interested in
(execution time, power, reliability…)
MOTIVATION & CONTEXT
System design
with system design
Embedded
MPSoCs
Design flow
Exploit application and
platform parallelism to achieve
real time performance
P1
P2
ram
Given a platform description
and an application abstraction
compute an allocation &
schedule
verify results & perform
thechanges
allocation
Crucial role of
& scheduling
algorithm
P1
1
2
3
4
5
1
2
P2
3
5
1
2
4
4
3
5
t
PROBLEM DESCRIPTION
MPSoC platform
MPSoC platform
Resources:
Constraints:
Identical processing
elements (PE)
Local storage devices
Remote on-chip memory
Shared bus
PEs
Local memory devices
Shared BUS
PE frequencies
DVS
Local device capacity
Bus bandwidth (additive resource)
Architecture dependent constraints
PROBLEM DESCRIPTION
Application
Application Task Graph
1
2
Each task:
3
6
4
Reads data for each ingoing arc
Performs some computation
Writes data for each outgoing arc
5
FREQ.
RD
Nodes are tasks/processes
Arcs are data communication
PROG. DATA
PE
LOC/REM
WR
EXEC
WR
COMM. BUFFER
LOC/REM
PROBLEM DESCRIPTION
Application
Application Task Graph
Durations depend on:
1
2
3
6
4
5
FREQ.
RD
Memory allocation
Remote memory is slower
than the local ones
Execution frequency
Different phases have different bus
requirements
PROG. DATA
PE
LOC/REM
WR
EXEC
WR
COMM. BUFFER
LOC/REM
PROBLEM VARIANTS
Problem variants
Application
We focused on problem variants with different
Objective function
Graph features
O. F.
Bus
traffic
Energy
consumption
Makespan
Pipelined
no DVS
DVS
no DVS
Generic
no DVS
DVS
no DVS
Generic with
cond. branches
no DVS
DVS
no DVS
G.F.
PROBLEM VARIANTS
Objective function
Objective function
Bus traffic
Tasks produce traffic when they access the bus
Completely depends on memory allocation
Makespan
Completely depends on the computed schedule
Energy (DVS)
The higher the frequency, the
higher the energy consumption
Cost for frequency switching
Allocation dependent
Schedule dependent
Time &
energy cost
1
2
3
4
t
PROBLEM VARIANTS
Graph structure
Graph Structure
Pipelined
1
2
3
Generic
4
Generic with cond. branches
0.3
1
2
Typical of stream processing
applications
3
a
1
!a
2
3
0.5
6
4
5
0.7
b
4
!b
0.5
6
5
Stochastic problem
Stochastic O.F. (exp. value)
Application Development Flow
Application
Development
Support
Simulator
CTG
Characterization Application
Profiles
Phase
Optimal SW
Application
Implementation
Optimizer
Optimization
Phase
Platform
Execution
When & Why Offline Optimization?
Plenty of design-time knowledge
Applications pre-characterized at design time
Dynamic transitions between different precharacterized scenarios
Aggressive exploitation of system resources
Reduces overdesign (lowers cost)
Strong performance guarantees
Applicable for many embedded applications
Question?
Complete or incomplete solver?
Can I solve the instances with a complete solver?
See the structure of the problem and the average
instance dimension
If yes, which technique to use?
If no, which is the quality of the heuristic solution
proposed?
Optimization in system design
Complete solvers find the optimal solution and prove
its optimality
The System Design community uses Integer Programming
techniques for every optimization problem despite the
structure of the problem itself
Scheduling is poorly handled by IP
Incomplete solvers: lack of estimated optimality gap
Decomposition of the problem and sequential solution of
each subproblem.
Local search/metaheuristic algorithms
Require a lot of tuning
Optimization techniques
We will consider two techniques:
Two aspects of a problem :
Constraint Programming
Integer Programming
Feasibility
Optimality
Merging the two, one could obtain
better results
Constraint Programming
Quite recent programming paradigm
declarative paradigm
Inherits from
logic programming
operations research
software engineering
AI constraint solving
Constraint Programming
Problem model
variable
domains
constraints
Problem solving
constraint propagation
search
Mathematical Constraints
Example 1
X::[1..10], Y::[5..15], X>Y
Arc-consistency: for each value v of X, there should be a value in
the domain of Y consistent with v
X::[6..10], Y::[5..9] after propagation
Example 2
X::[1..10], Y::[5..15], X=Y
X::[5..10], Y::[5..10] after propagation
Example 3
X::[1..10], Y::[5..15], XY
No propagation
Constraint Interaction
Every variable involved in many constraints: each
changing in the domain of a variable could affect many
constraints
Agent perspective:
Example:
Y = Z + 1
X = Y + 1
X::[1..5],
Y::[1..5],
Z::[1..5]
X = Z - 1
Constraint Interaction
Y = Z + 1
X = Y + 1
X::[1..5],
Y::[1..5],
Z::[1..5]
X = Z - 1
First propagation of X = Y + 1 leads to
X::[2..5]
Y::[1..4]
Z::[1..5]
X = Y + 1
suspended
Constraint Interaction
Second propagation of Y = Z + 1 leads to
X::[2..5]
Y::[2..4]
Z::[1..3]
Y = Z + 1
suspended
The domain og Y is changed and X = Y + 1 is awaken
X::[3..5]
Y::[2..4]
Z::[1..3]
X = Y + 1
suspended
Constraint Interaction
Third propagation of Z = X - 1 leads to
X::[]
Y::[2..4]
Z::[1..3]
FAIL
The order in which constraints are considered does not
affect the result BUT can influence the efficiency
Global Constraints
cumulative([S1,..,Sn],[D1,..,Dn],[R1,..,Rn], L)
S1,...Sn activity starting time (domain variables)
D1,...Dn duration (domain variables)
R1,...Rn resource consumption (domain variables)
L resource capacity
Given the interval [min,max] where min = mini {S },
i
max = max{Si+Di} - 1, the cumulative constraint holds iff
max{
j|SjiSj+Dj
Ri} L
Global Constraints
cumulative([1,2,4],[4,2,3],[1,2,2],3)
resources
L
3
2
3
2
1
1
1
2
3
4
5
6
7
time
Propagation
One of the propagation algorithms used in resource
constraints is the one based on obligatory parts
S min
8
8
S max
Obligatory part
Propagation
Another propagation is the one based on edge finding
[Baptiste, Le Pape, Nuijten, IJCAI95]
Consider a unary resource R and three activities that should be scheduled on
the R
6
S1
0
17
4
S2
11
1
3
S3
1
12
Propagation
S1
6
S2
4
11
1
S3
17
8
3
1
12
We can deduce that the earliest start time of S1 is 8.
In fact, S1 should be executed after both S2 e S3.
Global reasoning: if S1 is executed before the two, then there is no
space for executing both S2 and S3
CP solving process
•
The solution process interleaves propagation and search
•
•
•
The search heuristics chooses
•
•
•
Constraints propagate as much as possible
When no more propagation can be performed, either the process
fails since one domain is empty or search is performed.
Which variable to select
Which value to assign to it
Optimization problems
•
Solved via a set of feasibility problems with constraints on the
objective function variable
Pros and Cons
Declarative programming: the user states the constraint
the solver takes into account propagation and search
Strong on the feasibility side
Constraints are symbolic and mathematic: expressivity
Adding a constraint helps the solution process: flexibility
Weak optimality pruning if the link between problem
decision variables and the objective function is loose
No use of relaxations
Integer Programming
•
•
•
Standard form of Combinatorial Optimization Problem (IP)
n
min z = cj xj
j =1
subject to
n
aij xj = bi
i = 1..m
j =1
xj 0
j = 1..n
xj integer
May make the problem NP complete
Inequality y 0 recasted in y - s = 0
Maximization expressed by negating the objective function
0-1 Integer Programming
•
Many Combinatorial Optimization Problem can be
expressed in terms of 0-1 variables (IP)
n
min z =j =1
cj xj
subject to
n
aij xj = bi
i = 1..m
j =1
May make the problem NP complete
xj :[0,1]
Linear Relaxation
n
min z = j
cj xj
=1
subject to
n
a x = bi
i = 1..m
j =1 ij j
xj 0
j = 1..n
xj integer
Removed
Linear Relaxation
The linear relaxation is solvable in POLYNOMIAL TIME
The SIMPLEX ALGORITHM is the technique of choice
even if it is exponential in the worst case
Geometric Properties
•
•
The set of constraints defines a polytope
The optimal solution is located on one of its vertices
n
min z =j =1
cj xj
subject to
n
aij xj = bi
i = 1..m
j =1
xj 0
j = 1..n
The simplex algorithm starts from
one vertex and moves to an adjacent one
with a better value of the objective function
Optimal solution
Objective function
IP solving process
•
•
The optimal LP solution is in general fractional: violates
the integrality constraint but provides a bound on the
solution of the overall problem
The solution process interleaves branch and bound:
•
•
relaxation
search
Pros and Cons
Declarative programming: the user states the constraint
the solver takes into account relaxation and search
Strong on the optimality side
Many real problem structure has been deeply studied
Only linear constraints should be used
If sophisticated techniques are used, we lose flexibility
No pruning on feasibility (only some preprocessing)
Resource-Efficient
Application mapping for MPSoCs
MULTIMEDIA
APPLICATIONS
Given a platform
1.
Achieve a specified throughput
2.
Minimize usage of shared resources
Allocation and scheduling
•
Given:
•
•
•
An hardware platform with processors, (local and remote) storage
devices, a communication channel
A pre characterized task graph representing a functional
abstraction of the application we should run
Find:
•
An allocation and a scheduling of tasks to resources respecting
•
•
•
•
•
Real time constraints
Task deadlines
Precedences among tasks
Capacity of all resources
Such that
•
the communication cost is minimized
Allocation and scheduling
The platform is a multi-processor system with N nodes
Each node includes a processor and a schretchpad memory
The bus is a shared communication channel
In addition we have a remote memory of unlimited capacity
(realistic assomption for our application, but easily
generalizable)
The task graph has a pipeline workload
Real time video graphic processing pixel of a digital image
Task dependencies, i.e., arcs between tasks
Computation, communication, storage requirements on the
graph
MOTIVATION & CONTEXT
Embedded system design
Allocation and scheduling
Design flow
P1
P2
ram
1
2
3
4
5
Given a platform description
and an application abstraction
compute an allocation &
schedule
P1
1
2
P2
3
5
1
2
4
4
3
5
t
PROBLEM DESCRIPTION
MPSoC platform
MPSoC platform
Identical processing
elements (PE)
Local storage devices
Remote on-chip memory
Shared bus
Resources:
PEs Unary resource
Local memory devices Limited capacity
Shared BUS Limited bandwidth
Remote memory assumed to be infinite
Constraints:
Local device capacity
Bus bandwidth (additive resource)
Architecture dependent constraints
PROBLEM DESCRIPTION
Application
Application Task Graph
1
2
Each task:
3
6
4
Reads data for each ingoing arc
Performs some computation
Writes data for each outgoing arc
5
FREQ.
RD
Nodes are tasks/processes
Arcs are data communication
PROG. DATA
PE
LOC/REM
WR
EXEC
WR
COMM. BUFFER
LOC/REM
PROBLEM DESCRIPTION
Application
Application Task Graph
Durations depend on:
1
2
3
6
4
5
FREQ.
RD
Memory allocation
Remote memory is slower
than the local ones
Different phases have different bus
requirements
PROG. DATA
PE
LOC/REM
WR
EXEC
WR
COMM. BUFFER
LOC/REM
Problem structure
As a whole it is a scheduling problem
with alternative resources: very tough
problem
It smoothly decomposes into allocation and
scheduling
Allocation better handled with IP techniques
Scheduling better handled with CP techniques
Not with CP due to the complex objective function
Not with IP since we should model for each task all
its possible starting time with a 0/1 variable
INTERACTION REGULATED VIA CUTTING
PLANES
Logic Based Benders Decomposition
Memory constraints
Obj. Function:
Communication cost
Allocation
INTEGER PROGRAMMING
No good: linear
constraint
Valid
allocation
Timing
constraint
Scheduling:
CONSTRAINT PROGRAMMING
Decomposes the problem into 2 sub-problems:
Allocation → IP
Objective Function: communication of tasks
Scheduling → CP
Secondary Objective Function: makespan
Master Problem model
Assignment of tasks and memory slots (master problem)
Tij= 1 if task i executes on processor j, 0 otherwise,
Yij =1 if task i allocates program data on processor j memory, 0 otherwise,
Zij =1 if task i allocates the internal state on processor j memory, 0 otherwise
Xij =1 if task i executes on processor j and task i+1 does not, 0 otherwise
Each process should be allocated to one processor
Tij= 1 for all j
i
Link between variables X and T: Xij = |Tij – Ti+1 j | for all i and j (can be linearized)
If a task is NOT allocated to a processor nor its required memories are:
Tij= 0 Yij =0 and Zij =0
Objective function
i
j
memi (Tij - Yij) + statei (Tij - Yij) + datai Xij /2
Improvement of the model
With the proposed model, the allocation problem solver tends to pack
all tasks on a single processor and all memory required on the local
memory so as to have a ZERO communication cost: TRIVIAL SOLUTION
To improve the model we should add a relaxation of the subproblem to
the master problem model:
For each set S of consecutive tasks whose sum of durations exceeds the
Real time requirement, we impose that their processors should not be the
same
WCETi > RT Tij |S| -1
iS
iS
Sub-Problem model
Task scheduling with static resource assignment (subproblem)
We have to schedule tasks so we have to decide when they start
Activity Starting Time: Starti::[0..Deadlinei]
Precedence constraints: Starti+Duri Startj
Real time constraints: for all activities running on the same processor
(Starti+Duri ) RT
i
Cumulative constraints on resources
processors are unary resources: cumulative([Start], [Dur], [1],1)
memories are additive resources: cumulative([Start],[Dur],[MR],C)
What about the bus??
Bus model
BANDWIDTH
Additive bus model
BIT/SEC
Max bus
bandwidth
Task0 accesses
input data:
BW=MaxBW/NoProc
Size of program data
TaskExecTime
task1
task0
Task0
reads state
Task0
writes state
TIME
The model does not hold under heavy bus congestion
Bus traffic has to be minimized
Results
Algorithm search time
The combined approach dominates, and its higher complexity
comes out only for simple system configurations
Energy-Efficient
Application mapping for MPSoCs
MULTIMEDIA
APPLICATIONS
Given a platform
1.
Achieve a specified throughput
2.
Minimize power consumption
Logic Based Benders Decomposition
Memory constraints
Allocation
& Freq. Assign.:
INTEGER PROGRAMMING
& energy consumption
No good and
cutting planes
Valid
allocation
Timing
constraint
Obj. Function:
Communication cost
Scheduling:
CONSTRAINT PROGRAMMING
Decomposes the problem into 2 sub-problems:
Allocation & Assignment (& freq. setting) → IP
Objective Function: minimizing energy consumption during
execution
Scheduling → CP
Objective Function: E.g.: minimizing energy consumption
during frequency switching
Allocation problem model
The objective function: minimize energy consumption
P
Xtfp = 1 if task t executes on processor p at frequency f;
associated with task execution and communication
Wijfp = 1 if task i and j run on different core.
Task i on core p writes data to j at freq. f;
Rijfp = 1 if task i and j run on different core.
Task j on core p reads data to i at freq. f;
M
Each task can execute only on
X tfp 1t
one processor at one freq.
p 1 f 1
P
M
W
ijfp
p 1 f 1
P
M
R
p 1 f 1
P
ijfp
1i, j T
1i, j T
M
(W
p 1 f 1
ijfp
Rijfp ) 0i, j T
OF EnComp EnRe ad EnWrite
Communication between tasks
can execute only once for
execution and one write
corresponds to one read
Scheduling problem model
INPUT EXEC OUTPUT
Duration of task i is now fixed since mode is fixed:
Task i
Task j
Starti durationi Start j
Tasks running on the same
processor at the same frequency
Starti durationi Ti Start j
Tasks running on the same
processor at different frequencies
Starti durationi dWriteij d Re ad ij Start j
Tasks running on
different processors
•Processors are modelled as unary resource
•Bus is modelled as additive resource
The objective function: minimize energy consumption associated with
frequency switching
Computational efficiency
Search time for an increasing number of tasks
P4 2GHz 512 Mb RAM
Professional solving tools (ILOG)
• Similar plot for an increasing number of processors
• Standalone CP and IP proved not comparable on a simpler problem
• We varied the RT constraint:
tight deadline: few feasible solutions
very loose deadline: trivial solution
search time within 1 order of magnitude
Conditional Task Graphs
Conditional Task graphs: the problem
becomes stochastic since the outcome of
the conditions labeling the arcs are known
at execution time.
We only know the probability distribution
Minimize the expected value of the objective
function
Min communication cost: easier
Min makespan: much more complicated
Promising results
Other problems
What if the task durations are not known: only
the worst and best case are available
Change platform:
Scheduling only the worst case is not enough
Scheduling anomalies
We are facing the CELL BE Processor architecture
Model synchronous data flow applications
Model communication intensive applications
Challenge: the Abstraction Gap
Optimization
Development
Platform Modelling
Abstraction
gap
Starting Implementation
Optimization Analysis
Final Implementation
Optimal Solution
. .
(
Platform Execution
The abstraction gap between high level optimization tools
and standard application programming models can
introduce unpredictable and undesired behaviours.
Programmers must be conscious about simplified
assumptions taken into account in optimization tools.
Validation of optimizer solutions
Throughput
Optimizer
Optimal
Allocation
& Schedule
Probability (%)
0.25
0.2
250 instances
0.15
0.1
0.05
Virtual Platform
validation
0
-5% -4% -3% -2% -1% 0%
-0.05
1%
2%
3%
4%
5%
6%
7% 8%
9% 10% 11%
Throughput difference (%)
MAX error lower than 10%
AVG error equal to 4.51%, with standard
deviation of 1.94
All deadlines are met
Validation of optimizer solutions
Power
Optimizer
Optimal
Allocation
& Schedule
Probability (%)
0.35
0.3
250
250 instances
instances
0.25
0.2
0.15
0.1
0.05
Virtual Platform
validation
0
-5% -4% -3% -2% -1% 0%
-0.05
1%
2%
3%
4%
5%
6%
7%
8%
9% 10% 11%
Energy consumption difference (%)
MAX error lower than 10%;
AVG error equal to 4.80%, with standard
deviation of 1.71;
GSM Encoder
Task Graph:
10 computational tasks;
15 communication tasks.
Throughput required: 1 frame/10ms.
With 2 processors and 4 possible frequency
& voltage settings:
Without optimizations:
50.9μJ
With optimizations:
17.1 μJ
- 66,4%
Challenge: programming environment
A software development toolkit to help programmers in
software implementation:
The main goals are:
a generic customizable application template OFFLINE
SUPPORT;
a set of high-level APIs ONLINE SUPPORT in RT-OS (RTEMS)
predictable application execution after the optimization step;
guarantees on high performance and constraint satisfaction.
Starting from a high level task and data flow graph, software
developers can easily and quickly build their application
infrastructure.
Programmers can intuitively translate high level
representation into C-code using our facilities and library