Toward the Predictable Integration of Real-Time COTS Based Systems Marco Caccamo Department of Computer Science University of Illinois at Urbana-Champaign.

Download Report

Transcript Toward the Predictable Integration of Real-Time COTS Based Systems Marco Caccamo Department of Computer Science University of Illinois at Urbana-Champaign.

Toward the Predictable Integration
of Real-Time COTS Based Systems
Marco Caccamo
Department of Computer Science
University of Illinois at Urbana-Champaign
Acknowledgement



Part of this research is a joint work with prof. Lui Sha
This presentation is from selected research sponsored by
◦ National Science Foundation
◦ Lockheed Martin Corporation
Graduate students who led these research efforts were:
◦ Rodolfo Pellizzoni
◦ Bach D. Bui
References


R. Pellizzoni, B.D. Bui, M. Caccamo and L. Sha, "Coscheduling of CPU and I/O
Transactions in COTS-based Embedded Systems," To appear at IEEE RealTime Systems Symposium, Barcelona, December 2008.
R. Pellizzoni and M. Caccamo, "Toward the Predictable Integration of RealTime COTS based Systems", Proceedings of the IEEE Real-Time Systems
Symposium, Tucson, Arizona, December 2007.
2
COTS HW & RT Embedded Systems



Embedded systems are increasingly built by using Commercial
Off-The-Shelf (COTS) components to reduce costs and time-tomarket
This trend is true even for companies in the safety-critical avionic
market such as Lockheed Martin Aeronautics, Boeing and Airbus
COTS components usually provide better performance:
◦

SAFEbus used in the Boing777 transfers data up to 60 Mbps, while a COTS
interconnection such as PCI Express can reach higher transfer speeds (over
three orders of magnitude)
COTS components are mainly optimized for the average case
performance and not for the worst-case scenario.
I/O Bus Transactions & WCETs




Experiment based on an Intel Platform, typical embedded
system speed.
PCI-X 133Mhz, 64 bit fully loaded.
Task suffers continuous cache misses.
Up to 44% wcet increase.
This is a big problem!!!
ARINC 653 and unpredictable COTS behaviors



According to ARINC 653 avionic standard, different computational
components should be put into isolated partitions (cyclic time
slices of the CPU).
ARINC 653 does not provide any isolation from the effects of I/O
bus traffic. A peripheral is free to interfere with cache fetches
while any partition (not requiring that peripheral) is executing on
the CPU.
To provide true temporal partitioning, enforceable specifications
must address the complex dependencies among all interacting
resources.
 See Aeronautical Radio Inc. ARINC 653 Specification. It defines the
Avionics Application Standard Software Interface.
Peripheral Integration: Problem Scenario

CPU
Task A
Task B
Front Side Bus
Host PCI
Bridge
Master
peripheral
PCI Bus
DDRAM
Cache-peripheral conflict:
◦ Master peripheral working for Task B.
◦ Task A suffers cache miss.
◦ Processor activity can be stalled due to
interference at the FSB level.
How relevant is the problem?
◦ Four high performance network cards,
saturated bus.
◦ Up to 49% increased wcet for memory
intensive tasks.
This effect MUST be
considered in wcet
computation!!

Slave
peripheral
Sebastian Schonberg, Impact of PCI-Bus Load on
Applications in a PC Architecture, RTSS 03
Goal: End-to-End Temporal Isolation on COTS


To achieve end-to-end temporal isolation, shared resources
(CPU, bus, cache, peripherals, etc.) should either support strong
isolation or temporal interference should be quantifiable.
Highly pessimistic assumptions are often made to compensate
for the lack of end-to-end temporal isolation on COTS
◦ An example is to account for the effect of all peripheral traffic in the
wcet of real-time tasks (up to 44% increment in task wcet)!

Lack of end-to-end temporal isolation raises dramatically
integration costs and is source of serious concerns during the
development of safety critical embedded systems
◦ At integration time (last phase of the design cycle), testing can reveal
unexpected deadline misses causing expensive design rollbacks
Goal: End-to-End Temporal Isolation on COTS


It is mandatory to have a closer look at HW behavior and its
integration with OS, middleware, and applications
We aim at analyzing temporal interference caused by COTS
integration
◦ if analyzed performance is not satisfactory, we search for alternative
(non-intrusive) HW solutions  see Peripheral Gate
Main Contributions



We introduced an analytical technique that computes safe bounds
on the I/O-induced task delay (D).
To control I/O interference over task execution, we introduced a
coscheduling technique for CPU & I/O Peripherals
We designed a COTS-compatible peripheral gate and hardware
server to enable/disable I/O peripherals (hw server is in progress!)
The cache-peripheral interference problem

COTS are inherently unpredictable due to:
o
o
o



Pipelined, cached CPUs.
Master (DMA) peripherals.
Etc.
Modern COTS-based embedded
architectures are multi-master
platforms
We assume a shared memory
architecture with single-port
RAM
We will show safe bounds for
cache-peripheral interference at
the main memory level.
Power PC clocked @ 1000 MHz
CPU+MultiLevel Cache
Digital Video
64 Bit Wide Memory Bus
256 MB DDR SDRAM
Clocked @ 125 MHz
Ethernet
System
Controller
RS-485
Shared
Memory
Graphics
Processor
MPEG
Comp.
32 Bit PCI
Clocked @ 33 MHz
32 Bit PCI
Clocked @ 66 MHz
PCI Bus 1a
PCI Bus 1b
PCI to PCI Bridge
PCI-X to PCI Bridge
64 Bit PCI-X
Clocked @ 100 MHz
PCI Bus 0a
Network
Interface
32 Bit PCI
Clocked @ 66 MHz
PCI Bus 0b
Network
Interface
Network
Interface
Port 1
IO
Port 2
Inactive
Fibre ChannelCopper Fibre Channel IEEE 1394
Network Interface Network Interface Network Interface
Discrete IO
Peripheral Burstiness Bound




Similar to network calculus
approach.
E t  : maximum cumulative bus time
required in any interval of length t.
How to compute:
◦ Measurement.
◦ Knowledge of distributed traffic.
Assumptions:
◦ Maximum non preemtive
transaction length: L’
◦ No buffering in bridges (the analysis
was extended in presence of buffering too!).
Cache Miss Profile




c t  : cumulative bus time required
to fetch/replace cache lines in [0, t ].
Note: not an upper bound!
Assumptions:
◦ CPU is stalled while waiting for
lvl2 cache line fetch (no
hyperthreading).
How to compute:
◦ Static analysis.
◦ Profiling.
Profiling yields multiple traces, run
delay analysis on all.
flat curve:
CPU executing
Bus time

c L 
wcet
t
increasing curve (slope 1):
CPU stalled during cache line
fetch
Cache Delay Analysis
CPU
Cache misses

t
wcet
(no I/O interference)
PCI
t
Periph. Baund.

The proposed analysis computes worst case increase (D) on task
computation time due to cache delays caused by FSB interference.
Main idea: treat the FSB + CPU cache logic as a switch that multiplexes
accesses to system memory.
◦ Inputs: Cache line misses over time and peripheral bandwidth.
◦ Output: Curve representing the delayed cache misses.
Bus arbitration is assumed RR or FP, transactions are non preemptive.
Cache misses

wcet increament
(D)
t
Analysis: Intuition (1/2)

Worst case situation: PCI transaction accepted just before CPU
cache miss.
CPU
: cache miss
l
cache line length
PCI
L'
max transaction length
t

Worst case interference: min ( CM, PT/L’ ) * L’
◦ CM: # of cache misses
◦ PT: total peripheral traffic during task execution
◦ Assuming RR bus arbitration
Analysis: Intuition (2/2)


The analysis shown is pessimistic; cache misses exhibit burst behavior.
Example: assume 1 peripheral transaction every T time units.
these CPU memory
accesses can not be delayed
CPU
PCI
T
T
T
T
T
these peripheral transactions
can not delay the CPU

t
Real analysis: compute exact interference pattern based on burstiness
of cache misses and peripheral transactions.
Worst Case Interference Scenario

Worst case situation: peripheral transaction of length L’ accepted
just before CPU cache miss.
Fetch start time
in the cache access
function c(t) unmodified
c(t )
by peripheral activity
10
5
CPU
wcet
t
0
CACHE
t 4 t5
t1 t 2 t 3
0
5
10
15
20
25
30
35
40
45
50
55
Bound: Cache Misses
wcet
D
t
0
5
10
15
20
25
30
35
40
45
50
55
CACHE
CPU
PERIPHERAL



t5
D
Cache Bound: max number of interfering peripheral trans. = number
of cache misses.
Let CM be the number of cache misses.
Then D  CM  L' .
Bound: Peripheral Load
wcet
D
t
0
5
10
15
20
25
30
35
40
45
50
55
CACHE
CPU
PERIPHERAL




t5
D
Peripheral Bound: max interference D  max bus time requested by
peripherals in interval t5  t1  D
D  E(t5  t1 .D)
Let E(t )  maxx | x  E(t  x).
Then equivalently: D  E(t5  t1 )
In general, given a set of
fetches {fi,…,fj} with start
times {ti,…,tj}
 D E(tj-ti)
Some Insights about Peripheral Bound



There is a circular dependency between the amount of peripheral load
that interferes with {fi,…,fj} and the delay D(fi, fj).
When peripheral traffic is injected on the FSB, the start time of each fetch
is delayed. In turn, this increases the time interval between fi and fj and
therefore more peripheral traffic can now interfere with those fetches.
Our key idea is that we do not need to modify the start times {fi,…,fj} of
fetches when we take into account the I/O traffic injected on the FSB.
Instead, we take it into account using the equation that defines E(t )
Some Insights about Peripheral Bound
Fetches in interval [0-36]
 represents both
the maximum delay
suffered by fetches
within [0-36] and
the increase in
the time interval
for interfering
traffic.
max interference
D  E(36)
The Intersection is not Tight!
c(t )
10
5
wcet
t
0
5
10
15
20
25
30
35
40
45
50
45
CACHE
E(t5-t1+D) = 14
PERIPHERAL

This trans. can
not interfere!
15
E (t )
10
5
D
t5
0
t
0
5



10
15
20
25
30
35
40
45
50
45
D  min(CM  L' , E(t5  t1  D)  E (t5  t1 ))  14
The real worst case delay is 13!
Reason: cache is too bursty, interference from one
peripheral trans. is “lost” while the cache is not used.
The Intersection is not Tight!
c(t )
10
5
wcet
t
0
5
10
15
20
25
30
35
40
45
50
45
CACHE
PERIPHERAL
15
E (t )
E(t3 )  7

10
5
0
t
0
5
10
15
20
D1,3  E (t3  D1,3 )  E (t3 )  7



25
30
35
40
45
50
45
D4,5  2  L'  6
Solution: split into multiple intervals.
D1,3  D4,5  13 .
How many intervals do we need to consider?
Delay Algorithm


Iterative algorithm evaluates
N(N+1)/2 intervals.
Each interval computed in
O(1), overall complexity
O(N2).
Bound is tight (see RTSS’07).
[t1 , t1 ]
[t1 , t2 ]
[t2 , t2 ]
[t1, t3 ]
[t2 , t3 ]
[t3 , t3 ]
max delay for miss 1 (u1)
max delay for miss 2 (u2)
max delay for miss 3 (u3)
yi  Dti , tk 1 
[t1 , t4 ]
...

[t4 , t4 ]
max delay for miss 4
(u4)
Multitasking analysis
1

S1
Multitasking analysis using cyclic
executive (it was extended to EDF
with restricted-preemption model).
S2
1.
2.
3.
4.
5.
Analyze task Control Flow Graph.
Build a set of sequential
superblocks.
Schedule is interleaving of slots
composed of superblocks.
Algorithm: compute number of
superblocks in each slot.
Account for additional cache
misses due to inter-task cache
interference.
S3
S4
S5
S6
Great! But c(t) is hard to get... and 44% is awful




The proposed analysis makes a fairly restrictive assumption: it
must know the exact time of each cache miss.
I/O interference is significant: when added to the wcet of all
tasks, the system can suffer a huge waste of bandwidth!
Key idea: let’s coschedule CPU & I/O Peripherals
Goal: allow as much peripheral traffic as possible at run-time
while using CPU reservations that do NOT include I/O
interference (D).
Cache Miss Profile is Hard to Get

Problem: obtaining an exact cache miss pattern is
very hard.
◦ CPU simulation requires simulating all peripherals.
◦ Static analysis scales poorly.
◦ In practice testing is often the preferred way.
1
start
S1
checkpoint
S2
checkpoint

Our solution:
◦ Split the tasks into intervals.
◦ Insert a checkpoint at the end of each interval.
◦ Measure wcet and worst case # of cache misses
for each interval (with no peripheral traffic).
◦ Checkpoints should not break loops or
branches (sequential macroblock boundaries).
S3
checkpoint
S4
checkpoint
S5
checkpoint
S6
checkpoint
CPU & I/O coscheduling: HOW TO

A coscheduling technique for COTS peripherals
1.
divide each task into a series of sequential superblocks;
2.
3.
4.
5.
6.
Run off-line profiling for each task, collecting information on wcet
and # of cache misses in each superblock (without I/O interference);
Compute a safe (wcet+D) bound (it includes I/O interference) for each
superblock by assuming a “critical cache miss pattern”
Design a peripheral gate (p-gate) to enable/disable I/O peripherals
Design a new peripheral (on FPGA board), the reservation controller,
which executes the coscheduling algorithm and controls all p-gates.
Use profiling information at run-time to coschedule tasks and I/O
transactions
Analysis with Interval Information

wcet2
wcet3
wcet4
wcet5
CM1
CM2
CM3
CM4
CM5
Since we do not know when each cache miss happens within each
interval, we need to identify a worst case pattern.
If the Peripheral Load Curve is concave,
then we obtain a tight bound for delay D
(details are in a technical report).
If the Peripheral Load Curve is not concave,
the bound for delay D is not tight.
Simulations showed that the upper bound is
within 0.2% of the real worst case delay.
This is actually the
worst case pattern!
Bus time

wcet1
CMi =4
wceti
t
Bus time

Input: a set of intervals with wcet and cache misses.
Bus time

t
t
On-line Coscheduling Algorithm

The on-line algorithm:
◦ Non-safety critical tasks have CPU reservation = wcet (D NOT included!)
◦ At the beginning of each job the p-gates are closed.
◦ At run time, at each checkpoint the OS sends the APMC # of CPU cycles
(execi) to the reservation controller.
◦ The reservation controller keeps track of accumulated slack time. If slack
time i (wceti-execi) is greater than delay D for next interval, open the pgate.
wcet1
wcet2
wcet3
Task total wcet
wcet4
wcet5
Coscheduling algorithm: an example

The on-line algorithm:
◦ Non-safety critical tasks have CPU reservation = wcet (D NOT included!)
◦ At the beginning of each job the p-gates are closed.
◦ At run time, at each checkpoint the OS sends the APMC # of CPU cycles
(execi) to the reservation controller.
◦ The reservation controller keeps track of accumulated slack time. If slack
time i (wceti-execi) is greater than delay D for next interval, open the pgate.
wcet1
wcet2
wcet3
wcet4
Initial slack = 0 => p-gate closed
wcet5
Coscheduling algorithm: an example

The on-line algorithm:
◦ Non-safety critical tasks have CPU reservation = wcet (D NOT included!)
◦ At the beginning of each job the p-gates are closed.
◦ At run time, at each checkpoint the OS sends the APMC # of CPU cycles
(execi) to the reservation controller.
◦ The reservation controller keeps track of accumulated slack time. If slack
time i (wceti-execi) is greater than delay D for next interval, open the pgate.
wcet1
exec1
wcet2
wcet3
wcet4
wcet5
wcet2 + D2
Slack += wcet1 -exec1
Slack < D2
 p-gate closed
Coscheduling algorithm: an example

The on-line algorithm:
◦ Non-safety critical tasks have CPU reservation = wcet (D NOT included!)
◦ At the beginning of each job the p-gates are closed.
◦ At run time, at each checkpoint the OS sends the APMC # of CPU cycles
(execi) to the reservation controller.
◦ The reservation controller keeps track of accumulated slack time. If slack
time i (wceti-execi) is greater than delay D for next interval, open the pgate.
wcet1
exec1
wcet2
exec2
wcet3
wcet4
wcet5
wcet3 + D3
Slack += wcet2 – exec2
Slack >= D3
 p-gate open
System Integration: example for avionic domain


System composed of tasks/partitions with different criticalities: each
task/partition uses different I/O peripherals.
The right action depends on the task/partition criticality
t
Class A: safety critical
(e.g., flying control)
Class B: mission critical
(e.g., radar processing)
Class C: non critical
(e.g., display)
◦ Class A: block all non relevant peripheral traffic (Reservation=wcet+D)
◦ Class B: coschedule tasks and peripherals to maximize I/O traffic
(Reservation=wcet).
◦ Class C: all I/O peripherals are enabled
Peripheral Gate


We designed the peripheral gate (or p-gate for short) for the PCI/PCI-X
bus: it allows us to control peripheral access to the bus.
The peripheral gate is compatible with COTS devices: its use does not
require any modifications to either the peripheral or the motherboard.
Peripheral Gate




Reservation controller commands
Peripheral Gate (p-gate).
Kernel sends scheduling
information to Reservation
Controller.
Minimal kernel modification (send
PID and exec of executing process).
Class A task: block all non relevant
peripheral traffic
Class B task: reservation controller
implements coscheduling
algorithm.
time
P#2
P#1
P#3
P#3
P#2
P#1
executing
executing
executing
Reservation
Controller
cpu schedule
CPU
Reservation
Controller
Peripheral
Gate
FSB
logic
Peripheral Bus

Processes #1,#2,#3
belong to class A
RAM
Peripheral
Gate
Current Prototype


Testbed uses standard Intel
platform.
Reservation controller
implemented on FPGA, pgate uses PCI extender card
+ discrete logic.
Logic analyzer for debugging and
measurament
P-gate
Gigabit ethernet NIC
Reservation Controller
(Xilinx FPGA)
Kernel Implementation






Getting this information requires support from the CPU and the OS.
We used Architectural Performance Monitor Counters for the Intel
Core2 microarch, but other manifacturers (ex: IBM) have similar
support (implementation is specific, the lesson is general).
Two APMCs configured to count cache misses and CPU cycles in user
space.
Task descriptor extended with exec. time and cache miss fields.
At context switch, the APMCs are saved/restored in descriptors like
any other task-specific CPU registers.
Implemented under Linux/RK.
Other Coscheduling Algorithms


1.
2.
3.
We compared our adaptive heuristic with other algorithms.
Assumption: At the beginning of each interval the algorithm chooses
whether to open or close the switch for that interval.
Slack-only: baseline comparison, uses only remaining slack time
when task has finished.
Predictive:
◦ Also uses measured average exec times.
◦ “Predicts” slack time in the feature and optimizes open intervals at
each step.
◦ Computing an optimal allocation is NP-hard, instead it uses a fast
greedy heuristic.
Optimal:
◦ Clairvoyant (not implementable).
◦ Provides an upper bound to the performance of any run-time,
predictive algorithm.
The Test

All run-time algorithms implemented on Xilinx ML505 FPGA.

Optimal computed using Matlab optimization tool.



We used a mpeg decoder as benchmark.
◦ As a trend, video processing is increasingly used in the avionic domain
for mission control.
◦ It simulates a Class B application subject to heavy I/O traffic
The task misses its deadline by up to 30% if I/O traffic is always allowed!
Slack-only
Run-time
Predictive
Optimal
4.89%
31.21%
36.65%
40.85%
Results in term of
% time the p-gate
is open
The run-time algorithm is already close to the optimal; not much to gain
with the improved heuristic.
Simulation Results


We performed synthetic simulations to better understand the
performance of the run-time algorithm.
20 superblocks per task,
◦ α is the variation between wcet and avg computation time.
◦ β is the % of time the task is stalled due to cache misses.
β
α
Improving the P-Gate: Hardware Server


(in progress)
Problem: blocking the peripheral reduces maximum throughput.
◦ Ok only if critical tasks/partitions run for limited amount of time.
Better solution: implement a hardware server with buffering on SoC
◦ Transactions are queued in hw server’s memory during non relevant partitions.
◦ Interrupts/DMA transfers are delivered only during execution of interested
tasks/partitions
◦ Similar to real-time aperiodic servers: a hw server permits aperiodic I/O requests
to be analyzed as if they were following a predictable (periodic) pattern


FPGA-based SoC design with
Linux device drivers.
Currently in development.
Xilinx FPGA
peripheral
CPU
PCI
interface
DRAM
Mem
OPB
interrupt
controller
Bridge
PCI
Host bridge
DDRAM
Conclusions




A major issue in peripherals integration is task delay due to cacheperipheral contention at the main memory level
We proposed a framework to: 1) analyze the delay due to cache
peripheral contention ; 2) control task execution times.
The proposed co-scheduling technique was tested with PCI/PCI-X
bus; hw server will be ready soon.
Future work:
Extend to multi-processor and distributed systems