Transcript Document

Adaptive Partition Scheduling
Part 1: Why we did it
Cool stuff from QNX
A.Danko
July 16, 2015
Yet another thread scheduler.
Why?

The story begins with a customer:

“We can use QNX! We need ARINC653!!!!!! HELP!”
July 16, 2015
Cool Stuff from QNX
2
Why?
Shiny New Toy

Partition scheduler (ARINC 653)
>
>
>
Very popular in fixed military systems
Each partition is guaranteed a percentage of CPU
Priorities are only meaningful within a partition
OTHER
JAVA
POSIX
50%
20%
30%
ARINC 653 Partition Scheduler and “special” IPC

Shortcomings include
>
>
>
>
>
>
>
>
Detailed RMA required to verify system
Overload of IPC FIFO input queue
 Failures include denial of service and CPU quota exhaustion
Monolithic design within one partition
Hard to retrofit to existing 1-cpu applications.
Inefficient use of total CPU. Runs idle when tasks are ready.
Increased interrupt latency
Does not address shared entities such as a file system
Restrictive programming model. No DMA
July 16, 2015
Cool Stuff from QNX
3
Why?
Real-world examples of
partitioning for QNX customers
Selling a portion of throughput
Security: Untrusted Applications
Car
Router
Customer 1
TCP/IP
Protocol
Customer 2
TCP/IP
NAV
etc …
Radio
Protocol
3rd party
(malware?)
80%
20%
Application
Router
Application
Application
Downloaded applications from
the WEB cannot hurt the
system
Protocol
50%
50%
Locked System Recovery
Customer 2’s network load
cannot hurt customer 1
HOG
App
90%
bash
10%
Hard-wall scheduler not-required. Emergency recovery shell
Do we need any new scheduler?
July 16, 2015
Cool Stuff from QNX
4
Why?
Evolution of schedulers
Timeline
Yes, but:

priority
pre-emptive
SCHED_FIFO

System locks up


Timeslicing
SCHED_RR

Backhoes and Mother’s day

Time-varying priority

Untuneable for more than 1
application.


SCHED_SPORADIC
Really clever time-varying

US Military Satcom

Fair Share scheduling

Hard to manage share interactions.

Adaptive configuration

Not invented – until now.
July 16, 2015
Cool Stuff from QNX
5
Why?
Evolution: Lessons learned

Numerical priorities are chosen by applications but system
scheduling behavior must be designed globally

Degradation and overload: Priorities are not constants.
Importance of work depends on circumstances.
> Modes: normal operation, restart, emergency maintenance

Scheduling strategy needs to be based on unit of work, but
what we have is communicating threads.

must measure real-time behavior.
> 0.1 % accuracy

Want to specify shares as global percentages
> Applications don’t get to pick their importance or shares. System engineers
do.

Need to throttle cpu usage without losing realtime latencies.
July 16, 2015
Cool Stuff from QNX
6
Design
What is Partitioning?
General Answer
QNX Answer



Separation of
work

> Separation of work based on “working for
common purpose”
To isolate:
> cpu usage
> memory usage
> system resource
usage
> Failures

Runtime typed memory and kernel object
guarantees and limits
> With full inheritance and accounting for all
children



July 16, 2015
POSIX compatible design which can be
applied to existing systems with little or
no
recoding Partition Scheduling
Adaptive
A global hard real-time scheduler with
overload protection and CPU guarantees
Persistent storage (file system)
guarantees and limits
Process model for fault isolation
Dynamic configuration
Cool Stuff from QNX
7
Design
Principles
Scheduler must not trigger an overload
>

Real-time during underload
>

At least for interrupt handling
global scheduler algorithm
globally configured
Must mesh with current QNX architecture
•
•
>
>

Offered load
Must also be a fair-share scheduler
>
>

Same behavior as today
Real-time during overload
>

Overhead may not increase with # of threads
Throughput

Preemptive priority, individual thread scheduling
Heavy use of message passing
Easy to drop onto existing applications
Can’t be a “bag on the side”
Insert picture of
Juggling Watermelons
here
Simple enough for customers to use
>
>
Engineerable
Reconfigure on the fly
July 16, 2015
Cool Stuff from QNX
8
Overconstrained problem?
Nope:
 Implemented in QNX 6.3.2
 Actually Works
See “How it Works” in Part2.
July 16, 2015
Cool Stuff from QNX
9
Design
Adaptive Partition Scheduling

Part 2: How it works.

What it does:
>
>
>
>
>
>
>




Counting time
Who’s got time
Real time
Out of time
Free time
Borrowed time
Equal time
How it does it
API
Why is it secure?
Why is it cool?
July 16, 2015
Cool Stuff from QNX
10
Design
Counting time

What does 14% cpu mean?
>
>
CPU usage is calculated over a sliding window.
T= -100ms

Accuracy:
>
>
>

>
Tradeoff maximum READY-state latency with accuracy of CPU budgeting
 100ms window -> 1% accuracy or better.
Internal arithmetic accurate to 0.5% or better
Partition usage
>

Counting ticks is not enough. “Micro-billing” is used to track actual CPU
utilization even when threads don’t use their whole timeslice.
micro- and nano-second resolution
Threads are billed based on real usage, not statistics
“windowsize” is configurable as an argument to kernel at boot
>

T= now
ns cpu time executed, during last sliding window, expressed as percentage
Partition budget
>
Guaranteed percentage of cpu time, balanced over sliding window
July 16, 2015
Cool Stuff from QNX
11
Who’s got time: Partition Membership

QNX Scheduler Partition
> Set of threads working for a common purpose
 Set of initial processes/threads designated by customer
• + all subsequent children
 Guest members
• Server’s cpu time billed to client
• Resmgr threads temporarily join partition of sender thread
> Not locked to a static set of code.
> OS services are part of whatever partition they need to be.

hence the name “adaptive partition”
July 16, 2015
Cool Stuff from QNX
12
Design
Who’s got time: Partition Inheritance
File System
Process
6
6
6
11
8
10
9
CPU budget
available
7
-
Message
Message
10
4
-9
Receive Threads
Adaptive Partition 1
(Multi-media)
CPU budget
available
Adaptive Partition 2
(Java application)

Resource manager threads work on behalf of sender
 Priority and adaptive partition in inherited on receive
>

Execution time in server billed to client’s partition
This allows proper accounting for shared resources
July 16, 2015
Cool Stuff from QNX
13
Design
Real time: Behavior under normal load
Blocked
Ready
6
6
6
8
7
11
Running
10
9
4
CPU budget
available
CPU budget
available
Adaptive Partition 1
(Multi-media)
Adaptive Partition 2
(Java application)

Hard real-time scheduler under normal load
 Running thread selected as highest priority READY thread
 No delay on scheduling if adaptive partition has budget
July 16, 2015
Cool Stuff from QNX
14
Design
Out of time: Behavior under overload
Blocked
Ready
6
6
6
8
7
11
Running
10
9
4
CPU budget
exceeded
CPU budget
available
Adaptive Partition 1
(Multi-media)
Adaptive Partition 2
(Java application)

Highest priority READY thread in Partition with budget runs
 No delay on scheduling if adaptive partition has budget
July 16, 2015
Cool Stuff from QNX
15
Design
Free Time: Behavior with unused CPU
Blocked
6
6
6
8
11
6
Running
7
10
10
9
8
4
CPU budget
exceeded
CPU budget
exceeded
Adaptive Partition 1
(Multi-media)
Adaptive Partition 2
(Java application)
CPU budget
available
Adaptive Partition 3

If no partitions with remaining budget have READY threads, highest
priority READY thread is selected to run from other partitions
 This allows “free” time to be given based upon priority
>
“Free” time is still accounted and may have to be paid back (for example, if partition 3
becomes ready within 1 averaging window)
July 16, 2015
Cool Stuff from QNX
16
Design
Borrowed Time: Critical Threads
Blocked
Ready
6
6
6
7
11
8
Critical
Thread
Running
30
11
4
CPU budget
exceeded
CPU budget
available
Adaptive Partition 1
(Multi-media)
Adaptive Partition 2
(Air Bag Control)

Critical threads still run (based on priority) even if partition has no budget
 Critical threads provide deterministic scheduling even in overload
 Critical threads are given critical budget and can go into short-term debt
>
>
Critical time is accounted and has to be repaid
Exceeding critical budget is considered an error and causes notification/action
July 16, 2015
Cool Stuff from QNX
17
Design
Equal time.

How to choose between partitions of equal priority
> Unimportant?
> Many threads run at default priority, therefore equal priority

Possible algorithms:
> - round robin
> - favor partition with most free time
> - favor longest waiter

Requirement:
> Minimize latencies during underload
> WBN: divide free time by % cpu share.
Solution:
• Interleave partitions by ratio of partition shares
•We found a clever way to do that, so it’s in the patent.
July 16, 2015
Cool Stuff from QNX
18
How it does it
uKernel
libmod_aps.a
Process
creation
Per-partition
Ready Q
messaging
for all partitions, p
Def m(p) ->
(bud(p)||crit(p), prio(p), run_t/wsize/bud(p))
Then schedule ps
Def ps -> rdy(ps) and (m(ps) < m(pi))
For all i != s
Scheduler
clock intr handler
ready()
block()
select_thread()
July 16, 2015
Cool Stuff from QNX
19
Algorithm summary
-
A partition sees real-time behaviour when under
budget
- Only limited when another partition must get its guarantee
-
Fair-share scheduling at or over budget
Equal prio partitions are interleaved
- Budgets balanced in much less than windowsize
-
Free time (above budget) is given out:
- By default: in real-time mode
- Optionally: by ratio of budgets
-
Critical Thread run even if out of budget
- Criticality is inherited
July 16, 2015
Cool Stuff from QNX
20
Overhead: Fancy, but is it fast?

Scheduling overhead increases with:
>
>
>
>

- number of partitions
- number of messages/sec
- number of clock interrupts/sec, i.e. ClockPeriod()
* does not increase with number of threads *
Free or almost free operations:
> Inheriting partition as part of message receive
> Joining a thread to a partition
> Dynamically changing budgets

Computational requirements
> 32 bit multiply, 64bit add
> *no floating point* *no divides* *no address space swapping*
*short-circuit calculation of merit function* *no inter-cpu msging on
SMP* *history-less algorithm*

Overhead typically 1% of total cpu
July 16, 2015
Cool Stuff from QNX
21
Design
APIs


Control of Adaptive Partitioning Scheduler is done through a
kernel API
API allows associating a thread with a partition
> Used to launch processes within a partition
> Children inherit parent’s partition

Dynamic capabilities part of design
> Budgets may be changed at run time – instant effect
> Threads may join/unjoin partitions freely


APIs to attach event triggered on critical budget overrun
Selectable security
> API is restricted to privileged processes (root)
> Must be called from within default (system) partition
> Partitions are created with budget (normal and possibly critical)

API provided to “lock down” partition configuration
> Prevent creation of new partitions or modification of budgets
July 16, 2015
Cool Stuff from QNX
22
API 2: Launching applications

1. Build File
> schedaps MyPartition 20
> [schedaps=MyPartition] /bin/myApp

2. Command line
> aps create –b20 MyPartition
> on –Xaps=MyPartition /bin/myApp

3. Momentics IDE4
> Drag and drop

4. include <sys/sched_aps.h>
> Full programmatic interface: configure, get stats, launch, secure
July 16, 2015
Cool Stuff from QNX
23
Why is AP Secure?

AP enforces budgets every clock interrupt

Root can be required to do configuration changes

Partition creation by subdivision of parent
> It’s not possible to create a sub-partition greater than a parent
> Not even root can violate this rule

Configuration can be locked
July 16, 2015
Cool Stuff from QNX
24
Design
Why is this cool?: Engineerable
•
Identifying units of work: Partition Inheritance
•
Identify code that starts up applications
•
Inheritance figures out the rest
> Filesystems etc do not require separately engineered cpu share
•
•
Global share management: % cpu
•
•
cpu shares defined in units customers are used to: Percentage
• gets us off the hook for accounting for different clock speeds.
Realtime when you need it: Critical Threads
•

Customer need not analyze budgets for OS components
Interrupts and important event still get handled on time.
Secure
> Budgets, especially critical budgets, are set globally by root, not by
applications

July 16, 2015
“to err is human, but …”
Cool Stuff from QNX
25
Adaptive Partition Scheduling

Part 3. The Slick Demo
July 16, 2015
Cool Stuff from QNX
26