Transcript Document
Adaptive Partition Scheduling Part 1: Why we did it Cool stuff from QNX A.Danko July 16, 2015 Yet another thread scheduler. Why? The story begins with a customer: “We can use QNX! We need ARINC653!!!!!! HELP!” July 16, 2015 Cool Stuff from QNX 2 Why? Shiny New Toy Partition scheduler (ARINC 653) > > > Very popular in fixed military systems Each partition is guaranteed a percentage of CPU Priorities are only meaningful within a partition OTHER JAVA POSIX 50% 20% 30% ARINC 653 Partition Scheduler and “special” IPC Shortcomings include > > > > > > > > Detailed RMA required to verify system Overload of IPC FIFO input queue Failures include denial of service and CPU quota exhaustion Monolithic design within one partition Hard to retrofit to existing 1-cpu applications. Inefficient use of total CPU. Runs idle when tasks are ready. Increased interrupt latency Does not address shared entities such as a file system Restrictive programming model. No DMA July 16, 2015 Cool Stuff from QNX 3 Why? Real-world examples of partitioning for QNX customers Selling a portion of throughput Security: Untrusted Applications Car Router Customer 1 TCP/IP Protocol Customer 2 TCP/IP NAV etc … Radio Protocol 3rd party (malware?) 80% 20% Application Router Application Application Downloaded applications from the WEB cannot hurt the system Protocol 50% 50% Locked System Recovery Customer 2’s network load cannot hurt customer 1 HOG App 90% bash 10% Hard-wall scheduler not-required. Emergency recovery shell Do we need any new scheduler? July 16, 2015 Cool Stuff from QNX 4 Why? Evolution of schedulers Timeline Yes, but: priority pre-emptive SCHED_FIFO System locks up Timeslicing SCHED_RR Backhoes and Mother’s day Time-varying priority Untuneable for more than 1 application. SCHED_SPORADIC Really clever time-varying US Military Satcom Fair Share scheduling Hard to manage share interactions. Adaptive configuration Not invented – until now. July 16, 2015 Cool Stuff from QNX 5 Why? Evolution: Lessons learned Numerical priorities are chosen by applications but system scheduling behavior must be designed globally Degradation and overload: Priorities are not constants. Importance of work depends on circumstances. > Modes: normal operation, restart, emergency maintenance Scheduling strategy needs to be based on unit of work, but what we have is communicating threads. must measure real-time behavior. > 0.1 % accuracy Want to specify shares as global percentages > Applications don’t get to pick their importance or shares. System engineers do. Need to throttle cpu usage without losing realtime latencies. July 16, 2015 Cool Stuff from QNX 6 Design What is Partitioning? General Answer QNX Answer Separation of work > Separation of work based on “working for common purpose” To isolate: > cpu usage > memory usage > system resource usage > Failures Runtime typed memory and kernel object guarantees and limits > With full inheritance and accounting for all children July 16, 2015 POSIX compatible design which can be applied to existing systems with little or no recoding Partition Scheduling Adaptive A global hard real-time scheduler with overload protection and CPU guarantees Persistent storage (file system) guarantees and limits Process model for fault isolation Dynamic configuration Cool Stuff from QNX 7 Design Principles Scheduler must not trigger an overload > Real-time during underload > At least for interrupt handling global scheduler algorithm globally configured Must mesh with current QNX architecture • • > > Offered load Must also be a fair-share scheduler > > Same behavior as today Real-time during overload > Overhead may not increase with # of threads Throughput Preemptive priority, individual thread scheduling Heavy use of message passing Easy to drop onto existing applications Can’t be a “bag on the side” Insert picture of Juggling Watermelons here Simple enough for customers to use > > Engineerable Reconfigure on the fly July 16, 2015 Cool Stuff from QNX 8 Overconstrained problem? Nope: Implemented in QNX 6.3.2 Actually Works See “How it Works” in Part2. July 16, 2015 Cool Stuff from QNX 9 Design Adaptive Partition Scheduling Part 2: How it works. What it does: > > > > > > > Counting time Who’s got time Real time Out of time Free time Borrowed time Equal time How it does it API Why is it secure? Why is it cool? July 16, 2015 Cool Stuff from QNX 10 Design Counting time What does 14% cpu mean? > > CPU usage is calculated over a sliding window. T= -100ms Accuracy: > > > > Tradeoff maximum READY-state latency with accuracy of CPU budgeting 100ms window -> 1% accuracy or better. Internal arithmetic accurate to 0.5% or better Partition usage > Counting ticks is not enough. “Micro-billing” is used to track actual CPU utilization even when threads don’t use their whole timeslice. micro- and nano-second resolution Threads are billed based on real usage, not statistics “windowsize” is configurable as an argument to kernel at boot > T= now ns cpu time executed, during last sliding window, expressed as percentage Partition budget > Guaranteed percentage of cpu time, balanced over sliding window July 16, 2015 Cool Stuff from QNX 11 Who’s got time: Partition Membership QNX Scheduler Partition > Set of threads working for a common purpose Set of initial processes/threads designated by customer • + all subsequent children Guest members • Server’s cpu time billed to client • Resmgr threads temporarily join partition of sender thread > Not locked to a static set of code. > OS services are part of whatever partition they need to be. hence the name “adaptive partition” July 16, 2015 Cool Stuff from QNX 12 Design Who’s got time: Partition Inheritance File System Process 6 6 6 11 8 10 9 CPU budget available 7 - Message Message 10 4 -9 Receive Threads Adaptive Partition 1 (Multi-media) CPU budget available Adaptive Partition 2 (Java application) Resource manager threads work on behalf of sender Priority and adaptive partition in inherited on receive > Execution time in server billed to client’s partition This allows proper accounting for shared resources July 16, 2015 Cool Stuff from QNX 13 Design Real time: Behavior under normal load Blocked Ready 6 6 6 8 7 11 Running 10 9 4 CPU budget available CPU budget available Adaptive Partition 1 (Multi-media) Adaptive Partition 2 (Java application) Hard real-time scheduler under normal load Running thread selected as highest priority READY thread No delay on scheduling if adaptive partition has budget July 16, 2015 Cool Stuff from QNX 14 Design Out of time: Behavior under overload Blocked Ready 6 6 6 8 7 11 Running 10 9 4 CPU budget exceeded CPU budget available Adaptive Partition 1 (Multi-media) Adaptive Partition 2 (Java application) Highest priority READY thread in Partition with budget runs No delay on scheduling if adaptive partition has budget July 16, 2015 Cool Stuff from QNX 15 Design Free Time: Behavior with unused CPU Blocked 6 6 6 8 11 6 Running 7 10 10 9 8 4 CPU budget exceeded CPU budget exceeded Adaptive Partition 1 (Multi-media) Adaptive Partition 2 (Java application) CPU budget available Adaptive Partition 3 If no partitions with remaining budget have READY threads, highest priority READY thread is selected to run from other partitions This allows “free” time to be given based upon priority > “Free” time is still accounted and may have to be paid back (for example, if partition 3 becomes ready within 1 averaging window) July 16, 2015 Cool Stuff from QNX 16 Design Borrowed Time: Critical Threads Blocked Ready 6 6 6 7 11 8 Critical Thread Running 30 11 4 CPU budget exceeded CPU budget available Adaptive Partition 1 (Multi-media) Adaptive Partition 2 (Air Bag Control) Critical threads still run (based on priority) even if partition has no budget Critical threads provide deterministic scheduling even in overload Critical threads are given critical budget and can go into short-term debt > > Critical time is accounted and has to be repaid Exceeding critical budget is considered an error and causes notification/action July 16, 2015 Cool Stuff from QNX 17 Design Equal time. How to choose between partitions of equal priority > Unimportant? > Many threads run at default priority, therefore equal priority Possible algorithms: > - round robin > - favor partition with most free time > - favor longest waiter Requirement: > Minimize latencies during underload > WBN: divide free time by % cpu share. Solution: • Interleave partitions by ratio of partition shares •We found a clever way to do that, so it’s in the patent. July 16, 2015 Cool Stuff from QNX 18 How it does it uKernel libmod_aps.a Process creation Per-partition Ready Q messaging for all partitions, p Def m(p) -> (bud(p)||crit(p), prio(p), run_t/wsize/bud(p)) Then schedule ps Def ps -> rdy(ps) and (m(ps) < m(pi)) For all i != s Scheduler clock intr handler ready() block() select_thread() July 16, 2015 Cool Stuff from QNX 19 Algorithm summary - A partition sees real-time behaviour when under budget - Only limited when another partition must get its guarantee - Fair-share scheduling at or over budget Equal prio partitions are interleaved - Budgets balanced in much less than windowsize - Free time (above budget) is given out: - By default: in real-time mode - Optionally: by ratio of budgets - Critical Thread run even if out of budget - Criticality is inherited July 16, 2015 Cool Stuff from QNX 20 Overhead: Fancy, but is it fast? Scheduling overhead increases with: > > > > - number of partitions - number of messages/sec - number of clock interrupts/sec, i.e. ClockPeriod() * does not increase with number of threads * Free or almost free operations: > Inheriting partition as part of message receive > Joining a thread to a partition > Dynamically changing budgets Computational requirements > 32 bit multiply, 64bit add > *no floating point* *no divides* *no address space swapping* *short-circuit calculation of merit function* *no inter-cpu msging on SMP* *history-less algorithm* Overhead typically 1% of total cpu July 16, 2015 Cool Stuff from QNX 21 Design APIs Control of Adaptive Partitioning Scheduler is done through a kernel API API allows associating a thread with a partition > Used to launch processes within a partition > Children inherit parent’s partition Dynamic capabilities part of design > Budgets may be changed at run time – instant effect > Threads may join/unjoin partitions freely APIs to attach event triggered on critical budget overrun Selectable security > API is restricted to privileged processes (root) > Must be called from within default (system) partition > Partitions are created with budget (normal and possibly critical) API provided to “lock down” partition configuration > Prevent creation of new partitions or modification of budgets July 16, 2015 Cool Stuff from QNX 22 API 2: Launching applications 1. Build File > schedaps MyPartition 20 > [schedaps=MyPartition] /bin/myApp 2. Command line > aps create –b20 MyPartition > on –Xaps=MyPartition /bin/myApp 3. Momentics IDE4 > Drag and drop 4. include <sys/sched_aps.h> > Full programmatic interface: configure, get stats, launch, secure July 16, 2015 Cool Stuff from QNX 23 Why is AP Secure? AP enforces budgets every clock interrupt Root can be required to do configuration changes Partition creation by subdivision of parent > It’s not possible to create a sub-partition greater than a parent > Not even root can violate this rule Configuration can be locked July 16, 2015 Cool Stuff from QNX 24 Design Why is this cool?: Engineerable • Identifying units of work: Partition Inheritance • Identify code that starts up applications • Inheritance figures out the rest > Filesystems etc do not require separately engineered cpu share • • Global share management: % cpu • • cpu shares defined in units customers are used to: Percentage • gets us off the hook for accounting for different clock speeds. Realtime when you need it: Critical Threads • Customer need not analyze budgets for OS components Interrupts and important event still get handled on time. Secure > Budgets, especially critical budgets, are set globally by root, not by applications July 16, 2015 “to err is human, but …” Cool Stuff from QNX 25 Adaptive Partition Scheduling Part 3. The Slick Demo July 16, 2015 Cool Stuff from QNX 26