Real-Time Performance and Middleware for Multiprocessor and

Download Report

Transcript Real-Time Performance and Middleware for Multiprocessor and

Real-Time Performance and Middleware for Multiprocessor and Multicore Linux Platforms*

Yuanfang Zhang, Christopher Gill, and Chenyang Lu Department of Computer Science and Engineering Washington University, St. Louis, MO, USA {yfzhang, cdgill, lu}@cse.wustl.edu

15

th

IEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA 2009)

August 24 - 26, 2009, Beijing, China

*This research was supported in part by NSF grants CCF-0615341 (EHS), CCF-0448562 (CAREER), and CNS-0448554 (CAREER)

Motivation and Contributions

 Trend towards multi-processor and multi-core platforms affects both OS and middleware » Techniques designed for uni-processors need revisiting  This research makes 3 main contributions to real time systems on multi-processor platforms » A performance evaluation of relevant Linux features » » MC-ORB middleware designed for MC/MP platforms Evaluation of MC-ORB’s multi-core aware RT performance ‹#›

- Zhang et al. – 4/26/2020

Background and Related Work

 Linux 2.6 introduced SMP and multi-core support » Linux 2.6.23 added the Completely Fair Scheduler (CFS) » However, many deployed platforms predate 2.6.23

» We studied Linux 2.6.17 as a representative compromise  Related research: modifying Linux, RT middleware » We assume unmodified COTS Linux as our middleware design point, for highly portable real-time performance » The differing trade-offs for uni-processor vs. multi processor platforms motivate new middleware designs ‹#›

- Zhang et al. – 4/26/2020

Linux Performance: Clock Differences I

 We first evaluated clock differences between cores » How well do platform/Linux maintain synchronization?

» We used RDTSC instruction to record clock ticks on each core  We bounced a message back and forth between two cores » Used arrival TSCs (x, y, z) to measure round trip delay (RTD) » The results show that the cores’ frequencies were well matched ‹#›

- Zhang et al. – 4/26/2020

Linux Performance: Clock Differences II

We then estimated the cores’ temporal offsets as δ 0 = 2y 1 –x 0 – z 0 ; δ 1 = 2y 0 –x 1 –z 1 » Figures on the right show calculated results  Upper: as measured at each core  Lower: reverse signs for core 0 (shows consistent views of offset)  Insight 1 » Though frequencies matched well, avg. offset was ~1.3μs » Motivates measuring offsets in our subsequent analyses ‹#›

- Zhang et al. – 4/26/2020

Linux Performance: Load Balancing

Tasks

10 30

Utilization

0.6

0.6

Imbalance s detected in 5 min

211 210

Overhead per imbalance (ns)

Minimum Mean Maximum 405 566 983 1178 1899 2120

Overhead (total μs)

207 247 10 1.0

588 536 854 1463 509 30 1.0

596 671 1124 2069 670  Can thread affinity thwart (bad) Linux rebalancing?

» We ran sets of 10 vs. 30 tasks (all bound to one core to prevent rebalancing), with total utilizations of 0.6 vs. 1.0

 Insight 2 » Though overhead is small and amortized, compiling kernels with rebalancing off appears to be a preferable method ‹#›

- Zhang et al. – 4/26/2020

Linux Performance: Migration Strategies

 Two key migration strategies » Thread migrates itself » Separate manager thread migrates it

0 1 2 3

Case 1: a running thread Thread state  » mechanisms/cost Affinity mask is always updated » For running thread, changes run queues, may invoke scheduler

modifies its own affinity 0 1 2 3 Case 2: a separate manager thread

modifies a running

thread’s affinity 0 1 2 3 Case 3: a separate manager thread modifies a

sleeping

thread’s affinity

‹#›

- Zhang et al. – 4/26/2020

Linux Performance: Migration Costs

 Insight 3 » Every strategy risks a non negligible thread migration cost » » Motivates binding task threads into core-specific thread pools Motivates an ORB architecture with a separate manager thread (next)

manager migrates running thread (~ 18 to 36 μs) self migration (~ 16 to 45 μs) manager migrates sleeping thread (~ 4 to 10 μs)

‹#›

- Zhang et al. – 4/26/2020

Conventional Middleware Architecture

  Traditional single-CPU approach benefits from leader/followers etc. to reduce costly hand-offs » E.g., TAO, nORB However, multiple cores increase risk of migration 1.

2.

3.

4.

Leader invokes TA (and AC) for task Picks new leader

New leader may need to move old Old leader runs the task (on the appropriate core)

‹#›

- Zhang et al. – 4/26/2020

MC-ORB Middleware Architecture

 In contrast, MC-ORB’s threading architecture leverages hand-offs to avoid thread migrations » Key trade off: copying/locking costs vs. migration costs 1.

2.

3.

4.

5.

Request is queued Manager thread reads requests in priority order Invokes TA w/AC Manager picks thread from pool Thread runs task ‹#›

- Zhang et al. – 4/26/2020

Real-Time ORB Performance Evaluation

 To gauge performance costs of our middleware architecture we examined four key issues » Allocate on same vs. other core (as manager thread) » » » Thread available vs. migration needed Reallocation is vs. is not required to allocate task New task is admitted vs. rejected  We evaluated our middleware architecture both with (MC-ORB) and without (MC-ORB*) rejection » MC-ORB* compared to nORB (designed for uniprocessors) » » Varied utilization granularity & magnitude (10 task sets) We measured how many of the task sets missed a deadline ‹#›

- Zhang et al. – 4/26/2020

Overheads for MC-ORB’s Extensions (μs)

Scenario

1 2 3 4

Minimum

43 42 50 222

Mean

55 58 64 235

Maximum

109 111 121 289 5 39 50 107 Scenarios used for Overhead Evaluation 1.

New task on same core as manager 2.

New task on different core (

similar cost to 1

) 3.

4.

5.

(

Sleeping

) thread moved from other core to run new task ( All ) running tasks reallocated to make room for new task The new task is rejected (

low cost, but it’s pure overhead

) ‹#›

- Zhang et al. – 4/26/2020

Fraction of Workloads w/ Deadline Misses

Total Utilization

1.4

1.5

ORB

nORB MC-ORB* nORB MC-ORB*

0.1

0.4

0 0.8

0

Balance Factor 0.2

0.3

0 0 0 0.3

0.1

+

0 0.1

0.1

+ 0.5

0 0 0.1

0   nORB 1.0

0.5

0.1

0.1

1.6

MC-ORB* 0.3

+

0.4

+

0.4

+

0.3

+

With rejection, >94% of tasks were admitted by MC-ORB and all admitted tasks met all deadlines Without rejection (where

+

» » shows need for AC) MC-ORB* Outperformed nORB in 6 cases (green) Performed the same as nORB in 4 cases (grey) » » Underperformed nORB in 2 cases (red) Less balanced workloads emphasize MC-ORB* improvement over nORB ‹#›

- Zhang et al. – 4/26/2020

Concluding Remarks

    COTS OS evaluations » Measurement on specific target platforms is crucial » Behaviors of hardware and OS mechanisms are important Middleware architectures » OS evaluations establish design trade-off parameters » Prior design decisions may be reversed on new platforms Performance evaluations bear out our new design » Even w/out admission control, MC-ORB architecture helps » With AC admitted high utilization, and met all deadlines MC-ORB open-source download & build instructions » http://www.cse.wustl.edu/~yfzhang/MC-ORB.html

‹#›

- Zhang et al. – 4/26/2020