Transcript Document
Exploiting Scratchpad-aware Scheduling on VLIW Architectures for High-Performance Real-Time Systems Yu Liu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University HPEC’12, Waltham, MA Overview • A time-predictable two-level SPM based architecture is proposed for single-core VLIW (Very Long Instruction Word) microprocessors. • An ILP based static memory objects assignment algorithm is extended to support multi-level SPMs without harming the characteristic of time predictability of SPMs. • We developed a SPM-aware scheduling to improve performance of the proposed VLIW architecture 2-Level Cache-based Architecture • Two separated L1 caches Microprocessor to store instructions/ data to isolate the interference L1 I-Cache L1 D-Cache between them. • One unified L2 cache slower than L1 caches but with larger size to trade off between speed and size. L2 Unified Cache Main Memory 2-Level SPM-based Architecture • Two separated L1 SPMs to store instructions/ data, and one unified L2 SPM with larger size but slower speed. Microprocessor L1 I-SPM L1 D-SPM L2 Unified SPM • No replacement in any higher level memory of this architecture. Main Memory ILP-based Static Allocation • The ILP-based static allocation method is utilized to allocate memory objects to multi-level SPMs, since it can completely guarantee the characteristic of time predictability. • The object function is to maximally save executing time, while the constraint is the sizes of SPMs. • The ILP-based method is utilized three times for the three SPMs, and all instruction and data objects not selected to be allocated in the L1 SPMs need to be considered as candidates for the L2 SPM. Background on Load-Sensitive Scheduling • In the cache-based architecture, generally it is hard to statically know the latency of each load operation • An optimistic scheduler assumes a load always hits in the cache • Too aggressive • Processor needs to be stalled when a miss occurs • A pessimistic scheduler assumes a load always misses in the cache • Leads to bad performance Use-Stall Cycles Scratchpad-Aware Scheduling • Whenever possible, schedule a load op with large memory latency earlier • Schedule its use op later • Shorten use-stall cycles while preserving time predictability Memory Objects • The instruction objects consist of basic blocks, functions, and combinations of consecutive basic blocks • The data objects consist of global scalars and nonscalar variables. ILP for L1 Instruction SPM Scratchpad-Aware Scheduling • The load/store latencies are known in the SPMbased architecture • Instruction scheduling can be enhanced exploiting the predictable load/store latencies by • This is known as Load Sensitive Scheduling for VLIW architectures [Hardnett et al. GIT, 2001] Evaluation Methodology • We evaluate the performance and energy consumption of our SPM based architecture compared to the cache based architecture. • Eight real-time benchmarks are selected for this evaluation. • We simulate the proposed two-level SPM based architecture on a VLIW processor based on the HPL-PD architecture. Cache and SPM Configurations Evaluation Framework The framework of our two-level SPM based architecture for the single-core CPUs evaluation. Results (SPMs vs. Caches) The WCET comparison (L1 Size: 128 Bytes, L2 Size: 256 Bytes), normalized to SPM The energy consumption comparison (L1 Size: 128 Bytes, L2 Size: 256 Bytes), normalized to SPM Sensitivity Study Level Setting 1 (S1) Setting 2 (S2) Setting 3 (S3) L1 Instruction 128 256 512 L1 Data 128 256 512 L2 Shared 256 512 1024 Sensitivity WCET Results The WCET comparison among the SPMs with different size settings. The WCET comparison among the caches with different size settings. Sensitivity Energy Results The energy consumption comparison among the SPMs with different size settings. The energy consumption comparison among the caches with different size settings. Why Two Levels? • Why do we need two levels SPMs instead of one level? Microprocessor L1 I-Cache L1 D-Cache • The level 2 SPM is Main Memory important to mitigate the access latency, which otherwise has to One level SPM architecture. fetch from the memory. Results (One-Level vs. Two-Level) The timing performance comparison, normalized to two-level SPM based architecture. The energy consumption comparison, normalized to two-level SPM based architecture. Scratchpad-Aware Scheduling The maximum improvement of computation cycles is about 3.9%, and the maximum improvement of use stall cycles is about 10%. Thank You and Questions! Backup Slides – SPM Access Latencies Backup Slide – Priority Function in SSS Priority function of the default Critical Path Scheduling: In our Scratchpad Sensitive Scheduling, we consider two factors related to the Load-To-Use Distance, including the memory latency for a Load Op (curLat) and the related Load Op memory latency for a Use Op (preLat).