Efficient Execution of Augmented Reality Applications on Mobile Programmable Accelerators Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke University of Michigan December 10,
Download ReportTranscript Efficient Execution of Augmented Reality Applications on Mobile Programmable Accelerators Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke University of Michigan December 10,
Efficient Execution of Augmented Reality Applications on Mobile Programmable Accelerators Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke University of Michigan December 10, 2013 1 University of Michigan Electrical Engineering and Computer Science Augmented Reality • Physical world + Computer generated inputs Commerce 2 University of Michigan Electrical Engineering and Computer Science Augmented Reality • Physical world + Computer generated inputs Commerce Information 3 University of Michigan Electrical Engineering and Computer Science Augmented Reality • Physical world + Computer generated inputs Commerce Information Games Compared to multimedia applications, 1. User interactive 2. Computationally intensive 4 University of Michigan Electrical Engineering and Computer Science Application Characteristics Execution Time (%) Remaining Software Pipelinable Data Parallel 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 1.• 69% in data parallel loops (DLP loops) Feature Extracting Kernels => SIMD / Coarse-Grained Reconfigurable Architecture (CGRA) Virtual Object Rendering Kernel(SWP loops) 2.• 15% in software pipelinable loops • =>Video Conferencing with Virtual Object Manipulation CGRA 5 University of Michigan Electrical Engineering and Computer Science SIMD vs. CGRA • SIMD – Identical lanes – Shared instruction fetch (Same schedule across PEs) – SIMD memory access PE# : SIMD lane instruction PE0 PE1 PE2 PE3 PE4 PE5 PE6 PE8 PE7 6 PE9 PE10 PE11 PE12 PE13 PE14 PE15 University of Michigan Electrical Engineering and Computer Science SIMD vs. CGRA • Heterogeneous Homogeneous CGRA CGRA – Identical More energy unitsefficient than homogeneous CGRA – Less Mesh-like performance interconnects compared to homogeneous CGRA – Software pipelining PE0 PE1 PE2 PE3 PE# : Processing Element with Multipliers PE4 PE5 PE6 PE7 PE# : Processing Element with Memory Units PE8 PE9 PE10 PE11 PE# : Processing Element without Complex Units PE12 PE13 PE14 PE15 PE# : Processing Element with All Units 7 University of Michigan Electrical Engineering and Computer Science SIMD vs. CGRA Homo. CGRA SIMD Hetero. CGRA 1.8 1.8 1.6 1.6 1.4 1.4 Normalized Energy Normalized Execution Time SIMD 1.2 1 0.8 0.6 1 0.8 0.6 0.4 0.2 0.2 0 0 SWP DLP total Hetero. CGRA 1.2 0.4 DLP Homo. CGRA SWP total 1. In DLP loops, SIMD > CGRA 2. In total execution time and energy, CGRA > SIMD (due to SWP loops) 3. In energy consumption, heterogeneous CGRA > homogeneous CGRA (20% less energy with only 4% performance loss) 8 University of Michigan Electrical Engineering and Computer Science Adding SIMD Support for CGRA • Heterogeneous CGRA – Grouping multiple PEs to form an identical SIMD core PE0 PE1 PE2 PE3 PE4 PE5 PE6 PE7 PE8 PE9 PE10 PE11 PE12 PE13 PE14 PE15 SIMD Core 1. How do we obtain the efficiency of single instruction fetch? 2. How do we achieve the efficiency of SIMD memory access? 9 University of Michigan Electrical Engineering and Computer Science Efficient Instruction Fetch • Fetch instruction once from memory – Pass around the instruction to the next SIMD core – Last SIMD core stores the instruction in a recycle buffer SIMD Core 0 SIMD Core 3 0 iteration 4 iteration 3 SIMD Core 1 SIMD Core 2 iteration 1 iteration 2 10 University of Michigan Electrical Engineering and Computer Science SIMD Memory Access • Single memory request, multiple responses – Split transaction • Enables forwarding (Request ID) – SIMD mode flag, stride information MemUnit 0 MemUnit 1 MemUnit 2 MemUnit 3 Bank 0 Bank 1 Bank 2 Bank 3 11 University of Michigan Electrical Engineering and Computer Science Experimental Setup • Baseline – Heterogeneous CGRA with 16 PEs • 4 PEs with memory units, 4 PEs with multipliers • Our solution – Baseline + SIMD support • 1 cycle latency ring network, 16-entry recycle buffer • Compiler – IMPACT frontend compiler – Edge-centric modulo scheduler – ADRES framework • Power – 65nm technology @ 200MHz/1V – CACTI 12 University of Michigan Electrical Engineering and Computer Science Evaluation for DLP Loops 16 coresILP (SIMD) 4 loops SIMD cores (Our solution) withinvs. the Proposed SIMD Baseline 1.2 Proposed SIMD 1.2 Normalized Energy Normalized Execution Time Baseline 1 0.8 0.6 0.4 0.2 0 1 0.8 0.6 0.4 0.2 0 - Our solution is 14.1% slower compared to SIMD. - Our solution achieves nearly the same energy efficiency as SIMD. 13 University of Michigan Electrical Engineering and Computer Science Evaluation for Total Execution 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Proposed SIMD Baseline Normalized Energy Normalized Execution Time Baseline Proposed SIMD 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Our solution achieves 17.6% speedup with 16.9% less energy compared to baseline heterogeneous CGRA. 14 University of Michigan Electrical Engineering and Computer Science Conclusion • Best performing / energy-efficient solution – DLP loops : SIMD – Whole application : CGRA • Two techniques to implement SIMD support efficiently for CGRAs. – Efficient instruction fetch : Ring network + recycle buffer – SIMD memory access : Split transaction + stride information in header – Results in 3.4% power saving • CGRAs with SIMD support improves overall performance by 17.6% with 16.9% less energy. 15 University of Michigan Electrical Engineering and Computer Science Questions? For more information http://cccp.eecs.umich.edu [email protected] 16 University of Michigan Electrical Engineering and Computer Science CGRA Memory Access • Resolve bank conflicts through buffering – Compiler accounts for additional buffering delay MemUnit 0 MemUnit 1 MemUnit 2 MemUnit 3 Bank 0 Bank 1 Bank 2 Bank 3 17 University of Michigan Electrical Engineering and Computer Science Compilation Flow Program Loop Classification DLP ILP Matching SWP Low ILP High ILP Acyclic Scheduling Modulo Scheduling Code Generation Executable 18 University of Michigan Electrical Engineering and Computer Science Power Analysis 1.2 1 (-) Savings from memory (+) Overheads from ring network, recycle buffer, and SIMD memory access 0.8 Memory Control 0.6 RF 0.4 FU 0.2 0 CGRA mode (Baseline) SIMD mode - SIMD mode further saves power by 3.4%. 19 University of Michigan Electrical Engineering and Computer Science Resource Utilization in DLP Loops CGRA mode SIMD mode 0.6 Resource Utilization 0.5 0.4 0.3 0.2 0.1 0 SIMD mode can utilize 13.6% more resources in DLP loops. - Compiler generates more efficient schedule with fewer resources. (Less routing, less exploration) 20 University of Michigan Electrical Engineering and Computer Science PE0 PE1 PE2 PE3 PE4 PE5 PE6 PE7 PE8 PE9 PE10 PE11 PE12 PE13 PE14 PE15 21 University of Michigan Electrical Engineering and Computer Science