Efficient Performance Scaling of Future CGRAs for Mobile Applications Yongjun Park, Jason Jong Kyu Park , and Scott Mahlke December 11, 2012 University of.
Download ReportTranscript Efficient Performance Scaling of Future CGRAs for Mobile Applications Yongjun Park, Jason Jong Kyu Park , and Scott Mahlke December 11, 2012 University of.
Efficient Performance Scaling of Future CGRAs for Mobile Applications
Yongjun Park, Jason Jong Kyu Park , and Scott Mahlke
December 11, 2012 University of Michigan, Ann Arbor 1
University of Michigan Electrical Engineering and Computer Science
Convergence of Functionalities
Flexible Accelerator!
4G Wireless Audio Video 3D Navigation Anatomy of an iPhone4
Convergence of functionalities demands a flexible solution due to the design cost and programmability
2
University of Michigan Electrical Engineering and Computer Science
CGRA : Attractive Alternative to ASICs
Array of PEs connected in a mesh-like interconnect High throughput with a large number of resources Distributed hardware offers low cost/power consumption High flexibility with dynamic reconfiguration 3
University of Michigan Electrical Engineering and Computer Science
Bridging the Gap Between Market Demand and Computation Power
2000 1600 CPU Audio Video 1200 800 400 0 2009 2010 2011 2012
Year
2013 2014 2015
How to scale performance with retaining energy efficiency?
[Canali, Internet Computing Magazine, IEEE, 2009]
4
University of Michigan Electrical Engineering and Computer Science
Agenda: Scaling the Energy Efficiency of CGRAs
• Investigate the key factors and their feasibility in the view of performance and power efficiency – Hardware scalability vs. hardware flexibility • Interconnection topology • Complex PE vs. simple PE • Vector memory operation support • Homogeneity vs. Heterogeneity 5
University of Michigan Electrical Engineering and Computer Science
Experimental
Setup
• Target applications – Media benchmark: AAC decoder, H.264 decoder, and 3D rendering – Game physics benchmarks: line of sight, convolution, and conjugate • Target architecture: various types of CGRAs – 16 ~ 64 heterogeneous/homogeneous resources • IMPACT frontend compiler + Edge-centric modulo scheduler • Power measurement – IBM 65nm technology @ 200MHz/1V
University of Michigan Electrical Engineering and Computer Science
6
Q1: Interconnection Topology
• Overview – Routing overhead limits the performance when increasing the size of the CGRA – Common solution: clustering – What is the optimal interconnection topology?
• Methodology – Compare the performance of three different clustering schemes.
• Baseline • Fixed partition: CGRAs are physically split into multiple partitions • Flexible partition: number of partitions can be dynamically changed from 1 to 8 – Total number of PEs: 4 to 128 7
University of Michigan Electrical Engineering and Computer Science
Q1: Interconnection Topology
Application No-DLP loops Baseline DLP loops Fixed partition 8 Flexible mapping
University of Michigan Electrical Engineering and Computer Science
Performance Comparison (Base, Fixed, Flex)
12 10 8 6 4 2 0 Media 10 8 6 4 2 0 4 8 16 32 Architecture Game 64 128 4 8 16 32 Architecture 64
• Fixed partitioning doesn’t always show better performance.
• Flexible architectures show the best performance and retain scalability 9
128 University of Michigan Electrical Engineering and Computer Science
Q2: Complex PEs vs. Simple PEs
• Overview – CGRAs with complex PEs are introduced • Two level interconnect • Number of RFs can decrease • Multiple instructions can be chained – Challenge: resource utilization – Goal: determine the availability of complex PEs in the view of energy consumption • Methodology – Compare the energy consumption on different PE styles • Number of FUs inside a PE: 1 ~ 6 • Uniform vs. Optimized 10
University of Michigan Electrical Engineering and Computer Science
PE Designs
Simple integer ALU Simple integer + Complex ALU Register file 11
University of Michigan Electrical Engineering and Computer Science
Energy Consumption
2 1,5 1 0,5 4 3,5 3 2,5 Media uniform Game uniform Media optimized Game optimized
1.5x energy
1 2 3
# of FUs per PE
4 5 6 • Energy consumption does not increase dramatically as number of PEs • In 1.5x energy budget, complex PEs with 2~3 FUs can also be proper solutions 12
University of Michigan Electrical Engineering and Computer Science
Q3: SIMD Memory Support
• Overview – SIMD memory support provides less power and less number of instructions – Challenge: degree of DLP.
– Goal: determine the availability of SIMD memory access in the view of energy consumption • Methodology – Compare the energy consumption on different SIMD widths: 1 ~ 16 13
University of Michigan Electrical Engineering and Computer Science
Relative Energy Consumption
14 12 10 8 6 4 2 0 Relative power per access Relative # of access Relative total energy 1 2 4
Vector width
8 16 • Total energy consumption at wider vector width can be a similar level to a scalar memory unit – High degree of spatial locality can compensate for power overheads 14
University of Michigan Electrical Engineering and Computer Science
• Flexible partitioning should be supported for further improving the performance.
• Complex PE can be more energy efficient even in low resource utilizations.
• The wide SIMD memory support can be realistic due to the mobile application characteristics.
15
University of Michigan Electrical Engineering and Computer Science
Questions?
For more information http://cccp.eecs.umich.edu
16
University of Michigan Electrical Engineering and Computer Science
Q1: Homogeneity vs. Heterogeneity
• Overview – Heterogeneous CGRAs are common – No experiments on the effect of heterogeneity over homogeneity • Methodology – Start from 16-PE homogeneous CGRA (integer ALU, complex ALU, memory unit) – Decrease the number of PEs supporting complex ALU and memory unit – Performance goal: 80% of performance @ homogeneous CGRA How about performance?
17
University of Michigan Electrical Engineering and Computer Science
1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0
Performance Degradation
Media Game 1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0
• The amounts of performance degradation are not substantial – The performance is normally constrained not by the complex instructions • Performance degradation depends much more on memory operations • For 80% of the baseline performance, we can decrease the number of both complex and memory units by up to 75%.
18
University of Michigan Electrical Engineering and Computer Science
• Heterogeneous FU organization is highly effective.
• Flexible partitioning should be supported for further improving the performance.
• Complex PE can be more energy efficient even in low resource utilizations.
• The wide SIMD memory support can be realistic due to the mobile application characteristics.
19
University of Michigan Electrical Engineering and Computer Science
CGRA : Attractive Alternative to ASICs
Suitable for running multimedia applications for future embedded systems High throughput, low power consumption, high flexibility Morphosys SiliconHive ADRES viterbi at 80Mbps h.264 at 30fps Morphosys : 8x8 array with RISC processor SiliconHive : hierarchical systolic array ADRES : 4x4 array with tightly coupled VLIW 20 50-60 MOps /mW
University of Michigan Electrical Engineering and Computer Science