Processor Acceleration Through Automated Instruction Set Customization Nathan Clark, Hongtao Zhong, Scott Mahlke Advanced Computer Architecture Lab University of Michigan, Ann Arbor December 3, 2003 University.
Download ReportTranscript Processor Acceleration Through Automated Instruction Set Customization Nathan Clark, Hongtao Zhong, Scott Mahlke Advanced Computer Architecture Lab University of Michigan, Ann Arbor December 3, 2003 University.
Processor Acceleration Through Automated Instruction Set Customization Nathan Clark, Hongtao Zhong, Scott Mahlke Advanced Computer Architecture Lab University of Michigan, Ann Arbor December 3, 2003 1 University of Michigan Electrical Engineering and Computer Science Motivation • Cell phones, PDAs, digital cameras, etc. are everywhere – High performance yet low power design point • General core + ASIC solution – Limited post-programmability CPU ASIC • General core + application specific instructions (CFUs) CFU CPU 2 University of Michigan Electrical Engineering and Computer Science What is a CFU? • Combine multiple primitive operations – Smaller code size, fewer RF reads – Increases performance & << | ^ + + ^ ^ * + << | 1 2 + & CFU 1 ^ + * 2 ^ ^ 1 CFU 2 + 1 & << | 3 University of Michigan Electrical Engineering and Computer Science Automation is Key • This is ¼ of the DFG for a single basic block of blowfish 159 XOR 164 SHR 4 173 AND University of Michigan Electrical Engineering and Computer Science Related Work • Tensilica Xtensa – Commercial example – MIPS core + manually constructed CFU • Automatic instruction set synthesis is mature field – See paper for comparison of techniques • Our contributions – Novel technique for automatic CFU creation – System to utilize CFUs in multiple applications – Analysis of how effectively CFUs for one application apply to other applications in the same domain 5 University of Michigan Electrical Engineering and Computer Science System Overview • Synthesis – Subgraph identification • Discover candidates for CFUs • Weed out what shouldn’t be picked – Selection • Determine which candidates to use as CFUs • Compilation – Subgraph replacement • Make use of the CFUs in a range of applications 6 University of Michigan Electrical Engineering and Computer Science Subgraph Identification • Grow subgraphs from seed nodes – All nodes are seeds – Most directions don’t make sense • How to decide where to grow? – Making decisions using factors similar to an architect – Take 4 factors into consideration % + * & ^ << | • Criticality, Latency, Area, Input/Output 7 University of Michigan Electrical Engineering and Computer Science Subgraph Identification • Grow subgraphs from seed nodes – All nodes are seeds – Most directions don’t make sense • How to decide where to grow? – Making decisions using factors similar to an architect – Take 4 factors into consideration • Criticality, Latency, Area, Input/Output % + * & ^ << | CFU Candidates & << 8 University of Michigan Electrical Engineering and Computer Science Subgraph Identification • Grow subgraphs from seed nodes – All nodes are seeds – Most directions don’t make sense • How to decide where to grow? – Making decisions using factors similar to an architect – Take 4 factors into consideration • Criticality, Latency, Area, Input/Output • Sum of these factors determines value of each direction – NOT picking CFUs 9 + * & ^ % << | CFU Candidates & + << & University of Michigan Electrical Engineering and Computer Science Critical Path • Combining operations on the critical path will shrink the longer dependence chains – Maximize potential performance gain 10 • Wt = slack 1 – Slack is # cycles off longest dependence path 10 ^ & 10/(0+1) = 10 ^ 10/(2+1) = 3.33 >> >> >> & & & + + + << << << << + + + + University of Michigan Electrical Engineering and Computer Science Latency • Growing toward low latency operations allows combination of more nodes in a cycle ^ ^ – Maximize DFG compression 10*old _ latency • Wt = new _ latency Opcode Area Cycles + 1.00 0.30 & 0.12 0.06 <<, >> 0.01 ~0.00 ^ 0.16 0.09 & >> >> >> & & & + + + << << << << + + + + 10*0.3 / 0.36 = 8.33 10*0.3 / 0.6 = 5 11 University of Michigan Electrical Engineering and Computer Science Area • Want the most benefit for the least area ^ & ^ 10*0.5/0.5 = 10 *old _ area • Wt = 10new _ area • Area is the sum of macrocell areas >> >> >> & & & + + + << << << << + + + + 10*0.5/1.5 = 3.33 Opcode Area Cycles + 1.00 0.30 & 0.12 0.06 <<, >> 0.01 ~0.00 ^ 0.16 0.09 12 University of Michigan Electrical Engineering and Computer Science Input/Output • Want CFUs to use as few RF ports as possible ^ & ^ 10*2/(4+1)= 4 – Smaller encoding – Allow growth of larger candidates 10*old # ports • Wt = min( new# ports 1 ,10) 13 >> >> >> & & & + + + << << << << + + + + 10*2/(2+1)= 6.67 University of Michigan Electrical Engineering and Computer Science Example ^ & 35 >> 37.5 ^ 28.5 30.8 >> 37.5 28.5 >> & & & + + + << << << << + + + + 14 University of Michigan Electrical Engineering and Computer Science Example ^ & 35 ^ 28.5 30.8 >> >> 40 28.5 >> 33.5 & & & + + + << << << << + + + + 15 University of Michigan Electrical Engineering and Computer Science Example ^ & 35 ^ 28.5 30.8 >> 28.5 >> >> 36 36 & & & + + + << << << << + + + + 16 University of Michigan Electrical Engineering and Computer Science Example ^ & ^ >> >> >> & & & + + + << << << << + + + + 17 University of Michigan Electrical Engineering and Computer Science Example ^ & ^ >> >> >> & & & + + + << << << << + + + + 18 University of Michigan Electrical Engineering and Computer Science Example ^ & ^ >> >> >> & & & + + + << << << << + + + + 19 University of Michigan Electrical Engineering and Computer Science Example ^ & ^ >> >> >> & & & + + + << << << << + + + + 20 University of Michigan Electrical Engineering and Computer Science Example ^ & ^ >> >> >> & & & + + + << << << << + + + + 21 University of Michigan Electrical Engineering and Computer Science Example ^ & ^ >> >> >> & & & + + + << << << << + + + + 22 University of Michigan Electrical Engineering and Computer Science Example ^ & ^ >> >> >> & & & + + + << << << << + + + + 23 University of Michigan Electrical Engineering and Computer Science Finished – Met External Constraints ^ & ^ >> >> >> & & & + + + << << << << + + + + 24 University of Michigan Electrical Engineering and Computer Science Set of Candidates << << & & & ^ ^ ^ << << << << << << << << & & & & & & & & + + + + + + + + << << << << << + ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ << << << << << & & & & & << ^ << 25 ^ << << University of Michigan Electrical Engineering and Computer Science Avoids Exponential Explosion 1.50 100000 Exponential Series4 10000 1.38 1000 1.25 100 1.13 10 1.00 5 1 2 3 Cost Constraint (Adders) 26 4 University of Michigan Electrical Engineering and Computer Science Speedup Number of Candidates (K) Intelligent Performance Greedy Selection Heuristic • Use estimates of performance improvement / cost Subgraph Number Value Cost Ops Subgraph Value Number Cost Ops 1 20 4 (3,4),(6,8) 1 10 4 (6,8) 2 6 1 (1,3,7) 2 6 1 (1,3,7) … … … … … … … … N 9 5 (1,7) N 0 5 27 University of Michigan Electrical Engineering and Computer Science Compiler Replacement • Multiple applications can utilize CFUs • Vflib pattern matcher [Cor ’99] Instruction Synthesis 3 CFU Description 1 4 2 5 6 1 2 4 5 CFU Compiler 3 28 University of Michigan Electrical Engineering and Computer Science Experimental Setup • Implemented in the Trimaran toolset • Baseline machine: 1 Int, 1 Flt, 1 Br, 1 Mem/Cycle – CFUs use Int issue slot • CFU latency/area generated as sum of each individual macrocell – Pipeline latches were added if CFU latency >1 clock cycle – 300 MHz clock assumed – No branch or memory instructions in CFUs • Four application domains tested – Audio, Encryption, Image, Network 29 University of Michigan Electrical Engineering and Computer Science Native Encryption Results 2.0 1.9 1.8 Speedup 1.7 1.6 1.5 1.4 1.3 blowfish rijndael sha 1.2 1.1 1.0 0 2 4 6 8 10 Cost Budget (Adders) 30 12 14 16 University of Michigan Electrical Engineering and Computer Science Encryption Cross Compile 2.0 blowfish-rijndael blowfish-sha rijndael-blowfish rijndael-sha sha-blowfish sha-rijndael 1.9 1.8 Speedup 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1.0 0 2 4 6 8 10 Cost Budget (Adders) 31 12 14 16 University of Michigan Electrical Engineering and Computer Science Generalizing CFUs IN_1 0x8 >> Subsumed (Multiple Paths) IN_1 | 0x8, 0x0 0xF Wildcards (Multiple Nodes) IN_2 IN_1 0x8 + >> 0xF, 0x0 | IN_2 + >> 0xF |,& IN_2 +,- 32 University of Michigan Electrical Engineering and Computer Science 2.0 Effects of Generalization CFUs Subsumed Subgraphs 1.9 1.8 Speedup 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1.0 33 University of Michigan Electrical Engineering and Computer Science Conclusions • Developed two phase instruction set synthesis system – Guide function removes bad candidates – Greedy selection heuristic • Substantial speedups can be attained with very little die impact Domain Ave. Speedup Encryption Network Image 1.61 1.38 1.16 Audio 1.66 • Subsumed subgraphs and wildcarding increase crossapplication effectiveness 34 University of Michigan Electrical Engineering and Computer Science Questions? http://cccp.eecs.umich.edu 35 University of Michigan Electrical Engineering and Computer Science Backup slides 36 University of Michigan Electrical Engineering and Computer Science Individual Factors - Blowfish 1.7 1.6 1.5 IO Latency Area Criticality All 1.4 1.3 1.2 1.1 1 0 2 4 6 8 10 37 12 14 16 University of Michigan Electrical Engineering and Computer Science Individual Factors - Djpeg 1.25 1.2 1.15 1.1 IO Latency Area Criticality All 1.05 1 0 2 4 6 8 38 10 12 14 16 University of Michigan Electrical Engineering and Computer Science Selection ^ • Uses estimates of performance improvement • Greedy Heuristic used 39 & ^ >> >> >> & & & + + + << << << << + + + + University of Michigan Electrical Engineering and Computer Science