Processor Acceleration Through Automated Instruction Set Customization Nathan Clark, Hongtao Zhong, Scott Mahlke Advanced Computer Architecture Lab University of Michigan, Ann Arbor December 3, 2003 University.

Download Report

Transcript Processor Acceleration Through Automated Instruction Set Customization Nathan Clark, Hongtao Zhong, Scott Mahlke Advanced Computer Architecture Lab University of Michigan, Ann Arbor December 3, 2003 University.

Processor Acceleration Through
Automated Instruction Set Customization
Nathan Clark, Hongtao Zhong, Scott Mahlke
Advanced Computer Architecture Lab
University of Michigan, Ann Arbor
December 3, 2003
1
University of Michigan
Electrical Engineering and Computer Science
Motivation
• Cell phones, PDAs, digital cameras, etc. are everywhere
– High performance yet low power design point
• General core + ASIC solution
– Limited post-programmability
CPU
ASIC
• General core + application specific instructions (CFUs)
CFU
CPU
2
University of Michigan
Electrical Engineering and Computer Science
What is a CFU?
• Combine multiple primitive operations
– Smaller code size, fewer RF reads
– Increases performance
&
<<
|
^
+
+
^
^
*
+
<<
|
1
2
+
&
CFU 1
^
+
*
2
^
^
1
CFU 2
+
1
&
<<
|
3
University of Michigan
Electrical Engineering and Computer Science
Automation is Key
• This is ¼ of the DFG for a single basic block of blowfish
159 XOR
164 SHR
4
173 AND
University of Michigan
Electrical Engineering and Computer Science
Related Work
• Tensilica Xtensa
– Commercial example
– MIPS core + manually constructed CFU
• Automatic instruction set synthesis is mature field
– See paper for comparison of techniques
• Our contributions
– Novel technique for automatic CFU creation
– System to utilize CFUs in multiple applications
– Analysis of how effectively CFUs for one application
apply to other applications in the same domain
5
University of Michigan
Electrical Engineering and Computer Science
System Overview
• Synthesis
– Subgraph identification
• Discover candidates for CFUs
• Weed out what shouldn’t be picked
– Selection
• Determine which candidates to use as
CFUs
• Compilation
– Subgraph replacement
• Make use of the CFUs in a range of
applications
6
University of Michigan
Electrical Engineering and Computer Science
Subgraph Identification
• Grow subgraphs from seed nodes
– All nodes are seeds
– Most directions don’t make sense
• How to decide where to grow?
– Making decisions using factors
similar to an architect
– Take 4 factors into consideration
%
+
*
&
^
<<
|
• Criticality, Latency, Area, Input/Output
7
University of Michigan
Electrical Engineering and Computer Science
Subgraph Identification
• Grow subgraphs from seed nodes
– All nodes are seeds
– Most directions don’t make sense
• How to decide where to grow?
– Making decisions using factors
similar to an architect
– Take 4 factors into consideration
• Criticality, Latency, Area, Input/Output
%
+
*
&
^
<<
|
CFU Candidates
&
<<
8
University of Michigan
Electrical Engineering and Computer Science
Subgraph Identification
• Grow subgraphs from seed nodes
– All nodes are seeds
– Most directions don’t make sense
• How to decide where to grow?
– Making decisions using factors
similar to an architect
– Take 4 factors into consideration
• Criticality, Latency, Area, Input/Output
• Sum of these factors determines value of
each direction
– NOT picking CFUs
9
+
*
&
^
%
<<
|
CFU Candidates
&
+
<<
&
University of Michigan
Electrical Engineering and Computer Science
Critical Path
• Combining operations on the
critical path will shrink the
longer dependence chains
– Maximize potential performance
gain
10
• Wt = slack 1
– Slack is # cycles off longest
dependence path
10
^
&
10/(0+1) = 10
^
10/(2+1) = 3.33
>>
>>
>>
&
&
&
+
+
+
<<
<<
<<
<<
+
+
+
+
University of Michigan
Electrical Engineering and Computer Science
Latency
• Growing toward low latency
operations allows combination of
more nodes in a cycle
^
^
– Maximize DFG compression
10*old _ latency
• Wt = new _ latency
Opcode
Area
Cycles
+
1.00
0.30
&
0.12
0.06
<<, >>
0.01
~0.00
^
0.16
0.09
&
>>
>>
>>
&
&
&
+
+
+
<<
<<
<<
<<
+
+
+
+
10*0.3 / 0.36 = 8.33
10*0.3 / 0.6 = 5
11
University of Michigan
Electrical Engineering and Computer Science
Area
• Want the most benefit for
the least area
^
&
^
10*0.5/0.5 = 10
*old _ area
• Wt = 10new
_ area
• Area is the sum of
macrocell areas
>>
>>
>>
&
&
&
+
+
+
<<
<<
<<
<<
+
+
+
+
10*0.5/1.5 = 3.33
Opcode
Area
Cycles
+
1.00
0.30
&
0.12
0.06
<<, >>
0.01
~0.00
^
0.16
0.09
12
University of Michigan
Electrical Engineering and Computer Science
Input/Output
• Want CFUs to use as few RF
ports as possible
^
&
^
10*2/(4+1)= 4
– Smaller encoding
– Allow growth of larger
candidates
10*old # ports
• Wt = min( new# ports 1 ,10)
13
>>
>>
>>
&
&
&
+
+
+
<<
<<
<<
<<
+
+
+
+
10*2/(2+1)= 6.67
University of Michigan
Electrical Engineering and Computer Science
Example
^
&
35
>>
37.5
^
28.5
30.8
>>
37.5
28.5
>>
&
&
&
+
+
+
<<
<<
<<
<<
+
+
+
+
14
University of Michigan
Electrical Engineering and Computer Science
Example
^
&
35
^
28.5
30.8
>>
>>
40
28.5
>>
33.5
&
&
&
+
+
+
<<
<<
<<
<<
+
+
+
+
15
University of Michigan
Electrical Engineering and Computer Science
Example
^
&
35
^
28.5
30.8
>>
28.5
>>
>>
36
36
&
&
&
+
+
+
<<
<<
<<
<<
+
+
+
+
16
University of Michigan
Electrical Engineering and Computer Science
Example
^
&
^
>>
>>
>>
&
&
&
+
+
+
<<
<<
<<
<<
+
+
+
+
17
University of Michigan
Electrical Engineering and Computer Science
Example
^
&
^
>>
>>
>>
&
&
&
+
+
+
<<
<<
<<
<<
+
+
+
+
18
University of Michigan
Electrical Engineering and Computer Science
Example
^
&
^
>>
>>
>>
&
&
&
+
+
+
<<
<<
<<
<<
+
+
+
+
19
University of Michigan
Electrical Engineering and Computer Science
Example
^
&
^
>>
>>
>>
&
&
&
+
+
+
<<
<<
<<
<<
+
+
+
+
20
University of Michigan
Electrical Engineering and Computer Science
Example
^
&
^
>>
>>
>>
&
&
&
+
+
+
<<
<<
<<
<<
+
+
+
+
21
University of Michigan
Electrical Engineering and Computer Science
Example
^
&
^
>>
>>
>>
&
&
&
+
+
+
<<
<<
<<
<<
+
+
+
+
22
University of Michigan
Electrical Engineering and Computer Science
Example
^
&
^
>>
>>
>>
&
&
&
+
+
+
<<
<<
<<
<<
+
+
+
+
23
University of Michigan
Electrical Engineering and Computer Science
Finished – Met External Constraints
^
&
^
>>
>>
>>
&
&
&
+
+
+
<<
<<
<<
<<
+
+
+
+
24
University of Michigan
Electrical Engineering and Computer Science
Set of Candidates
<<
<<
&
&
&
^
^
^
<<
<<
<<
<<
<<
<<
<<
<<
&
&
&
&
&
&
&
&
+
+
+
+
+
+
+
+
<<
<<
<<
<<
<<
+
^
^
^
^
^
^
^
^
^
^
^
<<
<<
<<
<<
<<
&
&
&
&
&
<<
^
<<
25
^
<<
<<
University of Michigan
Electrical Engineering and Computer Science
Avoids Exponential Explosion
1.50
100000
Exponential
Series4
10000
1.38
1000
1.25
100
1.13
10
1.00
5
1
2
3
Cost Constraint (Adders)
26
4
University of Michigan
Electrical Engineering and Computer Science
Speedup
Number of Candidates (K)
Intelligent
Performance
Greedy Selection Heuristic
• Use estimates of performance improvement / cost
Subgraph
Number
Value
Cost
Ops
Subgraph Value
Number
Cost
Ops
1
20
4
(3,4),(6,8)
1
10
4
(6,8)
2
6
1
(1,3,7)
2
6
1
(1,3,7)
…
…
…
…
…
…
…
…
N
9
5
(1,7)
N
0
5
27
University of Michigan
Electrical Engineering and Computer Science
Compiler Replacement
• Multiple applications can utilize CFUs
• Vflib pattern matcher [Cor ’99]
Instruction
Synthesis
3
CFU
Description
1
4
2
5
6
1
2
4
5
CFU
Compiler
3
28
University of Michigan
Electrical Engineering and Computer Science
Experimental Setup
• Implemented in the Trimaran toolset
• Baseline machine: 1 Int, 1 Flt, 1 Br, 1 Mem/Cycle
– CFUs use Int issue slot
• CFU latency/area generated as sum of each
individual macrocell
– Pipeline latches were added if CFU latency >1 clock cycle
– 300 MHz clock assumed
– No branch or memory instructions in CFUs
• Four application domains tested
– Audio, Encryption, Image, Network
29
University of Michigan
Electrical Engineering and Computer Science
Native Encryption Results
2.0
1.9
1.8
Speedup
1.7
1.6
1.5
1.4
1.3
blowfish
rijndael
sha
1.2
1.1
1.0
0
2
4
6
8
10
Cost Budget (Adders)
30
12
14
16
University of Michigan
Electrical Engineering and Computer Science
Encryption Cross Compile
2.0
blowfish-rijndael
blowfish-sha
rijndael-blowfish
rijndael-sha
sha-blowfish
sha-rijndael
1.9
1.8
Speedup
1.7
1.6
1.5
1.4
1.3
1.2
1.1
1.0
0
2
4
6
8
10
Cost Budget (Adders)
31
12
14
16
University of Michigan
Electrical Engineering and Computer Science
Generalizing CFUs
IN_1
0x8
>>
Subsumed
(Multiple Paths)
IN_1
|
0x8, 0x0
0xF
Wildcards
(Multiple Nodes)
IN_2
IN_1
0x8
+
>> 0xF, 0x0
|
IN_2
+
>>
0xF
|,&
IN_2
+,-
32
University of Michigan
Electrical Engineering and Computer Science
2.0
Effects of Generalization
CFUs
Subsumed Subgraphs
1.9
1.8
Speedup
1.7
1.6
1.5
1.4
1.3
1.2
1.1
1.0
33
University of Michigan
Electrical Engineering and Computer Science
Conclusions
• Developed two phase instruction set synthesis system
– Guide function removes bad candidates
– Greedy selection heuristic
• Substantial speedups can be attained with very little
die impact
Domain
Ave. Speedup
Encryption Network Image
1.61
1.38
1.16
Audio
1.66
• Subsumed subgraphs and wildcarding increase crossapplication effectiveness
34
University of Michigan
Electrical Engineering and Computer Science
Questions?
http://cccp.eecs.umich.edu
35
University of Michigan
Electrical Engineering and Computer Science
Backup slides
36
University of Michigan
Electrical Engineering and Computer Science
Individual Factors - Blowfish
1.7
1.6
1.5
IO
Latency
Area
Criticality
All
1.4
1.3
1.2
1.1
1
0
2
4
6
8
10
37
12
14
16
University of Michigan
Electrical Engineering and Computer Science
Individual Factors - Djpeg
1.25
1.2
1.15
1.1
IO
Latency
Area
Criticality
All
1.05
1
0
2
4
6
8
38
10
12
14
16
University of Michigan
Electrical Engineering and Computer Science
Selection
^
• Uses estimates of performance
improvement
• Greedy Heuristic used
39
&
^
>>
>>
>>
&
&
&
+
+
+
<<
<<
<<
<<
+
+
+
+
University of Michigan
Electrical Engineering and Computer Science