A Framework for Studying Effect of VLIW Instruction

Download Report

Transcript A Framework for Studying Effect of VLIW Instruction

A Framework for Studying Effects of
VLIW Instruction Encoding and
Decoding Schemes
Anup Gangwar
November 28, 2001
Embedded Systems Group
IIT Delhi
Overview
• The VLIW code size expansion problem
• What all such a framework needs to support?
• Trimaran compiler infrastructure
• The HPL-PD architecture
• Extensions to the various modules of Trimaran
• Results
• Future work
• Acknowledgements
Embedded Systems Group
IIT Delhi
Choices for exploiting ILP
• The architectural choices for utilizing ILP
– Superscalar processors
•
•
•
•
Try to extract ILP at run time
Complex hardware
Limited clock speeds and high power dissipation
Not suited for embedded type of applications
– VLIW processors
•
•
•
•
Compiler has lot of knowledge about hardware
Compiler extracts ILP statically
Simplified hardware
Possible to attain higher clock speeds
Embedded Systems Group
IIT Delhi
Problems with VLIW processors
• Complex compiler required to extract ILP from
application program
• Requires adequate support in hardware for compiler
controlled execution
• Code size expansion due to explicit NOPs if,
– The application does not contain enough parallelism
– The compiler is not able to extract parallelism from the
application
– Need for good instruction encoding and NOP compression
schemes
Embedded Systems Group
IIT Delhi
What all such a framework should
support?
• The framework should have quick retargetability
• Studying the effect of a particular instruction
encoding and decoding scheme on processor
performance
• Studying the code size minimization due to a
particular instruction encoding scheme
• Studying memory bandwidth requirements imposed
by a particular instruction decoding scheme.
Embedded Systems Group
IIT Delhi
Trimaran Compiler Infrastructure
C Program
Bridge Code
IMPACT
•ANSI C Parsing
•Code profiling
•Classical machine independent optimizations
•Basic block formation
ELCOR
ELCOR IR
•Machine dependent
code optimizations
STATISTICS
•Compute and
stall cycles
•Cache stats
•Spill code info
SIMULATOR
•Code scheduling
•Register allocation
•ELCOR IR to low level C files
•HPL-PD virtual machine
•Cache simulation
•Performance statistics
HMDES Machine Description
Embedded Systems Group
IIT Delhi
Various modules of Trimaran - 1
• IMPACT
– Developed by UIUC’s IMPACT group
– Trimaran uses only the IMPACT front-end
– Classical machine independent optimizations
– Outputs a low level IR, Trimaran bridge code
• ELCOR
– Developed by HPL’s CAR group
– It is the compiler backend
– Performs registration allocation and code scheduling
– Parameterized by HMDES machine description
– Outputs ELCOR IR with annotated HPL-PD assembly
Embedded Systems Group
IIT Delhi
Various modules of Trimaran - 2
• HMDES
– Developed by UIUC’s IMPACT group
– Specifies resource usage and latency information for an arch.
– Input is translated to a low level representation
– Has efficient mechanisms for querying the database
– Does not specify instruction format information
• HPL-PD Simulator
– Developed by NYU’s REACT-ILP group
– Converts ELCOR’s annotated IR to low level C representation
– Processor performance and cache simulation
– Generates statistics and execution trace
Embedded Systems Group
IIT Delhi
Various modules of Trimaran - 3
Example ELCOR Operation in IR
Op 7 ( ADD_W [ br<11 :I gpr 14>] [br<27 :I gpr 14> I<1> ]
p<t> s_time( 3 ) s_opcode( ADD_W.0 ) attr(lc ^52) flags( sched ) )
Embedded Systems Group
IIT Delhi
Various modules of Trimaran - 4
• HMDES Sections
– Field_Type e.g. REG, Lit etc.
– Resource e.g. Slot0, Slot1 etc.
– Resource_Usage e.g. RU_slot0 time( 0 )
– Reservation_Table e.g. RT_slot0 use( Slot0 )
– Operation_Latency e.g. lat1 ( time( 1 ) )
– Scheduling_Alternative e.g. (format(std1) resv(RT1) latency(lat1) )
– Operation e.g. ADD_W.0 ( Alt_1 Alt_2 )
– Elcor_Operation e.g. ADD_W( op( “ADD_W.0” “ADD_W.1” ) )
Embedded Systems Group
IIT Delhi
Various modules of Trimaran - 5
HPL-PD Simulator in detail
REBEL
Low level C files
C libraries
Emulation Library
Code Processor
HMDES
Native Compiler
Executable for the host platform
Embedded Systems Group
IIT Delhi
Various modules of Trimaran - 7
HPL-PD Simulator in detail
HPL-PD Virtual Machine
Fetch Next Instruction
Fetch Data
Execute Instruction
Instruction Accesses
Data Accesses
Dinero IV Cache Simulator
Level I Instruction-Cache
Level I Data-Cache
Level II Unified Cache
Embedded Systems Group
IIT Delhi
The HPL-PD architecture
• Parameterized ILP architecture from HP Labs
• Possible to vary,
– Number and types of FUs
– Number and types of registers
– Width of instruction words
– Instruction latencies
• Predicated instruction execution
• Compiler visible cache hierarchy
• Result multicast is supported for predicate registers
• Run time memory disambiguation instructions
Embedded Systems Group
IIT Delhi
The HPL-PD memory hierarchy
Registers
L1 Cache
Data Prefetch Cache
L2 Cache
Main Memory
Embedded Systems Group
•Independent of L1 Cache
•Used to store large amount of
cache polluting data
•Doesn’t require sophisticated
cache replacement mechanism
IIT Delhi
The Framework
Decoder Model
HMDES
Perf. Stats
TRIMARAN
ASSEMBLER
(using NJMC)
Cache. Stats
Obj. File
Instruction Address
or Next Instr Request
Code Size
Bytes Fetched
DISASSEMBLER
(using NJMC)
Embedded Systems Group
IIT Delhi
Studying impact on performance
• The HMDES modeling of decompressor,
– Add a new resource with latency of decoder
– Add a new resource usage section for this decoder
– Add this resource usage to all the HPL-PD operations
• In the results there are two decompressor units with
latency = 1
• The latency of decompressor should be estimated or
generated using actual simulation.
Embedded Systems Group
IIT Delhi
Studying code size minimization - 1
A simple template based instruction encoding scheme
Issue Slots
MUL_OP Format
ADD_W and L_W_C1_C1
IALU.0
IALU.1
FALU.0
MU.0
BU.0
MUL_OP
OPCODE & OPERANDS
OPCODE & OPERANDS
00010
IOP ; Sgpr1, Slit1, Dgpr2
MemOP ; Sgpr1, Dgpr1
…..
•Multi-ops are decided after profiling the generated assembly code.
•Multi-op field encodes:
•Size and position of each Uni-op
•Number, size and position of operands of each Uni-op
Embedded Systems Group
IIT Delhi
Studying code size minimization - 2
• Instrumenting ELCOR to generate assembly code
1. Arrange all the ops in IR in forward control order
2. Choose the next basic block and initialize cycle to 0
3. Walk the ops of this BB and dump those with the s_time = cycle
4. If BBs are left goto step 2
5. Dump the global data
• Actual instruction encoding is done using procedures
created by NJMC
Embedded Systems Group
IIT Delhi
Studying code size minimization - 3
The New Jersey Machine Code Toolkit
• Deals with bits at symbolic level
• Can be used to write assemblers, disassemblers etc.
• Supports concatenation to emit large binary data
• Representation is specified in SLED
• Has been used to write assemblers for Sparc, i486 etc.
• VLIW instructions need to be broken up into 32 bit (max)
size tokens
• Emitted binary data must end on a 8 bit boundary
Embedded Systems Group
IIT Delhi
Studying code size minimization - 4
Machine specifications in SLED
bit 0 is least significant
fields of TOK32 (32) Dgpr_1 0:3 Slit_1_part1 4:31
fields of TOK8 (16) Slit_1_part2 0:3 Sgpr_1 4:7 IOP 8:11 tmpl 12:14
patterns IOP_pats is any of
[
ADD MUL SUB
], which is tmpl = 1 & IOP = { 0 to 2 }
constructors
IOP_pats Sgpr_1, Slit_1, Dgpr_1 is
IOP_pats & Sgpr_1 & Slit_1_part2 = Slit_1 @[28:31];
Slit_2_part1 = Slit_1 @[0:27] & Dgpr_1
Embedded Systems Group
IIT Delhi
Studying code size minimization - 5
Toolkit encoder output
ADD( unsigned Sgpr_1, unsigned Slit_1, unsigned Dgpr_1 );
MUL( unsigned Sgpr_1, unsigned Slit_1, unsigned Dgpr_1 );
SUB( unsigned Sgpr_1, unsigned Slit_1, unsigned Dgpr_1 );
Specifying matcher for disassembler
match
| ADD( Sgpr_1, Slit_1, Dgpr_1 ) => //Do something
| MUL( Sgpr_1, Slit_1, Dgpr_1 ) => //Do something
| SUB( Sgpr_1, Slit_1, Dgpr_1 ) => //Do something
endmatch
Embedded Systems Group
IIT Delhi
Studying code size minimization - 6
• The matcher application needs functions for fetching data
• Bit ordering is different on little and big endian machines
• The matcher fails when large number of complex
templates are given
• Breaking large sized multi-ops across 32 bit tokens makes
the representation messy and error prone
• Specifying addresses for forward branches requires two
passes
Embedded Systems Group
IIT Delhi
Studying impact on memory bandwidth - 1
The Typical VLIW Pipeline
Instruction Decode
Instruction Fetch
Store Results
Embedded Systems Group
Align
Decompress
Execute
Decode
DF/AG
IIT Delhi
Studying impact on memory bandwidth - 2
• The cache simulation requires the generation of,
– Instruction address
– No. of bytes to fetch
• Instruction address can be generated by disassembling
the instructions at run time and keeping track of jumps
• The matcher application returns the number of bytes
required to disassemble an instruction
• The disassembled instruction can be compared with the
instruction issued to check correctness
Embedded Systems Group
IIT Delhi
Studying impact on memory bandwidth - 3
• Run time verification of disassembled instructions can be
turned off for faster simulation
• Due to restricted size of matcher results could not be
obtained for larger programs
• Memory access addresses and bytes to fetch have been
generated by hand for SumToN application
Embedded Systems Group
IIT Delhi
Results -
Impact on code size (Strcpy)
370
400
280
300
207
200
X86
Sparc
HPL-PD
100
0
Embedded Systems Group
IIT Delhi
Results -
Impact on code size (SumToN)
200
159
150
97
100
X86
Sparc
59
HPL-PD
50
0
Embedded Systems Group
IIT Delhi
Results -
Size of SLED specification for various archs.
20000
15553
15000
10000
11500
13199
X86
Sparc
HPL-PD
5000
0
Embedded Systems Group
IIT Delhi
Results 350
300
250
200
150
100
50
0
Cache performance comparison (SumToN)
320
256
196
160
Canonical
Encoded
1
Embedded Systems Group
2
IIT Delhi
Future work
• Need for automation in most parts of the framework
• Better representation for VLIW instructions than SLED
– Unlimited token size
– Facility to bind one field with multiple patterns
• Methodology for predicting latency for decompressor
• Framework for finding the optimal instruction formats
Embedded Systems Group
IIT Delhi
Acknowledgements
• Prof. M.Balakrishnan and Prof. Anshul Kumar
• Rodric M. Rabbah, Georgia Institute of Technology
• Shail Aditya, HP Labs
• All the friends at Philips Lab. for stimulating discussions
Embedded Systems Group
IIT Delhi