Transcript Slide 1

Xtensa C and C++ Compiler
Ding-Kai Chen
Tensilica, Inc
[email protected]
Presentation Outline
XCC history
XCC target -- Xtensa configurable processor
XCC details with examples
– User defined C types
– Operator overloading
– VLIW scheduling
– Auto-SIMD vectorization
– Operation fusion
– SWP Changes
Copyright © 2010, Tensilica, Inc.
2
XCC History
Got the first version of SGI Pro64 in May 2000
First customer release, August 2001
Release with IPA, August 2002
Release with SWP, Feedback, VLIW, September 2004
Release with GCC 4.2 Front End, October 2009
Supports C and C++ applications
– Other languages are not as important for embedded
applications
Copyright © 2010, Tensilica, Inc.
3
Xtensa Processor
32-bit RISC processor
targeting embedded dataplane
applications
16 32-bit general registers
(AR)
24-bit base instructions
Configurable at design-time
(not at run-time)
Xtensa Core Architecture
Copyright © 2010, Tensilica, Inc.
4
Xtensa Configuration Options
Many pre-defined options to
choose from
– Endianness
– Windowed vs non-windowed
register file
– Narrow (16-bit) instructions
Configuration Options
– Multipliers
– Coprocessors (HiFi, Vectra,
BBE, FP)
Xtensa Core Architecture
– Specialized (e.g., MAX)
instructions, etc
Copyright © 2010, Tensilica, Inc.
5
Targeting XCC to Base Xtensa and Tensilica
Configurations
As part of retargeting to Xtensa, we used/added
– Code-generator generator tool Olive for WHIRL to CGIR
translation
• Handles a lot of configuration specific code
– Support for Xtensa zero-overhead loop instructions
– CG Code-size optimization that commonizes instructions
from control-flow predecessors
– Feedback-directed speed vs code-size tradeoff
– Support for flexible VLIW formats
• Formats of different bit width and different number of issue slots
Copyright © 2010, Tensilica, Inc.
6
Tensilica Instruction Extension (TIE)
Custom TIE
Configuration Options
Xtensa Architecture
TIE is a language to describe new
custom:
– Register files up to 512 bits
wide
– Instructions up to 128 bits
– VLIW formats up to 15 slots
– C types mapped to custom
register files
– Vectorization rules
– Fusion patterns
– Operator overloading
Copyright © 2010, Tensilica, Inc.
7
XCC Challenges
Custom extensions in TIE are written at customer site and
cannot be configured at XCC build time
Design goals:
– Separation of config-independent code and configdependent libraries
– Re-targeting in minutes after TIE is designed or modified by
processor architect at customer site
– programming new HW extensions as native C
types/operations
Copyright © 2010, Tensilica, Inc.
8
Xtensa - Full Development Automation
Complete Hardware Design
Source pre-verified RTL, EDA scripts, test suite
Processor
Extensions
Processor
Configuration
Xtensa
Processor
Generator*
1. Select from menu
2. Explicit instruction
description (TIE)
Iterate…
Use standard
ASIC/COT design
techniques and
libraries for any IC
fabrication process
in minutes!
Customized Software Tools
C/C++ compiler Debuggers, Simulators, RTOSes
* US Patent: 6,477,697
Copyright © 2010, Tensilica, Inc.
9
TIE register file and operation
in C:
void vsum() {
// new register file for int32x4
// vectorization
Regfile v 128 16
int i;
int32x4* va = (int32x4*)a;
int32x4* vb = (int32x4*)b;
int32x4* vc = (int32x4*)c;
// a new C type based on <v> regfile
// and has 128-bit size and
// 128-bit alignment
ctype int32x4 128 128 v
operation add_v { out v vout, in v va, in
v vb } {} {
assign vout = {
va[127:96] + vb[127:96],
va[95:64] + vb[95:64],
va[63:32] + vb[63:32],
va[31:0] + vb[31:0] };
}
for (i=0; i<VSIZE; i++) {
// C intrinsic call
vc[i] = add_v(va[i] , vb[i]);
}
}
add_v is an intrinsic call in C
In WHIRL, it is an intrinsic_op 
optimizer friendly
Copyright © 2010, Tensilica, Inc.
10
TIE C type support
Each TIE C type maps to a new WHIRL mtype
Each TIE regfile maps to a ISA_REGCLASS
GCC FE declares new C types and new intrinsics (added new
TIE_TYPE tree code)
WGEN translates TIE C type references to WHIRL loads/stores
Olive tool adds dynamic rules to handle new types and WHIRL
opcodes
Added TN_mtype() for register spills/reloads
Made BE optimizations (CSE, ebo, etc) work
Copyright © 2010, Tensilica, Inc.
11
TIE example – generated code
#<loop> Loop body line 28, nesting depth: 1, iterations: 8
#<loop> unrolled 4 times
load_v v0,a2,0
# [0*II+0] id:20 b+0x0
load_v v1,a3,0
# [0*II+1] id:19 a+0x0
load_v v2,a2,16
# [0*II+2] id:20 b+0x0
load_v v3,a3,16
# [0*II+3] id:19 a+0x0
load_v v4,a2,32
# [0*II+4] id:20 b+0x0
load_v v5,a3,32
# [0*II+5] id:19 a+0x0
load_v v6,a2,48
# [0*II+6] id:20 b+0x0
load_v v7,a3,48
# [0*II+7] id:19 a+0x0
addi
a2,a2,64
# [0*II+8]
addi
a3,a3,64
# [0*II+9]
addi
a4,a4,64
# [0*II+10]
add_v v0,v1,v0
# [0*II+11]
add_v v1,v3,v2
# [0*II+12]
add_v v2,v5,v4
# [0*II+13]
add_v v3,v7,v6
# [0*II+14]
store_v v0,a4,-64
# [0*II+15] id:21 c+0x0
store_v v1,a4,-48
# [0*II+16] id:21 c+0x0
store_v v2,a4,-32
# [0*II+17] id:21 c+0x0
store_v v3,a4,-16
# [0*II+18] id:21 c+0x0
Total 19/4 = 4.75 cycles per iteration
Copyright © 2010, Tensilica, Inc.
12
TIE updating ld/st
// pre-increment load/store
operation load_vu { out v vout, inout AR base, in simm8 offset } { out VAddr, in
MemDataIn128 } {
assign VAddr = base + offset;
assign vout = MemDataIn128;
assign base = base + offset;
}
operation store_vu { in v vin, inout AR base, in simm8 offset } { out VAddr, out
MemDataOut128 } {
assign VAddr = base + offset;
assign MemDataOut128 = vin;
assign base = base + offset;
}
proto int32x4_loadiu { out int32x4 vout, inout int32x4* base, in immediate offset } {} {
load_vu vout, base, offset;
}
proto int32x4_storeiu { in int32x4 vin, inout int32x4* base, in immediate offset } {} {
store_vu vin, base, offset;
}
Copyright © 2010, Tensilica, Inc.
13
TIE updating ld/st
#<loop> Loop body line 28, nesting depth:
1, iterations: 32
load_vu v0,a2,16
# [0*II+0] id:20 b+0x0
load_vu v1,a3,16
# [0*II+1] id:19 a+0x0
store_vu v2,a4,16
# [1*II+2] id:21 c+0x0
add_v v2,v1,v0
# [0*II+3]
XCC Identifies updating ld/st
operations
Pre-bias ld/st bases to work
with pre-increment
Combine ld/st with addi in CG
total 4 cycles per iteration
Copyright © 2010, Tensilica, Inc.
14
TIE operator overloading
Check for TIE type operands
and operator overloading in
build_binary_op in c-typeck.c
of GCC
// map “+” operator to add_v for
// type int32x4
operator "+" add_v
in C:
void vsum_op() {
Build proper call to mapped
TIE intrinsic
int i;
int32x4* va = (int32x4*)a;
int32x4* vb = (int32x4*)b;
int32x4* vc = (int32x4*)c;
for (i=0; i<VSIZE; i++) {
// more natural using C “+” syntax
vc[i] = va[i] + vb[i];
}
}
Copyright © 2010, Tensilica, Inc.
15
TIE VLIW scheduling
format flix0 64 {slot0,slot1} // add 2-slots 64-bit VLIW format
slot_opcodes slot0 { load_v, store_v, load_vu, store_vu, add_v }
slot_opcodes slot1 { load_v, store_v, load_vu, store_vu, add_v }
---------------------------------- .s output -------------------------------------------------#<loop> unrolled 2 times
{
# format flix0
load_vu v3,a2,32
# [0*II+0] id:20 b+0x0
add_v
v5,v4,v3
# [1*II+0]
}
{
# format flix0
load_v v0,a2,-16
# [0*II+1] id:20 b+0x0
add_v
v2,v1,v0
# [1*II+1]
}
{
# format flix0
load_v
v1,a3,16
# [0*II+2] id:19 a+0x0
load_vu v4,a3,32
# [0*II+2] id:19 a+0x0
}
{
# format flix0
store_v v2,a4,16
# [1*II+3] id:21 c+0x0
store_vu v5,a4,32
# [1*II+3] id:21 c+0x0
}
total 4/2=2 cycles per iteration
Copyright © 2010, Tensilica, Inc.
16
TIE VLIW scheduling
XCC initialization includes analysis on TIE VLIW formats
Create resources that model bundling constraints
– Consider a simpler case: 1 slot is allowed for each opcode
– Each VLIW slot in a format is viewed as a resource
• Different formats are treated separately
– Each opcode consumes the resource of the slot it is allowed
– For a group of operations, if the total resource usage is within
the limit  can be scheduled in the same cycle
– Get complicated when multiple slots are allowed for opcodes
Resource reservation modeling allows de-coupling of scheduling
and slot assignment in CG
Extended resource reservation word type SI_RRW to arbitrary
length bit-vectors
TI_RES_RES_Resources_Available() also checks for compatible
formats
17
Copyright © 2010, Tensilica, Inc.
TIE auto-SIMD vectorization
property vector_ctype {int32x4, int32, 4}
property vector_proto {add_v, xt_add, 4}
in C:
for (i=0; i<SIZE; i++) {
c[i] = a[i] + b[i];
}
with -O3 -LNO:simd -clist, in
.w2c:
int32x4 V_00;
int32x4 V_;
int32x4 V_0;
int32x4 V_4;
_INT32 i;
for(i = 0; i <= 127; i = i + 4)
{
V_00 = *(int32x4 *)(&a[i]);
V_ = *(int32x4 *)(&b[i]);
V_0 = add_v(V_00, V_);
V_4 = V_0;
* (int32x4 *)(&c[i]) = V_4;
}
Copyright © 2010, Tensilica, Inc.
18
TIE auto-SIMD vectorization
Developed independently (before) Open64 Vectorizer
Integrate into Phase2 of LNO
Scan all loops in a nest
Check for presence of vectorized versions of each op in the loop
Check for stride-1 or invariant memory references
Support for loads and stores with addresses not aligned as
vector type
– Pre-load once before the vector loop
– Subsequent loads in the vector loop combine with the prior loads
Support for spatial reuse within a vector using select instruction
– E.g. a[i] + a[i+1] in the scalar loop
• Pre-load once before the vector loop
• Only a single load is needed now for each iteration
• Select instructions shuffle data from loads of consecutive iterations
Copyright © 2010, Tensilica, Inc.
19
TIE operation fusion
imap add_shift_v { out v vout, in v va,
in v vb, in immediate amount }
{ {}
{ // the output pattern
add_shift_v vout, va, vb, amount;
}
}
{ { v v_temp }
{ // the input pattern
add_v v_temp, va, vb;
shift_v vout, v_temp, amount;
}
}
Combine multiple operations to
one
E.g., combines an add followed
by a shift to one add_shift
operation
Performed in CG
Build dataflow graphs from input
patterns
Repeatedly search for matches
in BBs
Peephole optimization with
custom patterns
Copyright © 2010, Tensilica, Inc.
20
TIE operation fusion
Example C code:
– for (i=0; i<VSIZE; i++) {
vc[i] = (va[i] + vb[i]) << 2;
}
Original schedule is 5 cycles / 2 iter = 2.5 cycles per iteration
New schedule with operation fusion is 4 cycles / 2 iter = 2
cycles per iteration
Copyright © 2010, Tensilica, Inc.
21
XCC SWP scheduler
Xtensa has no rotating registers – added 2 register allocators,
simple and coloring. Use simple first to get tighter bound then
try coloring.
Performance is critical: added back-tracking for the following
– Unrolling (hard to guess best unrolling)
– Different priority heuristics for choosing candidates
– Different initial op orderings
– Register allocation failures
Runs slightly longer but complements the original IA-64 based
SWP algorithm well
Copyright © 2010, Tensilica, Inc.
22
Conclusion
Open64 is versatile in providing optimized performance for
embedded applications.
XCC experience shows that many of the optimizations can be
adapted to retarget for ISA extensions quickly.
Sample Performance Data:
– EEMBC Consumer benchmark gained 6x speedup with
automatic vectorization + vliw scheduling + operation fusion
XCC solution is not final. It is still evolving with new HW features
offered from Tensilica.
Want to explore new ways in TIE to describe HW that supports
optimizations.
Copyright © 2010, Tensilica, Inc.
23
Tensilica is looking for new talent
to join the compiler team.
http://www.tensilica.com
[email protected]
Copyright © 2010, Tensilica, Inc.
24