Transcript Diapositiva 1
Writing software for embedded systems Part I: Introduction, design flow Part II: The role of compilers
Alessandro Dalla Torre, Andrea Marongiu {alessandro.dallatorre, andrea.marongiu3}@.unibo.it
Introduction Writing software for embedded systems
Writing software for embedded system
• – Traditional method: microprocessor emulation.
The software engineer: 1.
2.
develops his code on a PC, workstation; uses the emulator as a window into the system.
• – Alternative approach: ready-built prototyped board; – a way to download and debug the code on the target board has to be supplied.
The Compilation Process Middleware and OS integration • • • • The Toolchain as a set of computer programs (tools) that are used to create a product (typically another program).
Native versus cross-compilers.
Run-time libraries.
– Processor/architecture dependent (system call) • Memory allocation, task controlm semaphores.
– IO dependent.
Writing and linking additional libraries.
– Embedded OS like RTEMS.
– Communication libraries, for instance Message Passing support library like MP-Queue.
Dowloading binary code into the target platform • • •
Serial lines
, parallel ports 1. The host is connected to the onboard debugger via a serial comms port.
2. A dowload command sends the file to the target debugger.
3. The target debugger converts the ASCII format back into binary and loads at the correct location.
• Alternative: burn the program into
EPROM
or other form of non-volatile memories like FLASH. If we are simulating on a virtual platform (emulator), the cross compiled
binary file
the correct path.
has to be put into From that location, the it will be picked up and the code loaded at bootstrap by the simulator: – Possibly a different binary may be bounded to a different virtual core.
Instruction Set Simulator
(ISS)
• An
Instruction Set Simulator
(ISS) is a simulation model , usually,coded in a high-level language.
• It mimics the behavior of a another hardware device or microprocessor by: – "reading" instructions – maintaining internal variables which represent the processor's registers .
• The number of instructions to perform the above basic "loop" : – Fetch – Execute/ – Calculate new address – depends on hardware but requires multiple local host instructions for simulating a target instruction of the ISS.
single
Debugging techniques
• High level language simulation – Directly on the host machine used for developing, by means of Linux threads/ processes and IPC facilities.
• • • Task level debugging – Operating system may provide breakpointing facilities on system circumstancies, like: • Events, • Messages, • Interrupt Routines.
Low level simulation – ISS simulation, slower but more accurate (even cycle accurate).
Onboard debugging – Code has to be dowloaded into the target evaluation board; – A remote terminal can be attached and run on the host machine to monitor the program execution.
Cross Compiler
• A compiler capable of creating executable code for a platform
other than the one on which the compiler is run.
• Its fundamental use is that of
separating the build environment from the target environment.
– Embedded systems have limited resources , often not powerful enough to run a compiler or a developing environment (debugging).
– A single build environment can be set up to compile different (multiple) targets.
GNU Toolchain
• The set of programming tools used for programming both application and operating system.
• A vital component in Linux kernel development.
• A standard tool when developing software for embedded systems.
• Projects included in the GNU toolchain are: – GNU make – GNU Compiler Collection (GCC) – GNU Binutils
Compiling code
#include
source code
Compiling code
pre-processor header files
compiler
compiler assembler .s
assembler object code
binutils
linker executable file
Compiling code
• The pre-processor handles – Macros (#define) – Inclusions (#include) – Conditional code inclusion (#ifdef, #if) – Language extensions (#pragma).
• The compiler modules.
processes source code and turns it into assembler • The assembler converts them to hexadecimal.
• The linker takes the object files and searches library files to find the routines it calls. It calculates the address references and incorporates any symbolic information to create an executable file format.
Runtime Libraries
• Compilers only generate a small subset of high-level languages facilities and commands from built-in routines.
• It relies on libraries to provide the full range of functions that the language offers: – Processor dependent with peripherals; : mathematical functions, string manipulation and similar features that use the processor and don’t need to communicate – I/O dependent : defines the hardware that the software need to access. The library routine either drives directly the hardware or calls the operating system to perform its task; – System calls : typical routines are those which dinamically allocate memory, task control commands, use semaphores, etc; – Exit routines : used to terminate programs free up the memory used by the application.
Part II The role of compiler
Program written in a Programming Language
Compiler
Assembly Language Translation
High-level View of a Compiler
Source code Compiler Machine code Errors
• • • Must recognize legal (and illegal) programs – Understand and preserve the meaning of the source program.
Must generate correct code – Map the functionality of the source program to the target (usually the ISA of some computer system) Must introduce optimizations on the original code 1. Performance/Speed 2. Code size 3. Power consumption
Traditional Two-pass Compiler
Source code Front End IR Back End Machine code Errors
• Use an intermediate representation ( IR ) • Front end maps legal source code into IR • Back end maps IR into target machine code
Multiple front and back ends
Fortran C, C++ Java PASCAL Front end Front end Front end Front end Back end Back end Back end Target 1 Target 2 Target 3
• Must encode all language specific knowledge in each front end • Must encode all features in a single IR • Must encode all target specific knowledge in each back end
Anatomy of a Compiler
Program (character stream) Lexical Analyzer (Scanner) Token Stream Parse Tree Semantic Analyzer Intermediate Representation Optimized Intermediate Representation Assembly code
Lexical Analyzer (Scanner)
2 3 4 * ( 1 1 + - 2 2 ) Num(234) mul_op lpar_op Num(11) add_op Num(-22) rpar_op
18..23 + val#ue
Not a number Variable names cannot have ‘#’ character
Syntax Analyzer (Parser)
ES.
x + 2 - y
goal expr expr expr op term + < number, 2 > op term < id, x >
This contains a lot of unneeded information.
term < id, y > 1. goal
expr 2. expr
expr op term 3. | term 4. term
number 5. | id 6. op
+ 7. | -
Syntax Analyzer (Parser)
int * foo(i, j, k)) int i; int j; { } Extra parentheses for(i=0; i j) { fi(i>j) return j; Missing increment Not an expression Not a keyword
Semantic Analyzer
int * foo(i, j, k) int i; int j; { int x; x = x + j + N; return j; } Type not declared Mismatched return type Uninitialized variable used Undeclared variable
Anatomy of a Compiler
Program (character stream) Lexical Analyzer (Scanner) Token Stream Syntax Analyzer (Parser) Parse Tree Semantic Analyzer Intermediate Representation Optimized Intermediate Representation Assembly code
Constant Propagation
int i, x, y; x = 0; y = 0; for(i = 0; i <= N; i++) { x = x + (4*a/b)*i + (i+1)*(i+1); } return x;
Uses of variables initialized with a constant value are replaced with the value itself
Algebraic Simplification
int i, x, y; x = 0; y = 0; for(i = 0; i <= N; i++) { x = x + (4*a/b)*i + (i+1)*(i+1); x = x; + b*0; } return x;
Simple algebraic expressions are evaluated and replaced with the resulting value
Copy Propagation
int i, x, y; x = 0; y = 0; for(i = 0; i <= N; i++) { x = x + (4*a/b)*i + (i+1)*(i+1); x = x; } return x;
Targets of direct assignments of the form y=x are replaced with their values Another example: y = x; z = 3 + y; z = 3 + y;
Common Subexpression Elimination
int i, x, y; t ; x = 0; y = 0; for(i = 0; i <= N; i++) { } return x; t * t;
A subexpression that occurs more than once is replaced with the use of a temporary variable initialized with the subexpression
Dead Code Elimination
x = 0; y = 0; for(i = 0; i <= N; i++) { t = i+1; x = x + (4*a/b)*i + t * t; } return x;
Code that is never reached during execution, or assignments to variables that are never used are considered dead and removed
Loop Invariant Removal
int i, x, t; u ; x = 0; u = (4*a/b); for(i = 0; i <= N; i++) { t = i+1; } return x;
Expressions within a loop that are independent from the iteration count are moved outside the loop body and computed only once
Anatomy of a Compiler
Program (character stream) Lexical Analyzer (Scanner) Token Stream Syntax Analyzer (Parser) Parse Tree Semantic Analyzer Intermediate Representation Code Optimizer Optimized Intermediate Representation Assembly code
Code Generator
{ }
First role of the backend is that of mapping
int i, x, t, u; x = 0; u = (4*a/b); for(i = 0; i <= N; i++) { } t = i+1; x = x + u * i + t * t;
sumcalc: %eax xorl %r8d, %r8d movl %edx, %r9d instructions of the target machine ISA jg .L7
sall $2, %edi .L5: movl %edi, %eax
..but it is not the only one!
idivl %esi leal 1(%rcx), %edx movl %eax, %r10d Some optimizations (register allocation, movl %edx, %ecx instruction scheduling) need knowledge of leal (%r10,%rcx), some details of the target architecture. addl %eax, %r8d They are implemented in the backend jle .L5
.L7: movl %r8d, %eax ret
Back-end optimizations: Instruction scheduling
• Many pipeline stages – Pentium – Pentium Pro – Pentium IV (130nm) – Pentium IV (90nm) 5 10 20 31 • Different instructions taking different amount of time to execute • Most modern processors have multiple execution units (superscalar) – If the instruction sequence is correct, multiple operations will happen in the same cycles – Even more important to have the right instruction sequence • Reorder instructions so that pipeline stalls are minimized
Data Dependency between Instructions
• If two instructions access the same variable, they can be dependent • Kind of dependencies – True: write – Anti: read read write – Output: write write • What to do if two instructions are dependent.
– The order of execution cannot be reversed – Reduce the possibilities for scheduling
Representing Dependencies
• Using a dependence DAG (Directed Acyclic Graph), one per basic block • Nodes are instructions, edges represent dependencies
1 2 1: r2 = *(r1 + 4) 2: r3 = *(r1 + 8) 3: r4 = r2 + r3 4: r5 = r2 - 1 4 2 2 3 2
• Edge is labeled with Latency:
Example
Results In 5: add %r10, 8(%rsp) 4 cycles 7: imul %rax, %rbx 3 cycles 1 2 3 4 st st 5 6 st st st 7 st st
14 cycles
List Scheduling Algorithm
• Create a dependence DAG of a basic block – Do a topological sort of the dependence DAG – Consider when an instruction can be scheduled without causing a stall – Schedule the instruction if it causes no stall and all its predecessors are already scheduled • Topological Sort – READY = nodes with no predecessors. Loop until READY is empty • Heuristics for selecting from the READY list – pick the node with the longest path to a leaf in the dependence graph – pick a node with most immediate successors
Example
Number of successors Longest path to a leaf node d=5 f=1 3 d=0 f=0 4 d=3 f=1 1 3 3: inc %r11 d=4 f=1 d=7 f=1 d=0 2 6 5 f=0 1 7 4 7: imul %rax, %rbx 3 1 9: lea var_b, %rax 8 d=0 f=0 9 d=0 f=0
READY set { 6, 1, 4, 3 } { 1, 4, 3 } { 4, 3 } { 4, 3 } { 7, 3 } { 3, 5 }
{ 2, 4, 3 } { 4, 7, 3 } { 7, 3, 5 } { 3, 5, 8, 9 } { 5, 8, 9 } { 8, 9 } { 9 } { }
Example
1 2 1 8 3 d=5 f=1 d=4 f=1 7 d=0 f=0 4 d=3 f=2 1 9 d=0 f=0 d=7 f=1 d=0 f=0 d=3 f=1 3 5 d=0 f=0 6 1 2 4 7 3 5 8 9
Example
1: lea var_a, %rax 2: add $4, %rax 3: inc %r11 4: mov 4(%rsp), %r10
14 cycles vs 9 cycles
7: imul %rax, %rbx 8: mov %rbx, 16(%rsp) 9: lea var_b, %rax 1 2 3 4 st st 5 6 1 2 4 7 3 5 8 9 6 st st st 7
Anatomy of a Compiler
Program (character stream) Lexical Analyzer (Scanner) Token Stream Parse Tree Semantic Analyzer Intermediate Representation Optimized Intermediate Representation Assembly code
GCC Internals
Common intermediate representation Retargetable!
IR – Control Flow Graph
• Most analysis/optimization passes inside a compiler are performed on CFGs • A CFG is a directed graph which models flow of control in the program. • Each node corresponds to a
basic block
, i.e. a sequence of non-branch instructions.
• Edges correspond to possible tansfer of control between blocks
Control Flow Graph
} into add(n, k) { s = 0; a = 4; i = 0; if (k == 0) b = 1; else b = 2; while (i < n) { s = s + a*b; i = i + 1; } return s; b = 2; s = 0; a = 4; i = 0; k == 0 b = 1; i < n s = s + a*b; i = i + 1; return s;
Basic Block Construction
• Start with instruction control-flow graph (each basic block contains a single instruction) • Visit all edges in graph • Merge adjacent nodes if – Only one edge from first node – Only one edge into second node s = 0; a = 4; s = 0; a = 4;
s = 0; b = 2; a = 4; i = 0; k == 0 b = 1; s = s + a*b; i = i + 1; i < n return s; k == 0 b = 2; i < n b = 1; i = i + 1; return s;
Optimizing for parallel architectures Automatic Loop Parallelization
Types of Parallelism
• Instruction Level Parallelism (ILP) Scheduling and Hardware • Task Level Parallelism (TLP) • Loop Level Parallelism (LLP) or Data Parallelism • Pipeline Parallelism • Divide and Conquer Parallelism Mainly by hand Generated by Hand or Compiler Hardware or streaming Recursive functions
Loop parallelization
• Why loops?
– 90% of the execution time in 10% of the code • Mostly in loops – If parallel, can get good performance • Load balancing – Relatively easy to analyze • How to automatically parallelize loops?
– Find FORALL loops out of FOR loops • Data dependence analysis • Definition – Loop-carried dependence: dependence that crosses a loop boundary • If there are no loop carried dependences parallelizable
Programmer Defined Parallel Loop
• FORALL – No “loop carried dependences” – Fully parallel • FORACROSS – Some “loop carried dependences”
Loop Splitting
• Example
FORPAR I = 0 to N A[I] = A[I] + 1
• Block Distribution: Program gets mapped into
Iters = ceiling(N/NUMPROC); FOR P = 0 to NUMPROC-1 FOR I = P*Iters to MIN((P+1)*Iters, N) A[I] = A[I] + 1
Code transformation
parallel code int
main() { …
for
(i=0; i for (j=i; j sequential code PARALLELIZING COMPILER void parallel_routine() { for (i=N*cpuID/nprocs; i for (j=i; j int start () { … for (i=0; i for (j=i; j parallel code runtime library void initenv(); if main() { (cpuID == // gather workers on barrier start (); MASTER ) { // release workers } else { // spin until work provided parallel_routine(); // spin until work provided } } void doall () { // release workers parallel_routine(); // gather workers on barrier } // Synchronization facilities // Lock Implementation // Barrier Implementation void parallel_routine() { for (i=N*cpuID/nprocs; i for (j=i; j int start () { // sequential code for (i=0; i for (j=i; j // sequential code } • The compiler may not be able to do the parallelization in the way you like to see it: • – A loop is not parallelized • The data dependency analysis is not able to determine wheter it is safe to parallelize or not – • The granularity is not high enough The compiler lacks information to parallelize at the highest possible level This is when the explicit parallelization through OpenMP directives comes into the picture • Language extensions for shared memory concurrency (#pragma) • Supports C, C++ and Fortran • Embedded directives specify – Parallelism – Data sharing semantics – Work sharing semantics • Standard and increasingly popular • A typical OpenMP implementation relies on the pthreads library. • With MPSoCs the SPMD paradigm is also used • Based on fork-join semantics – Master thread spawns teams of children threads – Allows sequential and parallel execution • Recognize OpenMP pragmas – Within the parser in each frontend • Lower into GIMPLE IR – Augment it with new OMP nodes • Identify data sharing – The master thread creates a local structure which contains all the data items marked for sharing and passes its address to slaves • Identify work sharing – Split computation between different cores (omp for) • Identify parallel regions – Outline the body of parallel regions into functions that are used as arguments to the thread creation routinesRuntime support to parallelization
Cooperative approaches: OpenMP
OpenMP and GCC
OpenMP programming model
Supporting OpenMP in GCC