Diapositiva 1

Download Report

Transcript Diapositiva 1

Writing software for embedded systems Part I: Introduction, design flow Part II: The role of compilers

Alessandro Dalla Torre, Andrea Marongiu {alessandro.dallatorre, andrea.marongiu3}@.unibo.it

Introduction Writing software for embedded systems

Writing software for embedded system

• – Traditional method: microprocessor emulation.

The software engineer: 1.

2.

develops his code on a PC, workstation; uses the emulator as a window into the system.

• – Alternative approach: ready-built prototyped board; – a way to download and debug the code on the target board has to be supplied.

The Compilation Process Middleware and OS integration • • • • The Toolchain as a set of computer programs (tools) that are used to create a product (typically another program).

Native versus cross-compilers.

Run-time libraries.

– Processor/architecture dependent (system call) • Memory allocation, task controlm semaphores.

– IO dependent.

Writing and linking additional libraries.

– Embedded OS like RTEMS.

– Communication libraries, for instance Message Passing support library like MP-Queue.

Dowloading binary code into the target platform • • •

Serial lines

, parallel ports 1. The host is connected to the onboard debugger via a serial comms port.

2. A dowload command sends the file to the target debugger.

3. The target debugger converts the ASCII format back into binary and loads at the correct location.

• Alternative: burn the program into

EPROM

or other form of non-volatile memories like FLASH. If we are simulating on a virtual platform (emulator), the cross compiled

binary file

the correct path.

has to be put into From that location, the it will be picked up and the code loaded at bootstrap by the simulator: – Possibly a different binary may be bounded to a different virtual core.

Instruction Set Simulator

(ISS)

• An

Instruction Set Simulator

(ISS) is a simulation model , usually,coded in a high-level language.

• It mimics the behavior of a another hardware device or microprocessor by: – "reading" instructions – maintaining internal variables which represent the processor's registers .

• The number of instructions to perform the above basic "loop" : – Fetch – Execute/ – Calculate new address – depends on hardware but requires multiple local host instructions for simulating a target instruction of the ISS.

single

Debugging techniques

• High level language simulation – Directly on the host machine used for developing, by means of Linux threads/ processes and IPC facilities.

• • • Task level debugging – Operating system may provide breakpointing facilities on system circumstancies, like: • Events, • Messages, • Interrupt Routines.

Low level simulation – ISS simulation, slower but more accurate (even cycle accurate).

Onboard debugging – Code has to be dowloaded into the target evaluation board; – A remote terminal can be attached and run on the host machine to monitor the program execution.

Cross Compiler

• A compiler capable of creating executable code for a platform

other than the one on which the compiler is run.

• Its fundamental use is that of

separating the build environment from the target environment.

– Embedded systems have limited resources , often not powerful enough to run a compiler or a developing environment (debugging).

– A single build environment can be set up to compile different (multiple) targets.

GNU Toolchain

• The set of programming tools used for programming both application and operating system.

• A vital component in Linux kernel development.

• A standard tool when developing software for embedded systems.

• Projects included in the GNU toolchain are: – GNU make – GNU Compiler Collection (GCC) – GNU Binutils

Compiling code

#include #define TRUE 1 #define FALSE 0 } { main() int i; i = 5 * 2; printf(“5 times 2 is %d.\n”, i); printf(“TRUE is %d.\n”, TRUE); printf(“FALSE is %d.\n”, FALSE); Handled by the pre-processor Handled by the compiler Handled by library routines

source code

Compiling code

pre-processor header files

compiler

compiler assembler .s

assembler object code

binutils

linker executable file

Compiling code

• The pre-processor handles – Macros (#define) – Inclusions (#include) – Conditional code inclusion (#ifdef, #if) – Language extensions (#pragma).

• The compiler modules.

processes source code and turns it into assembler • The assembler converts them to hexadecimal.

• The linker takes the object files and searches library files to find the routines it calls. It calculates the address references and incorporates any symbolic information to create an executable file format.

Runtime Libraries

• Compilers only generate a small subset of high-level languages facilities and commands from built-in routines.

• It relies on libraries to provide the full range of functions that the language offers: – Processor dependent with peripherals; : mathematical functions, string manipulation and similar features that use the processor and don’t need to communicate – I/O dependent : defines the hardware that the software need to access. The library routine either drives directly the hardware or calls the operating system to perform its task; – System calls : typical routines are those which dinamically allocate memory, task control commands, use semaphores, etc; – Exit routines : used to terminate programs free up the memory used by the application.

Part II The role of compiler

Program written in a Programming Language

Compiler

Assembly Language Translation

High-level View of a Compiler

Source code Compiler Machine code Errors

• • • Must recognize legal (and illegal) programs – Understand and preserve the meaning of the source program.

Must generate correct code – Map the functionality of the source program to the target (usually the ISA of some computer system) Must introduce optimizations on the original code 1. Performance/Speed 2. Code size 3. Power consumption

Traditional Two-pass Compiler

Source code Front End IR Back End Machine code Errors

• Use an intermediate representation ( IR ) • Front end maps legal source code into IR • Back end maps IR into target machine code

Multiple front and back ends

Fortran C, C++ Java PASCAL Front end Front end Front end Front end Back end Back end Back end Target 1 Target 2 Target 3

• Must encode all language specific knowledge in each front end • Must encode all features in a single IR • Must encode all target specific knowledge in each back end

Anatomy of a Compiler

Program (character stream) Lexical Analyzer (Scanner) Token Stream Parse Tree Semantic Analyzer Intermediate Representation Optimized Intermediate Representation Assembly code

Lexical Analyzer (Scanner)

2 3 4 * ( 1 1 + - 2 2 ) Num(234) mul_op lpar_op Num(11) add_op Num(-22) rpar_op

18..23 + val#ue

Not a number Variable names cannot have ‘#’ character

Syntax Analyzer (Parser)

ES.

x + 2 - y

goal expr expr expr op term + < number, 2 > op term < id, x >

This contains a lot of unneeded information.

term < id, y > 1. goal

expr 2. expr

expr op term 3. | term 4. term

number 5. | id 6. op

+ 7. | -

Syntax Analyzer (Parser)

int * foo(i, j, k)) int i; int j; { } Extra parentheses for(i=0; i j) { fi(i>j) return j; Missing increment Not an expression Not a keyword

Semantic Analyzer

int * foo(i, j, k) int i; int j; { int x; x = x + j + N; return j; } Type not declared Mismatched return type Uninitialized variable used Undeclared variable

Anatomy of a Compiler

Program (character stream) Lexical Analyzer (Scanner) Token Stream Syntax Analyzer (Parser) Parse Tree Semantic Analyzer Intermediate Representation Optimized Intermediate Representation Assembly code

Constant Propagation

int i, x, y; x = 0; y = 0; for(i = 0; i <= N; i++) { x = x + (4*a/b)*i + (i+1)*(i+1); } return x;

Uses of variables initialized with a constant value are replaced with the value itself

Algebraic Simplification

int i, x, y; x = 0; y = 0; for(i = 0; i <= N; i++) { x = x + (4*a/b)*i + (i+1)*(i+1); x = x; + b*0; } return x;

Simple algebraic expressions are evaluated and replaced with the resulting value

Copy Propagation

int i, x, y; x = 0; y = 0; for(i = 0; i <= N; i++) { x = x + (4*a/b)*i + (i+1)*(i+1); x = x; } return x;

Targets of direct assignments of the form y=x are replaced with their values Another example: y = x; z = 3 + y; z = 3 + y;

Common Subexpression Elimination

int i, x, y; t ; x = 0; y = 0; for(i = 0; i <= N; i++) { } return x; t * t;

A subexpression that occurs more than once is replaced with the use of a temporary variable initialized with the subexpression

Dead Code Elimination

x = 0; y = 0; for(i = 0; i <= N; i++) { t = i+1; x = x + (4*a/b)*i + t * t; } return x;

Code that is never reached during execution, or assignments to variables that are never used are considered dead and removed

Loop Invariant Removal

int i, x, t; u ; x = 0; u = (4*a/b); for(i = 0; i <= N; i++) { t = i+1; } return x;

Expressions within a loop that are independent from the iteration count are moved outside the loop body and computed only once

Anatomy of a Compiler

Program (character stream) Lexical Analyzer (Scanner) Token Stream Syntax Analyzer (Parser) Parse Tree Semantic Analyzer Intermediate Representation Code Optimizer Optimized Intermediate Representation Assembly code

Code Generator

{ }

First role of the backend is that of mapping

int i, x, t, u; x = 0; u = (4*a/b); for(i = 0; i <= N; i++) { } t = i+1; x = x + u * i + t * t;

sumcalc: %eax xorl %r8d, %r8d movl %edx, %r9d instructions of the target machine ISA jg .L7

sall $2, %edi .L5: movl %edi, %eax

..but it is not the only one!

idivl %esi leal 1(%rcx), %edx movl %eax, %r10d Some optimizations (register allocation, movl %edx, %ecx instruction scheduling) need knowledge of leal (%r10,%rcx), some details of the target architecture. addl %eax, %r8d They are implemented in the backend jle .L5

.L7: movl %r8d, %eax ret

Back-end optimizations: Instruction scheduling

• Many pipeline stages – Pentium – Pentium Pro – Pentium IV (130nm) – Pentium IV (90nm) 5 10 20 31 • Different instructions taking different amount of time to execute • Most modern processors have multiple execution units (superscalar) – If the instruction sequence is correct, multiple operations will happen in the same cycles – Even more important to have the right instruction sequence • Reorder instructions so that pipeline stalls are minimized

Data Dependency between Instructions

• If two instructions access the same variable, they can be dependent • Kind of dependencies – True: write  – Anti: read  read write – Output: write  write • What to do if two instructions are dependent.

– The order of execution cannot be reversed – Reduce the possibilities for scheduling

Representing Dependencies

• Using a dependence DAG (Directed Acyclic Graph), one per basic block • Nodes are instructions, edges represent dependencies

1 2 1: r2 = *(r1 + 4) 2: r3 = *(r1 + 8) 3: r4 = r2 + r3 4: r5 = r2 - 1 4 2 2 3 2

• Edge is labeled with Latency:

Example

Results In 5: add %r10, 8(%rsp) 4 cycles 7: imul %rax, %rbx 3 cycles 1 2 3 4 st st 5 6 st st st 7 st st

14 cycles

List Scheduling Algorithm

• Create a dependence DAG of a basic block – Do a topological sort of the dependence DAG – Consider when an instruction can be scheduled without causing a stall – Schedule the instruction if it causes no stall and all its predecessors are already scheduled • Topological Sort – READY = nodes with no predecessors. Loop until READY is empty • Heuristics for selecting from the READY list – pick the node with the longest path to a leaf in the dependence graph – pick a node with most immediate successors

Example

Number of successors Longest path to a leaf node d=5 f=1 3 d=0 f=0 4 d=3 f=1 1 3 3: inc %r11 d=4 f=1 d=7 f=1 d=0 2 6 5 f=0 1 7 4 7: imul %rax, %rbx 3 1 9: lea var_b, %rax 8 d=0 f=0 9 d=0 f=0

READY set { 6, 1, 4, 3 } { 1, 4, 3 } { 4, 3 } { 4, 3 } { 7, 3 } { 3, 5 }

   

{ 2, 4, 3 } { 4, 7, 3 } { 7, 3, 5 } { 3, 5, 8, 9 } { 5, 8, 9 } { 8, 9 } { 9 } { }

Example

1 2 1 8 3 d=5 f=1 d=4 f=1 7 d=0 f=0 4 d=3 f=2 1 9 d=0 f=0 d=7 f=1 d=0 f=0 d=3 f=1 3 5 d=0 f=0 6 1 2 4 7 3 5 8 9

Example

1: lea var_a, %rax 2: add $4, %rax 3: inc %r11 4: mov 4(%rsp), %r10

14 cycles vs 9 cycles

7: imul %rax, %rbx 8: mov %rbx, 16(%rsp) 9: lea var_b, %rax 1 2 3 4 st st 5 6 1 2 4 7 3 5 8 9 6 st st st 7

Anatomy of a Compiler

Program (character stream) Lexical Analyzer (Scanner) Token Stream Parse Tree Semantic Analyzer Intermediate Representation Optimized Intermediate Representation Assembly code

GCC Internals

Common intermediate representation Retargetable!

IR – Control Flow Graph

• Most analysis/optimization passes inside a compiler are performed on CFGs • A CFG is a directed graph which models flow of control in the program. • Each node corresponds to a

basic block

, i.e. a sequence of non-branch instructions.

• Edges correspond to possible tansfer of control between blocks

Control Flow Graph

} into add(n, k) { s = 0; a = 4; i = 0; if (k == 0) b = 1; else b = 2; while (i < n) { s = s + a*b; i = i + 1; } return s; b = 2; s = 0; a = 4; i = 0; k == 0 b = 1; i < n s = s + a*b; i = i + 1; return s;

Basic Block Construction

• Start with instruction control-flow graph (each basic block contains a single instruction) • Visit all edges in graph • Merge adjacent nodes if – Only one edge from first node – Only one edge into second node s = 0; a = 4; s = 0; a = 4;

s = 0; b = 2; a = 4; i = 0; k == 0 b = 1; s = s + a*b; i = i + 1; i < n return s; k == 0 b = 2; i < n b = 1; i = i + 1; return s;

Optimizing for parallel architectures Automatic Loop Parallelization

Types of Parallelism

• Instruction Level Parallelism (ILP)  Scheduling and Hardware • Task Level Parallelism (TLP) • Loop Level Parallelism (LLP) or Data Parallelism • Pipeline Parallelism • Divide and Conquer Parallelism     Mainly by hand Generated by Hand or Compiler Hardware or streaming Recursive functions

Loop parallelization

• Why loops?

– 90% of the execution time in 10% of the code • Mostly in loops – If parallel, can get good performance • Load balancing – Relatively easy to analyze • How to automatically parallelize loops?

– Find FORALL loops out of FOR loops • Data dependence analysis • Definition – Loop-carried dependence: dependence that crosses a loop boundary • If there are no loop carried dependences  parallelizable

Programmer Defined Parallel Loop

• FORALL – No “loop carried dependences” – Fully parallel • FORACROSS – Some “loop carried dependences”

Loop Splitting

• Example

FORPAR I = 0 to N A[I] = A[I] + 1

• Block Distribution: Program gets mapped into

Iters = ceiling(N/NUMPROC); FOR P = 0 to NUMPROC-1 FOR I = P*Iters to MIN((P+1)*Iters, N) A[I] = A[I] + 1

Code transformation

parallel code int

main() { …

for

(i=0; i

for

(j=i; j

sequential code

PARALLELIZING COMPILER

void

parallel_routine() {

for

(i=N*cpuID/nprocs; i

for

(j=i; j

int start

() { …

for

(i=0; i

for

(j=i; j

Runtime support to parallelization

parallel code runtime library void

initenv();

if

main() { (cpuID ==

// gather workers on barrier

start

();

MASTER

) {

// release workers

}

else

{

// spin until work provided

parallel_routine();

// spin until work provided

} }

void doall

() {

// release workers

parallel_routine();

// gather workers on barrier

}

// Synchronization facilities // Lock Implementation // Barrier Implementation

void

parallel_routine() {

for

(i=N*cpuID/nprocs; i

for

(j=i; j

int start

() {

// sequential code

for

(i=0; i

for

(j=i; j

// sequential code

}

Cooperative approaches: OpenMP

• The compiler may not be able to do the parallelization in the way you like to see it: • – A loop is not parallelized • The data dependency analysis is not able to determine wheter it is safe to parallelize or not – • The granularity is not high enough The compiler lacks information to parallelize at the highest possible level This is when the explicit parallelization through OpenMP directives comes into the picture

OpenMP and GCC

• Language extensions for shared memory concurrency (#pragma) • Supports C, C++ and Fortran • Embedded directives specify – Parallelism – Data sharing semantics – Work sharing semantics • Standard and increasingly popular

OpenMP programming model

• A typical OpenMP implementation relies on the pthreads library. • With MPSoCs the SPMD paradigm is also used • Based on fork-join semantics – Master thread spawns teams of children threads – Allows sequential and parallel execution

Supporting OpenMP in GCC

• Recognize OpenMP pragmas – Within the parser in each frontend • Lower into GIMPLE IR – Augment it with new OMP nodes • Identify data sharing – The master thread creates a local structure which contains all the data items marked for sharing and passes its address to slaves • Identify work sharing – Split computation between different cores (omp for) • Identify parallel regions – Outline the body of parallel regions into functions that are used as arguments to the thread creation routines