PPT - ECE 751 Embedded Computing Systems

Download Report

Transcript PPT - ECE 751 Embedded Computing Systems

Lecture 18: Hardware/Software Codesign

Embedded Computing Systems

Topics

 Platforms.

 Performance analysis.

 Design representations.

 Hardware/software partitioning.

 Co-synthesis for general multiprocessors.

Design platforms

 Different levels of integration:  PC + board.

 Custom board with CPU + FPGA or ASIC.

 Platform FPGA.

 System-on-chip.

CPU/accelerator architecture

  CPU is sometimes called host.

Accelerator communicate via shared memory.

 May use DMA to communicate.

Example: Xilinx Virtex-4

 System-on-chip:  FPGA fabric.

 PowerPC.

 On-chip RAM.

 Specialized I/O devices.

 FPGA fabric is connected to PowerPC bus.

 MicroBlaze CPU can be added in FPGA fabric.

Example: WILDSTAR II Pro

Performance analysis

 Must analyze accelerator performance to determine system speedup.

 High-level synthesis helps:  Use as estimator for accelerator performance.

 Use to implement accelerator.

Data path/controller architecture

  Data path performs regular operations, stores data in registers.

Controller provides required sequencing.

High-level synthesis

    High-level synthesis creates register-transfer description from behavioral description.

Schedules and allocates:    Operators.

Variables.

Connections.

Control step or time step is one cycle in system controller.

Components may be selected from technology library.

Models

  Model as data flow graph.

Critical path is set of nodes on path that determines schedule length.

Accelerator estimation

 How do we use high-level synthesis, etc. to estimate the performance of an accelerator?

 We have a behavioral description of the accelerator function.

 Need an estimate of the number of clock cycles.

 Need to evaluate a large number of candidate accelerator designs.

 Can’t afford to synthesize them all.

Estimation methods

 Hermann et al. used numerical methods.

 Estimated incremental costs due to adding blocks to the accelerator.

 Henkel and Ernst used path-based scheduling.

  Cut CFDG into subgraphs: reduce loop iteration count; cut at large joins; divide into equal-sized pieces.

Schedule each subgraph independently.

 Vahid and Gajski estimate controller and data path costs incrementally.

Single- vs. multi-threaded

 One critical factor is available parallelism:  single-threaded/blocking : CPU waits for accelerator;  multithreaded/non-blocking : CPU continues to execute along with accelerator.

 To multithread, CPU must have useful work to do.

 But software must also support multithreading.

Total execution time

Execution time analysis

 Single-threaded:  Count execution time of all component processes.

 Multi-threaded:  Find longest path through execution.

Hardware-software partitioning

   Partitioning methods usually allow more than one ASIC.

Typically ignore CPU memory traffic in bus utilization estimates.

Typically assume that CPU process blocks while waiting for ASIC.

mem ASIC CPU ASIC

Synthesis tasks

   

Scheduling

: make sure that data is available when it is needed.

Allocation

: make sure that processes don’t compete for the PE.

Partitioning

: break operations into separate processes to increase parallelism, put serial operations in one process to reduce communication.

Mapping

: take PE, communication link characteristics into account.

Scheduling and allocation

 Must schedule/allocate  computation  communication  Performance may vary greatly with allocation choice.

P1 CPU1 P1 P2 P3 P2 P3 ASIC1

Problems in scheduling/allocation

    Can multiple processes execute concurrently?

Is the performance granularity of available components fine enough to allow efficient search of the solution space?

Do computation and communication requirements conflict?

How accurately can we estimate performance?

Partitioning example

r = p1(a,b); s = p2(c,d); z = r + s;

r=p1(a,b); s=p2(c,d); z = r + s

after

Problems in partitioning

 At what level of granularity must partitioning be performed?

  How well can you partition the system without an allocation?

How does communication overhead figure into partitioning?

Problems in mapping

 Mapping and allocation are strongly connected when the components vary widely in performance.

 Software performance depends on bus configuration as well as CPU type.

 Mappings of PEs and communication links are closely related.

Program representations

 CDFG: single-threaded, executable, can extract some parallelism.

 Task graph: task-level parallelism, no operator-level detail.

 TGFF generates random task graphs.

 UNITY: based on parallel programming language.

Platform representations

  Technology table describes PE, channel characteristics.

 CPU time.

   Communication time.

Cost.

Power.

Multiprocessor connectivity graph describes PEs, channels.

Hardware/software partitioning assumptions  CPU type is known.

 Can determine software performance.

 Number of processing elements is known.

 Simplifies system-level performance analysis.

 Only one processing element can multi-task.

 Simplifies system-level performance analysis.

Two early HW/SW partitioning systems  Vulcan:  Start with all tasks on accelerator.

 Move tasks to CPU to reduce cost.

 COSYMA:  Start with all functions on CPU.

 Move functions to accelerator to improve performance.

Additional Co-synthesis Approaches  Vahid: Binary constraint search  CoWare: communicating processes model  Simulated annealing & Tabu search heuristics [Ele96]  LYCOS: CDFG representation [Mad97]  Several others in book (skim) © 2006 Elsevier

Multi-objective optimization

 Operations research provides notions for optimization functions with multiple objectives.

 Pareto optimality: optimal solution cannot be improved without making something else worse.

Large search space: Genetic algorithms

 Modeled as:  Genes = strings of symbols.

 Mutations = changes to strings.

 Types of moves:  Reproduction makes a copy of a string.

 Mutation changes a string.

 Crossover interchanges parts of two strings.

Hardware/software co-simulation

   Must connect models with different models of computation, different time scales.

Simulation backplane manages communication.

Becker et al. used PLI in Verilog-XL to add C code that communicates with software models, UNIX networking to connect hardware simulator.

Mentor Graphics Seamless

 Hardware modules described using standard HDLs.

 Software can be loaded as C or binary.

 Bus interface module connects hardware models to processor instruction set simulator.

 Coherent memory server manages shared memory.

Summary