Design versus Programming in Custom Computing: Experiences using the ASM Method James P.

Download Report

Transcript Design versus Programming in Custom Computing: Experiences using the ASM Method James P.

Design versus Programming in Custom Computing: Experiences using the ASM Method

James P. Davis, Ph.D.

Heather Wake Conor Leehaug Nirish Namilae University of South Carolina Department of Computer Science and Engineering

MAPLD-04

Davis, Wake, Leehaug & Namilae 1 227-MAPLD04

Outline

• Process – Custom Computing for Mobile Computing – Algorithm to Architecture – Language-based versus Model-based formulation • Method – Algorithmic State Machine (ASM) • Examples – Reed-Solomon coding for wireless networks.

• Lessons Learned

Davis, Wake, Leehaug & Namilae 2 227-MAPLD04

Custom Computing Development Process

Software Programming Program Compilation & Debug Program Execution & Profiling Systems Analysis and/or Algorithm Analysis HW-SW Partitioning HW-SW Integration & Eval.

RTL Design of Custom Logic Logic Synthesis & Simulation Design Place & Route Predominant process for hybrid architectures: May start with a host platform to which a server CCM “fabric” is connected (hybrid high-performance computing, HPC, platforms).

Most likely scenario: distributed, mobile “tetherless” computing (MC), where power consumption, resource utilization , and performance optimization against power & resource budgets are paramount concerns.

Certain tasks of the process for MC involve separated development activities, as opposed to many unified processes for HPC (e.g., Star Bridge, SRC Computers).

Davis, Wake, Leehaug & Namilae 3 227-MAPLD04

Paths from Algorithm to Architecture

Language-driven approach: Use programming language as medium for algorithm expression.

Goal: hide the details of the platform from the programmer.

Rely on technology embodied in tools: (1) programming idioms; (2) compiler technology; (3) abstraction libraries.

Model-driven approach: Use explicit modeling language, providing algorithm constructs, and “cues” for constructing viable architectures.

Davis, Wake, Leehaug & Namilae 4 Goal: allow designer to use analogical reasoning, planning, & configuration skills to propose and refine solutions in space.

227-MAPLD04

Graphical Modeling versus Programming

• Human mind works more effectively with visual and spatial information.

– – Learning, retention, manipulation of artifacts, & communicating ideas.

During human evolution, we spent more time using pictures to convey ideas rather than text writing (e.g., petroglyphs at Lascaux, France).

– We use graphical representation of design artifacts annotated with textual components.

• Graphical notations more effective in “chunking” design information.

– – – A few changes in graphical model imply larger number of changes in code.

More compact representation in graphics.

The “links” between constructs carry much information.

• Design consists of planning and configuration tasks, which are easier to perform with diagrammatic representations than textual ones.

• Graphics allows designers to keep focus on artifacts of architecture.

– – Analogical focus makes possible better trade-offs Allows more “agile” exploration of design space.

Davis, Wake, Leehaug & Namilae 5 227-MAPLD04

Algorithm to Architecture Description-1

Software Algorithm (C code)

Control Flow modeling (Algorithmic structure)

Create Ordered Sequence of Operations Overlay Operation Sequence onto Control Structure Add Hardware Semantics Algorithm Spec (Text or Math)

Data Flow modeling (Operation ordering) - Clocking - Operation Scheduling - Parallelism - Resource Binding

• Sources for transforming to architecture model:

– May start with mathematical formulation (e.g., GF(2

m

)).

– May start with algorithm & abstract data type formulation.

– May start with existing architecture pattern (e.g., LFSR).

Davis, Wake, Leehaug & Namilae 6 227-MAPLD04

Xilinx Virtex-E

Algorithm to Architecture Description-2

Algorithm Specification

Platform Independent Model

Candidate Architecture Space

Iterative Architectural Refinement Annapolis Micro Star Bridge Systems SRC Computers Knowledge of Device Architecture

Platform Specific Model Target Logic Device

Xilinx Vurtex-II

Algorithm Specification: – Describe problem solving in algorithm steps and abstract data type (ADT).

Platform Independent Model: – Mapping algorithm and ADT onto initial architecture choice.

– Use behavioral & architectural patterns: (1) polling, (2) handshaking, (3) arbitration, (4) rendezvous, (5) pipelining, (6) iteration & repetition (systolic).

– Explore architecture space: serial to increasingly parallel.

Platform Specific Model: – Refine PIM onto target platform to create PSM.

– Account for device resources, number of devices, device interconnect.

• Bounding the search space: – Use of estimators (cf., Quan et al., 2004).

– Characterize key points on architecture continuum, converge to “satisficing” point.

Davis, Wake, Leehaug & Namilae 7 227-MAPLD04

Algorithmic State Machine Method-1

• Using Algorithmic State Machine Method – We use Executable ASM diagrams to model state machine behavior and datapath operations for the hardware.

– Executable ASM models have graphical symbol set that looks like a flowchart.

– Algorithm structure can be easily modeled using the ASM graphics.

– The diagrams are annotated with register transfer notation (RTN) expressions for operations and events.

– ASM models are executed in Nimbus TM , are compiled into simulation model, then translated into VHDL code for circuit synthesis.

Davis, Wake, Leehaug & Namilae 8 227-MAPLD04

Algorithmic State Machine Method-2

Explicit reset and soft priority interrupts.

Discrete states and operations.

Poll for new 16-bit word in receiver stream.

Conditions for next state and output decoding.

Coordination patterns (polling, handshaking).

Discrete looping, loop control.

Case construct for multiway branching.

Test is new word is first word of a new frame. Our sequencing choice depends on first frame word.

We’ll assume it’s Frame Control Header if we have a new frame.

Enable decoding of target block select, based on current state of Frame Sequencer.

Conditional outputs and data operations.

Davis, Wake, Leehaug & Namilae 9 227-MAPLD04

MAC Layer CRC-32 coding of data stream PHY Layer (incl. CRC-16 coding) BBP

Reed-Solomon R-S(n,k) Model

Can we replace with RS(n,k) coding scheme?

What impact on the throughput of the channel?

MAC Layer CRC-32 decoding of data stream PHY Layer (CRC-16 decoding) BBP • Background – – – – – Wireless communications in noisy environments can consume much bandwidth on frame retransmission after timeout.

What if we could not just detect errors, but correct certain error bursts, at speed?

We are interested in alternate error coding and correction schemes to evaluate tradeoffs in code strength, codeword overhead, channel error rate, and channel capacity.

We use IEEE 802.11b protocol as experimental platform: (1) construct circuit models for MAC/PHY layers, (2) collect station timing data from logic model, (3) correlate against network model in ns-2 simulator.

Replace CRC-32 with R-S(n,k) in 802.11b MAC Layer?

2.4 GHz Davis, Wake, Leehaug & Namilae 10 227-MAPLD04

Reed-Solomon R-S(n,k) Model

2t

-parity symbols Parity Symbols

k

-symbols Data Symbols

n-

symbol codeword

0 < k < n < 2 m + 2 (n,k) = (2 m - 1, 2 m - 1- 2t)

Generator polynomial for R-S(7,3): g(X) = a 3 + a 1 X + a 0 X 2 + a 3 X 3 + X 4 •

m-

bit symbol • Structure – – – We have a symbol stream into which we want to append coding bits to form a “codeword”.

m-bit dataword sequence, with k data words encoded to form n codewords with 2t parity words appended.

The symbol error correcting capability of the code is up to t words.

Mathematics – – – – The construction of a code is done within a “finite field” that is closed under addition and multiplication.

Non-binary cyclic coding has better performance than binary, so we require a field extension over a GF(2

m

) field when m isn’t a prime, but is a power of a prime (2

m

).

We characterize the field, and its coding patterns, according to polynomial expressions.

We carry out finite field arithmetic (addition, multiplication) on a complete symbol.

Davis, Wake, Leehaug & Namilae 11 227-MAPLD04

R-S(7,3) – Codeword Generation

• Linear Feedback Shift Register (LFSR) – Means of formulating GF(2 3 ) polynomial code generator circuit.

– Shift Register, (n-k) stages: k clock cycles to shift in m-bit input message words.

– Input message words simultaneously moved to LFSR and output register Reg4 (uncoded data portion of codeword).

– LFSR, modulo arithmetic performed on message words to generate parity words.

– Remainder of (n-k) cycles shift out the parity words appended to form complete codeword for transmission.

X 4 X 0 X 1 X 2 X 3 a 3 a 1 a 0 a 3 Generator polynomial for R-S(7,3): g(X) = a 3 + a 1 X + a 0 X 2 + a 3 X 3 + X 4 Reg0

+

Reg1

+

Reg2

+

Reg3

+

Output message stream

Reg4

Input message stream Source: Sklar, © 2001, Prentice-Hall Publishers, Inc.

Davis, Wake, Leehaug & Namilae 12 MUX 227-MAPLD04

R-S(7,3) – Syndrome Computation-1

• First pass modeling: – Define logical array to store received codeword, and obtain each symbol from the received bitstream in parallel (using a register file).

– Syndrome values computed as result of parity check on the received computed codeword polynomial: r(X) = U(X) + e(X).

r(X) = a 0 + a 2 X + a 4

X

2 + a 0

X

3 + a 6

X

4 + a 3

X

5 + a 5

X

6 – We run computation for MUL, then ADD, in order to check r(X) = 0 at each root of generator g(X) polynomial to see we have valid codeword.

– Enable the MULs and ADDs for syndrome equations as register-based table lookups, in parallel for each syndrome symbol.

– Broadcast active low ‘go_mul’ and ‘go_add’ signals to all concurrent ASM threads.

Davis, Wake, Leehaug & Namilae 13 227-MAPLD04

R-S(7,3) – Syndrome Computation-2

• Multiplication thread structure: – On enable, ASM thread uses a Case selection for coefficient lookup.

– For small R-S(7,3) code, this is likely most efficient.

– For larger codewords, such as R-S(63,59) or R-S(255,247), use of modulo arithmetic unit may be required to increase throughput.

Davis, Wake, Leehaug & Namilae 14 227-MAPLD04

R-S(7,3) – Syndrome Computation-3

• Addition thread structure: – Outer control loop for iterating on the S-alpha terms.

– Lookup of S-alpha (q) to form offset for addition table (addbits).

– Assignment for higher coefficients is mod(m), as multiple Case paths make the same assignment.

– Check at the end of thread for syndrome = 0.

– Logic for Error localization not shown.

– Syndrome ADD logic debugged, then replicated into multiple, concurrent threads to increase parallelism.

Davis, Wake, Leehaug & Namilae 15 227-MAPLD04

R-S(n,k) Modular Multiplication-1

• Montgomery multiplier algorithm breaks logically into three parts: – convert to Montgomery numbers (within closed GF(n) field).

– do the multiplication.

– convert back to Integer numbers.

• Each conversion requires both Multiplication and Modulo operations. – Conversion to/from modulo operators shown in ASM thread.

– Actual MUL operations on modular operators abstracted to a sub-flow within scope of ASM thread.

Davis, Wake, Leehaug & Namilae 16 227-MAPLD04

R-S(n,k) Modular Multiplication-2

Results of assignments used as inputs for later macro computations.

• Montgomery modular MUL steps – Upon conversion into modular operators, carry out the 3 MUL operations.

– No opportunity to increase parallelism in these MULs, because of dependencies.

– However, we could modify the thread structure so that data path is pipelined.

– Instead of having this as sub-flow, make it separate thread, controlled with handshaking signals.

Davis, Wake, Leehaug & Namilae 17 227-MAPLD04

Sub-Multiplication – Parallel Partials

Source: Carpinelli, © 2002 Pearson Publishing, Inc.

Davis, Wake, Leehaug & Namilae 18 227-MAPLD04

Faster Addition – Multi-operand Trees

Source: Leehaug & Davis, 2004

Davis, Wake, Leehaug & Namilae 19 Adder delay for a Carry Save Adder architecture is much less than other one.

CSA has two-stage logic. It takes three operands, putting one on the Carry In and generates two outputs on the Sum and Carry Out.

The (3,2) reduction allows multi-operand addition to be done, which is faster than repeated 2-operand addition.

Here, all 16 partials are added at the same time.

227-MAPLD04

ASM Model of 16x16 Wallace Multiplier

Davis, Wake, Leehaug & Namilae 20 227-MAPLD04

Exploration Process - Methods & Tools

Start Capture Design Compile & Checking

KBS flowHDL

TM

Exsedia Nimbus TM

TM

blockHDL

Design Approach - "stepwise refinement", with "iterative enhancement".

Create design "skeleton", with core functions and cycle-level timing information specified.

Iterate the design through synthesis, checking key area and timing constraints.

Correct Entry?

Cycle-based Simulation?

Behavioral Simulation Correct Behavior?

HDL Simulation Required?

Functional Simulation Correct Function?

Synopsys SGE TM Synopsys VSS TM Logic Synthesis Synopsys Design Compiler TM HDL Compiler TM FPGA Compiler TM TM DesignWare Design Analyzer TM Timing Analyzer G ate-level Timing Analysis Correct Timing?

Xilinx ISE TM

Return to the top of the process to make corrections, and to enhance the design description.

Integrate completed behavioral block with other blocks for HDL "system" simulation.

Partition, Place & Route Area & Speed?

Fabricate Device

Davis, Wake, Leehaug & Namilae 21

Designer Productivity – Effort Distribution

In the modeling exercise, three separate logic circuits of the larger architecture were explored: R-S(n,k) coding, modular MUL, Wallace MUL units within the Mod-MUL.

8 6 4 2 0 16 14 12 10

Effort Distribution (preliminary)

System definition & partitioning Design verification & debugging Layout Graphical entry Logic synthesis & estimation Davis, Wake, Leehaug & Namilae R-S Coding Modular MUL

Design Component

Wallace MUL R-S(n,k) circuit modeling of (7,3) and one other scheme.

Montgomery MUL modeled for 16, 32, 64, 128-bit operands.

Montgomery decomposed into 3 separate Integer units, realized using Wallace-tree MUL/ADDer units.

Xilinx Spartan® FPGAs targeted for cost, power, performance and area tradeoffs.

22 227-MAPLD04

Lessons Learned

• Design versus programming: “subjects” can be taught to explore a search space of candidate architectures to realize algorithms in programmable logic/custom computing platforms.

– Design involves planning and configuration tasks in a state-space search.

– Requisite knowledge burden can be minimized through use of model-based problem representation with a graphical notation as mediating interface.

• Designer productivity: effort distribution data is consistent with earlier studies (Joshi et al., 1997, Jawchinda et al., 1999).

– High productivity-design performance possible without VHDL expertise.

• Designer productivity versus design performance: with careful analysis, search space can be significantly “pruned”.

– Playing with problem formulation identified better “organizing principles” for architecture, changing shape of search space.

– However, there is no substitute for experience (heuristic knowledge).

– Furthermore, codifying heuristics and using automated “estimators” to guide selection of architecture candidates overcomes limitations in current language-based architecture compilers.

Davis, Wake, Leehaug & Namilae 23 227-MAPLD04

Custom Computing for Mobile Computing

• Executable algorithmic state machines: – – – – Both control and datapath operations specified in “time” (cycle scheduling) and “space” (binding operations to resource types).

Basic data and memory operations supported in ASM method using datapath macro-functions and memory arrays.

Notation is directly executable in the tool set, hence, “executable” ASM.

Designer doesn’t give up design exploration or decision-making to a language compiler.

• • Using ASM diagrams: – – – – A “thinking aid” for defining the structure and sequencing behavior of Finite State Machines.

Used in 3 different ways: (1) definition/specification of sequential systems, (2) analysis of sequential circuits, (3) design of combinational and sequential circuits behaviorally.

Use UML diagrams for specification (using IBM’s Rational Rose®) with architecture modeling with ASM (using Exsedia’s Nimbus TM and IPalette TM ).

Designers were able to carry out complete design effort without knowledge of HDLs.

Algorithms, Patterns and Protocols – – Directly support mapping of algorithm onto candidate architectures.

Directly support exploration of protocol implementations distributed across many concurrent threads of execution.

Davis, Wake, Leehaug & Namilae 24 227-MAPLD04

Acknowledgements

• This project was made possible with a software grant from Exsedia for use of their Nimbus software.

• Additional tools from Synopsys and Xilinx were provided under a research grant from Department of Defense.

• Formulation of the R-S(7,3) is based on the coding model presented in Sklar, B., Digital Communications: Fundemantal and

Applications, 2 nd

ed., Prentice-Hall Publishers, Inc., 2001.

Davis, Wake, Leehaug & Namilae 25 227-MAPLD04