Structured Hardware Design Ian Pratt University of Cambridge Computer Laboratory [email protected] Designing Hardware Systems A good design should work first time  Simulation  Verification  Testing Top-down methodology 

Download Report

Transcript Structured Hardware Design Ian Pratt University of Cambridge Computer Laboratory [email protected] Designing Hardware Systems A good design should work first time  Simulation  Verification  Testing Top-down methodology 

Structured Hardware Design
Ian Pratt
University of Cambridge
Computer Laboratory
[email protected]
Designing Hardware Systems
A good design should work first time
 Simulation
 Verification
 Testing
Top-down methodology
 Decompose into modules
Modules
 Well-defined functions and interfaces
 Often different technologies
 Using pre-existing modules desirable
Broadside components
Bus:
 parallel signals carrying a binary number
 Represented with thick lines
Broadside components:
 Building block is instantiated once for each wire in
the bus
 Building block inputs and outputs connected to the
corresponding members of the buses
 Control connections are wired in parallel
Registers, buffers, multiplexors
Read-Only Memories
Non-volatile, but typically slow
 Mask programmable
• Cheapest in mass production by far
 One-time programmable (PROM)
 UV Eraseable (EPROM)
 Electrically re-programmable (e.g. FLASH)
• Expensive, but many rewrite cycles possible
• `Field upgrades’ possible
Choose technology based on #units required
and #rewrite cycles expected
DRAM
Each bit stored in a small capacitor (1T)
 Needs refreshing periodically
 ‘Recovery time’ required after reads
Bits arranged in a square array
 Accessed by row, column (multiplexed address bus)
 Typically 1,4,8 bits wide
E.g.: 8Mbx8 (64Mbit) 50ns access time
New parts have synchronous interface
 SDRAM / DDR / RAMBUS (still same core)
Modules E.g.: 16Mbx64 100MHz SDRAM
 Made from eight 8Mbx8 parts on a PCB (DIMM)
SRAM
Transparent latch per bit (6T)
 Not as dense as DRAM, more expensive
Fast (7-50ns) access times
 Used in caches
Easy to use – no refresh to worry about
Non-multiplexed address bus
Modern parts have synchronous interfaces
 Pipelined design
E.g.: 256Kbx32 (8Mb) 10ns
Clock generation
RC oscillators rather inaccurate, but cheap
Quartz crystal oscillators commonplace
 Require a little care to make work
 Accurate to ~50ppm
 Clock multiplication
 Phase Locked Loop (PLL)
 E.g.: 133MHz x 7.5 = 997.5Mhz (Pentium III)
 Clock distribution trees
 Buffers, or PLLs to get zero propagation delay
Miscellaneous
Power-on reset
 Release reset after power stable
 Get all flip-flops into known state
 (manual reset by shorting capacitor)
Relays can be used to switch large loads
 (alternative is to use power transistors)
 Must protect transistor with a diode
Mechanical switches ‘bounce’ when switching
 Use a 2-pole switch and RS latch
ALUs
Combinatorial logic implementation
 Takes two N-bit inputs and function selector
 Propagation delay typically determined by carry chain
Typically twos-complement representation
ADD, ADC, SUB, NOT, AND, OR, BIC,…
Flags: Carry-out, Negative, Overflow, Zero
Output will typically be latched, along with flag
status results
Microprocessors
Simple microprocessor control signals:
 Inputs: Clock, Reset
 Output: Request, Read/nWrite, Addr<0..N>
 InOut: Data<0..M>
Read cycles to fetch instructions and load data
Write cycles when updating memory
Begins execution by fetching from reset location
 PC incremented unless branch/jump instruction
Address decoding
Devising a memory map for a design
 Address that memory/peripherals are available at
Non-volatile memory typically mapped at the
reset location
Use combinatorial function of high-order
address bits to generate enable signals
Devise memory map for decoding convenience
The PC as a component
Motherboard cost ~£30-100
 4+ wiring layers in PCB
 CPU, DRAM, keyboard, USB, VGA, IDE,
floppy, serial, parallel, audio, IRDA
 Cheap general purpose platform for
supporting other hardware
System-on-a-chip (SOC)
implementations available soon
Interconnecting Modules
How much data in bps needs to flow?
Will the connection be synchronous or async?
Is flow-control needed to limit the flow?
How long do the wires need to reach?
Is the topology fixed at design time?
Is hot-plugging needed?
Can we use an existing design?
PC Parallel Port
8 data wires, 3 control wires
Unidirectional in its most basic form
Flow-control mechanism
Master drives data then asserts strobe_bar
Slave asserts acknowledge
Slave optionally asserts busy
When both busy and acknowledge are
deasserted master can send another byte
RS232 Serial Ports
Asynchronous bit stream
One wire for each direction plus ground
Start, data, parity, stop
 Start bits assist clock recovery
Baud rate (e.g. 300, 1200, 9600, 115200)
Various flow-control schemes
 s/w: XOn/XOff characters
 h/w: CTS/RTS signals
Excellent for simple debugging support
Finite State Machines
Building everything from FSMs
 Avoid generated clocks / async resets
 Avoid loops in combinatorial logic
• Current CAD tools only work with FSMs
Timing specifications:
 Tck_to_out, Tsetup, Thold, Tprop
 Beware of long Thold’s
Use Moore outputs between modules
 Easier to characterize delay into next module
Critical path is longest logic path ending in an FF
 Determines maximum clock speed
Johnson Counters
Traditional binary counters require long
logic paths for high-order bits
 Limit clock frequency
Johnson counters are based on shift
registers with feedback
E.g. using a NOR gate for a /5 with 3FFs
 Clock prescalers – easy clock output
PRBS counter (XOR) 2n-1 with n FFs
One Hot Coding
FSM encoding using 1FF per state
 Single FF set, others all clear
Uses more FFs than necessary, but:
 Only very simple decode logic required
• High clock speeds
Particularly useful in FPGAs
Pipelining
Split combinatorial logic into stages separated
by FFs
Enables increased clock speed
 Improved throughput
but, increases delay:
 Tsetup + Tclock_to_out of each FF
 Unbalanced pipeline stages
Feedback paths can make life tricky…
CAD tools can help distribute FFs
Gated & Guarded Clocks
Clock Enable ‘safer’ than derived clocks
 Internal multiplexor selects between Din and Q
But, power is proportional to clock freq, so
in some designs it is necessary to:
 Gate lower frequency clocks
 Turn off clocks to currently idle units
When necessary, create clock by OR’ing
clock with synchronised enable_bar
Clock and Data Skew
Skew: when the same signal arrives at
different places at slightly different times
The enemy of synchronous design…
Clock signals are especially vulnerable
 Early clock can cause setup time violation on
critical paths
 Late clock can allow output of previous stage
to race into this one (hold time violation)
Take special care routing clocks!
Crossing Clock Domains
Setup/hold time violations unavoidable
 Metastability can occur, but typically only briefly
• Allow extra time for setup into next FF
• Or, use 2FFs for safety
 Synchronize each signal at a single point
Can use guard signal for buses
 Guard indicates when bus is safe to sample
Or, FIFOs with separate read/write clocks
FSM clocks derived from
another FSM
When it’s necessary to use derived clocks:
Use a moore output to clock slave
 Function should be hazard free
Be careful to avoid races with other
outputs connected to slave
 Mustn’t change at same time as clock
Outputs from slave back to master may
restrict max clock rate
Integrated Circuits
Si or GaAs substrate with implants
200/300mm wafers, 0.3mm thick
 Only the top few microns ‘active’
Ion implant and etching steps, controlled via
stencils created by exposing a photo-resistive
coating to UV / X-rays via a mask generated by
CAD tools
7-30+ different masks used
Masks stepped over wafer for each die
 4-500mm2 die size
CMOS Technology
nMOS, CMOS, ECL (Bipolar)
 CMOS most popular (and best supported)
Feature size – reduces at 10-20% p.a.
 Smaller  faster, lower power, higher density
 0.5, 0.35, 0.25, 0.18, 0.15, 0.13μm
Max die size increasing at 10-25% p.a.
 Number of available T’s increasing at 60-80% p.a.
2-7 metal wiring layers. Al (or now Cu)
Separate processes for DRAM, logic, analog
Pads and IO
Pad ring around edge of die
 Pads are typically 50 micron square
 Contain high-power drive outputs and ESD
protection circuitry
 Power / ground ring around pads
Gold bond wires connect to package pins
 Up to 1000+ pins (with expensive packaging)
 Packaging eases handling and dissipates heat
Core bound vs. Pad bound designs
Chip costs
Non Recurring Expenditure (NRE)
 Design costs (labour, tools, overheads...)
 Mask making costs
Per device costs
 Raw wafer, Processing, Testing, Packaging
 Influenced by yield
 P(die defect free)  Kdie area
• K is probability that any given mm2 is defect free
Taxonomy of ICs
Standard parts (off-the-shelf, datasheet available)
Full-custom ASICs
 For best performance, but greatest NRE
 CPUs, memory, DSPs
Semi-custom standard cell ASICs
 Designed from a library of standard gates/cores
Semi-custom gate array ASICs
 Only a few masks required, but inefficient
Field programmable parts
 FPGAs, PALs
Field Programmable Gate Arrays
Volatile, re-programmable & OTP types
 All programmable in situ
Array of Configurable Logic Blocks (CLBs) and
switch matrices (configurable wiring with buffers)
 IO Blocks (IOBs) around edge of die
CLB typically consists of LookUp Table (LUTs), 1-2
FFs and programmable MUXs
 16x1 LUT (SRAM) implements any fn of 4 variables
 Allowing writes to LUT enables use as RAM
Switch matrices provide hierarchical routing
Field Programmable Gate Arrays
Different families use different CLB sizes
 Xilinx 4K series : 2x 4 input LUTs and 2x FFs
 Others more or less fine grained
Very low NRE, rapid turnaround
 Only requires a ‘place and route’ tool run
Great for prototypes, but parts typically cost 10x
more than equivalent gate array
 SRAM/Flash parts enable field upgrades
 Switch to gate arrays in mature designs
Programmable Array Logic
Devices (PALs)
Programmable sum of products array feeding
macrocells
 Good for simple FSMs and glue logic
Macrocell enables combinatorial or registered
output, usually tristateable
 more complex devices also contain buried
macrocells, and may organise macrocells into clusters
with separate clock sources, sometimes called CPLDs
(Complex Programmable Logic Devices)
New parts in-circuit-programmable, while others
require a special programmer
 JEDEC description file
Delay and Power
Si/CMOS
 nmos/pmos unipolar transistors, generally small
 Power proportional to frequency
Si/BiCMOS
 CMOS augmented with bipolar for driving large loads
Si/ECL
 Bipolar transistors, kept unsaturated
 x3 performance, but large static current
GaAs/MESFET/Bipolar
 x10 performance, but yield generally poor
Up-coming technologies: SOI, SiGe
Fanout and delay
Output stage speed decrease with load
Dominant aspect of load is Capacitance
 Proportional to area of output conductor
 Sum of input capacitances of devices driven
delay = intrinsic delay + (output load x derating
factor) + propagation delay
Gate specification includes intrinsic delay, input
loads and output derating figures
Design Partitioning: h/w vs s/w
Hardware
 Use where high throughput required, but
 Harder to design and debug
 Harder to modify
Software
 Running on CPU(s) or microcontroller(s)
• A whole PC; on a PCB; embedded on an ASIC
 Better support for complexity
• Field upgrades
 Can help debug hardware
Hardware partitioning
Partitioning logic over chips motivated by:
 Availability of standard parts
• Use existing parts wherever possible, especially for
prototypes or low volume designs
 Speed required by different function units
• Use exotic technologies as sparingly as possible
 Interconnection speed and width required
• External interconnects much slower than on-chip
and have limited pin count
 ASIC size, pin count, power
Logic Synthesis & Layout
Complex functions expressed
algorithmically, then synthesized to gates
 Good at ‘mechanical’ tasks on relatively small
sections of a design
 Critical sections of a design still done by hand
Place tool attempts to layout gates to
minimize wiring paths
Route tool attempts to wire gates
Tools are continually improving
 More feed back and integration between tools
The Cambridge Fast Ring
100MHz ECL chip implements:
 Transceivers and serial de/modulator
• ECL has good high-power line driving characteristics
 Serial to parallel and parallel to serial
 Byte alignment
CMOS chip, 50x more logic than ECL chip:
 Media access control protocol / CRC generation
 Small buffer memory / Host processor interface
 Ring monitoring and maintenance
DRAM, VCO, PALs for glue logic to host iface
External Modem
Analogue frontend to telephone line
 Isolation, surge suppression, off-hook relay
Digital Signal Processor as Codec
 Dedicated to a single task
Microcontroller for control
 Talking to host, processing commands etc.
 External NVRAM e.g. Flash to store state
RS232 Line drivers (+/- 12V)
 Requires special fabrication process
Scan multiplexing
Scan multiplexing saves wires (and thus pins)
 Used for LEDs and switches (keyboards)
LED matrix
 Drive column high, write pattern on row
 Scan at >50Hz to avoid flicker
 Drive LEDs hard to make bright
 Pseudo dual porting enables pixel RAM to be updated
Keyboard matrix of push-to-make switches
 Drive column high, read row
 Pull down resistors keep row wires normally low
Audio delay unit
Sample clock of 44.1kHz sufficient for audio
Single counter provides fixed delay
 Read cycle followed by write to same location
Two counters (one loadable) and a mux enables
variable delays
 Lead write counter has over read sets delay
 Could use LFSR counters, but no need here
 Could use DRAM, but SRAM easier and dense enough
• Accesses unlikely to be to same page, hence slow
– Could use small staging FIFOs to enable burst reads & writes
Audio so slow, we could use a microcontroller
Network Camera Device :1
Standard parts for:
 Video frontend and resizer, Audio digitizer
 JPEG compression engine
 100Mb/s Network SERDES (de/serializer)
Three 8KBx8 SRAMs for scanline to tile conversion,
controlled by PAL
Three 256KBx8 DRAM FIFOs for framebuffer
PAL for colour conversion / muxing (non compressed)
Network Camera Device :2
FPGA for assembling audio/video/CPU cells for TX
 2KBx8 dual ported SRAM acting as small 3 channel FIFO
FPGA for network interface control
 MAC and CRC generation
 Determines stream priority and reads cell out of SRAM
and feeds it to SERDES (CoDec)
EPROM microcontroller
 Communicates over network with management software
 Co-ordinates frame capture and compression