Design Productivity Crisis

Download Report

Transcript Design Productivity Crisis

Lecture 11:
Interfaces, I/O and
Configurable Processors
Professor Kurt Keutzer
Computer Science 252
Spring 2000
With contributions from Prof. David Patterson
Niraj Shah, Scott Weber
Kurt Keutzer
1
Embedded Systems vs. General Purpose
Computing - 1
Embedded System
General purpose computing
• Runs a few applications often
•Intended to run a fully general
known at design time
set of applications
• Not end-user programmable
• End-user programmable
• Operates in fixed run-time
constraints, additional
• Faster is always better
performance may not be
useful/valuable
Kurt Keutzer
2
Embedded Systems vs. General Purpose
Computing - 2
Embedded System
General purpose computing
Differentiating features:
Differentiating features

power

cost

speed (must be
predictable)


speed

did we mention speed?

Kurt Keutzer
speed (need not be fully
predictable)
cost (largest component
power)
3
Configurabilty and Embedded Systems
Advantages of configuration:
• Pay (in power, design time, area) only for what you use
• Gain additional performance by adding features tailored to
your application:
Particularly for embedded systems:


Kurt Keutzer
Principally in embedded controller microprocessor
applications
Some us in DSP
4
What to Configure?
What parts of the microcontroller/microprocessor system
to configure?
Easy answers:
• Memory and Cache Sizes - get precisely the sizes your
applications needs
• Register file sizes
• Interrupt handling and addresses
Harder answers:
• Peripherals
• Instructions
But first we need more context
Kurt Keutzer
5
I/O Interrupts
An I/O interrupt is just like the exception handlers except:

An I/O interrupt is asynchronous

Further information needs to be conveyed
An I/O interrupt is asynchronous with respect to instruction execution:

I/O interrupt is not associated with any instruction

I/O interrupt does not prevent any instruction from completion

You can pick your own convenient point to take an interrupt
I/O interrupt is more complicated than exception:

Needs to convey the identity of the device generating the interrupt

Interrupt requests can have different urgencies:

Kurt Keutzer
Interrupt request needs to be prioritized
6
add
subi
slli

$r1,$r2,$r3
$r4,$r1,#4
$r4,$r4,#2
Hiccup(!)
lw
lw
add
sw
$r2,0($r4)
$r3,4($r4)
$r2,$r2,$r3
8($r4),$r2

Raise priority
Reenable All Ints
Save registers

lw
$r1,20($r0)
lw
$r2,0($r1)
addi $r3,$r0,#5
sw
$r3,0($r1)

Restore registers
Clear current Int
Disable All Ints
Restore priority
RTI
“Interrupt Handler”
External Interrupt
Example: Device Interrupt
Advantage:
 User program progress is only halted during actual transfer
Disadvantage, special hardware is needed to:
 Cause an interrupt (I/O device)
 Detect an interrupt (processor)
 Save the proper states to resume after the interrupt (processor)
Kurt Keutzer
7
Interrupt Driven Data Transfer
CPU
add
sub
and
or
nop
(1) I/O
interrupt
Memory
IOC
(2) save PC
device
(3) interrupt
service addr
User program progress only halted during
actual transfer
(4)
read
store
...
rti
user
program
interrupt
service
routine
1000 transfers at 1 ms each:
memory
1000 interrupts @ 2 µsec per interrupt
1000 interrupt service @ 98 µsec each = 0.1 CPU seconds
-6
Device xfer rate = 10 MBytes/sec => 0 .1 x 10 sec/byte => 0.1 µsec/byte
=> 1000 bytes = 100 µsec
1000 transfers x 100 µsecs = 100 ms = 0.1 CPU seconds
Still far from device transfer rate! 1/2 in interrupt overhead
Kurt Keutzer
8
Better Way to Handle Interrupts?
Handling all interrupts with CPU could bring it to a halt in a
real time system
Isn’t there a better way?
Hint, remember the trickledown theory of embedded
processor architecture.
Kurt Keutzer
9
Trickle Down Theory of Embedded Architectures
Mainframe/supercomputers
High-end servers/workstations
High-end personal computers
Features tend to trickle
down:
• #bits: 4->8->16->32->64
• ISA’s
• Floating point support
• Dynamic scheduling
• Caches
• I/O controllers/processors
• LIW/VLIW
• Superscalar
Personal computers
Lap tops/palm tops
Kurt Keutzer
Gadgets
10
I/O Interface
CPU
Memory
memory
bus
Independent I/O Bus
Interface
Interface
Peripheral
Peripheral
CPU
common memory
& I/O bus
Memory
Kurt Keutzer
Separate I/O instructions (in,out)
Lines distinguish between
I/O and memory transfers
Interface
Interface
Peripheral
Peripheral
VME bus
Multibus-II
Nubus
40 Mbytes/sec
optimistically
10 MIP processor
completely
saturates the bus!
11
Delegating I/O Responsibility from the CPU: IOP
CPU
D1
IOP
D2
main memory
bus
Mem
. . .
Dn
I/O
bus
(1) Issues
instruction
to IOP
CPU
IOP
(3)
OP Device Address
(4) IOP interrupts
CPU when done
(2)
IOP looks in memory for commands
OP Addr Cnt Other
memory
Device to/from memory
transfers are controlled
by the IOP directly.
IOP steals memory cycles.
Kurt Keutzer
target device
where cmnds are
what
to do
special
requests
where
to put
data
how
much
12
Memory Mapped I/O
CPU
Single Memory & I/O Bus
No Separate I/O Instructions
ROM
Memory
CPU
Interface
Interface
Peripheral
Peripheral
RAM
I/O
$
L2 $
Memory Bus
Memory
Kurt Keutzer
I/O bus
Bus Adaptor
13
Delegating I/O Responsibility from the CPU: DMA
CPU sends a starting address,
direction, and length count
to DMAC. Then issues "start".
Direct Memory Access (DMA):

External to the CPU

Act as a master on the bus

Transfers blocks of data to or
from memory without CPU
intervention
CPU
Memory
DMAC
IOC
device
DMAC provides handshake
signals for Peripheral
Controller, and Memory
Addresses and handshake
signals for Memory.
Kurt Keutzer
14
Direct Memory Access
Time to do 1000 xfers at 1 msec each:
CPU sends a starting address,
direction, and length count to
DMAC. Then issues "start".
1 DMA set-up sequence @ 50 µsec
1 interrupt @ 2 µsec
1 interrupt service sequence @ 48 µsec
.0001 second of CPU time
0
ROM
CPU
Memory
DMAC
IOC
Memory
Mapped I/O
RAM
device
Peripherals
DMAC provides handshake signals for Peripheral
Controller, and Memory Addresses and handshake
signals for Memory.
DMAC
n
Kurt Keutzer
15
68332 Family
68K was the most successful embedded controller in
history
CISC instruction set - good code density
Table lookup for compressed tables
Time processing unit - breakthrough in modular peripheral
handling!
Kurt Keutzer
16
MC68332 - Top level
IMB
inter module bus
CPU32
I/0 - channel 0
time
processing
unit
TPU
I/0 - channel 15
serial I/0
IMB control
RAM
Designed for automotive applications with mixture of computation intensive tasks and
complex I/0 -functions
Idea: off-load CPU from frequent I/0 interactions to make use of computation
performance:
TPU
Kurt Keutzer
17
68332 CPU Block Diagram
Kurt Keutzer
18
Addressing Modes in 68332
Seven modes
• Register direct
• Register indirect
• Register indirect with index
• Program counter indirect with displacement
• Program counter indirect with Index
• Absolute
• Immediate
Why so many modes? Antiquated architectural feature?
Kurt Keutzer
19
Addressing Modes in 68332
Seven modes
• Register direct
• Register indirect
• Register indirect with index
• Program counter indirect with displacement
• Program counter indirect with Index
• Absolute
• Immediate
Complex addressing modes allow for more dense code … but …
MCore - Mot’s embedded micocontroller rewrite uses simple DLX-like
Load Store instructions - code size impact?
Kurt Keutzer
20
MC68332 Time Processing Unit
independent programmable timer channels: single-shot "capture & compare"
channel coupling and sequence control with control processor
Host
Interface
Control
IMB
Channel
System
Configuration
Development
Support
and Test
Service
Requests
Scheduler
Timer
Channels
Channel 0
Channel 1
pin
time
base
Pins
Microengine
Channel
Control
Data
Parameter
RAM
Control
Store
Store
Execution
Unit
Control and
Data
Channel 15
TPU: time processing unit: peripheral coprocessor
Kurt Keutzer
21
Time Processing Unit
Kurt Keutzer
22
Time Processing Unit
Semi-autonomous microcontroller
Operates concurrently with CPU
• Schedules tasks
• Processes ROM instructions
• Accesses shared data with CPU
• Performs Input/Output
Kurt Keutzer
23
Uses of Time Processing Unit
Programmable series of two operations
• Match
• Capture
Each operation is called an ``event’’
A pre-programmed series of event is called a ``function’’
Pre-programmed functions
• Input capture/input transition counter
• Output compare
• Period measurement with addition/missing transition detect
• Position synchronized pulse-generator
• Period/pulse-width accumulator
Kurt Keutzer
24
Time Bases
Two sixteen-bit counters
provide time bases for all
Pre-scalers controlled by CPU
via bit-fiels in TPU module
configuration register
TPUCMR
Current values accessible via
TCR1 and TCR2 registers
TCR1, TCR2 can be
read/written by TPU
microcode- not available to
CPU
TC1 qualified by system clock
TC2 qualified by system clock
or external clock
Kurt Keutzer
25
Timer Channels
Sixteen channels
- each one connect to a MCU
pin
Each channel has symmetric
hardware:
• Event register



16-bit capture register
16-bit compare/match
register
16-bit comparator
• Pin control logic - pin
direction determined by
TPU microengine
Kurt Keutzer
26
Scheduler
Determines which of sixteen
channels is serviced by the
microenginer
Channel can request service
for one of four reasons

host service

link to another channel

match event

capture event
• Host system assigns to
each channel a priority

high

middle

low
Kurt Keutzer
27
Microengine
Determines which of sixteen
channels is serviced by the
microenginer
Channel can request service
for one of four reasons

host service

link to another channel

match event

capture event
• Host system assigns to
each channel a priority

high

middle

low
Kurt Keutzer
28
Another Motorola Microprocessor
Kurt Keutzer
29
Concepts so far ...
• Interrupts
• Memory Mapping of I/O
• Time Processing Unit / Peripheral Processor
other configurable elements
Peripherals
Instructions
Kurt Keutzer
30
Configurability in ARM Processor
ARM allows for configurability via AMBA bus
Offers ``prime cell’’ peripherals which hook into AMBA
Peripheral Bus (APB)
• UART
• Real Time Clock
• Audio Codec Interface
• Keyboard and mouse interface
• General purpose I/O
• Smart card interface
• Generic IR interface
http://www.arm.com/Pro+Peripherals/PrimeCell/index.html
Kurt Keutzer
31
ARM7 core
Kurt Keutzer
32
ARM’s Amba open standard
Advanced System Bus, (ASB) - high performance, CPU, DMA, external
Advanced Peripheral Bus, (APB) - low speed, low power, parallel I/O, UART’s
External interface
http://www.arm.com/Documentation/Overviews/AMBA_Intro/#intro
Kurt Keutzer
33
Ex1: ARM Infrared (IR) Interface
Kurt Keutzer
34
Ex 2: ARM Smart Card Interface
Kurt Keutzer
35
Ex 3: Audio Codec
Kurt Keutzer
36
Another Kind of Configurability
HDL
RTL
Synthesis
netlist
Library
logic
optimization
Synthesis of a processor core
from an RTL description allows
for:
• full range of other types of
configurability
• additional degrees of freedom
in quality of implementation
Examples:
netlist
physical
design
• ARM7
• Motorola Coldfire
• Tensilica Xtensa
layout
Kurt Keutzer
37
Quality of Results Tradeoffs
Delay
Synthesizable implementation
allows for explanation of a wide
range of implementations
Area
Kurt Keutzer
38
ARM Core7 Thumb Embedded
Kurt Keutzer
39
Ultimate configurabilty :The tensilica solution:
Kurt Keutzer
40
Tensilica Viterbi Implementation
Niraj Shah
Scott Weber
290A Final Presentation
Kurt Keutzer
41
Tensilica Flow
.c
.c
.c
TIE
xt-gcc
gen
Tensilica
Processor
Generator
.o
Kurt Keutzer
uArch
Designer
gen
xt-run
42
Xtensa Architecture
TIE Extensions:
Xtensa
Core
Rs
Rt I
Rr

single cycle

state free

no new exceptions

no stalls

typeless data
Rs, Rt, Rr are 32 bit regs
I is the instruction controlling the
TIE unit
TIE
Kurt Keutzer
Xtensa Core is a 32 bit
configurable RISC processor
43
Viterbi Architecture
Init
Kurt Keutzer
ADC
I/0
Device
RAM
TraceBack
ACS
Measured
Performance
Here
44
TIE SetupBMreg (ACS)
31
Rs
8:7
I
0
31
Rt
8:7
Q
0
0x7F
+
-
+
-
instruction
Control
bm0 bm1 bm2 bm3
0
7:8 15:16 23:24 31
Rr
Kurt Keutzer
45
ACS TIE Extension (ACS)
Rs
31 27
Rt
17
pm-
11
1:0
31 24:23 16:15 8:7
0
bm3
bm2
bm1
bm0
pm-
+
=1?
msb
-
+
0:1
decision bit
Kurt Keutzer
11:12
pm
31
ACS03 ||
ACS12 ||
ACS30 ||
ACS21
instruction
0’s
Rr
46
ACS TIE Extension with State (ACS)
31 27
Rs
pm-
17 11
1:0
pm-
31 24:23 16:15 8:7
0
bm3 bm2 bm1 bm0
Rt
0:1
pm-
msb
-
pm-
27 31
+
+
=1?
11 17
+
+
-
msb
=1?
Control
0:1
decision bit
11 16:17
pm
pm
Rr
Kurt Keutzer
27 31
instruction
decision bit
47
TIE Zmask (TraceBack)
31
Rs
1:0
31
Rt
6:5 0
<<1
0x7F
&
|
instruction
0x3F
Control
0
Kurt Keutzer
6:7
Rr
&
31
48
Designs
All designs had a BER of 0.000095 after 10 million iterations
Design 1

100 MHz, 48 mW, 1K DCache, 1K ICache, TIE
Design 1+

222 MHz, 144 mW, 1K DCache, 1K ICache, TIE
Design 2
100 MHz, 69 mW, 16K DCache, 16K ICache, TIE
Design 2

222 MHz, 191 mW, 16K DCache, 16K ICache, TIE
Design 3

Kurt Keutzer
222 MHz, 191 mW, 16K DCAche, 16K ICache, TIE with state
49
Performance
1200
1142
1000
909
909
966
793
800
600
Kb/s
409
400
263
200
0
118
Design Design Design Design Design
1
Kurt Keutzer
357
Cache
Perfect Cache
409
1+
2-
2
3
50
Energy Dissipation
0.6
0.54
0.5
0.4
0.4
0.3
uJ/bit
0.2
0.16
0.19
0.17
0.24
0.21
0.2
0.17
Cache
Perfect Cache
0.12
0.1
0
Design Design Design Design Design
1
Kurt Keutzer
1+
2-
2
3
51
n(s*J)/Bit
3.5
3.39
3
2.5
2.05
2
Cache
Perfect Cache
1.5
n(s*J)/ 1
Bit 0.5
0
0.293
0.176
0.315
0.231
0.207
0.148
Design Design Design Design Design
1
Kurt Keutzer
0.532
0.416
1+
2-
2
3
52
Die Area
7
6.7 6.7
6.7 6.7
6.146.14
6
5
4
3
mm2 2
2.1 2.1
Cache
Perfect Cache
2.372.37
1
0
Design Design Design Design Design
1
Kurt Keutzer
1+
2-
2
3
53
Summary: Levels of Configurabilty
Configurability is highly desirable in embedded
applications
There are many levels of configuration:
• Memory and Cache Sizes - get precisely the sizes your
applications needs
• Register file sizes
• Interrupt handling and addresses
• Peripherals
• Instructions
• Physical implementation
Kurt Keutzer
54