Transcript 2 - SKKU

Processor Selection
1
Introduction
• General-Purpose Processor
- Processor designed for a variety of computation tasks
- Low unit cost, in part because manufacturer spreads NRE over large
numbers of units
- Motorola sold half a billion 68HC05 microcontrollers in 1996 alone
- Carefully designed since higher NRE is acceptable
- Can yield good performance, size and power
- Low NRE cost, short time-to-market/prototype, high flexibility
- User just writes software; no processor design
- a.k.a. “microprocessor” – “micro” used when they were implemented
on one or a few chips rather than entire rooms
2
Basic Architecture
•
Control unit and datapath
-
•
Processor
Note similarity to singlepurpose processor
Control unit
Key differences
-
Datapath
ALU
Controller
Datapath is general
Control unit doesn’t store
the algorithm – the
algorithm is “programmed”
into the memory
Control
/Status
Registers
PC
IR
I/O
Memory
3
Datapath Operations
•
Load
-
Processor
Read memory location into
register
Control unit
Datapath
ALU
• ALU operation
Controller
– Input certain registers
through ALU, store
back in register
Registers
• Store
– Write register to
memory location
+1
Control
/Status
10
PC
11
IR
I/O
Memory
...
10
11
...
4
Control Unit
Processor
•
•
Control unit: configures the datapath
operations
Sequence of desired operations
(“instructions”) stored in memory
– “program”
Instruction cycle – broken into several
sub-operations, each one clock cycle,
e.g.:
Fetch: Get next instruction into IR
Decode: Determine what the
instruction means
Fetch operands: Move data from
memory to datapath register
Execute: Move data through the
ALU
Store results: Write data from
register to memory
Control unit
Datapath
ALU
Controller
Control
/Status
Registers
PC
IR
R0
I/O
100 load R0, M[500]
101
inc R1, R0
102 store M[501], R1
Memory
R1
...
500
501
10
...
5
Control Unit Sub-Operations
• Fetch
Processor
- Get next instruction into
IR
- PC: program counter,
always points to next
instruction
- IR: holds the fetched
instruction
Control unit
Datapath
ALU
Controller
Control
/Status
Registers
PC
100
IR
load R0, M[500]
R0
I/O
100 load R0, M[500]
101
inc R1, R0
102 store M[501], R1
Memory
R1
...
500
501
10
...
6
Control Unit Sub-Operations
• Decode
Processor
- Determine what the
instruction means
Control unit
Datapath
ALU
Controller
Control
/Status
Registers
PC
100
IR
load R0, M[500]
R0
I/O
100 load R0, M[500]
101
inc R1, R0
102 store M[501], R1
Memory
R1
...
500
501
10
...
7
Control Unit Sub-Operations
• Fetch operands
Processor
- Move data from
memory to datapath
register
Control unit
Datapath
ALU
Controller
Control
/Status
Registers
10
PC
100
IR
load R0, M[500]
R0
I/O
100 load R0, M[500]
101
inc R1, R0
102 store M[501], R1
Memory
R1
...
500
501
10
...
8
Control Unit Sub-Operations
• Execute
Processor
- Move data through the
ALU
- This particular
instruction does nothing
during this sub-
Control unit
Datapath
ALU
Controller
Control
/Status
Registers
operation
10
PC
100
IR
load R0, M[500]
R0
I/O
100 load R0, M[500]
101
inc R1, R0
102 store M[501], R1
Memory
R1
...
500
501
10
...
9
Control Unit Sub-Operations
• Store results
Processor
- Write data from register
to memory
- This particular
instruction does nothing
during this sub-
Control unit
Datapath
ALU
Controller
Control
/Status
Registers
operation
10
PC
100
IR
load R0, M[500]
R0
I/O
100 load R0, M[500]
101
inc R1, R0
102 store M[501], R1
Memory
R1
...
500
501
10
...
10
Instruction Cycles
PC=100
Fetch Decode Fetch Exec. Store
ops
results
clk
Processor
Control unit
Datapath
ALU
Controller
Control
/Status
Registers
10
PC 100
IR
load R0, M[500]
R0
I/O
100 load R0, M[500]
101
inc R1, R0
102 store M[501], R1
Memory
R1
...
500
501
10
...
11
Instruction Cycles
PC=100
Fetch Decode Fetch Exec. Store
ops
results
clk
Processor
Control unit
Datapath
ALU
Controller
+1
Control
/Status
PC=101
Registers
Fetch Decode Fetch Exec. Store
ops
results
clk
10
PC 101
IR
inc R1, R0
R0
I/O
100 load R0, M[500]
101
inc R1, R0
102 store M[501], R1
Memory
11
R1
...
500
501
10
...
12
Instruction Cycles
PC=100
Fetch Decode Fetch Exec. Store
ops
results
clk
Processor
Control unit
Datapath
ALU
Controller
Control
/Status
PC=101
Registers
Fetch Decode Fetch Exec. Store
ops
results
clk
10
PC 102
IR
store M[501], R1
R0
11
R1
PC=102
Fetch Decode Fetch Exec. Store
ops
results
clk
I/O
100 load R0, M[500]
101
inc R1, R0
102 store M[501], R1
Memory
...
500 10
501 11
...
13
Architectural Considerations
• N-bit processor
Processor
- N-bit ALU, registers,
buses, memory data
interface
- Embedded: 8-bit, 16-bit,
32-bit common
- Desktop/servers: 32-bit,
even 64
• PC size determines
address space
Control unit
Datapath
ALU
Controller
Control
/Status
Registers
PC
IR
I/O
Memory
14
Architectural Considerations
• Clock frequency
Processor
- Inverse of clock period
- Must be longer than
longest register to
register delay in entire
processor
- Memory access is often
the longest
Control unit
Datapath
ALU
Controller
Control
/Status
Registers
PC
IR
I/O
Memory
15
Pipelining: Increasing Instruction
Throughput
Wash
1
2
3
4
5
6
7
8
1
2
3
Non-pipelined
Dry
1
Decode
Fetch ops.
1
2
3
4
5
6
7
8
1
Time
6
7
8
3
4
5
6
7
8
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
Instruction 1
pipelined instruction execution
2
3
4
5
6
7
pipelined dish cleaning
2
Execute
Store res.
5
Pipelined
non-pipelined dish cleaning
Fetch-instr.
4
8
Time
Pipelined
8
Time
16
Superscalar and VLIW Architectures
• Performance can be improved by:
- Faster clock (but there’s a limit)
- Pipelining: slice up instruction into stages, overlap stages
- Multiple ALUs to support more than one instruction stream
- Superscalar
- Scalar: non-vector operations
- Fetches instructions in batches, executes as many as possible
- May require extensive hardware to detect independent
instructions
- VLIW: each word in memory has multiple independent
instructions
- Relies on the compiler to detect and schedule instructions
- Currently growing in popularity
17
Cache Memory
Fast/expensive technology, usually on the
same chip
• Memory access may be slow
• Cache is small but fast memory
close to processor
- Holds copy of part of memory
- Hits and misses
Processor
Cache
Memory
Slower/cheaper technology, usually on a
different chip
18
Programmer’s View
•
Programmer doesn’t need detailed understanding of architecture
-
•
Two levels of instructions:
-
•
Instead, needs to know what instructions can be executed
Assembly level
Structured languages (C, C++, Java, etc.)
Most development today done using structured languages
-
But, some assembly level programming may still be necessary
Drivers: portion of program that communicates with and/or controls (drives)
another device
- Often have detailed timing considerations, extensive bit manipulation
- Assembly level may be best for these
19
Assembly-Level Instructions
Instruction 1
opcode
operand1 operand2
Instruction 2
opcode
operand1 operand2
Instruction 3
opcode
operand1 operand2
Instruction 4
opcode
operand1 operand2
...
• Instruction Set
- Defines the legal set of instructions for that processor
- Data transfer: memory/register, register/register, I/O, etc.
- Arithmetic/logical: move register through ALU and back
- Branches: determine next PC value when not just PC+1
20
A Simple (Trivial) Instruction Set
Assembly instruct.
First byte
Second byte
Operation
MOV Rn, direct
0000
Rn
direct
Rn = M(direct)
MOV direct, Rn
0001
Rn
direct
M(direct) = Rn
MOV @Rn, Rm
0010
Rn
MOV Rn, #immed.
0011
Rn
ADD Rn, Rm
0100
Rn
Rm
Rn = Rn + Rm
SUB Rn, Rm
0101
Rn
Rm
Rn = Rn - Rm
JZ Rn, relative
0110
Rn
opcode
Rm
immediate
relative
M(Rn) = Rm
Rn = immediate
PC = PC+ relative
(only if Rn is 0)
operands
21
Addressing Modes
Addressing
mode
Operand field
Immediate
Data
Register-direct
Register-file
contents
Memory
contents
Register address
Data
Register
indirect
Register address
Memory address
Direct
Memory address
Data
Indirect
Memory address
Memory address
Data
Data
22
Sample Programs
Equivalent assembly program
C program
int total = 0;
for (int i=10; i!=0; i--)
total += i;
// next instructions...
•
0
1
2
3
Loop:
5
6
7
Next:
MOV R0, #0;
MOV R1, #10;
MOV R2, #1;
MOV R3, #0;
JZ R1, Next;
ADD R0, R1;
SUB R1, R2;
JZ R3, Loop;
// total = 0
// i = 10
// constant 1
// constant 0
// Done if i=0
// total += i
// i-// Jump always
// next instructions...
Try some others
-
-
Handshake: Wait until the value of M[254] is not 0, set M[255] to 1, wait until
M[254] is 0, set M[255] to 0 (assume those locations are ports).
(Harder) Count the occurrences of zero in an array stored in memory locations
100 through 199.
23
Programmer Considerations
• Program and data memory space
- Embedded processors often very limited
- e.g., 64 Kbytes program, 256 bytes of RAM (expandable)
• Registers: How many are there?
- Only a direct concern for assembly-level programmers
• I/O
- How communicate with external signals?
• Interrupts
24
Example: parallel port driver
LPT Connection Pin
I/O Direction
Register Address
1
Output
0th bit of register #2
2-9
•
Output
10,11,12,13,15
Input
14,16,17
Output
0th bit of register #2
6,7,5,4,3th
bit of register
#1
1,2,3th bit of register #2
Pin 13
PC
Switch
Parallel port
Pin 2
LED
Using assembly language programming we can configure a PC parallel port to
perform digital I/O
-
write and read to three special registers to accomplish this table provides list of
parallel port connector pins and corresponding register location
Example : parallel port monitors the input switch and turns the LED on/off
accordingly
25
Parallel Port Example
;
;
;
;
This program consists of a sub-routine that reads
the state of the input pin, determining the on/off state
of our switch and asserts the output pin, turning the LED
on/off accordingly
.386
CheckPort
push
push
dx
mov
in
and
cmp
jne
SwitchOff:
mov
in
and
out
jmp
SwitchOn:
mov
in
or
out
Done:
pop
pop
CheckPort
proc
ax
;
;
dx, 3BCh + 1 ;
al, dx
;
al, 10h
;
al, 0
;
SwitchOn
;
save the content
save the content
base + 1 for register #1
read register #1
mask out all but bit # 4
is it 0?
if not, we need to turn the LED on
extern “C” CheckPort(void);
// defined in
// assembly
void main(void) {
while( 1 ) {
CheckPort();
}
}
Pin 13
PC
Parallel port
Pin 2
dx, 3BCh + 0 ; base + 0 for register #0
al, dx
; read the current state of the port
al, f7h
; clear first bit (masking)
dx, al
; write it out to the port
Done
; we are done
dx,
al,
al,
dx,
3BCh + 0 ; base + 0 for register #0
dx
; read the current state of the port
01h
; set first bit (masking)
al
; write it out to the port
dx
ax
endp
; restore the content
; restore the content
Switch
LED
LPT Connection Pin
I/O Direction
Register Address
1
Output
0th bit of register #2
2-9
Output
0th bit of register #2
10,11,12,13,15
Input
14,16,17
Output
6,7,5,4,3th bit of register
#1
1,2,3th bit of register #2
26
Operating System
• Optional software layer providing low-level services to a program
(application).
- File management, disk access
- Keyboard/display interfacing
- Scheduling multiple programs for execution
- Or even just multiple threads from one program
- Program makes system calls to the OS
DB file_name “out.txt” -- store file name
MOV
MOV
INT
JZ
R0, 1324
R1, file_name
34
R0, L1
-----
system call “open” id
address of file-name
cause a system call
if zero -> error
. . . read the file
JMP L2
-- bypass error cond.
L1:
. . . handle the error
L2:
27
Development Environment
• Development processor
- The processor on which we write and debug our programs
- Usually a PC
• Target processor
- The processor that the program will run on in our embedded system
- Often different from the development processor
Development processor
Target processor
28
Software Development Process
• Compilers
C File
C File
Compiler
Binary
File
Binary
File
- Cross compiler
- Runs on one
processor, but
generates code
for another
Asm.
File
Assemble
r
Binary
File
Linker
Library
Exec.
File
Implementation Phase
Debugger
•
•
•
•
Assemblers
Linkers
Debuggers
Profilers
Profiler
Verification Phase
29
Running a Program
• If development processor is different than target, how can we run
our compiled code? Two options:
- Download to target processor
- Simulate
• Simulation
- One method: Hardware description language
- But slow, not always available
- Another method: Instruction set simulator (ISS)
- Runs on development processor, but executes instructions of target
processor
30
Instruction Set Simulator For A Simple
Processor
#include <stdio.h>
}
typedef struct {
unsigned char first_byte, second_byte;
} instruction;
}
instruction program[1024];
unsigned char memory[256];
//instruction memory
//data memory
}
return 0;
int main(int argc, char *argv[]) {
FILE* ifs;
void run_program(int num_bytes) {
If( argc != 2 ||
(ifs = fopen(argv[1], “rb”) == NULL ) {
return –1;
}
if (run_program(fread(program,
sizeof(program) == 0) {
print_memory_contents();
return(0);
}
else return(-1);
int pc = -1;
unsigned char reg[16], fb, sb;
while( ++pc < (num_bytes / 2) ) {
fb = program[pc].first_byte;
sb = program[pc].second_byte;
switch( fb >> 4 ) {
case 0: reg[fb & 0x0f] = memory[sb]; break;
case 1: memory[sb] = reg[fb & 0x0f]; break;
case 2: memory[reg[fb & 0x0f]] =
reg[sb >> 4]; break;
case 3: reg[fb & 0x0f] = sb; break;
case 4: reg[fb & 0x0f] += reg[sb >> 4]; break;
case 5: reg[fb & 0x0f] -= reg[sb >> 4]; break;
case 6: pc += sb; break;
default: return –1;
}
31
Testing and Debugging
•
ISS
-
(a)
(b)
Implementation
Phase
Verification
Phase
Implementation
Phase
-
•
Download to board
-
Development processor
Debugger
/ ISS
Emulator
•
Use device programmer
Runs in real environment, but not
controllable
Compromise: emulator
-
External tools
Gives us control over time – set
breakpoints, look at register
values, set values, step-by-step
execution, ...
But, doesn’t interact with real
environment
Runs in real environment, at
speed or near
Supports some controllability
from the PC
Programmer
Verification
Phase
32
Application-Specific Instruction-Set
Processors (ASIPs)
• General-purpose processors
- Sometimes too general to be effective in demanding application
- e.g., video processing – requires huge video buffers and operations
on large arrays of data, inefficient on a GPP
- But single-purpose processor has high NRE, not programmable
• ASIPs – targeted to a particular domain
- Contain architectural features specific to that domain
- e.g., embedded control, digital signal processing, video processing,
network processing, telecommunications, etc.
- Still programmable
33
A Common ASIP: Microcontroller
•
For embedded control applications
-
•
Reading sensors, setting actuators
Mostly dealing with events (bits): data is present, but not in huge amounts
e.g., VCR, disk drive, digital camera (assuming SPP for image compression),
washing machine, microwave oven
Microcontroller features
-
-
On-chip peripherals
- Timers, analog-digital converters, serial communication, etc.
- Tightly integrated for programmer, typically part of register space
On-chip program and data memory
Direct programmer access to many of the chip’s pins
Specialized instructions for bit-manipulation and other low-level operations
34
Another Common ASIP: Digital Signal
Processors (DSP)
• For signal processing applications
- Large amounts of digitized data, often streaming
- Data transformations must be applied fast
- e.g., cell-phone voice filter, digital TV, music synthesizer
• DSP features
- Several instruction execution units
- Multiple-accumulate single-cycle instruction, other instrs.
- Efficient vector operations – e.g., add two arrays
- Vector ALUs, loop buffers, etc.
35
Trend: Even More Customized ASIPs
•
•
In the past, microprocessors were acquired as chips
Today, we increasingly acquire a processor as Intellectual Property (IP)
-
•
e.g., synthesizable VHDL model
Opportunity to add a custom datapath hardware and a few custom
instructions, or delete a few instructions
-
Can have significant performance, power and size impacts
Problem: need compiler/debugger for customized ASIP
- Remember, most development uses structured languages
- One solution: automatic compiler/debugger generation
- e.g., www.tensillica.com
- Another solution: retargettable compilers
- e.g., www.improvsys.com (customized VLIW architectures)
36
Selecting a Microprocessor
•
Issues
-
•
Technical: speed, power, size, cost
Other: development environment, prior expertise, licensing, etc.
Speed: how evaluate a processor’s speed?
-
-
Clock speed – but instructions per cycle may differ
Instructions per second – but work per instr. may differ
Dhrystone: Synthetic benchmark, developed in 1984. Dhrystones/sec.
- MIPS: 1 MIPS = 1757 Dhrystones per second (based on Digital’s VAX
11/780). A.k.a. Dhrystone MIPS. Commonly used today.
- So, 750 MIPS = 750*1757 = 1,317,750 Dhrystones per second
SPEC: set of more realistic benchmarks, but oriented to desktops
EEMBC – EDN Embedded Benchmark Consortium, www.eembc.org
- Suites of benchmarks: automotive, consumer electronics, networking, office
automation, telecommunications
37
General Purpose Processors
Processor
Clock speed
Intel PIII
1GHz
IBM
PowerPC
750X
MIPS
R5000
StrongARM
SA-110
550 MHz
Intel
8051
Motorola
68HC811
250 MHz
233 MHz
12 MHz
3 MHz
TI C5416
160 MHz
Lucent
DSP32C
80 MHz
Periph.
2x16 K
L1, 256K
L2, MMX
2x32 K
L1, 256K
L2
2x32 K
2 way set assoc.
None
4K ROM, 128 RAM,
32 I/O, Timer, UART
4K ROM, 192 RAM,
32 I/O, Timer, WDT,
SPI
128K, SRAM, 3 T1
Ports, DMA, 13
ADC, 9 DAC
16K Inst., 2K Data,
Serial Ports, DMA
Bus Width
MIPS
General Purpose Processors
32
~900
Power
Trans.
Price
97W
~7M
$900
32/64
~1300
5W
~7M
$900
32/64
NA
NA
3.6M
NA
32
268
1W
2.1M
NA
8
Microcontroller
~1
~0.2W
~10K
$7
8
~.5
~0.1W
~10K
$5
Digital Signal Processors
16/32
~600
NA
NA
$34
32
NA
NA
$75
40
Sources: Intel, Motorola, MIPS, ARM, TI, and IBM Website/Datasheet; Embedded Systems Programming, Nov. 1998
38
Designing a General Purpose
Processor
FSMD
•
Not something an embedded
system designer normally would
do
-
Declarations:
bit PC[16], IR[16];
bit M[64k][16], RF[16][16];
Reset
PC=0;
Fetch
IR=M[PC];
PC=PC+1
Decode
But instructive to see how simply
we can build one top down
Remember that real processors
aren’t usually built this way
- Much more optimized, much
more bottom-up design
from states
below
Mov1
RF[rn] = M[dir]
to Fetch
Mov2
M[dir] = RF[rn]
to Fetch
Mov3
M[rn] = RF[rm]
to Fetch
Mov4
RF[rn]= imm
to Fetch
op = 0000
0001
0010
0011
Add
RF[rn] =RF[rn]+RF[rm]
to Fetch
Sub
RF[rn] = RF[rn]-RF[rm]
to Fetch
Jz
PC=(RF[rn]=0) ?rel :PC
to Fetch
0100
Aliases:
op IR[15..12]
rn IR[11..8]
rm IR[7..4]
0101
dir IR[7..0]
imm IR[7..0]
rel IR[7..0]
0110
39
Benchmarks-The Myth of MIPS
•
We think that if my processor (AMD) benchmarks at 1.5 MIPS, it has better
performance than your processor (Motorola) which benchmarks at 0.8 MIPS
-
•
MIPS is actually VAX 11/780 MIPS
-
•
•
•
The first machine to run 1 MIPS
A VAX 11/780 could execute 1757 loops through the Dhrystone benchmark
in 1 second
-
•
Millions of Instructions per Second ( MIPS ) also means:
- Meaningless Indicator of Performance for Salesmen
A simple C program which compiles to about 2000 lines of assembly code
Independent of O/S services
If your processor executes 1757 Dhrystone loops per second, it is a 1 MIPS
machine
Enter compiler optimizations…….
Bright idea!
-
Let’s optimize for benchmark performance and screw the real world!
40
According to Dr. Mann
There is a big demand from engineers to “just see what the performance
numbers are”. Consequently, Dhrystone is a popular benchmark. However,
synthetic benchmarks, such as Dhrystone, can easily become misleading. It
is difficult to project real system performance from unrealistic synthetic
benchmarks….
Unfortunately, all to frequently benchmark programs used for processor
evaluation are relatively small and can have high instruction cache hit ratios.
Programs such as Dhrystone have this characteristic. They also do not
exhibit the large data movement activities typical of many real applications.
Daniel Mann
AMD Fellow
41
Benchmarking proves my point*
•
•
•
The EEMBC benchmark for a C-based autocorrelation algorithm was tested
on a Texas Instruments TMS320C64X Digital Signal Processor (DSP) running
at 720 MHz clock speed.
Column A: The C program, without any compiler optimization scored 19.5
Column B: With aggressive optimizations for this architecture, the score was
379.1
-
•
Represents almost a 20X performance improvement.
Comparable to speeding up the clock to 14.4 GHz without optimizations
Column C: With all-out performance optimizations and hand-crafting the
code in assembly language, the benchmark score was 628
-
2X performance boost over the best compiler optimizations
A
B
C
Code Efficiency and Compiler Directed Feedback, Jackie Brenner and
Markus Levy, Dr. Dobb’s Journal, #355, December 2003, Pg. 59
42
Benchmarking
• Real benchmarking involves careful balancing of system requirements
and variables
• How a processor will run in your application may be very different
from performance in a different application
• Considerations:
-
Overall HW design: Memory bandwidth, caches, ASICs
Software design: Assembly, C, C++, Java, libraries
Compiler: Brand, optimizations
RTOS: Optimized for certain architectures
• There is no easy answer to predicting performance
• Companies have nearly gone under due to benchmarking errors
43
EDN Embedded Microprocessor
Benchmark Consortium (EEMBC)
Marcus Levy, Director EEMBC
Technical Editor, EDN Magazine
www.eembc.org
44
EEMBC Benchmarking Examples
• Consumer products suite
- Compress JPEG
- Decompress JPEG
- High-pass grayscale filter
• Office Automation Suite
- Bezier-curve calculation (image rotation)
- Dithering (text processing)
• Automotive/industrial
-
Angle to time conversion (rotational control)
Basic floating point
Bit manipulation
Infinite Impulse Response (IIR) filter (tooth to spark calculation)
45
Value of EEMBC benchmarks
• Provide a consistent set of algorithms that can be used for relative
performance measurements
• Values are variable subject to:
-
Compiler used
Compiler optimization level used
Cache utilization
Evaluation board performance ( Hot boards )
• At the very least system designers will have a significantly more
relevant code suite for predicting processor performance in a given
application
• Other factors may negate results:
- RTOS issues
- Processor task utilization
46
How would you benchmark your code?
AMD 186em Evaluation Board
•
Evaluation boards (single-board
computers) are provided by most
embedded processor
manufacturers for performance
evaluation
-
-
•
Priced just high enough to
eliminate most hobbyists
May also showcase tool support
partners.
Execute code on “representative”
platform
-
-
Measure execution time for critical
modules
What if platform is not
“representative enough”
3.5”
47
Benchmarking techniques
• Run standard benchmarking programs:
- Measure time it takes to execute a fixed number of iterations
- Measure number of iterations obtained for a fixed time interval
- Compare results for different:
- Processors and evaluation boards
- Compilers/Libraries
• Run representative benchmarks
- Typical of the type of code that is commonly used in your application
- EEMBC
• Perform benchmarks taken from your code base
- Define (create) comparative metrics to perform evaluations
48
Timing code execution
•
Hardware methods
-
-
Use hardware timing tools to measure execution times for critical modules
Example 1: Add small code segment to pulse I/O pin on processor at function entry
and exit points
- Use an oscilloscope to watch the I/O pin pulse
Code segment should be a small perturbation on the overall timing measurement
Example 2: Use a logic analyzer
- Code segment triggers analyzer
- Data writes to fixed memory
- See next slide
49
Using a logic analyzer
• Logic analyzer monitors the
state of all processor bus
signals
• Captures all data writes to a
specific address
• Bus activity is timed so
function entry and exit
intervals can be accurately
timed
50
Software methods
• Operating systems provide time stamping functions as part of libraries
• Replace write to memory location with O/S function call
• Significantly more overhead than hardware method
- Easier to implement
- Possibly a significant perturbation to function measurement accuracy
• If no O/S present than can substitute C library routine or write own
function
- Usually an in-line assembly language function
• May or may not have sufficient memory space to record statistics of
benchmarks
51
Benchmarking errors
• Function timing errors
- Functions may not take the same amount of time to execute each time it is
called
- Recursive functions
- Data processing times may vary from run to run
- Different interrupt frequencies and overhead
- O/S overheads
• Generally must gather statistics on function execution times
- Record sufficient data to calculate, min, max and average execution times
• O/S issues
- Other O/S tasks may influence apparent execution times
• Other influences
- Caches on/off, Compiler optimizations, memory system parameters
52
Benchmarking errors
•
Unfair comparisons
-
-
•
Evaluation board #1 has high-speed memory, evaluation board #2 has low-speed
memory
Different clock speeds
Difference performance parameters for other on-chip or on-board peripheral devices
Different compilers used to test different hardware platforms
Bottom line:
-
Do not accept any benchmarking results unless you can independently verify the
appropriateness of the comparison or the measurement
Example:
- Code execution on evaluation platform has sufficient headroom for anticipated
application
- HW designer uses slower system clock in order to save money and reduce the
need for extra RFI shielding
53
Benchmarking matrix
54
Benchmarking results-automotive(1)
55
Benchmarking results-consumer
56
Benchmarking results-consumer(2)
57
Benchmarking results-automotive(2)
58
Benchmarks-Telecom
59
How do we choose the right uP or uC?
Cost of
Goods
Real-time
Constraints
Legacy
Code
Power
Budget
Performance
Time to
Market
Landmines
Tool
Support
60
Selecting an embedded microprocessor
•
Issue 1: Performance requirements
•
Width of data path
-
performance ~ ( Width of Data Path )
2
The most general categorization of processor performance
Typical data bus widths: 4, 8, 16, 32, 64, 128 bits wide
Wider data busses -> greater data processing capability
Data bus width trade-off, the wider data path:
- Is more complex to design
- Takes up more room on PC boards
- Generates greater amounts of RF energy
- Requires more costly memory designs
- Is not compatible with existing hardware
CSS427- Introduction to Embedded Systems
61
More on data path width
•
Data path width generally determines functionality
-
•
Internal and external data paths may differ in size
-
•
4,8 bits - Appliances, modems, simple applications
16 bits - Industrial controllers, automotive
32 bits - Telecomm, laser printers, high-performance apps
64 bits - PC’s, UNIX workstations, games
128, 256 bits (VLIW) - Next generation
Narrower memory is more economical
MC68000: 32-bit internal/16-bit external
MC68008: 32-bit internal/8-bit external
80C188: 16-bit internal/8-bit external
Remember: An 8-bit processor can do most everything a 64-bit processor
can do, it will just take longer to accomplish
CSS427- Introduction to Embedded Systems
62
Selecting an embedded microprocessor-2
• Clock speed: RAW MIPS
- Brute force method of improving performance
- Bottleneck could be in software design or compiler itself!
- Faster isn’t always better
- Performance ~ Clock speed (unless throughput bottleneck is
somewhere else )
- Trade-off:
- As clock speed  RF interference energy 
- Compliance Engineering (CE issues)
- Electrical shielding costs can be significant
- Memory costs increase
- Other peripheral devices will cost more
- We’ll be revisiting this issue many times
CSS427- Introduction to Embedded Systems
63
Selecting an embedded microprocessor- 3
• Processor architecture issues
-
On-chip instruction cache, how big?
On-chip data cache, how big?
Pipelines
Superscalar
Trade-off -> high performance costs money and power
Address bus design
- Address bus width: 16 - 36 bits
- Multiplexed, synchronous, asynchronous
- Processor type: CISC, RISC, DSP
- What is the nature of the algorithm to implement?
- Control rich: CISC
- Data rich: RISC
- Data transforms and mathematical processing: DSP
CSS427- Introduction to Embedded Systems
64
More on address bus width
• The amount of externally accessible memory is defined as the
Address Space of the processor
• Can vary from 1KB for simple microcontrollers to over 60 GB in high
performance processors
• Size of the address space doesn’t mean that you have that much
memory, it only means that the capabilities exist to directly access it
• Processors with smaller address spaces can still manipulate larger
memory arrays with techniques such as Paging
CSS427- Introduction to Embedded Systems
65
Processor architecture: CISC, RISC and DSP
• CISC - Complex Instruction Set Computer
- Characterized by many instructions which can perform involved
operations
- CISC code is compact
- Can be many clock cycles per instruction
- Large silicon area > Higher cost per die
- Von Neumann Architecture
- Same memory space services instruction and data
• RISC - Reduced Instruction Set Computer
- More modern architecture
- One instruction executed per clock cycle > Very fast
- Harvard Architecture
- Separate memory spaces for instructions and data
CSS427- Introduction to Embedded Systems
66
Processor architecture: CISC, RISC and DSP
• DSP - Digital Signal Processor
- Specialized type of uP
- Designed for real time mathematical manipulation of data streams
- Radar image processing, audio/voice processing, ultrasound and
photographic image processing
- Includes instructions designed for multiplication and accumulation
CSS427- Introduction to Embedded Systems
67
Characteristics of a RISC Processor
• A RISC Architecture has many of the following characteristics:
- Instructions are conceptually simple
- Memory/Register transfers are exclusively LOADS and STORES
- LOAD = memory to register
- STORE = register to memory
- All arithmetic instructions are between registers
- Instructions are uniform in length
- Instructions use one ( or very few ) instruction formats
- Instruction sets are orthogonal
- Little or no overlapping functionality of instructions
- One ( or very few ) addressing mode
- Almost all instructions execute in 1 clock cycle
- Optimized for speed
CSS427- Introduction to Embedded Systems
68
Characteristics of a RISC Processor (2)
• RISC processors tend to have many registers
- (29K has 256 registers)
- Extremely useful for compiler optimization not to have to store
intermediate results back to external memory
• RISC architectures now dominates most high performance
processors
- Multiple instruction pipes mean that more than once instruction can be
executed per clock cycle
• Examples of modern RISC processors:
-
Motorola, IBM: PowerPC 8XX, 7XX, 6XX, 4XX
Sun: SPARC
MIPS: RXXXX
ARM: ARM7,9
HP: PA-RISC
Hitachi: SHX
AMD: 29K
CSS427- Introduction to Embedded Systems
69
Comparing CISC and RISC
• Isn’t CISC dead yet?
- 90% of the embedded processors still sold today (2003) are CISC
- Until 1999 the 680X0 family was the most popular 32-bit architecture
- Was surpassed by the ARM licensees
• Today, ARM-based processors outsell Pentiums 3:1
• RISC processors can go much faster because they have fewer
instructions and addressing modes
- Corollary: You do a lot less per instruction, but you can do it faster
• Faster doesn’t always mean better
- RFI problems go up, memory costs go up, power consumption goes up
- System costs will generally scale with speed
CSS427- Introduction to Embedded Systems
70
Comparing CISC and RISC
• RISC-based code images are usually twice the size of comparable
CISC algorithms
• Example: Add a constant number to a memory variable located at
memory location 0x00004000
- CISC:
MOVE #constant,D0
ADD
D0,$00004000
- RISC:
LD
LD
LD
ADD
ST
#constant,R0
#$00004000,R2
(R2),R1
R0,R1
R1,(R2)
CSS427- Introduction to Embedded Systems
71
Comparing CISC and RISC
• Increased code density means increased bus activity
- Bottlenecks
- Pipeline stalls
- Adding caches increase processor complexity and introduce potential
side-effects due to subtle compiler optimizations
• RISC chips may consume more power than CISC chips
- In modern processors, power consumption is proportional to speed
- More bus activity requires more energy
• Question: Is RISC or CISC better in embedded applications?
- Answer: Who knows? System design considerations and constraints will
dictate the proper choice for the application
CSS427- Introduction to Embedded Systems
72
Digital Signal Processors
• Basically, DSPs do math instead of control or data manipulation
• Traditionally, DSP were classic CISC processors with several
architectural enhancements to speed-up the execution of special
categories of mathematical operations
• Typical addition were barrel shifters and multiply/accumulate (MAC)
instructions
- Example:
- Execute an inner loop
- Fetch an X constant and a Y variable
- Multiply them together and accumulate (SUM) the result
- Check if loop is finished
- This is numerical integration
- Accomplished in 1 instruction what a CISC processor took 8 or more
instructions to accomplish
CSS427- Introduction to Embedded Systems
73
Numerical integration
•
Algorithms requiring complex mathematical solutions in real time are
becoming more and more important in embedded systems programming
-
•
Image processing - MPEG, JPEG
Communications - Echo cancellation, data encryption
Automotive control
Fly by wire
Calculating an integral requires rapid multiplication and summation
b
Y
=
F(x)dx
a
Multiply
b
Y =

F(x)
1/2( F(x) + F(x + x ) )x
x=a
Accumulate
a
x
x
CSS427- Introduction to Embedded Systems
b
74
Domain of DSPs
• DSP process data in real time
• Analog information is converted to a digital representation,
processed, then reconverted to analog
CSS427- Introduction to Embedded Systems
75
Domain of DSPs - 2
CSS427- Introduction to Embedded Systems
76
Domain of DSPs - 3
• DSPs will typically be used in multiprocessor arrays to improve
throughput
• Designed with high-speed, inter-processor communications
channels
CSS427- Introduction to Embedded Systems
77
TMS320C6201 32-bit Floating Point VLIW DSP
CSS427- Introduction to Embedded Systems
78
Features of TMS320C6201
•
•
•
•
•
•
•
•
•
•
•
•
8-Way Very Long Instruction Word (VLIW - 256 bits )
RISC-like Load/Store architecture
6 functional execution units ( 2 multipliers, 6 ALUs )
Dual pipeline
On chip 512K bits instruction memory and 512K Bits data memory
External 24 bit address, 32-bit data busses
167 MHz clock
Single and double precision floating point math
Can execute 2 MACs/clock cycle
Can issue 8 instructions per cycle
1336 peak MIPS
$196 ( 25,000 )
CSS427- Introduction to Embedded Systems
79
Consumer uses of DSP’s
• DSP’s have always been prevalent in industrial and millitary
applications
• The Killer App for DSPs ( 16-bit fixed point ) has been the PC
modem and sound cards
- In 1998, 70 to 80 Million PCs are sold
- Every modem and high-end sound card has a DSP
• New Killer App was predicted to be voice encoding/decoding
- Voice Over Internet Protocol (VOIP)
- Speech recognition
• Real Killer App has been digital camera
- 5M pixel image can be captured, stored, processed, compressed and
converted to a JPEG file in under 5 seconds
CSS427- Introduction to Embedded Systems
80
Selecting an embedded microprocessor- 4
• Single or multiple processors
- Combine CISC, RISC and DSP in a single design
- Tight coupling or loose coupling
- Architecture
- Code design, compiler capabilities
- Debug tool availability
- System simulation tools
CSS427- Introduction to Embedded Systems
81
Selecting an embedded microprocessor- 5
• Issue 2: Integration of functions
- Microprocessor or microcontroller?
- Review:
- A microprocessor contains the basic CPU functionality, and little
more
- A microcontroller combines the CPU core with peripheral devices
- The microprocessor is usually the leading edge of performance
- Lowest level of integration
- Highest cost
- Higher levels of integration imply
- Lower system costs
- Greater reliability
- As uP matures the core moves into the uC families
CSS427- Introduction to Embedded Systems
82
Marketing Hype from Motorola
CSS427- Introduction to Embedded Systems
83
Selecting an embedded microprocessor-6
- Higher level of integration ( continued )
- Less power
- Faster
- Higher processor costs
• Issue 3: Use a microcontroller
-
Peripheral choices ( timers, ports, serial comm., A/D, etc. )
On-chip, RAM, ROM, Flash
Power requirements
Sleep modes
Commercially available or build to order ( Motorola 683XX)
- See Diagram, next slide
CSS427- Introduction to Embedded Systems
84
Build a microcontroller to order
Coprocessor
DMA
FLASH
CPU Core
Real-time
Clock
RAM
Cache
ROM
A/D Converter
CSSAP490A
FLASH
Serial Ports
Timers
Ethernet
Watchdog
Parallel Ports
LCD Controller
$100K
NRE
PCI Bus
Bridge
CSS427- Introduction to Embedded Systems
85
Selecting an embedded microprocessor-7
• Microcontroller vs System-on-Silicon
- Application Specific Integrated Circuit (ASIC)
- Processor is soft, the processor exists as an encoded HDL
- Licensed foundries can fabricate CPU core design into actual
- Intellectual Property from ARM, MIPS, Motorola Mcore, ARC
- Companies do not build ASICs, called Fabless vendors
- Customizable CPUs
- Multiple CPU cores
- Mix RISC and DSPs
- Designs out with 64 (and more ) 32-bit RISC and DSP cores
CSS427- Introduction to Embedded Systems
86
Selecting an embedded microprocessor- 8
• Issue 4: New design or use Commercial Off-The-Shelf design (COTS)
- Industry standard bus PCI, STD bus, VXI bus, PC-104
- Whole industries have standardized on commercially available target
boards
- Mil/Aerospace, Telecomm-VXI Bus
- Industrial Automation-STD bus
- High energy physics-NIM modules
• Issues 5 and up: Software considerations
- Legacy code base for existing architecture
- C code may or may not be portable
- Assembly code is definitely not portable
- Instruction set architecture issues
- Certain ISAs may be better for certain problems
- Engineers may be more familiar with certain instruction sets
CSS427- Introduction to Embedded Systems
87
Selecting an embedded microprocessor-9
• Not so obvious considerations:
-
Compatibility with existing tools
Processor vendor’s roadmap for the future and longterm support
Design assistance, availability of IP
Legacy code
- “C is portable”, assembly code is not
- Pricing and availability
- Availability of third-party tools ( Emulators, debuggers)
- Power consumption (power budget)
• And… There are lots to choose from!
- Currently there are over 100 32-bit embedded
processors
- About 1000 different total devices
CSS427- Introduction to Embedded Systems
STOCK PRICE
88