Processor Design

Download Report

Transcript Processor Design

Lesson 5: Processor Design
Topic 1 – Methods and Concepts
EE37E
2005
1
Introduction
References:
-Modern Processor Design Book ( pp. 1 – 16)
- Computer Organization and Design Book (pp. 54- 89)
EE37E
2005
2
• While introducing this topic we will focus on these points:
– Evolution of microprocessors
– Instruction set processor design
– Principles
• Microprocessors are Instruction set processors (ISPs).
• An ISP executes instructions from a predefined
instruction set.
• A microprocessor’s functionality is fully characterized by
the instruction set it is capable of executing.
• This predefined instruction set is also called the
instruction set architecture.
EE37E
2005
3
• An ISA serves as an interface between software
and hardware.
• In terms of processor design methodology, an
ISA is the specification of the design while the
microprocessor or ISP is the implementation of a
design.
EE37E
2005
4
Computer System Components
L1
1000MHZ - 3 GHZ (a multiple of system bus speed)
Pipelined ( 7 -21 stages )
Superscalar (max ~ 4 instructions/cycle) single-threaded
Dynamically-Scheduled or VLIW
Dynamic and static branch prediction
CPU
L2
SDRAM
PC100/PC133
100-133MHZ
64-128 bits wide
2-way inteleaved
~ 900 MBYTES/SEC
L3
Double Date
Rate (DDR) SDRAM
PC3200
400MHZ (effective 200x2)
64-128 bits wide
4-way interleaved
~3.2 GBYTES/SEC
(second half 2002)
RAMbus DRAM (RDRAM)
PC800, PC1060
400-533MHZ (DDR)
16-32 bits wide channel
~ 1.6 - 3.2 GBYTES/SEC
( per channel)
Examples: Alpha, AMD K7: EV6, 400MHZ
Intel PII, PIII: GTL+ 133MHZ
Intel P4
800MHZ
Support for one or more CPUs
Caches
System Bus
adapters
Memory
Controller
Memory Bus
Controllers
I/O Buses
NICs
Example: PCI-X 133MHZ
PCI, 33-66MHZ
32-64 bits wide
133-1024 MBYTES/SEC
Memory
Disks
Displays
Keyboards
North
Bridge
I/O Devices:
South
Bridge
Networks
Fast Ethernet
Gigabit Ethernet
ATM, Token Ring ..
Chipset
EE37E
2005
5
Computer System Components
Enhanced CPU Performance & Capabilities:
Memory Latency Reduction:
Conventional &
Block-based
Trace Cache.
L1
•
•
•
•
•
Support for Simultaneous Multithreading (SMT): Alpha EV8.
VLIW & intelligent compiler techniques: Intel/HP EPIC IA-64.
More Advanced Branch Prediction Techniques.
Chip Multiprocessors (CMPs): The Hydra Project. IBM Power 4,5
Vector processing capability: Vector Intelligent RAM (VIRAM).
Or Multimedia ISA extension.
• Digital Signal Processing (DSP) capability in system.
• Re-Configurable Computing hardware capability in system.
SMT
CMP
CPU
L2
Integrate Memory
Controller & a portion
of main memory with
CPU: Intelligent RAM
Integrated memory
Controller:
AMD Opetron
IBM Power5
L3
Caches
System Bus
adapters
Memory
Controller
Memory Bus
Controllers
I/O Buses
NICs
Memory
Disks (RAID)
Displays
Keyboards
North
Bridge
South
Bridge
Chipset
EE37E
2005
Networks
I/O Devices:
6
Recent Trends in Computer Design
•
The cost/performance ratio of computing systems have seen a steady
decline due to advances in:
– Integrated circuit technology: decreasing feature size,
• Clock rate improves roughly proportional to improvement in 
• Number of transistors improves proportional to  (or faster).
•
– Architectural improvements in CPU design.
Microprocessor systems directly reflect IC improvement in terms of a
yearly 35 to 55% improvement in performance.
•
Assembly language has been mostly eliminated and replaced by other
alternatives such as C or C++
•
Standard operating Systems (UNIX, NT) lowered the cost of introducing
new architectures.
•
Emergence of RISC architectures and RISC-core architectures.
•
Adoption of quantitative approaches to computer design based on
empirical performance observations.
EE37E
2005
Microprocessor Architecture Trends
CISC Machines
instructions take variable times to complete
RISC Machines (microcode)
simple instructions, optimized for speed
RISC Machines (pipelined)
same individual instruction latency
greater throughput through instruction "overlap"
Superscalar Processors
multiple instructions executing simultaneously
CMPs
Multithreaded Processors
VLIW
Single Chip Multiprocessors
additional HW resources (regs, PC, SP) "Superinstructions" grouped together
duplicate entire processors
each context gets processor for x cycles decreased HW control complexity (tech soon due to Moore's Law)
SIMULTANEOUS MULTITHREADING (SMT)
multiple HW contexts (regs, PC, SP)
each cycle, any context may execute
SMT/CMPs (e.g. IBM Power5 in 2004)
EE37E
2005
8
Evolution of microprocessors
100000000
“Graduation Window”
Alpha 21264: 15 million
Pentium Pro: 5.5 million
PowerPC 620: 6.9 million
Alpha 21164: 9.3 million
Sparc Ultra: 5.2 million
10000000
Moore’s Law
P entium
i80486
Tr ansistors
1000000
i80386
i80286
100000
CMOS improvements:
• Die size: 2X every 3 yrs
• Line width: halve / 4-7 yrs
i8086
10000
i8080
i4004
1000
1970
1975
1980
1985
1990
1995
Y ear
EE37E
2005
2000
Figure1: Evolution of
microprocessors
9
• Three decades of the history of microprocessors
tell a truly remarkable story of advances in the
computer industry (Table 1).
1970 1980
1980 1990
1990 2000
2000 2010
Transistor 2K – 100K
count
100K – 1 M 1M – 100M 100M – 2
B
Clock
0.1 – 3
frequency MHz
3 – 30
MHz
30 MHz –
1 GHz
1 – 15 GHz
Instructio 0.1IPC
ns/Cycle
0.1IPC0.9IPC
0.9IPC1.9IPC
1.9IPC2.9IPC
Table 1. The amazing decades of the evolution of microprocessors
EE37E
2005
10
Hierarchy of Computer Architecture
High-Level Language Programs
Software
Assembly Language
Programs
Application
Operating
System
Machine Language
Program
Compiler
Software/Hardware
Boundary
Firmware
Instr. Set Proc. I/O system
Instruction Set
Architecture
Datapath & Control
Hardware
Digital Design
Circuit Design
Microprogram
Layout
Register Transfer
Notation (RTN)
Logic Diagrams
Circuit Diagrams
EE37E
2005
11
Instruction Set Processor Design
• Critical to an ISP is the
instruction set
architecture, which
specifies the functionality
that must be implemented
by the instruction set
processor (ISP).
EE37E
2005
12
The Design Process
• "To Design Is To Represent“
– Design activity yields description/representation of an
object
• Traditional craftsman does not distinguish between the
conceptualization and the artifact
• Separation comes about because of complexity
• Concept is captured in one or more representation
languages
– This process IS design
• Design Begins With Requirements
– Functional Capabilities: what it will do
– Performance Characteristics: Speed, Power, Area,
Cost, . . .
EE37E
2005
13
Design Process (cont.)
CPU
• Design Finishes As Assembly
Datapath
Control
– Design understood in terms of
components and how they have ALU
Regs
Shifter
been assembled
– Top Down decomposition of
complex functions (behaviors)
into more primitive functions
Nand
Gate
• Bottom-up composition of primitive
building blocks into more complex assemblies
Design is a "creative process," not a simple method
EE37E
2005
14
Design as Search
Problem A
Strategy 1
SubProb 1
BB1
BB2
Strategy 2
SubProb2
SubProb3
BBn
BB3
Design involves educated guesses and verification
-- Given the goals, how should these be prioritized?
-- Given alternative design pieces, which should be selected?
-- Given design space of components & assemblies, which part will yield
the best solution?
Feasible (good) choices vs. Optimal choices
EE37E
2005
15
Instruction Set Architecture
(subset of Computer Architecture)
“... the attributes of a [computing] system as seen by the
programmer, i.e., the conceptual structure and functional
behavior, as distinct from the organization of the data flows and
controls the logic design, and the physical implementation.”
– Amdahl, Blaaw, and Brooks, 1964
• Organization of Programmable Storage
SOFTWARE
• Data Types & Data Structures:
Encodings & Representations
• Instruction Set
• Instruction Formats
• Modes of Addressing and Accessing Data Items and Instructions
• Exceptional Conditions
EE37E
2005
16
The Instruction Set: a Critical Interface
software
instruction set
hardware
Figure 2: ISA
EE37E
2005
17
Dynamic Static Interface
• We have discussed two critical roles played by
the ISA:
– Contract between software and Hardware, which
facilitates the development pf programs and machines
– Specification for microprocessor design
• The third role is an associated definition of an
interface that separates what is done statically
at the compile time versus what is done
dynamically at run time. This interface is called
the “ Dynamic-static Interface”
EE37E
2005
18
(Software)
Program
Compiler
complexity
Exposed to
software
“Static”
Architecture (DSI)
Hardware
complexity
Machine
Hidden in
hardware
“Dynamic”
(Hardware)
Figure 3: The dynamic-static feature
EE37E
2005
19
Computer Architecture Topics
Input/Output and Storage
Disks, WORM, Tape
Emerging Technologies
Interleaving
Bus protocols
DRAM
Memory
Hierarchy
Coherence,
Bandwidth,
Latency
L2 Cache
L1 Cache
VLSI
Instruction Set Architecture
RAID
Addressing,
Protection,
Exception Handling
Pipelining, Hazard Resolution,
Superscalar, Reordering,
Prediction, Speculation,
Vector, DSP
EE37E
2005
Pipelining and Instruction
Level Parallelism
20
Principles of Processor Performance
EE37E
2005
21
Definitions
• Performance is in units of things per sec
– bigger is better
• If we are primarily concerned with response time
–performance(x) =
1
execution_time(x)
" X is n times faster than Y" means
Execution_time(Y)
Performance(X)
n
=
=
Performance(Y)
EE37E
2005
Execution_time(X)
22
Cycles Per Instruction
IC = Instruction Count
CPI = Clock Per Instruction
CP U time Number of clock cycles Clock cycle time
Number of clock cycles
CP U time
Clock Frequency
Number of clock cycles
CPI 
IC
CP U time IC  CP I Clock cycle time
IC  CP I
CP U time
Clock Rate
n
CP U time Cycle T ime  CPI j  I j
j 1
EE37E
2005
23
Cycles Per Instruction
We may separate the contribution of each type of
instruction to the execution time defining:
n
Number of clock cycles  CPI j  IC j
j 1
where IC j is thenumber of timesthatinstruction
j is execut ed,and CPI j is theaveragenumber of
clocksrequired to execut einstruction j
Processor pipelining and memory interactions limit the accuracy of this
approach, but its a good first guess. For accuracy, it is necessary to simulate
the instructions of an entire program with issue, pipeline and memory
interactions.
EE37E
2005
24
Aspects of CPU Performance (CPU Law)
CPU time
= Seconds
Program
= Instructions x
Program
EE37E
2005
Cycles
x Seconds
Instruction
Cycle
25
Amdahl's Law
Speedup due to enhancement E:
Exec Time w/o E Performanc e w/ E
Speedup(E) 

Exec Time w/ E Performanc e w/o E
Suppose that enhancement E accelerates a fraction
F of the task by a factor S, and the remainder of
the task is unaffected
E.g. special instructions, memory, IO, parallel
processing
EE37E
2005
26
Amdahl’s Law
ExT imenew

Fractionenhanced 
 ExT imeold  1  Fractionenhanced  

Speedup
enhanced 

ExT imeold
1
Speedup overall 

Fractionenhanced
ExT imenew 1  Fraction


enhanced
Speedup enhanced
EE37E
2005
27
Amdahl’s Law
• Example: Floating point instructions improved
to run 2X; but only 10% of actual instructions
are FP
0.1

ExT imenew  ExT imeold  1  0.1 
 ExT imeold  0.95

2 

ExT imeold
ExT imeold
1
Speedup overall 


 1.053
ExT imenew ExT imeold  0.95 0.95
EE37E
2005
28
Topic 2: Instruction Set Architecture
Design
Adapted from Prof. Jerry Breecher’s Notes + my CS21Q
Notes
(http://babbage.clarku.edu/~jbreecher/arch/arch.html)
EE37E
2005
29
Introduction
7.1 Introduction
7.2 Classifying Instruction Set Architectures
7.3 Memory Addressing
7.4 Operations in the Instruction Set
7.5 Type and Size of Operands
7.6 Encoding and Instruction Set
7.7 The Role of Compilers
7.8 The MIPS Architecture and Bonus
7.9. Endianess
EE37E
2005
30
Introduction
The Instruction Set Architecture is that portion of the machine visible to the
assembly level programmer or to the compiler writer.
software
instruction set
hardware
Questions:
- What are the advantages and disadvantages of various
instruction set alternatives?
- How do languages and compilers affect ISA?
EE37E
2005
31
Classifying Instruction Set
Architectures
Classifications can be by:
1.
2.
3.
Stack/accumulator/register
Number of memory operands.
Number of total operands.
EE37E
2005
32
Instruction Set
Architectures
Accumulator:
1 address
1+x address
Basic ISA
Classes
add A
addx A
acc acc + mem[A]
acc acc + mem[A + x]
add
tos tos + next
add A B
add A B C
EA(A) EA(A) + EA(B)
EA(A) EA(B) + EA(C)
Stack:
0 address
General Purpose Register:
2 address
3 address
Load/Store:
0 Memory
1 Memory
load R1, Mem1
load R2, Mem2
add R1, R2
ALU Instructions can
have 0, 1, 2, 3 operands.
Shown here are cases of
0 and 1.
add R1, Mem2
EE37E
ALU Instructions
can have two or
three operands.
2005
33
Instruction Set
Architectures
Basic ISA
Classes
The results of different address classes is easiest to see with the examples here,
all of which implement the sequences for C = A + B.
Stack
Accumulator
Register
(Register-memory)
Register
(load-store)
Push A
Load A
Load R1, A
Load
R1, A
Push B
Add B
Add
Load
R2, B
Add
Store C
Store
Add
R3, R1, R2
R1, B
C, R1
Pop C
Store
C, R3
Registers are the class that won out. The more registers on the CPU, the better.
EE37E
2005
34
Instruction Set
Architectures
Intel 80x86 Integer
Registers
GPR0
EAX
Accumulator
GPR1
ECX
Count register, string, loop
GPR2
EDX
Data Register; multiply, divide
GPR3
EBX
Base Address Register
GPR4
ESP
Stack Pointer
GPR5
EBP
Base Pointer – for base of stack seg.
GPR6
ESI
Index Register
GPR7
EDI
Index Register
CS
Code Segment Pointer
SS
Stack Segment Pointer
DS
Data Segment Pointer
ES
Extra Data Segment Pointer
FS
Data Seg. 2
GS
Data Seg. 3
EIP
Instruction Counter
Eflags
Condition Codes
PC
EE37E
2005
35
Memory Addressing
Sections Include:
Interpreting Memory Addresses
Addressing Modes
Displacement Address Mode
Immediate Address Mode
EE37E
2005
36
Memory
Addressing
Interpreting Memory
Addresses
What object is accessed as a function of the address and length?
Objects have byte addresses – an address refers to the number of bytes counted from
the beginning of memory.
Little Endian – puts the byte whose address is xx00 at the least significant position in the
word.
Big Endian – puts the byte whose address is xx00 at the most significant position in the
word.
Alignment – data must be aligned on a boundary equal to its size. Misalignment typically
results in an alignment fault that must be handled by the Operating System.
EE37E
2005
37
Memory
Addressing
Addressing
Modes
This table shows the most common modes. A more complete set is in Figure 2.6
Addressing Mode
Example Instruction
Meaning
When Used
Register
Add R4, R3
R[R4] <- R[R4] + R[R3]
When a value is in a
register.
Immediate
Add R4, #3
R[R4] <- R[R4] + 3
For constants.
Displacement
Add R4, 100(R1)
R[R4] <- R[R4] +
M[100+R[R1] ]
Accessing local variables.
Register Deferred
Add R4, (R1)
R[R4] <- R[R4] +
M[R[R1] ]
Using a pointer or a
computed address.
Absolute
Add R4, (1001)
R[R4] <- R[R4] + M[1001]
Used for static data.
EE37E
2005
38
Memory
Addressing
Displacement
Addressing Mode
How big should the displacement be?
For addresses that do fit in displacement size:
Add R4, 10000 (R0)
For addresses that don’t fit in displacement size, the compiler must do the
following:
Load R1, address
Add R4, 0 (R1)
Depends on typical displaces as to how big this should be.
On both IA32 and DLX, the space allocated is 16 bits.
EE37E
2005
39
Memory
Addressing
Immediate Address
Mode
Used where we want to get to a numerical value in an instruction.
At high level:
At Assembler level:
a = b + 3;
Load
Add
if ( a > 17 )
Load
R2, 17
CMPBGT R1, R2
goto
Load
Jump
Addr
R2, 3
R0, R1, R2
R1, Address
(R1)
So how would you get a 32 bit value into a register?
EE37E
2005
40
Operations In The Instruction Set
Sections Include:
Detailed information about types of instructions.
Instructions for Control Flow (conditional branches, jumps)
EE37E
2005
41
Operations In The
Instruction Set
Arithmetic and logical
Data transfer
Control
System
Floating point
Decimal
String
Multimedia -
Operator Types
and, add
move, load
branch, jump, call
system call, traps
add, mul, div, sqrt
add, convert
move, compare
2D, 3D? e.g., Intel MMX and Sun VIS
EE37E
2005
42
Control
Instructions
Operations In The
Instruction Set
Conditional branches are 20%
of all instructions!!
Control Instructions Issues:
–
–
–
–
taken or not
where is the target
link return address
save or restore
Instructions that change the PC:
–
–
–
(conditional) branches, (unconditional) jumps
function calls, function returns
system calls, system returns
EE37E
2005
43
Type And Size of Operands
The type of the operand is usually encoded in the Opcode – a LDW
implies loading of a word.
Common sizes are:
Character (1 byte)
Half word (16 bits)
Word (32 bits)
Single Precision Floating Point (1 Word)
Double Precision Floating Point (2 Words)
Integers are two’s complement binary.
Floating point is IEEE 754.
Some languages (like COBOL) use packed decimal.
EE37E
2005
44
The MIPS Architecture
MIPS is very RISC oriented.
EE37E
2005
45
The MIPS
Architecture
MIPS Characteristics
There’s MIPS – 32 that we learned in
CS140
32bit byte addresses aligned
Load/store only displacement
addressing
Standard datatypes
3 fixed length formats
32 32bit GPRs (r0 = 0)
16 64bit (32 32bit) FPRs
FP status register
No Condition Codes
There’s MIPS – 64 – the current arch.
Standard datatypes
4 fixed length formats (8,16,32,64)
32 64bit GPRs (r0 = 0)
64 64bit FPRs
EE37E
Addressing Modes
• Immediate
• Displacement
• (Register Mode used only for ALU)
Data transfer
• load/store word, load/store
byte/halfword signed?
• load/store FP single/double
• moves between GPRs and FPRs
ALU
• add/subtract signed? immediate?
• multiply/divide signed?
• and,or,xor immediate?, shifts: ll, rl,
ra immediate?
• sets immediate?
2005
46
The MIPS
Architecture
MIPS Characteristics
Control
•
branches == 0, <> 0
•
conditional branch testing FP bit
•
jump, jump register
•
jump & link, jump & link register
•
trap, returnfromexception
Floating Point
•
add/sub/mul/div
•
single/double
•
fp converts, fp set
EE37E
2005
47
The MIPS
Architecture
The MIPS Encoding
Register-Register
31
26 25
Op
21 20
Rs1
16 15
Rs2
11 10
6 5
Rd
0
Opx
Register-Immediate
31
26 25
Op
21 20
Rs1
16 15
0
immediate
Rd
Branch
31
26 25
Op
Rs1
21 20
16 15
Rs2/Opx
0
immediate
Jump / Call
31
26 25
Op
0
target
EE37E
2005
48
Byte Ordering
• How should bytes within multi-byte word be
ordered in memory?
• Conventions
– Sun’s, Mac’s are “Big Endian” machines
• Least significant byte has highest address
– Alphas, PC’s are “Little Endian” machines
• Least significant byte has lowest address
EE37E
2005
49
Byte Ordering Example
• Big Endian
– Least significant byte has highest address
• Little Endian
– Least significant byte has lowest address
• Example
– Variable x has 4-byte representation 0x01234567
– Address given by &x is 0x100
Big Endian
0x100 0x101 0x102 0x103
01
Little Endian
23
45
67
0x100 0x101 0x102 0x103
67
45
EE37E
23
2005
01
50
Machine-Level Code Representation
•
Encode Program as Sequence of Instructions
– Each simple operation
• Arithmetic operation
• Read or write memory
• Conditional branch
– Instructions encoded as bytes
• Alpha’s, Sun’s, Mac’s use 4 byte instructions
– Reduced Instruction Set Computer (RISC)
• PC’s use variable length instructions
– Complex Instruction Set Computer (CISC)
– Different instruction types and encodings for different machines
• Most code not binary compatible
• Programs are Byte Sequences Too!
EE37E
2005
51
Classification of Processors
• We can classify processors according to the areas in
which they are mostly used.
• We can identity four different group of processors:
– General purpose processors that are used in building
computers
– Digital Signal processors which are processors designed
specifically for signal processing.
– Microcontrollers which are small microcromputers
which integrate in the same chip a core processors plus
I/O elements and small amount of memories
– Application specific processors which design to
performed specific function (i.e. Network processors)
EE37E
2005
52
General Purpose Processors
• These processors are used to built major computer
platforms.
• We can name:
– Intel / AMD based computers also called IBM
compatible
– Macintosh computers built using PowerPC processors
– Sun machines that use Ultrasparc Processors.
EE37E
2005
53
Examples of General Purpose Processors
Type of Computer
Processors Used
Technology
Macinstosh
PowerPC
(IBM, Motorola)
Superscalar
Sun
Ultrasparc
(SUN)
RISC
IBM Compatible
Intel Processors
Athlon, Duron
(AMD), Cyrix
Superscalar
EE37E
2005
54
DSP
• Digital Signal Processing (DSP) is used in a wide variety of
applications, and it is hard to find a good definition that is general.
• We can start by dictionary definitions of the words:
Digital
* operating by the use of discrete signals to represent data
in the form of numbers
– Signal
* a variable parameter by which information is conveyed
through an electronic circuit
– Processing
* to perform operations on data according to programmed
instructions
–
• Which leads us to a simple definition of: Digital Signal processing
*
changing or analyzing information which is measured as discrete
sequences of numbers
EE37E
2005
55
• Note two unique features of Digital Signal processing as opposed to
plain old ordinary digital processing:
– signals come from the real world - this intimate connection with the real
world leads to many unique needs such as the need to react in real time
and a need to measure signals and convert them to digital numbers
– signals are discrete - which means the information in between discrete
samples is lost
• The advantages of DSP are common to many digital systems and
include:
– Versatility:
• digital systems can be reprogrammed for other applications (at least where
programmable DSP chips are used)
• digital systems can be ported to different hardware (for example a different
DSP chip or board level product)
– Repeatability:
• digital systems can be easily duplicated
• digital systems do not depend on strict component tolerances
• digital system responses do not drift with temperature
– Simplicity:
• some things can be done more easily digitally than with analogue
systems
EE37E
2005
56
• DSP is used in a very wide
variety of applications.
• But most share some
common features:
– they use a lot of math
(multiplying and adding
signals)
– they deal with signals
that come from the
real world
– they require a
response in a certain
time
• Where general purpose DSP
processors are concerned,
most applications deal with
signal frequencies that are
in the audio range.
EE37E
2005
57