Microprocessor Design 2002

Download Report

Transcript Microprocessor Design 2002

Advanced Computer Architecture
5MD00 / 5Z032
Instruction Set Design
Henk Corporaal
www.ics.ele.tue.nl/~heco
[email protected]
TUEindhoven
2007
Lecture overview
•
•
•
•
•
•
•
•
7/21/2015
ISA and Evolution
Architecture classes
Addressing
Operands
Operations
Encoding
RISC
SIMD extensions
ACA H.Corporaal
2
Instruction Set Architecture
• The instruction set architecture serves as the
interface between software and hardware
• It provides the mechanism by which the software tells
the hardware what should be done
• Architecture definition:
“the architecture of a system/processor is (a minimal
description of) its behavior as observed by its
immediate users”
software
instruction set architecture
hardware
Instruction Set Design Issues
• Where are operands stored?
– registers, memory, stack, accumulator
• How many explicit operands are there?
– 0, 1, 2, or 3
• How is the operand location specified?
– register, immediate, indirect, . . .
• What type & size of operands are supported?
– byte, int, float, double, string, vector. . .
• What operations are supported?
– add, sub, mul, move, compare . . .
Operands
• How are operands designated?
– fixed – always in the same place
– by opcode – always the same for groups of instructions
– by a field in the instruction – requires decode first
• What is the format of the data?
–
–
–
–
–
binary
character
decimal (packed and unpacked)
floating-point – IEEE 754 (others used less and less)
size – 8-, 16-, 32-, 64-, 128-bit
• What is the influence on ISA?
7/21/2015
ACA H.Corporaal
5
Operand Locations
7/21/2015
ACA H.Corporaal
6
Classifying ISAs
Accumulator (before 1960):
1 address
add A
Stack (1960s to 1970s):
0 address
add
Memory-Memory (1970s to 1980s):
2 address
3 address
add A, B
add A, B, C
Register-Memory (1970s to present):
2 address
add R1, A
load R1, A
acc acc + mem[A]
tos tos + next
mem[A] mem[A] + mem[B]
mem[A] mem[B] + mem[C]
R1 R1 + mem[A]
R1 mem[A]
Register-Register (Load/Store) (1960s to present):
3 address
7/21/2015
ACA H.Corporaal
add R1, R2, R3
load R1, R2
store R1, R2
R1 R2 + R3
R1 mem[R2]
mem[R1] R2
7
Evolution of Architectures
Single Accumulator (EDSAC 1950)
Accumulator + Index Registers
(Manchester Mark I, IBM 700 series 1953)
Separation of Programming Model
from Implementation
High-level Language Based
(B5000 1963)
Concept of a Family
(IBM 360 1964)
General Purpose Register Machines
Complex Instruction Sets
(Vax, Intel 8086 1977-80)
Load/Store Architecture
(CDC 6600, Cray 1 1963-76)
RISC
(Mips,Sparc,88000,IBM RS6000, . . .1987+)
Addressing Modes
• Types
– Register – data in a register
– Immediate – data in the instruction
– Memory – data in memory
• Calculation of Effective Address
– Direct – address in instruction
– Indirect – address in register
– Displacement – address = register or PC + offset
– Indexed – address = register + register
– Memory Indirect – address at address in register
• What is the influence on ISA?
7/21/2015
ACA H.Corporaal
9
Types of Addressing Mode (VAX)
Addressing Mode
Register direct
Immediate
Displacement
Register indirect
Indexed
Direct
Memory Indirect
Autoincrement
Example
Add R4, R3
Add R4, #3
Add R4, 100(R1)
Add R4, (R1)
Add R4, (R1 + R2)
Add R4, (1000)
Add R4, @(R3)
Add R4, (R2)+
Action
1.
R4 <- R4 + R3
2.
R4 <- R4 + 3
3.
R4 <- R4 + M[100 + R1]
4.
R4 <- R4 + M[R1]
5.
R4 <- R4 + M[R1 + R2]
6.
R4 <- R4 + M[1000]
7.
R4 <- R4 + M[M[R3]]
8.
R4 <- R4 + M[R2]
R2 <- R2 + d
9. Autodecrement
Add R4, (R2)R4 <- R4 + M[R2]
R2 <- R2 - d
10. Scaled
Add R4, 100(R2)[R3]
R4 <- R4 +
M[100 + R2 + R3*d]
• Studies by [Clark and Emer] indicate that modes 1-4 account for 93% of all
operands on the VAX
Operations
• Types
– ALU – Integer arithmetic and logical functions
– Data transfer – Loads/stores
– Control – Branch, jump, call, return, traps, interrupts
– System – O/S calls, virtual memory management
– Floating point – Floating point arithmetic
– Decimal – Decimal arithmetic
– String – moves, compares, search, etc.
– Graphics – Pixel/vertex operations
– Vector – Vector (SIMD) functions
• Addressing
– Which addressing modes for which operands are supported?
7/21/2015
ACA H.Corporaal
11
80x86 Instruction Frequency
Rank
1
2
3
4
5
6
7
8
9
10
Total
Instruction
load
branch
compare
store
add
and
sub
register move
call
return
Frequency
22%
20%
16%
12%
8%
6%
5%
4%
1%
1%
96%
9
Relative Frequency of
Control Instructions
Operation
Call/Return
Jumps
Branches
SPECint92
13%
6%
81%
SPECfp92
11%
4%
87%
• Design hardware to handle branches
quickly, since these occur most frequently
Frequency of Operand Sizes
on 32-bit Load-Store Machines
Size
64 bits
32 bits
16 bits
8 bits
SPECint92
0%
74%
19%
19%
SPECfp92
69%
31%
0%
0%
• For floating-point want good performance for 64 bit operands.
• For integer operations want good performance for 32 bit operands
• Recent architectures also support 64-bit integers
Instruction Encoding
• Variable
– Instruction length varies based on opcode and address
specifiers
– For example, VAX instructions vary between 1 and 53
bytes, while x86 instruction vary between 1 and 17 bytes.
– Good code density, but difficult to decode and pipeline
• Fixed
– Only a single size for all instructions
– For example MIPS, Power PC, Sparc all have 32 bit
instructions
– Not as good code density, but easier to decode and pipeline
• Hybrid
– Have multiple format lengths specified by the opcode
– For example, IBM 360/370
– Compromise between code density and ease of decode
Instruction Encoding
7/21/2015
ACA H.Corporaal
16
Example: MIPS
7/21/2015
ACA H.Corporaal
17
Compilers and ISA
• Compiler Goals
–
–
–
–
–
All correct programs compile correctly
Most compiled programs execute quickly
Most programs compile quickly
Achieve small code size
Provide debugging support
• Multiple Source Compilers
– Same compiler can compile different languages
• Multiple Target Compilers
– Same compiler can generate code for different machines
Compilers Phases
Compilers use phases to manage complexity:
• Front end
– Convert language to intermediate form
• High level optimizer
– Procedure inlining and loop transformations
• Global optimizer
– Global and local optimization, plus register allocation
• Code generator (and assembler)
– Dependency elimination, instruction selection, scheduling
Designing ISA to Improve Compilation
• Provide enough general purpose registers to ease
register allocation ( more than 16)
• Provide regular instruction sets by keeping the
operations, data types, and addressing modes
orthogonal
• Provide primitive constructs rather than trying to map
to a high-level language
• Allow compilers to help make the common case fast
A "Typical" RISC
•
•
•
•
32-bit fixed format instruction (few formats)
32 32-bit GPR
3-address, reg-reg arithmetic instruction
Single address mode for load/store:
base + displacement
– no indirection
•
•
•
•
7/21/2015
Simple branch conditions
Pipelined implementation
Separate Instruction and Data level-1 caches
Delayed branch ?
ACA H.Corporaal
21
Comparison MIPS with 80x86
• How would you expect the x86 and MIPS
architectures to compare on the following:
–
–
–
–
–
CPI on SPEC benchmarks
Ease of design and implementation
Ease of writing assembly language & compilers
Code density
Overall performance
• What other advantages/disadvantages are there
to the two architectures?
Graphics and Multimedia
Instruction Set Extensions
• Support graphics and multimedia applications
– Intel’s MMX Technology
– Intel’s Internet Streaming SIMD Extensions
– AMD’s 3DNow! Technology
– Sun’s Visual Instruction Set
– Motorola’s and IBM’s AltiVec Technology
• These extensions improve the performance of
– Computer-aided design
– Internet applications
– Computer visualization
– Video games
– Speech recognition
7/21/2015
ACA H.Corporaal
23
MMX Data Types
MMX Technology supports operations on
the following 64-bit integer data types:
Packed byte (eight 8-bit elements)
Packed word (four 16-bit elements)
Packed double word (two 32-bit elements)
Packed quad word (one 64-bit elements)
7/21/2015
ACA H.Corporaal
24
SIMD Operations
• MMX Technology allows a Single Instruction to work
on Multiple pieces of Data (SIMD)
A3
A2
A1
A0
B3
B2
B1
B0
A3+B3
A2+B2
A1+B1
A0+B0
PADD[W]: Packed add word
• In the above example, 4 parallel adds are performed on
16-bit elements
• Most MMX instructions only require a single cycle
7/21/2015
ACA H.Corporaal
25
Saturating Arithmetic
• Both wrap-around and saturating adds are
supported
• With saturating arithmetic, results that
overflow/underflow are set to the
largest/smallest value
PADD[W]: Packed wrap-around add
7/21/2015
ACA H.Corporaal
PADDUS[W]: Packed saturating add
26
Pack and Unpack Instructions
• Pack and unpack instructions provide
conversion between standard data types
and packed data types
PACKSS[DW]: Pack signed, with saturating, double to packed word
7/21/2015
ACA H.Corporaal
27
Multiply-Add Operations
• Many graphics applications require multiplyaccumulate operations
–
–
–
–
Vector Dot Products
Matrix Multiplies
Fast Fourier Transforms (FFTs)
Filter implementations
PMADDWD: Packed multiply-add word to double
7/21/2015
ACA H.Corporaal
28
Vector Dot Product
• A dot product on an 8-element vector can
be performed using 9 MMX instructions
– Without MMX 40 instructions are required
0
a0*c0+..+ a3*c3
0
a4*c4+..+ a7*c7
a0*c0+..+ a7*c7
7/21/2015
ACA H.Corporaal
29
Packed Compare Instructions
• Packed compare instructions allow a bit mask
to be set or cleared
• This is useful when images with certain
qualities need to be extracted
7/21/2015
ACA H.Corporaal
30
MMX Instructions
• MMX Technology adds 57 new instructions to the x86
architecture.
• Some of these instructions include
– PADD(b, w, d)
Packed addition
– PSUB(b, w, d)
Packed subtraction
– PCMPE(b, w, d)
Packed compare equal
– PMULLw
Packed word multiply low
– PMULHw
Packed word multiply high
– PMADDwd
Packed word multiply-add
– PSRL(w, d, q)
Pack shift right logical
– PACKSS(wb, dw)
Pack data
– PUNPCK(bw, wd, dq) Unpack data
– PAND, POR, PXOR
Packed logical operations
7/21/2015
ACA H.Corporaal
31
Performance Comparison
• The following shows the performance of
Pentium processors with and without MMX
Technology
Application
Video
Image
Processing
3D geometry
Audio
Overall
7/21/2015
ACA H.Corporaal
Without
MMX
155.52
159.03
With
MMX
268.70
743.90
Speedup
161.52
149.80
156.00
166.44
318.90
255.43
1.03
2.13
1.64
1.72
4.67
32
MMX Technology Summary
• MMX technology extends the Intel x86 architecture to improve
the performance of multimedia and graphics applications.
• It provides a speedup of 1.5 to 2.0 for certain applications.
• MMX instructions are hand-coded in assembly or implemented
as libraries to achieve high performance.
• MMX data types use the x86 floating point registers to avoid
adding state to the processor.
– Makes it easy to handle context switches
– Makes it hard to perform MMX and floating point
instructions at the same time
• Only increase the chip area by about 5%.
7/21/2015
ACA H.Corporaal
33
Questions on MMX
• What are the strengths and weaknesses of MMX
Technology?
• How could MMX Technology potentially be
improved?
• How did the developers of MMX preserve backward
compatibility with the x86 architecture?
– Why was this important?
– What are the disadvantages of this approach?
• What restrictions/limitations are there on the use of
MMX Technology?
7/21/2015
ACA H.Corporaal
34
Internet Streaming SIMD
Extensions
• Intel’s Internet Streaming SIMD Extensions (ISSE)
– Help improve the performance of video and 3D applications
– Are designed for streaming data, which is used once and then
discarded.
– 70 new instructions beyond MMX Technology
– Adds new 128-bit registers
– Provide the ability to perform parallel floating point operations
• Four parallel operations on 32-bit numbers
• Reciprocal and reciprocal root instructions - normalization
• Packed average instruction – Motion compensation
– Provide data prefetch instructions
– Make certain applications 1.5 to 2.0 times faster
7/21/2015
ACA H.Corporaal
35