ppt - ECE Users Pages
Download
Report
Transcript ppt - ECE Users Pages
ALU Architecture and ISA
Extensions
Lecture notes from MKP, H. H. Lee and S. Yalamanchili
Reading
• Sections 3.2-3.5 (only those elements covered
in class)
• Sections 3.6-3.8
• Appendix B.5
• Goal: Understand the
ISA view of the core microarchitecture
Organization of functional units and register files into
basic data paths
(2)
Overview
• Instruction Set Architectures have a purpose
Applications dictate what we need
• We only have a fixed number of bits
Impact on accuracy
• More is not better
We cannot afford everything we want
• Basic Arithmetic Logic Unit (ALU) Design
Addition/subtraction, multiplication, division
(3)
Reminder: ISA
byte addressed memory
Register File (Programmer Visible State)
Memory Interface
stack
0x00
0x01
0x02
0x03
Processor Internal Buses
0x1F
Dynamic Data
Data segment
(static)
Text Segment
Programmer Invisible State
Program
Counter
Instruction
register
Kernel
registers
Reserved
0xFFFFFFFF
Arithmetic Logic Unit (ALU)
Memory Map
Who sees what?
(4)
Arithmetic for Computers
• Operations on integers
Addition and subtraction
Multiplication and division
Dealing with overflow
• Operation on floating-point real numbers
Representation and operations
• Let us first look at integers
(5)
Integer Addition(3.2)
• Example: 7 + 6
Overflow if result out of range
Adding +ve and –ve operands, no overflow
Adding two +ve operands
Overflow if result sign is 1
Adding two –ve operands
Overflow if result sign is 0
(6)
Integer Subtraction
• Add negation of second operand
• Example: 7 – 6 = 7 + (–6)
+7:
–6:
+1:
0000 0000 … 0000 0111
1111 1111 … 1111 1010
0000 0000 … 0000 0001
2’s complement
representation
• Overflow if result out of range
Subtracting two +ve or two –ve operands, no overflow
Subtracting +ve from –ve operand
o Overflow if result sign is 0
Subtracting –ve from +ve operand
o Overflow if result sign is 1
(7)
ISA Impact
• Some languages (e.g., C) ignore overflow
Use MIPS addu, addui, subu instructions
• Other languages (e.g., Ada, Fortran) require
raising an exception
Use MIPS add, addi, sub instructions
On overflow, invoke exception handler
o
o
o
Save PC in exception program counter (EPC) register
Jump to predefined handler address
mfc0 (move from coprocessor register) instruction can
retrieve EPC value, to return after corrective action
(more later)
• ALU Design leads to many solutions. We look
at one simple example
(8)
Integer ALU (arithmetic logic unit)(B.5)
• Build a 1 bit ALU, and use 32 of them
(bit-slice)
operation
a
op a
b
res
result
b
(9)
Single Bit ALU
Implements only AND and OR operations
Operation
0
A
Result
1
B
(10)
Adding Functionality
• We can add additional operators (to a point)
• How about addition?
cout = ab + acin + bcin
sum = a b cin
CarryIn
a
Sum
b
CarryOut
• Review full adders from digital design
(11)
Building a 32-bit ALU
CarryIn
a0
b0
Operation
Operation
CarryIn
ALU0
Result0
CarryOut
CarryIn
a1
a
0
b1
CarryIn
ALU1
Result1
CarryOut
1
Result
a2
2
b
b2
CarryIn
ALU2
Result2
CarryOut
CarryOut
a31
b31
CarryIn
ALU31
Result31
(12)
Subtraction (a – b) ?
• Two's complement approach: just negate b
and add 1.
• How do we negate?
sub
Binvert
CarryIn
a0
CarryIn
ALU0
b0
Operation
Result0
CarryOut
• A clever solution:
Binvert
a1
ALU1
b1
Operation
a
a2
0
1
0
Result1
CarryOut
CarryIn
b
CarryIn
CarryIn
ALU2
b2
Result2
CarryOut
Result
2
1
a31
CarryOut
b31
CarryIn
ALU31
Result31
(13)
Tailoring the ALU to the MIPS
• Need to support the set-on-less-than instruction(slt)
remember: slt is an arithmetic instruction
produces a 1 if rs < rt and 0 otherwise
use subtraction: (a-b) < 0 implies a < b
• Need to support test for equality (beq $t5, $t6, $t7)
use subtraction: (a-b) = 0 implies a = b
(14)
What Result31 is when (a-b)<0?
Binvert
CarryIn
a0
b0
CarryIn
ALU0
Less
CarryOut
a1
b1
0
CarryIn
ALU1
Less
CarryOut
Operation
Result0
Binvert
Operation
CarryIn
a
0
Result1
1
Result
a2
b2
0
b
CarryIn
ALU2
Less
CarryOut
Result2
CarryIn
ALU31
Less
2
1
Less
3
CarryOut
CarryIn
a31
b31
0
0
Result31
Set
Overflow
Unsigned vs. signed support
(15)
Test for equality
Bnegate
Operation
• Notice control lines:
000
001
010
110
111
=
=
=
=
=
and
or
add
subtract
slt
•Note: zero is a 1 when the result is zero!
a0
b0
CarryIn
ALU0
Less
CarryOut
Result0
a1
b1
0
CarryIn
ALU1
Less
CarryOut
Result1
a2
b2
0
CarryIn
ALU2
Less
CarryOut
Result2
Zero
Note test for overflow!
a31
b31
0
CarryIn
ALU31
Less
Result31
Set
Overflow
(16)
ISA View
CPU/Core
$0
$1
$31
ALU
• Register-to-Register data path
• We want this to be as fast as possible
(17)
Multiplication (3.3)
• Long multiplication
multiplicand
multiplier
product
1000
× 1001
1000
0000
0000
1000
1001000
Length of product
is the sum of
operand lengths
(18)
A Multiplier
• Uses multiple adders
Cost/performance tradeoff
Can be pipelined
Several multiplication performed in parallel
(19)
MIPS Multiplication
• Two 32-bit registers for product
HI: most-significant 32 bits
LO: least-significant 32-bits
• Instructions
mult rs, rt / multu rs, rt
o 64-bit product in HI/LO
mfhi rd / mflo rd
o
o
Move from HI/LO to rd
Can test HI value to see if product
overflows 32 bits
mul rd, rs, rt
o
Least-significant 32 bits of product –
> rd
Study Exercise: Check out signed and
unsigned multiplication with QtSPIM
(20)
Division(3.4)
quotient
dividend
1001
1000 1001010
-1000
divisor
10
101
1010
-1000
10
remainder
• Check for 0 divisor
• Long division approach
If divisor ≤ dividend bits
o
1 bit in quotient, subtract
Otherwise
o
0 bit in quotient, bring down
next dividend bit
• Restoring division
n-bit operands yield n-bit •
quotient and remainder
Do the subtract, and if
remainder goes < 0, add
divisor back
Signed division
Divide using absolute values
Adjust sign of quotient and
remainder as required
(21)
Faster Division
• Can’t use parallel hardware as in multiplier
Subtraction is conditional on sign of remainder
• Faster dividers (e.g. SRT division) generate
multiple quotient bits per step
Still require multiple steps
• Customized implementations for high
performance, e.g., supercomputers
(22)
MIPS Division
• Use HI/LO registers for result
HI: 32-bit remainder
LO: 32-bit quotient
• Instructions
div rs, rt / divu rs, rt
No overflow or divide-by-0
checking
o
Software must perform checks
if required
Use mfhi, mflo to access result
Study Exercise: Check out signed
and unsigned division with QtSPIM
(23)
ISA View
CPU/Core
$0
$1
$31
Multiply
Divide
ALU
Hi
Lo
• Additional function units and registers (Hi/Lo)
• Additional instructions to move data to/from
these registers
mfhi, mflo
• What other instructions would you add? Cost?
(24)
Floating Point(3.5)
• Representation for non-integral numbers
Including very small and very large numbers
• Like scientific notation
–2.34 × 1056
+0.002 × 10–4
+987.02 × 109
normalized
not normalized
• In binary
±1.xxxxxxx2 × 2yyyy
• Types float and double in C
(25)
IEEE 754 Floating-point Representation
Single Precision (32-bit)
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
S
exponent
significand
1bit
23 bits
8 bits
(–1)sign x (1+fraction) x 2exponent-127
Double Precision (64-bit)
63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32
S
exponent
significand
1bit
20 bits
11 bits
significand (continued)
32 bits
(–1)sign x (1+fraction) x 2exponent-1023
(26)
Floating Point Standard
• Defined by IEEE Std 754-1985
• Developed in response to divergence of
representations
Portability issues for scientific code
• Now almost universally adopted
• Two representations
Single precision (32-bit)
Double precision (64-bit)
(27)
FP Adder Hardware
• Much more complex than integer adder
• Doing it in one clock cycle would take too long
Much longer than integer operations
Slower clock would penalize all instructions
• FP adder usually takes several cycles
Can be pipelined
Example: FP Addition
(28)
FP Adder Hardware
Step 1
Step 2
Step 3
Step 4
(29)
FP Arithmetic Hardware
• FP multiplier is of similar complexity to FP
adder
But uses a multiplier for significands instead of an
adder
• FP arithmetic hardware usually does
Addition, subtraction, multiplication, division,
reciprocal, square-root
FP integer conversion
• Operations usually takes several cycles
Can be pipelined
(30)
ISA Impact
• FP hardware is coprocessor 1
Adjunct processor that extends the ISA
• Separate FP registers
32 single-precision: $f0, $f1, … $f31
Paired for double-precision: $f0/$f1, $f2/$f3, …
o Release 2 of MIPs ISA supports 32 × 64-bit FP
reg’s
• FP instructions operate only on FP registers
Programs generally do not perform integer ops on FP
data, or vice versa
More registers with minimal code-size impact
(31)
ISA View: The Co-Processor
Co-Processor 1
CPU/Core
$0
$1
$0
$1
$31
$31
Multiply
Divide
ALU
Hi
FP ALU
Lo
Co-Processor 0
BadVaddr
Status
Causes
EPC
later
• Floating point operations access a separate set
of 32-bit registers
Pairs of 32-bit registers are used for double precision
(32)
ISA View
• Distinct instructions operate on the floating
point registers (pg. A-73)
Arithmetic instructions
o
add.d fd, fs, ft, and add.s fd, fs, ft
double precision
single precision
• Data movement to/from floating point
coprocessors
mcf1 rt, fs and mtc1 rd, fs
• Note that the ISA design implementation is
extensible via co-processors
• FP load and store instructions
lwc1, ldc1, swc1, sdc1
o e.g., ldc1 $f8, 32($sp)
Example: DP Mean
(33)
Associativity
• Floating point arithmetic is not commutative
• Parallel programs may interleave operations in
unexpected orders
Assumptions of associativity may fail
(x+y)+z
x+(y+z)
-1.50E+38
x -1.50E+38
y 1.50E+38 0.00E+00
z
1.0
1.0 1.50E+38
1.00E+00 0.00E+00
Need to validate parallel programs under varying
degrees of parallelism
(34)
Performance Issues
• Latency of instructions
Integer instructions can take a single cycle
Floating point instructions can take multiple cycles
Some (FP Divide) can take hundreds of cycles
• What about energy (we will get to that shortly)
• What other instructions would you like in
hardware?
Would some applications change your mind?
• How do you decide whether to add new
instructions?
(35)
Multimedia (3.6, 3.7, 3.8)
• Lower dynamic range and precision
requirements
Do not need 32-bits!
• Inherent parallelism in the operations
(36)
Vector Computation
• Operate on multiple data elements (vectors) at
a time
• Flexible definition/use of registers
•
Registers hold integers, floats (SP), doubles DP)
128-bit Register
1x128 bit integer
2x64-bit double precision
4 x 32-bit single precision
8x16 short integers
(37)
Processing Vectors
• When is this more efficient?
Memory
vector registers
• When is this not efficient?
• Think of 3D graphics, linear algebra and media
processing
(38)
Case Study: Intel Streaming SIMD
Extensions
• 8, 128-bit XMM registers
X86-64 adds 8 more registers XMM8-XMM15
• 8, 16, 32, 64 bit integers (SSE2)
• 32-bit (SP) and 64-bit (DP) floating point
• Signed/unsigned integer operations
• IEEE 754 floating point support
• Reading Assignment:
http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions
http://neilkemp.us/src/sse_tutorial/sse_tutorial.html#I
(39)
Instruction Categories
• Floating point instructions
Arithmetic, movement
Comparison, shuffling
Type conversion, bit level
register
memory
register
• Integer
• Other
e.g., cache management
• ISA extensions!
• Advanced Vector
Extensions (AVX)
Successor to SSE
(40)
Arithmetic View
• Graphics and media processing operates on
vectors of 8-bit and 16-bit data
Use 64-bit adder, with partitioned carry chain
o
Operate on 8×8-bit, 4×16-bit, or 2×32-bit vectors
SIMD (single-instruction, multiple-data)
• Saturating operations
On overflow, result is largest representable value
o
c.f. 2s-complement modulo arithmetic
E.g., clipping in audio, saturation in video
4x16-bit
2x32-bit
(41)
SSE Example
// A 16byte = 128bit vector struct
struct Vector4
{
float x, y, z, w;
};
More complex
example (matrix
multiply) in Section
3.8 – using AVX
// Add two constant vectors and return the resulting vector
Vector4 SSE_Add ( const Vector4 &Op_A, const Vector4 &Op_B )
{
Vector4 Ret_Vector;
__asm
{
MOV EAX Op_A
MOV EBX, Op_B
}
// Load pointers into CPU regs
MOVUPS XMM0, [EAX]
MOVUPS XMM1, [EBX]
// Move unaligned vectors to SSE regs
ADDPS XMM0, XMM1
MOVUPS [Ret_Vector], XMM0
// Add vector elements
// Save the return vector
}
return Ret_Vector;
From http://neilkemp.us/src/sse_tutorial/sse_tutorial.html#I
(42)
Characterizing Parallelism
Today serial computing cores
(von Neumann model)
Instruction Streams
Data Streams
SISD
SIMD
MISD
MIMD
Single instruction
multiple data stream
computing, e.g., SSE
Today’s Multicore
• Characterization due to M. Flynn*
*M. Flynn, (September 1972). "Some Computer Organizations and Their Effectiveness". IEEE Transactions
on Computers, C–21 (9): 948–960t
(43)
Parallelism Categories
From http://en.wikipedia.org/wiki/Flynn%27s_taxonomy
(44)
Data Parallel vs. Traditional Vector
Vector Architecture
Vector
Register
A
Vector
Register
C
Vector
Register
B
pipelined functional unit
Data Parallel Architecture
registers
Process each square in
parallel – data parallel
computation
(45)
ISA View
SIMD Registers
CPU/Core
$0
$1
XMM0
XMM1
$31
XMM15
Multiply
Divide
ALU
Hi
Vector ALU
Lo
• Separate core data path
• Can be viewed as a co-processor with a distinct
set of instructions
(46)
Domain Impact on the ISA: Example
Scientific Computing
•
•
•
•
Floats
Double precision
Massive data
Power
constrained
Embedded Systems
•
•
•
•
•
Integers
Lower precision
Streaming data
Security support
Energy
constrained
(47)
Summary
• ISAs support operations required of application
domains
Note the differences between embedded and
supercomputers!
Signed, unsigned, FP, SIMD, etc.
• Bounded precision effects
Software must be careful how hardware used e.g.,
associativity
Need standards to promote portability
• Avoid “kitchen sink” designs
There is no free lunch
Impact on speed and energy we will get to this later
(48)
Study Guide
• Perform 2’s complement addition and subtraction
(review)
• Add a few more instructions to the simple ALU
Add an XOR instruction
Add an instruction that returns the max of its inputs
Make sure all control signals are accounted for
• Convert real numbers to single precision floating
point (review) and extract the value from an
encoded single precision number (review)
• Execute the SPIM programs (class website) that
use floating point numbers. Study the
memory/register contents via single step
execution
(49)
Study Guide (cont.)
• Write a few simple SPIM programs for
Multiplication/division of signed and unsigned
numbers
o
o
Use numbers that produce >32-bit results
Move to/from HI and LO registers ( find the instructions
for doing so)
Addition/subtraction of floating point numbers
• Try to write a simple SPIM program that
demonstrates that floating point operations are
not associative (this takes some thought and
review of the range of floating point numbers)
• Look up additional SIMD instruction sets and
compare
AMD NEON, Altivec, AMD 3D Now
(50)
Glossary
• Co-processor
• Data parallelism
• Data parallel
computation vs.
vector
computation
• Instruction set
extensions
• Overflow
• MIMD
• Precision
• SIMD
• Saturating
arithmetic
• Signed arithmetic
support
• Unsigned
arithmetic
support
• Vector processing
(51)