מבנה מחשבים 0368-2159 Lecture 1 הקדמה נתן אינטרטור ויהודה אפק מתרגלים : 1/ 75 הילל אבני נועה בן - עמוס מה זה מבנה מחשבים? חומרה - טרנזיסטורים מעגלים לוגיים ארכיטקטורת מחשבים 2/ 75
Download
Report
Transcript מבנה מחשבים 0368-2159 Lecture 1 הקדמה נתן אינטרטור ויהודה אפק מתרגלים : 1/ 75 הילל אבני נועה בן - עמוס מה זה מבנה מחשבים? חומרה - טרנזיסטורים מעגלים לוגיים ארכיטקטורת מחשבים 2/ 75
מבנה מחשבים
0368-2159
Lecture 1
הקדמה
נתן אינטרטור ויהודה אפק
מתרגלים:
1/ 75
הילל אבני
נועה בן-עמוס
מה זה מבנה מחשבים?
חומרה -טרנזיסטורים
מעגלים לוגיים
ארכיטקטורת מחשבים
2/ 75
על מה נדבר היום:
Introduction : Computer Architecture
Administrative Matters
History
ממוליכים וחשמל ועד פעולות בינריות בסיסיות במחשב
מתח חשמלי •
מוליכים •
סיליקון :מוליך למחצה •
טרנזיסטור •
פעולות בינריות ברכיבים אלקטרוניים •
3/ 77
Computing Devices Then…
EDSAC, University of Cambridge, UK, 1949
4/ 77
Computing Devices Now
Sensor Nets
Qu ickT ime™ an d a
TIF F (U ncom pres sed ) dec omp resso r
are nee ded to se e thi s pic ture.
Cameras
Set-top
boxes
Games
Quic kTime™ and a
TIFF (Unc ompres sed) dec ompres sor
are needed t o s ee t his pict ure.
Qui ckT ime™ and a
T IFF (Uncompres sed) dec ompres sor
are needed to s ee this pic ture.
Media
Players
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Laptops
Servers
Routers
Smart
phones
Automobiles
Robots
Supercomputers
5/ 77
מבנה מחשבים,
מה זה?
6/ 77
7/ 77
Mother board
8/ 77
9/ 77
The
paradigm
(Patterson)
Every Computer
Scientist should
master the “AAA”
Architecture
Algorithms
Applications
10/ 77
Computer Architecture: GOAL
Fast, Effective and Cheap
The goal of Computer Architecture
To build “cost effective systems”
• How do we calculate the cost of a system ?
• How we evaluate the effectiveness of the
system?
To optimize the system
• What are the optimization points ?
Fact: most of the computer systems still use
Von-Neumann principle of operation, even
though, internally, they are much different
from the computer of that time.
11/ 77
Anatomy: 5 components of any Computer (since 1946)
Personal Computer
Computer
Processor
Control
(“brain”)
Datapath
(“brawn”)
Memory
(where
programs,
data
live when
running)
Devices
Input
Output
Keyboard,
Mouse
Disk
(where
programs,
data
live when
not running)
Display,
Printer
12/ 77
Computer System Structure
Cache
Mem BUS
CPU BUS
CPU
Bridge
Memory
I/O BUS
Scsi/IDE
Adap
Lan
Adap
USB
Hub
Graphic
Adapt
Scsi Bus
Hard
Disk
LAN
KeyBoard
Mouse
Scanner
Video
Buffer
13/ 77
The Instruction Set: a Critical Interface
software
instruction set
hardware
14/ 77
? “Computer Architecture” מה זה
Computer Architecture =
Instruction Set Architecture +
Machine Organization + …
= ארכיטקטורה+ הנדסה
15/ 77
What are “Machine Structures”?
Application (ex: browser)
Software
Hardware
Operating
Compiler
System
Assembler (Linux, Win, ..)
Processor Memory I/O system
מבנה מחשבים
Instruction Set
Architecture
Datapath & Control
Digital Design
Circuit Design
transistors
Physics
* Coordination of many
levels (layers) of abstraction
16/ 77
Levels of Representation
temp = v[k];
High Level Language
Program
v[k] = v[k+1];
v[k+1] = temp;
Compiler
lw $15,
lw $16,
sw
sw
Assembly Language
Program
Assembler
Machine Language
Program
0000
1010
1100
0101
1001
1111
0110
1000
1100
0101
1010
0000
0110
1000
1111
1001
0($2)
4($2)
$16, 0($2)
$15, 4($2)
1010
0000
0101
1100
1111
1001
1000
0110
0101
1100
0000
1010
1000
0110
1001
1111
Machine Interpretation
Control Signal
Specification
ALUOP[0:3] <= InstReg[9:11] & MASK
°
°
17/ 77
Computer Architecture’s Changing Definition
1950s to 1960s Computer Architecture Course
• Computer Arithmetic
1970s to mid 1980s Computer Architecture Course
• Instruction Set Design, especially ISA appropriate for
compilers
1990s Computer Architecture Course
• Design of CPU, memory system, I/O system, Multiprocessors, Networks
2000s Computer Architecture Course:
• Special purpose architectures, Functionally reconfigurable,
Special considerations for low power/mobile processing
2005 – futue (?) Multi processors, Parallelism
• Synchronization, Speed-up, How to Program ??? !!!
18/ 77
Forces on Computer Architecture
Technology
Programming
Languages
Applications
Computer
Architecture
Operating
Systems
Cleverness
History
19/ 77
Computers in the News: Sony Playstation 2000
The Playstation 3 will deliver nearly 2 teraflops
overall performance, said Ken Kutaragi, president
and group CEO of Sony Computer Entertainment
As reported in Microprocessor Report, Vol 13, No. 5:
• Emotion Engine: 6.2 GFLOPS, 75 million polygons per second
• Graphics Synthesizer: 2.4 Billion pixels per second
• Claim: Toy Story realism brought to games!
20/ 77
Where are We Going??
Input
Multiplier
Input
Multiplicand
32
Multiplicand
Register
LoadMp
32=>34
signEx
<<1
32
34
34
32=>34
signEx
1
Arithmetic
Multi x2/x1
34
34
Sub/Add
34-bit ALU
Control
Logic
32
32
2
ShiftAll
LO register
(16x2 bits)
Prev
2
Booth
Encoder
HI register
(16x2 bits)
LO[1]
2
"LO
[0]"
34
Extra
2 bits
ENC[2]
ENC[1]
ENC[0]
LoadLO
LoadHI
2
ClearHI
Single/multicycle
Datapaths
0
34x2 MUX
32
Result[HI]
LO[1:0]
32
Result[LO]
1000
CPU
IFetchDcd
WB
Exec Mem
WB
Performance
Processor-Memory
Performance Gap:
(grows 50% / year)
10
DRAM
9%/yr.
DRAM (2X/10 yrs)
1
198
2
3
198
498
1
5
198
6
198
7
198
8
198
9
199
0
199
199
2
199
399
1
4
199
5
199
699
1
7
199
8
199
9
200
0
Exec Mem
100
198
098
1
1
198
IFetchDcd
מבנה
מחשבים
“Moore’s Law”
µProc
60%/yr.
(2X/1.5yr)
Time
IFetchDcd
Exec Mem
IFetchDcd
WB
Exec Mem
WB
Pipelining
I/O
Memory Systems
21/ 77
שקופית מאחת ההרצאות לקראת סוף הסמסטר
22/ 77
Instructors:
Course Administration
Yehuda Afek ([email protected])
Nathan Intrator ([email protected])
TA:
Hillel Avni ([email protected] )
Noa Ben Amos([email protected])
http://cs.tau.ac.il/~nin/Courses/CompStruct/CompStr
uct.htm
http://virtual.tau.ac.il
Books:
1.
V. C. Hamacher, Z. G. Vranesic, S. G. Zaky Computer Organization.
McGraw-Hill, 1982
2.
H. Taub Digital Circuits and Microporcessors. McGraw-Hill 1982
3. מערכות ספרתיות בהוצאות האוניברסיטה הפתוחה
4. Hennessy and Patterson, Computer Organization Design, the
hardware/software interface, Morgan Kaufman 1998
23/ 77
Grading
ציון:
מבחן סופי
80%
תרגילים
20%
7תרגילים
24/ 77
Architecture & Microarchitecture Elements
Architecture:
•
•
•
•
Registers data width (8/16/32/64)
Instruction set
Addressing modes
Addressing methods (Segmentation, Paging, etc...)
Architecture:
• Physical memory size
• Caches size and structure
• Number of execution units, number of execution pipelines
• Branch prediction
• TLB
Timing is considered Arch (though it is user visible!)
Processors with the same arch may have different Arch
25/ 77
Compatibility
Backward compatibility
– New hardware can run existing software
– Example: Pentium 4 can run software originally written for
Pentium III, Pentium II, Pentium , 486, 386, 286
Forward compatibility
– New software can run on existing (old) hardware
– Example: new software written with MMXTM must still run on
older Pentium processors which do not support MMXTM
– Less important than backward compatibility
New ideas: architecture independent
– JIT – just in time compiler: Java and .NET
– Binary translation
26/ 77
How to compare between different
systems?
27/ 77
Benchmarks – Programs for Evaluating
Processor Performance
Toy Benchmarks
– 10-100 line programs
– e.g.: sieve, puzzle, quicksort
Synthetic Benchmarks
– Attempt to match average frequencies of real workloads
– e.g., Winstone, Dhrystone
Real programs
– e.g., gcc, spice
SPEC: System Performance Evaluation Cooperative
– SPECint (8 integer programs)
– and SPECfp (10 floating point)
28/ 77
CPI – to compare systems with same
instruction set architecture (ISA)
The CPU is synchronous - it works according to a clock signal.
• Clock cycle is measured in nsec (10-9 of a second).
• Clock rate (= 1/clock cycle) is measured in MHz (106 cycles/second).
CPI - cycles per instruction
• Average #cycles per Instruction (in a given program)
CPI =
#cycles required to execute the program
#instruction executed in the program
• IPC (= 1/CPI) : Instructions per cycles
Clock rate is mainly affected by technology, CPI by the
architecture
CPI breakdown: how many cycles (in average) the program spends
for different causes; e.g., in executing, memory I/O etc.
29/ 77
CPU Time
CPU Time
– The time required by the CPU to execute a given program:
CPU Time = clock cycle #cyc = clock cycle CPI IC
Our goal: minimize CPU Time
– Minimize clock cycle: more MHz (process, circuit, Arch)
– Minimize CPI:
Arch (e.g.: more execution units)
– Minimize IC: architecture (e.g.: MMXTM technology)
Speedup due to enhancement E
ExTim ew/ oE Perform ancew / E
SpeedupE =
=
ExTim ew/ E Perform ancew / oE
31/ 77
Amdahl’s Law
Suppose that enhancement E accelerates a fraction F of the
task by a factor S, and the remainder of the task is
unaffected, then:
Fractionenhanced
ExTimenew = ExTimeold x(1 - Fractionenhanced) +
Speedupenhanced
ExTimeold
Speedupoverall =
=
ExTimenew
1
Fractionenhanced
(1 - Fractionenhanced) +
Speedupenhanced
32/ 77
Amdahl’s Law: Example
• Floating point instructions improved to run
2X; but only 10% of actual instructions are
FP
ExTimenew = ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold
Speedupoverall =
1
= 1.053
0.95
Corollary:
Make The Common Case Fast
33/ 77
Instruction Set Design
software
The ISA is what the user
and the compiler sees
instruction set
The ISA is what the
hardware needs to
implement
hardware
34/ 77
Why ISA is important?
Code size
• long instructions may take more time to be fetched
• Requires large memory (important in small devices, e.g., cell
phones)
Number of instructions (IC)
• Reducing IC reduce execution time (assuming same CPI and
frequency)
Code “simplicity”
• Simple HW implementation which leads to higher frequency
and lower power
• Code optimization can better be applied to “simple code”
35/ 77
The impact of the ISA
RISC vs CISC
36/ 77
CISC Processors
CISC - Complex Instruction Set Computer
The idea: a high level machine language
Characteristic
• Many instruction types, with many addressing
modes
• Some of the instructions are complex:
- Perform complex tasks
- Require many cycles
• ALU operations directly on memory
- Usually uses limited number of registers
• Variable length instructions
- Common instructions get short codes save code
length
Example: x86
37/ 77
CISC Drawbacks
Compilers do not take advantage of the complex instructions and
the complex indexing methods
Implement complex instructions and complex addressing modes
complicate the processor
slow down the simple, common instructions
contradict Amdahl’s law corollary:
Make The Common Case Fast
Variable length instructions are real pain in the neck:
• It is difficult to decode few instructions in parallel
- As long as instruction is not decoded, its length is unknown
It is unknown where the instruction ends
It is unknown where the next instruction starts
• An instruction may not fit into the “right behavior” of the
memory hierarchy (will be discussed next lectures)
Examples: VAX, x86 (!?!)
38/ 77
RISC Processors
RISC - Reduced Instruction Set Computer
The idea: simple instructions enable fast hardware
Characteristic
• A small instruction set, with only a few instructions formats
• Simple instructions
- execute simple tasks
- require a single cycle (with pipeline)
• A few indexing methods
• ALU operations on registers only
- Memory is accessed using Load and Store instructions only.
- Many orthogonal registers
- Three address machine:
Add dst, src1, src2
• Fixed length instructions
Examples: MIPSTM, SparcTM, AlphaTM, PowerPCTM
39/ 77
RISC Processors (Cont.)
Simple architecture Simple microarchitecture
• Simple, small and fast control logic
• Simpler to design and validate
• Room for on die caches: instruction cache + data cache
- Parallelize data and instruction access
• Shorten time-to-market
Using a smart compiler
• Better pipeline usage
• Better register allocation
Existing RISC processor are not “pure” RISC
• e.g., support division which takes many cycles
40/ 77
RISC and Amdhal’s Law (Example)
In compare to the CISC architecture:
• 10% of the static code, that executes 90% of the dynamic
has the same CPI
• 90% of the static code, which is only 10% of the dynamic,
increases in 60%
• The number of instruction being executed is increased in
50%
• The speed of the processor is doubled
- This was true for the time the RISC processors were invented
CPInew
Fractionenhanced
= 0.9+ 0.11.6= 1.06
We get
= 1 Fractionenhanced +
CPIold
Speedupenhanced
And then
CPU Time old clock old CPI old IC old
Speedup overall =
=
∗
∗
= 2/ 1.06∗ 1.5 = 1.26
CPU Timenew clock new CPI new IC new
41/ 77
So, what is better, RISC or CISC
Today CISC architectures (X86) are running as fast
as RISC (or even faster)
The main reasons are:
• Translates CISC instructions into RISC instructions (ucode)
• CISC architecture are using “RISC like engine”
We will discuss this kind of solutions later on in this
course.
42/ 77
Technology Trends: Microprocessor Complexity
100000000
Itanium 2: 410 Million
Athlon (K7): 22 Million
Alpha 21264: 15 million
Pentium Pro: 5.5 million
PowerPC 620: 6.9 million
Alpha 21164: 9.3 million
Sparc Ultra: 5.2 million
10000000
Moore’s Law
Pent ium
i80486
Transistors
1000000
i80386
i80286
100000
2X transistors/Chip
Every 1.5 years
i8086
10000
i8080
i4004
1000
1970
1975
1980
1985
Year
1990
1995
2000
Called
“Moore’s Law”
43/ 77
44/ 77
45/ 77
Performance measure
Technology Trends: Processor Performance
900
800
700
600
500
400
300
200
100
0
Intel P4 2000 MHz
(Fall 2001)
DEC
Alpha
21264/600
1.54X/yr
DEC Alpha 5/500
DEC Alpha 5/300
DEC Alpha 4/266
IBM POWER 100
87 88 89 90 91 92 93 94 95 96 97
year
46/ 77
Technology Trends: Memory Capacity
(Single-Chip DRAM)
s ize
1000000000
100000000
Bits
10000000
1000000
100000
10000
1000
1970
1975
1980
1985
1990
1995
Year
• Now 1.4X/yr, or 2X every 2 years.
• 8000X since 1980!
2000
year
1980
1983
1986
1989
1992
1996
1998
2000
2002
size (Mbit)
0.0625
0.25
1
4
16
64
128
256
512
47/ 77
Technology Trends Imply Dramatic Change
Processor
• Logic capacity:
about 30% per year
• Clock rate:
about 20% per year
Memory
• DRAM capacity: about 60% per year (4x every 3 years)
• Memory speed:
about 10% per year
• Cost per bit:
improves about 25% per year
Disk
• Capacity:
about 60% per year
• Total data use:
100% per 9 months!
Network Bandwidth
• Bandwidth increasing more than 100% per year!
48/ 77
1980-2003, CPU--DRAM Speed gap
Q. How do architects address this gap?
Performance
(1/latency)
A. Put smaller, faster “cache” memories between CPU and
DRAM.
The
power
wall
CPU
60% per yr
2X in 1.5 yrs
CPU
Gap grew 50%
per year
DRAM
9% per yr
2X in 10 yrs
DRAM
Year
49/ 77
Dimensions
2006: 0.04 10e-6
2005: 0.12 10e-6 = 1.2 10e-7
2001 devices
(0.18 µm)
1 cm
1 mm
Chip size
(1 cm)
Demo
0.1 mm
Diameter of
Human Hair
(25 µm)
10µm
1 µm
0.1 µm
1996 devices
(0.35 µm)
Deep UV
Wavelength
(0.248 µm)
10 nm
1 nm
2007 devices
(0.01 µm)
1Å
Silicon
atom
radius
(1.17 Å)
X-ray
Wavelength
(0.6 nm)
50/ 77
ארכיטקטורת מחשבים בשנים הבאות
בעבר :אנרגיה /צריכת חשמל .non issue
היום:
Power Wallחשמל יקר.
טרנזיסטורים הם בחינם.
בעבר :ביצועים משתפרים ע"י מיקבול ברמת פקודות המכונה ,קומפיילרים
חכמים ,וארכיטקטורות CPUיחיד ( pipelining, superscalar, out-of-order
)execution, speculations
היום ILP Wall :שיפורי חומרה לשיפור ביצועים לא משתלם.
בעבר :כפל איטי ,גישה לזיכרון מהירה.
היום Memory Wall :כפל מהיר גישות לזיכרון איטיות.
( 200מחזורי שעון ל 4 DRAMמחזורים לכפל)
בעבר:
ביצועי מעבד יחיד 2 Xכל 1.5שנים.
היום :כל הנ"ל :אולי 2 Xכל 5שנים??
51/ 77
אבל 2 Xמעבדים (ליבות ) Coresכל שנתיים .היום 4עד 40ליבות למעבד
Physics / Transistor’s History
1906
Audion (Triode), 1906
Lee De Forest
1947
First point contact transistor (germanium), 1947
John Bardeen and Walter Brattain
Bell Laboratories
52/ 77
History
1958
1997
First integrated circuit (germanium), 1958
Jack S. Kilby, Texas Instruments
Contained five components, three types:
transistors resistors and capacitors
Intel Pentium II, 1997
Clock: 233MHz
Number of transistors: 7.5 M
Gate Length: 0.35
53/ 77
Annual Sales
1018 transistors manufactured in 2003 alone
• 100 million for every human on the planet
Global Semiconductor Billings
(Billions of US$)
200
150
100
50
0
1982
1984
1986
1988
1990
1992
1994
1996
1998
2000
2002
Year
54/ 77
55/ 77
56/ 77
57/ 77
58/ 77
Integrated Circuits (2003 state-of-the-art)
Bare Die
Chip in Package
Primarily Crystalline Silicon
1mm - 25mm on a side
2003 - feature size ~ 0.13µm = 0.13 x 10-6 m
100 - 400M transistors
(25 - 100M “logic gates")
3 - 10 conductive layers
“CMOS” (complementary metal oxide
semiconductor) - most common.
Package provides:
• spreading of chip-level signal paths to board-level
• heat dissipation.
Ceramic or plastic with gold wires.
59/ 77
Printed Circuit Boards
fiberglass or ceramic
1-20 conductive layers
1-20in on a side
IC packages are soldered down.
60/ 77
nMOS Transistor
Four terminals: gate, source, drain, body
Gate – oxide – body stack looks like a capacitor
• Gate and body are conductors
• SiO2 (oxide) is a very good insulator
• Called metal – oxide – semiconductor (MOS) capacitor
• Even though gate is
Source
no longer made of metal
Gate
Drain
Polysilicon
polysilicon
gate
SiO2
W
Off
tox
n+
n+
L
On
p-type body
n+
n+
p
SiOSi
gate oxide
bulk
2
(good insulator, ox = 3.90
61/ 77
nMOS Operation
Body is commonly tied to ground (0 V)
When the gate is at a low voltage:
• P-type body is at low voltage
• Source-body and drain-body diodes are OFF
• No current flows, transistor is OFF
Source
Gate
Drain
Polysilicon
SiO2
Off
0
n+
n+
S
p
D
bulk Si
62/ 77
nMOS Operation Cont.
When the gate is at a high voltage:
• Positive charge on gate of MOS capacitor
• Negative charge attracted to body
• Inverts a channel under gate to n-type
• Now current can flow through n-type silicon from source
through channel to drain, transistor is ON
Source
Gate
Drain
Polysilicon
SiO2
1
n+
On
n+
S
p
D
bulk Si
63/ 77
pMOS Transistor
Similar, but doping and voltages reversed
• Body tied to high voltage (VDD)
• Gate low: transistor ON
• Gate high: transistor OFF
• Bubble indicates inverted behavior
Source
Gate
Drain
Polysilicon
SiO2
p+
p+
n
bulk Si
64/ 77
65/ 77
Example: Inverter
66/ 77
Example: NAND3
Horizontal N-diffusion and p-diffusion strips
Vertical polysilicon gates
Metal1 VDD rail at top
Metal1 GND rail at bottom
32 l by 40 l
67/ 77
68/ 77
69/ 77
CMOS Inverter
A
Y
VDD
0
1
A
A
Y
Y
GND
70/ 77
CMOS Inverter
A
Y
VDD
0
1
OFF
0
A=1
Y=0
ON
A
Y
GND
71/ 77
CMOS Inverter
A
Y
0
1
1
0
VDD
ON
A=0
Y=1
OFF
A
Y
GND
72/ 77
73/ 77
74/ 77
Multiplexers
2:1 multiplexer chooses between two inputs
S
D1
D0
0
X
0
0
X
1
1
0
X
1
1
X
S
Y
D0
0
Y
D1
1
75/ 77
Multiplexers
2:1 multiplexer chooses between two inputs
S
D1
D0
Y
0
X
0
0
0
X
1
1
1
0
X
0
1
1
X
1
S
D0
0
Y
D1
1
76/ 77
Transmission Gate Mux
Nonrestoring mux uses two transmission gates
• Only 4 transistors
S
D0
Y
S
D1
S
77/ 77
out
78/ 77
מה למדנו היום
Computer Architecture: integrates few levels,
from programming languages to logic design.
Instruction Set Architecture (ISA)
Amdahl’s law
Moor’s law
Processor (CPU) --- Memory speed gap
History
Transistors. What, and how.
From transistors to logic design
79/ 77