병렬처리시스템

Download Report

Transcript 병렬처리시스템

Chapter 1 Uniprocessor Architecture
Overview
2015-07-16
1.1 A uniprocessor model
Structure of a typical uniprocessor computer system.
–
–
–
–
Memory
ALU
CU(control unit)
I/O unit
Von-Neumann Architecture
Memory Unit
• A single port device
– MAR
– MBR (or MDR)
– Word  the unit of data that can be read or written.
CPU
• ALU
– ACC
• CU
– PC
– IR
– A set of registers
General Computer Structure (Figure 1.2)
General Computer Structure
• It shows a more generalized computer system structure.
–
–
–
–
Address bus
Data bus
Control bus
Device interfaces
• Multiple bus structure vs. single bus structure
– To allow simultaneous operations on the buses
• Higher throughput
• Complexity of structure
– Speed-for-cost tradeoff
1.1 A uniprocessor model (continued)
• The characteristics of von Neumann model
– Programs and data are stored in a single sequential memory.
– There is no explicit distinction between data and instruction
representation in the memory.
– The memory, being one dimensional arrays, requires that some
data structures such as multidimensional array, such data structures
be linearized for representation.
– The data representation does not retain any information on the type
of data.
• Semantic gap: the redundant operations requiring
excessive mapping by compiler.
1.2 Enhancements to the uniprocessor model
• The Harvard architecture
– Separate storage for data and data
– Current Harvard architecture
• Do not use separate storage for data and data
• Have separate paths and buffers to access data and instructions
Harvard Architecture
Major performance parameters
• Arithmetic Logic Unit
– functionality
– the speed of operations
• Memory
– the access speed
– the capacity
– the cost
• Control unit
– speed
– complexity
– flexibility
1.2.1 ALU
• Enhancements of ALU
–
–
–
–
–
–
Faster algorithm for ALU operations
Use of large number of general purpose registers
Stack-based ALUs
Pipelining
Multiple functional units
Multiple ALUs
1.2.2 Memory
• Enhancements of Memory subsystem
– wider word fetch
– blocking (interleaved and banked organization)
• Low-order interleaving
• High-order interleaving(banking)
–
–
–
–
instruction/data buffers
cache memories
virtual memories
multiport memories
1.2.3 control unit
• Two popular implementations of the control unit
– hardwired: for speed
– microprogrammed: for flexibility
1.2.4 I/O subsystem
• The popular I/O structures
–
–
–
–
programmed I/O
interrupt mode I/O
DMA(direct memory access)
Channels
• Selector channel
• Multiplexed channel
– I/O processors
– Front-end processors
1.2.5 Interconnection structures
• Bandwidth: major performance measure of the bus
structure
– bus width
– speed of interface hardware
– bus protocols
1.2.6 System considerations
• Large instruction sets, large numbers of general purpose
registers, large memories
• The availability of low-cost processors
• Multiple processors
– Two multiple processor structures
• each processor is dedicated for a specialized function.
• all processors in the system could be operating
simultaneously.parallel processing
1.3 Two architecture styles
• Two processor architecture styles that try to reduce the
semantic gap
– RISCs
– HLL architectures
1.3.1 HLL Architectures
• Figure 1.4 shows the evolution starting from the
compilation.
–
–
–
–
Compilation
Interpretation
Two-level Interpretation
Direct execution
Direct Execution Language
• SYMBOL machine
– directly to execute Symbol Programming Language
– Iowa State University in 1971.
• Advantage:
– Very high translate-load speed
– Not proved that they offer execution speeds higher that conventional
architectures.
• Disadvantage:
– Only one source language can be used in programming the machine.
– HLL architectures have never been successful commercially.
• Texas Instruments’s Explore and Symbolics’ 3600 series: LISP, for
symbolic processing application.
1.3.2 RISC
• The characteristics of RISC architectures
–
–
–
–
–
–
relatively low number of instructions
a small number of addressing modes
a small number of instruction format
fast execution of all instructions
minimized memory access
support for most frequently used operations and optimizing
compilers
• the Berkeley RISC, the Stanford University MIPS, IBM
801, Sparc, ...
1.4 Performance Evaluation
• The performance is measured by the bandwidth provided
by its memory, processor, and I/O subsystems.
• The most common ones: MIPS, MOPS, MFLOPS, MLIPS.
– TFLOPS machines are building.
• The performance rating: peak rate, average rate,
comparative rate
• Evaluating factors of architectures except performance:
generality, ease of use, expandability( or scalability),
openness, cost, etc.
• A practical method for estimating the performance: using
benchmarks.
1.4.1 Benchmarks
• Benchmarks are useful in evaluating hardware as well as software and
single processor as well as multiprocessor systems.
• Common benchmarks
–
–
–
–
–
–
–
–
–
–
Kernel Benchmarks: Linpak, Lawrence Livermore loops
Local Benchmarks
Partial Benchmarks
Recursive Benchmarks
Unix Utility and Application Benchmarks: SPECmarks
Synthetic Benchmarks: Dhrystone, Whetstone
Parallel Benchmarks: NIST recommended several suites.
Stanford Small Programs
PERFECT: PERFormance Evaluation for Cost-Effective Transformations
SLALOM: for measuring the parallel computer performance
1.5 Cost Factors
• The cost of a computer system: a composite of its software
and hardware cost.
• The cost of hardware has fallen rapidly as the hardware
technology progressed.
• The software costs are steadily rising as the software
complexity grows.
• The cost is dependent on two factors: an upfront
development cost and a per unit manufacturing cost
• The life spans of systems are getting shorter.
1.6 Example systems
• DEC Alpha
–
–
–
–
Table 1.1
Figure 1.5
Figure 1.6
Figure 1.7
DEC Alpha
• The DEC Alpha, also known as the AXP, is a RISC microprocessor
originally developed and fabbed by DEC. DEC used it in their own
line of workstations and servers. Designed as a successor to the VAX
line of computers, it supported the VMS operating system, as well as
the DEC favor of UNIX. Later open source operating systems also ran
on the Alpha, notably certain BSD systems. Microsoft supported the
processor in earlier versions of Windows NT.
• The 64-bit processor was introduced in 1992 running at 200MHz. It
was designed as a 64-bit architecture with super-pipelining and
superscalar design. At the time, DEC touted it as the world's fastest
processor. In July 1996 it was clocked at 500 MHz (the 21164PC), in
March 1998 at 666 MHz and in May 2000 at 731MHz (the 21264PC).
1GHz and faster pieces were announced in 2001 (the 21364PC or EV7), and are available since 2003 at 1.1GHz and upwards. Around
500,000 Alpha based systems were sold to end-2000.
DEC Alpha(continued)
• The production of Alpha chips was licensed to Samsung Electronics
Company. Following the purchase of Digital by Compaq a lot of the
Alpha products were placed with API NetWorks, Inc. (previously
Alpha Processor Inc.), a private company funded by Samsung and
Compaq. In October 2001 Microway became the exclusive sales and
service provider of API NetWorks' Alpha-based product line.
• Compaq announced that computers using Alpha would be phased out
by 2004 in favour of Intel's Itanium. Windows NT support was halted
with NT4 SP6 following the Compaq takeover. HP, new owner of
Compaq, announced to support the Alpha series for a few more years,
including a new EV79 chip, but this will be the end of the lifetime. The
IA-64 is supposed to be the replacement of this series.
• Ironically, in mid-2003, when the Alpha is about to be phased out, the
fastest computer in the U.S., and second fastest in the world, is a
cluster of 4096 Alpha processors.
1.6 Example systems (continued)
• Intel i860
–
–
–
–
Figure 1.8
Figure 1.9
Figure 1.10
Table 1.2
Intel i860
64-Bit Microprocessor
• The Intel i860 (N10) microprocessor delivers
supercomputer performance in a single VLSI component.
The 64-bit design of the i860 balances integer, floating
point, and graphic performance. The Intel i860 has features
of both a digital signal processor and a data processor.
However, because of its speed in doing typical DSP
operations, it has been extensively used in the DSP role. Its
architecture also makes it suitable for other applications
including engineering workstations, scientific computing,
3-D graphics workstations, and multi-user systems. The
i860 is used as the data processor in Intel's massivelyparallel Touchstone and Paragon supercomputers.
Intel i860
64-Bit Microprocessor(continued)
Features
• Parallel architecture that supports up to three operations per clock
• One integer or control instruction per clock
• Up to two floating-point results per clock
• High performance design
• 33.3/40 MHz clock rates
• 64-bit external data bus
• 64-bit internal instruction cache bus
• 128-bit internal data cache bus
• High level of integration on one chip
• 32-bit integer and control unit
• 32/64-bit pipelined floating point adder and multiplier units
• 64-bit 3-D graphic unit
Intel i860
64-Bit Microprocessor(continued)
Performance
• 80 peak single precision MFLOPS (40MHz i860)
• 60 peak double precision MFLOPS (40MHz i860)
• 80 peak double precision MFLOPS (40MHz i860XR)
• 42 SPECmark (40MHz i860XR)
• The i860XP (N11) is an extension to i860, with MP
support (enable physical snooping), new process, and
better performance.
Intel i860
64-Bit Microprocessor(continued)
Functional Description
The i860 microprocessor consists of 9 units:
1. Core Execution Unit
2. Floating-Point Control Unit
3. Floating-Point Adder Unit
4. Floating-Point Multiplier Unit
5. Graphics Unit
6. Paging Unit
7. Instruction Cache
8. Data Cache
9. Bus and Cache Control Unit
Intel i860
64-Bit Microprocessor(continued)
Functional Description
• The core execution unit controls overall operation of the i860
microprocessor. A set of 32 x 32-bit general-purpose registers are
provided for the manipulation of integer data.
• The floating-point hardware is connected to a separate set of floatingpoint registers, which can be accessed as 16 x 64-bit registers, or 32 x
32-bit registers.
• The floating-point control unit controls both the floating-point adder
and the floating-point multiplier, issuing instructions, handling all
source and result exceptions, and updating status bits in the floatingpoint status register.
• The floating-point adder performs addition, subtraction, comparison,
and conversions on 64- and 32-bit floating-point values.
• The floating-point multiplier performs floating-point and integer
multiply and floating-point reciprocal operations on 64- and 32-bit
floating-point values.
Intel i860
64-Bit Microprocessor(continued)
Features
• Paging unit with translation lookaside buffer
• 32x32-bit integer register file
• 16x64-bit FPU register file
• 4 Kbyte instruction cache
• 8 Kbyte data cache
• Compatible with industry standards
• On-chip debug register
• Assembler, Linker, Simulator, Debugger, C and FORTRAN Compilers,
FORTRAN Vectorizer, Scalar and Vector Math Libraries for both OS/2
and UNIX environments
1.6 Example systems(continued)
• MIPS R4000
– Table 1.3
– Figure 1.11-1.15
– Table 1.4
MIPS R4000
• A company which designs, develops, and licenses reduced instruction
set computer (RISC) microprocessors and compilers. MIPS
Technologies, Inc. is a wholly-owned subsidiary of Silicon Graphics,
Inc. and operates as an independent unit. MIPS is the successor to the
processor business of MIPS Computer Systems which was founded in
1984 and merged with Silicon Graphics on 29 June 1992.
• MIPS Technologies developed the world's first RISC VLSI
microprocessors (1985) (or was it the ARM?), the first commercial 64bit microprocessor (MIPS R4000, 1992), announced MIPS R4300i the first 64-bit RISC processor designed for interactive consumer
applications (April 1995). They announced the MIPS R10000 - the
next generation general-purpose MIPS microprocessor and the most
powerful processor in the world (October 1994).
MIPS R4000 (continued)
• MIPS' semiconductor company partners participate in the
design and development of MIPS processors and software
and then produce, market, and support the processors.
MIPS itself does not fabricate or sell products. MIPS'
semiconductor partners are: Integrated Device Technology,
LSI Logic Corporation, NEC Corporation, NKK
Corporation, Philips Semiconductors, Siemens AG, and
Toshiba Corporation.
MIPS R4000 (continued)
MIPS' products
• R4000 - 100 MHz; 1.35M transistors, primary i/d cache 8KB/8KB,
SPECint92 58.3/ SPECfp92 61.4.
• R4300i - 133 MHZ, 1.35M transistors; primary i/d cache, 16KB/8KB,
SPECint92 80, SPECfp92 60.
• R4400 - 250 MHz, 2.3M transistors, primary i/d cache 16KB/16KB,
SPECint92 175.8, SPECfp92 164.4.
• R4600 - 133 MHz, 1.9M transistors, primary i/d cache 16KB/16KB,
SPECint92 85, SPECfp92 75.
• R8000/R8010 - 90 MHz, 2.6M, .83M transistors, primary i/d cache,
16KB/16KB, SPECint92 132, SPECfp92 396.
• R10000 - 200 MHz, 6.7M transistors, primary i/d cache 32KB/32KB,
SPECint92 >300, SPECfp92 >600.
• MIPS' processor chips were used in the DEC 3100 series of
workstations.
Intel Research - Microprocessor
• We research advanced microarchitecture and system
architecture concepts and techniques for future
generation IA-32 and IA-64 designs. We are located
at Intel's centers for microprocessor development
including Santa Clara (California), Hillsboro (Oregon),
Haifa (Israel) and Barcelona (Spain). We work side
by side with engineers developing current and next
generation microprocessors.
Intel Research – Microprocessor(continued)
Research areas
• Multithread Microarchitecture
Research into various flavors of multithreading from CMP (chip
multiprocessor), SMT (simultaneous multithreading) to DMT
(dynamic multithread).
• Memory Hierarchy
Research into multilevel caches, prefetching, multiprocessor cache
behavior, and external memory bandwidth and latency bottlenecks.
• Improving Instruction Level Parallelism
Research areas include improving ILP through novel instruction
supply and prediction techniques, techniques for bypassing
memory latency and improving memory hierarchy organization.
• Low Power Architecture and Microarchitectures
In the low-power area, we investigate techniques for cutting local
and global power and novel architecture design for low power.
• Simulators
We are also investigating IA-32 and IA-64 based simulation
frameworks to evaluate design and performance characteristics.