Image - University of Oklahoma

Download Report

Transcript Image - University of Oklahoma

Computing Environment
• The computing environment rapidly evolving - you need to
know not only the methods, but also
•
•
•
•
How and when to apply them,
Which computers to use,
What type of code to write,
What kind of CPU time and memory requirement your jobs
will have,
• What tools (e.g., visualization software) to use to analyze
the data.
Definitions – Clock Cycles
• computer chip operates at discrete intervals called clocks. Often
measured in nanoseconds (ns) or megahertz.
• 500 mHz (Pentium III) -> 2 ns
• 100 mhz (Cray J90) -> 10 ns
•
•
•
•
May take 4 clocks to do one multiplication
May take 30 clocks to start a procedure
May take 2 clocks to access memory
mHz not the only measure
Definitions – FLOPS
• Floating Operations / Second
• Mflops – million FLOPS
• A good measure of code performance – typically one add is one
flop, one multiplication is also on flop
• Cray J90 Perk = 200 Mflops, most codes achieves only 1/3 of
peak
• Cray T90 Perk = 3.2 Gflops
• Earth Simulator (NEC XS-5) = 8 Gflops
• Fastest Workstation Processor (DEC Alpha) ~ 1Gflops
MIPS
• Million instructions per second – also a measure of computer
speed – used most the old days
Bandwidth
• The speed at which data flow across a network or wire
• 56K Modem = 56 kilobits / second
• T1 link = 1.554 mbits / sec
• T3 link = 45 mbits / sec
• FDDI = 100 mbits / sec
• Fiber Channel = 800 mbits /sec
• 100 BaseT (fast) Ethernet = 100 mbits/ sec
• Brain system = 3 Gbits / s
Hardware Evolution
•
•
•
•
•
•
•
•
Mainframe computers
Supercomputers
Workstations
Microcomputers / Personal Computers
Desktop Supercomputers
Workstation Super Clusters
Handheld, Palmtop, Calculators,
et al….
Types of Processors
• Scalar (Serial)
• One operation per clock cycle
• Vector
• Multiple (tens to hundreds) operations per clock cycle.
Typically achieved at the loop level where the instructions
are the same or similar for each loop index
• Superscalar
• Several instructions per clock cycle
Types of Computer Systems
• Single Processor Scalar (e.g., ENIAC, IBM704, IBM-PC)
• Single Processor Vector (CDC7600, Cray-1)
• Multi-Processor Vector (e.g., Cray XMP, Cray C90, Cray J90,
NEC SX-5),
• Single Processor Super-scalar (IBM RS/6000 such as Bluesky)
• Multi-processor scalar (e.g., Multi-processor Pentium PC)
• Multi-processor super-scalar (e.g., DEC Alpha based Cray T3E,
RS/6000 based IBM SP-2, SGI Origin 2000)
• Clusters of the above (e.g., Linux clusters, Earth Simulator –
Cluster of multiple vector processor nodes)
Memory Architectures
• Shared Memory Systems
• Memory can be accessed and addressed
uniformly by all processors
• Fast/expensive CPU, Memory, and networks
• Easy to use
• Difficult to scale to many (> 32) processors
• Distributed Memory Systems
• Each processor has its own memory
• Others can access its memory only via
network communications
• Often off-the-shelf components,
therefore low cost
• Hard to use, explicit user specification of
communications often needed.
• Single CPU slow. Not suitable for
inherently serial codes
• High-scalability - largest current system
has nearly 10K processors
Memory Architectures
• Multi-level memory (cache and main memory)
architectures
• Cache – fast and expensive memory
• Typical L1 cache size in current day
microprocessors ~ 32 K
• L2 size ~ 256K to 8mb
• Main memory a few Mb to many Gb.
• Try to reuse the content of cache as much as
possible before the content is replaced by new
data or instructions
Issues with Parallel Computing
•
Load-balance / Synchronization
•
•
•
•
Try to give equal amount of workload to each processor
Try to give processors that finish first more work to do (load rebalance)
The goal is to keep all processors as busy as possible
Communication / Locality
•
•
•
Inter-processor communications typically the biggest overhead on MPP
platforms, because network is slow relative to CPU speed
Try to keep data access local
E.g., 2nd-order finite difference
u  u j 1
u
 j 1
x j
2x
requires data at 3 points
4th-order finite difference
u
4 u j 1  u j 1 1 u j 2  u j 2


x j 3 2x
3 4x
requires data at 5 points
SpectralExpansion Method
uk 
1 2 N 1
 u j exp(ikx j ) reqiresdata fromtheentiregrid
2 N  1 j 1
A Few Simple Roles for Writing Efficient Code
•
•
Use multiplies instead of divides whenever possible
Make innermost loop the longest
Slower loop:
10
Do 100 i=1000
Do 10 j=1,10
a(i,j)=…
continue
Faster loop
10
•
•
•
•
•
Do 100 j=100
Do 10 i=1,1000
a(i,j)=…
continue
For the short loop like Do I=1,3, write out the associated expressions explicitly since the
startup cost may be very high
Avoid complicated logics (IF’s) inside Do loops
Avoid subroutine and function calls inside DO loops
Vectorizable codes typically also run faster on RISC based super-scalar processors
Keep it simple.
Transition in Computing Architectures
This chart depicts major NCAR SCD computers from the 1960s onward, along with the
sustained gigaflops (billions of floating-point calculations per second) attained by the SCD
machines from 1986 to the end of fiscal year 1999. Arrows at right denote the machines
that will be operating at the start of FY00. The division is aiming to bring its collective
computing power to 100 Gfps by the end of FY00, 200 Gfps in FY01, and 1 teraflop by
FY03. (Source at http://www.ucar.edu/staffnotes/9909/IBMSP.html)