O2000 Architecture - Texas A&M University

Transcript O2000 Architecture - Texas A&M University

TM
Origin System Architecture
Hardware
and
Software Environment
TM
Scalar Architecture
Register File
~2GB/s
~10 cy
memory
Functional Unit
(mult, add)
Cache
~500 MB/s
~100 cycles
Processor
Reduced Instruction Set (RISC) Architecture:
• load/store instructions refer to memory
• functional units operate on items in the register file
• memory hierarchy in the Scalar Architecture
– Most recently used items are captured in the cache
– Access to cache is much faster than access to memory
TM
Vector Architecture
Vector Operation
DO i=1,n
DO k=1,n
C(i,1:n)=C(i,1:n) + A(i,k)*B(k,1:n)
ENDDO ENDDO
i
=
C
Vector
Registers
Functional Unit
(mult, add)
Processor
k
i
X
k
+ Accumulate C(1,1:n)
in a vector register
A
memory
B
loadf
loadv
mpyvs
addvv
f2,(r3)
v3,(r3)
v3,v3,v2
v4,v4,v3
load scalar A(i,k)
load vector B(k,1:n)
calculate A(I,k)*B(k,1:n)
update C(I,1:n)
• Vectors will be loaded (loadv instruction) from memory
• The performance is determined by memory bandwidth
• Optimization takes vector length (64 words) into account
TM
Multiprocessor Architecture
Register File
Functional
Unit
(mult, add)
memory
Cache
Cache
Coherency
Unit
Processor
Register File
Cache
Functional
Unit
(mult, add)
Cache
Coherency
Unit
Processor
Cache coherency unit will intervene if two or more
processors attempt to update same cache line
• All memory (and I/O) is shared by all processors
• Read/write conflicts between processors on the same memory
location are resolved by cache coherency unit
• Programming model is an extension of single processor
programming model
TM
Multicomputer Architecture
Main
memory
Register File
Functional
Unit
(mult, add)
Cache
Main
memory
Register File
Functional
Unit
(mult, add)
Cache
Processor
Processor
• All memory and I/O path are independent
• Data movement across the interconnect is “slow”
• Programming model is based on message passing
– Processors explicitly engage in communication by sending and
receiving data
TM
Origin 2000 Node Board
Basic Building
Block
Main
Memory
Directory
Directory
>32P
XIO
Hub
R1*K
R1*K
Cache
Cache
Node Board
•2 X R12000 Processors
•64 MB to 4 GB Main Memory
Hub Bandwidth Peaks
•
780 MB/s [625] --- CPUs
CrayLink •780 MB/s [683] --- memory
•1.56 GB/s [1.25] -- XIO link
•1.56 GB/s [1.25] -- CrayLink
TM
O2000 Node Board
Directory
SDRAM
Main Memory up to 4 GB/node
SDRAM (144@50 MHz=800MB/s)
L2 Cache
1-4-8 MB
R1x000
processor
L2 Cache
Proc Interface
R1x000
processor
Memory Interface
HUB
I/O Interface
CrayLink
duplex connection
(2x23@400 MHz,
2x800 MB/s)
to other nodes
1-4-8 MB
Input/Output on every node: 2x800 MB/s
HUB Crossbar ASIC:
• Single chip integrates all 4 Interfaces:
HUB ASIC:
950K gates
100MHz 64bit
BTE
64 counters /(4KB)page
– Processor Interface; two R1x000 processors multiplex on the same bus
– Memory Interface, integrating the memory controller and
(Directory) Cache Coherency
– Interface to the CrayLink Interconnect to other nodes in the system
– Interface to the I/O devices with XIO-to-PCI bridges
• Memory Access characteristics:
– Read Bandwidth single processor 460 MB/s sustained
– Average access latency 315 ns to restart processor pipeline
TM
Origin 2000 Switch Technology
Main
Memory
Directory
Directory
>32P
Proc.
6 ports to XIO
XBOW
Hub
Proc.
N
Cache
Cache
Node Board
Router to other
Node Boards
N
N
N
R
R
N
R
N
R
N
N
R
ccNUMA
hypercube
R
N
N
N
N
R
R N
N
N
N
TM
O2000 Scalability Principle
Main
Memory
Main
Memory
Directory
SDRAM
Directory
SDRAM
L2 Cache
1-4-8 MB
L2 Cache
1-4-8 MB
Link Interface
R1x000
processor
Proc Interface
R1x000
processor
Memory Interface
Memory Interface
HUB
R1x000
processor
L2 Cache
1-4-8 MB
L2 Cache
1-4-8 MB
HUB
I/O Interface
I/O Interface
R1x000
processor
Crossbar router
network
Distributed switch does scale:
– Network of crossbars allows for full remote bandwidth
– The switch components are distributed and modular
TM
Origin 2000 Module
System Building Block
Module Features:
•Up to 8 R12000 CPUs (1-4 Nodes)
•Up to 16 GB physical memory
•Up to 12 XIO slots
•2 XBOW Switches
•2 Router Switches
•64 bit internal PCI Bus (optional)
•Up to 2.5 [3.1] GB/sec system bandwidth
•Up to 5.0 [6.2] GB/sec I/O bandwidth
Origin 2000 Module
Deskside System
• 2-8 CPUs
• 16GB Memory
• 12 XIO slots
SGI 2100 / 2200
N
R
R
N
N
N
TM
TM
Origin 2000 Single Rack
Single Rack System
• 2-16 CPUs
• 32GB Memory
• 24 XIO slots
SGI 2400
N
N
N
R
R
N
R
R N
N
N
N
TM
Origin 2000 Multi-Rack
Multi-Rack System
• 17-32 CPUs
• 64GB Memory
• 48 XIO slots
• 32-processor hypercube
building block
N
N
N
N
R
N
N
N
R
N
R N
R
N
R
N
N R
N
R
R N
N
N
TM
Origin 2000 Large Systems
Large Multi-Rack Systems
• up to 512 CPUs
• up to 1 TB Memory
• 384+ XIO slots
+
SGI 2800
=
+
+
TM
Scalable Node Product Concept
Address diverse customer
requirements
• Independent scaling of CPU, I/O, and
storage…tailor ratios to suit application
• Large dynamic range of product
configurations
• RAS via component isolation
Independent evolution and
upgrade of system
components
Maximize leverage of
engineering and technology
development efforts
Modular
Architecture
Interface and
Form Factor
Standards
I/O SUBSYSTEMS
TM
Origin 3000 Hardware Modules (BRICKS)
G-brick
Graphics Expansion
C-brick
CPU Module
R-brick
Router Interconnect
I-brick
Base I/O Module
P-brick
PCI Expansion
X-brick
XIO Expansion
D-brick
Disk Storage
Origin 3000 MIPS Node
128 Nodes / 512 CPUs
per System (Max)
Memory Interface
4x O2K Bandwidth
200 MHz, 3200 MB/sec
60% O2K Latency
180 ns local
8 GB/node (Max)
DDR SDRAM
TM
Two Independent SysAD Interfaces
Each 2x O2K Bandwidth
200 MHz, 1600 MB/sec each
L2
Cache
R1*000
R1*000
L2
Cache
L2
Cache
R1*000
R1*000
L2
Cache
Mem/Dir
Bedrock
ASIC
NUMALink3 Network Port
2x O2K Bandwidth
800 MHz, 1600 MB/sec
Bi-directional
XIO+ Port
1.5x O2K Bandwidth
600 MHz, 1200 MB/sec
Bi-directional
Origin 3000 CPU Brick (C-brick)
• 3U high x 28” deep
• Four MIPS or IA64 CPUs
• 1 - 4 DIMM pairs: 256MB,
512MB, 1024MB (premium)
• 48V DC power input
• N+1 redundant, hot-plug
cooling
• Independent power on/off
• Each CPU module can
support one I/O brick
TM
TM
Origin 3000 BEDROCK Chip
TM
SGI Origin 3000 Bandwidth
Theoretical vs. Measured (MB/s)
CPU
CPU
1600
1600
1600
1600
1600
CPU
900
900
900
CPU
3200
Memory
2x1600
2x1250
CPU
CPU
1150
1150
1600
Hub
node
CPU
900
CPU
Hub
2100
Memory
node
TM
STREAMS Copy Benchmark
3000.0
Megabytes/sec
2500.0
2000.0
1500.0
1000.0
500.0
0.0
1
2
4
8
Origin 2000 R12KS 400 MHz
380.0
381.0
820.0
1538.0
Origin 3000 R12KS 400 MHz
623.0
777.0
1406.0
2855.0
Origin 3000 R14K 500 MHz
685.0
778.0
1401.0
2823.0
Number of CPUs
SGI Confidential
Origin 3000 Router Brick
(r/R-brick)
TM
•2U high x 25” deep
•Replaces system mid-plane
•Multiple Implementations
– r-Brick…6-port (up to 32 CPUs)
– R-Brick…8-port (up to 128 CPUs)
– metarouter…(128 to 512 CPUs)
•48V DC power input
•N+1 redundant, hot-plug
8 NUMAlink™ 3 NW Ports
Each port...3.2GB/s
(2x O2K bandwidth)
cooling
•Independent power on/off
•Latency 50% ORIGIN 2000
–
45 ns
NUMAlink™ 3
Router
45ns roundtrip latency
(50% O2K router latency)
TM
SGI Origin 3000
Measured Bandwidth
5000 MB/s
Router
2500
2500
SGI NUMA 3
Scalable Architecture (16p - 1hop)
TM
R1*000
R1*000
R1*000
R1*000
R1*000
R1*000
R1*000
R1*000
R1*000
R1*000
R1*000
R1*000
R1*000
R1*000
R1*000
R1*000
Bedrock
ASIC
Bedrock
ASIC
Bedrock
ASIC
8-port
Router
To other Routers
Bedrock
ASIC
TM
Origin 3000 I/O Bricks
I-brick:
Base I/O Module
P-brick:
PCI Expansion
X-brick:
XIO Expansion
• Base system I/O: • 12 industry-standard, • Highest
•
• system disk
• CD-ROM
• 5 PCI slots
No need to
duplicate starting
I/O infrastructure
64-bit, 66MHz slots
• Supports almost all
system peripherals
• All slots are hot-swap
performance
I/O expansion
• Supports HIPPI,
GSN, VME, HDTV
• 4 XIO slots per
brick
New I/O bricks (e.g., PCI-X) can be attached via same XIO+ port
Types of Computer Architecture
characterised by memory access
PVP (SGI/Cray T90)
UMA
SMP
Central Memory
(Intel SHV, SUN E10000, DEC 8400
SGI Power Challenge, IBM R60, etc.)
COMA (KSR-1, DDM)
Multiprocessors
Single Address space
Shared Memory
NUMA
CC-NUMA
distributed memory
(SGI Origin2000, Origin3000,
Cray T3E, HP Exemplar,
Sequent NUMA-Q, Data General)
NCC-NUMA (Cray T3D, IBM SP3)
Cluster
MIMD
(IBM SP2, DEC TruCluster,
Microsoft Wolfpack, “Beowolf”, etc.)
Multicomputers
Multiple Address spaces
loosely coupled, multiple OS
NORMA
no-remote memory access
“MPP” (Intel TFLOPS,TM-5)
tightly coupled & single OS
MIMD
UMA
NUMA
NORMA
MPP
Multiple Instruction s Multiple Data
Uniform Memory Access
Non-Uniform Memory Access
No-Remote Memory Access
Massively Parallel Processor
PVP
SMP
COMA
CC-NUMA
NCC-NUMA
Parallel Vector Processor
Symmetric Multi-Processor
Cache Only Memory Architecture
Cache-Coherent NUMA
Non-Cache Coherent NUMA
TM
TM
Origin DSM-ccNUMA Architecture
Distributed Shared Memory
Cache
Cache
Bedrock
Dir
XIO+
Cache
Cache
Main
Memory
Processor
Processor
Processor
Processor
Cache
Cache
Cache
Cache
Bedrock
Dir
Processor
Processor
Processor
Processor
Main
Memory
NUMALink3 and R-Bricks
XIO+
Distributed Shared Memory
Architecture (DSM)
Main
memory
Register File
Functional
Unit
(mult, add)
Main
memory
TM
Register File
Cache
Functional
Unit
(mult, add)
Cache
Cache
Unit
Processor
Coherency
Cache
Unit
Coherency
Processor
interconnect
• Local memory and independent path to memory as with the
Multicomputer Architecture
• Memory of all nodes is organized as one logical “shared memory”
• Non-uniform memory access (NUMA):
— “Local memory” access is faster than “remote memory” access
• Programming model is (almost) the same as for the Shared
Memory Architecture
— data distribution is available for optimization
• Scalability properties similar to the Multicomputer Architecture
TM
Origin DSM-ccNUMA Architecture
Directory-Based Scalable Cache
Coherence
Cache
Cache
Bedrock
Dir
XIO+
Cache
Cache
Main
Memory
Processor
Processor
Processor
Processor
Cache
Cache
Cache
Cache
Bedrock
Dir
Processor
Processor
Processor
Processor
Main
Memory
NUMALink3 and R-Bricks
XIO+
TM
Origin Cache Coherency
• Memory page is divided in data blocks of 32 words or 128 Bytes
each (L2 cache line size)
• Each data request transfers one data block (128 Bytes)
• Each data block has associated presence and state information
directory
presence state
(64 bits) 8bits
Data Block or Cache line 128 Bytes (32 words)
presence state
(64 bits) 8bits
page
Data Block or Cache line 128 Bytes (32 words)
Unowned: no copies
Shared: read-only copies
Exclusive: one read-write
Busy: state in transition
Each L2 cache line contains 4 data blocks of 8 words
or 32 Bytes each (L1 data cache line size)
• If a node (HUB) requests a data block, the corresponding
presence bit is set and the state of that cache line is recorded
• HUB runs the Cache Coherency protocol, updating the state of the
data block and notifying nodes for which the presence bit is set.
CC-NUMA Architecture:
TM
Programming
Proc 1
Proc 2
Proc 3
i
k
i
=
j
X
k
j
C every processor holds a column of each matrix:
C$distribute A(*,block),B(*,block),C(*,block)
C$omp parallel do
DO i=1,n
DO j=1,n
DO k=1,n
C(i,j)=C(i,j) + A(i,k)*B(k,j)
ENDDO ENDDO ENDDO
• All data is shared
• Additional optimization to place data close to the processor that
would do most of the computations on that data
• Automatic (compiler) optimizations for single processor and
parallel performance
• The data access (data exchange) is implicit in the algorithm;
• Except for the additional data placement directives, the source is
TM
Problems of CC-NUMA Architecture
SMP programming style + data placement techniques (directives)
SMP programming Cliff
remote memory latency jump ~3-5
requires correct data placement
Based on 1 GB/s SCI link;
latency/hop ~ 500 ns
64-128 processor O2000
ta(remote)/ta(local) ~3-5
->correct data placement
TM
DSM-ccNUMA Memory
Easy to Program
Easy to Scale
Hard to scale
Hard to program
Shared-memory
Systems (SMP)
Massively Parallel
Systems (MPP)
Distributed Shared Memory
Systems [ccNUMA)
Easy to Program
Easy to Scale
SGI 3200 (2-8p)
TM
Router-less
configurations in
deskside form factor
Short Rack
(17U config. space)
C-Brick
Network
P
P, I, or, X-Brick
P
P
BR
I-Brick
P
BR
P
I-Brick
Network
P
P
XIO+
P
XIO+
C-Brick
C-Brick
C-Brick
Power Bay
Power Bay
Minimum (2p) System
Maximum (8p) System
XIO+ Ports
I-Brick
XIO+ Ports
P,I, or X-Brick
System Topology
SGI 3400 (4-32p)
TM
Full-size Rack
(39U config. space)
C-Brick
P, I, or, X-Brick
XIO+
I-Brick
C-Brick
P, I, or, X-Brick
P
P
XIO+
P
BR
C-Brick
P, I, or, X-Brick
P
P
XIO+
P
BR
P
P
P
XIO+
P
BR
P
P
P
BR
P
P
P
C-Brick
r-Brick
P, I, or, X-Brick
r-Brick
6-port router
C-Brick
P, I, or, X-Brick
C-Brick
r-Brick
r-Brick
P, I, or, X-Brick
C-Brick
P, I, or, X-Brick
P
C-Brick
C-Brick
I-Brick
Power Bay
Power Bay
Power Bay
Power Bay
Power Bay
Power Bay
P
P
P
P
BR
P
XIO+
Maximum (32p) System
P
BR
C-Brick
Minimum (4p) System
r-Brick
6-port router
P
P
P
BR
P
P
XIO+
System Topology
P
BR
P
XIO+
P
P
XIO+
SGI 3800 (16-128p)
Rack 1
C
C
Rack 2
C
C
C
Rack 3
C
C
C
Rack 4
C
C
C
1
C
C
R
R
C
C
R
R
C
C
R
R
C
C
R
R
C
R-Brick
8-port router
C
C
C
C
C
C
C
C
C
C
C
C
128P System Topology
R-Brick
C-Brick
R-Brick
C-Brick
C-Brick
C-Brick
I-Brick
Power Bay
Power Bay
Power Bay
Power Bay
Minimum (16p) System
TM
Maximum (128p) System
2
3
4
SGI 3800 System: 128 processors
16 proc
16 proc
16 proc
16 proc
16 proc
16 proc
16 proc
16 proc
TM
SGI 3800 (32-512p)
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
R-Brick
R-Brick
R-Brick
C-Brick
C-Brick
C-Brick
C-Brick
C-Brick
C-Brick
C-Brick
C-Brick
C-Brick
C-Brick
C-Brick
C-Brick
R-Brick
R-Brick
R-Brick
R-Brick
C-Brick
C-Brick
C-Brick
C-Brick
C-Brick
C-Brick
C-Brick
C-Brick
R-Brick
R-Brick
R-Brick
R-Brick
C-Brick
C-Brick
C-Brick
C-Brick
C-Brick
C-Brick
C-Brick
C-Brick
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
P, I, or, X-Brick
I-Brick
C-Brick
C-Brick
C-Brick
C-Brick
Power Bay
Power Bay
Power Bay
Power Bay
Power Bay
Power Bay
Power Bay
Power Bay
Power Bay
Power Bay
Power Bay
Power Bay
One Quadrant of a 512p System
512p Power Estimates:
MIPS = 77 KW
TM
Itanium = 150 KW
McKinley = 231 KW
No I/O or storage included
in power estimates.
Premium memory required
TM
Router-to-Router Connections
for 256 Processor Systems
TM
512 Processor Systems
TM
TM
R1xK Family of Processors
MIPS R1x000 is an out-of-order, dynamic-scheduling
superscalar processor with non-blocking caches
•Supports the 64-bit MIPS IV ISA
•4-way superscalar
•Five separate execution units
•2 floating point results / cycle
•4-way deep speculative execution of branches
•Out-of-order execution (48 instruction window)
•Register re-naming
•Two-way set associative non-blocking caches
–Up to 4 outstanding memory read requests
–Prefetching of data
–1MB to 8MB secondary data cache
•Four user-accessible event counters
Origin 3000
MIPS Processor Roadmap
TM
R18000
xxx MHz, xxx GFlops
R16000
xxx MHz, xxx GFlops
R14000(A) 8 MB DDR SRAM@ 250+ MHz
500+ MHz, 1000+ MFlops
Origin 2000
8 MB @ 266 MHz
R12000A
400 MHz, 800 MFlops
R12000
300 MHz, 600 MFlops
R10000
250 MHz, 500 MFlops
1999
O3K-MIPS
8 MB @ 200 MHz
4 MB @ 250 MHz
2000
2001
2002
2003
TM
R14000 Cache Interfaces
Memory Hierarchy
Cache subsystem
memory
disk
1400
1169
Origin3000 Latency
1200
1067
Origin2000 Latency
Remote Latency (ns)
~2-3
cy
~10 cy
64reg
Speed of Access 1/clock
1
0.1
TM
1000
836
759
800
554
600
585
400
343
435
200
32KB
(L1)
759
175
175
2p
4p
235
285
335
335
32p
64p
485
0
8MB
(L2)
~100 - 300 cy
(NUMA)
8p
16p
0.01
~4000 cy
~1 - 100s GB
Device Capacity (size)
128p
256p
512p
TM
Effects of Memory Hierarchy
32 KB L1 cache
4 MB L1 cache
L2 cache:
1MB cache
2MB cache
4MB cache
TM
Instruction Latencies (R12K)
Integer units
• ALU 1
latency Repeat rate
– add, sub, logic ops, shift, br
• ALU 2
1
1
– add, sub, logic ops
1
1
– signed multiply (32/64 bit) 6/10
6/10
–
(unsigned multiply: +1 cycle)
– divide (32/64 bit)
35/67 35/67
• Address Unit
–
–
–
–
load integer
2
load floating point
3
store
Atomic LL,ADD,SC sequence 6
1
1
1
6
Repeat rate of 1 means that after
pipelining processor can complete
1 operation per cycle.
Thus the peak rates:
Int operations: 2 int operations/cycle
FP operations: 2 fp operations/cycle
For the R14000@500MHz:
Floating point units
• FPU 1
– add, sub, compare, convert
2
1
– multiply
– multiply-add (madd)
2
4
1
1
– divide, reciprocal (32/64 bit)
– sqrt (32/64 bit)
– rsqrt (32/64 bit)
12/19 14/21
18/33 20/35
30/52 34/56
• FPU 2
• FPU 3
4*500 MHz = 2000 MIPS
2*500 MHz = 1000 Mflop/s
Compiler has this table build in.
The goal of compiler scheduling
is finding instructions that can be
executed in parallel to fill all slots:
ILP - Instruction Level Parallelism
TM
Instruction Latencies: DAXPY Example
DO I=1,n
Y(I) = Y(I) + A*X(I)
ENDDO
Loop parallelism:
2 loads, 1 store
1 multiply-add (madd)
2 address increments
1 loop-end test
1 branch
per single loop iteration
Processor parallelism:
1 load or store
1 ALU1 instruction
1 ALU2 instruction
1 FP add
1 FP multiply
per processor cycle
– There are 2 loads (x,y) and 1 store (y)= 3 mem ops.
– There are 2 fp operations (+,*) which can be done with 1 madd
• 3 mem ops require 3 cycles minimum (processor can do 1 mem op/cycle)
•
•


theoretically in 3 cycles processor can do 6 fp operations
only 2 fp operations are available in the code
max processor speed is 2fp/6fp=1/3 peak on this code;
I.e. for the R12000@300MHz processor 600/3=200 Mflop/s.
TM
DAXPY Example: Schedules
DO I=1,n
Y(I) = Y(I) + A*X(I)
ENDDO
Simple schedule:
cycle
0
1
2
3
4
5
6
7
instructions
ld x
ld y
unrolled by 2:
x++
madd
st y
br
DO I=1,n-1,2
Y(I+0) = Y(I+0) + A*X(I+0)
Y(I+1) = Y(I+1) + A*X(I+1)
ENDDO
y++
2fp/(8cycles*2fp/cy)=1/8 peak
R12000@300MHz ~ 75 Mflop/s
cycle
0
1
2
3
4
5
6
7
8
instructions
ld x0
ld x1
ld y0
x+=4
ld y1
st y0
st y1
y+=4
madd0
madd1
br
4fp/(9cycles*2fp/cy)=2/9 peak
~133 Mflop/s
DAXPY Example:
Software Pipelining
TM
• Software pipelining is the way to fill all processor slots by mixing
iterations
• replications gives how many iterations are mixed
• number of replications depends on the distance (in cycles) between the
load and the calculation
#<swp> replication 0
ld x0 ldc1 $f0,0($1)
ld x1 ldc1 $f1,-8($1)
st y2 sdc1 $f3,-8($3)
st y3 sdc1 $f5,0($3)
y+=2 addiu $3,$2,16
madd.d $f5,$f2,$f0,$f4
ld y0 ldc1 $f0,-8($2)
madd.d $f3,$f0,$f1,$f4
x+=2 addiu $1,$1,16
beq $2,$4,.BB21.daxpy
ld y3 ldc1 $f2,0($3)
#cy
#[0]
#[1]
#[2]
#[3]
#[3]
#[4]
#[4]
#[5]
#[5]
#[5]
#[5]
#<swp> replication 1
ld x3 ldc1 $f1,0($1)
ld x2 ldc1 $f0,-8($1)
st y1 sdc1 $f3,-8($2)
st y0 sdc1 $f5,0($2)
y+=2 addiu $2,$3,16
madd.d $f5,$f2,$f1,$f4
ld y3 ldc1 $f1,-8($3)
madd.d $f3,$f1,$f0,$f4
x+=2 addiu $1,$1,16
ld y0 ldc1 $f2,0($2)
#cy
#[0]
#[1]
#[2]
#[3]
#[3]
#[4]
#[4]
#[5]
#[5]
#[5]
• DAXPY 6 cy schedule with 4 fp ops: 4fp/(6cy*2fp/cy)=1/3 peak
TM
DAXPY SWP: Compiler Messages
F77 -mips4 -O3 -LNO:prefetch=0 -S daxpy.f
• With the -S switch the compiler will produce file daxpy.s with assembler
instructions and comments about software pipelining schedules
#<swps>
#<swps>
#<swps>
#<swps>
#<swps>
#<swps>
#<swps>
#<swps>
#<swps>
#<swps>
#<swps>
#<swps>
#<swps>
Pipelined loop line 6 steady state
50 estimated iterations before pipelining
2 unrolling before pipelining
6 cycles per 2 iterations
4 flops
( 33% of peak)(madds count 2fp)
2 flops
( 16% of peak)(madds count 1fp)
2 madds
( 33% of peak)
6 mem refs
(100% of peak)
3 integer ops
( 25% of peak)
11 instructions
( 45% of peak)
2 short trip threshold
7 ireg registers used.
6 fgr registers used.
• The schedule is the max 1/3 peak processor performance, as expected
• note: it is necessary to switch off prefetch to attain max schedule
TM
Multiple Outstanding Mem Refs
• Processor can support 4 outstanding memory requests
Wait for data
Wait for data
Execution
independent
instructions
Execution
“Parallel” cache miss
Execution
Execution
independent
instructions
“Sequential” cache miss
Wait for data
time
Timing linked list references:
while(x) x=x->p;
#outstanding ref
1
2
4
time per
230
160
110
pointer fetch:
ns
(480 ns)
ns
(250 ns)
ns
(240 ns)
TM
Origin 3000 Memory Latency
Local
NI to NI
Per Router
ORIGIN
O3K
320 ns
165 ns
105 ns
180 ns
50 ns
45 ns
485 ns + #hops*105 ns
230 ns + #hops*45 ns
32 CPU O3K Max Latency:
315 ns
TM
Remote Memory Latency
SGI™ 3000 Family vs. SGI™ 2000 Series
Worst case round trip remote latency (ns)
1400
1200
Origin2000 Latency
SN (Hypercube)
Origin 3000 Series
1000
800
600
400
200
0
2p
4p
8p
16p
32p
64p
Node Size (CPUs)
128p
256p
512p
1024p
TM
R1x000 Event Counters
The R1x000 processor family allows extensive performance
monitoring with counters that can be triggered by 32 events:
• R10000 has 2 event counters
• R12000 has 4 event counters
The counters are incremented when an event happens in the
processor (e.g. cache miss) and the event is selected by the user.
The first counter can be triggered by the events 0-15,
the second counter is incremented in response to events 15-31.
R12000 has 2 additional counters that allow to monitor
conditional events (i.e. events based on previous events).
User access to the counters is through a software library or shell
level tools provided by the IRIX OS.
TM
Origin Address Space
• Physically the memory is distributed and is not contiguous.
39
0
32 31
• Node id is assigned at boot time Node id 8 bits
Node offset 32 bits (4 GB)
• Logically memory is a shared single contiguous address space, the
virtual address space is 44 bits (16 TB)
1 TB max
Physical
(40 bits)
• The program (compiler) uses the
Address
virtual address space.
12 GB
Empty slots
• Translation from the virtual to the
memory present
physical address space is by the CPU.
8 GB
Page 0
Page 1
Page 2
Page n
TLB
Page k
Page 1
Page n
Page 0
4 GB
Max for a single node:
4 GB memory
0
0 1 2 3 4 ... Node id
Page size is configurable as 16 KB (default),
64 KB, 256 KB, 1 MB, 4 MB, 16 MB
Virtual
Physical
TLB = Translation Look-aside Buffer
TM
Process Scheduling
Irix is a Symmetric Multiprocessing Operating System
• Processes and Processors are independent
• Parallel programs are executed as jobs with multiple processes
• The Scheduler will allocate processes to processors
255
system
128
Real time
40
Time
share
1
0
Priority range from 0 to 255
0
weightless (batch)
1-40
time share (interactive) (TS)
90-239 system (daemons and interrupts)
1-255
real time processes (FIFO & RR)
TM
Process Scheduling
TM
Process Scheduling
TM
Process Scheduling
TM
System Monitoring Commands
uptime(1)
w(1)
sysmon
ps(1)
top, gr_top
osview
sar
gr_osview
gmemusage
sysconf
returns information about system
usage and user load
who is on the system and what are they
doing?
system log viewer
a "snapshot" of the process table
process table dynamic display
system usage statistics
system activity reporter
system usage statistics in graphical
form
graphical memory usage monitor
system limits, options, and parameters
TM
System Monitoring Commands
ecstats -C
ja
oview
pmchart
nstats,linkstat
bufview
par
numa_view, dlook
limit [-h]
R10K Counter Monitor
job accounting statistics
Performance Co-Pilot (bundled
with IRIX)
Performance Co-Pilot (licensed
software)
CrayLink connection statistics
(man refcnt(5) )
system buffer statistics
process activity report
provides process memory
placement information
displays system soft [hard] limits
TM
System Monitoring Commands
hinv
topology
hardware inventory
system interconnect description
TM
Summary: Origin Properties
• Single machine image
– it behaves like a fat workstation
• same compilers
• time sharing
– all your old code will run
– OS schedules all the hardware resources on the machine
• Processor scalability 2-512 cpu
• I/O scalability 2-300 GB/s
• All memory and I/O devices are directly addressable
– no limitation on the size of a single program, it can use all the
available memory
– no limitation on the location of the data, all disks can be used in a
single file system
• 64 bit operating system and file system
– HPC Features: Checkpoint/Restart, DMF, NQE/LSF, TMF, Miser,
job limits, cpusets, enhanced accounting
• Machine stability