Cray XE7 Architecture Slides
Download
Report
Transcript Cray XE7 Architecture Slides
Node Characteristics
Number of Cores
32
Peak Performance
(2.3 GHz)
294 Gflops/sec
Memory Size
32-128 GB per
node
Memory
Bandwidth
Z
Y
X
102 GB/sec
1
Dedicated
Components
Shared at the
module level
Fetch
Two independent integer
New AVX instruction set
AVX = Advanced Vector eXtensions
Both 128 and 256 bit “vector length”
instr.
Integer
Scheduler
FP Scheduler
Integer
Functional Unit
FP resource
Integer
Scheduler
128-bit FMAC
A single “thread” can use the entire
Decode
Integer
Functional Unit
schedulers and a shared 2x128-bit
FP resource
Shared at
the chip level
128-bit FMAC
Composed of 8 “core modules”
A core module has shared and
dedicated components
Shared 2 Mbyte L2 Cache
Flexible architecture
Shared 8 Mbyte L3 Cache and NB
2
Shared at
the chip level
Fetch
Decode
Integer
Scheduler
FP Scheduler
Integer
Functional Unit
Integer
Scheduler
128-bit FMAC
Active
Components
128-bit FMAC
Module
Each MPI rank has exclusive
access to the 2x128-bit FP unit
and is capable of 8 FP results per
clock cycle
Maximize memory/core and
memory/rank
Larger L2/L3 cache per MPI rank
The peak of the chip is not
reduced
Better with well vectorized code
Integer
Functional Unit
1 MPI Rank or Thread per Core
Idle
Components
Shared 2 Mbyte L2 Cache
3
Integer
Scheduler
Integer
Scheduler
FP Scheduler
Integer
Functional Unit
Decode
128-bit FMAC
MPI Rank 2
Fetch
128-bit FMAC
Module
Each unit has exclusive access to
an integer scheduler, integer
pipelines and L1 Dcache
The 2x128-bit FP unit and the L2
Cache is shared between units
AVX instructions are dynamically
executed as two 128-bit
instructions utilizing either or
both FP unit
Best for highly parallel integer or
mostly scalar applications
Integer
Functional Unit
2 MPI Ranks or Threads per Core
Shared
Components
MPI Rank 1
Shared 2 Mbyte L2 Cache
4
Node Characteristics
Number of X86
Cores
16
X86 Peak
147 Gflops
Accelerator Peak
~1 Tflop
X86 Memory
16 or 32GB capacity
at 51 GB/sec
Accelerator Memory
12GB capacity at 225
GB/sec
Z
Y
X
5
“Kepler” accelerator
Peak: ~1Tflop (64-bit)
Memory: 12GB
Memory BW: ~225GB/sec
Several architectural
improvements over Fermi
generation
6
MPI Support
~1.2 s latency
~15M independent messages/sec/NIC
BTE for large messages
FMA stores for small messages
One-sided MPI
Small , scalable memory footprint
Advanced Synchronization and Communication
Features
Globally addressable memory
Atomic memory operations
Pipelined global loads and stores
~25M (65M) independent (indexed) Puts/sec/NIC
Efficient support for UPC, CAF, and Global Arrays
Embedded high-performance router
Adaptive routing
Scales to over 100,000 endpoints
7
Globally addressable memory provides efficient
support for UPC, Co-array FORTRAN, Shmem and
Global Arrays
Cray Programming Environment will target this capability
directly
Pipelined global loads and stores
Allows for fast irregular communication patterns
Atomic memory operations
Provides fast synchronization needed for one-sided
communication models
8