Cray XE7 Architecture Slides

Download Report

Transcript Cray XE7 Architecture Slides

Node Characteristics
Number of Cores
32
Peak Performance
(2.3 GHz)
294 Gflops/sec
Memory Size
32-128 GB per
node
Memory
Bandwidth
Z
Y
X
102 GB/sec
1
Dedicated
Components
Shared at the
module level
Fetch
 Two independent integer
 New AVX instruction set
 AVX = Advanced Vector eXtensions
 Both 128 and 256 bit “vector length”
instr.
Integer
Scheduler
FP Scheduler
Integer
Functional Unit
FP resource
Integer
Scheduler
128-bit FMAC
 A single “thread” can use the entire
Decode
Integer
Functional Unit
schedulers and a shared 2x128-bit
FP resource
Shared at
the chip level
128-bit FMAC
 Composed of 8 “core modules”
 A core module has shared and
dedicated components
Shared 2 Mbyte L2 Cache
 Flexible architecture
Shared 8 Mbyte L3 Cache and NB
2



Shared at
the chip level
Fetch
Decode
Integer
Scheduler
FP Scheduler
Integer
Functional Unit
Integer
Scheduler
128-bit FMAC

Active
Components
128-bit FMAC

Module
Each MPI rank has exclusive
access to the 2x128-bit FP unit
and is capable of 8 FP results per
clock cycle
Maximize memory/core and
memory/rank
Larger L2/L3 cache per MPI rank
The peak of the chip is not
reduced
Better with well vectorized code
Integer
Functional Unit
 1 MPI Rank or Thread per Core
Idle
Components
Shared 2 Mbyte L2 Cache
3

Integer
Scheduler
Integer
Scheduler
FP Scheduler
Integer
Functional Unit

Decode
128-bit FMAC

MPI Rank 2
Fetch
128-bit FMAC

Module
Each unit has exclusive access to
an integer scheduler, integer
pipelines and L1 Dcache
The 2x128-bit FP unit and the L2
Cache is shared between units
AVX instructions are dynamically
executed as two 128-bit
instructions utilizing either or
both FP unit
Best for highly parallel integer or
mostly scalar applications
Integer
Functional Unit
 2 MPI Ranks or Threads per Core
Shared
Components
MPI Rank 1
Shared 2 Mbyte L2 Cache
4
Node Characteristics
Number of X86
Cores
16
X86 Peak
147 Gflops
Accelerator Peak
~1 Tflop
X86 Memory
16 or 32GB capacity
at 51 GB/sec
Accelerator Memory
12GB capacity at 225
GB/sec
Z
Y
X
5
 “Kepler” accelerator
 Peak: ~1Tflop (64-bit)
 Memory: 12GB
 Memory BW: ~225GB/sec
 Several architectural
improvements over Fermi
generation
6
 MPI Support
 ~1.2 s latency
 ~15M independent messages/sec/NIC
 BTE for large messages
 FMA stores for small messages
 One-sided MPI
 Small , scalable memory footprint
 Advanced Synchronization and Communication
Features
 Globally addressable memory
 Atomic memory operations
 Pipelined global loads and stores
 ~25M (65M) independent (indexed) Puts/sec/NIC
 Efficient support for UPC, CAF, and Global Arrays
 Embedded high-performance router
 Adaptive routing
 Scales to over 100,000 endpoints
7
 Globally addressable memory provides efficient
support for UPC, Co-array FORTRAN, Shmem and
Global Arrays
 Cray Programming Environment will target this capability
directly
 Pipelined global loads and stores
 Allows for fast irregular communication patterns
 Atomic memory operations
 Provides fast synchronization needed for one-sided
communication models
8