Scientific Computations on Modern Parallel Vector Systems Leonid Oliker

Download Report

Transcript Scientific Computations on Modern Parallel Vector Systems Leonid Oliker

Scientific Computations
on Modern Parallel
Vector Systems
Leonid Oliker
Computer Staff Scientist
Future Technologies Group
Computational Research Division
Lawrence Berkeley National Laboratories
[email protected]
http://crd.lbl.gov/~oliker/paperlinks.html
Overview









Superscalar cache-based architectures dominate US HPC market
Leading architectures are commodity-based SMPs due to cost effectiveness and
generality
Growing gap between peak & sustained performance is well known in scientific
computing
Modern parallel vectors may bridge gap this for many important applications
In April 2002, the Earth Simulator (ES) became operational:
Peak 40TF, LINPACK 35.9 TF (90% of peak)
ES performance > all DOE and DOD systems combined
ES achieved over 26TF (65% peak) on a atmospheric general circulation model
Currently conducting evaluation study of DOE applications on modern parallel vector
architectures: second year of three year project
In September 2003, a Memorandum of Understanding (MOU) between NERSC and
ES was completed
Visited ES center December 8th-17th, 2003
First international team to conduct performance evaluation study at ES
Vector Paradigm
 High memory bandwidth
• Allows systems to effectively feed ALUs (high byte to flop ratio)
 Flexible memory addressing modes
• Supports fine grained strided and irregular data access
 Vector Registers
• Hide memory latency via deep pipelining of memory load/stores
 Vector ISA
• Single instruction specifies large number of identical operations
 Vector architectures allow for:
• Reduced control complexity
• Efficiently utilize large number of computational resources
• Potential for automatic discovery of parallelism
However: only effective if sufficient regularity discoverable in program structure
• Suffers greatly even if small % of code non-vectorizable (Amdahl’s Law)
ES Processor Overview








–ES: newly developed FPLRAM (Full Pipelined RAM)
SX6: DDR-SDRAM 128/256Mb
–ES: uses IN 12.3 GB/s bi-dir btw any 2 nodes, 640 nodes
SX6: uses IXS 8GB/s bi-dir btw any 2 nodes, max 128 nodes

8 Gflops per CPU
8 CPU per SMP
8 way replicated
vector pipe
72 vec registers,
256 64-bit words
Divide Unit
32 GB/s pipe to
FPLRAM
4-way superscalar
o-o-o @ 1 Gflop
64KB I$ & D$
Earth Simulator:
640 nodes
Earth Simulator Overview
Machine
type : 640 nodes, each node is 8-way
SMP vector processors (5120 total procs)
Machine
OS
Peak: 40TF/s (proc peak 8GF/s)
: Extended version of Super-UX:
64 bit Unix OS based on System V-R3
Connection
structure : a single stage crossbar
network (1500 miles of cable), 83,000 copper
cables:
7.9 TB/s aggregate switching capacity
12.3 GB/s bi-di between any two nodes
Global
Barrier Counter within interconnect
allows global barrier synch <3.5usec
Storage:
480 TB Disk, 1.5 PB Tape
Compilers
Batch:
: Fortran 90, HPF, ANSI C, C++
similar to NQS, PBS
Parallelization:
vectorization processor level
OpenMP, Pthreads, MPI, HPF
Earth Simulator Cost
Approx costs
Development: $400M
Building:
$70M
Maintenance: $50M/year
Electricity:
$8M/year
ES Programming Environment

Only benchmarking size runs were submitted (no production runs)

ES not connected to Internet




Interactive, S cluster, L cluster
(2 nodes, 14 nodes, 624 nodes)
Files required/generated on L cluster must be staged in/out (job script)
Using >10 nodes requires minimum
vectorization ratio: 95%
parallelization efficiency: 50%
Examples of required parallelization ratio (as defined by Amdahl’s Law):
16 nodes 99.21% ; 64 nodes 99.80% ; 256 nodes 99.95%

All codes were ported/vectorized on single node SX6 at ARSC (SC2003)

Multi-node vector runs first attempted at ES center
Cray X1 Overview
SSP: 3.2GF computational core
VL = 64, dual pipes (800 MHz)
2-way scalar 0.4 GF (400MHz)
MSP: 12.8 GF combines 4 SSP
shares 2MB data cache (unique)
SSP
MSP Node: 4 MSP w/ flat shared mem
Interconnect: modified 2D torus
fewer links then full crossbar but
smaller bisection bandwidth
Globally addressable: procs can
directly read/write to global mem
Node
Parallelization:
Vectorization (SSP)
Multistreaming (MSP)
Shared mem (OMP, Pthreads)
Inter-node (MPI2, CAF, UPC)
Altix3000 Overview

Itanium2@ 1.5GHz (peak 6 GF/s) 128 FP registers, 32K L1, 256K L2, 6MB L3


EPIC Bundles instruction





bundles processed in-order, instructions within bundle processed in parallel
Consists of “Cbricks” : 4 Itanium2, memory, 2 controller ASICS called SHUB
Uses high bandwidth, low latency Numalink3 interconnect (fat-tree)
Implements CCNUMA protocol in hardware


Cannot store FP in values in L1
A cache miss caused data to be communicated/replicated via SHUB
Uses 64-bit Linux with single system image (256 processor / few for OS services)
Scalability to large numbers of processors ?
Architectural Comparison
Node Type Where
CPU/ Clock
Node MHz
Peak Mem BW Peak
GFlop
GB/s byte/flop
Netwk
Bisect
MPI
BW
BW
Latency
GB/s/P byte/flop
usec
Network
Topology
Power3 NERSC
16
375
1.5
1.0
0. 47
0.13
0.087
16.3
Fat-tree
Power4
ORNL
32
1300
5.2
2.3
0.44
0.13
0.025
7.0
Fat-tree
Altix
ORNL
2
1500
6.0
6.4
1.1
0.40
0.067
2.8
Fat-tree
ES
X1
ESC
ORNL
8
4
500
800
8.0
12.8
32.0
34.1
4.0
2.7
1.5
6.3
0.19
0.088
5.6
7.3
Crossbar
2D-torus
Custom vector architectures have
•High memory bandwidth relative to peak
•Superior interconnect: latency, point to point, and bisection bandwidth
Overall ES appears as the most balanced architecture,
while Altix shows best architectural balance among superscalar architectures
A key ‘balance point’ for vector systems is the scalar:vector ratio
Triad Mem Test:
A(i) = B(i) + s*C(i)
Triad (MB/s)
10000
1000
100
Power3
Power4
SX-6
X1
10
1
0
100
200
300
Stride
400
500
NO Machine
Specific
Optimizations
10000
1000
100
Pow er3
Pow er4
SX-6
X1
10
1
100
300
700
1.5K
4.5K
10K
25K
65K
150K
375K
900K
2.2M
5.5M
13M
100000
Triad Gath/Scat (MB/s)
Memory Performance
Data Size (Bytes)
•For strided access, SX6 achieves 10X, 100X, 1000X improvement over X1, Pwr4, Pwr3
•For gather/scatter, SX6/X1 show similar performance, exceed scalar at higher data sizes
•All machines performance can be improved via architecture specific optimizations
•Example: On X1 using non-cachable & unroll(4) pragma improves strided BW by 20X
Analysis using
‘Architectural Probe’

Developed Architectural Probe: allows stress the balance points of processor design (PMEO-04)

Tunable parameters to mimic behavior of important scientific kernel
What % of memory access can be random before performance decreases by half (left)
How much computational intensity is required to hide penalty of all random access (right)
•
•
Reducing performance by 50%
% Indirection
0%
0.8%
1.6%
1%
6.3%
10%
25%
100%
Itanium 2
Opteron
Power3
Power4
Gather/Scatter expensive on commodity
cache-based systems
Power4 can is slightly better at 1.6% (1 in 64)
Itanium2: much less sensitive at 25% (1 in 4)
Computational Intensity (CI)
CI required to hide indirection
160
140
120
100
80
60
40
20
0
149.3
74.7
9.3
18.7
Itanium 2
Opteron
Power3
Power4
Huge amount of computation may be required to
hide overhead of irregular data access
Itanium2 requires CI of about 9 flops/word
Power4 requires CI of almost 75!
Interested in developing application driven architectural probes
for evaluation of emerging petascale systems
Sample Kernel Performance
NPB FT Class B
Nbody (Barnus-Hut)
1.5
Pow e r3
Pow e r4
SX-6
X1
1.0
0.5
0.0
1
2
4
8
16
Processors
32
64
GFlops/s
GFlops/s
2.0
0.30
0.25
0.20
0.15
0.10
0.05
0.00
Pow e r3
Pow e r4
SX-6
X1
1
2
4
8
16
Processors
32
64
•FFT computationally intensive with data parallel operations
•Well suited for vectorization: approx 17X and 4X faster than Power3/4
•Fixed cost of interprocessor communication hurts scalability
•Nbody requires control irregularity, unstructured data access/communication
•Poorly suited for vectorization: approx 2X and 5X slower than the Power3/4
•Vector architectures are not general purpose machines
Interested in exploring advanced kernel optimizations on X1 system preliminary work described in CUG04.
Applications studied
Applications chosen with potential to run at ultrascale





CACTUS
Astrophysics
100,000 lines grid based
 Solves Einstein’s equations of general relativity
PARATEC Material Science 50,000 lines Fourier space/grid
 Density Functional Theory electronic structures codes
LBMHD
Plasma Physics 1,500 lines grid based
 Lattice Boltzmann approach for magneto-hydrodynamics
GTC
Magnetic Fusion 5,000 lines particle based
 Particle in cell method for gyrokinetic Vlasov-Poisson equation
MADCAP
Cosmology
5,000 lines dense linear algebra
 Extracts key data from Cosmic Microwave Background Radiation
Astrophysics: CACTUS



Visualization of grazing
collision of two black holes


Among most complex in physics: Set of
coupled nonlinear hyperbolic & elliptic
systems with thousands of terms
CACTUS evolves these equations to simulate
high gravitational fluxes, such as collision of
two black holes
Uses ADM formulation: domain decomposed into 3D hypersurfaces for different
slices of space along time dimension
Exciting new field about to be born: Gravitational Wave Astronomy



Numerical solution of Einstein’s equations
from theory of general relativity
Fundamentally new information about Universe
What are gravitational waves?: Ripples in spacetime curvature,
caused by matter motion, causing distances to change:
Developed at Max Planck Institute, vectorized by John Shalf
CACTUS: Parallelism





Cactus is designed around a distributed memory model. Each
thorn is passed a section of the global grid.
The actual parallel driver (implemented in a thorn) can use
multiple methods to decompose the grid across processors and
exchange ghost zone information
Standard driver distributed with Cactus
(PUGH) is for a parallel unigrid and
uses MPI for the communication layer
PUGH can do custom processor
decomposition and/or static load
balancing
Expect high parallel efficiency
CACTUS: Performance
Problem
Size
P
16
80X80x80 64
256
16
250x80x80 64
256



Altix
ES
X1
Gflops/P %peak Gflops/P %peak Gflops/P %peak Gflops/P %peak Gflops/P %peak
0.31
0.22
0.22
0.10
0.08
0.07
21%
14%
14%
6%
6%
5%
0.58
0.50
0.48
0.56
-----
11%
10%
9%
11%
-----
0.89
0.70
--0.51
0.42
---
15%
12%
--9%
7%
---
1.5
1.4
1.4
2.8
2.7
2.7
18%
17%
17%
35%
34%
34%
0.54
0.43
0.41
0.81
0.72
0.68
Vector performance related to x-dim (vector length)
Scalar performance better on smaller problem size (cache effects)
Excellent scaling on ES using fixed data size per proc (weak scaling)
X1 surprisingly poor (4X slower ES) - low ratio scalar:vector


Power 4
ES achieves fastest performance to data: 45X faster than Power3!


Power 3
Note boundary vectorized for X1 but not on ES giving the X1 an advantage
Unvectorized boundary, required 15-20% of runtime on ES (30+% on X1)

< 5% for the scalar version: unvectorized code can quickly dominate cost!
4%
3%
3%
6%
6%
5%
Material Science: PARATEC



PARATEC performs first-principles quantum
mechanical total energy calculation using
pseudopotentials & plane wave basis set
Density Functional Theory to calc structure &
electronic properties of new materials
DFT calc are one of the largest consumers of
supercomputer cycles in the world
Induced current and charge
density in crystallized glycine

Uses all-band CG approach to obtain wavefunction of electrons

33% 3D FFT, 33% BLAS3, 33% Hand coded F90
Part of calculation in real space other in Fourier space
 uses specialized 3D FFT to transform wavefunction
Computationally intensive - generally obtains high percentage of peak

Developed w/ Louie and Cohen’s groups (UCB, LBNL), A Canning


PARATEC:
Wavefunction Transpose
(a)
(c)
(e)
(b)
–
Transpose from Fourier to real space
–
3D FFT done via 3 sets of
1D FFTs and 2 transposes
–
Most communication in
global transpose (b) to (c)
little communication (d) to (e)
–
Many FFTs done at the same time
to avoid latency issues
–
Only non-zero elements
communicated/calculated
–
Much faster than vendor 3D-FFT
(d)
(f)
PARATEC: Performance
Data
Size
P
32
64
128
256
128
256
432
Atom
686
Atom



Altix
ES
X1
Gflops/P %peak Gflops/P %peak Gflops/P %peak Gflops/P %peak Gflops/P %peak
0.95
0.85
0.74
0.57
63%
57%
49%
38%
2.0
1.7
1.5
1.1
39%
33%
29%
21%
3.7
3.2
-----
62%
54%
-----
4.7
4.7
4.7
4.2
4.9
4.6
60%
59%
59%
52%
62%
57%
3.0
2.6
1.9
--3.0
1.3
24%
20%
15%
--24%
10%
Non-vectorizable code can be much more expensive on X1 (32:1 vs 8:1)
Lower bisection bandwidth to computation ratio
Limited scalability due to increasing cost of global transpose and reduced vector length


Power4
ES achieves fastest performance to date! Over 2Tflop/s on 1024 procs
X1 3.5X slower than ES (although peak is 50% higher)


Power 3
Plan to run larger problem size next ES visit
Scalar architectures generally perform well due to high computational intensity



Power3 8X slower than ES
Power4 4X slower - Federation has increased speed 2X compared with Colony
Altix 1.5X slower - high memory and interconnect bandwidth, low latency switch
PARATEC Scaling:
ES vs. Power3
10000
309 QD - Ideal
309 QD - Pwr3
432 Si - Pwr3
432 Si - ES
686 Si - ES
– ES can run the same system
about 10 times faster than the
IBM SP (on any number of
processors)
GFlops
1000
– Main advantage of ES for these
types of codes is the fast
communication network
100
– Fast processors require less finegrain parallelism in code to get
same performance as RISC
machines
10
32
64
128
256
Processors
512
1024
–Vector arch allow opportunity to
simulate systems not possible on
scalar platforms
Plasma Physics: LBMHD

LBMHD uses a Lattice Boltzmann method to
model magneto-hydrodynamics (MHD)

Picture shows current contours after decay
from simple starting vortex

Performs 2D simulation of high temp plasma

2D spatial grid is coupled to octagonal
streaming lattice

Block distributed over 2D proc grid
Current density decays of
two cross-shaped structures

Main computational components:

Collision requires coefficients for local gridpoint only, no communication
 Stream values at gridpoints are streamed to neighbors,
at cell boundaries information is exchanged via MPI
 Interpolation step required between spatial and stream lattices
Developed George Vahala’s group College of William and Mary, ported Jonathan Carter

LBMHD: Porting Details
(left) octagonal streaming
lattice coupled with square
spatial grid

Collision routine rewritten:




For ES loop ordering switched so gridpoint loop (~1000 iterations) is inner rather than velocity or
magnetic field loops (~10 iterations)
X1 compiler made this transformation automatically: multistreaming outer loop and vectorizing (via
strip mining) inner loop
Temporary arrays padded reduce bank conflicts
Stream routine performs well:


(right) example of diagonal
streaming vector updating
three spatial cells
Array shift operations, block copies, 3rd-degree polynomial eval
Boundary value exchange


MPI_Isend, MPI_Irecv pairs
Further work: plan to use ES "global memory" to remove message copies
LBMHD: Performance
Data
Size
4096
x
4096
8192
x
8192




16
64
256
64
256
1024
Power4
Altix
ES
X1
Gflops/P %peak Gflops/P %peak Gflops/P %peak Gflops/P %peak Gflops/P %peak
0.11
0.14
0.14
0.11
0.12
0.11
7%
9%
9%
7%
8%
7%
0.28
0.30
0.28
0.27
0.28
---
5%
6%
5%
5%
5%
---
0.60
0.62
--0.65
-----
10%
10%
--11%
-----
4.6
4.3
3.2
4.6
4.3
3.3
58%
54%
40%
58%
53%
41%
4.3
4.4
--4.5
2.7
---
ES achieves highest performance to data: over 3.3 Tflops for P=1024


Power 3
P
X1 comparable absolute speed up to P=64 (lower % peak)
But performs 1.5X slower at P=256 (decreased scalability)
CAF improved X1 to slightly exceed ES (up to 4.70 Gflop/P)
ES is 44X, 16X, and 7X faster than Power3, Power4, and Altix
 Low CI and high memory requirement (30GB) hurt scalar performance
Altix best scalar due to: high memory bandwidth, fast interconnect
34%
34%
--35%
21%
---
LBMHD on X1
MPI vs CAF
Data
Size
40962
81922

16
64
64
256
Gflops/P
%peak
4.32
4.35
4.48
2.70
34%
34%
35%
21%
X1-CAF
Gflops/P %peak
4.55
4.26
4.70
2.91
36%
33%
37%
23%
X1 well-suited for one-sided parallel languages (globally addressable mem)


P
X1-MPI
MPI hinders this feature and requires scalar tag matching
CAF allows much simpler coding of boundary exchange (array subscripting):
 feq(ista-1,jsta:jend,1) = feq(iend,jsta:jend,1)[iprev,myrankj]

MPI requires non-contiguous data copies into buffer, unpacked at destination

Since communication about 10% of LBMHD, only slight improvements
However, for P=64 on 40962 performance degrades. Tradeoffs:




CAF reduced total message volume 3X (eliminates user and system buffer copy)
But CAF used more numerous and smaller sized message
Interested in research of CAF and UPC performance and optimization
LBMHD: Performance
80
8192 x 8192 Grid
64 processors
70
% time
60
50
collision
stream
comm
40
30
20
10
0
P3
P4
ES
X1

Time breakdown shown relative to each individual architecture

Cray X1 has highest % spent in communication, CAF version reduced this

ES shows best memory bandwidth performance (stream)
LBMHD: Performance
80
8192 x 8192 Grid
256 processors
70
% time
60
50
collision
stream
comm
40
30
20
10
0
P3

P4
ES
Limitation in P4/Colony switch becomes more obvious with P=256
Magnetic Fusion: GTC





3D visualization of
electrostatic potential in
magnetic fusion device

Gyrokinetic Toroidal Code: transport of thermal
energy (plasma microturbulence)
Goal magnetic fusion is burning plasma power plant
producing cleaner energy
GTC solves 3D gyroaveraged gyrokinetic system w/
particle-in-cell approach (PIC)
PIC scales N instead of N2 – particles interact w/
electromagnetic field on grid
Allows solving equation of particle motion with ODEs
(instead of nonlinear PDEs)
Main computational tasks:
 Scatter deposit particle charge to nearest point
 Solve
Poisson eqn to get potential for each point
 Gather calc force based on neighbors potential
 Move
particles by solving eqn of motion
 Shift
particles moved outside local domain
Developed at Princeton Plasma Physics Lab, vectorized by Stephane Ethier
GTC: Scatter operation

Particle charge deposited amongst nearest grid points.
The particles can be anywhere inside the domain

Several particles can contribute to same grid points, resulting in memory
conflicts (dependencies) that prevent vectorization

Since particles are randomly localized - scatter also hinders cache reuse

Solution: VLEN copies of charge deposition array w/ reduction after main loop
GTC: Porting Details




Vector memory footprint is large due to temp arrays used to eliminate
dependencies and reduce bank conflicts. P=64 uses 42 GB on ES compared w/
5 GB on Power3 (Seaborg)
Relatively small memory per processor (ES=2GB, X1=4GB) severely limits the
problem size runs
GTC has second level of loop parallelism using OpenMP. Unfortunately, the hybrid
version is still not working on ES/X1. The memory footprint increased even more
(additional 8X, about 320GB)
Non-vectorized “Shift” routine accounted for: 54% X1, 11% on ES


Due to high penalty of serialized sections on X1 when multistreaming
The shift routine vectorized on X1, but NOT on ES - X1 has advantage

Now shift account for only 4% of X1 runtime
GTC: Performance
Number
P
Particles

ES
X1
Gflops/P %peak Gflops/P %peak Gflops/P %peak Gflops/P %peak Gflops/P %peak
0.13
9%
0.29
5%
0.29
5%
0.96
12%
1.00
8%
20M
64
0.13
9%
0.32
5%
0.26
4%
0.84
10%
0.80
6%
32
0.13
9%
0.29
5%
0.33
6%
1.34
17%
1.50
12%
64
0.13
9%
0.29
5%
0.31
5%
1.25
16%
1.36
11%
1024
0.06
4%
Vectors achieve fastest per-processor performance of any tested architecture


Altix
32
200M

Power4
10/cell
100/cell

Power 3
P=64 on X1 is 35% faster than P=1024 on Power3!
Advantage of ES for PIC codes may reside in higher statistical resolution
simulations. The greater speed allows for more particles per grid cell
Larger testes could not be performed due to ES hurdles for
parallelization/vectorization efficiency
Low Altix performance due under investigation (random # generation)
GTC: Performance



With increasing processors,
and fixed problem size, the
vector length decreases
Limited scaling due to
decreased vector efficiency
rather than
communications overhead.
MPI communication
by itself has near perfect
scaling.
Cosmology: MADCAP


Temperature anisotropies in
CMB measured by Boomerang




Microwave Anisotropy Dataset
Computational Analysis Package
Optimal general algorithm for
extracting key cosmological data
from Cosmic Microwave Background
Radiation (CMB)

Picture shows temp anisotropies in
CMB measured by Boomerang

Anisotropies in the CMB contains
early history of the Universe
Calculates maximum likelihood two-point angular correlation function
Recasts problem in dense linear algebra: ScaLAPACK
Steps include: mat-mat, matrix-inv, mat-vec, chol decomp, redistribution
Porting: ScaLAPACK plus rewrite of Legendre polynomial recursion, such that
large batches are computed in inner loop
Developed at NERSC by Julian Borrill
MADCAP: Performance
P
Power 3
Power4
ES
X1
Gflops/P
%peak
Gflops/P
%peak Gflops/P %peak Gflops/P %peak
16
0.62
41%
1.5
29%
4.1
32%
2.2
27%
64
0.54
36%
0.81
16%
1.9
23%
2.0
16%

Only partially ported due to code’s assumption of global file system

All systems sustain relatively low % peak considering MADCAP’s BLAS3 ops




Complex tradeoffs: architectural paradigm, interconnect technology, and I/O filesystem
Detailed analysis presented HiPC 2004
Further work is required to: reduce I/O, remove system calls, and remove global file
system requirements
Plan to implement new MADCAP version for next ES visit
Overview
Code
(P=64) % peak
(P=Max avail) Speedup ES vs.
Pwr3
Pwr4
Altix
ES
X1
Pwr3
Pwr4
Altix
X1
LBMHD
7%
5%
11%
58%
37%
30.6
15.3
7.2
1.5
CACTUS
6%
11%
7%
34%
6%
45.0
5.1
6.4
4.0
GTC
9%
6%
5%
16%
11%
9.4
4.3
4.1
0.9
PARATEC
57%
33%
54%
58%
20%
8.2
3.9
1.4
3.9
MADCAP
61%
40%
---
53%
19%
3.4
2.3
---
0.9
Tremendous potential of vector architectures: 4 codes running faster than ever before


Vector systems allows resolution not possible with scalar arch (regardless of # procs)
ES shows much higher raw and sustained performance compared with X1



Vectors potentially at odds w/ emerging techniques (sparse, irregular, multi-physics)


No X1 specific optimization - optimal programming approach still unclear (CAF, etc)
Non-vectorizable code segments become very expensive (8:1 or even 32:1 ratio)
Much more difficult to evaluate codes poorly suited for vectorization
Return to ES in October:



plan to evaluate higher scalability runs
New codes in climate and cosmology
Potential opportunity of large-scale scientific runs (not just benchmarking)
Future directions
Leverage evaluation suite, (unclassified) application expertise, emerging arch research


Develop application driven architectural probes for evaluation of emerging petascale
systems
Research the enhancement of commodity scalar processors with vector components for
increased scientific productivity (including investigation into VIVA2)

Investigate algorithmic (kernel level) optimizations for leading vector systems

Explore new application areas on leading parallel systems



Examples includes sparse linear-algebra, molecular dynamics, and combustion
Evaluate more difficult codes traditionally at odds with vector architectures
Interested in examining AMR codes - key component for future multi-scale simulations

Study the potential of implicit parallel programming languages: UPC and CAF

Evaluate soon-to-be-released supercomputing systems

Including Cray X1e, Hitachi SR11000, NEC SX8, Cray Red-Storm, IBM Power5, Bluegene/*, largescale Altix
Publications

L. Oliker, A. Canning, J. Carter, J. Shalf, and S. Ethier.
“Scientific Computations on Modern Parallel Vector Systems”, Supercomputing 2004, to appear.
Nominated for Best Paper award

J. Carter, J. Borrill, and L. Oliker.
“Performance Characteristics of a Cosmology Package on Leading HPC Architectures”,
International Conference on Higher Performance Computing: HIPC 2004, to appear.
Nominated for Best Paper award

L. Oliker, J. Borrill, A. Canning, J. Carter, H. Shan, D. Skinner, R. Biswas, J. Djomheri,
“Performance Evaluation of the SX-6 Vector Architecture”,
Journal of Concurrency and Computation 2004, to appear.

L. Oliker and Rupak Biswas,
“Performance Modeling and Evaluation of Ultra-Scale Systems”, Minisymposium organized at
SIAM Conference on Parallel Processing for Scientific Computing: SIAMPP 2004.

L. Oliker, J. Borrill, A. Canning, J. Carter, H. Shan, D. Skinner, R. Biswas, J. Djomheri,
“A Performance Evaluation of the Cray X1 for Scientific Applications”, International Meeting on High
Performance Computing for Computational Science: VECPAR 2004.

H. Shan, E. Strohmaier, L. Oliker,
“Optimizing Performance of Superscalar Codes For a Single Cray X1 MSP Processor”,
46th Cray User Group Conference, CUG 2004.

L. Oliker, J. Carter, J. Shalf, D. Skinner, S. Ethier, R. Biswas, J. Djomehri, R. Van der Wijngaart. “Evaluation
of Cache-based Superscalar and Cacheless Vector Architectures for Scientific Computations”,
Supercomputing 2003.