Status report

Download Report

Transcript Status report

UPC Status Report - 10/12/04
Adam Leko
UPC Project, HCS Lab
University of Florida
Oct 12, 2004
NSA bench9
●
Simple code
–
–
Given stream A
Two parameters
●
●
–
Compute Bi = Ai “right justified”
●
●
●
●
–
●
L – number of elements in A
N – number of bits for each element in A
1000 -> 0001
1010 - > 0101
1011 -> 1011
Removes factors of 2 from list
Compute C such that Bi * Ci = 1 mod 2N
Parameters for experiments
–
–
N=48 (recommended: N=30 or N=46)
L=5*10^6 (recommended: L = 5*10^7)
Oct 12, 2004
2
Program flow
●
Computation section (embarrassingly parallel)
–
–
Fill up A with [rand() & (2N -1)] + 1
Compute B & C
● B: directly by right shifts ( >> 1)
● C: iterative algorithm
–
–
–
●
xn -> n correct bits computed
x2j = xj * (2 – Ci * xj) mod 22j
Example: 12 bits, Bi=127
● x = 7 mod 8
3
● x = 7 * (2 – 127*7) = 63 mod 64
6
● x
12 = 63 * (2 – 127*63) = 3967 mod 4096
Check section (gather)
–
–
First node checks all values to verify Bi*Ci = 1 mod 2N
Fits along with benchmark
● “Output selected values from A, B, and C”
Oct 12, 2004
3
bench9 - Marvel total time
5
4.5
4
cc seq
upcc seq
upc_forall
upc_forall, cast
for
for, cast
seq check time
upcc seq check
time
3.5
3
2.5
2
1.5
upc par check time
1
0.5
0
1 thread
Oct 12, 2004
2 threads
3 threads
4 threads
4
rand() fill loop, size 10e6 * THREADS
5.5
5
4.5
4
3.5
Lambda BUPC,
VAPI
Lambda MUPC
Marvel
3
2.5
2
1.5
1
0.5
0
gcc/cc seq
upcc seq
Oct 12, 2004
upc_forall 1
thread
upc_forall 2
threads
upc for 1
thread
upc for 2
threads
5
Analysis for factors (user's perspective)
●
Big question: where is time being spent? Which
statements in source code use the most cycles?
– Which statements incur remote accesses?
●
–
Which threads are sitting idle?
●
–
Factors: CPU utilization, parallel efficiency,
synchronization overhead
How close am I to peak GFLOPS?
●
–
Factors: network characteristics, communication patterns
Factors: all, especially lower-level cache and
network/memory
How expensive and how much synchronization?
●
Factors: synchronization algorithms, network/memory
latency
Oct 12, 2004
6
Analysis strategy
●
●
Come up with list of questions we want our performance
tool to answer
Think about possible factors in terms of which questions
they answer or help answer
–
–
–
●
Split up some questions in terms of combinations of factors
Try to get as many as possible
Preliminary list from brainstorming?
Based on important questions from above
–
Perform sensitivity study
●
●
–
Also run through list of questions and catalog answers
●
●
Assemble microbenchmark suite to isolate factors
Vary parameters artificially
Can we record this factor? etc
Combine results from sensitivity study with survey and
tool study to get preliminary list of factors
Oct 12, 2004
7
Individual part of project
●
Contacting developers
–
–
Sent out email to all developers from contact list
Purpose
●
●
●
●
Look at benchmarks
–
●
●
Understand “compiler weirdness”
Get ideas for factors
Get access to a Cray machine?
Get ideas for factors
Start on next coding project – convolution
Model-driven factor development
Oct 12, 2004
8
Model-driven factor development
●
●
●
Start up one or more performance models that take into
account major performance factors
Tune those models to Marvel, lambda+IBA, kappa+SCI
General idea:
–
–
–
–
●
If a performance model can have 90%+ accuracy, then
using the model we can determine which factors are import
for which architectures
And thus what to concentrate on and what to show user
Gives us a good understanding of “what's going on”
Also can be used to validate factors we have chosen
Issues
–
–
–
–
Existing models?
Simulation or equations?
Corner cases?
Too hard?
Oct 12, 2004
9