Powerpoint slides

Download Report

Transcript Powerpoint slides

CS 395 Last Lecture
Summary, Anti-summary, and
Final Thoughts
Summary (1) Architecture
• Modern architecture designs are driven by energy
constraints
• Shortening latencies is too costly, so we use
parallelism in hardware to increase potential
throughput
• Some parallelism is implicit (out-of-order
superscalar processing,) but have limits
• Others are explicit (vectorization and
multithreading,) and rely on software to unlock
2
Summary (2) Memory
• Memory technologies trade off energy and
cost for capacity, with SRAM registers on one
end and spinning platter hard disks on the
other
• Locality (relationships between memory
accesses) can help us get the best of all cases
• Caching is the hardware-only solution to
capturing locality, but software-driven
solutions exist too (memcache for files, etc.)
3
Summary (3) Software
• Want to fully occupy your hardware?
– Express locality (tiling)
– Vectorize (compiler or manual)
– Multithread (e.g. OpenMP)
– Accelerate (e.g. CUDA, OpenCL)
• Take the cost into consideration. Unless
you’re optimizing in your free time, your time
isn’t free.
4
Research Perspective (2010)
• Can we generalize and categorize the most
important, generally applicable GPU
Computing software optimizations?
– Across multiple architectures
– Across many applications
• What kinds of performance trends are we
seeing from successive GPU generations?
• Conclusion – GPUs aren’t special, and parallel
programming is getting easier
5
Application Survey
• Surveyed the GPU Computing Gems chapters
• Studied the Parboil benchmarks in detail
Results:
• Eight (for now) major categories of
optimization transformations
– Performance impact of individual optimizations on
certain Parboil benchmarks included in the paper
6
1: (Input) Data Access Tiling
DRAM
DRAM
Explicit
Copy
Local
Access
DRAM
Implicit
Copy
Scratchpad
Cache
Local
Access
7
2. (Output) Privatization
• Avoid contention
by aggregating
updates locally
• Requires storage
resources to keep
copies of data
structures
Private
Results
Local
Results
Global
Results
8
x
Running Example: SpMV
Ax = v
v
Row
Col
Data
A
9
x
Running Example: SpMV
Ax = v
v
Row
Col
Data
A
10
3. “Scatter to Gather” Transformation
x
Ax = v
v
Row
Col
Data
A
11
3. “Scatter to Gather” Transformation
x
Ax = v
v
Row
Col
Data
A
12
4. Binning
A
13
5. Regularization (Load Balancing)
14
6. Compaction
15
7. Data Layout Transformation
16
7. Data Layout Transformation
17
8. Granularity Coarsening
• Parallel execution often requires redundant and
coordination work
– Merging multiple threads into one allows reuse of result,
reducing redundancy
Time
4-way
parallel
2-way
parallel
Redundant
Essential
18
How much faster do applications
really get each hardware
generation?
Unoptimized Code Has Improved
Drastically
• Orders of
magnitude
speedup in many
cases
• Hardware does not
solve all problems
– Coalescing (lbm)
– Highly contentious
atomics (bfs)
20
Optimized Code Is Improving Faster
than “Peak Performance”
• Caches capture
locality scratchpad
can’t efficiently
(spmv, stencil)
• Increased local
storage capacity
enables extra
optimization (sad)
• Some benchmarks
need atomic
throughput more
than flops (bfs, histo)
21
Optimization Still Matters
• Hardware never
changes algorithmic
complexity (cutcp)
• Caches do not solve
layout problems for
big data (lbm)
• Coarsening still
makes a big
difference (cutcp,
sgemm)
• Many artificial
performance cliffs
are gone (sgemm,
tpacf, mri-q)
22
Stuff we haven’t covered
• Good tools out there for profiling code beyond
good timing (cache misses, etc.) If you can’t
find why a particular piece of code is taking so
long, look into hardware performance
counters.
• Patterns and practice
– Some of the major patterns of optimization we
covered, but only the basic ones. Many
optimization patterns are algorithmic.
23
Fill Out Evaluations!
24