Paraprox: Pattern-Based Approximation for Data Parallel Applications Mehrzad Samadi, D. Anoushe Jamshidi, Janghaeng Lee, and Scott Mahlke University of Michigan March 2014 University of Michigan Electrical Engineering and.

Download Report

Transcript Paraprox: Pattern-Based Approximation for Data Parallel Applications Mehrzad Samadi, D. Anoushe Jamshidi, Janghaeng Lee, and Scott Mahlke University of Michigan March 2014 University of Michigan Electrical Engineering and.

Paraprox: Pattern-Based
Approximation for Data Parallel
Applications
Mehrzad Samadi, D. Anoushe Jamshidi,
Janghaeng Lee, and Scott Mahlke
University of Michigan
March 2014
University of Michigan
Electrical Engineering and Computer Science
Compilers Creating Custom Processors
Approximate Computing
• 100% accuracy is not always necessary
• Less Work
– Better performance
– Lower power consumption
• There are many domains
where approximate
output is acceptable
2
Data Parallelism is everywhere
Financial
Modeling
Games
Medical
Imaging
Physics
Simulation
Image Processing
Audio
Processing
Machine
Learning
Statistics
Video Processing
• Mostly regular applications
• Works on large data sets
• Exact output is not required for operation
Good opportunity for automatic approximation
3
Approximating KMeans
4
Approximating KMeans
Exact Centers
5
Approximating KMeans
Exact Centers
6
Approximating KMeans
Mislabeled
Exact Centers
Approximate Centers
7
Approximating KMeans
Mislabeled
Exact Centers
Approximate Centers
8
Approximating KMeans
Mislabeled
Exact Centers
Approximate Centers
50%
Mislabeling error
45%
Approximating
alone is not enough we need a
40%
way to control the output quality
35%
30%
25%
20%
15%
10%
5%
0%
0%
20%
40%
60%
Error in computing clusters' centers
80%
100%
9
Approximate Computing
• Ask the programmer to do it
– Not easy / practical
– Hard to debug
• Automatic Approximation
– One solution does not fit all
• Paraprox : Pattern-based Approximation
– Pattern-specific approximation methods
– Provide knobs to control the output quality
10
Common Patterns
Map
𝑓
𝑓
Scan
Partitioning
𝑓
𝑓
𝑓
+
𝑓
Image Processing, Finance, … Signal Processing, Physics,…
Scatter/Gather
𝑓
𝑓
𝑓
Statistics,…
𝑓
Stencil
𝑓
𝑓
𝑓
+
+
+
Machine Learning, Search,…
Reduction
𝑓
+
Image Processing, Physics,… Machine Learning, Physics,..
M. McCool et al. “Structured Parallel Programming: Patterns for Efficient Computation.” 11
Morgan Kaufmann, 2012.
Paraprox
Parallel Program
(OpenCl/CUDA)
Paraprox
Pattern
Detection
Approximation
Methods
Runtime system
Approximate Kernels
Tuning Parameters
12
Common Patterns
Map
𝑓
𝑓
Scan
Partitioning
𝑓
𝑓
𝑓
+
𝑓
Image Processing, Finance, … Signal Processing, Physics,…
Scatter/Gather
𝑓
𝑓
𝑓
Statistics,…
𝑓
Stencil
𝑓
𝑓
𝑓
+
+
+
Machine Learning, Search,…
Reduction
𝑓
+
Image Processing, Physics,… Machine Learning, Physics,..
13
Approximate Memoization
S
X
T
R
V
Sqrt
Div
X
T
R
V
Q
Q
Q
Q
Q
0.5
Mul
Mul
Log
S
<< q3
Mul
Or
Add
<< q2
Mul
Add
Mul
Or
Div
<< q1
Exp
Sub
Or
Cnd()
Sub
Mul
Mul
<< q0
Cnd()
Sub
Or
Mul
Mul
Sub
Sub
CallResult
PutResult
BlackScholes
Addr: q4 q3 q2 q1 q0
LookUp Table
float2 CallResult
PutResult
14
Approximate Memoization
Identify candidate functions
Find the table size
Check The Quality
Determine qi for each input
Fill the Table
Execution
15
Candidate Functions
• Pure functions do not:
– read or write any global or
static mutable state.
– call an impure function.
– perform I/O.
S
X
T
R
V
Sqrt
Div
0.5
Mul
Mul
Log
Mul
Add
Mul
Add
Mul
Div
• In CUDA/OpenCL:
– No global/shared
memory access
– No thread ID dependent
computation
Exp
Sub
Cnd()
Cnd()
Sub
Mul
Mul
Sub
Mul
Mul
Sub
Sub
CallResult
PutResult
16
Table Size
Quality
64K
32K
16K
Speedup
17
How Many Bits per Input?
Table Size = 32KB
15 bits address
6
4 5
4
91.3%
6
A
B
C
5
5
5
5
95.2%
6
95.4%
6
5 4
95.1%
Input
5
Output Quality
Bits
4
7
4
96.5%
4
5 4
6
91.2%
5
95.4%
7
3
95.8%
Quantization Levels
Inputs thatA do not need
will get
5 high precision
32
B fewer number
6
of bits.64
C
4
16
18
Common Patterns
Map
𝑓
𝑓
Scan
Partitioning
𝑓
𝑓
𝑓
+
𝑓
Image Processing, Finance, … Signal Processing, Physics,…
Scatter/Gather
𝑓
𝑓
𝑓
Statistics,…
𝑓
Stencil
𝑓
𝑓
𝑓
+
+
+
Machine Learning, Search,…
Reduction
𝑓
+
Image Processing, Physics,… Machine Learning, Physics,..
19
Tile Approximation
Difference with neighbors
[0-10)
[10-20)
[20-30)
[30-40)
[40-50)
[50-60)
[60-70)
[70-80)
[80-90)
[90-100]
0
20
40
60
Percentage of pixels
80
20
Stencil/Partitioning
C
W
E
NW
N
NE
SW
S
SE
= Input[i][j]
= Input[i][j-1]
= Input[i][j+1]
= Input[i-1][j-1]
= Input[i-1][j]
= Input[i-1][j+1]
= Input[i+1][j-1]
= Input[i+1][j]
= Input[i+1][j+1]
NW
N
NE
W
C
E
SW
S
SE
• Paraprox looks for global/texture/shared load
accesses to the arrays with affine addresses
• Control the output quality by changing the
number of accesses per tile
21
Stencil/Partitioning
C
W
E
NW
N
NE
SW
S
SE
= Input[i][j]
= Input[i][j-1]
= Input[i][j+1]
= Input[i-1][j-1]
= Input[i-1][j]
= Input[i-1][j+1]
= Input[i+1][j-1] W
= Input[i+1][j]
C
= Input[i+1][j+1] E
NW
N
NE
W
C
E
SW
S
SE
• Paraprox looks for global/texture/shared load
accesses to the arrays with affine addresses
• Control the output quality by changing the
number of accesses per tile
22
Stencil/Partitioning
C
W
E
NW
N
NE
SW
S
SE
= Input[i][j]
= Input[i][j-1]
= Input[i][j+1]
= Input[i-1][j-1]
= Input[i-1][j]
= Input[i-1][j+1]
= Input[i+1][j-1]
= Input[i+1][j]
= Input[i+1][j+1]
W
C
E
W
C
E
NW
N
NE
W
C
E
SW
S
SE
• Paraprox looks for global/texture/shared load
accesses to the arrays with affine addresses
• Control the output quality by changing the
number of accesses per tile
23
Stencil/Partitioning
C
W
E
NW
N
NE
SW
S
SE
= Input[i][j]
= Input[i][j-1]
= Input[i][j+1]
= Input[i-1][j-1]
= Input[i-1][j]
= Input[i-1][j+1]
= Input[i+1][j-1]
= Input[i+1][j]
= Input[i+1][j+1]
C
C
C
C
C
C
C
C
NW
N
NE
W
C
E
SW
S
SE
• Paraprox looks for global/texture/shared load
accesses to the arrays with affine addresses
• Control the output quality by changing the
number of accesses per tile
24
Common Patterns
Map
𝑓
𝑓
Scan
Partitioning
𝑓
𝑓
𝑓
+
𝑓
Image Processing, Finance, … Signal Processing, Physics,…
Scatter/Gather
𝑓
𝑓
𝑓
Statistics,…
𝑓
Stencil
𝑓
𝑓
𝑓
+
+
+
Machine Learning, Search,…
Reduction
𝑓
+
Image Processing, Physics,… Machine Learning, Physics,..
25
Scan/ Prefix Sum
• Prefix Sum
– 𝑂𝑢𝑡𝑝𝑢𝑡[𝑖] =
𝑖
𝑘=0 𝐼𝑛𝑝𝑢𝑡[𝑘]
+
+
+
+
• Cumulative histogram, list ranking,…
• Data parallel implementation:
1. Divide the input into smaller subarrays
2. Compute the prefix sum of each subarray in
parallel
26
Data Parallel Scan
Phase I
1
1
1
1
1
2
1
1
1
3
4
1
2
1
1
1
3
4
1
Phase II
2
3
4
1
1
1
3
4
Scan
Scan
4
1
4
4
4
12
16
2
Scan
4
Phase III
1
Scan
Scan
1
1
Add
Add
1
2
3
4
5
6
8
7
8
9
10
Add
11
12
13
14
15
16
27
Data Parallel Scan
Phase I
1
1
1
1
1
2
1
1
1
3
4
1
2
1
1
1
3
4
1
Phase II
2
3
4
1
1
1
3
4
Scan
Scan
4
1
4
4
4
12
16
2
Scan
4
Phase III
1
Scan
Scan
1
1
Add
Add
1
2
3
4
5
6
8
7
8
9
10
Add
11
12
13
14
15
16
28
Scan Approximation
Output Elements
0
N
29
Evaluation
30
Experimental Setup
• Clang 3.3
CUDA
Driver
AST
Visitor
Pattern
Detection
Action
Generator
Rewrite
Approximate
Kernels
• GPU
– NVIDIA GTX 560
• CPU
– Intel Core I7
• Benchmarks
– NVIDIA SDK, Rodinia, …
31
Runtime System
Quality
Checking
Quality
Target
Quality
Speedup
Green[PLDI2010]
SAGE[MICRO2013]
32
Speedups for Both CPU and GPU
CPU
GPU
Cumulative Histogram
Mean Filter
Target = 90%
Gaussian Filter
Convolution Separable
HotSpot
7.9
Kernel Density
Naïve Bayes
Image Denoising
Matrix Multiplication
BoxMuller
Gamma Correction
Quasirandom Generator
BlackScholes
Geometric
Geo Mean
Mean
0
1
2
Speedup
3
4
5
33
One Solution Does Not Fit All!
Paraprox
Loop Perforation
GEO MEAN
CUMULATIVE HISTOGRAM
MEAN FILTER
GAUSSIAN FILTER
HOTSPOT
BOXMULLER
GAMMA CORRECTION
QUASIRANDOM GENERATOR
BLACKSCHOLES
0
0.5
1
1.5
2
Speedup
2.5
3
3.5
4
34
We Have Control on Output Quality
5
4.5
Kernel Density
Speedup
4
3.5
3
2.5
2
1.5
1
100
98
96
94
Output Quality(%)
92
90
35
We Have Control on Output Quality
5
Matrix Multiplication
4.5
Kernel Density
Speedup
4
Gaussian Filter
3.5
3
2.5
Quasirandom
Generator
2
Convolution Separable
1.5
BlackScholes
1
100
98
96
94
Output Quality(%)
92
90
36
Distribution of Errors
100%
Percentage of Elements
90%
Image Denoising
80%
70%
60%
50%
40%
30%
20%
10%
0%
0%
10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Error
37
Distribution of Errors
100%
Percentage of Elements
90%
80%
Cumulative Histogram
70%
Gamma Correction
60%
Matrix Multiplication
Image Denoising
50%
Naïve Bayes
40%
Kernel Density
30%
Hotspot
20%
Gaussian Filter
Mean Filter
10%
0%
0%
10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Error
38
Conclusion
• Manual approximation is not easy/practical.
• We need tools for approximation
• One approximation method does not fit all
applications.
• By using pattern-based optimization, we
achieved 2.6x speedup by maintaining 90% of
the output quality.
39
Paraprox: Pattern-Based
Approximation for Data Parallel
Applications
Mehrzad Samadi, D. Anoushe Jamshidi,
Janghaeng Lee, and Scott Mahlke
University of Michigan
March 2014
University of Michigan
Electrical Engineering and Computer Science
Compilers creating custom processors