Co-clustering using CUDA
Download
Report
Transcript Co-clustering using CUDA
CO-CLUSTERING USING
CUDA
Co-Clustering Explained
Problem:
Large binary matrix of samples (rows) and features (columns)
What samples should be grouped together? Why?
What are shared features?
Co-clustering provides you the “why” explicitly
Correlated sample/feature pair
Row cluster:
s1 and s3 are in a group
Column cluster:
distinguishing features are
2,3, and 5
Co-Clustering - Details
Using Information Theoretic Co-clustering, as parallelized for Hadoop
architecture in:
Disco: Distributed co-clustering with Map-Reduce: A case study towards
petabyte-scale end-to-end mining, Papadimitriou et.al, Data Mining 2008
Partition entire matrix into row groups, col groups
Minimize length of encoding of resulting partitioned matrix
Competing code length factors: number of row groups & col groups,
homogeneity of clusters
Iterate over rows, rearrange and sub-partition to find better encoding using
heuristic
Repeat for columns, then rows again, until local optimum is found
Complexity: O(n*fp*(row_groups+col_groups)2*iters)
Credit: Chakrabarti et. al, KDD 2004
Implementation - Basics
Initial matrix generation : CPU
Initial random row/column group assignment: CPU
Memory structures very simple, arrays of ints
Implementation – Stats step 1
Statistics calculations:
Calculates
statistics for each
row of each column group
Statistic
is number of 1’s in a
column group
Straight-forward
parallelization (each thread
works on one row at a time),
global memory
Column Groups
2 3 1
Row Groups
3
5
1
1
4
3 2
Stat(Row 3, ColumnGroup 3) = 1
Room For Improvement
Calculate row statistics according to histogram
algorithm from text book
Block
columns
Assign one thread block to each block
Compute shared memory histograms within block
Merge back to global memory when finished
Implementation – Stats step 2
Calculates cost for each row
group of each column group
Column Groups
Essentially a reduce on the per2 3 1
Row Groups
row data
3
Block the rows, assign block to
5
thread block
1
Use shared memory and
1
atomics to build histogram of
4
all rows in a given row group
Merge shared histogram with
global histogram for that row
group
Iterate over all row groups
3 2
Stat(RowGroup 1, ColumnGroup 3) = 2
Implementation – Row/Col Group Optimization
For each row, find optimal group it could belong to
Parallelized
straight-forwardly, one row per thread,
loop and stride to get all rows
Each row calculation goes through all row groups,
determines global cost of moving to that row group
Move all rows to their optimal group
Recompute statistics
Repeat for column groups
Continue alternating row/column groupings until
convergence
Room For Improvement
Parallelization could be more sophisticated
Could
block the rows and compute the cost of the row
joining each row group in parallel
Using shared memory atomics to identify minimum cost
In practice, this algorithm heavily favors a small
number of row and column groups
The
parllelization would be therefore be small
Implementation Outer Loop
After local minimum is found, change initial number
of row and column groups and retry
Change
number of row groups or number of column
groups, up or down
Continue changing number of row or column groups in
that direction until cost fails to decrease
Try both directions in both dimensions before stopping
Outer loop performed on CPU
Room for Improvement
Outer loop could parallelize inner loop actions over
different GPUs
Each
could explore the different dimensions and
directions in parallel
Implementation – CPU + Validation
CPU implementation performed all steps described
earlier, but sequentially
Validation
Used
CPU implementation of statistics calculations to
validate GPU stats calculations
CPU and GPU log implementations differ, so validated
cost calculations by allowing for a tolerance of 5% btw
results
Did not have time to validate the overall algorithm or
visualize the outputs to it to see if coclusters produced
were reasonable
Timing Measurements
Time was measured by clock_t/CLOCKS_PER_SEC
under CPU implementation
Measured by cuda events under GPU
implementation
Development Lessons Learned
CUDA and structured data is a bad idea
Even structs of arrays are impossible to deal with
Host-side pointer math on device pointers does not work
CUDA API has REALLY unfriendly error messages
__device__ variables declared globally must be
passed to kernels
Take care to do very, very little through that API
Runtime errors otherwise
You can malloc and free shared memory in device code
as of 3.2
Development Lessons Learned Cont
Visual Studio CUDA integration leaves a lot to be
desired
All
optimizations removed, still can’t set breakpoints
everywhere
Many variables show as freed
No in-IDE, real-time, in editor compile errors
But, Visual Studio does give nice auto-complete,
auto-definition navigation
No CUDA linker => separate files must be directly
#include’d
Experiment - Environment
Float.cs.drexel.edu
CPU:
4 quad-core Intel Xeon L5360 processors @2.13
Ghz
GPU: 2 Nvidia GeForce GTX 580 GPUs @1544Mhz
Experiment - Description
Sequential (CPU) and Parallel (GPU) tested on square
matrices of order 100, 1000, and 10000
Larger matrices caused memory problems
GPU tested with varying block and thread counts
Num blocks: 10, 100, 5000
Num threads: 10, 100, 1024 (max)
Resulting co-clusters usually stayed in the 50-200
row/column group range, regardless of matrix order
Row and column groupings are important in the calculation
of matrix statistics, rows and columns are blocked by these
Experiment Results
Speedup - 10 Blocks
80
70
60
50
Num Threads
10
40
100
1024
30
20
10
0
100
1000
Matrix Order
10000
Experiment Results
For small number of blocks, 100 thread
performance peaks at num_blocks * num_threads =
matrix_order
I
would expect this to be the optimal configuration,
when num_blocks ~= num_row_groups ~=
num_col_groups
Slowdown occurs when matrix order exceeds total
number of threads and more must be done serially
Experiment - Results
Speedup - 100 Blocks
80
70
60
50
Num Threads
10
40
100
1024
30
20
10
0
100
1000
Matrix Order
10000
Experiment Results
Speedup - 5000 Blocks
80
70
60
50
Num Threads
10
40
100
1024
30
20
10
0
100
1000
Matrix Order
10000
Experiment Results
Interestingly, the maximum speedup was the same in all
block counts
Roughly speaking, as long as num_blocks * num_threads >=
matrix order, max speedup of ~70 is achieved
10 threads never got there, due to block scheduling overhead?
Possibly cost of copying to shared memory for block processing
was not recouped in 10 thread case?
Maxing out thread count is counter-productive in smaller
matrices
Hypothesis: When block count is excessive (as for small
matrices), scheduling of large blocks of threads that return
immediately is costly
Experiment Results
Effficiency - 10 Blocks
0.08
0.07
0.06
0.05
Num Threads
10
0.04
100
1024
0.03
0.02
0.01
0
100
1000
Matrix Order
10000
Experiment Results
Efficiency - 100 Blocks
0.05
0.045
0.04
0.035
Num Threads
0.03
10
0.025
100
0.02
1024
0.015
0.01
0.005
0
100
1000
Matrix Order
10000
Experiment Results
Efficiency - 5000 Blocks
0.0012
0.001
0.0008
Num Threads
10
0.0006
100
1024
0.0004
0.0002
0
100
1000
Matrix Order
10000
Experiment Results
Efficiency is consistently highest for the smaller
numbers of blocks and smaller numbers of threads
within those blocks
Hypothesis:
Overhead of starting blocks and threads
must be high enough to result in diminishing returns
when adding blocks and threads