Transcript ppt

Parallelized Boosting
Mehmet Basbug
Burcin Cakir
Ali Javadi Abhari
Date: 10 Jan 2013
Motivating Example
•
•
Many examples, many attributes
Can we find a good (strong) hypothesis relating the
attributes to the final labels?
Examples
Attributes
Table 1. Example Data Format
Labels
User Interface
•
User specifies the desired options in two ways:
o
o
Configuration File: Information about the number of nodes/cores, memory, number of
iterations. To be parsed by the preprocessor.
Behavioral Classes: Defining the hypotheses "behaviors"
--------------------------------------------------------------------------<configurations.config>
--------------------------------------------------------------------------[Configuration 1]
working_directory = '/scratch/pboost/example'
data_files = 'diabetes_train.dat'
test_files = 'diabetes_test.dat'
fn_behavior = 'behaviors_diabetes.py'
boosting_algorithm = 'confidence_rated'
max_memory = 2
xval_no = 10
round_no = 1000
---------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
<behaviors_diabetes.py>
---------------------------------------------------------------------------------------------------from parallel_boosting.utility.behavior import Behavioral
class BGL_Day_Av(Behavioral)
def behavior(self, bgl_m, bgl_n, bgl_e) {
return (self.data[:,day_m]+self.data[:,day_m], self.data[:,day_m]) / 3
def fn_generator(self){
bgl_m = list()
bgl_n = list()
bgl_e = list()
for k in range(1, (self.data.shape[1]-4)/3+1)
bgl_m = 3*k
bgl_n = 3*k+1
bgl_e = 3*k +2
self.insert(bgl_m,bgl_n,bgl_e)
}
----------------------------------------------------------------------------------------------------
Package Diagram
Pre-Processing
•
•
•
•
•
User-defined python classes: to obtain different function behaviors.
to obtain a set of hypotheses.
Configuration file: to get the path of the required data and definitions.
Function Definitions Table: to store the hypotheses and make it available to
different cores
Hypothesis Result Matrix
Sorting Index Matrix: to save the sorting indices of each example
Function ID
Behavior Path
Behavioral Class
Arguments
1
"/Scratch/pboost/example/behaviors_diabetes.py"
"BMI_Exponential"
"{'weight_col':2, 'height_col':3, 'power':2}"
2
"/Scratch/pboost/example/behaviors_diabetes.py"
"BGL_Day_Av"
"{'BGL-m':4, 'BGL-a':5 }"
3
"/Scratch/pboost/example/behaviors_diabetes.py"
"BGL_Day_Av"
"{'BGL-m':7, 'BGL-a':8 }"
4
"/Scratch/pboost/example/behaviors_diabetes.py"
"BGL_Time_Period_Av"
"{'day-1-BGL':4, 'day-2-BGL':7}"
5
"/Scratch/pboost/example/behaviors_diabetes.py"
"BGL_Time_Period_Av"
"{'day-1-BGL':5, 'day-2-BGL':8}"
6
"/Scratch/pboost/example/behaviors_diabetes.py"
"BGL_Time_Period_Av"
"{'day-1-BGL':6, 'day-2-BGL':9}"
Table 2. Function Definitions Table
Pre-Processing (cont.')
F1
F2
F3
F4
F5
F6
Alice
24.129
3.221
1.212
1.321
6.782
5.123
Bob
32.312
1.108
5.875
1.412
3.091
4.312
...
...
...
...
...
...
...
Helena
25.178
6.612
4.912
3.128
2.412
7.281
Table 3. Function Output Table
F1
F2
F3
F4
F5
F6
1
2
5
3
7
2
3
1
2
4
3
1
...
...
...
...
...
6
7
8
1
5
4
Table 4. Sorting Index Table
Applying each function to each
example is a parallelizable task.
Therefore, another important step
that needs to be implemented in
the preprocessing part is to read
the machine information from
the configuration file.
Training the boosting algorithm
slave
Dt
h1t
master
h2t
Dt
F1
F2
F3
F4
F5
F6
F1
F2
F3
F4
F5
F6
1
2
5
3
7
2
0.4
0.6
0.5
0.5
0.6
0.6
3
1
2
4
3
1
0.3
0.1
0.8
0.4
0.9
0.3
...
...
...
...
...
...
...
...
...
...
6
7
8
1
5
0.6
0.5
0.7
0.2
0.8
4
Sorting index is partitioned
0.2
Error matrices for each slave
slave
Weak Learner (Slave)
Boosting (Master)
Start with a distribution over examples(Dt)
For each round t=1...T
send Dt to each slave
receive best hypotheses from each
slave(h1t,h2t)
find the one with the least error (ht)
update Dt using ht
calculate the coefficient at
Return the linear combination of ht s
Calculate error for each combination
(hypothesis, labeling, threshold)
for the hypothesis in the given set
for given distribution over examples(Dt)
Return the hypothesis with the least error
Features
- Super fast
--- memory based
--- single pass through data
--- store indexes rather than results(16 bit vs 64 bit)
--- LAPACK & numexpr
--- embarrassingly parallelized
- Several Boosting algorithms
- Flexible xval structure
Post-Processing
•
•
•
•
•
•
•
•
Combines and reports the collected results
The result after each round of iteration is stored by the master: Set of
hypothesis (ht) and their respective coefficients (at), and the error.
Plot training and testing error vs. number of rounds
Plot ROC curve of training and testing error
Confusion matrix showing false/true positives/negatives
Create standalone final classifier
Report running time, amount of memory used, number of cores, ...
Clean up extra intermediary data stored on disk
Post-Processing
Thank You!