Classification and Regression Trees for Glast Analysis: to IM or not to IM?

Download Report

Transcript Classification and Regression Trees for Glast Analysis: to IM or not to IM?

Classification and Regression Trees
for
Glast Analysis:
to IM or not to IM?
Toby Burnett
Data Challenge Meeting
15 June 2003
The problem
• Bill is using IM classification and regression tree analysis for
analysis:
– Calorimeter validity
– PSF tail suppression
– background suppression
• IM is proprietary, and rather expensive ($5K): only UW and UCSC
have academic licenses ($500 single; $1K for 10)
T. Burnett
GLAST Data Challenge Workshop
2
Bill’s IM worksheet (PSFAnalysis_14)
Training
region
Input
tuple
Predicion
tree
Analyze
results
T. Burnett
GLAST Data Challenge Workshop
3
The Trees: calculate 4 values
with 11 nodes
•
•
•
•
Good calorimeter measurement [1 node]
vertex vs. 1 track (thin and thick) [2 nodes]
Core vs tail (thin/thick and vtx/1 trk) [4 nodes]
Prediction of recon direction error [ 4 nodes]
Example: A Good CAL/Bad Cal prediction node
CalTwrEdge<48.48,
CalTrackDoca<10.27,
CalTwrEdge>=26.58,
CalTwrEdge<34.81,
CalXtalRatio<0.82,
CalTransRms>3,611.48,
CalTrackDoca>3.96,
CalXtalRatio<0.46,
CalTotSumCorr>1.76
T. Burnett
GLAST Data Challenge Workshop
4
Bill’s result*
* Flawed by G4 problems
T. Burnett
GLAST Data Challenge Workshop
5
A Solution
• IM saves its results as XML files, which are easy to interpret
• A new package, “classification” defines a class classification::Tree
that does the following:
– accepts a “lookup” object to obtain a pointer to the double associated
with named quantities
– parses the XML file, creating trees for each prediction tree found
– returns a value from each tree
• Merit creates and fills the new tuple variables, in a new class
ClassificationTree.
– duplicates the logic defining the 4 categories
– evaluates each of the 4 variables
T. Burnett
GLAST Data Challenge Workshop
6
Current Procedure
•
•
Bill releases an IM file.
I strip it down, removing nodes not required for analysis
–
•
•
•
size reduced by 1/2, to 500 Kb.
Rename it, and check it in to cvs as
classification/xml/PSF_Analysis.xml
Create a tuple with merit, containing the new tuple quantities
Feed that tuple to this IM worksheet, which writes a new tuple with both versions
T. Burnett
GLAST Data Challenge Workshop
7
Results: the good
•
The comparisons were with 10000
generated 100 MeV normal
The vertex classification (used to
select vertex vs. 1 Track direction
estimate) is perfect, as is the core
vs. tail
1
Pr(CORE)
•
0.5
0
0
0.5
IM coreProb
1
0
0.5
IM vertexProb
1
Pr(VTX)
1
0.5
0
T. Burnett
GLAST Data Challenge Workshop
8
Results: the bad
•
– The merit evaluation is only the
first tree
– The evaluation uses an average
of the two trees.
– Note that there are three
branches.
T. Burnett
100
10
Pred.PSF.Core
The results of the “regression tree”
to predict the psf error has two
populations!
• The agreement is rather poor for
the “thin vertex” category;
otherwise perfect.
An explanation: Bill generated two
different trees from different data
sets, of 1000, and 243 events.
(The latter has only two nodes and
can only generate 3 values.)
1
0.1
0.1
1
10
100
IM psfErrPred
all categories
thin vertex
line
GLAST Data Challenge Workshop
9
Results: the ugly
•
1
0.9
Pr(GoodCAL)
•
This is the comparison of the
prediction for good energy
measurement
Again, Bill created two trees,
which are apparently being
averaged.
0.8
0.7
0.6
0.5
0
0.2
0.4
0.6
0.8
IMgoodCalProb
T. Burnett
GLAST Data Challenge Workshop
10
Observations
• Fixing the “disagreement”
– Bill: will train only one tree
– me: average all the trees
• Using IM to train the classification or regression trees
– The current procedure is exploratory
– If we decide to use these trees in the final analysis, they must be trained
systematically
– Another possibility (idea from Tracy): use the classification/regression
analysis in S-PLUS, which manages tree objects.
T. Burnett
GLAST Data Challenge Workshop
11
S-PLUS
•
•
•
•
•
No question about academic
licenses, ($100 per license at UW)
Linux version available
Open source alternative: R
Scriptable, also callable from C++
Supports the same classification
and regression tree functions (we
think!)
Fit a Regression or Classification Tree
DESCRIPTION:
Grows a tree object from a specified formula and
data. USAGE:
tree(formula, data=<<see below>>,
weights=<<see below>>, subset=<<see
below>>, na.action=na.fail,
method="recursive.partition", control=<<see
below>>, model=NULL, x=F, y=T, ...)
REQUIRED ARGUMENTS:
formula
a formula expression as for other regression
models, of the form `response ~ predictors'.
T. Burnett
GLAST Data Challenge Workshop
12
Status
Work done by a summer student
–
–
•
1.0
0.8
0.6
0.4
In progress: direct comparison
–
–
–
•
Explore classification tree with random
x, y in 0,1; good=x<y;
See validity plot at right
Explore regression tree: feed it x,
y=x^2, have it create a predictor for y.
y
•
Choose the GoodCAL category:
ifelse((EvtMcEnergySigma > -5. ),
"GoodCAL","BadCal")
Use IM (v2) to create classification with
independent variables used by Bill.
Write the results to a file for S-PLUS
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
x
Approximating w =z^2 Using a Regression Tree
z<0.6405
|
Next steps:
–
–
Run the same analysis in S-PLUS,
compare
Establish procedures to construct tree
predictions with R or S-PLUS
z<0.4105
z<0.2625
0.02292
T. Burnett
0.11510
GLAST Data Challenge Workshop
z<0.8335
z<0.5335
0.22400
0.34550
z<0.7415
0.47830
0.62090
z<0.9195
0.76890
0.92210
13