Transcript Document

Andreas Höcker (ATLAS), Kai Voss (ATLAS), Helge Voss (LHCb),
Jörg Stelzer (BaBar), Peter Speckmayer (CERN)
Xavier Prudent - LAPP
BaBar Collaboration Meeting
June 2006 - Montreal
1
Multi-variable analysis widely used in HEP
(LEP, BaBar, Belle, D0, MiniBooNE, …)
Common Reproaches to Multi-Variable Methods :
“Black box methods”
Training sample may
not describe correctly
the data
In case of correlations cuts are not transparent anymore
Creates no bias, only bad performance
Need a control sample
Systematics ?
Independent & uneasy
implementations
…
Need of a global tool that would :
provide with the most common MV methods
do both the training and evaluation of these methods
2
enable easy computation of systematics
“TMVA” means Toolkit for Multivariate Analysis
Root package written by Andreas Höcker, Kai Voss, Helge Voss, Jörg Stelzer,
Peter Speckmayer for the evaluation of MV methods in parallel of an analysis
MV Methods available so far :
Rectangular cut optimization
Correlated Likelihood estimator (PDE)
Multi-dimensional likelihood estimator (PDE)
Fisher & Mahalanobis discriminant
H-Matrix (χ2 estimator)
Neural Network (2 different implementations)
Boosted decision tree
TMVA provide training, testing & evaluation of these methods
A dedicated class enables to plug the training results in your favorite analysis
3
Cut Optimization
 Scan in signal efficiency for highest background rejection
Correlated & de-correlated likelihood
 PDE approach, generalization to
multi dimension likelihood,
1
1

xPDE ,i
Output transformed by an inverse Fermi
function (less peaked)
xPDE ,i  x
'
PDE ,i
   ln(
 1)
De-correlation possible with the square
root of the covariance matrix
Fisher discriminant and H-matrix
Neural Network
 Classical definitions
 2 NNs, both multi-layer perceptrons with stochastic learning
• Clermont Ferrand ANN (used for ALEPH Higgs analysis)
• TMultiLayerPerceptron (ANN from ROOT)
(Boosted) Decision trees
Inspired from MiniBooNE
Sequential applications of cuts
4
What is a Boosted Decision Tree ?
Training
Each event has a weight Wi (= 1 to start)
S/B
Cut on the variable that optimizes the separation, based on the
purity P (=“node”)
52/48
Var1 ≤ x1
Var1 > x1
P
 W
W  W
signal
i
i
S/B
signal
S/B
4/37
backgd
Split until minimal #event reached or limit of purity reached
48/11
Var2 > x2
j
Optimization by
scanning,
Genetic
Algorithm soon
Final node = “leaf” : if P>Pmin “signal leaf”
If P<Pmin “background leaf”
Var2 ≤ x2
S/B
S/B
Boosting : if a signal event is on a bkg leaf
2/10
46/1
or if a bkg event is on a signal leaf  weight modified
Then training re-performed with the new weights (x1000 trees)
Testing
Start from root node, events go through the 1000 boosted trees
Each time an event ends on leaf signal or background signal
its weight is modified (↔ Neural Net output)
(smoother output than classical discrete output)
5
How to get and use TMVA ?
6
How to download TMVA ?
•
Get a tgz file : from TMVA website http://tmva.sourceforge.net then click on
Download
• Via cvs : cvs –z3 –d:pserver:[email protected]:/cvsroot/tmva co –P TMVA
 Automatic creation of 6 directories
src/ Source for TMVA library
example/ example of how to use TMVA
lib/ TMVA library once compiled
reader/ all functionalities to apply MV weights
macros/ root macro to display results
development/ working & testing directories
For your own analysis :
> cp example myTMVA
 Modify “makefile” for compilation in /myTMVA
7
Detailed Steps for the Example
How to compile TMVA ?
Include TMVA/lib in PATH
/home cd TMVA
/home/TMVA source setup.csh
/home/TMVA cd src/
/home/TMVA/src make
Compile the librairy  libTMVA.so
8
How to choose the MV method I want ?
Go to exemple/ directory and open TMVAnalysis.cpp
/home/TMVA/src cd ../examples
You will find a list of available methods (Booleans)
Switch to 1/0 the method you want/don’t want
…
Bool_t
Bool_t
Bool_t
Bool_t
Bool_t
Bool_t
Bool_t
Bool_t
Bool_t
Bool_t
Bool_t
Bool_t
…
Use_Cuts
= 1;
Use_Likelihood
= 0;
Use_LikelihoodD
= 0;
Use_PDERS
= 0;
Use_HMatrix
= 0;
Use_Fisher
= 1;
Use_CFMlpANN
= 1;
Use_TMlpANN
= 0;
Use_BDT_GiniIndex = 0;
Use_BDT_CrossEntro = 0;
Use_BDT_SdivStSpB = 0;
Use_BDT_MisClass
= 0;
You just have to “switch” on or off the
Booleans !
Here for instance I will compare
Cuts, Fisher, Neural Net CFM
9
How to point to TMVA the training samples & variables ?
In TMVAnalysis.cpp
Both ascii or root files can be used as input
Creation of the factory object
How to point the input ascii files :
How to point the variables (example with 4 variables) :
In /examples/data : toy_sig.dat bkg_toy.dat
10
How to change training options ?
In TMVAnalysis.cpp
Training cycles,
#hidden layer,
#neurons per layer
factory->PreparTrainigAndTestTree( mycut, 2000, 4000 );
#events used :
training
testing
11
 Description of every options in class BookMethod
How to I run TMVA ?
/home/TMVA/src cd ../examples
Name of output
root file
/home/TMVA/examples make
/home/TMVA/examples TMVAnalysis “myOutput.root”
What does it create ?
Some weights files for each trained MV methods in weight/
A root file in main directory with the MV outputs and efficiencies
How to look at the results ?
Use the nice ROOT macros in directory macros/
/home/TMVA/examples root –l
/home/TMVA/examples .L ../macros/efficiencies.C
/home/TMVA/examples efficiencies("MyOutput.root")
Plots created in directory plots/
12
Which ROOT macros are available ? (1)
variables.C
 Distributions of input variables
13
Which ROOT macros are available ? (2)
correlations.C
 Colored correlation matrix of input variables
Numeric values displayed during TMVA running
14
Which ROOT macros are available ? (3)
mvas.C
 Outputs of MV methods
15
Which ROOT macros are available ? (4)
efficiencies.C
 Background rejection vs. Signal efficiency
Direct comparison of all MV methods !
16
I have trained the MV method I want …
I have the weight files …
How to use this MV method in my analysis ?
Detailed example is TMVA/reader/TMVApplication.cpp
Dedicated class
reader/TMVA_reader.hh
The next slide shows what must be included in your analysis program …
Work in progress (being implemented in ROOT) thus possible differences
with later version …
17
#include “TMVA_reader.h”
using TMVApp::TMVA_Reader;
[1] Include the reader class
Void MyAnalysis() {
vector<string> inputVars;
inputVars.push_back( "var1" );
inputVars.push_back( "var2" );
inputVars.push_back( "var3" );
inputVars.push_back( "var4" );
[2] Create an array of the input
variables names (here 4var)
[3] Create the reader class
TMVA_Reader *tmva = new TMVA_Reader( inputVars );
[4] Read the weights
and build the MV tool
tmva->BookMVA(TMVA_Reader::Fisher, “TMVAnalysis_Fisher.weights");
vector<double> varValues;
varValues.push_back( var1 );
varValues.push_back( var2 );
varValues.push_back( var3 );
varValues.push_back( var4 );
[5] Create an array with the
input variables values
[6] Compute the value of the MV,
it is the value you will cut on
double mvaFi = tmva->EvaluateMVA( varValues, TMVA_Reader::Fisher );
delete tmva;}
18
TMVA is already used by several AWG in BaBar
Group Dalitz Charmless UK : TMVA Fisher for continuum rejection in
the Dalitz-plot analyzes of Kspp- and K+pp+ ( BADs: 1376 and 1512 ).
Use of 11 input variables, pictures taken from BAD 1376
19
Group D0h0 : TMVA ClmF NN for continuum rejection in the measurement
of the BFs of the color suppressed modes B0  D0h0 (h0 = ω, η, η’, ρ, π0)
and in the measurement of CKM β angle
Use of 4 input variables
20
Measurement of sin(2α) with Bρπ
 Uses Clermont-Ferrand NN to get rid of combinatory background
Measurement of CKM angle γ with GLW method (Emmanuel Latour – LLR)
 Uses Fisher to get rid of combinatory background
Signal = MC signal
B D*K, D*D0p0, D0Kp
Background = MCs udsc
21
What to keep in mind about TMVA ?
A powerful Multivariate toolkit with 12 different methods (more are coming)
User friendly package from training to plots ! Already used in BaBar
Comparison possible & easy between the different MV methods
C++ & Root functionalities, announced in ROOT version V5-11-06 http://root.cern.ch/
Have a look at http://tmva.sourceforge.net/ !!
Talk by Kai Voss at CERN
http://agenda.cern.ch/askArchive.php?base=agenda&categ=a057207&id=a057207s27t6/transparencies
TMVA Tutorial
https://twiki.cern.ch/twiki/bin/view/Atlas/AnalysisTutorial1105#TMVA_Multi_Variate_Data_Analysis
Physics analysis HN advertisement
http://babar-hn.slac.stanford.edu:5090/HyperNews/get/physAnal/2989.html
A similar tool has been developed by Ilya Narsky ( StatPatternRecognition ) 22
Back Up Slides
23
Available Options For Every Methods in
TMVAnalysis.cpp
Rectangular cut optimization
Correlated Likelihood estimator (PDE)
Multi-dimensional likelihood estimator (PDE)
Fisher & Mahalanobis discriminant
H-Matrix (χ2 estimator)
Neural Network (2 different implementations)
Boosted decision tree
24
Rectangular cuts
factory->BookMethod( "MethodCuts", “Method : nBin : OptionVar1: … :OptionVarn" );
TMVA method
# bins in the hist
of efficiency S/B
Method of cut
- "MC" : Monte Carlo optimization (recommended)
- "FitSel": Minuit Fit: "Fit_Migrad" or "Fit_Simplex"
- "FitPDF": PDF-based: only useful for uncorrelated input variables
Option for each variables
- "FMax" : ForceMax (the max cut is fixed to maximum of variable i)
- "FMin" : ForceMin (the min cut is fixed to minimum of variable i)
- "FSmart": ForceSmart (the min or max cut is fixed to min/max, based on mean value)
- Adding "All" to "option_vari", eg, "AllFSmart" will use this option for all variables
- if "option_vari" is empty (== ""), no assumptions on cut min/max are made
25
Likelihood
factory->BookMethod( "MethodLikelihood", “TypeOfSpline : NbSmooth : NbBin : Decorr");
TMVA method
Which spline is used
for smoothing the pdfs
“Splinei” [i=1,2,3,5]
How often the input
histos are smoothed
Average num of events per
PDF bin to trigger warning
Option for decorrelation or not
- “NoDecorr” – do not use square-root-matrix to
decorrelate variable space
- “Decorr” – decorrelate variable space
26
Fisher Discriminant and H matrix
factory->BookMethod( "MethodFisher", "Fisher" );
TMVA method
Which method
-“Fisher”
- “Mahalanobis” (another definition of distance)
factory->BookMethod( "MethodHMatrix" );
TMVA method
27
Artificial Neural Network
factory->BookMethod( “WhichANN”, “NbCycles:NeuronsL1:NeuronsL2:…:NeuronsLn" );
Which type of NN
- “MethodCFMlpANN”
Clermond Ferrand NN, used for Higgs
search in ALEPH
- “MethodTMlpANN” ROOT NN
Number of training cycles
Number of neurons in each layer
The 1st layer has necessarily as many
neurons as input variables
28
Boosted Decision Trees
factory->BookMethod( "MethodBDT",
“nTree :
Number or trees
BoostType :
Method of boosting
TMVA method
- AdaBoost
- EpsilonBoost
SeparationType :
Method for evaluating the misclassification
- GiniIndex
- CrossEntropy
- SdivSqrtSplusB
- MisClassificationError
nEvtMin :
Minimum Number of events in a node (leaf criteria)
MaxNodePurity :
nCuts”);
Higher bound for leave or intermediate node
Number of steps in the optimization of the cut for a node
29