Transcript Colinearity among variables
Chance Correlation in QSAR studies Ahmadreza Mehdipour Medicinal & Natural Product Chemistry Research Center
Correlation or causation?
Correlation is essential but not sufficient Correlation is meaningless unless its cause (or role) in the biological activity is interpreted A satisfactory QSAR correlation does not mean that a particular descriptor causes the efficient action of a compound
Chance Correlation
•Topliss Ratio (J. Med. Chem. 1972, 35, 1066) • A misconception • • • Ratio of variables in model to Sample Size Ratio of variables in Data Pool to Sample Size Revalidation of problem by Livingstone (J. Med. Chem. 2005, 48, 6661)
• Topliss et al.
demonstrated that the more independent variables (
X
) that are available for selection in a multiple linear regression model, the more likely a model will be found by chance. These authors recommended that in order to reduce the risk of chance correlations there should be a certain ratio of data points to the number of independent variables available.
Unfortunately , this ratio was often misinterpreted as the number of data points to the number of independent variables in the final model, a practice that did very little if anything to reduce chance effects.
D.W. Salt, S. Ajmani, R. Crichton, D.J. Livingstone, An improved approximation to the estimation of the critical F values in best subset regression. J. Chem. Inf. Model. 47 (2007) 143-149.
Chance Correlation How does it occur?
•A Trial Example with random data •Characteristics: • N (Sample Size)=20 • K (Number of variables in data pool)=10, 20, 50, 75, 100
N=20 K=10
N=20 K=20
N=20 K=50
N=20 K=75
N=20 K=100
Avoiding chance correlation
What should we do?
Solutions for detection of chance correlation F max critical Randomization of Y (input scrambling) Validation procedures
F
max
Critical
Linvingstone Approach Normal tabulated F is significant ONLY WHEN K=P K= number of variables in data pool P= number of variables in model
F
max
Critical
However, in most cases K>>P K= number of variables in data pool P= number of variables in model N=Sample Size
Introduction of F
max
Critical
Simulated random data Run 1000 times Different N, K and P Obtain F max for each combination (for a significance level of 5%) Check for some Known data sets www.cmd.port.ac.uk
Randomization of Y
Ys are randomly attributed to samples
Y-randomization
However
This method should also be performed during Variable selection process if, R 2 max and Q 2 max are low Then, the risk of chance correlation is low
Cross-validation Process
Different N, K, P N=10, 20, 30, 40, 50, 80, 100 P=1-8 N=p, 10, 20, 30, 50, 100 Run 1000 times Evaluation factors R 2 of training set Q 2 1 = Q 2 for LOO CV Q 2 20% = Q 2 for Leave-20% of samples-Out CV Q 2 50% = Q 2 for Leave-50% of samples-Out CV R 2 P = R 2 of one random test set (25% of samples)
1 0.8
0.6
0.4
0.2
0 1 3 n=10 5
p
7 n=20 n=30 n=40 n=50 n=80 n=100 9 1.0
0.8
0.6
0.4
0.2
0.0
1 3 n=10 5
p
7 n=100 9 1.000
0.800
0.600
0.400
0.200
0.000
1 3 n=10 5
p
7 n=100 9 1.0
0.8
0.6
0.4
0.2
0.0
1 1.0
0.8
0.6
0.4
0.2
0.0
1 3 3 n=10 5
p
7 n=100 9 n=10 n=20 5
p
7 n=30 n=40 n=50 n=80 n=100 9
Cross-validation Process
Leave-one-out Vs Leave-group-out Q 2 L50%O is independent of N, K, P Hemmateenejad B, Mehdipour AR, Bagheri L, Miri R, Judging the significance of the multiple linear regression-based QSAR models by cross-validation. To be submitted
Concluding Remarks
Be aware of N to K ratio Not only N to P ratio Check different approaches for chance correlation
Models are not real
but