Colinearity among variables

Download Report

Transcript Colinearity among variables

Chance Correlation in QSAR studies Ahmadreza Mehdipour Medicinal & Natural Product Chemistry Research Center

Correlation or causation?

 Correlation is essential but not sufficient  Correlation is meaningless unless its cause (or role) in the biological activity is interpreted  A satisfactory QSAR correlation does not mean that a particular descriptor causes the efficient action of a compound

Chance Correlation

•Topliss Ratio (J. Med. Chem. 1972, 35, 1066) • A misconception • • • Ratio of variables in model to Sample Size Ratio of variables in Data Pool to Sample Size Revalidation of problem by Livingstone (J. Med. Chem. 2005, 48, 6661)

• Topliss et al.

demonstrated that the more independent variables (

X

) that are available for selection in a multiple linear regression model, the more likely a model will be found by chance. These authors recommended that in order to reduce the risk of chance correlations there should be a certain ratio of data points to the number of independent variables available.

Unfortunately , this ratio was often misinterpreted as the number of data points to the number of independent variables in the final model, a practice that did very little if anything to reduce chance effects.

D.W. Salt, S. Ajmani, R. Crichton, D.J. Livingstone, An improved approximation to the estimation of the critical F values in best subset regression. J. Chem. Inf. Model. 47 (2007) 143-149.

Chance Correlation How does it occur?

•A Trial Example with random data •Characteristics: • N (Sample Size)=20 • K (Number of variables in data pool)=10, 20, 50, 75, 100

N=20 K=10

N=20 K=20

N=20 K=50

N=20 K=75

N=20 K=100

Avoiding chance correlation

What should we do?

Solutions for detection of chance correlation  F max critical  Randomization of Y (input scrambling)  Validation procedures

F

max

Critical

 Linvingstone Approach  Normal tabulated F is significant ONLY WHEN K=P K= number of variables in data pool P= number of variables in model

F

max

Critical

 However, in most cases K>>P K= number of variables in data pool P= number of variables in model N=Sample Size

Introduction of F

max

Critical

 Simulated random data  Run 1000 times  Different N, K and P  Obtain F max for each combination (for a significance level of 5%)  Check for some Known data sets  www.cmd.port.ac.uk

Randomization of Y

 Ys are randomly attributed to samples

Y-randomization

However

 This method should also be performed during Variable selection process if, R 2 max and Q 2 max are low Then, the risk of chance correlation is low

Cross-validation Process

    Different N, K, P N=10, 20, 30, 40, 50, 80, 100 P=1-8 N=p, 10, 20, 30, 50, 100  Run 1000 times  Evaluation factors R 2 of training set Q 2 1 = Q 2 for LOO CV Q 2 20% = Q 2 for Leave-20% of samples-Out CV Q 2 50% = Q 2 for Leave-50% of samples-Out CV R 2 P = R 2 of one random test set (25% of samples)

1 0.8

0.6

0.4

0.2

0 1 3 n=10 5

p

7 n=20 n=30 n=40 n=50 n=80 n=100 9 1.0

0.8

0.6

0.4

0.2

0.0

1 3 n=10 5

p

7 n=100 9 1.000

0.800

0.600

0.400

0.200

0.000

1 3 n=10 5

p

7 n=100 9 1.0

0.8

0.6

0.4

0.2

0.0

1 1.0

0.8

0.6

0.4

0.2

0.0

1 3 3 n=10 5

p

7 n=100 9 n=10 n=20 5

p

7 n=30 n=40 n=50 n=80 n=100 9

Cross-validation Process

 Leave-one-out Vs Leave-group-out  Q 2 L50%O is independent of N, K, P Hemmateenejad B, Mehdipour AR, Bagheri L, Miri R, Judging the significance of the multiple linear regression-based QSAR models by cross-validation. To be submitted

Concluding Remarks

 Be aware of N to K ratio  Not only N to P ratio  Check different approaches for chance correlation

Models are not real

but

sometimes are helpful