Transcript Document
Regression vs. Correlation Both: Two variables Continuous data Regression: Change in X causes change in Y Independent and dependent variables or Predict X based on Y Correlation: No dependence (causation) assumed Estimate the degree to which 2 variables vary together Correlation: more on bivariate statistics No dependence (causation) assumed Can call variables XY or X1X2 Are to variables independent, or do they covary Purpose of investigator Nature of variables Y random, X Both random fixed Establish and estimate dependence of Y upon X, describe functional relationship or predict Y from X Model I regression Establish and estimate association (interdependence) between X & Y Meaningless Adapted from Sokal & Rolf pg 559 Model II regression, with few exceptions, eg prediction Correlation coefficient, significance only if , normally distributed Visualize Correlation negative Y(X2) Y(X2) positive X1 Increase in X associated with increase in Y X1 Increase in X associated with decrease in Y No correlation Y(X2) Y(X2) No correlation X1 X1 vertical horizontal Pearson product-moment correlation coefficient xy r= Summed products of deviations of x & y = x2 y2 ss X * ss Y [(x-xbar) *(y-ybar)] = (x-xbar)2 * (y-ybar)2 Equivalent calculations (1) r= xy (n-1) sxsy Where sx = SD X sy = SD Y Equivalent calculations (2) (Ŷi-Ybar)2 regression SS (r2) = r= r2 = total SS = (Yi-Ybar)2 regression SS total SS Testing significance: H0: r () = 0 Assumes that data come from bivariate normal distribution true population parameter r t= sr SE of r sr = Reject null if…… t calc > t(2), 1-r2 n-2 data start; infile 'C:\Documents and Settings\cmayer3\My Documents\teaching\Biostatistics\Lectures\monitoring data for corr.csv' dlm=',' DSD; input year day site $ depth temp DO spCond turb pH Kpar secchi alk Chla; options ls=180; proc print; data one; set start; options ls=100; proc corr; var temp DO spCond turb pH Kpar secchi alk Chla; Correlations on raw data data two; set start; lnturb=log(turb); Create new variables by transformation lnsecchi=log(secchi); lgturb=log10(turb); lgsecchi=log10(secchi); sqturb=sqrt(turb); sqsecchi=sqrt(secchi); proc print; data three; set two; Correlations on transformed data proc corr; var lnturb lnsecchi; proc corr; var lgturb lgsecchi; proc corr; var sqturb sqsecchi; data four; set two; Plot raw and transformed options ls=100; proc plot; plot turb*secchi; plot lnturb*lnsecchi; plot lgturb*lgsecchi; plot sqturb*sqsecchi; run; Pearson Correlation Coefficients Prob > |r| under H0: Rho=0 Number of Observations temp DO temp DO spCond turb pH Kpar secchi alk Chla 1.00000 -0.21792 0.06538 -0.14523 0.35328 -0.23911 0.15689 0.11311 0.37612 0.0302 0.5202 0.1515 0.0003 0.1541 0.1209 0.3895 0.0001 99 99 99 99 99 37 99 60 99 -0.21792 1.00000 0.01542 -0.21550 0.50679 -0.24013 -0.06504 0.15790 0.38699 0.8796 0.0322 <.0001 0.1523 0.5224 0.2282 <.0001 0.0302 spCond turb pH Kpar secchi alk Chla 99 99 99 99 99 37 99 60 99 0.06538 0.01542 1.00000 0.48214 -0.29017 0.78394 -0.51332 0.74021 0.21367 0.5202 0.8796 <.0001 0.0036 <.0001 <.0001 <.0001 0.0337 99 99 99 99 99 37 99 60 99 -0.14523 -0.21550 0.48214 1.00000 -0.33727 0.89941 -0.50336 0.47441 0.07208 0.1515 0.0322 <.0001 0.0006 <.0001 <.0001 0.0001 0.4783 99 99 99 99 99 37 99 60 99 0.35328 0.50679 -0.29017 -0.33727 1.00000 -0.56355 0.14049 -0.14061 0.61033 0.0003 <.0001 0.0036 0.0006 0.0003 0.1654 0.2839 <.0001 99 99 99 99 99 37 99 60 99 -0.23911 -0.24013 0.78394 0.89941 -0.56355 1.00000 -0.76680 0.85542 0.04579 0.1541 0.1523 <.0001 <.0001 0.0003 <.0001 <.0001 0.7878 37 37 37 37 37 37 37 29 37 0.15689 -0.06504 -0.51332 -0.50336 0.14049 -0.76680 1.00000 -0.49649 -0.30918 0.1209 0.5224 <.0001 <.0001 0.1654 <.0001 <.0001 0.0018 99 99 99 99 99 37 99 60 99 0.11311 0.15790 0.74021 0.47441 -0.14061 0.85542 -0.49649 1.00000 0.12410 0.3895 0.2282 <.0001 0.0001 0.2839 <.0001 <.0001 60 60 60 60 60 29 60 60 60 0.37612 0.38699 0.21367 0.07208 0.61033 0.04579 -0.30918 0.12410 1.00000 0.3448 Nonparametric statistics Sometimes called distribution free statistics because they do not require that the data fit a normal distribution Many nonparametric procedures are based on ranked data. Data are ranked by ordering them from lowest to highest and assigning them, in order, the integer values from 1 to the sample size. Some Commonly Used Statistical Tests Normal theory based test Corresponding nonparametric test Purpose of test t test for independent samples Mann-Whitney U test; Wilcoxon ranksum test Compares two independent samples Paired t test Wilcoxon matched pairs signed-rank test Examines a set of differences Pearson correlation coefficient Spearman rank correlation coefficient Assesses the linear association between two variables. One way analysis of variance (F test) Kruskal-Wallis analysis of variance by ranks Compares three or more groups Two way analysis of variance Friedman Two way analysis of variance Compares groups classified by two different factors From: http://www.tufts.edu/~gdallal/npar.htm Data transformations Data transformation can “correct” deviation from normality and uneven variance (heteroscedasticity) See chapter 13 in Zar Pretty much….. Whatever works, works. Some common ones are for % or proportion use asin of square root log10 for density (#/m2) Right transformation can allow you to use parametric statistics