Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.
Download ReportTranscript Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.
Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney Give a man three weapons – correlation, regression and a pen – and he will use all three (Anon, 1978) An example Age and cholesterol levels in 18 individuals ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Age 46 20 52 30 57 25 28 36 22 43 57 33 22 63 40 48 28 49 Chol (mg/ml) 3.5 1.9 4.0 2.6 4.5 3.0 2.9 3.8 2.1 3.8 4.1 3.0 2.5 4.6 3.2 4.2 2.3 4.0 Read data into R id <- seq(1:18) age <- c(46, 20, 52, 30, 43, 57, 33, 22, chol <- c(3.5, 1.9, 4.0, 3.8, 4.1, 3.0, plot(chol ~ age, pch=16) 57, 25, 28, 36, 22, 63, 40, 48, 28, 49) 2.6, 4.5, 3.0, 2.9, 3.8, 2.1, 2.5, 4.6, 3.2, 4.2, 2.3, 4.0) 20 30 40 age 50 60 2.0 2.5 3.0 chol 3.5 4.0 4.5 Questions of interest • Association between age and cholesterol levels • Strength of association • Prediction of cholesterol for a given age Correlation and Regression analysis Variance and covariance: algebra • Let x and y be two random variables from a sample of n obervations. • Measure of variability of x and y: variance var x n xi x i 1 2 n 1 n var y i 1 • Measure of covariation between x and y ? • Algebraically: var(x + y) = var(x) + var(y) var(x + y) = var(x) + var(y) + 2cov(x,y) Where: 1 n cov x, y n 1 i 1 xi x yi y yi y 2 n 1 Variance and covariance: geometry • The independence or dependence between x and y can be represented geometrically: h y y h H x h2 = x2 + y2 x h2 = x2 + y2 – 2xycos(H) Meaning of variance and covariance • Variance is always positive • If covariance = 0, x and y are independent. • Covariance is sum of cross-products: can be positive or negative. • Negative covariance = deviations in the two distributions in are opposite directions, e.g. genetic covariation. • Positive covariance = deviations in the two distributions in are in the same direction. • Covariance = a measure of strength of association. Covariance and correlation • Covariance is unit-depenent. • Coefficient of correlation (r) between x and y is a standardized covariance. • r is defined by: cov x, y cov x, y r var x var y SDx SDy -15 Positive and negative correlation r = -0.9 -25 15 -30 20 y y 25 -20 30 r = 0.9 8 10 12 14 x 16 8 10 12 14 x 16 Test of hypothesis of correlation • Hypothesis: Ho: r = 0 versus Ho: r not equal to 0. • Standard error of r is: • The t-statistic: tr 1 r2 SEr n2 n2 1 r2 • This statistic has a t distribution with n – 2 degrees of freedom. • Fisher’s z-transformation: 1 1 r z ln 2 1 r • Standard error of z: SE z 1 n3 • Then 95% CI of z can be constructed as: z 1 n3 An illustration of correlation analysis ID Age Cholesterol (x) (y; mg/100ml) 1 46 3.5 2 20 1.9 3 52 4.0 4 30 2.6 5 57 4.5 6 25 3.0 7 28 2.9 8 36 3.8 9 22 2.1 10 43 3.8 11 57 4.1 12 33 3.0 13 22 2.5 14 63 4.6 15 40 3.2 16 48 4.2 17 28 2.3 18 49 4.0 Mean 38.83 3.33 SD 13.60 0.84 Cov(x, y) = 10.68 cov x, y 10.68 r SDx SDy 13.60 0.84 0.94 1 1 0.94 z ln 0.56 2 1 0.94 SE z 1 1 0.26 n3 15 t-statistic = 0.56 / 0.26 = 2.17 Critical t-value with 17 df and alpha = 5% is 2.11 Conclusion: There is a significant association between age and cholesterol. Simple linear regression analysis • Only two variables are of interest: one response variable and one predictor variable • No adjustment is needed for confounding or covariate • Assessment: – Quantify the relationship between two variables • Prediction – Make prediction and validate a test • Control – Adjusting for confounding effect (in the case of multiple variables) Relationship between age and cholesterol Linear regression: model • Y : random variable representing a response • X : random variable representing a predictor variable (predictor, risk factor) – Both Y and X can be a categorical variable (e.g., yes / no) or a continuous variable (e.g., age). – If Y is categorical, the model is a logistic regression model; if Y is continuous, a simple linear regression model. • Model Y = a + bX + e a : intercept b : slope / gradient e : random error (variation between subjects in y even if x is constant, e.g., variation in cholesterol for patients of the same age.) Linear regression: assumptions • The relationship is linear in terms of the parameter; • X is measured without error; • The values of Y are independently from each other (e.g., Y1 is not correlated with Y2) ; • The random error term (e) is normally distributed with mean 0 and constant variance. Expected value and variance • If the assumptions are tenable, then: • The expected value of Y is: E(Y | x) = a + bx • The variance of Y is: var(Y) = var(e) = s2 Estimation of model parameters Given two points A(x1, y1) and B(x2, y2) in a two-dimensional space, we can derive an equation connecting the points. Gradient: y B(x2,y2) m dy y2 y1 dx x2 x1 Equation: y = mx + a dy A(x1,y1) a 0 dx What happen if we have more than 2 points? x Estimation of a and b • For a series of pairs: (x1, y1), (x2, y2), (x3, y3), …, (xn, yn) • Let a and b be sample estimates for parameters a and b, • We have a sample equation: Y* = a + bx • Aim: finding the values of a and b so that (Y – Y*) is minimal. • Let SSE = sum of (Yi – a – bxi)2. • Values of a and b that minimise SSE are called least square estimates. Criteria of estimation yˆi a bxi d i yi yˆ i yi Chol Age The goal of least square estimator (LSE) is to find a and b such that the sum of d2 is minimal. Estimation of a and b • After some calculus operations, the results can be shown to be: a y bx S xy b S xx Where: S xx xi x n 2 i 1 S xy xi x yi y n i 1 • When the regression assumptions are valid, the estimators of a and b have the following properties: – Unbiased – Uniformly minimal variance (eg efficient) Goodness-of-fit • Now, we have the equation Y = a + bX + e • Question: how well the regression equation describe the actual data? • Answer: coefficient of determination (R2): the amount of variation in Y is explained by the variation in X. Partitioning of variations: concept • SST = sum of squared difference between yi and the mean of y. • SSR = sum of squared difference between the predicted value of y and the mean of y. • SSE = sum of squared difference between the observed and predicted value of y. SST = SSR + SSE The the coefficient of determination is: R2 = SSR / SST Partitioning of variations: geometry SSE SST Chol (Y) SSR mean Age (X) Partitioning of variations: algebra • Some statistics: • Total variation: • Attributed to the model: • Residual sum of square: • SST = SSR + SSE • SSR = SST – SSE SST yi y n i 1 n 2 SSR yˆ i y i 1 n 2 SSE yi yˆ i i 1 2 Analysis of variance • SS increases in proportion to sample size (n) • Mean squares (MS): normalise for degrees of freedom (df) – MSR = SSR / p (where p = number of degrees of freedom) – MSE = SSE / (n – p – 1) – MST = SST / (n – 1) • Analysis of variance (ANOVA) table: Source d.f. Sum of squares (SS) Mean squares (MS) F-test Regression Residual Total p N–p –1 n–1 SSR SSE SST MSR MSE MSR/MSE Hypothesis tests in regression analysis • Now, we have Sample data: Population: Y = a + bX + e Y = a + bX + e • Ho: b = 0. There is no linear association between the outcome and predictor variable. • In layman language: “what is the chance, given the sample data that we observed, of observing a sample of data that is less consistent with the null hypothesis of no association?” Inference about slope (parameter b) • Recall that e is assumed to be normally distributed with mean 0 and variance = s2. • Estimate of s2 is MSE (or s2) • It can be shown that – The expected value of b is b, i.e. E(b) = b, – The standard error of b is: SEb s / S xx • Then the test whether b = 0 is: t = b / SE(b) which follows a t-distribution with n-1 degrees of freedom. Confidence interval around predicted valued • Observed value is Yi. • Predicted value is Yˆi a bxi • The standard error of the predicted value is: 1 x x SE Yˆi s 1 i n S xx 2 • Interval estimation for Yi values Yˆi SE Yˆi t n p 1,1a / 2 Checking assumptions • • • • Assumption of constant variance Assumption of normality Correctness of functional form Model stability • All can be conducted with graphical analysis. The residuals from the model or a function of the residuals play an important role in all of the model diagnostic procedures. Checking assumptions • Assumption of constant variance – Plot the studentized residuals versus their predicted values. Examine whether the variability between residuals remains relatively constant across the range of fitted values. • Assumption of normality – Plot the residuals versus their expected values under normality (Normal probability plot). If the residuals are normally distributed, it should fall along a 45o line. • Correct functional form? – Plot the residuals versus fitted values. Examine whether the residual plot for evidence of a non-linear trend in the value of the residual across the range of fitted values. • Model stability – Check whether one or more observations are influential. Use Cook’s distance. Checking assumptions (Cont) • Cook’s distance (D) is a measure of the magnitude by which the fitted values of the regression model change if the ith observation is removed from the data set. • Leverage is a measure of how extreme the value of xi is relative to the remaining value of x. • The Studentized residual provides a measure of how extreme the value of yi is relative to the remaining value of y. Remedial measures • Non-constant variance – Transform the response variable (y) to a new scale (e.g. logarithm) is often helpful. – If no transformation can achieve the non-constant variance problem, use a more robust estimator such as iterative weighted least squares. • Non-normality – Non-normality and non-constant variance go hand-in-hand. • Outliers – Check for accuracy – Use robust estimator Regression analysis using R id <- seq(1:18) age <- c(46, 20, 52, 30, 43, 57, 33, 22, chol <- c(3.5, 1.9, 4.0, 3.8, 4.1, 3.0, 57, 25, 28, 36, 22, 63, 40, 48, 28, 49) 2.6, 4.5, 3.0, 2.9, 3.8, 2.1, 2.5, 4.6, 3.2, 4.2, 2.3, 4.0) #Fit linear regression model reg <- lm(chol ~ age) summary(reg) ANOVA result > anova(reg) Analysis of Variance Table Response: chol Df Sum Sq Mean Sq F value Pr(>F) age 1 10.4944 10.4944 114.57 1.058e-08 *** Residuals 16 1.4656 0.0916 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Results of R analysis > summary(reg) Call: lm(formula = chol ~ age) Residuals: Min 1Q Median -0.40729 -0.24133 -0.04522 3Q 0.17939 Max 0.63040 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.089218 0.221466 4.918 0.000154 *** age 0.057788 0.005399 10.704 1.06e-08 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.3027 on 16 degrees of freedom Multiple R-Squared: 0.8775, Adjusted R-squared: 0.8698 F-statistic: 114.6 on 1 and 16 DF, p-value: 1.058e-08 Diagnostics: influential data par(mfrow=c(2,2)) plot(reg) 2 1 6 -1 6 8 0 Standardized residuals 0.0 0.2 0.4 0.6 Normal Q-Q 8 -0.4 Residuals Residuals vs Fitted 17 17 3.0 3.5 4.0 4.5 -2 -1 0 1 Fitted values Theoretical Quantiles Scale-Location Residuals vs Leverage 8 2 1 0.5 2 1 6 0 0.5 1.0 17 -1 Standardized residuals 6 8 2 Cook's distance 0.0 Standardized residuals 1.5 2.5 2.5 3.0 3.5 Fitted values 4.0 4.5 0.00 0.05 0.10 0.5 0.15 Leverage 0.20 0.25 A non-linear illustration: BMI and sexual attractiveness – Study on 44 university students – Measure body mass index (BMI) – Sexual attractiveness (SA) score id <- seq(1:44) bmi <- c(11.00, 12.00, 14.00, 14.00, 16.50, 17.00, 20.00, 20.00, 24.00, 24.50, 28.00, 29.00, 36.00, 36.00) sa <- c(2.0, 2.8, 1.8, 3.2, 3.7, 5.5, 6.5, 4.9, 5.0, 3.5, 4.0, 3.7, 2.1, 2.0, 1.8, 12.50, 14.80, 17.00, 20.00, 25.00, 31.00, 1.8, 5.2, 5.3, 3.6, 1.7) 14.00, 15.00, 18.00, 20.50, 25.00, 32.00, 2.0, 5.1, 5.0, 3.4, 14.00, 15.00, 18.00, 22.00, 26.00, 33.00, 2.8, 5.7, 4.2, 3.3, 3.2, 5.6, 4.1, 2.9, 14.00, 15.50, 19.00, 23.00, 26.00, 34.00, 3.1, 4.8, 4.7, 2.1, 14.00, 16.00, 19.00, 23.00, 26.50, 35.50, 4.0, 5.4, 3.5, 2.0, 1.5, 6.3, 3.7, 2.1, Linear regression analysis of BMI and SA reg <- lm (sa ~ bmi) summary(reg) Residuals: Min 1Q -2.54204 -0.97584 Median 0.05082 3Q 1.16160 Max 2.70856 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.92512 0.64489 7.637 1.81e-09 *** bmi -0.05967 0.02862 -2.084 0.0432 * --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.354 on 42 degrees of freedom Multiple R-Squared: 0.09376, Adjusted R-squared: 0.07218 F-statistic: 4.345 on 1 and 42 DF, p-value: 0.04323 BMI and SA: analysis of residuals plot(reg) Normal Q-Q 2 3 Residuals vs Fitted 0 -2 -3 10 3.0 3.5 4.0 10 -2 -1 Fitted values 2 1 0 -2 Fitted values 2 -1 Standardized residuals 0.8 0.4 0.0 3.5 1 Residuals vs Leverage 10 1.2 21 20 3.0 0 Theoretical Quantiles Scale-Location Standardized residuals 21 1 20 -1 1 0 -1 -2 Residuals 2 Standardized residuals 21 20 4.0 3 10 Cook's distance 0.00 0.02 0.04 1 0.06 Leverage 0.08 0.10 0.12 BMI and SA: a simple plot 4 3 2 sa 5 6 par(mfrow=c(1,1)) reg <- lm(sa ~ bmi) plot(sa ~ bmi, pch=16) abline(reg) 10 15 20 25 bmi 30 35 Re-analysis of sexual attractiveness data # Fit 3 regression models linear <- lm(sa ~ bmi) quad <- lm(sa ~ poly(bmi, 2)) cubic <- lm(sa ~ poly(bmi, 3)) # Make new BMI axis bmi.new <- 10:40 # Get predicted values quad.pred <- predict(quad,data.frame(bmi=bmi.new)) cubic.pred <- predict(cubic,data.frame(bmi=bmi.new)) # Plot predicted values abline(reg) lines(bmi.new, quad.pred, col="blue",lwd=3) lines(bmi.new, cubic.pred, col="red",lwd=3) 6 5 sa 4 3 2 10 15 20 25 bmi 30 35 Some comments: Interpretation of correlation • Correlation lies between –1 and +1. A very small correlation does not mean that no linear association between the two variables. The relationship may be non-linear. • For curlinearity, a rank correlation is better than the Pearson’s correlation. • A small correlation (eg 0.1) may be statistically significant, but clinically unimportant. • R2 is another measure of strength of association. An r = 0.7 may sound impressive, but R2 is 0.49! • Correlation does not mean causation. Some comments: Interpretation of correlation • Be careful with multiple correlations. For p variables, there are p(p – 1)/2 possible pairs of correlation, and false positive is a problem. • Correlation can not be inferred directly from association. – r(age, weight) = 0.05; r(weight, fat) = 0.03; it does not mean that r(age, fat) is near zero. – In fact, r(age, fat) = 0.79. Some comments: Interpretation of regression • The fitted line (regression) is only an estimated of the relation between these variables in the population. • Uncertainty associated with estimated parameters. • Regression line should not be used to make prediction of x values outside the range of values in the observed data. • A statistical model is an approximation; the “true” relation may be nonlinear, but a linear is a reasonable approximation. Some comments: Reporting results • Results should be reported in sufficient details: nature of response variable, predictor variable; any transformation; checking assumptions, etc. • Regression coefficients (a, b), their associated standard errors, and R2 are useful summary. Some final comments • Equations are the cornerstone on which the edifice of science rests. • Equations are like poems, or even an onion. • So, be careful with your building of equations!