Transcript Regression
Regression Dr.L.Jeyaseelan Dept. of Biostatistics Christian Medical College Vellore, India Linear Regression a linear regression coefficient indicates the impact of each independent variable on the outcome in the context of (or “adjusting for”) all other variables. ... - J. Concato, A. R. Feinstein, T. R. Holford Overview • Research interests lies when we may want to describe the relationship and thus predict the value of one variable using the value of the other variable for an individual. • Describing the relation between the values of the two variables - Regression Origin of Regression Concept • Sir Francis Galton (1822-1911) used the term Regression. • To explain the relationship between the heights (inches) of fathers and their sons. • Father – Son pairs (n=1,078) • Son’s height, Y = 33.73 + 0.516 (Father’s height, X) when X = 74 => Y=72 (son is not tall as his father) when X = 65 => Y = 67 (son is taller than his father) Assumptions • Outcome is normally distributed • Independent observations • Relationship between variables is linear Linear regression Equation Y a bX Suppose we want to test whether there is any relation between birth weight (BW) of baby and Blood Pressure (BP) Dependent variable is BP and independent variable is BW So the equation will be BP= a + b (BW) i.e. Given a value of birth weight (BW) corresponding Blood Pressure (BP) can be predicted. In mathematics Y is called a function of X but in statistics the term regression is used to describe the relationship. So the regression equation will be Y 8.74 25.34 * X What does these coefficients tells us? The slope b means that for each unit change in X (i.e. Birth weight), Y ( Blood Pressure) increases by 25.34 units. Straight line: The equation of the straight line is Y = ß0 +ß1 X where ß0 is the Y intercept of the line ß1 is the slope. The following diagram depicts the relationship between the blood pressure and the drug concentration. The highest line is of the relationship Y=20+15X, which represents the effect of drug A on an animal. The quantity of drug is measured in micrograms, the blood pressure in millimeters mercury. If 4g of the drug have been given, then the blood pressure would be Y=20 + 15(4)=80mm Hg. If the independent variable equals zero, the dependent variable does not also equals zero, but equals ß0. In the diagram, it equals to a blood pressure of 20mm, which is the normal BP of animal in the absence of drug. Obviously, when no drug is administered, the BP should be at the same Y intercept, since the identical animal is studied. In the above equation ß0 is called Y-intercept. ß1 is called the slope or regression coefficient. In the lowest line, Y=20+7.5X, the Y intercept remains the same, but the slope has been halved. We visualize this as the effect of a different drug B on the animal. (Kleinbaum and Kupper, 1978) Test for slope and intercepts The null hypothesis is, ß1 = 0. Wald Statistics, t = 1 SE1 The data showed 5 units change in cholesterol level for a one year increase in age Is this increase of 5 units, just confined to this dataset (chance effect) or is it a real change due to the effect of age Interpretation Coefficientsa Model 1 (Constant) AGE Unstandardized Coefficients B Std. Error 107.549 28.516 5.248 .699 Standardized Coefficients Beta .843 t 3.772 7.506 Sig . .001 .000 95% Confidence Interval for B Lower Bound Upper Bound 48.559 166.539 3.802 6.694 a. Dependent Variable: CHOL For a one year increase in age, there is a significant 5 units increase in cholesterol level Prediction Age = 43 years Cholesterol = ??? Cholesterol = 107.55 + (5.25*Age) Principle… Estimated value of Y at X = Xi:- where ˆ0 and ˆ1 are the intercept and slope regression parameters to be determined Yˆi ˆ0 ˆ1 X i Error in predicting an actual observation Y = Yi at X = Xi is Yi Yˆi Yi ˆ0 ˆ1 X i Total sum of squared errors (SSE) n SSE Yi ˆ0 ˆ1 X i i 1 y 2 x x x x x x x Objective: Fit so that SSE is minimised. Simple (Linear) Regression One independent variable Age and cholesterol Age and BP Age and Forced Vital Capacity Multiple (Linear) Regression More than one independent variable Age, gender, BMI and cholesterol Age, height, weight and FVC Uses: • Measure of linear association • Interpolation • Prediction after controlling confounders • To identify which combination of variables best predicts response variables or outcome. Misuses • Extrapolation without assurance that the trend remains same. • Using the regression relationship whose slope has been shown to be not significantly different from zero • Concluding that cause and effect relationship exists, while the relationship may just be statistical • Applying the relationship established in one group of subject to another group without the assurance that is applicable to all groups.