Transcript Regression

Regression
Dr.L.Jeyaseelan
Dept. of Biostatistics
Christian Medical College
Vellore, India
Linear Regression
a linear regression coefficient indicates the
impact of each independent variable on the
outcome in the context of (or “adjusting for”) all
other variables.
...
- J. Concato, A. R. Feinstein, T. R. Holford
Overview
• Research interests lies when we may want
to describe the relationship and thus
predict the value of one variable using the
value of the other variable for an
individual.
• Describing the relation between the values
of the two variables - Regression
Origin of Regression Concept
• Sir Francis Galton (1822-1911) used the term
Regression.
• To explain the relationship between the heights (inches)
of fathers and their sons.
• Father – Son pairs (n=1,078)
• Son’s height, Y = 33.73 + 0.516 (Father’s height, X)
when X = 74 => Y=72 (son is not tall as his father)
when X = 65 => Y = 67 (son is taller than his father)
Assumptions
• Outcome is normally distributed
• Independent observations
• Relationship between variables is linear
Linear regression Equation
Y  a  bX
Suppose we want to test whether there is any relation between birth weight (BW)
of baby and Blood Pressure (BP)
Dependent variable is BP and independent variable is BW
So the equation will be
BP= a + b (BW)
i.e. Given a value of birth weight (BW) corresponding Blood Pressure (BP) can be
predicted.
In mathematics Y is called a function of X but in statistics the term regression is
used to describe the relationship.
So the regression equation will be
Y  8.74  25.34 * X
What does these coefficients tells us?
The slope b means that for each unit change in X (i.e. Birth
weight), Y ( Blood Pressure) increases by 25.34 units.
Straight line:
The equation of the straight line is
Y = ß0 +ß1 X
where ß0 is the Y intercept of the line
ß1 is the slope.
The following diagram depicts the relationship between
the blood pressure and the drug concentration.
The highest line is of the relationship
Y=20+15X, which represents the effect of drug A on
an animal. The quantity of drug is measured in
micrograms, the blood pressure in millimeters
mercury. If 4g of the drug have been given, then the
blood pressure would be Y=20 + 15(4)=80mm Hg.
If the independent variable equals zero, the
dependent variable does not also equals zero, but
equals ß0. In the diagram, it equals to a blood
pressure of 20mm, which is the normal BP of animal
in the absence of drug. Obviously, when no drug is
administered, the BP should be at the same Y intercept, since the identical animal is studied.
In the above equation ß0 is called Y-intercept. ß1 is
called the slope or regression coefficient.
In the lowest line, Y=20+7.5X, the Y intercept
remains the same, but the slope has been halved.
We visualize this as the effect of a different drug B on
the animal.
(Kleinbaum and Kupper, 1978)
Test for slope and intercepts
The null hypothesis is, ß1 = 0.
Wald Statistics, t =
1
SE1
The data showed 5 units change in cholesterol level
for a one year increase in age
Is this increase of 5 units, just confined to this dataset
(chance effect) or is it a real change due to the effect
of age
Interpretation
Coefficientsa
Model
1
(Constant)
AGE
Unstandardized
Coefficients
B
Std. Error
107.549
28.516
5.248
.699
Standardized
Coefficients
Beta
.843
t
3.772
7.506
Sig .
.001
.000
95% Confidence Interval for B
Lower Bound
Upper Bound
48.559
166.539
3.802
6.694
a. Dependent Variable: CHOL
For a one year increase in age, there is a
significant 5 units increase in cholesterol level
Prediction
Age = 43 years
Cholesterol = ???
Cholesterol = 107.55 + (5.25*Age)
Principle…
Estimated value of Y at X = Xi:-
where ˆ0 and ˆ1 are the intercept and slope regression parameters
to be determined
Yˆi  ˆ0  ˆ1 X i
Error in predicting an actual observation Y = Yi at X = Xi is
Yi  Yˆi  Yi  ˆ0  ˆ1 X i
Total sum of squared errors (SSE)
n

 SSE   Yi  ˆ0  ˆ1 X i
i 1
y

2
x
x
x
x
x
x
x
Objective: Fit so that SSE is minimised.
Simple (Linear) Regression
One independent variable
Age and cholesterol
Age and BP
Age and Forced Vital Capacity
Multiple (Linear) Regression
More than one independent variable
Age, gender, BMI and cholesterol
Age, height, weight and FVC
Uses:
• Measure of linear association
• Interpolation
• Prediction after controlling confounders
• To identify which combination of variables best predicts
response variables or outcome.
Misuses
•
Extrapolation without assurance that the trend remains
same.
• Using the regression relationship whose slope has
been shown to be not significantly different from zero
• Concluding that cause and effect relationship exists,
while the relationship may just be statistical
•
Applying the relationship established in one group of
subject to another group without the assurance that is
applicable to all groups.