Transcript Slide 1

Quantitative Methods – Week 5:
Linear Regression Analysis
Roman Studer
Nuffield College
[email protected]
Homework (II)
• Many factors (variables) are potentially associated with the drop in crime rates in
the US. Where do you find correlations between a variable and the falling crime
rate? Which variables are positively, which ones negatively correlated with the
crime rate variable? Where do you not find any correlation? Explanations?
Concept/Variable
Measured Variable
Strong economy
Unemployment rate, poverty level
Reliance on prisons
Number of prisoners
Education
Educational Attainment
Legalizing abortion
Number of abortions
 How did you solve the lag problem with the abortion variable?
Correlation
Simple Linear Regression: Introduction
 As with correlation, we are still looking at the relationship between two
(ratio level) variables, but now we do not treat them symmetrically
anymore, but make a distinction between the two:
influences
y
x
Dependent variable
Independent variable
Explained variable
Explanatory variable
 As a consequence, the question that can be anwered with correlation
analysis and regression analysis are different:
Correlation
Regression
Is there an association between X and Y?
How exactly does Y change if X changes?
How strong? Positive or negative?
How much of the change in Y is explained?
Simple Linear Regression: Introduction (II)
 Therefore, we move from issues of mere association to issues of causality…
 HOWEVER: A simple linear regression cannot establish causation!
•
•
We assume that causality runs from x to y
Sometimes this assumption is questionable, then more sophisticated models
and tests are needed
 Simultaneous equation models
 Two-stage least square regressions
 Causality tests
 So we have to be careful when making such assumptions. Let’s look at some
pairs of variables.
•
•
Which is the explanatory variable and which the dependent variable?
In which cases might you expect a mutual interaction between the two
variables?
•
•
•
•
•
•
Height and weight
Birth rate and marriage rate
Rainfall and crop yield
Rate of unemployment and level of relief expenditure in British parishes
Government spending on welfare programmes and the voting share of leftliberal parties
CO2 emissions and global warming
Simple Linear Regression: Introduction (III)
 Straight line (“linear”) relationship between two variables
 The relationship between two variables can take a nonlinear
form
Income Inequality
• Kuznets (1955) hypothesised that the relationship between the
level of economic development (x) and income inequality (y)
takes the form of a nonlinear (inverted-U shaped) form:
Economic Development

Multiple or multivariate regression includes two or more
explanatory variables
The Equation of a Straight Line
 The equation of a straight line: Y = a + bX
Example: y=2+3x
The Equation of a Straight Line (II)
 Equation of a straight line: Y=a+bX
• Y and X are the dependent and the independent variable
respectively and y1, y2, …, yn and x1, x2, …, xn their
values
• a is the intercept. It determines the level at which the
straight line crosses the vertical axis, i.e. it gives the
value of y when x=0
• b measures the slope of the line
→ Positive relationship: b>0
→ No relationship: b=0
→ Negative Relationship b<0
• If x increases by one unit, y will increase by b units
Fitting the Regression Line
40
30
20
10
nsdap votes (nazi party)
50
60
 How do we find the line, which is the “best fit”, i.e. the line that
describes the linear relationship between X and Y the best
 This is an issue as in the real world, relationships between two
variables never follow a completely linear pattern…
0
20
40
unemployment rate (in %)
60
80
Fitting the Regression Line (II)
 The regression line predicts the values of Y based on the values
of X. Thus, the best line will minimise the deviation between
the predicted and the actual values (the error, e)
Regression line
IP=a+bWage
=YUK - ŶUK
Fitting the Regression Line (III)
 However, to avoid the problem that positive and negative deviations
cancel out, we look at squared deviations (Yi – Ŷi)2
 Also, as we want to minimise the total errors, we are interested in
minimising the sum of all squared errors
Therefore, the regression line it the line that….
“minimizes the sum of the squares of the vertical deviation of all pairs
of the values of X and Y from the regression line”
 This estimation procedure is known as ordinary least squares regression
(OLS). Its formal derivation yields the two formulae needed to calculate
the regression line, i.e. a formula for the intercept (a) and a formula for
the slope (b):
Fitting the Regression Line (IV)

b
N
i 1
( xi  x )( yi  y )
2
(
x

x
)
i1 i
N
a  y  bx
With these two formulae, we can actually easily calculate
the regression line for small datasets by hand. However,
for large datasets and when we have more than one
explanatory variable, we use the Stata to do it for us
The Goodness of Fit
 Once we get the “best fit” regression line, we still want to know
“how good a fit” this line really is:
• How much of the variation in Y is explained by this
regression line?
 The measure to describe the explanatory power of a regression is the
coefficient of determination r2, which is equal to the square of the
correlation coefficient
 It is a measure of the success with which the movements in Y are
explained by the movements in X
• In particular it measures how much of the derivation of Y
from the mean of Y is explained by the regression
The Goodness of Fit (II)
 This is the interactive part of the class….
 Please explain the concept depicted in the following graph in
your own words…
yi
Regression line
???
???
y
xi
x
The Goodness of Fit (III)
Total variation = explained variation + unexplained variation
TSS
ESS
TSS  ( yi  y)2 ESS  ( yˆi  y )2
USS
ESS  ( yi  yˆi )2
R²=ESS/TSS
Explained Sum of Squares/Total Sum of Squares
The Goodness of Fit (IV)
R2 gives the proportion of the sample variation in y that is
explained by x
R2 =0.35 means that the explanatory variable explains 35% of the
variation in the dependent variable
R2 ranges between 0 and 1
The higher R2 the better the fit of the regression line to the data
R2 can be used to compare the explanatory power of different
regression models (with the same dependent variable)
R2 has to be interpreted in the light of what we would expect
 A low R2 does not necessarily mean that an OLS regression
equation is useless
Computer Class:
•
Regression Analysis
Exercises
1.Weimar elections: Unemployment and votes for the Nazi
A) Descriptive Statistics
Get the dataset about the Weimar election of 1932 at
http://www.nuff.ox.ac.uk/users/studer/teaching.htm
• Look at the variables (votes for the Nazi party, level of unemployment) in turn
• Get a first visualisation of the data; does it look normally distributed?
• Compute the mean, median, standard deviation, coefficient of variation, kurtosis
and skewness for the variable
• Make a scatter plot; Do you think the two variables are associated? How and how
strongly?
B) Regression Analysis
• Which one is the dependent variable?
• Estimate the regression line
• What is the interpretation?
• What is the explanatory power of the regression?
• Draw a scatter plot and add the regression line
Exercises (II)
2. Weimar elections: Nazi votes and the share of Catholics
 Do the same exercises as with the unemployment rate
A) Descriptive Statistics
B) Regression Analysis
Appendix: STATA Commands

regress depvar indepvars

predict yhat, xb

predict res, resid

generate newvar=exp
Linear regression; the dependent and
independent variables are indicated by the
order. The dependent variable depvar comes
first, then the independent variable(s) indepvar
follow
Calculates the linear prediction for each
observation, i.e. yhati= a+b*indepvari
The option “resid” behind the comma let
STATA calculate and save the residuals, i.e.
resi=yi- yhati
Creates a new variable. STATA provides
numerous functions, see STATA’s Help on
“Functions and expressions”/“math functions
Homework
Readings:
• Feinstein & Thomas, Ch. 5
Problem Set 4:
 Finish the exercises from today’s computer class if you haven’t done so already.
Include all the results and answers in the file you send me
 On the next slide you see part of the macroeconomic dataset we used in week 3.
Look at the variables “GDP per head” and “education”, assuming that GDP per head
is the dependent variable and education the explanatory variable. Calculate by hand:
• The regression coefficient, b
• The intercept, a
• The total sum of squares
• The explained sum of squares
• The residual sum of squares
• The coefficient of determination
(hint: graph the values first by hand and then look at tables 4.1 and 4.2)
 Download the complete data set at
http://www.nuff.ox.ac.uk/users/studer/teaching.htm (“macro data”) and
calculate the regression line and the coefficient of determination
 Interpret the results
 Do the results differ from the correlation results?
Homework (II)
Country
GDP per Head
Agriculture
Education
Norway
54360
1.60
81
Switzerland
49660
1.40
49
United States
39430
1.20
83
Brazil
3340
10.10
21
Iran
2340
13.70
21