Transcript Slide 1

Statistics for Business and
Economics
Module 2: Regression and time series analysis
Spring 2010
Lecture 2: Examining the relationship between two
quantitative variables
Priyantha Wijayatunga, Department of Statistics, Umeå University
[email protected]
These materials are altered ones from copyrighted lecture slides (© 2009 W.H.
Freeman and Company) from the homepage of the book:
The Practice of Business Statistics Using Data for Decisions :Second Edition
by Moore, McCabe, Duckworth and Alwan.
Examining relationship between two
quantitative variables
Reference to the book: Chapter 2.1 and 2.2

Explanatory and response variables

Scatterplots and interpreting, outliers

Categorical variables in scatterplots

Quantifying linear relationships with correlation
coefficient “r”

Properties of correlation coefficient
Examining Relationships
Most statistical studies involve more than one variable.
Questions:
 What individuals do the data describe?
 What variables are present and how are they measured?
 Are all of the variables quantitative?
 Do some of the variables explain or even cause changes
in other variables?
Relationships between two variables
Most models are linear
1. Probabilistic models:
Eg. Real estate prices in Umeå may be related to population
per km2 in the local area plus some random variation
2. Deterministic models:
Eg: in electric current theory V=IR
(unless for measurement errors, valtage of a given wire is
proportaional to current flow)
Most models in economics and finance may be probabilistic
and often linear too!
Looking at relationships

Start with a graph

Look for an overall pattern and deviations from the
pattern

Use numerical descriptions of the data and overall
pattern (if appropriate)
Explanatory and response variables

A response variable measures or records an outcome
of a study. Also called dependent variable.

An explanatory variable explains changes in the
response variable (also called independent variable).
response variable: real estate price
explanatory variable: population per Km2
Scatterplot
 A scatterplot shows the relationship between two
quantitative variables measured on the same individuals.
 Typically, the explanatory or independent variable is
plotted on the x axis, and the response or dependent
variable is plotted on the y axis.
 Each individual in the data appears as a point in the plot.
Here, we have two quantitative
variables for each of 16
students.



1) How many beers they
drank, and
2) Their blood alcohol
level (BAC)
We are interested in the
relationship between the two
variables: How is one affected
by changes in the other one?

Student
Beers
Blood Alcohol
1
5
0.1
2
2
0.03
3
9
0.19
6
7
0.095
7
3
0.07
9
3
0.02
11
4
0.07
13
5
0.085
4
8
0.12
5
3
0.04
8
5
0.06
10
5
0.05
12
6
0.1
14
7
0.09
15
1
0.01
16
4
0.05
Scatterplot example
Student
Beers
BAC
1
5
0.1
2
2
0.03
3
9
0.19
6
7
0.095
7
3
0.07
9
3
0.02
11
4
0.07
13
5
0.085
4
8
0.12
5
3
0.04
8
5
0.06
10
5
0.05
12
6
0.1
14
7
0.09
15
1
0.01
16
4
0.05
Some plots don’t have clear explanatory and response variables.
Do calories explain
sodium amounts?
Does percent return on Treasury
bills explain percent return
on common stocks?
Interpreting scatterplots

After plotting two variables on a scatterplot, we describe the
relationship by examining the form, direction, and strength of the
association. We look for an overall pattern …


Form: linear, curved, clusters, no pattern

Direction: positive, negative, no direction

Strength: how closely the points fit the “form”
… and deviations from that pattern.

Outliers
Form and direction of an association
Linear
Nonlinear
No relationship
Positive association: High values of one variable tend to occur together
with high values of the other variable.
Negative association: High values of one variable tend to occur together
with low values of the other variable.
No relationship: X and Y vary independently. Knowing X tells you nothing
about Y.
Strength of the association
The strength of the relationship between the two variables can be
seen by how much variation, or scatter, there is around the main form.
With a strong relationship, you
can get a pretty good estimate
of y if you know x.
With a weak relationship, for any
x you might get a wide range of
y values.
This is a weak relationship. For a
particular state median household
income, you can’t predict the state
per capita income very well.
This is a very strong relationship.
The daily amount of gas consumed
can be predicted quite accurately for
a given temperature value.
How to scale a scatterplot
Using an inappropriate
scale for a scatterplot
can give an incorrect
impression.
Both variables should be
given a similar amount of
space:
• Plot roughly square
• Points should occupy all
the plot space (no blank
space)
Outliers
An outlier is a data value that has a low probability of occurrence (i.e.,
it is unusual or unexpected).
In a scatterplot, outliers are points that fall (far) outside of the overall
pattern of the relationship.
Not an outlier:
Outliers
The upper right-hand point here is
not an outlier of the relationship—It
is what you would expect for this
many beers given the linear
relationship between beers/weight
and blood alcohol.
This point is not in line with the
others, so it is an outlier of the
relationship.
IQ score and
Grade Point Average
a) Describe in words what this
plot shows.
b) Describe the direction,
shape, and strength. Are
there outliers?
c) What is the deal with these
people?
Categorical variables in scatterplots
Often, things are not simple and one-dimensional. We need to group
the data into categories to reveal trends.
What may look like a positive linear
relationship is in fact a series of
negative linear associations.
Plotting different habitats in different
colors allows us to make that
important distinction.
Comparison of men and women
racing records over time.
Each group shows a very strong
negative linear relationship that
would not be apparent without the
gender categorization.
Relationship between lean body mass
and metabolic rate in men and women.
Both men and women follow the same
positive linear trend, but women show
a stronger association. As a group,
males typically have larger values for
both variables.
Categorical explanatory variables
When the explanatory variable is categorical, you cannot make a
scatterplot, but you can compare the different categories side-by-side on
the same graph (boxplots, or mean +/ standard deviation).
Comparison of income
(quantitative response variable)
for different education levels (five
categories).
But be careful in your
interpretation: This is NOT a
positive association, because
education is not quantitative.
Stronger association?
Two scatterplots of
the same data.
The straight-line
pattern in the lower
plot appears
stronger because
of the surrounding
open space.
The correlation coefficient "r"

The correlation coefficient is a measure of the direction
and strength of a linear relationship.

It is calculated using the mean and the standard
deviation of both the x and y variables.

Correlation can only be used to describe quantitative
variables. Categorical variables don’t have means and
standard deviations.
The correlation coefficient "r"
Time to swim: x = 35, sx = 0.7
Pulse rate: y = 140 sy = 9.5
r 
  x  x  y  y 
 x  x    y  y 
i
i
2
i
i
2
Scatterplot: data on baby birth length and height
5000
weight in g
4500
4000
3500
3000
2500
2000
47
48
49
50
51
52
53
54
55
56
length in cm
Seems to be linearly related, therefore measure the correlation
Correlation between weight and length
for newborn babies
Correlations
LENGTH Pearson Correlation
Sig. (2-tailed)
N
WEIGHT Pearson Correlation
Sig. (2-tailed)
N
LENGTH
1
,
35
,765**
,000
35
**. Correlation is significant at the 0.01 level
(2-tailed).
WEIGHT
,765**
,000
35
1
,
35
Calculating sample correlation coefficient
If you have n number of observations on parir of variables X and Y:
x1
x2
x3
xn
y1
y2
y3
yn
Sample correlation coefficient between X and Y is
r 
  x  x  y  y 
 x  x    y  y 
i
i
2
i
i
2
Detailed example:
For randomly selected 6 students number of studying hours for the exam
and exam marks are recorded
student
hours marks
(x)
(y)
1
10
17
2
20
32
3
30
58
4
40
60
5
50
87
6
60
99
35
58.83
Total
mean
Detailed example:
For randomly selected 6 students number of studying hours for the exam
and exam marks are recorded
 x i  x  y i  y   x i  x 2  y i  y 2
student
hours marks x  x y  y
i
i
(x)
(y)
1
10
17
-25
-41.83
1045.83
625
1750.03
2
20
32
-15
-26.83
402.50
225
720.03
3
30
58
-5
-0.83
4.17
25
0.69
4
40
60
5
1.67
5.83
25
1.36
5
50
87
15
28.17
422.50
225
793.36
6
60
99
25
40.17
1004.17
625
1613.36
2885
1750
4878.83
Total
mean
35
58.83
Detailed example
r 
2885
1750 
 0 . 987
4878 . 83
Correlations
hours
hours
Pearson Correlation
1
Sig. (2-tailed)
marks
N
Pearson Correlation
Sig. (2-tailed)
N
**. Correlation is significant at the 0.01 level (2-tailed).
marks
,987**
,000
6
,987**
6
1
,000
6
6
Facts about correlation

r ignores the distinction between response and explanatory
variables

r measures the strength and direction of a linear relationship
between two quantitative variables

r is not affected by changes in the unit of measurement

Positive value of r means association between the two variables is
positive

Negative value of r means association between the variables is
negative

r is always between -1 and +1

r is strongly affected by outliers
"r" ranges from -1 to +1
Strength: how closely the points
follow a straight line.
Direction: is positive when
individuals with higher X values
tend to have higher values of Y.
When variability in one
or both variables
decreases, the
correlation coefficient
gets stronger
( closer to +1 or -1).
Correlation only describes linear relationships
No matter how strong the association,
r does not describe curved relationships.
Influential points
Correlations are calculated using
means and standard deviations,
and thus are NOT resistant to
outliers.
Just moving one point away from the
general trend here decreases the
correlation from -0.91 to -0.75
Review Example

1.
2.
3.
4.
5.
Estimate r
r = 1.00
r = -0.94
r = 1.12
r = 0.94
r = 0.21
(in 1000’s)