No Slide Title

Download Report

Transcript No Slide Title

Regression Notes
i.e. How to get an A on the big project
I’m about to assign you…
PHASE
I.
Scatter-Plots and
Bivariate (2-variable) Statistics
Describing Variables
One Variable
Descriptors:
Two-Variable Descriptors:
 SHAPE
1. LINEAR?
 CENTER
2. DIRECTION
 SPREAD
3. SCATTER
 anything
unusual?
4. anything unusual?
What do you see in these scatter plots?
Mean January Air Temperatures
for 30 U.S. Locations
Temperature (°C)
20
19
18
17
16
15
14
LINEAR TREND
NEGATIVE ASSOCIATION
CONSTANT SCATTER
ANYTHING UNUSUAL?
35
40
Latitude (°N)
45
What do you see in these scatter plots?
% of population who are Internet Users vs
GDP per capita for 202 Countries
80
NON-LINEAR TREND
POSITIVE ASSOCIATION
NON-CONSTANT SCATTER
Internet Users
(%)
70
60
50
40
30
20
and an OUTLIER!!!
10
0
0
10
20
30
40
GDP per capita (thousands of dollars)
What do you see in these scatter plots?
Average Age Americans are First Married
30
…gapNON-LINEAR
in data in 1940s?
2 SEPARATE,
TRENDS
Age
28
26
24
22
20
NEGATIVE ASSOCIATION
TIL ~1970, THEN POSITIVE
NO SCATTER
1930 1940 1950 1960 1970 1980 1990
Year
What to look for in scatter plots
1. Trend
Linear or non-linear?
What to look for in scatter plots
1. Trend
Positive or negative association?
What to look for in scatter plots
2. Scatter
Strong or weak relationship?
What to look for in scatter plots
2. Scatter
Constant or non-constant scatter?
What to look for in scatter plots
3. Anything unusual
Outlier
What to look for in scatter plots
3. Anything unusual
Groupings
Rank relationships: weakest (1) to strongest (4)
2
4
1
3
Correlation Coefficient
Correlation Coefficient
little r – what is it?
 r is the correlation coefficient
between y and x
 r measures the strength of a
linear relationship
 r is a multiple of the slope
r – when can it be used?
 Only use r if the scatter plot is linear
y
r = 0.99
****
**
**
**
**
**
* **
***
x
 Don’t use r if the scatter plot is non-linear!
r – what does it tell you?
 How close the points in the scatter plot
come to lying on the line
*
*
*
*
*
*
** * *
*
*
*** **
*
y
y
*
r = 0.57
r = 0.99
*
*
* * *
* *
* *
*
x
Difficult Ones
**
** *
x
* *
Playing with Outliers (1)…
What will happen to
thean
correlation
----------------- and
OUTLIER!!!
coefficient if we remove
the tallest 12th grader?
bigger or smaller
Hint:
…correlation measures
how linear the data is
LINEAR TREND
POSITIVE ASSOCIATION
MOSTLY CONSTANT SCATTER
See for yourself HERE
Playing with Outliers (2)…
What will happen to
the correlation
----------------and an
OUTLIER!!!
coefficient
if we
remove
the elephant?
bigger or smaller
LINEAR TREND
POSITIVE ASSOCIATION
CONSTANT SCATTER
Hint:
…make your brain
zoom in on that main
cluster of points
See for yourself HERE
Guess which are correlated with Test Scores?
4 are… 4 aren’t…
1. Highly educated parents
2. Mom’s age >30 at birth
3. Mom stays home until Kindergarten
4. Intact family (live with mom and dad)
5. Attended Head Start program
6. Parents have money
7. Move to a better neighborhood
8. Low birthweight (including premature)
Guess which are correlated with Test Scores?
3 are… 4 aren’t…
9. Parents speak English
10. Family goes to museums, zoos,
concerts…
11. Parents involved in PTA
12. Child spanked regularly
13. Watches TV a lot
14. Parents own a lot of books
15. Child is read to every day
Life Expectancy Example
Life Expectancy and Availability of Doctors
for a Sample of 40 Countries
Life Expectancy
80
you suggest
- Can
Non-linear
trend how to
increase life expectancy in a
- Negative Association
country?
- Fairly Constant Scatter
70
Get fewer people per doctor!
Duh!
60
50
0
10000
20000
People per Doctor
30000
40000
Life Expectancy Example
Life Expectancy and Availability of Televisions
for a Sample of 40 Countries
Life Expectancy
80
Can you suggest how to
increase life expectancy in a
country?
70
Get fewer people per TV?!?
BEWARE LURKING
VARIABLES!!!
60
50
0
100
200
300
400
People per Television
500
600
Kinds of Lurking Variables (1)
“People who take showers have better
organizational skills.”
CAUSATION
perceived correlation
Shower
x
Organized
y
Maybe changes in x
CAUSE
changes in y
Kinds of Lurking Variables (2)
CAUSATION
“People who take showers have better
organizational skills.”
again...
but
in
reverse
perceived correlation
Shower
X
Organized
y
Maybe changes in y
CAUSE
changes in x
Kinds of Lurking Variables (3)
COMMON
“People who take showers have better
organizational skills.”
RESPONSE
perceived correlation
Shower
x
Organized
y
Good
Maybe something
else z is causing
Habits in
changes inGeneral
both X and Y at the
z time!
same
Kinds of Lurking Variables (4)
CONFOUNDING
“People who take showers have better
organizational
skills.”
…we don’t know which variable is causing the changes
perceived correlation
Organized
y
Shower
x
Good
Habits in
General
z
LURKING
VARIABLES
Heh, Heh,
Heh…
How Regression gets you in Trouble…
Famous examples of strong correlations:
Instances of
drunkenness in
those below 18
years of age are
significantly
lower than for
those above.
(Clearly children can hold their
drink better than adults)
How Regression gets you in Trouble…
Famous examples of strong correlations:
Whenever ice cream sales rise, so
do the number of shark attacks.
(eating ice cream makes you tastier?)
How Regression gets you in Trouble…
Famous examples of strong correlations:
As vocabulary in infancy rises, so
does appetite.
(learning words make you hungry?)
How Regression gets you in Trouble…
Famous examples of strong correlations:
The more fire trucks you send to a
fire, the worse the damage is.
(firetrucks cause damage?)
How Regression gets you in Trouble…
Famous examples of strong correlations:
The more you pay teachers in a
town, the more expensive alcohol is.
How Regression gets you in Trouble…
Famous examples of strong correlations:
In Scandinavia,
storks appear more
often on the
rooftops of families
with more babies.
Deer and cattle,
orient themselves
along a
north/south axis
when grazing.
How Regression gets you in Trouble…
Famous examples of strong correlations:
Correlation is
not
Causation
The story:
The smoking ban in Wales
"caused" a 13% fall in
heart attacks from
October to December
2007, compared with the
same period in 2006.
The flaw: The ban began in April. We also
observed a 13% fall in heart attacks in April. And
presumably it "caused" me to spill my coffee. For
that happened during the smoking ban too.
!! TRADE UNIONS SECURE BETTER PAY !!
See? Union
membership
can get you as
much as 30%
more pay!!
!! TRADE UNIONS SECURE BETTER PAY !!
perceived correlation
Better
Pay!
y
Union
Membershi
p
x
Education
level of
employee
z
Experience
Level of
employee
z
Age of
employee
z
Gapminder
PHASE
II.
Residuals and
Least Square Regression Lines (LSRL)
Residuals = Actual – Predicted
The actual point
is (8, 25)
prediction
line
y = 5 + 2x
25
(8, 25)
The predicted point
21
is (8, 21)
(8, 21)
17
4
6
8
10
12
Residuals = Actual – Predicted
4 = 25 –
21
“Actual” – “Predicted”
7
-3
17
-10
1
-4
Least Squares Regression:
We’ll try to
get the
Least
Squares
Σ
2
(Resids)
= 439.2988
Least Squares Regression Line Facts:
 There is one and only one
LSRL for every set of
bivariate data.
“Least Squares Regression
Line”
 Σ Residuals = 0
(just like with st.dev)
 The LSRL must go
through the point
Your calculator will give you
the one equation
with the “least” amount of
squares…
x, y 
 You’ll only have to
calculate the LSRL by
hand once (…heh, heh)
(LSRL)
PHASE
III.
Your Project
Example of analysis: “Going Crackers”
An example of the type of work
you’ll be doing for your
REGRESSION
ASSIGNMENT
You start with some raw
data…
1. Predict the energy content of a
cracker with 25% fat content.
2. If you reduced the salt content by
100g, how would the fat change?
Example of analysis: “Going Crackers”
ENERGY
1. Predict the energy content of a
cracker with 25% fat content.
FAT
(ENERGY) = 380 + 4.98 (FAT)
= 380 + 4.98 (25)
= 504.5
Example of analysis: “Going Crackers”
FAT
2. If you reduced the salt content by
100g, how would the fat change?
SALT
(FAT) = - 2.7 + 0.0237 (SALT)
The fat content would drop
by 0.0237 mg.
Problem 2 Analysis
The data suggest a linear
trend. The association is
positive with constant scatter
about the trend line. It is
reasonable to do a linear
regression.
Problem 2 Analysis
The LSRL is y = -2.7 + 0.0237x.
The slope of the fitted line is
0.0237 which tells us, on
average, each 100mg
decrease in salt content is
associated with a decrease in
total fat content by 2.4%
The moderate relationship (r =
0.69) means that predicting
such a drop will not
necessarily be highly
accurate.