Chapter 2 Scatter plots, Correlation, Linear Regression, Inferences

Download Report

Transcript Chapter 2 Scatter plots, Correlation, Linear Regression, Inferences

CHAPTER 2
SCATTER PLOTS, CORRELATION,
LINEAR REGRESSION, INFERENCES FOR
REGRESSION
By: Tasha Carr, Lyndsay Gentile, Darya
Rosikhina, Stacey Zarko
SCATTER PLOTS
Shows the relationship between two quantitative
variables measured on the same individuals
 Look at:
 Direction- positive, negative, none
 Form-straight, linear, curved
 Strength- little scatter means little
association
 great scatter means great
association
 Outliers- make sure there are no major
outliers

CORRELATION
 Measures
the direction and strength of
the linear relationship
 Usually written as r
 r is the correlation coefficient
 Not resistant
CORRELATION

Rules:









It does not change if you switch x and y
Both variables must be quantative
Does not change when we change units of
measurement
Positive r shows positive association, negative r shows
negative association
Always between -1 and 1
Values near 0 show weak linear relationship
Strength of relationship increases as r moves toward -1
and 1 (means points lie in straight line)
Not resistant, so outliers can change the value
Bad measure for curves
LEAST-SQUARES REGRESSION
Makes the sum of the squares of the vertical
distances of the data points from the line as
small as possible (not resistant)
 Ŷ = b0 + b1 x
 b1 x = slope




b1 = (sy / sx )(r)
Amount by which y changes when x increases by one
unit
b0 = y-intercept
Value of y when x=0
 b0 = (y-bar) - b1 x


Extrapolation- making predictions outside of the
given data ; inaccurate
LEAST-SQUARES REGRESSION
A
Regression Line is a straight line that
describes how a response variable as an
explanatory variable x changes
 Based on correlation
 Used to predict the value of y for a given
value of x
 R2 = Coefficient of Determination
 In the model,
R2 of the variability in the y-
variable is accounted for by variation in the x-
variable.
RESIDUALS
 Minimized
by the LSRL
 Difference between actual and predicted
data
 Observed – Expected
 Actual – Guess
 e=Y–Ŷ
 Positive residuals – underestimates
 Negative residuals – overestimates
RESIDUAL PLOT
A scatter plot of the regression residuals against
the explanatory variable or predicted values
 Shows if linear model is appropriate
 If there is no apparent shape or pattern and
residuals are randomly scattered, linear model is
a good fit
 If there is a curve or horn shape, or big change in
scatter, linear model is not a good fit

LURKING VARIABLES

Variable that has an important effect on the
relationship among the variables in a study but is
not included among the variables studied



Make a correlation or regression misleading
An outlier- point that lies outside the overall
pattern of the other observations
Influential point- removing it would change the
outcome (outliers in the x- direction)
CAUSATION
 An
association between an
explanatory and response variable
does not show a causation, or cause
and effect relationship, even if there
is a high correlation
 Correlation based on averages is
higher than data from individuals
INFERENCE FOR REGRESSION
 Used
to test if there is an association
between two quantitative variables
based on the population
 To test for an association we check β1
 If no association exists this should
be zero
INFERENCE FOR REGRESSION

Hypothesis:



H0 : β1 = 0. There is no association
HA : β1 ≠ 0. There is an association.
Conditions:





Straight Enough: Check for no curves in scatter plot.
Independence: Data is assumed independent.
Equal Variance: Check residual plot for changes in
spread
Nearly Normal: Create histogram or Normal
Probability plot of the residuals.
All conditions have been met to use a student’s tmodel for a test on the slope of a regression model.
INFERENCE FOR REGRESSION

Mechanics



b0
Df = n – 2
t= (b1 – 0)/(SE(b1 )
P-value = 2P(tn-2 > or < t)
Multiple Regression
Model of House Prices
Response attribute (numeric): Price
Predictor Coefficient
b1
Constant
Age
Std
Error
1244.2712 75.4607
-5.3659
3.8596
t
P
Statistic Value
P-value
 R2
16.489 -0.0000
-1.390
0.1691 0.0285
Regression Equation: Price =
R-Squared: 0.0284526
Adjusted R-Squared: 0.0137322
Standard Deviation of the Error: 400.242
Age
SE (b1 )
t= (b1 – 0)/(SE(b1 )
INFERENCE FOR REGRESSION

Conclusion
 If the p-value is less than alpha, reject the null
hypothesis
 If we reject H0, there is evidence of an
association
 If the p-value is greater than alpha, we fail to
reject the null hypothesis
 If we fail to reject the H0 , there is not
enough evidence of an association