Scatter Diagrams and Linear Correlation

Download Report

Transcript Scatter Diagrams and Linear Correlation

Scatter Diagrams and Linear Correlation
• Chapter 1-3 single variable data
• Examples or two variables: age of person vs. time to master cell phone
task , grade point average vs. time studying, grade point average vs.
time playing video games, amount of smoking vs. rate of lung cancer
• Scatter diagram: (x,y) data plotted as individual
points
– x – explanatory variable (independent)
– y – response variable (dependent)
• Evaluate scatterplot data
– y vs x values – shows relationship between 2
quantitative variables measured on the same
individual
Scatter Diagrams and Linear Correlation
• Look at overall pattern
– Any striking deviation (outliers)?
• Describe by a) form (linear or curved)
b) direction - positively associated +slope
negatively associated – slope
c) strength - how closely do points follow
form
• Examples: age of person vs. time to master cell phone task , grade
point average vs. time studying, grade point average vs. time playing
video games, amount of smoking vs. rate of lung cancer
Degrees of correlation
Scatter Diagrams and Linear Correlation
• Tips for drawing
scatterplot
– Scale axis: intervals
for each axis must
be the same; scale
can be different for
each axis
– Label both axis
– Adopt a scale that
uses entire grid (do
not compress plot
into 1 corner of grid
Scatter Diagrams and Linear Correlation
• Correlation coefficient (r)
– Assesses strength and direction of linear relationship
between x and y.
– Unit less
– -1≤ r ≤ 1
r = -1 or 1 perfect correlation (all
points exactly on the line)
– Closer to 1or -1; better line describes relationship;
better fit of data
– r > 0 positive association at x, y 
– r < 0 negative association a x , y 
– x and y are interchangeable in calculating r
– r does not change if either (or both) variables have unit
changes (inches to cm, or F to C)
Linear and non-linear
correlations
Scatter Diagrams and Linear Correlation
• r = 1 Σ( x-x . y-y_)
n-1
sx
sy
• Using TI-83 ex p.129 (number of police vs. muggings)
• Cautions : Association does not imply causation
– Lurking variables may play rate
– r only good for linear models
– Correlation between averages higher than between
individual point.
Scatter Diagrams and Linear Correlation
• Facts
– No distinction between x and y variable. The
value of r is unaffected by switching x and y
– Both x and y must be quantitative
– Only good for linear relationships
– Not resistant to outliers
• Correlation or r is not a complete description of 2-variable
data, the x and y standard deviations and means should be
included
• HW: p131 2,4,6,8 a,b,c, 10 a,b,c, 12 a,b,c
For “c” use calculator to compute r
4.2 Least Squares Regression
• Least Squares Regression
– Method for finding a line (best fit) that
summarizes the relationship between 2
variables a x (explanatory) and y (response)
– Use the line to predict value of y for a given x
– Must have specific response variable y and
explanatory variable x (cannot switch like r)
4.2 Least Squares Regression
• Least Squares Regression Line (LSRL)
– Minimizes square of error (y-values)
– Error = observed –predicted value
Σ(y-ŷ)2 (y actual value, ŷ is predicted value)
(ŷ is called y hat)
– Line of y on x that makes the sum of the
squares of data points to fitted line as small as
possible
4.2 Least Squares Regression
• LSRL Equation ŷ = a + bx
•
•
•
•
ŷ predicted value of y
Slope b = r(sy/sx)
y – intercept a = y – bx
x and y are means for all x and y data, respectively
and are on the LSLR (x, y)
• sy sx are std. deviations of x,y data
• r correlation
• ŷ predicted value of y
4.2 Least Squares Regression
• TI-83 – enter data into L1, L2 (x,y)
– Use STAT CALC , select #8:LinReg(a+bx) to
get the best fit required
• Slope: important for interpretation of data
– Rate of change of y for each increase of x
• Intercept – may not be practically important
for problems.
4.2 Least Squares Regression
• Plot LSLR: using formula ŷ = a + bx find 2
values on the line.
– (x1, ŷ1) and (x2, ŷ2) make sure x1 and x2 are
near opposite ends of the data
• Influential observations and outliers
– Influential – extreme in the x-direction
if we remove an influential point it will affect
the LSLR significantly
– Outliers – extreme in the y-direction does not
significantly change the LSLR
Coefficient of Determination
• r2 – coefficient of determination
• r – describes the strength and direction of a
straight line relationship
• r2 - fraction of variation in values of y that is
explained by LSRL of y on x
• r = 1, r2 = 1 perfect correlation 100% of the
variation explained by LSRL
• r = 0.7, r2 = 0.49 about 49% of y is
explained by LSLR
Residuals
• Residuals – difference between observed value
and predicted value
– Residual = y –ŷ
– Mean of least square residuals = 0
• Residual plots – scatterplot of regression residuals
against explanatory variable (x)
– Useful in accessing fit of regression line i.e. do we have
a straight line?
• Linear –uniform scatter
• Curved indicates relationship not linear
• Increasing/ decreasing indicates predicting of y
will be less accurate for larger x