Influential Points and Outliers
Download
Report
Transcript Influential Points and Outliers
Influential Points
and
Outliers
Debbi Amanti
OUTLIERS:
Data points two or three standard
deviations from the mean of the data.
Observations that differ significantly from
the pattern of the REST OF THE DATA
Observations that lie outside the overall
pattern of the other observations.
OUTLIERS IN TERMS OF
REGRESSION:
Observations with large (in absolute
value) residuals.
Observations falling f a r from the
regression line while not following the
pattern of the relationship apparent in
the others
Residual=actual-fitted
To mathematically compute an outlier
given a univariate set of data:
Find the Inter Quartile Range a.k.a.
IQR (Q3-Q1) and multiply this value
by 1.5. An outlier for a data set is
any point:
Greater than Q3+1.5*(IQR)
Less than Q1-1.5*(IQR)
INFLUENTIAL POINTS ARE:
Points whose removal would greatly affect the
association of two variables
Points whose removal would significantly
change the slope of an LSR line
Points with a large moment (i.e they are far
away from the rest of the data.)
Usually outliers in the x direction.
The two graphs below show the same data – the one on the
right with the removal of the green data point. As you can
see, the removal of this point significantly affects the slope of
the regression line. This is an influential point!
Using the same data as shown on the
previous slide, let’s compare the x and
y data sets for the presence of outliers:
X DATA
IQR=5
Q1=3
Q3=8
MAX=15.5 MIN=1
Y DATA
IQR=5
Q1=4
MAX=10
Q3=9
MIN=2
An outlier is any point:
An outlier is any point:
> Q3+1.5*IQR=15.5
or
< Q1-1.5*IQR=-4.5
> Q3+1.5*IQR=16.5
or
< Q1-1.5*IQR=-3.5
THERE ARE NO OUTLIERS
IN THIS DATA SET!!!
THERE ARE NO OUTLIERS IN
THIS DATA SET!!!
!!!REMEMBER!!!
An observation does NOT have
to be an Outlier to be an
Influential Point!!
Nor does an observation need
to be an Influential Point in order
to be an Outlier!!
Get your
calculator
handy...
Given the five-number summary
{8 21 35 43 77}, which of the
following is correct?
A. There are no outliers
B. There are at least two outliers
C. There is not enough data to make
any conclusion
D. There is exactly one outlier
E. There is at least one outlier
The correct answer is
E
The five number summary gives you
{Min Q1 Median Q3 Max}
The IQR is calculated by Q3-Q1
So, the IQR for the given data is 43-21=22
An outlier for this data would be:
>Q3+1.5*IQR or <Q1-1.5*IQR
>43+(22*1.5)=76 or <21-(22*1.5)=-12
Since the max is 77, there must be at least one
outlier in this data set, but we cannot conclude
how many outliers without more data.
Given the following scatterplot and residual plot. Which
of the following is true about the yellow data point?
0
5
10
I. It is an influential point
II. It is an outlier with respect to the regression model
II. It appears to be an outlier in the x direction
A. I only
B. I and II
C. I and III
D. None of the above
E. All of the above
15
The correct answer is
I.
II.
III.
c
Because this point has a large moment and is
far from the rest of the data, it is an influential
point. If this point was removed, the slope of
the line would markedly change.
This point is not an outlier with respect to the
model because as you can see in the residual
plot, it does not have a large residual (It
follows the regression pattern of the data).
By looking at both the scatterplot and the
residual plot, you can see that the yellow
point is an outlier in the x direction (far right of
the rest of the data).
Resources used in this
presentation include:
Workshop Statistics by Allan Rossman
The Basic Practice of Statistics by David
S. Moore
AMSCO’s AP Statistics by James Bohan
Any further questions, email me at:
[email protected]