Transcript Document

NOTICE
Please note that the purpose of these
slides is not a substitute for the reading
and interacting of your material. It’s
intended purpose is to be a quick
reference for studying concepts as well as
presenting the material from a different
angle that might help you to better
understand the statistics.
For the Student:
How do insurance companies determine various premium rates for
different age groups and sexes? What information does the government use to
decide who get taxed how much? How are you going to determine which
vehicle is the safest to drive? The answer to each one of these questions relies
heavily on statistics. Unfortunately there is a very large amount of statistics in
society that has been manipulated very badly, and will provide you with
unreliable results.
Statistics is everywhere, and affects virtually everyone. The question to
ask yourself now is “ Are you going to become another “victim of statistics”?”.
Whether your career in the future is education, politics, or fire fighting, making
decisions based off of statistics will be inevitable, and there will be
consequences. Statistics is not hard, it only takes a little time and patience to
gain a true understanding of what your information really ‘means’.
To the Student
A little hint in keeping yourself from being overwhelmed while first learning
about statistics. It is not vital to memorize all of the equations. Although
memorizing them can help, it is better to understand what the equations mean,
why they are used, and when do you use them. Some of the equations,
especially any equation that has a  symbol requires adding a series of many
numbers. Practically speaking you should use either a calculator or computer to
compute such equations. Focus on understanding the concept of what the
computer is doing so that the number that pops out is more than a number to
you, because that number means something. Don’t memorize, just recognize.
Graphs and Summaries
A. One Categorical Variable
1. Graphs
2. Summaries
B. One Numerical Variable
1. Graphs
a.
b.
c.
d.
e.
Applets
Stemplot
Histogram
Boxplot
Normal Quantile plot (Q-plot)
Shapes
i. Symmetric
1) Normal
2) Uniform
ii. Skewed
1) Left
2) Right
2. Summaries
a. Locations
i. Mean
ii. Median
iii. Mode
iv. Min and Max
v. Quartiles
vi. Comparisons of Mean and Median
vii. Z-scores
Empirical Rule
b. Spreads
i. Variance
ii. Standard Deviation
iii. Range
iv. Interquartile Range (IQR)
3. Transformations
a. Shift changes
i. Centers
ii. Spreads
b. Scale changes
i. Centers
ii. Spreads
Beginning Definitions
Variable- the overall object of interest that is desired to be
understood.
ie. Percent of people who use Deodorant in America
ie. Average debt of a college graduate from Texas A&M
Individual- A single value constructed by a variable.
ie. Bob, an American who does not use deodorant
ie. Jill, a college graduate of Texas A&M with $10,000 in debt
Variable Types
• Categorical
(Qualitative)
-Nominal
• Numerical
(Quantitative)
ie. Colors
{red,blue,green}
-Discreet
ie. Number of Children
in a family
ie. Strength
{weak,moderate,Strong}
-Continuous
ie. Amount of water
the average house uses
-Ordinal
{Depending on the context, certain discreet
numbers can be considered continuous for
practical purposes, and continuous data can
be made discreet}
Distribution
-Shape ie. unimodal, bimodal, multimodal, symmetric, skewed right, skewed left
-Center ie. Mean, Median
-Spread ie. Range, Standard Deviation,Variance, Interquartile Range
Categorical Graphs (Nominal or Ordinal)
• Pie Charts
• Bar Graphs
Index
Pie Charts (Counts and Percents)
Pie Graphs with Percents
Pie Graph with Counts
Country of Origin
Country of Origin
n=79
n=73
American
American
European
Japanese
European
Japanese
Pies show c ounts
19.51%
18.02%
n=253
Pies show c ounts
62.47%
Index
Bar Graphs
Bar Graph w ith Counts
Bar Graph w ith Per cents
250
60%
200
Perce nt
Count
40%
150
100
20%
50
n=253
n=73
n=79
62%
18%
20%
American
European
Japanese
0%
American
European
Country of Origin
Japanese
Country of Origin
Index
Numerical Graphs (Univariate)
•
•
•
•
Stemplots
Histograms
Boxplots
Normal Quantile Plots (Q-plots)
Index
Stemplots
Back to back stemplot
Stemplot of
Scores
3 | 178
4 | 567
5 | 09
6 | 3789
7 | 013355789
8 | 00124588
9|7
boys
girls
18| 3 | 7
67 | 4 | 5
0| 5 | 9
7| 6 | 389
13379| 7 | 0558
1488| 8 | 0025
|9|7
Index
Histogram
Total Calories per bar of Common Candy
0.006
0.004
Density
8
6
0.002
4
0.000
2
0
Frequency
10
12
0.008
14
0.010
Total Calories per bar of Common Candy
100
150
200
250
Calories
300
350
400
100
150
200
250
300
350
400
Calories
Note that these are analogous to counts and percents with bar charts
Index
Boxplots
100
150
200
250
300
350
Calories in Common Candies
Boxplots are made using the 5 number
summary to define the box and whiskers
unless there are outliers present. If an
outlier is present then the next minimum
number not considered an outlier is
chosen to represent the new minimum if
the outlier or outliers where minimum
numbers and vice versa if the outliers are
considered maximum numbers.
Outlier? A number is considered an
outlier if it lies a distance of 1.5 times the
IQR (Interquartile Range) lower than the
1st quartile or higher than the 3rd quartile.
Index
Normal Plots (aka. Q-plots)
250
200
150
100
Sample Quantiles
300
350
Calories in Common Candies
-2
-1
0
Theoretical Quantiles
1
2
Q-plots are used to determine
how reasonable it may be to
assume that the sample comes
from a normal distribution. If the
sample comes from a normal
distribution then the plot of the
scatterplots should make a
straight 45 degree line, or in the
case where the Q plot includes a
Q-line, the points should follow
“closely” to the line.
Unfortunately there is no clear
rule for declaring a set of data
normal or not. It takes practice of
examining patterns in Q-plots to
recognize “close calls”, but if the
data is strongly skewed it will be
very easy to see the change in
pattern from the line.
Index
Shapes-Symmetric-Normal
The blue histograms are samples from a population of test grades
that have an average of 65 with a standard deviation of 10. Notice
the one with more samples begins to look more like the density
curve of a normal distribution (the red line)
0.03
0.00
0.01
0.02
Density
0.02
0.01
0.00
Density
0.03
0.04
1000 Samples~ Normal(65,10)
0.04
100 Samples~ Normal(65,10)
30
40
50
60
70
test grade
80
90
100
30
40
50
60
70
test grade
80
90
100
Shapes-Symmetric-Normal
100 Samples ~N(65,10)
1000 Samples ~N(65,10)
30 40 50 60 70 80 90
30 40 50 60 70 80 90
Boxplots
QQplot of 1000 Samples
60
70
80
Normal plots
40
50
Sample Quantiles
70
60
50
Sample Quantiles
80
90
QQplot of 100 Samples
-2
-1
0
1
Theoretical Quantiles
2
-3
-2
-1
0
1
Theoretical Quantiles
2
3
Index
Shapes-Symmetric-Uniform
QQplot of 100 Samples
40
50
60
70
80
70
80
90
60
50
50
40
40
30
0.00
30
60
70
Sample Quantiles
80
0.03
0.02
0.01
Density
100 Samples ~U(40,95)
90 100
0.04
100 Samples~ Uniform(40,95)
90 100
-2
test grade
1000 Samples ~U(40,95)
60
70
test grade
80
90 100
2
90
80
70
60
50
40
Sample Quantiles
80
70
60
50
40
30
50
1
QQplot of 1000 Samples
90 100
0.04
0.03
Density
0.02
0.01
0.00
40
0
Theoretical Quantiles
1000 Samples~ Uniform(40,95)
30
-1
-3
-2
-1
0
1
Theoretical Quantiles
2
3
Shapes-Skewed- Right and Left
Right Skewed
Left Skewed
The other major pattern to
recognize is skew. Think about a
skewer on a barbeque grill.
Everything seems lopped to one side
of the stick. Likewise, the pattern
in graphs is similar. If the majority
of the data lies on the left then the
graph is right skewed and viceversa.
Index
Shapes- Skewed Left
100 Samples Skewed left
80
180
160
140
120
80
100
80
0.000
60
100
Sample Quantiles
120
140
0.010
0.005
Density
160
0.015
180
200
100 Samples Skewed left
200
100 Samples Skewed right
100 120 140 160 180 200
-2
-1
Average Speed of Stock Cars
1
2
Theoretical Quantiles
1000 Samples Skewed left
150
100
50
0
Sample Quantiles
0.000
0
50
0.005
100
0.010
150
0.015
200
1000 Samples Skewed left
200
1000 Samples Skewed right
Density
0
0
50
100
150
Average Speed of Stock Cars
200
-3
-2
-1
0
1
Theoretical Quantiles
2
3
Shapes-Skewed Right
60
80
80
60
20
0
40
80
60
40
20
40
20
0
0.000
0
100
120
100 Samples Skewed right
Sample Quantiles
100
0.015
0.010
0.005
Density
100 Samples Skewed right
120
100 Samples Skewed right
100 120 140
-2
-1
Costs of Meals at Restraunts
1000 Samples Skewed right
200
Costs of Meals at Restraunts
150
50
0
100
50
Sample Quantiles
150
150
0
100
100
2
200
1000 Samples Skewed right
200
0.015
Density
0.010
0.005
0.000
50
1
Theoretical Quantiles
1000 Samples Skewed right
0
0
-3
-2
-1
0
1
Theoretical Quantiles
2
3
Summaries
Locations - Mean
Heights of students 71 70 68 69 68 65 72 69 71 62
x
xi
n
x1
xn
n
71 70 68 69 68 65 72 69 71 62
10
68.5
Index
Summaries
Location-Median
Heights of students 71 70 68 69 68 65 72 69 71 62
Ordered heights 62 65 68 68 69 69 70 71 71 72
~
Median = x
69
If the number of observations is even the Median is the average of
the middle two numbers. If the number of observations is odd
then the middle number of the order data is the Median.
Heights of male students
~
x
65 68 70 71 72
70
Index
Summaries
Location-Mode, Min, Max
Ordered heights 62 65 68 68 69 69 70 71 71 72
Mode= 69 & 71
Mode is most common number. If there is tie for the number of
common numbers then there is more that one mode.
Min= 62
Max=72
Index
Summaries Location- Quartiles
Ordered heights 62 65 68 68 69 69 70 71 71 72
1st Quartile = 68
3rd Quartile = 71
To find the 1st and 3rd Quartiles you consider the data separately to the left and
to the right of the median. The median is the 2nd Quartile. The 1st Quartile
is the middle number (or average of two middle numbers if the subset is
even) between the minimum and the median. The 3rd Quartile is calculated
the same way only replacing the max for the min.
Technical note: Include the median when finding the 1st or 3rd Quartile if
the number of observations is odd.
Index
Comparing Means and Medians
Notice the blue and red lines on distribution graphs below. The blue
line represent the mean and the red line represent the median. This
demonstrates how whenever data becomes skewed the mean is affected
more then the median. The bottom graph shows how the mean and median
are about the same on a normal distribution.
Medians
Right Skewed
Left Skewed
Mean
Normal Distribution
Mean
Index
Median and Mean the Same
Z - Scores
Suppose we are given a set of data that has a normal distribution.
Given that we already know the mean and the standard deviation we want to
find precisely how many actual deviations a certain amount is. That value is
called a z-score. The equation is:
z
x

Why is the z-score useful to us? Well if we compare our z-score to
the 68-95-99.7 rule we can learn about what percentage of values in
greater than or less that our value. Suppose we had a z-score of 1.5.
Obviously more than 68% of the value are below our value, meaning that we
would have less than a 32% chance of choosing our particular value at
random. Now consider that our value had a z-score of -2.5 meaning that it
is 2 and 1/2 standard deviation to the left of the mean. Our new score lies
between 95 and 99.7 which means that we had less than a 5% chance of
selecting our value at random and more .3%. We can look up our z-score on
a table of “Standard Normal Probabilities in order to find our exact
chances of being so lucky.
Index
Z-Scores
Based off the standard deviation, Z-Scores are used to determine
how far a way a sample is from the mean. A Z-Score of 1 corresponds to
one standard deviation from the mean. The 68-95-99.7 rule is helpful in
determining what the value of a z-score really means. Figure 2 is density
curve demonstrating what is meant by the 68-95-99.7 rule. The area under
the blue contains 68% of the data. Where the blue ends is where z = 1 or z
= -1. The red plus the blue contains 95 % of the data with the outer edges
being z = 2 or z = -2. Likewise, the green added to the data contains 99.7%
of the whole data. If we had a z-score of 0.5 we know that our number is
somewhere in the blue. A z-score of 2.5 would lie somewhere in the green.
Blue- 68%
Z-scores
Blue & Red- 95%
Blue, Red & Green- 99.7%
When to use 68-95-99.7 rule
NORMAL
Valid
-2.00
.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
11.00
12.00
13.00
14.00
15.00
16.00
17.00
18.00
19.00
Total
Frequency
1
2
1
4
6
4
4
6
3
5
2
2
2
1
1
4
1
1
50
Percent
2.0
4.0
2.0
8.0
12.0
8.0
8.0
12.0
6.0
10.0
4.0
4.0
4.0
2.0
2.0
8.0
2.0
2.0
100.0
Valid Percent
2.0
4.0
2.0
8.0
12.0
8.0
8.0
12.0
6.0
10.0
4.0
4.0
4.0
2.0
2.0
8.0
2.0
2.0
100.0
Cumulative
Percent
2.0
6.0
8.0
16.0
28.0
36.0
44.0
56.0
62.0
72.0
76.0
80.0
Statistics
NORMAL
N
Mean
Std. Deviation
Percentiles
Valid
Missing
25
50
75
50
0
9.4200
4.6995
6.0000
9.0000
12.2500
84.0
86.0
88.0
96.0
98.0
100.0
When do we use the Empirical Rule? It is better to make a decision based off
of graphs (histograms,boxplots,Q-plots), but if all we are given is the above
we can notice some features about the distribution by observing the
frequency column. The tallies need to be somewhat low in the top and
bottom of this column with the data builiding up near the middle. Notice for
this example this is what we have. If this pattern is apparent it is then
necessary to compare the standard deviation of the data with the percentiles.
If the data is normal then our standard deviation should contain about 68% of
our data. According to the table 68% of the data lies between 5 and 14 for a
length of 9. The standard deviation is 4.7 approximately 4.7, which with the
empirical rule says that we expect about this distance is 9.4, so we conclude
that the data has a Normal distribution
Empirical Rule usage
UNIFORM
Valid
-4.00
-2.00
-1.00
.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
10.00
11.00
12.00
13.00
14.00
15.00
16.00
18.00
19.00
20.00
21.00
22.00
24.00
25.00
Total
Frequency
1
3
3
2
1
1
1
1
3
4
1
1
3
1
1
2
3
1
5
1
1
4
1
2
2
1
50
Percent
2.0
6.0
6.0
4.0
2.0
2.0
2.0
2.0
6.0
8.0
2.0
2.0
6.0
2.0
2.0
4.0
6.0
2.0
10.0
2.0
2.0
8.0
2.0
4.0
4.0
2.0
100.0
Once again the two things we need to
check for
Statistics
-pattern of the tallies
UNIFORM
N
Valid
-68% Interval
Missing
Valid Percent
2.0
6.0
6.0
4.0
2.0
2.0
2.0
2.0
6.0
8.0
2.0
2.0
6.0
2.0
2.0
4.0
6.0
2.0
10.0
2.0
2.0
8.0
2.0
4.0
4.0
2.0
100.0
Cumulative
Percent
2.0
8.0
14.0
18.0
20.0
22.0
24.0
26.0
32.0
40.0
42.0
44.0
50.0
52.0
54.0
58.0
64.0
66.0
76.0
78.0
80.0
88.0
90.0
94.0
98.0
100.0
Mean
Std. Deviation
Percentiles
25
50
75
50
0
10.4400
8.3426
3.7500
10.5000
16.5000
Here we see that the frequency column has a
pattern of higher tallies appears the same or bigger
then the center of the tallies. But to be safe we
consider the 68%Interval compared to the standard
deviation. The lower bound of the interval is
between (-1 and 0) the upperbound is between (19
and 20) Therefore the length of the interval is
between 21 and 19. With the empirical rule we
would expect this interval to be around 2 * 8.34=
16.68. Because this interval is clearly smaller than
either of the previous we conclude that the data is
not normal.
Spreads- Variance
Variance is a number that describes how much the data “varies”. The reason
for the two different formula below is that one is that the first one is used if
we have the mean of the population. The second equation divides by n – 1
because the variance of a sample will be smaller then the variance from the
population that the sample comes from. However as n gets large there
becomes very little difference between these two equations
2
2
xi
n
2
x1
2
xn
n
2
s
xi x
n 1
2
Index
Spreads- Standard Deviation
The Standard Deviation is just the square root of the
variance. A standard deviation of “1” is exact the same
as Z-score of one. Once again the difference between
the two formula below are whether or not the data is the
population or a sample from a population.
xi
n
2
s
xi x
n 1
2
Index
Spread-Calculation of variance and standard
deviation.
Heights of male students
65 68 70 71 72
x 69.2
2
s
65 69.2
2
2
68 69.2
2
s
70 69.2
5 1
2
71 69.2
2
71 69.2
2
7.7
s 2.77
Index
Summaries-Range and IQR
Ordered heights 62 65 68 68 69 69 70 71 71 72
1st Quartile = 68
3rd Quartile = 71
Range = Max – Min = 72 – 62 = 10
Inter-Quartile-Range (IQR)= 3rd Quartile – 1st Quartile
= 71 - 68
Index
Transformations
A Transformation is when each value of a data set is placed into the same
function. For example if we add a number n to every observation we will
have a transformed data set that is shifted n-units. If we multiply or divide
every observation by the same number then the data set will have a new
scale.
If you are given a mean, (or ), and a standard deviation, s (or ), and want to
convert your data so you have a new mean, new (or  new), and new
standard deviation, snew (or new), all you need to remember is what shift
and scales changes affect. In our linear transformation formula:
xnew  a  bx
shift
scale
Index
Transformation
Standard deviation are only affected by scale changes, but means are affected by both
shift and scales changes. This means that:
xnew shift scale x
snew scale s
For example suppose College Station has an average annual temperature of 72
degrees with a standard deviation of 10 degrees. We want to know what these
statistics are in Celsius. The formula for Celsius is:
Celcius
xnew
snew
5
Farenheight 32
9
5
32
72
9
5
10
9
Celsius 8
scelsius 5.556
Index
Transformations- Shifts
Suppose we discover that a measuring instrument was off by 3 inches because
someone was measuring from the top of the shoe to the head. Well obviously the
given heights would not be the height of the subjects. If we assume every suject’s
shoes where the same height of 3 inches then we can fix the data appropriately with
the equation:
xnew xi 3
Ordered heights 62 65 68 68 69 69 70 71 71 72
Shifted heights 65 68 71 71 72 72 73 74 74 75
Notice what this does to the following statistics.
xnew x 3
2
2
snew s
range 10
IQR 3
min x
new
Q1new
~
x new
Q3new
max x
new
min x
3
Q1 3
~
x 3
Q3 3
max x
What we see from this is that a
shift change adds or substracts
the same amount from every
statistic that is not related to
spread. The statistics that
describe the spread (ie s2 and
IQR) are not affected by the shift.
3
Index
Transformations - Scale
Going back to our original subjects for whom we have their height. Suppose that instead
of inches we wanted to know how tall every one was in cm. 2.54 cm = 1 inch.
Therefore our new data is as follows
Ordered heights 62 65 68 68 69 69 70 71 71 72
Heights in cm 157.48 165.10 172.72 172.72 175.26 175.26 177.80 180.34 180.34 182.88
xnew 174
snew 7.69
Rangenew 25.4
IQRnew 7
min x
new
157.48
Q1new 172.7
~
x new 175.3
Unlike with the shifts
notice that every single
one of these statistics is
affected by the scale
change.
Q3new 179.7
max x
new
182.88
Index