Document 7145320

Download Report

Transcript Document 7145320

Basic Quantitative Methods in
the Social Sciences
(AKA Intro Stats)
02-250-01
Lecture 3
Variation
• Variability: The extent numbers in a
data set are dissimilar (different) from
each other
• When all elements measured receive
the same scores (e.g., everyone in the
data set is the same age, in years),
there is no variability in the data set
• As the scores in a data set become
more dissimilar, variability increases
Variation: Range
• The range tells us the span over which
the data are distributed, and is only a
very rough measure of variability
• Range: The difference between the
maximum and minimum scores
Example: The youngest student in a class is
19 and the oldest is 46. Therefore, the age
range of the class is 46 – 19 = 27 years.
Variation
X
5
5
5
5
5
 X = 25
X X
0.00
0.00
0.00
0.00
0.00
n=5
This is an example of data
with NO variability
X =5
Variation
X
6
4
6
5
4
X X
+1.00
-1.00
+1.00
0.00
-1.00
 X = 25
n=5
This is an example of data
with low variability
X =5
Variation
X
8
1
9
5
2
X X
+3.00
-4.00
+4.00
0.00
-3.00
 X = 25
n=5
This is an example of data
with higher variability
X =5
Note:
• Let’s say we wanted to figure out the average
deviation from the mean. Normally, we would want to
sum all deviations from the mean and then divide by
n, i.e.,
 X  X 
n
• (recall: look at your formula for the mean from last
lecture)
• BUT: We have a problem.
up to zero
( X  X ) will always add
Variation
• However, if we square each of the deviations
from the mean, we obtain a sum that is not
equal to zero
• This is the basis for the measures of variance
and standard deviation, the two most
common measures of variability of data
Variation
X
8
1
9
5
2
 X = 25
 X  X 
Note: The
Squares
XX
X  X 
+3.00
-4.00
+4.00
0.00
-3.00
= 0.00
9.00
16.00
16.00
0.00
9.00
2 = 50.00
XX
 X  X 
2
2


is called the Sum of
Variance of a Population
• VARIANCE OF A POPULATION: the
sum of squared deviations from the
mean divided by the number of scores
(sigma squared):
 X   
 
n
2
2
Population Standard Deviation
Square root of the variance
2
 X   

n
2
Sample Variance
• the sum of squared deviations from the
mean divided by the number of
degrees of freedom (an estimate of
the population variance, n-1)

 X x
s 
n 1
2

2
Sample Standard Deviation
• Square root of the variance s2
 X  x 
2
s
n 1
Why use Standard Deviation and
not Variance!??!
• Normally, you will only calculate variance in
order to calculate standard deviation, as
standard deviation is what we typically want
• Why? Because standard deviation expresses
variability in the same units as the data
• Example: Standard deviation of ages in a class
is 3.7 years
Variance
• The above formulae are definitional - they
are the mathematical representation of the
concepts of variance and standard deviation
• When calculating variance and standard
deviation (especially when doing so by hand)
the following computational formulae are
easiest to use (trust us, they really are easier
to use. You should however have a good
understanding of the definitional formulae):
Population Variance
• Computational Formula:



X

2
 X 


n

2
 
n
2




Population Standard Deviation
• Computational Formula:



X

 X2 


n


n
2




Sample Variance
• Computational Formula:



X

 X2


n

2
s 
n 1
2




Sample Standard Deviation
• Computational Formula:



X

 X2


n

s
n 1
2




Sample Standard Deviation Example
Data:
X
8
1
9
5
2
X=25
X2
64
1
81
25
4
2
X
 =175
2


X

2

X 


n
s2  
n 1
n = 5,




X=5
s2 = 175 – (25)2/5
4
s2 = 12.50
s=
12.50
s = 3.54
Computing Standard Deviation
• When calculating standard deviation, create a
table that looks like this:
X
X2
X
X2
X1
X2
X3
X4 …
X12
X22
X32
X42
2
4
7
9
4
16
X
 49
81
X=
 XX 2 =
 X = 22  X 2 = 150
Computing Standard Deviation
• The values are then entered into the formula as
follows:
( X ) 2 = 222 = 484
 X 2 = 150
2




X
 X2 



n 


s
n 1
n=4
n-1 = 3
Computing Standard Deviation
• The values are then entered into the formula as
follows:
( X ) 2 = 222 = 484
 X 2 = 150
484 

150 

4 

s
3
n=4
n-1 = 3
Computing Standard Deviation
• The values are then entered into the formula
as follows:
s
150  121
3
s  9.6667
29
s
3
s  3.1091
s  3.11
Degrees of Freedom
• Degrees of Freedom: The number of
independent observations, or, the number of
observations that are free to vary
• In our data example above, there are 5
numbers that total 25 (  X = 25, n = 5)
Degrees of Freedom
• Many combinations of numbers can total 25, but
only the first 4 can be any value
• The 5th number cannot vary if  X = 25
• This example has 4 degrees of freedom, as four of
the five numbers are free to vary
• Sample standard deviation usually underestimates
population standard deviation. Using n-1 in the
denominator corrects for this and gives us a better
estimate of the population standard deviation.
Degrees of Freedom
• Degrees of freedom are usually
n-1
(the total # of data points minus one)
Time for an example
• Seven people were asked to rate the taste of
McDonalds french fries on a scale of 1 to 10.
Their ratings are as follows:
8, 4, 6, 2, 5, 7, 7
Calculate the population standard deviation
Calculate the sample variance
Class discussion: When would this be a
population, and when would it be a sample?
Why is Standard Deviation so
Important?
• What does the standard deviation really
tell us?
• Why would a sample’s standard
deviation be small?
• Why would a sample’s standard
deviation be large?
An Example
• You’re sitting in the CAW Student
Centre with 4 of your friends. A
member of the opposite sex walks by,
and you and your friends rate this
person’s attractiveness on a scale from
1 to 10 (where 1=very unattractive and
10=drop dead gorgeous)
Food for thought
• 1) What would it mean if all five of you rated this person a 9
on 10?
• 2) What would it mean if all five of you rated this person a 5
on 10?
• 3) What would it mean if the five of you produced the following
ratings: 1, 10, 2, 9, and 3 (note that the mean rating would be
5)?
• Why would scenario #3 happen instead of scenario #2? What
factors would lead to these different ratings?
• These questions form the basis of why statisticians like to
“explain variability”
An In-Depth Look at Scenario #3
• So if the five of you produced the
following ratings: 1, 10, 2, 9, and 3,
what is the standard deviation of these
ratings?
• Calculate!
• What is the standard deviation in
Scenario #2? Calculate!
Normal Distribution
• The normal distribution is a theoretical
distribution
• “Normal” does not mean typical or
average, it is a technical term given to
this mathematical function
• The normal distribution is unimodal and
symmetrical, and is often referred to as
the Bell Curve
Normal Distribution
Mean
Median
Mode
Normal Distribution
• We study the normal distribution
because many naturally occurring
events yield a distribution that
approximates the normal distribution
Properties of Area Under the
Normal Distribution
• One of the properties of the Normal
Distribution is the fixed area under the
curve
• If we split the distribution in half, 50%
of the scores of the sample lie to the
left of the mean (or median, or mode),
and 50% of the scores lie to the right of
the mean (or median, or mode)
Properties of Area Under the
Normal Distribution
• The mean, median, and mode always
cut the Normal Distribution in half, and
are equal since the Normal Distribution
is unimodal and symmetrical:
Properties of Area Under the
Normal Distribution
50% of
scores
50% of
scores
Mean, Median, Mode
Properties of Area Under the
Normal Distribution
• The entire area under the normal curve
can be considered to be a proportion of
1.0000
• Thus, half, or .5000 of the scores lie in
the bottom half (i.e., left of the mean)
of the distribution, and half, or .5000 of
the scores lie in the top half (i.e., right
of the mean)
Properties of Area Under the
Normal Distribution
.5000 of
scores
.5000 of
scores
Mean, Median, Mode
Z-scores
• Z-Scores (or standard scores) are a way of
expressing a raw score’s place in a distribution
• Z-score formula:
z
X 

Z-scores
• The mean  and standard deviation
notated in Greek letters

are always
• Z-scores only reflect the data points’ position relative
to the overall data set (so you’re now considering the
data as a population, as you’re not looking to infer to
a greater population)
• This means use the population formula for standard
deviation rather than the sample formula whenever
you calculate Z
Z-scores
• A z-score is a better indicator of where
your score falls in a distribution than a
raw score
• A student could get a 75/100 on a test
(75%) and consider this to be a very
high score
Z-scores
• If the average of the class marks is 89 and the
(population) standard deviation is 5.2, then the
z-score for a mark of 75 would be:
 = 89  = 5.2
z = (75-89)/5.2
z = (-14)/5.2
z = -2.69
z
X 

Z-scores
• This means that a mark of 75% is
actually 2.69 standard deviations
BELOW the mean
• The student would have done poorly on
this test, as compared to the rest of the
class
Z-scores
• z = 0 represents the mean score (which
would be 89 in this example)
• z < 0 represents a score less than the
mean (which would be less than 89)
• z > 0 represents a score greater than
the mean (which would be greater than
89)
Z-scores
• For any set of scores:
the sum of z-scores will equal zero
(  Z = 0.00)
have a mean equal to zero
(  Z = 0.00)
and a standard deviation equal to one
(  Z = 1.00)
Z-scores
• A z-score expresses the position of the
raw score above or below the mean in
standard deviation sized units
• E.g.,
z = +1.50 means that the raw score is 1
and one-half standard deviations above
the mean
z = -2.00 means that the raw score is 2
standard deviations below the mean
Z-score Example
• If you write two exams, in Math and English,
and get the following scores:
Math 70% (class  = 55,  = 10)
English 60% (class  = 50,  = 5)
• Which test mark represents the better
performance (relative to the class)?
Z-score Example cont.
• Math mark:
z = (70-55)/10
z = +1.50
• English mark:
z = (60-50)/5
z = +2.00
z
X 

Z-score Example Illustration
Mean
Z=0.00
Z=1.50
Z=2.00
The Answer
• Because: Z = +2.00 is greater than Z =
+1.50, the English class mark of 60%
reflects a better performance relative to
that class than does the Math class
mark of 70%
Z-score: Solving for X
• The z-score formula can be rearranged
to solve for X:
X   X  (z )( )  
z

Z-scores: Solving for X
• This formula is used when you know
the z-score of a data point, and want to
solve for the raw score.
Example
• E.g., if a class midterm exam has = 65 and  = 5, what
exam mark has a z-score value of 1.25?
X = (1.25)(5) + 65
= 6.25 + 65
= 71.25
X  (z )( )  
So, a person whose test is 1.25 standard deviations above
the mean obtained a score of 71.25%
Z-scores
• Z-score problems ask you to solve for X
or solve for z
• Review both types of problems!