Transcript Slide 1

Descriptive
Statistics
Inferential
Statistics


Definition: Statistics, derived from
sample data, that are used to make
inferences about the population from
which the sample was drawn.
Generalizability is important is this
type of statistic because it is the ability
to use the results of data collected
from a sample to reach conclusions
about the characteristics of the
population.


Definition: Statistics used to
described the characteristics of a
distribution of scores. They apply
only to the members of a sample
or population from which data
have been collected.
Generalizability to the population
is not the objective of descriptive
statistics
Population

Definition: The collection of cases that comprise the entire
set of cases with the specified characteristics (e.g., “All living
adult males in the United States”)

Example: In order to find the average salary of Psychology
majors who graduated from college in 2004, collect
information about the salaries of all the 2004 Psychology
graduates and derive an average from that data.

Any value generated from or applied to the population is a
parameter.
Sample



Definition: A collection of cases
selected from a larger population
Example: In order to find the average
salary of Psychology majors who
graduated from college in 2004, you
select (randomly or non-randomly)
some of these graduates and derive a
mean from their salaries.
Any value derived from the sample,
such as the mean, is a statistic.
Sampling Methods
RANDOM



Definition: Selecting cases
from a population in a
manner that ensures each
member of the population
has an equal chance of
being selected into the
sample.
One of the most useful,
but most difficult to use.
The major benefit of
random sampling is that
any differences between
the sample and the
population from which the
sample was selected will
not be systematic.
REPRESENTATIVE
 Definition: A method of
selecting a sample in which
members are purposely selected
to create a sample that
represents the population on
some characteristic(s) of interest
(e.g., when a sample is selected
to have the same percentages of
various ethnic groups as the
larger population).
• This type of sampling can be
expensive and time consuming,
however it ensures that your
sample looks the population on
some important variables,
therefore increasing the
generalizability of the sample.



CONVENIENCE
Definition: Selecting a
sample based on ease
of access or availability.
This method of
selecting a sample is less
labor-intensive than
selecting a random or
representative sample.
In order for it to be an
acceptable method, it
cannot differ from my
population of interest in
ways that influence the
outcome of the study.
Variable


Any construct with more than one value that is examined in
research.
Examples include income, gender, age, height, attitudes about
school, score on a measure of depression, etc.
Types of Variables

Quantitative (continuous) variable
A variable that has assigned values
and the values are ordered and
meaningful, such that 1 is less than
2, 2 is less than 3, etc.

Qualitative (categorical)
variable A variable that has
discrete categories. If the
categories are given
numerical values, the values
have meaning as nominal
references but not as
numerical values (e.g., in 1 =
“male” and 2 = “female” 1 is
not more or less than 2).
Scales of Measurement for Variables



Nominally (or categorical) scaled
variable: A variable in which the
numerical values assigned to each
category are simply labels rather than
meaningful numbers.
Ordinal variable: Variables measured
with numerical values where the numbers
are meaningful (e.g., 2 is larger than 1) but
the distance between the numbers is not
constant.
Interval or Ratio variable: Variables
measured with numerical values with
equal distance, or space, between each
number (e.g., 2 is twice as much as 1, 4 is
twice as much as 2, the distance between
1 and 2 is the same as the distance
between 2 and 3).
Collecting Data



Collecting data produces a group of scores on one or
more variables
To get the distribution of scores you must arrange the
scores from lowest to highest
Researchers are usually interested in central tendency, a
set of distribution characteristics that consist of the
mean, median, and mode
The Mean




Definition: The arithmetic average of a distribution of
scores
Provides a single, simple number that gives a rough
summary of the distribution
The most commonly used statistic in all social science
research
Useful, but does not tell you anything about how spread
out the scores are (i.e., variance) or how many scores in
the distribution are close to the mean
The Median



Definition: The score in a distribution that marks the 50th
percentile. It is the score at which 50% of the distribution
falls below and 50% fall above
Used when dividing distribution scores into two groups
(median split)
Useful statistic to examine when the scores in a distribution
are skewed or when there are a few extreme scores at the
high end or the low end of the distribution
The Mode


Definition: The score in the distribution that occurs
most frequently
Least used of the measures of central tendency;
provides the least amount of information
Calculating the Mean
1.
Add, or sum, all of the
scores in a distribution
2.
Divide by the number of
scores
Formula for calculating the mean of a distribution
is the sample mean
 is the population mean
S means “the sum of ”
X is an individual score in the distribution
n is the number of scores in the sample
N is the number of scores in the population
X
X, 
X
n, N
OR
1.
Multiply each value by
the frequency for which
the value occurred
2.
Add all of these products
3.
Divide by the number of
scores
Calculating The Median
1.
2.
Arrange all of the scores in the
distribution in order, from
smallest to largest
Find the middle score in the
distribution
If there is an odd number of scores...
there will be a single score that marks the
middle of the distribution
If there are an even number of scores
in the distribution...
the median is the average of the two
scores in the middle of the distribution (as
long as the scores are arranged in order,
from largest to smallest)
Finding the average
add the two scores in the middle together
and divide by two
Finding The Mode
Example of bimodal distribution


Remember, the mode is simply the
category in the distribution that has the
highest number of scores, or the highest
frequency
Multimodal: When a distribution of scores
has two or more values that have the
highest frequency of scores
•
Example - Bimodal distribution: A
distribution that has two values that have
the highest frequency of scores; often
occurs when people respond to
controversial questions that tend to
polarize the public
On the following scale, please indicate how you
feel about capital punishment.
1—————2—————3—————4—————5
Strongly Opposed
Strongly In Favor
Frequency of Responses
Category of Responses on the Scale
Frequency of
Responses in Each
Category
1
2
3
4
5
45
3
4
3
45
Example: The Mean, Median, and
Mode of a Distribution
The following distribution of test scores are given:
86
90
96
96
100
105
115
121
Mean = 86+90+96+96+100+105+115+121 = 101.13
8
Calculating the mean: Add up all the scores, then divide by the number of
scores. In this case, there are 8 IQ scores.
Median = 96+100 = 98
2
Calculating the median: Because there is an even amount of scores, sum the two
scores that are found in the middle of the distribution when it is put into
numerical order, then divide by two.
Mode = 96
Calculating the mode: 96 is the most frequent number that occurs
Skewed Distribution


Definition: A distribution of scores has
a high number of scores clustered at
one end of the distribution with
relatively few scores spread out toward
the other end of the distribution,
forming a tail.
When working with a skewed
distribution, the mean, median, and
mode are usually all at different points
rather than at the center of
distribution.

Similarities between a skewed and
normal distribution:
•

The procedures used to calculate a
mean, median, and mode are the same
Differences between a skewed and
normal distribution:
•
The position of the three measures of
central tendency in the distribution
Left or Negative
Right or Positive
Skewness

Skewness Ranges



If skewness is less than −1 or
greater than +1, the
distribution is highly skewed.
If skewness is between −1 and
−½ or between +½ and +1,
the distribution is moderately
skewed.
If skewness is between −½
and +½, the distribution is
approximately symmetric.


If a distribution is symmetric, the
next question is about the central
peak: is it high and sharp, or short
and broad
Kurtosis
The reference standard is a normal
distribution, which has a kurtosis of 3.
Often the excess kurtosis is presented:
excess kurtosis = kurtosis−3.

A normal distribution has kurtosis
exactly 3 (excess kurtosis exactly 0).
Any distribution with kurtosis ≈3
(excess ≈0) is called mesokurtic.

A distribution with kurtosis <3 (excess
kurtosis <0) is called platykurtic.
Compared to a normal distribution, its
central peak is lower and broader, and
its tails are shorter and thinner.

A distribution with kurtosis >3 (excess
kurtosis >0) is called leptokurtic.
Compared to a normal distribution, its
central peak is higher and sharper, and
its tails are longer and fatter.
kurtosis = 3, excess = 0
kurtosis = 1.8, excess = −1.2
kurtosis = 4.2, excess = 1.2
Measures of Central Tendency vs.
Measures of Variability



Measures of central tendency provide useful
information, but are limited.
Measures of central tendency provide insufficient
information on the dispersion of scores in a
distribution or, in other words, the variety of the scores
in a distribution.
3 measures of dispersion that researchers typically
examine: range, variance, and standard deviation.
Standard deviation is the most informative and widely
used of the three.
Range




Definition: The range is the difference between the largest
(maximum value) score and the smallest score (minimum
value) of a distribution
Gives researchers a quick sense of how spread out the
scores of a distribution are
Not practical; misleading at times
Helps see whether all or most of the points on a scale,
such as a survey, were covered
Interquartile Range (IQR)


Definition: The difference between
the 75th percentile (third quartile)
and 25th percentile (first quartile)
scores in a distribution
IQR contains scores in the two
middle quartiles if scores in a
distribution were arranged in order
numerically
Variance



Definition: The sum of the squared deviations divided by
the number of cases in the population, or by the number
of cases minus one in the sample
Provides a statistical average of the amount of dispersion
in a distribution of scores
Rarely look at variance by itself because it does not use the
same scales as the original measure of a variable; although
this is true, it is helpful for the calculation of other
statistics (i.e., analysis of variance, regression)
Standard Deviation


Definition: The average deviation between the
individual scores in the distribution and the
mean for the distribution
To understand standard deviation, consider
the meanings of the two words:
•
Standard: typical or average
•

Deviation: refers to the difference between
an individual score and the average score
for the distribution
Useful statistic; provides handy measure of
how spread out the scores are in the
distribution

When combined, the mean and standard
deviation provide a pretty good picture of
what the distribution of scores is like
Sample Statistics as Estimates of
Population Parameters


For the most part, researchers are concerned with
what a sample tells us about the population from
which the sample was drawn. This is important
because most of the statistics, although generated
from sample data, are used to make inferences about
the population
The formulas for calculating the variance and standard
deviation of sample data are actually designed to make
sample statistics better estimates of the population
parameters (i.e., the population variance and standard
deviation)
Making Sense of the Formulas for
Calculating the Variance



Not interested in the average score of the distribution,
rather in the average difference, or deviation, between each
score in the distribution and the mean of the distribution
First, calculate a deviation score for each individual score in
the distribution
See next slide for formula
Similarities Between the Variance and Standard
Deviation Formulas
Variance and Standard Deviation Formulas
Population
 sum

a score in the distribution
X the population mean
 the number of cases in the
Variance
population
N
2 
( X   ) 2
N
 to sum
a score in the distribution
 the number of cases in the
population
N

X
( X   ) 2
N

s2 
X

sum
a score in the distribution
the sample mean
the number of cases in the sample
N
X

X the population mean
Standard
Deviation
Estimate Based on Sample
( X  X ) 2
n 1
sum
a score in the distribution
the sample mean
the number of cases in the sample
N
X
s
( X  X ) 2
n 1

Formulas for calculating the
variance and the standard
deviation are virtually
identical. Square root in
standard deviation formula is
only difference.
Calculating the variance is the
same for both sample and
population data except the
denominator for the sample
formula, which is n-1
Formula for calculating the
variance is known as
deviation score formula
Differences Between the Variance and
Standard Deviation Formulas: Why n – 1?

Brief explanation:
•
If population mean is unknown, use the
sample mean as an estimate. But sample
mean probably will differ from the
population mean
•
•
Whenever using a number other than the
actual mean to calculate the variance, a
larger variance will be found. This will be
true regardless of whether the number
used in the formula is smaller or larger
than the actual mean
Because the sample mean usually differs
from the population mean, the variance
and standard deviation will probably be
smaller than it would have been if used
the population mean
•
When using the sample mean to
generate an estimate of the population
variance or standard deviation, it will
actually underestimate the size of the
population mean
•
To adjust underestimation:

•
use n – 1 in the denominator in
sample formulas
Smaller denominators produce larger
overall variance and standard deviation
statistics, making it a more accurate
estimate of the population parameters
Working with a Population Distribution


Researchers usually assume they are working with a
sample that represents a larger population
How much of a difference between using N and n-1 in
the denominator depends on size of sample
• If sample is large, virtually no difference
•
If sample is small, relatively large difference between
the results produced by the population and sample
formulas
Why Have Variance?

Why not go straight to standard deviation?
• We need to calculate the variance before finding the
standard deviation. That is because we need to square
the deviation scores (so they will not sum to zero).
These squared deviations produce the variance. Then
we need to take the square root to find the standard
deviation.
•
The fundamental piece of the variance formula,
which is the sum of the squared deviations, is used in
a number of other statistics, most notably analysis of
variance (ANOVA)
Students’ responses to the item “I would feel really good if I were
the only one who could answer the teacher’s question in class.”

Sample Size = 491
Mean = 2.92
Standard Deviation = 1.43
Variance = (1.43)2 = 2.04
Range = 5 – 1 = 4
Range does not provide very much
information.
The mean of 2.92 not particularly
informative because from the mean it is
impossible to determine whether:
 Most students circled a 3 on the
scale


Roughly equal numbers of
students circled each of the five
numbers on the response scale
Almost half of the students
circled 1 whereas the other half
circled 5
140
120
120
115
98
100
Frequency

81
77
80
60
40
20
0
1
2
3
4
Scores on desire to demonstrate ability item
5
Drawing Conclusions…

6
Consider the standard deviation in conjunction
with the mean
•
5
4
3
Predicting what the size of the standard
deviation will be:



2
1
If almost all of the students circled a 2 or a 3 on
the response scale, expect a fairly small standard
deviation
If half of the students circled 1 whereas the
other half circled 5, expect a large standard
deviation (about 2.0) because each score would
be about two units away from the mean
If the responses are fairly evenly spread out
across the five response categories, expect a
moderately sized standard deviation (about 1.50)
0
1.
2.

Boxplot for the desire to appear able variable
Presented for the same variable that is represented in
the previous graph, wanting to demonstrate ability
Conclusions:
•
The distribution looks somewhat
symmetrical due to the mean of 2.92 being
somewhat in the middle
•
From the standard deviation of 1.43, we
know that the scores are pretty well spread
out across the five response categories