Quantitative Measures - University of Oxford

Download Report

Transcript Quantitative Measures - University of Oxford

Statistics for Linguistics
Students
Michaelmas 2004
Week 1
Bettina Braun
Why calculating statistics?
• Describe and summarise the data
• E.g. examination results (out of 100)
22 98 40 45 16 31 77 78
55 45 61 91 87 45 54 66
75 87 88 49 64 76 58 61 …
• Average mark/Spread of scores/Lowest
and highest marks?/Comparison with
other results (e.g. from last year’s?)
Population vs. Sample
• Population: total universe of all possible
observations.
Populations can be finite or infinite, real or
theoretical
– the IQ of all adult men in Britain
– The outcome of an infinite number of flips of a
coin
• Descriptive statitics are called parameters
Population vs. Sample (cont’d)
• Sample: Subset of observations drawn from a
given population
– The IQ scores of 100 adult men in Britain
– The outcome of 50 flips of a coin
• Descriptive statitics from a sample are called
statistics
• Note: In experimental research it is important to
draw a representative, random sample that is
not biased
Histograms: Frequency distribution
of each event
25
Frequency
20
15
10
5
Mean = 54,6207
Std. Dev. = 16,9673
N = 87
0
0,00
20,00
40,00
60,00
80,00
VAR00001
Data: Tutorial1.sav
Central tendency: mode and
median
• Mode: Most frequent mark (Note: there may be
multiple modes)
Statistics
mark
N
Mean
Median
Mode
Valid
Mi ssing
87
0
54,62
56,00
55a
a. Multipl e modes exi st. The smal lest value is shown
• Median: score from the middle of the list when
ordered from lowest to highest. Cuts data into
halves (doesn’t take account of values of all
scores but only of the scores in middle position).
Central tendency: mean
• Mean: sum of scores divided by the
number of scores
Note on notation: Greek letters often used
for population, roman letters used for
statistic (properties of a sample)
Comparing measures of “central
tendency”
• Mode:
– quick if we have frequency distribution
– Possible with categorical data
• Median:
– Good estimate if we have abnormally large or small
values (e.g. max aircraft speed of 450km/h, 480km/h,
500km/h, 530km/h, 600km/h, and 1100km/h)
– Only influenced by values in the middle of ordered
data
• Mean
– Every score is taken into account
– Some interesting properties  Most widely used
Types of variables
• Interval (scale): difference between consecutive
numbers are of equal intervals (e.g. time, speed,
distances). Precise measurements
• Ordinal: assignments of ranks that represent
position along some ordered dimension (e.g.
ranking people wrt their speed, 1 = fastest, 4 =
slowest). No equal intervals
• Categorical (nominal): numerical categories,
labels (e.g. brown = 1, blue = 2, green = 3)
• Question: on which type of data can we
calculate a meaningful “central tendency”?
Spread of distributions: why?
Spread of distributions:
range and quartiles
• Small spread often desirable as it
indicates a high proportion of identical
scores
• Large spread indicates large differences
between individual scores
• Range: difference between highest and
lowest score – rather crude measure
• Quartiles: cuts the ordered data into
quarters (second quartile = median)
Median, quartiles, and outliers
100,00
80,00
Largest
value
which is
not outlier
28
66
80
Upper quartile
60,00
Median
Lower quartile
o Outlier (more than
1.5 box lengths
above or below the
box)
Interquartile range
40,00
139
20,00
0,00
Smallest
value
which is
not outlier
45
7
75
37
var0001
* Extreme value
(more than 3 box
lengths below or
above the box)
tutorial1.sav: simple bp, sep. var
Spread of the population:
variance measures
• Variance: sum of squared deviations from
the mean
Variance =
• Standard deviation: square root of
variance
Normal distribution
(Gaussian distribution)
• Example: IQ scores, mean=100, sd=16
Mean = Median = Mode
Skewed distributions and measures
of central tendency
Bimodal distributions
Normal distribution
(Gaussian distribution)
• Example: IQ scores, mean=100, sd=16
Mean = Median = Mode
z-scores
• Z-score: deviation of given score from the
mean in terms of standard deviations
How likely is a given event?
• Example: time to utter a particular
sentence: x = 3.45s and sd = .84s
• Questions:
– What proportion of the population of utterance
times will fall below 3s?
– What proportion would lie between 3s and
4s?
– What is the time value below which we will
find 1% of the data?