Introduction - University of Toronto

Download Report

Transcript Introduction - University of Toronto

The Rise of Statistics

• Statistics is the science of collecting, organizing and interpreting data. The goal of statistics is to gain information and understanding from data. • Statistics is the common bond supporting all other sciences.

• For more info on the history of statistics read: A Very Brief History of Statistics, Howard W. Eves,

The College Mathematics Journal

, Vol. 33, No. 4 (Sep., 2002), pp. 306-308.

STA248 week 2 1

Elements of Statistics - Introduction

• • Data are numerical facts with context and we need to understand the context if we are to make sense of the numbers.

• A set of data contains some information about a group of

individuals

.

Individuals

are the objects upon which we collect data. Individuals can be people, animals, plots of land and many other things. • A

population

studying.

is a set of individuals that we are interested in • A

variable

is any characteristic of an individual. • A

sample

is a subset of the individuals of a population.

STA248 week 2 2

Questions to ask when planning a statistical study

• Why? What purpose do the data have? Do we hope to answer some specific questions? Do we want to draw conclusions about individuals other than the ones we actually have data for?

• Who? What individuals do the data describe? How many individuals appear in the data?

• What? How many

variable

s do the data contain?

Exact definitions

of these variables. What are the

units of measurements

in which each variable is recorded? Weights for example, might be recorded in pounds, or in kg.

STA248 week 2 3

Collecting Data

• Generally, data can be obtained in four different ways.

 Published source.

 Designed experiment.

 Survey.

 Observational study.

STA248 week 2 4

Simple Random Sample

• A

simple random sample

(SRS) of size

n

consists of

n

individuals from the population chosen in such a way that every set of

n

individuals has an equal chance to be the sample actually selected.

• How to select an SRS?

Hat method, Random number tables or software.

• SRS helps eliminate the problem of having the sample reflect a different population than the one we are interested in studying. STA248 week 2 5

Stratified Random Sampling

• Often, the sampling units are not homogeneous and naturally divide themselves into nonoverlaping groups that are homogeneous. • These groups of similar individuals are called

strata

.

• A stratified random sample consist of separate SRS within each stratum; these SRSs are combined to form the full sample.

STA248 week 2 6

Data and Statistics

• Statistics is about data – collecting organizing, summarizing, analyzing, using data to make decisions. • A

statistic

is a number in context.

• Statistical inference is the connection between statistics and probability. • There is a random mechanism behind the data because only have data on a random sample.

• Data: observations

x

1 , …,

x n

variables

X

1 ,…,

X n

are considered

n

or of 1 random variable X.

realizations of random STA248 week 2 7

Displaying Distributions With Graphs

• Statistical tools and ideas help us examine data in order to describe their main features. This examination is called

exploratory data analysis

.

• Two basic strategies for exploration of data set:  Begin by examining each variable by itself. Then move on to study the relationships among the variables.

 Begin with graphs. Then add numerical summaries of specified aspects of the data.

STA248 week 2 8

Describing Quantitative Data

• The pattern of variation of a variable is called its

distribution

.

• The distribution of a variable is best displayed graphically.

• There are three main graphical methods for describing summarizing and detecting patterns in quantitative data:  Dot plot  Stem-and-leaf plot  Histogram STA248 week 2 9

Stemplots

To make a stemplot: 1.

Separate each observation into a

stem

consisting of all but the final (rightmost) digit and a

leaf

, the final digit.

2. Write the stems in a vertical column with the smallest at the top, and draw a vertical line at the right of this column.

3. Write each leaf in the row to the right of its stem, in increasing order out from the stem.

STA248 week 2 10

Example

• Here are the scores of a basketball player (say player A) 54 59 35 41 46 25 47 60 54 46 49 46 41 34 22 Make a stemplot of these data. Describe the main features of the distribution.

• Solution : Min. = 22, Max = 60 … • R command: stem (scores, scale=2) STA248 week 2 11

Examining a distribution

• In any graph of data, look for the

overall pattern

and for striking

deviations

from that pattern.

• Overall pattern of a distribution can be described by its

shape

,

centre

, and

spread

.

• An important kind of deviation is an

outlier

, an individual value that falls outside the overall pattern.

• Some other things to look for in describing shape are:  Does the distribution have one or several major peaks, usually called

modes

? A distribution with one major peak is called

unimodal

.

 Is it approximately symmetric or skewed in one direction.

STA248 week 2 12

Exercise

Describe the shape of the distributions summarized by the following stemplot.

Stem-and-leaf of sta220 marks N = 42 Leaf Unit = 1.0

1 6 7 3 7 44 11 7 77888999 (11) 8 00011233444 20 8 555556666778 8 9 000001 2 9 7 1 10 0 STA248 week 2 13

.

Exercise

Describe the shape of the distributions summarized by the following stemplots.

Stem-and-leaf of C1 N= 50 Leaf Unit = 0.10

18 0 000111122233334444 (17) 0 55555566667889999 15 1 0011444 8 1 5669 4 2 03 2 2 2 3 1 1 3 1 4 2 STA248 week 2 14

• • • •

Histograms

A histogram breaks the range of values of a variable into intervals and displays only the count or percent of the observations that fall into each interval. We can choose a convenient number of intervals.

Histograms do not display the actual values observed. (only counts in each interval).

Example: Here is some data on the number of days lost due to illness of a group of employees: 47, 1, 55, 30, 1, 3, 7, 14, 7, 66, 34, 6, 10, 5, 12, 5, 3, 9, 18, 45, 5, 8, 44, 42, 46, 6, 4, 24, 24, 34, 11, 2, 3, 13, 5, 5, 3, 4, 4, 1 STA248 week 2 15

The main steps in constructing a histogram

1.

Determine the Range of the data (largest and smallest values) In our example the data ranges from a min.

of 1 day to a max. of 66 days.

2.

Decide on the number of intervals (or classes) , and the width of each class (usually equal).

3.

Count the number of observations in each class. These counts are called class frequencies.

4.

Draw the histogram.

STA248 week 2 16

Class 0-10 10-20 20-30 30-40 40-50 50-60 60-70 No. of employees (Frequency) 23 5 3 2 5 1 1 Total 40 Cumulative. Frequency 23 28 31 33 38 39 40 Relative frequency 0.575

0.125

0.075

0.050

0.075

0.025

0.025

1.000

• A table with the first two columns above is called

frequency table

or

frequency distribution

.

• A table with the first column and the third column is called

cumulative frequency distribution

.

STA248 week 2 17

20 10 0 0 10 20 30 40 days lost 50 60 70

R

command: hist(days)

STA248 week 2 18

Describing distributions with numbers

• A large number or numerical methods are available for describing quantitative data sets. Most of these methods measure one of two data characteristics:  The

central tendency

or

location

of the set of observations – it is the tendency of the data to cluster, or center, about certain numerical values.

 The

variability

of the set of observation – it is the spread of the data.

STA248 week 2 19

Measuring Center

• Two common measures of center are the

mean

and the

median

.

• These two measures behave differently.

• The mean is the “average value” and the median is the “middle value”. STA248 week 2 20

Measuring center: the median

• • number such that half the observations are smaller then it and the other half are larger.

To find the median of a distribution: 1. Arrange the observations in order of size, from smallest to largest.

2. If the number of observations

n

is odd, the median is the center observation in the ordered list. 3. If the number of observations

n

is even, the median is the average of the two center observations in the ordered list. STA248 week 2 21

Example

The annual salaries (in thousands of $) of a random sample of five employees of a company are: 40, 30, 25, 200, 28 Arranging the values in increasing order: 25 28 30 40 200 median = 30 Excluding 200 median = (28+30)/2=29.

STA248 week 2 22

Measuring center: mean

• If the

n

observations are

x

1 ,

x

2 ,…

x n

. The

sample mean

given by 

mean

x

x

1 

x

2    

x n

x i n n x

is • Example Find the mean of the following observations: 4, 5, 9, 3, 5.

Solution:

mean

 4  5  9  3  5  5 26  5 .

2 5 STA248 week 2 23

Example

• The annual salaries (in thousands of $) of a random sample of five employees of a company are: 40, 30, 25, 200, 28.

mean

 5 If we exclude 200 as an outlier,

mean

  323 5  64.6

 4   4  • Mean is sensitive to the influence of a few extreme observations. Because the mean cannot resist the influence of extreme values, we say that it is NOT a

resistant measure

of center.

STA248 week 2 24

Mean versus median

• The median and mean are the most common measures of the center of a distribution.

• If the distribution is exactly symmetric, the mean and median are exactly the same.

• Median is less influenced by extreme values.

• If the distribution is skewed to the right, then mode < median < mean • If the distribution is skewed to the left, then mean < median < mode.

STA248 week 2 25

Trimmed mean

• Trimmed mean is a measure of the center that is more resistant than the mean but uses more of the available information than the median.

• To compute the 10% trimmed mean, discard the highest 10% and the lowest 10% of the observations and compute the mean of the remaining 80%. Similarly, we can compute 5%, 20% etc. trimmed mean. • Trimming eliminates the effect of a small number of outliers.

STA248 week 2 26

Example

• Compute the 10% trimmed mean of the data given below.

20 40 22 22 21 21 20 10 20 20 20 13 18 50 20 18 15 8 22 25 • Solution: - Arrange the values in increasing order: 8 10 13 15 18 18 20 20 20 20 20 20 21 21 22 22 22 25 40 50 - There are 20 observations and 10% of 20 = 2. Hence, discard the first 2 and the last 2 observations in the ordered data and compute the mean of the remaining 16 values.

Variable N Mean C2 16 19.812 STA248 week 2 27

Questions

1.

You are asked to recommend a measure of center to characterize the following data: 0.6, 0.2, 0.1, 0.2, 0.2, 0.3, 0.7, 0.1, 0.0, 22.5, 0.4.

What is your recommendation and why?

2. The mean is ____ sensitive to extreme values than the median.

(a) more (b) less (c) equally (d) can’t say without data 3. Changing the value of a single score in a data set will necessarily cause the mean to change. (T/F) 4. Changing the value of a single score in a data set will necessarily cause the median to change. (T/F) STA248 week 2 28

Measuring Spread

• There are two main measures of spread that we will discuss; the sample range and the sample standard deviation.

• The

range

(max-min) is a measure of spread but it is very sensitive to the influence of extreme values.

• The range can be very useful in statistical quality control

.

• The measure of spread that is used most often is the

sample standard deviation.

STA248 week 2 29

Measuring spread - Standard deviation

• The sample variance,

s 2

, of a set of

n x

,

x

,...,

x s

2 

n

1  1

i n

  1 

x i

x

 2 • The sample standard deviation,

s,

i.e.

s

s

2 is the square root of the variance (

s 2

). • It can be shown that,

s

2 

n

1  1  

i n

  1

x i

2 

n x

2   This formula is usually quicker.

STA248 week 2 30

x i

x x i

about their mean. Some of these deviations will be positive and some negative because the observations fall on each side of the mean.

• The sum of the deviations of the observations from their mean will always be zero.

• Squaring the deviations makes them all positive, so that observations far from the mean in either direction have large positive squared deviations.

• The variance is the average of the squared deviations.

• The variance,

s

2 , and the standard deviation, s, will be large if the observations are widely spread about their mean, and small if the observations are all close to the mean.

STA248 week 2 31

Example

• Find the standard deviation of the following data set: 4, 8, 2, 9, 7.

• Solution:

n=5 , mean

x

 4  8  2  9  7 5  30 5  6

s

2   2 2 (2 6) 5 1 2 2

s

 2  34 4  8.5

STA248 week 2 32

Properties of standard deviation (s)

s

measures the spread about the mean and should be used only when the mean is chosen as the measure of center.

s

= 0 only when there is no spread. This happens only when all observations have the same value. Otherwise,

s

> 0.

s

, like the mean , is not resistant to extreme values. A few outliers can make

s

very large.

STA248 week 2 33

Ballpark approximation for s

• The ballpark approximation for the standard deviation

s

is the Range/4 (divide by 3 if there are less then 10 observations, divide by 5 if there are more then 100 observations).

• For the data set 4, 8, 2, 9, 7, range = 9 – 2 = 7 and so

s

7 2.33

3 STA248 week 2 34

Percentiles

• The simplest useful numerical description of a distribution consists of both a measure of center and a measure of spread.

• We can describe the spread or variability of a distribution by giving several percentiles.

• The

pth percentile

of a distribution is the value such that

p

percent of the observations are smaller or equal to it.

• The median is the 50 th percentile. • If a data set contains

n

100

th

observations, then the the value in the ordered data set.

p

th percentile is STA248 week 2 35

Example

• Find the 20th percentile of the data represented by the following stem-and-leaf plot.

Stem-and-leaf of Rural N = 29 Leaf Unit = 1.0 N* = 7 1 2 1 5 3 3589 (12) 4 122333456788 12 5 112467 6 6 7 5 7 04 3 8 48 1 9 1 10 8 STA248 week 2 36

Solution

Since

n

= 29 ,

p

= 20 and (

n

1)

p

100  20 100  6 then the 20th percentile is the 6th value from the top of the stemplot, which is 41.

STA248 week 2 37

Quartiles

• The 25th percentile is called the

first quartile

(Q 1 ). • The first quartile (Q 1 ) is the median of the observations whose position in the ordered list is to the left of the location of the overall median.

• The 75th percentile is called the

third quartile

(Q 3 ). • The third quartile (Q 3 ) is the median of the observations whose position in the ordered list is to the right of the location of the overall median.

NOTE: The median is the second quartile Q 2 .

STA248 week 2 38

Example

• The highway mileages of 20 cars, arranged in increasing order are: 13 15 16 16 17 19 20 22 23 23 | 23 24 25 25 26 28 28 28 29 32.

The median is … The first quartile Q 1 is … The third quartile Q 3 is… Exercise: Find (a) the 10th percentile.

(b) the 90th percentile of the above data set.

STA248 week 2 39

Measuring Spread: IQR

• The

range

(max-min) is a measure of spread but it is very sensitive to the influence of extreme values.

• The distance between the first and third quartiles is called the

Interquartile range

(

IQR

) i.e.

IQR =Q 3 – Q 1

.

• The IQR is another measure of spread that is less sensitive to the influence of extreme values.

STA248 week 2 40

The five-number summary

• The

five-number

summary of a set of observations consists of the smallest observation, the first quartile, the median, the third quartile and the largest observation.

• These five numbers give a reasonably complete description of both the center and the spread of the distribution.

• R commands: summary (data) STA248 week 2 41

Example

• The highway mileages of 20 cars, arranged in increasing order are: 13 15 16 16 17 19 20 22 23 23 23 24 25 25 26 28 28 28 29 32.

Give the five number summary.

• Answer Variable N Minimum Q1 Median Q3 Maximum mileage 20

13.00

17.50

23.00

27.50

32.00

STA248 week 2 42

Box-plot

• A box-plot is a graph of the five-number summary.

• Example: Make a box-plot for the data in the above example.

Boxplot of Mileages

20 15 30 25 • R commands:

Boxplot (data)

STA248 week 2 43

Outliers

An

outlier

is an observation that is usually large or small relative to the other values in a data set. Outliers are typically attributable to one of the following causes: 1. The observation is observed, recorded, or entered incorrectly. 2. The observation comes from a different population.

3. The observation is correct but represents a rare event.

STA248 week 2 44

The 1.5×IQR Criterion for outliers

• Call an observation a suspected outlier if it falls more than

1.5×IQR

above the 3rd quartile or below the 1st quartile.

• Example Consider the data given in the example on slide 42 (mileage data with an extra observation of 66).

Variable N Mean Min Q1 Median Q3 Max Mileages 21 24.67 13 18 23 28 66 The

IQR

= 28-18 = 10 and the largest observation, 66, falls more than

1.5×IQR

above Q 3 and therefore is an outlier.

STA248 week 2 45