LIS 397.1 Introduction to Research in Library and

Download Report

Transcript LIS 397.1 Introduction to Research in Library and

LIS 386.13 Information Technologies and the Information Professions

Introduction to Statistics R. E. Wyllys Copyright © 2001 by R. E. Wyllys Last revised 2001 Sep 2 School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions

Lesson Objectives

• You will acquire an introductory understanding of – Descriptive statistics – Inferential statistics School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions

Statistics Is a Tool for Making Decisions

• “

Statistics

is a method of decision making in the face of uncertainty, on the basis of numerical data, and at calculated risks.”* *From: Chou, Ya-Lun. (1969).

Statistical Analysis with Business and Economic Applications

. New York, NY: Holt, Rinehart and Winston.

School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions

Two Branches of Statistics

• Descriptive Statistics – Numbers that describe or characterize groups: mean, median, mode, total, range, standard deviation, proportion, etc.

– Presentations of such numbers in charts and tables • Inferential Statistics – The use of numbers from samples to provide generalizations (inferences) about the populations from which the samples came School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions

Statistical Terms

• Population or Universe – The set of entities in which the investigator is interested, typically with the purpose of seeking the value of certain characteristics of the population – “Population” is the term generally used in social statistics; “universe,” in the mathematical theory of statistics • Sample – A subset of a population that is examined and from which inferences are drawn about the characteristics of the population School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions

Statistical Terms

• Parameter – Measurable characteristic of a population (universe) • Sample value – Measurable characteristic of a sample – Value that a population parameter takes on in a particular sample Note: Some writers use the term “statistic” to refer to samples only, not to populations. In LIS 386.13 we do not restrict it that way.

School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions

Types of Data

• There are four main types of data that it is useful to distinguish – Categorical (nominal) – Ordinal – Interval – Ratio • The next three slides discuss these data types in more detail.

School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions

Types of Data (cont'd)

• Categorical (nominal) Numbers – Numbers used like names, e.g., faces of a die – No order is implied (although order may used for convenience in listing numbers for human viewing) • Ordinal Numbers – Numbers used to indicate position in a sequence, e.g., in rank number – Order is indicated, but “distance” between positions is

not

indicated School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions

Types of Data (cont'd)

• Interval Numbers – Express order

and

distances or differences – Result from measurements and counts – May lack a true zero • Ratio Numbers – Express order

and

distances or differences – Result from measurements and counts – Do have a true zero Note that all ratio numbers are interval numbers, but not all interval numbers are ratio numbers.

School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions

Interval vs. Ratio Numbers

• A temperature of 68° F might look twice as hot as 34° F. But the same temperatures expressed in the Celsius scale are 20 ° C and 1.1° C. This is because the zeros in both the Celsius and Fahrenheit scales are arbitrary (though the Celsius zero makes more sense than the Fahrenheit zero).

• Only in the Kelvin and Rankine scales — where the zero of the scale is a true zero, Absolute Zero (-273.15

° C) — do temperatures express true relative hotness. Since 34 ° F = 274 ° K and 68° F = 293° K, you can see that 68° F is nowhere near twice as hot as 34 ° F. • Celsius and Fahrenheit are interval, but not ratio, scales; Kelvin is a ratio (and, hence, also an interval) scale.

School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions

Why Use Samples?

• Observing a whole population would enable certainty about its characteristics —so why settle for uncertainty ?

• If it is feasible to do so, you may observe the whole population • But, observing the whole population may be – Logically impossible • Infinite populations • Future populations – Destructive – Too expensive – Too time-consuming School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions

Virtues of Sampling

• Reduced cost • Results available sooner • Broader scope yields more information – 1 variable observed on each of 1000 elements in population vs.

– 10 variables observed on each of 100 elements in sample • This offers the opportunity not only to study 9 additional variables but also to investigate possible correlations or associations among all 10 variables —all at the cost of only slightly less accuracy with respect to the 1 variable with the 1000 observations • Greater accuracy – More attention possible to each observation School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions

Random Sampling Is Basic to Statistical Inference

• The mathematical theory of statistics requires that each element of the population have a

known

chance of being chosen in a sample • In practical terms, this usually translates to: Each element of the population must have the

same

chance as any other element of being chosen in a sample – Strictly speaking, this is called

equiprobable

sampling • Ordinarily, the term "random sampling" is used loosely to describe both

known-probability

and

equiprobable

sampling.

School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions

Statistics as a Tool for LIS

• LIS problems – concern people and other complex entities – concern interactions among these complex entities – typically involve many contributing, or potentially contributing, variables • Many LIS problems involve variables that exhibit

random

or

probabilistic variation

School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions

Experiments in Sampling

• Toss a coin 10 times, writing down the exact sequence of Heads and Tails. Also write down the proportion of Heads (e.g. 0.6 if there are 6 Heads).

• Repeat this experiment 3 more times.

• You probably got 3 or 4 different values for the proportion of heads, even though, as you already know, you expect that proportion to be 0.5

in the long run

.

• Calculate the proportion of Heads in all 40 tosses of the coin. Note that it is closer to 0.5 than are most of the 10-toss proportions.

• This should suggest to you one of the benefits of taking large samples whenever feasible.

School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions

Random Sampling Can be Tricky

• Random sampling aims to eliminate all anticipatable biases, i.e., tendencies to favor one element or class of elements over another in the population • Think about how you would choose randomly from among – Lumps of coal in a gondola car – Laboratory rats in a large cage – Human volunteers in a survey • What problems might you encounter? Please think carefully about this question before you move to the next slide to see my answer.

School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions

Random Sampling Can be Tricky

• It can be difficult to choose truly at random – Lumps of coal in a gondola car get shaken during the train's journey. Small lumps sift to the bottom, so that the easily chosen lumps on top tend to be larger than the average lump in the whole car – Laboratory rats in a large cage will try to evade the person who is trying to capture them for an experiment. This means that the rats that get captured will tend to be among the less vigorous, less speedy rats. – Human volunteers in a survey tend to be those with an interest in the purpose of the survey or in making money as paid subjects. Either way, a group of volunteers rarely resembles the kind of group that would be chosen at random. School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions

Random Sampling

• How to choose,

randomly

, population elements for a sample?

– Mechanical means • Toss a coin • Toss a die or dice • Draw straws or cards • Random interval timers – Use a printed table of random numbers – Use a random-number generator in a computer program School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions

Basic Statistical Measures

• Measures of Central Tendency (central or typifying values for a set of numbers) – Mean (arithmetic average) – Median – Mode • Measures of Dispersion (the scatteredness, the variability of a set of numbers) – Range – Standard deviation and variance School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions

Basic Statistical Measures

• Measures of central tendency apply to sets of 1 or more elements • Measures of dispersion apply only to sets of 2 or more elements • Variability pervades the entire biological world and the world of human activity • Variability is usually vastly

under

recognized and

under-

estimated School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions

Basic Statistical Measures

• The

mean

is the arithmetic average of a set of numbers. For example, – The mean of 2, 5, and 8 is 5, since 15 ÷ 3 = 5.

• Note: In statistical work, this equation would usually be written as 15/3 = 5.

– The mean of 1, 3, 2, and 8 is 3.5, since 14/4 = 3.5.

• Note that this mean is not among the set of numbers from which it was calculated. • That is, the mean of a set of numbers

can

that

cannot

be a number characterize any element of the set.

• Think of 1, 3, 2, and 8 as the numbers of children in four different families to see the point of the previous remark.

School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions

Basic Measures of Central Tendency

• If we make a set of observations, we can call them X in general, or X i if we want to emphasize the individual observations: e.g., X 1 , X 2 , . . ., X 97 , etc.

• Using this notation, and using ∑ to represent the operation of summation (i.e., adding things up, as you may recall from your high-school algebra), we can write the mean of

n

observations X as where School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions

X X

 

X n

or

X

j

  1

X j n

Basic Measures of Central Tendency

• The

median

of a set of observations is the middle value of the observations when they have been arranged in order. This turns out to be slightly different for odd numbers of observations than for even numbers of observations. For example, – The median of the five observations 1, 3, 15, 16, and 17 is 15. As this example shows, the median has the property that just as many observations are smaller than it as are larger than it, making it a meaningful middle value.

– Determining the median of the six observations 1, 2, 3, 5, 8, and 9 requires us to agree that what we will mean by the "middle value" is the half-way point between the middle pair of observations. Here the middle pair is 3 and 5, and the half-way point between them is 4. (More complicated definitions exist, but the half-way point idea conveys the essence of the median satisfactorily for most practical purposes.) School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions

Basic Measures of Central Tendency

• The

mode

of a set of observations is the most frequently occuring value among a set of observations (provided that there is a most frequent value). For example, – The mode of the observations 1, 2, 2, 3, 4, 5 is 2.

– The set of observations 1, 2, 3, 4, 5 has no mode.

– The set of observations 1, 2, 3, 3, 4, 5, 5 has, strictly speaking, no mode. However, it is sometimes convenient to call such a set

bi-modal

, i.e., to allow a set to have 2 modes: e.g., 3 and 5 in this example.

• Though it would be logical to speak of tri-modal, quadri-modal, etc. sets of observations, the basic idea of the mode is rarely stretched beyond allowing for two modes.

School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions

Basic Measures of Central Tendency • Why are there so many measures of central tendency?

– Because each has its own special advantages and disadvantages.

School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions

Advantages and Disadvantages of the Mean • Advantage: The mean always exists, and can always be calculated by a basically simple formula (even though large numbers of observations can be very tedious to handle). For this reason, the mean is readily usable as a basis for building a theory for making statistical inferences.

• Disadvantage: The mean can, by extremely large or extremely small values among the set of observations, be "distorted" into a value that fails to be useful as a characteristic of the set. For example, – The mean of 1, 2, and 1,000,000 is 333,334.33, which fails to provide any useful idea about the actual observations in this set.

School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions

Advantages and Disadvantages of the Median • Advantage: The median is a useful way of describing sets of observations that are skewed (i.e., distorted) by including extremely large or small values.

– The classic example is that of household incomes. Few households have very small positive incomes (or even negative incomes), but in almost any set of households (e.g., in a city or state) there are a few households with extremely high incomes (e.g., the households of Michael Millken in the 1980s, or Michael Dell and Bill Gates, in the 1990s, with annual incomes in the hundreds of millions of dollars). School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions

Advantages and Disadvantages of the Median (cont'd) – Because the mean is badly distorted by extreme incomes, it has become conventional to express household incomes in terms of the median. • For example, the Bureau of the Census might report that the median household income in Bellevue, Washington is, say, $40,000. • The interpretation is that half of all the households in Bellevue have incomes below $40,000 and half have incomes above $40,000, so that the $40,000 median is a meaningful characteristic of the set of household incomes • In contrast, the mean household income in Bellevue might be well over $1,000,000 since Bill Gates's house is located there.

School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions

Advantages and Disadvantages of the Median (cont'd) • Disadvantage: Because the median is defined somewhat differently for odd numbers than for even numbers of observations, it has been historically less usable than the mean as a basis for building a theory of statistical inference.

School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions

Advantages and Disadvantages of the Mode • Advantage: If a set of observations has a mode, then that mode is a useful way of characterizing the set. For example, "The most common result of tossing two dice is that their top faces add up to 7." • Disadvantage: Many sets of observations lack a mode because no observed value occurs more than once; other sets of observations may have several different "most frequent" values; and in either case, the notion of a mode has no useful value for characterizing the set of observations. Furthermore, because of these definitional difficulties, the mode fails to provide a solid basis for building a theory of statistical inference.

School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions

Measures of Dispersion • The range of a set of observations is the distance between the smallest and largest observations in the set. For example, – The range of the set of observations 2, 4, 7 is 5. The range of the set -10, -3, 4 is 14.

– Sometimes it may be desirable to take into account the matter of whether the observations come from counts (in which case they are necessarily whole numbers) or from measurements (in which case the extent to which the measurements are rounded off can play a role). School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions

Measures of Dispersion (cont'd) • The standard deviation

s

of a set of n observations X i calculated by a somewhat complicated formula .

is – In pre-computer days this formula required some effort to use. Now, however, computer programs (and even some sophisticated calculators) make the calculation trivial for the user, who needs only to provide the program with the numbers from the original observations.

School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions

s

j

  1 (

X j n

  1

X

) 2

Measures of Dispersion (cont'd) • What does this complicated formula really do?

– Inside the parentheses are the distances between each individual observation and mean, i.e., the center, of the whole set of observations. – Each of these distances is squared, and the squared distances are added up.

– This sum is divided by

n

-1, which is almost the number of numbers in the sum; i.e., the result is an "adjusted average" of the squared distances.

s

i

  1 (

X i n

  1

X

) 2 School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions

Measures of Dispersion (cont'd) – It is worth noting that if

n

is large (say, over 30 or so), then there is really very little difference in dividing by

n

-1 rather

n

in obtaining this average squared distance between individual observations and the center (i.e., the mean) of the set of observations.

– Finally, the formula takes the square root of this average squared distance, and calls it

s

, the standard deviation.

– Speaking somewhat loosely, we can say that the standard deviation is a fancy average distance between the individual observations and the center of the set of observations.

s

i

  1 (

X i n

  1

X

) 2 School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions

Advantages and Disadvantages of the Range • Advantages: The range is very simple to calculate, and it provides a meaningful characteristic of a set of observations, viz., the total spread of the observations.

• Disadvantage: The range measures

only

the total spread; it fails to take any account of the scatteredness or clusterness of the observations other than the two extremes. For example, – The set 1, 2, 3, 4, 5, 6, 7, 8, 9 has a range of 8.

– But so also does the set 1, 9, 9, 9, 9, 9, 9, 9, 9, which is, overall, obviously less scattered than the previous set.

School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions

Advantages and Disadvantages of the Standard Deviation • Advantages: The standard deviation can always be calculated (i.e., its definition, though complicated, never runs into logical difficulties). It provides a meaningful characteristic of a set of observations that takes

every

observation into account in developing a number to express the scatteredness of the observations. For example, – The set of observations 1, 2, 3, 4, 5, 6, 7, 8, 9 has a standard deviation

s

= 2.74.

– The set of observations 1, 9, 9, 9, 9, 9, 9, 9, 9 has a standard deviation s = 2.67, reflecting the lesser scatteredness of this set compared with the first.

– In short, the range fails to distinguish any difference in scatteredness between these two sets, but the standard deviation does measure a difference in their scatteredness.

School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions

Advantages and Disadvantages of the Standard Deviation • Disadvantage: The formula for the standard deviation is complicated (for humans, though not for computer programs, or even for sophisticated calculators).

School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions

Statistical Inference

• Although the descriptive statistics that we have been discussing are useful in their own right, they are also important as a basis for making inferences from a sample of observations to characteristics of the population from which the sample came. For example, – The mean of a sample can be used to suggest the likely value of the mean of the population.

– The standard deviation of a sample can be used to suggest the likely value of the standard deviation of the population.

School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions

Statistical Inference

• The theory of statistical inference deals with how to use sample statistics in reasonable and suitable ways in order to enable us to infer various values of a population through observing one or more samples of observations drawn from that population.

• You will learn about inferential statistics in GSLIS research courses.

School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions

Goal of Statistics: To Pierce through the Haze of Obscuring Variation & Reveal Underlying Patterns

School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions

Goal of Statistics: To Pierce through the Haze of Obscuring Variation & Reveal Underlying Patterns

School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions

Goal of Statistics: To Pierce through the Haze of Obscuring Variation & Reveal Underlying Patterns

School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions

Goal of Statistics: To Pierce through the Haze of Obscuring Variation & Reveal Underlying Patterns

School of Information - The University of Texas at Austin LIS 386.13, Information Technologies & the Information Professions