INTRODUCTION - METU | Middle East Technical University

Download Report

Transcript INTRODUCTION - METU | Middle East Technical University

IAM 530 ELEMENTS OF PROBABILITY AND STATISTICS INTRODUCTION

1

WHAT IS STATISTICS?

Statistics

is a science of collecting data, organizing and describing it and drawing conclusions from it. That is,

statistics

is a way to get information from data. It is the science of uncertainty.

2

WHAT IS STATISTICS?

• A pharmaceutical CEO wants to know if a new drug is superior to already existing drugs, or possible side effects.

• How fuel efficient a certain car model is?

• Is there any relationship between your GPA and employment opportunities?

• Actuaries want to determine “risky” customers for insurance companies. 3

STEPS OF STATISTICAL PRACTICE

• • • •

Preparation:

Set clearly defined goals, questions of interests for the investigation

Data collection:

Make a plan of which data to collect and how to collect it

Data analysis:

Apply appropriate statistical methods to extract information from the data

Data interpretation:

and draw conclusions Interpret the information 4

STATISTICAL METHODS

Descriptive statistics

include the collection, presentation and description of numerical data.

Inferential statistics

include making inference, decisions by the appropriate statistical methods by using the collected data.

Model building

includes developing prediction equations to understand a complex system. 5

BASIC DEFINITIONS

POPULATION:

The collection of all items of interest in a particular study.

SAMPLE:

A set of data drawn from the population; a subset of the population available for observation •

PARAMETER:

A descriptive measure of the population, e.g., mean •

STATISTIC:

A descriptive measure of a sample •

VARIABLE:

A characteristic of interest about each element of a population or sample.

6

EXAMPLE

Population All students currently enrolled in school Unit Student Sample Any department Variable GPA Hours of works per week All books in library Book Statistics’ Books Replacement cost Frequency of check out Repair needs All campus fast food restaurants Restaurant Burger King Number of employees Seating capacity Hiring/Not hiring Note that some samples are not representative of population and shouldn’t be used to draw conclusions about population.

7

How not to run a presidential poll

For the 1936 election, the

Literary Digest

picked names at random out of telephone books in some cities and sent these people some ballots, attempting to predict the election results, Roosevelt versus Landon, by the returns. Now, even if 100% returned the ballots, even if all told how they really felt, even if all would vote, even if none would change their minds by election day, still this method could be (and was) in trouble: They estimated a conditional probability in that part of the American population which had phones and showed that that part was not typical of the total population. [Dudewicz & Mishra, 1988]

STATISTIC

Statistic

(or

estimator

) is any function of a r.v. of r.s. which do not contain any unknown quantity. E.g.

o o i n    1 X i i n  1 i n   1 X i   X , i , i n   1 i n   1 X i X i / /  n , m i i n ( X i ), m a i x ( X i ) are NOT.

are statistics.

• Any observed or particular value of an estimator is an

estimate

. 9

RANDOM VARIABLES

• Variables whose observed value is determined by chance • A r.v. is a function defined on the sample space

S

that associates a real number with each outcome in

S

.

• • Rvs are denoted by uppercase letters, and their observed values by lowercase letters.

Example:

Consider the random variable X, the number of brown-eyed children born to a couple heterozygous for eye color (each with genes for both brown and blue eyes). If the couple is assumed to have 2 children, X can assume any of the values 0,1, or 2. The variable is random in that brown eyes depend on the chance inheritance of a dominant gene at conception. If for a particular couple there are two brown-eyed children, we have

x

=2. 10

COLLECTING DATA

Target Population:

The population about which we want to draw inferences.

Sampled Population:

The actual population from which the sample has been taken.

11

SAMPLING PLAN

• • •

Simple Random Sample (SRS):

All possible samples with the same number of observations are equally likely to be selected.

Stratified Sampling:

Population is separated into mutually exclusive sets (strata) and then sample is drawn using simple random samples from each strata.

Convenience Sample:

It is obtained by selecting individuals or objects without systematic randomization.

12

13

EXAMPLE

• A manufacturer of computer chips claims that less than 10% of his products are defective. When 1000 chips were drawn from a large production run, 7.5% were found to be defective.

• What is the population of interest?

The complete production run for the computer chips • What is the sample?

1000 chips • What is parameter?

Proportion of the all chips that are defective • What is statistic?

Proportion of sample chips that are defective • Does the value 10% refer to a parameter or a statistics?

Parameter • Explain briefly how the statistic can be used to make inferences about the parameter to test the claim.

Because the sample proportion is less than 10%, we can conclude that the claim

may be

true.

14

DESCRIPTIVE STATISTICS

Descriptive statistics

involves the arrangement, summary, and presentation of data, to enable meaningful interpretation, and to support decision making.

Descriptive statistics

methods make use of – graphical techniques – numerical descriptive measures.

• The methods presented apply both to – the entire population – the sample 15

Types of data and information

• A variable - a characteristic of population or sample that is of interest for us.

– Cereal choice – Expenditure – The waiting time for medical services • Data - the observed values of variables – Interval and ratio data are numerical observations (in ratio data, the ratio of two observations is meaningful and the value of 0 has a clear “no” interpretation. E.g. of ratio data: weight; e.g. of interval data: temp.) – Nominal data are categorical observations – Ordinal data are ordered categorical observations 16

Types of data – examples

Examples of types of data

Quantitative Continuous Blood pressure, height, weight, age Discrete Number of children Number of attacks of asthma per week Categorical (Qualitative) Ordinal (Ordered categories) Nominal (Unordered categories) Grade of breast cancer Better, same, worse Disagree, neutral, agree Sex (Male/female) Alive or dead Blood group O, A, B, AB 17

Types of data – analysis

Knowing the

type

of data is necessary to properly select the technique to be used when analyzing data.

Types of descriptive analysis allowed for each type of data   Numerical data – arithmetic calculations Nominal data – counting the number of observation in each category  Ordinal data - computations based on an ordering process 18

Types of data - examples

Numerical data Nominal

.

.

Age - income

55 75000 42 68000 .

.

+10 .

.

.

.

Person Marital status

1 married 2 single 2 3 3 .

.

.

single 1 .

IBM Dell IBM .

.

19

Types of data - examples

Numerical data

.

.

Age - income

55 75000 42 68000 .

.

+10 .

.

Nominal data

A descriptive statistic for nominal data is the proportion of data that falls into each category.

IBM Dell Compaq Other

25 11 8 6 50% 22% 16% 12%

Total

5 0 20

Cross-Sectional/Time Series/Panel Data

Cross sectional data

point in time is collected at a certain – Test score in a statistics course – Starting salaries of an MBA program graduates •

Time series data

is collected over successive points in time – Weekly closing price of gold – Amount of crude oil imported monthly • Panel data is collected over successive points in time as well 21

Change in time

Differences

Cross-sectional Time series

Cannot measure Can measure

Panel

Can measure

Properties of the series Measurement time Measurements

No series Measurement only at one time point; even if more than one time point, samples are independent from each other Long; usually just one or a few series Usually at regular time points (all series are taken at the same time points and time points are equally spaced) Response(s); time independent covariates Response(s); time; usually no covariate Short; hundreds of series Varies Response(s); time; time dependent and independent covariates 22

GAMES OF CHANCE

23

COUNTING TECHNIQUES

• Methods to determine how many subsets can be obtained from a set of objects are called counting techniques.

FUNDAMENTAL THEOREM OF COUNTING If a job consists of

k

separate tasks, the

i

-th of which can be done in

n i ways, i

=1,2,…,

k,

then the entire job can be done in

n

1

x n 2

x

x

n

k ways.

24

THE FACTORIAL

• number of ways in which objects can be permuted. n! = n(n-1)(n 2)…2.1

0! = 1, 1! = 1 Example: Possible permutations of {1,2,3} are {1,2,3}, {1,3,2}, {3,1,2}, {2,1,3}, {2,3,1}, {3,2,1}. So, there are 3!=6 different permutations. 25

COUNTING

Partition Rule:

There exists a single set of N distinctly different elements which is partitioned into k sets; the first set containing

n

1 elements, …, the

k

-th set containing

n k

elements. The number of different partitions is 1

N

!

! !

2

n k

!

where

N

  1

n

2  

n k

.

26

COUNTING

• Example: Let’s partition {1,2,3} into two sets; first with 1 element, second with 2 elements.

• Solution: Partition 1: {1} {2,3} Partition 2: {2} {1,3} Partition 3: {3} {1,2} 3!/(1! 2!)=3 different partitions 27

Example

• How many different arrangements can be made of the letters “ISI”?

1 st letter 2 nd letter 3 rd letter I I S S I S I I N=3, n1=2, n2=1; 3!/(2!1!)=3 28

Example

• How many different arrangements can be made of the letters “statistics”?

• N=10, n1=3 s, n2=3 t, n3=1 a, n4=2 i, n5=1 c 10 !

3 !

3 !

1 !

2 !

1 !

 50400 29

COUNTING

1. Ordered, without replacement (e.g. picking the first 3 winners of a competition) 2. Ordered, with replacement (e.g. tossing a coin and observing a Head in the k th toss) 3. Unordered, without replacement (e.g. 6/49 lottery) 4. Unordered, with replacement (e.g. picking up red balls from an urn that has both red and green balls & putting them 30 back)

PERMUTATIONS

• Any ordered sequence of r objects taken from a set of n distinct objects is called a

permutation

of size r of the objects.

P

 (

n

!

)!

  1)...( 1) 31

COMBINATION

• Given a set of n distinct objects, any unordered subset of size r of the objects is called a

combination

.

C

n

  

n

!

r

)!

Properties    1,

n

   1,   

n n

r

   32

Ordered Unordered

COUNTING

Number of possible arrangements of size

r

from

n

objects Without Replacement With Replacement 

n n

!

r

 !

n

    

n n r r

1    33

EXAMPLE

• How many different ways can we arrange 3 books (A, B and C) in a shelf?

• Order is important; without replacement • n=3, r=3; n!/(n-r)!=3!/0!=6, or

Possible number of books for 1 st place in the shelf

3 x 2

Possible number of books for 2nd place in the shelf Possible number of books for 3rd place in the shelf

x 1 34

EXAMPLE, cont.

• How many different ways can we arrange 3 books (A, B and C) in a shelf?

1 st book 2 nd book 3 rd book A B C C B B A C C A C A B B A 35

EXAMPLE

• Lotto games: Suppose that you pick 6 numbers out of 49 • What is the number of possible choices – If the order does not matter and no repetition is allowed?

  49 6    13 , 983 , 816  14 million – If the order matters and no repetition is allowed?

49 !

 43 !

49 x 48 x 47 x 46 x 45 x 44  10 10 36