The Art in the State of the Art

Download Report

Transcript The Art in the State of the Art

Biostatistics Yusuf Al-Gau’d

BDS, MSc, MSPH, MHPE, FFPH, ScD Prof. of Epidemiology, Biostatistics and Medical Education

Why need biostatistics?

 Main reason: handling variations  Biological variation  Attribute differ not only among individuals but also within same individual over time  Example: height, weight, blood pressure, eye color ...

 Sample variation  Biomedical research projects are usually carried out on small numbers of study subjects

Why need to learn biostatistics?

 Essential for scientific method of investigation  Formulate hypothesis  Design study to objectively test hypothesis  Collect reliable and unbiased data  Process and evaluate data rigorously  Interpret and draw appropriate conclusions  Essential for understanding, appraisal and critique of scientific literature

HEALTH STATUS IN JORDAN

Selected indicators 2012

Area (km 2 ) Total population (million) 88778 6.4

Crude birth rate per 1000 population 28.1

Crude death rate per 1000 population % Population growth rate Total fertility rate 7 2.2

3.8

Age Distribution of Jordanian Population (2010)

Selected indicators

Indicator Adult illiteracy rate (% of 15+ years) Total % Males 3.5

Females 10.0

Unemployment (%) 12.2

Mortality

Total life expectancy at birth

(total years) (71.6 males and 74.4 females)

Infant mortality rate

per 1000 live births

Maternal mortality rate

per 100000 live births

73 23 19.1

Morbidity: Chronic and non-communicable diseases

Disease Hypertension Diabetes Total blood cholesterol Cholesterol HDL _C Triglyceride Overweight Obesity BMI Metabolic syndrome Total % 44.6

19.5

36.0

59.6

38.6

30.2

44.1

36.3

Males % 42.2

19.7

35.6

61.5

48.1

35.5

12.3

28.7

Female % 46.0

19.8

36.8

59.2

34.2

22.5

60.7

40.9

Cancer in Jordan 2000-2010

Cancer rate

Cancer rate

Common Cancers

Common Cancers

Road traffic accidents

Jordan has one of the highest rates of traffic-related accidents in the world.

In 2007 alone we lost 1,000 people, 18,000 were wounded, 6,000 of them seriously.

Every 10 hours someone is killed and every 35 hours a child is killed.

Considering the population of Jordan, this is a very high rate - for every 100,000 citizen, 307 of them are either killed or injured.

Communicable diseases

Communicable diseases have largely been controlled in Jordan

The trend of vaccine-preventable diseases has shown a remarkable decline in the last 20 years.

There is lack of information on the prevalence of hepatitis B and C virus infections.

Nutrition

Micronutrient deficiencies are common in our region

The prevalence of anemia was 34% in children under 5

The prevalence of low vitamin D (25(OH)D <30 ng/ml) was 37.3% in females as compared to 5.1% in males.

Resources

Hospital number 104 comprehensive health centers 70 Primary health care canters Maternity and child health care centers 378 431 Physician/ 10000 pop.

24.5

Dentist/ 10000 pop.

8.2

Nurse/10000 pop Pharmacist/10000 pop 33.0

12.0

Pharmaceuticals

The high cost of drugs is a major constraint.

Irrational use of drugs

inadequate drug information services.

Public health

A shortage of public health providers

No specialized training in public health

lack of capacity to respond to new and emerging health threats

Lack of public health programs (mainly prevention and screening programs) to serve populations

Challenges facing health development in Jordan

High rates of non-communicable diseases .

Considerable changes in lifestyles favouring the development of determinants and risk factors for chronic diseases, accidents, and injuries.

Lack of health system research national health development.

as an integral part of

Challenges facing health development in Jordan

Jordan lacks appropriate policies and interventions that aim to improve the social, environmental and nutritional determinants of health, including poverty reduction strategies, promotion of healthy lifestyles and food safety.

Lack of strategies for addressing issues related to prevention and management of accidents and injuries

Inadequate coordination and partnership between health service providers and educational institutions for health professionals.

Challenges facing health development in Jordan

Lack of integration of the priority programmes primary health care.. within

Inadequate coordination between the public sector and the rapidly expanding private sector

lack of effective systems for regulation and enforcement of standards of care.

Biostatistics: Types of Statistics

Biostatistics ???

Descriptive statistics

 Organization of data  Summarization of data  Presentation of data

Inferential statistics

Sample and population

 Populations are rarely studied because of logistical, financial and other considerations

Population

Researchers have to rely on study samples Sample

Types of statistical methods

Descriptive statistical methods

 Provide summary indices for a given data, e.g. arithmetic mean, median, standard deviation, coefficient of variation, etc. 

Inductive (inferential) statistical methods

 Produce statistical inferences about a population based on information from a sample derived from the population, need to take variation into account

sample Population

Random sampling

 Suppose that we want to estimate the mean birth-weights of male live births in Jordan  Due to logistical constraints, we decide to take a random sample of 100 live births at the University Hospital in a given year sample All live births in KAUH, 2008

All live births in Jordan

Sampled population Target population

Variable

 Definition  What is observed or measured in the way people differ  Examples  age  height  hair color  smoking

Types of Data

 Qualitative or categorical variables   Nominal Ordinal  Quantitative variables (Numerical)   Discrete variables Continuous variables

Categorical Variables

Cannot be measured numerically Categories must not overlap and must cover all possibilities

Classified as nominal or ordinal

Categorical Nominal Variables

   Named categories No implied order among categories Examples   Gender – Male/Female Blood Groups – 0, A, B, AB   Ethnic Group – Chinese, Malay, Indian, Jordanian Eye color – brown/black/blue/green/mixed

Categorical Ordinal Variables

   Same as nominal but ordered categories Differences between categories not considered equal Examples   Grading – Excellent, satisfactory, unsatisfactory Pain severity – no pain, slight pain, moderate pain, severe pain

Quantitative Variables

 Can be measured numerically  Weight  # of admissions to the hospital  Concentration of chlorine  Can be discrete or continuous

Discrete - Numerical Variables

 Integers that correspond to a count  Can assume only whole numbers  Examples  # of bacterial colonies on a plate  # of missing teeth  # of accidents in a time period  # of illnesses in a time period

Continuous Data

 Continuous data is measured  Can take any value within a defined range  Limitations imposed by the measuring stick  Examples – blood pressure, height, weight, blood pressure, time

Types of variables Qualitative Or categorical

SUMMARY

Variable

Quantitative measurement Nominal (not ordered) e.g. ethnic group Ordinal (ordered) e.g. response to treatment Discrete (count data) e.g. number of admissions Continuous (real-valued) e.g. height Measurement scales

Determining the Type of Data 

Categorical variables

 

nominal

categories that cannot be ordered one above the other (sex, marital status) where the variables are divided into a number of named

ordinal

where the variables are divided into a number of named categories that can ordered from lowest to highest or vice versa (levels of satisfaction and levels of knowledge ) 

Numerical Variables

Continuous

where between any two points there are at least theoretically infinite number of values (weight and height) 

Discrete

that have only certain fixed values and no intermediate values possible (number of students in a classroom)

Dependent and independent variables

Whether a variable is dependent or independent is determined by the statement of the problem and study objectives Dependent :the variable that is used to describe or measure the problem under study.

Independent : the variables that are used to describe or measure the factors that are assumed to cause or at least to influence the problem

Why does it Matter

?

Categorical and quantitative variables are:  graphed  charted  tabled and  statistically summarized

in very different ways

Descriptive statistics

Organizing and Presenting Data

Overview of Data Collection Techniques

Overview of Data Collection Techniques Data collection techniques allow us to systematically collect information about our study subjects (people, objects or phenomena) Various Data collection techniques can be used      Using available information Observing Interviewing (mainly face to face) Administering written questionnaires Focus Group Discussions

Using Available Information

  Look for sources of already collected data  health information system data   census data unpublished reports   publications of archives, libraries or offices or even a study in itself Design the instrument to retrieve the needed data such as checklists and data compilation forms

Using Available Information

Advantages  Inexpensive  Permits examination of past trends Disadvantages    Data are not always easy accessible Ethical issues regarding confidentiality may arise Information may be incomplete and inaccurate

Observing

It involves systematic selection, watching, and recording behavior and characteristics of living beings or objects  Observations of Human Behavior are used on small scale and can be  Participant observation  Non-participant observation     All kinds of measurements are also called observations Measurements will require additional tools that can be simple or complex Observation can be the primary source of information Additional information to other methods of data collection can be obtained

Observing

Advantages    More detailed, more accurate info Collection of info not written in questionnaires Testing validity of responses to questionnaires Disadvantages     Ethical issues of privacy and confidentiality Observer Bias The presence of the observer can influence the situation Extensive training of assistants is needed

Interviewing

 It is oral questioning of respondents, either individually or as a group, face to face or over phone  High degree of flexibility It depends on the  Low degree of flexibility level of researcher understanding of the problem or situation

Interviewing

Advantages  Suitable for illiterate Disadvantages  The presence of the interviewer can influence the responses  Permits clarification by respondents  Higher response rate than questionnaires  Less complete information compared to observation

Administering Written Questionnaires

 Written questions are to be answered by respondents in a written form  Mail  Group  Drop-off

Administering Questionnaires

Advantages  Less expensive    Permits anonymity and probably more honest responses No need for assistants Eliminates observer bias Disadvantages  Can not be used with illiterate individuals  Non response rate could be high  Questions me be misunderstood

Terminology Used in Sample Surveys

       An

element

is the entity on which data are collected.

A

population

is the collection of all elements of interest.

A

sample

is a subset of the population. The

target population

inferences about.

is the population we want to make The

sampled population

sample is actually selected.

is the population from which the These two populations are not always the same.

If inferences from a sample are to be valid, the sampled population must be representative of the target population.

Terminology Used in Sample Surveys

    The population is divided into themselves.

sampling units

which are groups of elements or the elements A list of the sampling units for a particular study is called a

frame

.

The choice of a particular frame is often determined by the availability and reliability of a list.

The development of a frame can be the most difficult and important steps in conducting a sample survey.

Types of Surveys

 Surveys Involving Questionnaires  Three common types are

telephone surveys, and personal interview surveys.

mail surveys,

 Survey cost are lower for mail and telephone surveys.

 With well-trained interviewers, higher response rates and longer questionnaires are possible with personal interviews.

 The design of the questionnaire is critical.

Types of Surveys

 Surveys Not Involving Questionnaires  Often, someone simply counts or measures the sampled items and records the results.

 An example is sampling a company’s inventory of parts to estimate the total inventory value.

Sampling Methods

 Sample surveys can also be classified in terms of the sampling method used.

 The two categories of sampling methods are:  Probabilistic sampling  Nonprobabilistic sampling

Nonprobabilistic Sampling Methods

 The probability of obtaining each possible sample can be computed.

 Statistically valid statements cannot be made about the precision of the estimates.

 Sampling cost is lower and implementation is easier.

 Methods include convenience and judgment sampling.

Nonprobabilistic Sampling Methods

 Convenience Sampling  The units included in the sample are chosen because of accessibility.

 In some cases, convenience sampling is the only practical approach.

Nonprobabilistic Sampling Methods

 Judgment Sampling  A knowledgeable person selects sampling units that he/she feels are most representative of the population.

 The quality of the result is dependent on the judgment of the person selecting the sample.

 Generally, no statistical statement should be made about the precision of the result.

Probabilistic Sampling Methods

The probability of obtaining each possible sample can be computed.

utilizes some form of

random selection

.

    

Methods include:

s imple random , s tratified simple random , c luster , and s ystematic sampling.

Survey Errors

 Two types of errors can occur in conducting a survey:  Sampling error  Nonsampling error

Survey Errors

Sampling Error

• It is defined as the magnitude of the difference between the point estimate, developed from the sample, and the population parameter.

• It occurs because not every element in the population is surveyed.

• It cannot occur in a census.

• It can not be avoided, but it can be controlled.

Survey Errors

 Nonsampling Error  It can occur in both a census and a sample survey.  Examples include:  Measurement error  Errors due to nonresponse  Errors due to lack of respondent knowledge  Selection error  Processing error

Simple Random Sampling

   A simple random sample of size n from a finite population of size N is a sample selected such that every possible sample of size n has the same probability of being selected.

We begin by developing a frame or list of all elements in the population.

Then a selection procedure, based on the use of random numbers selected.

, is used to ensure that each element in the sampled population has the same probability of being

Simple Random Sampling

We will see in the upcoming slides how to:  Estimate the following population parameters:  Population mean  Population total  Population proportion  Determine the appropriate sample size

Determining the Sample Size

     An important consideration in sample design is the choice of sample size.

The best choice usually involves a tradeoff between cost and precision (size of the confidence interval).

Larger samples provide greater precision, but are more costly.

A budget might dictate how large the sample can be.

A specified level of precision might dictate how small a sample can be.

Determining the Sample Size

    Smaller confidence intervals provide more precision.

The size of the approximate confidence interval depends on the bound B on the sampling error.

Choosing a level of precision amounts to choosing a value for B.

Given a desired level of precision, we can solve for the value of n.

Stratified Simple Random Sampling

     The population is first divided into H groups, called

strata .

Then for stratum h, a simple random sample of size n

h

selected.

is The data from the H simple random samples are combined to develop an estimate of a population parameter.

If the variability within each stratum is smaller than the variability across the strata, a stratified simple random sample can lead to greater precision.

The basis for forming the various strata depends on the judgment of the designer of the sample.

Systematic Sampling

 Systematic Sampling is often used as an alternative to simple random sampling which can be time-consuming if a large population is involved.

 If a sample size of n from a population of size N is desired, we might sample one element for every N/n elements in the population.

 We would randomly select one of the first N/n elements and then select every (N/n)th element thereafter.

 Since the first element selected is a random choice, a systematic sample is often assumed to have the properties of a simple random sample.

Systematic Random Sampling

 Here are the steps you need to follow in order to achieve a systematic random sample:  number the units in the population from 1 to N  decide on the n (sample size) that you want or need  k = N/n = the interval size   randomly select an integer between 1 to k then take every kth unit

Cluster Sampling

    Cluster sampling requires that the population be divided into N groups of elements called clusters.

We would define the frame as the list of N clusters.

We then select a simple random sample of n clusters.

We would then collect data for all elements in each of the n clusters.

Cluster Sampling

 Cluster sampling tends to provide better results than stratified sampling when the elements within the clusters are heterogeneous.

 A primary application of cluster sampling involves area sampling, where the clusters are counties, city blocks, or other well-defined geographic sections.

Multi-Stage Sampling

  In most real applied social research, we would use sampling methods that are considerably more complex than these simple variations. The most important principle here is that we can combine the simple methods described earlier in a variety of useful ways that help us address our sampling needs in the most efficient and effective manner possible.

Multi-stage sampling

 Consider the problem of sampling students in grade schools. We might begin with a national sample of school districts stratified by economics and educational level. Within selected districts, we might do a simple random sample of schools. Within schools, we might do a simple random sample of classes or grades. And, within classes, we might even do a simple random sample of students.  In this case, we have three or four stages in the sampling process and we use both stratified and simple random sampling. By combining different sampling methods we are able to achieve a rich variety of probabilistic sampling methods that can be used in a wide range of social research contexts.

Organizing and Presenting Data

Descriptive Statistics: Tabular and Graphical Methods

 Summarizing Qualitative Data  Summarizing Quantitative Data  Crosstabulations and Scatter Diagrams

Summarizing Qualitative Data

     Frequency Distribution Relative Frequency Percent Frequency Distribution Bar Graph Pie Chart

Example:

Students in JUST were asked to rate the quality of food served in the cafeteria in JUST as being

excellent

,

above average

shown below.

,

average

,

below average

, or

poor

. The ratings provided by a sample of 20 students are Below Average Above Average Above Average Average Above Average Average Above Average Average Above Average Below Average Poor Excellent Above Average Average Above Average Above Average Below Average Poor Above Average Average

Frequency Distribution

 A frequency distribution is a tabular summary of data showing the frequency (or number) of items in each of several nonoverlapping classes.

 The objective is to provide insights about the data that cannot be quickly obtained by looking only at the original data.

Example:

 Frequency Distribution

Rating Frequency Poor Below Average Average Above Average Excellent 1 Total 20 2 3 5 9

Relative Frequency Distribution

  The relative frequency of a class is the fraction or proportion of the total number of data items belonging to the class.

A relative frequency distribution is a tabular summary of a set of data showing the relative frequency for each class.

Percent Frequency Distribution

 The percent frequency of a class is the relative frequency multiplied by 100.

 A percent frequency distribution is a tabular summary of a set of data showing the percent frequency for each class.

 Relative Frequency and Percent Frequency Distributions

Rating frequency Relative Frequency Poor 2 .10

Below Average 3 .15

Average Above Average 5 .25

9 .45

Excellent 1 .05

Total 20 1.00

Percent Frequency 10 15 25 45 5 100

Tables

 Data in arranged in rows and columns  Simple and self-explanatory  Title  Label each row and column  Show totals for rows and columns  Include units of measure (yrs, mg/dl)  Explain codes in footnote

Guidelines for Developing a Table         Describe what, when, where in the title Label rows and columns clearly Provide units of measure Provide row and column totals Define abbreviations and symbols Note data exclusions References Source Should stand alone

    

Bar Graph

A bar graph is a graphical device for depicting qualitative data that have been summarized in a frequency, relative frequency, or percent frequency distribution.

On the horizontal axis we specify the labels that are used for each of the classes.

A frequency, relative frequency, or percent frequency scale can be used for the vertical axis.

Using a bar of fixed width drawn above each class label, we extend the height appropriately.

The bars are separated to emphasize the fact that each class is a separate category.

Example: quality of food

Bar Graph 9 8 7 6 5 4 3 2 1

Rating

Poor Below Average Average Above Average Excellent

Charts

 Appropriate for categorical data  Bar charts  Simple  Grouped  Stacked

Bar Charts

    Display data from one-variable table Each variable is represented by a bar Bars are proportional to the number of events Can be presented vertically or horizontally

Simple Bar Chart

Annual Death Rates by Govornorate, 1996-2000 400 350 300 250 200 150 100 50 0 AQABA ZARQA AMMAN JARAS MAFRQ MADAB TAFEL Govornorate IRBID BALQA KARAK AJLON MAANN

1600 1400 1200 1000 800 600 400 200 0

Grouped Bar Chart

Treatment completion and cure of disease X in various racial groups, 1994-2000

Cases Completion Cure Race A Race B Race Race C Race D

Pie Chart

   The pie chart is a commonly used graphical device for presenting relative frequency distributions for qualitative data.

First draw a circle; then use the relative frequencies to subdivide the circle into sectors that correspond to the relative frequency for each class.

Since there are 360 degrees in a circle, a class with a relative frequency of .25 would consume .25(360) = 90 degrees of the circle.

Example: Quality of food

 Pie Chart Above Average 45% Exc.

5% Poor 10% Below Average 15% Average 25% Quality Ratings

Summarizing Quantitative Data

   Frequency Distribution Relative Frequency and Percent Frequency Distributions Histogram

Frequency Distributions  Frequency distribution for NUMERICAL data after being grouped into suitable categories  Age is continuous variable that can be grouped into age groups and presented as frequency distribution     When grouping numerical variables into categories, the following groups are important: Groups must not overlap There must be continuity from one group to next Groups must range from lowest to highest measurements (preferably round numbers) It is preferable that groups be the same width

Example:

The manager of hospital A would like to get a better picture of the distribution of waiting times of his patients. A sample of 50 patients has been taken and their waiting times (minutes), are listed below.

91 71 104 74 85 97 62 78 69 82 93 72 62 88 57 89 68 68 75 66 97 105 77 83 52 75 68 99 79 71 98 101 79 105 79 80 75 65 69 69 97 72 80 109 67 74 62 62 76 73

Frequency Distribution

 Guidelines for Selecting Number of Classes  Use between 5 and 20 classes.

 Data sets with a larger number of elements usually require a larger number of classes.

 Smaller data sets usually require fewer classes.

Frequency Distribution

 Guidelines for Selecting Width of Classes  Use classes of equal width.

 Approximate Class Width = Largest Data Value  Smallest Data Value Number of Classes

 Frequency Distribution If we choose six classes: Approximate Class Width = (109 - 52)/6 = 9.5  10 waiting times Frequency 50-59 60-69 70-79 80-89 2 13 16 7 90-99 7 100-109 5 Total 50

 Relative Frequency and Percent Frequency Distributions waiting time 50-59 60-69 70-79 80-89 90-99 Relative Percent Frequency Frequency .04

.26

.32

.14

.14

4 26 32 14 14 100-109 .10

Total 1.00

10 100

17 33 12 43 53 37 44 44 32 33 39 11 48 56 21 49 43 17 12 34 35 32 20 29 35 33 20 47 48 56 48 60 26 40 52 42 15 12 24 24 33 31 13 46 18 46 14 43 54 33 51 12 50 31 46 37 20 41 25 29 47 24 38 26 13 22 45 54 15 35 23 29 57 33 41 40 11 59 56 59 55 20 32 17 39 55 17 12 45 45 54 15 35 23 29 57 55 41 40 11 33 56 59 55 20 32 17 39 59 17 40 11 59 17 12 23 45 54 55 20 32 17 15 35 23 29 57 33 56 59 39 55 12 24 24 33 31 48 60 26 44 44 48 56 40 13 46 33 12 43 53 37 21 49 29 47 24 38 26 50 31 46 48 56 52 42 37 13 22 29 35 33 20 47 15 12 20 32 17 39 55 41 40 11 51 12 20 41 59 17 12 46 14 43 54 33 25 29 55 20 32 17 39 55 41 40 57 33 56 59 11 59 17 54 15 35 23 29 55 20 29 57 33 56 59 55 20 32 29 57 33 56 17 39 55 45 54 15 35 23 59 55

Simple Frequency Distribution

Primary and secondary syphilis morbidity by age, Unites Staes, 1989 Age group (years)

<14 15-19 20-24 25-29 30-34 35-44 45-54 >55

Total Number of Cases

230 4378 10405 9610 8648 6901 2631 1278

44081

Simple Frequency Distribution

Primary and secondary syphilis morbidity by age, Unites Staes, 1989 Age group (years)

<14 15-19 20-24 25-29 30-34 35-44 45-54 >55

Total Number

230 4378 10405 9610 8648 6901 2631 1278

44081 Cases Percent

0.5

10.0

23.6

21.8

19.6

15.7

6.0

2.9

100.0

Histogram

    Another common graphical presentation of quantitative data is a histogram.

The variable of interest is placed on the horizontal axis and the frequency, relative frequency, or percent frequency is placed on the vertical axis.

A rectangle is drawn above each class interval with its height corresponding to the interval’s frequency, relative frequency, or percent frequency.

Unlike a bar graph, a histogram has no natural separation between rectangles of adjacent classes.

Waiting time ( Histogram )

18 16 14 12 10 8 6 4 2 waiting time 50 60 70 80 90 100 110

Histograms

 Graph of the frequency distribution of a continuous variable  Columns are adjoining  Area of each column is proportional to # of observations in that interval

Histograms

Histograms and Frequency Polygons

Example of a Histogram

80 70 60 50 40 30 20 10 0 Cases 1 2 3 4 5 6 7 8

Week

9 10 11 12 13 14 15 16

Crosstabulations and Scatter Diagrams

   Thus far we have focused on methods that are used to summarize the data for one variable at a time.

Often a manager is interested in tabular and graphical methods that will help understand the relationship between two variables.

Crosstabulation and a scatter diagram are two methods for summarizing the data for two (or more) variables simultaneously.

Crosstabulation

 Crosstabulation is a tabular method for summarizing the data for two variables simultaneously.

 Crosstabulation can be used when:  Both variables are qualitative  The left and top margin labels define the classes for the two variables.

Crosstabulation: Row or Column Percentages

 Converting the entries in the table into row percentages or column percentages can provide additional insight about the relationship between the two variables.

Different Types of Cross-tabulation    Cross tabulation that describe the sample using different combinations of background variables (age, sex, occupation, residence … .

Cross tabulations displaying comparisons between groups Cross tabulation focusing on relationship between variables

Cross-Tabulation to describe the sample

   In any study, it is common practice to first describe the research subjects before presenting the various results.

This can be done either by presenting the variables in simple frequency tables or in a combination of variables in cross tables Data is usually listed in both absolute figures and relative frequencies

Distribution of disease status according to age groups

Age Groups <10 10-19 20-29 30-39 =>40 Total Disease Status Present

2 (3%) 2 (3%) 5 (7%) 23 (34%) 36 (53%)

Absent

1 (2%) 0 2 (4%) 12 (23%) 37 (71%)

68 (100%) 52 (100%) Total 3 2 7 35 73 120

Cross-Tabulation to Determine Differences Between Groups     Cross-tabulation should be used when we aim at discovering any differences between two or more groups on particular variable in case- control, cohort, quasi-experimental and experimental studies Dependent variables are displayed in columns and independent variables in rows Totals for each of the comparison groups should be 100% Cross tables will be used further for certain statistical testing

Duration of Breast Feeding In Mothers of Different Age Groups

Age Groups Duration of Breast Feeding Total 0-5 m 6-11 m 12+ m <10 10-19 20-29 30-39 =>40 Total

Duration of Breast Feeding In Relation To Working Status of Mothers

Working Status Duration of Breast Feeding 0-5 m 6-11 m 12+ m Total Full time Part time Not working Total

Scatter Diagram

 A scatter diagram is a graphical presentation of the relationship between two quantitative variables.

 One variable is shown on the horizontal axis and the other variable is shown on the vertical axis.

 The general pattern of the plotted points suggests the overall relationship between the variables.

Scatter Diagram

 A Positive Relationship

y x

Scatter Diagram

A Negative Relationship

y x

Scatter Diagram

 No Apparent Relationship

y x

 Scatter Diagram The Panthers football team is interested in investigating the relationship, if any, between interceptions made and points scored.

x = Number of Interceptions 1 3 2 1 3 y = Number of Points Scored 14 24 18 17 27

Example: Panthers Football

Scatter Diagram

Team

y

30 25 20 15 10 5 0 0 1 2 Number of Interceptions 3

x

Scatter Diagram

Serum levels of heavy metal X in 38 moon settlers, 2200

20 18 16 14 12 10 8 6 4 2 0 0 10 20 30

Age (years)

40 50 60

Frequency Polygon

40 30 20 10 0 80 70 60 50 Cases Cases-FP 1 2 3 4 5 6 7 8

Week

9 10 11 12 13 14 15 16

Frequency Polygon

    Graph of entire frequency distribution of a continuous variable # of events in interval plotted at midpoint of interval Straight line connects points Useful to compare two or more distributions on the same axis

Frequency Polygon

80 70 20 10 0 60 50 40 30 1 2 3 4 5 6 Cases-FP More Cases-FP 7 8

Week

9 10 11 12 13 14 15 16

Frequency Polygon

Geographic Distribution of HAV Infection

Anti-HAV Prevalence High Intermediate Low Very Low

Guidelines in Developing Graphs       Label title, source, axes, scales, legend Portray frequency on the vertical scale, starting with zero Portray method of classification on the horizontal scale Indicate units of measure Define abbreviations and symbols Note data exclusions

Frequency Distribution of Shortage of

Antihypertensive Drugs in PHCCs During 1999

Relative Frequency of Shortage of Antihypertensive Drugs in PHCCs During 1999

Frequency Distribution of Shortage of Antihypertensive Drugs in PHCCs During 1999

Histogram

Number of Brucellosis Cases in Nowhere During 1990s

Descriptive Statistics: Numerical Methods

 Measures of Location  Measures of Variability

x

 %

Measures of Location

 Mean  Median  Mode  Percentiles  Quartiles

Example

Given below is a sample of monthly income ($) for 70 diabetic patients. The data are presented in ascending order.

425 430 430 435 435 435 435 435 440 440 440 440 440 445 445 445 445 445 450 450 450 450 450 450 450 460 460 460 465 465 465 470 470 472 475 475 475 480 480 480 480 485 490 490 490 500 500 500 500 510 510 515 525 525 525 535 549 550 570 570 575 575 580 590 600 600 600 600 615 615

   Mean Median Mode

Measures of Central Tendency Mean

   It is the arithmetic mean and is also known as average It is calculated by totaling the results of all observations and dividing by the total number of observations Example height of 7 girls are as follow:  141, 141, 143, 144, 145, 146, 155 cm   Total is 1015 Mean is 1015/7 = 145 cm

Mean

 The mean of a data set is the average of all the data values.

 If the data are from a sample, the mean is denoted by

x

 

n x i

 If the data are from a population, the mean is denoted by m (mu)   

N x i

 Mean

Example: income

x

 

x i n

 34 , 356 70  490 .

80 425 430 430 435 435 435 435 435 440 440 440 440 440 445 445 445 445 445 450 450 450 450 450 450 450 460 460 460 465 465 465 470 470 472 475 475 475 480 480 480 480 485 490 490 490 500 500 500 500 510 510 515 525 525 525 535 549 550 570 570 575 575 580 590 600 600 600 600 615 615

Cholesterol level (mg/dl) 190, 199, 198, 196, 192, 199, 198, 196, 193, 199, 198, 196, 196, 190, 199, 198, 196, 190, 199, 198, 196, 400, 480 All values: 217.2 mg/dl Excluding extreme values: 196 mg/dl

Median

    It is the the value that divides the distribution into two equal halves List all observations from lowest to highest Count the number of observations (n) The position of the median is (n+1) / 2  Example weights of 7 girls are as follow:  47, 42, 44, 40, 43, 72, 41 Kg   Sort first 40, 41, 42, 43, 44, 47, 72 Kg The position which is 43) of the Median is (7+1) / 2 = 4 (the 4th one,

Median

 The median is the measure of location most often reported for annual income and property value data.

 A few extremely large incomes or property values can inflate the mean.

Median

   The median of a data set is the value in the middle when the data items are arranged in ascending order.

For an odd number of observations, the median is the middle value.

For an even number of observations, the median is the average of the two middle values.

Asymmetric Distributions of the Population Values

Example: income

 Median Median = 50th percentile Averaging the 35th and 36th data values: 425 440 450 465 480 510 575 430 440 450 470 485 515 575 Median = (475 + 475)/2 = 475 430 440 450 470 490 525 580 435 445 450 472 490 525 590 435 445 450 475 490 525 600 435 445 460 475 500 535 600 435 445 460 475 500 549 600 435 445 460 480 500 550 600 440 450 465 480 500 570 615 440 450 465 480 510 570 615

Mode

 It is the most frequently occurring value in a set of observations  Its useful for categorized data  Example weights of 7 girls are as follow:  47, 44, 44, 40, 43, 72, 44 Kg  Sort first 40, 43, 44, 44, 44, 47, 72 Kg  The Mode is 44

10 8 6 4

Mean =11

2 0 5.0

6.0

7.0

8.0

9.0

10.0

11.0

12.0

13.0

14.0

15.0

16.0

17.0

Median = 11

Std. Dev = 3.17 Mean = 11.1

VAR00001

20 18 16 14 12 10 8 6 4 N = 16 16 LENGTH

Mode

    The mode of a data set is the value that occurs with greatest frequency.

The greatest frequency can occur at two or more different values.

If the data have exactly two modes, the data are bimodal.

If the data have more than two modes, the data are multimodal.

Example: income

 Mode 450 occurred most frequently (7 times) Mode = 450 425 440 450 465 480 510 575 430 440 450 470 485 515 575 430 440 450 470 490 525 580 435 445 450 472 490 525 590 435 445 450 475 490 525 600 435 445 460 475 500 535 600 435 445 460 475 500 549 600 435 445 460 480 500 550 600 440 450 465 480 500 570 615 440 450 465 480 510 570 615

Percentiles

 A percentile provides information about how the data are spread over the interval from the smallest value to the largest value.

 Admission test scores for colleges and universities are frequently reported in terms of percentiles.

Percentiles

Percentiles

The pth percentile of a data set is a value such that at least p percent of the items take on this value or less and at least (100 p) percent of the items take on this value or more.

 Arrange the data in ascending order.

 Compute index i, the position of the pth percentile.

i = (p/100)n   If i is not an integer, round up. The p th percentile is the value in the i th position.

If i is an integer, the p th percentile is the average of the values in positions i and i +1.

Example: income

 90th Percentile i = (p/100)n = (90/100)70 = 63 Averaging the 63rd and 64th data values: 90th Percentile = (580 + 590)/2 = 585 425 440 450 465 480 510 575 430 440 450 470 485 515 575 430 440 450 470 490 525 580 435 445 450 472 490 525 590 435 445 450 475 490 525 600 435 445 460 475 500 535 600 435 445 460 475 500 549 600 435 445 460 480 500 550 600 440 450 465 480 500 570 615 440 450 465 480 510 570 615

Quartiles

    Quartiles are specific percentiles First Quartile = 25th Percentile Second Quartile = 50th Percentile = Median Third Quartile = 75th Percentile

Example: Apartment Rents

 Third Quartile 425 440 450 465 480 510 575 Third quartile = 75th percentile i = (p/100)n = (75/100)70 = 52.5 = 53 Third quartile = 525 430 440 450 470 485 515 575 430 440 450 470 490 525 580 435 445 450 472 490 525 590 435 445 450 475 490 525 600 435 445 460 475 500 535 600 435 445 460 475 500 549 600 435 445 460 480 500 550 600 440 450 465 480 500 570 615 440 450 465 480 510 570 615

Descriptive Statistics: Numerical Methods

 Measures of Location  Measures of Variability

x

 Mean  Median  Mode

Measures of Location

Compute mean, median, mode.

A:

B: 8 9 10 10 10 11 12 1 5 10 10 10 15 19

Measures of Variability

 It is often desirable to consider measures of variability (dispersion), as well as measures of location.

 For example, in choosing supplier A or supplier B we might consider not only the average delivery time for each, but also the variability in delivery time for each.

Measures of Variability

 Range  Interquartile Range  Variance  Standard Deviation  Coefficient of Variation

Range

   The range of a data set is the difference between the largest and smallest data values.

It is the simplest measure of variability.

It is very sensitive to the smallest and largest data values.

Measures of Dispersion

Range “Difference b/w smallest and largest value”

Simple to Calculate

Example:

Data Set: 13, 20, 89, 47, 12, 22, 70, 51, 30 (cm)

Range = 89 – 12 = 77cm Range doesn’t tell much about Measurement Distribution

Example: income

 Range Range = largest value - smallest value Range = 615 - 425 = 190 425 440 450 465 480 510 575 430 440 450 470 485 515 575 430 440 450 470 490 525 580 435 445 450 472 490 525 590 435 445 450 475 490 525 600 435 445 460 475 500 535 600 435 445 460 475 500 549 600 435 445 460 480 500 550 600 440 450 465 480 500 570 615 440 450 465 480 510 570 615

Interquartile Range

   The interquartile range of a data set is the difference between the third quartile and the first quartile.

It is the range for the middle 50% of the data.

It overcomes the sensitivity to extreme data values.

Example: income

Interquartile Range 3rd Quartile (Q3) = 525 1st Quartile (Q1) = 445 80 Interquartile Range = Q3 - Q1 = 525 - 445 = 425 440 450 465 480 510 575 430 440 450 470 485 515 575 430 440 450 470 490 525 580 435 445 450 472 490 525 590 435 445 450 475 490 525 600 435 445 460 475 500 535 600 435 445 460 475 500 549 600 435 445 460 480 500 550 600 440 450 465 480 500 570 615 440 450 465 480 510 570 615

Variance

 The variance is a measure of variability that utilizes all the data.

 It is based on the difference between the value of each observation (x

i

) and the mean (x for a sample,  for a population).

Variance

 The variance is the average of the squared differences between each data value and the mean.

 If the data set is a sample, the variance is denoted by s 2 .

s

2   (

xi x n

 1 ) 2  2 .

 2   (

x i N

  ) 2

  

Standard Deviation

The standard deviation of a data set is the positive square root of the variance.

It is measured in the same units as the data, making it more easily comparable, than the variance, to the mean.

If the data set is a sample, the standard deviation is denoted s.

2

s

s

 If the data set is a population, the standard deviation is denoted  (sigma).

   2

Measures of Dispersion

Calculating Standard Deviation (Method 1)

Mean

X   x n 

Standard Deviation

SD     2 n  1

Measures of Dispersion

Calculating Standard Deviation (Method 2)

SD   x 2  n ((   1 x) 2 /n)

 

Coefficient of Variation

The coefficient of variation indicates how large the standard deviation is in relation to the mean.

If the data set is a sample, the coefficient of variation is computed as follows:

x s

( 100 )  If the data set is a population, the coefficient of variation is computed as follows:   ( 100 )

 Variance

Example: income

s

2   (

n

 1 ) 2 

s

s

2   Standard Deviation

s x

 100   Coefficient of Variation  

Introduction to Probability

     Experiments, Counting Rules, and Assigning Probabilities Events and Their Probability Some Basic Relationships of Probability Conditional Probability Bayes’ Theorem

Probability

     Probability is a numerical measure of the likelihood that an event will occur.

Probability values are always assigned on a scale from 0 to 1.

A probability near 0 indicates an event is very unlikely to occur.

A probability near 1 indicates an event is almost certain to occur.

A probability of 0.5 indicates the occurrence of the event is just as likely as it is unlikely.

Probability as a Numerical Measure of the Likelihood of Occurrence

Probability: 0 Increasing Likelihood of Occurrence .5

1 The occurrence of the event is just as likely as it is unlikely.

An Experiment and Its Sample Space

 An experiment is any process that generates well-defined outcomes.

 The sample space for an experiment is the set of all experimental outcomes.

 A sample point is an element of the sample space, any one particular experimental outcome.

Assigning Probabilities

Classical Method Assigning probabilities based on the assumption of equally likely outcomes.

 Relative Frequency Method Assigning probabilities based on experimentation or historical data.

 Subjective Method Assigning probabilities based on the assignor’s judgment.

Classical Method

If an experiment has n possible outcomes, this method would assign a probability of 1/n to each outcome.

 Example Experiment: Rolling a die Sample Space: S = {1, 2, 3, 4, 5, 6} Probabilities: Each sample point has a 1/6 chance of occurring.

Probability

  The probability of an outcome is the proportion of times the outcome would occur if we repeated the procedure many times.

Examples      Coin: What is the probability of obtaining heads when flipping a coin?

A single die: What is the probability I will roll a four?

Two dice: What is the probability I will roll a four?

A jar of 30 red and 40 green jelly beans: What is the probability I will randomly select a red jelly bean?

Computer: In the past 20 times I used my computer, it crashed 4 times and didn’t crash 16 times. What is the probability my computer will crash next time I use it?

Probability

 Independence: Two events are outcome of one does not affect or give an indication of the outcome of the other.

independent

if the Independent Dependent Events Flipping a coin twice Temperature on consecutive days 3 jelly beans: red, green, orange. Eat one. Eat another.

Probability

 Independence: Two events are independent if the outcome of one does not affect or give an indication of the outcome of the other.

Independent Dependent Events Randomly polling two individuals Comparing fertilizer yield for two adjacent field plots Rolling two dice

Probability

 Definition: A

sample space

is a set of all the possible outcomes of a process.

 Example: Coin  What is the sample space for flipping a coin 3 times?

Probability

 Definition: An

event

is an outcome or set of outcomes of a process.

 Example: Coin  What is one of the possible events for flipping a coin 3 times?

Probability Rules

   Rule 1: The probability of any event is between 0 and 1 inclusive.

 Pr(HTH) = 1/8 which is between 0 and 1.

Rule 2: The probability of the whole sample space is 1.

 Pr(rolling a 1 or 2 or 3 or 4 or 5 or 6) = 1 Rule 3: The probability of an event not occurring is 1 minus the probability of the event. This is known as the complement rule.

 Pr(not rolling a 5) = 1 – 1/6 = 5/6

Probability Rules

  Rule 4: If two events A and B have no outcomes in common (they are disjoint ), then Pr(A or B) = Pr(A) + Pr(B)  Pr(rolling a 1 or a 6) = Pr(rolling a 1) + Pr(rolling a 6) = 1/6 + 1/6 = 2/6 or 1/3 Rule 5: If two events A and B are independent, then Pr(A and B) = Pr(A)Pr(B)  Pr(rolling a 1 and then a 6) = Pr(rolling a 1) * Pr(rolling a 6) = (1/6)(1/6) = 1/36

Rules of Probability

     Rule 1: 0 ≤ P(A) ≤ 1 Rule 2: P(S) = 1 Rule 3: Complement Rule: For any event, A, P(A c ) = 1 – P(A) Rule 4: Addition Rule: If A and B are disjoint events, then P(A or B) = P(A) + P(B) Rule 5: Multiplication Rule: If A and B are independent events, then P(A and B) = P(A)P(B)

Events and Their Probability

An event is a collection of sample points.

 The probability of any event is equal to the sum of the probabilities of the sample points in the event.

 If we can identify all the sample points of an experiment and assign a probability to each, we can compute the probability of an event.

Some Basic Relationships of Probability

 There are some basic probability relationships that can be used to compute the probability of an event without knowledge of al the sample point probabilities.

 Complement of an Event    Union of Two Events Intersection of Two Events Mutually Exclusive Events

Complement of an Event

   The complement of event A is defined to be the event consisting of all sample points that are not in A. The complement of A is denoted by A c .

The Venn diagram below illustrates the concept of a complement.

Sample Space

S A

c Event

A

  

Union of Two Events

The union of events A and B is the event containing all sample points that are in A or B or both.

The union is denoted by A 

B

 The union of A and B is illustrated below.

Sample Space

S

Event

A

Event

B

Intersection of Two Events

   The intersection of events A and B is the set of all sample points that are in both A and B.

The intersection is denoted by A    The intersection of A and B is the area of overlap in the illustration below.

Sample Space

S

Intersection Event

A

Event

B

Addition Law

  The addition law provides a way to compute the probability of event A, or B, or both A and B occurring.

The law is written as:  P(A  B) = P(A) + P(B) - P(A

B

Mutually Exclusive Events

Two events are said to be mutually exclusive if the events have no sample points in common. That is, two events are mutually exclusive if, when one event occurs, the other cannot occur.

Sample Space

S

Event

A

Event

B

Mutually Exclusive Events

 Addition Law for Mutually Exclusive Events P(A  B) = P(A) + P(B)

Conditional Probability

 The probability of an event given that another event has occurred is called a conditional probability.

 The conditional probability of A given B is denoted by P(A|B).

 A conditional probability is computed as follows: P  P (

A

B

) P

Multiplication Law

 The multiplication law provides a way to compute the probability of an intersection of two events.

 The law is written as: P(AB) = P(B)P(A|B)

Independent Events

 Events A and B are independent if P(A|B) = P(A).

Independent Events

 Multiplication Law for Independent Events P(AB) = P(A)P(B)  The multiplication law also can be used as a test to see if two events are independent.

Bayes’ Theorem

To find the posterior probability that event A

i

will occur given that event B has occurred we apply Bayes’ theorem.

( | )  P (

A

1 P (

A i

1 )  P (

A

2

B A i

) 2 P (

A n B A n

)  Bayes’ theorem is applicable when the events for which we want to compute posterior probabilities are mutually exclusive and their union is the entire sample space.

Daniel ’s example : Frequency of cocaine use by gender, ( Erickson and Murray, Am J of Drug and Alcohol Abuse, 15 (1989), 135-152.) Lifetime freq use 1-19 times (A) 20-99 times (B) 100+ times (C) Male (M) 32 18 25 75 Female (F) 7 20 9 36 Total 39 38 34 111 Asumptions: male and female are mutually exclusive, likelihood of selecting any person in the group is equal for all.

Marginal probabilities

: P(person is a male) = number of males/number of subjects = 75/111=0.6757

P(person is a female) = P(is not a male) = 1-P(is a male) = 1-0.68 = 0.3243

Probability in clinical practice

http://www.screening.nhs.uk/screening

Example: results of a screening test Test result Positive (T) Negative (T c ) Total Disease Present (D) Absent (D c ) a=TP c=FN a + c b=FP d=TN b + d Total a + b c + d n Sensitivity: Specificity: Predictive value positive: Predictive value negative:  

T

)  P(T + |D + ) = TP/(TP+FN) P(T |D ) = TN/(TN+FP) P(D + |T + ) P(D |T ) = TP/(TP+FP) = TN/(TN+FN)  )

Screening tests .

Proportion

p

of population is known to carry virus Z . Test results of people carrying the virus return positive 98% of the time. Of people not carrying the virus, the tests correctly returns a negative 97% of the time. If an individual results in a positive test, what is P(individual carrying virus)?

T p 0.98

V 0.02

1-p 0.03

V c 0.97

T c T T c V = virus Z is present T = test results positive

P

(

V

|

T

)

P

(

V

T

)

P

(

T

)

P

(

T

|

V

)

P

(

V

)

P

(

T

|

V

)

P

(

V

)

P

(

T

|

V c

)

P

(V

c

)

P

(

V

|

T

)  0 .

98 *

p

0 .

98 *

p

 0 .

03 * ( 1 

p

) p = prevalence p 0.1

0.05

0.001

P(V|T) 0.784

0.632

0.032

Probability: frequency (m) of occurrence of an event of interest among other N mutually exclusive and likely events: P(E)=m/N PROPERTIES: 0  P(E i )  1 MULTIPLICATION RULE: P(M  C) = P(C|M)P(M) if P(M)  0 ADDITION RULE: P(A  B ) = P(A) + P(B) - P(A  B) MARGINAL PROBABILITY (TOTAL PROBABILITY):

P

(

B

) 

P

(

B

|

A

1 )

P

(

A

1 ) 

P

(

B

|

A

2 )

P

(

A

2 )  ...

P

(

B

|

A k

)

P

(

A k

) 

i k

  1

P

(

B

|

A i

)

P

(

A i

)

BAYES:

P

(

V

|

T

)

P

(

V

T

)

P

(

T

)

P

(

T

|

V

)

P

(

P

(

T V

)

|

V

)

P

(

V

)

P

(

T

|

V c

)

P

(

V c

)

Normal Distribution

Bell-Shaped Curve: The Normal Distribution of Population Values

Function

 The Normal is a theoretical distribution specified by its two parameters

Normal Probability Distribution

 Normal Probability Density Function  1 2 

e

(   where:    = mean = standard deviation = 3.14159

e = 2.71828

Asymmetric Distributions of the Population Values

The Normal Distribution

mean standard deviation

Normal Probability Distribution

x

Normal Probability Distribution

Characteristics of the Normal Probability Distribution     The shape of the normal curve is often illustrated as a bell-shaped curve. Two parameters,  (mean) and  (standard deviation), determine the location and shape of the distribution.

The highest point on the normal curve is at the mean, which is also the median and mode.

The mean can be any numerical value: negative, zero, or positive.

… continued

Normal Probability Distribution

Characteristics of the Normal Probability Distribution  The normal curve is symmetric.

 The standard deviation determines the width of the curve: larger values result in wider, flatter curves.

 The total area under the curve is 1 (.5 to the left of the mean and .5 to the right).

 Probabilities for the normal random variable are given by areas under the curve.

Empirical Rule for Any Normal Curve

68%

of the values fall within one standard deviation of the mean 

95%

of the values fall within two standard deviations of the mean 

99.7%

of the values fall within three standard deviations of the mean

Empirical Rule for Any Normal Curve mean = 200 SD=10

68%

-1 190 +1 210 -2 180

95%

+2 220

99.7%

-3 170 +3 230

The Mean and Standard Deviation of the Normal Distribution Determine  What proportion of individuals fall into any range of values  At what percentile a given individual falls, if you know their value  What value corresponds to a given percentile

The standard Normal Curve

 Because there are many possible normal curves, in order to make use of their wonderful properties, we need to standardize it by:  Convert every normal distribution to the standard normal distribution and convert all of the scores to standard scores  Then use a Table of Areas under the Normal Curve to interpret and make inferences about the data

Properties of a Standard Normal Curve: z-scores 

Convert the raw scores to z-scores

Every Distribution of z-scores:

 Has a mean equal to 0  Has a standard deviation equal to 1  Has a shape that is the same as the underlying distribution of raw scores

The Normal Distribution

(of SD)

The Normal Distribution

 For the Standard Normal Distribution (or

Z

Distribution) we can find probabilities associated with different values of Z using Z tables.

Z

The Normal Distribution

 First we look at some general characteristics of the Z-distribution.

    The area under the entire curve is 1.

The area under the curve to the left of 0 is 0.5.

We say, “The probability that Z is to the left of 0 is 0.5.” This can be written as Prob ( Z < 0) = 0.5.

1 0.5

0 Z

The Normal Distribution

 We can find the probability that Z is to the left of any number using the Z-table.

  Z-tables can also be found on the inside front cover of the book Notice first if we go in the table to the value z = 0.00 we see the probability is 0.5.

0.5

0 Z

The Normal Distribution

 We can find the probability that Z is to the left of any number using the Z-table.

Pr ( Z < 1.25) = ?

Answer 1.25

Z Pr ( Z < 0.50) = ?

Answer 0.50

Z

The Normal Distribution

 More examples of probabilities to the left or less than a number Pr ( Z < -2.01) = ?

Answer -2.01

Pr ( Z < -3.75) = ?

Z -3.75

Answer Z

The Normal Distribution

 The Z-table only gives probabilities to the left of the value. If we want to get probabilities to the right we use 1 – Pr (Z < z).

Pr ( Z > 1.25) = ?

Answer: Pr ( Z > 0.50) = ?

Answer: 1.25

Z 0.50

Z

The Normal Distribution

 More examples of finding probabilities to the right of a number using 1 – Pr (Z < z).

Pr ( Z > -2.01) = ?

Answer: Pr ( Z > -3.75) = ?

Answer: > -2.01

Z -3.75

Z

The Normal Distribution

 To find probabilities between two numbers, find the less than (of to the left) probability for each number and then subtract.

Pr (-2.01< Z < 2.01) = ?

ANSWER:

Standard Normal Probability Distribution

    A random variable that has a normal distribution with a mean of zero and a standard deviation of one is said to have a standard normal probability distribution.

The letter z is commonly used to designate this normal random variable.

Converting to the Standard Normal Distribution 

x

 

z

 We can think of z as a measure of the number of standard deviations x is from  .

Standard or z score

 A z score indicates distance from the mean in standard deviation units. Formula: 

z

X

  

z

X

S X

Converting to standard or z scores does not change the shape of the distribution.

z-Scores

the regularity of the normal distribution allows us to  determine the normalised deviation of an individual observation from the mean of the distribution: z-score (transformation to: mean μ = 0, sd σ = 1)  estimate the probability of observing a particular score, based on the normalised area under the bell curve

34.13% Formula:

z = X: observed value μ: mean of distribution σ: standard deviation of distribution

-2 -1 0

+1 13.59% 2.28% z +2 note:

 z is the deviation from the mean in units of standard deviation  intervals defined by z-boundaries correspond to precisely defined proportions of the distribution

normal distribution & z-score transformation

Formula Y = 1

2



2 e -(X-

) 2 /2

2

properties of the normal distribution :

X 1. 50% below and 50% above the mean (mean = median) 2. most scores at mean (mean = mode), extreme scores are relatively rare (low frequencies) 3. symmetrical 34.13% ---> basis for z-score transformation: Formula z = X-

 

13.59

% ==> normalized proportions of area in z-score sections !! -2 -1 0

+1 +2 2.28% z

|

x

Sampling distributions

  |

The distribution of sample means for sample size n a mean of

will have : a standard deviation of

/

n and will approach a normal distribution as n approaches infinity

 [---------------------

x

---------------------] [---------------------

x

---------------------] [---------------------

x

---------------------]

x

Central limit theorem

A cornerstone for much of inferential statistics: the central limit theorem : For any population with a mean

and a standard deviation

, the distribution of sample means for sample size n will have a mean of

and a standard deviation of

/

n, and will approach a normal distribution as n approaches infinity.

The shape of the distribution of sample means is close to normal if the population from which the samples are drawn is normal, or if n is large (n>30)

Central limit theorem

the mean of the distribution of sample means, the expected value of

X, is close to the population mean

( it is more probable to draw scores close to

)

the variability of the distribution of sample means is described by the standard error of the means,

 

X measures the standard difference between

X and , which

 •

the sample size determines how close the sample mean will be to the population mean, and influences the standard error of the mean:

 

X =

/

n (note that for n=1:

 

X =

)

The Normal Distribution

Properties

Most values cluster near mean and few values out near tails  Values in the middle of the curve have a greater probability of occurring than values further away  There are many possible normal curves with different means and standard deviation

Standardized Scores

standardized score

= (observed value minus mean) / (std dev)    x is the observed value   is the population mean is the population standard deviation

z

x

  

The Normal Distribution

 Now suppose we know X ~ N (  ,  2 ) and we want to know the probability that X is less than some value. We must first convert the X to a Z and then use the probabilities from the Z-table. Recall that if X ~ N (  ,  2 ) , then

Z

X

   ~

N

So Pr (X < x) = Pr (Z < (x –  )/  

The Normal Distribution

 Here’s an example. Suppose X ~ N ( 3, 4). Find the probability that X is less than 4.

Pr ( X < 4 ) = ?

Pr ( X < 4 ) = Pr (Z < 0.5) =

Z

X

    4  3  1 / 2  0 .

5 2 3 4 X 0 0.50

Z

The Normal Distribution

 We will look at some more difficult examples by hand:  Suppose X ~ N (2, 9),       Given a value z, find the corresponding x that it came from.

How many standard deviations is x from  ?

Find Pr (X > 5).

Find Pr (X < –4 or X > 8).

Find Pr ( –4 < X < 8 ).

Find the x* such that Pr ( X < x* ) = 0.8485, where 0.8485 is some probability.

Law of Large Numbers

 Draw independent observations at random from any population with finite mean, μ. Decide how accurately you would like to estimate μ. As the number of observations drawn increases, the sample

x

approaches the mean μ of the population as closely as you specified and then stays that close.

Interval Estimation

Estimation of a Population Mean

Estimation of a Population Proportion Interval

Estimation of the difference between two Population Means

Estimation of the difference between two Population Proportions

Interval Estimation of population mean

|

x

  |  [---------------------

x

---------------------] [---------------------

x

---------------------] [---------------------

x

---------------------]

x

Interval Estimation of a Population Mean: Large-Sample Case

    Sampling Error Probability Statements about the Sampling Error Constructing an Interval Estimate: Large-Sample Case with  Known Calculating an Interval Estimate: Large-Sample Case with  Unknown

Sampling Error

 The absolute value of the difference between an unbiased point estimate and the population parameter it estimates is called the sampling error.

 |

x

  | For the case of a sample mean estimating a population mean, the sampling error is Sampling Error = |

x

  |

Probability Statements About the Sampling Error

x

  Knowledge of the sampling distribution of enables us to make probability statements about the sampling error even though the population mean  is not known. A probability statement about the sampling error is a precision statement.

Confidence Interval (CI) Definition of Confidence Interval

“Interval or range of values which most likely encompasses the true population value”  Helps determine extent of deviation of sample value

from the population

 Upper and lower limits of the CI are termed

Confidence Limits

Interpretation of CI

A 95% CI of 152 to 164 cm for the mean height of a population of women means that you are 95% confident (certain) that the real population mean* lies between 152 and 164cm

152=lower confidence limit 164=upper confidence limit * which you cannot know exactly unless you measure the heights of all women

Probability Statements About the Sampling Error

 Precision Statement : There is a 1  probability that or less .

  /2 1 -

x

 of all values  /2  Sampling distribution of

x x

Calculations

SE

SD n

Calculating CI

 Takes into account standard error (SE)SE gives an estimate of the degree to which the sample mean varies from the population means  Computed on the basis of Standard Deviation

SE

SD n

Interval Estimate of a Population Mean: Large-Sample Case ( n > 30)

 With  Known

x z

 /2

n

where: 1 -

x

is the sample mean is the confidence coefficient z

/2 is the z value providing an area of standard normal probability distribution

/2 in the upper tail of the



is the population standard deviation n is the sample size

Calculations

SE and 95% CI of a Mean To what degree does the sample mean vary from the population mean

95% CI

SE

SD n

 x   1.96

 SE 

Interval Estimate of a Population Mean: Large-Sample Case ( n > 30)

 With  Unknown In most applications the value of the population standard deviation is unknown. We simply use the value of the sample standard deviation, s, as the point estimate of the population standard deviation.

x

z

 /2

s n

Example Sampling can be used to develop an interval estimate of the mean annual income for individuals in a potential marketing area for National Discount. A sample of size n = 36 was taken. The

x

standard deviation, .95 as the confidence coefficient in our interval estimate.

s , is $4,500. We will use

Example  Precision Statement There is a .95 probability that the value of a sample mean for National Discount will provide a sampling error of $1,470 or less……. determined as follows: 95% of the sample means that can be observed are 

x

 . If 

x

s n

 4 , 500 36  750 

x

Interval Estimation of a Population Mean: Small-Sample Case ( n < 30)

   Population is Not Normally Distributed The only option is to increase the sample size to n > 30 and use the large-sample interval-estimation procedures.

Population is Normally Distributed and  is Known The large-sample interval-estimation procedure can be used.

Population is Normally Distributed and  Unknown is The appropriate interval estimate is based on a probability distribution known as the t distribution.

t

Distribution

     The t distribution is a family of similar probability distributions.

A specific t distribution depends on a parameter known as the degrees of freedom.

As the number of degrees of freedom increases, the difference between the t distribution and the standard normal probability distribution becomes smaller and smaller.

A t distribution with more degrees of freedom has less dispersion.

The mean of the t distribution is zero.

t

table

Interval Estimation of a Population Mean: Small-Sample Case ( n < 30) with

Unknown

 Interval Estimate

x

t

 /2

s n

where 1  = the confidence coefficient

t

/2

= the t value providing an area of  /2 in the upper tail of a t distribution with n - 1 degrees of freedom

s

= the sample standard deviation

Example: Apartment Rents

Interval Estimation of a Population Mean: Small-Sample Case ( n < 30) with

Unknown

A reporter for a student newspaper is writing an article on the cost of off-campus housing. A sample of 10 one-bedroom units within a half-mile of campus resulted in a sample mean of $550 per month and a sample standard deviation of $60.

Let us provide a 95% confidence interval estimate of the mean rent per month for the population of one-bedroom units within a half-mile of campus. We’ll assume this population to be normally distributed.

t Value

Example: Apartment

Rents

At 95% confidence, 1 .025.

 = .95,  = .05, and  /2 =

t

.025

is based on n - 1 = 10 - 1 = 9 degrees of freedom.

In the t distribution table we see that t .025

= 2.262.

Degrees of Freedom .

7 8 9 10 .

.10

.

1.415

1.397

1.383

1.372

.

.05

Area in Upper Tail .025

.01

.

1.895

.

2.365

.

2.998

1.860

1.833

1.812

.

2.306

2.262

2.228

.

2.896

2.821

2.764

.

.005

.

3.499

3.355

3.250

3.169

.

 

Example: Apartment Rents

Interval Estimation of a Population Mean: Small-Sample Case (n < 30) with  Unknown

x

t

.025

s n

550  2 .

262 60 10 550 + 42.92

or $507.08 to $592.92

We are 95% confident that the mean rent per month for the population of one-bedroom units within a half-mile of campus is between $507.08 and $592.92.

Interval Estimation of a Population Proportion

 Interval Estimate

p

( 1 

p

)

p

z

 / 2

n

where: 1  is the confidence coefficient

z

/

2 is the z value providing an area of /2 in the upper tail of the standard normal probability distribution

p

is the sample proportion

Example: dental caries n=500 # with caries = 220

 Interval Estimate of a Population Proportion

p

z

 / 2

p

( 1 

p

)

n

where: n = 500, = 220/500 = .44, z

p

/2

= 1.96

.

.

500 .44 + .0435

.

) PSI is 95% confident that the proportion of children with dental caries is between .3965 and .4835.

Comparisons Involving Means

    Estimation of the Difference Between the Means of Two Populations: Independent Samples Hypothesis Tests about the Difference between the Means of Two Populations: Independent Samples Inferences about the Difference between the Means of Two Populations: Matched Samples Inferences about the Difference between the Proportions of Two Populations:

Estimation of the Difference Between the Means of Two Populations: Independent Samples

    Point Estimator of the Difference between the Means of Two Populations Sampling Distribution

x

1 

x

2 Interval Estimate of Interval Estimate of       Large-Sample Case       Small-Sample Case

    

Point Estimator of the Difference Between the Means of Two Populations

Let  1 equal the mean of population 1 and the mean of population 2.

 2 equal The difference between the two population means is  1  2 .

To estimate  1  2 , we will select a simple random sample of size n 1 from population 1 and a simple random sample of size n 2 from population 2.

Let equal the mean of sample 1 and equal the 1 mean of sample 2.

x

2 The point estimator of the difference between the means of the populations 1 and 2 is 2

Sampling Distribution of

1

x

2  Standard Deviation 

x

1 

x

2   1 2

n

1   2 2

n

2 where:  1  2 = standard deviation of population 1 = standard deviation of population 2 n1 = sample size from population 1 n2 = sample size from population 2

Interval Estimate of Large-Sample Case ( n 1

1 -

2 : > 30 and n 2 > 30)

 Interval Estimate with  1 and  2 Known 

x

1

z

 / 2 

x

1 

x

2 where: 1  is the confidence coefficient Interval Estimate with  1 and  2 Unknown

x

1

z

 / 2

s x

1 

x

2 where:

s x

1 

x

2 

s

1 2

n

1 

s n

2 2 2

Example: Par, Inc.

Interval Estimate of  1 Case  2 : Large-Sample Par, Inc. is a manufacturer of golf equipment and has developed a new golf ball that has been designed to provide “extra distance.” In a test of driving distance using a mechanical driving device, a sample of Par golf balls was compared with a sample of golf balls made by Rap, Ltd., a competitor. The sample statistics appear on the next slide.

Example: Par, Inc.

Interval Estimate of  1 Case  2 : Large-Sample  Sample Statistics Sample #1 Par, Inc. Sample Size n 1 Mean Standard Dev.

x s

1 1 = 120 balls = 235 yards = 15 yards Sample #2 Rap, Ltd.

n

2 = 80 balls

s

2 = 218 yards

x

Hypothesis testing