Transcript The Art in the State of the Art
Biostatistics Yusuf Al-Gau’d
BDS, MSc, MSPH, MHPE, FFPH, ScD Prof. of Epidemiology, Biostatistics and Medical Education
Why need biostatistics?
Main reason: handling variations Biological variation Attribute differ not only among individuals but also within same individual over time Example: height, weight, blood pressure, eye color ...
Sample variation Biomedical research projects are usually carried out on small numbers of study subjects
Why need to learn biostatistics?
Essential for scientific method of investigation Formulate hypothesis Design study to objectively test hypothesis Collect reliable and unbiased data Process and evaluate data rigorously Interpret and draw appropriate conclusions Essential for understanding, appraisal and critique of scientific literature
HEALTH STATUS IN JORDAN
Selected indicators 2012
Area (km 2 ) Total population (million) 88778 6.4
Crude birth rate per 1000 population 28.1
Crude death rate per 1000 population % Population growth rate Total fertility rate 7 2.2
3.8
Age Distribution of Jordanian Population (2010)
Selected indicators
Indicator Adult illiteracy rate (% of 15+ years) Total % Males 3.5
Females 10.0
Unemployment (%) 12.2
Mortality
Total life expectancy at birth
(total years) (71.6 males and 74.4 females)
Infant mortality rate
per 1000 live births
Maternal mortality rate
per 100000 live births
73 23 19.1
Morbidity: Chronic and non-communicable diseases
Disease Hypertension Diabetes Total blood cholesterol Cholesterol HDL _C Triglyceride Overweight Obesity BMI Metabolic syndrome Total % 44.6
19.5
36.0
59.6
38.6
30.2
44.1
36.3
Males % 42.2
19.7
35.6
61.5
48.1
35.5
12.3
28.7
Female % 46.0
19.8
36.8
59.2
34.2
22.5
60.7
40.9
Cancer in Jordan 2000-2010
Cancer rate
Cancer rate
Common Cancers
Common Cancers
Road traffic accidents
Jordan has one of the highest rates of traffic-related accidents in the world.
In 2007 alone we lost 1,000 people, 18,000 were wounded, 6,000 of them seriously.
Every 10 hours someone is killed and every 35 hours a child is killed.
Considering the population of Jordan, this is a very high rate - for every 100,000 citizen, 307 of them are either killed or injured.
Communicable diseases
Communicable diseases have largely been controlled in Jordan
The trend of vaccine-preventable diseases has shown a remarkable decline in the last 20 years.
There is lack of information on the prevalence of hepatitis B and C virus infections.
Nutrition
Micronutrient deficiencies are common in our region
The prevalence of anemia was 34% in children under 5
The prevalence of low vitamin D (25(OH)D <30 ng/ml) was 37.3% in females as compared to 5.1% in males.
Resources
Hospital number 104 comprehensive health centers 70 Primary health care canters Maternity and child health care centers 378 431 Physician/ 10000 pop.
24.5
Dentist/ 10000 pop.
8.2
Nurse/10000 pop Pharmacist/10000 pop 33.0
12.0
Pharmaceuticals
The high cost of drugs is a major constraint.
Irrational use of drugs
inadequate drug information services.
Public health
A shortage of public health providers
No specialized training in public health
lack of capacity to respond to new and emerging health threats
Lack of public health programs (mainly prevention and screening programs) to serve populations
Challenges facing health development in Jordan
High rates of non-communicable diseases .
Considerable changes in lifestyles favouring the development of determinants and risk factors for chronic diseases, accidents, and injuries.
Lack of health system research national health development.
as an integral part of
Challenges facing health development in Jordan
Jordan lacks appropriate policies and interventions that aim to improve the social, environmental and nutritional determinants of health, including poverty reduction strategies, promotion of healthy lifestyles and food safety.
Lack of strategies for addressing issues related to prevention and management of accidents and injuries
Inadequate coordination and partnership between health service providers and educational institutions for health professionals.
Challenges facing health development in Jordan
Lack of integration of the priority programmes primary health care.. within
Inadequate coordination between the public sector and the rapidly expanding private sector
lack of effective systems for regulation and enforcement of standards of care.
Biostatistics: Types of Statistics
Biostatistics ???
Descriptive statistics
Organization of data Summarization of data Presentation of data
Inferential statistics
Sample and population
Populations are rarely studied because of logistical, financial and other considerations
Population
Researchers have to rely on study samples Sample
Types of statistical methods
Descriptive statistical methods
Provide summary indices for a given data, e.g. arithmetic mean, median, standard deviation, coefficient of variation, etc.
Inductive (inferential) statistical methods
Produce statistical inferences about a population based on information from a sample derived from the population, need to take variation into account
sample Population
Random sampling
Suppose that we want to estimate the mean birth-weights of male live births in Jordan Due to logistical constraints, we decide to take a random sample of 100 live births at the University Hospital in a given year sample All live births in KAUH, 2008
All live births in Jordan
Sampled population Target population
Variable
Definition What is observed or measured in the way people differ Examples age height hair color smoking
Types of Data
Qualitative or categorical variables Nominal Ordinal Quantitative variables (Numerical) Discrete variables Continuous variables
Categorical Variables
Cannot be measured numerically Categories must not overlap and must cover all possibilities
Classified as nominal or ordinal
Categorical Nominal Variables
Named categories No implied order among categories Examples Gender – Male/Female Blood Groups – 0, A, B, AB Ethnic Group – Chinese, Malay, Indian, Jordanian Eye color – brown/black/blue/green/mixed
Categorical Ordinal Variables
Same as nominal but ordered categories Differences between categories not considered equal Examples Grading – Excellent, satisfactory, unsatisfactory Pain severity – no pain, slight pain, moderate pain, severe pain
Quantitative Variables
Can be measured numerically Weight # of admissions to the hospital Concentration of chlorine Can be discrete or continuous
Discrete - Numerical Variables
Integers that correspond to a count Can assume only whole numbers Examples # of bacterial colonies on a plate # of missing teeth # of accidents in a time period # of illnesses in a time period
Continuous Data
Continuous data is measured Can take any value within a defined range Limitations imposed by the measuring stick Examples – blood pressure, height, weight, blood pressure, time
Types of variables Qualitative Or categorical
SUMMARY
Variable
Quantitative measurement Nominal (not ordered) e.g. ethnic group Ordinal (ordered) e.g. response to treatment Discrete (count data) e.g. number of admissions Continuous (real-valued) e.g. height Measurement scales
Determining the Type of Data
Categorical variables
nominal
categories that cannot be ordered one above the other (sex, marital status) where the variables are divided into a number of named
ordinal
where the variables are divided into a number of named categories that can ordered from lowest to highest or vice versa (levels of satisfaction and levels of knowledge )
Numerical Variables
Continuous
where between any two points there are at least theoretically infinite number of values (weight and height)
Discrete
that have only certain fixed values and no intermediate values possible (number of students in a classroom)
Dependent and independent variables
Whether a variable is dependent or independent is determined by the statement of the problem and study objectives Dependent :the variable that is used to describe or measure the problem under study.
Independent : the variables that are used to describe or measure the factors that are assumed to cause or at least to influence the problem
Why does it Matter
?
Categorical and quantitative variables are: graphed charted tabled and statistically summarized
in very different ways
Descriptive statistics
Organizing and Presenting Data
Overview of Data Collection Techniques
Overview of Data Collection Techniques Data collection techniques allow us to systematically collect information about our study subjects (people, objects or phenomena) Various Data collection techniques can be used Using available information Observing Interviewing (mainly face to face) Administering written questionnaires Focus Group Discussions
Using Available Information
Look for sources of already collected data health information system data census data unpublished reports publications of archives, libraries or offices or even a study in itself Design the instrument to retrieve the needed data such as checklists and data compilation forms
Using Available Information
Advantages Inexpensive Permits examination of past trends Disadvantages Data are not always easy accessible Ethical issues regarding confidentiality may arise Information may be incomplete and inaccurate
Observing
It involves systematic selection, watching, and recording behavior and characteristics of living beings or objects Observations of Human Behavior are used on small scale and can be Participant observation Non-participant observation All kinds of measurements are also called observations Measurements will require additional tools that can be simple or complex Observation can be the primary source of information Additional information to other methods of data collection can be obtained
Observing
Advantages More detailed, more accurate info Collection of info not written in questionnaires Testing validity of responses to questionnaires Disadvantages Ethical issues of privacy and confidentiality Observer Bias The presence of the observer can influence the situation Extensive training of assistants is needed
Interviewing
It is oral questioning of respondents, either individually or as a group, face to face or over phone High degree of flexibility It depends on the Low degree of flexibility level of researcher understanding of the problem or situation
Interviewing
Advantages Suitable for illiterate Disadvantages The presence of the interviewer can influence the responses Permits clarification by respondents Higher response rate than questionnaires Less complete information compared to observation
Administering Written Questionnaires
Written questions are to be answered by respondents in a written form Mail Group Drop-off
Administering Questionnaires
Advantages Less expensive Permits anonymity and probably more honest responses No need for assistants Eliminates observer bias Disadvantages Can not be used with illiterate individuals Non response rate could be high Questions me be misunderstood
Terminology Used in Sample Surveys
An
element
is the entity on which data are collected.
A
population
is the collection of all elements of interest.
A
sample
is a subset of the population. The
target population
inferences about.
is the population we want to make The
sampled population
sample is actually selected.
is the population from which the These two populations are not always the same.
If inferences from a sample are to be valid, the sampled population must be representative of the target population.
Terminology Used in Sample Surveys
The population is divided into themselves.
sampling units
which are groups of elements or the elements A list of the sampling units for a particular study is called a
frame
.
The choice of a particular frame is often determined by the availability and reliability of a list.
The development of a frame can be the most difficult and important steps in conducting a sample survey.
Types of Surveys
Surveys Involving Questionnaires Three common types are
telephone surveys, and personal interview surveys.
mail surveys,
Survey cost are lower for mail and telephone surveys.
With well-trained interviewers, higher response rates and longer questionnaires are possible with personal interviews.
The design of the questionnaire is critical.
Types of Surveys
Surveys Not Involving Questionnaires Often, someone simply counts or measures the sampled items and records the results.
An example is sampling a company’s inventory of parts to estimate the total inventory value.
Sampling Methods
Sample surveys can also be classified in terms of the sampling method used.
The two categories of sampling methods are: Probabilistic sampling Nonprobabilistic sampling
Nonprobabilistic Sampling Methods
The probability of obtaining each possible sample can be computed.
Statistically valid statements cannot be made about the precision of the estimates.
Sampling cost is lower and implementation is easier.
Methods include convenience and judgment sampling.
Nonprobabilistic Sampling Methods
Convenience Sampling The units included in the sample are chosen because of accessibility.
In some cases, convenience sampling is the only practical approach.
Nonprobabilistic Sampling Methods
Judgment Sampling A knowledgeable person selects sampling units that he/she feels are most representative of the population.
The quality of the result is dependent on the judgment of the person selecting the sample.
Generally, no statistical statement should be made about the precision of the result.
Probabilistic Sampling Methods
The probability of obtaining each possible sample can be computed.
utilizes some form of
random selection
.
Methods include:
s imple random , s tratified simple random , c luster , and s ystematic sampling.
Survey Errors
Two types of errors can occur in conducting a survey: Sampling error Nonsampling error
Survey Errors
Sampling Error
• It is defined as the magnitude of the difference between the point estimate, developed from the sample, and the population parameter.
• It occurs because not every element in the population is surveyed.
• It cannot occur in a census.
• It can not be avoided, but it can be controlled.
Survey Errors
Nonsampling Error It can occur in both a census and a sample survey. Examples include: Measurement error Errors due to nonresponse Errors due to lack of respondent knowledge Selection error Processing error
Simple Random Sampling
A simple random sample of size n from a finite population of size N is a sample selected such that every possible sample of size n has the same probability of being selected.
We begin by developing a frame or list of all elements in the population.
Then a selection procedure, based on the use of random numbers selected.
, is used to ensure that each element in the sampled population has the same probability of being
Simple Random Sampling
We will see in the upcoming slides how to: Estimate the following population parameters: Population mean Population total Population proportion Determine the appropriate sample size
Determining the Sample Size
An important consideration in sample design is the choice of sample size.
The best choice usually involves a tradeoff between cost and precision (size of the confidence interval).
Larger samples provide greater precision, but are more costly.
A budget might dictate how large the sample can be.
A specified level of precision might dictate how small a sample can be.
Determining the Sample Size
Smaller confidence intervals provide more precision.
The size of the approximate confidence interval depends on the bound B on the sampling error.
Choosing a level of precision amounts to choosing a value for B.
Given a desired level of precision, we can solve for the value of n.
Stratified Simple Random Sampling
The population is first divided into H groups, called
strata .
Then for stratum h, a simple random sample of size n
h
selected.
is The data from the H simple random samples are combined to develop an estimate of a population parameter.
If the variability within each stratum is smaller than the variability across the strata, a stratified simple random sample can lead to greater precision.
The basis for forming the various strata depends on the judgment of the designer of the sample.
Systematic Sampling
Systematic Sampling is often used as an alternative to simple random sampling which can be time-consuming if a large population is involved.
If a sample size of n from a population of size N is desired, we might sample one element for every N/n elements in the population.
We would randomly select one of the first N/n elements and then select every (N/n)th element thereafter.
Since the first element selected is a random choice, a systematic sample is often assumed to have the properties of a simple random sample.
Systematic Random Sampling
Here are the steps you need to follow in order to achieve a systematic random sample: number the units in the population from 1 to N decide on the n (sample size) that you want or need k = N/n = the interval size randomly select an integer between 1 to k then take every kth unit
Cluster Sampling
Cluster sampling requires that the population be divided into N groups of elements called clusters.
We would define the frame as the list of N clusters.
We then select a simple random sample of n clusters.
We would then collect data for all elements in each of the n clusters.
Cluster Sampling
Cluster sampling tends to provide better results than stratified sampling when the elements within the clusters are heterogeneous.
A primary application of cluster sampling involves area sampling, where the clusters are counties, city blocks, or other well-defined geographic sections.
Multi-Stage Sampling
In most real applied social research, we would use sampling methods that are considerably more complex than these simple variations. The most important principle here is that we can combine the simple methods described earlier in a variety of useful ways that help us address our sampling needs in the most efficient and effective manner possible.
Multi-stage sampling
Consider the problem of sampling students in grade schools. We might begin with a national sample of school districts stratified by economics and educational level. Within selected districts, we might do a simple random sample of schools. Within schools, we might do a simple random sample of classes or grades. And, within classes, we might even do a simple random sample of students. In this case, we have three or four stages in the sampling process and we use both stratified and simple random sampling. By combining different sampling methods we are able to achieve a rich variety of probabilistic sampling methods that can be used in a wide range of social research contexts.
Organizing and Presenting Data
Descriptive Statistics: Tabular and Graphical Methods
Summarizing Qualitative Data Summarizing Quantitative Data Crosstabulations and Scatter Diagrams
Summarizing Qualitative Data
Frequency Distribution Relative Frequency Percent Frequency Distribution Bar Graph Pie Chart
Example:
Students in JUST were asked to rate the quality of food served in the cafeteria in JUST as being
excellent
,
above average
shown below.
,
average
,
below average
, or
poor
. The ratings provided by a sample of 20 students are Below Average Above Average Above Average Average Above Average Average Above Average Average Above Average Below Average Poor Excellent Above Average Average Above Average Above Average Below Average Poor Above Average Average
Frequency Distribution
A frequency distribution is a tabular summary of data showing the frequency (or number) of items in each of several nonoverlapping classes.
The objective is to provide insights about the data that cannot be quickly obtained by looking only at the original data.
Example:
Frequency Distribution
Rating Frequency Poor Below Average Average Above Average Excellent 1 Total 20 2 3 5 9
Relative Frequency Distribution
The relative frequency of a class is the fraction or proportion of the total number of data items belonging to the class.
A relative frequency distribution is a tabular summary of a set of data showing the relative frequency for each class.
Percent Frequency Distribution
The percent frequency of a class is the relative frequency multiplied by 100.
A percent frequency distribution is a tabular summary of a set of data showing the percent frequency for each class.
Relative Frequency and Percent Frequency Distributions
Rating frequency Relative Frequency Poor 2 .10
Below Average 3 .15
Average Above Average 5 .25
9 .45
Excellent 1 .05
Total 20 1.00
Percent Frequency 10 15 25 45 5 100
Tables
Data in arranged in rows and columns Simple and self-explanatory Title Label each row and column Show totals for rows and columns Include units of measure (yrs, mg/dl) Explain codes in footnote
Guidelines for Developing a Table Describe what, when, where in the title Label rows and columns clearly Provide units of measure Provide row and column totals Define abbreviations and symbols Note data exclusions References Source Should stand alone
Bar Graph
A bar graph is a graphical device for depicting qualitative data that have been summarized in a frequency, relative frequency, or percent frequency distribution.
On the horizontal axis we specify the labels that are used for each of the classes.
A frequency, relative frequency, or percent frequency scale can be used for the vertical axis.
Using a bar of fixed width drawn above each class label, we extend the height appropriately.
The bars are separated to emphasize the fact that each class is a separate category.
Example: quality of food
Bar Graph 9 8 7 6 5 4 3 2 1
Rating
Poor Below Average Average Above Average Excellent
Charts
Appropriate for categorical data Bar charts Simple Grouped Stacked
Bar Charts
Display data from one-variable table Each variable is represented by a bar Bars are proportional to the number of events Can be presented vertically or horizontally
Simple Bar Chart
Annual Death Rates by Govornorate, 1996-2000 400 350 300 250 200 150 100 50 0 AQABA ZARQA AMMAN JARAS MAFRQ MADAB TAFEL Govornorate IRBID BALQA KARAK AJLON MAANN
1600 1400 1200 1000 800 600 400 200 0
Grouped Bar Chart
Treatment completion and cure of disease X in various racial groups, 1994-2000
Cases Completion Cure Race A Race B Race Race C Race D
Pie Chart
The pie chart is a commonly used graphical device for presenting relative frequency distributions for qualitative data.
First draw a circle; then use the relative frequencies to subdivide the circle into sectors that correspond to the relative frequency for each class.
Since there are 360 degrees in a circle, a class with a relative frequency of .25 would consume .25(360) = 90 degrees of the circle.
Example: Quality of food
Pie Chart Above Average 45% Exc.
5% Poor 10% Below Average 15% Average 25% Quality Ratings
Summarizing Quantitative Data
Frequency Distribution Relative Frequency and Percent Frequency Distributions Histogram
Frequency Distributions Frequency distribution for NUMERICAL data after being grouped into suitable categories Age is continuous variable that can be grouped into age groups and presented as frequency distribution When grouping numerical variables into categories, the following groups are important: Groups must not overlap There must be continuity from one group to next Groups must range from lowest to highest measurements (preferably round numbers) It is preferable that groups be the same width
Example:
The manager of hospital A would like to get a better picture of the distribution of waiting times of his patients. A sample of 50 patients has been taken and their waiting times (minutes), are listed below.
91 71 104 74 85 97 62 78 69 82 93 72 62 88 57 89 68 68 75 66 97 105 77 83 52 75 68 99 79 71 98 101 79 105 79 80 75 65 69 69 97 72 80 109 67 74 62 62 76 73
Frequency Distribution
Guidelines for Selecting Number of Classes Use between 5 and 20 classes.
Data sets with a larger number of elements usually require a larger number of classes.
Smaller data sets usually require fewer classes.
Frequency Distribution
Guidelines for Selecting Width of Classes Use classes of equal width.
Approximate Class Width = Largest Data Value Smallest Data Value Number of Classes
Frequency Distribution If we choose six classes: Approximate Class Width = (109 - 52)/6 = 9.5 10 waiting times Frequency 50-59 60-69 70-79 80-89 2 13 16 7 90-99 7 100-109 5 Total 50
Relative Frequency and Percent Frequency Distributions waiting time 50-59 60-69 70-79 80-89 90-99 Relative Percent Frequency Frequency .04
.26
.32
.14
.14
4 26 32 14 14 100-109 .10
Total 1.00
10 100
17 33 12 43 53 37 44 44 32 33 39 11 48 56 21 49 43 17 12 34 35 32 20 29 35 33 20 47 48 56 48 60 26 40 52 42 15 12 24 24 33 31 13 46 18 46 14 43 54 33 51 12 50 31 46 37 20 41 25 29 47 24 38 26 13 22 45 54 15 35 23 29 57 33 41 40 11 59 56 59 55 20 32 17 39 55 17 12 45 45 54 15 35 23 29 57 55 41 40 11 33 56 59 55 20 32 17 39 59 17 40 11 59 17 12 23 45 54 55 20 32 17 15 35 23 29 57 33 56 59 39 55 12 24 24 33 31 48 60 26 44 44 48 56 40 13 46 33 12 43 53 37 21 49 29 47 24 38 26 50 31 46 48 56 52 42 37 13 22 29 35 33 20 47 15 12 20 32 17 39 55 41 40 11 51 12 20 41 59 17 12 46 14 43 54 33 25 29 55 20 32 17 39 55 41 40 57 33 56 59 11 59 17 54 15 35 23 29 55 20 29 57 33 56 59 55 20 32 29 57 33 56 17 39 55 45 54 15 35 23 59 55
Simple Frequency Distribution
Primary and secondary syphilis morbidity by age, Unites Staes, 1989 Age group (years)
<14 15-19 20-24 25-29 30-34 35-44 45-54 >55
Total Number of Cases
230 4378 10405 9610 8648 6901 2631 1278
44081
Simple Frequency Distribution
Primary and secondary syphilis morbidity by age, Unites Staes, 1989 Age group (years)
<14 15-19 20-24 25-29 30-34 35-44 45-54 >55
Total Number
230 4378 10405 9610 8648 6901 2631 1278
44081 Cases Percent
0.5
10.0
23.6
21.8
19.6
15.7
6.0
2.9
100.0
Histogram
Another common graphical presentation of quantitative data is a histogram.
The variable of interest is placed on the horizontal axis and the frequency, relative frequency, or percent frequency is placed on the vertical axis.
A rectangle is drawn above each class interval with its height corresponding to the interval’s frequency, relative frequency, or percent frequency.
Unlike a bar graph, a histogram has no natural separation between rectangles of adjacent classes.
Waiting time ( Histogram )
18 16 14 12 10 8 6 4 2 waiting time 50 60 70 80 90 100 110
Histograms
Graph of the frequency distribution of a continuous variable Columns are adjoining Area of each column is proportional to # of observations in that interval
Histograms
Histograms and Frequency Polygons
Example of a Histogram
80 70 60 50 40 30 20 10 0 Cases 1 2 3 4 5 6 7 8
Week
9 10 11 12 13 14 15 16
Crosstabulations and Scatter Diagrams
Thus far we have focused on methods that are used to summarize the data for one variable at a time.
Often a manager is interested in tabular and graphical methods that will help understand the relationship between two variables.
Crosstabulation and a scatter diagram are two methods for summarizing the data for two (or more) variables simultaneously.
Crosstabulation
Crosstabulation is a tabular method for summarizing the data for two variables simultaneously.
Crosstabulation can be used when: Both variables are qualitative The left and top margin labels define the classes for the two variables.
Crosstabulation: Row or Column Percentages
Converting the entries in the table into row percentages or column percentages can provide additional insight about the relationship between the two variables.
Different Types of Cross-tabulation Cross tabulation that describe the sample using different combinations of background variables (age, sex, occupation, residence … .
Cross tabulations displaying comparisons between groups Cross tabulation focusing on relationship between variables
Cross-Tabulation to describe the sample
In any study, it is common practice to first describe the research subjects before presenting the various results.
This can be done either by presenting the variables in simple frequency tables or in a combination of variables in cross tables Data is usually listed in both absolute figures and relative frequencies
Distribution of disease status according to age groups
Age Groups <10 10-19 20-29 30-39 =>40 Total Disease Status Present
2 (3%) 2 (3%) 5 (7%) 23 (34%) 36 (53%)
Absent
1 (2%) 0 2 (4%) 12 (23%) 37 (71%)
68 (100%) 52 (100%) Total 3 2 7 35 73 120
Cross-Tabulation to Determine Differences Between Groups Cross-tabulation should be used when we aim at discovering any differences between two or more groups on particular variable in case- control, cohort, quasi-experimental and experimental studies Dependent variables are displayed in columns and independent variables in rows Totals for each of the comparison groups should be 100% Cross tables will be used further for certain statistical testing
Duration of Breast Feeding In Mothers of Different Age Groups
Age Groups Duration of Breast Feeding Total 0-5 m 6-11 m 12+ m <10 10-19 20-29 30-39 =>40 Total
Duration of Breast Feeding In Relation To Working Status of Mothers
Working Status Duration of Breast Feeding 0-5 m 6-11 m 12+ m Total Full time Part time Not working Total
Scatter Diagram
A scatter diagram is a graphical presentation of the relationship between two quantitative variables.
One variable is shown on the horizontal axis and the other variable is shown on the vertical axis.
The general pattern of the plotted points suggests the overall relationship between the variables.
Scatter Diagram
A Positive Relationship
y x
Scatter Diagram
A Negative Relationship
y x
Scatter Diagram
No Apparent Relationship
y x
Scatter Diagram The Panthers football team is interested in investigating the relationship, if any, between interceptions made and points scored.
x = Number of Interceptions 1 3 2 1 3 y = Number of Points Scored 14 24 18 17 27
Example: Panthers Football
Scatter Diagram
Team
y
30 25 20 15 10 5 0 0 1 2 Number of Interceptions 3
x
Scatter Diagram
Serum levels of heavy metal X in 38 moon settlers, 2200
20 18 16 14 12 10 8 6 4 2 0 0 10 20 30
Age (years)
40 50 60
Frequency Polygon
40 30 20 10 0 80 70 60 50 Cases Cases-FP 1 2 3 4 5 6 7 8
Week
9 10 11 12 13 14 15 16
Frequency Polygon
Graph of entire frequency distribution of a continuous variable # of events in interval plotted at midpoint of interval Straight line connects points Useful to compare two or more distributions on the same axis
Frequency Polygon
80 70 20 10 0 60 50 40 30 1 2 3 4 5 6 Cases-FP More Cases-FP 7 8
Week
9 10 11 12 13 14 15 16
Frequency Polygon
Geographic Distribution of HAV Infection
Anti-HAV Prevalence High Intermediate Low Very Low
Guidelines in Developing Graphs Label title, source, axes, scales, legend Portray frequency on the vertical scale, starting with zero Portray method of classification on the horizontal scale Indicate units of measure Define abbreviations and symbols Note data exclusions
Frequency Distribution of Shortage of
Antihypertensive Drugs in PHCCs During 1999
Relative Frequency of Shortage of Antihypertensive Drugs in PHCCs During 1999
Frequency Distribution of Shortage of Antihypertensive Drugs in PHCCs During 1999
Histogram
Number of Brucellosis Cases in Nowhere During 1990s
Descriptive Statistics: Numerical Methods
Measures of Location Measures of Variability
x
%
Measures of Location
Mean Median Mode Percentiles Quartiles
Example
Given below is a sample of monthly income ($) for 70 diabetic patients. The data are presented in ascending order.
425 430 430 435 435 435 435 435 440 440 440 440 440 445 445 445 445 445 450 450 450 450 450 450 450 460 460 460 465 465 465 470 470 472 475 475 475 480 480 480 480 485 490 490 490 500 500 500 500 510 510 515 525 525 525 535 549 550 570 570 575 575 580 590 600 600 600 600 615 615
Mean Median Mode
Measures of Central Tendency Mean
It is the arithmetic mean and is also known as average It is calculated by totaling the results of all observations and dividing by the total number of observations Example height of 7 girls are as follow: 141, 141, 143, 144, 145, 146, 155 cm Total is 1015 Mean is 1015/7 = 145 cm
Mean
The mean of a data set is the average of all the data values.
If the data are from a sample, the mean is denoted by
x
n x i
If the data are from a population, the mean is denoted by m (mu)
N x i
Mean
Example: income
x
x i n
34 , 356 70 490 .
80 425 430 430 435 435 435 435 435 440 440 440 440 440 445 445 445 445 445 450 450 450 450 450 450 450 460 460 460 465 465 465 470 470 472 475 475 475 480 480 480 480 485 490 490 490 500 500 500 500 510 510 515 525 525 525 535 549 550 570 570 575 575 580 590 600 600 600 600 615 615
Cholesterol level (mg/dl) 190, 199, 198, 196, 192, 199, 198, 196, 193, 199, 198, 196, 196, 190, 199, 198, 196, 190, 199, 198, 196, 400, 480 All values: 217.2 mg/dl Excluding extreme values: 196 mg/dl
Median
It is the the value that divides the distribution into two equal halves List all observations from lowest to highest Count the number of observations (n) The position of the median is (n+1) / 2 Example weights of 7 girls are as follow: 47, 42, 44, 40, 43, 72, 41 Kg Sort first 40, 41, 42, 43, 44, 47, 72 Kg The position which is 43) of the Median is (7+1) / 2 = 4 (the 4th one,
Median
The median is the measure of location most often reported for annual income and property value data.
A few extremely large incomes or property values can inflate the mean.
Median
The median of a data set is the value in the middle when the data items are arranged in ascending order.
For an odd number of observations, the median is the middle value.
For an even number of observations, the median is the average of the two middle values.
Asymmetric Distributions of the Population Values
Example: income
Median Median = 50th percentile Averaging the 35th and 36th data values: 425 440 450 465 480 510 575 430 440 450 470 485 515 575 Median = (475 + 475)/2 = 475 430 440 450 470 490 525 580 435 445 450 472 490 525 590 435 445 450 475 490 525 600 435 445 460 475 500 535 600 435 445 460 475 500 549 600 435 445 460 480 500 550 600 440 450 465 480 500 570 615 440 450 465 480 510 570 615
Mode
It is the most frequently occurring value in a set of observations Its useful for categorized data Example weights of 7 girls are as follow: 47, 44, 44, 40, 43, 72, 44 Kg Sort first 40, 43, 44, 44, 44, 47, 72 Kg The Mode is 44
10 8 6 4
Mean =11
2 0 5.0
6.0
7.0
8.0
9.0
10.0
11.0
12.0
13.0
14.0
15.0
16.0
17.0
Median = 11
Std. Dev = 3.17 Mean = 11.1
VAR00001
20 18 16 14 12 10 8 6 4 N = 16 16 LENGTH
Mode
The mode of a data set is the value that occurs with greatest frequency.
The greatest frequency can occur at two or more different values.
If the data have exactly two modes, the data are bimodal.
If the data have more than two modes, the data are multimodal.
Example: income
Mode 450 occurred most frequently (7 times) Mode = 450 425 440 450 465 480 510 575 430 440 450 470 485 515 575 430 440 450 470 490 525 580 435 445 450 472 490 525 590 435 445 450 475 490 525 600 435 445 460 475 500 535 600 435 445 460 475 500 549 600 435 445 460 480 500 550 600 440 450 465 480 500 570 615 440 450 465 480 510 570 615
Percentiles
A percentile provides information about how the data are spread over the interval from the smallest value to the largest value.
Admission test scores for colleges and universities are frequently reported in terms of percentiles.
Percentiles
Percentiles
The pth percentile of a data set is a value such that at least p percent of the items take on this value or less and at least (100 p) percent of the items take on this value or more.
Arrange the data in ascending order.
Compute index i, the position of the pth percentile.
i = (p/100)n If i is not an integer, round up. The p th percentile is the value in the i th position.
If i is an integer, the p th percentile is the average of the values in positions i and i +1.
Example: income
90th Percentile i = (p/100)n = (90/100)70 = 63 Averaging the 63rd and 64th data values: 90th Percentile = (580 + 590)/2 = 585 425 440 450 465 480 510 575 430 440 450 470 485 515 575 430 440 450 470 490 525 580 435 445 450 472 490 525 590 435 445 450 475 490 525 600 435 445 460 475 500 535 600 435 445 460 475 500 549 600 435 445 460 480 500 550 600 440 450 465 480 500 570 615 440 450 465 480 510 570 615
Quartiles
Quartiles are specific percentiles First Quartile = 25th Percentile Second Quartile = 50th Percentile = Median Third Quartile = 75th Percentile
Example: Apartment Rents
Third Quartile 425 440 450 465 480 510 575 Third quartile = 75th percentile i = (p/100)n = (75/100)70 = 52.5 = 53 Third quartile = 525 430 440 450 470 485 515 575 430 440 450 470 490 525 580 435 445 450 472 490 525 590 435 445 450 475 490 525 600 435 445 460 475 500 535 600 435 445 460 475 500 549 600 435 445 460 480 500 550 600 440 450 465 480 500 570 615 440 450 465 480 510 570 615
Descriptive Statistics: Numerical Methods
Measures of Location Measures of Variability
x
Mean Median Mode
Measures of Location
Compute mean, median, mode.
A:
B: 8 9 10 10 10 11 12 1 5 10 10 10 15 19
Measures of Variability
It is often desirable to consider measures of variability (dispersion), as well as measures of location.
For example, in choosing supplier A or supplier B we might consider not only the average delivery time for each, but also the variability in delivery time for each.
Measures of Variability
Range Interquartile Range Variance Standard Deviation Coefficient of Variation
Range
The range of a data set is the difference between the largest and smallest data values.
It is the simplest measure of variability.
It is very sensitive to the smallest and largest data values.
Measures of Dispersion
Range “Difference b/w smallest and largest value”
Simple to Calculate
Example:
Data Set: 13, 20, 89, 47, 12, 22, 70, 51, 30 (cm)
Range = 89 – 12 = 77cm Range doesn’t tell much about Measurement Distribution
Example: income
Range Range = largest value - smallest value Range = 615 - 425 = 190 425 440 450 465 480 510 575 430 440 450 470 485 515 575 430 440 450 470 490 525 580 435 445 450 472 490 525 590 435 445 450 475 490 525 600 435 445 460 475 500 535 600 435 445 460 475 500 549 600 435 445 460 480 500 550 600 440 450 465 480 500 570 615 440 450 465 480 510 570 615
Interquartile Range
The interquartile range of a data set is the difference between the third quartile and the first quartile.
It is the range for the middle 50% of the data.
It overcomes the sensitivity to extreme data values.
Example: income
Interquartile Range 3rd Quartile (Q3) = 525 1st Quartile (Q1) = 445 80 Interquartile Range = Q3 - Q1 = 525 - 445 = 425 440 450 465 480 510 575 430 440 450 470 485 515 575 430 440 450 470 490 525 580 435 445 450 472 490 525 590 435 445 450 475 490 525 600 435 445 460 475 500 535 600 435 445 460 475 500 549 600 435 445 460 480 500 550 600 440 450 465 480 500 570 615 440 450 465 480 510 570 615
Variance
The variance is a measure of variability that utilizes all the data.
It is based on the difference between the value of each observation (x
i
) and the mean (x for a sample, for a population).
Variance
The variance is the average of the squared differences between each data value and the mean.
If the data set is a sample, the variance is denoted by s 2 .
s
2 (
xi x n
1 ) 2 2 .
2 (
x i N
) 2
Standard Deviation
The standard deviation of a data set is the positive square root of the variance.
It is measured in the same units as the data, making it more easily comparable, than the variance, to the mean.
If the data set is a sample, the standard deviation is denoted s.
2
s
s
If the data set is a population, the standard deviation is denoted (sigma).
2
Measures of Dispersion
Calculating Standard Deviation (Method 1)
Mean
X x n
Standard Deviation
SD 2 n 1
Measures of Dispersion
Calculating Standard Deviation (Method 2)
SD x 2 n (( 1 x) 2 /n)
Coefficient of Variation
The coefficient of variation indicates how large the standard deviation is in relation to the mean.
If the data set is a sample, the coefficient of variation is computed as follows:
x s
( 100 ) If the data set is a population, the coefficient of variation is computed as follows: ( 100 )
Variance
Example: income
s
2 (
n
1 ) 2
s
s
2 Standard Deviation
s x
100 Coefficient of Variation
Introduction to Probability
Experiments, Counting Rules, and Assigning Probabilities Events and Their Probability Some Basic Relationships of Probability Conditional Probability Bayes’ Theorem
Probability
Probability is a numerical measure of the likelihood that an event will occur.
Probability values are always assigned on a scale from 0 to 1.
A probability near 0 indicates an event is very unlikely to occur.
A probability near 1 indicates an event is almost certain to occur.
A probability of 0.5 indicates the occurrence of the event is just as likely as it is unlikely.
Probability as a Numerical Measure of the Likelihood of Occurrence
Probability: 0 Increasing Likelihood of Occurrence .5
1 The occurrence of the event is just as likely as it is unlikely.
An Experiment and Its Sample Space
An experiment is any process that generates well-defined outcomes.
The sample space for an experiment is the set of all experimental outcomes.
A sample point is an element of the sample space, any one particular experimental outcome.
Assigning Probabilities
Classical Method Assigning probabilities based on the assumption of equally likely outcomes.
Relative Frequency Method Assigning probabilities based on experimentation or historical data.
Subjective Method Assigning probabilities based on the assignor’s judgment.
Classical Method
If an experiment has n possible outcomes, this method would assign a probability of 1/n to each outcome.
Example Experiment: Rolling a die Sample Space: S = {1, 2, 3, 4, 5, 6} Probabilities: Each sample point has a 1/6 chance of occurring.
Probability
The probability of an outcome is the proportion of times the outcome would occur if we repeated the procedure many times.
Examples Coin: What is the probability of obtaining heads when flipping a coin?
A single die: What is the probability I will roll a four?
Two dice: What is the probability I will roll a four?
A jar of 30 red and 40 green jelly beans: What is the probability I will randomly select a red jelly bean?
Computer: In the past 20 times I used my computer, it crashed 4 times and didn’t crash 16 times. What is the probability my computer will crash next time I use it?
Probability
Independence: Two events are outcome of one does not affect or give an indication of the outcome of the other.
independent
if the Independent Dependent Events Flipping a coin twice Temperature on consecutive days 3 jelly beans: red, green, orange. Eat one. Eat another.
Probability
Independence: Two events are independent if the outcome of one does not affect or give an indication of the outcome of the other.
Independent Dependent Events Randomly polling two individuals Comparing fertilizer yield for two adjacent field plots Rolling two dice
Probability
Definition: A
sample space
is a set of all the possible outcomes of a process.
Example: Coin What is the sample space for flipping a coin 3 times?
Probability
Definition: An
event
is an outcome or set of outcomes of a process.
Example: Coin What is one of the possible events for flipping a coin 3 times?
Probability Rules
Rule 1: The probability of any event is between 0 and 1 inclusive.
Pr(HTH) = 1/8 which is between 0 and 1.
Rule 2: The probability of the whole sample space is 1.
Pr(rolling a 1 or 2 or 3 or 4 or 5 or 6) = 1 Rule 3: The probability of an event not occurring is 1 minus the probability of the event. This is known as the complement rule.
Pr(not rolling a 5) = 1 – 1/6 = 5/6
Probability Rules
Rule 4: If two events A and B have no outcomes in common (they are disjoint ), then Pr(A or B) = Pr(A) + Pr(B) Pr(rolling a 1 or a 6) = Pr(rolling a 1) + Pr(rolling a 6) = 1/6 + 1/6 = 2/6 or 1/3 Rule 5: If two events A and B are independent, then Pr(A and B) = Pr(A)Pr(B) Pr(rolling a 1 and then a 6) = Pr(rolling a 1) * Pr(rolling a 6) = (1/6)(1/6) = 1/36
Rules of Probability
Rule 1: 0 ≤ P(A) ≤ 1 Rule 2: P(S) = 1 Rule 3: Complement Rule: For any event, A, P(A c ) = 1 – P(A) Rule 4: Addition Rule: If A and B are disjoint events, then P(A or B) = P(A) + P(B) Rule 5: Multiplication Rule: If A and B are independent events, then P(A and B) = P(A)P(B)
Events and Their Probability
An event is a collection of sample points.
The probability of any event is equal to the sum of the probabilities of the sample points in the event.
If we can identify all the sample points of an experiment and assign a probability to each, we can compute the probability of an event.
Some Basic Relationships of Probability
There are some basic probability relationships that can be used to compute the probability of an event without knowledge of al the sample point probabilities.
Complement of an Event Union of Two Events Intersection of Two Events Mutually Exclusive Events
Complement of an Event
The complement of event A is defined to be the event consisting of all sample points that are not in A. The complement of A is denoted by A c .
The Venn diagram below illustrates the concept of a complement.
Sample Space
S A
c Event
A
Union of Two Events
The union of events A and B is the event containing all sample points that are in A or B or both.
The union is denoted by A
B
The union of A and B is illustrated below.
Sample Space
S
Event
A
Event
B
Intersection of Two Events
The intersection of events A and B is the set of all sample points that are in both A and B.
The intersection is denoted by A The intersection of A and B is the area of overlap in the illustration below.
Sample Space
S
Intersection Event
A
Event
B
Addition Law
The addition law provides a way to compute the probability of event A, or B, or both A and B occurring.
The law is written as: P(A B) = P(A) + P(B) - P(A
B
Mutually Exclusive Events
Two events are said to be mutually exclusive if the events have no sample points in common. That is, two events are mutually exclusive if, when one event occurs, the other cannot occur.
Sample Space
S
Event
A
Event
B
Mutually Exclusive Events
Addition Law for Mutually Exclusive Events P(A B) = P(A) + P(B)
Conditional Probability
The probability of an event given that another event has occurred is called a conditional probability.
The conditional probability of A given B is denoted by P(A|B).
A conditional probability is computed as follows: P P (
A
B
) P
Multiplication Law
The multiplication law provides a way to compute the probability of an intersection of two events.
The law is written as: P(A B) = P(B)P(A|B)
Independent Events
Events A and B are independent if P(A|B) = P(A).
Independent Events
Multiplication Law for Independent Events P(A B) = P(A)P(B) The multiplication law also can be used as a test to see if two events are independent.
Bayes’ Theorem
To find the posterior probability that event A
i
will occur given that event B has occurred we apply Bayes’ theorem.
( | ) P (
A
1 P (
A i
1 ) P (
A
2
B A i
) 2 P (
A n B A n
) Bayes’ theorem is applicable when the events for which we want to compute posterior probabilities are mutually exclusive and their union is the entire sample space.
Daniel ’s example : Frequency of cocaine use by gender, ( Erickson and Murray, Am J of Drug and Alcohol Abuse, 15 (1989), 135-152.) Lifetime freq use 1-19 times (A) 20-99 times (B) 100+ times (C) Male (M) 32 18 25 75 Female (F) 7 20 9 36 Total 39 38 34 111 Asumptions: male and female are mutually exclusive, likelihood of selecting any person in the group is equal for all.
Marginal probabilities
: P(person is a male) = number of males/number of subjects = 75/111=0.6757
P(person is a female) = P(is not a male) = 1-P(is a male) = 1-0.68 = 0.3243
Probability in clinical practice
http://www.screening.nhs.uk/screening
Example: results of a screening test Test result Positive (T) Negative (T c ) Total Disease Present (D) Absent (D c ) a=TP c=FN a + c b=FP d=TN b + d Total a + b c + d n Sensitivity: Specificity: Predictive value positive: Predictive value negative:
T
) P(T + |D + ) = TP/(TP+FN) P(T |D ) = TN/(TN+FP) P(D + |T + ) P(D |T ) = TP/(TP+FP) = TN/(TN+FN) )
Screening tests .
Proportion
p
of population is known to carry virus Z . Test results of people carrying the virus return positive 98% of the time. Of people not carrying the virus, the tests correctly returns a negative 97% of the time. If an individual results in a positive test, what is P(individual carrying virus)?
T p 0.98
V 0.02
1-p 0.03
V c 0.97
T c T T c V = virus Z is present T = test results positive
P
(
V
|
T
)
P
(
V
T
)
P
(
T
)
P
(
T
|
V
)
P
(
V
)
P
(
T
|
V
)
P
(
V
)
P
(
T
|
V c
)
P
(V
c
)
P
(
V
|
T
) 0 .
98 *
p
0 .
98 *
p
0 .
03 * ( 1
p
) p = prevalence p 0.1
0.05
0.001
P(V|T) 0.784
0.632
0.032
Probability: frequency (m) of occurrence of an event of interest among other N mutually exclusive and likely events: P(E)=m/N PROPERTIES: 0 P(E i ) 1 MULTIPLICATION RULE: P(M C) = P(C|M)P(M) if P(M) 0 ADDITION RULE: P(A B ) = P(A) + P(B) - P(A B) MARGINAL PROBABILITY (TOTAL PROBABILITY):
P
(
B
)
P
(
B
|
A
1 )
P
(
A
1 )
P
(
B
|
A
2 )
P
(
A
2 ) ...
P
(
B
|
A k
)
P
(
A k
)
i k
1
P
(
B
|
A i
)
P
(
A i
)
BAYES:
P
(
V
|
T
)
P
(
V
T
)
P
(
T
)
P
(
T
|
V
)
P
(
P
(
T V
)
|
V
)
P
(
V
)
P
(
T
|
V c
)
P
(
V c
)
Normal Distribution
Bell-Shaped Curve: The Normal Distribution of Population Values
Function
The Normal is a theoretical distribution specified by its two parameters
Normal Probability Distribution
Normal Probability Density Function 1 2
e
( where: = mean = standard deviation = 3.14159
e = 2.71828
Asymmetric Distributions of the Population Values
The Normal Distribution
mean standard deviation
Normal Probability Distribution
x
Normal Probability Distribution
Characteristics of the Normal Probability Distribution The shape of the normal curve is often illustrated as a bell-shaped curve. Two parameters, (mean) and (standard deviation), determine the location and shape of the distribution.
The highest point on the normal curve is at the mean, which is also the median and mode.
The mean can be any numerical value: negative, zero, or positive.
… continued
Normal Probability Distribution
Characteristics of the Normal Probability Distribution The normal curve is symmetric.
The standard deviation determines the width of the curve: larger values result in wider, flatter curves.
The total area under the curve is 1 (.5 to the left of the mean and .5 to the right).
Probabilities for the normal random variable are given by areas under the curve.
Empirical Rule for Any Normal Curve
68%
of the values fall within one standard deviation of the mean
95%
of the values fall within two standard deviations of the mean
99.7%
of the values fall within three standard deviations of the mean
Empirical Rule for Any Normal Curve mean = 200 SD=10
68%
-1 190 +1 210 -2 180
95%
+2 220
99.7%
-3 170 +3 230
The Mean and Standard Deviation of the Normal Distribution Determine What proportion of individuals fall into any range of values At what percentile a given individual falls, if you know their value What value corresponds to a given percentile
The standard Normal Curve
Because there are many possible normal curves, in order to make use of their wonderful properties, we need to standardize it by: Convert every normal distribution to the standard normal distribution and convert all of the scores to standard scores Then use a Table of Areas under the Normal Curve to interpret and make inferences about the data
Properties of a Standard Normal Curve: z-scores
Convert the raw scores to z-scores
Every Distribution of z-scores:
Has a mean equal to 0 Has a standard deviation equal to 1 Has a shape that is the same as the underlying distribution of raw scores
The Normal Distribution
(of SD)
The Normal Distribution
For the Standard Normal Distribution (or
Z
Distribution) we can find probabilities associated with different values of Z using Z tables.
Z
The Normal Distribution
First we look at some general characteristics of the Z-distribution.
The area under the entire curve is 1.
The area under the curve to the left of 0 is 0.5.
We say, “The probability that Z is to the left of 0 is 0.5.” This can be written as Prob ( Z < 0) = 0.5.
1 0.5
0 Z
The Normal Distribution
We can find the probability that Z is to the left of any number using the Z-table.
Z-tables can also be found on the inside front cover of the book Notice first if we go in the table to the value z = 0.00 we see the probability is 0.5.
0.5
0 Z
The Normal Distribution
We can find the probability that Z is to the left of any number using the Z-table.
Pr ( Z < 1.25) = ?
Answer 1.25
Z Pr ( Z < 0.50) = ?
Answer 0.50
Z
The Normal Distribution
More examples of probabilities to the left or less than a number Pr ( Z < -2.01) = ?
Answer -2.01
Pr ( Z < -3.75) = ?
Z -3.75
Answer Z
The Normal Distribution
The Z-table only gives probabilities to the left of the value. If we want to get probabilities to the right we use 1 – Pr (Z < z).
Pr ( Z > 1.25) = ?
Answer: Pr ( Z > 0.50) = ?
Answer: 1.25
Z 0.50
Z
The Normal Distribution
More examples of finding probabilities to the right of a number using 1 – Pr (Z < z).
Pr ( Z > -2.01) = ?
Answer: Pr ( Z > -3.75) = ?
Answer: > -2.01
Z -3.75
Z
The Normal Distribution
To find probabilities between two numbers, find the less than (of to the left) probability for each number and then subtract.
Pr (-2.01< Z < 2.01) = ?
ANSWER:
Standard Normal Probability Distribution
A random variable that has a normal distribution with a mean of zero and a standard deviation of one is said to have a standard normal probability distribution.
The letter z is commonly used to designate this normal random variable.
Converting to the Standard Normal Distribution
x
z
We can think of z as a measure of the number of standard deviations x is from .
Standard or z score
A z score indicates distance from the mean in standard deviation units. Formula:
z
X
z
X
S X
Converting to standard or z scores does not change the shape of the distribution.
z-Scores
the regularity of the normal distribution allows us to determine the normalised deviation of an individual observation from the mean of the distribution: z-score (transformation to: mean μ = 0, sd σ = 1) estimate the probability of observing a particular score, based on the normalised area under the bell curve
34.13% Formula:
z = X: observed value μ: mean of distribution σ: standard deviation of distribution
-2 -1 0
+1 13.59% 2.28% z +2 note:
z is the deviation from the mean in units of standard deviation intervals defined by z-boundaries correspond to precisely defined proportions of the distribution
normal distribution & z-score transformation
Formula Y = 1
2
2 e -(X-
) 2 /2
2
properties of the normal distribution :
X 1. 50% below and 50% above the mean (mean = median) 2. most scores at mean (mean = mode), extreme scores are relatively rare (low frequencies) 3. symmetrical 34.13% ---> basis for z-score transformation: Formula z = X-
13.59
% ==> normalized proportions of area in z-score sections !! -2 -1 0
+1 +2 2.28% z
|
x
Sampling distributions
|
The distribution of sample means for sample size n a mean of
will have : a standard deviation of
/
n and will approach a normal distribution as n approaches infinity
[---------------------
x
---------------------] [---------------------
x
---------------------] [---------------------
x
---------------------]
x
Central limit theorem
A cornerstone for much of inferential statistics: the central limit theorem : For any population with a mean
and a standard deviation
, the distribution of sample means for sample size n will have a mean of
and a standard deviation of
/
n, and will approach a normal distribution as n approaches infinity.
The shape of the distribution of sample means is close to normal if the population from which the samples are drawn is normal, or if n is large (n>30)
Central limit theorem
•
the mean of the distribution of sample means, the expected value of
X, is close to the population mean
( it is more probable to draw scores close to
)
•
the variability of the distribution of sample means is described by the standard error of the means,
X measures the standard difference between
X and , which
•
the sample size determines how close the sample mean will be to the population mean, and influences the standard error of the mean:
X =
/
n (note that for n=1:
X =
)
The Normal Distribution
Properties
Most values cluster near mean and few values out near tails Values in the middle of the curve have a greater probability of occurring than values further away There are many possible normal curves with different means and standard deviation
Standardized Scores
standardized score
= (observed value minus mean) / (std dev) x is the observed value is the population mean is the population standard deviation
z
x
The Normal Distribution
Now suppose we know X ~ N ( , 2 ) and we want to know the probability that X is less than some value. We must first convert the X to a Z and then use the probabilities from the Z-table. Recall that if X ~ N ( , 2 ) , then
Z
X
~
N
So Pr (X < x) = Pr (Z < (x – )/
The Normal Distribution
Here’s an example. Suppose X ~ N ( 3, 4). Find the probability that X is less than 4.
Pr ( X < 4 ) = ?
Pr ( X < 4 ) = Pr (Z < 0.5) =
Z
X
4 3 1 / 2 0 .
5 2 3 4 X 0 0.50
Z
The Normal Distribution
We will look at some more difficult examples by hand: Suppose X ~ N (2, 9), Given a value z, find the corresponding x that it came from.
How many standard deviations is x from ?
Find Pr (X > 5).
Find Pr (X < –4 or X > 8).
Find Pr ( –4 < X < 8 ).
Find the x* such that Pr ( X < x* ) = 0.8485, where 0.8485 is some probability.
Law of Large Numbers
Draw independent observations at random from any population with finite mean, μ. Decide how accurately you would like to estimate μ. As the number of observations drawn increases, the sample
x
approaches the mean μ of the population as closely as you specified and then stays that close.
Interval Estimation
Estimation of a Population Mean
Estimation of a Population Proportion Interval
Estimation of the difference between two Population Means
Estimation of the difference between two Population Proportions
Interval Estimation of population mean
|
x
| [---------------------
x
---------------------] [---------------------
x
---------------------] [---------------------
x
---------------------]
x
Interval Estimation of a Population Mean: Large-Sample Case
Sampling Error Probability Statements about the Sampling Error Constructing an Interval Estimate: Large-Sample Case with Known Calculating an Interval Estimate: Large-Sample Case with Unknown
Sampling Error
The absolute value of the difference between an unbiased point estimate and the population parameter it estimates is called the sampling error.
|
x
| For the case of a sample mean estimating a population mean, the sampling error is Sampling Error = |
x
|
Probability Statements About the Sampling Error
x
Knowledge of the sampling distribution of enables us to make probability statements about the sampling error even though the population mean is not known. A probability statement about the sampling error is a precision statement.
Confidence Interval (CI) Definition of Confidence Interval
“Interval or range of values which most likely encompasses the true population value” Helps determine extent of deviation of sample value
from the population
Upper and lower limits of the CI are termed
Confidence Limits
Interpretation of CI
A 95% CI of 152 to 164 cm for the mean height of a population of women means that you are 95% confident (certain) that the real population mean* lies between 152 and 164cm
152=lower confidence limit 164=upper confidence limit * which you cannot know exactly unless you measure the heights of all women
Probability Statements About the Sampling Error
Precision Statement : There is a 1 probability that or less .
/2 1 -
x
of all values /2 Sampling distribution of
x x
Calculations
SE
SD n
Calculating CI
Takes into account standard error (SE) SE gives an estimate of the degree to which the sample mean varies from the population means Computed on the basis of Standard Deviation
SE
SD n
Interval Estimate of a Population Mean: Large-Sample Case ( n > 30)
With Known
x z
/2
n
where: 1 -
x
is the sample mean is the confidence coefficient z
/2 is the z value providing an area of standard normal probability distribution
/2 in the upper tail of the
is the population standard deviation n is the sample size
Calculations
SE and 95% CI of a Mean To what degree does the sample mean vary from the population mean
95% CI
SE
SD n
x 1.96
SE
Interval Estimate of a Population Mean: Large-Sample Case ( n > 30)
With Unknown In most applications the value of the population standard deviation is unknown. We simply use the value of the sample standard deviation, s, as the point estimate of the population standard deviation.
x
z
/2
s n
Example Sampling can be used to develop an interval estimate of the mean annual income for individuals in a potential marketing area for National Discount. A sample of size n = 36 was taken. The
x
standard deviation, .95 as the confidence coefficient in our interval estimate.
s , is $4,500. We will use
Example Precision Statement There is a .95 probability that the value of a sample mean for National Discount will provide a sampling error of $1,470 or less……. determined as follows: 95% of the sample means that can be observed are
x
. If
x
s n
4 , 500 36 750
x
Interval Estimation of a Population Mean: Small-Sample Case ( n < 30)
Population is Not Normally Distributed The only option is to increase the sample size to n > 30 and use the large-sample interval-estimation procedures.
Population is Normally Distributed and is Known The large-sample interval-estimation procedure can be used.
Population is Normally Distributed and Unknown is The appropriate interval estimate is based on a probability distribution known as the t distribution.
t
Distribution
The t distribution is a family of similar probability distributions.
A specific t distribution depends on a parameter known as the degrees of freedom.
As the number of degrees of freedom increases, the difference between the t distribution and the standard normal probability distribution becomes smaller and smaller.
A t distribution with more degrees of freedom has less dispersion.
The mean of the t distribution is zero.
t
table
Interval Estimation of a Population Mean: Small-Sample Case ( n < 30) with
Unknown
Interval Estimate
x
t
/2
s n
where 1 = the confidence coefficient
t
/2
= the t value providing an area of /2 in the upper tail of a t distribution with n - 1 degrees of freedom
s
= the sample standard deviation
Example: Apartment Rents
Interval Estimation of a Population Mean: Small-Sample Case ( n < 30) with
Unknown
A reporter for a student newspaper is writing an article on the cost of off-campus housing. A sample of 10 one-bedroom units within a half-mile of campus resulted in a sample mean of $550 per month and a sample standard deviation of $60.
Let us provide a 95% confidence interval estimate of the mean rent per month for the population of one-bedroom units within a half-mile of campus. We’ll assume this population to be normally distributed.
t Value
Example: Apartment
Rents
At 95% confidence, 1 .025.
= .95, = .05, and /2 =
t
.025
is based on n - 1 = 10 - 1 = 9 degrees of freedom.
In the t distribution table we see that t .025
= 2.262.
Degrees of Freedom .
7 8 9 10 .
.10
.
1.415
1.397
1.383
1.372
.
.05
Area in Upper Tail .025
.01
.
1.895
.
2.365
.
2.998
1.860
1.833
1.812
.
2.306
2.262
2.228
.
2.896
2.821
2.764
.
.005
.
3.499
3.355
3.250
3.169
.
Example: Apartment Rents
Interval Estimation of a Population Mean: Small-Sample Case (n < 30) with Unknown
x
t
.025
s n
550 2 .
262 60 10 550 + 42.92
or $507.08 to $592.92
We are 95% confident that the mean rent per month for the population of one-bedroom units within a half-mile of campus is between $507.08 and $592.92.
Interval Estimation of a Population Proportion
Interval Estimate
p
( 1
p
)
p
z
/ 2
n
where: 1 is the confidence coefficient
z
/
2 is the z value providing an area of /2 in the upper tail of the standard normal probability distribution
p
is the sample proportion
Example: dental caries n=500 # with caries = 220
Interval Estimate of a Population Proportion
p
z
/ 2
p
( 1
p
)
n
where: n = 500, = 220/500 = .44, z
p
/2
= 1.96
.
.
500 .44 + .0435
.
) PSI is 95% confident that the proportion of children with dental caries is between .3965 and .4835.
Comparisons Involving Means
Estimation of the Difference Between the Means of Two Populations: Independent Samples Hypothesis Tests about the Difference between the Means of Two Populations: Independent Samples Inferences about the Difference between the Means of Two Populations: Matched Samples Inferences about the Difference between the Proportions of Two Populations:
Estimation of the Difference Between the Means of Two Populations: Independent Samples
Point Estimator of the Difference between the Means of Two Populations Sampling Distribution
x
1
x
2 Interval Estimate of Interval Estimate of Large-Sample Case Small-Sample Case
Point Estimator of the Difference Between the Means of Two Populations
Let 1 equal the mean of population 1 and the mean of population 2.
2 equal The difference between the two population means is 1 2 .
To estimate 1 2 , we will select a simple random sample of size n 1 from population 1 and a simple random sample of size n 2 from population 2.
Let equal the mean of sample 1 and equal the 1 mean of sample 2.
x
2 The point estimator of the difference between the means of the populations 1 and 2 is 2
Sampling Distribution of
1
x
2 Standard Deviation
x
1
x
2 1 2
n
1 2 2
n
2 where: 1 2 = standard deviation of population 1 = standard deviation of population 2 n1 = sample size from population 1 n2 = sample size from population 2
Interval Estimate of Large-Sample Case ( n 1
1 -
2 : > 30 and n 2 > 30)
Interval Estimate with 1 and 2 Known
x
1
z
/ 2
x
1
x
2 where: 1 is the confidence coefficient Interval Estimate with 1 and 2 Unknown
x
1
z
/ 2
s x
1
x
2 where:
s x
1
x
2
s
1 2
n
1
s n
2 2 2
Example: Par, Inc.
Interval Estimate of 1 Case 2 : Large-Sample Par, Inc. is a manufacturer of golf equipment and has developed a new golf ball that has been designed to provide “extra distance.” In a test of driving distance using a mechanical driving device, a sample of Par golf balls was compared with a sample of golf balls made by Rap, Ltd., a competitor. The sample statistics appear on the next slide.
Example: Par, Inc.
Interval Estimate of 1 Case 2 : Large-Sample Sample Statistics Sample #1 Par, Inc. Sample Size n 1 Mean Standard Dev.
x s
1 1 = 120 balls = 235 yards = 15 yards Sample #2 Rap, Ltd.
n
2 = 80 balls
s
2 = 218 yards
x