The Role of statistics and the data analysis process

Transcript The Role of statistics and the data analysis process

The Role of statistics and the
data analysis process
AP STATS
CHAPTER 1
The Data Analysis Process
Step
1
•Acknowledging
Variability
•Collecting Data
Sensibly
Step
2
•Describing
Variability in
the Data
•Descriptive
Statistics
Step
3
•Drawing
Conclusions in
a Way That
Recognizes
Variability in
the Data
•Probability
Supports the
Conclusion
Statistics (1)
 The scientific discipline that provides methods to
help make sense of data.
 Suspicion: Extreme skeptics, usu. speaking out of
ignorance, characterize this discipline as a
subcategory of lying.
 Used properly, statistical methods offer a set of
POWERFUL tools for gaining insight into the world
around us.
 Used in business, medicine, agriculture, social
sciences, natural sciences, and applied sciences.
Statistics (2)
 “…teaches us how to make intelligent judgments and
informed decisions in the presence of uncertainty
and variation.”
1.1 Three Reasons to Study Statistics
Be Informed
• To understand news
reports making databased claims.
• Extract information
from tables and
graphs.
• Follow numerical
arguments.
• Understand the
basics for valid
research designs.
Understand Issues and
Sound Decision
Making Based on Data
• Is existing info
adequate, or do we
need more?
• How to collect
information in a
reasonable and
thoughtful manner.
• Summarize data in a
useful and
informative way.
• Analyze available
data.
• Make conclusions
and decisions, and
assess risk for an
incorrect decision.
Evaluate Decisions
That Affect Your Life
• Other people use
statistical methods to
make decisions that
affect you life.
• Drug screening by
companies, medical
researchers,
university financial
aid, insurance
companies, etc.
• Are the decisions
made by these groups
done in a reasonable
way.
1.2 The Nature and Role of Variability
 Statistics focuses on collecting, analyzing, and drawing
conclusions from data.



If all measurements were identical for every individual, this task
would be easy.
But populations without variability are virtually non-existent.
In fact, variability is universal.
 We need to understand variability to be able to collect,
analyze, and draw conclusions from data in a sensible
way.
 The branch called descriptive statistics helps to increase
our understanding of the nature of variability in a
population.
Figure 1.1 Histogram of heights (in inches) of female athletes: (a) basketball
players; (b) gymnasts. Sample size is 100 for both groups (N =100).
Next
Example 1.1 If the Shoe Fits
 Is the variation in the heights between the two groups
similar?
 What if a 5’11” woman was looking for her sister who is
practicing with her team in the gym, where would you
direct her? Why?
 What if you found a pair of size 6 shoes left in the locker
room? Where would you try to return them?
 You informally used statistical reasoning that combined
your knowledge of the relationship of height between
siblings and height and shoe size with the information
about height distributions in Fig. 1.1.
Figure 1.2 Frequency of contaminant concentration (in ppm) in well water. Based
on the average of five measurements per day for 200 days (N = 200).
Example 1.2 Monitoring Water Quality
 Suppose a chemical spill occurred at a
manufacturing plant 1 mile from the well.
 One month aver the spill the average contamination
is 15.5 ppm. Would this be convincing evidence that
well was affected by the spill?
 What if the average was 17.4 ppm? 22.0 ppm?
 In both Examples 1.1 and 1.2, reaching a conclusion
required an understanding of variability. Variability
allows us to distinguish between usual and unusual
values.
1.3 Statistics and the Data Analysis Process
 Conclusions based on data are seen regularly in popular
media and professional and academic publications.
 Decisions are data driven in business, industry, and
government.
 Descriptive statistics – methods for organizing and
summarizing data.

Next step in the data analysis process once a data set has be collected or
an appropriate source identified.
 Inferential statistics – involves generalizing from a sample
to the population and requires and understanding of the
variation in the population (i.e., descriptive statistics).


Also includes assessing the reliability of such generalizations.
Because these conclusions are based on incomplete information there is
a need to quantify the chance of an incorrect conclusion.
 Population – the entire collection of individuals or
objects about which information is desired.
 Sample – a subset of the population, selected for
study in some prescribed manner.
The Data Analysis Process (1)
 Raw data without analysis is of little value, likewise even
a sophisticated analysis cannot provide meaningful
information from data that were not collected in a
sensible way.
 Data collection and analysis allow researchers to answer
questions about the way systems work.
 Steps to data analysis process:






Understand the nature of the problem.
Decide what to measure and how to measure it.
Data collection
Data summarization and preliminary analysis
Formal data analysis
Interpretation of results
Example 1.3 A Proposed New Treatment for Alzheimer’s
Disease
 In 2002, eleven patients had shunts implanted into
brain.
 Comparison group received the standard care for
Alzheimer’s
 Quarterly tests of memory function for both groups
showed a steady decline in the control group, while
the surgically treated (experimental) did not decline.
 Study was too small to produce conclusive statistical
evidence, but the preliminary results justified a
larger study to include 256 patients at 25 medical
centers across the country.
The Data Analysis Process (2)
 Evaluating a Research Study (in the popular and
technical press): The six data analysis steps can be used
as a guide.
1.
2.
3.
4.
5.
6.
What were the researchers trying to learn? What question
motivated their research?
Was relevant information collected? Were the right things
measured?
Were the data collected in a sensible way?
Were the data summarized in an appropriate way?
Was an appropriate method of analysis used, given the type of data
and how the date were collected?
Are the conclusions drawn by the researchers supported by the data
analysis?
Example 1.4 Spray Away the Flu
 Newspaper article reported the results of a study in
which vaccine was administered by nasal spray.
 Results general look promising, but the newspaper
articles typically do not report the details of the
study.





How were the data collected?
How were the children selected for the study?
How was it determined which children received the vaccine
and which received the placebo?
How was subsequent diagnosis of flu made?
Not mentioned by authors of text: how does this compare to
vaccine by injection?
1.4 Types of Data and Some Simple Graphical Displays
 Describing Data
 Variable – any characteristic whose value may change from
one individual or object to another.
 Data – results from making observations either on a single
variable or simultaneously on two or more variables.
 Univariate data set – data set consisting of observations on
a single attribute.
Categorical (or qualitative) – individual responses are
categorical responses.
 Numerical (or quantitative) – observations are numerical.

Example 1.5 Airline Safety Violations
 FAA monitors airlines
 USA Today March 13 2000 reported on violations
that could lead to fines from FAA.
 Violation categories: Security (S), Maintenance (M),
Flight Operations (F), Hazardous Materials (H), or
Other (O).
 Subset of the data for 20 administrative actions: S S
MHMOSMSSFSOMSMSMSM
 What kind of data set is this?
 Describing data (continued)
 Bivariate data set – when a data set consists of two
attributes recorded simultaneously for each individual.
 Multivariate data set
Example 1.6 Revisiting Airline Safety Violations
Airline
Number of Violations
Average Fine per
Violation (in US
dollars)
Alaska
258
5038.760
America West
257
3112.840
American
1745
2693.410
Continental
973
5755.396
Delta
1280
3828.125
Northwest
1097
2643.573
Southwest
535
3925.234
TWA
642
2803.738
United
1110
2612.613
US Airways
891
3479.237
What type of data set is this?
 Two Types of Numerical Data
 Discrete – a numerical variable in which the possible values
of the variable correspond to isolated points on the number
line.
 Continuous – a numerical variable in which the possible
values of the variable form an entire interval on the number
line.
Example 1.7 Calls to Drug Abuse Hotline
 The number of telephone calls per day to a drug
abuse hotline is recorded for 12 days: 3 0 4 3 1 0 6 2
0012
 This data set represents isolated points on a number
line, thus this is a discrete numerical data set.
 Example 1.6 had both types of numerical data.


Number of violations
Average fine per violation
 In general, data are continuous when observations
involve making measurements
 Frequency Distributions and Bar Charts for
Categorical Data



A tabular or graphical display can effectively communicate
information
A common way to present categorical data is in the form of a in
a table called a frequency distribution.
Frequency distribution for categorical data – a table that
displays the possible categories along with the associated
frequencies and/or relative frequencies.
Frequency – for a particular category, the number of times the
category appears in the data set.
 Relative frequency – for a particular category, the fraction or
proportion of the observations resulting in the category.

relative
frequency

frequency
number of observatio ns in the data set
If a table includes relative frequencies, it is
sometimes referred to as a relative frequency
distribution.
Example 1.8 Motorcycle Helmets  Can You See Those Ears?
Table 1.1 Frequency distribution of helmet use.
Helmet Use
Category
Frequency
Relative Frequency
No Helmet
731
0.430
731/1700
Noncompliant Helmet
153
0.090
153/1700
Compliant Helmet
816
0.480
TOTAL
1700
1.000
Total number of
observations
Should be equal to 1,
but may be slightly
off due to rounding.
 Bar Charts – a graph of the frequency distribution
of categorical data.


When to use: Categorical data.
How to construct:
Horizontal line, with category names below line at regularly
spaced intervals
 Vertical line, label the scale using in frequency or relative
frequency.
 Rectangular bar above every category should be same width,
height determined by category’s frequency.


What to look for: Frequently and infrequently occurring
categories.
Example 1.9 Revisiting Motorcycle Helmets
900
800
Frequency
700
600
500
400
300
200
100
0
No Helmet
Noncompliant Helmet
Helmet Use Category
Figure 1.5 Bar chart of helmet use.
Compliant Helmet
 Dotplots for Numerical Data



A dotplot is a simple way to display numerical data when the data
set is reasonably small.
When to use: Small numerical data sets
How to construct:
Draw a horizontal line and mark with an appropriate measurement
scale.
 Locate each value in the data set along the measurement scale and
represent it by a dot. If there are two or more observations with the
same value, stack the dots vertically.


What to look for: Dotplots convey information about:
A representative or typical value in the data set.
 The extent of the spread of the data
 The nature of the distribution of values along the number line.
 The presence of unusual values in the data set.

Example 1.10 Graduation Rates for NCAA Division I Schools
in CA and TX
 From The Chronicle of Higher Education, Aug. 31,
2001
 Reported graduation rates as percentages of full-time
freshmen in fall 1993 who earned bachelor’s degrees
by Aug. 1999.
California
Texas
64
41
44
31
37
67
21
32
88
35
73
72
68
35
37
71
39
35
71
63
81
90
82
74
79
12
46
35
39
28
67
66
66
70
63
65
25
24
22
Figure 1.6 Minitab dotplot of graduation rates.
Figure 1.7 Minitab dotplot of graduation rates for California
and Texas.

The Role of statistics and the data analysis process

Transcript The Role of statistics and the data analysis process

Directory