Lect 1 - Department of Engineering and Physics

Download Report

Transcript Lect 1 - Department of Engineering and Physics

Chapter 1
Lecture Slides
1
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Chapter 1:
Summarizing Univariate Data
2
Introduction
• How can one draw conclusions from the
results of an experiment when those results
could have come differently?
– Example: Normal blood sugar levels should be
between 70 and 120 mg/ dL
• A knowledge of statistics is essential to
address this question.
3
Quality Control: Example 1
Consider a machine that makes steel balls for ball
bearings used in clutch systems. The specification for
the diameter of the balls is 0.65  0.03 cm. During the
last hour, the machine has made 2000 balls. The quality
engineer wants to know approximately how many of
these balls meet the specification. He does not have
time to measure all 2000 balls. So, he draws a random
sample of 80 balls, measures them, and finds that 72 of
them (90%) meet the diameter specification. It is
unlikely that the sample of 80 balls represents the
population of 2000 exactly.
4
Section 1.1: Sampling
Definitions:
 A population is the entire collection of objects or
outcomes about which information is sought.
 A sample is a subset of a population, containing the
objects or outcomes that are actually observed.
 A simple random sample (SRS) of size n is a
sample chosen by a method in which each collection
of n population items is equally likely to comprise the
sample, just as in the lottery.
5
SRS Sample: Example 2
Q: A utility company wants to conduct a survey to
measure the satisfaction level of its customers in a
certain town. There are 10,000 customers in the
town, and utility employees want to draw a sample
of size 200 to interview over the telephone. They
obtain a list of all 10,000 customers, and number
them from 1 to 10,000. They use a computer
random generator to generate 200 random integers
between 1 and 10,000 and then telephone the
customers who correspond to those numbers. Is this
a simple random sample?
A: Yes, this is a simple random sample.
6
Sampling (cont.)
Definition: A sample of convenience is a sample
that is not drawn by a well-defined random
method.
Things to consider with convenience samples:
 Differ systematically in some way from the
population.
 Only use when it is not feasible to draw a random
sample.
7
Sample of Convenience: Example 3
A construction engineer has received a shipment
of 1000 concrete blocks, each weighing
approximately 50 pounds. The blocks are in a
large pile. The engineer wishes to investigate
the crushing strength of the blocks by measuring
the strengths in a sample of 10 blocks. It may be
difficult to take a SRS since that would involve
getting blocks from the middle and bottom of the
pile, so the engineer may just take 10 off the top.
This would be a sample of convenience.
8
Simple Random Sampling
• A SRS is not guaranteed to reflect the population
perfectly.
• SRS’s always differ in some ways from each other;
occasionally a sample is substantially different from
the population.
• Two different samples from the same population will
vary from each other as well.
 This phenomenon is known as sampling variation.
9
Example 2 cont.
Which sample is a simple random sample?
10
Tangible Population
• The populations that consist of actual physical
objects – customers, blocks, balls are called
tangible populations.
• Tangible populations are always finite.
• After we sample an item, the population size
decreases by 1.
11
More on SRS
Definition: A conceptual population consists of
items that are not actual objects.
• For example, a geologist weighs a rock several
times on a sensitive scale. Each time, the scale
gives a slightly different reading.
• Here the population is conceptual. It consists
of all the readings that the scale could in
principle produce.
12
SRS (cont.)
• The items in a sample are independent if
knowing the values of some of the items does
not help to predict the values of the others.
• Items in a simple random sample may be
treated as independent in most cases
encountered in practice. The exception occurs
when the population is finite and the sample
comprises a substantial fraction (more than
5%) of the population.
13
Types of Data
• Numerical or quantitative if a numerical quantity is
assigned to each item in the sample.
• Height
• Weight
• Age
• Categorical or qualitative if the sample items are
placed into categories.
• Gender
• Hair color
• Zip code
14
Controlled Experiments
• Suppose that a chemical engineer wants to
determine how the concentrations of reagent
and catalyst affect the yield of a process.
• The engineer can run the process several times,
changing the concentrations each time and
compare the yields that result.
• This sort of experiment is called a controlled
experiment because the values of the factor
are under the control of the experimenter.
15
Observational Studies
• There are many situations in which scientists cannot
control the levels of the factors.
• Many studies have been conducted to determine the
effect of cigarette smoking on the risk of lung cancer.
• In these studies, rates of cancer among smokers are
compared with rates among nonsmokers.
• The experimenter cannot control who smokes and
who doesn’t.
• This kind of study is called an observational study.
16
Section 1.2: Summary Statistics
• Sample Mean:
n
1
X   Xi
n i 1
• Sample Variance:
n
n
2
1
1

2
2
2
s 
Xi  X  
X i  nX 




n  1 i 1
n  1  i 1

• Sample standard deviation is the square root of
the sample variance.
17
More on Summary Statistics
• If X1, …, Xn is a sample, and Yi = a + b Xi ,where a
and b are constants, then
Y  a  bX .
s  b s , and sy  b sx .
2
y
2 2
x
18
Definition of a Median
The median is another measure of center, like the
mean. To find it:
 If n is odd, the sample median is the number in
n 1
.
position
2
 If n is even, the sample median is the average
n
n
of the numbers in positions and  1.
2
2
19
Example 3
A simple random sample of five men is chosen from a
large population of men, and their heights are
measured. The five heights (in cm) are 166.4, 183.6,
173.5, 170.3, and 179.5. Find the sample mean,
sample variance, sample standard deviation, and the
median.
20
Quartiles
 The
first quartile is the median of the lower
half of the data (include the median in the
lower half of the data if n is odd).
 The
third quartile is the median of the upper
half of the data (include the median in the
upper half of the data if n is odd).
21
Definition of Percentile
• The pth percentile of a sample, for a number
between 0 and 100, divides the sample so that
as nearly as possible p% of the sample values
are less than the pth percentile.
22
To Find Percentiles
Order the sample values from smallest to
largest.
Then compute the quantity (p/100)(n+1),
where n is the sample size.
If this quantity is an integer, the sample value
in this position is the pth percentile.
Otherwise, average the two sample values on
either side.
23
Note on Percentiles
• The first quartile is the 25th percentile.
• The median is the 50th percentile.
• The third quartile is the 75th percentile.
24
Example 4
The following values of fracture stress (in
megapascals) were measured for a sample of 24
mixtures of hot-mixed asphalt: 30 75 79 80 80
105 126 138 149 179 179 191 223 232 232
236 240 242 245 247 254 274 384 470.
•
•
•
•
•
What is the mean of these data?
What is the median?
What is the first quartile?
What is the third quartile?
What is the 65th percentile?
25
Section 1.3: Graphical Summaries
•
•
•
•
Stem-and-leaf plot
Dotplot
Histogram
Boxplot
26
Stem-and-leaf Plot
• A simple way to summarize a data set.
• Each item in the sample is divided into two
parts: a stem, consisting of the leftmost one or
two digits, and the leaf, which consists of the
next digits.
• It is a compact way to represent the data.
• It also gives us some indication of the shape of
our data.
27
Example 5
• Amount of Drug in Skin
•
•
•
Stem-and-leaf plot:
0 34477899
1 22566778
2 001122234566667
3 34456678
4 0011
5 1355
6
7 4
Let’s look at the first line of the stem-and-leaf plot. This represents measurements of 3,
4, 5, 7, 7, 8, 9, and 9 minutes.
A good feature of these plots is that they display all the sample values. One can
28
reconstruct the data in its entirety from a stem-and-leaf plot.
Dotplot
• A dotplot is a graph that can be used to give a rough
impression of the shape of a sample.
• It is useful when the sample size is not too large and when the
sample contains some repeated values.
• Good method, along with the stem-and-leaf plot to informally
examine a sample.
• Not generally used in formal presentations.
Dotplot for HiAltitude
2
12
HiAltitude
22
29
Histogram
• Graphical display that gives an idea of the
shape of the sample.
• We want a reasonable number of observations
in each interval.
• The bars of the histogram touch each other. A
space indicates that there are no observations
in that interval.
30
Creating a Histogram
• Determine the number of classes to use, and construct
class intervals of equal width.
• Compute the frequency and relative frequency for
each class.
• Draw a rectangle for each class. The heights of the
rectangles may be set equal to the frequencies or to
the relative frequencies.
31
Example 6
32
Example 6 cont.
33
Symmetry and Skewness
• A histogram is perfectly symmetric if its right half is a mirror
image of its left half.
– Heights of random men
• Histograms that are not symmetric are referred to as skewed.
• A histogram with a long right-hand tail is said to be skewed to
the right, or positively skewed.
– Incomes are right skewed.
• A histogram with a long left-hand tail is said to be skewed to
the left, or negatively skewed.
– Grades on an easy test are left skewed.
34
Symmetry and Skewness
35
Shape of Histogram
• A histogram with only one peak is what we
call unimodal.
• If a histogram has two peaks then we say that
it is bimodal.
• If there are more than two peaks in a
histogram, then it is said to be multimodal.
36
Boxplots
• A boxplot is a graphic that presents the
median, the first and third quartiles, and any
outliers present in the sample.
• The interquartile range (IQR) is the
difference between the third and first quartile.
This is the distance needed to span the middle
half of the data.
37
Boxplots
38
Creating a Boxplot



Compute the median and the first and third quartiles of the
sample. Indicate these with horizontal lines. Draw vertical
lines to complete the box.
Find the largest sample value that is no more than 1.5 IQR
above the third quartile, and the smallest sample value that
is not more than 1.5 IQR below the first quartile. Extend
vertical lines (whiskers) from the quartile lines to these
points.
Points more than 1.5 IQR above the third quartile, or more
than 1.5 IQR below the first quartile are designated as
outliers. Plot each outlier individually.
39
Example 5 cont.

Notice there are no outliers in these data.

Looking at the four pieces of the boxplot, we can tell that the sample
values are comparatively densely packed between the median and the
third quartile.

The lower whisker is a bit longer than the upper one, indicating that
the data has a slightly longer lower tail than an upper tail.

The distance between the first quartile and the median is greater than
the distance between the median and the third quartile.

This boxplot suggests that the data are skewed to the left.
40
Comparative Boxplots
• Sometimes we want to compare between more
than one sample.
• We can place the boxplots of the two samples
side-by-side.
• This will allow us to compare how the medians
differ between samples, as well as the first and
third quartile.
• It also tells us about the difference in spread
between the two samples.
41
Example 7
42
Example 7 cont.
43
Summary
•
•
•
•
Types of data
Sampling
Summary Statistics
Graphical displays of data
HW1: 1.1(4,6), 1.2(2,10,12, 16)1.3(5, 7, 9, 12, 14); Ch(8, 13)
44