Proposition 1.1 De Moargan’s Laws

Download Report

Transcript Proposition 1.1 De Moargan’s Laws

Chapter 1: Looking at Data - Distributions:
http://anengineersaspect.blogspot.com/2013_05_01_archive.html
1
What is Statistics?
http://vadlo.com/cartoons.php?id=71
2
What is Statistics
• Statistics is the science of learning from data.
• Components
– Collection
– Organization
– Analysis
– Interpretation
3
Applications of Statistics
• Computer Science
client-server performance image processing
• Chemistry/Physics
determining outliers in your data
linear regression
propagation of error
dealing with large populations and approximations
• Engineering
is one process/technique better than another one?
• Business
Making good decisions
• Everyday life
Medical information
Average cell phone usage of Purdue students
4
Branches of Statistics
• Collection of data
• Descriptive Statistics
• Inferential Statistics
5
1.1: Data: Goals
• Give examples of cases in a data set.
• Identify the variables in a data set.
• Demonstrate how a label can be used as a
variable in a data set.
• Identify the values of a variable.
• Classify variables as categorical or quantitative.
Information on Histograms:
Slides: 15 – 19, Book: pp. 15 – 20.
6
Basic Definitions
• Cases
– objects that are described by the data
• Label
– special variable used to separate the cases
• Variable
– characteristic of a case
7
Types of Variables
• Number
– univariate
– bivariate
– multivariate
• Type
– Categorical
– Quantitative
• Distribution of a variable
– The possible values and how often that it
takes these variables
8
To better understand a data set, ask:
• Who?
•What cases do the data describe?
•How many cases?
• What?
•How many variables?
•What is the exact definition of each variable?
•What is the unit of measurement for each variable?
• Why?
•What is the purpose of the data?
•What questions are being asked?
•Are the variables suitable?
9
1.2: Displaying Distributions with Graphs:
Goals
• Analyze the distribution of categorical variable:
– Bar Graphs
– Pie Charts
• Analyze the distribution of quantitative variable:
– Histogram
– Time plots
– Identify the shape, center, and spread
– Identify and describe any outliers
10
Categorical Variables - Display
The distribution of a categorical variable lists the
categories and gives the count or percent or
frequency of individuals who fall into each
category.
• Pie charts show the distribution of a
categorical variable as a “pie” whose slices are
sized by the counts or percents for the
categories.
• Bar graphs represent categories as bars
whose heights show the category counts or
percents.
11
Categorical Variables – Display (STAT 311)
0.4
F
5%
Percent
0.3
0.2
D
25%
0.1
A
20%
B
10%
0
A
B Grade
C
D
F
Percent
0.4
C
40%
0.3
0.2
0.1
0
C
D
A
Grade
B
F
12
Quantitative Variable: Histograms
Histograms show the distribution of a
quantitative variable by using bars. The height
of a bar represents the number of individuals
whose values fall within the corresponding class.
Procedure - discrete
1. Calculate the frequency and/or relative
frequency of each x value.
2. Mark the possible x values on the x-axis.
3. Above each value, draw a rectangle whose
height is the frequency (or relative
frequency) of that value.
15
Histogram - Discrete
100 married couples between 30 and 40 years of
age are studied to see how many children each
couple have. The table below is the frequency
table of this data set.
Kids
0
1
2
3
4
5
6
7
# of Couples Rel. Freq
11
0.11
22
0.22
24
0.24
30
0.30
11
0.11
1
0.01
0
0.00
1
0.01
16
100
1.00
Quantitative Variable: Histograms continuous
Procedure - continuous
1. Divide the x-axis into a number of class
intervals or classes such that each
observation falls into exactly one interval.
2. Calculate the frequency or relative frequency
for each interval.
3. Above each value, draw a rectangle whose
height is the frequency (or relative
frequency) of that value.
17
Visual Display: Continuous Histogram
Power companies need information about customer
usage to obtain accurate forecasts of demand.
Investigators from Wisconsin Power and Light
determined the energy consumption (BTUs) during
a particular period for a sample of 90 gas-heated
homes. An adjusted consumption value was
calculated via
ad j co n su m p tio n 
co n su m p tio n
(w eath er, d egree d ays)(h o u se area)
The data is listed under furnace.txt under extra files
on the computer web page.
18
Example (cont)
Bin = 0.25 63 classes
Bin = 1
17 classes
Bin = 0.5 32 classes
Bin = 3 7 classes
Bin = 5 4 classes
19
Examining Distributions
In any graph of data, look for the overall pattern
and for striking deviations from that pattern.
• You can describe the overall pattern by its
shape, center, and spread.
• An important kind of deviation is an outlier, an
individual that falls outside the overall
pattern.
20
Shapes of Histograms - Number
Symmetric
unimodal
bimodal
multimodal
http://www.particleandfibretoxicology.com/content/6/1/6/figure/F1?highres=y
21
Shapes of Histograms (cont)
Symmetric
Positively skewed
Negatively skewed
22
Shapes of Histograms (cont)
23
Outliers
http://ewencp.org/blog/url-reshorteners/
24
Time Plots
A time plot shows behavior over time.
• Time is always on the x-axis; the other variable
is on the y-axis
• Look for a trend and deviations from the
trend. Connecting the data points by lines may
emphasize this trend.
• Look for patterns that repeat at known regular
intervals.
25
Example: Time Plots
We are interested in the temperature (oF) of
effluent at a sewage treatment plant.
47 54 53 50 46 46 47 50 51 50 51 50 46
52 50 50
a) Plot a histogram of the data.
b) Plot a time plot of the data.
26
Example: Time Plots (cont)
27
1.3: Describing Distributions with
Numbers: Goals
• Describe the center of a distribution by:
– mean
– median
• Compare the mean and median
• Describe the measure of spread:
– quartiles
– standard deviation
• Describe a distribution by a boxplot (five-number
summary and outliers)
• Be able to determine which summary statistics are
appropriate for a given situation
• Be able to determine the effects of a linear
transformation on the above summary statistics.
28
Sample Mean
𝑠𝑢𝑚 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 1
𝑥=
=
𝑛
𝑛
𝑥𝑖
29
Sample Mean: Example
The following data give the time in months from hire
to promotion to manager for a random sample of
20 software engineers from all software engineers
employed by a large telecommunications firm.
a) What is the mean time for this sample?
5
7
12
14
18
14
14
22
21
25
23
24
34
37
34
49
64
47
67
69
b) Suppose that instead of x20 = 69, we had chosen
another engineer that took 483 months to be
promoted. what is the mean time for this new
sample?
30
Sample Median, M or x̃
Procedure
1. Sort n observations from smallest to largest
2. If n is odd, x̃ is the center
If n is even, x̃ is the average of the two center
observations
31
Sample Median: Example
The following data give the time in months from hire
to promotion to manager for a random sample of
20 software engineers from all software engineers
employed by a large telecommunications firm.
a) What is the median time for this sample?
5
7
12
14
14
14
18
21
22
23
24
25
34
34
37
47
49
64
67
69
b) Suppose that instead of x20 = 69, we had chosen
another engineer that took 483 months to be
promoted. what is the median time for this new
sample?
32
Mean and Median
Mean
Median
Left skew
Mean
Median
Mean
Median
Right skew
33
Variability of Data
1
2
3
-20
Set 1
Set 2
Set 3
-10
-15
-15
-3
-10
-5
-2
0
-5
-1
-1
10
0
0
0
20
5
1
1
10
5
2
15
15
3
34
Quartiles
Q1
Q2
Q3
35
Quartiles
Procedure
1. Sort the values from lowest to highest and
locate the median.
2. The first Quartile, Q1 is the median of the
lower half.
3. The third quartile, Q3 is the median of the
upper half.
36
Quartiles: Example
The following data give the time in months from
hire to promotion to manager for a random
sample of 19 software engineers from all
software engineers employed by a large
telecommunications firm.
24
7
12
14
14
14
18
21
22
23
25
34
34
37
47
49
64
100 150
a) Find the median and the quartiles.
b) What is the Interquartile Range?
c) Are there any outliers in this data set?
37
Boxplots
Procedure
1. Draw and label a number line that includes
the range of the distribution.
2. Draw a central box from Q1 to Q3.
3. Draw a line for the median.
4. Extend lines (whiskers) from the box to the
minimum and maximum values that are not
outliers.
5. Put in dots (* or some symbol) for the
outliers
38
Boxplot: Example
Boxplot of Promotion
160
140
Promotion
120
100
80
60
40
20
0
39
Side-by-side Boxplot: Example
40
Sample Standard Deviation
𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 =
𝑠𝑥2
1
=
𝑛−1
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = 𝑠𝑥 =
(𝑥𝑖 − 𝑥)2
1
𝑛−1
(𝑥𝑖 − 𝑥)2
41
Properties of Standard Deviation
• s measures spread about the mean so only
use this measure when you are using the
mean to measure the center.
• s = 0 means that all of the observations are
the same, normally s > 0
• s is not resistant to outliers
• s has the same units of measurement as the
original observations
42
Sample Standard Deviation: Example
The following data give the time in months
from hire to promotion to manager for a
random sample of 20 software engineers
from all software engineers employed by a
large telecommunications firm.
What is the standard deviation time for this
sample?
5
7
12
14
18
14
14
22
21
25
23
24
34
37
34
49
64
47
67
69
43
Choosing Measures of Center and
Spread
Choices
1. Mean and standard deviation
2. Median and IQR
ALWAYS PLOT YOUR DATA!
http://freshspectrum.com/wp-content/uploads/2012/09/
Hans-Rosling-Bubble-Plot-Cartoon.jpg
44
Change of Measurement
• Linear transformation: xnew = a + bx
• Effects
1. No change to shape
2. Adding a: adds a to measures of center;
doesn’t effect measures of spread
3. Multiplying by b: multiplies both measures
of center and measures of spread (s, IQR)
by b.
45
1.4: Density Curves and Normal
Distributions: Goals
• Be able to state the definition and practical importance of
a density curve.
• State the physical means of the measurements of center
and spread for density distributions.
• Normal distributions
–
–
–
–
–
–
–
Be able to sketch the normal distribution.
Be able to state the importance of the 68 – 96 – 99.7 rule
Be able to standardize a value
Be able to use the Z-table
Be able to calculate percentages
Be able to calculate percentiles (Inverse calculations)
Be able to determine if a distribution is normal (normal
46
quantile plots)
Exploring Quantitative Data
1.
2.
3.
4.
Always plot your data.
Look for the overall pattern.
Calculate a numeric summary.
Sometimes, the overall pattern is regular so
that we can describe it by a specific
methodology.
47
Density Curve
(a)
(b)
(c)
48
Properties of Density Curve
y = f(x)

y = f(x)


f(x)d x  1
p ro p o rtio n b e tw e e n
b
a a n d b =  f(x)d x
a
49
Density Curves – Median and Mean
• The median of a density curve is the equal –
areas point.
𝑦=𝑚𝑒𝑑𝑖𝑎𝑛
𝑝 = 0.5 =
𝑓 𝑥 𝑑𝑥
−∞
• The mean of a density curve is the balance
point.
• If the distribution is symmetric, the median
and mean are the same and are the center of
the curve.
50
Mean
http://isc.temple.edu/economics/notes/descprob/descprob.htm
51
Sample vs. Population
• Terms for samples (actual observations)
– Mean: x,̄ median: x̃, standard deviation, s
• Terms for populations (density curves)
– Mean: , median: ̃, standard deviation, 
52
Normal Distribution
A visual comparison of
normal and paranormal
distribution
Lower caption says
'Paranormal Distribution' - no
idea why the graphical artifact
is occurring.
http://stats.stackexchange.com/questions/423/what-is-your-favorite-data-analysis-cartoon
53
Normal Distribution
𝑓 𝑥 =
1
(𝑥−𝜇)2
−
𝑒 2𝜎2
𝜎 2𝜋
where -∞ <  < ∞, σ > 0
X ~ N(,σ)
54
Shapes of Normal Density Curve
http://resources.esri.com/help/9.3/arcgisdesktop/com/gp_toolref
/process_simulations_sensitivity_analysis_and_error_analysis_modeling
/distributions_for_assigning_random_values.htm
55
68-95-99.7 Rule
Empirical Rule
56
Standard Normal or z curve
𝑓 𝑧 =
1
2𝜋
𝑧3
−
𝑒 2
57
Cumulative z curve area
58
Z-table
59
Using the Z table
area right of z
=
area between z1 and z2 =
1

area left of z
area left of z1
–
area left of z2
60
Procedure for Normal Distribution
Problems
1. Sketch the situation and shade the area to be
found.
2. Standardize X to state the problem in terms
of Z.
3. Use Table A to find the area to the left of z.
4. Calculate the final answer.
5. Write your conclusion in the context of the
problem.
61
Normal Distribution: Example
A particular rash has shown up in an elementary
school. It has been determined that the length of
time that the rash will last is normally distributed
with mean 6 days and standard deviation 1.5 days.
a) What is the percentage of students that have the
rash for longer than 8 days?
b) What is the percentage of students that the rash
will last between 3.7 and 8 days?
62
Percentiles
63
Normal Distribution: Example
A particular rash has shown up in an elementary
school. It has been determined that the length of
time that the rash will last is normally distributed
with mean 6 days and standard deviation 1.5 days.
c) How long would the student’s rash have to have
lasted to be in the top 10% of the number of days
that the students have the rash?
64
Symmetrically Located Areas
65
Normal Distribution: Example
A particular rash has shown up in an elementary
school. It has been determined that the length of
time that the rash will last is normally distributed
with mean 6 days and standard deviation 1.5 days.
d) What interval symmetrically placed about the
mean will capture 95% of the times for the
student’s rashes to have lasted.
66
Procedure: Normal Quantile Plot
1) Arrange the data from smallest to largest.
2) Record the corresponding percentiles
(quantiles).
3) Find the z value corresponding to the
quantile calculated in part 2.
4) Plot the original data points (from 1) vs. the z
values (from 3).
67