Transcript Exploratory Data Analysis: One Variable
Exploratory Data Analysis: One Variable
FPP 3-6
2
Plan of attack
Distinguish different types of variables Summarize data numerically Summarize data graphically Use theoretical distributions to potentially learn more about a variable.
The five steps of statistical analyses
1.
2.
Form the question Collect data
3.
1.
Model the observed data
We start with exploratory techniques.
4.
5.
Check the model for reasonableness Make and present conclusions
Just to make sure we are on the same page
More (or repeated) vocabulary Individuals are the objects described by a set of data examples: employees, lab mice, states… A variable is any characteristic of an individual that is of interest to the researcher. Takes on different values for different individuals examples: age, salary, weight, location… How is this different from a mathematical variable?
Just to make sure we are on the same page #2
Measurement The value of a variable obtained and
recorded on an individual Example: 145 recorded as a person’s weight, 65 recorded as the height of a tree, etc.
Data is a set of measurements made on a group of
individuals The distribution of a variable tells us what values it takes and how often it takes these values Possible values -> How often each occur -> Chest Size count Chest Sizes of 5,738 Militamen 33-34 35-36 37-38 39-40 41-42 43-44 45-46 47-48 21 266 1169 2152 1592 462 71 5
Two Types of Variables
a categorical/qualitative variable places an individual into one of several groups or categories examples: Gender, Race, Job Type, Geographic location… JMP calls these variables nominal a quantitative variable takes numerical values for which arithmetic operations such as adding and averaging make sense examples: Height, Age, Salary, Price, Cost… Can be further divided to ordinal and continuous Why two types?
Both require their own summaries (graphically and numerically) and analysis.
I can’t emphasis enough the importance of identifying the type of variable being considered before proceeding with any type of statistical analysis
Example
Name Fleetwood, Delores Perez, Juan Wang, Lin Johnson, LaVerne Age Gender Race 39 Female White 27 Male White 20 Female Asian 48 Male Black Age: quantitative Gender: categorical Race: categorical Salary: quantitative Job type: categorical Salary Job Type 62,100 Management 47,350 Technical 18,250 Clerical 77,600 Management
Variable types in JMP
Qualitative/categorical JMP uses Nominal Quantitative Discrete JMP uses Ordinal Continuous JMP uses Continuous
Exploratory data analysis
Statistical tools that help examine data in order to describe their main features Basic strategy Examine variables one by one, then look at the relationships among the different variables Start with graphs, then add numerical summaries of specific aspects of the data
Exploratory data analysis: One variable
Graphical displays Qualitative/categorical data: bar chart, pie chart, etc.
Quantitative data: histogram, stem-leaf, boxplot, timeplot etc.
Summary statistics Qualitative/categorical: contingency tables Quantitative: mean, median, standard deviation, range etc.
Probability models Qualitative: Binomial distribution(others we won’t cover in this class) Quantitative: Normal curve (others we won’t cover in this class)
Example categorical/qualitative data
Summary table
we summarize categorical data using a table. Note that percentages are often called Relative Frequencies.
Class Frequency Highest Degree Obtained Number of CEOs Relative Frequency Proportion
None Bachelors Masters Doctorate / Law Totals 1 7 11 6 25 0.04
0.28
0.44
0.24
1.00
Bar graph
The bar graph quickly compares the degrees of the four groups The heights of the four bars show the counts for the four degree categories
Pie chart
A pie chart helps us see what part of the whole group forms To make a pie chart, you must include all the categories that make up a whole
Summary of categorical variables
Graphically Bar graphs, pie charts Bar graph nearly always preferable to a pie chart. It is easier to compare bar heights compared to slices of a pie Numerically: tables with total counts or percents
Quantitative variables
Graphical summary Histogram Stemplots Time plots more Numerical sumary Mean Median Quartiles Range Standard deviation more
Histograms
The bins are:
3.0 ≤ rate < 4.0
4.0 ≤ rate < 5.0
5.0 ≤ rate < 6.0
6.0 ≤ rate < 7.0
7.0 ≤ rate < 8.0
8.0 ≤ rate < 9.0
9.0 ≤ rate < 10.0
10.0 ≤ rate < 11.0
11.0 ≤ rate < 12.0
12.0 ≤ rate < 13.0
13.0 ≤ rate < 14.0
14.0 ≤ rate < 15.0
Histograms
The bins are:
3.0 ≤ rate < 4.0
4.0 ≤ rate < 5.0
5.0 ≤ rate < 6.0
6.0 ≤ rate < 7.0
7.0 ≤ rate < 8.0
8.0 ≤ rate < 9.0
9.0 ≤ rate < 10.0
10.0 ≤ rate < 11.0
11.0 ≤ rate < 12.0
12.0 ≤ rate < 13.0
13.0 ≤ rate < 14.0
14.0 ≤ rate < 15.0
Histograms
The bins are:
2.0 ≤ rate < 4.0
4.0 ≤ rate < 6.0
6.0 ≤ rate < 8.0
8.0 ≤ rate < 10.0
10.0 ≤ rate < 12.0
12.0 ≤ rate < 14.0
14.0 ≤ rate < 16.0
16.0 ≤ rate < 18.0
Histograms
Where did the bins come from?
They were chosen rather arbitrarily Does choosing other bins change the picture?
Yes!! And sometimes dramatically What do we do about this?
Some pretty smart people have come up with some “optimal” bin widths and we will rely on there suggestions
Histogram
The purpose of a graph is to help us understand the data After you make a graph, always ask, “What do I see?” Once you have displayed a distribution you can see the important features
Histograms
We will describe the features of the distribution that the histogram is displaying with three characteristics 1.
Shape Symmetric, skewed right, skewed left, uni-modal, multi-modal, bell shaped 2.
Center Mean, median 3.
Spread (outliers or not) Standard deviation, Inter-quartile range
Body temperatures of 30 people
D B i o s d t y r i T b e
9 6 9 .
9 7 5 7 9 .
u m t
5 8 9 .
9 9 5 1
i p Q
.
0 2 0 0 1 9 9 9 7 5 2 1
o n ( u
.
.
.
0 9 7 0 5 0 5 0 m 0 .
.
q .
.
q .
.
5 5 m 0
s F ) a n t M i
.
5 5 0 0 0 % % % a 0 % % x 9 % 9 9 % a % e % a % % 9 9 r 9 d 9 r 9 9 9 M lo t N ia 8 t 7 7 7 7
l
.
.
t .
t .
p .
w ile .
n .
.
.
.
ile .
u
o e
u e 6 1 3 0 0
m s e n t s
0 2 3 0 0 0 a 9 0 0 0 0 r 2 9 0 0 0 r 0 5 0 5 0 .
8 .
8 9 r .
e 7 9 .
1 8 r v 5 5 9 3 8 5 .
5 6 0 7 4 % 2 M 0 % 3 3 8 8 3 5 8 7 2 e M 3 3 6 0 9 3 3 9 a 5 M 7 n 6 e 0 e a a n n
Incomes from 500 households in 2000 current population survey
D h i
0
o
5
s u
0
t s
1 0
r e
5 0
i h
2 0 0
b o
5 0
u
2 1 1 5
l
0 0
t i d Q
0 5 0 0 2 1 0 0 2 0 0 1 9 9 0 9 7 0 5 0 0
o i u
.
.
0 5 0 5 .
0 0 m 9 7 0 m q m q .
.
0 .
.
.
.
.
5 5 0
n n s c o m e a n M t i o l m e s e
.
a 5 5 u 0 0 u 0 0 a e a % % in 2 0 x 2 % 1 % M 8 % im S 5 S 6 t t 1 % % % % im u 0 r lo 6 r 3 1 2 5 8 p 1 t w 3 t ia 3 7 7 3 u e 5 d 9 d 7 u 4 a 7 1 0 0 m n D 9 E 7 1 7 8 2 p 9 1 n ile 7 2 8 7 e 9 4 r 3 2 9 7 7 5 r 3 9 5 2 2 1 3 0 0 9 0 0 r
n
5 9 r 6 9 6 5
t
.
4 0 4 5 3 %
s
.
9 1 M .
4 % 1 4 7 .
8 4 9 .
9 e 5 6 M 9 6 6 2 a 3 3 M 9 n e e a a n n
Histogram vs. Bar graph
Spaces mean something in histograms but not in bar graphs Shape means nothing with bar graphs The biggest difference is that they are displaying fundamentally different types of variables
Time Plots
Many variables are measured at intervals over time Examples Closing stock prices Number of hurricanes Unemployment rates If interest is a variable is to see change over time use a time plot
Time Plots
Patterns to look for Patterns that repeat themselves at known regular intervals of time are called seasonal variation A trend is a persistant, long-term rise or fall
Time plots
10 8 6 4 2 0 1965 1970 1975 1980 Year 1985 1990 1995 number of hurricanes each year from 1970 - 1990
Numerical summaries of quantitative variables
Want a numerical summary for center and spread Center Mean Median Mode Spread Range Inter-quartile range Standard deviation 5 number summary is a popular collection of the following min, 1 st quartile, median, 3 rd quartile, max
Mean
To find the mean of a set of observations, add their values and divide by the number of observations equation 1:
x
1
x
2 K
N
x N
1
N N
i
1
x i
Mean example
The average age of 20 people in a room is 25. A 28 year old leaves while a 30 year old enters the room. Does the average age change?
If so, what is the new average age?
Median
The median is the midpoint of a distribution The number such that half the observations are smaller and the other half are larger Also called the 50 th percentile or 2 nd quartile To compute a median Order observations If number of observations is odd the median is the center observation If number of observations is even the median is the average of the two center observations
Median example
The median age of 20 people in a room is 25. A 28 year old leaves while a 30 year old enters the room. Does the median age change?
If so, what is the new median age?
The median age of 21 people in a room is 25. A 28 year old leaves while a 30 year old enters the room. Does the median age change?
If so, what is the new median age?
Mean vs Median
When histogram is symmetric mean and median are similar Mean and median are different when histogram is skewed Skewed to the right mean is larger than median Skewed to the left mean is smaller than median The business magazine Forbes estimates that the “average” household wealth of its readers is either about $800,000 or about $2.2 million, depending on which “average” it reports. Which of these numbers is the mean wealth and which is the median wealth? Why?
Mean vs Median
Symmetric distribution
Mean vs Median
Right skewed distribution
Mean vs Median
Left skewed distribution
Extreme example
Income in small town of 6 people $25,000 $27,000 $29,000 $35,000 $37,000 $38,000 Mean is $31,830 and median is $32,000 Bill Gates moves to town $25,000 $27,000 $29,000 $35,000 $37,000 $38,000 $40,000,000 Mean is $5,741,571 median is $35,000 Mean is pulled by the outlier while the median is not. The median is a better of measure of center for these data
Is a central measure enough?
A warm, stable climate greatly affects some individual’s health. Atlanta and San Diego have about equal average temperatures (62 o vs. 64 o ). If a person’s health requires a stable climate, in which city would you recommend they live?
Measures of spread
Range: subtract the largest value form the smallest Inter-quartile range: subtract the 3 rd quartile from the 1 st quartile Standard Deviation (SD): “average” distance from the mean Which one should we use?
Standard Deviation
The standard deviation looks at how far observations are from their mean It is the square root of the average squared deviations from the mean Compute distance of each value from mean Square each of these distances Take the average of these squares and square root 1
N
n
i
1
x i
2 Often we will use SD to denote standard deviation
Example
Standard deviation
Order these histograms by the SD of the numbers they portray. Go from smallest largest - 15 - 10 - 5 0 5 10 15 20 - 30 - 20 - 10 0 10 20 30 What is a reasonable guess of the SD for each?
- 1 - 0. 5 0 . 5 1 1. 5 2 2. 5
Histograms on same scale
- 30 - 20 - 10 0 10 20 30 - 30 - 20 - 10 0 10 20 30 - 30 - 20 - 10 0 10 20 30
Problem from text (p. 74, #2)
Which of the following sets of numbers has the smaller SD’ a) 50, 40, 60, 30, 70, 25, 75 b) 50, 40, 60, 30, 70, 25, 75, 50, 50, 50 Repeat for these two sets c) 50, 40, 60, 30, 70, 25, 75 d) 50, 40, 60, 30, 70, 25, 75, 99, 1
More intuition behind the SD
This is a variance contest. You must give a list of six numbers chosen from the whole numbers 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 with repeats allowed. Give a list of six numbers with the largest standard deviation such a list described above can possibly have.
Give a list of six numbers with the smallest standard deviation such a list can possibly have.
Properties of SD
SD ≥ 0. (When is SD = 0)?
Has the same unit of measurement as the original observations Inflated by outliers
Mean and SD
What happens to the mean if you add 5 to every number in a list?
What happens to the SD?
1
N N
i
1
x i
1
N
n
i
1
x i
2
Standard deviation
SDs are like measurement units on a ruler Any quantitative variable can be converted into “standardized” units These are often called z-scores and are denoted by the letter z
Important formula
z
value
SD
Example ACT versus SAT scores A 1340 on the SAT, or a 32 on the ACT?
mean
value
The normal curve
When histogram looks like a bell-shaped curve, z-scores are associated with percentages The percentage of the data in between two different z-score values equals the area under the normal curve in between the two z-score values A bit of notation here. N( , ) is short hand for writing normal curve with mean deviation through out the course) and standard (get used to this notation as it will be used fairly regularly
Normal curves
Normal curves
Properties of normal curve
In the Normal distribution with mean and standard deviation : 68% of the observations fall within 1 of 95% of the observations fall within 2 s of 99.7% of the observations fall within 3 s of By remembering these numbers, you can think about Normal curves without constantly making detailed calculations
Properties of normal curves
For a N(0,1) the following holds
IQ
A person is considered to have mental retardation when 1.
IQ is below 70 2.
Significant limitations exist in two or more adaptive skill areas 3.
Condition is present from childhood What percentage of people have IQ that meet the first criterion of mental retardation
IQ
A histogram of all people’s IQ scores has a μ =100 and a σ =16 How to get % of people with IQ < 70
More IQ
Reggie Jackson, one of the greatest baseball players ever, has an IQ of 140. What percentage of people have bigger IQs than Reggie?
Marilyn vos Savant, self-proclaimed smartest person in the world, has a reported IQ of 205. What percentage of people have IQ scores smaller than Marilyn’s score?
Mensa is a society for “intelligent people.” To qualify for Mensa, one needs to be in at least the upper 2% of the population in IQ score. What is the score needed to qualify for Mensa?
Checking if data follow normal curve
Look for symmetric histogram A different method is a normal probability plot. When normal curve is a good fit, points fall on a nearly straight line
Measurement error
Measurement error model Measurement = truth + chance error Outliers Bias effects all measurements in the same way Measurement = truth + bias + chance error Often we assume that the chance error follows a normal curve that is centered at 0