Transcript V10_CAT
Categorical Data
Department of ISM, University of Alabama, 1992-2003
M28- Categorical Analysis 1
Lesson Objective
Understand basic rules of probability.
Calculate marginal and
conditional probabilities.
Determine if two categorical variables
are independent.
Department of ISM, University of Alabama, 1992-2003
M28- Categorical Analysis 2
C
Recall Rule of Thumb:
Quantitative variables:
averages or differences
have meaning.
Ex: weight, height, income, age
Department of ISM, University of Alabama, 1992-2003
M28- Categorical Analysis 3
C
Recall Rule of Thumb:
Categorical variables:
classify people or things.
Ex: gender, race, occupation,
political affiliation,
country of origin
Department of ISM, University of Alabama, 1992-2003
M28- Categorical Analysis 4
Note: Sometimes
quantitative variables are
expressed as categorical.
Income (Family Economic Income):
Class
Definition
1. Less than $30,000
2. $30,000 but less than $100,000
3. $100,000 or more.
Department of ISM, University of Alabama, 1992-2003
M28- Categorical Analysis 5
Department of ISM, University of Alabama, 1992-2003
M28- Categorical Analysis 6
Relationship between
two quantitative variables?
Is relationship linear (scatterplot)?
J
L
Use Correlation &
Least Squares Regression.
Data transformations.
Department of ISM, University of Alabama, 1992-2003
M28- Categorical Analysis 7
Recall: Boxplots
Best graphical tool for examining the
relationship between a quantitative
variable and a categorical variable,
(i.e., comparing distributions).
Example:
Weight vs. Country of Origin
Department of ISM, University of Alabama, 1992-2003
weight
Weight
Boxplot can be used to answer:
“Do the distributions of
weights vary for different
countries of origin?”
4000
3000
2000
US
1
Far East Europe
2
3
origin
M28- Categorical Analysis 8
Relationship between
two categorical variables?
Use two-way frequency tables:
Look at marginal probabilities
and conditional probabilities.
Department of ISM, University of Alabama, 1992-2003
M28- Categorical Analysis 9
STATISTICS
is the science of
transforming data
into information
to make decisions
in the face of uncertainty.
Department of ISM, University of Alabama, 1995-2003
M28- Categorical Data
10
How do we measure
"uncertainty"?
Probability
A numerical measure of the
likelihood that an outcome or
an event occurs.
P(A) = probability of event A
Department of ISM, University of Alabama, 1992-2003
M28- Categorical Analysis 11
Three Methods for
Assessing Probability
Classical
Relative Frequency
Subjective
Department of ISM, University of Alabama, 1992-2003
M28- Categorical Analysis 12
Probability requirements for
discrete variables:
1. 0 _
< P(A) _
<1
P(A) = 0
impossible event
P(A) = 1
certain event
2. Sum of the probabilities of
all possible outcomes
must equal 1. (Binomial, Poisson)
Department of ISM, University of Alabama, 1992-2003
M28- Categorical Analysis 13
Conditional probability:
The chance one event happens,
given that another event will
occur.
P(A and B)
P(A | B) =
P(B)
=
All outcomes belonging to BOTH A AND B
Those outcomes in the restricted group, B
Department of ISM, University of Alabama, 1992-2003
M28- Categorical Analysis 14
Problem: Credit Card Manager
New credit test to determine
credit worthiness.
Credit test checked against
500 previous customers.
Department of ISM, University of Alabama, 1992-2003
M28- Categorical Analysis 15
Credit
History
Credit Test A
Passed Failed
(F)
(P)
Good (G)
350
50
400
Default (D)
20
80
100
370
130
500
Department of ISM, University of Alabama, 1992-2003
M28- Categorical Analysis 16
What is the probability of
a customer defaulting?
P(Defaults) =
P (D )
=
P
F
G 350
50 400
20
80 100
D
370 130 500
What is the probability of a customer
defaulting given that he fails test A?
P(Defaults given failed test A) =
P (D|F ) =
Department of ISM, University of Alabama, 1992-2003
M28- Categorical Analysis 17
General Rules:
P(A and B) = P(A) P(B|A)
= P(B) P(A|B)
P(A or B) = P(A) + P(B) - P(A and B)
Department of ISM, University of Alabama, 1992-2003
M28- Categorical Analysis 18
P
P(Fails AND Defaults)
= P(F) P(D|F)
Department of ISM, University of Alabama, 1992-2003
F
G 350
50 400
20
80 100
D
370 130 500
M28- Categorical Analysis 19
P
P(Fails
OR
Defaults)
= P(F) + P(D)
-
F
G 350
50 400
20
80 100
D
370 130 500
P(D AND F)
Note: The “overlap” group
would be counted twice if
no subtraction.
Department of ISM, University of Alabama, 1992-2003
M28- Categorical Analysis 20
Does knowledge of “test A result”
help you make a better decision?
=
P (D|F ) =
P (D )
Do you want to know the test A
results before you give the loan?
“Credit test A results” and “defaulting”
are
____________ on each other.
Department of ISM, University of Alabama, 1992-2003
M28- Categorical Analysis 21
A different sample of
500 credit records
Department of ISM, University of Alabama, 1992-2003
M28- Categorical Analysis 22
Credit
History
Credit Test B
Passed Failed
(F)
(P)
Good (G)
340
60
400
Default (D)
85
15
100
425
75
500
Department of ISM, University of Alabama, 1992-2003
M28- Categorical Analysis 23
What is the probability of
a customer defaulting?
P(Defaults) =
P (D )
=
P
F
G 340
60 400
85
15 100
425
75 500
D
What is the probability of a customer
defaulting given that he fails test B?
P(Defaults given failed test B) =
P (D|F ) =
Department of ISM, University of Alabama, 1992-2003
M28- Categorical Analysis 24
Does knowledge of “test B result”
help you make a better decision?
=
P (D|F ) =
P (D )
Test B tells me
.
“Credit test B results” and “defaulting” are
of each other.
Department of ISM, University of Alabama, 1992-2003
M28- Categorical Analysis 25
Department of ISM, University of Alabama, 1992-2003
M28- Categorical Analysis 26
Two events are independent if
the occurrence, or non-occurrence,
of one does not affect the chances of
the other occurring, or not occurring.
Otherwise, we say the
events are
dependent.
Department of ISM, University of Alabama, 1992-2003
M28- Categorical Analysis 27
If A and B independent, then
P(A and B) = P(A) P(B)
P(A or B) = P(A) + P(B) - P(A) P(B)
P(A|B) = P(A)
P(B|A) = P(B)
Department of ISM, University of Alabama, 1992-2003
Note: The condition
does NOT change
the probability.
M28- Categorical Analysis 28
Survey of randomly selected
people voters in Jan. 2001:
Q1: Did you vote in the 2000 election?
Q2: Do you favor an amendment
to require a balanced budget?
Q3: To which political party do you
belong ?
Department of ISM, University of Alabama, 1992-2003
M28- Categorical Analysis 29
Political Party:
Do you favor
amendment
for a balanced
budget?
Yes
No
Total
Republican
90
82
172
Democrat
44
104
148
Other
48
32
80
182
218
400
Total
Department of ISM, University of Alabama, 1992-2003
M28- Categorical Analysis 30
Party:
Favor
amendment
Yes No Total
Republican
90
82
172
Democrat
44
104
148
Other
48
32
80
Total
182
218
400
Marginal totals
for opinion.
Marginal
totals for
Party.
Sample size
Favor
amend.
Party
Yes
No
Total
Repub
90
82
172
Demo
44
104
148
Other
48
32
80
Total
182
218
400
What proportion
favor the amend.
and are Other?
What proportion
favor the amend.?
What proportion
claim to be Rep?
Favor
amend.
Party
Yes
No
Total
Repub
90
82
172
Demo
44
104
148
Other
48
32
80
Total
182
218
400
Of those that claim
to be Democrat,
what proportion
favor the amend.
What proportion
favor the amend,
given those that
claim to be Rep?
Considering only
those opposed,
what proportion
are not Republican?
Conditional Distribution:
Restrict subjects to only those that meet a
condition. Within this restricted group,
what is the distribution of some other var.?
Distribution of “opinion” given
those that claim to be Republican:
P( Yes | Rep. ) =
.523
P( No | Rep. ) =
.477
90
172
82
172
“given that”
Department of ISM, University of Alabama, 1992-2003
M28- Categorical Analysis 34
Is there a relationship between
the party and the opinion
on the amendment?
What would you expect
to happen if
no relationship existed?
Department of ISM, University of Alabama, 1992-2003
M28- Categorical Analysis 35
Three Conditional Distributions:
P( Yes | Rep.) = .523, P( No | Rep.)
=
P( Yes | Demo) = .297, P( No | Demo) =
P( Yes | Other) = .600, P( No | Other) =
Marginal Distribution:
P( Yes ) = .455,
P( No )
= .545
Is there a relationship?
Why? or Why not?
Department of ISM, University of Alabama, 1992-2003
M28- Categorical Analysis 36
If there is NO relationship
(i.e., independence)
between the party and
the opinion, then
“the three conditional probabilities
close to each
other and close to the
should be the
marginal probability.”
Department of ISM, University of Alabama, 1992-2003
M28- Categorical Analysis 37
Three Conditional Probabilities:
P( Yes | Rep.) = .523
Are these
close to
each other?
P( Yes | Demo) = .297
P( Yes | Other) = .600
Marginal Probability:
P( Yes ) = .455
AND close to
the “marginal”?
Not close; therefore, “party” and
the “opinion” are ____________.
Department of ISM, University of Alabama, 1992-2003
M28- Categorical Analysis 38
Create with
“Pivot Tables”
in Excel.
Department of ISM, University of Alabama, 1992-2003
M28- Categorical Analysis 39
Barchart- Clustered
Rep.
Yes
Demo.
Other
Frequency
Department of ISM, University of Alabama, 1992-2003
M28- Categorical Analysis 40
Barchart- Stacked
Rep.
Yes
Demo.
Other
Frequency
Department of ISM, University of Alabama, 1992-2003
M28- Categorical Analysis 41
Barchart- Percents
Rep.
Yes
Demo.
Other
Percent
Department of ISM, University of Alabama, 1992-2003
M28- Categorical Analysis 42
Summary
For two categorical variables:
Must use conditional probabilities
to determine if a relationship exists.
Cannot use correlation.
Visual display:
Stacked percentage bar charts
Department of ISM, University of Alabama, 1992-2003
M28- Categorical Analysis 43
Associations between TWO Variables
Variables
Quant.
vs. Quant
Quant.
vs. Cat.
Cat.
vs. Cat.
numerical
LS regression line,
r, r-sq, std error
X-bar and s
for each category
Two-way table,
conditional &
marginal
distributions
Department of ISM, University of Alabama, 1992-2003
graphical
Scatterplot,
residual plots
Side-by-side
box plots
Bar chart :
stacked,
percent.
M28- Categorical Analysis 44
The End
Department of ISM, University of Alabama, 1992-2003
M28- Categorical Analysis 45