Lecture Eleven Probability Models 1

Download Report

Transcript Lecture Eleven Probability Models 1

Lecture Eleven
Probability Models
1
Outline
• Bayesian Probability
• Duration Models
2
Bayesian Probability
• Facts
• Incidence of the disease in the population is
one in a thousand
• The probability of testing positive if you
have the disease is 99 out of 100
• The probability of testing positive if you do
not have the disease is 2 in a 100
3
Joint and Marginal Probabilities
Sick: S
Healthy: H
Test +
Pr(+ S)
Pr(+H)
Pr(+)
Test-
Pr(-S)
Pr(-H)
Pr(-)
Pr(S)
Pr(H)
4
Filling In Our Facts
Sick: S
Healthy: H
Pr(s) =
0.001
Pr(H) =
0.999
Test +
Test -
5
Using Conditional Probability
• Pr(+  H)= Pr(+/H)*Pr(H)= 0.02*0.999=.01998
• Pr(+  S) = Pr(+/S)*Pr(S) = 0.99*0.001=.00099
Filling In Our Facts
Sick: S
Test +
Healthy: H
Pr(+S)= Pr(+H)=
0.00099
0.01998
Test Pr(s) =
0.001
Pr(H) =
0.999
7
By Sum and By Difference
Test +
Test -
Sick: S
Healthy: H
Pr(+S)=
0.00099
Pr(-S)=
0.00901
Pr(s) =
0.001
Pr(+H)= Pr(+)=
0.02097
0.01998
Pr(-H)=
0.97902
Pr(H) =
0.999
8
False Positive Paradox
•
•
•
•
•
Probability of Being Sick If You Test +
Pr(S/+) ?
From Conditional Probability:
Pr(S/+) = Pr(S  +)/Pr(+) = 0.00099/0.02097
Pr(S/+) = 0.0472
Bayesian Probability By Formula
• Pr(S/+) = Pr(S  +)/Pr(+) = PR(+/S)*Pr(S)/Pr(+)
• Where PR(+) = PR(+/S)*PR(S) + PR(+/H)*PR(H)
• And Using our facts;
Pr(S/+) = 0.99*(0.001)/[0.99*.001 + 0.02*.999]
• Pr(S/+) = 0.00099/[0.00099+0.01998]
• Pr(S/+) = 0.00099/0.02097 = 0.0472
Duration Models
• Exploratory (Graphical) Estimates
– Kaplan-Meier
• Functional Form Estimates
– Exponential Distribution
11
Duration of Post-War Economic
Expansions in Months
12
Trough
Oct. 1945
Oct. 1949
May 1954
April 1958
Feb. 1961
Nov. 1970
March 1975
July 1980
Nov. 1982
March 1991
Peak
Nov. 1948
July 1953
August 1957
April 1960
Dec. 1969
Nov. 1973
January 1980
July 1981
July 1990
March 2000
Duration
37
45
39
24
106
36
58
12
92
120
13
Estimated Survivor Function for
Ten Post-War Expansions
14
Kaplan-Meyer Estimate of
Survivor Function
• Survivor Function = (# at risk - # ending)/# at risk
15
Duration
0
12
24
36
37
39
45
58
92
106
120
# Ending
0
1
1
1
1
1
1
1
1
1
1
# At Risk
10
10
9
8
7
6
5
4
3
2
1
Survivor
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
16
0
Figure 2: Estimated Survivor Function for Post-War Expansions
1.2
Survivor Function
1
0.8
0.6
0.4
0.2
0
0
20
40
60
80
100
120
140
Duration in Months
17
Exponential Distribution
•
•
•
•
•
•
•
•
Density: f(t) =  exp[ -  t], 0  t  
Cumulative
Distribution Function F(t)
t
t
F(t) =  f (u)du    exp[u]du
0
0
t

F(t) = - exp[- u] 0
F(t) = -1 {exp[-  t] - exp[0]}
F(t) = 1 - exp[-  t]
Survivor Function, S(t) = 1- F(t) = exp[-  t]
Taking logarithms, lnS(t) = -  t
Postwar Expansions
0.5
Ln Survivor Function
0
0
20
40
60
80
100
120
-0.5
-1
-1.5
-2
y = -0.0217x + 0.1799
2
R = 0.9533
-2.5
Duration (Months)
So   0.022
19
Exponential Distribution (Cont.)

Mean = 1/  =  t * f (t)dt
Memoryless feature:
Duration conditional on surviving until t = :
DURC(  ) =  t * f (t)dt / S( ) =  + 1/ 

Expected remaining duration = duration
conditional on surviving until time  , i.e
DURC, minus 
• Or 1/  , which is equal to the overall mean,
so the distribution is memoryless
•
•
•
•
•


Exponential Distribution(Cont.)
• Hazard rate or function, h(t) is the
probability of failure conditional on survival
until that time, and is the ratio of the density
function to the survivor function. It is a
constant for the exponential.
• h(t) = f(t)/S(t) =  exp[-  t] /exp[-  t] = 
Model Building
• Reference: Ch 20
22
20.2 Polynomial Models
• There are models where the independent
variables (xi) may appear as functions of a
smaller number of predictor variables.
• Polynomial models are one such example.
23
Polynomial Models with One Predictor
Variable
y = b0 + b1x1+ b2x2 +…+ bpxp + e
y = b0 + b1x + b2x2 + …+bpxp + e
24
Polynomial Models with One Predictor
Variable
• First order model (p = 1)
y  b0 + b1x + e
• Second order model (p=2)
y = b0 + b1x + b2x2 + e
b2 < 0
b2 > 0
Polynomial Models with One Predictor
Variable
• Third order model (p = 3)
y = b0 + b1x + b2x2 + eb3x3 + e
b3 < 0
b3 > 0
Polynomial Models with Two Predictor
Variables
y
• First order model
y = b0 + b1x1 + eb2x2 + e
y
x1
x2
x1
x2
20.3 Nominal Independent Variables
• In many real-life situations one or more
independent variables are nominal.
• Including nominal variables in a regression
analysis model is done via indicator
variables.
• An indicator variable (I) can assume one out
of two values,
“zero”
or
“one”.
o
1
if
the
temperature
was
below
50
11 ififadata
firstwere
condition
collected
out ofbefore
two is1980
met
I=
1 if a degree earned is in Finance
o or more
the
temperature
was
50
0000ifififaifdata
second
were
condition
collected
out
after
of
1980
is met
a degree earned is not intwo
Finance
28
Nominal Independent Variables;
Example: Auction Car Price (II)
• Example 18.2 - revised (Xm18-02a)
– Recall: A car dealer wants to predict the auction
price of a car.
– The dealer believes now that odometer reading
and the car color are variables that affect a
car’s price.
– Three color categories are considered:
Note: Color is a
• White
nominal variable.
• Silver
• Other colors
29
Nominal Independent Variables;
Example: Auction Car Price (II)
• Example 18.2 - revised (Xm18-02b)
1 if the color is white
I1 =
0 if the color is not white
1 if the color is silver
I2 = 0 if the color is not silver
The category “Other colors” is defined by:
I1 = 0; I2 = 0
30
How Many Indicator Variables?
• Note: To represent the situation of three
possible colors we need only two indicator
variables.
• Conclusion: To represent a nominal
variable with m possible categories, we
must create m-1 indicator variables.
31
Nominal Independent Variables;
Example: Auction Car Price
• Solution
– the proposed model is
y = b0 + b1(Odometer) + b2I1 + b3I2 + e
– The data
Price
14636
14122
14016
15590
15568
14718
.
.
Odometer
37388
44758
45833
30862
31705
34010
.
.
I-1
1
1
0
0
0
0
.
.
I-2
0
0
0
0
1
1
.
.
White car
Other color
Silver color
32
Example: Auction Car Price
The Regression Equation
From Excel (Xm18-02b) we get the regression equation
PRICE = 16701-.0555(Odometer)+90.48(I-1)+295.48(I-2)
Price
Price = 16701 - .0555(Odometer) + 90.48(0) + 295.48(1)
Price = 16701 - .0555(Odometer) + 90.48(1) + 295.48(0)
Price = 16701 - .0555(Odometer) + 45.2(0) + 148(0)
Odometer
33
Example: Auction Car Price
The Regression Equation
From Excel we get the regression equation
PRICE = 16701-.0555(Odometer)+90.48(I-1)+295.48(I-2)
For one additional mile the
auction price decreases by
5.55 cents.
A white car sells, on the average,
for $90.48 more than a car of the
“Other color” category
A silver color car sells, on the average,
for $295.48 more than a car of the
“Other color” category.
34
Example: Auction Car Price
The Regression Equation
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.8355
R Square
0.6980
Adjusted R Square
0.6886
Standard Error
284.5
Observations
100
There is insufficient evidence
Xm18-02b
to infer that a white color car and
a car of “other color” sell for a
different auction price.
There is sufficient evidence
to infer that a silver color car
sells for a larger price than a
car of the “other color” category.
ANOVA
df
Regression
Residual
Total
Intercept
Odometer
I-1
I-2
3
96
99
SS
17966997
7772564
25739561
Coefficients Standard Error
16701 184.3330576
-0.0555
0.0047
90.48
68.17
295.48
76.37
MS
5988999
80964
F
Significance F
73.97
0.0000
t Stat
P-value
90.60
0.0000
-11.72
0.0000
1.33
0.1876
3.87
0.0002
Nominal Independent Variables;
Example: MBA Program Admission
(MBA II)
• Recall: The Dean wanted to evaluate applications for
the MBA program by predicting future performance of
the applicants.
• The following three predictors were suggested:
– Undergraduate GPA
– GMAT score
– Years of work experience
Note: The undergraduate
degree is nominal data.
• It is now believed that the type of undergraduate degree
should be included in the model.
36
Nominal Independent Variables;
Example: MBA Program Admission
(II)
1 if B.A.
I1 =
0 otherwise
1 if B.B.A
I2 = 0 otherwise
1 if B.Sc. or B.Eng.
I3 = 0 otherwise
The category “Other group” is defined by:
I1 = 0; I2 = 0; I3 = 0
37
Nominal Independent Variables;
Example: MBA Program Admission
(II)
MBA-II
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.7461
R Square
0.5566
Adjusted R Square
0.5242
Standard Error
0.729
Observations
89
ANOVA
df
Regression
Residual
Total
Intercept
UnderGPA
GMAT
Work
I-1
I-2
I-3
SS
6
82
88
54.75
43.62
98.37
Coefficients Standard Error
0.19
1.41
-0.0061
0.114
0.0128
0.0014
0.098
0.030
-0.34
0.22
0.71
0.24
0.03
0.21
MS
9.13
0.53
F
Significance F
17.16
0.0000
t Stat
P-value
0.13
0.8930
-0.05
0.9577
9.43
0.0000
3.24
0.0017
-1.54
0.1269
2.93
0.0043
0.17
0.8684
38
20.4 Applications in Human
Resources Management: Pay-Equity
• Pay-equity can be handled in two different
forms:
– Equal pay for equal work
– Equal pay for work of equal value.
• Regression analysis is extensively
employed in cases of equal pay for equal
work.
39
Human Resources Management:
Pay-Equity
• Solution
– Construct the following multiple regression
model:
y = b0 + b1Education + b2Experience +
b3Gender + e
– Note the nature of the variables:
• Education – Interval
• Experience – Interval
• Gender – Nominal (Gender = 1 if male; =0 otherwise).
40
Human Resources Management:
Pay-Equity
• Solution – Continued (Xm20-03)
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.8326
R Square
0.6932
Adjusted R Square
0.6836
Standard Error
16274
Observations
100
Analysis and Interpretation
• The model fits the data quite well.
• The model is very useful.
• Experience is a variable strongly
related to salary.
• There is no evidence of sex discrimination.
ANOVA
df
Regression
Residual
Total
Intercept
Education
Experience
Gender
SS
MS
3 57434095083 19144698361
96 25424794888 264841613.4
99 82858889971
CoefficientsStandard Error
-5835.1
16082.8
2118.9
1018.5
4099.3
317.2
1851.0
3703.1
t Stat
-0.36
2.08
12.92
0.50
F
Significance F
72.29
0.0000
P-value
0.7175
0.0401
0.0000
0.6183
41
Human Resources Management:
Pay-Equity
• Solution – Continued (Xm20-03)
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.8326
R Square
0.6932
Adjusted R Square
0.6836
Standard Error
16274
Observations
100
ANOVA
df
Regression
Residual
Total
Intercept
Education
Experience
Gender
3
96
99
Analysis and Interpretation
• Further studying the data we find:
Average experience (years) for women is 12.
Average experience (years) for men is 17
• Average salary for female manager is $76,189
Average salary for male manager is $97,832
SS
57434095083
25424794888
82858889971
Coefficients Standard Error
-5835.1
16082.8
2118.9
1018.5
4099.3
317.2
1851.0
3703.1
MS
19144698361
264841613
t Stat
-0.36
2.08
12.92
0.50
F Significance F
72.29
0.0000
P-value
0.7175
0.0401
0.0000
0.6183
42