DATA ANALYSIS - DCU School of Computing

Download Report

Transcript DATA ANALYSIS - DCU School of Computing

DATA ANALYSIS
Module Code CA660
Supplementary
Extended examples
Extended example – Value of information
Recall : price of new computer tablet. When expected payoffs used in decision strategy,
that action is selected which has the largest expected payoff. Hence,
Expected Value of
Perfect Information
= (Average Payoff using a perfect predictor) (Average payoff for whatever Action actually select)
States of nature (Si) = time to major competitor introducing a similar product. Maximum
payoffs are summarised below for each Si
States of nature
S1 < 6 months
S2 6-12 months
S3 12-18 months
S4 > 18 months
Max. payoff (millions)
250 (for A1)
320 (for A1)
410 (for A4)
550 (for A4)
Suppose P{Si } to be
0.1
0.5
0.3
0.1
Exp. Payoff (using perfect predictor) = (0.1)(250) + (0.5)(320) + (0.3)(410) (0.1)(550) = 363
So,
EVPI = 363 – 330 = 33 (Million)
N.B. upper limit for any info. on future Co. purchase. Predicted state of nature
estimated. So, depends e.g. on consultant reliability, volatility of market etc.
Example contd.
Not using Expected Payoff as decision base: – e.g. gambling, e.g. Insure vs do
not. What about Risk? – can use variance of respective payoffs as have seen.
Need – way to combine given attitude to payoff with corresponding risk of
each alternative (profit vs loss) Utility Value
Steps:
1. Assign utility values to smallest and largest payoff , U – range 0 to 100. so
have U(Min) = 0, U(Max) =100
(Relative values important)
2. Utility Value for any payoff (F) to be considered = U(F) = P x 100. [P = what
the probability would have to be of getting that payoff with certainty to be
equally attractive to getting Max payoff (with prob 1-P)].
Note: this probability relates to willingness to take a risk, not to Prob{Si }
Check attitude to risk on Utility vs Profit line. If above = risk avoider
So suppose have Utility values for a simple example:
P{S1} = 0.7 P{S2} = 0.3
A1 Gilt-edged
50
50
A1: Exp. Utility = (50)(0.7) +(50)(0.3) = 50
A2 Oil well
0
100
A2: Exp. Utility = (0)(0.7) +(100)(0.3) = 30
Larger expected utility associated with gilt-edged (as risk avoider)
Example contd.
So from the table for various payoffs for computer tablet selling-price
decision, Min = 80, Max = 550, so U(80) = 0, U(550)= 100.
Easiest way to determine utility values for other 14 possible payoffs, is to
sketch a curve of U vs Profit (Payoff)
Typically, pick values in range between min and max payoff and ask the
question “for a payoff of 200 e.g., what value of P would make getting that
payoff with certainty equally as attractive as a payoff, of 550 with prob P and
a payoff of 80 with prob. 1-P.
If decision maker responds by saying P = 0.55 acceptable, then
U(200) = 0.55 x 100 = 55
Curve basis
Hypothetical payoff
100 150 200
300
400
500
Prob P
0.2
0.4 0.55 0.75 0.90
0.97
Utlility U
20
40
55
75
90
97
Thus for actual payoffs corresponding to Ai , Si , read off curve.
Example contd.
Action
A1: price at 1500
A2: price at 1750
A3: price at 2000
A4: price at 2500
S1
(0.1)
67
40
30
0
S2
(0.5)
79
68
74
72
S3
(0.3)
84
75
88
91
S4
(0.1)
90
87
94
100
Expected
Utility
80.4
69.2
75.8
73.3
Where e.g. for Action A2:
Expected Utility = (0.1) 40) + (0.5) (68)+(0.3) (75) + (0.1)(87) = 69.2
Choosing the action with largest expected utility, the decision is to select
action A1 (selling price 1500). For this particular example, it appears that A1
maximises both expected payoff and expected utility.
Examples in General Linear Models - ANOVA
• Suppose, as part of QA, components are subjected to a strength test,
where rods with pointed tips are forced into a hinge and movement
is measured. There are 3 types of tip available, 10 hinges are
randomly selected and both tips used against each hinge.
yij    Ti   ij y ~ NID (  ,  m2 ), T ~ NID (0,  T2 ),  ~ NID (0,  e2 )
the coded data (RANDOMISED BLOCK design) are:
Hinge:
Tip #1
Tip #2
Tip #3
1
2
3 4
5
6
7
8
9 10
68 40 82 56 70 80 47 55 78 53
72 43 89 60 75 91 58 68 77 65
65 42 84 50 68 86 50 52 75 60
• If assume Normality (interested in Random effects) + assume
zero covariance between genetic effects and error
 m2   T2   2
6
Example - RB Models contd.
• What about other factors, e.g. Process and TP interactions?
Tip strength effects
Extension to Simple Model.
measured within
yijk    Ti  Pj  (TP ) ij   ijk
blocks
ANOVA Table: Randomized Blocks within Process . For b = replications.
Focus - on Tip strength
Source
Process
Blocks
Tip type
TP
Error
dof
p-1
(b-1)p
T-1
(T-1)(p-1)
(b-1)(T-1)P
Expected MSQ
know there are differences
again – know there are differences
2
 2  b TP
 bP T2
2
 2  b TP
 2
Note: individuals blocked within processes, so process effect intrinsic to error.
Model form is standard, but only meaningful comparisons are within process,
hence form of random error = population variance =  2 ; so random effects of
interest obtained from ratios of variances.
7
Example contd.
ANOVA Table
Source
dof
SSQ
MSQ
F-Ratio
Factor
(Tips)
k-1
2
SS(Factor)
304.2
SS(Factor)/k-1
152.1
MS(Factor)/MSE
15.63 *
Blocks
(Hinges)
b–1
9
SS(Blocks)
5705.0
SS(Blocks)/b – 1
633.9
MS(Blocks)/MSE
65.15 *
Error
(k – 1) ( b – 1) SS(Error)
SS(Error)/(k – 1)(b – 1)
18
175.1
9.73
______________________________________________________________________
Total
29
6184.3
Note: If treat the 30 observations as 3 replicates for each Tip (one-way
design) and ignore blocking , F not significant
Example – factorial design
Sickness claims costs by employee category and sex
p = 3 replicates per cell
C1
yijk    Ci  S j  CS ij   ijk
Employee Classification
C2
C3
Sex M 190,225,200 135, 180,100
(615)
(415)
F 235,190, 270 275,305,285
Total
: 2 factors + interaction.
(695)
(865)
1310
1280
C4
Total Ave.
260, 330, 350 305, 275, 240
(940)
(820)
2790 232.50
160, 205, 140 155,110, 75
(505)
(340)
2405 200.42
1445
1160
5195
Example contd.
ANOVA Table
Source
dof
SSQ
MSQ
Factor 1
(Sex)
r-1
1
SS(Factor 1) SS(Factor 1)/r-1
6176.04
6176.04
Factor 2
(Emp. Class)
c–1
3
SS(Factor 2)
6853.12
SS(Factor 2)/c – 1
2284.37
Interaction (r - 1)(c – 1) SS(Cells)
SS(Cells)/ (r – 1) (c – 1)
3
98578.13
32859.38
F-Ratio
MS(Factor 1)/MSE
5.05 *
MS(Factor 2)/MSE
1.87
MS (Cells)/MSE
26.87 *
Error
(r – 1) ( c – 1) p SS(Error) SS(Error)/(r – 1)(c – 1) p
16
19566.67
1222.92
______________________________________________________________________
Total
rcp – 1
SS(Total)
23
131173.96
Note: To see effect of interaction, sketch amounts claimed (on average) vs
employment category, for M and F
Multiple Populations: Mendel - 2 and G
Plant
Round Seed
Wrinkled Seed
dof
2
p-value
G
p-value
Count
Expected
Count
Expected
1
45
42.75
12
14.25
1
0.47
0.49
0.49
0.49
2
27
26.25
8
8.75
1
0.09
0.77
0.09
0.77
3
24
23.25
7
7.75
1
0.10
0.76
0.10
0.75
4
19
21.75
10
7.25
1
1.39
0.24
1.30
0.26
5
32
32.25
11
10.75
1
0.01
0.93
0.01
0.93
6
26
24.00
6
8.00
1
0.67
0.41
0.71
0.40
7
88
84.00
24
28.00
1
0.76
0.38
0.79
0.38
8
22
24.00
10
8.00
1
0.67
0.41
0.63
0.43
9
28
25.50
6
8.50
1
0.98
0.32
1.06
0.30
10
25
24.00
7
8.00
1
0.17
0.68
0.17
0.68
Total
336
10
5.30
Pooled
336
1
0.83
0.36
0.85
0.36
9
4.47
0.88
4.50
0.88
Heterogeneity
101
327.75
101
109.25
5.34
Multiple Populations - summary
• Parallels
2

 Oij 
 2 Oi log 
 E 
i 1 j 1 
 ij 

p
GTotal
n
• Partitions therefore







 
n
GPooled  2
j 1
and Gheterogeneity = Gtotal - GPooled
p
i 1



Oij log






p
i 1
p
i 1
 
Oij  
 
 
 
Eij  
 
 
(n=no. classes, p = no.populations)
12
Smoothing Time series – choice of smoothing
constant for exponentially smoothed forecasts
Suppose actual weekly demand, Yt as below, Want Smoothed values St for
smoothing constants () of 0.1 and 0.8 contrasted; S1 = Y1 by convention
Week
1
2
3
4
5
6
7
8
9
10
Actual demand
Yt
100
100
100
100
150
100
100
100
100
100
Forecast St
(with  = 0.1)
100
100
100
100
105
104.5
104.05
103.645
103.2805
102.95245
Forecast St
(with  = 0.8)
100
100
100
100
140
108
101.6
100.32
100.064
100.0128