Optimal Learning: Efficient Data Collection in the Information Age Industrial Engineering Research Conference Cancun, Mexico June 6, 2010 . Warren B.
Download
Report
Transcript Optimal Learning: Efficient Data Collection in the Information Age Industrial Engineering Research Conference Cancun, Mexico June 6, 2010 . Warren B.
Optimal Learning:
Efficient Data Collection in the Information Age
Industrial Engineering Research Conference
Cancun, Mexico
June 6, 2010
.
Warren B. Powell
With research by
Peter Frazier
Ilya Ryzhov
Warren Scott
Princeton University
© 2009B.
Warren
B. Powell
© 2010 Warren
Powell,
Princeton University
1
1
Energy technology
Retrofitting buildings with new
energy technologies
Different combinations of
technologies interact, with
behaviors that depend on the
characteristics of the building.
Potential technologies include:
•
•
•
•
Window tinting, insulation
Energy-efficient lighting
Advanced thermostats
… many others
We need to try different
combinations of technologies to
build up a knowledge base on
different interactions, in different
settings.
2
Finding the best path
Figure out Manhattan:
Walking
Subway/walking
Taxi
Street bus
Driving
3
3
Finding effective compounds
Materials research
How do we find the best
material for converting
sunlight to electricity?
What is the best battery
design for storing
energy?
We need a method to sort
through potentially
thousands of
experiments.
4
Applications
Pandemic disease control
Face masks are effective at disease
containment.
It is better to test people for disease.
But we cannot test everyone. Who do
we test?
5
Applications
Finding good designs
How do we optimize the
dimensions of tubes, plates
and distances in an aerosol
device?
Each design requires several
hours to set up and execute.
Five parameters determine
the effectiveness of the
spray.
6
Nomadic
trucker illustration
The nomadic
trucker
V 0 (MN ) 0
V (CO) 0
0
V 0 ( NY ) 0
$350
$150
V (CA) 0
0
$450
$300
7
Nomadic
trucker illustration
The nomadic
trucker
V 0 (MN ) 0
V (CO) 0
0
V 0 ( NY ) 0
$350
$150
V (CA) 0
0
$450
$300
V 1 (TX ) 450
8
Nomadic
trucker illustration
The nomadic
trucker
V 0 (MN ) 0
$180
V (CO) 0
0
V 0 (CA) 0
V 0 ( NY ) 0
$400
$600
$125
V 1 (TX ) 450
9
Nomadic
trucker illustration
The nomadic
trucker
V 0 (MN ) 0
$180
V (CO) 0
0
V 0 (CA) 0
V 0 ( NY ) 600
$400
$600
$125
V 1 (TX ) 450
10
Nomadic
trucker illustration
The nomadic
trucker
V 0 (MN ) 0
$550
V (CO) 0
0
$350
V 0 ( NY ) 600
$150
V 0 (CA) 0
$250
V 1 (TX ) 450
11
Applications
12
Outline
The challenge of learning
The knowledge gradient policy
The knowledge gradient with correlated beliefs
The knowledge gradient for on-line learning
Applications
© 2009 Warren B. Powell
© 2008 Warren B. Powell
Slide 13
13
Outline
The challenge of learning
The knowledge gradient policy
The knowledge gradient with correlated beliefs
The knowledge gradient for on-line learning
Applications
© 2009 Warren B. Powell
© 2008 Warren B. Powell
Slide 14
14
The challenge of learning
Deterministic optimization
Find the choice with the highest reward (assumed known):
Choice
1
2
3
4
5
Value
759
722
698
653
616
The winner!
15
15
The challenge of learning
Stochastic optimization
Now assume the reward you will earn is stochastic, drawn
from a normal distribution. The reward is revealed after
the choice is made.
Choice
1
2
3
4
5
Mean
759
722
698
653
616
Std dev
120
142
133
90
102
The winner!
16
16
The challenge of learning
Optimal learning
Choice
1
2
3
4
5
Now, you have a budget of 10 measurements to determine
which of the 5 choices is best.
You have an estimate of the performance of each, but you
are unsure and you are willing to update your belief.
Mean
759
722
698
653
616
Std dev Observation
120
702
78
133
90
102
Mean
712
722
698
653
616
Std dev Observation
96
78
734
133
90
102
Mean
712
726
698
653
616
Std dev Observation
96
64
133
90
102
• … It is no longer obvious which you should try first.
17
17
The challenge of learning
At first, we believe that
x ~ N x0 ,1 x0
But we measure alternative x and observe
Our beliefs change:
i
ij0
j
ij0
j
yˆ 1x ~ N x ,1 y
i
x ~ N 1x ,1 x1
ˆ ij1
x1 x0 y
0 0
1
ˆ
y
y x
1x x 0x
x y
i
ij1
j
Thus, our beliefs about the rewards are gradually improved over
measurements
18
18
The challenge of learning
Now assume we have five choices, with uncertainty in our
belief about how well each one will perform.
If you can make one measurement, which would you measure?
1
2
3
4
5
19
19
The challenge of learning
Now assume we have five choices, with uncertainty in our
belief about how well each one will perform.
If you can make one measurement, which would you measure?
No improvement
1
2
3
4
5
20
20
The challenge of learning
Now assume we have five choices, with uncertainty in our
belief about how well each one will perform.
If you can make one measurement, which would you measure?
New solution
1
2
3
4
5
The value of learning is that it may change your decision.
21
21
The challenge of learning
The measurement problem
We wish to design a sequential measurement policy, where each
measurement depends on previous choices.
We can formulate this as a dynamic program:
V n ( S n ) max x C ( S n , x) E V n 1 ( S n 1 ) | S n
… but it is a little different than most dynamic programs that focus on the
physical state.
22
22
The challenge of learning
Optimal routing over a graph
S is a node in the network
V n ( S n ) max x C ( S n , x) E V n 1 ( S n 1 ) | S n
Current node (e.g. node 2)
23
23
The challenge of learning
Optimal routing over a graph
S is a node in the network
V n ( S n ) max x C ( S n , x) E V n 1 ( S n 1 ) | S n
Decision to go to a node (e.g. 5)
Downstream node (e.g. 5)
24
24
The challenge of learning
Learning problems
S is our “state of knowledge”
5
5
S5 N 5 , 52
1
2
3
4
5
V n ( S n ) max x C ( S n , x) E V n 1 ( S n 1 ) | S n
New state of knowledge
Current state of knowledge
Decision to make a measurement
25
25
The challenge of learning
Heuristic measurement policies
Pure exploitation – Always make the choice that appears to be the best.
Pure exploration – Make choices at random so that you are always
learning more.
Hybrid (epsilon greedy)
• Explore with probability and exploit with probability 1
n
• Declining exploration – explore with probability c / n. Goes to zero as
.n , but not too quickly.
Boltzmann exploration
n
• Explore choice x with probability px
exp xn
exp
n
x'
x'
Interval estimation
n
n
Standardparameter
deviation of
• Choose x which maximizes x z x Tunable
estimate of the mean.
0
0 z xn
26
Outline
The challenge of learning
The knowledge gradient policy
The knowledge gradient with correlated beliefs
The knowledge gradient for on-line learning
Applications
© 2009 Warren B. Powell
© 2008 Warren B. Powell
Slide 27
27
The knowledge gradient
Basic principle:
Assume you can make only one measurement, after which you have to
make a final choice (the implementation decision).
What choice would you make now to maximize the expected value of
the implementation decision?
Change which produces a
change in the decision.
Change in estimate
of value of option
5 due to
measurement.
1
2
3
4
5
28
28
The knowledge gradient
General model
Off-line learning – We have a measurement budget of N observations.
After we do our measurements, we have to make an implementation
decision.
Notation:
y Implementation decision
K n Our state of knowledge after n measurements.
F ( y, K n ) Value of making decision y given knowledge K n .
x n Measurement decision after n measurements.
Wxn 1 Observation resulting from observing x n x
K n 1 ( x ) Updated distribution of belief about costs after observing Wxn 1
29
29
The knowledge gradient
The knowledge gradient
The knowledge gradient is the expected value of a single measurement x,
given by
xKG ,n E max y F ( y, K n 1 ( x)) max y F ( y, K n )
New
optimization
problem
Optimization
problem
given
what we
Marginal value
of measuring
xUpdated
(the
knowledge
gradient)
Expectation
over different
measurement
outcomes
knowledge
state
given
measurement
x know
Implementation
decision
Knowledge
state
X arg max x xKG
Knowledge gradient policy
The challenge is a computational one: how do we compute the
expectation?
30
30
The knowledge gradient
Computing the knowledge gradient for Gaussian beliefs
The change in variance can be found to be
x2,n Var xn 1 xn | S n
x2,n x2,n 1
Next compute the normalized influence:
xn max x ' x xn'
xn
n
x
Let f ( ) ( ) ( )
( ) Cumulative standard normal distribution
( ) Standard normal density
0
Knowledge gradient is computed using
xKG x2,n f xn
31
31
The knowledge gradient
Classic steepest ascent
x2
x1
xn1 xn nf ( xn )
32
The knowledge gradient
Knowledge gradient policy is a type of coordinate ascent
x2
x1
x n 1 x n n KG ,n
33
The knowledge gradient
2.5
Current
Current
estimate
estimate
of of
value
standard
of a decision
deviation
Value
of knowledge
gradient
2
1.5
mu
Sigma
KG index
1
0.5
0
1
2
3
4
5
Choice
44
44
The knowledge gradient
1.6
1.4
1.2
1
mu
0.8
Sigma
KG index
0.6
0.4
0.2
0
1
2
3
4
5
45
45
The knowledge gradient
6
5
Choice
4
mu
3
Sigma
KG index
2
1
0
1
2
3
4
5
46
46
The knowledge gradient
The knowledge gradient policy
X KG (S n ) arg max x xKG,n
Properties
Effectively a myopic policy, but also similar to steepest ascent for nonlinear
programming.
The best single measurement you can make (by construction).
Asymptotically optimal for offline learning (more difficult proof). As the
measurement budget grows, we get the optimal solution.
The knowledge gradient policy is the only stationary policy with this behavior,
with no tunable parameters.
47
47
The knowledge gradient policy
Myopic and asymptotic optimality
Optimal solution
Ideal
Fast initial convergence,
but stalls
Asymptotically
optimal
48
The knowledge gradient policy
Myopic and asymptotic optimality
Myopic optimality (fast initial convergence)
Optimal solution
Ideal
Asymptotic optimality
Knowledge gradient
49
The knowledge gradient
Myopic policy vs. three-step lookahead
Opportunity cost
Rolling horizon policy
Three-step lookahead
Knowledge gradient
One-step lookahead
50
The value of information
The value of information is often concave…
51
The value of information
… but not always.
The marginal value of a single measurement can be small!
52
The value of information
Optimal number of choices
As measurement noise
increases, the optimal
number of alternatives
to evaluate decreases.
Number of alternatives being evaluated
53
The value of information
The KG(*) policy
Maximize the average value of measurements.
54
Outline
The challenge of learning
The knowledge gradient policy
The knowledge gradient with correlated beliefs
The knowledge gradient for on-line learning
Applications
© 2009 Warren B. Powell
© 2008 Warren B. Powell
Slide 55
55
Introduction
An important problem class involves correlated beliefs –
measuring one alternative tells us something other alternatives.
1
2
3
...these beliefs change too.
4
5
measure
here...
56
56
The knowledge gradient
with correlated beliefs
Introduction
Examples
Finding the best price at which to sell a product.
• Demand at a price of $8 is close to demand at a price of $9
Choosing a combination of drugs to treat a disease.
• Two treatments may share common medications.
Finding a chemical for a particular medical or industrial purpose.
• Two chemicals sharing similar molecular structures behave similarly.
Choosing a combination of features to include in a product.
• Can only evaluate sales of a complete product.
• Two products may have some features in common, while others are
different.
57
57
The knowledge gradient
with correlated beliefs
Introduction
Optimizing the price of a product
Estimating demand at a price of $84
tells us something about the demand
when we charge $86
Correlated knowledge gradient procedure
Without correlations
With correlations
Chooses measurements based in part on
what we learn about other potential
measurements.
Updating of correlations is built into the
decision function, not just the transition
function.
58
Outline
The challenge of learning
The knowledge gradient policy
The knowledge gradient with correlated beliefs
The knowledge gradient for on-line learning
Applications
© 2009 Warren B. Powell
© 2008 Warren B. Powell
Slide 59
59
Major problem classes
Types of learning probems
Off-line learning (ranking and selection/stochastic search)
• There is a phase of information collection with a finite budget, after
which you make an implementation decision.
• Examples:
– Finding the best design of a manufacturing configuration or engineering
design which is evaluated using an expensive simulation.
– What is the best combination of designs for hydrogen production,
storage and conversion.
On-line learning (multiarmed bandit problems)
• “Learn as you earn”
• Example:
– Finding the best path to work
– What is the best set of energy-saving technologies to use for your
building?
– What is the best medication to control your diabetes?
60
60
Knowledge gradient for online learning
Objective function for off-line problems
We wish to find the best design after N measurements
max x xN
Objective function for on-line problems
We wish to maximize the total reward as we proceed
N
max
n 1
n
xn
n
x n (S n )
61
Measurement policies
Special case: the multiarmed bandit problem
Which slot machine should I try next to maximize total
expected rewards?
Breakthrough (Gittins and Jones, 1974)
• Do not need to solve the high-dimensional dynamic program
• Compute a single index (the “Gittins index”) for each slot
machine
• Try the slot machine with the largest index
xGittins xn (n, ) x
Gittins
index
mean
zero,
1x
Standard
ofvariance
measurement
Current
estimate
of for
thedeviation
reward
from
machine
62
62
Knowledge gradient for online learning
Knowledge gradient policy
For off-line problems:
xKG,n Value of a measurement from a single decision
For finite-horizon on-line problems:
• Assume we have made 3 measurements out of our budget of 20.
• What is the value of learning from one more measurement?
KG ,3
• x is the improvement in the 4th decision given what we know
after the 3rd measurement. But we benefit from this decision 17
more times.
xKGOL,3 x3 (20 3) xKG,3 x3 17 xKG,3
• The more times we can use the information, the more we are willing
to take a loss for future benefits.
63
Knowledge gradient for online learning
Knowledge gradient policy
For finite-horizon on-line problems:
xKGOL,n xn ( N n) xKG,n
For infinite-horizon discounted problems:
xKG OL,n xn
1
xKG ,n
Compare to Gittins indices for bandit problems
xGittins xn (n, ) x
64
Knowledge gradient for online learning
Gittins indices
Gittins indices looks at one measurement at a time, over
the entire future. Knowledge gradient looks across all
measurements at a point in time.
Time
Knowledge gradient
Gittins indices
65
Knowledge gradient for online learning
On-line KG vs. Gittins
On-line KG slightly
underperforms
Gittins
On-line KG slightly
outperforms Gittins.
Number of measurements
66
Knowledge gradient for online learning
KG versus Gittins indices for multiarmed bandit problems
Gittins indices are provably optimal, but computing them is hard.
Chick and Gans (2009) have developed a simple and accurate
approximation.
Informative prior
Uninformative prior
Improvement of KG over Gittins
Improvement of KG over Gittins
67
Knowledge gradient for online learning
But knowledge gradient can also handle:
Finite horizons
Correlated beliefs:
KG vs. Gittins
KG vs. Upper confidence bounding
???
KG vs. Interval estimation
KG vs. pure exploitation
68
Knowledge gradient for online learning
KG versus interval estimation
Recall that with IE, you choose the alternative with the
highest: xIE x z x
Tunable parameter
Opportunity cost
IE
IE beats KG
KG
IE parameter z
69
Knowledge gradient for online learning
Tuning z for interval estimation
Optimal value is very sensitive to the problem parameters
z 0.3
z 110
70
Outline
The challenge of learning
The knowledge gradient policy
The knowledge gradient with correlated beliefs
The knowledge gradient for on-line applications
Applications
© 2009 Warren B. Powell
© 2008 Warren B. Powell
Slide 71
71
Outline
Applications
Optimizing an energy storage problem
Warren Scott, Emre Barut
Jennifer and Christine Schoppe
Drug discovery
Diana Negoescu
Peter Frazier
Learning on a graph
Ilya Ryzhov
© 2009 Warren B. Powell
© 2008 Warren B. Powell
Slide 72
72
Optimal control of wind and storage
Wind
Varies with multiple
freqeuencies (seconds,
hours, days, seasonal).
Spatially uneven, generally
not aligned with population
centers.
Solar
Shines primarily during the
day (when it is needed), but
not entirely reliably.
Strongest in
south/southwest.
73
Optimal control of wind and storage
30 days
1 year
74
Optimal control of wind and storage
Hydroelectric
Batteries
Flywheels
Ultracapacitors
75
Optimal control of wind and storage
Controlling the storage process
Imagine that we would like to use storage to reduce demand
when electricity prices are high.
We use a simple policy controlled by two parameters.
Price
Withdraw
Store
76
Optimizing storage policy
Store
Withdraw
77
77
Optimizing storage policy
Initially we think the concentration is the same everywhere:
Estimated profit
Knowledge gradient
We want to measure the value where the knowledge gradient is the
highest. This is the measurement that teaches us the most.
78
Optimizing storage policy
After four measurements:
Estimated profit
Measurement
Knowledge gradient
Value of another measurement
New optimum
at same location.
Whenever we measure at a point, the value of another
measurement at the same point goes down. The knowledge
gradient guides us to measuring areas of high uncertainty.
79
Optimizing storage policy
After five measurements:
Estimated profit
Knowledge gradient
80
Optimizing storage policy
After six samples
Estimated profit
Knowledge gradient
81
Optimizing storage policy
After seven samples
Estimated profit
Knowledge gradient
82
Optimizing storage policy
After eight samples
Estimated profit
Knowledge gradient
83
Optimizing storage policy
After nine samples
Estimated profit
Knowledge gradient
84
Optimizing storage policy
After ten samples
Estimated profit
Knowledge gradient
85
Optimizing storage policy
After 10 measurements, our estimate of the surface:
Estimated profit
True concentration
86
Outline
Applications
Optimizing an energy storage problem
Warren Scott, Emre Barut
Jennifer and Christine Schoppe
Drug discovery
Diana Negoescu
Peter Frazier
Learning on a graph
Ilya Ryzhov
© 2009 Warren B. Powell
© 2008 Warren B. Powell
Slide 87
87
Applications
Biomedical research
How do we find the
best drug to cure
cancer?
There are millions of
combinations, with
laboratory budgets that
cannot test everything.
We need a method for
sequencing
experiments.
88
Drug discovery
Designing molecules
X and Y are sites where we can hang substituents to change the behavior
of the molecule
89
Drug discovery
We express our belief using a linear, additive QSAR model
X ij 1 if substituent j is at site i, 0 otherwise.
m
Y 0
ij X ij
sites i substituents j
90
Drug discovery
Knowledge gradient versus pure exploration for 99 compounds
Performance under best possible
Pure exploration
Knowledge gradient
Number of molecules tested (out of 99)
91
Drug discovery
A more complex molecule:
3
8
4’
7
6
R1
R2
Potential substituents:
1
9
R4
9
R3
4
3’
2’
2
1’
5
R5
F
OH
CH 3
OCOCH 3
OCOCH
OCH 3
CH
NO
CI
OCOCH
From this base molecule, we created problems with 10,000
compounds, and one with 87,120 compounds.
92
Drug discovery
Compact representation on 10,000 combination compound
Results from 15 sample paths
Performance under best possible
Number of molecules tested
93
Drug discovery
Single sample path on molecule with 87,120 combinations
Performance under best possible
94
Parametric belief models
Representing beliefs using linear regression has many
applications:
How do we find the optimal price of a product sold on the
internet?
Which internet ad will generate the most ad clicks?
How will a customer, described by a set of attributes,
respond to a price for a contract?
What parameter settings produce the best results from my
business simulator?
What are the best features that I should include in a laptop?
95
Major problem classes
Belief structures
Lookup tables (one belief for
n
each discrete value x .
• Independent beliefs
• Correlated beliefs
1
2
3
4
5
Parametric beliefs
y 0 11 (S ) 22 (S ) ...
Nonparametric beliefs
96
Outline
Applications
Optimizing an energy storage problem
Warren Scott
Jennifer and Christine Schoppe
Drug discovery
Diana Negoescu
Peter Frazier
Learning on a graph
Ilya Ryzhov
© 2009 Warren B. Powell
© 2008 Warren B. Powell
Slide 97
97
Information collection for rapid response
The challenge:
We need to plan the
movement of emergency
response resources to
respond to an emergency.
Collecting information
Aerial videos
Sampling mobile phones
with GPS
Ground observations
98
Information collection on a graph
Optimal routing over a graph:
99
99
Information collection on a graph
Optimal routing over a graph
The shortest path
100 100
Information collection on a graph
Optimal routing over a graph
The shortest path
Evaluating a link
101 101
Information collection on a graph
Optimal routing over a graph
The shortest path
Evaluating a link
Now we have a new shortest path
How do we decide which links to measure?
102 102
Information collection on a graph
The knowledge gradient on a graph
We can apply the knowledge gradient concept directly
xKG E max y F ( y, K ( x)) max y F ( y, K )
Expected
valuepath
of updated
shortest
problem
Shortest
problem
after
thepath
update
weproblem
believe about
Current
path
Updated
distributions
ofshortest
arc What
costs
link
Based on what
wecosts
know
How do we compute the expected value of a stochastic shortest path
problem?
103 103
Information collection on a graph
The knowledge gradient on a graph
When we had finite alternatives, we had to compute
xn max x ' x xn'
xn
n
x
Normalized distance to best (or second best)
For problems on graphs, we have to compute
n
n
max
p
(
ij
)
p
'
P
(
ij
)
p
'
ijn
pn (ij )
Value of best path that
includes link (i,j)
Value of best path that
does not include link (i,j)
104 104
Experimental results
Ten layered graphs (22 nodes, 50 edges)
Ten larger layered graphs (38 nodes, 102 edges)
105
Special thanks to
Peter Frazier (faculty, Cornell University)
Ilya Ryzhov (available 2011)
Warren Scott (expected availability 2012)
Emre Barut (expected availability 2013?)
© 2009 Warren B. Powell
© 2008 Warren B. Powell
Slide 106 106
© 2009 Warren B. Powell
107
Major problem classes
Measurement variable
Binary problems x=(0,1)
• Sequential hypothesis testing
Discrete choice problems x=(1,2,…,M)
• Finding the best technology
Subset selection problems x=( 0 1 1 0 1 0 0 0 1 )
• R&D portfolio optimization
Continuous scalar parameter – x=2.682
• What is the best temperature, density, quantity
Continuous vectors x=(1.43, 12.78, 4.59, …)
• Tuning a design or process
Multiattribute problems x=(OH, OCH3, NO, CI, …)
• Drug discovery – What is the molecular compound?
• What is the best set of features for a device?
108
The knowledge gradient
Computing the knowledge gradient
n
n
max
x ' x
x'
xn x
xn
Normalized distance to best (or second best)
5KG
xn
1
2
3
4
5
x
109 109
Approximate dynamic programming
Learning the value of being in each state
-$5
$0
$0
1
2
v1 0
v2 0
$20
Starting in state 1, given our initial estimate of the value of
being in each state, we would prefer to stay in state 1 and get
$0 then move to state 2 and get -$5+0 = -$5.
To learn the value of being in state 2, we have to make an
explicit decision to explore state 2.
110