Optimal Learning: Efficient Data Collection in the Information Age Industrial Engineering Research Conference Cancun, Mexico June 6, 2010 . Warren B.

Download Report

Transcript Optimal Learning: Efficient Data Collection in the Information Age Industrial Engineering Research Conference Cancun, Mexico June 6, 2010 . Warren B.

Optimal Learning:
Efficient Data Collection in the Information Age
Industrial Engineering Research Conference
Cancun, Mexico
June 6, 2010
.
Warren B. Powell
With research by
Peter Frazier
Ilya Ryzhov
Warren Scott
Princeton University
© 2009B.
Warren
B. Powell
© 2010 Warren
Powell,
Princeton University
1
1
Energy technology

Retrofitting buildings with new
energy technologies


Different combinations of
technologies interact, with
behaviors that depend on the
characteristics of the building.
Potential technologies include:
•
•
•
•

Window tinting, insulation
Energy-efficient lighting
Advanced thermostats
… many others
We need to try different
combinations of technologies to
build up a knowledge base on
different interactions, in different
settings.
2
Finding the best path

Figure out Manhattan:





Walking
Subway/walking
Taxi
Street bus
Driving
3
3
Finding effective compounds

Materials research



How do we find the best
material for converting
sunlight to electricity?
What is the best battery
design for storing
energy?
We need a method to sort
through potentially
thousands of
experiments.
4
Applications

Pandemic disease control



Face masks are effective at disease
containment.
It is better to test people for disease.
But we cannot test everyone. Who do
we test?
5
Applications

Finding good designs



How do we optimize the
dimensions of tubes, plates
and distances in an aerosol
device?
Each design requires several
hours to set up and execute.
Five parameters determine
the effectiveness of the
spray.
6
Nomadic
trucker illustration
The nomadic
trucker
V 0 (MN )  0
V (CO)  0
0
V 0 ( NY )  0
$350
$150
V (CA)  0
0
$450
$300
7
Nomadic
trucker illustration
The nomadic
trucker
V 0 (MN )  0
V (CO)  0
0
V 0 ( NY )  0
$350
$150
V (CA)  0
0
$450
$300
V 1 (TX )  450
8
Nomadic
trucker illustration
The nomadic
trucker
V 0 (MN )  0
$180
V (CO)  0
0
V 0 (CA)  0
V 0 ( NY )  0
$400
$600
$125
V 1 (TX )  450
9
Nomadic
trucker illustration
The nomadic
trucker
V 0 (MN )  0
$180
V (CO)  0
0
V 0 (CA)  0
V 0 ( NY )  600
$400
$600
$125
V 1 (TX )  450
10
Nomadic
trucker illustration
The nomadic
trucker
V 0 (MN )  0
$550
V (CO)  0
0
$350
V 0 ( NY )  600
$150
V 0 (CA)  0
$250
V 1 (TX )  450
11
Applications
12
Outline

The challenge of learning
 The knowledge gradient policy
 The knowledge gradient with correlated beliefs
 The knowledge gradient for on-line learning
 Applications
© 2009 Warren B. Powell
© 2008 Warren B. Powell
Slide 13
13
Outline

The challenge of learning
 The knowledge gradient policy
 The knowledge gradient with correlated beliefs
 The knowledge gradient for on-line learning
 Applications
© 2009 Warren B. Powell
© 2008 Warren B. Powell
Slide 14
14
The challenge of learning

Deterministic optimization

Find the choice with the highest reward (assumed known):
Choice
1
2
3
4
5
Value
759
722
698
653
616
The winner!
15
15
The challenge of learning

Stochastic optimization

Now assume the reward you will earn is stochastic, drawn
from a normal distribution. The reward is revealed after
the choice is made.
Choice
1
2
3
4
5
Mean
759
722
698
653
616
Std dev
120
142
133
90
102
The winner!
16
16
The challenge of learning

Optimal learning


Choice
1
2
3
4
5
Now, you have a budget of 10 measurements to determine
which of the 5 choices is best.
You have an estimate of the performance of each, but you
are unsure and you are willing to update your belief.
Mean
759
722
698
653
616
Std dev Observation
120
702
78
133
90
102
Mean
712
722
698
653
616
Std dev Observation
96
78
734
133
90
102
Mean
712
726
698
653
616
Std dev Observation
96
64
133
90
102
• … It is no longer obvious which you should try first.
17
17
The challenge of learning

At first, we believe that
 x ~ N   x0 ,1  x0 

But we measure alternative x and observe

Our beliefs change:
i
ij0
j
ij0
j
yˆ 1x ~ N   x ,1  y 
i
 x ~ N   1x ,1  x1 
ˆ ij1
 x1   x0   y
0 0
1
ˆ




y
y x
 1x  x 0x
x   y

i
ij1
j
Thus, our beliefs about the rewards are gradually improved over
measurements
18
18
The challenge of learning


Now assume we have five choices, with uncertainty in our
belief about how well each one will perform.
If you can make one measurement, which would you measure?
1
2
3
4
5
19
19
The challenge of learning


Now assume we have five choices, with uncertainty in our
belief about how well each one will perform.
If you can make one measurement, which would you measure?
No improvement
1
2
3
4
5
20
20
The challenge of learning


Now assume we have five choices, with uncertainty in our
belief about how well each one will perform.
If you can make one measurement, which would you measure?
New solution
1
2
3
4
5
The value of learning is that it may change your decision.
21
21
The challenge of learning

The measurement problem

We wish to design a sequential measurement policy, where each
measurement depends on previous choices.

We can formulate this as a dynamic program:

V n ( S n )  max x C ( S n , x)  E V n 1 ( S n 1 ) | S n 

… but it is a little different than most dynamic programs that focus on the
physical state.
22
22
The challenge of learning

Optimal routing over a graph

S is a node in the network

V n ( S n )  max x C ( S n , x)  E V n 1 ( S n 1 ) | S n 

Current node (e.g. node 2)
23
23
The challenge of learning

Optimal routing over a graph

S is a node in the network

V n ( S n )  max x C ( S n , x)  E V n 1 ( S n 1 ) | S n 

Decision to go to a node (e.g. 5)
Downstream node (e.g. 5)
24
24
The challenge of learning

Learning problems

S is our “state of knowledge”
5
5
S5  N  5 ,  52 
1
2
3

4
5
V n ( S n )  max x C ( S n , x)  E V n 1 ( S n 1 ) | S n 

New state of knowledge
Current state of knowledge
Decision to make a measurement
25
25
The challenge of learning

Heuristic measurement policies



Pure exploitation – Always make the choice that appears to be the best.
Pure exploration – Make choices at random so that you are always
learning more.
Hybrid (epsilon greedy)
• Explore with probability  and exploit with probability 1  
n
• Declining exploration – explore with probability   c / n. Goes to zero as
.n  , but not too quickly.

Boltzmann exploration
n
• Explore choice x with probability px 
exp  xn 
 exp  
n
x'
x'

Interval estimation
n
n
Standardparameter
deviation of
• Choose x which maximizes x  z x Tunable
estimate of the mean.
0
 0  z xn
26
Outline

The challenge of learning
 The knowledge gradient policy
 The knowledge gradient with correlated beliefs
 The knowledge gradient for on-line learning
 Applications
© 2009 Warren B. Powell
© 2008 Warren B. Powell
Slide 27
27
The knowledge gradient

Basic principle:


Assume you can make only one measurement, after which you have to
make a final choice (the implementation decision).
What choice would you make now to maximize the expected value of
the implementation decision?
Change which produces a
change in the decision.

 Change in estimate
 of value of option

5 due to

measurement.

1
2
3
4
5
28
28
The knowledge gradient

General model


Off-line learning – We have a measurement budget of N observations.
After we do our measurements, we have to make an implementation
decision.
Notation:
y  Implementation decision
K n  Our state of knowledge after n measurements.
F ( y, K n )  Value of making decision y given knowledge K n .
x n  Measurement decision after n measurements.
Wxn 1  Observation resulting from observing x n  x
K n 1 ( x )  Updated distribution of belief about costs after observing Wxn 1
29
29
The knowledge gradient

The knowledge gradient

The knowledge gradient is the expected value of a single measurement x,
given by
 xKG ,n  E max y F ( y, K n 1 ( x))  max y F ( y, K n )
New
optimization
problem
Optimization
problem
given
what we
Marginal value
of measuring
xUpdated
(the
knowledge
gradient)
Expectation
over different
measurement
outcomes
knowledge
state
given
measurement
x know
Implementation
decision
Knowledge
state
X   arg max x  xKG
 Knowledge gradient policy

The challenge is a computational one: how do we compute the
expectation?
30
30
The knowledge gradient

Computing the knowledge gradient for Gaussian beliefs

The change in variance can be found to be
 x2,n  Var   xn 1   xn | S n 
  x2,n   x2,n 1

Next compute the normalized influence:
 xn  max x ' x  xn'
 
 xn
n
x

Let f ( )  ( )   ( )
( )  Cumulative standard normal distribution
 ( )  Standard normal density
0

Knowledge gradient is computed using
 xKG   x2,n f  xn 
31
31
The knowledge gradient

Classic steepest ascent
x2
x1
xn1  xn  nf ( xn )
32
The knowledge gradient

Knowledge gradient policy is a type of coordinate ascent
x2
x1
x n 1  x n   n   KG ,n 
33
The knowledge gradient
2.5
Current
Current
estimate
estimate
of of
value
standard
of a decision
deviation
Value
of knowledge
gradient
2
1.5
mu
Sigma
KG index
1
0.5
0
1
2
3
4
5
Choice
44
44
The knowledge gradient
1.6
1.4
1.2
1
mu
0.8
Sigma
KG index
0.6
0.4
0.2
0
1
2
3
4
5
45
45
The knowledge gradient
6
5
Choice
4
mu
3
Sigma
KG index
2
1
0
1
2
3
4
5
46
46
The knowledge gradient

The knowledge gradient policy
X KG (S n )  arg max x  xKG,n

Properties




Effectively a myopic policy, but also similar to steepest ascent for nonlinear
programming.
The best single measurement you can make (by construction).
Asymptotically optimal for offline learning (more difficult proof). As the
measurement budget grows, we get the optimal solution.
The knowledge gradient policy is the only stationary policy with this behavior,
with no tunable parameters.
47
47
The knowledge gradient policy

Myopic and asymptotic optimality
Optimal solution
Ideal
Fast initial convergence,
but stalls
Asymptotically
optimal
48
The knowledge gradient policy

Myopic and asymptotic optimality
Myopic optimality (fast initial convergence)
Optimal solution
Ideal
Asymptotic optimality
Knowledge gradient
49
The knowledge gradient
Myopic policy vs. three-step lookahead
Opportunity cost

Rolling horizon policy
Three-step lookahead
Knowledge gradient
One-step lookahead
50
The value of information

The value of information is often concave…
51
The value of information

… but not always.

The marginal value of a single measurement can be small!
52
The value of information

Optimal number of choices
As measurement noise
increases, the optimal
number of alternatives
to evaluate decreases.
Number of alternatives being evaluated
53
The value of information

The KG(*) policy

Maximize the average value of measurements.
54
Outline

The challenge of learning
 The knowledge gradient policy
 The knowledge gradient with correlated beliefs
 The knowledge gradient for on-line learning
 Applications
© 2009 Warren B. Powell
© 2008 Warren B. Powell
Slide 55
55
Introduction

An important problem class involves correlated beliefs –
measuring one alternative tells us something other alternatives.
1
2
3
...these beliefs change too.
4
5
measure
here...
56
56
The knowledge gradient
with correlated beliefs
Introduction

Examples


Finding the best price at which to sell a product.
• Demand at a price of $8 is close to demand at a price of $9
Choosing a combination of drugs to treat a disease.
• Two treatments may share common medications.


Finding a chemical for a particular medical or industrial purpose.
• Two chemicals sharing similar molecular structures behave similarly.
Choosing a combination of features to include in a product.
• Can only evaluate sales of a complete product.
• Two products may have some features in common, while others are
different.
57
57
The knowledge gradient
with correlated beliefs
Introduction
Optimizing the price of a product
Estimating demand at a price of $84
tells us something about the demand
when we charge $86
Correlated knowledge gradient procedure
Without correlations
With correlations
Chooses measurements based in part on
what we learn about other potential
measurements.
Updating of correlations is built into the
decision function, not just the transition
function.
58
Outline

The challenge of learning
 The knowledge gradient policy
 The knowledge gradient with correlated beliefs
 The knowledge gradient for on-line learning
 Applications
© 2009 Warren B. Powell
© 2008 Warren B. Powell
Slide 59
59
Major problem classes

Types of learning probems

Off-line learning (ranking and selection/stochastic search)
• There is a phase of information collection with a finite budget, after
which you make an implementation decision.
• Examples:
– Finding the best design of a manufacturing configuration or engineering
design which is evaluated using an expensive simulation.
– What is the best combination of designs for hydrogen production,
storage and conversion.

On-line learning (multiarmed bandit problems)
• “Learn as you earn”
• Example:
– Finding the best path to work
– What is the best set of energy-saving technologies to use for your
building?
– What is the best medication to control your diabetes?
60
60
Knowledge gradient for online learning

Objective function for off-line problems

We wish to find the best design after N measurements
max x xN

Objective function for on-line problems

We wish to maximize the total reward as we proceed
N
max

n 1
n
 xn
n
x n   (S n )
61
Measurement policies

Special case: the multiarmed bandit problem


Which slot machine should I try next to maximize total
expected rewards?
Breakthrough (Gittins and Jones, 1974)
• Do not need to solve the high-dimensional dynamic program
• Compute a single index (the “Gittins index”) for each slot
machine
• Try the slot machine with the largest index
 xGittins  xn  (n,  ) x
Gittins
index
mean
zero,
1x
Standard
ofvariance
measurement
Current
estimate
of for
thedeviation
reward
from
machine
62
62
Knowledge gradient for online learning

Knowledge gradient policy

For off-line problems:
 xKG,n  Value of a measurement from a single decision

For finite-horizon on-line problems:
• Assume we have made 3 measurements out of our budget of 20.
• What is the value of learning from one more measurement?
KG ,3
•  x is the improvement in the 4th decision given what we know
after the 3rd measurement. But we benefit from this decision 17
more times.
 xKGOL,3  x3  (20  3) xKG,3  x3  17 xKG,3
• The more times we can use the information, the more we are willing
to take a loss for future benefits.
63
Knowledge gradient for online learning

Knowledge gradient policy

For finite-horizon on-line problems:
 xKGOL,n  xn  ( N  n) xKG,n

For infinite-horizon discounted problems:
 xKG OL,n   xn 


1 
 xKG ,n
Compare to Gittins indices for bandit problems
 xGittins  xn  (n,  ) x
64
Knowledge gradient for online learning

Gittins indices

Gittins indices looks at one measurement at a time, over
the entire future. Knowledge gradient looks across all
measurements at a point in time.
Time
Knowledge gradient
Gittins indices
65
Knowledge gradient for online learning

On-line KG vs. Gittins
On-line KG slightly
underperforms
Gittins
On-line KG slightly
outperforms Gittins.
Number of measurements
66
Knowledge gradient for online learning

KG versus Gittins indices for multiarmed bandit problems


Gittins indices are provably optimal, but computing them is hard.
Chick and Gans (2009) have developed a simple and accurate
approximation.
Informative prior
Uninformative prior
Improvement of KG over Gittins
Improvement of KG over Gittins
67
Knowledge gradient for online learning

But knowledge gradient can also handle:


Finite horizons
Correlated beliefs:
KG vs. Gittins
KG vs. Upper confidence bounding
???
KG vs. Interval estimation
KG vs. pure exploitation
68
Knowledge gradient for online learning
KG versus interval estimation

Recall that with IE, you choose the alternative with the
highest:  xIE  x  z  x
Tunable parameter
Opportunity cost

IE
IE beats KG
KG
IE parameter z
69
Knowledge gradient for online learning

Tuning z for interval estimation

Optimal value is very sensitive to the problem parameters
z  0.3
z  110
70
Outline

The challenge of learning
 The knowledge gradient policy
 The knowledge gradient with correlated beliefs
 The knowledge gradient for on-line applications
 Applications
© 2009 Warren B. Powell
© 2008 Warren B. Powell
Slide 71
71
Outline

Applications
 Optimizing an energy storage problem
 Warren Scott, Emre Barut
 Jennifer and Christine Schoppe
 Drug discovery
 Diana Negoescu
 Peter Frazier
 Learning on a graph
 Ilya Ryzhov
© 2009 Warren B. Powell
© 2008 Warren B. Powell
Slide 72
72
Optimal control of wind and storage

Wind


Varies with multiple
freqeuencies (seconds,
hours, days, seasonal).
Spatially uneven, generally
not aligned with population
centers.

Solar


Shines primarily during the
day (when it is needed), but
not entirely reliably.
Strongest in
south/southwest.
73
Optimal control of wind and storage
30 days
1 year
74
Optimal control of wind and storage
Hydroelectric
Batteries
Flywheels
Ultracapacitors
75
Optimal control of wind and storage

Controlling the storage process


Imagine that we would like to use storage to reduce demand
when electricity prices are high.
We use a simple policy controlled by two parameters.
Price
 Withdraw
 Store
76
Optimizing storage policy
 Store
 Withdraw
77
77
Optimizing storage policy

Initially we think the concentration is the same everywhere:
Estimated profit

Knowledge gradient
We want to measure the value where the knowledge gradient is the
highest. This is the measurement that teaches us the most.
78
Optimizing storage policy

After four measurements:
Estimated profit
Measurement

Knowledge gradient
Value of another measurement
New optimum
at same location.
Whenever we measure at a point, the value of another
measurement at the same point goes down. The knowledge
gradient guides us to measuring areas of high uncertainty.
79
Optimizing storage policy

After five measurements:
Estimated profit
Knowledge gradient
80
Optimizing storage policy

After six samples
Estimated profit
Knowledge gradient
81
Optimizing storage policy

After seven samples
Estimated profit
Knowledge gradient
82
Optimizing storage policy

After eight samples
Estimated profit
Knowledge gradient
83
Optimizing storage policy

After nine samples
Estimated profit
Knowledge gradient
84
Optimizing storage policy

After ten samples
Estimated profit
Knowledge gradient
85
Optimizing storage policy

After 10 measurements, our estimate of the surface:
Estimated profit
True concentration
86
Outline

Applications
 Optimizing an energy storage problem
 Warren Scott, Emre Barut
 Jennifer and Christine Schoppe
 Drug discovery
 Diana Negoescu
 Peter Frazier
 Learning on a graph
 Ilya Ryzhov
© 2009 Warren B. Powell
© 2008 Warren B. Powell
Slide 87
87
Applications

Biomedical research

How do we find the
best drug to cure
cancer?
 There are millions of
combinations, with
laboratory budgets that
cannot test everything.
 We need a method for
sequencing
experiments.
88
Drug discovery

Designing molecules

X and Y are sites where we can hang substituents to change the behavior
of the molecule
89
Drug discovery

We express our belief using a linear, additive QSAR model
 X ij  1 if substituent j is at site i, 0 otherwise.
m
 Y  0 
 
ij X ij
sites i substituents j
90
Drug discovery
Knowledge gradient versus pure exploration for 99 compounds
Performance under best possible

Pure exploration
Knowledge gradient
Number of molecules tested (out of 99)
91
Drug discovery

A more complex molecule:
3
8
4’
7
6
R1

R2
Potential substituents:
1
9
R4
9
R3
4
3’
2’
2
1’
5
R5
F
OH
CH 3
OCOCH 3
OCOCH
OCH 3
CH
NO
CI
OCOCH
From this base molecule, we created problems with 10,000
compounds, and one with 87,120 compounds.
92
Drug discovery
Compact representation on 10,000 combination compound

Results from 15 sample paths
Performance under best possible

Number of molecules tested
93
Drug discovery
Single sample path on molecule with 87,120 combinations
Performance under best possible

94
Parametric belief models

Representing beliefs using linear regression has many
applications:

How do we find the optimal price of a product sold on the
internet?
 Which internet ad will generate the most ad clicks?
 How will a customer, described by a set of attributes,
respond to a price for a contract?
 What parameter settings produce the best results from my
business simulator?
 What are the best features that I should include in a laptop?
95
Major problem classes

Belief structures

Lookup tables (one belief for
n
each discrete value  x .
• Independent beliefs
• Correlated beliefs

1
2
3
4
5
Parametric beliefs
y  0  11 (S )  22 (S )  ...

Nonparametric beliefs
96
Outline

Applications
 Optimizing an energy storage problem
 Warren Scott
 Jennifer and Christine Schoppe
 Drug discovery
 Diana Negoescu
 Peter Frazier
 Learning on a graph
 Ilya Ryzhov
© 2009 Warren B. Powell
© 2008 Warren B. Powell
Slide 97
97
Information collection for rapid response

The challenge:


We need to plan the
movement of emergency
response resources to
respond to an emergency.
Collecting information



Aerial videos
Sampling mobile phones
with GPS
Ground observations
98
Information collection on a graph

Optimal routing over a graph:
99
99
Information collection on a graph

Optimal routing over a graph

The shortest path
100 100
Information collection on a graph

Optimal routing over a graph


The shortest path
Evaluating a link
101 101
Information collection on a graph

Optimal routing over a graph


The shortest path
Evaluating a link
Now we have a new shortest path

How do we decide which links to measure?

102 102
Information collection on a graph

The knowledge gradient on a graph

We can apply the knowledge gradient concept directly
 xKG  E max y F ( y, K ( x))  max y F ( y, K )
Expected
valuepath
of updated
shortest
problem
Shortest
problem
after
thepath
update
weproblem
believe about
Current
path
Updated
distributions
ofshortest
arc What
costs
link
Based on what
wecosts
know

How do we compute the expected value of a stochastic shortest path
problem?
103 103
Information collection on a graph

The knowledge gradient on a graph

When we had finite alternatives, we had to compute
 xn  max x ' x  xn'
 
 xn
n
x
 Normalized distance to best (or second best)

For problems on graphs, we have to compute
n
n


max

p
(
ij
)
p
'

P
(
ij
)
p
'
 ijn  
 pn (ij )
Value of best path that
includes link (i,j)
Value of best path that
does not include link (i,j)
104 104
Experimental results

Ten layered graphs (22 nodes, 50 edges)

Ten larger layered graphs (38 nodes, 102 edges)
105
Special thanks to

Peter Frazier (faculty, Cornell University)

Ilya Ryzhov (available 2011)

Warren Scott (expected availability 2012)

Emre Barut (expected availability 2013?)
© 2009 Warren B. Powell
© 2008 Warren B. Powell
Slide 106 106
© 2009 Warren B. Powell
107
Major problem classes

Measurement variable

Binary problems x=(0,1)
• Sequential hypothesis testing

Discrete choice problems x=(1,2,…,M)
• Finding the best technology

Subset selection problems x=( 0 1 1 0 1 0 0 0 1 )
• R&D portfolio optimization

Continuous scalar parameter – x=2.682
• What is the best temperature, density, quantity

Continuous vectors x=(1.43, 12.78, 4.59, …)
• Tuning a design or process

Multiattribute problems x=(OH, OCH3, NO, CI, …)
• Drug discovery – What is the molecular compound?
• What is the best set of features for a device?
108
The knowledge gradient

Computing the knowledge gradient
n
n


max

x ' x
x'
 xn   x
 xn
 Normalized distance to best (or second best)
 5KG
 xn
1
2
3
4
5
x
109 109
Approximate dynamic programming

Learning the value of being in each state
-$5
$0
$0
1
2
v1  0
v2  0
$20


Starting in state 1, given our initial estimate of the value of
being in each state, we would prefer to stay in state 1 and get
$0 then move to state 2 and get -$5+0 = -$5.
To learn the value of being in state 2, we have to make an
explicit decision to explore state 2.
110