A Simple DistributionFree Approach to the Max k-Armed Bandit Problem Matthew Streeter and Stephen Smith Carnegie Mellon University.

Download Report

Transcript A Simple DistributionFree Approach to the Max k-Armed Bandit Problem Matthew Streeter and Stephen Smith Carnegie Mellon University.

A Simple DistributionFree Approach to the Max
k-Armed Bandit Problem
Matthew Streeter and Stephen Smith
Carnegie Mellon University
Outline




The max k-armed bandit problem
Previous work
Our distribution-free approach
Experimental evaluation
What is the max k-armed
bandit problem?
The classical k-armed
bandit





You are in a room with k slot
machines
Pulling the arm of machine i
returns a payoff drawn
(independently at random)
from unknown distribution Di
Allowed n total pulls
Goal: maximize total payoff
> 50 years of papers
The max k-armed
bandit





You are in a room with k slot
machines
Pulling the arm of machine i
returns a payoff drawn
(independently at random)
from unknown distribution Di
Allowed n total pulls
Goal: maximize highest
payoff
Introduced ~2003
Why study it?
Goal: improve multi-start
heuristics

A multi-start heuristic runs an underlying
randomized heuristic a bunch of times and
returns the best solution

Examples:



HBSS (Bresina 1996)
VBSS (Cicirello & Smith 2005)
GRASPs (Feo & Resende 1995, and many others)
Application: selecting
among heuristics




Given: some
optimization problem, k
randomized heuristics
Each time you run a
heuristic, get a solution
with a certain quality
Allowed n runs
Goal: maximize quality
of best solution
The max k-armed
bandit: example
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.

Given n pulls, how can we maximize the (expected)
maximum payoff?


If n=1, should pull blue arm (higher mean)
If n=1000, should mainly pull maroon arm (higher variance)
Distributional
assumptions?


Without distributional assumptions, optimal
strategy is not interesting.
For example suppose payoffs are in {0,1};
arms are shuffled so you don’t know which is
which.


Optimal strategy samples the arms in round-robin
order!
Can’t distinguish a “good” arm until you receive
payoff 1, at which point max payoff can’t be
improved
Distributional
assumptions?

All previous work assumed each machine
returns payoff from a generalized extreme
value (GEV) distribution

Why?


Extremal Types Theorem: let Mn = max. of n
independent draws from some fixed distribution.
As n, distribution of Mn  a GEV distribution
GEV sometimes gives an excellent fit to payoff
distributions we care about
Previous work

Cicirello & Smith (CP 2004, AAAI 2005):



Assumed Gumbel distributions (special case of
GEV), no rigorous performance guarantees
Good results selecting among heuristics for the
RCPSP/max
Streeter & Smith (AAAI 2006)


Rigorous result for general GEV distributions
But no experimental evaluation
Our contributions

Threshold ascent: strategy to solve max karmed problem using classical k-armed
solver as subroutine

Chernoff interval estimation: strategy for
classical k-armed bandit algorithm that works
well when mean payoffs are small (we assume
payoffs in [0,1])
Threshold Ascent


Parameters: strategy S for classical k-armed
bandit, integer m > 0
Idea:



Initialize t  -
Use S to maximize number of payoffs that exceed
t
Once m payoffs > t have been received, increase t
and repeat
Threshold Ascent

Designed to work well when:

For t > tcritical, there is a growing gap between
probability that eventually-best arm yields payoff
> t and corresponding prob. for other arms
m controls exploration/exploitation
tradeoff (larger m means algorithm
converges more before increasing t)
 as t gets large, S sees a classical k Parameters: strategy
S for
classical
armed
bandit
instance k-armed
where almost
bandit, integer m > 0all payoffs are zero
 we don’t really start S from scratch
 Idea:
each time we increase t

Threshold Ascent



Initialize t  -
Use S to maximize number of payoffs that exceed
t
Once m payoffs > t have been received, increase t
and repeat
Interval Estimation

Interval estimation (Lai & Robbins 1987,
Kaelbling 1993) maintains confidence interval
for each arm’s mean payoff; pulls arm with
highest upper bound
1
2
Arm 1
Arm 2
3
Arm 3
Chernoff Interval
Estimation





We analyze a variant of interval estimation with
confidence intervals derived from Chernoff bounds
regret = average_payoff(strategy) - *, where * =
mean payoff of best arm.
We prove an O(sqrt(*)*X) regret bound, where
X = sqrt(k (log n)/n).
Using Hoeffding’s inequality just gives O(X). (Auer
et al. 2002). As * 0, our bound is much better.
Can get comparable bounds using “multiplicative
weight update” algorithms
Experimental
Evaluation
The RCPSP/max




Assign start times to activities subject to
resource and temporal constraints
Goal: find a schedule with minimum
makespan
NP-hard, “one of the most intractable
problems in operations research” (Mohring
2000)
Multi-start heuristics give state-of-the-art
performance (Cicirello & Smith 2005)
Evaluation

Five multi-start heuristics; each is a randomized rule
for greedily building a schedule





LPF - “longest path following”
LST - “latest start time”
MST - “minimum slack time”
Note: we use a less
MTS - “most total successors” aggressive variant of interval
RSM - “resource scheduling method”estimation in these
experiments

Three max k-armed bandit strategies:



Threshold Ascent (m=100, S = Chernoff interval estimation
with 99% confidence intervals)
round robin sampling
QD-BEACON (Cicirello & Smith 2004, 2005)
Evaluation




Ran on 169 instances from ProGen/max
library
For each instance, ran each of five rules
10,000 times and saved results in file
For each of three strategies, solve as max 5armed bandit with n=10,000 pulls
Define regret = difference between max.
possible payoff and max. payoff actually
obtained
Results

Threshold Ascent outperforms the other max karmed bandit strategies, as well as the five
“pure” strategies
Summary & Conclusions



The max k-armed bandit problem is a simple
online learning problem with applications to
heuristic search
We described a new, distribution-free
approach to the max k-armed bandit problem
Our strategy is effective at selecting among
randomized priority dispatching rules for the
RCPSP/max