Algorithmic Game Theory Uri Feige Robi Krauthgamer Moni Naor Lecture 8: Regret Minimization Lecturer: Moni Naor.

Download Report

Transcript Algorithmic Game Theory Uri Feige Robi Krauthgamer Moni Naor Lecture 8: Regret Minimization Lecturer: Moni Naor.

Algorithmic Game Theory
Uri Feige
Robi Krauthgamer
Moni Naor
Lecture 8: Regret Minimization
Lecturer: Moni Naor
Announcements
• Next Week (Dec 24th):
Israeli Seminar on Computational Game Theory
(10:00 - 4:30)
• At Microsoft Israel R&D Center,
13 Shenkar St. Herzliya.
• January: course will be 1300:-15:00
– The meetings on Jan 7th, 14th and 21st 2009
The Peanuts Game
• There are n bins.
• At each round ``nature” throws a peanut into one of the bins
• If you (the player) are at the chosen bin – you catch the peanut
– Otherwise – you miss it
• You may choose to move to any bin before any round
• Game ends when d peanuts were thrown at some bin
– Independent of whether they were caught or not
Goal: to catch as many peanuts as possible.
Hopeless if the opponent is tossing the peanuts based on knowing where you
stand.
Say sequence of bins is predetermined (but unknown).
How well can we do as a function of d and n?
Basic setting
• View learning as a sequence of trials.
• In each trial, algorithm is given x, asked to predict f, and
then is told the correct value.
• Make no assumptions about how examples are chosen.
• Goal is to minimize number of mistakes.
Focus on number of mistakes. Need to “learn from our
mistakes”.
Using “expert” advice
Want to predict the stock market.
• We solicit n “experts” for their advice: will the market go up or
down tomorrow?
• Want to use their advice to make our prediction. E.g.,
What is a good strategy for using their opinions,
•No a priori knowledge which expert is the best?
•“Expert”: someone with an opinion.
Some expert is perfect
• We have n “experts”.
• One of these is perfect (never makes a mistake). We just don’t
know which one.
• Can we find a strategy that makes no more than log n
mistakes?
Simple algorithm: take majority vote over all experts that have been
completely correct so far.
What if we have a “prior” p over the experts. Want no more than log(1/pi)
mistakes, where expert i is the perfect one?
Take weighted vote according to p.
Relation to concept learning
• If computation time is no object, can have one “expert”
per concept in C.
• If target in C, then number of mistakes at most log|C|.
• More generally, for any description language, number of
mistakes is at most number of bits to write down f.
What if no expert is perfect?
Goal is to do nearly as well as the best one in hindsight.
Strategy #1:
• Iterated halving algorithm. Same as before, but once we've
crossed off all the experts, restart from the beginning.
• Makes at most log(n)*OPT mistakes,
– OPT is # mistakes of the best expert in hindsight.
x
log n
log n
log n
log n
Seems wasteful: constantly forgetting what we've “learned”.
Weighted Majority Algorithm
Intuition: Making a mistake doesn't completely disqualify an
expert. So, instead of crossing off, just lower its weight.
Weighted Majority Alg:
– Start with all experts having weight 1.
– Predict based on weighted majority vote.
– Penalize mistakes by cutting weight in half.
Weighted Majority Algorithm
Weighted Majority Alg:
– Start with all experts having weight 1.
– Predict based on weighted majority vote.
– Penalize mistakes by cutting weight in half.
Example:
Day 1
Day 2
Day 3
Day 4
Analysis: not far from best expert in hindsight
• M = # mistakes we've made so far.
• b = # mistakes best expert has made so far.
• W = total weight. Initially W is set to n.
• After each mistake: W drops by at least 25%.
So, after M mistakes, W is at most n(3/4)M.
• Weight of best expert is (1/2)b. So:
constant
(1/2)b · n(3/4)M
and
comp. ratio
M · (b+log n)
Randomized Weighted Majority
If the best expert makes a mistake 20% of the time, then
(b+log n) not so good: Can we do better?
Instead of majority vote: use weights as probabilities.
If 70% on up, 30% on down, then pick 70:30
Idea: smooth out the worst case.
• Also, generalize ½ to 1- e.
Randomized Weighted Majority
/* Initialization */:
Wi ← 1 for i 2 1..n
/* Main loop */:
For t=1.. T
Let Pt(i) = Wi/(j=1n Wj) .
Choose i according to Pt
/*Update Scores */
Observe losses
for i 2 1..n
Wi ← Wi ¢ (1-e)loss(i)
Analysis
• Say at time t fraction Ft of weight on experts that made mistake.
• We have probability Ft of making a mistake: remove an eFt
fraction of the total weight.
– Wfinal = n(1-e F1)(1 - e F2)...
– ln(Wfinal) = ln(n) + t [ln(1 - e Ft)] · ln(n) - e t Ft
(using ln(1-x) < -x)
= ln(n) - e M.
( Ft = E[# mistakes])
• If best expert makes b mistakes, then ln(Wfinal) > ln((1-e)b).
• Now solve: ln(n) - e M > b ln(1-e):
M · b ln(1-e)/(-e) + ln(n) /e
· b (1+e) + ln(n) /e .
-ln(1-x) · -x+x2 for 0 · x · 1/2
•M = Expected # mistakes
• b = # mistakes best
expert made
• W = total weight.
Summarizing: RWM
• Can be (1+e)-competitive with best expert in
hindsight, with additive e-1log(n).
• If running T time steps, set e= (ln n/T)1/2 to get
M · b (1+ (ln n/T)1/2)+ ln(n) / (ln n/T)1/2
=b + (b2ln n/T)1/2 )+ (ln(n) T)1/2
· b + 2(ln(n) T)1/2
additive
loss
•M = # mistakes made
• b = # mistakes best expert made
• M · b (1+e) + ln(n) /e
Questions
• Isn’t it better to sometimes take the majority?
– The best expert may have a hard time on the easy
question and we would be better of using the wisdom
of the crowds
• Answer: if it a good idea, make the majority expert
one of the experts!
Lower Bounds
Cannot hope to do better than:
• log n
• T1/2
What can we use this for?
• Can use to combine multiple algorithms to do nearly as
well as best in hindsight.
– E.g., online auctions: one expert per price level.
• Play repeated game to do nearly as well as best
strategy in hindsight Regret Minimization
• Extensions: “bandit problem”, movement costs.
“No-regret” algorithms for repeated games
Repeated play of matrix game with N rows. (Algorithm is
row-player, rows represent different possible actions).
Algorithm
Adversary – world - life
it
Ct
• At each step t: algorithm picks row; life picks column.
– Alg pays cost for action chosen. Ct(it)
– Alg gets column as feedback : Ct
• or just its own cost in the “bandit” model).
– Assume bound on max cost: all costs between 0 and 1.
“No-regret” algorithms for repeated games
• At each time step, algorithm picks row, life picks
column.
Define average regret in T time steps as:
Alg pays
cost
for action
Ct(i) row in hindsight]
[avg–cost
of alg]
– [avg
cost chosen:
of best fixed
– Alg gets column as feedback
T C (i)]
– Assume
all
costs
between
1/T ¢bound
[t=1onT max
Ct(icost:
)
min

t
i
t=1
t 0 and 1.
Want this to go to 0 or better as T gets large
[= “no-regret” algorithm].
Randomized Weighted Majority
/* Initialization */:
Wi ← 1 for i 2 1..n
/* Main loop */:
For t=1.. T
Let Pt(i) = Wi/(j=1n Wj) .
Choose i according to Pt
/*Update Scores */
Observe column Ct
for i 2 1..n
C (i)
Wi ← Wi ¢ (1-e) t
Algorithm
Adversary – world - life
i
Ct
Analysis
• Similar to {0,1} case.
E[cost of RWM]
· (mini t=1T Ct(i) )/(1-e) +ln(n)/e
· (mini t=1T Ct(i) (1+2e)+ln(n)/e
For
e·½
No Regret: as T grows difference goes to 0
Properties of no-regret algorithms.
• Time-average performance guaranteed to approach
minimax value V of game
– or better, if life isn’t adversarial.
Algorithm
Adversary – world - life
• Two NR algorithms playing against each other:
have empirical distribution approaching minimax optimal.
• Existence of no-regret algorithms: yields proof of
minimax theorem.
von Neuman’s Minimax Theorem
• Zero-sum game: u2(a1,a2) = -u1(a1,a2)
Theorem:
• For any two-player zero sum game with finite
strategy set A1, A2 there is a value v 2 R, the
game value, s.t.
(A) = mixed
v = maxp 2 (A1) minq 2 (A2) u1(p,q) strategies over A
= minq 2 (A2) maxp 2 (A1) u1(p,q)
• For all mixed Nash equilibria (p,q): u1(p,q)=v
Convergence to Minimax
• Suppose we know
v = maxp 2 (A1) minq 2 (A2) u1(p,q)
= minq 2 (A2) maxp 2 (A1) u1(p,q)
• Consider distribution q for player 2: observed frequencies of
player 2 for T steps
• There is best response x 2 A1 for q so that u2(x,q) · v
• If player 1 always plays x: then expected gain is vT
• If player 1 follows a no-regret procedure:
loss is at most vT +R where R/T → 0
Using RWM: average loss is v + O(log n/T)1/2)
Proof of the Minimax
• Want to show
v = maxp 2 (A1) minq 2 (A2) u1(p,q)
= minq 2 (A2) maxp 2 (A1) u1(p,q)
Consider for player 1
• v1max = maxx 2 A1 minq 2 (A2) u1(x,q)
• v1min = miny 2 A2 maxp 2 (A1) u1(p,y)
Best choice given
player 2 distribution
Best distribution not given
player 2 distribution
For player 2
• v2max = maxy 2 A2 minp 2 (A2) -u1(p,y)
• v2min = minx 2 A1 maxq 2 (A1) -u1(x,q)
•
•
•
•
Need to prove v1max = v1min . Easy: v1max ¸ v1min
Suppose v1max = v1min +  for  ¸ 0
Player 1 and 2 follow a no-regret procedure for T steps with regret R
Need R/T < /2
Proof of the Minimax
Consider for player 1
• v1max = maxx 2 A1 minq 2 (A2) u1(x,q)
• v1min = miny 2 A2 maxp 2 (A1) u1(p,y)
For player 2
• v2max = maxy 2 A2 minp 2 (A2) -u1(p,y)
• v2min = minx 2 A1 maxq 2 (A1) -u1(x,q)
Suppose v1max = v1min +  for  ¸ 0
• Player 1 and 2 follow a no-regret procedure for T steps with regret R
• Need R/T < /2:
But
Losses are LT and -LT.
Let L1 and L2 be best response losses for the empirical distributions
Then L1 /T ¸ v1max and L2 /T ¸ v2max
L1 · LT +R and L2 ¸ - LT - R
History and development
• [Hannan’57, Blackwell’56]: Alg. with regret O((N/T)1/2).
– Need T = O(N/e2) steps to get time-average regret e.
• Call this quantity Te
– Optimal dependence on T (or e). View N as constant
• Learning-theory 80s-90s: “combining expert advice”
– Perform (nearly) as well as best f2C. View N as large.
[LittlestoneWarmuth’89]: Weighted-majority algorithm
E[cost] · OPT(1+e) + (log N)/e · OPT+eT+(log N)/e
• Regret O((log N)/T)1/2.
• Te = O((log N)/e2).
Why Regret Minimization?
• Finding Nash equilibria can be computationally
difficult
• Not clear that agents would converge to it, or
remain in one if there are several
• Regret minimization is realistic:
– There are efficient algorithms that minimize regret
– It is locally computed,
– Players improve by lowering regret
Efficient implicit implementation for large n…
• Bounds have only log dependence on n.
• So, conceivably can do well when n is
exponential in natural problem size, if only could
implement efficiently.
• E.g., case of paths…
target
• Recent years: series of results giving efficient
implementation/alternatives in various settings
The Evesdropping Game
•
•
•
•
•
•
Let G=(V,E)
Player 1 chooses and edge e of E
Player 2 chooses a spanning tree T
Payoff u1(e,T) = 1 if e 2 T and 0 otherwise
The number of moves: exponential in |G|
But: best response for Player 2 given distribution on
edges:
– solve a minimum spanning tree on the probabilities
Correlated Equilibrium and Swap Regret
• What about Nash?
• For correlated equilibrium: if algorithm has low
swap regret then converges to correlated
equilibrium.
Back to the Peanuts Game
• There are n bins.
• At each round ``nature” throws a peanut into one of the bins
• If you (the player) are at the chosen bin – you catch the peanut
– Otherwise – you miss it
• You may choose to move to any bin before any round
• Game ends when d peanuts were thrown at some bin
Homework: what guarantees can you provide