Constraints in Repeated Games Rational Learning Leads to Nash Equilibrium Kalai & Lehrer, 1993 …so what is rational learning?

Download Report

Transcript Constraints in Repeated Games Rational Learning Leads to Nash Equilibrium Kalai & Lehrer, 1993 …so what is rational learning?

Constraints in Repeated Games

Rational Learning Leads to Nash Equilibrium

Kalai & Lehrer, 1993

…so what is rational learning?

What is Rational Learning?

Rational learning is…

Bayesian Updating

frequentist vs. Bayesian statistics

Frequentist Approach

•Assume a coin 10 times and it comes up heads 8 times •A frequentist approach would conclude that the coin comes up heads 80% of the time •Using the relative frequency as a probability estimate, we can calculate the maximum likelihood estimate (MLE) •Frequentist MLE not always accurate in all contexts •For  m the model asserting P(head) = m, and s an observed sequence, the MLE is:

arg max m P(s|



m )

Bayesian Approach

•Allows us to incorporate prior beliefs e.g., that our coin is fair (why not?) •We can measure degrees of belief, which can be updated in the face of evidence using Bayes’ theorem •



m |s) = (P(s|



m ) * P(



m ))/P(s)

•We already have P(s|  m ), we can quantify P(  m ) and ignore the normalization factor P(s) •

Arg max m P(



m |s) = .75

for P(  m ) = 6m(1-m)

Under What Conditions?

•Infinitely repeated game •subjective beliefs about others are compatible with true strategies •Players know their own payoff matrices •Players choose strategies to maximize their expected utility •Perfectly monitored •Discounted payoffs

…must eventually play according to a Nash equilibrium of the repeated game

What Isn’t Needed

•assumptions about the rationality of other players •knowledge of the payoff matrices of other players

Definitions

A game is

perfectly monitored

complete history of the game up to the point where they are currently at.

if all players have access to the

discounting

by: introduces a factor that future payoffs are multiplied u i (f) = (1  i ) ∑ t = 0 ∞ E f (x i t+1 )  i t note the relation to geometric series

…continued

beliefs are compatible with true strategies if the distribution over infinite play paths induced by the belief is

absolutely continuous

with respect to that of the true strategies A measure  f  f <<  g is

absolutely continuous

with respect to also has a positive measure according to  g .

 g (denoted ) if every event having a positive measure according to  f

More Definitions

Let  > 0 and let same space.   and  ’ be two probability measures defined on the is  -close to  ’ if there is a measurable set Q satisfying:  (Q) and  ’(Q) are greater than 1  for every measurable set A  Q (1  )  ’(A) <=  (A) <= (1 +  )  ’(A) For  >= 0, f plays  -like g if  f is  -close to  g

Optimality and Domination in Repeated Games

Fortnow & Whang

…so what are optimality and domination?

Two Types of Infinite Repeated Games

•History (h) -> strategy (  ) -> action (a) … payoff (u) •Limit of means game (G ∞ ) •  i G∞ (  I ,  II ) = lim inf k->∞ (1/k)∑ k j = 1 u i (a I j (  I ,  II ), a II j (  I ,  II )) •Discounted game (G  ) with discount 0 <  < 1 •  i G  (  I ,  II ) = (1  ) ∑ ∞ j = 1  k -- 1 u i (a I j (  I ,  II ), a II j (  I ,  II ))

Definitions

Optimality: their way of saying Nash equilibrium  i G∞ (  I ,  II )  i G∞ (  I ’,  II ) >= 0 lim inf  -> 1 (  i G  (  I ,  II )  i G  (  I ’,  II )) >= 0 Domination: reciprocal best response not for one opposing strategy, but for

all.

for every choice of strategy



II , the strategy



I is optimal

Example

2, 2 0, 0 3, 3 0, 4 0, 0 1, 1 4, 0 1, 1

Mozart or Mahler

has two optimal strategies, and

prisoners’ dilemma

has one

Mozart or Mahler

has no dominant strategies and

prisoners’ dilemma

has one

Classes of Strategies

•All possible strategies (

rational

) -- uncountably many •Those strategies implemented on a Turing Machine that always halts (

recursive

) •Those strategies implemented on a Turing Machine that halts in polynomial time in r between rounds r and r + 1 (

polynomial

) •Those strategies implemented on a Finite State Automata (

regular

)

can also allow behavioral versions of these strategies

Bounding the Number of Rounds

We want our payoff functions to a reasonable rate of convergence to the final payoff of the game With average payoff function  i k (  I ,  II ) = (1/k)∑ k j = 1 u i (a I j (  I ,  II ), a II j (  I ,  II )) We have  i G∞ (  I ,  II ) = lim inf k->∞  i k (  I ,  II )

…continued

We know that for all  > 0 there is a round t such that for all k >= t  i k (  I ,  II ) >=  i G∞ (  I ,  II )  Our bound will be to require that t be a function of  the strategy for  II (number of states of FSA) and the size of For discounted games, we use  I converges in t rounds for   to bound the number of rounds > 0 and  > 2 -1/(t(s(  II)/  ) if  i G  (  I ,  II )  i G∞  (  I ’ ,  II ) >= 

Previous Work

Gilboa & Samet (1989) showed that if player II is limited to strategies realized by strongly connected FSAs, then there exists a recursive dominant strategy.

Strong connection needed to protect against some earlier choice made in the game

vengeful

strategies i.e., those strategies that penalize the opponent forever simply for To extend this result to arbitrary finite automata we must weaken our notion of domination. An finite number of rounds

eventually dominant strategy

is one that only requires domination of strategies that agree with it for some initial

Extension of Previous Work

For any game G, there is a recursive strategy strategies realized by finite automata  1 which is eventually dominant for the class of rational strategies against the class of The following results aim to show how well strategies of differing complexities perform against one another in certain cases In the next paper, we will see what happens when strategies are both restricted to the same complexity class

Prisoner’s Dilemma vs. Matching Pennies

Prisoner’s Dilemma:

max(u I (a 1 , A 1 ), u I (a 2 , A 1 )) != max(u I (a 1 , A 2 ), u I (a 2 , A 2 ))

Matching Pennies:

max(u I (a 1 , A 1 ), u I (a 2 , A 1 )) = max(u I (a 1 , A 1 ), u I (a 2 , A 1 ))

More Results

Consider prisoner’s dilemma for any fixed 0 <= any strategy for player I then there is some rational strategy player II implemented by an FSA of n states such that any strategy  -optimal strategy  I against of rounds to converge.

 II  < 1 and n. If  I  II of will require an exponential number is For matching pennies, there exists a polynomial-time strategy that dominates all finite automata and converges in a polynomial number of rounds.

There exists a behavioral regular strategy for which there is no optimal rational strategy even for matching pennies.

…continued

For prisoner’s dilemma, there is a polynomial-time strategy  II which there is no eventually optimal rational strategy  I .

for For prisoner’s dilemma, there is some polynomial-time strategy  II such that there is an optimal rational strategy but for all 0 <=  < 1, there is no eventually  -optimal recursive strategy for the class of rational strategies.

For matching pennies, there is a recursive strategy polynomial-time strategies.  I which is dominant for the class of rational strategies against the class of

Further Questions

•Does there exist a behavioral strategy for which there is no eventually optimal rational strategy?

•For some  > 0, does there exist a behavioral regular strategy for which there is no eventually  -optimal recursive of polynomial-time strategy?

•Does there exist a polynomial-time or recursive strategy that eventually dominates all behavioral regular strategies?

•Incomplete information? Finite Games? Infinite non-repeated stage games?

On Bounded Rationality and Computational Complexity

Papadimitriou & Yannakakis

…so what is bounded rationality?

Bounded Rationality

From Simon:

Reasoning and computation are costly, so agents don’t invest inordinate amounts of computational resources and reasoning power to achieve relatively insignificant gains in their payoff

We can implement bounded rationality by restricting the computational complexity of a strategy.

but why would we want to?

Motivation

•Leads to a more accurate model of the world •Has interesting game-theoretic consequences •Increased elegance (no needlessly complicated strategies) •Leads to more and better cooperation, and therefore higher payoffs

Example

Consider the prisoner’s dilemma, repeated n times for n > 1 The only Nash Equilibrium to this game is

(D n, D n )

Both players play the stage game in the last round

(D, D)

proceed backwards by induction for all rounds and they Shouldn’t we be able to do better than this?

...continued

For the n-repeated prisoner’s dilemma, the strategy space is

doubly exponential

in n This is not realistic for even small n Our undesirable result (no cooperation) results when we place no constraints on the complexity (i.e. states in FSA) of the strategies

What happens when we constrain the complexity?

Good Things

If we require that s I (n) and s II (n) are less than n -1, then the FSAs can’t count to n, and backwards induction fails.

In this case, tit-for-tat is a Nash equilibrium Neyman: for s I (n) and s II (n) between n 1/k and n k for k > 1, there is an equilibrium that approximates cooperation (payoff 3 - 1/k) If s I (n), s II (n) >= 2 n then backwards induction is possible via dynamic programming

Theorem 1

For all subexponential complexities there are equilibria that are arbitrarily close to the collaborative behavior

If at least one of the state bounds in the n-round prisoner’s dilemma is 2

(



n) then for large enough n, there is a mixed equilibrium with average payoff for each player at least 3 -

 This can be extended to arbitrary games and payoffs by making  function of these new parameters a

The Idea

The number of histories is exponential in the length of the game Memory states can be filled with small histories (to use up space) and for the remaining states (few enough so that they can’t count too high and use backwards induction to always defect), cooperation is enabled

Some Details

Players exchange short customized sequence of Cs and Ds (“buisness cards”) between them, then periodically repeat this with the XOR of these sequences intermittently with long periods of cooperation Advantage that players with D-heavy business cards have must be cancelled Imbalance in the periodic repetitions solved with XOR (then players get the same payoff as each other) The possibility of saving states by misusing punitive transitions only detected by dishonesty must be eliminated

General Games (definitions)

The

minimax

of player I is v 1 = min y  Y max x  X g I (x, y) Player I can always guarantee this much payoff, assuming that player II uses a payoff known to player I.

v = (v 1 , v 2 ) is called the

threat point

((1, 1) is prisoner’s dilemma) The

feasible region

is the convex hull of payoff combinations The

individually rational region

is the part of the feasible region that dominates the corresponding threat point.

General Games (theorems)

The Folk Theorem

: In the infinitely repeated game, all points in the Mixed individually rational region are equilibria

The Folk Theorem for Automata

: Let (a, b) be a payoff Combination in the infinitely repeated game with automata. TFAE: (a, b) is a pure equilibrium payoff (a, b) is a mixed equilibrium payoff with finite support and rational coefficients (a,b) is a rational point in the pure nonstrict individually rational region

More Definitions

For pure strategy pairs A, B and A’, B’: they are

dependent

if A = A’ or B = B’ and

independent

otherwise they are

aligned

and

nonaligned

if g I (A, B) = g I (A’, B’) or g II (A, B) = g II (A’, B’) otherwise Every point on the Pareto boundary corresponds to either a pure strategy or a convex combination of two nonaligned pure strategies

Another Theorem

Let G be an arbitrary game and let p = (p 1 , p 2 ) be a point in the strict, pure individually rational region. For every  > 0, there are a, c, n > 0 such that for m >= n > 0 in the n-round repeated game G played by automata with sizes bounded by a, there is a mixed equilibrium with average payoff for each player within  of p i if either (i) p can be realized by pure strategies and at least one of the bounds is smaller than 2 c*n (ii) p can be realized as the convex combination of two nonaligned (or independent) pure strategy pairs, and both bounds are smaller than 2 c*n

Constraints in Repeated Games Rational Learning Leads to Nash Equilibrium Kalai & Lehrer, 1993 …so what is rational learning?

Transcript Constraints in Repeated Games Rational Learning Leads to Nash Equilibrium Kalai & Lehrer, 1993 …so what is rational learning?