Constraints in Repeated Games Rational Learning Leads to Nash Equilibrium Kalai & Lehrer, 1993 …so what is rational learning?
Download ReportTranscript Constraints in Repeated Games Rational Learning Leads to Nash Equilibrium Kalai & Lehrer, 1993 …so what is rational learning?
Constraints in Repeated Games
Rational Learning Leads to Nash Equilibrium
Kalai & Lehrer, 1993
…so what is rational learning?
What is Rational Learning?
Rational learning is…
Bayesian Updating
frequentist vs. Bayesian statistics
frequentist vs. Bayesian statistics
Frequentist Approach
•Assume a coin 10 times and it comes up heads 8 times •A frequentist approach would conclude that the coin comes up heads 80% of the time •Using the relative frequency as a probability estimate, we can calculate the maximum likelihood estimate (MLE) •Frequentist MLE not always accurate in all contexts •For m the model asserting P(head) = m, and s an observed sequence, the MLE is:
arg max m P(s|
m )
Bayesian Approach
•Allows us to incorporate prior beliefs e.g., that our coin is fair (why not?) •We can measure degrees of belief, which can be updated in the face of evidence using Bayes’ theorem •
P(
m |s) = (P(s|
m ) * P(
m ))/P(s)
•We already have P(s| m ), we can quantify P( m ) and ignore the normalization factor P(s) •
Arg max m P(
m |s) = .75
for P( m ) = 6m(1-m)
Under What Conditions?
•Infinitely repeated game •subjective beliefs about others are compatible with true strategies •Players know their own payoff matrices •Players choose strategies to maximize their expected utility •Perfectly monitored •Discounted payoffs
…must eventually play according to a Nash equilibrium of the repeated game
What Isn’t Needed
•assumptions about the rationality of other players •knowledge of the payoff matrices of other players
Definitions
A game is
perfectly monitored
complete history of the game up to the point where they are currently at.
if all players have access to the
discounting
by: introduces a factor that future payoffs are multiplied u i (f) = (1 i ) ∑ t = 0 ∞ E f (x i t+1 ) i t note the relation to geometric series
…continued
beliefs are compatible with true strategies if the distribution over infinite play paths induced by the belief is
absolutely continuous
with respect to that of the true strategies A measure f f << g is
absolutely continuous
with respect to also has a positive measure according to g .
g (denoted ) if every event having a positive measure according to f
More Definitions
Let > 0 and let same space. and ’ be two probability measures defined on the is -close to ’ if there is a measurable set Q satisfying: (Q) and ’(Q) are greater than 1 for every measurable set A Q (1 ) ’(A) <= (A) <= (1 + ) ’(A) For >= 0, f plays -like g if f is -close to g
Optimality and Domination in Repeated Games
Fortnow & Whang
…so what are optimality and domination?
Two Types of Infinite Repeated Games
•History (h) -> strategy ( ) -> action (a) … payoff (u) •Limit of means game (G ∞ ) • i G∞ ( I , II ) = lim inf k->∞ (1/k)∑ k j = 1 u i (a I j ( I , II ), a II j ( I , II )) •Discounted game (G ) with discount 0 < < 1 • i G ( I , II ) = (1 ) ∑ ∞ j = 1 k -- 1 u i (a I j ( I , II ), a II j ( I , II ))
Definitions
Optimality: their way of saying Nash equilibrium i G∞ ( I , II ) i G∞ ( I ’, II ) >= 0 lim inf -> 1 ( i G ( I , II ) i G ( I ’, II )) >= 0 Domination: reciprocal best response not for one opposing strategy, but for
all.
for every choice of strategy
II , the strategy
I is optimal
Example
2, 2 0, 0 3, 3 0, 4 0, 0 1, 1 4, 0 1, 1
Mozart or Mahler
has two optimal strategies, and
prisoners’ dilemma
has one
Mozart or Mahler
has no dominant strategies and
prisoners’ dilemma
has one
Classes of Strategies
•All possible strategies (
rational
) -- uncountably many •Those strategies implemented on a Turing Machine that always halts (
recursive
) •Those strategies implemented on a Turing Machine that halts in polynomial time in r between rounds r and r + 1 (
polynomial
) •Those strategies implemented on a Finite State Automata (
regular
)
can also allow behavioral versions of these strategies
Bounding the Number of Rounds
We want our payoff functions to a reasonable rate of convergence to the final payoff of the game With average payoff function i k ( I , II ) = (1/k)∑ k j = 1 u i (a I j ( I , II ), a II j ( I , II )) We have i G∞ ( I , II ) = lim inf k->∞ i k ( I , II )
…continued
We know that for all > 0 there is a round t such that for all k >= t i k ( I , II ) >= i G∞ ( I , II ) Our bound will be to require that t be a function of the strategy for II (number of states of FSA) and the size of For discounted games, we use I converges in t rounds for to bound the number of rounds > 0 and > 2 -1/(t(s( II)/ ) if i G ( I , II ) i G∞ ( I ’ , II ) >=
Previous Work
Gilboa & Samet (1989) showed that if player II is limited to strategies realized by strongly connected FSAs, then there exists a recursive dominant strategy.
Strong connection needed to protect against some earlier choice made in the game
vengeful
strategies i.e., those strategies that penalize the opponent forever simply for To extend this result to arbitrary finite automata we must weaken our notion of domination. An finite number of rounds
eventually dominant strategy
is one that only requires domination of strategies that agree with it for some initial
Extension of Previous Work
For any game G, there is a recursive strategy strategies realized by finite automata 1 which is eventually dominant for the class of rational strategies against the class of The following results aim to show how well strategies of differing complexities perform against one another in certain cases In the next paper, we will see what happens when strategies are both restricted to the same complexity class
Prisoner’s Dilemma vs. Matching Pennies
Prisoner’s Dilemma:
max(u I (a 1 , A 1 ), u I (a 2 , A 1 )) != max(u I (a 1 , A 2 ), u I (a 2 , A 2 ))
Matching Pennies:
max(u I (a 1 , A 1 ), u I (a 2 , A 1 )) = max(u I (a 1 , A 1 ), u I (a 2 , A 1 ))
More Results
Consider prisoner’s dilemma for any fixed 0 <= any strategy for player I then there is some rational strategy player II implemented by an FSA of n states such that any strategy -optimal strategy I against of rounds to converge.
II < 1 and n. If I II of will require an exponential number is For matching pennies, there exists a polynomial-time strategy that dominates all finite automata and converges in a polynomial number of rounds.
There exists a behavioral regular strategy for which there is no optimal rational strategy even for matching pennies.
…continued
For prisoner’s dilemma, there is a polynomial-time strategy II which there is no eventually optimal rational strategy I .
for For prisoner’s dilemma, there is some polynomial-time strategy II such that there is an optimal rational strategy but for all 0 <= < 1, there is no eventually -optimal recursive strategy for the class of rational strategies.
For matching pennies, there is a recursive strategy polynomial-time strategies. I which is dominant for the class of rational strategies against the class of
Further Questions
•Does there exist a behavioral strategy for which there is no eventually optimal rational strategy?
•For some > 0, does there exist a behavioral regular strategy for which there is no eventually -optimal recursive of polynomial-time strategy?
•Does there exist a polynomial-time or recursive strategy that eventually dominates all behavioral regular strategies?
•Incomplete information? Finite Games? Infinite non-repeated stage games?
On Bounded Rationality and Computational Complexity
Papadimitriou & Yannakakis
…so what is bounded rationality?
Bounded Rationality
From Simon:
Reasoning and computation are costly, so agents don’t invest inordinate amounts of computational resources and reasoning power to achieve relatively insignificant gains in their payoff
We can implement bounded rationality by restricting the computational complexity of a strategy.
but why would we want to?
Motivation
•Leads to a more accurate model of the world •Has interesting game-theoretic consequences •Increased elegance (no needlessly complicated strategies) •Leads to more and better cooperation, and therefore higher payoffs
Example
Consider the prisoner’s dilemma, repeated n times for n > 1 The only Nash Equilibrium to this game is
(D n, D n )
Both players play the stage game in the last round
(D, D)
proceed backwards by induction for all rounds and they Shouldn’t we be able to do better than this?
...continued
For the n-repeated prisoner’s dilemma, the strategy space is
doubly exponential
in n This is not realistic for even small n Our undesirable result (no cooperation) results when we place no constraints on the complexity (i.e. states in FSA) of the strategies
What happens when we constrain the complexity?
Good Things
If we require that s I (n) and s II (n) are less than n -1, then the FSAs can’t count to n, and backwards induction fails.
In this case, tit-for-tat is a Nash equilibrium Neyman: for s I (n) and s II (n) between n 1/k and n k for k > 1, there is an equilibrium that approximates cooperation (payoff 3 - 1/k) If s I (n), s II (n) >= 2 n then backwards induction is possible via dynamic programming
Theorem 1
For all subexponential complexities there are equilibria that are arbitrarily close to the collaborative behavior
If at least one of the state bounds in the n-round prisoner’s dilemma is 2
O
(
n) then for large enough n, there is a mixed equilibrium with average payoff for each player at least 3 -
This can be extended to arbitrary games and payoffs by making function of these new parameters a
The Idea
The number of histories is exponential in the length of the game Memory states can be filled with small histories (to use up space) and for the remaining states (few enough so that they can’t count too high and use backwards induction to always defect), cooperation is enabled
Some Details
Players exchange short customized sequence of Cs and Ds (“buisness cards”) between them, then periodically repeat this with the XOR of these sequences intermittently with long periods of cooperation Advantage that players with D-heavy business cards have must be cancelled Imbalance in the periodic repetitions solved with XOR (then players get the same payoff as each other) The possibility of saving states by misusing punitive transitions only detected by dishonesty must be eliminated
General Games (definitions)
The
minimax
of player I is v 1 = min y Y max x X g I (x, y) Player I can always guarantee this much payoff, assuming that player II uses a payoff known to player I.
v = (v 1 , v 2 ) is called the
threat point
((1, 1) is prisoner’s dilemma) The
feasible region
is the convex hull of payoff combinations The
individually rational region
is the part of the feasible region that dominates the corresponding threat point.
General Games (theorems)
The Folk Theorem
: In the infinitely repeated game, all points in the Mixed individually rational region are equilibria
The Folk Theorem for Automata
: Let (a, b) be a payoff Combination in the infinitely repeated game with automata. TFAE: (a, b) is a pure equilibrium payoff (a, b) is a mixed equilibrium payoff with finite support and rational coefficients (a,b) is a rational point in the pure nonstrict individually rational region
More Definitions
For pure strategy pairs A, B and A’, B’: they are
dependent
if A = A’ or B = B’ and
independent
otherwise they are
aligned
and
nonaligned
if g I (A, B) = g I (A’, B’) or g II (A, B) = g II (A’, B’) otherwise Every point on the Pareto boundary corresponds to either a pure strategy or a convex combination of two nonaligned pure strategies
Another Theorem
Let G be an arbitrary game and let p = (p 1 , p 2 ) be a point in the strict, pure individually rational region. For every > 0, there are a, c, n > 0 such that for m >= n > 0 in the n-round repeated game G played by automata with sizes bounded by a, there is a mixed equilibrium with average payoff for each player within of p i if either (i) p can be realized by pure strategies and at least one of the bounds is smaller than 2 c*n (ii) p can be realized as the convex combination of two nonaligned (or independent) pure strategy pairs, and both bounds are smaller than 2 c*n