Joint Strategy Fictitious Play

Download Report

Transcript Joint Strategy Fictitious Play

Joint Strategy Fictitious Play Sherwin Doroudi

“Adapted” from J. R. Marden, G. Arslan, J. S. Shamma, “Joint strategy fictitious play with inertia for potential games,” in Proceedings of the 44th IEEE Conference on Decision and Control, December 2005, pp. 6692-6697.

• Players: • Actions: • Payoffs: Review: Game

Review: Game We then play the game repeatedly in “stages,” starting at stage 0. Players can use learning algorithms as discussed in lecture. Note that players know the structural form of their own payoff function, but do not know the form of the other players’ payoff functions.

Notation: Actions As in the lecture, we use the notation

Review: Regret Matching • Guaranteed to converge to a Coarse Correlated Equilibrium (CCE) in all games (Hart & Mas-Colell, 2000).

• But CCE can be quite bad in some cases, as they are a superset of Nash Equilibria (NE).

Review: Fictitious Play (FP) • Observe empirical frequencies of every player’s action • Consider best response(s) under the (incorrect) assumption that other players play according to their empirical frequencies • Randomly choose a best response and act accordingly

Empirical Frequency in FP The empirical frequency for a player and an action is the percentage of stages that the player chose that action up to the previous stage:

Empirical Frequency in FP Each player also has an empirical frequency vector.

Best Response in FP Each player assumes an expected payoff And each player chooses a best response from the set

The Good News!

“The empirical frequencies generated by FP converge to a Nash equilibrium in potential games” (Monderer & Shapley, 1996).

The Bad News (if any)?

What are some weaknesses of FP?

A Routing Example • Consider a routing game with 100 players all with the same source and sink • There are 4 roads from the source to the sink • Players want to minimize their cost.

A Routing Example • The cost of traveling on each road is given by a quadratic cost function with positive coefficients (could be randomly generated) depending on the number of players choosing that road • Can we use FP as a learning algorithm in this example?

A Routing Example Formalizing the game, we have

A Routing Example Remember this?

A Routing Example Remember this?

The sum above is over 4^99=2^198 terms!

A Routing Example Remember this?

The sum above is over 4^99=2^198 terms!

This is not computationally feasible!

What do we do?

The routing example (which is fairly realistic) is motivation that we either need to find a more effective way to compute this utility or we need to develop an algorithm that is computationally suitable for “large” games.

Joint Strategy Fictitious Play (JSFP) • Observe empirical frequencies of joint actions • Consider best response(s) under the (still incorrect) assumption that all other players act collectively as a group according to their joint empirical frequency • Randomly choose a best response and act accordingly

Does FP=JSFP?

• In the case of two players it is easy to see that FP and JSFP are the same.

Does FP=JSFP?

• In the case of two players it is easy to see that FP and JSFP are the same • But in the case of three or more players this is not necessarily the case!

Empirical Frequency in JSFP The empirical frequency for an action profile may be calculated as follows:

Expected Payoff in JSFP Each player assumes an expected payoff

Expected Payoff in JSFP Each player assumes an expected payoff But this looks about as bad (maybe worse) than FP!

So what can we do?

Expected Payoff in JSFP Each player assumes an expected payoff We rewrite it in a more useful form!

The JSFP Payoff Recursion So now, we can rewrite the expected payoff as a simple recursion, and at every stage choose a value that maximizes it (our best response) We are maximizing regret!

Convergence Properties of JSFP The convergence properties of JSFP (for games of three or more players) remain unknown; so this is an open problem. But when a joint action generated by JSFP reaches a strict NE, it will stay there forever. To get convergence properties, we add “inertia” to our learning algorithm.

JSFP with Inertia • Assume that all NE are strict • JSFP-1: If the action chosen by a player in the previous stage is a best response to the current stage choose that action • JSFP-2: Otherwise choose an action according to the distribution

The JSFP-2 Distribution Here the alpha parameter represents the player’s willingness to optimize at a given stage, while the beta parameter whose support is contained in the set of best responses to this stage, and the v term is a distribution with full support on the action taken in the previous stage.

JSFP w/ Inertia Converges!

• In particular to some Nash Equilibria for generalized ordinal potential games • Of course there is no equilibrium selection mechanism • And not much is known regarding the convergence rate • But we have shown that JSFP w/ Inertia is a good substitute for FP in “large” games

JSFP w/ Inertia Converges!

If you want the proof, read the paper as the proof is not trivial!

The Fading Memory Variant We used the recursion But we could also use the recursion Here, rho is a constant or function less than or equal to 1, and it is also proven that this algorithm gives rise to a process converging to some NE.

A Routing Example, Revisited • We can now apply JSFP w/ Inertia and fading memory to the routing problem, and we should converge to some NE (in generalized ordinal potential games, which includes routing games) • Simulations show that JSFP without inertia should also work in this case • Try it!

Example of Convergence

Conclusion • We have demonstrated some weaknesses of FP (computational demands, observational demands, etc.) • We have developed JSFP, which seems to accommodate computational limitations