Jump to content

Draft:Policy-Space Response Oracles

From Wikipedia, the free encyclopedia

In Multi-Agent Learning, Reinforcement Learning, and Game Theory, Policy-Space Response Oracles[1] (PSRO) is a collection of multi-agent learning algorithms for training agents in two-player, zero-sum, imperfect information (partially observable), extensive form (stochastic form) games using deep reinforcement learning as an approximate best response operator. Knowledge of all player's payoffs is required (complete information). PSRO unifies, and is heavily influenced by other algorithms such as Double Oracle [2] (DO), fictitious play (FP). It can be considered a framework of algorithms which under certain parameterizations are equivalent to algorithms that have come before it. PSRO is closely related to Empirical Game Theoretic Analysis (EGTA).

The multi-agent problem setting involves agents learning to interact with others in a shared environment. PSRO works by iteratively training new policies against past opponents policies (so called "self-play"). A key property of PSRO is that the resulting distribution over policies it finds provably converges to a normal-form Nash Equilibrium (NE) under certain parameterizations. For two-player, zero-sum games an NE cannot be exploited by any other policy, which makes it a particularly suitable solution concept in this setting. Many interesting games are two-player, zero-sum ("purely competitive"). Notable projects such as AlphaZero[3], and AlphaStar[4] make use of this family algorithms.

Other classes of games such as those with more than two players, or general payoffs are not usually solved using PSRO because of difficulties selecting a Nash equilibrium in these game classes (equilibrium selection problem). Extensions (such as JPSRO) are are more suitable, but use different solution concepts.

History

[edit]

(TODO) There is a long list of breakthroughs / other algorithms that PRSO is based on. Credit them here.

Double Oracle[2] (DO).

Empirical game-theoretic analysis (EGTA)



Algorithm

[edit]

PSRO works by iteratively training a policy against a distribution over all previous opponent policies found so far. This step of the algorithm is called the best response (BR) and is commonly estimated using reinforcement learning (RL) and function approximation (typically a neural network).

The distribution over opponent policies is determined by a meta-solver (MS) - which in turn determines many of the properties of PSRO. For example, if one were to use a uniform distribution, PSRO would be similar to FSP, and if the Nash distribution were used, PSRO would be similar to Double Oracle.

The meta-solver determines a distribution from an a meta-game.

(Placeholder update)

function expected_return(policy policy1, policy policy2) → float is
    return payoff1, payoff2
    
function meta_solver(matrix payoff1, matrix payoff2) → (dist, dist) is
    return payoff1, payoff2

function PSRO(game g) → (dist, dist), (list[policy], list[policy]) is
    // Initialize.
    Π1 := {πrandom}
    Π2 := {πrandom}
    // Iterate until convergence.
    for i in 1,... do
    
        if gap == 0 then
            break
    return1,  σ2), (Π1, Π2)


Convergence

[edit]

PSRO is known to converge to an equilibrium. Without loss of generality, consider only pure best-responses. At some point in the algorithm's execution we have a set of pure policies for each player, and an equilibrium over the induced restricted normal-form game. After computing a best-response for each player there are two possibilities: 1) these best-responses improve the performance and are necessarily novel policies, or 2) no best-responses improve the performance.

In the second case, by definition an equilibrium has been found and the algorithm can terminate. In the first case, note that the new best-response is necessarily novel. Therefore, because there are a finite number of pure policies to search through, the algorithm must eventually terminate.

Although the convergence proof for PSRO is vacuous, it performs much better in practice.

Other extensions

[edit]

Performance Extensions

[edit]

Pipeline PSRO[5]

Other solution concepts

[edit]

XDO[6] is a variant of DO that converges to an extensive-form Nash equilibrium.

Joint Policy-Space Response Oracles[7] (JPSRO) extends PSRO to converge to normal-form correlated equilibria (CEs) and normal-form coarse correlated equilibria (CCEs) in n-player, general-sum extensive form games. (C)CEs are a more suitable solution concept outside for n-player, general-sum games, where there can be many NEs. For two-player, zero-sum games the set of CEs is equal to the set of NEs and therefore PSRO is a special case of JPSRO in this scenario.

The AlphaRank [CITE] has can also be used as a meta-solver[8].

TODO

[edit]

Double Oracle citation:

McMahan, H. Brendan, Geoffrey J. Gordon, and Avrim Blum. "Planning in the presence of cost functions controlled by an adversary." Proceedings of the 20th International Conference on Machine Learning (ICML-03). 2003

(Include more citations)

Draft edit update.


References

[edit]
  1. ^ Lanctot, Marc; Zambaldi, Vinicius; Gruslys, Audrunas; Lazaridou, Angeliki; Tuyls, Karl; Perolat, Julien; Silver, David; Graepel, Thore (2017). "A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning". arXiv:1711.00832 [cs.AI].
  2. ^ a b Fawcett, Tom; Mishra, Nina (21 August 2003). Planning in the presence of cost functions controlled by an adversary. Icml'03. pp. 536–543. ISBN 9781577351894.
  3. ^ Silver, David; Hubert, Thomas; Schrittwieser, Julian; Antonoglou, Ioannis; Lai, Matthew; Guez, Arthur; Lanctot, Marc; Sifre, Laurent; Kumaran, Dharshan; Graepel, Thore; Lillicrap, Timothy; Simonyan, Karen; Hassabis, Demis (7 December 2018). "A general reinforcement learning algorithm that masters chess, shogi, and go through self-play". Science. 362 (6419): 1140–1144. Bibcode:2018Sci...362.1140S. doi:10.1126/science.aar6404. PMID 30523106.
  4. ^ Vinyals, Oriol; Babuschkin, Igor; Czarnecki, Wojciech M.; et al. (30 October 2019). "Grandmaster level in StarCraft II using multi-agent reinforcement learning". Nature. 575 (7782): 350–354. Bibcode:2019Natur.575..350V. doi:10.1038/s41586-019-1724-z. PMID 31666705. S2CID 204972004.
  5. ^ McAleer, Stephen; Lanier, John; Fox, Roy; Baldi, Pierre (2020). "Pipeline PSRO: A Scalable Approach for Finding Approximate Nash Equilibria in Large Games". arXiv:2006.08555 [cs.GT].
  6. ^ McAleer, Stephen; Lanier, John; Baldi, Pierre; Fox, Roy (2021). "XDO: A Double Oracle Algorithm for Extensive-Form Games". arXiv:2103.06426 [cs.GT].
  7. ^ Marris, Luke; Muller, Paul; Lanctot, Marc; Tuyls, Karl; Graepel, Thore (2021). "Multi-Agent Training beyond Zero-Sum with Correlated Equilibrium Meta-Solvers". arXiv:2106.09435 [cs.MA].
  8. ^ Muller, Paul; Omidshafiei, Shayegan; Rowland, Mark; Tuyls, Karl; Perolat, Julien; Liu, Siqi; Hennes, Daniel; Marris, Luke; Lanctot, Marc; Hughes, Edward; Wang, Zhe; Lever, Guy; Heess, Nicolas; Graepel, Thore; Munos, Remi (2019). "A Generalized Training Approach for Multiagent Learning". arXiv:1909.12823 [cs.MA].