muzero

All living beings exhibit some type of behavior, in the sense that they perform some action such as response to the signals they receive from the environment in which they live. Some of them also modify their behavior over time, so that when faced with equivalent signals they behave differently over time..

Reinforcement learning is an area of machine learning inspired by this concept, whose occupation is determine what actions a software agent should choose in a given environment in order to maximize some notion of "reward."

I leave you the following repository https://github.com/al118345/muzero-pytorch with an example implementation and the following video https://www.youtube.com/watch?v=C837WJkFc4k:

2. News presented in the article.

Reinforcement learning algorithms have attempted to simulate behavioral psychology with excellent results for very controlled and low-dimensional environments . Tree-based planning methods have had great success in challenging domains, such as chess1 and Go2. However, in real-world problems, the dynamics that govern the environment are often complex and unknown. In the article [Mastering atari, go, chess and shogi by planning with a learned model] the MuZero algorithm combines a tree-based search with a model learned, achieving superhuman performance in a variety of challenging and visually complex domains, without no knowledge of its underlying dynamics.

This algorithm learns an iterable model that produces relevant predictions for planning: the policy action selection, the value function and reward. When evaluated on 57 different Atari games, the MuZero algorithm achieved state-of-the-art performance. When tested on Go, chess, and shogi (canonical environments for high-performance planning), the MuZero algorithm matched, without any knowledge of the dynamics of the game, with the performance of the AlphaZero5 algorithm .

1 Evolution of Reinforcement Learning algorithms

3. MuZero.

The main idea of the algorithm is to predict those aspects of the future that are directly relevant to planning. The model receives the observation (for example, an image of the Go board or Atari screen) as input and transforms it into a hidden state. The hidden state is then iteratively updated using a recurring process that receives the previous hidden state and a next hypothetical action. In each of these steps, the model produces a policy (which predicts the move to play), a value function (which predicts the reward accumulated, for example, the eventual winner) and a prediction of the immediate reward (for example, the points obtained by playing a move).

The model is trained end-to-end, with the sole objective of accurately estimating these three quantities important, to match the enhanced policy and value function generated by the search, as well as the observed reward. There is no direct requirement or restriction on the hidden state to capture all the information needed to reconstruct the original observation, drastically reducing the amount of information that the model has to maintain and predict. There is also no requirement for the hidden state to match the real and unknown state of the environment; nor any other restrictions on the semantics of the state. Instead, the hidden states are free to represent any state that correctly computes the policy, value function and the reward. Intuitively, the agent can invent, internally, any dynamic that leads to a precise planning.

4. Conclusions.

MuZero has matched the superhuman performance of high-performance scheduling algorithms in its favorite domains (logically complex board games like chess and Go) and has surpassed RL algorithms without latest generation models in Atari environments.

MuZero's ability to learn a model of its environment and use it to plan successfully demonstrates a significant advance in reinforcement learning and general-purpose search algorithms. Your Predecessors have already been applied to a variety of complex problems from various sectors such as chemistry, physics quantum or logistics. This advance may pave the way to address new challenges in robotics, systems industrial and other complicated real-world environments where the "rules of the game" are not known.

Bibliography

  • 1.Titulo
    Mastering atari, go, chess and shogi by planning with a learned model
    Autor
    Schrittwieser, Julian and Antonoglou, Ioannis and Hubert, Thomas and Simonyan, Karen and Sifre, Laurent and Schmitt, Simon and Guez, Arthur and Lockhart, Edward and Hassabis, Demis and Graepel, Thore and others
    Publicacion
    Nature
    Url
  • 2.Titulo
    List of shogi software
    Autor
    Publicacion
    Wikipedia
  • 3.Titulo
    Human-level control through deep reinforcement learning
    Autor
    Mnih, Volodymyr and Kavukcuoglu, Koray and Silver, David and Rusu, Andrei A and Veness, Joel and Bellemare, Marc G and Graves, Alex and Riedmiller, Martin and Fidjeland, Andreas K and Ostrovski, Georg and others
    Publicacion
    nature
    Url
  • 4.Titulo
    Aprendizaje por refuerzo
    Autor
    Publicacion
    Wikipedia
  • 5.Titulo
    Reinforcement learning for robot soccer
    Autor
    Riedmiller, Martin and Gabel, Thomas and Hafner, Roland and Lange, Sascha
    Publicacion
    Autonomous Robots
    Url
  • 6.Titulo
    MuZero: Mastering Go, chess, shogi and Atari without rules
    Autor
    Publicacion
    Deepmind