Introduction to Reinforcement Learning.

Current reading guide

This article explains the 2015 DQN paper and should be read as a foundation, not as the final state of deep reinforcement learning. Current projects distinguish episode termination from time-limit truncation, document wrappers and random seeds, and report results across several runs instead of selecting one successful trace.

Use maintained environment APIs such as Gymnasium and pin environment, ROM and library versions.
Start with a random-policy baseline and a small tabular or supervised test before expensive training.
Track evaluation return separately from training return and publish variance, seeds and compute budget.
DQN fits discrete actions; continuous control normally requires a different algorithm family.

1. Introduction

The base article of the article is [Human-level control through deep reinforcement learning] [1]

2. Description of the topic.

All living beings exhibit some type of behavior, in the sense that they perform some action such as response to the signals they receive from the environment in which they live. Some of them also modify their behavior over time, so that when faced with equivalent signals they behave differently over time.

Reinforcement learning is an area of machine learning inspired by this concept, whose occupation is determine what actions a software agent should choose in a given environment in order to maximize some notion of "reward."

3. News presented in the article.

Reinforcement learning algorithms have attempted to simulate behavioral psychology with excellent results for very controlled and low-dimensional environments . However, until the date of publication of the article, such successful results have never been obtained in environments with such high dimensionality and such changing like the classic Atari 2600 games.

In the text they use recent advances in the training of deep neural networks to develop an artificial agent, called a deep Q-network, that can learn successful policies directly from inputs high-dimensional sensory learning using end-to-end reinforcement learning.

4. Summary of the experimental part

To analyze the experimental part, we need to start by defining the following concepts:

Policy: term used to refer to the actions that the agent will decide. The politics $\varepsilon$ -greedy means that the agent will almost always take the best possible action given the information it has.
Exploration vs exploitation: From time to time, with a probability of $\varepsilon$ , the agent will take a completely random action. In this way, if after the first action the agent has If you get a positive reward, you won't be stuck choosing that same action all the time. with probability $\varepsilon$ the agent will explore other options. This value is parameterizable and will be responsible for balancing the exploration and exploitation

In a complete reinforcement learning problem, the state changes every time we execute an action. The agent receives the state (state) in which the environment is located (environment), which we will represent with the letter s (state). The agent then executes the action it chooses, represented by the letter a (action). By executing that action, The environment responds by providing a reward, represented by the letter r (reward), and the environment moves to a new state, represented with s' (next state). This cycle can be seen in the image: [1] .

1 Complete reinforcement learning cycle.

The Q-Learning algorithm, used in the article, tries to learn how many rewards it will get in the long term for each pair of states and actions (s,a). We call that function the action-value function. and this algorithm represents it as the function Q(s,a), which returns the reward that the agent will receive when execute action a from state s, and assuming that it will follow the same policy dictated by the function Q until end of the episode. For example, if Q(s,a1)=1 and Q(s,a2)=4, the agent knows that action a2 is better and will bring more reward, so it will be the action that will be executed.

The formal definition of this algorithm is done as follows:

Q(s, a; \theta) = r + \gamma \max_{a'}{Q(s', a'; \theta')}

The Q-value of the state s and the action a (Q(s, a)) is defined as the reward r obtained by executing that action, plus the Q-value of executing the best possible action a' from the next state s', multiplied by a discount factor

\gamma

(discount factor), which is a value with a range

\gamma

\in

(0, 1].

Now, when there are billions of different states and hundreds of different actions, Q-Learning is not able to be used optimally. Therefore, in the work [1] define a new technique called Deep Q-Network. This algorithm combines Q-learning with deep neural networks to approximate the function Q, thus avoiding using a table to represent it. It actually uses two networks neurons to stabilize the learning process. The first, the main neural network, represented by the parameters

\theta

, is used to estimate the Q-values of the current state s and action a: Q(s, a;

\theta

). The second, the target neural network, parameterized by

\theta

', will have the same architecture as the main network, but will be used to approximate the Q-values of the next state s' and the next action a'. Learning happens on the main network and not on the target. Target network freezes (its parameters are not changed) for several iterations (usually around 10000), and then the parameters from the main network are copied to the target network, thus transmitting the learning from one to the other, making the estimates calculated by the target network are more accurate.

In order to train a neural network, we need a loss function, which we define as the square of the difference between both sides of the equation [1], that is, the following calculation:

L(\theta) = E [(r + \gamma \max_{a'}{Q(s', a'; \theta')}-Q(s, a; \theta) )^{2}]

5. Conclusions and critical summary of the article

The chosen article demonstrates how, based on the pixels and the game score, the agent of the Deep Q network was able to outperform all previous algorithms and reach a level comparable to that of a professional human game tester on a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the gap between high-dimensional sensory inputs and actions, which results in the first artificial agent that is capable of learning to excel at a wide range of challenging tasks.

Taken together, this work illustrates the power of leveraging cutting-edge machine learning techniques. generation with biologically inspired mechanisms to create agents that are capable of learning to master a large variety of challenging tasks.

6. Code example

In the repository https://github.com/al118345/OpenAi_Examples I have made available several code examples that try to solve the Atari2600 games. Furthermore, I have uploaded the video https://www.youtube.com/watch?v=Z2DbDXeNJOc that I hope it helps you understand the theme of the video.

7. How to continue after this introduction

A good learning path is to start with tabular methods, where the state and action spaces are small enough to understand every update, and then move to deep reinforcement learning only when the number of states becomes too large for a table. That progression makes the difference between Q-Learning and DQN much clearer.

Bibliography

1.Titulo
Human-level control through deep reinforcement learning
Autor
Mnih, Volodymyr and Kavukcuoglu, Koray and Silver, David and Rusu, Andrei A and Veness, Joel and Bellemare, Marc G and Graves, Alex and Riedmiller, Martin and Fidjeland, Andreas K and Ostrovski, Georg and others
Publicacion
nature
Url
2.Titulo
Aprendizaje por refuerzo
Autor

Publicacion
Wikipedia
Url
https://es.wikipedia.org/wiki/Aprendizaje_por_refuerzo
3.Titulo
Reinforcement learning for robot soccer
Autor
Riedmiller, Martin and Gabel, Thomas and Hafner, Roland and Lange, Sascha
Publicacion
Autonomous Robots
Url

Introduction to Reinforcement Learning.

Current reading guide

1. Introduction

2. Description of the topic.

3. News presented in the article.

4. Summary of the experimental part

5. Conclusions and critical summary of the article

6. Code example

7. How to continue after this introduction

Bibliography

1.Titulo

Human-level control through deep reinforcement learning

Autor

Mnih, Volodymyr and Kavukcuoglu, Koray and Silver, David and Rusu, Andrei A and Veness, Joel and Bellemare, Marc G and Graves, Alex and Riedmiller, Martin and Fidjeland, Andreas K and Ostrovski, Georg and others

Publicacion

nature

Url

2.Titulo

Aprendizaje por refuerzo

Autor

Publicacion

Wikipedia

Url

https://es.wikipedia.org/wiki/Aprendizaje_por_refuerzo

3.Titulo

Reinforcement learning for robot soccer

Autor

Riedmiller, Martin and Gabel, Thomas and Hafner, Roland and Lange, Sascha

Publicacion

Autonomous Robots

Url