Deep reinforcement learning

This project compares three deep reinforcement-learning agents in the same control problem in order to study how different learning strategies behave during training and how their final performance differs:

Double DQN (DDQN)
Dueling DQN
REINFORCE with baseline

0. Context and environment

One of the motivations behind deep reinforcement learning in robotics is to obtain controllers that can learn complex sequences of actions on their own, especially in tasks such as autonomous landing and takeoff. To study that idea, this notebook works with OpenAI Gym's Lunar Lander environment.

The agent controls a spacecraft that must land on the target area at coordinates (0, 0) with a safe velocity and without crashing. The ship has three engines and the action space is discrete:

0: do nothing
1: fire left engine
2: fire main engine
3: fire right engine

Lunar Lander environment

Rewards are shaped to favor an efficient descent, a landing in the correct zone, stable leg contact and low impact speed, while crashes and excessive engine use are penalized. The task is considered solved when the agent reaches an average score of at least 200 over 100 consecutive episodes.

1. Initialization and exploration

Before training any agent, the notebook installs the required dependencies, including box2d-py, loads the main libraries and explores the environment. This section clarifies the observation space, confirms the action set and establishes the computational setup used later by all three agents.

This is a key difference with tabular examples: Lunar Lander has a continuous state space, so direct lookup tables are no longer practical and neural approximators become necessary.

2. DDQN agent

The first model is a Double DQN agent. The notebook defines the neural-network architecture, implements the agent logic, trains the model and finally evaluates the learned behavior in test episodes.

The practical objective of this section is to separate action selection from target evaluation in order to reduce the overestimation that affects standard DQN-style updates.

3. Dueling DQN agent

The second model keeps the value-based framework but changes the architecture to a Dueling DQN. The network learns separate streams for the state value and the action advantage before combining them into the final Q estimate.

As in the DDQN section, the notebook covers architecture, agent definition, training and test, which makes it possible to compare both variants under the same environment and evaluation criteria.

4. REINFORCE with baseline

The third block switches from value-based learning to policy gradients through REINFORCE with baseline. This adds a different learning perspective: instead of estimating Q values and acting greedily, the model learns a policy directly while using a baseline to reduce variance.

The notebook again mirrors the same workflow: network definition, agent implementation, training and test, so the comparison is methodologically consistent across all approaches.

5. Model comparison and optimization

Once the three agents are trained, the practical work compares their reward curves, convergence behavior and final performance. A later optimization section then adjusts parameters and training choices to improve the results obtained previously.

The notebook closes with an analysis section that discusses the strengths and weaknesses of the three methods in this environment rather than presenting a single winner without context.

Main takeaway

This project uses the same environment to contrast three major families of deep reinforcement learning: improved Q-learning, architectural refinement of Q-learning, and policy gradients with baseline. The full notebook, including code, training outputs and comparative tables, remains below in Spanish.