Tabular reinforcement learning solutions

This project focuses on a controlled environment where classic tabular reinforcement-learning algorithms can be implemented, compared and interpreted. The objective is not only to obtain a policy that reaches the goal, but to understand how the policy is learned and how its quality changes when the environment and the hyperparameters are modified.

0. The WindyGridWorld environment

The starting point is WindyGridWorld, a 7x10 grid in which the agent begins at [3, 0] and must reach [3, 7]. The difficulty comes from the wind acting on the central columns, pushing the agent upward with different strengths depending on the column.

This is a standard reinforcement-learning benchmark because it is simple enough for tabular methods while still requiring exploration, policy improvement and long-horizon credit assignment.

The notebook first loads the environment, prints the action and observation spaces, and executes a random episode so the behavior of the dynamics can be seen before any learning takes place.

1. Modifying the environment

After the initial inspection, the exercise asks for a modified version of the environment. The new grid is larger, the wind profile changes, and both the start and target states are moved. This is important because it shows that the algorithms should not be treated as fixed recipes attached to a single board configuration.

The updated environment is saved in a second file and reused in the later experiments, which makes the rest of the notebook closer to a real implementation exercise than to a purely theoretical summary.

2. Monte Carlo methods

The second block estimates an optimal policy with on-policy first-visit Monte Carlo control using an epsilon-soft policy. Because the environment is deterministic, the practical goal is to recover a policy that yields the shortest path, or at least one of the optimal paths.

The notebook implements the policy-construction logic, samples a large number of episodes and prints the learned action values and the resulting policy over the grid.

3. Temporal-difference methods

The next part moves to Q-learning, an off-policy temporal-difference method. Here the notebook estimates the state-action values directly while interacting with the modified WindyGridWorld environment.

Besides learning the Q table, the exercise also asks for the estimated value function and for a final episode executed under the learned optimal policy so the trajectory can be inspected step by step.

4. Comparing the algorithms

The last section is not limited to showing one solution per method. It compares Monte Carlo and temporal difference learning under changes in the number of episodes, the discount factor and the learning rate. The notebook explicitly recommends repeated simulations because the results are stochastic and must be interpreted by their most frequent behavior rather than by a single run.

A dedicated subsection also contrasts Monte Carlo and Q-learning directly so that convergence speed, policy quality and sensitivity to hyperparameters can be discussed side by side.

Why this notebook matters

Tabular reinforcement learning is still the cleanest way to understand the logic behind policy estimation, exploration and value updates before moving to neural approximations. The full notebook, with all code cells, experiments and comments, remains below in Spanish.