Hands-On Machine Learning Chapter 18 - Reinforcement Learning

This chapter introduces and goes over some reinforcement learning concepts. I am going to read a book on RL next, so I just skimmed this chapter.

DOWNLOAD NOTEBOOK

2 527

Chapter 18: Reinforcement Learning

Reinforcement Learning (RL) is one of the most exciting fields of machine learning today, and also one of the oldest (been around since the 50s).

Learning to Optimize Rewards

In reinforcement learning, a software agent makes ovservations and tajes actions within an environment, and in return it receives rewards from the environment. Its objective is to learn to act in a way that will maximize its expected rewards over time.

Policy Search

The algorithm a software agent uses to determine ita actions is called its policy. The policy could be a neural network taking observations as inputs and outputting an action to take.

A stochastic policy is a policy that involves some randomness. Policy parameters can be tweaked to changed the policy. Policy search is looking for different values for the policy parameters. The policy space is the space that includes all different poly parameters. Using policy gradients, you can tweak the policy parameters by following the gradients toward higher rewards.

OpenAI Gym

Simulated environments are used to train reinforcement learning algorithms. You might use PyBullet or MuJoCo for 3D physics simulation. OpenAI Gym is a toolkit that provides a wide variety of simulated environments that you can use to train agents, compare them, or develop new RL algorithms. It is preinstalled on Colab.

# Only run these commands on Colab or Kaggle!
%pip install -q -U gym # Upgrade gym to the latest version quietly
%pip install -q -U gym[classic_control,box2d,atari,accept-rom-license] # Installs libraries to run various kinds of environments

out[2]

"""
Import gym and make an environment
"""
import gym

env = gym.make("CartPole-v1", render_mode="rgb_array")

out[3]

The above code builds a CartPole environment.

The gym.envs.registry dictionary contains the names and specifications of all the available environments.

"""
Initialize Environment
"""
# Initialize environment
# This returns the first observation, which depends on environment
obs, info = env.reset(seed=42)
# First observatioin: [position, velocity, angle, angular velocity]
print(obs)
# Returns a dictionary that may contain extra environment-sepecific information
print(info)

out[5]

[ 0.0273956 -0.00611216 0.03585979 0.0197368 ]
{}

img = env.render() # Renders the enviornment as image - NumPy array
print("Environment Shape:",img.shape)
import matplotlib.pyplot as plt
plt.imshow(img)
ax = plt.gca()
ax.set_title("Environment")
plt.show()
print("What actions are possible:",env.action_space)
"""
Discrete(2) meansd thatthe possible actions are intehers 0 and 1, which represents accelerating right or left
"""
def plot_env(environ):
  """
  Plot Environment
  """
  img = environ.render()
  plt.imshow(img)
  ax = plt.gca()
  ax.set_title("Environment")
  plt.show()

out[6]

Environment Shape: (400, 600, 3)

What actions are possible: Discrete(2)

action = 1  # accelerate right
obs, reward, done, truncated, info = env.step(action)
print(obs)
print(reward)
print(done)
print(truncated)
print(info )

out[7]

[ 0.02727336 0.18847767 0.03625453 -0.26141977]
1.0
False
False
{}

/usr/local/lib/python3.10/dist-packages/gym/utils/passive_env_checker.py:233: DeprecationWarning: `np.bool8` is a deprecated alias for `np.bool_`. (Deprecated NumPy 1.24)
if not isinstance(terminated, (bool, np.bool8)):

The step() function executes the desired action and returns 5 values:

obs
- This is the new observation. The cart is now moving toward the right (obs[1] > 0). The pole is still tilted toward the right (obs[2] > 0), but its angular velocity is now negative (obs[3] < 0), so it will likely be tilted toward the left after the next step
reward
- In this environment, you get a reward of 1.0 at every step, no matter what you do, so the goal is to keep the episode running for as long as possible.
done
- This value will be True when the episode is over. This will happen when the pole tilts too much, or goes off the screen, or after 200 steps
truncated
- This value will be True when an episode is interrupted early, for example by an environment wrapper that imposes a maximum number of steps per episode
info
- This environment-specific dictionary may provide extra information

"""
Hardcode a simple policy that accelerates left when the pole is learning towards
the left and accelerates right when the pole is learning toward the right.
"""
def basic_policy(obs):
  angle = obs[2]
  return 0 if angle < 0 else 1

totals = []
for episode in range(500):
  episode_rewards = 0
  obs, info = env.reset(seed=episode)
  for step in range(200):
    action = basic_policy(obs)
    obs, reward, done, truncated, info = env.step(action)
    episode_rewards += reward
    if done or truncated:
      break

    totals.append(episode_rewards)
"""
Looking at the results
"""
import numpy as np
print(np.mean(totals), np.std(totals), min(totals), max(totals))
"""
Even with 500 tries, this policy never managed to keep the pole upright for more than 63 consecutive steps.
"""
print("")

out[9]

21.713696004717676 13.151677912691767 1.0 62.0

Neural Network Policies

Neural network policy to take an observation as input, and it will output the action to be executed. It will estimate a probability for each action, and then we will select an action randomly according to the estimated probabilities.

Why pick randomly vs picking highest probability action: it lets the agent find the right balance between exploring new actions and exploting the actions that are known to work well. The exploration/exploitation dilemma is central in reinforcement learning.

"""
Neural network policy in Keras
"""
import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Dense(5, activation="relu"),
    tf.keras.layers.Dense(1, activation="sigmoid"),
])

out[11]

Evaluating Actions: The Credit Assignment Problem

The credit assignment problem: when the agent gets a reward, it is hard to know which actions should be credited (or blamed) for it. A common strategy is to evaluate an action based on the sum of all rewards that come after it, usually applying a discount factor. The sum of the discounted rewards is called the action's return. Typical discount factors vary from 0.9 to 0.99. We want to estimate how much better or worse an action is, compared to the other possible actions, on average. This is called the action advantage. For this, we must run many episodes and normalize all the action returns, by subtracting the mean and dividing by the standard deviation. After that, we can reasonably assume that actions with a negative advantage were bad while actions with a positive advantage were good.

Policy Gradients

PG algorithms optimize the parameters of a policy by following the gradients toward higher rewards. Here is one common variant:

Let the neural network policy play the game several tumes, and at each step, compute the gradients that would make the chosen action even more likely - but don't apply these gradients yet.
Once you have run several episodes, compute each action's advantage, using the method in the previous section.
If an action's advantage is positive, it means that the action was probably good, and you want to apply the gradients computed earlier to make the action even more likely to be chosen in the future. However, if the action's advantage is negative, it means the action was probably bad, and you want to apply the opposite gradients to make this action slightly less likely in the future. The solution is to multiply each gradient vector by the corresponding action's advantage.
Compute the mean of all the resulting gradient vectors, and use it to perform a gradient descent step.

Markov Decision processes

Markov chains are stochastic processes with no memory. Such a process has a fixed number of states, and it randomly evolves from one state to another at each step. The probability for it to evolve from a state $s$ to state $s^{'}$ is fixed, and it depends only on the pair $(s, s^{'})$ , not on past states.

A terminal state is a state from which the Markov chain cannot escape - i.e., $s_{3}$ above. Markov decision processes resemble markov chains, but with a twist: at each step, an agent can choose one of several possible actions, and the transition probabilities depend on the chosen action. Moreover, some state transitions return some reward, and the agent's goal is to find a policy that will maxmize reward over time.

There is an algorithm to estimate the optimal state-action values, generally called Q-values (quality values). The optimal Q-value of the state action pair $(s, a)$ , noted $Q^{*} (s, a)$ is the sum of discounted future rewards the agent can expect on average after it reaches the state $s$ and chooses action $a$ , but before it sees the outcome of this action, assuming it acts optimally after that action.

I am planning on reading a book on RL after this, so I am skipping the rest of the chapter.

User Comments

There are currently no comments for this article.