Deep Reinforcement Learning Hands-On Ch1-5
I started reading Deep Reinforcement Learning Hands-On by Maxim Lapan. These first 5 chapters go introduce Reinforcement Learning, the Cross-Entropy method, and the Bellman Equation.
Preface
The topic of this book is reinforcement learning (RL); it focuses on the general and challenging problem of learning optimal behavior in a comple environment. The learning process is driven only by the reward value and observations obtained from the environment. This model is very general and can be applied to many practical situations, from playing games to optimizing complex manufacturing processes.
This book was written as an attempt to fill the obvious gap in practical and structured information about RL methods and approaches.
Code Examples Used in the Book
Chapter 1: What is Reinforcement Learning?
Reinforcement Learning (RL) is a subfield of machine learning (ML) that addresses the problem of the automatic learning of optimal decisions over time. RL is an approach that nativelu incorporates an extra dimension (which is usually time) into learning equations.
Supervised Learning
It's basic question is: how do you automatically build a function that maps some input into some output when given a set of example pairs? The name supervised comes from the fact that we learn from known answers provided by a "ground truth" data source.
Unsupervised Learning
Unsupervised learning assumes no supervision and has no known labels assigned to our data. The main objective is to learn some hidden structure of the dataset at hand.
Reinforcement Learning
RL lies somehwere in between full supervision and a complete lack of predefined labels. On the one hand, it uses many well-established methods of supervised learning, such as deep neural networks for function approximation, stochastic gradient descent, and backpropagation, to learn data representation. On the other hand, it usually applies them in a different way.
RL's Complications
The first thing to note is that observation in RL depends on an agent's behavior and, to some extent, it is the result of this behavior. The abbreviation iid stands for independent and identically distributed, and it is a requirement for most supervised learning methods. The second thing that complicates our agent's life is that it needs to not only exploit the knowledge it has learned, but actively explore the environment, beause maybe doing things differently will significantly improve the outcome. This exploration/exploitation dilemma is one of the open fundamental questions in RL. The third complication factor lies in the fact that reward can be seriously delayed after actions. During learning, we need to discover possibly long term causalities, which can be tricky.
RL Formalisms
Reward
In RL, reward is just a scalar value we obtain periodically from environment. Can be positive or negative - any size. It's purpose is to tell the model how it has behaved. It's common to receive rewards every fixed timestep or every environment interaction, just for convenience. Reward is local, meaning that it reflects the success of the agent's recent activity and not all the successes achieved by the agent so far.
Agent
An agent is somebody or something who/that interacts with the environment by executing certain actions, making observations, and receiving eventual rewards for this.
The environment
The environment is everything outside of an agent.
Actions
Actions are things that an agent can do in the environment. Discrete actions form the finite set of mutually exclusive things an agent can do. Continuous actions have some value attached to them.
Observations
Observations are pieces of information that the environment provides the agent with that say what's going on around the agent.
The theoretical Foundations of RL
Markov Decision processes
The Markov Process also known as the Markov chain is a system that can switch between a finite set of various states in the state space according to some laws of dynamics. Your chain of observations of state changes forms a chain. A sequence of observatuons over time forms a chain of states, and this is called history. To call a system an MP, it needs to fulfill the Markov Property, which means that the future system dynamics from any state have to depend on this state only. The main point of the Markov property is to make every observable state self-contained to describe the future of the system. In other words, the Markov property requires the states of the system to be distinguishable from each other and unique.
As your system model complies with the Markov property, you can cature transition probabilities witha transition matrix, which is a sqauer matrix of size , where is the number of states in the model. Each cell contains the probability of the system to transition from state to state .
_ | Sunny | Rainy |
---|---|---|
Sunny | 0.8 | 0.2 |
Rainy | 0.1 | 0.9 |
The formal definition of MP is:
- A Set of states ( ) that a system can be in
- A transition matrix ( ), with transition probabilities, which defines the system dynamics.
It's not complicated to estimate the transition matrix from our observations—we just count all the transitions from every state and normalize them to a sum of 1. It's also worth noting that the Markov property implies stationarity (that is, the underlying transition distribution for any state does not change over time). Nonstationarity means that there is some hidden factor that influences our system dynamics, and this factor is not included in observations.
Markov Reward Processes
Reward can be represented in various forms. The most general way is to have another square matrix, similar to the transition matrix, with reward given for transitioning from state to state , which reside in row and column . The second thing we're adding to the model is the discount factor , which is a single number from 0 to 1. For every episode, we define a return at the time, , as this quantity:
For evry time point, we calculate the return as the sum of subsequent rewards, but more distant rewards are multipleied by the discount factor raised to the power of the number of steps we are away from the starting point . The discount factor stands for the foresightedness of the agent. Most of the time gamma is set to something between 0.9 and 0.99. Think of gamma as a measure of how far into the future we want to look to estimate the future return. The closer it is to 1, the more steps we will take into account.
If we go to the extreme and calculate the mathematical expectation of return for any state, we will get a much more useful quantity, which is called the value of the state:
For every state, , the value, , is the average (or expected) return we get by following the Markov reward process.
Adding Actions
We must add a set of actions (A), which has to be finite. This is our agent's action space. We need to condition our transition matrix with actions, which basically means that our matrix needs an extra dimension, which turns it into a cube.
Policy
The simple definition of policy is that it is some set of rules that controls the agent's behavior. Even for fairly simple environments, we can have a variety of policies. The main objective of the agent in RL is to gather as much return as possible. Different policies can give us different amounts of return, which makes it importrant to find a good policy. Formally, policy is defines at the probability distribution over actions for every possible state:
This is defined as probability and not as a concrete action to introduce randomness into an agent's behavior.
Chapter 2: OpenAI Gym
OpenAI Gym is a library used to provide a unifform API for an EL agent and lots of RL environments. This removes the need to write boilerplate code. The central class in trhis library is an environment, which is called Env. Instances of thus class expose several methods and fields that provide the required information about its capabilities. At a high level, every environment provides these info and functionality:
- A set of actions that is allowed to be executed in the environment
- The shape and boundaries of the observations taht the environment provides the agent with
- A method called step to execute an action, which returns the current observation, the reward, and the indication that the episode is over
- A method called reset, which returns the environment to its initial state and obtais the first observation
The environment could take multiple actions. Observations are pieces of information that an environment provides the agent with, besides the reward.
import gym
e = gym.make('CartPole-v0')
obs = e.reset()
print(obs)
print(e.action_space)
print(e.observation_space)
print(e.step(0))
print(e.action_space.sample())
print(e.action_space.sample())
print(e.observation_space.sample())
print(e.observation_space.sample())
[-0.01932181 0.01775413 -0.04872379 0.04151772]
Discrete(2)
Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)
(array([-0.01896673, -0.1766365 , -0.04789343, 0.3184385 ], dtype=float32), 1.0, False, {})
0
0
[-4.6867809e+00 -2.7532286e+38 2.6945692e-01 -2.2183380e+38]
[-2.8329217e+00 -3.2517213e+38 2.6583666e-01 1.2318715e+38]
/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)
/usr/local/lib/python3.10/dist-packages/gym/envs/registration.py:593: UserWarning: [33mWARN: The environment CartPole-v0 is out of date. You should consider upgrading to version `v1`.[0m
logger.warn(
/usr/local/lib/python3.10/dist-packages/gym/core.py:317: DeprecationWarning: [33mWARN: Initializing wrapper in old step API which returns one bool instead of two. It is recommended to set `new_step_api=True` to use new step API. This will be the default behaviour in future.[0m
deprecation(
/usr/local/lib/python3.10/dist-packages/gym/wrappers/step_api_compatibility.py:39: DeprecationWarning: [33mWARN: Initializing environment in old step API which returns one bool instead of two. It is recommended to set `new_step_api=True` to use new step API. This will be the default behaviour in future.[0m
deprecation(
/usr/local/lib/python3.10/dist-packages/gym/utils/passive_env_checker.py:241: DeprecationWarning: `np.bool8` is a deprecated alias for `np.bool_`. (Deprecated NumPy 1.24)
if not isinstance(terminated, (bool, np.bool8)):
while True:
action = env.action_space.sample()
obs, reward, done, _ = env.step(action)
total_reward += reward
total_steps += 1
if done:
break
print("Episode done in %d steps, total reward %.2f" % (total_steps, total_reward))
/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)
/usr/local/lib/python3.10/dist-packages/gym/utils/passive_env_checker.py:241: DeprecationWarning: `np.bool8` is a deprecated alias for `np.bool_`. (Deprecated NumPy 1.24)
if not isinstance(terminated, (bool, np.bool8)):
(array([-0.04837472, -0.16432339, 0.04757575, 0.28160578], dtype=float32),
1.0,
False,
{})
Chapter 4: The Cross-Entropy Method
Taxonomy of RL methods:
- Model-free or model-based
- Value-based or policy-based
- On-policy or off-policy
Cross-entropy is simple and it has good convergence. The cross-entropy method falls into the model-free and policy-based category of methods. The term "model-free" means that the method doesn;t build a model of the environment or reward; it just directly connects observations to actions. In other words, the agent takes current observations and does some computations on them, and the result is the action that it should take. model-based methods try to predict what the next observation and/or reward will be. Based on this prediction, the agent tries to choose next action.
Policy-based methods directly approximate the policy of the agent, what actions the agent should carry out at every step. value-based methods calculate the value of every possible action and chooses the action with the best value. off-policy is the ability of the method to learn on historical data.
The core of the cross-entropy method is to throw away bad episodes and train on better ones:
- Play N number of episodes using our current model and environment.
- Calculate the total reward for every episode and decide on a reward boundary. Usually, we use some percentile of all rewards, such as 50th or 70th.
- Throw away all episodes with a reward below the boundary.
- Train on the remaining "elite" episodes using observations as the input and issued actions as the desired output.
- Repeat from step 1 until we become satisfied with the result.
Chapter 5: Tabular Learning and the Bellman Equation
If we choose the concrete action, , and calacul;ate the value given to this action, then the value will be . So, to choose the best possible action, the agent needs to calculate the resulting values for every action and choose themaximum possible outcome. In other words, . If we are using the discount factor , we need to multiply the value of the next state by gamma:
Bellman proved that with that extension, our behavior will get the best possible outcome. In other words, it will be optimal. So, the preceding equation is called the Bellman equation of value (for a deterministic case).
The optimal value of the state is equal to the action, which gives us the maximum possible expected immediate reward, plus the discounted long-term reward for the next state.
The value of the action equals the total reward we can get by executing an action in state and can be defined via .
In the case of Q, to choose the action based on the state, the agent just needs to calculate Q for all available actions using the current state and choose the action with the largest value of Q.
Value Iteration (VI) is an algorithm used to solve RL problems. It works by iteratively improving its estimate of the 'value' of being in each state. It does this by considering the immediate rewards and expected future rewards when taking different available actions. These values are tracked using a value table, which updates at each step.