A formal Introduction to Deep Reinforcement Learning

Deep Reinforcement Learning from scratch with Deep Q-learning

Lorenzo Tinfena

28 min readMay 10, 2022

To understand this article very quickly, I suggest that you should be familiar with:

Basic math
Basic matrix operations
Partial derivatives
Probability function

In some parts, I will use a notation like a_b to denote a subscript in maths, like in LaTeX.

1. Introduction

1.1 Purpose

In this elaboration, I would like to explain how it is possible to build an artificial intelligence, understand when it produces errors or when it performs its actions correctly, and it is improved continuously. This is one of the most generalized forms of artificial intelligence currently for me. I also made a sample application: an inverse pendulum, which is simply an axis hung to a pivot, that must move either to the right or to the left so that it does not fall to the axis.
All of the algorithm implementations I used in the attached project (see the bottom) are written from scratch in Python.

deep q learning pendulum inverse reinforcement learning deep reinforcement learning deep learning

A process called “deep Q-learning” is the algorithm I used for this, but now, before we get to how this works, we have to start with the basics.
What are artificial intelligence and machine learning?
What is the concept of reinforcement learning?

1.2 What is artificial intelligence?

Artificial intelligence, is the ability a machine has to reproduce biological intelligence. That is, the one we all know that humans, animals, and plants have.

Generalizing, we can say that the intelligence (both artificial and biological) of an entity in an environment is its ability to choose the correct action to perform according to its state and its experience. The chosen action is calculated as:

Where A is the chosen action, E is the experience, and S is the state.

For example, for a human being to walk, he has to choose which muscle to move in a space (environment), based on their position (state), and on his experience (which regulates in a general way how to move).

1.2.1 The relationship between general intelligence and the correctness of actions

The intelligence of an entity is proportional to the correctness of the actions performed by the entity as its state and experience vary. Thus, just as correctness is subjective, so is intelligence.

1.2.2 The criteria for the correctness of actions

To create artificial intelligence that assists humans, the criteria of correctness will have to be subjective to the users.
For instance: Person A applauds and considers a robot that cleans the house once a week very intelligent.While Person B assumes the opposite because he would like the robot to clean the house 5 times a day.
Even though we can incorporate in the robot state the exact person whom it has to clean house. Thereby making the intelligence a little more objective. However, the latter can never be fully so.

1.3 How is the experience of an entity formed?

The experience and the state of the entity determines the actions that will be carried out. We can consider it as the strategy of the entity and it is divided into 2 broad types:

Static experience: This is only found in artificial entities and from a mathematical point of view, we can identify it as a “non-parametric function”.
For example, a program (artificial entity) that is given the route of a car, i.e. the space covered in kilometers, and the duration in hours (both of these form the state), and a non-parametric function (static experience) f (space, duration) = space/duration calculates the average speed (the action is the value of the average speed).
Dynamic experience: This is in both the artificial and biological entities and from a mathematical point of view, we can identify it as a “parametric function”.
For example, a program (artificial entity) that given the surface of a house (state) must predict its price (this chosen action is the value itself). And it cannot use a function that is not parameterized, because many variables would affect the price.
These variables could be the location of the construction or the year of construction.
We could include these variables in the state together with the surface, but it would be difficult to write a function that efficiently calculates the price.
for this reason, we write a very generalized function parameterized according to dynamically determining variables, trying to maximize the accuracy (or intelligence) of the entity.

2 Reinforcement learning

2.1 General machine learning

In an artificial entity with static experience, it is the task of classical programmers to determine the suitable static experience to maximize its accuracy.

Instead, when dealing with a dynamic experience, it is the artificial entity itself (at least in most cases) that modifies its own experience, and it is only in this case that the entity will be allowed to do so.
This means that it must be provided with a correctness criteria.

The set of processes that govern the experience, and how it can change, is called machine learning.

2.1.1 Types of machine learning

For an entity to modify its experience in a way that maximizes its accuracy, someone must provide it with a correctness criteria. So, how an entity modifies its experience given a correctness criteria determines the various types of machine learning.

Supervised learning (dynamic experience):
In supervised learning, the provided correctness criteria is a set of records called a dataset, or more specific training set. This is where each record contains a state and an action that is considered correct.
There are many models/algorithms in supervised learning, but the general idea is that the experience is constantly modified so that for each state, the action chosen is similar to the corresponding one in the dataset.
Unsupervised learning (static experience):
In unsupervised learning, the experience that is, how it should work when it is tested, is determined from the beginning.
This means that it is static because the purpose of the models/algorithms of this type of machine learning is generally to collect insights or statistics regarding the data.
For example, a single state comprises a list of images, that must be classified by looking for the differences between them. (The list of predictions is the chosen action).
Reinforcement learning (RL) (dynamic experience):
A very general method for providing the correctness criteria is simply saying how much correct the action chosen by the entity is, based on the state.
This value is called the reward. The experience here will be modified trying to maximize it. Giving a reward (mostly continuous, and not discreet) to the entity for every action that it carries out, is a workflow very generalized.
A valid example is we. That is, with our senses we understand our surroundings and take actions to increase or improve our happiness. (i.e. reward).
For each state, action, and random factor (only in certain environments), a reward is attached.

2.2 Basic workflow

To summarize this, providing scalar feedback (reward), for each couple state-action is a very simple and general correctness criteria.
Now let’s better understand the basic workflow of an artificial reinforcement learning entity.

The agent initially receives the first state s and then uses it together with its experience to perform an action a.
Then, an interpreter (which in some cases corresponds to the environment itself), receives s’ (next state), evaluates the choice of a (based on s, s’ and sometimes also based on a random factor that we will formalize later). The reward r’ chosen by the interpreter is given to the agent along with s’.
Let’s see a small example in python of an agent that performs random actions:

2.3 Formalization

before we begin, you need to understand these concepts well:

States are sometimes called observations.
All possible states belong to an observation space.
Like states, actions also belong to an action space.
Each space can be discrete, continuous, or a combination of other spaces. For example, a discrete space is defined as [0, 1, … , n-1], where n is the number of discrete states, and a continuous space is a tensor of scalars with a minimum and a maximum value.

To formalize a reinforcement learning workflow, it is viewed in form of a particular type of Markov chain named a Markov decision process (MDP).
MDP is a way to represent a discrete observation space (S) of an environment and its discrete action space (A).

An MDP is not part of the entity’s experience, it is simply a way to represent transitions in reinforcement learning. We are currently considering environments with discrete observation and action spaces and the equations derived from an MDP will also be used for continuous observation and action spaces, as we will see later.

2.3.1 Types of MDP

Fully observable MDP: In addition to S and A, a matrix P (also called T ) is included, this can be stochastic. Meaning that

is the probability of passing from state s to state s’ having performed the action a; or deterministic:

It also has a matrix R, such that

is the reward obtained from the transition from state s to state s’ having performed action a.

In some MDPs with the same s, a, and s’, different rewards can occur so, the matrix R can represent the average of such values.

Usually, P is expressed stochastically to generalize, and at most, you will have a deterministic-like probability distribution. This type of MDP respects the Markov property that says that transitions and rewards successive to a time t are independent of transitions with times before t. So, when we talk about MDPs in general, we refer to this type. MDP = (S, A, P, R)

Partially observable MDP:
This does not have a probability matrix P, and a reward matrix R, because they are unknown so, one can only try to approximate them. This type of MDP may not respect the Markov property but in the next section, we will understand why.

The dynamics of the environment are often characterized using a Fully Observable MDP, while the dynamics of the agent can be characterized using the same Fully observable MDP as its environment, or another Partially observable MDP. This will also be clarified in the next section.

2.4 Model-based and model-free RL

There are 2 types of reinforcement learning (both of them have different algorithms) called the model-based RL and model-free RL. They differ from what the agent knows about the environment, formally we can define the difference using the 2 types of MDP described above.

model based model free reinforcement learning

Model-based RL:
This type of reinforcement learning use a Fully observable MDP that describes both the dynamics of the agent and the environment so, the agent can predict all transitions.
We can find this type of RL in some videogames like “snake”, where every state of the environment (matrix) corresponds to the state of the agent (artificial), however, it is difficult to be applied in reality because for example a human, cannot know what happens in every part of the universe.
Model-free RL:
In the model-free RL, the dynamics of the agent are described by a Partially observable MDP, while the dynamics of its environment are described by a Fully observable MDP.
The agent, therefore, does not know initially the rewards (R) and P. What it can only do is try to predict them according to its possibilities which are determined by the relationship between its different MDPs.
In these two different MDPs, the states may not correspond and the Partially observable MDP does not always follow the Markov property.
For example, if one day a person performs an action that produces a butterfly effect on the other side of the world and then takes a stroke, he can no longer remember anything. The next day he may suffer the consequences of what he did. But if he had not performed that action that day, it would have affected the future differently thus, future transitions.

2.5 Policy

Recalling that an agent when receiving a state, chooses which action to take based on experience, we now understand how this process occurs.
The strategy/experience of an agent in reinforcement learning is determined by a policy, that may be stochastic (π : A × S → [0, 1]) or deterministic (π : S → A), where S and A are the observation space and the action space of the a- people’s MDP.

The stochastic policy is generally defined as

where p is the probability function

or (alternative notation) π(a, s) = p(a|s). The probability given by the stochastic policy, given a state and an action, is the probability that that action is the optimal according to the agent. Deterministic policy instead deterministically chooses the best action given a state and is defined as

The action chosen by the policy will then be carried out. To generalize, usually, in formulas stochastic policy is used.

2.6 Policy evaluation

Premise: In a series of infinite transitions, in time (t) we can define a variable called return (or more specifically discounted return), with the symbol G (we use also R, but it can create confusion with the reward matrix), defined as

If instead the current state is the last of the episode, G is worth 0, because it is useless to consider future rewards that will never be obtained. The return represents a sum of future rewards, where more weight is given to immediate rewards. The discount factor (gamma), such that γ ∈ [0, 1], the smaller it is, the greater the immediate rewards affect the discounted return. Given this premise, we can define the optimal policy

as that deterministic policy which at time t=0, maximizes the expected value of the return G or, as that policy which if used, maximizes the average of all total rewards over a large number of episodes. Formally:

with

The dot is used only by convention. For example

in the equation is the discrete probability distribution that each state is selected as first, another notation used is

Another example:

is the discrete probability distribution that each state is selected as next, given the current state and the chosen action.

The formula may seem complicated and of course, it is, but let’s analyze it: E(G) does not refer to a series of transitions, but refers to possible transitions in the MDP of the agent, in fact, the expected value of the return is calculated with the discrete aleatory variables s_0 , a_t and s_t+1 .

The first is the first state of the return. It has a probability distribution given by the probability that s_0 is the state of the first transition, we must be careful that t=0 means the initial time of the return calculation, not the time of the episode.

Then there is the random variable a_t , which has a distribution obtained from the policy given the state always at the same time t. In addition, the last random variable is s_t+1 , which has a probability distribution given by the probability that this state at time t+1 was selected by the environment, given the state at time t, and the action taken a_t.

In the case of a fully observable MDP, the algorithms/optimizers try to solve this equation, however in partially observable MDPs, since the probabilities of the transitions and the rewards are initially unknown, and since the optimal policy depends on the rewards, the optimal policy can change, since with time the prediction of the rewards varies, in other words, the optimal policy at time t, will not be the optimal policy at the time t + 1, we can distinguish the optimal policy according to the agent and the optimal policy according to the environment (the “really optimal” one). The goal is to bring the agent’s optimal policy closer to the environment’s optimal policy.

2.6.1 Difference between on-policy and off-policy algorithms

In the on-policy algorithms, the same policy of the evaluation is used in training but in the off-policy algorithms, it is only in the evaluation that the optimal policy will be used.
In the Q-learning section, we will see why.

The optimal policy is always deterministic. But, a non-optimal policy can both be stochastic and deterministic, this concept will also be clarified in section 3.1

If something is unclear, I recommend that you continue to read the Q-learning, then you can come back later to read this section later.

There are many types of reinforcement learning algorithms with different approaches. For example, there are algorithms of type value optimization and policy optimization. We will not go into detail but to get an idea just look at the figure below.

2.7 State-value and action-value

The state-value

is a function that tells how good a state is, basically it is the expected return
following the policy π, starting from the state s, formally:

The action value

called also Q-value, is another function, and it plays a similar role to the state-value, but the first chosen action must also be specified.
The Q-value, then, is the expected return following the policy π, starting from a determined state and initially choosing a determined action.

2.8 Q-value evaluation and Bellman equation

The Bellman equation succeeds in expressing the Q-value recursively:

note that the symbol ‘ is used to denote the next action or state

Now, considering the optimal policy, (remember that if it is optimal it is deterministic), given a state, we obtain the action that maximizes the Q-value. So, we can say:

Now we can define the optimal Q-value (the one that follows the optimal policy), with the Bellman equation (in this case called the Bellman equation for optimality):

bellman equation optimality reinforcement learning deep q learning

This formula will take us into the next section to optimize the agent.

3 Q-Learning

Q-learning is a reinforcement learning algorithm that is model-free and is a value optimization type. the optimization is based on the Q-value and the Bellman equation. The limitations are that the observation and action space are discrete, and the policy is deterministic.

The agent has a two-dimensional table called Q-table, where each row corresponds to a state in the observation space and each column to an action in the action space.
This table has a corresponding Q-value for every state-action pair. From now on we will take for granted that it is the Q-value that follows the optimal policy. So the following convention will be valid:

As already mentioned, the optimal policy (which I remember is always deterministic), when given a state, chooses the action with the highest Q-value:

If we can collect the right Q-values, we’ve got the optimal policy!
But how do we calculate the Q-values?
If we re-observe the Bellman equation for optimality (which uses the same policy), we will notice that in time t, if we know the Q-value at time t-1, we can approximate better the Q-value at the time t.
I mentioned the word “approximate” for 3 reasons:
The rewards can vary even with the same s, a, and s’. So, if in the same transition I get a reward now, later if I repeat the same transition I could get another different, also s’ may not be fixed if P is stochastic. Finally, the Q-value at time t-1 is also approximated.

But if we continue in the episodes always updating the Q-values, for a time that tends to infinity, we can obtain the average (or expected) Q-value.
This means approximating the optimal policy according to the agent with that one according to the environment (or the really optimal one), which is exactly what we are looking for.
The update will happen in the following way:

Let’s understand this process better:

is the expected (or target) Q-value of s and a, while

is the current one, we will calculate their difference and obtain an error, that we will add to the current Q-value, after multiplying it by an alpha factor, such that a ∈]0, 1], called learning rate, if it is close to 1, the new Q-value is close to the target one. But if it is close to 0, the new Q-value is close to the current one.

The basic workflow of Q-learning is as follows:

deep q learning reinforcement learning deep reinforcement learning deep learning

3.1 Exploration vs exploitation dilemma

Imagine you are in a room with 2 doors and on one door, you know that behind it there will be 100 coins, but you do not know what may be behind the other door.
Maybe there may be more than 100 coins or less.
The question now is: do you try your luck or accept the first safe 100 coins?
Another example is every time a student goes to school he or she always travels the same road. Now should the student try the road that he or she does not know and perhaps discover that they can shorten or at worst lengthen the route?
Surely it is disadvantageous to try new roads every day. And if you take the risk, it may lengthen the route, and if you don’t take this risk, you will keep taking the best route that you know.
Yes, the solution might be to try new routes now and then. But how often should you try taking a new route that you don’t know of?

This is a big dilemma in reinforcement learning and it is based on the choice between exploration (exploring new ways) and exploitation (going in the best ways I already know). Because of this, it is not always the action chosen by the optimal policy, that will be carried out.

Having said this we can identify the real policy (previously called “not optimal”), which is the one that chooses the actions that will actually be performed. It relies on the optimal policy for the exploitation (although sometimes you can avoid the exploitation in training), and on a random factor (usually) for the exploration. In other words, given the action chosen by the optimal policy, we must understand that it might be the one to be carried out or not. To make this choice there are many algorithms called Bandit Algorithms, the most important are the following:

Epsilon-greedy ( ε-greedy): The epsilon is a constant value between 0 and 1 inclusive, it is the probability that a random action is performed, the pseudocode is:

The choice of epsilon constant is based on how deterministic the MDP is. The more deterministic it is, the less it needs to be explored so, a small epsilon is fine.
In addition, the duration of the episodes is also affected: if they tend to finish quickly at the slightest wrong action so, it is best to prefer a low epsilon.

Softmax: This is a “Value Adaptive Epsilon algorithm” and it relies on Q-values to determine the probability of choosing actions. This algorithm chooses the action based on a probability distribution calculated from action returns. The higher the return, the higher the probability that his action is chosen. Each probability of the distribution is obtained from:

The parameter τ>0, if it is high, the difference in probability between actions with different returns is smaller, but if it is close to 0, a “greedy” choice is made, which is always choosing the best action.

Decay epsilon-greedy (based on ε-greedy): This algorithm, at each step, decreases the epsilon exponentially, as follows:

where epsilon decay ∈ [0, 1].

eploration vs exploitation trade-off dilemma reinforcement learning

3.2 Example in python

Now let’s take a look at a small example in python which we’ll go on to explain.
In the attached project there is a Python notebook called “q-learning-example.py”, divided into 2 parts. Let’s analyze the first one:

Let’s import the libraries we will use: numpy for various mathematical operations, gym, used to create environments with a common interface quickly and easily, and plotly, to show graphs.

The class Q-Agent is defined, where the environment is passed in its the constructor, and the Q-table with zeros is started.
The start episode function is defined, where, in a loop that lasts until when the episode is over.
It chooses the action with the real (not optimal) policy, with the epsilon-greedy algorithm. Now, the agent performs that action and gets the reward, the next state, and a boolean variable, which indicates whether the episode is over.
If it is over, the update does not take into account the Q-value of the next state, because it would not make sense to think of getting future rewards, if the episode is over.

After defining the agent, in this second part, we are going to execute a small example program. The discount factor used will be 0.99 and the learning rate 0.1.
Note that the value of epsilon in testing is equal to 0 so, that the real policy corresponds to the optimal one. The first graph shows the total reward trend in the various episodes, following the optimal policy and the real one.
Then the last graph shows the distribution of the total reward following the optimal policy.

The output of the script is as follows:

mean total reward: 7.92, standard deviation total reward: 2.58

3.3 Conclusion

Q-learning is a very easy algorithm and the parameters (discount factor, learning rate, etc.) have a relatively good margin of error. The limitations are also seen especially as the complexity of the agent’s MDP increases.

4 Deep Q-Learning

A great limitation of Q-learning is the discrete observation space since the Q-values are obtained by accessing a matrix.
The solution adopted by the deep Q-learning is to place the Q-table with a function that approximates the Q-values, and this function is a neural network (called in this case Deep Q-Network, or DQN). When it is given a state, it approximates/predicts the Q-values of all possible actions. Deep Q-learning is therefore a deep reinforcement learning algorithm, which is a word derived from both reinforcement learning and deep learning.

The function that calculates the Q-values will follow the following convention:

where W are the parameters of the neural network.

4.1 Neural Networks

A neural network (or neural network, or even deep neural network) is a function approximator that takes inspiration from biological neural networks and it can be structured in many ways. Below we will see a simple example.

neural network deep q-network q network deep q learning reinforcement learning deep reinforcement learning deep learning

Each neural network has an input layer that contains the inputs, hidden layers where the inputs are processed, and an output layer that contains the outputs of the neural network.
The parameters W (or θ ) are called weights, and together with the biases B, are the ones that parameterize the prediction of the neural network. W is simply a three-dimensional matrix, where the first dimension identifies the layer, the second the next neuron, and the third the previous neuron.
Thinking about a linear function, W represents the angular coefficients and B is the known term.
The inputs of the neural network correspond to the Z_1 values while the outputs are the A_L values, where L is the number of total layers.
The Z and A values of each neuron also called input and output values (except those in the input layer), are calculated in the following way that I summarize in a diagram, with input a vector x:

perceptron neuron neural network deep q-network q network deep q learning reinforcement learning deep reinforcement learning deep learning

In general:

where σ is a function called activation function which mainly has the task of preparing the input values for the next summation.
For example, it tries to avoid large values, although it also has other purposes that we will not go into. Many types of activation functions can vary for layers or even to individual neurons of a single layer. The most used activation functions are sigmoid, ReLU, leaky ReLU, linear function, etc.
In the cartpole project, I used leaky ReLU for hidden layers and linear function for the input and output layers.

activation functions neural network deep q-network q network

How many neurons in the hidden layer must a neural network have? Or how many hidden layers must it have?

Both things vary from problem to problem, but we can say that to approximate complex functions we need more layers, even if there is a risk of overfitting the neural network. Regarding the number of neurons, there are some rules of thumb like the one that says that the right number is half the number of neurons in the input layer and half the number of neurons in the output layer, but in reality, we have to go by trial and error many times.

4.2 Neural Network Optimization

Premise: From now on, we will consider the B bias as part of the W weight, so it is possible to think of an extra neuron for each layer (except the output layer), with the output value equal to 1 and connected only with the neurons of the next layer.

To optimize the neural network so that it correctly predicts all target Q-values based on a state, we compute the Q-values for each action, then via policy, we find the action to take.
After executing the action, through the Bellman equation we find the Q-value target and replace it with the Q-values already calculated in place of the chosen action. Finally, we optimize the neural network. The pseudocode is:

For the real optimization of the neural network, we must identify a function to minimize. This is called cost function (or even loss function), is usually referred to with the letter J or C, which as parameters has an input vector s, and a vector of output target q∗ . its purpose is to say how far apart are the prediction of the input from the target.
There are many types and the most common which I used is the MSE (mean square error) and which is defined as

Our goal is to minimize this function by varying W, thus:

Note that to get the mean square error we should not divide everything by 2, but by the length of A_l , but since our goal is to minimize this function, the result will be the same and when we divide everything by 2, it will be easier to calculate the partial derivative that we will see below.
Now, to solve this optimization problem, we will use an algorithm called gradient descent.

The weights are initialized with random values (several distributions can be followed), and according to the partial derivatives

of the cost function with respect to each weight (which corresponds to the error of that weight).

The algorithm varies the weights to minimize the cost function.
As an example: if the derivative is negative, it means that a value of the weight that’s greater will diminish the cost function.
The optimization happens therefore in the following way:

where α is the learning rate.
If a is too low the optimization is slow and if too high the weights will diverge.
So to find the right learning rate, it is better to start from a low one and gradually increase.

gradient descent cost function neural network

How do we compute the partial derivative for each weight?
One efficient way is to use the backpropagation algorithm, which propagates the error in the neural network, from the output layer to the input layer, taking advantage of the chain rule which allows us to compute the derivative of composite functions. We will not go into detail on this algorithm.

In the next sections, we will see 3 optimizations that I implemented and used in the attached project, and then there will be an example in pseudocode.

4.3 Gradient descent with momentum

One of the risks of gradient descent is that it blocks the optimization at a local minimum.
Although there are many other more efficient alternative algorithms to this one for simplicity I used the “gradient descent with momentum”, which manages to avoid some local minima and speed up the optimization process.

It is based on updating each weight not according to its derivative but according to the weighted average of its derivatives calculated over time, giving more weight to the most recent ones (the “how much” more weight is adjusted by a parameter called momentum).

4.4 Replay memory

At each step, updating the weights of the neural network can vary the predictions of the network too quickly, and therefore worsen them because each update, given a state, will also influence the prediction of the Q-values of all the other states. A solution is to memorize the instructions to optimize the network (state, action, reward, done, and next state) in a memory called replay memory (or also experience replay), and also for some steps update the network by following some instructions randomly extracted from the replay memory (this number is called batch-size).
In general, updating the neural network with the “gradient descent with momentum” with mini-batches (a mini-batch is a series of instructions taken from the replay memory), is called “mini-batch gradient descent with momentum”.

One limitation of experience replay is giving equal importance to predictions that perhaps are secondary, with predictions that perhaps are essential for the agent to learn; a possible optimization is called “prioritized experience replay”, which we will not delve into.

4.5 Target network

To make the optimization more stable, it is necessary to use a target network, which is a neural network structurally identical to the main one, its difference is that in the update of a mini-batch, the prediction of the subsequent Q-values (from which the target Q-values are derived) is computed by the target network, and the weights are optimized only in the main neural network, and every few steps the weights of the target network are synchronized with those of the main neural network.

4.6 Example of an agent in pseudocode

Now let’s see an example of a pseudocode function of a DQN agent i.e, deep Q network and performing an episode.
The bandit algorithm used is epsilon decay, for the optimization is used the mini-batch gradient descent algorithm with momentum (the real backpropagation is done in the “train” function of the neural network).
All the complete code of the neural network is in the attached project.

4.7 Application with cartpole

In the attached application, I made a small project of deep Q-learning with the cartpole.
I used the following parameters:

Replay memory max size = 10000
Batch size = 30
discount factor = 0.99
Learning rate = 0.0001
Number of steps to sync target network = 10
Epsilon decay = 0.99
Min epsilon = 0.01
Momentum = 0.4
For the neural network, I used 4 total layers, with 4, 50, 100, and 2 neurons respectively.
The activation function in the first neuron and the last one is linear, while for the hidden layers I used the leaky ReLU.
For the initialization of the weights, I used a uniform distribution with a minimum of -0.5 and a maximum of 0.5

Out of 1000 total training episodes, I noticed that the agent got the most rewards around episode 320 so, I saved the weights of that episode to a file and used them to evaluate the results.

You can notice that after about episode 320, the total reward starts to go down; this phenomenon is called “Catastrophic interference”, and it is a typical phenomenon of neural networks used in reinforcement learning, caused by the fact that new optimizations can worsen the ones previously done.
So after having inserted temporally advanced optimizations in the various episodes, the neural network may be no longer able to correctly predict during the first steps of the episodes.
This problem can be circumvented in offline learning (where training is separated from evaluation, very similar to off-policy learning) which is what I did but, it is serious in online learning (no difference between training and evaluation) which is the basis of a general artificial intelligence.

For the evaluation (so, the evaluation of the results), I obviously used the optimal policy, and below I calculated the distribution of the total rewards on a large number of episodes, where you can see not just a kind of approximate Gaussian but also a tendency to finish the episode between 475 and 500 total rewards. This also depends strongly on the environment.
however, note that the total reward never exceeds 500 because it is the maximum achievable in this environment after which the episode ends, the dashed green line is the average (388.6).

4.8 Video result

5 Deep distributed Q-learning

Stanford’s “CME 323: Distributed Algorithms and Optimization” course in 2015 proposed a distributed Deep Q-learning algorithm here based on the Downpour SGD algorithm which is an asynchronous variant of gradient descent. This algorithm defines a server, and workers, represented in the figure below:

deep distributed q-learning deep q-learning q learning — source here

where p are the weights.

5.1 Workers

Each worker works asynchronously and independently from the others and has the task of requesting to the server the most updated weights, maintaining a local target network that will generate data to insert into the replay memory, which is also maintained locally to calculate the gradients with mini-batches and to send them at the end of the mini-batch to the server.

5.2 Server

The server that maintains neural network weights locally is responsible for providing them to the workers when they require them and also updating the weights with the gradients they receive.

5.3 Scalability

Initially the scalability is horizontal, thereby adding workers that not only boost learning but also increase stability.

One of the risks is that the server may be too slow to respond to the workers.
A solution can be to increase the batch-size, and make the communication of gradients less frequent, or to scale the server vertically so, upgrade the hardware or increase the bandwidth.

Another risk is that, given the high frequency of weight updates in the server, workers may use an outdated model that slows down learning.
A solution may be to use optimizers called “adaptive learning rate methods”, such as RMSprop or AdaGrad.

Attachment

GitHub - lorenzotinfena/deep-q-learning-itt-final-project

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

Conclusion

I did this article initially in Italian for a school project, then I translated it into English (with some help) with some slight corrections.

I’m not an expert and take all of this with a grain of salt. It’s just my consideration with my very little experience. For this reason, considerations, improvements, or criticisms are welcome in the comments.