DEV Community

Cover image for Understanding Reinforcement Learning with Neural Networks Part 6: Completing the Reinforcement Learning Process
Rijul Rajesh
Rijul Rajesh

Posted on

Understanding Reinforcement Learning with Neural Networks Part 6: Completing the Reinforcement Learning Process

In the previous article we covered the basics of training, and how rewards, derivatives and step-size were used to acheive it.

In this article, we will finish the training process for the model.

To fully train the model, we need to use different input values between 0 and 1 as inputs to the neural network.

This allows the model to learn how to behave under different hunger levels.


Training Over Time

After many updates using a wide range of inputs between 0 and 1, the value of the bias starts to hover around -10.

The fact that the bias stabilizes around a certain value suggests that the model has finished training.


What Happens When We Are Not Hungry?

Now, let us test the model.

When we are not hungry and the input is 0.0, the probability of going to Place B becomes 0.

This means that when hunger is low, the model will always choose Place A.


What Happens When We Are Hungry?

In contrast, when we are very hungry and the input value is 1.0, the probability of going to Place B becomes 1.

This means that when hunger is high, the model will always choose Place B.


Summary of Reinforcement Learning

In summary, reinforcement learning allows us to optimize a neural network even when we do not know the correct outputs in advance.

The process works like this:

  1. The neural network decides what action to take
  2. We assume that the chosen action was the correct decision
    • For example, if we go to Place B, we temporarily assume that Place B was the best choice, even if we later discover that Place A would have been better
  3. We calculate the derivative with respect to the parameter we want to optimize
  4. We determine the reward associated with the decision
  5. We multiply the derivative by the reward to correct mistakes
  6. This gives us the updated derivative
  7. We use the updated derivative in gradient descent to optimize the neural network

That wraps up this article.

In the next article, we will explore Reinforcement Learning with Human Feedback (RLHF).


Looking for an easier way to install tools, libraries, or entire repositories?
Try Installerpedia: a community-driven, structured installation platform that lets you install almost anything with minimal hassle and clear, reliable guidance.

Just run:

ipm install repo-name
Enter fullscreen mode Exit fullscreen mode

… and you’re done! πŸš€

Installerpedia Screenshot

πŸ”— Explore Installerpedia here

Top comments (0)