DEV Community: Rijul Rajesh

Understanding Reinforcement Learning with Human Feedback Part 5: Training the Reward Model with Loss Functions

Rijul Rajesh — Mon, 25 May 2026 19:15:00 +0000

In the previous article, we created a reward model. In this article, we will continue exploring how this model is trained.

One important thing to note is that we do not need to define the ideal reward values in advance.

Instead, the model learns to determine appropriate rewards on its own.

The Loss Function

To train the reward model, OpenAI used the following loss function in their 2022 paper:

This loss function helps the model learn good reward values without us explicitly defining what the rewards should be.

Where:

Reward Better corresponds to the reward calculated for the preferred response
Reward Worse corresponds to the reward calculated for the less preferred response

Ideally, we want:
Reward_better - Reward_worse
to be a large positive number.

Step 1: The Sigmoid Function

The difference between the rewards is first passed through a sigmoid function.

For any input value, the sigmoid function outputs a value between 0 and 1.

In an ideal case, we want the sigmoid output to be close to 1. This happens when the preferred response receives a much higher reward than the worse response.

Step 2: The Log Function

The output of the sigmoid function is then passed through a log function.

In the ideal case, this produces a relatively high value.

Finally, we multiply the result by -1.

This turns the equation into a loss function that optimization algorithms can minimize during training.

The interesting part is that we never explicitly tell the model:

the preferred response must have a positive reward
the worse response must have a negative reward

Instead, the loss function naturally guides the model toward assigning rewards in a way that makes preferred responses score higher than worse ones.

What Happens Next?

Once the reward model is fully trained, we can use it to train the original model that only went through supervised fine-tuning.

In the next article, we will explore how the reward model is used to further train the original model.

Looking for an easier way to install tools, libraries, or entire repositories?
Try Installerpedia: a community-driven, structured installation platform that lets you install almost anything with minimal hassle and clear, reliable guidance.

Just run:

ipm install repo-name

… and you’re done! 🚀

🔗 Explore Installerpedia here

Understanding Reinforcement Learning with Human Feedback Part 4: Teaching Models Human Preferences

Rijul Rajesh — Sat, 23 May 2026 19:25:30 +0000

In the previous article, we explored the part where we collect human preferences. In this article, we will see how to use this data to train the models.

To train a model that gives higher scores to preferred responses, we first make a copy of the model that has already gone through supervised fine-tuning.

Modifying the Model

Next, we modify this copied model.

We remove the unembedding layer, which normally predicts the next token, and replace it with a single output value.

The result is a new model called a reward model.

Instead of generating text, this model learns to assign a reward score to a response.

Training the Reward Model

We can now train this reward model using the human preference data we collected earlier.

For a preferred response, we train the model to produce a higher reward value.

For a less preferred response, we train the model to produce a lower reward value or a negative reward.

For example:

If humans preferred Response A over Response B, the reward model learns to give a higher score to Response A
And a lower score to Response B

Over time, the reward model learns what kinds of responses humans tend to prefer.

We will continue further in the next article

Just run:

ipm install repo-name

… and you’re done! 🚀

🔗 Explore Installerpedia here

Understanding Reinforcement Learning with Human Feedback Part 3: Collecting Human Preferences

Rijul Rajesh — Wed, 20 May 2026 19:05:25 +0000

In the previous article we explored the concept of aligning the pretrained model. Now, we will look at the next component: human preference collection.

The first step in understanding RLHF is to understand that, given a specific prompt, a model can generate different responses.

One way to generate a response is to configure the model to always select the token with the highest output value at every step.

In this case, the model will generate the same response every single time for a given prompt.

Generating Different Responses

Instead of always selecting the highest-value token, we can also use the outputs of the softmax function as probabilities for selecting tokens.

In this approach:

the token with the highest probability is more likely to be selected
but other tokens still have a chance of being selected

As a result, the model can generate different responses for the same prompt.

Collecting Human Preferences

Since a model can generate multiple responses, we can create pairs of responses for the same prompt.

We can then ask people which response they prefer.

For example, given two possible answers, a person can simply choose the better one.

Collecting preferences like this is much faster than asking people to manually write responses for every prompt.

This preference collection process is the “Human Feedback” part of RLHF.

Using Preference Data

Once we collect preference data, we can use it to train the model so that it assigns higher scores to preferred responses and lower scores to less preferred ones.

Over time, this helps the model generate responses that better match human preferences.

In the next article, we will explore how to train the model using this preference data.

Just run:

ipm install repo-name

… and you’re done! 🚀

🔗 Explore Installerpedia here

Understanding Reinforcement Learning with Human Feedback Part 2: Aligning Pretrained Models

Rijul Rajesh — Tue, 19 May 2026 17:29:56 +0000

In the previous article, we explored the concept of pre-training and its limitations without a further step in the training process.

In this article, we will explore how we can align a pretrained model to help overcome these limitations.

The Two Steps of Alignment

Aligning a pretrained model usually involves two stages:

Supervised Fine-Tuning (SFT)
Reinforcement Learning with Human Feedback (RLHF)

Step 1: Supervised Fine-Tuning

Supervised fine-tuning uses a dataset made up of human-written prompts and human-written responses.

For example, someone might create a prompt like:

“Suggest a coding assistant tool”

And then provide a response such as:

“Try out Cursor”

Using many examples like this, we can train the model with standard backpropagation so that it learns to generate helpful responses.

What Supervised Fine-Tuning Achieves

After supervised fine-tuning, the pretrained model becomes more aligned with human communication.

Instead of only predicting the next token like it did during pre-training, the model now starts to generate:

helpful responses
polite responses
responses to natural language prompts

In other words, supervised fine-tuning transforms a pretrained but unaligned model into one that has started learning how to respond like an assistant.

The Limitation of Supervised Fine-Tuning

Since supervised fine-tuning requires human effort and time, the dataset is usually much smaller than the massive dataset used during pre-training.

Because of this, supervised fine-tuning can sometimes cause the model to overfit.

This means the model may respond well to prompts that are similar to examples it was trained on, but struggle with new prompts that were not part of the fine-tuning dataset.

For example, it may respond appropriately to a prompt it has seen during training, but fail to generalize to unfamiliar prompts.

Why RLHF Is Needed

One possible solution would be to create a much larger supervised fine-tuning dataset.

However, collecting and writing a huge dataset by hand would be extremely expensive and time-consuming.

Instead, we can use Reinforcement Learning with Human Feedback (RLHF) to help train the model to generate better responses, even for prompts it was not directly trained on.

We will explore this further in the next article.

Just run:

ipm install repo-name

… and you’re done! 🚀

🔗 Explore Installerpedia here

Understanding Reinforcement Learning with Human Feedback Part 1: Pre-Training Large Language Models

Rijul Rajesh — Mon, 18 May 2026 19:48:14 +0000

In this article, we will explore Reinforcement Learning with Human Feedback (RLHF).

RLHF is one of the techniques used to help train large language models like ChatGPT.

Starting with an Untrained Model

Suppose we want to build a model like ChatGPT from scratch so that we can ask it questions.

To do this, we first need to understand how to train an untrained decoder-only transformer model.

By untrained, we mean that all the weights and biases in the model are initialized with random values.

At this stage, the model does not understand language or meaning.

The First Step: Pre-Training

The first step in training a large language model is to teach it to predict the next token using a very large body of text, such as Wikipedia articles.

We take segments of text and use the earlier words as input tokens. The model then learns to predict the next token in the sequence.

For example, if the input is:

“The cat sat on the…”

The model learns to predict the next likely word.

By repeating this process across a massive amount of text, the model gradually learns:

grammar
sentence structure
facts and patterns in language

This training stage is called pre-training.

Over time, this process produces a pretrained model.

Why Pre-Training Is Not Enough

At this point, the model becomes good at predicting the next token in text.

However, simply predicting the next token is not enough to solve the problem of answering questions like a chatbot.

For example, being good at continuing Wikipedia text does not automatically mean the model will give helpful, safe, or conversational responses.

To make the model useful for chat, we need to align it with human expectations.

We will explore this in the next article.

Just run:

ipm install repo-name

… and you’re done! 🚀

🔗 Explore Installerpedia here

Understanding Reinforcement Learning with Neural Networks Part 6: Completing the Reinforcement Learning Process

Rijul Rajesh — Sat, 16 May 2026 20:28:40 +0000

In the previous article we covered the basics of training, and how rewards, derivatives and step-size were used to acheive it.

In this article, we will finish the training process for the model.

To fully train the model, we need to use different input values between 0 and 1 as inputs to the neural network.

This allows the model to learn how to behave under different hunger levels.

Training Over Time

After many updates using a wide range of inputs between 0 and 1, the value of the bias starts to hover around -10.

The fact that the bias stabilizes around a certain value suggests that the model has finished training.

What Happens When We Are Not Hungry?

Now, let us test the model.

When we are not hungry and the input is 0.0, the probability of going to Place B becomes 0.

This means that when hunger is low, the model will always choose Place A.

What Happens When We Are Hungry?

In contrast, when we are very hungry and the input value is 1.0, the probability of going to Place B becomes 1.

This means that when hunger is high, the model will always choose Place B.

Summary of Reinforcement Learning

In summary, reinforcement learning allows us to optimize a neural network even when we do not know the correct outputs in advance.

The process works like this:

The neural network decides what action to take
We assume that the chosen action was the correct decision
- For example, if we go to Place B, we temporarily assume that Place B was the best choice, even if we later discover that Place A would have been better
We calculate the derivative with respect to the parameter we want to optimize
We determine the reward associated with the decision
We multiply the derivative by the reward to correct mistakes
This gives us the updated derivative
We use the updated derivative in gradient descent to optimize the neural network

That wraps up this article.

In the next article, we will explore Reinforcement Learning with Human Feedback (RLHF).

Just run:

ipm install repo-name

… and you’re done! 🚀

🔗 Explore Installerpedia here

Understanding Reinforcement Learning with Neural Networks Part 5: Connecting Reward, Derivative, and Step Size

Rijul Rajesh — Fri, 15 May 2026 20:33:05 +0000

In the previous article, we explored the reward system in reinforcement learning

In this article, we will begin calculating the step size.

First Update

In this example, the learning rate is 1.0.

So, the step size is 0.5.

Next, we update the bias by subtracting the step size from the old bias value 0.0:

After the Update

Now that the bias has been updated, we run the model again.

The new probability of going to Place B becomes 0.4.

This means the probability of going to Place A is:

Choosing Again

We now pick a random number between 0 and 1, and get 0.9.

Since 0.9 falls in the region representing Place B, we choose Place B.

Computing the Gradient Again

To update the bias, we again compute the derivative.

First, we assume that choosing Place B was the correct action.

So ideally:

Now we compute the difference between the ideal value 1.0 and the actual value 0.4.

Using this, we calculate the derivative with respect to the bias, which gives:

Checking the Reward

Now we check whether this was actually a good decision.

Place B gives a large portion of fries, but our hunger input is 0.0, meaning we are not very hungry.

So this was not a good choice.

Therefore, the reward is:

Reward = -1

Updating with Reward

We multiply the derivative by the reward:

-0.6 x -1 = 0.6

So the updated derivative becomes 0.6.

Second Step Update

Now we calculate the step size again:

Final Result

We plug the new bias back into the neural network.

Now the probability of going to Place B has decreased.

This means that when hunger is low, the model is more likely to choose Place A, which is the correct behavior.

This shows that the reinforcement learning algorithm, specifically policy gradients, is working as expected.

In the next article, we will explore how to further train the model using different input values.

Just run:

ipm install repo-name

… and you’re done! 🚀

🔗 Explore Installerpedia here

Understanding Reinforcement Learning with Neural Networks Part 4: Positive and Negative Rewards

Rijul Rajesh — Wed, 13 May 2026 20:46:18 +0000

In the previous article, we began the process of guessing the ideal output.

Let us continue with the same example.

Suppose we receive a small number of fries.

Since our hunger level is 0, this is actually a good outcome.

In this case, we should assign a reward of 1.

Now consider the opposite situation.

Suppose we receive a large order of fries.

Since we are not hungry enough to eat all the fries, this means we made a poor decision.

In that case, we assign a reward of -1.

In general:

Any positive reward indicates a good decision
Any negative reward indicates a bad decision

Updating the Derivative with the Reward

We now use this reward to update the derivative.

To do this, we simply multiply the derivative by the reward.

Case 1: Correct Decision

If the reward is 1, then:

The derivative remains unchanged.

This means the derivative is already pointing in the correct direction.

Case 2: Incorrect Decision

If the reward is -1, then:

Now the derivative changes sign.

This causes the optimization process to move the bias in the opposite direction.

In other words, the negative reward flips the direction of the update so the neural network can learn from the bad decision.

In the next article, we will explore how to calculate the step size for updating the parameters.

Just run:

ipm install repo-name

… and you’re done! 🚀

🔗 Explore Installerpedia here

Understanding Reinforcement Learning with Neural Networks Part 3: Guessing the Ideal Output

Rijul Rajesh — Mon, 11 May 2026 18:48:10 +0000

In the previous article, we explored the limitations of backpropagation and why it is not ideal when the correct output values are unknown.

In this article, we will begin exploring the core ideas behind reinforcement learning.

Starting Example

Let us begin by assuming that we are not hungry.

We will feed the value 0.0 into the neural network.

The neural network outputs a probability of 0.5 for going to Place B.

So:

Probability of going to Place B = p(B) = 0.5
Probability of going to Place A = 1 - p(B) = 0.5

Visualizing the Probabilities

We can represent these probabilities using a line.

First, we draw a line segment with length 0.5 to represent the probability of going to Place A.

Then, we append another line segment to represent the probability of going to Place B.

Together, these form a line ranging from 0 to 1.

Choosing an Action

To decide which place to go for a snack, we randomly pick a number between 0 and 1.

Let us pick 0.2.

Since 0.2 falls inside the region representing Place A, we choose to go to Place A.

Making a Guess About the Correct Action

Now, let us assume that going to Place A when hunger = 0 was the correct decision.

Ideally:

The probability of going to Place A, p(A), should be 1
The probability of going to Place B, p(B), should be 0

These ideal values are based on our guess about what the correct action should have been.

Moving Toward Optimization

Using these guessed ideal values, we can calculate the difference between:

the ideal probability for p(A)
the actual probability produced by the neural network

This allows us to calculate the derivative of the difference with respect to the bias we want to optimize.

In the next article, we will continue exploring how this optimization process works in reinforcement learning.

Just run:

ipm install repo-name

… and you’re done! 🚀

🔗 Explore Installerpedia here

Understanding Reinforcement Learning with Neural Networks Part 2: Why Backpropagation Is Not Enough

Rijul Rajesh — Sun, 10 May 2026 19:53:16 +0000

In the previous article, we explored an example where reinforcement learning is required and standard methods do not work.

In this article, we will understand why policy gradients are needed, and why the standard backpropagation method does not work in certain situations.

How Backpropagation Normally Works

Assume we have the following training data, where the desired outputs are already known:

Input (Hunger)	Output p(B)
0.0	0
1.0	1
0.1	0
0.9	1

With this data, we can feed the input values into the neural network one at a time.

The neural network produces an output, and we compare it with the ideal output value from the training data.

Using this difference, we can measure how wrong the network is.

Using Derivatives to Update the Bias

We can calculate these differences for different values of the bias and visualize how the error changes as the bias changes.

From this graph, we can calculate the derivative.

If the derivative is negative, we shift the bias to the right
If the derivative is positive, we shift the bias to the left

The derivative correctly tells us which direction to move because the training data already contains the ideal output values.

This is the basic idea behind backpropagation.

The Problem in Reinforcement Learning

However, in reinforcement learning, we do not know the ideal output values in advance.

For example, we do not already know whether choosing Place A or Place B is the correct action.

Because of this:

we cannot calculate the difference between the neural network’s output and the ideal output
without these differences, we cannot calculate derivatives in the normal way

A Different Approach

Instead, we can guess what the ideal outputs should be and use those guesses to estimate the derivatives.

This idea forms the foundation of policy gradients in reinforcement learning.

In the next article, we will explore how reinforcement learning and policy gradients help us solve this problem.

Just run:

ipm install repo-name

… and you’re done! 🚀

🔗 Explore Installerpedia here

Understanding Reinforcement Learning with Neural Networks Part 1: Learning Without Correct Answers

Rijul Rajesh — Fri, 08 May 2026 18:50:27 +0000

In this article, we will explore reinforcement learning with neural networks.

Let’s start with a simple example.

Choosing Between Two Snack Places

Suppose it is snack time, and you have to choose between Place A and Place B for fries.

To make a good decision, we also need to consider how hungry we are.

Some days we may be very hungry, while on other days we may only want a small snack.

We also need to consider how many fries each place might serve.

For example:

Place B might give a large quantity of fries, which would be great if we were very hungry
But if we were not that hungry, getting too many fries might not be ideal

Similarly:

Getting a small amount of fries would not be good if we were extremely hungry
But it could be perfectly fine if we only wanted a light snack

So, it would be useful to have a system that helps decide which place to choose based on:

our hunger level
the possible quantity of fries we might receive

Using a Neural Network

To solve this problem, we will use a neural network.

The neural network takes our hunger level as the input and outputs the probability of choosing Place B, written as p(B).

The Challenge

Normally, when training a neural network, we start with a training dataset that contains:

input values
correct output values

Using this data, we can train the network with standard backpropagation.

However, in this example, we do not know in advance whether Place A or Place B will serve a large or small quantity of fries.

Because of this, we do not know what the correct output values should be.

Reinforcement Learning

In situations where we do not have known output values, we can still train a model using reinforcement learning.

Instead of learning from correct answers, the model learns by trying actions and receiving feedback based on how good the outcome was.

In the next article, we will explore a reinforcement learning algorithm called policy gradients.

Just run:

ipm install repo-name

… and you’re done! 🚀

🔗 Explore Installerpedia here

Understanding Encoder-Only Transformers: The Foundation of BERT and RAG Retrieval

Rijul Rajesh — Thu, 07 May 2026 18:50:05 +0000

Back in 2017, the first transformer architecture introduced two main components:

an encoder
a decoder

These two parts were connected so they could work together.

This original design is known as an encoder–decoder transformer.

Decoders Can Work on Their Own

Over time, researchers realized that the decoder alone was powerful enough for many tasks.

Using only a decoder, models could:

generate text
continue sentences
perform translation and other language tasks

As we discussed in the article on decoder only transformers, these models form the foundation of systems like ChatGPT.

These are called decoder-only transformers.

Encoders Can Also Work Independently

In a similar way, encoder-based models are also very useful on their own.

This idea forms the foundation of models like BERT and many others.

These are called encoder-only transformers.

Building Blocks of Encoder-Only Transformers

Encoder-only transformers use the same core components we explored earlier:

Word embeddings convert words into numbers
Positional encoding keeps track of word order
Self-attention helps establish relationships between words

When these layers are combined, they create a new representation for each token that captures:

meaning
position
relationships with other words

These representations are called context-aware embeddings or contextualized embeddings.

Why Context-Aware Embeddings Are Useful

Context-aware embeddings can help group together:

similar sentences
similar paragraphs
similar documents

This capability is one of the foundations of Retrieval-Augmented Generation (RAG).

RAG works by:

Breaking documents into smaller chunks of text
Using an encoder-only transformer to generate embeddings for each chunk
Comparing embeddings to find the most relevant information

Other Uses of Encoder-Only Transformers

Context-aware embeddings can also be used as inputs for machine learning models.

For example:

neural networks can use them for sentiment classification
logistic regression models can also use them for classification tasks

That wraps up encoder-only transformers.

In the next article, we will explore reinforcement learning in neural networks.

Just run:

ipm install repo-name

… and you’re done! 🚀

🔗 Explore Installerpedia here