<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Rijul Rajesh</title>
    <description>The latest articles on DEV Community by Rijul Rajesh (@rijultp).</description>
    <link>https://dev.to/rijultp</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1207862%2Ff06197aa-d585-4225-94a6-86243238376f.png</url>
      <title>DEV Community: Rijul Rajesh</title>
      <link>https://dev.to/rijultp</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/rijultp"/>
    <language>en</language>
    <item>
      <title>Understanding Reinforcement Learning with Human Feedback Part 5: Training the Reward Model with Loss Functions</title>
      <dc:creator>Rijul Rajesh</dc:creator>
      <pubDate>Mon, 25 May 2026 19:15:00 +0000</pubDate>
      <link>https://dev.to/rijultp/understanding-reinforcement-learning-with-human-feedback-part-5-training-the-reward-model-with-3g37</link>
      <guid>https://dev.to/rijultp/understanding-reinforcement-learning-with-human-feedback-part-5-training-the-reward-model-with-3g37</guid>
      <description>&lt;p&gt;In the &lt;a href="https://dev.to/rijultp/understanding-reinforcement-learning-with-human-feedback-part-4-teaching-models-human-preferences-m7f"&gt;previous article&lt;/a&gt;, we created a &lt;strong&gt;reward model&lt;/strong&gt;. In this article, we will continue exploring how this model is trained.&lt;/p&gt;

&lt;p&gt;One important thing to note is that we do &lt;strong&gt;not&lt;/strong&gt; need to define the ideal reward values in advance.&lt;/p&gt;

&lt;p&gt;Instead, the model learns to determine appropriate rewards on its own.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Loss Function
&lt;/h2&gt;

&lt;p&gt;To train the reward model, OpenAI used the following loss function in their 2022 paper:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqmaqeqw6ftguvkwrg3ef.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqmaqeqw6ftguvkwrg3ef.png" alt=" " width="582" height="89"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This loss function helps the model learn good reward values &lt;strong&gt;without us explicitly defining what the rewards should be&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Reward Better&lt;/strong&gt; corresponds to the reward calculated for the preferred response&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reward Worse&lt;/strong&gt; corresponds to the reward calculated for the less preferred response&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Ideally, we want:&lt;br&gt;
Reward_better - Reward_worse&lt;br&gt;
to be a &lt;strong&gt;large positive number&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 1: The Sigmoid Function
&lt;/h2&gt;

&lt;p&gt;The difference between the rewards is first passed through a &lt;strong&gt;sigmoid function&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For any input value, the sigmoid function outputs a value between &lt;strong&gt;0 and 1&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In an ideal case, we want the sigmoid output to be &lt;strong&gt;close to 1&lt;/strong&gt;. This happens when the preferred response receives a much higher reward than the worse response.&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 2: The Log Function
&lt;/h2&gt;

&lt;p&gt;The output of the sigmoid function is then passed through a &lt;strong&gt;log function&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In the ideal case, this produces a relatively high value.&lt;/p&gt;

&lt;p&gt;Finally, we multiply the result by &lt;strong&gt;-1&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This turns the equation into a loss function that optimization algorithms can minimize during training.&lt;/p&gt;

&lt;p&gt;The interesting part is that we never explicitly tell the model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the preferred response must have a positive reward&lt;/li&gt;
&lt;li&gt;the worse response must have a negative reward&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead, the loss function naturally guides the model toward assigning rewards in a way that makes preferred responses score higher than worse ones.&lt;/p&gt;
&lt;h2&gt;
  
  
  What Happens Next?
&lt;/h2&gt;

&lt;p&gt;Once the reward model is fully trained, we can use it to train the original model that only went through &lt;strong&gt;supervised fine-tuning&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In the next article, we will explore how the reward model is used to further train the original model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Looking for an easier way to install tools, libraries, or entire repositories?&lt;/strong&gt;&lt;br&gt;
Try &lt;strong&gt;Installerpedia&lt;/strong&gt;: a &lt;strong&gt;community-driven, structured installation platform&lt;/strong&gt; that lets you install almost anything with &lt;strong&gt;minimal hassle&lt;/strong&gt; and &lt;strong&gt;clear, reliable guidance&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Just run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ipm &lt;span class="nb"&gt;install &lt;/span&gt;repo-name
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;… and you’re done! 🚀&lt;/p&gt;

&lt;p&gt;&lt;a href="https://hexmos.com/ipm" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm2s3mzj8pfcq94a1y4at.png" alt="Installerpedia Screenshot" width="800" height="336"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://hexmos.com/ipm/" rel="noopener noreferrer"&gt;&lt;strong&gt;Explore Installerpedia here&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Understanding Reinforcement Learning with Human Feedback Part 4: Teaching Models Human Preferences</title>
      <dc:creator>Rijul Rajesh</dc:creator>
      <pubDate>Sat, 23 May 2026 19:25:30 +0000</pubDate>
      <link>https://dev.to/rijultp/understanding-reinforcement-learning-with-human-feedback-part-4-teaching-models-human-preferences-m7f</link>
      <guid>https://dev.to/rijultp/understanding-reinforcement-learning-with-human-feedback-part-4-teaching-models-human-preferences-m7f</guid>
      <description>&lt;p&gt;In the &lt;a href="https://dev.to/rijultp/understanding-reinforcement-learning-with-human-feedback-part-3-collecting-human-preferences-6cl"&gt;previous article&lt;/a&gt;, we explored the part where we collect human preferences. In this article, we will see how to use this data to train the models.&lt;/p&gt;

&lt;p&gt;To train a model that gives &lt;strong&gt;higher scores to preferred responses&lt;/strong&gt;, we first make a copy of the model that has already gone through &lt;strong&gt;supervised fine-tuning&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fohrrwglu0fuyy6lviyab.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fohrrwglu0fuyy6lviyab.png" alt=" " width="779" height="557"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Modifying the Model
&lt;/h2&gt;

&lt;p&gt;Next, we modify this copied model.&lt;/p&gt;

&lt;p&gt;We remove the &lt;strong&gt;unembedding layer&lt;/strong&gt;, which normally predicts the next token, and replace it with a &lt;strong&gt;single output value&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fygcbddrfjuyycy2z9hx1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fygcbddrfjuyycy2z9hx1.png" alt=" " width="458" height="696"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The result is a new model called a &lt;strong&gt;reward model&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Instead of generating text, this model learns to assign a &lt;strong&gt;reward score&lt;/strong&gt; to a response.&lt;/p&gt;




&lt;h2&gt;
  
  
  Training the Reward Model
&lt;/h2&gt;

&lt;p&gt;We can now train this reward model using the &lt;strong&gt;human preference data&lt;/strong&gt; we collected earlier.&lt;/p&gt;

&lt;p&gt;For a preferred response, we train the model to produce a &lt;strong&gt;higher reward value&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For a less preferred response, we train the model to produce a &lt;strong&gt;lower reward value&lt;/strong&gt; or a &lt;strong&gt;negative reward&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If humans preferred &lt;strong&gt;Response A&lt;/strong&gt; over &lt;strong&gt;Response B&lt;/strong&gt;, the reward model learns to give a higher score to &lt;strong&gt;Response A&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;And a lower score to &lt;strong&gt;Response B&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Over time, the reward model learns what kinds of responses humans tend to prefer.&lt;/p&gt;

&lt;p&gt;We will continue further in the next article&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Looking for an easier way to install tools, libraries, or entire repositories?&lt;/strong&gt;&lt;br&gt;
Try &lt;strong&gt;Installerpedia&lt;/strong&gt;: a &lt;strong&gt;community-driven, structured installation platform&lt;/strong&gt; that lets you install almost anything with &lt;strong&gt;minimal hassle&lt;/strong&gt; and &lt;strong&gt;clear, reliable guidance&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Just run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ipm &lt;span class="nb"&gt;install &lt;/span&gt;repo-name
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;… and you’re done! 🚀&lt;/p&gt;

&lt;p&gt;&lt;a href="https://hexmos.com/ipm" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm2s3mzj8pfcq94a1y4at.png" alt="Installerpedia Screenshot" width="800" height="336"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://hexmos.com/ipm/" rel="noopener noreferrer"&gt;&lt;strong&gt;Explore Installerpedia here&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Understanding Reinforcement Learning with Human Feedback Part 3: Collecting Human Preferences</title>
      <dc:creator>Rijul Rajesh</dc:creator>
      <pubDate>Wed, 20 May 2026 19:05:25 +0000</pubDate>
      <link>https://dev.to/rijultp/understanding-reinforcement-learning-with-human-feedback-part-3-collecting-human-preferences-6cl</link>
      <guid>https://dev.to/rijultp/understanding-reinforcement-learning-with-human-feedback-part-3-collecting-human-preferences-6cl</guid>
      <description>&lt;p&gt;In the &lt;a href="https://dev.to/rijultp/understanding-reinforcement-learning-with-human-feedback-part-2-aligning-pretrained-models-58ho"&gt;previous article&lt;/a&gt; we explored the concept of aligning the pretrained model. Now, we will look at the next component: human preference collection.&lt;/p&gt;

&lt;p&gt;The first step in understanding &lt;strong&gt;RLHF&lt;/strong&gt; is to understand that, given a specific prompt, a model can generate different responses.&lt;/p&gt;

&lt;p&gt;One way to generate a response is to configure the model to always select the token with the &lt;strong&gt;highest output value&lt;/strong&gt; at every step.&lt;/p&gt;

&lt;p&gt;In this case, the model will generate the same response every single time for a given prompt.&lt;/p&gt;




&lt;h2&gt;
  
  
  Generating Different Responses
&lt;/h2&gt;

&lt;p&gt;Instead of always selecting the highest-value token, we can also use the outputs of the &lt;strong&gt;softmax function&lt;/strong&gt; as probabilities for selecting tokens.&lt;/p&gt;

&lt;p&gt;In this approach:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the token with the highest probability is &lt;strong&gt;more likely&lt;/strong&gt; to be selected&lt;/li&gt;
&lt;li&gt;but other tokens still have a chance of being selected&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As a result, the model can generate &lt;strong&gt;different responses&lt;/strong&gt; for the same prompt.&lt;/p&gt;




&lt;h2&gt;
  
  
  Collecting Human Preferences
&lt;/h2&gt;

&lt;p&gt;Since a model can generate multiple responses, we can create &lt;strong&gt;pairs of responses&lt;/strong&gt; for the same prompt.&lt;/p&gt;

&lt;p&gt;We can then ask people which response they prefer.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxaft85vd3uz0c6loe7zz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxaft85vd3uz0c6loe7zz.png" alt=" " width="800" height="283"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For example, given two possible answers, a person can simply choose the better one.&lt;/p&gt;

&lt;p&gt;Collecting preferences like this is much faster than asking people to manually write responses for every prompt.&lt;/p&gt;

&lt;p&gt;This preference collection process is the &lt;strong&gt;“Human Feedback”&lt;/strong&gt; part of &lt;strong&gt;RLHF&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Using Preference Data
&lt;/h2&gt;

&lt;p&gt;Once we collect preference data, we can use it to train the model so that it assigns &lt;strong&gt;higher scores to preferred responses&lt;/strong&gt; and lower scores to less preferred ones.&lt;/p&gt;

&lt;p&gt;Over time, this helps the model generate responses that better match human preferences.&lt;/p&gt;




&lt;p&gt;In the next article, we will explore how to train the model using this preference data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Looking for an easier way to install tools, libraries, or entire repositories?&lt;/strong&gt;&lt;br&gt;
Try &lt;strong&gt;Installerpedia&lt;/strong&gt;: a &lt;strong&gt;community-driven, structured installation platform&lt;/strong&gt; that lets you install almost anything with &lt;strong&gt;minimal hassle&lt;/strong&gt; and &lt;strong&gt;clear, reliable guidance&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Just run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ipm &lt;span class="nb"&gt;install &lt;/span&gt;repo-name
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;… and you’re done! 🚀&lt;/p&gt;

&lt;p&gt;&lt;a href="https://hexmos.com/ipm" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm2s3mzj8pfcq94a1y4at.png" alt="Installerpedia Screenshot" width="800" height="336"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://hexmos.com/ipm/" rel="noopener noreferrer"&gt;&lt;strong&gt;Explore Installerpedia here&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Understanding Reinforcement Learning with Human Feedback Part 2: Aligning Pretrained Models</title>
      <dc:creator>Rijul Rajesh</dc:creator>
      <pubDate>Tue, 19 May 2026 17:29:56 +0000</pubDate>
      <link>https://dev.to/rijultp/understanding-reinforcement-learning-with-human-feedback-part-2-aligning-pretrained-models-58ho</link>
      <guid>https://dev.to/rijultp/understanding-reinforcement-learning-with-human-feedback-part-2-aligning-pretrained-models-58ho</guid>
      <description>&lt;p&gt;In the &lt;a href="https://dev.to/rijultp/understanding-reinforcement-learning-with-human-feedback-part-1-pre-training-large-language-models-42hg"&gt;previous article&lt;/a&gt;, we explored the concept of &lt;strong&gt;pre-training&lt;/strong&gt; and its limitations without a further step in the training process.&lt;/p&gt;

&lt;p&gt;In this article, we will explore how we can &lt;strong&gt;align a pretrained model&lt;/strong&gt; to help overcome these limitations.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Two Steps of Alignment
&lt;/h2&gt;

&lt;p&gt;Aligning a pretrained model usually involves two stages:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Supervised Fine-Tuning (SFT)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reinforcement Learning with Human Feedback (RLHF)&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Step 1: Supervised Fine-Tuning
&lt;/h2&gt;

&lt;p&gt;Supervised fine-tuning uses a dataset made up of &lt;strong&gt;human-written prompts and human-written responses&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For example, someone might create a prompt like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Suggest a coding assistant tool”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And then provide a response such as:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Try out Cursor”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Using many examples like this, we can train the model with standard &lt;strong&gt;backpropagation&lt;/strong&gt; so that it learns to generate helpful responses.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Supervised Fine-Tuning Achieves
&lt;/h2&gt;

&lt;p&gt;After supervised fine-tuning, the pretrained model becomes more aligned with human communication.&lt;/p&gt;

&lt;p&gt;Instead of only predicting the next token like it did during pre-training, the model now starts to generate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;helpful responses&lt;/li&gt;
&lt;li&gt;polite responses&lt;/li&gt;
&lt;li&gt;responses to natural language prompts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words, supervised fine-tuning transforms a &lt;strong&gt;pretrained but unaligned model&lt;/strong&gt; into one that has started learning how to respond like an assistant.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Limitation of Supervised Fine-Tuning
&lt;/h2&gt;

&lt;p&gt;Since supervised fine-tuning requires &lt;strong&gt;human effort and time&lt;/strong&gt;, the dataset is usually much smaller than the massive dataset used during pre-training.&lt;/p&gt;

&lt;p&gt;Because of this, supervised fine-tuning can sometimes cause the model to &lt;strong&gt;overfit&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This means the model may respond well to prompts that are similar to examples it was trained on, but struggle with new prompts that were not part of the fine-tuning dataset.&lt;/p&gt;

&lt;p&gt;For example, it may respond appropriately to a prompt it has seen during training, but fail to generalize to unfamiliar prompts.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why RLHF Is Needed
&lt;/h2&gt;

&lt;p&gt;One possible solution would be to create a much larger supervised fine-tuning dataset.&lt;/p&gt;

&lt;p&gt;However, collecting and writing a huge dataset by hand would be extremely expensive and time-consuming.&lt;/p&gt;

&lt;p&gt;Instead, we can use &lt;strong&gt;Reinforcement Learning with Human Feedback (RLHF)&lt;/strong&gt; to help train the model to generate better responses, even for prompts it was not directly trained on.&lt;/p&gt;

&lt;p&gt;We will explore this further in the next article.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Looking for an easier way to install tools, libraries, or entire repositories?&lt;/strong&gt;&lt;br&gt;
Try &lt;strong&gt;Installerpedia&lt;/strong&gt;: a &lt;strong&gt;community-driven, structured installation platform&lt;/strong&gt; that lets you install almost anything with &lt;strong&gt;minimal hassle&lt;/strong&gt; and &lt;strong&gt;clear, reliable guidance&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Just run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ipm &lt;span class="nb"&gt;install &lt;/span&gt;repo-name
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;… and you’re done! 🚀&lt;/p&gt;

&lt;p&gt;&lt;a href="https://hexmos.com/ipm" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm2s3mzj8pfcq94a1y4at.png" alt="Installerpedia Screenshot" width="800" height="336"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://hexmos.com/ipm/" rel="noopener noreferrer"&gt;&lt;strong&gt;Explore Installerpedia here&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Understanding Reinforcement Learning with Human Feedback Part 1: Pre-Training Large Language Models</title>
      <dc:creator>Rijul Rajesh</dc:creator>
      <pubDate>Mon, 18 May 2026 19:48:14 +0000</pubDate>
      <link>https://dev.to/rijultp/understanding-reinforcement-learning-with-human-feedback-part-1-pre-training-large-language-models-42hg</link>
      <guid>https://dev.to/rijultp/understanding-reinforcement-learning-with-human-feedback-part-1-pre-training-large-language-models-42hg</guid>
      <description>&lt;p&gt;In this article, we will explore &lt;strong&gt;Reinforcement Learning with Human Feedback (RLHF)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;RLHF is one of the techniques used to help train large language models like ChatGPT.&lt;/p&gt;

&lt;h2&gt;
  
  
  Starting with an Untrained Model
&lt;/h2&gt;

&lt;p&gt;Suppose we want to build a model like ChatGPT from scratch so that we can ask it questions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw76ma4zea2run6cgwic1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw76ma4zea2run6cgwic1.png" alt=" " width="487" height="619"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To do this, we first need to understand how to train an &lt;strong&gt;untrained decoder-only transformer model&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;By &lt;strong&gt;untrained&lt;/strong&gt;, we mean that all the weights and biases in the model are initialized with random values.&lt;/p&gt;

&lt;p&gt;At this stage, the model does not understand language or meaning.&lt;/p&gt;




&lt;h2&gt;
  
  
  The First Step: Pre-Training
&lt;/h2&gt;

&lt;p&gt;The first step in training a large language model is to teach it to &lt;strong&gt;predict the next token&lt;/strong&gt; using a very large body of text, such as Wikipedia articles.&lt;/p&gt;

&lt;p&gt;We take segments of text and use the earlier words as input tokens. The model then learns to predict the next token in the sequence.&lt;/p&gt;

&lt;p&gt;For example, if the input is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“The cat sat on the…”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The model learns to predict the next likely word.&lt;/p&gt;

&lt;p&gt;By repeating this process across a massive amount of text, the model gradually learns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;grammar&lt;/li&gt;
&lt;li&gt;sentence structure&lt;/li&gt;
&lt;li&gt;facts and patterns in language&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This training stage is called &lt;strong&gt;pre-training&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Over time, this process produces a &lt;strong&gt;pretrained model&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foa1acnf1nid6u5q0x6hi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foa1acnf1nid6u5q0x6hi.png" alt=" " width="769" height="501"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Pre-Training Is Not Enough
&lt;/h2&gt;

&lt;p&gt;At this point, the model becomes good at predicting the next token in text.&lt;/p&gt;

&lt;p&gt;However, simply predicting the next token is not enough to solve the problem of answering questions like a chatbot.&lt;/p&gt;

&lt;p&gt;For example, being good at continuing Wikipedia text does not automatically mean the model will give helpful, safe, or conversational responses.&lt;/p&gt;

&lt;p&gt;To make the model useful for chat, we need to &lt;strong&gt;align&lt;/strong&gt; it with human expectations.&lt;/p&gt;

&lt;p&gt;We will explore this in the next article.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Looking for an easier way to install tools, libraries, or entire repositories?&lt;/strong&gt;&lt;br&gt;
Try &lt;strong&gt;Installerpedia&lt;/strong&gt;: a &lt;strong&gt;community-driven, structured installation platform&lt;/strong&gt; that lets you install almost anything with &lt;strong&gt;minimal hassle&lt;/strong&gt; and &lt;strong&gt;clear, reliable guidance&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Just run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ipm &lt;span class="nb"&gt;install &lt;/span&gt;repo-name
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;… and you’re done! 🚀&lt;/p&gt;

&lt;p&gt;&lt;a href="https://hexmos.com/ipm" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm2s3mzj8pfcq94a1y4at.png" alt="Installerpedia Screenshot" width="800" height="336"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://hexmos.com/ipm/" rel="noopener noreferrer"&gt;&lt;strong&gt;Explore Installerpedia here&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Understanding Reinforcement Learning with Neural Networks Part 6: Completing the Reinforcement Learning Process</title>
      <dc:creator>Rijul Rajesh</dc:creator>
      <pubDate>Sat, 16 May 2026 20:28:40 +0000</pubDate>
      <link>https://dev.to/rijultp/understanding-reinforcement-learning-with-neural-networks-part-6-completing-the-reinforcement-5g8b</link>
      <guid>https://dev.to/rijultp/understanding-reinforcement-learning-with-neural-networks-part-6-completing-the-reinforcement-5g8b</guid>
      <description>&lt;p&gt;In the &lt;a href="https://dev.to/rijultp/understanding-reinforcement-learning-with-neural-networks-part-5-connecting-reward-derivative-2dk"&gt;previous article&lt;/a&gt; we covered the basics of training, and how rewards, derivatives and step-size were used to acheive it.&lt;/p&gt;

&lt;p&gt;In this article, we will finish the training process for the model.&lt;/p&gt;

&lt;p&gt;To fully train the model, we need to use &lt;strong&gt;different input values between 0 and 1&lt;/strong&gt; as inputs to the neural network.&lt;/p&gt;

&lt;p&gt;This allows the model to learn how to behave under different hunger levels.&lt;/p&gt;




&lt;h2&gt;
  
  
  Training Over Time
&lt;/h2&gt;

&lt;p&gt;After many updates using a wide range of inputs between &lt;strong&gt;0 and 1&lt;/strong&gt;, the value of the bias starts to hover around &lt;strong&gt;-10&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs94u004w2lkib5x88g8h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs94u004w2lkib5x88g8h.png" alt=" " width="800" height="206"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The fact that the bias stabilizes around a certain value suggests that the model has finished training.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Happens When We Are Not Hungry?
&lt;/h2&gt;

&lt;p&gt;Now, let us test the model.&lt;/p&gt;

&lt;p&gt;When we are &lt;strong&gt;not hungry&lt;/strong&gt; and the input is &lt;strong&gt;0.0&lt;/strong&gt;, the probability of going to &lt;strong&gt;Place B&lt;/strong&gt; becomes &lt;strong&gt;0&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This means that when hunger is low, the model will always choose &lt;strong&gt;Place A&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwgmt4k5qaprmi97y08fc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwgmt4k5qaprmi97y08fc.png" alt=" " width="800" height="220"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What Happens When We Are Hungry?
&lt;/h2&gt;

&lt;p&gt;In contrast, when we are &lt;strong&gt;very hungry&lt;/strong&gt; and the input value is &lt;strong&gt;1.0&lt;/strong&gt;, the probability of going to &lt;strong&gt;Place B&lt;/strong&gt; becomes &lt;strong&gt;1&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcpkrjpdx986bdqmenkg7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcpkrjpdx986bdqmenkg7.png" alt=" " width="800" height="224"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This means that when hunger is high, the model will always choose &lt;strong&gt;Place B&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Summary of Reinforcement Learning
&lt;/h2&gt;

&lt;p&gt;In summary, reinforcement learning allows us to optimize a neural network &lt;strong&gt;even when we do not know the correct outputs in advance&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The process works like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The neural network decides what action to take&lt;/li&gt;
&lt;li&gt;We assume that the chosen action was the correct decision

&lt;ul&gt;
&lt;li&gt;For example, if we go to &lt;strong&gt;Place B&lt;/strong&gt;, we temporarily assume that Place B was the best choice, even if we later discover that Place A would have been better&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;We calculate the derivative with respect to the parameter we want to optimize&lt;/li&gt;
&lt;li&gt;We determine the reward associated with the decision&lt;/li&gt;
&lt;li&gt;We multiply the derivative by the reward to correct mistakes&lt;/li&gt;
&lt;li&gt;This gives us the &lt;strong&gt;updated derivative&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;We use the updated derivative in &lt;strong&gt;gradient descent&lt;/strong&gt; to optimize the neural network&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That wraps up this article.&lt;/p&gt;

&lt;p&gt;In the next article, we will explore &lt;strong&gt;Reinforcement Learning with Human Feedback (RLHF)&lt;/strong&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Looking for an easier way to install tools, libraries, or entire repositories?&lt;/strong&gt;&lt;br&gt;
Try &lt;strong&gt;Installerpedia&lt;/strong&gt;: a &lt;strong&gt;community-driven, structured installation platform&lt;/strong&gt; that lets you install almost anything with &lt;strong&gt;minimal hassle&lt;/strong&gt; and &lt;strong&gt;clear, reliable guidance&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Just run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ipm &lt;span class="nb"&gt;install &lt;/span&gt;repo-name
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;… and you’re done! 🚀&lt;/p&gt;

&lt;p&gt;&lt;a href="https://hexmos.com/ipm" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm2s3mzj8pfcq94a1y4at.png" alt="Installerpedia Screenshot" width="800" height="336"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://hexmos.com/ipm/" rel="noopener noreferrer"&gt;&lt;strong&gt;Explore Installerpedia here&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Understanding Reinforcement Learning with Neural Networks Part 5: Connecting Reward, Derivative, and Step Size</title>
      <dc:creator>Rijul Rajesh</dc:creator>
      <pubDate>Fri, 15 May 2026 20:33:05 +0000</pubDate>
      <link>https://dev.to/rijultp/understanding-reinforcement-learning-with-neural-networks-part-5-connecting-reward-derivative-2dk</link>
      <guid>https://dev.to/rijultp/understanding-reinforcement-learning-with-neural-networks-part-5-connecting-reward-derivative-2dk</guid>
      <description>&lt;p&gt;In the &lt;a href="https://dev.to/rijultp/understanding-reinforcement-learning-with-neural-networks-part-4-positive-and-negative-rewards-23h0"&gt;previous article&lt;/a&gt;, we explored the reward system in reinforcement learning&lt;/p&gt;

&lt;p&gt;In this article, we will begin calculating the &lt;strong&gt;step size&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  First Update
&lt;/h2&gt;

&lt;p&gt;In this example, the learning rate is &lt;strong&gt;1.0&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2gaf2bcc6qnlfoo97obt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2gaf2bcc6qnlfoo97obt.png" alt=" " width="800" height="145"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So, the step size is &lt;strong&gt;0.5&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Next, we update the bias by subtracting the step size from the old bias value &lt;strong&gt;0.0&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzivto11xex0mqmywf4a5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzivto11xex0mqmywf4a5.png" alt=" " width="595" height="162"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  After the Update
&lt;/h2&gt;

&lt;p&gt;Now that the bias has been updated, we run the model again.&lt;/p&gt;

&lt;p&gt;The new probability of going to &lt;strong&gt;Place B&lt;/strong&gt; becomes &lt;strong&gt;0.4&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiu35h5j6ol9yrzupnf2l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiu35h5j6ol9yrzupnf2l.png" alt=" " width="644" height="129"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This means the probability of going to &lt;strong&gt;Place A&lt;/strong&gt; is:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwxr65rqf5vyhy634ii90.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwxr65rqf5vyhy634ii90.png" alt=" " width="391" height="88"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgb7teytxtd16uruvt201.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgb7teytxtd16uruvt201.png" alt=" " width="657" height="175"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Choosing Again
&lt;/h2&gt;

&lt;p&gt;We now pick a random number between 0 and 1, and get &lt;strong&gt;0.9&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd024wbn97kuxo2q5jr16.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd024wbn97kuxo2q5jr16.png" alt=" " width="672" height="227"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Since &lt;strong&gt;0.9&lt;/strong&gt; falls in the region representing &lt;strong&gt;Place B&lt;/strong&gt;, we choose Place B.&lt;/p&gt;

&lt;h2&gt;
  
  
  Computing the Gradient Again
&lt;/h2&gt;

&lt;p&gt;To update the bias, we again compute the derivative.&lt;/p&gt;

&lt;p&gt;First, we assume that choosing &lt;strong&gt;Place B&lt;/strong&gt; was the correct action.&lt;/p&gt;

&lt;p&gt;So ideally:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq6ugtinbby8aquykebvq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq6ugtinbby8aquykebvq.png" alt=" " width="328" height="93"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now we compute the difference between the ideal value &lt;strong&gt;1.0&lt;/strong&gt; and the actual value &lt;strong&gt;0.4&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Using this, we calculate the derivative with respect to the bias, which gives:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0a4sdax4nf8qdpfaovud.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0a4sdax4nf8qdpfaovud.png" alt=" " width="479" height="69"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzmm7f8258xmctg9gt30h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzmm7f8258xmctg9gt30h.png" alt=" " width="579" height="365"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Checking the Reward
&lt;/h2&gt;

&lt;p&gt;Now we check whether this was actually a good decision.&lt;/p&gt;

&lt;p&gt;Place B gives a large portion of fries, but our hunger input is &lt;strong&gt;0.0&lt;/strong&gt;, meaning we are not very hungry.&lt;/p&gt;

&lt;p&gt;So this was not a good choice.&lt;/p&gt;

&lt;p&gt;Therefore, the reward is:&lt;/p&gt;

&lt;p&gt;Reward = -1&lt;/p&gt;




&lt;h2&gt;
  
  
  Updating with Reward
&lt;/h2&gt;

&lt;p&gt;We multiply the derivative by the reward:&lt;/p&gt;

&lt;p&gt;-0.6 x -1 = 0.6&lt;/p&gt;

&lt;p&gt;So the updated derivative becomes &lt;strong&gt;0.6&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Second Step Update
&lt;/h2&gt;

&lt;p&gt;Now we calculate the step size again:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6rbgqh9tj72k0szf15b1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6rbgqh9tj72k0szf15b1.png" alt=" " width="800" height="152"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Result
&lt;/h2&gt;

&lt;p&gt;We plug the new bias back into the neural network.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp3ifab68esfkfjvpn7zm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp3ifab68esfkfjvpn7zm.png" alt=" " width="628" height="194"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now the probability of going to &lt;strong&gt;Place B&lt;/strong&gt; has decreased.&lt;/p&gt;

&lt;p&gt;This means that when hunger is low, the model is more likely to choose &lt;strong&gt;Place A&lt;/strong&gt;, which is the correct behavior.&lt;/p&gt;

&lt;p&gt;This shows that the reinforcement learning algorithm, specifically &lt;strong&gt;policy gradients&lt;/strong&gt;, is working as expected.&lt;/p&gt;




&lt;p&gt;In the next article, we will explore how to further train the model using different input values.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Looking for an easier way to install tools, libraries, or entire repositories?&lt;/strong&gt;&lt;br&gt;
Try &lt;strong&gt;Installerpedia&lt;/strong&gt;: a &lt;strong&gt;community-driven, structured installation platform&lt;/strong&gt; that lets you install almost anything with &lt;strong&gt;minimal hassle&lt;/strong&gt; and &lt;strong&gt;clear, reliable guidance&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Just run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ipm &lt;span class="nb"&gt;install &lt;/span&gt;repo-name
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;… and you’re done! 🚀&lt;/p&gt;

&lt;p&gt;&lt;a href="https://hexmos.com/ipm" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm2s3mzj8pfcq94a1y4at.png" alt="Installerpedia Screenshot" width="800" height="336"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://hexmos.com/ipm/" rel="noopener noreferrer"&gt;&lt;strong&gt;Explore Installerpedia here&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Understanding Reinforcement Learning with Neural Networks Part 4: Positive and Negative Rewards</title>
      <dc:creator>Rijul Rajesh</dc:creator>
      <pubDate>Wed, 13 May 2026 20:46:18 +0000</pubDate>
      <link>https://dev.to/rijultp/understanding-reinforcement-learning-with-neural-networks-part-4-positive-and-negative-rewards-23h0</link>
      <guid>https://dev.to/rijultp/understanding-reinforcement-learning-with-neural-networks-part-4-positive-and-negative-rewards-23h0</guid>
      <description>&lt;p&gt;In the &lt;a href="https://dev.to/rijultp/understanding-reinforcement-learning-with-neural-networks-part-3-guessing-the-ideal-output-3m47"&gt;previous article&lt;/a&gt;, we began the process of guessing the ideal output.&lt;/p&gt;

&lt;p&gt;Let us continue with the same example.&lt;/p&gt;

&lt;p&gt;Suppose we receive a &lt;strong&gt;small number of fries&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Since our hunger level is &lt;strong&gt;0&lt;/strong&gt;, this is actually a good outcome.&lt;/p&gt;

&lt;p&gt;In this case, we should assign a &lt;strong&gt;reward of 1&lt;/strong&gt;.&lt;/p&gt;




&lt;p&gt;Now consider the opposite situation.&lt;/p&gt;

&lt;p&gt;Suppose we receive a &lt;strong&gt;large order of fries&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Since we are not hungry enough to eat all the fries, this means we made a poor decision.&lt;/p&gt;

&lt;p&gt;In that case, we assign a &lt;strong&gt;reward of -1&lt;/strong&gt;.&lt;/p&gt;




&lt;p&gt;In general:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Any &lt;strong&gt;positive reward&lt;/strong&gt; indicates a good decision&lt;/li&gt;
&lt;li&gt;Any &lt;strong&gt;negative reward&lt;/strong&gt; indicates a bad decision&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Updating the Derivative with the Reward
&lt;/h2&gt;

&lt;p&gt;We now use this reward to update the derivative.&lt;/p&gt;

&lt;p&gt;To do this, we simply multiply the derivative by the reward.&lt;/p&gt;




&lt;h3&gt;
  
  
  Case 1: Correct Decision
&lt;/h3&gt;

&lt;p&gt;If the reward is &lt;strong&gt;1&lt;/strong&gt;, then:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foqooiniacv6ovw8q9i3e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foqooiniacv6ovw8q9i3e.png" alt=" " width="360" height="93"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The derivative remains unchanged.&lt;/p&gt;

&lt;p&gt;This means the derivative is already pointing in the correct direction.&lt;/p&gt;




&lt;h3&gt;
  
  
  Case 2: Incorrect Decision
&lt;/h3&gt;

&lt;p&gt;If the reward is &lt;strong&gt;-1&lt;/strong&gt;, then:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fld54pyb01v6hd9qdjetd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fld54pyb01v6hd9qdjetd.png" alt=" " width="376" height="101"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now the derivative changes sign.&lt;/p&gt;

&lt;p&gt;This causes the optimization process to move the bias in the opposite direction.&lt;/p&gt;

&lt;p&gt;In other words, the negative reward flips the direction of the update so the neural network can learn from the bad decision.&lt;/p&gt;

&lt;p&gt;In the next article, we will explore how to calculate the &lt;strong&gt;step size&lt;/strong&gt; for updating the parameters.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Looking for an easier way to install tools, libraries, or entire repositories?&lt;/strong&gt;&lt;br&gt;
Try &lt;strong&gt;Installerpedia&lt;/strong&gt;: a &lt;strong&gt;community-driven, structured installation platform&lt;/strong&gt; that lets you install almost anything with &lt;strong&gt;minimal hassle&lt;/strong&gt; and &lt;strong&gt;clear, reliable guidance&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Just run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ipm &lt;span class="nb"&gt;install &lt;/span&gt;repo-name
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;… and you’re done! 🚀&lt;/p&gt;

&lt;p&gt;&lt;a href="https://hexmos.com/ipm" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm2s3mzj8pfcq94a1y4at.png" alt="Installerpedia Screenshot" width="800" height="336"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://hexmos.com/ipm/" rel="noopener noreferrer"&gt;&lt;strong&gt;Explore Installerpedia here&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Understanding Reinforcement Learning with Neural Networks Part 3: Guessing the Ideal Output</title>
      <dc:creator>Rijul Rajesh</dc:creator>
      <pubDate>Mon, 11 May 2026 18:48:10 +0000</pubDate>
      <link>https://dev.to/rijultp/understanding-reinforcement-learning-with-neural-networks-part-3-guessing-the-ideal-output-3m47</link>
      <guid>https://dev.to/rijultp/understanding-reinforcement-learning-with-neural-networks-part-3-guessing-the-ideal-output-3m47</guid>
      <description>&lt;p&gt;In the &lt;a href="https://dev.to/rijultp/understanding-reinforcement-learning-with-neural-networks-part-2-why-backpropagation-is-not-enough-2el2"&gt;previous article&lt;/a&gt;, we explored the limitations of backpropagation and why it is not ideal when the correct output values are unknown.&lt;/p&gt;

&lt;p&gt;In this article, we will begin exploring the core ideas behind reinforcement learning.&lt;/p&gt;

&lt;h2&gt;
  
  
  Starting Example
&lt;/h2&gt;

&lt;p&gt;Let us begin by assuming that we are &lt;strong&gt;not hungry&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;We will feed the value &lt;strong&gt;0.0&lt;/strong&gt; into the neural network.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxbrdbq7cqde9q2bzo575.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxbrdbq7cqde9q2bzo575.png" alt=" " width="800" height="175"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The neural network outputs a probability of &lt;strong&gt;0.5&lt;/strong&gt; for going to &lt;strong&gt;Place B&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;So:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Probability of going to Place B = &lt;strong&gt;p(B) = 0.5&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Probability of going to Place A = &lt;strong&gt;1 - p(B) = 0.5&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Visualizing the Probabilities
&lt;/h2&gt;

&lt;p&gt;We can represent these probabilities using a line.&lt;/p&gt;

&lt;p&gt;First, we draw a line segment with length &lt;strong&gt;0.5&lt;/strong&gt; to represent the probability of going to &lt;strong&gt;Place A&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Then, we append another line segment to represent the probability of going to &lt;strong&gt;Place B&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F67ez8dzdjhuk5z06ik5z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F67ez8dzdjhuk5z06ik5z.png" alt=" " width="800" height="180"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Together, these form a line ranging from &lt;strong&gt;0 to 1&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Choosing an Action
&lt;/h2&gt;

&lt;p&gt;To decide which place to go for a snack, we randomly pick a number between &lt;strong&gt;0 and 1&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Let us pick &lt;strong&gt;0.2&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2ckpjtnj21fhvfgecxti.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2ckpjtnj21fhvfgecxti.png" alt=" " width="800" height="213"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Since &lt;strong&gt;0.2&lt;/strong&gt; falls inside the region representing &lt;strong&gt;Place A&lt;/strong&gt;, we choose to go to Place A.&lt;/p&gt;




&lt;h2&gt;
  
  
  Making a Guess About the Correct Action
&lt;/h2&gt;

&lt;p&gt;Now, let us assume that going to &lt;strong&gt;Place A&lt;/strong&gt; when hunger = 0 was the correct decision.&lt;/p&gt;

&lt;p&gt;Ideally:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The probability of going to Place A, &lt;strong&gt;p(A)&lt;/strong&gt;, should be &lt;strong&gt;1&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;The probability of going to Place B, &lt;strong&gt;p(B)&lt;/strong&gt;, should be &lt;strong&gt;0&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These ideal values are based on our guess about what the correct action should have been.&lt;/p&gt;




&lt;h2&gt;
  
  
  Moving Toward Optimization
&lt;/h2&gt;

&lt;p&gt;Using these guessed ideal values, we can calculate the difference between:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the ideal probability for &lt;strong&gt;p(A)&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;the actual probability produced by the neural network&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This allows us to calculate the derivative of the difference with respect to the bias we want to optimize.&lt;/p&gt;




&lt;p&gt;In the next article, we will continue exploring how this optimization process works in reinforcement learning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Looking for an easier way to install tools, libraries, or entire repositories?&lt;/strong&gt;&lt;br&gt;
Try &lt;strong&gt;Installerpedia&lt;/strong&gt;: a &lt;strong&gt;community-driven, structured installation platform&lt;/strong&gt; that lets you install almost anything with &lt;strong&gt;minimal hassle&lt;/strong&gt; and &lt;strong&gt;clear, reliable guidance&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Just run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ipm &lt;span class="nb"&gt;install &lt;/span&gt;repo-name
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;… and you’re done! 🚀&lt;/p&gt;

&lt;p&gt;&lt;a href="https://hexmos.com/ipm" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm2s3mzj8pfcq94a1y4at.png" alt="Installerpedia Screenshot" width="800" height="336"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://hexmos.com/ipm/" rel="noopener noreferrer"&gt;&lt;strong&gt;Explore Installerpedia here&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Understanding Reinforcement Learning with Neural Networks Part 2: Why Backpropagation Is Not Enough</title>
      <dc:creator>Rijul Rajesh</dc:creator>
      <pubDate>Sun, 10 May 2026 19:53:16 +0000</pubDate>
      <link>https://dev.to/rijultp/understanding-reinforcement-learning-with-neural-networks-part-2-why-backpropagation-is-not-enough-2el2</link>
      <guid>https://dev.to/rijultp/understanding-reinforcement-learning-with-neural-networks-part-2-why-backpropagation-is-not-enough-2el2</guid>
      <description>&lt;p&gt;In the &lt;a href="https://dev.to/rijultp/understanding-reinforcement-learning-with-neural-networks-part-1-learning-without-correct-answers-47ld"&gt;previous article&lt;/a&gt;, we explored an example where reinforcement learning is required and standard methods do not work.&lt;/p&gt;

&lt;p&gt;In this article, we will understand why &lt;strong&gt;policy gradients&lt;/strong&gt; are needed, and why the standard &lt;strong&gt;backpropagation&lt;/strong&gt; method does not work in certain situations.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Backpropagation Normally Works
&lt;/h2&gt;

&lt;p&gt;Assume we have the following training data, where the desired outputs are already known:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Input (Hunger)&lt;/th&gt;
&lt;th&gt;Output p(B)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0.0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1.0&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0.1&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0.9&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;With this data, we can feed the input values into the neural network one at a time.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8scwonpe9136c8nap1pi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8scwonpe9136c8nap1pi.png" alt=" " width="800" height="150"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The neural network produces an output, and we compare it with the &lt;strong&gt;ideal output value&lt;/strong&gt; from the training data.&lt;/p&gt;

&lt;p&gt;Using this difference, we can measure how wrong the network is.&lt;/p&gt;




&lt;h2&gt;
  
  
  Using Derivatives to Update the Bias
&lt;/h2&gt;

&lt;p&gt;We can calculate these differences for different values of the bias and visualize how the error changes as the bias changes.&lt;/p&gt;

&lt;p&gt;From this graph, we can calculate the &lt;strong&gt;derivative&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If the derivative is &lt;strong&gt;negative&lt;/strong&gt;, we shift the bias to the right&lt;/li&gt;
&lt;li&gt;If the derivative is &lt;strong&gt;positive&lt;/strong&gt;, we shift the bias to the left&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The derivative correctly tells us which direction to move because the training data already contains the ideal output values.&lt;/p&gt;

&lt;p&gt;This is the basic idea behind &lt;strong&gt;backpropagation&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem in Reinforcement Learning
&lt;/h2&gt;

&lt;p&gt;However, in reinforcement learning, we do not know the ideal output values in advance.&lt;/p&gt;

&lt;p&gt;For example, we do not already know whether choosing Place A or Place B is the correct action.&lt;/p&gt;

&lt;p&gt;Because of this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;we cannot calculate the difference between the neural network’s output and the ideal output&lt;/li&gt;
&lt;li&gt;without these differences, we cannot calculate derivatives in the normal way&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  A Different Approach
&lt;/h2&gt;

&lt;p&gt;Instead, we can &lt;strong&gt;guess&lt;/strong&gt; what the ideal outputs should be and use those guesses to estimate the derivatives.&lt;/p&gt;

&lt;p&gt;This idea forms the foundation of &lt;strong&gt;policy gradients&lt;/strong&gt; in reinforcement learning.&lt;/p&gt;

&lt;p&gt;In the next article, we will explore how reinforcement learning and policy gradients help us solve this problem.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Looking for an easier way to install tools, libraries, or entire repositories?&lt;/strong&gt;&lt;br&gt;
Try &lt;strong&gt;Installerpedia&lt;/strong&gt;: a &lt;strong&gt;community-driven, structured installation platform&lt;/strong&gt; that lets you install almost anything with &lt;strong&gt;minimal hassle&lt;/strong&gt; and &lt;strong&gt;clear, reliable guidance&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Just run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ipm &lt;span class="nb"&gt;install &lt;/span&gt;repo-name
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;… and you’re done! 🚀&lt;/p&gt;

&lt;p&gt;&lt;a href="https://hexmos.com/ipm" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm2s3mzj8pfcq94a1y4at.png" alt="Installerpedia Screenshot" width="800" height="336"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://hexmos.com/ipm/" rel="noopener noreferrer"&gt;&lt;strong&gt;Explore Installerpedia here&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Understanding Reinforcement Learning with Neural Networks Part 1: Learning Without Correct Answers</title>
      <dc:creator>Rijul Rajesh</dc:creator>
      <pubDate>Fri, 08 May 2026 18:50:27 +0000</pubDate>
      <link>https://dev.to/rijultp/understanding-reinforcement-learning-with-neural-networks-part-1-learning-without-correct-answers-47ld</link>
      <guid>https://dev.to/rijultp/understanding-reinforcement-learning-with-neural-networks-part-1-learning-without-correct-answers-47ld</guid>
      <description>&lt;p&gt;In this article, we will explore &lt;strong&gt;reinforcement learning with neural networks&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Let’s start with a simple example.&lt;/p&gt;

&lt;h2&gt;
  
  
  Choosing Between Two Snack Places
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fazbdd5bnvfcgb508jaai.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fazbdd5bnvfcgb508jaai.png" alt=" " width="800" height="259"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Suppose it is snack time, and you have to choose between &lt;strong&gt;Place A&lt;/strong&gt; and &lt;strong&gt;Place B&lt;/strong&gt; for fries.&lt;/p&gt;

&lt;p&gt;To make a good decision, we also need to consider &lt;strong&gt;how hungry we are&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Some days we may be very hungry, while on other days we may only want a small snack.&lt;/p&gt;

&lt;p&gt;We also need to consider how many fries each place might serve.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Place B might give a &lt;strong&gt;large quantity of fries&lt;/strong&gt;, which would be great if we were very hungry&lt;/li&gt;
&lt;li&gt;But if we were not that hungry, getting too many fries might not be ideal&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Similarly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Getting a small amount of fries would not be good if we were extremely hungry&lt;/li&gt;
&lt;li&gt;But it could be perfectly fine if we only wanted a light snack&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So, it would be useful to have a system that helps decide which place to choose based on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;our hunger level&lt;/li&gt;
&lt;li&gt;the possible quantity of fries we might receive&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Using a Neural Network
&lt;/h2&gt;

&lt;p&gt;To solve this problem, we will use a &lt;strong&gt;neural network&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The neural network takes our &lt;strong&gt;hunger level&lt;/strong&gt; as the input and outputs the probability of choosing &lt;strong&gt;Place B&lt;/strong&gt;, written as &lt;strong&gt;p(B)&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Challenge
&lt;/h2&gt;

&lt;p&gt;Normally, when training a neural network, we start with a training dataset that contains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;input values&lt;/li&gt;
&lt;li&gt;correct output values&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Using this data, we can train the network with standard &lt;strong&gt;backpropagation&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;However, in this example, we do not know in advance whether Place A or Place B will serve a large or small quantity of fries.&lt;/p&gt;

&lt;p&gt;Because of this, we do not know what the correct output values should be.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reinforcement Learning
&lt;/h2&gt;

&lt;p&gt;In situations where we do not have known output values, we can still train a model using &lt;strong&gt;reinforcement learning&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Instead of learning from correct answers, the model learns by trying actions and receiving feedback based on how good the outcome was.&lt;/p&gt;

&lt;p&gt;In the next article, we will explore a reinforcement learning algorithm called &lt;strong&gt;policy gradients&lt;/strong&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Looking for an easier way to install tools, libraries, or entire repositories?&lt;/strong&gt;&lt;br&gt;
Try &lt;strong&gt;Installerpedia&lt;/strong&gt;: a &lt;strong&gt;community-driven, structured installation platform&lt;/strong&gt; that lets you install almost anything with &lt;strong&gt;minimal hassle&lt;/strong&gt; and &lt;strong&gt;clear, reliable guidance&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Just run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ipm &lt;span class="nb"&gt;install &lt;/span&gt;repo-name
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;… and you’re done! 🚀&lt;/p&gt;

&lt;p&gt;&lt;a href="https://hexmos.com/ipm" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm2s3mzj8pfcq94a1y4at.png" alt="Installerpedia Screenshot" width="800" height="336"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://hexmos.com/ipm/" rel="noopener noreferrer"&gt;&lt;strong&gt;Explore Installerpedia here&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Understanding Encoder-Only Transformers: The Foundation of BERT and RAG Retrieval</title>
      <dc:creator>Rijul Rajesh</dc:creator>
      <pubDate>Thu, 07 May 2026 18:50:05 +0000</pubDate>
      <link>https://dev.to/rijultp/understanding-encoder-only-transformers-the-foundation-of-bert-and-rag-retrieval-4bk8</link>
      <guid>https://dev.to/rijultp/understanding-encoder-only-transformers-the-foundation-of-bert-and-rag-retrieval-4bk8</guid>
      <description>&lt;p&gt;Back in 2017, the first transformer architecture introduced two main components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;an &lt;strong&gt;encoder&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;a &lt;strong&gt;decoder&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These two parts were connected so they could work together.&lt;/p&gt;

&lt;p&gt;This original design is known as an &lt;strong&gt;encoder–decoder transformer&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Decoders Can Work on Their Own
&lt;/h2&gt;

&lt;p&gt;Over time, researchers realized that the decoder alone was powerful enough for many tasks.&lt;/p&gt;

&lt;p&gt;Using only a decoder, models could:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;generate text&lt;/li&gt;
&lt;li&gt;continue sentences&lt;/li&gt;
&lt;li&gt;perform translation and other language tasks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As we discussed in the article on &lt;a href="https://dev.to/rijultp/understanding-decoder-only-transformers-part-1-masked-self-attention-mf8"&gt;decoder only transformers&lt;/a&gt;, these models form the foundation of systems like ChatGPT.&lt;/p&gt;

&lt;p&gt;These are called &lt;strong&gt;decoder-only transformers&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Encoders Can Also Work Independently
&lt;/h2&gt;

&lt;p&gt;In a similar way, encoder-based models are also very useful on their own.&lt;/p&gt;

&lt;p&gt;This idea forms the foundation of models like &lt;strong&gt;BERT&lt;/strong&gt; and many others.&lt;/p&gt;

&lt;p&gt;These are called &lt;strong&gt;encoder-only transformers&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building Blocks of Encoder-Only Transformers
&lt;/h2&gt;

&lt;p&gt;Encoder-only transformers use the same core components we explored earlier:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Word embeddings&lt;/strong&gt; convert words into numbers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Positional encoding&lt;/strong&gt; keeps track of word order&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-attention&lt;/strong&gt; helps establish relationships between words&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9wqriy365gvonj5c3isu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9wqriy365gvonj5c3isu.png" alt=" " width="549" height="702"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When these layers are combined, they create a new representation for each token that captures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;meaning&lt;/li&gt;
&lt;li&gt;position&lt;/li&gt;
&lt;li&gt;relationships with other words&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These representations are called &lt;strong&gt;context-aware embeddings&lt;/strong&gt; or &lt;strong&gt;contextualized embeddings&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Context-Aware Embeddings Are Useful
&lt;/h2&gt;

&lt;p&gt;Context-aware embeddings can help group together:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;similar sentences&lt;/li&gt;
&lt;li&gt;similar paragraphs&lt;/li&gt;
&lt;li&gt;similar documents&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This capability is one of the foundations of &lt;strong&gt;Retrieval-Augmented Generation (RAG)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;RAG works by:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Breaking documents into smaller chunks of text&lt;/li&gt;
&lt;li&gt;Using an encoder-only transformer to generate embeddings for each chunk&lt;/li&gt;
&lt;li&gt;Comparing embeddings to find the most relevant information&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Other Uses of Encoder-Only Transformers
&lt;/h2&gt;

&lt;p&gt;Context-aware embeddings can also be used as inputs for machine learning models.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;neural networks can use them for &lt;strong&gt;sentiment classification&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;logistic regression models can also use them for classification tasks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That wraps up encoder-only transformers.&lt;/p&gt;

&lt;p&gt;In the next article, we will explore &lt;strong&gt;reinforcement learning in neural networks&lt;/strong&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Looking for an easier way to install tools, libraries, or entire repositories?&lt;/strong&gt;&lt;br&gt;
Try &lt;strong&gt;Installerpedia&lt;/strong&gt;: a &lt;strong&gt;community-driven, structured installation platform&lt;/strong&gt; that lets you install almost anything with &lt;strong&gt;minimal hassle&lt;/strong&gt; and &lt;strong&gt;clear, reliable guidance&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Just run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ipm &lt;span class="nb"&gt;install &lt;/span&gt;repo-name
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;… and you’re done! 🚀&lt;/p&gt;

&lt;p&gt;&lt;a href="https://hexmos.com/ipm" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm2s3mzj8pfcq94a1y4at.png" alt="Installerpedia Screenshot" width="800" height="336"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://hexmos.com/ipm/" rel="noopener noreferrer"&gt;&lt;strong&gt;Explore Installerpedia here&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
