<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Charbel</title>
    <description>The latest articles on DEV Community by Charbel (@charbull).</description>
    <link>https://dev.to/charbull</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3707644%2F81e2b118-4e82-4758-b87c-8110f3e58bf8.png</url>
      <title>DEV Community: Charbel</title>
      <link>https://dev.to/charbull</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/charbull"/>
    <language>en</language>
    <item>
      <title>I Taught a 4B Parameter LLM to Play Wordle on a Mac M4 (Using GRPO)</title>
      <dc:creator>Charbel</dc:creator>
      <pubDate>Tue, 13 Jan 2026 18:26:05 +0000</pubDate>
      <link>https://dev.to/charbull/i-taught-a-4b-parameter-llm-to-play-wordle-on-a-mac-m4-using-grpo-i9k</link>
      <guid>https://dev.to/charbull/i-taught-a-4b-parameter-llm-to-play-wordle-on-a-mac-m4-using-grpo-i9k</guid>
      <description>&lt;p&gt;DeepSeek-R1 changed the conversation. Their paper &lt;a href="https://arxiv.org/abs/2501.12948" rel="noopener noreferrer"&gt;"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning"&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;But DeepSeek was trained on massive clusters. I have a &lt;strong&gt;MacBook Pro M4&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;I spent a few weeks answering a specific question: &lt;strong&gt;Can we replicate this reasoning behavior on a consumer device, using a small model (Gemma-3 4B), without any supervised fine-tuning (SFT)?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I chose &lt;strong&gt;Wordle&lt;/strong&gt; as the testbed. While simple, it requires state tracking, hypothesis testing, and information theory—a perfect microcosm for testing "reasoning" capabilities.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why MLX? (The Technology Stack)
&lt;/h2&gt;

&lt;p&gt;I chose Apple's &lt;strong&gt;MLX&lt;/strong&gt; framework over PyTorch for three specific technical reasons:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Unified Memory Access:&lt;/strong&gt; Training with GRPO requires generating multiple "rollouts" (completions) in parallel. On a standard GPU, moving these massive tensors between VRAM and RAM is a bottleneck. MLX is optimized for the M-series Unified Memory architecture, allowing zero-copy access to arrays.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;The Quantization Struggle:&lt;/strong&gt; In the PyTorch ecosystem, libraries like &lt;code&gt;bitsandbytes&lt;/code&gt; (crucial for loading models in 4-bit/8-bit) have historically had unstable support on Apple Silicon.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Forcing Local Constraints:&lt;/strong&gt; Using a cloud GPU is an "escape hatch." By forcing myself to train locally, I had to confront the actual hardware limits (bandwidth vs. capacity) that shape modern LLM architecture.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Challenge: "Straight-to-RL"
&lt;/h2&gt;

&lt;p&gt;Most RL pipelines start with &lt;strong&gt;Supervised Fine-Tuning (SFT)&lt;/strong&gt;. You show the model thousands of expert games, and &lt;em&gt;then&lt;/em&gt; use RL to polish the strategy.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I wanted to test the &lt;strong&gt;"Cold Start"&lt;/strong&gt; problem. Can a 4B parameter model learn the rules and strategy of Wordle &lt;em&gt;purely&lt;/em&gt; through trial and error, guided only by a reward function?&lt;/li&gt;
&lt;li&gt;I wanted to see if &lt;strong&gt;GRPO (Group Relative Policy Optimization)&lt;/strong&gt; could teach a model the rules &lt;em&gt;and&lt;/em&gt; the strategy simultaneously, purely from trial and error.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It turns out, skipping SFT with a 4B parameter model is a high-wire act.&lt;/p&gt;

&lt;h2&gt;
  
  
  1: The "Final Final" Loop (Reward Hacking)
&lt;/h2&gt;

&lt;p&gt;An RL agent does not learn what you &lt;em&gt;want&lt;/em&gt; it to learn; it learns what you &lt;em&gt;incentivize&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;In my early runs, the model discovered a loophole. It realized that making a "bad guess" (a word that doesn't fit the clues) resulted in a penalty. But it also realized that if it just outputted garbage or looped the word &lt;code&gt;Final Final Final&lt;/code&gt; forever, the penalty was sometimes &lt;em&gt;less&lt;/em&gt; severe (or delayed).&lt;/p&gt;

&lt;p&gt;The model converged on a strategy of &lt;strong&gt;inaction&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Fix:&lt;/strong&gt; I had to engineer a &lt;code&gt;format_fail_penalty&lt;/code&gt; that was unequivocally the worst possible outcome (-200 reward). I effectively told the model: &lt;em&gt;"You can lose the game, but if you mess up the JSON format or refuse to play, you will regret it."&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  2: Policy Collapse at Rank 64 vs Rank 16 with the same learning rate
&lt;/h2&gt;

&lt;p&gt;There is a misconception that "Higher Rank LoRA = Better."&lt;/p&gt;

&lt;p&gt;I initially tried training with a LoRA Rank of 64 and a standard learning rate. The result was a catastrophic &lt;strong&gt;Policy Collapse&lt;/strong&gt;. The win rate dropped to 0%, and the model's outputs degraded into gibberish.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Insight:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Model Sensitivity:&lt;/strong&gt; Smaller models (4B) are incredibly sensitive to hyperparameter swings compared to the massive reasoning models described in research papers.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Gradient Clipping:&lt;/strong&gt; This became non-negotiable. Without aggressive gradient clipping, the "Straight-to-RL" updates were too volatile, shattering the weights before they could settle.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Rank Reduction:&lt;/strong&gt; Dropping to Rank 16 stabilized the training. It forced the model to learn efficient updates rather than overfitting to the noise of early random exploration.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  3. The Hardware Bottleneck (KV Cache vs. M4)
&lt;/h2&gt;

&lt;p&gt;I am running this on an M4 Pro with 48GB of Unified Memory using the &lt;strong&gt;MLX&lt;/strong&gt; framework. During my training runs, my tokens-per-second would suddenly crash by 8x. I initially thought it was a memory leak in my code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Culprit: The KV Cache.&lt;/strong&gt;&lt;br&gt;
In GRPO, you generate multiple "rollouts" (completions) for every prompt to calculate the group advantage.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Generating text is cheap.&lt;/li&gt;
&lt;li&gt;Generating text &lt;em&gt;inside a gradient tape&lt;/em&gt; with &lt;code&gt;num_generations=4&lt;/code&gt; is expensive.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On Apple Silicon, the &lt;strong&gt;Key-Value (KV) Cache&lt;/strong&gt; grows linearly with the group size. Each parallel generation requires its own massive cache. Once that cache filled the Unified Memory, the system fell back to heavy Swap Memory (20GB+ Swap Used), crippling performance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Lesson:&lt;/strong&gt; If you are training locally, &lt;code&gt;num_generations&lt;/code&gt; is your most expensive hyperparameter. I had to tune the batch size and group size specifically to hover around 40GB RAM usage to prevent swapping.&lt;/p&gt;
&lt;h2&gt;
  
  
  4. Prompting: Symbols vs. English
&lt;/h2&gt;

&lt;p&gt;I originally fed the model raw Wordle grids (e.g., &lt;code&gt;'xxx✓x'&lt;/code&gt;). It struggled to track state. I switched to a structured text summary:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;**Current Knowledge:**
*   **Correct Position (Green):** `A _ _ _ _`
*   **Wrong Position (Yellow):** 'O', 'R', 'T', 'U'
*   **Not in Word (Gray):** B, E, I, S
*   **Words Already Guessed:** ARISE, ABOUT
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Explicitly summarizing the state in natural language gave the model a "scratchpad" to reason from. It transformed the problem from "Visual Pattern Matching" to "Logical Deduction."&lt;/p&gt;

&lt;h2&gt;
  
  
  Results &amp;amp; Analysis
&lt;/h2&gt;

&lt;p&gt;My first attempt at training from Turn 1 (starting from scratch) failed. The 4B model was too "dumb" to stumble upon a winning strategy randomly.&lt;/p&gt;

&lt;p&gt;I implemented a &lt;strong&gt;Curriculum Strategy&lt;/strong&gt; to fix this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Single Guess History:&lt;/strong&gt; I first trained on prompts that already had one previous guess. This gave the model enough context to start learning basic constraints.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Random History (0-4 Turns):&lt;/strong&gt; Once the model stabilized, I expanded the dataset to include games with 0 to 4 turns of history.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;By feeding the model synthetic data with random histories (0-4 turns), I created a "Zone of Proximal Development" where the model could actually learn.&lt;/p&gt;

&lt;p&gt;I evaluated the trained LoRA adapter against the base Gemma-3 model on 150 unseen games.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Win Rate Improvement (Zero-Shot)
&lt;/h3&gt;

&lt;p&gt;Without any game history (starting from scratch), the base model is effectively guessing randomly. The RL training provided a massive boost in reliability.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F098gwa6nusl64vux4svr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F098gwa6nusl64vux4svr.png" alt="Win Rate Comparison" width="800" height="560"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Base Model:&lt;/strong&gt; 4.7% Win Rate&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GRPO Trained:&lt;/strong&gt; 16.0% Win Rate&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Result:&lt;/strong&gt; A &lt;strong&gt;~3.4x improvement&lt;/strong&gt; in reasoning capability without seeing a single expert game.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. The Power of Context (With History)
&lt;/h3&gt;

&lt;p&gt;When provided with partial game history (e.g., entering the game at Turn 3), the model's ability to deduce the answer skyrocketed. This proves the model learned to &lt;strong&gt;utilize constraints&lt;/strong&gt; (Green/Yellow letters) rather than just memorizing words.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9owvha4gykw0tw9bp8x3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9owvha4gykw0tw9bp8x3.png" alt="Cumulative Wins With History" width="800" height="464"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GRPO Trained:&lt;/strong&gt; 31.3% Win Rate (Red Line)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Base Model:&lt;/strong&gt; 16.0% Win Rate (Blue Line)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Creativity vs. Consistency (Temperature)
&lt;/h3&gt;

&lt;p&gt;I benchmarked the model at Temperature 0.9 (Creative) vs. 0.1 (Deterministic).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Temp 0.1:&lt;/strong&gt; Consistently outperformed high temperature.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Temp 0.9:&lt;/strong&gt; Win rates dropped significantly.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Insight:&lt;/strong&gt; For logic/reasoning tasks, "creativity" is often detrimental. The model performs best when forced to be deterministic, reducing the chance of hallucinating a strategy that violates the rules.&lt;/p&gt;

&lt;h2&gt;
  
  
  Related Work &amp;amp; Comparison
&lt;/h2&gt;

&lt;p&gt;This project sits at the intersection of two recent approaches to "Reasoning" models:&lt;/p&gt;

&lt;p&gt;DeepSeek-R1 (Zero): Uses pure RL with sparse outcome rewards (Win/Loss). This often fails on small models because they never stumble onto the solution (the "Cold Start" problem).&lt;/p&gt;

&lt;p&gt;Supervised Reinforcement Learning (Deng et al., Oct 2025): Solves the Cold Start problem by using Expert Trajectories to provide dense, step-by-step rewards based on similarity to human reasoning.&lt;/p&gt;

&lt;p&gt;My Approach (Wordle-RL) takes a third path. I solved the Cold Start problem without Expert Trajectories (as required by Deng et al.). Instead of supervising with Data, I supervised with Information Theory.&lt;/p&gt;

&lt;p&gt;By calculating the Entropy of every guess, I generated the same kind of "Dense, Step-wise Rewards" that Deng et al. advocate for, but I did it using pure computation rather than human datasets.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This project proves that we don't always need massive clusters to do interesting RL research.&lt;/p&gt;

&lt;p&gt;By combining Apple MLX for efficient local training and Heuristic Rewards (Entropy) as a substitute for expert data, I was able to train a small model to "reason" about game states. It learned to burn guesses to find vowel positions and navigate the trade-off between exploration and exploitation.&lt;/p&gt;

&lt;p&gt;The code is open source. If you have an M-series Mac, you can run this today.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Code and Logs are available on GitHub: &lt;a href="https://github.com/charbull/wordle-rl-gemma" rel="noopener noreferrer"&gt;https://github.com/charbull/wordle-rl-gemma&lt;/a&gt;)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;References:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Deng et al. (2025):&lt;/strong&gt; &lt;a href="https://arxiv.org/abs/2510.25992" rel="noopener noreferrer"&gt;Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning&lt;/a&gt; (An alternative approach using data instead of math for dense rewards).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DeepSeek-R1:&lt;/strong&gt; &lt;a href="https://arxiv.org/abs/2501.12948" rel="noopener noreferrer"&gt;Incentivizing Reasoning Capability in LLMs via Reinforcement Learning&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>deepseek</category>
      <category>python</category>
      <category>machinelearning</category>
      <category>applesilicon</category>
    </item>
  </channel>
</rss>
