Charbel

Posted on Jan 13

I Taught a 4B Parameter LLM to Play Wordle on a Mac M4 (Using GRPO)

#deepseek #python #machinelearning #applesilicon

DeepSeek-R1 changed the conversation. Their paper "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning"

But DeepSeek was trained on massive clusters. I have a MacBook Pro M4.

I spent a few weeks answering a specific question: Can we replicate this reasoning behavior on a consumer device, using a small model (Gemma-3 4B), without any supervised fine-tuning (SFT)?

I chose Wordle as the testbed. While simple, it requires state tracking, hypothesis testing, and information theory—a perfect microcosm for testing "reasoning" capabilities.

Why MLX? (The Technology Stack)

I chose Apple's MLX framework over PyTorch for three specific technical reasons:

Unified Memory Access: Training with GRPO requires generating multiple "rollouts" (completions) in parallel. On a standard GPU, moving these massive tensors between VRAM and RAM is a bottleneck. MLX is optimized for the M-series Unified Memory architecture, allowing zero-copy access to arrays.
The Quantization Struggle: In the PyTorch ecosystem, libraries like bitsandbytes (crucial for loading models in 4-bit/8-bit) have historically had unstable support on Apple Silicon.
Forcing Local Constraints: Using a cloud GPU is an "escape hatch." By forcing myself to train locally, I had to confront the actual hardware limits (bandwidth vs. capacity) that shape modern LLM architecture.

The Challenge: "Straight-to-RL"

Most RL pipelines start with Supervised Fine-Tuning (SFT). You show the model thousands of expert games, and then use RL to polish the strategy.

I wanted to test the "Cold Start" problem. Can a 4B parameter model learn the rules and strategy of Wordle purely through trial and error, guided only by a reward function?
I wanted to see if GRPO (Group Relative Policy Optimization) could teach a model the rules and the strategy simultaneously, purely from trial and error.

It turns out, skipping SFT with a 4B parameter model is a high-wire act.

1: The "Final Final" Loop (Reward Hacking)

An RL agent does not learn what you want it to learn; it learns what you incentivize.

In my early runs, the model discovered a loophole. It realized that making a "bad guess" (a word that doesn't fit the clues) resulted in a penalty. But it also realized that if it just outputted garbage or looped the word Final Final Final forever, the penalty was sometimes less severe (or delayed).

The model converged on a strategy of inaction.

The Fix: I had to engineer a format_fail_penalty that was unequivocally the worst possible outcome (-200 reward). I effectively told the model: "You can lose the game, but if you mess up the JSON format or refuse to play, you will regret it."

2: Policy Collapse at Rank 64 vs Rank 16 with the same learning rate

There is a misconception that "Higher Rank LoRA = Better."

I initially tried training with a LoRA Rank of 64 and a standard learning rate. The result was a catastrophic Policy Collapse. The win rate dropped to 0%, and the model's outputs degraded into gibberish.

The Insight:

Model Sensitivity: Smaller models (4B) are incredibly sensitive to hyperparameter swings compared to the massive reasoning models described in research papers.
Gradient Clipping: This became non-negotiable. Without aggressive gradient clipping, the "Straight-to-RL" updates were too volatile, shattering the weights before they could settle.
Rank Reduction: Dropping to Rank 16 stabilized the training. It forced the model to learn efficient updates rather than overfitting to the noise of early random exploration.

3. The Hardware Bottleneck (KV Cache vs. M4)

I am running this on an M4 Pro with 48GB of Unified Memory using the MLX framework. During my training runs, my tokens-per-second would suddenly crash by 8x. I initially thought it was a memory leak in my code.

The Culprit: The KV Cache.
In GRPO, you generate multiple "rollouts" (completions) for every prompt to calculate the group advantage.

Generating text is cheap.
Generating text inside a gradient tape with num_generations=4 is expensive.

On Apple Silicon, the Key-Value (KV) Cache grows linearly with the group size. Each parallel generation requires its own massive cache. Once that cache filled the Unified Memory, the system fell back to heavy Swap Memory (20GB+ Swap Used), crippling performance.

The Lesson: If you are training locally, num_generations is your most expensive hyperparameter. I had to tune the batch size and group size specifically to hover around 40GB RAM usage to prevent swapping.

4. Prompting: Symbols vs. English

I originally fed the model raw Wordle grids (e.g., 'xxx✓x'). It struggled to track state. I switched to a structured text summary:

**Current Knowledge:**
*   **Correct Position (Green):** `A _ _ _ _`
*   **Wrong Position (Yellow):** 'O', 'R', 'T', 'U'
*   **Not in Word (Gray):** B, E, I, S
*   **Words Already Guessed:** ARISE, ABOUT

Explicitly summarizing the state in natural language gave the model a "scratchpad" to reason from. It transformed the problem from "Visual Pattern Matching" to "Logical Deduction."

Results & Analysis

My first attempt at training from Turn 1 (starting from scratch) failed. The 4B model was too "dumb" to stumble upon a winning strategy randomly.

I implemented a Curriculum Strategy to fix this:

Single Guess History: I first trained on prompts that already had one previous guess. This gave the model enough context to start learning basic constraints.
Random History (0-4 Turns): Once the model stabilized, I expanded the dataset to include games with 0 to 4 turns of history.

By feeding the model synthetic data with random histories (0-4 turns), I created a "Zone of Proximal Development" where the model could actually learn.

I evaluated the trained LoRA adapter against the base Gemma-3 model on 150 unseen games.

1. Win Rate Improvement (Zero-Shot)

Without any game history (starting from scratch), the base model is effectively guessing randomly. The RL training provided a massive boost in reliability.

Base Model: 4.7% Win Rate
GRPO Trained: 16.0% Win Rate
Result: A ~3.4x improvement in reasoning capability without seeing a single expert game.

2. The Power of Context (With History)

When provided with partial game history (e.g., entering the game at Turn 3), the model's ability to deduce the answer skyrocketed. This proves the model learned to utilize constraints (Green/Yellow letters) rather than just memorizing words.

GRPO Trained: 31.3% Win Rate (Red Line)
Base Model: 16.0% Win Rate (Blue Line)

3. Creativity vs. Consistency (Temperature)

I benchmarked the model at Temperature 0.9 (Creative) vs. 0.1 (Deterministic).

Temp 0.1: Consistently outperformed high temperature.
Temp 0.9: Win rates dropped significantly.

Insight: For logic/reasoning tasks, "creativity" is often detrimental. The model performs best when forced to be deterministic, reducing the chance of hallucinating a strategy that violates the rules.

Related Work & Comparison

This project sits at the intersection of two recent approaches to "Reasoning" models:

DeepSeek-R1 (Zero): Uses pure RL with sparse outcome rewards (Win/Loss). This often fails on small models because they never stumble onto the solution (the "Cold Start" problem).

Supervised Reinforcement Learning (Deng et al., Oct 2025): Solves the Cold Start problem by using Expert Trajectories to provide dense, step-by-step rewards based on similarity to human reasoning.

My Approach (Wordle-RL) takes a third path. I solved the Cold Start problem without Expert Trajectories (as required by Deng et al.). Instead of supervising with Data, I supervised with Information Theory.

By calculating the Entropy of every guess, I generated the same kind of "Dense, Step-wise Rewards" that Deng et al. advocate for, but I did it using pure computation rather than human datasets.

Conclusion

This project proves that we don't always need massive clusters to do interesting RL research.

By combining Apple MLX for efficient local training and Heuristic Rewards (Entropy) as a substitute for expert data, I was able to train a small model to "reason" about game states. It learned to burn guesses to find vowel positions and navigate the trade-off between exploration and exploitation.

The code is open source. If you have an M-series Mac, you can run this today.

The Code and Logs are available on GitHub: https://github.com/charbull/wordle-rl-gemma)

References:

Deng et al. (2025): Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning (An alternative approach using data instead of math for dense rewards).
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DEV Community