DEV Community: Charbel

Building a Stock Advisor on a Coral Dev Board

Charbel — Wed, 13 May 2026 03:25:20 +0000

A few months ago I set out to answer a simple question: can I build a scientific framework for deciding when to sell my Google RSUs instead of making decisions based on gut feeling?

The answer turned out to be "sort of, but the process taught me far more than the answer did." This post covers the full arc — hardware choices, architecture decisions, the bugs that kept predictions stuck at 0.00%, and finally a working system running at 2.5ms on the Edge TPU.

I also added a second model — a direction classifier that predicts whether price will go up or down — to complement the original price regression model. The dual-model results are instructive and sometimes humbling.

The Hardware Stack

I started with what I had: a Google Coral Dev Board sitting on my shelf. The Coral has an Edge TPU coprocessor connected to the CPU via PCIe — not the USB Accelerator version, the on-chip variant. It's discontinued hardware, but it's genuinely capable for what I needed.

HP Victus RTX 3050 — primary training environment
Coral Dev Board   → inference + sentiment (2W idle, always on)

The key insight that drove the architecture: you don't need the same hardware for training and inference. The Coral is terrible at training (no backprop support) but excellent at fast, cheap, power-efficient inference.

Why Conv1D and Not LSTM

The Coral TPU's supported op set is frozen at 2019. This matters enormously:

Operation	TPU Support	Notes
`CONV_2D`	✅ Full	Conv1D maps here
`ReLU6`	✅ Native	NOT regular ReLU
`GlobalAvgPool`	✅ Native
`BatchMatMul`	❌ CPU fallback	Kills LSTM, Transformers
`LayerNorm`	❌ CPU fallback	Kills BERT-family
`GELU`	❌ CPU fallback	Use ReLU6 instead

LSTM falls back to CPU because of BatchMatMul. FinBERT falls back to CPU because of LayerNorm. Conv1D runs 100% on-chip because it maps directly to CONV_2D. The practical result: 2.5ms on TPU vs ~300ms on the ARM CPU.

The Feature Set: 52 Indicators Across 7 Groups

The input is a 60-day window of 52 features per day, computed from OHLCV data for the target ticker plus SPY (market proxy) and VIX (fear gauge):

Group 1 – Price/Volume (5)      close_norm, OHLC ratios, volume deviation
Group 2 – Returns & RVol (6)    1d/5d/20d returns, log-return, realized vol
Group 3 – Momentum (11)         RSI×3, Stochastic K/D, Williams%R, MFI, CCI, ROC×3
Group 4 – MACD family (4)       line, signal, histogram, histogram delta
Group 5 – Trend & MAs (12)      close vs MA5/10/20/50/100/200, Bollinger, ATR, ADX, DI+/-
Group 6 – Volume (4)            OBV, vol ratio, CMF, vol momentum
Group 7 – Market context (10)   SPY returns, VIX z-score, relative strength,
                                 calendar cyclicals, 52w high/low distances

The price model outputs three log-return predictions for 1-day, 3-day, and 5-day forward closes. The direction model outputs three up-probabilities for the same horizons.

Bug 1: The Scaler That Refitted Itself

For weeks the model was outputting this:

1-day    →  $ 314.74  ▼ 0.00%
3-day    →  $ 314.74  ▼ 0.00%
5-day    →  $ 314.74  ▼ 0.00%

The inference code was silently fitting a brand new RobustScaler on 2 years of current data when the scaler file wasn't found:

# BUG — silently refits if the file doesn't exist
if not os.path.exists(SCALER_PATH):
    scaler = RobustScaler()
    scaler.fit(feat.values)  # ← fits on 2 years of live data

The model was trained with a scaler fit on 10 years of data across 30 tickers. Different statistics, different scaling — the model received garbage inputs and output zeros.

# Fix — crash loudly instead of silently producing wrong results
if not os.path.exists(SCALER_PATH):
    raise FileNotFoundError(
        "Scaler not found. Copy price_model_scaler_params.npz from your "
        "training machine. Never refit the scaler at inference time."
    )

Bug 2: GlobalAveragePooling1D vs Flatten

Using Flatten instead of GlobalAveragePooling1D caused only 2 of 40 ops to run on the TPU:

# WRONG — Flatten breaks the TPU execution graph
x = tf.keras.layers.Flatten()(x)

# RIGHT — GlobalAveragePooling1D maps to MEAN (TPU-native)
x = tf.keras.layers.GlobalAveragePooling1D()(x)

Bug 3: BatchNormalization Splits the Graph

Even after fixing the above, the edgetpu-compiled model output all-zeros. The edgetpu_compiler log revealed why:

DEQUANTIZE         1   Operation is working on an unsupported data type
CONV_2D            1   Mapped to Edge TPU
CONV_2D            4   More than one subgraph is not supported
FULLY_CONNECTED    3   More than one subgraph is not supported
MAX_POOL_2D        1   More than one subgraph is not supported

Only 2 ops out of ~20 ran on the TPU. BatchNormalization uses float32 accumulators. When TFLite quantizes the graph it inserts a DEQUANTIZE node — and DEQUANTIZE is unsupported on the Edge TPU. This creates a subgraph boundary. The TPU runs everything before the first DEQUANTIZE (one Conv), and everything after (pooling, dense layers, output) runs on CPU with uninitialized output quantization (scale=0.0, zp=0), which dequantizes to all-zeros.

Fix: remove BatchNormalization entirely and switch use_bias=False → use_bias=True. The inputs are already RobustScaler-normalized, so BN isn't needed for stability. ReLU6 keeps activations bounded for INT8.

# Before — BatchNorm causes DEQUANTIZE → subgraph split → zeros on TPU
x = tf.keras.layers.Conv1D(32, 3, padding="same", use_bias=False)(x)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.Activation("relu6")(x)

# After — clean graph, 100% TPU execution
x = tf.keras.layers.Conv1D(32, 3, padding="same", use_bias=True)(x)
x = tf.keras.layers.Activation("relu6")(x)

After this change, the compiler log became all Mapped to Edge TPU.

Bug 4: Reading the Wrong Quantization Scale

Even with all ops on the TPU, inputs showed Std: 28.89 | Unique Levels: 149 — meaning values were being crushed to the INT8 boundary. The model was receiving a barcode of extreme values instead of a price chart.

The cause: reading the input scale from the wrong field.

# WRONG — reads per-channel WEIGHT scales of the first Conv layer
p = in_d['quantization_parameters']
sc = p['scales'][0]   # a tiny weight-magnitude value like 0.003

# RIGHT — reads the per-tensor INPUT scale from the calibration dataset
sc, zp = in_d['quantization']   # correctly ~0.039 (= 5/127)

quantization_parameters['scales'] is an array of per-channel weight scales — one per Conv filter. quantization is the plain (scale, zero_point) 2-tuple the TFLite INT8 converter computes from the representative calibration data for the input tensor. Using the weight scale to quantize a [-5, 5] input means a value of 1.0 quantizes to 1.0/0.003 = 333, clips to 127, and 90%+ of the input space collapses to the boundary. After the fix: Std: 24.32 | Unique Levels: 152. Real predictions.

Multi-Ticker Training: Why 30 Stocks Instead of 1

Training only on GOOGL gives ~2,300 bars — thin for a 60-day sequence model. Training on 30 tickers gives 55,560 sequences and forces the model to learn generalizable price dynamics rather than GOOGL-specific patterns.

DEFAULT_TICKERS = [
    "GOOGL", "AAPL", "MSFT", "NVDA", "META", "AMZN", "TSLA",  # mega-cap tech
    "JPM", "BAC", "GS", "V", "MA",                              # financials
    "JNJ", "UNH", "PFE", "ABBV",                                # healthcare
    "XOM", "CVX",                                               # energy
    "WMT", "HD", "CAT", "UPS",                                  # consumer/industrial
    "AMD", "INTC", "TSM",                                       # semiconductors
    "XLK", "XLF", "XLE", "XLV", "SPY",                         # sector ETFs
]

Preventing Data Leakage: The Embargo Gap

Adjacent sequences in a sequence model share almost all their data. Sequence 100 uses days 40–99; sequence 101 uses days 41–100. A standard train/val split puts these in different sets, creating look-ahead leakage. The fix:

EMBARGO = SEQ_LEN  # must be >= SEQ_LEN

split       = int(len(X) * 0.85)
train_end   = split - EMBARGO
val_start   = split + EMBARGO

X_train = X[:train_end]
X_val   = X[val_start:]

Adding a Direction Model

The price model can cheat by predicting "slightly positive" for everything and still minimize MAE on bull market data. A direction model predicts binary up/down, which is harder to game:

# Price model: linear head, Huber loss
out = tf.keras.layers.Dense(3, name="price_output")(x)
loss_fn = tf.keras.losses.Huber(delta=0.5)

# Direction model: sigmoid head, binary cross-entropy
out = tf.keras.layers.Dense(3, activation="sigmoid", name="direction_output")(x)
loss_fn = "binary_crossentropy"

Both models train in one command with --mode both, sharing the same dataset and producing all deployment artifacts including automatic Edge TPU compilation.

The Walk-Forward CV Results

Price model CV Mean : 1d=53.5%  3d=56.4%  5d=57.7%
Direction model CV  : 1d=52.0%  3d=55.5%  5d=55.9%

Held-out val:
  Price      1d=53.0%  3d=56.8%  5d=57.6%
  Direction  1d=52.6%  3d=56.2%  5d=57.6%

Both models cross 54% on 5-day, which is the threshold that indicates a real edge. Results are consistent across all 4 folds with no suspicious outlier fold.

The Backtest Results

Mode            ROI      Ann. ROI   Sharpe   Drawdown   Trades   Win Rate
─────────────────────────────────────────────────────────────────────────
Price only    +2.63%    +1.34%     -0.72    -5.96%       14      57.1%
Direction     +16.48%   +8.13%     +0.36   -11.68%       30      46.7%
Fusion        +2.76%    +1.40%     -0.99    -5.80%       10      40.0%
─────────────────────────────────────────────────────────────────────────
Buy & Hold   +98.95%

All three modes underperform buy-and-hold on GOOGL over 3 years. This is the right conclusion for RSU decisions: in a sustained bull trend the default should be to hold, and the bar for the model to recommend a sale should be high. The system's value is in providing a rigorous framework for when to deviate from holding, not in trading actively.

The System Running Live

After fixing all four bugs, both models run on the Edge TPU at 2.5ms each:

════════════════════════════════════════════════════════════════════
  📈  GOOG Advisor  |  Coral Edge TPU Dev Board  [FUSION mode]
  2026-04-27 07:03:58
════════════════════════════════════════════════════════════════════
  Last close : $342.32  ▲ 4.57 (1.35%)  [592ms]

────────────────────────────────────────────────────────────────────
  📊 Technical Analysis  (18 indicators)
────────────────────────────────────────────────────────────────────
  🟢  RSI-14 69.7 → Above midline
  🟢  RSI trend +4.6 → accelerating upward
  🟢  MACD 10.17 > Signal 7.43 → Bullish
  🔴  MACD histogram contracting → momentum fading
  🟢  Price $342.32 > MA50 $308.57
  🟢  Price $342.32 > MA200 $276.80
  🟢  MA5 > MA10 > MA20 → Momentum stacked bullish
  ⚪  BB %B 0.81 → mid-band territory
  🟢  ADX 29.9 strong | DI+ 36 > DI- 16 → bullish trend
  ⚪  Volume 1.1× avg → average participation
  🔴  MFI 80.5 → overbought money flow

────────────────────────────────────────────────────────────────────
  📰 News Sentiment
────────────────────────────────────────────────────────────────────
  Source     : yfinance+GoogleRSS  (2006ms)
  Headlines  : 9 scored  /  11 filtered
  Ticker     : +0.1717  →  BULLISH  (58% confidence)
  Macro      : +0.1343  →  NEUTRAL  [gate: —]

  +           +0.000  Chicago Capital LLC Reduces Stock Holdings in Alphabet Inc
  +█          +0.158  Why Alphabet (GOOG, GOOGL) Is a Compelling AI Investment i
  +████       +0.486  Alphabet Stock (GOOG) Opinions on Upcoming Q1 Earnings and
  +███        +0.346  Tanager Wealth Management LLP Has $37.11 Million Stock Pos
  +           +0.000  Alphabet Inc. (GOOG) Laps the Stock Market: Here's Why
  +█          +0.175  Alphabet Inc. $GOOG Stock Holdings Lowered by Natural Inve
  +           +0.000  Lbp Am Sa Trims Stock Holdings in Alphabet Inc. $GOOG
  +███        +0.380  Is GOOG Stock a Buy Ahead of Q1 Earnings and Amid Fragile

────────────────────────────────────────────────────────────────────
  TECHNICAL VERDICT : 🟢 BUY 🟢  (score: +6)
  ADJUSTED VERDICT  : 🟢 BUY 🟢
  CONFIDENCE        : MEDIUM
  FUSION SIGNAL     : 🟢 BUY 🟢  (price + direction)

────────────────────────────────────────────────────────────────────
  🤖 Price Model  [Coral Edge TPU (price) ⚡  2.6ms]

  1-day    →  $ 343.32  ▲ 0.29%
  3-day    →  $ 344.93  ▲ 0.76%
  5-day    →  $ 346.44  ▲ 1.20%

  Day-trade (1d)  BUY → SELL  +0.29%
  Swing     (3d)  BUY → SELL  +0.76%
  Week      (5d)  BUY → SELL  +1.20%

────────────────────────────────────────────────────────────────────
  🧭 Direction Model  [Coral Edge TPU (direction) ⚡  2.6ms]

  1-day    →  ▲  52.7%  ██████████
  3-day    →  ▲  55.5%  ███████████
  5-day    →  ▲  56.6%  ███████████

────────────────────────────────────────────────────────────────────
  Key levels : MA50 $308.57  MA200 $276.80  52wH $344.90  52wL $152.80
════════════════════════════════════════════════════════════════════

592ms total latency — data fetch + 52-feature engineering + two TPU inferences. Results pushed to Telegram automatically.

What I'd Do Differently

Remove BatchNorm from the start. For quantized edge deployment, BatchNormalization is a trap. The right design is Conv1D(use_bias=True) → ReLU6. Pre-normalized inputs make BN redundant.

Read the edgetpu compiler log immediately. The compiler exits with code 0 even when only 2 of 40 ops map to the TPU. The .log file it writes alongside the compiled model is the only way to know.

Use weighted horizon agreement. The fusion signal's MIN_AGREEMENT=2 gate treats all three horizons equally. The 1-day prediction is noisier than the 5-day but counts the same. A weighted agreement score matching prediction weights [0.5, 0.3, 0.2] would be more accurate.

Add bear-regime training data. The sell signal never triggered once in 250 backtest days. The training window skews bullish. Explicitly oversampling high-VIX / drawdown windows would help.

Use FinBERT instead of VADER for sentiment. VADER was designed for social media. Financial language ("impairment charge," "above consensus," "guidance raised") isn't in its vocabulary.

What This Project Is Actually For

The technical goals were always secondary to three things:

A better framework for RSU selling decisions than my gut feeling. Replacing "the stock feels extended" with "RSI is 70, direction model sees 52.7% up on 1-day but 56.6% on 5-day, price model predicts +1.20% over the week — this is not a signal to sell."

Hands-on experience with ML systems at the hardware layer. Understanding why BatchNorm breaks INT8 graphs, how subgraph splitting silently produces zeros, and why quantization_parameters['scales'][0] vs quantization[0] is the difference between a working model and a broken one.

A concrete signal about whether quantitative finance is genuinely interesting. The answer: yes, but the gap between "57% directional accuracy" and "beating buy-and-hold" is enormous. That gap is where the real research lives.

I Taught a 4B Parameter LLM to Play Wordle on a Mac M4 (Using GRPO)

Charbel — Tue, 13 Jan 2026 18:26:05 +0000

DeepSeek-R1 changed the conversation. Their paper "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning"

But DeepSeek was trained on massive clusters. I have a MacBook Pro M4.

I spent a few weeks answering a specific question: Can we replicate this reasoning behavior on a consumer device, using a small model (Gemma-3 4B), without any supervised fine-tuning (SFT)?

I chose Wordle as the testbed. While simple, it requires state tracking, hypothesis testing, and information theory—a perfect microcosm for testing "reasoning" capabilities.

Why MLX? (The Technology Stack)

I chose Apple's MLX framework over PyTorch for three specific technical reasons:

Unified Memory Access: Training with GRPO requires generating multiple "rollouts" (completions) in parallel. On a standard GPU, moving these massive tensors between VRAM and RAM is a bottleneck. MLX is optimized for the M-series Unified Memory architecture, allowing zero-copy access to arrays.
The Quantization Struggle: In the PyTorch ecosystem, libraries like bitsandbytes (crucial for loading models in 4-bit/8-bit) have historically had unstable support on Apple Silicon.
Forcing Local Constraints: Using a cloud GPU is an "escape hatch." By forcing myself to train locally, I had to confront the actual hardware limits (bandwidth vs. capacity) that shape modern LLM architecture.

The Challenge: "Straight-to-RL"

Most RL pipelines start with Supervised Fine-Tuning (SFT). You show the model thousands of expert games, and then use RL to polish the strategy.

I wanted to test the "Cold Start" problem. Can a 4B parameter model learn the rules and strategy of Wordle purely through trial and error, guided only by a reward function?
I wanted to see if GRPO (Group Relative Policy Optimization) could teach a model the rules and the strategy simultaneously, purely from trial and error.

It turns out, skipping SFT with a 4B parameter model is a high-wire act.

1: The "Final Final" Loop (Reward Hacking)

An RL agent does not learn what you want it to learn; it learns what you incentivize.

In my early runs, the model discovered a loophole. It realized that making a "bad guess" (a word that doesn't fit the clues) resulted in a penalty. But it also realized that if it just outputted garbage or looped the word Final Final Final forever, the penalty was sometimes less severe (or delayed).

The model converged on a strategy of inaction.

The Fix: I had to engineer a format_fail_penalty that was unequivocally the worst possible outcome (-200 reward). I effectively told the model: "You can lose the game, but if you mess up the JSON format or refuse to play, you will regret it."

2: Policy Collapse at Rank 64 vs Rank 16 with the same learning rate

There is a misconception that "Higher Rank LoRA = Better."

I initially tried training with a LoRA Rank of 64 and a standard learning rate. The result was a catastrophic Policy Collapse. The win rate dropped to 0%, and the model's outputs degraded into gibberish.

The Insight:

Model Sensitivity: Smaller models (4B) are incredibly sensitive to hyperparameter swings compared to the massive reasoning models described in research papers.
Gradient Clipping: This became non-negotiable. Without aggressive gradient clipping, the "Straight-to-RL" updates were too volatile, shattering the weights before they could settle.
Rank Reduction: Dropping to Rank 16 stabilized the training. It forced the model to learn efficient updates rather than overfitting to the noise of early random exploration.

3. The Hardware Bottleneck (KV Cache vs. M4)

I am running this on an M4 Pro with 48GB of Unified Memory using the MLX framework. During my training runs, my tokens-per-second would suddenly crash by 8x. I initially thought it was a memory leak in my code.

The Culprit: The KV Cache.
In GRPO, you generate multiple "rollouts" (completions) for every prompt to calculate the group advantage.

Generating text is cheap.
Generating text inside a gradient tape with num_generations=4 is expensive.

On Apple Silicon, the Key-Value (KV) Cache grows linearly with the group size. Each parallel generation requires its own massive cache. Once that cache filled the Unified Memory, the system fell back to heavy Swap Memory (20GB+ Swap Used), crippling performance.

The Lesson: If you are training locally, num_generations is your most expensive hyperparameter. I had to tune the batch size and group size specifically to hover around 40GB RAM usage to prevent swapping.

4. Prompting: Symbols vs. English

I originally fed the model raw Wordle grids (e.g., 'xxx✓x'). It struggled to track state. I switched to a structured text summary:

**Current Knowledge:**
*   **Correct Position (Green):** `A _ _ _ _`
*   **Wrong Position (Yellow):** 'O', 'R', 'T', 'U'
*   **Not in Word (Gray):** B, E, I, S
*   **Words Already Guessed:** ARISE, ABOUT

Explicitly summarizing the state in natural language gave the model a "scratchpad" to reason from. It transformed the problem from "Visual Pattern Matching" to "Logical Deduction."

Results & Analysis

My first attempt at training from Turn 1 (starting from scratch) failed. The 4B model was too "dumb" to stumble upon a winning strategy randomly.

I implemented a Curriculum Strategy to fix this:

Single Guess History: I first trained on prompts that already had one previous guess. This gave the model enough context to start learning basic constraints.
Random History (0-4 Turns): Once the model stabilized, I expanded the dataset to include games with 0 to 4 turns of history.

By feeding the model synthetic data with random histories (0-4 turns), I created a "Zone of Proximal Development" where the model could actually learn.

I evaluated the trained LoRA adapter against the base Gemma-3 model on 150 unseen games.

1. Win Rate Improvement (Zero-Shot)

Without any game history (starting from scratch), the base model is effectively guessing randomly. The RL training provided a massive boost in reliability.

Base Model: 4.7% Win Rate
GRPO Trained: 16.0% Win Rate
Result: A ~3.4x improvement in reasoning capability without seeing a single expert game.

2. The Power of Context (With History)

When provided with partial game history (e.g., entering the game at Turn 3), the model's ability to deduce the answer skyrocketed. This proves the model learned to utilize constraints (Green/Yellow letters) rather than just memorizing words.

GRPO Trained: 31.3% Win Rate (Red Line)
Base Model: 16.0% Win Rate (Blue Line)

3. Creativity vs. Consistency (Temperature)

I benchmarked the model at Temperature 0.9 (Creative) vs. 0.1 (Deterministic).

Temp 0.1: Consistently outperformed high temperature.
Temp 0.9: Win rates dropped significantly.

Insight: For logic/reasoning tasks, "creativity" is often detrimental. The model performs best when forced to be deterministic, reducing the chance of hallucinating a strategy that violates the rules.

Related Work & Comparison

This project sits at the intersection of two recent approaches to "Reasoning" models:

DeepSeek-R1 (Zero): Uses pure RL with sparse outcome rewards (Win/Loss). This often fails on small models because they never stumble onto the solution (the "Cold Start" problem).

Supervised Reinforcement Learning (Deng et al., Oct 2025): Solves the Cold Start problem by using Expert Trajectories to provide dense, step-by-step rewards based on similarity to human reasoning.

My Approach (Wordle-RL) takes a third path. I solved the Cold Start problem without Expert Trajectories (as required by Deng et al.). Instead of supervising with Data, I supervised with Information Theory.

By calculating the Entropy of every guess, I generated the same kind of "Dense, Step-wise Rewards" that Deng et al. advocate for, but I did it using pure computation rather than human datasets.

Conclusion

This project proves that we don't always need massive clusters to do interesting RL research.

By combining Apple MLX for efficient local training and Heuristic Rewards (Entropy) as a substitute for expert data, I was able to train a small model to "reason" about game states. It learned to burn guesses to find vowel positions and navigate the trade-off between exploration and exploitation.

The code is open source. If you have an M-series Mac, you can run this today.

The Code and Logs are available on GitHub: https://github.com/charbull/wordle-rl-gemma)

References:

Deng et al. (2025): Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning (An alternative approach using data instead of math for dense rewards).
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning