Replacing Strategy Gates With Intelligence Amplifiers: A Reviewer Stage for LLM-Driven Trading

#ai #python #architecture #trading

Most automated trading stacks look like this:

signal -> filter_gates -> (pass or drop) -> execution

Each gate is a hard-coded rule. Volume above X. RSI below Y. Correlation below Z.
The problem is the gates have no context. A rule that says "skip signals during
low-volume hours" will drop the one asymmetric setup that only fires during
low-volume hours.

We spent the last two weeks rewriting this. The new shape looks like:

signal -> context_pack -> reviewer(LLM) -> (accept | modify | veto) -> execution

The gates are still there as a cheap pre-filter, but they no longer make the
final decision. A reviewer stage does. This post is about why we changed,
what the reviewer actually does, and how we made it safe to put an LLM on
the hot path without it becoming a single point of failure.

Why gates aren't enough

Rule-based gates are great at one thing: keeping the obviously wrong stuff
out. If your signal's 24h volume is below some floor, you don't want to
trade it. No argument.

They fall apart on the trades that matter. The trades that matter are
usually the ones that don't look like the training set. They're the 2am
breakout on a coin that just got listed on a new venue, or the unusual
order-book shape you've never seen before, or the third identical signal
from the same strategy that just stopped working last week and you
haven't noticed yet.

A gate can't tell the difference between "this is the exception the rule
was designed to handle" and "this is a new regime we haven't modeled yet".
An LLM with the right context can, at least some of the time.

What the reviewer actually sees

The reviewer is not a chat model. It's a scoped prompt that gets a context
pack and returns a structured verdict. The pack is built at signal time
and contains:

The signal itself (strategy id, direction, size hint, entry/SL/TP)
Recent trades from the same strategy (last 20, with outcomes)
A regime tag (trending, mean-reverting, chop) derived from realized vol
Top-5 correlated open positions across the portfolio
News flags for the symbol in the last 4 hours (if any)
A one-line note from the gate stage explaining why it passed

The output shape is strict JSON with three possible verdicts and a
confidence:

{
  "verdict": "accept" | "modify" | "veto",
  "confidence": 0.0,
  "rationale": "...",
  "modified_size_pct": null,
  "modified_sl": null
}

That's it. No free-form reasoning in production. The model can write
whatever rationale it wants, but the executor only reads verdict,
confidence, modified_size_pct, and modified_sl. Everything else
gets logged for review.

Why modify, not just accept/reject

The third verdict, modify, is where most of the value shows up. A
classic gate-based system is binary. A signal either gets through at
full size or doesn't get through at all. But most of the hard cases
aren't binary. They're "yes, but smaller" or "yes, but with a tighter
stop because the regime is trending against you."

The reviewer can return modify with a modified_size_pct (bounded
between 10% and 100% of the original) and a modified_sl (bounded to
be tighter, never looser, than the original). The executor clamps
these on read — we never trust the LLM to size a trade without hard
caps. The model is suggesting, not commanding.

Early data from our paper run: about 60% of signals get accept, 25%
get modify (usually smaller size during mixed-regime conditions),
and 15% get veto. The veto rate is higher than we expected. Most
vetoes come from the "three bad trades in a row on this strategy"
pattern the reviewer sees in the context pack.

Fail-open routing: the LLM cannot be a SPOF

The scariest thing about putting a model on the hot path is that
models break. Rate limits, outages, Cloudflare incidents,
token-pricing changes that silently degrade a cheap tier. If the
reviewer becomes a single point of failure, one incident kills the
whole pipeline.

We solved this with a tiered router that fails open, not fails closed:

async def review(signal, context):
    for tier in ["sonnet", "gpt4_mini", "phi4_local"]:
        try:
            result = await call_tier(tier, signal, context, timeout=8.0)
            if result.valid:
                return result
        except (TimeoutError, RateLimitError, UpstreamError):
            metrics.tier_failure.labels(tier=tier).inc()
            continue
    # All tiers down. Don't block the signal — hand it back with a
    # fallback verdict and let the old gate system make the call.
    return Verdict(
        verdict="accept",
        confidence=0.0,
        rationale="reviewer_unavailable",
        source="fallback",
    )

Two important choices here.

First, the fallback returns accept with confidence 0, not veto. If
the review layer is down, the trading system should behave exactly
like it did before we added the reviewer. Fail-closed would mean an
LLM outage = trading halt, which is worse than not having a reviewer
at all.

Second, the confidence 0 signal matters. The executor treats any
verdict with confidence below 0.3 as "use the pre-existing gate
decision only." So the reviewer's influence scales with its
confidence, and scales to zero when it's offline.

Picking a local fallback on purpose

The last tier in the cascade is phi4_local — a Phi-4 14B quantized
model running on a local GPU. It's slower than Sonnet and less
sharp, but it has three properties the hosted tiers don't:

It's free. Every signal reviewed by the local tier costs nothing.
It can't rate-limit us. Concurrent request cap is set by our hardware, not by a vendor's billing engine.
It can't be deprecated out from under us overnight.

The cascade isn't "hosted first, local as a sad fallback." It's
"use the sharpest tier available for each signal, and if nothing is
available, still have a real model in the loop." A lot of days, the
cheap hosted tier is plenty. But on the day Sonnet has an outage
and GPT-4-mini is queued three minutes deep, phi4_local keeps the
lights on.

What we measure

We don't claim this is a better trading strategy. We claim it's a
better decision stage. The questions we measure are:

Veto precision: of the trades the reviewer vetoed, what fraction actually would have lost money? (target: >60%)
Modify improvement: on trades where we took a modified size, did the expected-loss reduction justify the expected-return reduction? (target: yes, on average)
Fallback rate: what fraction of signals got the fallback path because all tiers were down? (target: <1% over a 30-day window)

We publish these numbers on an operations dashboard, not in
marketing copy. If the reviewer's veto precision drops below 50%
for two weeks running, we take it off the hot path until we
understand why. The spec isn't "always use the LLM." The spec is
"use it only when it's measurably earning its place."

The honest disclaimer

This setup does not promise better returns. We have no backtested
accuracy number to offer because we believe backtests over-fit and
we don't want to sell one. What we offer is a decision stage that
can see more of the context than a rule-based gate can, with a
fail-open path so it can't take down the rest of the system, and
a measurement plan that will retire it if the veto precision
stops paying rent.

If you're building something similar and have notes on how the
context pack should be shaped, or what tier-cascade behavior you
found works best when all hosted tiers are degraded at once, we'd
genuinely like to hear them.

Posted from the ops side of a live trading platform. No affiliate
links, no course to sell, no promise of returns. Just a design
pattern we wish we'd had two years ago.