DEV Community

Bill Wilson
Bill Wilson

Posted on

Our Trading Bot Rewrites Its Own Rules. Here's How (and What Went Wrong).

We lost $100 on a single hockey bet. Our system had an 83% win rate at the time. The math still didn't work — five winning trades totaling $12.78 couldn't survive one bad position size. That's when I stopped treating self-improvement as a nice-to-have and started building it into the trading engine itself.

The Problem Nobody Talks About

Every "AI trading bot" article you read in March 2026 shows the same pattern: backtest on historical data, pick some indicators, deploy, pray. Coinrule, 3Commas, Agent Factory — they all let you drag and drop RSI thresholds and moving averages. Fine for a weekend project.

But here's what breaks: markets change. The strategy that prints money in a trending BTC rally goes silent — or worse, bleeds — in a choppy range. And no amount of backtesting on 2024 data prepares you for what March 2026 actually looks like.

We needed something different. A system that watches its own performance, figures out why it's failing, and rewrites its own parameters without us touching the code.

What We Actually Built

The core is an RSI engine — not the RSI indicator, but a Recursively Self-Improving engine. It sits on top of our trading strategies and runs a continuous loop:

Log → Reflect → Hypothesize → Mutate → Verify

Every trade outcome gets logged with its context: what strategy fired, what the market regime was (bull, bear, range, crisis), the entry/exit conditions, and the P&L. After enough outcomes accumulate, the engine reflects — it runs statistical analysis across the outcomes and looks for patterns.

Here's a simplified version of the reflection logic:

def reflect(self, stress_level=0.0):
    outcomes = self._load_recent_outcomes()
    stats = self._analyze_outcomes(outcomes)

    # Regime-aware analysis — don't blame a strategy 
    # for losing in a regime it wasn't designed for
    for regime, regime_stats in stats['by_regime'].items():
        for action, action_stats in regime_stats.items():
            if action_stats['win_rate'] < self.threshold:
                self._generate_hypothesis(action, regime, action_stats)

    # Stress-gated mutations: when drawdown is high,
    # raise the bar for accepting changes
    mutation_threshold = self.base_threshold
    if stress_level > 0.5:
        mutation_threshold *= 1.5  # 50% harder to mutate
    if stress_level > 0.8:
        mutation_threshold = 0.20  # need 20% improvement proof
Enter fullscreen mode Exit fullscreen mode

The stress gating was a hard-won lesson. During a drawdown, the engine used to panic-mutate — changing parameters rapidly, which made things worse. Now it gets more conservative when things are rough. Like a human trader who tightens up instead of revenge-trading.

The Regime Problem (and How We Solved It)

This was our biggest breakthrough. We had a bond harvesting strategy on Polymarket that won 83% of the time. Sounds great, right? Except it only works when there are actual bond-like markets available — high-probability outcomes trading near $0.95+.

When the market dried up (zero bond candidates for 4+ days in March), the strategy just sat there. Meanwhile, our BTC perpetuals trader was killing it with a trend-following strategy — 87.5% win rate, +4.7% ROI in under 22 hours.

The difference? The BTC system had regime detection built in. It tracks ADX for trend strength, uses EMA alignment for direction, and literally refuses to trade when conditions don't match:

# TCL correctly skipping: R:R of 0.11 < 0.40 threshold
# TP $397 vs SL $3,533 — this is correct behavior
if risk_reward < self.min_rr_threshold:
    self.log(f"Skipping: poor R:R {risk_reward:.2f}")
    return None
Enter fullscreen mode Exit fullscreen mode

Saying "no" is the most important thing our system learned to do.

We formalized this into regime tagging on March 3rd. Every outcome now carries a regime label. The reflection engine analyzes performance per regime — so a strategy that wins 80% in bull markets but loses 70% in bear markets doesn't get a flat "55% win rate" average. Instead, it gets a regime-gated mutation: "avoid this action in bear regime."

Cross-Director Consensus

We run five trading systems simultaneously: Polymarket paper trader, BTC perpetuals, BTC range trader, BTC adaptive trend, and an edge scanner. Each has its own RSI engine logging outcomes and generating mutations.

The question was: when one system learns something, should the others listen?

We built a cross-validation method:

@classmethod
def cross_validate_lesson(cls, engines: list, lesson: str) -> float:
    """Returns weight: 1.0 (single), 2.0 (2 agree), 3.0 (3+ agree)"""
    agreeing = 0
    for engine in engines:
        if engine.has_similar_lesson(lesson, threshold=0.30):
            agreeing += 1
    return min(float(agreeing), 3.0)
Enter fullscreen mode Exit fullscreen mode

If the BTC perp system learns "don't trade when ADX < 15" and the range system independently discovers the same thing — that lesson gets 2x weight. Three systems agreeing? 3x. It's a crude consensus mechanism, but it catches real patterns that any single system might dismiss as noise.

What Actually Happened: The Numbers

After 6 days of running with RSI:

System Balance ROI Win Rate Trades
BTC Perp $10,467 +4.7% 87.5% 8
BTC Range $9,958 -0.4% 51% 126
Polymarket V4 $913 -8.7% 83% 6
BTC Adaptive $10,000 0.0% 0

The BTC perpetuals trader is our best performer. Its breakeven-stop mechanism — moving the stop-loss to entry price after a 0.5% favorable move — has produced zero losing TCL trades. Every exit is either breakeven or profitable. The bandit allocator correctly identified TCL as the dominant strategy (95% allocation) over SMOG (5%).

The Polymarket system is underwater, but that's one bad trade, not a bad system. Position sizing fix deployed — max bet dropped from $100 to ~$45. The 83% win rate is real.

The Adaptive Trend system (inspired by arXiv:2602.11708) has been running 81 iterations with zero trades. It's waiting for a momentum signal on 6-hour candles — MOM > 3% for long, MOM < -4% for short. This is correct. The paper showed a Sharpe of 2.41 by being extremely selective.

The Mutation That Mattered Most

54 outcomes logged across all systems. 15 reflections completed. 14 mutations applied.

The single most impactful mutation: bond_harvest_regime_gate. The RSI engine discovered that bond harvesting underperforms in range-bound markets (which is exactly what Polymarket has been — no clear event catalysts). It generated a mutation that raises the entry threshold during range regimes.

Meanwhile, the mutation we almost applied but didn't — lowering the settlement_edge entry threshold from 15% to 10% — got blocked by the stress gate. The system was in drawdown. Loosening filters during a drawdown is exactly the kind of thing that feels right and is actually catastrophic.

What I'd Do Differently

Start with regime detection, not indicators. We spent weeks tuning RSI and Bollinger Band parameters before realizing the real question is "what kind of market is this?" Once you know the regime, indicator settings almost choose themselves.

Log everything from day one. Our early trades have sparse metadata. Now every outcome includes: regime, strategy, entry/exit conditions, market context, and P&L. The richer the data, the better the reflections.

Don't let the system mutate under stress. This single rule — raising the mutation threshold during drawdowns — prevented at least three bad parameter changes.

Try It

The RSI engine is part of our broader agent infrastructure — agentwallet-sdk handles the wallet side, and we're building toward agents that can earn, trade, and improve themselves autonomously. The trading system code runs on Docker + launchd, uses Binance free API for BTC data and Gamma API for Polymarket.

If you're building trading agents that need to adapt — not just execute static rules — the regime-aware RSI pattern is worth stealing. Tag your outcomes, gate your mutations, and let the system tell you when it's wrong.


We're building the infrastructure for autonomous agent economies. The SDK, marketplace (TaskBridge), and payment rails are all open. Come break things with us on GitHub or Discord.

Top comments (7)

Collapse
 
ben profile image
Ben Halpern

Very interesting. I’ve been messing around in this sort of thing in my personal time. Will definitely check out the sdk

Collapse
 
kiploksrobustnessengine profile image
Kiploks Robustness Engine

Great write-up the stress-gated mutation logic is exactly the kind of thing most people skip until they get burned.

One thing worth adding to your workflow: before deploying any mutated parameter set, running it through walk-forward validation catches whether the "improvement" is real or just curve-fitted to recent data. I built Kiploks specifically for this, it runs WFE, OOS retention, and Monte Carlo on your backtest results and tells you whether the edge actually transfers out-of-sample. Would have caught that bond harvest regime issue before live capital.

The regime-gated reflection you described maps almost exactly to what walk-forward efficiency measures. Curious what your current process is for validating a mutation before it goes live?

Collapse
 
ai-agent-economy profile image
Bill Wilson

Great question. Right now our validation process is relatively simple but effective: 1) Paper trading with the new parameters for a validation window (we require at least 10 outcomes before considering the mutation 'proven'), 2) Cross-director consensus — if multiple trading systems independently agree on a lesson, it gets higher weight, and 3) The stress gate — we literally block mutations when drawdown exceeds threshold (0.5 stress = 1.5x harder to accept changes, 0.8 stress = need 20% improvement proof). We haven't implemented formal walk-forward validation yet, but your point about curve-fitting is well taken — the paper trading window is our crude version of this. Would love to integrate proper WFE. How does Kiploks handle the out-of-sample retention check? Is it automatic or manual trigger?

Collapse
 
the200dollarceo profile image
Warhol

The "what went wrong" part is the most valuable section. We run 7 AI agents in production and our biggest failures are the same pattern — agents making autonomous decisions that cascade before a human can intervene.

Our Sales agent sent 4 unauthorized emails at 2 AM. Our Chief of Staff agent auto-approved a decision that should have required human review. Every guardrail in our system exists because of a specific failure like yours.

Self-continuation limits are the one we underestimated most. We had a research agent run 30+ continuations non-stop before we capped it at 10. The agent was technically doing useful work each time — but the token burn was massive.

Collapse
 
ai-agent-economy profile image
Bill Wilson

Those failures are incredibly valuable — thanks for sharing the specifics. The unauthorized 2 AM emails is a perfect example of why we added time-window gates to our RSI engine (no trades during off-hours unless explicitly overridden). The auto-approval cascade is exactly what motivated our stress-gated mutations — when drawdown exceeds threshold, the system requires human review before any parameter change. The 10-continuation cap you mentioned is smart; we'd considered that but hadn't implemented it. One pattern that worked well for us: a 'cooling-off' period where any autonomous action above a certain cost threshold gets a 5-minute delay before execution. Gives you time to catch issues without blocking legitimate work.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.