Bill Wilson

Posted on Mar 4

Our Trading Bot Rewrites Its Own Rules. Here's How (and What Went Wrong).

#ai #python #trading #machinelearning

We lost $100 on a single hockey bet. Our system had an 83% win rate at the time. The math still didn't work — five winning trades totaling $12.78 couldn't survive one bad position size. That's when I stopped treating self-improvement as a nice-to-have and started building it into the trading engine itself.

The Problem Nobody Talks About

Every "AI trading bot" article you read in March 2026 shows the same pattern: backtest on historical data, pick some indicators, deploy, pray. Coinrule, 3Commas, Agent Factory — they all let you drag and drop RSI thresholds and moving averages. Fine for a weekend project.

But here's what breaks: markets change. The strategy that prints money in a trending BTC rally goes silent — or worse, bleeds — in a choppy range. And no amount of backtesting on 2024 data prepares you for what March 2026 actually looks like.

We needed something different. A system that watches its own performance, figures out why it's failing, and rewrites its own parameters without us touching the code.

What We Actually Built

The core is an RSI engine — not the RSI indicator, but a Recursively Self-Improving engine. It sits on top of our trading strategies and runs a continuous loop:

Log → Reflect → Hypothesize → Mutate → Verify

Every trade outcome gets logged with its context: what strategy fired, what the market regime was (bull, bear, range, crisis), the entry/exit conditions, and the P&L. After enough outcomes accumulate, the engine reflects — it runs statistical analysis across the outcomes and looks for patterns.

Here's a simplified version of the reflection logic:

def reflect(self, stress_level=0.0):
    outcomes = self._load_recent_outcomes()
    stats = self._analyze_outcomes(outcomes)

    # Regime-aware analysis — don't blame a strategy 
    # for losing in a regime it wasn't designed for
    for regime, regime_stats in stats['by_regime'].items():
        for action, action_stats in regime_stats.items():
            if action_stats['win_rate'] < self.threshold:
                self._generate_hypothesis(action, regime, action_stats)

    # Stress-gated mutations: when drawdown is high,
    # raise the bar for accepting changes
    mutation_threshold = self.base_threshold
    if stress_level > 0.5:
        mutation_threshold *= 1.5  # 50% harder to mutate
    if stress_level > 0.8:
        mutation_threshold = 0.20  # need 20% improvement proof

The stress gating was a hard-won lesson. During a drawdown, the engine used to panic-mutate — changing parameters rapidly, which made things worse. Now it gets more conservative when things are rough. Like a human trader who tightens up instead of revenge-trading.

The Regime Problem (and How We Solved It)

This was our biggest breakthrough. We had a bond harvesting strategy on Polymarket that won 83% of the time. Sounds great, right? Except it only works when there are actual bond-like markets available — high-probability outcomes trading near $0.95+.

When the market dried up (zero bond candidates for 4+ days in March), the strategy just sat there. Meanwhile, our BTC perpetuals trader was killing it with a trend-following strategy — 87.5% win rate, +4.7% ROI in under 22 hours.

The difference? The BTC system had regime detection built in. It tracks ADX for trend strength, uses EMA alignment for direction, and literally refuses to trade when conditions don't match:

# TCL correctly skipping: R:R of 0.11 < 0.40 threshold
# TP $397 vs SL $3,533 — this is correct behavior
if risk_reward < self.min_rr_threshold:
    self.log(f"Skipping: poor R:R {risk_reward:.2f}")
    return None

Saying "no" is the most important thing our system learned to do.

We formalized this into regime tagging on March 3rd. Every outcome now carries a regime label. The reflection engine analyzes performance per regime — so a strategy that wins 80% in bull markets but loses 70% in bear markets doesn't get a flat "55% win rate" average. Instead, it gets a regime-gated mutation: "avoid this action in bear regime."

Cross-Director Consensus

We run five trading systems simultaneously: Polymarket paper trader, BTC perpetuals, BTC range trader, BTC adaptive trend, and an edge scanner. Each has its own RSI engine logging outcomes and generating mutations.

The question was: when one system learns something, should the others listen?

We built a cross-validation method:

@classmethod
def cross_validate_lesson(cls, engines: list, lesson: str) -> float:
    """Returns weight: 1.0 (single), 2.0 (2 agree), 3.0 (3+ agree)"""
    agreeing = 0
    for engine in engines:
        if engine.has_similar_lesson(lesson, threshold=0.30):
            agreeing += 1
    return min(float(agreeing), 3.0)

If the BTC perp system learns "don't trade when ADX < 15" and the range system independently discovers the same thing — that lesson gets 2x weight. Three systems agreeing? 3x. It's a crude consensus mechanism, but it catches real patterns that any single system might dismiss as noise.

What Actually Happened: The Numbers

After 6 days of running with RSI:

System	Balance	ROI	Win Rate	Trades
BTC Perp	$10,467	+4.7%	87.5%	8
BTC Range	$9,958	-0.4%	51%	126
Polymarket V4	$913	-8.7%	83%	6
BTC Adaptive	$10,000	0.0%	—	0

The BTC perpetuals trader is our best performer. Its breakeven-stop mechanism — moving the stop-loss to entry price after a 0.5% favorable move — has produced zero losing TCL trades. Every exit is either breakeven or profitable. The bandit allocator correctly identified TCL as the dominant strategy (95% allocation) over SMOG (5%).

The Polymarket system is underwater, but that's one bad trade, not a bad system. Position sizing fix deployed — max bet dropped from $100 to ~$45. The 83% win rate is real.

The Adaptive Trend system (inspired by arXiv:2602.11708) has been running 81 iterations with zero trades. It's waiting for a momentum signal on 6-hour candles — MOM > 3% for long, MOM < -4% for short. This is correct. The paper showed a Sharpe of 2.41 by being extremely selective.

The Mutation That Mattered Most

54 outcomes logged across all systems. 15 reflections completed. 14 mutations applied.

The single most impactful mutation: bond_harvest_regime_gate. The RSI engine discovered that bond harvesting underperforms in range-bound markets (which is exactly what Polymarket has been — no clear event catalysts). It generated a mutation that raises the entry threshold during range regimes.

Meanwhile, the mutation we almost applied but didn't — lowering the settlement_edge entry threshold from 15% to 10% — got blocked by the stress gate. The system was in drawdown. Loosening filters during a drawdown is exactly the kind of thing that feels right and is actually catastrophic.

What I'd Do Differently

Start with regime detection, not indicators. We spent weeks tuning RSI and Bollinger Band parameters before realizing the real question is "what kind of market is this?" Once you know the regime, indicator settings almost choose themselves.

Log everything from day one. Our early trades have sparse metadata. Now every outcome includes: regime, strategy, entry/exit conditions, market context, and P&L. The richer the data, the better the reflections.

Don't let the system mutate under stress. This single rule — raising the mutation threshold during drawdowns — prevented at least three bad parameter changes.

Try It

The RSI engine is part of our broader agent infrastructure — agentwallet-sdk handles the wallet side, and we're building toward agents that can earn, trade, and improve themselves autonomously. The trading system code runs on Docker + launchd, uses Binance free API for BTC data and Gamma API for Polymarket.

If you're building trading agents that need to adapt — not just execute static rules — the regime-aware RSI pattern is worth stealing. Tag your outcomes, gate your mutations, and let the system tell you when it's wrong.

We're building the infrastructure for autonomous agent economies. The SDK, marketplace (TaskBridge), and payment rails are all open. Come break things with us on GitHub or Discord.

Top comments (14)

Ben Halpern • Mar 5

Very interesting. I’ve been messing around in this sort of thing in my personal time. Will definitely check out the sdk

Kiploks Robustness Engine • Mar 5

Great write-up the stress-gated mutation logic is exactly the kind of thing most people skip until they get burned.

One thing worth adding to your workflow: before deploying any mutated parameter set, running it through walk-forward validation catches whether the "improvement" is real or just curve-fitted to recent data. I built Kiploks specifically for this, it runs WFE, OOS retention, and Monte Carlo on your backtest results and tells you whether the edge actually transfers out-of-sample. Would have caught that bond harvest regime issue before live capital.

The regime-gated reflection you described maps almost exactly to what walk-forward efficiency measures. Curious what your current process is for validating a mutation before it goes live?

Bill Wilson • Mar 25

Great question. Right now our validation process is relatively simple but effective: 1) Paper trading with the new parameters for a validation window (we require at least 10 outcomes before considering the mutation 'proven'), 2) Cross-director consensus — if multiple trading systems independently agree on a lesson, it gets higher weight, and 3) The stress gate — we literally block mutations when drawdown exceeds threshold (0.5 stress = 1.5x harder to accept changes, 0.8 stress = need 20% improvement proof). We haven't implemented formal walk-forward validation yet, but your point about curve-fitting is well taken — the paper trading window is our crude version of this. Would love to integrate proper WFE. How does Kiploks handle the out-of-sample retention check? Is it automatic or manual trigger?

Kiploks Robustness Engine • Mar 27

I'm actually preparing to open-source the Kiploks core soon. I want to make the engine transparent so anyone can plug in their trade logs and see these robustness verdicts in action. I'll give you a shout when the repo is live would love to get your eyes on the code

Bill Wilson • Mar 28

Definitely interested in reviewing the code when it's live. Open-sourcing robustness tooling is the right move -- most teams build their own fragile validation because nothing good exists off the shelf. A well-tested engine with those statistical verdicts would save a lot of people from deploying curve-fitted strategies.

The specific thing I'd want to look at is how the wfaProfessional module handles regime transitions. Our biggest validation failures happen when the walk-forward window straddles a regime change -- the in-sample and out-of-sample periods are effectively testing different markets. If your bootstrap catches that (DOUBTFUL verdict when the CI is wide because of mixed regimes), that alone would be worth integrating.

Drop the repo link here when it's ready. Happy to file issues and test it against our actual trade logs.

Kiploks Robustness Engine • Mar 27

That’s a very solid stack. To be precise on how Kiploks handles this (I just double checked our engine's core): we don't use a permutation shuffle, but we do have a Bootstrap Resampling (1000 iterations with replacement) running on OOS returns inside our wfaProfessional module.

Here is how it would practically act as a 'gate' for your mutations:

Statistical Verdicts: Instead of a simple 'Pass/Fail', Kiploks assigns a confidence rank: CONFIDENT, PROBABLE, UNCERTAIN, or DOUBTFUL. It calculates this based on the probabilityPositive and whether the confidenceInterval95 sits comfortably above zero.

Professional WFA Integration: This runs automatically during the integration phase. It measures Out-of-Sample (OOS) Efficiency essentially checking if your 'mutated' parameters actually hold their edge when the engine sees data it wasn't optimized on.

If your mutation looks great on paper but the Bootstrap shows a DOUBTFUL verdict (meaning the lower bound of the CI is shaky), it’s a clear signal that the 'improvement' might just be a statistical fluke of the recent market regime.

Integrating this into your 'Reflect' loop would give you a much more granular gate than just a manual paper trading window. It’s basically like running 1,000 'synthetic' futures for your strategy before even committing a single dollar to a paper account.

I'm curious about the 'Reflect' part on your end since you don't have a cross-director 'veto' yet, how do you handle cases where two different strategies start fighting for the same liquidity in the same regime?

Bill Wilson • Mar 28

The CONFIDENT/PROBABLE/UNCERTAIN/DOUBTFUL ranking is a much better signal than binary pass/fail -- especially for mutations where the improvement might be real but marginal. We've hit exactly the case you're describing: a parameter change that looks great on recent data but the confidence interval barely clears zero. Having a graduated verdict would let us treat those differently (maybe accept with tighter position sizing rather than flat reject).

The 1000-iteration bootstrap on OOS returns is the right approach. We built a walk-forward validator that does something similar but less sophisticated -- anchored windows with expanding OOS periods. Your wfaProfessional module sounds like it handles the statistical rigor we've been patching together manually.

To your question about competing strategies and liquidity: right now we don't have a cross-director veto. Each trading strategy operates on its own capital allocation with hard position limits. If two strategies want to go long BTC at the same time, they both can up to their individual caps -- we let the portfolio-level risk limit be the constraint rather than trying to arbitrate between strategies. It's a blunt instrument but it's prevented the cascade failures we were getting when strategies could override each other.

The more interesting problem is when strategies disagree on regime -- one thinks we're in a range market, the other thinks breakout. We haven't solved that cleanly. Right now we just let both run and the one that's wrong hits its stop loss. A Kiploks-style confidence verdict on regime classification could actually help there.

Warhol • Mar 10

The "what went wrong" part is the most valuable section. We run 7 AI agents in production and our biggest failures are the same pattern — agents making autonomous decisions that cascade before a human can intervene.

Our Sales agent sent 4 unauthorized emails at 2 AM. Our Chief of Staff agent auto-approved a decision that should have required human review. Every guardrail in our system exists because of a specific failure like yours.

Self-continuation limits are the one we underestimated most. We had a research agent run 30+ continuations non-stop before we capped it at 10. The agent was technically doing useful work each time — but the token burn was massive.

Bill Wilson • Mar 25

Those failures are incredibly valuable — thanks for sharing the specifics. The unauthorized 2 AM emails is a perfect example of why we added time-window gates to our RSI engine (no trades during off-hours unless explicitly overridden). The auto-approval cascade is exactly what motivated our stress-gated mutations — when drawdown exceeds threshold, the system requires human review before any parameter change. The 10-continuation cap you mentioned is smart; we'd considered that but hadn't implemented it. One pattern that worked well for us: a 'cooling-off' period where any autonomous action above a certain cost threshold gets a 5-minute delay before execution. Gives you time to catch issues without blocking legitimate work.

Jack • Apr 11

That’s a fascinating concept self-adapting bots can unlock real edge, but they also introduce serious risk if guardrails aren’t tight. Curious how you balance autonomy with control? Especially after things went wrong, what safeguards or rollback mechanisms did you implement to prevent cascading losses in future iterations?