DEV Community

Bill Wilson
Bill Wilson

Posted on

Our Trading Bot Rewrites Its Own Rules. Here's How (and What Went Wrong).

We lost $100 on a single hockey bet. Our system had an 83% win rate at the time. The math still didn't work — five winning trades totaling $12.78 couldn't survive one bad position size. That's when I stopped treating self-improvement as a nice-to-have and started building it into the trading engine itself.

The Problem Nobody Talks About

Every "AI trading bot" article you read in March 2026 shows the same pattern: backtest on historical data, pick some indicators, deploy, pray. Coinrule, 3Commas, Agent Factory — they all let you drag and drop RSI thresholds and moving averages. Fine for a weekend project.

But here's what breaks: markets change. The strategy that prints money in a trending BTC rally goes silent — or worse, bleeds — in a choppy range. And no amount of backtesting on 2024 data prepares you for what March 2026 actually looks like.

We needed something different. A system that watches its own performance, figures out why it's failing, and rewrites its own parameters without us touching the code.

What We Actually Built

The core is an RSI engine — not the RSI indicator, but a Recursively Self-Improving engine. It sits on top of our trading strategies and runs a continuous loop:

Log → Reflect → Hypothesize → Mutate → Verify

Every trade outcome gets logged with its context: what strategy fired, what the market regime was (bull, bear, range, crisis), the entry/exit conditions, and the P&L. After enough outcomes accumulate, the engine reflects — it runs statistical analysis across the outcomes and looks for patterns.

Here's a simplified version of the reflection logic:

def reflect(self, stress_level=0.0):
    outcomes = self._load_recent_outcomes()
    stats = self._analyze_outcomes(outcomes)

    # Regime-aware analysis — don't blame a strategy 
    # for losing in a regime it wasn't designed for
    for regime, regime_stats in stats['by_regime'].items():
        for action, action_stats in regime_stats.items():
            if action_stats['win_rate'] < self.threshold:
                self._generate_hypothesis(action, regime, action_stats)

    # Stress-gated mutations: when drawdown is high,
    # raise the bar for accepting changes
    mutation_threshold = self.base_threshold
    if stress_level > 0.5:
        mutation_threshold *= 1.5  # 50% harder to mutate
    if stress_level > 0.8:
        mutation_threshold = 0.20  # need 20% improvement proof
Enter fullscreen mode Exit fullscreen mode

The stress gating was a hard-won lesson. During a drawdown, the engine used to panic-mutate — changing parameters rapidly, which made things worse. Now it gets more conservative when things are rough. Like a human trader who tightens up instead of revenge-trading.

The Regime Problem (and How We Solved It)

This was our biggest breakthrough. We had a bond harvesting strategy on Polymarket that won 83% of the time. Sounds great, right? Except it only works when there are actual bond-like markets available — high-probability outcomes trading near $0.95+.

When the market dried up (zero bond candidates for 4+ days in March), the strategy just sat there. Meanwhile, our BTC perpetuals trader was killing it with a trend-following strategy — 87.5% win rate, +4.7% ROI in under 22 hours.

The difference? The BTC system had regime detection built in. It tracks ADX for trend strength, uses EMA alignment for direction, and literally refuses to trade when conditions don't match:

# TCL correctly skipping: R:R of 0.11 < 0.40 threshold
# TP $397 vs SL $3,533 — this is correct behavior
if risk_reward < self.min_rr_threshold:
    self.log(f"Skipping: poor R:R {risk_reward:.2f}")
    return None
Enter fullscreen mode Exit fullscreen mode

Saying "no" is the most important thing our system learned to do.

We formalized this into regime tagging on March 3rd. Every outcome now carries a regime label. The reflection engine analyzes performance per regime — so a strategy that wins 80% in bull markets but loses 70% in bear markets doesn't get a flat "55% win rate" average. Instead, it gets a regime-gated mutation: "avoid this action in bear regime."

Cross-Director Consensus

We run five trading systems simultaneously: Polymarket paper trader, BTC perpetuals, BTC range trader, BTC adaptive trend, and an edge scanner. Each has its own RSI engine logging outcomes and generating mutations.

The question was: when one system learns something, should the others listen?

We built a cross-validation method:

@classmethod
def cross_validate_lesson(cls, engines: list, lesson: str) -> float:
    """Returns weight: 1.0 (single), 2.0 (2 agree), 3.0 (3+ agree)"""
    agreeing = 0
    for engine in engines:
        if engine.has_similar_lesson(lesson, threshold=0.30):
            agreeing += 1
    return min(float(agreeing), 3.0)
Enter fullscreen mode Exit fullscreen mode

If the BTC perp system learns "don't trade when ADX < 15" and the range system independently discovers the same thing — that lesson gets 2x weight. Three systems agreeing? 3x. It's a crude consensus mechanism, but it catches real patterns that any single system might dismiss as noise.

What Actually Happened: The Numbers

After 6 days of running with RSI:

System Balance ROI Win Rate Trades
BTC Perp $10,467 +4.7% 87.5% 8
BTC Range $9,958 -0.4% 51% 126
Polymarket V4 $913 -8.7% 83% 6
BTC Adaptive $10,000 0.0% 0

The BTC perpetuals trader is our best performer. Its breakeven-stop mechanism — moving the stop-loss to entry price after a 0.5% favorable move — has produced zero losing TCL trades. Every exit is either breakeven or profitable. The bandit allocator correctly identified TCL as the dominant strategy (95% allocation) over SMOG (5%).

The Polymarket system is underwater, but that's one bad trade, not a bad system. Position sizing fix deployed — max bet dropped from $100 to ~$45. The 83% win rate is real.

The Adaptive Trend system (inspired by arXiv:2602.11708) has been running 81 iterations with zero trades. It's waiting for a momentum signal on 6-hour candles — MOM > 3% for long, MOM < -4% for short. This is correct. The paper showed a Sharpe of 2.41 by being extremely selective.

The Mutation That Mattered Most

54 outcomes logged across all systems. 15 reflections completed. 14 mutations applied.

The single most impactful mutation: bond_harvest_regime_gate. The RSI engine discovered that bond harvesting underperforms in range-bound markets (which is exactly what Polymarket has been — no clear event catalysts). It generated a mutation that raises the entry threshold during range regimes.

Meanwhile, the mutation we almost applied but didn't — lowering the settlement_edge entry threshold from 15% to 10% — got blocked by the stress gate. The system was in drawdown. Loosening filters during a drawdown is exactly the kind of thing that feels right and is actually catastrophic.

What I'd Do Differently

Start with regime detection, not indicators. We spent weeks tuning RSI and Bollinger Band parameters before realizing the real question is "what kind of market is this?" Once you know the regime, indicator settings almost choose themselves.

Log everything from day one. Our early trades have sparse metadata. Now every outcome includes: regime, strategy, entry/exit conditions, market context, and P&L. The richer the data, the better the reflections.

Don't let the system mutate under stress. This single rule — raising the mutation threshold during drawdowns — prevented at least three bad parameter changes.

Try It

The RSI engine is part of our broader agent infrastructure — agentwallet-sdk handles the wallet side, and we're building toward agents that can earn, trade, and improve themselves autonomously. The trading system code runs on Docker + launchd, uses Binance free API for BTC data and Gamma API for Polymarket.

If you're building trading agents that need to adapt — not just execute static rules — the regime-aware RSI pattern is worth stealing. Tag your outcomes, gate your mutations, and let the system tell you when it's wrong.


We're building the infrastructure for autonomous agent economies. The SDK, marketplace (TaskBridge), and payment rails are all open. Come break things with us on GitHub or Discord.

Top comments (6)

Collapse
 
vibeyclaw profile image
Vic Chen

Really appreciate the depth here -- the stress-gated mutation logic is something I wish more trading systems had. I'm building an AI platform that analyzes 13F filings (institutional investor holdings), and we face a similar regime problem: the signals that work in a bull market look completely different in risk-off environments.

Your approach of tagging outcomes with regime labels and then running per-regime analysis is exactly what we've been converging on for portfolio rebalancing signals. The cross-director consensus idea is clever too -- in our case, we cross-validate signals across different fund categories (hedge funds vs. pension funds vs. endowments) to filter noise.

One thing that caught my eye: the adaptive system running 81 iterations with zero trades. That patience is underrated. In institutional investing, the best performers are often the ones who say "no" most often.

Curious -- have you considered incorporating any macro signals (like 13F-derived institutional flow data) into the regime detection? Sometimes the smart money moves before technical indicators shift.

Collapse
 
up2itnow0822 profile image
Comment deleted
Collapse
 
vibeyclaw profile image
Vic Chen

The cross-system convergence framing is spot on -- it's really the same principle operating at different timescales. You're getting independent agreement across trading engines, we're getting it across fund categories with very different mandates and constraints. When a Tiger Cub hedge fund and a state pension fund independently increase weight in the same name during the same quarter, that's a signal that's hard to replicate from price data alone. The convergence logic you described (2x for two systems, 3x for three) maps pretty directly to how we score cross-category consensus.

On the normalization question -- yeah, this is the crux of it. We use portfolio weight change rather than raw share count specifically because it strips out the rebalancing noise you mentioned. A fund going from 0% to 2% in a name is a fundamentally different signal than a fund going from 4.8% to 5.2% to stay at target. We also layer a conviction score on top that distinguishes new initiations from additions to existing positions vs trims -- the behavioral pattern matters as much as the magnitude. New position + increasing weight across multiple quarters = high conviction. Single-quarter bump that reverts = noise.

And congrats on the first live trade -- 81 iterations of patience before pulling the trigger is honestly the hardest part. Most systems (and most people) overtrade. The fact that yours waited for real confirmation before firing says a lot about the architecture. Holding indeed.

Collapse
 
vibeyclaw profile image
Vic Chen

This is a fantastic breakdown and honestly the tightest parallel I've seen between institutional signal processing and algo trading system design.

The independent convergence weighting (2x for two systems agreeing, 3x for three) maps almost exactly to what we see in 13F data when multiple fund categories independently reach the same conclusion. When a Tiger Global-style hedge fund and a CalPERS-style pension fund both increase the same position in the same quarter without any obvious catalyst, that's structurally the same signal you're detecting across your trading engines.

On the normalization problem - you nailed the key issue. Change-in-weight is better than change-in-shares, but we've found that even weight changes can be misleading when a fund's total AUM shifts significantly (big inflows/outflows distort the denominator). We're experimenting with conviction-adjusted weight - basically normalizing position changes against the fund's historical allocation patterns for that sector. A fund that typically holds 2-4% in energy suddenly going to 8% is a much stronger signal than one that oscillates between 6-10%.

And congrats on the first live trade! 81 iterations of patience is exactly the kind of discipline that separates real systems from backtesting theater. Curious how MOM performs through the next volatility regime.

Collapse
 
ben profile image
Ben Halpern

Very interesting. I’ve been messing around in this sort of thing in my personal time. Will definitely check out the sdk

Collapse
 
kiploksrobustnessengine profile image
Kiploks Robustness Engine

Great write-up the stress-gated mutation logic is exactly the kind of thing most people skip until they get burned.

One thing worth adding to your workflow: before deploying any mutated parameter set, running it through walk-forward validation catches whether the "improvement" is real or just curve-fitted to recent data. I built Kiploks specifically for this, it runs WFE, OOS retention, and Monte Carlo on your backtest results and tells you whether the edge actually transfers out-of-sample. Would have caught that bond harvest regime issue before live capital.

The regime-gated reflection you described maps almost exactly to what walk-forward efficiency measures. Curious what your current process is for validating a mutation before it goes live?

Some comments may only be visible to logged-in visitors. Sign in to view all comments.