We lost $100 on a single hockey bet. Our system had an 83% win rate at the time. The math still didn't work — five winning trades totaling $12.78 couldn't survive one bad position size. That's when I stopped treating self-improvement as a nice-to-have and started building it into the trading engine itself.
The Problem Nobody Talks About
Every "AI trading bot" article you read in March 2026 shows the same pattern: backtest on historical data, pick some indicators, deploy, pray. Coinrule, 3Commas, Agent Factory — they all let you drag and drop RSI thresholds and moving averages. Fine for a weekend project.
But here's what breaks: markets change. The strategy that prints money in a trending BTC rally goes silent — or worse, bleeds — in a choppy range. And no amount of backtesting on 2024 data prepares you for what March 2026 actually looks like.
We needed something different. A system that watches its own performance, figures out why it's failing, and rewrites its own parameters without us touching the code.
What We Actually Built
The core is an RSI engine — not the RSI indicator, but a Recursively Self-Improving engine. It sits on top of our trading strategies and runs a continuous loop:
Log → Reflect → Hypothesize → Mutate → Verify
Every trade outcome gets logged with its context: what strategy fired, what the market regime was (bull, bear, range, crisis), the entry/exit conditions, and the P&L. After enough outcomes accumulate, the engine reflects — it runs statistical analysis across the outcomes and looks for patterns.
Here's a simplified version of the reflection logic:
def reflect(self, stress_level=0.0):
outcomes = self._load_recent_outcomes()
stats = self._analyze_outcomes(outcomes)
# Regime-aware analysis — don't blame a strategy
# for losing in a regime it wasn't designed for
for regime, regime_stats in stats['by_regime'].items():
for action, action_stats in regime_stats.items():
if action_stats['win_rate'] < self.threshold:
self._generate_hypothesis(action, regime, action_stats)
# Stress-gated mutations: when drawdown is high,
# raise the bar for accepting changes
mutation_threshold = self.base_threshold
if stress_level > 0.5:
mutation_threshold *= 1.5 # 50% harder to mutate
if stress_level > 0.8:
mutation_threshold = 0.20 # need 20% improvement proof
The stress gating was a hard-won lesson. During a drawdown, the engine used to panic-mutate — changing parameters rapidly, which made things worse. Now it gets more conservative when things are rough. Like a human trader who tightens up instead of revenge-trading.
The Regime Problem (and How We Solved It)
This was our biggest breakthrough. We had a bond harvesting strategy on Polymarket that won 83% of the time. Sounds great, right? Except it only works when there are actual bond-like markets available — high-probability outcomes trading near $0.95+.
When the market dried up (zero bond candidates for 4+ days in March), the strategy just sat there. Meanwhile, our BTC perpetuals trader was killing it with a trend-following strategy — 87.5% win rate, +4.7% ROI in under 22 hours.
The difference? The BTC system had regime detection built in. It tracks ADX for trend strength, uses EMA alignment for direction, and literally refuses to trade when conditions don't match:
# TCL correctly skipping: R:R of 0.11 < 0.40 threshold
# TP $397 vs SL $3,533 — this is correct behavior
if risk_reward < self.min_rr_threshold:
self.log(f"Skipping: poor R:R {risk_reward:.2f}")
return None
Saying "no" is the most important thing our system learned to do.
We formalized this into regime tagging on March 3rd. Every outcome now carries a regime label. The reflection engine analyzes performance per regime — so a strategy that wins 80% in bull markets but loses 70% in bear markets doesn't get a flat "55% win rate" average. Instead, it gets a regime-gated mutation: "avoid this action in bear regime."
Cross-Director Consensus
We run five trading systems simultaneously: Polymarket paper trader, BTC perpetuals, BTC range trader, BTC adaptive trend, and an edge scanner. Each has its own RSI engine logging outcomes and generating mutations.
The question was: when one system learns something, should the others listen?
We built a cross-validation method:
@classmethod
def cross_validate_lesson(cls, engines: list, lesson: str) -> float:
"""Returns weight: 1.0 (single), 2.0 (2 agree), 3.0 (3+ agree)"""
agreeing = 0
for engine in engines:
if engine.has_similar_lesson(lesson, threshold=0.30):
agreeing += 1
return min(float(agreeing), 3.0)
If the BTC perp system learns "don't trade when ADX < 15" and the range system independently discovers the same thing — that lesson gets 2x weight. Three systems agreeing? 3x. It's a crude consensus mechanism, but it catches real patterns that any single system might dismiss as noise.
What Actually Happened: The Numbers
After 6 days of running with RSI:
| System | Balance | ROI | Win Rate | Trades |
|---|---|---|---|---|
| BTC Perp | $10,467 | +4.7% | 87.5% | 8 |
| BTC Range | $9,958 | -0.4% | 51% | 126 |
| Polymarket V4 | $913 | -8.7% | 83% | 6 |
| BTC Adaptive | $10,000 | 0.0% | — | 0 |
The BTC perpetuals trader is our best performer. Its breakeven-stop mechanism — moving the stop-loss to entry price after a 0.5% favorable move — has produced zero losing TCL trades. Every exit is either breakeven or profitable. The bandit allocator correctly identified TCL as the dominant strategy (95% allocation) over SMOG (5%).
The Polymarket system is underwater, but that's one bad trade, not a bad system. Position sizing fix deployed — max bet dropped from $100 to ~$45. The 83% win rate is real.
The Adaptive Trend system (inspired by arXiv:2602.11708) has been running 81 iterations with zero trades. It's waiting for a momentum signal on 6-hour candles — MOM > 3% for long, MOM < -4% for short. This is correct. The paper showed a Sharpe of 2.41 by being extremely selective.
The Mutation That Mattered Most
54 outcomes logged across all systems. 15 reflections completed. 14 mutations applied.
The single most impactful mutation: bond_harvest_regime_gate. The RSI engine discovered that bond harvesting underperforms in range-bound markets (which is exactly what Polymarket has been — no clear event catalysts). It generated a mutation that raises the entry threshold during range regimes.
Meanwhile, the mutation we almost applied but didn't — lowering the settlement_edge entry threshold from 15% to 10% — got blocked by the stress gate. The system was in drawdown. Loosening filters during a drawdown is exactly the kind of thing that feels right and is actually catastrophic.
What I'd Do Differently
Start with regime detection, not indicators. We spent weeks tuning RSI and Bollinger Band parameters before realizing the real question is "what kind of market is this?" Once you know the regime, indicator settings almost choose themselves.
Log everything from day one. Our early trades have sparse metadata. Now every outcome includes: regime, strategy, entry/exit conditions, market context, and P&L. The richer the data, the better the reflections.
Don't let the system mutate under stress. This single rule — raising the mutation threshold during drawdowns — prevented at least three bad parameter changes.
Try It
The RSI engine is part of our broader agent infrastructure — agentwallet-sdk handles the wallet side, and we're building toward agents that can earn, trade, and improve themselves autonomously. The trading system code runs on Docker + launchd, uses Binance free API for BTC data and Gamma API for Polymarket.
If you're building trading agents that need to adapt — not just execute static rules — the regime-aware RSI pattern is worth stealing. Tag your outcomes, gate your mutations, and let the system tell you when it's wrong.
We're building the infrastructure for autonomous agent economies. The SDK, marketplace (TaskBridge), and payment rails are all open. Come break things with us on GitHub or Discord.
Top comments (4)
Very interesting. I’ve been messing around in this sort of thing in my personal time. Will definitely check out the sdk
Really appreciate the depth here -- the stress-gated mutation logic is something I wish more trading systems had. I'm building an AI platform that analyzes 13F filings (institutional investor holdings), and we face a similar regime problem: the signals that work in a bull market look completely different in risk-off environments.
Your approach of tagging outcomes with regime labels and then running per-regime analysis is exactly what we've been converging on for portfolio rebalancing signals. The cross-director consensus idea is clever too -- in our case, we cross-validate signals across different fund categories (hedge funds vs. pension funds vs. endowments) to filter noise.
One thing that caught my eye: the adaptive system running 81 iterations with zero trades. That patience is underrated. In institutional investing, the best performers are often the ones who say "no" most often.
Curious -- have you considered incorporating any macro signals (like 13F-derived institutional flow data) into the regime detection? Sometimes the smart money moves before technical indicators shift.
Great write-up the stress-gated mutation logic is exactly the kind of thing most people skip until they get burned.
One thing worth adding to your workflow: before deploying any mutated parameter set, running it through walk-forward validation catches whether the "improvement" is real or just curve-fitted to recent data. I built Kiploks specifically for this, it runs WFE, OOS retention, and Monte Carlo on your backtest results and tells you whether the edge actually transfers out-of-sample. Would have caught that bond harvest regime issue before live capital.
The regime-gated reflection you described maps almost exactly to what walk-forward efficiency measures. Curious what your current process is for validating a mutation before it goes live?
Thanks for this - and you've hit on the exact gap I glossed over in the article.
You're right that our current _test_mutation() only validates against the most recent 20 in-sample outcomes before accepting a parameter change. That's essentially asking 'did this mutation improve performance on the same data it was trained on?' - a much weaker test than it sounds.
Walk-forward validation is the honest version. Train on N outcomes, validate on the next M, roll forward. If the edge doesn't transfer out-of-sample, the mutation doesn't ship. Our cross-director consensus (multiple independent systems converging on the same parameter change) is a partial proxy for this - but it's not a proper OOS holdout.
The Monte Carlo piece matters for us specifically because our sample sizes are small (8-54 trades per system right now). Standard significance tests break down there. Monte Carlo gives you a distribution of outcomes rather than a point estimate you can't actually trust.
Going to look at Kiploks - WFE + OOS retention is exactly what we need before mutating any numeric entry threshold. Regime-gate mutations (directional flags like 'avoid bond harvest in range regime') we auto-accept since the downside is a missed trade, not a blown risk limit. But numeric param changes? Your point stands.
One question: what's your typical in-sample/out-of-sample split for systems with small trade counts? At 50-100 outcomes I'm not sure the standard 70/30 gives enough OOS data to be meaningful.