mithril0rd

Posted on Apr 2

I Built a 6-Agent AI System That Debates Every Trade Decision — It Just Got One Wrong

#ai #machinelearning #trading #python

I Built a 6-Agent AI System That Debates Every Trade Decision — It Just Got One Wrong

On March 31, my AI trading system analyzed Amphenol (APH). Six agents debated. Three said buy. Three said sell. The system chose to sit it out.

APH rallied 6% that day.

This is a system I built in 16 days — solo, with Claude as my only teammate. It runs 24/7 across US, China, and Hong Kong markets. Multiple AI agents debate each other before every decision. A guard pipeline can reject trades even when agents agree. Real paper trading through Alpaca.

Here's what I built, what broke, and what I learned from the first real mistake.

The Numbers

	Day 1	Day 16
API endpoints	0	315+
Backend services	0	117
Database tables	0	57
Symbols monitored	0	87 (US + China + HK)
Dashboard pages	0	19
Lines of code	0	45K Python + 14K React
Analysis	—	6-agent debate with contradiction detection
Trading	—	Alpaca paper trading ($100K sim)

Stack: Python FastAPI sidecar + React 19 dashboard + Claude Opus as the orchestrating brain + Qlib for quantitative scoring. Deployed on a 2-core/2GB Tencent Cloud box via Docker Compose + Cloudflare Tunnel.

Why Multiple Agents Instead of One Prompt

The first version of the analysis engine was what most people build:

async def analyze(symbol):
    indicators = await get_indicators(symbol)
    if indicators.rsi > 70:
        score -= 2
    if indicators.macd_cross == "golden":
        score += 1.5
    return {"score": score, "stance": "bullish" if score > 6.5 else "bearish"}

This isn't AI analysis. It's if-else wearing a lab coat.

The problem: RSI 70 means completely different things in different contexts. In a strong bull trend, RSI 70 is just a pause. In a ranging market, it's a reversal signal. A rule engine doesn't know the difference — it always subtracts 2 points.

Worse: when indicators contradict each other — technicals bullish but fund flows bearish, sentiment hot but fundamentals weak — the rule engine averages away the contradiction. But contradictions are the most valuable signal. They tell you the market hasn't made up its mind.

So I rebuilt the entire analysis layer as a multi-agent debate system.

The 6-Agent Architecture

Six specialized analysts, each with their own tools and perspective:

Agent	Focus	Key Tools
Technical Analyst	Price action, patterns, momentum	RSI, MACD, Bollinger, volume profile
Fundamental Analyst	Earnings, valuation, growth	Financial statements, peer comparison
Sentiment Analyst	News, social media, fear/greed	News API, CNN Fear & Greed Index
Fund Flow Analyst	Institutional behavior, options	Put/call ratio, dark pool, 13F signals
Macro Analyst	Rates, geopolitics, cross-market	Fed policy, VIX, sector rotation
Contrarian Analyst	Deliberately challenges consensus	Same tools, opposite framing

Each agent independently analyzes a symbol and produces a stance (bullish/bearish/neutral), a confidence score, and reasoning.

Then comes the interesting part: contradiction detection.

The system doesn't just average the scores. It explicitly identifies where agents disagree and flags the nature of the disagreement:

Technical vs. Fundamental divergence: Price momentum says buy, but earnings are deteriorating. This is a classic value trap signal.
Sentiment vs. Fund Flow divergence: Retail is euphoric, but institutions are quietly selling. This is a distribution pattern.
Macro vs. Everything divergence: Individual stock looks great, but macro environment is hostile. This is a regime risk.

These contradictions get classified and weighted. A 5-1 bullish consensus with the contrarian dissenting is very different from a 3-3 split where technicals and fundamentals are on opposite sides.

The Guard Pipeline: 11 Checks Before Any Trade

Even when the agents reach consensus, the system doesn't just execute. Every trade passes through a guard pipeline with 11 sequential checks:

Market hours check — Is the market actually open?
Symbol validity — Is this a tradeable symbol?
Position check — Do we already hold this? (No doubling down)
Confidence threshold — Is the consensus strong enough?
Contradiction severity — Are there unresolved major disagreements?
Daily trade limit — Max 3 auto-trades per day
Cooldown check — Minimum 2 hours between analyses of the same symbol
VIX regime check — Is VIX elevated? If so, only allow sells
Portfolio exposure — Would this trade over-concentrate the portfolio?
Cost basis check — Is the position size appropriate?
Final LLM review — Claude Opus gets one last look at the full context

If any check fails, the trade is rejected with a specific reason code. The rejection itself gets logged — because understanding why the system didn't trade is as valuable as understanding why it did.

A Real Decision That Went Wrong

Let me show you what this actually looks like — including when the system gets it wrong.

On March 31, the system flagged Amphenol (APH) as a buy candidate. The signal fusion score was 0.9125 — high enough to rank #3 in the daily screening. Direction: bullish.

Then the six agents debated. The result was a perfect 3v3 split:

Agent	Score	Stance
Fundamental	7.8	Bullish
Fund Flow	7.5	Bullish
Sentiment	6.8	Bullish
Macro	3.4	Bearish
Quant	3.2	Bearish
Technical	2.1	Bearish

The bull case: strong earnings, institutional money flowing in, positive sentiment. The bear case: technical breakdown with heavy volume, quant factors pointing down, and VIX at 25.98 signaling macro stress.

Weighted score: 4.8/10 — almost perfectly neutral.

The Trader LLM had to break the tie. Its reasoning:

"Despite strong fundamentals, the market regime has shifted bearish. VIX at 25.98, technical analysis shows a high-volume breakdown — systemic risk outweighs individual stock strength. Buying here means catching a falling knife."

Decision: HOLD. No trade executed. Guard pipeline never even triggered — nothing to guard when there's no order.

Then the market proved the system wrong.

APH opened at $122.28, dipped to $121.00, then V-shaped into a close at $126.35. +6.04% on the day. It kept climbing for two more days, reaching +7% from the decision point.

The post-mortem revealed two failures:

The Trader over-weighted macro/technical fear signals in a 3v3 tie. When the debate is perfectly split, the system defaults to caution. That's by design — but it means the system will systematically miss V-shaped reversals.
A classification bug. The system labeled this 3v3 split as "ALIGNED" (signals in agreement) when it should have been "DIVERGENT" (major disagreement). This is now fixed.

I'm sharing this because the quant space is full of people showing their wins. The losses — and the bugs — are where the real learning happens.

What Half the Code Was Wasted

Here's the part nobody writes about.

Of the 45,000 lines of Python, I estimate roughly half was unnecessary. Not buggy — unnecessary. Code that solved problems I didn't actually have, or solved real problems in overcomplicated ways.

Examples:

Built an elaborate caching layer for market data before realizing the API rate limits were generous enough that caching barely mattered.
Wrote a custom task scheduler before discovering APScheduler does everything I needed in 20 lines.
Implemented a complex event sourcing system for trade decisions before realizing a simple trade_traces table with JSON columns was more than sufficient.

The lesson: Claude Code can generate code incredibly fast. So fast that you'll build things before questioning whether they should exist. The bottleneck was never "can I implement this?" — it was always "should I implement this?"

AI accelerates execution, not judgment. The 16-day timeline is real, but it would have been 10 days if I'd made better decisions about what not to build.

The Signal Fusion Model

Individual agent scores get combined through a weighted fusion model:

L1 base score: Qlib quantitative score (30%) + agent debate consensus (25%) + factor robustness score (25%) + remaining signals (20%)
L2 macro modifier: The base score gets multiplied by a macro environment factor. In a hostile macro regime (VIX > 30, rates rising, cross-market stress), even a strong individual stock signal gets dampened.
Weight auto-adjustment: Analyst weights aren't static. The system tracks which analysts have been most accurate over rolling windows and gradually shifts weight toward the better performers.

This matters because alpha is reflexive. A factor that works today will stop working once enough people trade on it. The system needs to detect when an analyst's edge is decaying and reduce its influence before the damage compounds.

Cost and Infrastructure

Running this on a 2C2G cloud box for under $10/day in LLM costs required aggressive model tiering:

Task	Model	Why
News classification	Haiku	High volume, low complexity
Individual analyst	Sonnet	Good reasoning, reasonable cost
Risk manager synthesis	Opus	Needs to weigh contradictions
Guard final review	Opus	Highest stakes decision

The contrarian analyst is deliberately run on the same model tier as the others — you don't want your devil's advocate to be dumber than the team it's challenging.

What I'd Do Differently

Start with the trace system. I built the analysis engine, the dashboard, the notification system — and only later realized I needed a way to trace every decision from signal to execution to outcome. This should have been table #1, not table #50.

Fewer agents, deeper tools. Six agents sounds impressive, but three well-tooled agents (technical, fundamental, macro) with richer data access would probably outperform six shallow ones. Quality of tools > quantity of perspectives.

Don't build a dashboard until you have data worth showing. I built 19 pages of React dashboard in the first two weeks. Most of them showed placeholder data. The dashboard should have come after a month of running data, not before.

Current Status

The system is live in paper trading mode. 87 symbols, three markets, 24/7 monitoring. Daily health checks, Feishu (Lark) push notifications for trades and rejections, automated post-trade verification.

Early results are not statistically significant yet — the IC is around 0.032, and annualized returns don't beat a money market fund. I'm being transparent about this because I think the quant space has too many people showing cherry-picked backtests and not enough people showing honest forward-test results.

The system's value right now isn't in its trading performance. It's in the decision traces — understanding how multiple AI agents reason about markets under uncertainty, where they agree, where they disagree, and how often the disagreements predict something the consensus missed.

What This Is Really About

I'm not building the next Renaissance Technologies. I don't have their data, their compute, their execution infrastructure, or their 30 years of institutional knowledge.

What I'm building is a proof of concept for multi-agent decision systems in adversarial environments. Trading happens to be a domain where decisions are frequent, feedback is fast, and the cost of being wrong is precisely measurable.

The same architecture — multiple specialized agents debating, a guard pipeline that can override consensus, contradiction detection as a first-class signal — applies to any domain where you need AI systems to make real decisions under uncertainty. Medical diagnosis. Legal analysis. Strategic planning.

If you're building multi-agent systems (trading or otherwise), I'm documenting the entire process publicly. The debug logs, the architecture changes, the things that broke. Not a polished tutorial — a live engineering journal.

QuantNova: quantnova.app

Building something similar? I'd love to compare notes.

Top comments (1)

mithril0rd • Apr 2

One thing I underestimated was traceability. I built the agents and dashboard first, but only later realized I needed full decision traces from signal to execution to outcome. Should have been table #1, not table #50.