DEV Community

Cover image for I ran a 202-day experiment to test if AI agents need analysis, not just data
Daniel Kalski
Daniel Kalski

Posted on • Originally published at prereason.com

I ran a 202-day experiment to test if AI agents need analysis, not just data

An AI agent with full internet access lost to one receiving 200 tokens of pre-analyzed context. By 9.38 percentage points.

That result came from 24 experiments I ran over 6 months, testing whether AI trading agents perform better with pre-analyzed market briefings or raw data. The answer was consistent across 7 controlled runs and 1,116 total decision points: the analysis layer matters more than the data layer.

Here's the experiment, what broke along the way, and the open-source MCP server that came out of it.

The question

Most financial APIs give you raw data. Price feeds. OHLCV bars. Time series. When you point an LLM at that and ask it to make a decision, it either hallucinates the analysis or wastes tokens doing basic math that could be done server-side.

What if you moved the analysis to the API layer?

// What most APIs return (~15 tokens, no context)
{"price": 94200, "change_24h": -2.1, "volume": 18400000000}

// What a pre-analyzed briefing returns (~200 tokens, decision-ready)
## BTC Quick Check
BTC: $94,200 (-2.1% 24h, +6.5% 7d)
Regime: Risk Off Contraction (75% confidence)
Hash Ribbon: Bearish (12 days)
200D MA: $91,400 (price 3.1% above)
Net Liquidity: $5.77T (+1.1% 30d)
Strongest Signal: Hash Ribbon bearish crossover
Enter fullscreen mode Exit fullscreen mode

The raw number $94,200 tells the agent nothing about whether that's high or low relative to recent history, what the trend direction is, how BTC correlates with macro indicators, or whether the broader market is risk-on or risk-off.

The briefing gives all of that in ~200 tokens. The agent can act immediately instead of spending 2,000 tokens on its own (usually worse) analysis.

I built PreReason to test whether this actually helps.

The experiment

I designed a 4-arm randomized controlled trial. Each arm uses the same AI model, the same trading rules, the same portfolio constraints. The only variable is the information each arm receives per tick.

Treatment: 4 PreReason briefings per tick covering mining economics, regime classification, cross-asset breadth, and momentum indicators. About 800 tokens of pre-analyzed context.

Control: BTC price and portfolio state only. About 348 characters. No analysis at all.

Websearch Control: Same agent, same rules, but with full web search access. It can google anything it wants. This arm answers the question: "could the agent just look it up instead of receiving structured briefings?"

Placebo: Same agent, same briefing format and structure, but the data is time-offset by 150-210 days. The agent sees confident-looking analysis that's systematically wrong. This arm controls for template format bias, the possibility that structured markdown helps regardless of whether the data is accurate.

The treatment prompt uses a signal weighting hierarchy so the agent knows what to prioritize:

// Signal weighting guide (abbreviated from treatment prompt)
// 1. REGIME (cross.regime): macro atmosphere.
//    Expansion/Contraction = directional context.
//    TRANSITION = ambiguous, defer to structural signals.
//    Regime is ONE input, not a gate.
//
// 2. STRUCTURAL (200D MA zone + Hash Ribbon):
//    medium-term anchors. Hash ribbon bearish + BTC below
//    200D MA = strong contraction signal even if regime
//    says transition.
//
// 3. TACTICAL (hashprice, price-to-production,
//    USDT.D, divergences): fine-tune sizing and timing.
//
// Position sizing by signal agreement:
//   4+ signals agree = up to 40-50% exposure
//   2-3 signals agree = up to 20-25% exposure
//   single signal     = max 10% exposure
Enter fullscreen mode Exit fullscreen mode

Each run: up to 202 daily ticks over 6 months. Claude as the decision model. $100K starting capital per arm. 0.045% transaction costs (matching Hyperliquid's taker fee). Temperature 0 for reproducibility. All arms process each tick synchronously before the simulation advances.

Full methodology, including the complete prompt templates and portfolio rules: prereason.com/evidence/methodology

The results

Every approved run shows treatment outperformance: +4.46pp to +15.97pp across 7 controlled experiments and 1,116 total ticks.

The progression tells a story. Run 1 showed the first positive signal (+5.28pp). Run 2 confirmed it held across a different AI model (+4.46pp). Run 3 proved template restructuring preserved the edge. Run 4 was the breakthrough, jumping to +14.90pp after rebuilding the analysis architecture. Runs 5-7 confirmed and strengthened the result, with the websearch control arm added in Run 6.

Here's the latest and strongest run:

Run 7 results: Treatment +7.83%, Control -8.14%, Websearch -1.55%, BTC Buy & Hold -34.7%

BTC fell 34.7% during this period. The treatment arm returned +7.83%. That's a +15.97 percentage point edge over the price-only control, and +9.38pp over the agent with full web search access.

The websearch finding is the most interesting result. The model with structured briefings outperformed the model with full internet access. The websearch agent made more trades (95 vs 86) but with worse timing. Web search results are noisy, contradictory, and require the agent to synthesize across sources. Pre-analyzed briefings skip that synthesis step entirely.

The edge is defensive, not predictive. Treatment's entire alpha came from 2 crash phases. During a mid-period selloff (BTC -20.8%), treatment gained +4.05%. During a late crash (BTC -27.3%), treatment gained +10.35%. In all other phases (rallies, recoveries, sideways chop), treatment underperformed or matched control. The briefings helped the agent get out early, not get in better. Treatment short accuracy was 65% vs control's 50%.

Stale data was worse than no data. The placebo arm, receiving confident-looking analysis that was systematically wrong, performed worse than even the price-only control. The placebo made more trades with worse outcomes because it acted confidently on wrong signals.

Full results across all 7 runs, with a tick-by-tick viewer: prereason.com/evidence

What went wrong

24 experiments. Most of them failed. The failures shaped the API more than the successes did.

"More data made things worse"

In the very first experiment, treatment was the worst arm. It scored -2.00pp behind the price-only control.

The agent received briefings with no guardrails, no signal hierarchy, no position sizing rules. It overthought every tick. Conflicting signals (bearish macro, bullish hash ribbon) paralyzed it. The control arm, working with nothing but a price number, made simple decisions and outperformed.

The signal weighting guide, the position sizing rules, the cooldown periods: all of that scaffolding exists because the first run proved that dumping context on an LLM without structure makes things worse, not better.

"Enough data to be confidently wrong"

An early run. The agent saw "contraction regime, 75% confidence" combined with "hash rate -15.2% over 7 days." It correctly identified bearish macro conditions. Then it went aggressively short at $109K. BTC rallied to $122K. The shorts got squeezed. Treatment fell 9.6pp behind control. I aborted the run at tick 34.

The templates were accurate. The macro was contracting. But BTC rallies in macro contractions all the time. The template framing presented "regime: contraction" as an alarm signal, and the LLM interpreted alarm signals as "act now, act big."

I rewrote every template to present regime as "background context, not a trading signal." That single framing change was the difference between the first experiment (-2.00pp) and Run 4 (+14.90pp).

"32 regime flips in 202 ticks"

Even in the strongest run (Run 7), the regime classifier was noisy. It flipped between expansion, contraction, and transition 32 times across 202 ticks, roughly once every 6 days. The treatment arm whipsawed in sideways markets, giving back gains on 15 of 17 short campaigns.

Only 2 of those 17 short campaigns were profitable. But those 2 were the big crash phases, and they were sized large enough to carry the entire run.

The fix: widening the classifier's noise bands, raising the score threshold required for a regime change, and replacing static thresholds with dynamic bands that adapt to current volatility.

Each failure shaped the API design. Dampened signals. Explicit "this is context, not a directive" framing. Noise filtering. The product is the scar tissue from 24 experiments.

The full research writeup, including what went wrong across all 24 runs: prereason.com/evidence/research

Try it

MCP server (one JSON block)

For Claude Desktop, Cursor, Windsurf, or any MCP-compatible client:

{
  "mcpServers": {
    "prereason": {
      "command": "npx",
      "args": ["-y", "@prereason/mcp"],
      "env": {
        "PREREASON_API_KEY": "your_key_here"
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

This gives your agent 5 tools:

get_context(briefing)   Full market briefing (17 available)
get_metric(metric)      Single metric with trend analysis
list_briefings()        Available briefings for your tier
list_metrics()          30 available individual metrics
get_health()            API status and version
Enter fullscreen mode Exit fullscreen mode

Free tier works without an API key. Just npx -y @prereason/mcp and you get 6 briefings and 30 metrics. The MCP server is MIT licensed and open-source on GitHub.

REST API

curl -H "Authorization: Bearer YOUR_KEY" \
  "https://api.prereason.com/api/context?briefing=btc.quick-check"
Enter fullscreen mode Exit fullscreen mode

Response (abbreviated):

## BTC Quick Check
BTC: $87,420 (+3.2% 7d) | 200D MA: $91,400 (4.3% below)
Regime: Transition (55% confidence)
Hash Ribbon: Neutral (30d MA near 60d MA)
Net Liquidity: $5.77T (+1.1% 30d, stable)
USDT.D: 5.8% (neutral band, no strong risk signal)
Strongest Signal: Price below 200D MA (watch for recross)

## Trend Summary
| Metric | 7d | 30d | Signal |
|--------|-----|------|--------|
| BTC Price | +3.2% | -8.1% | Mixed |
| Hash Rate | +1.8% | +4.2% | Bullish |
| Net Liquidity | +0.3% | +1.1% | Neutral |
Enter fullscreen mode Exit fullscreen mode

17 briefings across Bitcoin on-chain, macro, and cross-asset domains. 6 free, 5 basic ($19.99/mo), 6 pro ($49.99/mo). Free tier: 60 req/hr, no credit card.

x402 micropayments (no signup)

AI agents with crypto wallets can pay per call using the x402 protocol. $0.01 USDC on Base. No API key, no signup, no human in the loop.

Agent:  GET /api/context?briefing=btc.quick-check
API:    402 Payment Required
        X-Payment-Address: 0x...
        X-Payment-Amount: 10000 (0.01 USDC)
Agent:  Signs $0.01 USDC on Base
Agent:  Retries with X-Payment-Proof header
API:    200 OK (briefing content)
Enter fullscreen mode Exit fullscreen mode

This is designed for autonomous agents that need financial context without a human setting up billing first.


7 controlled experiments, one consistent finding: the analysis layer matters more than the data layer for AI agent decisions.

If your agent is making financial decisions on raw numbers, it's probably doing worse analysis than a 200-token briefing could provide.

Free tier: 6 briefings, 30 metrics, 60 req/hr. No credit card.

Top comments (0)