DEV Community

Cover image for Building a Multi-Agent AI Forecasting System: Architecture Deep Dive
Ivan Valmont
Ivan Valmont

Posted on

Building a Multi-Agent AI Forecasting System: Architecture Deep Dive

Most "AI prediction" systems are a single prompt wrapped in a REST API. I wanted to build something closer to how intelligence agencies actually work — multiple analysts with different expertise, a red team that challenges every conclusion, and honest accuracy metrics.

This is the architecture behind Seldon Vault, a multi-agent forecasting system named after Hari Seldon — a character from Isaac Asimov's Foundation series (1942). Seldon invented "psychohistory": a mathematical framework that could predict the behavior of large populations statistically, the way gas dynamics predicts molecules. Not individual fates — macro trends.

The idea always felt like the most elegant concept in science fiction. And with LLMs, real-time data feeds, and cheap Bayesian computation, we now have the building blocks to at least attempt it. This system is that attempt — a very humble one.

The Problem with Single-Prompt Prediction

Give an LLM a bunch of news articles and ask "what's going to happen?" You'll get a coherent, confident, and usually useless answer. One model with one prompt suffers from anchoring bias — it latches onto the first plausible hypothesis and builds a narrative around it.

The fix isn't a better prompt. It's more perspectives.

System Architecture

12 sources: RSS, Reddit, Telegram, Bluesky, Polymarket, Metaculus, FRED, Fear&Greed, ACLED, UCDP, GDACS, GDELT
        │
   Signal Processor (DeepSeek — cheap, fast)
   → classify: immediate vs. structural
   → score importance, extract entities
        │
   ┌────┴────┐
   │ 7 Analyst Agents (parallel) │
   │ Geopolitics │ Economics │ Technology │
   │ Sociology │ Climate │ Military │ Cyber │
   └────┬────┘
        │
   Skeptic Agent (Claude Opus + Tavily Search)
   → fact-check, find counter-evidence
   → risk score < 50 = auto-reject
        │
   Seldon Arbiter (Claude Opus)
   → synthesize top 5 forecasts
   → bilingual output (EN + RU)
   → detect cascade narratives
        │
   PostgreSQL + Redis + SSE
Enter fullscreen mode Exit fullscreen mode

Seldon Vault architecture: signals → 7 analysts → skeptic with veto power → arbitrator. Red crosses indicate forecasts that failed the skeptic's test.

The LLM Factory

Every AI call goes through a unified LLMFactory that abstracts provider differences:

# Simplified — actual code handles retries, fallbacks, proxy
class LLMFactory:
    @staticmethod
    def create(provider: str, config: dict) -> BaseLLMProvider:
        providers = {
            "openai": OpenAIProvider,
            "anthropic": AnthropicProvider,
            "google": GoogleProvider,
            "deepseek": DeepSeekProvider,
        }
        return providers[provider](config)

    async def generate_with_retry(self, prompt, config):
        try:
            return await self.primary.generate(prompt)
        except LLMError:
            return await self.fallback.generate(prompt)
Enter fullscreen mode Exit fullscreen mode

Provider configs are centralized in YAML — no hardcoded model names in agent code:

# configs/seldon.yaml
llm_config:
  signal_processor:
    provider: deepseek
    model: deepseek-chat
    temperature: 0.3
    max_tokens: 2000
  analysts:
    provider: deepseek
    model: deepseek-chat
    temperature: 0.7
    max_tokens: 4000
  skeptic:
    provider: anthropic
    model: claude-opus-4-6
    temperature: 0.4
    max_tokens: 3000
  seldon:
    provider: anthropic
    model: claude-opus-4-6
    temperature: 0.6
    max_tokens: 5000
Enter fullscreen mode Exit fullscreen mode

Cost optimization: analysts run on DeepSeek Reasoner, while Skeptic and Arbiter use Claude Opus for stronger reasoning and fact-checking.

The Skeptic Pattern

The most valuable architectural decision was the Skeptic agent. Every proposed forecast must survive systematic doubt before publication.

# Simplified skeptic flow
async def review_forecast(self, forecast: dict) -> SkepticReview:
    # 1. Fact-check key claims via Tavily Search
    fact_checks = await self.tavily.search(forecast["key_claims"])

    # 2. Generate critique with counter-evidence
    review = await self.llm.generate(
        prompt=self.build_skeptic_prompt(forecast, fact_checks)
    )

    # 3. Auto-reject if risk too high
    if review.risk_score < 50:
        review.verdict = "REJECTED"

    return review
Enter fullscreen mode Exit fullscreen mode

The Skeptic's risk score threshold (50) is deliberately aggressive. Better to miss a valid forecast than to publish a hallucinated one.

Bayesian Updates

Forecasts aren't one-shot. Every 6 hours, the system re-evaluates:

def bayesian_update(prior: float, evidence_strength: float,
                     evidence_direction: float) -> float:
    """
    prior: current probability (0.05 - 0.95)
    evidence_strength: how strong the new evidence is (0.0 - 1.0)
    evidence_direction: positive (>0.5) or negative (<0.5)
    """
    if evidence_strength < MIN_EVIDENCE_STRENGTH:  # 0.3
        return prior  # ignore weak signals

    likelihood_ratio = evidence_direction / (1 - evidence_direction)
    posterior = (prior * likelihood_ratio) / \
                (prior * likelihood_ratio + (1 - prior))

    # Cap daily shift at ±15pp
    max_shift = MAX_DAILY_SHIFT  # 0.15
    posterior = max(prior - max_shift, min(prior + max_shift, posterior))

    # Clamp to 5-95% range
    return max(0.05, min(0.95, posterior))
Enter fullscreen mode Exit fullscreen mode

The MAX_DAILY_SHIFT cap is critical. Without it, a single alarming headline could swing a probability from 30% to 80% in one cycle. Real analysts don't panic-update; neither should the system.

Cascade Narratives

Forecasts link into causal chains:

Sanctions on chip exports (70%)
    └── TSMC Arizona delays (55%) [strength: 0.7]
         └── SE Asia AI slowdown (45%) [strength: 0.5]
Enter fullscreen mode Exit fullscreen mode

When an upstream event resolves, downstream probabilities shift:

def propagate_cascade(resolved_forecast, delta, graph):
    """
    delta: change in resolved forecast probability
    graph: narrative links (source → target with strength)
    """
    for link in graph.get_downstream(resolved_forecast):
        shift = (delta * link.strength *
                 link.conditional_shift *
                 (DAMPENING ** link.depth))  # 0.5 per hop

        if link.depth <= MAX_CASCADE_DEPTH:  # 3
            update_probability(link.target, shift)
Enter fullscreen mode Exit fullscreen mode

Dampening (0.5 per hop) prevents runaway cascades. Max depth 3 keeps the graph manageable.

Agent Calibration Feedback Loop

Every agent gets its own accuracy report injected into its prompt:

Your calibration data (last 30 days):
- Sector Technology: avg Brier Score 0.28 (12 resolved forecasts)
- High-confidence predictions (>75%): correct 3/5 times
- Your forecasts tend to overestimate probability by ~8pp
Adjust your confidence levels accordingly.
Enter fullscreen mode Exit fullscreen mode

This isn't fine-tuning — it's prompt-based calibration. The agent sees its own track record and (in theory) adjusts. Early results are promising but the sample size is still small.

Tech Stack

Layer Choice Why
Backend FastAPI + async SQLAlchemy Non-blocking DB + LLM calls
Database PostgreSQL 16 + JSONB Agent analyses stored inline, no complex joins
Task Queue Celery + Redis Beat Daily pipeline at 08:00 UTC, updates every 6h
Frontend React 19 + Vite 7 + Tailwind SPA with SSE for real-time updates
LLM Multi-provider factory Swap models via YAML config
Search Tavily API Skeptic fact-checking
Visualization D3 (force graph) + Recharts Cascade narratives + probability charts

What I'd Do Differently

  1. Start with fewer agents. 7 analysts + skeptic + arbiter = 9 LLM calls per signal. The marginal value of agent #6 and #7 is questionable. Start with 3-4 and add based on data.

  2. Build the Brier tracking first. I built it in phase 3. Should have been phase 1. Without accuracy metrics, you're just generating confident-sounding text.

  3. The Skeptic is more valuable than any analyst. If I had to pick one agent, it would be the Skeptic. Most AI prediction systems fail not because their analysis is bad, but because nobody checks the analysis.

Does It Actually Work?

Honest answer: I don't know yet. The whole thing was built in about two days of vibe-coding, and it's been running for less than a week. The Brier Scores are there, the tracking is there, but there's not enough resolved forecasts to say anything statistically meaningful.

Ask me again in six months. By then there should be enough data to either validate the approach — or to write a very honest post-mortem about why multi-agent forecasting doesn't work.

That's the whole point of tracking accuracy publicly. If the system is garbage, the numbers will show it. No hiding behind vague "our AI is cutting-edge" marketing. Just math.

Try It

Any questions? I'll be happy to answer any questions I can.

Top comments (0)