Most "AI prediction" systems are a single prompt wrapped in a REST API. I wanted to build something closer to how intelligence agencies actually work — multiple analysts with different expertise, a red team that challenges every conclusion, and honest accuracy metrics.
This is the architecture behind Seldon Vault, a multi-agent forecasting system named after Hari Seldon — a character from Isaac Asimov's Foundation series (1942). Seldon invented "psychohistory": a mathematical framework that could predict the behavior of large populations statistically, the way gas dynamics predicts molecules. Not individual fates — macro trends.
The idea always felt like the most elegant concept in science fiction. And with LLMs, real-time data feeds, and cheap Bayesian computation, we now have the building blocks to at least attempt it. This system is that attempt — a very humble one.
The Problem with Single-Prompt Prediction
Give an LLM a bunch of news articles and ask "what's going to happen?" You'll get a coherent, confident, and usually useless answer. One model with one prompt suffers from anchoring bias — it latches onto the first plausible hypothesis and builds a narrative around it.
The fix isn't a better prompt. It's more perspectives.
System Architecture
12 sources: RSS, Reddit, Telegram, Bluesky, Polymarket, Metaculus, FRED, Fear&Greed, ACLED, UCDP, GDACS, GDELT
│
Signal Processor (DeepSeek — cheap, fast)
→ classify: immediate vs. structural
→ score importance, extract entities
│
┌────┴────┐
│ 7 Analyst Agents (parallel) │
│ Geopolitics │ Economics │ Technology │
│ Sociology │ Climate │ Military │ Cyber │
└────┬────┘
│
Skeptic Agent (Claude Opus + Tavily Search)
→ fact-check, find counter-evidence
→ risk score < 50 = auto-reject
│
Seldon Arbiter (Claude Opus)
→ synthesize top 5 forecasts
→ bilingual output (EN + RU)
→ detect cascade narratives
│
PostgreSQL + Redis + SSE
The LLM Factory
Every AI call goes through a unified LLMFactory that abstracts provider differences:
# Simplified — actual code handles retries, fallbacks, proxy
class LLMFactory:
@staticmethod
def create(provider: str, config: dict) -> BaseLLMProvider:
providers = {
"openai": OpenAIProvider,
"anthropic": AnthropicProvider,
"google": GoogleProvider,
"deepseek": DeepSeekProvider,
}
return providers[provider](config)
async def generate_with_retry(self, prompt, config):
try:
return await self.primary.generate(prompt)
except LLMError:
return await self.fallback.generate(prompt)
Provider configs are centralized in YAML — no hardcoded model names in agent code:
# configs/seldon.yaml
llm_config:
signal_processor:
provider: deepseek
model: deepseek-chat
temperature: 0.3
max_tokens: 2000
analysts:
provider: deepseek
model: deepseek-chat
temperature: 0.7
max_tokens: 4000
skeptic:
provider: anthropic
model: claude-opus-4-6
temperature: 0.4
max_tokens: 3000
seldon:
provider: anthropic
model: claude-opus-4-6
temperature: 0.6
max_tokens: 5000
Cost optimization: analysts run on DeepSeek Reasoner, while Skeptic and Arbiter use Claude Opus for stronger reasoning and fact-checking.
The Skeptic Pattern
The most valuable architectural decision was the Skeptic agent. Every proposed forecast must survive systematic doubt before publication.
# Simplified skeptic flow
async def review_forecast(self, forecast: dict) -> SkepticReview:
# 1. Fact-check key claims via Tavily Search
fact_checks = await self.tavily.search(forecast["key_claims"])
# 2. Generate critique with counter-evidence
review = await self.llm.generate(
prompt=self.build_skeptic_prompt(forecast, fact_checks)
)
# 3. Auto-reject if risk too high
if review.risk_score < 50:
review.verdict = "REJECTED"
return review
The Skeptic's risk score threshold (50) is deliberately aggressive. Better to miss a valid forecast than to publish a hallucinated one.
Bayesian Updates
Forecasts aren't one-shot. Every 6 hours, the system re-evaluates:
def bayesian_update(prior: float, evidence_strength: float,
evidence_direction: float) -> float:
"""
prior: current probability (0.05 - 0.95)
evidence_strength: how strong the new evidence is (0.0 - 1.0)
evidence_direction: positive (>0.5) or negative (<0.5)
"""
if evidence_strength < MIN_EVIDENCE_STRENGTH: # 0.3
return prior # ignore weak signals
likelihood_ratio = evidence_direction / (1 - evidence_direction)
posterior = (prior * likelihood_ratio) / \
(prior * likelihood_ratio + (1 - prior))
# Cap daily shift at ±15pp
max_shift = MAX_DAILY_SHIFT # 0.15
posterior = max(prior - max_shift, min(prior + max_shift, posterior))
# Clamp to 5-95% range
return max(0.05, min(0.95, posterior))
The MAX_DAILY_SHIFT cap is critical. Without it, a single alarming headline could swing a probability from 30% to 80% in one cycle. Real analysts don't panic-update; neither should the system.
Cascade Narratives
Forecasts link into causal chains:
Sanctions on chip exports (70%)
└── TSMC Arizona delays (55%) [strength: 0.7]
└── SE Asia AI slowdown (45%) [strength: 0.5]
When an upstream event resolves, downstream probabilities shift:
def propagate_cascade(resolved_forecast, delta, graph):
"""
delta: change in resolved forecast probability
graph: narrative links (source → target with strength)
"""
for link in graph.get_downstream(resolved_forecast):
shift = (delta * link.strength *
link.conditional_shift *
(DAMPENING ** link.depth)) # 0.5 per hop
if link.depth <= MAX_CASCADE_DEPTH: # 3
update_probability(link.target, shift)
Dampening (0.5 per hop) prevents runaway cascades. Max depth 3 keeps the graph manageable.
Agent Calibration Feedback Loop
Every agent gets its own accuracy report injected into its prompt:
Your calibration data (last 30 days):
- Sector Technology: avg Brier Score 0.28 (12 resolved forecasts)
- High-confidence predictions (>75%): correct 3/5 times
- Your forecasts tend to overestimate probability by ~8pp
Adjust your confidence levels accordingly.
This isn't fine-tuning — it's prompt-based calibration. The agent sees its own track record and (in theory) adjusts. Early results are promising but the sample size is still small.
Tech Stack
| Layer | Choice | Why |
|---|---|---|
| Backend | FastAPI + async SQLAlchemy | Non-blocking DB + LLM calls |
| Database | PostgreSQL 16 + JSONB | Agent analyses stored inline, no complex joins |
| Task Queue | Celery + Redis Beat | Daily pipeline at 08:00 UTC, updates every 6h |
| Frontend | React 19 + Vite 7 + Tailwind | SPA with SSE for real-time updates |
| LLM | Multi-provider factory | Swap models via YAML config |
| Search | Tavily API | Skeptic fact-checking |
| Visualization | D3 (force graph) + Recharts | Cascade narratives + probability charts |
What I'd Do Differently
Start with fewer agents. 7 analysts + skeptic + arbiter = 9 LLM calls per signal. The marginal value of agent #6 and #7 is questionable. Start with 3-4 and add based on data.
Build the Brier tracking first. I built it in phase 3. Should have been phase 1. Without accuracy metrics, you're just generating confident-sounding text.
The Skeptic is more valuable than any analyst. If I had to pick one agent, it would be the Skeptic. Most AI prediction systems fail not because their analysis is bad, but because nobody checks the analysis.
Does It Actually Work?
Honest answer: I don't know yet. The whole thing was built in about two days of vibe-coding, and it's been running for less than a week. The Brier Scores are there, the tracking is there, but there's not enough resolved forecasts to say anything statistically meaningful.
Ask me again in six months. By then there should be enough data to either validate the approach — or to write a very honest post-mortem about why multi-agent forecasting doesn't work.
That's the whole point of tracking accuracy publicly. If the system is garbage, the numbers will show it. No hiding behind vague "our AI is cutting-edge" marketing. Just math.
Try It
- Live (free forever): seldonvault.io
- Methodology: seldonvault.io/methodology
- API Docs: seldonvault.io/developers
- Daily forecasts updated every 6 hours with Bayesian updates
Any questions? I'll be happy to answer any questions I can.

Top comments (0)