DEV Community: Brian McMahon

Multi-Agent Research: How 6 LLM Teams Analyze 900 Stocks

Brian McMahon — Fri, 27 Mar 2026 22:46:10 +0000

Originally published at https://nousergon.ai/blog/posts/multi-agent-research/

In Post 1, I introduced Nous Ergon — an autonomous trading system that splits intelligence across four layers: LLM agents for research judgment, ML for pattern recognition, deterministic rules for execution, and a backtester for system-wide learning. This post goes inside the Research module — the layer where LLMs are found hard at work.

What a Weekend Run Looks Like

Over the weekend, an AWS Lambda fires. It loads the S&P 500 and S&P 400 — roughly 900 mid-to-large-cap US stocks — along with recent price history, and then distributes them across six sector-specialized teams that run in parallel. Each team screens, analyzes, and debates their sector's best opportunities. A CIO agent evaluates the top picks across all teams and decides which stocks enter or exit the portfolio.

The pipeline writes a single signals.json file to S3 that the rest of the system — Predictor, Executor, Backtester — consumes. By the time markets open Monday morning, the week's research is done.

This post walks through how this process works.

The Funnel: 900 → ~25

There's no single bottleneck filter narrowing the universe. Instead, the ~900 stocks are distributed across six sector teams based on their GICS sector classification — Technology, Healthcare, Financials, Industrials, Consumer, and Defensives. Each team receives its full sector allocation — typically 100–200 stocks — and runs its own screening independently.

Within each team, the process follows a four-stage pipeline:

1. Quant Analyst — A ReAct agent with access to screening tools: volume filters, technical indicators, analyst consensus, balance sheet ratios, price performance, and options flow. The quant receives the full sector ticker list and autonomously decides which tools to call, which metrics to prioritize, and how to narrow from ~150 candidates down to a top 10. Each sector team is parameterized with focus metrics — the Technology team sees hints like revenue growth and R&D intensity, while the Financials team sees ROE and net interest margin — though the agent ultimately decides its own screening strategy.

2. Qual Analyst — Also a ReAct agent, reviewing the quant's top picks with deeper qualitative tools: recent news, insider transactions, institutional accumulation, SEC filing search via RAG, and lessons from past signal failures. The qual analyst adds the narrative layer — why this stock, why now, what could go wrong.

3. Peer Review — Quant and qual collaborate to agree on 2–3 final recommendations, each backed by a bull case, bear case, specific catalysts, and a conviction level.

4. CIO Evaluation — All six teams run in parallel, producing 12–18 total recommendations. A CIO agent (single Sonnet call) evaluates them together across four dimensions: team conviction, macro alignment, portfolio fit, and catalyst specificity. The CIO loads its prior decisions to maintain portfolio continuity — it's not assembling a new portfolio from scratch each week, but rather rotating a handful of out-of-favor positions (typically 2–10) and replacing them with the strongest new candidates from the investment committee review.

The result is a ~25-stock portfolio — sector-balanced, thesis-backed, refreshed weekly.

Here's what a quant analyst tool looks like in practice — a LangChain @tool that the ReAct agent calls when it wants balance sheet data for its top candidates:

@tool
def get_balance_sheet(tickers: list[str]) -> str:
    """Get balance sheet metrics: debt/equity, current ratio, PE, revenue growth, gross margins."""
    import yfinance as yf

    results = {}
    for t in tickers[:20]:
        info = yf.Ticker(t).info
        results[t] = {
            "debt_to_equity": info.get("debtToEquity"),
            "current_ratio": info.get("currentRatio"),
            "pe_ratio": info.get("trailingPE"),
            "forward_pe": info.get("forwardPE"),
            "revenue_growth": info.get("revenueGrowth"),
            "gross_margins": info.get("grossMargins"),
        }
    return json.dumps(results)

The agent passes in a list of tickers from its sector subset — typically its top candidates after an initial volume and momentum screen. It sees the tool's docstring and decides when balance sheet data is relevant to its analysis. It might call get_balance_sheet for a value screen, then get_technical_indicators for momentum confirmation — the reasoning loop drives the sequence, not hardcoded logic. (Today these tools pull data live from yfinance on each run. Longer term, the plan is to read from the predictor's feature store, which already pre-computes many of the same fundamental and technical metrics — faster, more reliable, and avoids rate limiting during the pipeline.)

Why Sector Teams?

Comparing a biotech awaiting an FDA decision to a utility with stable regulated earnings is an apples-to-oranges problem. The metrics that matter, the catalysts that move prices, and the risks worth watching are fundamentally different. A single generalist prompt can't evaluate all of these well.

Sector teams solve this by making comparisons within a peer group. The Healthcare team compares drug pipelines to drug pipelines. The Financials team compares credit portfolios to credit portfolios. Each team's prompts are parameterized with sector-specific focus metrics — the Technology quant is nudged toward revenue growth and R&D intensity, while the Defensives quant is nudged toward dividend yield and payout stability. When stocks are evaluated against true peers, the rankings should be more meaningful and the catalysts more specific.

An important caveat: I'm describing the intended behavior. Whether each sector's quant actually uses different tools and metrics in practice — and whether that differentiation produces better recommendations — is an empirical question. The system logs every tool call and iteration count per agent via LangSmith traces, which means I'm accumulating the data needed to answer it. As the dataset grows, I'll be able to analyze whether the Technology quant really does call get_balance_sheet less frequently than the Financials quant, or whether agent behavior converges across sectors despite the sector-specific parameterization. This kind of agentic evaluation is a planned area of investigation.

ReAct Agents and the Tool Ecosystem

The qual analysts aren't just reading pre-fetched data. They're ReAct agents — they reason about what information they need and call tools to get it. Each agent has several tools available:

Tool	Data Source	What It Provides
News articles	Yahoo Finance RSS	Headlines, sentiment, catalysts from recent coverage
Analyst consensus	FMP API	Ratings, price targets, earnings surprises
Insider activity	SEC EDGAR Form 4	Cluster buys/sells, net sentiment
SEC filings	SEC EDGAR	8-K event filings
Prior thesis	SQLite archive	Last week's bull/bear case for continuity
Options flow	yfinance	Put/call ratio, IV rank, expected move
Institutional activity	SEC EDGAR 13F	Fund accumulation/distribution
Filing search (RAG)	Neon pgvector	Semantic search over 10-K/10-Q full text
Lessons (episodic memory)	SQLite	Past signal failures and extracted lessons

The last two tools deserve deeper explanation — they represent two different forms of persistent knowledge that the agents build up over time.

RAG: Searching SEC Filings

The query_filings tool uses retrieval-augmented generation (RAG) to search hundreds of SEC filings stored as embedded chunks in a vector database. When the agent asks "What are the competitive risks in the memory semiconductor market?", it gets real Risk Factors text from Micron's latest 10-Q — not a summary, but the actual filing language that management signed off on.

This is where RAG adds depth beyond headlines. An analyst consensus report tells you the price target is $150. The 10-K Risk Factors section tells you why that target might be wrong — competitive threats management disclosed, regulatory risks they're navigating, customer concentration they're exposed to. The qual agent can search for this context on demand, and it integrates naturally because it's just another tool in the ReAct loop.

Under the hood, the RAG retrieval is a metadata-filtered vector similarity search. The query is embedded using Voyage, filtered by ticker and document type, then ranked by cosine similarity against an HNSW (Hierarchical Navigable Small World) index over thousands of filing chunks. The agent gets back the most relevant excerpts with source metadata — filing type, date, and section label.

If RAG is unavailable (database down, embedding service unreachable), the agent proceeds with the other tools. New capabilities are additive — they never block the pipeline.

Agent Memory: Learning from Mistakes

The memory system gives agents access to accumulated knowledge from past performance.

Episodic memory captures lessons from past signal failures. The Research module tracks the outcome of every BUY signal it generates — specifically, whether each signal beat the S&P 500 over a 10-day window (via the score_performance table in the research database). For signals that underperformed, the system uses an LLM call to extract a structured lesson — what the thesis was, what the market conditions were (regime, VIX level), and what the agent should watch for next time. When a qual analyst evaluates a stock, it can call get_lessons to retrieve any prior lessons for that ticker. If the system recommended SMCI three months ago based on rising data center demand but the stock collapsed on accounting concerns, the qual analyst sees that context before making its current recommendation.

Beyond the tool-based episodic memory, the system also accumulates semantic observations — cross-agent insights about sector dynamics and macro reasoning. After each run, the system extracts thematic patterns from sector team reports, macro analysis, and CIO decisions — observations like "semiconductor stocks are rotating from memory to AI accelerators" or "the CIO has been underweighting utilities due to rate sensitivity." These observations are loaded as context for the qual analysts in subsequent runs, giving them awareness of patterns that emerged from other teams' analysis in prior weeks. Repeated observations are reinforced rather than duplicated, so persistent themes naturally rise in prominence.

Both memory types are stored in SQLite and accumulate over time. The episodic memory teaches the system not to repeat specific mistakes; the semantic observations help it recognize recurring patterns.

Living Theses and Material Triggers

The agents don't start from scratch each week. For every stock in the population, the system maintains a living investment thesis — a structured document covering the bull case, bear case, key catalysts, risk factors, and conviction level.

Each week, agents receive the prior thesis — but they don't blindly rewrite it. Thesis updates are triggered only when something material has changed: a significant news event, a price move exceeding 2x the stock's average true range (ATR), an analyst rating revision, approaching earnings, insider cluster activity, or a shift in the sector's macro regime. If nothing material has happened, the thesis is preserved as-is. This prevents the stateless behavior where an LLM rewrites a perfectly good thesis just because it was asked to, potentially losing nuance that accumulated over prior weeks.

Conviction tracking captures the momentum of the thesis itself: a stock whose bull case keeps strengthening gets "rising" conviction, while one where risks are accumulating gets "declining." These signals feed directly into position sizing downstream — the Executor gives larger positions to rising conviction and smaller positions to declining.

All theses are archived in S3 and SQLite, creating a historical record of what the system believed about each stock and when.

Observability: Validating the Pipeline

With six teams running in parallel, a macro agent, an exit evaluator, and a CIO all executing autonomously, things can go wrong silently. A sector team might fail to produce recommendations. The macro agent might not load its prior report. A node might execute out of order.

To catch these issues, the system runs trajectory validation after every pipeline execution using LangSmith. The validator checks that all required nodes fired (all 6 sector teams, macro, CIO, consolidator), that they executed in the correct order, and that the graph completed successfully. Currently, failures are logged to CloudWatch — surfacing them more prominently (via Telegram alerts or the morning briefing email) is a planned improvement.

This is the kind of observability investment that doesn't add alpha directly, but prevents the silent degradation that erodes it over time.

What I'm Still Iterating On

This is an active project, not a finished product. Some areas I'm working through:

RAG efficacy — The filing search infrastructure is in place and wired to the qual analysts. Whether access to full SEC filing text produces materially better theses compared to headlines and consensus data alone is a question I want to answer over time.
Document coverage — SEC 10-K/10-Q filings are ingested. Earnings call transcripts are a natural next source, though the data pipeline for transcripts is still in progress.
Memory accumulation — The episodic and semantic memory systems are new. As the lesson database grows, I'll be watching whether agents actually use the memory tools effectively and whether the lessons improve signal quality.
Agentic evaluation — I'm logging tool calls and agent iterations per team, which will eventually let me answer questions like: do sector teams actually behave differently? Do agents that use more tools produce better signals? This requires time — the system needs to accumulate enough signal history for these evaluations to be statistically meaningful. Building up this kind of proprietary dataset is a natural limiting factor on how fast the research layer can develop and be evaluated.
Scoring calibration — The backtester auto-tunes scoring weights weekly, but needs more sample history before the optimization converges.

What's Next

The Research module identifies what's attractive. The next posts in this series cover how the rest of the system acts on those signals:

ML Prediction — LightGBM feature engineering, sector-neutral labeling (benchmarking each stock against its sector ETF rather than the broad market), and the veto gate
Autonomous Backtesting — How the system evaluates its own signal quality and auto-tunes parameters
Execution — The order book, intraday technical triggers, and risk management

The full signal chain: Research identifies what's attractive → Predictor validates short-term timing → Executor sizes and executes positions → Backtester measures what worked and feeds optimized parameters back upstream. Each module reads from S3 and is independently replaceable — the architecture is designed so I can swap any component without touching the others.

Nous Ergon is an open-source autonomous trading system. Follow along at nousergon.ai or explore the code on GitHub.

Nous Ergon: Building an Autonomous Alpha Engine with AI

Brian McMahon — Fri, 27 Mar 2026 22:43:17 +0000

Originally published at https://nousergon.ai/blog/posts/building-autonomous-alpha-engine/

The Thesis

Can AI generate sustained market alpha — not through a single model making predictions, but through a system of specialized components, each contributing what it does best?

That's the question behind Nous Ergon: Alpha Engine (νοῦς ἔργον — "intelligence at work"), a fully autonomous trading system I've been building that combines AI-driven research, quantitative prediction, and rule-based execution. Quantitative finance — using mathematical models and statistical analysis to make investment decisions — has traditionally been the domain of institutional hedge funds with massive engineering teams. Large language models and modern machine learning tooling are changing that equation.

The system's north star is alpha — the difference between the portfolio's return and the S&P 500 (SPY):

Alpha = Portfolio Return − SPY Return

Positive alpha means you're doing something the market isn't already pricing in. Everything in Nous Ergon — every agent prompt, every feature, every risk rule — exists to find, validate, and capture that edge.

Why Not Just Ask an LLM to Trade?

The naive approach is tempting: give a large language model (LLM) market data and ask it what to buy. But LLMs are probabilistic text generators. They excel at synthesis, judgment, and reasoning across unstructured information. They're terrible at precise numerical prediction, risk management, and consistent execution.

Nous Ergon splits the problem into three layers, each matched to the right tool:

Layer	Tool	Why
Research	LLM agents (Claude)	Judgment over unstructured data — news, analyst reports, macro context
Prediction	Machine learning (ML) ensemble (starting with LightGBM)	Pattern recognition over structured numerical features
Execution	Deterministic rules	Hard risk constraints that never get creative

LLMs reason about why a stock might move. ML models find patterns in how stocks actually move. And risk rules ensure the system survives long enough for the other two to matter.

A key design decision: LLM agents are used only in the Research module. This deliberate separation means the Predictor, Executor, and Backtester can run unlimited simulations, parameter sweeps, and backtests without making a single LLM API call. When you're iterating on model features or testing risk parameters, that cost decoupling matters.

The Five Modules

Nous Ergon runs as five modules on AWS, connected through a shared S3 bucket. Each module has a single job, reads its inputs from S3, and writes its outputs back. There's no shared state beyond the bucket — no databases to coordinate, no APIs to call between services.

1. Research

Five LLM agents orchestrated by LangGraph maintain rolling investment theses on ~20 tracked stocks and scan ~900 S&P 500 and S&P 400 tickers weekly for the top buy candidates. A quantitative filter first reduces the ~900 universe to ~50 candidates using volume, price, and momentum screens — no LLM calls. From there, a ranking agent (Sonnet) compares all ~50 candidates in a single cross-stock evaluation and selects the top ~35. Then two per-ticker agents — news sentiment and analyst research — each run independently on every candidate and population stock (Haiku), producing the sub-scores that feed into the final composite. A macro agent (Sonnet) assesses the broader market environment and sector conditions. A consolidator agent (Sonnet) synthesizes all analyses into a research brief.

Research outputs a composite attractiveness score (0–100) per ticker, combining news sentiment (50%) and analyst research (50%), with per-sector macro adjustments. The resulting signals.json — written to Amazon S3 — is the system's primary input for everything downstream.

Research focuses entirely on fundamental attractiveness over a 6–12 month horizon. Technical analysis is deliberately excluded from the composite score. This is the first half of what I call horizon separation — Research answers "is this a good stock?", not "is now the right time to buy it?"

2. Predictor

The Predictor handles the second half of horizon separation: short-term technical timing. Research may identify a stock as fundamentally attractive over the next 6–12 months, but that doesn't mean today is the right day to enter. Each trading day, the Predictor evaluates the population's near-term momentum using engineered features across technical indicators, macro context, volume analysis, and cross-sectional measures. Its veto gate can override a BUY signal from Research when the model predicts DOWN with high confidence — preventing the system from entering a fundamentally sound position at a technically poor time.

The current implementation uses a LightGBM gradient-boosted machine (GBM) model, but the architecture is designed for an ensemble of ML and deep learning algorithms. The plan is to layer additional models — likely including a neural network — and combine their predictions through confidence-weighted voting. LightGBM is a strong starting point: it handles threshold interactions and missing data well, trains fast, and provides interpretable feature importance. As the system matures, adding models that capture different types of patterns (non-linear interactions, sequential dependencies) should improve prediction quality.

The model trains on sector-neutral labels — stock returns minus sector exchange-traded fund (ETF) returns — isolating stock-specific signal from sector momentum. Weekly retraining uses 10 years of price history. New model weights only promote to production if they pass an Information Coefficient (IC) gate. IC measures the rank correlation between predicted and actual returns — in financial ML, an IC of 0.03–0.05 is considered meaningful because even small persistent edges compound significantly when applied across many positions over time. The current validation gate requires IC > 0.03.

3. Executor

Once the Predictor clears a position for entry, the Executor takes over. It reads signals and predictions from S3, applies hard risk rules, sizes the position, and executes market orders on Interactive Brokers. From that point forward, the Executor owns the position — managing it through a set of deterministic rules until exit.

Risk management is graduated, not binary. A drawdown response system scales position sizing through tiers: full sizing in normal conditions, reduced sizing as drawdowns deepen, and a complete halt at -8%. Additional constraints cap individual positions (5% of net asset value (NAV), 2.5% in bear markets), sector concentration (25% NAV), and total equity exposure (90% NAV).

Exit management combines ATR-based trailing stops (volatility-adaptive) with time-decay rules that progressively tighten stops as positions age, forcing the system to either prove a thesis quickly or move on.

The Executor's rules are simple by design — but they aren't arbitrary. They're the output of the Backtester's systematic optimization (more on this below). The Executor doesn't reason or predict; it applies its parameters exactly as given, every time, with no emotional second-guessing. The intelligence lives in the process that produces those parameters, not in the component that executes them. This gives you consistent, repeatable execution while the learning happens offline where you can run thousands of simulations cheaply.

4. Backtester

The system's learning mechanism. Runs weekly to validate the entire pipeline end-to-end — not just "did we make money?" but "are our signals predictive, which components drive that predictiveness, and what execution parameters maximize risk-adjusted returns?"

The Backtester does this through several layers of analysis:

Signal quality: are Research scores actually predictive? What percentage of BUY signals beat SPY at 10 and 30 days? Are higher scores more predictive than lower ones?
Attribution: which sub-scores (news vs research) correlate with outperformance? This determines where the scoring formula's weight should shift.
Weight optimization: adjusts the Research scoring weights based on attribution results — conservatively, with a 30/70 blend of data-driven recommendations against current weights and a 15% max change per weight.
Executor parameter optimization: a parameter sweep across executor parameters — minimum entry score, position size limits, Average True Range (ATR) trailing stop multipliers, time-decay windows — replaying historical signals through the full executor simulation for each combination and ranking by Sharpe ratio. Random sampling (Bergstra & Bengio 2012) replaces exhaustive grid search: the number of trials auto-scales as a percentage of the total parameter space, with a statistical floor that guarantees a 95% probability of finding a top-5% combination regardless of grid size. The best-performing parameters get recommended for production.
Predictor threshold calibration: sweeps the veto gate's confidence threshold across seven levels, measuring the trade-off between precision (correctly blocked losing trades) and missed alpha (incorrectly blocked winners).

Each optimization has guardrails — minimum sample sizes, minimum improvement thresholds, excluded parameters (the drawdown circuit breaker is never auto-tuned) — to prevent overfitting to noise.

The results flow back through S3: updated scoring weights for Research, optimized parameters for the Executor, and calibrated thresholds for the Predictor. Without the Backtester, the system operates blind. This is the component that turns a static pipeline into an adaptive one.

5. Dashboard

A Streamlit application providing read-only visibility into the full system: portfolio performance vs SPY, signal quality trends, per-ticker research timelines, backtester results, and predictor metrics. The operational cockpit.

How It All Connects

The modules run on AWS in two cadences — a daily trading loop and a weekly optimization cycle — with S3 as the sole communication bus:

Daily Cadence (Mon–Fri)

Predictor (6:15 AM PT) — reads latest signals.json from S3
Executor (6:30 AM PT) — reads predictions, trades on Interactive Brokers
EOD Reconcile (1:05 PM PT) — captures NAV, computes daily return and alpha, sends email

Weekly Cadence (Sunday/Monday)

Research — scans 900 tickers, rotates population, outputs signals.json
Predictor Training — retrains on 10y history, promotes weights if IC > 0.03
Backtester — signal quality analysis, weight optimization, parameter sweeps

Always-On

Dashboard (Streamlit) — read-only monitoring of all modules via S3

Research runs weekly to refresh the tracked population and generate updated signals.json. During the daily trading loop, the Predictor reads the latest signals from S3 and the Executor reads the Predictor's output. Each module's output is the next module's input, and S3 acts as the contract between them.

S3 as the communication bus means any module can be replaced, rewritten, or tested independently. The Research module doesn't know or care that a LightGBM model reads its signals. The Executor doesn't know that five LLM agents generated the scores it's acting on. They agree on a JSON schema, and S3 handles the rest.

The Feedback Loop

The most important architectural decision wasn't any individual module — it was connecting the Backtester's output back to the upstream components.

Every week, the Backtester measures whether the system's signals actually worked. It tracks the percentage of BUY signals that beat SPY over 10 and 30 days, runs attribution analysis to determine which scoring components are pulling their weight, and recommends adjustments.

These recommendations flow back through S3: updated scoring weights that Research loads on its next run, optimized parameters that the Executor reads on cold-start, and calibrated thresholds that the Predictor uses for its veto gate. The system observes its own performance and adapts — slowly, conservatively, with guardrails — but it adapts.

This is what separates Nous Ergon from a static trading bot. Most automated trading systems are write-once: you build a strategy, deploy it, and hope it keeps working. Nous Ergon is designed to be fully autonomous — no human in the trading loop, no manual approvals, no daily oversight required. It researches, predicts, trades, measures, and adjusts on its own.

Where Things Stand

The infrastructure is built. All five modules are deployed on AWS, wired end-to-end, and running against live market data on Interactive Brokers paper trading. Research refreshes signals weekly. The Predictor and Executor run autonomously every trading day — Predictor scores the latest signals, Executor places trades, end-of-day (EOD) reconciliation measures performance.

Now comes the hard part: making it actually generate alpha.

The system hasn't yet demonstrated sustained outperformance against SPY. That's the work ahead — refining signal quality, tuning the ML models, calibrating risk parameters, expanding the prediction ensemble, and iterating on the scoring weights until the system finds edges that persist. Once the system demonstrates that ability consistently in paper trading, the plan is to transition to real capital in small amounts.

Building the infrastructure was the engineering challenge. Generating alpha is the research challenge. This is where it gets interesting.

Areas for Further Development

Beyond refining the core system, there are several directions that could meaningfully expand Nous Ergon's capabilities:

Prediction Ensemble. The Predictor currently runs a single LightGBM model. The near-term goal is to build out an ensemble with additional ML and deep learning architectures. Options like Temporal Fusion Transformers (TFT) are compelling for their ability to model time-varying relationships, but may be cost-prohibitive at this stage — both in compute for training and in the engineering effort to deploy on Lambda. As the system generates alpha, there will be opportunities to invest in stronger deep neural network (DNN) architectures and higher-quality data APIs that aren't justifiable today.

Retrieval-Augmented Generation (RAG). Research agents currently see fresh data each run plus the last thesis snapshot, but have no persistent memory of historical patterns — past earnings surprises, sector rotation cycles, how a stock behaved during previous rate hike environments. A RAG layer could let agents retrieve relevant historical context during their analysis, producing more informed research.

MCP Tool Use. Currently, the pipeline pre-fetches data and passes it to agents for analysis. With Model Context Protocol (MCP), agents could query data sources on demand — pulling specific SEC filings, checking real-time options flow, or querying alternative data — as part of their reasoning process rather than being limited to a pre-determined data scope.

Social Sentiment. Financial Twitter/X surfaces market-moving information — earnings reactions, sector rotation narratives, retail sentiment — often faster than traditional news sources. Integrating social sentiment as an additional signal source for Research could expand the system's information surface area.

Expanded Sources and Features. Both Research and Predictor have room to grow their input data. Research could incorporate earnings call transcripts, insider trading filings, or institutional flow data. The Predictor's feature set could expand with alternative data sources — options market signals, credit spreads, or cross-asset correlations — that may carry predictive information the current 29 features don't capture.

Each of these represents both a system improvement and a meaningful engineering challenge. The modular architecture — S3 contracts between independent modules — means any of them can be pursued without disrupting the rest of the system.