Hassann

Posted on May 7 • Originally published at apidog.com

TradingAgents:Open-Source LLM Trading Framework

Most multi-agent LLM frameworks promise more than they deliver. TradingAgents is one of the rare exceptions: open-sourced by Tauric Research alongside an arXiv paper, now at version 0.2.4, and built around a clean role decomposition that mirrors a real research desk.

Try Apidog today

This guide focuses on what TradingAgents does, what changed in v0.2.4, how its agent architecture works, and how to test the LLM and market-data layers underneath with Apidog. If you are already thinking about agent contracts, pair this with the agents.md guide for API teams.

TL;DR

TradingAgents is a multi-agent LLM trading framework from Tauric Research, described in arXiv 2412.20138.
It decomposes trading into specialist agents: Fundamentals Analyst, Sentiment Analyst, News Analyst, Technical Analyst, Bull/Bear Researchers, Trader, and Risk Management agents.
v0.2.4 adds structured-output agents, LangGraph checkpoint resume, persistent decision logs, Docker support, and more LLM providers.
It can run against OpenAI-compatible endpoints, which makes hosted, local, and self-hosted models easier to swap.
Use Apidog to mock market-data APIs, replay LLM traffic, assert structured output, and compare provider behavior.
Download Apidog if you want to wire these checks into CI before trusting agent output.

What TradingAgents is

TradingAgents is a Python package and CLI for running a multi-agent trading research workflow.

Instead of asking one model to “analyze this stock,” the framework splits the workflow into roles:

Fundamentals Analyst
Sentiment Analyst
News Analyst
Technical Analyst
Bull Researcher
Bear Researcher
Research Manager
Trader
Risk Management agents
Portfolio Manager

Each agent has:

A specific role prompt.
A focused toolset.
A place in the workflow graph.
A defined output consumed by the next stage.

The project README frames it as research code, not investment advice. That distinction matters. The useful engineering lesson is not “let an LLM trade for you.” It is how to design a multi-agent system with specialist roles, debate, structured decisions, and an audit trail.

What v0.2.4 shipped

The v0.2.4 release is important because it improves reliability around long-running agent workflows.

Structured-output agents

The Research Manager, Trader, and Portfolio Manager now emit structured output through either:

OpenAI Responses API
Anthropic tool-use channel

That replaces brittle free-text parsing with typed JSON-style outputs, which makes downstream automation safer.

LangGraph checkpoint resume

TradingAgents uses LangGraph for orchestration. v0.2.4 adds checkpoint resume support, so a run can recover from interruptions such as:

LLM provider 429 responses
market-data API throttling
local process failures
network issues

Instead of restarting the full workflow, you can resume from a saved checkpoint.

Persistent decision log

Trader decisions are written to a SQLite log with reasoning, inputs, and timestamps.

That gives you an audit trail you can inspect later or use for evaluation.

More LLM providers

v0.2.4 added support for:

DeepSeek
Qwen
GLM
Azure OpenAI

Those join the existing provider matrix that includes OpenAI, Anthropic, Gemini, and Grok.

If you want to compare cost and reasoning behavior, you can test DeepSeek through its OpenAI-compatible endpoint. The request pattern is covered in the DeepSeek V4 API guide.

Docker and Windows fixes

The release also includes:

Dockerfile support
a Windows UTF-8/path encoding fix from v0.2.3

Not exciting, but useful if you want repeatable local or CI runs.

TradingAgents architecture

A complete TradingAgents run follows this flow:

The CLI accepts a ticker and date.
The Analyst Team fans out.
Each analyst fetches data and writes a report.
The Bull Researcher writes a bullish thesis.
The Bear Researcher writes a bearish thesis.
The researchers debate.
The Research Manager synthesizes the debate into a recommendation.
The Trader reads the recommendation and decision history.
The Trader produces a trade plan.
Risk Management agents review the plan from aggressive, conservative, and neutral perspectives.
The Portfolio Manager approves or sends the plan back.
The final decision is written to SQLite.

The highest LLM cost usually appears in the debate and risk-review stages because multiple agents reason over the same context.

That is also where smaller models tend to fail. A weak local model may loop, repeat arguments, or produce shallow Bull/Bear debates. Stronger reasoning models generally produce more useful tradeoffs and cleaner structured conclusions.

How it compares to LangGraph and CrewAI

TradingAgents is not a general-purpose agent framework in the same way LangGraph or CrewAI is.

Think of the layers like this:

LangGraph: low-level graph orchestration for agent workflows.
CrewAI: general-purpose role-based multi-agent framework.
TradingAgents: domain-specific implementation for trading research.

If you want maximum flexibility, start with LangGraph.

If you want a general multi-agent abstraction, evaluate CrewAI.

If you want to study a concrete, opinionated multi-agent workflow with debate, decision, risk review, and logging, read TradingAgents.

Why you need to test the API layers

TradingAgents depends on two unstable surfaces:

Market-data APIs
LLM provider APIs

Both can break runs in ways that are hard to debug.

Market-data APIs fail through drift

Common issues include:

inconsistent free-tier rate limits
renamed fields
missing fields
different trading-day boundaries
different historical-data formats between vendors

A run can work one day and fail the next because a vendor changed a field such as regularMarketTime.

LLM provider APIs fail through shape and cost

Common issues include:

changed response formats
tool-call parsing differences
reasoning-mode cost spikes
provider-specific structured-output behavior
token usage that varies by role

The fix is to keep saved, replayable request collections with assertions. That is where Apidog fits. The same pattern is useful for protocol-level testing, as described in the MCP server testing playbook.

Mock market-data APIs with Apidog

Use this workflow to make TradingAgents test runs deterministic.

Step 1: define upstream endpoints

Create an Apidog project and add the market-data endpoints TradingAgents calls, such as:

Yahoo Finance
FinnHub
Polygon
OpenBB

For each endpoint, save:

method
path
query parameters
headers
example response body

Use real vendor responses as fixtures.

Step 2: enable the mock server

Turn on Apidog’s mock server and point TradingAgents’ tool configuration at the mock URL.

The Fundamentals Analyst, Technical Analyst, and other data-consuming agents now receive deterministic data instead of live vendor responses.

Step 3: detect vendor drift

On a schedule, replay the live vendor endpoints and compare their response shapes against your saved fixtures.

Look for:

renamed fields
removed fields
newly required fields
type changes
empty values where data previously existed

This is the same contract-first workflow described in contract-first API development.

Test the LLM provider layer

Before scaling TradingAgents runs, test three things.

1. Cost per role

Run a single ticker and capture token usage per agent.

At minimum, track:

Fundamentals Analyst tokens
Sentiment Analyst tokens
News Analyst tokens
Technical Analyst tokens
Bull/Bear debate tokens
Risk Management tokens
final decision tokens

The Bull/Bear debate should usually be more expensive than a single analyst pass. If it is not, the model may be short-circuiting the debate.

Use Apidog request logs to capture provider traffic and compare token usage across runs.

2. Structured output shape

For v0.2.4 structured-output agents, add assertions that verify required fields exist.

For example, assert that the Trader output contains fields like:

{
  "action": "buy | sell | hold",
  "confidence": 0.72,
  "reasoning": "...",
  "risk_notes": "..."
}

Then add JSONPath checks such as:

$.action
$.confidence
$.reasoning

A structured-output regression is dangerous because downstream code may fail only after the model response is already accepted.

3. Provider parity

When swapping providers, do not compare one run against one run.

Instead:

Select a fixed ticker basket.
Run the same dates through provider A.
Run the same dates through provider B.
Compare the SQLite decision logs.
Measure how often conclusions diverge.

For example:

OpenAI vs DeepSeek
30 tickers
2 debate rounds
same market-data fixtures
same date range
compare final action + confidence + reasoning summary

Use the DeepSeek V4 API guide and the GPT-5.5 API guide for provider request patterns.

Minimal TradingAgents run

A basic run looks like this:

git clone https://github.com/TauricResearch/TradingAgents
cd TradingAgents
pip install -r requirements.txt

export OPENAI_API_KEY="sk-..."
export FINNHUB_API_KEY="..."

python -m tradingagents.cli \
  --ticker AAPL \
  --date 2026-04-30 \
  --models gpt-5.5 \
  --rounds 2

Two debate rounds are a practical minimum for testing the Bull/Bear workflow.

The output is written under:

tradingagents/results/

Expect JSON artifacts plus a Markdown decision summary.

Swap to DeepSeek

To test a different reasoning provider, configure the provider and model:

export DEEPSEEK_API_KEY="sk-..."

python -m tradingagents.cli \
  --ticker AAPL \
  --date 2026-04-30 \
  --models deepseek-v4-pro \
  --provider deepseek \
  --rounds 2

The same pattern applies to Qwen, GLM, or local OpenAI-compatible servers such as Ollama or vLLM.

For local model options, see the best local LLMs of 2026 post.

Common pitfalls

Running with a model that is too small

Small local models can produce repetitive Bull/Bear debates that never converge.

For serious evaluation, use at least a mid-tier reasoning model. The original article identifies DeepSeek V4 Flash, Qwen 3.6 32B, GPT-5.5, and Claude 4.5 as realistic options.

Skipping market-data caching

Each analyst can call the data layer separately. Without caching, one ticker run can fan out into multiple vendor requests.

Enable caching before running batches.

Treating research code as a trading bot

TradingAgents is research code. Backtest results are sensitive to:

model choice
prompt seed
debate length
data quality
provider behavior

Treat outputs as hypotheses, not executable trading strategies.

Not logging token spend

A single ticker run can cost anywhere from cents to several dollars depending on model and debate rounds.

Track per-run cost in Apidog’s replay history so debate loops do not silently burn budget.

Hardcoding one provider

The framework supports multiple providers. Use that to your advantage.

Before committing to one provider:

Run the same ticker set through several models.
Compare decision logs.
Compare token cost.
Review failure modes.
Pick based on both cost and behavior.

Where Apidog fits in the development loop

Design the API surface

Before wiring TradingAgents to live vendors, model each market-data endpoint in Apidog.

That forces you to identify which response fields the agents actually need.

Run local CI against mocks

Use Apidog’s mock server for unit and integration tests.

That keeps tests independent of:

vendor uptime
market hours
rate limits
network failures

The same workflow is covered in API testing without Postman.

Diff live responses against fixtures

Schedule a weekly replay of live vendor endpoints.

Compare the live response shape against saved fixtures and alert on schema drift. This gives you an early warning when the data layer changes underneath the agents.

Why this pattern matters beyond trading

TradingAgents is useful even if you never build trading software.

The architecture transfers to other multi-step agent workflows:

customer support triage
code review
compliance review
research summarization
incident analysis
security review

The reusable pattern is:

specialist agents -> debate/review -> synthesis -> decision -> audit log

That structure is easier to test than a single large prompt because each stage has a defined responsibility and output.

Real-world examples

A quant research student can run the same 30-ticker basket through DeepSeek V4, GPT-5.5, and Claude 4.5, then use Apidog logs to compare request/response behavior.

A fintech engineer can reuse the multi-agent pattern for code reviews: security agent, performance agent, style agent, then a synthesizer that writes the final PR comment.

A solo developer can run TradingAgents nightly on a 10-ticker watchlist and log every decision into a database while using Apidog mocks for weekend test runs.

Conclusion

TradingAgents is a practical reference implementation for multi-agent LLM workflows. It uses specialist roles, debate, risk review, structured decisions, and persistent logs instead of a single monolithic prompt.

v0.2.4 makes the project more useful for production-style experimentation with structured outputs, checkpoint resume, SQLite decision logs, Docker support, and broader provider coverage.

The key implementation lesson: test the layers underneath the agents.

Mock market-data vendors in Apidog.
Assert structured LLM outputs.
Log token cost by role.
Compare providers with repeatable fixtures.
Treat final decisions as research artifacts, not trading instructions.

Next step: clone the repo, run one ticker, and route the upstream calls through an Apidog mock server. You should know within an hour whether the architecture fits your workflow.

FAQ

Is TradingAgents safe to use with real money?

The repo describes TradingAgents as research code, not financial advice. Treat its output as a hypothesis. Running it against a live brokerage is your own risk.

Which LLM provider gives the best cost-quality tradeoff?

The original article identifies DeepSeek V4 Flash with thinking mode as a strong cost-quality option for early 2026 workloads. See the DeepSeek V4 API guide for request details.

Can I run TradingAgents on local models?

Yes. Multi-provider support allows OpenAI-compatible local endpoints from tools such as Ollama, vLLM, and LM Studio. See the best local LLMs of 2026 post.

How do I mock market-data APIs?

Define each vendor endpoint in Apidog, enable the mock server, and point TradingAgents’ tool config at the mock URL. The same pattern is covered in API testing tools for QA engineers.

What hardware do I need?

If you call hosted LLMs such as OpenAI, Anthropic, or DeepSeek, any laptop with Python 3.10+ should be enough.

If you serve local models, hardware depends on model size. Larger reasoning models need substantially more GPU memory than small local models.

Does it support after-hours and weekend simulation?

TradingAgents can run against historical data for a selected date. Live trading is a separate problem that the framework does not claim to solve.

How does it compare to other multi-agent frameworks?

TradingAgents is domain-specific. CrewAI, AutoGen, and LangGraph are general-purpose. Use TradingAgents to study a concrete multi-agent implementation; use LangGraph or another general framework when you need to build your own agent graph from scratch.