Most multi-agent LLM frameworks promise more than they deliver. TradingAgents is one of the rare exceptions: open-sourced by Tauric Research alongside an arXiv paper, now at version 0.2.4, and built around a clean role decomposition that mirrors a real research desk.
This guide focuses on what TradingAgents does, what changed in v0.2.4, how its agent architecture works, and how to test the LLM and market-data layers underneath with Apidog. If you are already thinking about agent contracts, pair this with the agents.md guide for API teams.
TL;DR
- TradingAgents is a multi-agent LLM trading framework from Tauric Research, described in arXiv 2412.20138.
- It decomposes trading into specialist agents: Fundamentals Analyst, Sentiment Analyst, News Analyst, Technical Analyst, Bull/Bear Researchers, Trader, and Risk Management agents.
- v0.2.4 adds structured-output agents, LangGraph checkpoint resume, persistent decision logs, Docker support, and more LLM providers.
- It can run against OpenAI-compatible endpoints, which makes hosted, local, and self-hosted models easier to swap.
- Use Apidog to mock market-data APIs, replay LLM traffic, assert structured output, and compare provider behavior.
- Download Apidog if you want to wire these checks into CI before trusting agent output.
What TradingAgents is
TradingAgents is a Python package and CLI for running a multi-agent trading research workflow.
Instead of asking one model to “analyze this stock,” the framework splits the workflow into roles:
- Fundamentals Analyst
- Sentiment Analyst
- News Analyst
- Technical Analyst
- Bull Researcher
- Bear Researcher
- Research Manager
- Trader
- Risk Management agents
- Portfolio Manager
Each agent has:
- A specific role prompt.
- A focused toolset.
- A place in the workflow graph.
- A defined output consumed by the next stage.
The project README frames it as research code, not investment advice. That distinction matters. The useful engineering lesson is not “let an LLM trade for you.” It is how to design a multi-agent system with specialist roles, debate, structured decisions, and an audit trail.
What v0.2.4 shipped
The v0.2.4 release is important because it improves reliability around long-running agent workflows.
Structured-output agents
The Research Manager, Trader, and Portfolio Manager now emit structured output through either:
- OpenAI Responses API
- Anthropic tool-use channel
That replaces brittle free-text parsing with typed JSON-style outputs, which makes downstream automation safer.
LangGraph checkpoint resume
TradingAgents uses LangGraph for orchestration. v0.2.4 adds checkpoint resume support, so a run can recover from interruptions such as:
- LLM provider
429responses - market-data API throttling
- local process failures
- network issues
Instead of restarting the full workflow, you can resume from a saved checkpoint.
Persistent decision log
Trader decisions are written to a SQLite log with reasoning, inputs, and timestamps.
That gives you an audit trail you can inspect later or use for evaluation.
More LLM providers
v0.2.4 added support for:
- DeepSeek
- Qwen
- GLM
- Azure OpenAI
Those join the existing provider matrix that includes OpenAI, Anthropic, Gemini, and Grok.
If you want to compare cost and reasoning behavior, you can test DeepSeek through its OpenAI-compatible endpoint. The request pattern is covered in the DeepSeek V4 API guide.
Docker and Windows fixes
The release also includes:
- Dockerfile support
- a Windows UTF-8/path encoding fix from v0.2.3
Not exciting, but useful if you want repeatable local or CI runs.
TradingAgents architecture
A complete TradingAgents run follows this flow:
- The CLI accepts a ticker and date.
- The Analyst Team fans out.
- Each analyst fetches data and writes a report.
- The Bull Researcher writes a bullish thesis.
- The Bear Researcher writes a bearish thesis.
- The researchers debate.
- The Research Manager synthesizes the debate into a recommendation.
- The Trader reads the recommendation and decision history.
- The Trader produces a trade plan.
- Risk Management agents review the plan from aggressive, conservative, and neutral perspectives.
- The Portfolio Manager approves or sends the plan back.
- The final decision is written to SQLite.
The highest LLM cost usually appears in the debate and risk-review stages because multiple agents reason over the same context.
That is also where smaller models tend to fail. A weak local model may loop, repeat arguments, or produce shallow Bull/Bear debates. Stronger reasoning models generally produce more useful tradeoffs and cleaner structured conclusions.
How it compares to LangGraph and CrewAI
TradingAgents is not a general-purpose agent framework in the same way LangGraph or CrewAI is.
Think of the layers like this:
- LangGraph: low-level graph orchestration for agent workflows.
- CrewAI: general-purpose role-based multi-agent framework.
- TradingAgents: domain-specific implementation for trading research.
If you want maximum flexibility, start with LangGraph.
If you want a general multi-agent abstraction, evaluate CrewAI.
If you want to study a concrete, opinionated multi-agent workflow with debate, decision, risk review, and logging, read TradingAgents.
Why you need to test the API layers
TradingAgents depends on two unstable surfaces:
- Market-data APIs
- LLM provider APIs
Both can break runs in ways that are hard to debug.
Market-data APIs fail through drift
Common issues include:
- inconsistent free-tier rate limits
- renamed fields
- missing fields
- different trading-day boundaries
- different historical-data formats between vendors
A run can work one day and fail the next because a vendor changed a field such as regularMarketTime.
LLM provider APIs fail through shape and cost
Common issues include:
- changed response formats
- tool-call parsing differences
- reasoning-mode cost spikes
- provider-specific structured-output behavior
- token usage that varies by role
The fix is to keep saved, replayable request collections with assertions. That is where Apidog fits. The same pattern is useful for protocol-level testing, as described in the MCP server testing playbook.
Mock market-data APIs with Apidog
Use this workflow to make TradingAgents test runs deterministic.
Step 1: define upstream endpoints
Create an Apidog project and add the market-data endpoints TradingAgents calls, such as:
- Yahoo Finance
- FinnHub
- Polygon
- OpenBB
For each endpoint, save:
- method
- path
- query parameters
- headers
- example response body
Use real vendor responses as fixtures.
Step 2: enable the mock server
Turn on Apidog’s mock server and point TradingAgents’ tool configuration at the mock URL.
The Fundamentals Analyst, Technical Analyst, and other data-consuming agents now receive deterministic data instead of live vendor responses.
Step 3: detect vendor drift
On a schedule, replay the live vendor endpoints and compare their response shapes against your saved fixtures.
Look for:
- renamed fields
- removed fields
- newly required fields
- type changes
- empty values where data previously existed
This is the same contract-first workflow described in contract-first API development.
Test the LLM provider layer
Before scaling TradingAgents runs, test three things.
1. Cost per role
Run a single ticker and capture token usage per agent.
At minimum, track:
- Fundamentals Analyst tokens
- Sentiment Analyst tokens
- News Analyst tokens
- Technical Analyst tokens
- Bull/Bear debate tokens
- Risk Management tokens
- final decision tokens
The Bull/Bear debate should usually be more expensive than a single analyst pass. If it is not, the model may be short-circuiting the debate.
Use Apidog request logs to capture provider traffic and compare token usage across runs.
2. Structured output shape
For v0.2.4 structured-output agents, add assertions that verify required fields exist.
For example, assert that the Trader output contains fields like:
{
"action": "buy | sell | hold",
"confidence": 0.72,
"reasoning": "...",
"risk_notes": "..."
}
Then add JSONPath checks such as:
$.action
$.confidence
$.reasoning
A structured-output regression is dangerous because downstream code may fail only after the model response is already accepted.
3. Provider parity
When swapping providers, do not compare one run against one run.
Instead:
- Select a fixed ticker basket.
- Run the same dates through provider A.
- Run the same dates through provider B.
- Compare the SQLite decision logs.
- Measure how often conclusions diverge.
For example:
OpenAI vs DeepSeek
30 tickers
2 debate rounds
same market-data fixtures
same date range
compare final action + confidence + reasoning summary
Use the DeepSeek V4 API guide and the GPT-5.5 API guide for provider request patterns.
Minimal TradingAgents run
A basic run looks like this:
git clone https://github.com/TauricResearch/TradingAgents
cd TradingAgents
pip install -r requirements.txt
export OPENAI_API_KEY="sk-..."
export FINNHUB_API_KEY="..."
python -m tradingagents.cli \
--ticker AAPL \
--date 2026-04-30 \
--models gpt-5.5 \
--rounds 2
Two debate rounds are a practical minimum for testing the Bull/Bear workflow.
The output is written under:
tradingagents/results/
Expect JSON artifacts plus a Markdown decision summary.
Swap to DeepSeek
To test a different reasoning provider, configure the provider and model:
export DEEPSEEK_API_KEY="sk-..."
python -m tradingagents.cli \
--ticker AAPL \
--date 2026-04-30 \
--models deepseek-v4-pro \
--provider deepseek \
--rounds 2
The same pattern applies to Qwen, GLM, or local OpenAI-compatible servers such as Ollama or vLLM.
For local model options, see the best local LLMs of 2026 post.
Common pitfalls
Running with a model that is too small
Small local models can produce repetitive Bull/Bear debates that never converge.
For serious evaluation, use at least a mid-tier reasoning model. The original article identifies DeepSeek V4 Flash, Qwen 3.6 32B, GPT-5.5, and Claude 4.5 as realistic options.
Skipping market-data caching
Each analyst can call the data layer separately. Without caching, one ticker run can fan out into multiple vendor requests.
Enable caching before running batches.
Treating research code as a trading bot
TradingAgents is research code. Backtest results are sensitive to:
- model choice
- prompt seed
- debate length
- data quality
- provider behavior
Treat outputs as hypotheses, not executable trading strategies.
Not logging token spend
A single ticker run can cost anywhere from cents to several dollars depending on model and debate rounds.
Track per-run cost in Apidog’s replay history so debate loops do not silently burn budget.
Hardcoding one provider
The framework supports multiple providers. Use that to your advantage.
Before committing to one provider:
- Run the same ticker set through several models.
- Compare decision logs.
- Compare token cost.
- Review failure modes.
- Pick based on both cost and behavior.
Where Apidog fits in the development loop
Design the API surface
Before wiring TradingAgents to live vendors, model each market-data endpoint in Apidog.
That forces you to identify which response fields the agents actually need.
Run local CI against mocks
Use Apidog’s mock server for unit and integration tests.
That keeps tests independent of:
- vendor uptime
- market hours
- rate limits
- network failures
The same workflow is covered in API testing without Postman.
Diff live responses against fixtures
Schedule a weekly replay of live vendor endpoints.
Compare the live response shape against saved fixtures and alert on schema drift. This gives you an early warning when the data layer changes underneath the agents.
Why this pattern matters beyond trading
TradingAgents is useful even if you never build trading software.
The architecture transfers to other multi-step agent workflows:
- customer support triage
- code review
- compliance review
- research summarization
- incident analysis
- security review
The reusable pattern is:
specialist agents -> debate/review -> synthesis -> decision -> audit log
That structure is easier to test than a single large prompt because each stage has a defined responsibility and output.
Real-world examples
A quant research student can run the same 30-ticker basket through DeepSeek V4, GPT-5.5, and Claude 4.5, then use Apidog logs to compare request/response behavior.
A fintech engineer can reuse the multi-agent pattern for code reviews: security agent, performance agent, style agent, then a synthesizer that writes the final PR comment.
A solo developer can run TradingAgents nightly on a 10-ticker watchlist and log every decision into a database while using Apidog mocks for weekend test runs.
Conclusion
TradingAgents is a practical reference implementation for multi-agent LLM workflows. It uses specialist roles, debate, risk review, structured decisions, and persistent logs instead of a single monolithic prompt.
v0.2.4 makes the project more useful for production-style experimentation with structured outputs, checkpoint resume, SQLite decision logs, Docker support, and broader provider coverage.
The key implementation lesson: test the layers underneath the agents.
- Mock market-data vendors in Apidog.
- Assert structured LLM outputs.
- Log token cost by role.
- Compare providers with repeatable fixtures.
- Treat final decisions as research artifacts, not trading instructions.
Next step: clone the repo, run one ticker, and route the upstream calls through an Apidog mock server. You should know within an hour whether the architecture fits your workflow.
FAQ
Is TradingAgents safe to use with real money?
The repo describes TradingAgents as research code, not financial advice. Treat its output as a hypothesis. Running it against a live brokerage is your own risk.
Which LLM provider gives the best cost-quality tradeoff?
The original article identifies DeepSeek V4 Flash with thinking mode as a strong cost-quality option for early 2026 workloads. See the DeepSeek V4 API guide for request details.
Can I run TradingAgents on local models?
Yes. Multi-provider support allows OpenAI-compatible local endpoints from tools such as Ollama, vLLM, and LM Studio. See the best local LLMs of 2026 post.
How do I mock market-data APIs?
Define each vendor endpoint in Apidog, enable the mock server, and point TradingAgents’ tool config at the mock URL. The same pattern is covered in API testing tools for QA engineers.
What hardware do I need?
If you call hosted LLMs such as OpenAI, Anthropic, or DeepSeek, any laptop with Python 3.10+ should be enough.
If you serve local models, hardware depends on model size. Larger reasoning models need substantially more GPU memory than small local models.
Does it support after-hours and weekend simulation?
TradingAgents can run against historical data for a selected date. Live trading is a separate problem that the framework does not claim to solve.
How does it compare to other multi-agent frameworks?
TradingAgents is domain-specific. CrewAI, AutoGen, and LangGraph are general-purpose. Use TradingAgents to study a concrete multi-agent implementation; use LangGraph or another general framework when you need to build your own agent graph from scratch.
Top comments (0)