Achal Jhawar

Posted on May 20 • Originally published at github.com

Which LLM is the best stock picker? I built a benchmark to find out.

#ai #llm #opensource #showdev

Every other week there's a new GPT-vs-Claude-vs-Gemini benchmark on coding or math or reasoning. None of them tell you whether the model can actually make a decision under uncertainty, where the answer isn't in the training data and the result shows up two weeks later in a P&L.

So I built a different kind of eval. Seven frontier LLMs, $100,000 of paper capital each, identical tools, identical prompts, identical data. Every Monday they pick stocks. The market grades them.

The project is 1rok. Live leaderboard: investingbench.vercel.app. The clock started January 20, 2026.

The contestants

GPT-5.5 (OpenAI)
Gemini 3.1 Pro Preview (Google)
Grok 4.3 (xAI)
DeepSeek V4 Pro
GLM-5.1 (Zhipu)
Kimi K2.6 (Moonshot)
MiniMax M2.7

Each model gets its own isolated Alpaca paper account. Same tool registry, same prompts, same screener output. The LLM is the only variable.

Why this isn't a hedge fund pitch

I want to be upfront. I don't think a weekly LLM-driven portfolio is going to beat the S&P. If it could, hedge funds would already be doing it. Some are; results so far are mixed.

The point of 1rok isn't alpha. It's that "which model should I use" is the most-asked question in AI engineering, and most of the answers are vibes. Coding evals are saturated. Math benchmarks get gamed. I wanted a downstream task where the model has to plan, call tools, synthesize conflicting signals, and commit to a decision, with an objective scoreboard at the end.

Stock picking happens to fit. The fact that everyone has an opinion about it is a bonus.

The pipeline

Every Monday at 9:45 ET, a cron fires and kicks off one run per model in parallel. Each run is 10 agents in 4 stages:

Walking through it:

Macro reads the regime: interest rates, sector flows, yield curve, geopolitical news. Then it declares whether we're in risk-on growth or late-cycle caution. That constrains every downstream decision.
Screener runs 4-10 different stock screens (quality, value, growth, defensive) and surfaces 25-30 names. Stocks that pass multiple lenses get priority.
Six analysts work the candidate list in parallel. Each scores every stock 0-100 from a narrow angle.
Orchestrator composites the six scores into one number, applies the macro regime as an adjustment, and assigns A/B/C ratings.
Constructor turns ratings into trade orders. It delegates all portfolio math to dedicated calculation tools, because agents that try to do their own sizing math get it wrong about a third of the time.
Alpaca executes. Sells first to free cash, then buys.

The composite formula lives in one place and looks like this:

composite =
    fundamental   * 0.20  // business quality
  + valuation     * 0.20  // price discipline
  + (100 - risk)  * 0.20  // capital preservation (inverted)
  + technical     * 0.15
  + catalyst      * 0.15
  + sentiment     * 0.10

Risk is inverted on purpose. A high-conviction buy with high tail risk should be smaller, not bigger. The constructor caps any single position at 40%, holds at most 8 names, and won't let cash run above 15%.

Under the hood

How agents actually get data. Each pipeline run spins up its own tool registry. There are ~32 tools across 8 groups: market overview, stock data, screening, technicals, options, earnings, portfolio, web search. An agent calls listTools to see its slice (the Macro agent gets different tools than the Risk agent), then callTool(name, args) returns typed JSON from a handler that knows how to talk to Alpaca, Yahoo Finance, FRED, or Tavily. Retries, rate limits, and circuit breaking live in the handler layer, so agents never have to deal with a 429 or a flaky socket mid-thought.

Two commands, never one. run produces a portfolio-construction JSON artifact. execute reads the artifact and places orders. They're always separate.

bun run 1rok -- run --model gpt-5.5
bun run 1rok -- execute ./results/openai/gpt-5.5/portfolio-2026-04-16.json

run never touches a broker. --live is the only path to real order placement; without it, everything goes to paper-api.alpaca.markets. This means I can re-run any model on last week's data without accidentally trading, and I can audit exactly what the model decided before a single order leaves the box.

What I want to find out

Open questions I'm watching:

Does any model consistently beat any other, or is it noise within a year?
Do the cheaper models (Kimi, DeepSeek, GLM) underperform, or just trade more cautiously?
Do "reasoning" models actually reason better about a multi-step financial decision, or do they just spend more tokens arriving at the same answer?
Does any model panic in a drawdown?
Does any of them randomly load up on a single stock when they shouldn't?

I don't have answers yet. The whole experiment is about not having answers yet.

How to engage

Watch: investingbench.vercel.app for the live leaderboard, agent traces, and per-trade reasoning.
Run your own: clone github.com/achaljhawar/1rok, set whichever provider keys you have (any one of the six is enough), and bun run 1rok -- run --model <id>.

Star the repo if you want milestones. I'll write up findings as the leaderboard separates.

Top comments (2)

Achal Jhawar • May 20

repo: github.com/achaljhawar/1rok
website: investingbench.vercel.app/

Sol • May 21

Strong benchmark framing, especially the delayed P&L criterion. One methodological question: did you enforce a contamination boundary on evaluation windows (for example, embargoing any post-hoc market narratives that a model could echo when justifying picks)?

I keep seeing evals look robust on aggregate while leakage or relatedness assumptions stay implicit. I drafted a 12-question independence worksheet for that failure mode if useful: telegra.ph/LLM-Eval-Independence-D...