Dhananjay Lakkawar

Posted on May 7

We Built a Poor Man’s o1 on AWS for $0.25 – And You Can Too

#aws #serverless #ai #stepfunctions

I remember the first time I tried OpenAI’s o1.

I asked it a gnarly infrastructure question: “Design a multi‑region, strongly consistent queue that survives a full AWS region outage.”

It paused for ten seconds. Then it gave me a brilliant, cautious, self‑corrected answer. I was blown away.

Then I saw the price. And the rate limits. And the fact that I couldn’t see why it rejected certain paths.

That’s when a thought hit me – not a breakthrough, but an old, boring, beautiful cloud pattern: Map‑Reduce.

Because here’s the secret no AI lab will tell you: reasoning is just search. And search loves parallelism.

You don’t need o1. You need 50 cheap LLMs running in parallel, one judge, and AWS Step Functions.

Let me show you exactly how we built a “bring‑your‑own‑o1” engine. It costs 25 cents per hard question and runs in under 15 seconds.

The “Aha” Moment: Why One Model Fails

A single LLM is a brilliant guesser, but it only gets one shot. When you ask a really hard question, it starts generating tokens immediately. If it stumbles on token #20, the whole answer drifts into a ditch.

o1 fixes that by thinking before talking – simulating multiple internal chains of thought.

But here’s the trick: you don’t need a special model to do that. You can brute‑force reasoning by asking 50 different copies of a cheap model to each try a different approach. Then you hire a single expensive judge to pick the best ideas and stitch them together.

That’s not magic. That’s distributed computing.

I call it Scatter‑Gather Reasoning.

The Architecture: A 50‑Worker Reasoning Swarm

We built this entirely on serverless AWS. No Kubernetes. No long‑running GPUs.

Step 1 – The Scatter (High‑variance generation)

We take the user’s question and use a Step Functions Distributed Map to launch 50 Lambda invocations simultaneously. Each Lambda calls Claude 3 Haiku (super cheap, super fast) with temperature = 0.9.

High temperature means the same prompt yields wildly different answers. One Haiku might propose a Postgres‑based queue. Another might suggest SQS + DynamoDB. A third might hallucinate a completely wrong but interesting pattern.

That’s fine. We want diversity.

All 50 responses land in an S3 bucket within 2–4 seconds.

Step 2 – The Gather (The Judge)

Once the 50 workers finish, Step Functions triggers a single Judge Lambda. This Lambda reads all 50 answers, builds a massive prompt, and sends it to Claude Sonnet 3.5 (much smarter, but slower and pricier) with temperature = 0.1.

The judge’s system prompt is brutally simple:

“Review these 50 solutions. Reject any that are clearly wrong. Extract the strongest ideas from the survivors. Then synthesize a single, correct, production‑ready answer. Cite which worker contributed which idea.”

Sonnet returns the final answer. The user sees a thoughtful, well‑reasoned response – without ever knowing 50 mini‑models died to bring it to them.

The Real Cost: $0.25 Per Deep Question

Let’s do the math. I use us‑east‑1 Bedrock prices (as of today).

Assumptions for one hard query:

Input: 500 tokens
Each Haiku output: 1,000 tokens
50 Haiku workers
Judge reads 50k tokens, writes 2k tokens

Haiku swarm:

(500 in + 1000 out) * 50

= $0.00025 per worker → $0.068 total

Sonnet judge:

Input 50,500 tokens → $0.15

Output 2,000 tokens → $0.03

Total judge = $0.18

Total = $0.248 (plus pennies for Lambda, Step Functions, S3).

That’s 25 cents to simulate a reasoning engine that feels like o1.

For a financial strategy question or a compliance check? That’s nothing. For a “what’s the weather” query? Overkill. But for the hard problems – the ones where a mistake costs you hours – this pattern is a steal.

The Three Real‑World Limits (And How We Beat Them)

I’ve run this in production. You’ll hit three walls. Here’s how we handle each.

1. The Context Window Ceiling

50 workers × 1,000 tokens = 50k tokens. That’s fine for Sonnet’s 200k limit.

But if you go to 150 workers or each worker writes code? You’ll blow past 200k.

Our fix: A tournament bracket.

Instead of one judge, we run 10 sub‑judges (each reviewing 10 answers). They pick 2 winners each. Then a final judge reviews those 20 winners. Works up to 500 workers.

2. Bedrock Throttling

Launching 100 concurrent Lambda → 100 concurrent Bedrock calls will hit default quotas (ThrottlingException).

Fix 1: Request a quota increase for Bedrock on‑demand throughput (takes a few days).

Fix 2 (simpler): Use Step Functions MaxConcurrency = 25 to burst in waves. Adds 1–2 seconds but avoids errors.

3. Latency: Not for Chat

Waiting for 50 LLMs + a judge reading 50k tokens takes 10–15 seconds in my tests.

Don’t use this for a real‑time chatbot. Use it for async tasks: report generation, architecture review, code refactoring suggestions, legal document analysis. Users will happily wait 15 seconds for a deeply reasoned answer.

Why This Feels Better Than o1 (To Me)

Yes, o1 is magical. But it’s also a black box. You don’t know why it rejected a path. You can’t tune it.

With this architecture, you can:

Log all 50 raw attempts – see exactly which ideas were rejected and why.
Swap the judge’s prompt – make it more or less conservative.
Change the worker model – use Llama 3 on Bedrock if Haiku isn’t creative enough.
Add a voting step – before the judge, have 3 small models rank the 50 answers.

You’re not praying to an API. You’re orchestrating intelligence.

The One Mistake I Made (And Fixed)

In an earlier draft of this post, I called this “Monte Carlo Tree Search on AWS.”

I was wrong. MCTS requires iterative tree expansion and backpropagation. This is just parallel sampling + a judge – technically “best‑of‑N with ensemble summarization.”

But you know what? It works. And it’s simple. And any senior engineer can build it in an afternoon.

So no more cargo‑culting AI buzzwords. Call it what it is: scatter‑gather reasoning.

Try It Yourself

You can build a minimal version today in less than 50 lines of Step Functions ASL and two Lambda functions.

The hardest part is writing the judge prompt. Here’s ours (edited for brevity):

You are a senior architect. You will receive 50 proposed answers to a user question.
Your job:
1) Discard any answer that contains factual errors or hallucinations.
2) For the remaining answers, extract the best components.
3) Synthesize a final answer that is better than any single proposal.
4) Cite which worker contributed which insight.

User question: {{original_prompt}}

Proposed answers:
{{answers_json}}

That’s it.

We’re running this for internal code reviews and infrastructure design. It’s not AGI. But it’s an o1‑like feeling for 25 cents and full transparency.

Now go build your own swarm.

Have you tried multi‑agent consensus or parallel LLM patterns on AWS? I’d love to hear what judge prompts worked for you – drop a comment below.