TL;DR
Claude has a real speed problem for our team — but mostly in TTFT, not in raw decoding speed.
I measured our actual usage and found this:
- TTFT p50: 4.2s–6.8s
- TTFT p90: 14.5s–28.1s
- Claude Sonnet decode p50: 176 tok/s
That explains the feeling: Claude often isn’t that slow once it starts, but sometimes it takes so long to begin that the whole thing feels like it’s crawling.
That naturally raises the next question:
Should we move the team to self-hosted open-weight models?
At first glance, that sounds promising. Self-hosted setups can have dramatically better TTFT. In the numbers I looked at, open-weight deployments were often estimated around 150–600ms TTFT, versus Claude’s 4–7s median in our real usage.
But once I looked at the actual team setup — 10 engineers sharing one GPU budget — the answer stopped looking obvious.
The best open-weight models need serious multi-GPU infra, and once that infra is shared, the speed case starts looking surprisingly shaky.
So this post is not “open source bad.”
It’s a narrower question:
If Claude feels slow, is moving a team to open-weight models on shared infra actually the answer?
Right now, I’m not convinced.
The problem: Claude feels like it crawls
This started with a very practical complaint:
Claude is slow.
That could mean a lot of things, so I measured it.
From about 50 session files and roughly 3,000 API calls, the problem was clear: the main issue was TTFT, especially in the tail.
TTFT from our real usage
| Trigger | p10 | p50 | p90 |
|---|---|---|---|
| User message | 2.8s | 6.8s | 28.1s |
| Tool result | 2.5s | 4.2s | 14.5s |
That 28.1s p90 is the whole story.
Claude is not just “a bit laggy” there. It’s slow enough to break flow.
The surprising part: decode speed wasn’t the main problem
Here’s the other half of the picture.
Generation speed
| Metric | p10 | p50 | p90 |
|---|---|---|---|
| Decode tok/s (excluding TTFT) | 72 | 178 | 567 |
| Wall tok/s (including TTFT) | 23 | 41 | 63 |
And per model:
| Model | TTFT p50 | Decode p50 |
|---|---|---|
| Haiku 4.5 | 1.8s | 287 tok/s |
| Sonnet 4.6 | 4.2s | 176 tok/s |
| Opus 4.6 | 4.7s | 130 tok/s |
So the core problem wasn’t really:
Claude can’t stream fast enough.
It was:
Claude often takes too long to get started.
That distinction matters, because it makes self-hosting sound much more attractive than it might actually be.
Why open weights sound like the obvious answer
If TTFT is the problem, then self-hosting sounds like the clean fix.
The pitch is simple:
- no provider-side queue
- no shared API congestion
- your own inference server
- much lower TTFT
And the numbers I collected from the self-hosting side were definitely seductive.
Best-case self-hosted framing
| Metric | Claude now | Best self-hosted |
|---|---|---|
| Tool-triggered TTFT p50 | 4,200ms | ~160ms |
| User-triggered TTFT p50 | 6,800ms | ~160ms |
| Bad-day p90 | 14,500ms+ | <400ms |
If TTFT were the only thing that mattered, I think this would already be enough to move seriously toward GPUs.
But TTFT is not the whole developer experience.
The models we’d actually consider
We’re not talking about toy models here. We’re talking about the real open-weight candidates people would actually put on the table.
Models considered
| Model | Why consider it? |
|---|---|
| Qwen3-Coder-Next | Fast MoE coding model, 80B total / 3B active |
| MiniMax M2.5 | Stronger quality candidate, 230B total / 10B active |
| DeepSeek V3.2 | Very large MoE option |
| Qwen3.5-27B | Dense, simpler, slower but cheaper |
And the inference engines are the standard ones you’d expect:
Inference engines
| Model family | Realistic inference engine |
|---|---|
| Qwen / DeepSeek | vLLM or SGLang |
| MiniMax M2.5 | vLLM |
| Dense smaller models | usually vLLM |
That means this isn’t some hypothetical future stack. It’s the standard modern self-hosted inference path.
The part that makes this much less exciting: GPU budgets are shared
This is the piece I think gets hand-waved away too often.
Our current setup is:
| Item | Value |
|---|---|
| Engineers | 10 |
| Claude subscription per engineer | $150/mo |
| Total Claude cost | $1,500/mo |
The budget I was willing to entertain for self-hosting was roughly 3× that, so about $4,500/month.
That sounds like a lot.
But for top open-weight coding models, it buys you something like this:
What the budget can buy
| Config | Cost/month | Notes |
|---|---|---|
| 5× H100 on Vast.ai | $4,712 | Enough for MiniMax M2.5 / DeepSeek-class INT4 |
| 3× H100 on Lambda | $4,521 | More reliable, lower GPU count |
| 4× H200 on Vast.ai | $4,153 | Better memory bandwidth |
| 8× A100 on Vast.ai | $2,580 | Cheapest high-count option |
That’s not “10 engineers each get a fast private model.”
That’s one shared cluster.
And that changes the question completely.
The real metric is not TTFT. It’s team step time.
The right equation is not:
lower TTFT = faster experience
It’s more like:
team step time = queueing + TTFT + output_tokens / decode_speed
That’s the part that made me hesitate.
Because once you share one cluster across 10 engineers:
- TTFT might improve
- but per-user decoding might not
- and queueing becomes part of the story
That is a very different situation from “look how fast this benchmark is on one box.”
Why I’m not yet sold
The self-hosted numbers I gathered looked like this:
Self-hosted decode estimates I considered
| Model | Config | INT4 decode tok/s |
|---|---|---|
| Qwen3-Coder-Next | 2× H100 | ~3,400 |
| MiniMax M2.5 | 4× H100 | ~2,000 |
| MiniMax M2.5 | 2× H100 | ~1,000 |
| DeepSeek V3.2 | 5× H100 | ~700 |
| Qwen3.5-27B | 2× H100 | ~380 |
| Qwen3.5-27B | 1× H100 | ~190 |
Those numbers are exciting. They make open weights look like a no-brainer.
But they also raise exactly the question I still don’t think I’ve answered cleanly:
Are these the numbers one engineer feels, or the numbers a shared cluster produces in aggregate?
Because for a 10-person team, those are not the same thing.
And once I started looking at the problem through the lens of shared infra, the speed case stopped looking like an obvious slam dunk.
So where does that leave me?
I think I’ve convinced myself of a few things:
What seems true
| Statement | My current view |
|---|---|
| Claude has a real speed problem | Yes |
| The problem is mostly TTFT | Yes |
| Self-hosting probably improves TTFT a lot | Yes |
| The best open-weight models are expensive to run well | Yes |
| Shared infra weakens the speed story | Yes |
| Moving the whole team looks obviously promising | No |
That’s the interesting part.
The story I expected was:
Claude is slow, open weights are fast, buy GPUs, problem solved.
The story I actually found was:
Claude is slow mostly because of TTFT.
Open weights probably help that.
But once the infra is shared across a team, the speed case gets much less clean.
Bottom line
I started with a very simple frustration:
Claude felt slow.
I measured it and found a very specific issue:
TTFT, especially the p90 tail, was bad enough to make the whole experience feel like it was crawling.
That led to the obvious next idea:
What if we just move to open-weight models on our own GPUs?
And right now, my answer is not “definitely no.”
It’s this:
Open-weight models look promising for TTFT.
They look much less promising as a shared-infra speed fix for a whole team.
That’s the question I’m left with.
Not whether open weights are good.
Not whether they’re possible.
But whether they really solve the problem we actually have.
Top comments (0)