Aviad Rozenhek

Posted on Mar 30

Claude Feels Slow. But Is Moving a Team to Open-Weight Models Actually the Fix?

#llm #ai #performance #opensource

TL;DR

Claude has a real speed problem for our team — but mostly in TTFT, not in raw decoding speed.

I measured our actual usage and found this:

TTFT p50: 4.2s–6.8s
TTFT p90: 14.5s–28.1s
Claude Sonnet decode p50: 176 tok/s

That explains the feeling: Claude often isn’t that slow once it starts, but sometimes it takes so long to begin that the whole thing feels like it’s crawling.

That naturally raises the next question:

Should we move the team to self-hosted open-weight models?

At first glance, that sounds promising. Self-hosted setups can have dramatically better TTFT. In the numbers I looked at, open-weight deployments were often estimated around 150–600ms TTFT, versus Claude’s 4–7s median in our real usage.

But once I looked at the actual team setup — 10 engineers sharing one GPU budget — the answer stopped looking obvious.

The best open-weight models need serious multi-GPU infra, and once that infra is shared, the speed case starts looking surprisingly shaky.

So this post is not “open source bad.”

It’s a narrower question:

If Claude feels slow, is moving a team to open-weight models on shared infra actually the answer?

Right now, I’m not convinced.

The problem: Claude feels like it crawls

This started with a very practical complaint:

Claude is slow.

That could mean a lot of things, so I measured it.

From about 50 session files and roughly 3,000 API calls, the problem was clear: the main issue was TTFT, especially in the tail.

TTFT from our real usage

Trigger	p10	p50	p90
User message	2.8s	6.8s	28.1s
Tool result	2.5s	4.2s	14.5s

That 28.1s p90 is the whole story.

Claude is not just “a bit laggy” there. It’s slow enough to break flow.

The surprising part: decode speed wasn’t the main problem

Here’s the other half of the picture.

Generation speed

Metric	p10	p50	p90
Decode tok/s (excluding TTFT)	72	178	567
Wall tok/s (including TTFT)	23	41	63

And per model:

Model	TTFT p50	Decode p50
Haiku 4.5	1.8s	287 tok/s
Sonnet 4.6	4.2s	176 tok/s
Opus 4.6	4.7s	130 tok/s

So the core problem wasn’t really:

Claude can’t stream fast enough.

It was:

Claude often takes too long to get started.

That distinction matters, because it makes self-hosting sound much more attractive than it might actually be.

Why open weights sound like the obvious answer

If TTFT is the problem, then self-hosting sounds like the clean fix.

The pitch is simple:

no provider-side queue
no shared API congestion
your own inference server
much lower TTFT

And the numbers I collected from the self-hosting side were definitely seductive.

Best-case self-hosted framing

Metric	Claude now	Best self-hosted
Tool-triggered TTFT p50	4,200ms	~160ms
User-triggered TTFT p50	6,800ms	~160ms
Bad-day p90	14,500ms+	<400ms

If TTFT were the only thing that mattered, I think this would already be enough to move seriously toward GPUs.

But TTFT is not the whole developer experience.

The models we’d actually consider

We’re not talking about toy models here. We’re talking about the real open-weight candidates people would actually put on the table.

Models considered

Model	Why consider it?
Qwen3-Coder-Next	Fast MoE coding model, 80B total / 3B active
MiniMax M2.5	Stronger quality candidate, 230B total / 10B active
DeepSeek V3.2	Very large MoE option
Qwen3.5-27B	Dense, simpler, slower but cheaper

And the inference engines are the standard ones you’d expect:

Inference engines

Model family	Realistic inference engine
Qwen / DeepSeek	vLLM or SGLang
MiniMax M2.5	vLLM
Dense smaller models	usually vLLM

That means this isn’t some hypothetical future stack. It’s the standard modern self-hosted inference path.

The part that makes this much less exciting: GPU budgets are shared

This is the piece I think gets hand-waved away too often.

Our current setup is:

Item	Value
Engineers	10
Claude subscription per engineer	$150/mo
Total Claude cost	$1,500/mo

The budget I was willing to entertain for self-hosting was roughly 3× that, so about $4,500/month.

That sounds like a lot.

But for top open-weight coding models, it buys you something like this:

What the budget can buy

Config	Cost/month	Notes
5× H100 on Vast.ai	$4,712	Enough for MiniMax M2.5 / DeepSeek-class INT4
3× H100 on Lambda	$4,521	More reliable, lower GPU count
4× H200 on Vast.ai	$4,153	Better memory bandwidth
8× A100 on Vast.ai	$2,580	Cheapest high-count option

That’s not “10 engineers each get a fast private model.”

That’s one shared cluster.

And that changes the question completely.

The real metric is not TTFT. It’s team step time.

The right equation is not:

lower TTFT = faster experience

It’s more like:

team step time = queueing + TTFT + output_tokens / decode_speed

That’s the part that made me hesitate.

Because once you share one cluster across 10 engineers:

TTFT might improve
but per-user decoding might not
and queueing becomes part of the story

That is a very different situation from “look how fast this benchmark is on one box.”

Why I’m not yet sold

The self-hosted numbers I gathered looked like this:

Self-hosted decode estimates I considered

Model	Config	INT4 decode tok/s
Qwen3-Coder-Next	2× H100	~3,400
MiniMax M2.5	4× H100	~2,000
MiniMax M2.5	2× H100	~1,000
DeepSeek V3.2	5× H100	~700
Qwen3.5-27B	2× H100	~380
Qwen3.5-27B	1× H100	~190

Those numbers are exciting. They make open weights look like a no-brainer.

But they also raise exactly the question I still don’t think I’ve answered cleanly:

Are these the numbers one engineer feels, or the numbers a shared cluster produces in aggregate?

Because for a 10-person team, those are not the same thing.

And once I started looking at the problem through the lens of shared infra, the speed case stopped looking like an obvious slam dunk.

So where does that leave me?

I think I’ve convinced myself of a few things:

What seems true

Statement	My current view
Claude has a real speed problem	Yes
The problem is mostly TTFT	Yes
Self-hosting probably improves TTFT a lot	Yes
The best open-weight models are expensive to run well	Yes
Shared infra weakens the speed story	Yes
Moving the whole team looks obviously promising	No

That’s the interesting part.

The story I expected was:

Claude is slow, open weights are fast, buy GPUs, problem solved.

The story I actually found was:

Claude is slow mostly because of TTFT.
Open weights probably help that.
But once the infra is shared across a team, the speed case gets much less clean.

Bottom line

I started with a very simple frustration:

Claude felt slow.

I measured it and found a very specific issue:

TTFT, especially the p90 tail, was bad enough to make the whole experience feel like it was crawling.

That led to the obvious next idea:

What if we just move to open-weight models on our own GPUs?

And right now, my answer is not “definitely no.”

It’s this:

Open-weight models look promising for TTFT.

They look much less promising as a shared-infra speed fix for a whole team.

That’s the question I’m left with.

Not whether open weights are good.
Not whether they’re possible.
But whether they really solve the problem we actually have.

DEV Community