This article was originally published on aicoderscope.com
Most benchmarks comparing cloud AI coding to local inference focus on tokens per second. That number is not what you actually feel when you're writing code. A cloud API that returns 150 tok/s after an 800ms wait can feel slower than a local model pushing 40 tok/s with a 30ms time-to-first-response. The difference matters at the granularity where developers work — single-line completions, small function edits, and rapid back-and-forth loops with an agent.
This article separates the two variables that govern perceived coding speed, measures them for concrete setups, and maps results to the use cases where each setup wins. Hardware tiers covered: no GPU (cloud only), RTX 5060 Ti 16GB, and RTX 3090 24GB.
Why tokens/sec is the wrong metric for coding
Throughput (tok/s) matters for long responses. If you're asking an agent to refactor 800 lines of code or generate a full test suite, the speed at which tokens stream determines how long you wait. For those tasks, a 150 tok/s cloud API beats a 30 tok/s local model cleanly.
But most coding interactions are short. Autocomplete fills in a single argument or closes a function signature. An inline edit replaces 5-20 lines. A quick Q&A asks "what does this regex do?" None of these generate more than 100 tokens of output. For interactions under 100 tokens, time-to-first-response (TTFR) — the wall-clock gap between submitting the request and receiving the first token back — dominates the experience.
The math: at 40 tok/s, a 60-token completion takes 1.5 seconds of generation time. If your TTFR is 50ms, total perceived latency is ~1.55 seconds. Now take a 150 tok/s cloud API with an 800ms TTFR: same 60-token completion takes 0.4 seconds of generation. Total: ~1.2 seconds. Cloud wins on total time here, but only just — and at 200ms TTFR the local model is already faster. At 800ms TTFR and below-average cloud throughput on busy servers? Local wins by a significant margin.
This is the parameter space that determines whether autocomplete feels instant or annoyingly lagged.
Latency components: cloud vs local
Understanding what creates latency in each system shows you where to attack it.
Cloud API latency stack
Network round-trip time (RTT): From a US-based developer to Anthropic or OpenAI's nearest inference cluster. East Coast US to AWS us-east-1: roughly 20-60ms. West Coast: 30-80ms. EU: 80-150ms. India: 150-300ms. This floor cannot be optimized away — it's physics.
Queue and load-balancer time: Cloud providers queue requests during high-traffic periods. This is generally 0-200ms on paid tiers, but can spike higher. Anthropic's status page has historically logged queue-time incidents of 300-800ms during peak demand. OpenAI's batch and pro tiers have separate queue priorities.
Time-to-first-token on server: From when the inference server receives the request to when the first output token is emitted. This depends on server-side scheduling, model size, and speculative decoding implementation. For Claude Sonnet 4.6 on Anthropic's infrastructure, documented TTFR at normal load is in the 200-500ms range. For GPT-4o on OpenAI infrastructure, it's slightly faster, typically 150-400ms.
Combined cloud TTFR (network + queue + server TTFR): For a US developer on Cursor Pro under normal load, measured aggregate TTFR is approximately 600-900ms for Claude Sonnet 4.6 and 500-750ms for GPT-4o. During peak hours these stretch by 50-100%.
Local inference latency stack
Model load time (one-time cost): Cold-loading Qwen2.5-Coder 32B from disk into VRAM takes 8-12 seconds on an NVMe SSD. The 7B model loads in 2-4 seconds. Once loaded, models stay resident in VRAM until explicitly unloaded. Ollama keeps loaded models in memory by default; the OLLAMA_KEEP_ALIVE environment variable controls timeout (default: 5 minutes). For active coding sessions, this is a one-time cost you pay once per IDE launch.
Time-to-first-token on warm GPU: With the model loaded, TTFR on local hardware is dominated by the time to process the prompt tokens (prefill). For short prompts (< 500 tokens), this is 10-80ms on an RTX 5060 Ti or 3090. For a standard coding request with moderate context, expect 20-60ms TTFR — roughly 10-15× faster than cloud API TTFR under typical conditions.
Network: Zero. The request never leaves the machine.
Measured TTFR and throughput by setup
These numbers reflect measurements taken from community testing, Ollama benchmark threads, and API performance monitoring tooling. All GPU numbers assume model is warm (loaded in VRAM).
Cursor Pro — Claude Sonnet 4.6 (cloud API)
- TTFR average: ~800ms
- TTFR range: 550ms (low load, East Coast US) – 1,400ms (peak hours or EU)
- Generation throughput: 120-180 tok/s (server-side; stream delivery varies)
- Rate limits: 500 requests/day on Pro tier before slow-mode kicks in; heavy agent loops can burn this in under 2 hours
Cursor Pro — GPT-4o (cloud API)
- TTFR average: ~600ms
- TTFR range: 400ms – 1,100ms
- Generation throughput: 100-160 tok/s
- Rate limits: Similar request-count limits; separate pool from Claude models
Local Ollama — Qwen2.5-Coder 32B (Q4_K_M) on RTX 3090 24GB
- TTFR (warm model): 30-50ms
- Generation throughput: 25-28 tok/s
- Rate limits: None
-
Context window: 32k tokens by default; extend to 128k with
OLLAMA_CONTEXT_LENGTH=131072(expect TTFR increase to 80-150ms at large context)
Local Ollama — Qwen2.5-Coder 7B (Q4_K_M) on RTX 5060 Ti 16GB
- TTFR (warm model): 15-25ms
- Generation throughput: 85-100 tok/s
- Rate limits: None
- Context window: 16k tokens reasonable; 32k possible with VRAM headroom
The 5060 Ti with the 7B model is faster than the 3090 with the 32B model on TTFR and throughput. It's slower on code quality — the 32B model scores roughly 92% on HumanEval vs 75% for the 7B. That quality gap is real and matters for complex tasks, but for autocomplete the 7B output is usually sufficient.
Use-case latency breakdown
| Use case | Cloud API (Sonnet 4.6) | Local (32B / 3090) | Local (7B / 5060 Ti) | Winner |
|---|---|---|---|---|
| Autocomplete (< 50 tokens) | 700-900ms perceived | 50-80ms perceived | 25-50ms perceived | Local |
| Inline edit (200-500 tokens output) | 2.5-5s total | 8-22s total | 2.5-7s total | Cloud / 5060 Ti tie |
| Single-file refactor (~1k tokens) | 8-18s | 40-60s | 12-18s | Cloud (32B) / 5060 Ti tie |
| Multi-file agent loop (~3k tokens) | 25-60s | 110-180s | 35-50s | Cloud |
| 10 rapid agent calls in sequence | 70-150s total | 10-25s total | 5-15s total | Local |
| Large context (50k+ token codebase) | 15-40s first response | Not feasible (OOM) | Not feasible (OOM) | Cloud only |
The "10 rapid agent calls" row deserves explanation. When you run an agentic loop — generate code, test, fix error, test again — each loop iteration is a separate API call. Cloud API TTFR compounds: 10 calls at 800ms TTFR adds 8 seconds of pure waiting, before generation time. Local TTFR at 30ms adds 0.3 seconds. For tight iteration loops, local has a compounding TTFR advantage even when individual generations are slower.
Rate limit latency: the invisible cloud tax
Cursor Pro at $20/month includes 500 "fast requests" per day on the premium models (Claude Sonnet 4.6, GPT-4o). After that, requests route to slower server pools with higher queue times. In practice, a developer running heavy agent sessions can exhaust fast requests by mid-afternoon, at which point effective TTFR can climb to 2-5 seconds per call.
The $60/month Business tier raises the limit but doesn't remove it. The $200/month Ultra tier has the most generous allocation but sti
Top comments (0)