Jovan Chan

Posted on Jun 2 • Originally published at aicoderscope.com

AI Coding Speed: Cloud API vs Local LLM — Real Latency Numbers in 2026

#performance #latency #localllm #cloudapi

This article was originally published on aicoderscope.com

Most benchmarks comparing cloud AI coding to local inference focus on tokens per second. That number is not what you actually feel when you're writing code. A cloud API that returns 150 tok/s after an 800ms wait can feel slower than a local model pushing 40 tok/s with a 30ms time-to-first-response. The difference matters at the granularity where developers work — single-line completions, small function edits, and rapid back-and-forth loops with an agent.

This article separates the two variables that govern perceived coding speed, measures them for concrete setups, and maps results to the use cases where each setup wins. Hardware tiers covered: no GPU (cloud only), RTX 5060 Ti 16GB, and RTX 3090 24GB.

Why tokens/sec is the wrong metric for coding

Throughput (tok/s) matters for long responses. If you're asking an agent to refactor 800 lines of code or generate a full test suite, the speed at which tokens stream determines how long you wait. For those tasks, a 150 tok/s cloud API beats a 30 tok/s local model cleanly.

But most coding interactions are short. Autocomplete fills in a single argument or closes a function signature. An inline edit replaces 5-20 lines. A quick Q&A asks "what does this regex do?" None of these generate more than 100 tokens of output. For interactions under 100 tokens, time-to-first-response (TTFR) — the wall-clock gap between submitting the request and receiving the first token back — dominates the experience.

The math: at 40 tok/s, a 60-token completion takes 1.5 seconds of generation time. If your TTFR is 50ms, total perceived latency is ~1.55 seconds. Now take a 150 tok/s cloud API with an 800ms TTFR: same 60-token completion takes 0.4 seconds of generation. Total: ~1.2 seconds. Cloud wins on total time here, but only just — and at 200ms TTFR the local model is already faster. At 800ms TTFR and below-average cloud throughput on busy servers? Local wins by a significant margin.

This is the parameter space that determines whether autocomplete feels instant or annoyingly lagged.

Latency components: cloud vs local

Understanding what creates latency in each system shows you where to attack it.

Cloud API latency stack

Network round-trip time (RTT): From a US-based developer to Anthropic or OpenAI's nearest inference cluster. East Coast US to AWS us-east-1: roughly 20-60ms. West Coast: 30-80ms. EU: 80-150ms. India: 150-300ms. This floor cannot be optimized away — it's physics.

Queue and load-balancer time: Cloud providers queue requests during high-traffic periods. This is generally 0-200ms on paid tiers, but can spike higher. Anthropic's status page has historically logged queue-time incidents of 300-800ms during peak demand. OpenAI's batch and pro tiers have separate queue priorities.

Time-to-first-token on server: From when the inference server receives the request to when the first output token is emitted. This depends on server-side scheduling, model size, and speculative decoding implementation. For Claude Sonnet 4.6 on Anthropic's infrastructure, documented TTFR at normal load is in the 200-500ms range. For GPT-4o on OpenAI infrastructure, it's slightly faster, typically 150-400ms.

Combined cloud TTFR (network + queue + server TTFR): For a US developer on Cursor Pro under normal load, measured aggregate TTFR is approximately 600-900ms for Claude Sonnet 4.6 and 500-750ms for GPT-4o. During peak hours these stretch by 50-100%.

Local inference latency stack

Model load time (one-time cost): Cold-loading Qwen2.5-Coder 32B from disk into VRAM takes 8-12 seconds on an NVMe SSD. The 7B model loads in 2-4 seconds. Once loaded, models stay resident in VRAM until explicitly unloaded. Ollama keeps loaded models in memory by default; the OLLAMA_KEEP_ALIVE environment variable controls timeout (default: 5 minutes). For active coding sessions, this is a one-time cost you pay once per IDE launch.

Time-to-first-token on warm GPU: With the model loaded, TTFR on local hardware is dominated by the time to process the prompt tokens (prefill). For short prompts (< 500 tokens), this is 10-80ms on an RTX 5060 Ti or 3090. For a standard coding request with moderate context, expect 20-60ms TTFR — roughly 10-15× faster than cloud API TTFR under typical conditions.

Network: Zero. The request never leaves the machine.

Measured TTFR and throughput by setup

These numbers reflect measurements taken from community testing, Ollama benchmark threads, and API performance monitoring tooling. All GPU numbers assume model is warm (loaded in VRAM).

Cursor Pro — Claude Sonnet 4.6 (cloud API)

TTFR average: ~800ms
TTFR range: 550ms (low load, East Coast US) – 1,400ms (peak hours or EU)
Generation throughput: 120-180 tok/s (server-side; stream delivery varies)
Rate limits: 500 requests/day on Pro tier before slow-mode kicks in; heavy agent loops can burn this in under 2 hours

Cursor Pro — GPT-4o (cloud API)

TTFR average: ~600ms
TTFR range: 400ms – 1,100ms
Generation throughput: 100-160 tok/s
Rate limits: Similar request-count limits; separate pool from Claude models

Local Ollama — Qwen2.5-Coder 32B (Q4_K_M) on RTX 3090 24GB

TTFR (warm model): 30-50ms
Generation throughput: 25-28 tok/s
Rate limits: None
Context window: 32k tokens by default; extend to 128k with OLLAMA_CONTEXT_LENGTH=131072 (expect TTFR increase to 80-150ms at large context)

Local Ollama — Qwen2.5-Coder 7B (Q4_K_M) on RTX 5060 Ti 16GB

TTFR (warm model): 15-25ms
Generation throughput: 85-100 tok/s
Rate limits: None
Context window: 16k tokens reasonable; 32k possible with VRAM headroom

The 5060 Ti with the 7B model is faster than the 3090 with the 32B model on TTFR and throughput. It's slower on code quality — the 32B model scores roughly 92% on HumanEval vs 75% for the 7B. That quality gap is real and matters for complex tasks, but for autocomplete the 7B output is usually sufficient.

Use-case latency breakdown

Use case	Cloud API (Sonnet 4.6)	Local (32B / 3090)	Local (7B / 5060 Ti)	Winner
Autocomplete (< 50 tokens)	700-900ms perceived	50-80ms perceived	25-50ms perceived	Local
Inline edit (200-500 tokens output)	2.5-5s total	8-22s total	2.5-7s total	Cloud / 5060 Ti tie
Single-file refactor (~1k tokens)	8-18s	40-60s	12-18s	Cloud (32B) / 5060 Ti tie
Multi-file agent loop (~3k tokens)	25-60s	110-180s	35-50s	Cloud
10 rapid agent calls in sequence	70-150s total	10-25s total	5-15s total	Local
Large context (50k+ token codebase)	15-40s first response	Not feasible (OOM)	Not feasible (OOM)	Cloud only

The "10 rapid agent calls" row deserves explanation. When you run an agentic loop — generate code, test, fix error, test again — each loop iteration is a separate API call. Cloud API TTFR compounds: 10 calls at 800ms TTFR adds 8 seconds of pure waiting, before generation time. Local TTFR at 30ms adds 0.3 seconds. For tight iteration loops, local has a compounding TTFR advantage even when individual generations are slower.

Rate limit latency: the invisible cloud tax

Cursor Pro at $20/month includes 500 "fast requests" per day on the premium models (Claude Sonnet 4.6, GPT-4o). After that, requests route to slower server pools with higher queue times. In practice, a developer running heavy agent sessions can exhaust fast requests by mid-afternoon, at which point effective TTFR can climb to 2-5 seconds per call.

The $60/month Business tier raises the limit but doesn't remove it. The $200/month Ultra tier has the most generous allocation but sti

DEV Community