cauqjbwkerl

Posted on Jun 30

Why your Claude and OpenAI API calls are slow (and how to fix it)

#ai #claude #openai #python

TL;DR

If you're calling the Claude or OpenAI API from Asia, Oceania, or South America, you're likely adding 400–800ms of pure network overhead before a single token arrives. The root cause is geography, not the AI model itself — and it's fixable with smarter routing.

The Problem Nobody Talks About

I spent two weeks debugging what I thought was a slow model. My streaming chat app felt sluggish — users were staring at a blank screen for nearly a second before anything appeared. After profiling every layer of my stack, I finally isolated the culprit: the raw time from my server in Tokyo to api.anthropic.com and back was eating 600–900ms before the API even started generating.

This isn't a Claude or OpenAI problem. It's physics and routing.

Why AI API Latency Is Geography-Dependent

Both Anthropic and OpenAI run their inference infrastructure primarily in US data centers (us-east-1, us-west-2 regions). When you're in Singapore, Seoul, or Sydney, every API call has to:

Cross transoceanic fiber — a round-trip from Tokyo to Virginia is ~160ms at the speed of light, and real-world routing adds 30–50% on top of that.
Negotiate TLS from a distance — a TLS 1.3 handshake requires 1 round-trip; TLS 1.2 requires 2. At 200ms RTT, that's 200–400ms gone before a byte of your prompt is sent.
Fight TCP congestion control — long-haul routes traverse multiple ISP handoffs. TCP's slow-start and congestion windows are tuned for short distances; on transoceanic routes, you get retransmits and window stalls that inflate latency unpredictably.

The result: developers in the US see 40–80ms to first token on streaming calls. Developers in Asia routinely see 300–600ms, sometimes spiking past 1 second during peak hours.

Measure Your Baseline First

Before optimizing anything, get hard numbers. Here's a curl timing command I use to measure raw connection latency to both APIs:

# Measure connection phases to Anthropic
curl -o /dev/null -s -w "\
DNS lookup:      %{time_namelookup}s\n\
TCP connect:     %{time_connect}s\n\
TLS handshake:   %{time_appconnect}s\n\
First byte:      %{time_starttransfer}s\n\
Total:           %{time_total}s\n" \
https://api.anthropic.com/v1/models

# Same for OpenAI
curl -o /dev/null -s -w "\
DNS lookup:      %{time_namelookup}s\n\
TCP connect:     %{time_connect}s\n\
TLS handshake:   %{time_appconnect}s\n\
First byte:      %{time_starttransfer}s\n\
Total:           %{time_total}s\n" \
https://api.openai.com/v1/models

From Tokyo, my typical output looks like:

DNS lookup:      0.028s
TCP connect:     0.187s
TLS handshake:   0.412s
First byte:      0.843s

From a US-East server, the same command returns 0.041s to first byte. That's a 20x difference on connection setup alone.

Measuring Streaming First-Token Latency in Python

The curl test measures connection overhead, but for streaming AI APIs, what users actually feel is time-to-first-token (TTFT). Here's a Python snippet that measures this precisely:

import time
import anthropic

def measure_ttft(prompt: str) -> dict:
    client = anthropic.Anthropic()

    start = time.perf_counter()
    first_token_time = None
    token_count = 0

    with client.messages.stream(
        model="claude-3-5-sonnet-20241022",
        max_tokens=200,
        messages=[{"role": "user", "content": prompt}],
    ) as stream:
        for text in stream.text_stream:
            if first_token_time is None:
                first_token_time = time.perf_counter()
            token_count += 1

    end = time.perf_counter()

    ttft = (first_token_time - start) * 1000
    total = (end - start) * 1000

    return {
        "ttft_ms": round(ttft, 1),
        "total_ms": round(total, 1),
        "tokens": token_count,
        "throughput_tps": round(token_count / ((end - first_token_time)), 1),
    }

if __name__ == "__main__":
    result = measure_ttft("Explain TCP slow start in two sentences.")
    print(f"Time to first token: {result['ttft_ms']}ms")
    print(f"Total time:          {result['total_ms']}ms")
    print(f"Throughput:          {result['throughput_tps']} tokens/sec")

Run this 5–10 times and average the results. On a cold connection from Southeast Asia, I consistently measured 780–920ms TTFT. That's the number we want to crush.

The Fix: Route Through Infrastructure Close to the API

The insight is simple: if the AI APIs live in the US, your traffic should enter the US network as close to those endpoints as possible — not traverse 15 BGP hops across the Pacific first.

There are a few approaches:

Option 1: Deploy your backend in the US. If you control your server, move it to us-east-1. This is the cleanest solution but not always feasible — your users might be in Asia, your data residency requirements might be regional, or you might be running on a local machine during development.

Option 2: Use a regional proxy or accelerator. Route your AI API traffic through an optimized path that has a PoP (point of presence) near the Anthropic/OpenAI data centers. The proxy handles the long-haul routing on an optimized backbone, and your server only needs to reach the nearest proxy node.

This is where I found TonBoVPN genuinely useful. It's designed specifically for routing AI API traffic — you set HTTPS_PROXY in your environment, and your Claude/OpenAI calls get routed through nodes with optimized paths to US API endpoints. The setup is literally one environment variable:

export HTTPS_PROXY=http://your-tonbovpn-endpoint:port

# Your existing Python code works unchanged
python your_app.py

Both the anthropic and openai Python SDKs respect standard proxy environment variables, so there's zero code change required.

Real-World Numbers

After switching to proxied routing, here's what my TTFT measurements looked like across different Asian cities (averages over 20 runs each):

Location	Direct TTFT	Proxied TTFT	Improvement
Tokyo	820ms	195ms	4.2× faster
Seoul	760ms	170ms	4.5× faster
Singapore	690ms	155ms	4.5× faster
Sydney	950ms	220ms	4.3× faster

The 3–4× improvement is consistent across regions. More importantly, the variance dropped dramatically — direct calls would sometimes spike to 2,000ms during peak hours; proxied calls stayed under 300ms at P95.

Why This Matters for Streaming UX

For non-streaming API calls, latency is just latency — your user waits, the response arrives. But for streaming, TTFT is the difference between a UI that feels alive and one that feels broken.

Human perception research puts the "feels instant" threshold at around 100ms and "noticeable delay" at 300ms. At 800ms TTFT, users genuinely think the app is loading or broken. At 180ms, the first token appears before they've consciously registered waiting.

If you're building any kind of chat interface, code assistant, or real-time AI feature for users outside the US, optimizing TTFT isn't a nice-to-have — it's the single highest-leverage UX improvement you can make.

Quick Checklist

[ ] Run the curl timing test against both API endpoints from your actual server location
[ ] Measure baseline TTFT with the Python snippet above
[ ] If TTFT > 300ms, routing is your bottleneck (not the model)
[ ] Try proxied routing via TonBoVPN or a US-region proxy
[ ] Re-run measurements and compare P50 and P95 (variance matters as much as average)
[ ] If deploying to production, instrument TTFT as a metric in your observability stack

Conclusion

Slow AI API responses from outside the US are almost always a routing problem, not a model problem. The fix is straightforward: get your traffic onto an optimized path to US infrastructure as early as possible, whether that's moving your server, using a regional accelerator, or proxying through a service built for this use case. Measure first, optimize second — the curl and Python snippets above give you everything you need to quantify the problem and verify the fix.

DEV Community