DEV Community

Cover image for What Actually Happens When You Call an LLM API
Daniel Nwaneri
Daniel Nwaneri Subscriber

Posted on

What Actually Happens When You Call an LLM API

Physical journey and regional latency tax

you've felt it.

you type a prompt, hit send, and the response starts streaming in under a second. smooth. instant. you feel like you're thinking out loud with a machine.

then the next day — same model, same prompt — you wait. three seconds. five. the cursor blinks. nothing. then it all comes at once.

you probably blamed your wifi.

it wasn't your wifi.

what actually happened in those extra seconds is a story that starts in a building you'll never visit, runs through a cable at the bottom of an ocean, and ends on a gpu that was busy doing someone else's thinking before it got to yours.

and if you're building in africa or anywhere that isn't virginia, ireland, or frankfurt — that story has a chapter in it specifically about you.


the journey of one api call

let's follow a single request from the moment you hit send.

your prompt leaves your device and travels as packets of data through your ISP, hits a submarine fibre cable, crosses an ocean, arrives at a data centre, gets routed to the right server, waits for a gpu to become available, gets processed, and the response travels back the same way.

that whole round trip happens in what feels like nothing.

except it isn't nothing. every step costs time. and some of those steps cost more depending on where you're sitting on the planet.


what is a data centre, actually

before we get to the interesting parts, ground this.

a data centre is a building — sometimes the size of several football pitches — filled with servers. those servers are computers without screens. stacked in metal racks. thousands of them. running twenty-four hours a day, seven days a week, never switching off.

every api call you make, every message you send on whatsapp, every google search, every youtube video — all of it is touching a server in a building like this somewhere.

the building needs three things to function: power, cooling, and connectivity. the power runs the servers. the cooling stops them melting — servers generate enormous heat at this density. the connectivity is the fibre cable that connects the building to the rest of the internet.

nigeria has 17 of these buildings. the united states has over 5,500.

that gap matters. we'll come back to it.


latency: the physics problem nobody warned you about

latency is the time it takes for data to travel from point A to point B and back.

it is bounded by physics. data moves through fibre optic cable at roughly two-thirds the speed of light. you cannot make it faster. you can only make the distance shorter.

lagos to london is approximately 5,000 kilometres. at two-thirds the speed of light, the minimum possible round-trip time is around 50 milliseconds just from the distance alone. add routing, congestion, processing and you're looking at 100 to 150ms before your request has even reached the server.

then the model has to think.

then the response travels back.

most developers building in nigeria are hitting llm servers in us-east-1 (virginia) or eu-west (ireland or frankfurt). that's not a complaint — those are where the servers are. but it means every api call carries 100 to 200ms of latency just from geography, before inference even begins.

for a streaming chatbot, you feel this. that pause before the first token appears isn't the model being slow. it's the speed of light, applied to distance.


inference: what the gpu is actually doing

when your prompt arrives at the server, it doesn't get processed the way you might imagine — like a search engine matching keywords.

the model runs your prompt through billions of mathematical operations, layer by layer, to predict what the most likely next token should be. then the next. then the next. each token generated one at a time, sequentially, until the response is complete.

this is inference.

a token is roughly three-quarters of a word. "hello" is one token. "infrastructure" is two. the response you're reading right now would be several hundred tokens.

why does this matter? because every token costs compute. a longer prompt costs more compute on the input side. a longer response costs more on the output side. and all of that compute is happening on a gpu inside a data centre consuming real electricity.


the gpu: why this specific hardware

your laptop has a cpu — central processing unit. it's designed for general tasks: running your browser, compiling your code, handling your operating system. very fast at one thing at a time.

a gpu — graphics processing unit — was originally designed to render video games. thousands of smaller cores that can do many calculations simultaneously. it turns out this parallel architecture is exactly what llm inference needs: running the same mathematical operations across billions of parameters at once.

a single high-end gpu used for llm inference — an nvidia h100 — costs around $30,000. a data centre running a frontier model has thousands of them.

when you call an llm api, your request is routed to one of these gpus. if that gpu is busy processing another user's request, yours waits. that wait is real. it shows up as latency on your end.

this is what rate limits are actually enforcing: the physical capacity of the hardware.


cold starts: why the first request is slower

you've noticed that sometimes the very first call in a while takes noticeably longer.

this isn't imaginary. it's a cold start.

models are large. a frontier model can be hundreds of gigabytes of weights — the numbers that encode what the model knows. those weights need to be loaded into gpu memory before inference can happen. if no request has come in for a while, the system may have partially unloaded the model to free up memory for other things.

the first request has to wait for the model to load back in. subsequent requests hit the already-warm model and feel faster.

serverless llm deployments are especially prone to this. you pay less when traffic is low. but your users feel the first request after a quiet period.


why nigeria specifically

nigeria's 17 data centres — 14 of them in lagos — run almost entirely on diesel generators. the national grid provides on average four hours of power per day. every data centre makes up the difference with generators burning diesel around the clock.

this is expensive. it's also why local cloud infrastructure hasn't scaled the way it has in markets with stable power.

the consequence for you as a developer: every llm api call you make routes to a server that is not in nigeria. not in west africa. often not even on the continent. you are paying the latency cost of that distance on every single request, for every single user you have.

this isn't a software problem. it's a geography and infrastructure problem. and it has a direct effect on how your ai-powered products feel to the people using them.


what this means when you're building

three practical things:

stream the response. don't wait for the full response before showing anything. streaming tokens as they arrive makes the experience feel faster even when it isn't. the perceived latency drops dramatically because the user sees something happening.

cache aggressively. if you're calling the same prompt or near-identical prompts repeatedly, cache the response. inference is expensive. latency is expensive. caching eliminates both for repeated queries.

pick the right model for the job. a 70 billion parameter model is slower and more expensive than a 7 billion parameter model. for many tasks — classification, extraction, short-form generation — the smaller model is sufficient and returns results significantly faster. frontier models are not always the right tool.


the bigger picture

data centres exist because computation has to live somewhere physical. it takes power, water, land, and connectivity to run the infrastructure that makes ai feel effortless.

africa accounts for less than 1% of global data centre capacity while housing 18% of the world's population. the gap between what the continent generates as digital demand and what it owns as infrastructure is where the latency comes from, where the dependency comes from, where the value extraction happens.

knowing it's a physics problem, not a code problem, changes where you look. knowing that equinix, aws, and microsoft own most of the continent's usable capacity changes what you think about it.

it's probably not your code. it's a building somewhere running on diesel.


AI helped me research, structure, and edit this piece. The arguments, the examples, and the opinions are mine. So is whatever's wrong with them.

Top comments (12)

Collapse
 
dannwaneri profile image
Daniel Nwaneri

pascal, the ground-level version of the 1% story is here.
The diesel generators, the latency tax, the 17 buildings versus 5,500 . This is what the infrastructure gap looks like from inside it. not 2029. right now, on every API call.
the piece we talked about writing . this is the non-fiction half of it.

Collapse
 
sylwia-lask profile image
Sylwia Laskowska

@pascal_cescato_692b7a8a20 is a trendsetter this week 😀

Collapse
 
dannwaneri profile image
Daniel Nwaneri

He started something this week. glad you're here too, Sylwia....

Thread Thread
 
pascal_cescato_692b7a8a20 profile image
Pascal CESCATO

You brought the ground truth. Sylwia brought the room. I just left the door open.

Collapse
 
pascal_cescato_692b7a8a20 profile image
Pascal CESCATO • Edited

Trendsetter is generous. Early symptom is more accurate. 😄

Collapse
 
pascal_cescato_692b7a8a20 profile image
Pascal CESCATO

"It's probably not your code. It's a building somewhere running on diesel."

That's the sentence. Everything before it is the setup, and it lands because you've actually earned it.
17 versus 5,500. That number does what the whole geopolitics section of 1% couldn't — it makes the gap physical, countable, impossible to abstract away.
This is the piece I couldn't write. You wrote it from inside. That's the difference.

Collapse
 
dannwaneri profile image
Daniel Nwaneri

Pascal, "from inside" is the only part of the infrastructure story that doesn't get written. The latency numbers exist in documentation. The diesel generators are in McKinsey reports. what's missing is what it feels like to build a product knowing every user is paying a tax you can't eliminate with better code.

That's the piece. the constraint that produces different instincts.

Thread Thread
 
pascal_cescato_692b7a8a20 profile image
Pascal CESCATO

A tax you can't eliminate with better code.

That's the line. Everything else in the infrastructure conversation is documentation. That sentence is the piece.
Write the next one from there.

Thread Thread
 
dannwaneri profile image
Daniel Nwaneri

Writing it...

Collapse
 
adamthedeveloper profile image
Adam - The Developer

Great breakdown of the physical journey behind a single token.

The reminder to stream responses and cache aggressively is so critical, especially when building for regions bearing the brunt of that 'regional latency tax'.

great piece!

Collapse
 
dannwaneri profile image
Daniel Nwaneri

Adam, building for those regions is exactly who this was written for....

Collapse
 
unitbuilds profile image
UnitBuilds

🤑 Runs on Diesel... Ouch! Especially now...

Yeah people dont really consider the physical bottlenecks that exist, they just think the model is slow. Cant do much about distance, but you can change the protocol you use for sending the data (eg. V.E.L.O.C.I.T.Y. was built for handling bank transfers at microseconds), or what kind of data you send (NDA vs JSON means no serialization and deserialization), inference needs to run (google has custom accelerators that run way faster than H100s atleast), data congestion (queueing, or dropping) can be addressed with load-balancers which also add latency (albeit fractions), then response comes back and your editor needs to display it (if your ide uses Electron, then it's SLOW). So many things that add up to why it takes a few seconds to get the first token and why the industry is at the standstill again. People think AI surpassed hardware, true in some ways, but the real bottleneck is the aging infrastructure code we've been using. Take JSON, it takes over a milisecond just to serialize and deserialize it. Now multiply that by a billion... That's the impact of JSON every hour for cloud models. That's compute that costs money and time that could be used inferring instead of packing and retrying if the output wasnt structured perfectly (whose still remembers "Model produced a malformed edit"). LLMs like Gemini 3.5 flash already run at 400+ tps, that's 2.5ms per token, a token is on average 0.8 of a word, which works out to 320 words per second... That's really damn fast. it's just the infrastructure that's still slow, the GPUs are fast and the models are fast.