Daniel Nwaneri

Posted on Jun 29

What Actually Happens When You Call an LLM API

#ai #webdev #beginners #discuss

Physical journey and regional latency tax

you've felt it.

you type a prompt, hit send, and the response starts streaming in under a second. smooth. instant. you feel like you're thinking out loud with a machine.

then the next day — same model, same prompt — you wait. three seconds. five. the cursor blinks. nothing. then it all comes at once.

you probably blamed your wifi.

it wasn't your wifi.

what actually happened in those extra seconds is a story that starts in a building you'll never visit, runs through a cable at the bottom of an ocean, and ends on a gpu that was busy doing someone else's thinking before it got to yours.

and if you're building in africa or anywhere that isn't virginia, ireland, or frankfurt — that story has a chapter in it specifically about you.

the journey of one api call

let's follow a single request from the moment you hit send.

your prompt leaves your device and travels as packets of data through your ISP, hits a submarine fibre cable, crosses an ocean, arrives at a data centre, gets routed to the right server, waits for a gpu to become available, gets processed, and the response travels back the same way.

that whole round trip happens in what feels like nothing.

except it isn't nothing. every step costs time. and some of those steps cost more depending on where you're sitting on the planet.

what is a data centre, actually

before we get to the interesting parts, ground this.

a data centre is a building — sometimes the size of several football pitches — filled with servers. those servers are computers without screens. stacked in metal racks. thousands of them. running twenty-four hours a day, seven days a week, never switching off.

every api call you make, every message you send on whatsapp, every google search, every youtube video — all of it is touching a server in a building like this somewhere.

the building needs three things to function: power, cooling, and connectivity. the power runs the servers. the cooling stops them melting — servers generate enormous heat at this density. the connectivity is the fibre cable that connects the building to the rest of the internet.

nigeria has 17 of these buildings. the united states has over 5,500.

that gap matters. we'll come back to it.

latency: the physics problem nobody warned you about

latency is the time it takes for data to travel from point A to point B and back.

it is bounded by physics. data moves through fibre optic cable at roughly two-thirds the speed of light. you cannot make it faster. you can only make the distance shorter.

lagos to london is approximately 5,000 kilometres. at two-thirds the speed of light, the minimum possible round-trip time is around 50 milliseconds just from the distance alone. add routing, congestion, processing and you're looking at 100 to 150ms before your request has even reached the server.

then the model has to think.

then the response travels back.

most developers building in nigeria are hitting llm servers in us-east-1 (virginia) or eu-west (ireland or frankfurt). that's not a complaint — those are where the servers are. but it means every api call carries 100 to 200ms of latency just from geography, before inference even begins.

for a streaming chatbot, you feel this. that pause before the first token appears isn't the model being slow. it's the speed of light, applied to distance.

inference: what the gpu is actually doing

when your prompt arrives at the server, it doesn't get processed the way you might imagine — like a search engine matching keywords.

the model runs your prompt through billions of mathematical operations, layer by layer, to predict what the most likely next token should be. then the next. then the next. each token generated one at a time, sequentially, until the response is complete.

this is inference.

a token is roughly three-quarters of a word. "hello" is one token. "infrastructure" is two. the response you're reading right now would be several hundred tokens.

why does this matter? because every token costs compute. a longer prompt costs more compute on the input side. a longer response costs more on the output side. and all of that compute is happening on a gpu inside a data centre consuming real electricity.

the gpu: why this specific hardware

your laptop has a cpu — central processing unit. it's designed for general tasks: running your browser, compiling your code, handling your operating system. very fast at one thing at a time.

a gpu — graphics processing unit — was originally designed to render video games. thousands of smaller cores that can do many calculations simultaneously. it turns out this parallel architecture is exactly what llm inference needs: running the same mathematical operations across billions of parameters at once.

a single high-end gpu used for llm inference — an nvidia h100 — costs around $30,000. a data centre running a frontier model has thousands of them.

when you call an llm api, your request is routed to one of these gpus. if that gpu is busy processing another user's request, yours waits. that wait is real. it shows up as latency on your end.

this is what rate limits are actually enforcing: the physical capacity of the hardware.

cold starts: why the first request is slower

you've noticed that sometimes the very first call in a while takes noticeably longer.

this isn't imaginary. it's a cold start.

models are large. a frontier model can be hundreds of gigabytes of weights — the numbers that encode what the model knows. those weights need to be loaded into gpu memory before inference can happen. if no request has come in for a while, the system may have partially unloaded the model to free up memory for other things.

the first request has to wait for the model to load back in. subsequent requests hit the already-warm model and feel faster.

serverless llm deployments are especially prone to this. you pay less when traffic is low. but your users feel the first request after a quiet period.

why nigeria specifically

nigeria's 17 data centres — 14 of them in lagos — run almost entirely on diesel generators. the national grid provides on average four hours of power per day. every data centre makes up the difference with generators burning diesel around the clock.

this is expensive. it's also why local cloud infrastructure hasn't scaled the way it has in markets with stable power.

the consequence for you as a developer: every llm api call you make routes to a server that is not in nigeria. not in west africa. often not even on the continent. you are paying the latency cost of that distance on every single request, for every single user you have.

this isn't a software problem. it's a geography and infrastructure problem. and it has a direct effect on how your ai-powered products feel to the people using them.

what this means when you're building

three practical things:

stream the response. don't wait for the full response before showing anything. streaming tokens as they arrive makes the experience feel faster even when it isn't. the perceived latency drops dramatically because the user sees something happening.

cache aggressively. if you're calling the same prompt or near-identical prompts repeatedly, cache the response. inference is expensive. latency is expensive. caching eliminates both for repeated queries.

pick the right model for the job. a 70 billion parameter model is slower and more expensive than a 7 billion parameter model. for many tasks — classification, extraction, short-form generation — the smaller model is sufficient and returns results significantly faster. frontier models are not always the right tool.

the bigger picture

data centres exist because computation has to live somewhere physical. it takes power, water, land, and connectivity to run the infrastructure that makes ai feel effortless.

africa accounts for less than 1% of global data centre capacity while housing 18% of the world's population. the gap between what the continent generates as digital demand and what it owns as infrastructure is where the latency comes from, where the dependency comes from, where the value extraction happens.

knowing it's a physics problem, not a code problem, changes where you look. knowing that equinix, aws, and microsoft own most of the continent's usable capacity changes what you think about it.

it's probably not your code. it's a building somewhere running on diesel.

AI helped me research, structure, and edit this piece. The arguments, the examples, and the opinions are mine. So is whatever's wrong with them.

Top comments (54)

Daniel Nwaneri • Jun 29

pascal, the ground-level version of the 1% story is here.
The diesel generators, the latency tax, the 17 buildings versus 5,500 . This is what the infrastructure gap looks like from inside it. not 2029. right now, on every API call.
the piece we talked about writing . this is the non-fiction half of it.

Sylwia Laskowska • Jun 29

@pascal_cescato_692b7a8a20 is a trendsetter this week 😀

Daniel Nwaneri • Jun 29

He started something this week. glad you're here too, Sylwia....

Pascal CESCATO • Jun 29

You brought the ground truth. Sylwia brought the room. I just left the door open.

Pascal CESCATO • Jun 29 • Edited

Trendsetter is generous. Early symptom is more accurate. 😄

Pascal CESCATO • Jun 29

"It's probably not your code. It's a building somewhere running on diesel."

That's the sentence. Everything before it is the setup, and it lands because you've actually earned it.
17 versus 5,500. That number does what the whole geopolitics section of 1% couldn't — it makes the gap physical, countable, impossible to abstract away.
This is the piece I couldn't write. You wrote it from inside. That's the difference.

Daniel Nwaneri • Jun 29

Pascal, "from inside" is the only part of the infrastructure story that doesn't get written. The latency numbers exist in documentation. The diesel generators are in McKinsey reports. what's missing is what it feels like to build a product knowing every user is paying a tax you can't eliminate with better code.

That's the piece. the constraint that produces different instincts.

Pascal CESCATO • Jun 29

A tax you can't eliminate with better code.

That's the line. Everything else in the infrastructure conversation is documentation. That sentence is the piece.
Write the next one from there.

Daniel Nwaneri • Jun 29

Writing it...

UnitBuilds • Jun 29

🤑 Runs on Diesel... Ouch! Especially now...

Yeah people dont really consider the physical bottlenecks that exist, they just think the model is slow. Cant do much about distance, but you can change the protocol you use for sending the data (eg. V.E.L.O.C.I.T.Y. was built for handling bank transfers at microseconds), or what kind of data you send (NDA vs JSON means no serialization and deserialization), inference needs to run (google has custom accelerators that run way faster than H100s atleast), data congestion (queueing, or dropping) can be addressed with load-balancers which also add latency (albeit fractions), then response comes back and your editor needs to display it (if your ide uses Electron, then it's SLOW). So many things that add up to why it takes a few seconds to get the first token and why the industry is at the standstill again. People think AI surpassed hardware, true in some ways, but the real bottleneck is the aging infrastructure code we've been using. Take JSON, it takes over a milisecond just to serialize and deserialize it. Now multiply that by a billion... That's the impact of JSON every hour for cloud models. That's compute that costs money and time that could be used inferring instead of packing and retrying if the output wasnt structured perfectly (whose still remembers "Model produced a malformed edit"). LLMs like Gemini 3.5 flash already run at 400+ tps, that's 2.5ms per token, a token is on average 0.8 of a word, which works out to 320 words per second... That's really damn fast. it's just the infrastructure that's still slow, the GPUs are fast and the models are fast.

Daniel Nwaneri • Jun 29

UnitBuilds, the JSON number is the one that sits with me. 1ms per serialize/deserialize sounds trivial until you map it across every layer . client to proxy, proxy to inference server, response back. by the time you count all the hops a Nigerian developer's request makes, you're stacking serialization costs on top of geography costs on top of queue costs. the latency isn't one problem. it's five problems wearing one coat.

The NDA format angle is sharp . if you're eliminating serialization at the transport layer, you're attacking the problem at the right level. most people optimize the model. you're optimizing the pipe.

curious whether V.E.L.O.C.I.T.Y. handles the case where the receiving end is still JSON-native . does it translate at the edge or does it require both sides to speak NDA?

UnitBuilds • Jun 29

It can yes, the POC for that would be NMCP (Rust-based MCP built on using the NDA format, instead of JSON-RPC and self-hosted instead of Node.JS). You can use NDA, or you can use JSON, so it's backwards compatible when necessary (eg. public APIs for cloud models). So you dont get the full benefit of using an AI-native language, zero deserialization, E2E AES 256 encryption and a merkle root immutable history, but you do get some performance gain still vs traditional.

UnitBuilds • Jun 29

NDA is more than just a packing format, it's a means of representing data deterministically (think excel formulas), by using the triplet structure, it's easier for a model to understand cause and effect and everything inbetween. Essentially allowing it to 'infer' the meaning, without running inference. So it also improves your signal to noise ratio while at it, because the old standard is MCP tools have 'a title + description', plain text packed in json, with parameter fields and no clue what they do, unless you bloat the context with large descriptions. NDA is a single A -> B -> C and the model just gets it.

While V.E.L.O.C.I.T.Y. had it's roots in bank transfers and wall-street, NDA was built to be the USB-C of files, it can contain code, a spreadsheet, a form, a webpage, an image, video, audio, even a zip. But it represents data differently. With the triplet structure, a spreadsheet line isnt just Cell A, B, C, it's Qty * Price = Cost. As a document standard, idea was for it to use this structure and when you print it, print a tiny QR code on the doc, which when scanned reconstructs the document precisely, so for financial documents you cant just change the PDF's owner password check and edit a line, it leaves a fingerprint on the NDA doc's merkle root if edited properly, if you dumb-edit it (just replace the text in paint), the QR code will invalidate the text on the page. Also means that when you print a doc, you dont have to run OCR to scan it back in. But that's besides the point. As a data-packet it's encrypted by the hardware AES 256, effectively making it free encryption even at insane data rates due to the architecture and it natively handles chunking, so you simply fill a UDP packet to the brim, send it, prep and send the next 1, the merkle root ensures that any packets lost are recalled and that the final structure is 100% intact and unaltered. Because it's hardware signed, even if you snoop a packet, you'd need the destination pc's cpu just to unlock the packet and it would have to be in sequence, else it's simply garbage data. As a data transfer system, it's computationally impossible to hack, because it runs at bare-metal speeds, if tech advances to THz cpus, it scales right up to the limit with it.

Daniel Nwaneri • Jun 29

The merkle root detail is the one I wasn't expecting. backwards compatible transport is interesting on its own but immutable history baked into the protocol layer is a different claim. that's not just performance, that's auditability by default.

been thinking about agent governance from the application layer . signoff-workers tracks what decisions were made, under what conditions, with what authorization. you're solving the same problem one layer down. if the transport itself carries a tamper-evident history, the governance layer doesn't have to reconstruct it from logs.
does NMCP expose that merkle root in a way an orchestration layer can query, or is it internal to the transport?

UnitBuilds • Jun 29

It's open, it's how the system validates state before executing, that way nonrunable code is detected at write-time instead of compile or run time. the NDA standard was designed for auditability in mind, at every layer, if you're authorized, you can view anything's history, be it a file or packet

UnitBuilds • Jun 29

You might wanna check out the 12 part series I posted yesterday. While it's the journey of 5 days from Cloudflare Worker AI integration into an agentic extension for VS-Code to a standalone OS, it's actually all just testimony to NDA as a format, in all it's use-cases.

Daniel Nwaneri • Jun 29

write-time detection is the part that changes the calculus. catching non-runnable code before compile means the governance signal arrives before the damage not after the incident report.

The authorization layer on history access is what makes it usable in production. full auditability without access control is just a liability. scoped auditability . if you're authorized, you can see anything's history — is the thing enterprises will actually trust.

The production-safe-agent-loop I built runs five primitives to keep agents from executing outside their authorized scope. right now that's application-layer enforcement, after the transport. if NMCP exposes the merkle root and authorization check at the protocol level, you could enforce scope before the agent loop even fires. that's a meaningful boundary to push down.

is the authorization model role-based or something more granular?

UnitBuilds • Jun 29

CPU's hardware signature. So it's system-level secured.

Daniel Nwaneri • Jun 29

system-level enforcement is the boundary I didn't know I was looking for. that changes what's possible above it....

UnitBuilds • Jun 29

What I learned during the 12 part series, is that you can start at surface level, add security, but you'll always find loopholes and risks until you're at bare-metal.

Adam - The Developer ✨ • Jun 29

Great breakdown of the physical journey behind a single token.

The reminder to stream responses and cache aggressively is so critical, especially when building for regions bearing the brunt of that 'regional latency tax'.

great piece!

Daniel Nwaneri • Jun 29

Adam, building for those regions is exactly who this was written for....

Hemapriya Kanagala • Jun 30

Really enjoyed this one 😄

I learned a few things from this, especially around latency and what actually happens behind the scenes. It's easy to forget how much infrastructure is involved when an API call feels almost instant.

Mykola Kondratiuk • Jun 30

spent a week arguing that my LLM integration was broken. turned out it was fine - just high GPU load on the provider’s end. the invisible queue is the part nobody builds dashboards for.

VoltageGPU • Jun 30

Interesting breakdown of the pipeline! As someone who works with GPU infrastructure, I'm always fascinated by how scheduling and memory management on the device side can impact latency—especially when dealing with multiple concurrent requests. It's easy to forget how much happens behind the scenes to get that first token so quickly.

akash yadav • Jun 29

When you call an LLM API, your application sends a prompt to a hosted AI model over the internet. The API processes the request by tokenizing the input, running it through the language model, and generating a response token by token. The final output is then returned to your application in a structured format like JSON, making it easy to integrate into websites, chatbots, CRMs, or other software.

The overall workflow typically looks like this:

Your app sends a prompt through an API request.
The LLM interprets the prompt using its trained parameters.
It generates a context-aware response.
The API returns the response for your application to display or process.

Many businesses use LLM APIs to power customer support, automate content creation, summarize documents, and enhance internal workflows without having to train or host their own AI models.

At Aqva Marketing, we help businesses integrate AI solutions like LLM APIs into their digital strategies, enabling smarter automation, personalized customer experiences, and scalable business growth. The key is not just accessing an AI model but implementing it in a way that delivers real business value.

algorhymer • Jun 29

What Actually Happens When You Call an LLM API
you've felt it.
you type a prompt, hit send, and the response starts streaming in under a second. smooth. instant. you feel like you're thinking out loud with a machine.

Meanwhile the elusive girlfriendicus in the background, eyerolls and slowly starts her long march to yoga class.

I was your host, David Attenborough.

Aneesha Prasannan • Jun 29

One of the best explanations I've read on why AI feels "slow" sometimes. We often optimize prompts while forgetting that networking, GPUs, and data center location are just as important to the user experience.

Kartik N V J K • Jun 29

Great walkthrough, and the part people underrate is that the GPU queue is usually a bigger share of that wait than the ocean cable. The same prompt can stream instantly or stall for five seconds based purely on how loaded the serving batch is at that moment, which is why tail latency looks so random from the client side. When I trace my own calls, splitting queue wait from network and decode time is the only thing that makes p95 explainable, do you separate those out when you measure?

View full discussion (54 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.