DEV Community

What Actually Happens When You Call an LLM API

Daniel Nwaneri on June 29, 2026

you've felt it. you type a prompt, hit send, and the response starts streaming in under a second. smooth. instant. you feel like you're thinking o...

Read full post

Daniel Nwaneri • Jun 29

pascal, the ground-level version of the 1% story is here.
The diesel generators, the latency tax, the 17 buildings versus 5,500 . This is what the infrastructure gap looks like from inside it. not 2029. right now, on every API call.
the piece we talked about writing . this is the non-fiction half of it.

Sylwia Laskowska • Jun 29

@pascal_cescato_692b7a8a20 is a trendsetter this week 😀

Daniel Nwaneri • Jun 29

He started something this week. glad you're here too, Sylwia....

Pascal CESCATO • Jun 29

You brought the ground truth. Sylwia brought the room. I just left the door open.

Pascal CESCATO • Jun 29 • Edited

Trendsetter is generous. Early symptom is more accurate. 😄

Pascal CESCATO • Jun 29

"It's probably not your code. It's a building somewhere running on diesel."

That's the sentence. Everything before it is the setup, and it lands because you've actually earned it.
17 versus 5,500. That number does what the whole geopolitics section of 1% couldn't — it makes the gap physical, countable, impossible to abstract away.
This is the piece I couldn't write. You wrote it from inside. That's the difference.

UnitBuilds • Jun 29

🤑 Runs on Diesel... Ouch! Especially now...

Yeah people dont really consider the physical bottlenecks that exist, they just think the model is slow. Cant do much about distance, but you can change the protocol you use for sending the data (eg. V.E.L.O.C.I.T.Y. was built for handling bank transfers at microseconds), or what kind of data you send (NDA vs JSON means no serialization and deserialization), inference needs to run (google has custom accelerators that run way faster than H100s atleast), data congestion (queueing, or dropping) can be addressed with load-balancers which also add latency (albeit fractions), then response comes back and your editor needs to display it (if your ide uses Electron, then it's SLOW). So many things that add up to why it takes a few seconds to get the first token and why the industry is at the standstill again. People think AI surpassed hardware, true in some ways, but the real bottleneck is the aging infrastructure code we've been using. Take JSON, it takes over a milisecond just to serialize and deserialize it. Now multiply that by a billion... That's the impact of JSON every hour for cloud models. That's compute that costs money and time that could be used inferring instead of packing and retrying if the output wasnt structured perfectly (whose still remembers "Model produced a malformed edit"). LLMs like Gemini 3.5 flash already run at 400+ tps, that's 2.5ms per token, a token is on average 0.8 of a word, which works out to 320 words per second... That's really damn fast. it's just the infrastructure that's still slow, the GPUs are fast and the models are fast.

Daniel Nwaneri • Jun 29

UnitBuilds, the JSON number is the one that sits with me. 1ms per serialize/deserialize sounds trivial until you map it across every layer . client to proxy, proxy to inference server, response back. by the time you count all the hops a Nigerian developer's request makes, you're stacking serialization costs on top of geography costs on top of queue costs. the latency isn't one problem. it's five problems wearing one coat.

The NDA format angle is sharp . if you're eliminating serialization at the transport layer, you're attacking the problem at the right level. most people optimize the model. you're optimizing the pipe.

curious whether V.E.L.O.C.I.T.Y. handles the case where the receiving end is still JSON-native . does it translate at the edge or does it require both sides to speak NDA?

UnitBuilds • Jun 29

It can yes, the POC for that would be NMCP (Rust-based MCP built on using the NDA format, instead of JSON-RPC and self-hosted instead of Node.JS). You can use NDA, or you can use JSON, so it's backwards compatible when necessary (eg. public APIs for cloud models). So you dont get the full benefit of using an AI-native language, zero deserialization, E2E AES 256 encryption and a merkle root immutable history, but you do get some performance gain still vs traditional.

UnitBuilds • Jun 29

NDA is more than just a packing format, it's a means of representing data deterministically (think excel formulas), by using the triplet structure, it's easier for a model to understand cause and effect and everything inbetween. Essentially allowing it to 'infer' the meaning, without running inference. So it also improves your signal to noise ratio while at it, because the old standard is MCP tools have 'a title + description', plain text packed in json, with parameter fields and no clue what they do, unless you bloat the context with large descriptions. NDA is a single A -> B -> C and the model just gets it.

While V.E.L.O.C.I.T.Y. had it's roots in bank transfers and wall-street, NDA was built to be the USB-C of files, it can contain code, a spreadsheet, a form, a webpage, an image, video, audio, even a zip. But it represents data differently. With the triplet structure, a spreadsheet line isnt just Cell A, B, C, it's Qty * Price = Cost. As a document standard, idea was for it to use this structure and when you print it, print a tiny QR code on the doc, which when scanned reconstructs the document precisely, so for financial documents you cant just change the PDF's owner password check and edit a line, it leaves a fingerprint on the NDA doc's merkle root if edited properly, if you dumb-edit it (just replace the text in paint), the QR code will invalidate the text on the page. Also means that when you print a doc, you dont have to run OCR to scan it back in. But that's besides the point. As a data-packet it's encrypted by the hardware AES 256, effectively making it free encryption even at insane data rates due to the architecture and it natively handles chunking, so you simply fill a UDP packet to the brim, send it, prep and send the next 1, the merkle root ensures that any packets lost are recalled and that the final structure is 100% intact and unaltered. Because it's hardware signed, even if you snoop a packet, you'd need the destination pc's cpu just to unlock the packet and it would have to be in sequence, else it's simply garbage data. As a data transfer system, it's computationally impossible to hack, because it runs at bare-metal speeds, if tech advances to THz cpus, it scales right up to the limit with it.

Daniel Nwaneri • Jun 29

The merkle root detail is the one I wasn't expecting. backwards compatible transport is interesting on its own but immutable history baked into the protocol layer is a different claim. that's not just performance, that's auditability by default.

been thinking about agent governance from the application layer . signoff-workers tracks what decisions were made, under what conditions, with what authorization. you're solving the same problem one layer down. if the transport itself carries a tamper-evident history, the governance layer doesn't have to reconstruct it from logs.
does NMCP expose that merkle root in a way an orchestration layer can query, or is it internal to the transport?

UnitBuilds • Jun 29

It's open, it's how the system validates state before executing, that way nonrunable code is detected at write-time instead of compile or run time. the NDA standard was designed for auditability in mind, at every layer, if you're authorized, you can view anything's history, be it a file or packet

Adam - The Developer ✨ • Jun 29

Great breakdown of the physical journey behind a single token.

The reminder to stream responses and cache aggressively is so critical, especially when building for regions bearing the brunt of that 'regional latency tax'.

great piece!

Daniel Nwaneri • Jun 29

Adam, building for those regions is exactly who this was written for....

Hemapriya Kanagala • Jun 30

Really enjoyed this one 😄

I learned a few things from this, especially around latency and what actually happens behind the scenes. It's easy to forget how much infrastructure is involved when an API call feels almost instant.

Mykola Kondratiuk • Jun 30

spent a week arguing that my LLM integration was broken. turned out it was fine - just high GPU load on the provider’s end. the invisible queue is the part nobody builds dashboards for.

VoltageGPU • Jun 30

Interesting breakdown of the pipeline! As someone who works with GPU infrastructure, I'm always fascinated by how scheduling and memory management on the device side can impact latency—especially when dealing with multiple concurrent requests. It's easy to forget how much happens behind the scenes to get that first token so quickly.

akash yadav • Jun 29

When you call an LLM API, your application sends a prompt to a hosted AI model over the internet. The API processes the request by tokenizing the input, running it through the language model, and generating a response token by token. The final output is then returned to your application in a structured format like JSON, making it easy to integrate into websites, chatbots, CRMs, or other software.

The overall workflow typically looks like this:

Your app sends a prompt through an API request.
The LLM interprets the prompt using its trained parameters.
It generates a context-aware response.
The API returns the response for your application to display or process.

Many businesses use LLM APIs to power customer support, automate content creation, summarize documents, and enhance internal workflows without having to train or host their own AI models.

At Aqva Marketing, we help businesses integrate AI solutions like LLM APIs into their digital strategies, enabling smarter automation, personalized customer experiences, and scalable business growth. The key is not just accessing an AI model but implementing it in a way that delivers real business value.

algorhymer • Jun 29

What Actually Happens When You Call an LLM API
you've felt it.
you type a prompt, hit send, and the response starts streaming in under a second. smooth. instant. you feel like you're thinking out loud with a machine.

Meanwhile the elusive girlfriendicus in the background, eyerolls and slowly starts her long march to yoga class.

I was your host, David Attenborough.

Aneesha Prasannan • Jun 29

One of the best explanations I've read on why AI feels "slow" sometimes. We often optimize prompts while forgetting that networking, GPUs, and data center location are just as important to the user experience.

Kartik N V J K • Jun 29

Great walkthrough, and the part people underrate is that the GPU queue is usually a bigger share of that wait than the ocean cable. The same prompt can stream instantly or stall for five seconds based purely on how loaded the serving batch is at that moment, which is why tail latency looks so random from the client side. When I trace my own calls, splitting queue wait from network and decode time is the only thing that makes p95 explainable, do you separate those out when you measure?

QuantaMind • Jun 29

The cold-start point compounds in agents, a multi-step loop pays that distance/TTFT tax on every step, so 200ms becomes seconds across 20 calls. Makes your "pick the right model" advice even stronger: a smaller model close to home can beat a frontier one an ocean away for agentic work. Has running anything locally been viable for you, or does the cost beat the latency?

Yunetzi • Jun 30

Great, a clear tour of what actually happens when you call an LLM API: latency, streaming, and costs explained. A quick add: refetch/caching strategies and a simple retry backoff would help newcomers avoid surprise bills. More please.

Kartik N V J K • Jun 30

The part most people miss is that a lot of the variance is queue time on a shared GPU, not network distance, so the same prompt can swing several seconds depending on batch pressure at that instant. I started logging time-to-first-token separately from total time for exactly this, since they fail for different reasons. Do you see the regional gap shrink once you account for provider-side batching versus the raw submarine-cable hop?

Alex Shev • Jun 30

The hidden production lesson is that an LLM call is not one thing. It is auth, routing, prompt construction, policy, model behavior, token accounting, retries, logging, and post-processing. Treating it as a single API call is how teams miss the failure modes.

mote • Jul 1

This is a really solid breakdown of the physical reality behind "it's just an API call." The part about Nigeria's 17 data centers running on diesel generators hit hard â I've worked on systems where the assumption was always "the cloud is somewhere close," and then you realize close is relative to which side of the ocean you're on.

One thing that makes this worse in practice: when you're building agentic systems, the latency compounds. Each tool call, each reasoning step, each function invocation adds another round trip. A 150ms base latency becomes 1500ms across a 10-step agent loop. You can stream the final response, but the agent's internal deliberation is still serial and still pays the distance tax on every hop.

What I've been wondering: for latency-sensitive regions, does it make more sense to push more of the agent's internal state and decision logic to the edge â running smaller models locally for the deliberation loops and only calling the cloud frontier model for the final synthesis? Or does that just split the problem into two harder problems? Curious if you've experimented with hybrid local/cloud setups from Port Harcourt.

Theo Valmis • Jul 3

Good instinct to look under the hood, because a lot of AI-app bugs come from treating the API call as deterministic when it isn't. Same input, different output, and no exception thrown, which breaks every assumption the rest of your code makes about a function returning what you expect. Understanding that the call is a sample, not a lookup, is what separates people who build reliable AI features from people who ship a demo that works until it doesn't.

Vignesh J • Jul 8

Hey, great write-up and an excellent breakdown of the topics.

While reading it, something came to mind. A few days ago, I downloaded a local 8B parameter model. Based on my research, it should have run comfortably on my laptop's specs. However, my very first prompt took much longer than I expected to generate a response.

In fact, that ended up being my only prompt because of how slow it was.

Could that have been due to a cold start?

I think I'll give it another try and see if the subsequent requests are any faster.

Alfatech • Jul 9

Excellent explanation. I especially liked how you connected concepts that many developers treat separately—network latency, GPU scheduling, inference, and streaming. It's easy to think of an LLM API as "just another HTTP request," but in reality there's an enormous amount of infrastructure working behind the scenes before the first token appears.

The point about geography and latency was particularly interesting. Many of us optimize prompts or tweak model parameters without realizing that physical distance to the data center and hardware availability can have just as much impact on the user experience. It really highlights that building AI applications isn't only about prompting—it's also about understanding distributed systems and infrastructure.

I also appreciated that the article stays practical instead of getting lost in marketing buzzwords. Explaining concepts like cold starts, GPU queues, and streaming in plain language makes this accessible even for developers who are just starting to work with LLMs.

Thanks for taking the time to break down such a complex topic into something that's both educational and enjoyable to read. Looking forward to reading more articles like this.

Wren Calloway • Jul 1

The part that makes this genuinely hard to fix: the person building the product is usually the person least able to feel the problem. If you're developing a few hops from us-east-1, every call feels instant, your p50 looks great, and the tax your Lagos users pay is invisible to the one person who could budget around it.

That's the trap with latency specifically — "works on my machine" isn't just a testing joke here, it's structural. Your machine is the best-case network path, so your own experience is the least representative data point you have.

What's worked for me is to stop trusting local numbers and run synthetic checks from the regions users are actually in — even a cheap cron hitting the endpoint from a couple of far-flung boxes, logging time-to-first-token separately from total (someone above nailed why: they fail for different reasons). Once you're looking at p95 from the worst region instead of p50 from your desk, the streaming/caching advice stops being generic and starts being aimed at the users who need it. You can't optimize a tax you can't see.

Valentin Monteiro • Jul 4

The "cache aggressively" line is the one I'd put an asterisk on. Exact-match caching barely hits because LLM prompts are almost never byte-identical, so people jump to semantic caching, and that quietly swaps the latency win for a correctness risk: a subtly different question gets served a close-but-wrong cached answer. Safer boundary is to cap it to high-similarity hits plus cheap deterministic tasks (classification, extraction) where a bad hit is obvious. How do you draw that line in practice?

Kishore Ganapathy • Jul 5

Yes sir, picking the model with lower parameters can give faster responses. But, it may reduce response quality and accuracy when compared to larger models. I am curious how GPT 5.5 with trillions of parameters would response too fast. Is it purely their scalable architecture ?

Mudassir Khan • Jul 2

the GPU queue wait being larger than the submarine cable hop is the thing most dashboards don't separate. we started logging TTFT separately from total response time after chasing a P95 regression that turned out to be batch pressure on a provider's serving layer, not geography.

the place this really compounds: tool calling loops in agents. each invocation pays the TTFT tax again. 150ms base becomes 1.5s across a 10 step loop, and the agent's deliberation is still serial throughout.

for Africa specifically, have you found edge inference (Cloudflare Workers AI, local smaller models) viable for the deliberation steps, or does the model quality gap kill the routing benefit?

YutoGPT • Jul 5

One layer I would add to this is that the expensive unit is often not the single API call, but the user-visible workflow that contains it.

For a simple chat box, one request might mostly map to one inference path. But once the product becomes agentic, the same visible action can include retrieval, a planner call, one or more tool calls, validation, retry, fallback and a final summary. Latency and cost then stop being properties of a model call and become properties of the whole workflow.

The receipt I would want for production systems is:

user action -> workflow step -> provider/model -> cache hit/miss -> retry/fallback/tool event -> final outcome.

That makes it much easier to see whether a slow or expensive experience came from geography, cold start, long context, model choice, tool-loop shape or a budget policy that should have stopped earlier.

Evans Owusu • Jun 29

Really well explained! The part about statelessness hits close
home — I actually ran into this building Yhuu (yhuu.life), an
anonymous Q&A app. Managing context between sessions without
storing sensitive conversation state was trickier than expected.

Ended up using Supabase edge functions to handle it cleanly.
Anyone else found good patterns for this?

leob • Jun 29

This is so interesting, also the fact that you're not just explaining about the LLMs but that you're telling the "complete story" - the ocean floor data cables, the fact that the signal travels at two thirds of the speed of light in a vacuum, and so on - I love that kind of thing!

Varadraj Galgali • Jul 8

Great article!

Todd Pressley • Jul 4

Love the honesty and authenticity of this post… great work, please keep it coming 😊

Laura Ambiguity • Jun 29

Diesel...
People really forget that the data is using real energy. For many it's just an abstract concept

Michał Szczepański • Jun 29

Your request is processed by the forest of GPUs

trendbrewers official • Jul 3

vroom go!!!1

Hannah • Jul 3

Thanks !!

laxman Subedi • Jun 30

Hamza Benzaoui • Jul 5

It’s wild how we just treat LLM APIs like magic black boxes, but honestly, that "regional latency tax" mentioned is such a massive, overlooked pain point. Being based in a region with less-than-ideal infrastructure really changes your perspective; you’re not just fighting model speed, you’re literally fighting the speed of light across submarine cables. I’ve found that streaming responses and aggressive caching aren't just "best practices"—they’re survival strategies for anyone building outside the main tech hubs. It really puts into perspective that the "intelligence" of the model is only half the battle, and the rest is just brutal engineering and infrastructure logistics.