you've felt it.
you type a prompt, hit send, and the response starts streaming in under a second. smooth. instant. you feel like you're thinking o...
For further actions, you may consider blocking this person and/or reporting abuse
pascal, the ground-level version of the 1% story is here.
The diesel generators, the latency tax, the 17 buildings versus 5,500 . This is what the infrastructure gap looks like from inside it. not 2029. right now, on every API call.
the piece we talked about writing . this is the non-fiction half of it.
@pascal_cescato_692b7a8a20 is a trendsetter this week 😀
He started something this week. glad you're here too, Sylwia....
You brought the ground truth. Sylwia brought the room. I just left the door open.
Trendsetter is generous. Early symptom is more accurate. 😄
That's the sentence. Everything before it is the setup, and it lands because you've actually earned it.
17 versus 5,500. That number does what the whole geopolitics section of 1% couldn't — it makes the gap physical, countable, impossible to abstract away.
This is the piece I couldn't write. You wrote it from inside. That's the difference.
🤑 Runs on Diesel... Ouch! Especially now...
Yeah people dont really consider the physical bottlenecks that exist, they just think the model is slow. Cant do much about distance, but you can change the protocol you use for sending the data (eg. V.E.L.O.C.I.T.Y. was built for handling bank transfers at microseconds), or what kind of data you send (NDA vs JSON means no serialization and deserialization), inference needs to run (google has custom accelerators that run way faster than H100s atleast), data congestion (queueing, or dropping) can be addressed with load-balancers which also add latency (albeit fractions), then response comes back and your editor needs to display it (if your ide uses Electron, then it's SLOW). So many things that add up to why it takes a few seconds to get the first token and why the industry is at the standstill again. People think AI surpassed hardware, true in some ways, but the real bottleneck is the aging infrastructure code we've been using. Take JSON, it takes over a milisecond just to serialize and deserialize it. Now multiply that by a billion... That's the impact of JSON every hour for cloud models. That's compute that costs money and time that could be used inferring instead of packing and retrying if the output wasnt structured perfectly (whose still remembers "Model produced a malformed edit"). LLMs like Gemini 3.5 flash already run at 400+ tps, that's 2.5ms per token, a token is on average 0.8 of a word, which works out to 320 words per second... That's really damn fast. it's just the infrastructure that's still slow, the GPUs are fast and the models are fast.
UnitBuilds, the JSON number is the one that sits with me. 1ms per serialize/deserialize sounds trivial until you map it across every layer . client to proxy, proxy to inference server, response back. by the time you count all the hops a Nigerian developer's request makes, you're stacking serialization costs on top of geography costs on top of queue costs. the latency isn't one problem. it's five problems wearing one coat.
The NDA format angle is sharp . if you're eliminating serialization at the transport layer, you're attacking the problem at the right level. most people optimize the model. you're optimizing the pipe.
curious whether V.E.L.O.C.I.T.Y. handles the case where the receiving end is still JSON-native . does it translate at the edge or does it require both sides to speak NDA?
It can yes, the POC for that would be NMCP (Rust-based MCP built on using the NDA format, instead of JSON-RPC and self-hosted instead of Node.JS). You can use NDA, or you can use JSON, so it's backwards compatible when necessary (eg. public APIs for cloud models). So you dont get the full benefit of using an AI-native language, zero deserialization, E2E AES 256 encryption and a merkle root immutable history, but you do get some performance gain still vs traditional.
NDA is more than just a packing format, it's a means of representing data deterministically (think excel formulas), by using the triplet structure, it's easier for a model to understand cause and effect and everything inbetween. Essentially allowing it to 'infer' the meaning, without running inference. So it also improves your signal to noise ratio while at it, because the old standard is MCP tools have 'a title + description', plain text packed in json, with parameter fields and no clue what they do, unless you bloat the context with large descriptions. NDA is a single A -> B -> C and the model just gets it.
While V.E.L.O.C.I.T.Y. had it's roots in bank transfers and wall-street, NDA was built to be the USB-C of files, it can contain code, a spreadsheet, a form, a webpage, an image, video, audio, even a zip. But it represents data differently. With the triplet structure, a spreadsheet line isnt just Cell A, B, C, it's Qty * Price = Cost. As a document standard, idea was for it to use this structure and when you print it, print a tiny QR code on the doc, which when scanned reconstructs the document precisely, so for financial documents you cant just change the PDF's owner password check and edit a line, it leaves a fingerprint on the NDA doc's merkle root if edited properly, if you dumb-edit it (just replace the text in paint), the QR code will invalidate the text on the page. Also means that when you print a doc, you dont have to run OCR to scan it back in. But that's besides the point. As a data-packet it's encrypted by the hardware AES 256, effectively making it free encryption even at insane data rates due to the architecture and it natively handles chunking, so you simply fill a UDP packet to the brim, send it, prep and send the next 1, the merkle root ensures that any packets lost are recalled and that the final structure is 100% intact and unaltered. Because it's hardware signed, even if you snoop a packet, you'd need the destination pc's cpu just to unlock the packet and it would have to be in sequence, else it's simply garbage data. As a data transfer system, it's computationally impossible to hack, because it runs at bare-metal speeds, if tech advances to THz cpus, it scales right up to the limit with it.
The merkle root detail is the one I wasn't expecting. backwards compatible transport is interesting on its own but immutable history baked into the protocol layer is a different claim. that's not just performance, that's auditability by default.
been thinking about agent governance from the application layer . signoff-workers tracks what decisions were made, under what conditions, with what authorization. you're solving the same problem one layer down. if the transport itself carries a tamper-evident history, the governance layer doesn't have to reconstruct it from logs.
does NMCP expose that merkle root in a way an orchestration layer can query, or is it internal to the transport?
It's open, it's how the system validates state before executing, that way nonrunable code is detected at write-time instead of compile or run time. the NDA standard was designed for auditability in mind, at every layer, if you're authorized, you can view anything's history, be it a file or packet
Great breakdown of the physical journey behind a single token.
The reminder to stream responses and cache aggressively is so critical, especially when building for regions bearing the brunt of that 'regional latency tax'.
great piece!
Adam, building for those regions is exactly who this was written for....
Meanwhile the elusive girlfriendicus in the background, eyerolls and slowly starts her long march to yoga class.
I was your host, David Attenborough.
One of the best explanations I've read on why AI feels "slow" sometimes. We often optimize prompts while forgetting that networking, GPUs, and data center location are just as important to the user experience.
When you call an LLM API, your application sends a prompt to a hosted AI model over the internet. The API processes the request by tokenizing the input, running it through the language model, and generating a response token by token. The final output is then returned to your application in a structured format like JSON, making it easy to integrate into websites, chatbots, CRMs, or other software.
The overall workflow typically looks like this:
Your app sends a prompt through an API request.
The LLM interprets the prompt using its trained parameters.
It generates a context-aware response.
The API returns the response for your application to display or process.
Many businesses use LLM APIs to power customer support, automate content creation, summarize documents, and enhance internal workflows without having to train or host their own AI models.
At Aqva Marketing, we help businesses integrate AI solutions like LLM APIs into their digital strategies, enabling smarter automation, personalized customer experiences, and scalable business growth. The key is not just accessing an AI model but implementing it in a way that delivers real business value.
Really well explained! The part about statelessness hits close
home — I actually ran into this building Yhuu (yhuu.life), an
anonymous Q&A app. Managing context between sessions without
storing sensitive conversation state was trickier than expected.
Ended up using Supabase edge functions to handle it cleanly.
Anyone else found good patterns for this?