DEV Community

Cover image for An LLM API call, in 4 GIFs

An LLM API call, in 4 GIFs

Jasmin Virdi on May 26, 2026

This is the first post of series Building TinyAgent where we are going to build a small agent from scratch in Node.js with no frameworks just the A...
Collapse
 
francistrdev profile image
FrancisTRᴅᴇᴠ (っ◔◡◔)っ

Very great illustration! I am a visual learner and this helped a lot! Good work :D

Collapse
 
jasmin profile image
Jasmin Virdi

Thanks @francistrdev
I am a visual learner too. 🙋‍♀️😄

Collapse
 
valentin_monteiro profile image
Valentin Monteiro

The 4 GIFs are the happy path. The 5th invisible one in prod is retry/fallback/idempotency, which is where most agent loops actually burn their budget. Pricing math also flips once you're in a tool-calling loop: output tokens usually dominate input by an order of magnitude or more, so input price arbitrage between providers stops mattering. The real comparison is output cost plus structured-output reliability.

Collapse
 
jasmin profile image
Jasmin Virdi

Fair point @valentin_monteiro

Really appreciate you adding this it is helping me to think from a broader prospect for this series I would try to cover this in upcoming posts.
Quick question though, what do you mean by structured output reliability? Is that about the model consistently returning valid JSON or something broader?

Collapse
 
__c1b9e06dc90a7e0a676b profile image
zhongqiyue

Great post — the stop_reason branching is something a lot of tutorials skip, but it's essential for building reliable agents. We ran into the same need to switch providers without rewriting code, so we started using ai.interwestinfo.com as a unified gateway. The pricing has been noticeably lower than buying direct, and having one key for 300+ models simplifies a lot. Have you experimented with routing requests between providers based on cost or latency?

Collapse
 
jasmin profile image
Jasmin Virdi

Thanks @__c1b9e06dc90a7e0a676b

Interesting, does it support multiple models? I haven't tried routing request based on cost or latency. Could you share some more pointers on it ?

Collapse
 
xulingfeng profile image
xulingfeng

These GIFs are brilliantly clear — they show exactly how much the SDK abstractions hide. We switched to raw API calls for our Hermes agent stack after hitting a mysterious latency issue. Turned out the SDK was polling for stream completion even on non-streaming requests, adding 300-800ms per call that didn't show up in any dashboard.

Out of curiosity — are you planning to cover streaming vs non-streaming latency differences in the TinyAgent series? That's the one gap I haven't seen well explained with visuals.

Collapse
 
voltagegpu profile image
VoltageGPU

Interesting breakdown of the API call flow! When working with GPU-backed LLMs, I've seen how critical it is to manage memory and concurrency efficiently—especially when handling multiple requests. If you're scaling this up, you might want to look into how frameworks like VoltageGPU help with resource isolation.

Collapse
 
jasmin profile image
Jasmin Virdi

Thanks @voltagegpu

Seems interesting will check. I believe the infra and scaling could be another topic altogether.

Collapse
 
the_seventeen profile image
The Seventeen

This is a really beautiful write up. Would love to see how you integrate AgentSecrets for credentials management!

agentsecrets.theseventeen.co

Collapse
 
jasmin profile image
Jasmin Virdi

Thanks @the_seventeen

More coming soon. Stay tuned!

Collapse
 
the_seventeen profile image
The Seventeen

Can't wait!

Collapse
 
nahuel990 profile image
Nahuel Nucera

Amazing!

Collapse
 
jasmin profile image
Jasmin Virdi

Thanks @nahuel990

Collapse
 
bashsnippets profile image
Anguishe

Very nice! Awesome topic to do a series on. Looking forward to seeing the rest 😍

Collapse
 
jasmin profile image
Jasmin Virdi • Edited

Thanks @bashsnippets

I have bunch of things to cover in this series. This is really motivating, hope I do justice to the series. 😄

Collapse
 
leob profile image
leob

Insightful, very well written! AI/LLMs explained for "the rest of us" (a.k.a. "mere mortals") :-)

Collapse
 
jasmin profile image
Jasmin Virdi

Thanks @leob

Glad you liked it. More in the series coming soon.

Collapse
 
leob profile image
leob

Looking forward to it!

Collapse
 
workout097collab profile image
Vasyl • Edited

This is actually a really clean explanation for beginners. The JSON is expensive part surprises almost everyone 😄 People use AI APIs for months without knowing what stop_reason does.

Collapse
 
jasmin profile image
Jasmin Virdi • Edited

Thanks @workout097collab

Glad you liked it. More coming soon. 😄

Collapse
 
nafasebra profile image
Nafas Ebrahimi

Great post! I learned a few things from this post.

Collapse
 
jasmin profile image
Jasmin Virdi

Thanks @nafasebra

Glad you liked it

Collapse
 
unitbuilds profile image
UnitBuilds

And if you're using APIs, turn on flex/batch and context caching, to make sure you dont burn your wallet

Collapse
 
jasmin profile image
Jasmin Virdi

Great point @unitbuilds

Prompt caching is good to have when we have long prompt that does not changes frequently, this would help in reducing input costs. Whereas batch is perfect for anything that doesn't need a real time response. Will make sure to cover in upcoming modules of series.

Collapse
 
capestart profile image
CapeStart

As models become commoditized, understanding the mechanics around API calls, context windows, tool usage, and cost control may become a bigger advantage than model choice itself.

Collapse
 
mnemehq profile image
Theo Valmis

The 4-GIF framing is great precisely because it forces the question of where the boundary actually lives. Once people see how thin the request/response shell is, the more interesting question becomes what shapes the prompt before it goes out — and that's where most production complexity ends up living.