<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Harshvardhan Singh</title>
    <description>The latest articles on DEV Community by Harshvardhan Singh (@hrsvd).</description>
    <link>https://dev.to/hrsvd</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F4016094%2Fa85d7913-556d-4e1a-b878-2c13bff2d9ea.jpeg</url>
      <title>DEV Community: Harshvardhan Singh</title>
      <link>https://dev.to/hrsvd</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/hrsvd"/>
    <language>en</language>
    <item>
      <title>The Hidden Cost of Every LLM API Call</title>
      <dc:creator>Harshvardhan Singh</dc:creator>
      <pubDate>Sun, 05 Jul 2026 16:00:51 +0000</pubDate>
      <link>https://dev.to/hrsvd/the-hidden-cost-of-every-llm-api-call-570b</link>
      <guid>https://dev.to/hrsvd/the-hidden-cost-of-every-llm-api-call-570b</guid>
      <description>&lt;h3&gt;
  
  
  What actually happens after your app sends a prompt to an LLM?
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;~6 min read&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;You call &lt;code&gt;client.messages.create(...)&lt;/code&gt;. A few hundred ms later, tokens start streaming back.&lt;/p&gt;

&lt;p&gt;Feels simple. Isn't. Here's the full path, broken into fast, skimmable sections. &lt;/p&gt;




&lt;h2&gt;
  
  
  1. Your SDK does work before anything leaves your laptop
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Serializes your messages to JSON&lt;/li&gt;
&lt;li&gt;Attaches headers (API key, content-type)&lt;/li&gt;
&lt;li&gt;Decides HTTP/1.1 vs HTTP/2&lt;/li&gt;
&lt;li&gt;Sets up retry/backoff logic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;💡 &lt;strong&gt;Common Mistake:&lt;/strong&gt; Making a new client instance per request. You lose connection pooling and pay full TCP + TLS setup cost every time. Reuse the client.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. DNS: finding the server
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;api.anthropic.com → Resolver → 203.0.113.42
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cold lookup: 20–120ms. Cached: basically free. This is why connection reuse (skip re-resolving DNS on every call) is a real win at scale.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. TLS: locking the channel
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Client → TCP handshake → TLS handshake → Encrypted request →
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;TLS 1.3 trimmed this to ~1 round trip. Still not free — especially on mobile networks with higher latency.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Load balancer: you're not hitting one server
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Request → [Load Balancer] → Server A / B / C
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Health checks, geographic routing, traffic spike absorption. This is why one dead server never becomes your problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. API Gateway: airport security for your request
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Auth&lt;/strong&gt; — is this API key valid, whose account is it?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rate limiting&lt;/strong&gt; — protects shared infra from noisy neighbors&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Validation&lt;/strong&gt; — malformed JSON or bad params get rejected here, before wasting GPU time downstream&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;💡 &lt;strong&gt;Engineering Insight:&lt;/strong&gt; Rate limits aren't there to annoy you — they keep one client from degrading service for everyone sharing that hardware.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Logging (async, non-blocking)
&lt;/h2&gt;

&lt;p&gt;Request IDs, token counts, per-stage latency — feeds debugging, abuse detection, and your invoice. Doesn't block your request.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. Tokenization: words become numbers
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"Explain quantum entanglement" → [16350, 14294, 4776, 385, 1997]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two things this affects directly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;💰 &lt;strong&gt;Cost&lt;/strong&gt; — billed per token, not per character&lt;/li&gt;
&lt;li&gt;📏 &lt;strong&gt;Context limit&lt;/strong&gt; — "200K context" = token budget, not word count&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;💡 &lt;strong&gt;Real Example:&lt;/strong&gt; Code and non-English text often burn more tokens than plain English for the same "amount" of meaning — the tokenizer saw those patterns less during training.&lt;/p&gt;

&lt;p&gt;💡 &lt;strong&gt;Performance Tip:&lt;/strong&gt; Trim repeated boilerplate/system prompts. Every token costs money and context space.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. Model routing
&lt;/h2&gt;

&lt;p&gt;A routing layer picks which model + cluster serves your request based on capacity and region. Provider-specific, mostly undocumented in detail — but this general shape is common everywhere.&lt;/p&gt;




&lt;h2&gt;
  
  
  9. GPU scheduling: the real bottleneck
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[User A][User B][User C][User D] → batched onto one GPU
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;GPUs can't spin up instantly like a web server. Batching multiple requests keeps them efficient. &lt;strong&gt;Continuous batching&lt;/strong&gt; (slotting new requests into an in-flight batch) is why modern serving is so much faster than naive one-at-a-time processing.&lt;/p&gt;

&lt;p&gt;💡 This is also why your latency varies call to call — you're sharing hardware.&lt;/p&gt;




&lt;h2&gt;
  
  
  10. KV Cache: the trick behind fast generation
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Token 1 → compute + cache
Token 2 → reuse cache + compute new token only
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without this, every new token would mean reprocessing the whole conversation. With it, generation stays fast — but the cache grows with context length, eating GPU memory the whole time your request is active.&lt;/p&gt;

&lt;p&gt;This is also the mechanism behind &lt;strong&gt;prompt caching&lt;/strong&gt; — reusing cached state for a shared prefix (like a system prompt) across calls, cutting cost + latency.&lt;/p&gt;




&lt;h2&gt;
  
  
  11. Transformer inference (the part everyone pictures)
&lt;/h2&gt;

&lt;p&gt;Per token: embed → run through N transformer layers (self-attention + feed-forward) → probability distribution over vocabulary → sample next token.&lt;/p&gt;

&lt;p&gt;💡 &lt;strong&gt;Common Mistake:&lt;/strong&gt; Higher &lt;code&gt;temperature&lt;/code&gt; ≠ smarter model. It just changes sampling randomness.&lt;/p&gt;




&lt;h2&gt;
  
  
  12. Streaming: why it feels like typing
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Prompt → [t1] → [t1,t2] → [t1,t2,t3] → ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tokens generate one at a time (autoregressive) and get streamed to you as each one is ready — usually via Server-Sent Events.&lt;/p&gt;

&lt;p&gt;💡 &lt;strong&gt;Performance Tip:&lt;/strong&gt; Always stream user-facing responses longer than a sentence. Total time is the same, but perceived latency drops massively — first token in ms instead of a blank screen.&lt;/p&gt;




&lt;h2&gt;
  
  
  13. Billing, running in parallel
&lt;/h2&gt;

&lt;p&gt;Input tokens + output tokens metered (cached tokens often cheaper). Feeds your invoice and sometimes real-time quota checks back into the rate limiter from step 5.&lt;/p&gt;

&lt;p&gt;💡 A long repeated system prompt quietly becomes a big line item unless the provider discounts the repeated prefix via caching.&lt;/p&gt;




&lt;h2&gt;
  
  
  The whole pipeline, one diagram
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Your Code → SDK → DNS → TLS → Load Balancer → Gateway (auth/limit/validate)
   → Logging → Tokenization → Routing → GPU Scheduling → KV Cache
   → Inference → Generation → Streaming → (Billing, parallel) → Your Code
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;~15 systems, different teams, different hardware — cooperating in well under a second.&lt;/p&gt;




&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;This is a distributed systems problem first, ML problem second&lt;/li&gt;
&lt;li&gt;Reuse connections — DNS + TLS cost adds up&lt;/li&gt;
&lt;li&gt;Tokens = cost + context budget, treat them as a resource&lt;/li&gt;
&lt;li&gt;Latency variance = GPU batching, not "harder thinking"&lt;/li&gt;
&lt;li&gt;KV cache = why long chats cost more server-side&lt;/li&gt;
&lt;li&gt;Streaming = better &lt;em&gt;perceived&lt;/em&gt; speed, not better actual speed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Discussion:&lt;/strong&gt; As agent chains stack tool calls on tool calls, how much of this overhead gets duplicated at every hop — and what should get collapsed into one shared layer instead?&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Provider-specific details (routing, scheduling, caching) vary — the patterns above are common across large-scale LLM serving systems, not any one provider's exact internals.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;References:&lt;/strong&gt; &lt;a href="https://docs.claude.com" rel="noopener noreferrer"&gt;Anthropic Docs&lt;/a&gt; · &lt;a href="https://platform.openai.com/docs" rel="noopener noreferrer"&gt;OpenAI Docs&lt;/a&gt; · Vaswani et al., "Attention Is All You Need" (2017) · &lt;a href="https://www.cloudflare.com/learning/ssl/what-is-ssl/" rel="noopener noreferrer"&gt;Cloudflare: How TLS Works&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>backend</category>
      <category>webdev</category>
      <category>architecture</category>
    </item>
  </channel>
</rss>
