DEV Community

Abhishek solanki
Abhishek solanki

Posted on

🧠 Streaming LLM APIs Can Quietly Give Free Tokens

πŸ“Œ The Problem

Most OpenAI-compatible APIs send token usage only in the final chunk of a stream.

So if a user:

  • refreshes
  • closes the tab

πŸ‘‰ the stream stops
πŸ‘‰ usage data never arrives

But the user has already seen part of the response.


⚠️ Why This Matters

  • ❌ tokens not recorded
  • ❌ users get partial responses for free
  • ❌ can be abused
  • ❌ billing becomes inaccurate

πŸš€ The Fix

I added a fallback inside the streaming loop (finally block).

πŸ‘‰ If no usage data is received:

  • calculate tokens manually using tiktoken
  • count prompt + generated output
if not tokens_used:
    enc = tiktoken.get_encoding("cl100k_base")
    prompt_tokens = sum(len(enc.encode(m["content"])) for m in messages)
    output_tokens = len(enc.encode(full_content))
    tokens_used = prompt_tokens + output_tokens
Enter fullscreen mode Exit fullscreen mode

βœ”οΈ Result

  • βœ”οΈ tokens tracked even if stream is interrupted
  • βœ”οΈ no free token exploits
  • βœ”οΈ accurate usage tracking

πŸ“Œ Takeaway

Streaming APIs don’t guarantee usage data.

If you’re building anything with token limits or billing, don’t depend only on the final chunk.

Top comments (0)