Skip to content

DEV Community

Abhishek solanki

Posted on Mar 24

🧠 Streaming LLM APIs Can Quietly Give Free Tokens

#ai #rag #tokenusage #multisourcerag

📌 The Problem

Most OpenAI-compatible APIs send token usage only in the final chunk of a stream.

So if a user:

refreshes
closes the tab

👉 the stream stops
👉 usage data never arrives

But the user has already seen part of the response.

⚠️ Why This Matters

❌ tokens not recorded
❌ users get partial responses for free
❌ can be abused
❌ billing becomes inaccurate

🚀 The Fix

I added a fallback inside the streaming loop (finally block).

👉 If no usage data is received:

calculate tokens manually using tiktoken
count prompt + generated output

if not tokens_used:
    enc = tiktoken.get_encoding("cl100k_base")
    prompt_tokens = sum(len(enc.encode(m["content"])) for m in messages)
    output_tokens = len(enc.encode(full_content))
    tokens_used = prompt_tokens + output_tokens

✔️ Result

✔️ tokens tracked even if stream is interrupted
✔️ no free token exploits
✔️ accurate usage tracking

📌 Takeaway

Streaming APIs don’t guarantee usage data.

If you’re building anything with token limits or billing, don’t depend only on the final chunk.

Top comments (0)

Subscribe