"An uncle-nephew conversation about the new line item on every developer's bill: LLM tokens."
Uncle, I burned ₹1000 in 4 runs
An uncle-nephew conversation on why some AI calls cost paise, and others cost thousands.
👦 Nephew: Uncle, I have a problem. I put ₹1000 of my own money into the AI API this month and it finished in four runs. I don't even know what happened.
👨🦳 Uncle: Sit down, beta. Drink chai first. Then tell me everything you built this month, from the beginning.
👦 Nephew: Okay. I built four things. Three of them were fine. One of them ate the whole month.
👨🦳 Uncle: Good. We'll go one by one. Start with the easy one.
Project one: image text extraction
👦 Nephew: First one was simple. User uploads a photo — invoice, prescription, screenshot, whatever. I send it to the vision model. It returns the text. Done.
👨🦳 Uncle: How many calls per user?
👦 Nephew: One. Maybe two if the image is bad and they retry.
👨🦳 Uncle: And the cost?
👦 Nephew: Honestly, kaka, almost nothing. A few paise per image. I scanned around 200 images during testing and barely touched ₹50.
👨🦳 Uncle: Now stop and think. Why was it so cheap?
👦 Nephew: Aha… because it's just one image in, some text out. One round trip.
👨🦳 Uncle: Exactly. One input, one output, one transaction. Like sending one SMS. You know what you're paying before you press send. Keep this model in your head. It's the cheapest pattern that exists.
Project two: a customer support bot
👦 Nephew: Second one was a support bot for a client. User types a question. Bot replies.
👨🦳 Uncle: Where does the bot get its answers from?
👦 Nephew: I wrote a system prompt — around 800 words — explaining the product, common questions, tone of voice. Every user message goes in with that prompt. Bot replies.
👨🦳 Uncle: And the cost?
👦 Nephew: A bit higher than OCR. Each conversation was maybe ₹2 to ₹3. Total for the month, around ₹200 across hundreds of conversations.
👨🦳 Uncle: Tell me — why is this one costlier than the image project?
👦 Nephew: (thinking) …because the system prompt is going every time?
👨🦳 Uncle: Yes. You're paying for the same 800 words of instructions on every single message. The user pays for one message, you pay for 800 + their message. But conversations are short, so it doesn't get out of hand.
👦 Nephew: And the bot replies are also small — 100, 150 words.
👨🦳 Uncle: Small input, small output, slightly bigger envelope than OCR. Still safe. Now the next project.
Project three: document Q&A with RAG
👦 Nephew: Third one was a document Q&A. User uploads a 50-page PDF, asks questions, gets answers from inside the document.
👨🦳 Uncle: Smart. How did you handle the document?
👦 Nephew: I chunked the PDF, embedded each chunk, stored them in a vector database. When the user asks, I pull the top 5 relevant chunks and send them with the question to the LLM.
👨🦳 Uncle: Beta, that's actually a sensible design. Cost?
👦 Nephew: ₹3 to ₹4 per question. The project as a whole cost me around ₹300 across some testing and a few real users.
👨🦳 Uncle: Why higher than the support bot?
👦 Nephew: Because each question carries 5 chunks of document context. Maybe 3000 words of input on every call.
👨🦳 Uncle: Correct. Bigger input on every call. But notice what you didn't do — you didn't send the entire 50-page PDF every time. You sent only the 5 most relevant pieces. That's a smart RAG.
👦 Nephew: What's a dumb one?
👨🦳 Uncle: A dumb RAG sends the whole 50-page document every time the user asks anything. Some people actually build this. They pay for 50 pages of input when they could have paid for 5. Don't be that person.
👦 Nephew: Noted.
Project four: the disaster
👨🦳 Uncle: Now tell me about the project that ate the rest of your money.
👦 Nephew: (sighs) Job search agent. Idea was — user uploads their resume, my agent searches the web for matching jobs, returns the top ones with reasoning. Like a personal recruiter.
👨🦳 Uncle: How does the agent search?
👦 Nephew: It has a web search tool. Searches job boards, company career pages — whatever comes up. Clicks into postings, reads them, decides if they match the resume, then either suggests them or searches again with refined keywords.
👨🦳 Uncle: How many of these search-and-read cycles per user?
👦 Nephew: I didn't think about it carefully. The agent does around 5 searches. Each search returns 10 results. Then it fetches the most promising 3 or 4 pages in full. Reasons over them. Maybe searches again.
👨🦳 Uncle: Cost per user?
👦 Nephew: …₹200 to ₹250 per user, kaka.
👨🦳 Uncle: (silence)
👦 Nephew: I tested it four times to debug. ₹1000 gone. And it didn't even get anyone a job. It just suggested some.
Where the money actually went
👨🦳 Uncle: Beta, listen carefully now. I want you to understand what really happened, because it will change how you think about every AI project from now on.
👦 Nephew: Tell me.
👨🦳 Uncle: Look at the four projects we discussed.
OCR: [image] → [text] one call
Support bot: [prompt + query] → [answer] one call
RAG: [prompt + chunks + query] → [answer] one call
Web search agent: [prompt + query] → [search results]
→ [prompt + results] → [decide to fetch]
→ [prompt + results + FULL PAGE] → [decide next]
→ [prompt + results + page + ANOTHER PAGE] → ...
→ and on, and on
👦 Nephew: Oh… the last one is a loop.
👨🦳 Uncle: Yes. And here is the part that breaks beginners. Every call inside that loop includes everything that came before it.
👦 Nephew: Wait, what?
👨🦳 Uncle: When the agent makes its second decision, it must see what the first search returned. So the second call carries the first search's results. When it makes the third decision, it must see the first two searches plus whatever pages were fetched in between. By the fifth step, the input on each call has grown to thousands and thousands of words.
👦 Nephew: So the context keeps stacking.
👨🦳 Uncle: And web pages are huge. One stripped-down job posting page is easily 2000 to 5000 words. Multiply by 3 or 4 pages, your agent is reasoning over 15,000 to 20,000 words on every step. Five steps of looping — you're at 80,000+ words of input billed for one user session.
👦 Nephew: Compared to OCR…
👨🦳 Uncle: Compared to OCR, your web search agent cost you roughly 400 times more per user. For the same kind of API.
👦 Nephew: (quietly) I see.
The one rule to remember
👨🦳 Uncle: Pick up your pen. Write this down. Stick it above your laptop.
The cost of an LLM application is not the cost of one call.
It's the cost of one call × how many times you call × how big the input has grown by then.
👨🦳 Uncle: Three things multiply together. Most beginner cost disasters come from forgetting one of them.
👦 Nephew: Which one bit me?
👨🦳 Uncle: All three at once. Many calls because of the agent loop. Growing input because of accumulated page content. And each call seemed cheap when you tested in isolation, so you never saw the multiplication coming.
👦 Nephew: Kaka, this is like electricity. Each appliance seems small. The bill at the end of the month tells the truth.
👨🦳 Uncle: Yes! Exactly that. AI tokens are electricity, not water connection. Variable, not fixed. Some operations are fans. Some are air conditioners running all day.
How to build the same agent without going broke
👦 Nephew: So how do I build a job search agent that doesn't bankrupt me?
👨🦳 Uncle: I'll give you five rules. Learn them properly.
Rule 1: Match the model to the job
👨🦳 Uncle: Not every step needs your most expensive model. There's a hierarchy.
👦 Nephew: Like what?
👨🦳 Uncle: Extracting text from an image? Use a smaller, cheaper vision model. The task is mechanical. Classifying whether a job posting matches a resume? A small model is more than enough. Writing the final personalized recommendation letter to the user? Now use your best model. That's the part where quality is visible.
👦 Nephew: I was using the smartest, most expensive model for every step.
👨🦳 Uncle: I know. Most people do. It's like using a JCB to plant a tulsi plant.
Rule 2: Don't pass full pages — pass extracts
👨🦳 Uncle: If your agent fetches a 5000-word page, do not shove all 5000 words into the next reasoning step.
👦 Nephew: Then what?
👨🦳 Uncle: Run a cheap model first to pull only the relevant 200 words. Pass those 200 words to your expensive reasoning model. Two-stage processing. You can cut 80–90% of cost this way.
Bad: [agent — best model] reads 5000-word page → decides
Good: [extractor — cheap model] pulls 200 relevant words
→ [agent — best model] reads 200 words → decides
Rule 3: Cap the loop
👨🦳 Uncle: Set a maximum number of iterations on your agent. Three searches. Four. Not "keep going until satisfied."
👦 Nephew: Why?
👨🦳 Uncle: Because an agent with no cap will happily run forever if you let it. Polite, hard-working, expensive servants need supervision.
Rule 4: Cache like your wallet depends on it
👨🦳 Uncle: If two users have similar resumes, you're searching for the same jobs and reading the same pages. Cache the search results. Cache the extracted summaries. Cache the model's interpretations.
👦 Nephew: Disk is cheaper than tokens?
👨🦳 Uncle: Disk is cheaper than tokens by a factor of a thousand. Always.
Rule 5: Use specialized APIs, not generic crawling
👨🦳 Uncle: This was the biggest mistake in your job project.
👦 Nephew: Tell me.
👨🦳 Uncle: You were doing general web search and crawling job board HTML. That is the most expensive path that exists. Job boards have APIs. LinkedIn, Naukri, Indeed, Glassdoor — many of them have official APIs or structured data feeds.
👦 Nephew: Even paid?
👨🦳 Uncle: Even paid. Pay ₹1000 once for an API subscription instead of ₹250 per user for crawling. The math turns positive at the fourth user.
👦 Nephew: I thought building my own crawler was the clever, free way.
👨🦳 Uncle: It was the expensive way wearing a free costume.
The bigger lesson
👦 Nephew: Kaka, I think I've been treating the AI API like a database. Like a fixed cost.
👨🦳 Uncle: And what is it actually?
👦 Nephew: More like electricity. The more you do, the more it draws. Some operations draw a lot more than others.
👨🦳 Uncle: Now you're thinking like an engineer, not just a developer.
👦 Nephew: What's the difference?
👨🦳 Uncle: A developer asks, "does this work?" An engineer asks, "does this work, and what does it cost to run for ten thousand users?" Both questions matter. But only one of them will tell you whether your product survives past the third month.
The math before you build
👦 Nephew: So before I build my next AI project, what do I do?
👨🦳 Uncle: Estimate four numbers for your worst-case user.
👦 Nephew: Tell me.
👨🦳 Uncle:
- How many model calls per session.
- How big the input is on each call (in words — roughly 1 word ≈ 1.3 tokens).
- How big the output is on each call.
- What you're paying per million input and per million output tokens for your chosen model.
👨🦳 Uncle: Multiply them out. If the per-user cost is more than what the user pays you — or more than you can absorb as a free tier — you don't have a product. You have a charity that will close in two months.
👦 Nephew: Or worse, a bill that goes to my father.
👨🦳 Uncle: (laughs) Don't make me explain this conversation to him.
A short summary for the next nephew
👦 Nephew: Kaka, can you give me a small table I can keep in my notes?
👨🦳 Uncle: Take this.
| Project type | Pattern | Cost behaviour |
|---|---|---|
| Image OCR | One image → one output | Cheap, predictable |
| Support bot | Prompt + question → answer | Cheap, mildly accumulating |
| Document RAG | Retrieved chunks + question | Moderate, scales with chunk size |
| Agent + web search | Many loops with growing context | Expensive, scales explosively |
👨🦳 Uncle: The hierarchy is not about technology being "advanced" or "basic." It is about how many tokens flow through the model per user, per session. Some patterns naturally produce a small flow. Others produce a flood.
👦 Nephew: Know which one I'm building. Do the math first. Pick the right model for each step. Cache aggressively.
👨🦳 Uncle: And one last thing.
👦 Nephew: What?
👨🦳 Uncle: Next time, build a small calculator before you build the agent. One spreadsheet cell would have saved you a thousand rupees.
👦 Nephew: (laughs) Noted, kaka.
If you found this useful, I write about backend engineering, system design, and these kinds of practical lessons from real projects. Less noise, more action.
— until the next nephew project goes sideways.
Top comments (0)