I built an AI chat over my CV on a zero-pound inference budget

Hiten Patel — Thu, 11 Jun 2026 09:37:11 +0000

My CV is a PDF, and PDFs do not answer questions. So I built ask.hiten.dev: a streaming chat grounded in my actual career history, where a recruiter can ask "why should I hire you over another senior frontend engineer?" and get a real answer.

The constraint that made it interesting: the total inference budget is zero. No OpenAI bill, no hosted vector DB, nothing. Here is what that actually took.

Four free providers and a failover chain

No single free tier is reliable enough to put in front of strangers. Groq's free tier caps at 100k tokens/day, and I hit that cap on day one. OpenRouter's free models come and go. Cerebras occasionally queues you out at busy times.

The fix is boring and effective: an ordered provider chain, all OpenAI-compatible, walked per-request until one answers.

Groq (llama-3.3-70b) -> OpenRouter (gpt-oss-120b:free) -> NVIDIA (llama-3.3-70b) -> Cerebras (gpt-oss-120b)

Each provider is just a base URL, a key and a model name. The API route tries each in order; the first 2xx with a body wins, and the response streams straight through. The client gets an X-Provider header so I can see who served what in the logs.

Two details that mattered:

Empty env vars are not unset. Docker Compose's ${VAR:-} yields an empty string, which defeats ?? defaults in Node. Every key goes through a helper that coerces "" to undefined, otherwise a provider with no key "exists" and fails every request.
You cannot cheaply probe a token-per-day cap. My health check hits GET /models on each provider (auth check, 60s cache). It tells you "key works, service up", not "you have tokens left". The failover chain covers the gap: a TPD-capped provider fails fast and the next one picks up.

If every provider is down, the page itself says so. The health check runs server-side at render time, and instead of a broken chat you get a short maintenance note. Never ship a chat UI that can fail after the user has typed.

Open-weight models do not follow formatting orders

My site's voice avoids em dashes and curly quotes everywhere. The system prompt says, in increasingly desperate ways, "plain ASCII punctuation only". Llama 3.3 mostly complies. gpt-oss-120b absolutely does not.

Instead of fighting the model, I rewrite the stream at the proxy. Each SSE data: chunk gets parsed, the delta content normalised (curly quotes to straight, every dash variant handled, NBSP, ellipsis, arrows, bullets), and re-emitted. The client never sees the model's typographic opinions.

The same idea applies more broadly: anything you must guarantee about model output, enforce in code after the model, not in the prompt.

Grounding without a vector DB

There is no RAG here. The grounding document is a hand-written ~3.5k-token version of my CV, injected as the system prompt. At this scale a vector store is overengineering: the whole corpus fits in context with room to spare.

The hard part was stopping hallucinated specifics. Early versions confidently invented user counts for my projects. The fix was a whitelist: the prompt lists the only numbers the model may state about my career, and instructs it to answer honest gaps with one line and a redirect to an actual strength. A 15-test Playwright suite asserts forbidden characters never appear, suspect numbers never appear, and the framing stays right.

Making the common path free

Most visitors click one of the starter chips ("why hire you?", "core stack", "available now?"). Those prompts are fixed strings, so responses are cached in-memory for an hour. First visitor pays the tokens; everyone else gets the answer in 0ms with X-Provider: cache. Follow-up suggestions reuse the same chips, so they hit the cache too.

History sent upstream is trimmed by character budget (~3k tokens) rather than message count, which protects the daily caps once conversations get long.

The rest of the stack

Astro 5 SSR (node adapter) in Docker on an Oracle ARM free-tier VM
Vanilla JS client: SSE parsing, a small safe markdown renderer, sessionStorage persistence, an abort button. No framework; the whole client is one file
Playwright for E2E, Forgejo (self-hosted) for CI
Hosting cost: the VM is free tier. Inference: zero

What I would tell you to copy

Chain free providers; never depend on one
Enforce output guarantees in code, not prompts
Gate the UI on a server-side health check
Whitelist the facts your model may state about anything that matters
Cache your fixed prompts; most traffic is the same five questions

Try it: ask.hiten.dev. And if you are hiring (permanent or contract), the chat will happily explain why that is a good idea. The rest of my work lives at hiten.dev.