My CV is a PDF, and PDFs do not answer questions. So I built ask.hiten.dev: a streaming chat grounded in my actual career history, where a recruiter can ask "why should I hire you over another senior frontend engineer?" and get a real answer.
The constraint that made it interesting: the total inference budget is zero. No OpenAI bill, no hosted vector DB, nothing. Here is what that actually took.
Four free providers and a failover chain
No single free tier is reliable enough to put in front of strangers. Groq's free tier caps at 100k tokens/day, and I hit that cap on day one. OpenRouter's free models come and go. Cerebras occasionally queues you out at busy times.
The fix is boring and effective: an ordered provider chain, all OpenAI-compatible, walked per-request until one answers.
Groq (llama-3.3-70b) -> OpenRouter (gpt-oss-120b:free) -> NVIDIA (llama-3.3-70b) -> Cerebras (gpt-oss-120b)
Each provider is just a base URL, a key and a model name. The API route tries each in order; the first 2xx with a body wins, and the response streams straight through. The client gets an X-Provider header so I can see who served what in the logs.
Two details that mattered:
Empty env vars are not unset. Docker Compose's
${VAR:-}yields an empty string, which defeats??defaults in Node. Every key goes through a helper that coerces""toundefined, otherwise a provider with no key "exists" and fails every request.You cannot cheaply probe a token-per-day cap. My health check hits
GET /modelson each provider (auth check, 60s cache). It tells you "key works, service up", not "you have tokens left". The failover chain covers the gap: a TPD-capped provider fails fast and the next one picks up.
If every provider is down, the page itself says so. The health check runs server-side at render time, and instead of a broken chat you get a short maintenance note. Never ship a chat UI that can fail after the user has typed.
Open-weight models do not follow formatting orders
My site's voice avoids em dashes and curly quotes everywhere. The system prompt says, in increasingly desperate ways, "plain ASCII punctuation only". Llama 3.3 mostly complies. gpt-oss-120b absolutely does not.
Instead of fighting the model, I rewrite the stream at the proxy. Each SSE data: chunk gets parsed, the delta content normalised (curly quotes to straight, every dash variant handled, NBSP, ellipsis, arrows, bullets), and re-emitted. The client never sees the model's typographic opinions.
The same idea applies more broadly: anything you must guarantee about model output, enforce in code after the model, not in the prompt.
Grounding without a vector DB
There is no RAG here. The grounding document is a hand-written ~3.5k-token version of my CV, injected as the system prompt. At this scale a vector store is overengineering: the whole corpus fits in context with room to spare.
The hard part was stopping hallucinated specifics. Early versions confidently invented user counts for my projects. The fix was a whitelist: the prompt lists the only numbers the model may state about my career, and instructs it to answer honest gaps with one line and a redirect to an actual strength. A 15-test Playwright suite asserts forbidden characters never appear, suspect numbers never appear, and the framing stays right.
Making the common path free
Most visitors click one of the starter chips ("why hire you?", "core stack", "available now?"). Those prompts are fixed strings, so responses are cached in-memory for an hour. First visitor pays the tokens; everyone else gets the answer in 0ms with X-Provider: cache. Follow-up suggestions reuse the same chips, so they hit the cache too.
History sent upstream is trimmed by character budget (~3k tokens) rather than message count, which protects the daily caps once conversations get long.
The rest of the stack
- Astro 5 SSR (node adapter) in Docker on an Oracle ARM free-tier VM
- Vanilla JS client: SSE parsing, a small safe markdown renderer, sessionStorage persistence, an abort button. No framework; the whole client is one file
- Playwright for E2E, Forgejo (self-hosted) for CI
- Hosting cost: the VM is free tier. Inference: zero
What I would tell you to copy
- Chain free providers; never depend on one
- Enforce output guarantees in code, not prompts
- Gate the UI on a server-side health check
- Whitelist the facts your model may state about anything that matters
- Cache your fixed prompts; most traffic is the same five questions
Try it: ask.hiten.dev. And if you are hiring (permanent or contract), the chat will happily explain why that is a good idea. The rest of my work lives at hiten.dev.
Top comments (0)