<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Hiten Patel</title>
    <description>The latest articles on DEV Community by Hiten Patel (@hiten_patel_35ea71d4d2007).</description>
    <link>https://dev.to/hiten_patel_35ea71d4d2007</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3865919%2F98cb7deb-e02e-4cdb-a004-f42779ee2ecc.png</url>
      <title>DEV Community: Hiten Patel</title>
      <link>https://dev.to/hiten_patel_35ea71d4d2007</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/hiten_patel_35ea71d4d2007"/>
    <language>en</language>
    <item>
      <title>I built an AI chat over my CV on a zero-pound inference budget</title>
      <dc:creator>Hiten Patel</dc:creator>
      <pubDate>Thu, 11 Jun 2026 09:37:11 +0000</pubDate>
      <link>https://dev.to/hiten_patel_35ea71d4d2007/i-built-an-ai-chat-over-my-cv-on-a-zero-pound-inference-budget-km3</link>
      <guid>https://dev.to/hiten_patel_35ea71d4d2007/i-built-an-ai-chat-over-my-cv-on-a-zero-pound-inference-budget-km3</guid>
      <description>&lt;p&gt;My CV is a PDF, and PDFs do not answer questions. So I built &lt;a href="https://ask.hiten.dev" rel="noopener noreferrer"&gt;ask.hiten.dev&lt;/a&gt;: a streaming chat grounded in my actual career history, where a recruiter can ask "why should I hire you over another senior frontend engineer?" and get a real answer.&lt;/p&gt;

&lt;p&gt;The constraint that made it interesting: the total inference budget is zero. No OpenAI bill, no hosted vector DB, nothing. Here is what that actually took.&lt;/p&gt;

&lt;h2&gt;
  
  
  Four free providers and a failover chain
&lt;/h2&gt;

&lt;p&gt;No single free tier is reliable enough to put in front of strangers. Groq's free tier caps at 100k tokens/day, and I hit that cap on day one. OpenRouter's free models come and go. Cerebras occasionally queues you out at busy times.&lt;/p&gt;

&lt;p&gt;The fix is boring and effective: an ordered provider chain, all OpenAI-compatible, walked per-request until one answers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Groq (llama-3.3-70b) -&amp;gt; OpenRouter (gpt-oss-120b:free) -&amp;gt; NVIDIA (llama-3.3-70b) -&amp;gt; Cerebras (gpt-oss-120b)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each provider is just a base URL, a key and a model name. The API route tries each in order; the first 2xx with a body wins, and the response streams straight through. The client gets an &lt;code&gt;X-Provider&lt;/code&gt; header so I can see who served what in the logs.&lt;/p&gt;

&lt;p&gt;Two details that mattered:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Empty env vars are not unset.&lt;/strong&gt; Docker Compose's &lt;code&gt;${VAR:-}&lt;/code&gt; yields an empty string, which defeats &lt;code&gt;??&lt;/code&gt; defaults in Node. Every key goes through a helper that coerces &lt;code&gt;""&lt;/code&gt; to &lt;code&gt;undefined&lt;/code&gt;, otherwise a provider with no key "exists" and fails every request.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;You cannot cheaply probe a token-per-day cap.&lt;/strong&gt; My health check hits &lt;code&gt;GET /models&lt;/code&gt; on each provider (auth check, 60s cache). It tells you "key works, service up", not "you have tokens left". The failover chain covers the gap: a TPD-capped provider fails fast and the next one picks up.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If every provider is down, the page itself says so. The health check runs server-side at render time, and instead of a broken chat you get a short maintenance note. Never ship a chat UI that can fail after the user has typed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open-weight models do not follow formatting orders
&lt;/h2&gt;

&lt;p&gt;My site's voice avoids em dashes and curly quotes everywhere. The system prompt says, in increasingly desperate ways, "plain ASCII punctuation only". Llama 3.3 mostly complies. gpt-oss-120b absolutely does not.&lt;/p&gt;

&lt;p&gt;Instead of fighting the model, I rewrite the stream at the proxy. Each SSE &lt;code&gt;data:&lt;/code&gt; chunk gets parsed, the delta content normalised (curly quotes to straight, every dash variant handled, NBSP, ellipsis, arrows, bullets), and re-emitted. The client never sees the model's typographic opinions.&lt;/p&gt;

&lt;p&gt;The same idea applies more broadly: anything you must guarantee about model output, enforce in code after the model, not in the prompt.&lt;/p&gt;

&lt;h2&gt;
  
  
  Grounding without a vector DB
&lt;/h2&gt;

&lt;p&gt;There is no RAG here. The grounding document is a hand-written ~3.5k-token version of my CV, injected as the system prompt. At this scale a vector store is overengineering: the whole corpus fits in context with room to spare.&lt;/p&gt;

&lt;p&gt;The hard part was stopping hallucinated specifics. Early versions confidently invented user counts for my projects. The fix was a whitelist: the prompt lists the only numbers the model may state about my career, and instructs it to answer honest gaps with one line and a redirect to an actual strength. A 15-test Playwright suite asserts forbidden characters never appear, suspect numbers never appear, and the framing stays right.&lt;/p&gt;

&lt;h2&gt;
  
  
  Making the common path free
&lt;/h2&gt;

&lt;p&gt;Most visitors click one of the starter chips ("why hire you?", "core stack", "available now?"). Those prompts are fixed strings, so responses are cached in-memory for an hour. First visitor pays the tokens; everyone else gets the answer in 0ms with &lt;code&gt;X-Provider: cache&lt;/code&gt;. Follow-up suggestions reuse the same chips, so they hit the cache too.&lt;/p&gt;

&lt;p&gt;History sent upstream is trimmed by character budget (~3k tokens) rather than message count, which protects the daily caps once conversations get long.&lt;/p&gt;

&lt;h2&gt;
  
  
  The rest of the stack
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Astro 5 SSR&lt;/strong&gt; (node adapter) in Docker on an Oracle ARM free-tier VM&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vanilla JS client&lt;/strong&gt;: SSE parsing, a small safe markdown renderer, sessionStorage persistence, an abort button. No framework; the whole client is one file&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Playwright&lt;/strong&gt; for E2E, &lt;strong&gt;Forgejo&lt;/strong&gt; (self-hosted) for CI&lt;/li&gt;
&lt;li&gt;Hosting cost: the VM is free tier. Inference: zero&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What I would tell you to copy
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Chain free providers; never depend on one&lt;/li&gt;
&lt;li&gt;Enforce output guarantees in code, not prompts&lt;/li&gt;
&lt;li&gt;Gate the UI on a server-side health check&lt;/li&gt;
&lt;li&gt;Whitelist the facts your model may state about anything that matters&lt;/li&gt;
&lt;li&gt;Cache your fixed prompts; most traffic is the same five questions&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Try it: &lt;a href="https://ask.hiten.dev" rel="noopener noreferrer"&gt;ask.hiten.dev&lt;/a&gt;. And if you are hiring (permanent or contract), the chat will happily explain why that is a good idea. The rest of my work lives at &lt;a href="https://hiten.dev" rel="noopener noreferrer"&gt;hiten.dev&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>javascript</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
