<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Anup Sharma</title>
    <description>The latest articles on DEV Community by Anup Sharma (@anup_sharma_86fa94612fe3c).</description>
    <link>https://dev.to/anup_sharma_86fa94612fe3c</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3947205%2F0f0dcd14-4a48-4d26-b9e2-7a9102608e05.png</url>
      <title>DEV Community: Anup Sharma</title>
      <link>https://dev.to/anup_sharma_86fa94612fe3c</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/anup_sharma_86fa94612fe3c"/>
    <language>en</language>
    <item>
      <title>I Built an AI That Decides Which AI to Talk To — Running 24/7 From My Living Room</title>
      <dc:creator>Anup Sharma</dc:creator>
      <pubDate>Sat, 23 May 2026 07:43:58 +0000</pubDate>
      <link>https://dev.to/anup_sharma_86fa94612fe3c/i-built-an-ai-that-decides-which-ai-to-talk-to-running-247-from-my-living-room-211p</link>
      <guid>https://dev.to/anup_sharma_86fa94612fe3c/i-built-an-ai-that-decides-which-ai-to-talk-to-running-247-from-my-living-room-211p</guid>
      <description>&lt;p&gt;Last Saturday when I woke up, my AI agent reviewed 14 restaurant ratings in Indiranagar, updated a shared Google Sheet, signed a 20-page PDF I'd been ignoring for a week, and wrote a bash script to clean up my server logs.&lt;/p&gt;

&lt;p&gt;I didn't ask it to do any of that. It just... does things now.&lt;/p&gt;

&lt;p&gt;Meet &lt;strong&gt;OpenClaw&lt;/strong&gt; — my long-running autonomous agent that lives on a Raspberry Pi, plugged into Discord, running 24/7. It manages my memory, handles research, writes code, edits documents, finds the best weekend spots in Bangalore by scraping live ratings — basically, it runs half my life on autopilot.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But a few weeks ago, I noticed something that bothered me.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I asked it: &lt;em&gt;"Write a Python script to parse JSON logs."&lt;/em&gt; Simple coding task. It sent that request to a cloud API, waited 3 seconds, burned tokens I paid for, and came back with an answer — when I had a perfectly capable local LLM sitting idle on my Mac Mini, three feet away.&lt;/p&gt;

&lt;p&gt;Then I asked: &lt;em&gt;"Think step by step about the trade-offs between event-driven vs polling architecture for my notification system."&lt;/em&gt; That's a hard reasoning question. I want that going to a frontier model. That's worth the tokens.&lt;/p&gt;

&lt;p&gt;Same agent. Same endpoint. Completely different needs.&lt;/p&gt;

&lt;p&gt;And that's when a stupid idea hit me:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What if the system could figure out which brain to use — before the request even reaches a model?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Turns out, it's not stupid at all. And it took me a weekend, a Raspberry Pi, a Mac Mini, 50 lines of Python, and an open-source gateway to build it.&lt;/p&gt;

&lt;p&gt;Here's how.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxd9x1tqn50cf1lew8lu8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxd9x1tqn50cf1lew8lu8.png" alt=" " width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here's what's running in my living room:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Raspberry Pi&lt;/strong&gt; → Runs OpenClaw, my autonomous agent. It takes input from Discord, manages context, memory, and orchestrates everything.&lt;br&gt;
&lt;strong&gt;Mac Mini&lt;/strong&gt; → The brain farm. Runs three things:&lt;br&gt;
Ollama with qwen2.5-coder:7b — a local coding model that never leaves my network&lt;br&gt;
&lt;strong&gt;AgentGateway&lt;/strong&gt; — an open-source AI gateway from Google that handles routing, auth, observability&lt;br&gt;
&lt;strong&gt;A lightweight Python router&lt;/strong&gt; — the "intent classifier" I wrote in ~50 lines of code&lt;br&gt;
The magic? OpenClaw doesn't know any of this is happening. It just sends a request to one endpoint. Behind the scenes, the system figures out the rest.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgy4hte82eh0nergs3xya.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgy4hte82eh0nergs3xya.png" alt=" " width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Three models. Three price points. One unified endpoint. OpenClaw just hits &lt;a href="http://192.168.1.15:1234/v1/chat/completions" rel="noopener noreferrer"&gt;http://192.168.1.15:1234/v1/chat/completions&lt;/a&gt; and forgets about it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why AgentGateway?
&lt;/h3&gt;

&lt;p&gt;I evaluated a few options — raw Envoy, Nginx with Lua scripting, even building a full proxy from scratch. But &lt;strong&gt;AgentGateway&lt;/strong&gt; stood out for a few reasons:&lt;/p&gt;

&lt;p&gt;What it gives you out of the box:&lt;br&gt;
&lt;strong&gt;Protocol translation&lt;/strong&gt; — It speaks OpenAI-compatible API on the frontend, but can talk to Gemini, Vertex AI, Bedrock, Ollama, and more on the backend. I don't write a single line of provider-specific code.&lt;br&gt;
&lt;strong&gt;Backend authentication&lt;/strong&gt; — API keys are managed at the gateway level. OpenClaw never sees or stores any API key. I just set backendAuth: key: $GEMINI_API_KEY in the config and it handles the rest.&lt;br&gt;
&lt;strong&gt;Model aliasing&lt;/strong&gt; — OpenClaw sends model: "inteli-llm" in every request. AgentGateway silently translates that to qwen2.5-coder:7b, gpt-4o, or gemini-2.5-flash depending on which route matched. The client has no idea.&lt;br&gt;
&lt;strong&gt;Observability&lt;/strong&gt; — Every request gets logged with provider name, model, token counts, and latency. I can see exactly how many tokens are going to OpenAI vs staying local.&lt;br&gt;
&lt;strong&gt;Prompt guards &amp;amp; rate limiting&lt;/strong&gt; — Built-in regex-based PII masking, webhook-based content moderation, and rate limiting. Enterprise-grade features I get for free.&lt;br&gt;
&lt;strong&gt;Weighted load balancing &amp;amp; failover&lt;/strong&gt; — If Ollama crashes (it happens), I can configure automatic failover to a cloud model. No downtime.&lt;br&gt;
&lt;strong&gt;What it doesn't do (yet):&lt;/strong&gt; Content-aware routing. AgentGateway routes based on path, headers, and methods — which is the right design for a gateway. It doesn't peek into your request body to decide where to send it. That's a feature, not a bug — gateways should be fast and protocol-level, not parsing JSON payloads.&lt;/p&gt;

&lt;p&gt;But I needed content-aware routing. So instead of searching for other tool, I extended it.&lt;/p&gt;

&lt;h3&gt;
  
  
  The 50-Line Router That Makes It All Work
&lt;/h3&gt;

&lt;p&gt;I wrote a tiny FastAPI proxy that sits in front of AgentGateway. Here's what it does:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Intercepts the incoming OpenAI-compatible request&lt;/li&gt;
&lt;li&gt;Reads the last message in the chat&lt;/li&gt;
&lt;li&gt;Classifies intent using simple keyword matching + prompt length heuristics:

&lt;ul&gt;
&lt;li&gt;Contains &lt;code&gt;code&lt;/code&gt;, &lt;code&gt;python&lt;/code&gt;, &lt;code&gt;script&lt;/code&gt;, &lt;code&gt;function&lt;/code&gt;, &lt;code&gt;bug&lt;/code&gt;? → coding&lt;/li&gt;
&lt;li&gt;Contains &lt;code&gt;think&lt;/code&gt;, &lt;code&gt;analyze&lt;/code&gt;, &lt;code&gt;reasoning&lt;/code&gt;, &lt;code&gt;deduce&lt;/code&gt;? Or prompt &amp;gt; 400 chars? → reasoning&lt;/li&gt;
&lt;li&gt;Everything else? → simple&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Injects an x-intent HTTP header&lt;/li&gt;

&lt;li&gt;Forwards the request to AgentGateway untouched
That's it. No ML model for classification. No vector databases. No semantic similarity. Just good old keyword matching that works 90% of the time — and that's good enough for a homelab.
&lt;/li&gt;

&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;coding_keywords&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;javascript&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;script&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bug&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;reasoning_keywords&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;think&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;analyze&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;explain in detail&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reasoning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;logic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deduce&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;prompt_lower&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;coding_keywords&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;intent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;coding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;400&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;prompt_lower&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;reasoning_keywords&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;intent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reasoning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;intent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;simple&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The Cost Equation
&lt;/h3&gt;

&lt;p&gt;Here's what this setup actually saves me:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Intent&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Where it runs&lt;/th&gt;
&lt;th&gt;Cost per 1M tokens&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Coding&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;qwen2.5-coder:7b&lt;/td&gt;
&lt;td&gt;Local (Ollama)&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Simple Q&amp;amp;A&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;gemini-2.5-flash&lt;/td&gt;
&lt;td&gt;Google Cloud&lt;/td&gt;
&lt;td&gt;~$0.15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deep Reasoning&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;gpt-4o&lt;/td&gt;
&lt;td&gt;OpenAI&lt;/td&gt;
&lt;td&gt;~$2.50&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Before this setup, every single request was going to a cloud API. Now, roughly 60-70% of my queries stay local — coding questions, quick lookups, simple formatting tasks. They're fast, free, and private.&lt;/p&gt;

&lt;p&gt;The expensive reasoning model only gets called when I genuinely need it. And the mid-tier Gemini handles everything in between.&lt;/p&gt;

&lt;p&gt;My monthly API bill dropped significantly, and the local responses are actually faster.&lt;/p&gt;

&lt;h3&gt;
  
  
  Design Choices &amp;amp; Why They Worked
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Header-based routing over path-based routing&lt;/strong&gt; Initially, I was going to use URL paths (&lt;code&gt;/coding&lt;/code&gt;, &lt;code&gt;/reasoning&lt;/code&gt;, &lt;code&gt;/simple&lt;/code&gt;) and strip them with URL rewriting. But header injection is cleaner — the original request path stays intact, and AgentGateway's header matching is first-class.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Classification at the proxy, not the gateway&lt;/strong&gt; I could have tried to use AgentGateway's CEL expressions or ExtProc policies for classification. But those run after backend selection, not before. Keeping classification in a separate lightweight layer means I can swap algorithms without touching my gateway config.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Keyword heuristics over ML classifiers&lt;/strong&gt; Could I use a small classifier model or even RouteLLM for smarter routing? Absolutely. But for a homelab, keyword matching is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Zero latency overhead&lt;/li&gt;
&lt;li&gt;Zero dependencies&lt;/li&gt;
&lt;li&gt;Easy to debug (just read the logs)&lt;/li&gt;
&lt;li&gt;Surprisingly accurate for my use cases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;4. One unified model name&lt;/strong&gt; OpenClaw sends model: &lt;code&gt;"inteli-llm"&lt;/code&gt; for everything. AgentGateway's &lt;code&gt;modelAliases&lt;/code&gt; feature translates it per-route. This means I can swap out backend models without touching a single line of OpenClaw's config. Last week it was &lt;code&gt;gemini-1.5-flash&lt;/code&gt;, this week it's &lt;code&gt;gemini-2.5-flash&lt;/code&gt;. OpenClaw never knew.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Smarter classification&lt;/strong&gt; — Maybe a tiny local classifier model, or even using the first few tokens of a response to reclassify and retry on a better model.&lt;br&gt;
&lt;strong&gt;Metrics dashboard&lt;/strong&gt; — AgentGateway already emits OpenTelemetry traces. I want to hook up a Grafana dashboard to see which models are handling what, with latency and token breakdowns.&lt;br&gt;
&lt;strong&gt;Failover chains&lt;/strong&gt; — If Ollama is under heavy load, automatically fall back to Gemini for coding tasks. AgentGateway supports priority groups for this.&lt;br&gt;
&lt;strong&gt;More agents&lt;/strong&gt; — OpenClaw is just the beginning. I want to run specialized agents for different domains, all routing through the same gateway.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Takeaway
&lt;/h2&gt;

&lt;p&gt;You don't need a Kubernetes cluster or a $10K GPU server to build a multi-model AI system. A Raspberry Pi, a Mac Mini, an open-source gateway, and 50 lines of Python got me:&lt;/p&gt;

&lt;p&gt;✅ An always-on autonomous agent ✅Intelligent routing ✅across 3 different LLMs ✅Local-first for privacy and speed ✅Cloud when I need the horsepower ✅Zero API keys exposed to the client ✅A monthly bill I actually don't mind paying&lt;/p&gt;

&lt;p&gt;The best part? The entire config is a single YAML file and a single Python script. No Docker. No Kubernetes. No Terraform. Just two processes on a Mac Mini and an agent on a Pi.&lt;/p&gt;

&lt;p&gt;Sometimes the best infrastructure is the one you can explain in a napkin sketch.&lt;/p&gt;

&lt;p&gt;If you're building something similar or want to see the config files, drop a comment — happy to share the full setup.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6rmzapxb0p2vk8wdzjgn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6rmzapxb0p2vk8wdzjgn.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  AI #HomeAssistant #LLM #AgentGateway #Ollama #OpenAI #Gemini #HomeLab #BuildInPublic #MacMini #RaspberryPi #AIEngineering
&lt;/h1&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>architecture</category>
      <category>agents</category>
    </item>
  </channel>
</rss>
