<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sam Hartley</title>
    <description>The latest articles on DEV Community by Sam Hartley (@samhartley_dev).</description>
    <link>https://dev.to/samhartley_dev</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3811539%2Fdd554e30-699d-42a3-a82a-77673790a186.png</url>
      <title>DEV Community: Sam Hartley</title>
      <link>https://dev.to/samhartley_dev</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/samhartley_dev"/>
    <language>en</language>
    <item>
      <title>I Gave Each of My AI Agents a Personality — Here's Why My Workflow Actually Improved</title>
      <dc:creator>Sam Hartley</dc:creator>
      <pubDate>Fri, 19 Jun 2026 08:01:30 +0000</pubDate>
      <link>https://dev.to/samhartley_dev/i-gave-each-of-my-ai-agents-a-personality-heres-why-my-workflow-actually-improved-2dl9</link>
      <guid>https://dev.to/samhartley_dev/i-gave-each-of-my-ai-agents-a-personality-heres-why-my-workflow-actually-improved-2dl9</guid>
      <description>&lt;p&gt;I used to have one AI assistant. It did everything — coded, wrote docs, answered questions, monitored my inbox.&lt;/p&gt;

&lt;p&gt;It was fine. But "fine" isn't the same as "good." One model trying to be a generalist meant it was mediocre at everything. Context bloat. Conflicting instructions. The coding advice was too cautious. The writing was too robotic. The inbox monitoring missed nuance because the model was busy trying to remember my entire codebase.&lt;/p&gt;

&lt;p&gt;So I split it into three. Each with a different personality, different model, different job. And it actually works better.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem with One Agent to Rule Them All
&lt;/h2&gt;

&lt;p&gt;When you have one AI doing everything, you run into three problems fast:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Context pollution.&lt;/strong&gt; The coding instructions leak into the writing tone. The writing style bleeds into the code suggestions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Wrong tool for the job.&lt;/strong&gt; A 70B parameter model is overkill for "check my calendar." A 7B model is underpowered for "refactor this 500-line function."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. No specialization.&lt;/strong&gt; My coding agent doesn't need to know my grocery list. My writing agent doesn't need to know my API keys. But when there's only one context window, everything is in there.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three-Agent Setup
&lt;/h2&gt;

&lt;p&gt;I now run three distinct agents, each with its own model, personality, and scope:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Agent&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Personality&lt;/th&gt;
&lt;th&gt;Job&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Celebi&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Qwen 3.5 9B (Mac Mini)&lt;/td&gt;
&lt;td&gt;Generalist, casual, resourceful&lt;/td&gt;
&lt;td&gt;Orchestration, daily checks, notifications, routing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ProgrammierMinna&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Qwen 3 Coder 30B (RTX 3060)&lt;/td&gt;
&lt;td&gt;Precise, technical, no fluff&lt;/td&gt;
&lt;td&gt;Code generation, debugging, refactoring, PR review&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DocMinna&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Granite 3.2 8B (Mac Mini)&lt;/td&gt;
&lt;td&gt;Formal, structured, thorough&lt;/td&gt;
&lt;td&gt;Documentation, technical writing, READMEs, specs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Why Different Models?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Celebi runs on the Mac Mini (M4)&lt;/strong&gt; because it's always on, low power, and handles simple tasks instantly. Qwen 3.5 9B is perfect for "check my email, summarize it, tell me if it's urgent."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ProgrammierMinna runs on the RTX 3060&lt;/strong&gt; because coding tasks need a bigger model. Qwen 3 Coder 30B actually understands large codebases, suggests proper refactors, and catches edge cases the 9B misses. Response time is 10-15 seconds — fine for code, too slow for "what's the weather."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DocMinna also runs on the Mac Mini&lt;/strong&gt; with Granite 3.2 8B. It's smaller because documentation doesn't need frontier reasoning. It just needs to be structured, consistent, and technically accurate. The smaller model is faster and cheaper.&lt;/p&gt;

&lt;h2&gt;
  
  
  How They Talk to Each Other
&lt;/h2&gt;

&lt;p&gt;This was the hard part. I didn't want three separate chat windows. I wanted one interface (Telegram) where I message Celebi, and Celebi delegates to the right specialist.&lt;/p&gt;

&lt;p&gt;Here's how it works:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User (Telegram): "Refactor the auth module in project X"
  → Celebi receives message
  → Classifies: "coding task, complex"
  → Routes to ProgrammierMinna
  → ProgrammierMinna generates refactored code
  → Returns to Celebi
  → Celebi formats response and sends back to user
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The user never talks directly to ProgrammierMinna or DocMinna. Celebi is the router. This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The user has one interface&lt;/li&gt;
&lt;li&gt;Each specialist gets only relevant context&lt;/li&gt;
&lt;li&gt;Results are combined and formatted consistently&lt;/li&gt;
&lt;li&gt;If a task is simple, Celebi handles it directly (no delegation overhead)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What "Personality" Actually Means in Practice
&lt;/h2&gt;

&lt;p&gt;I don't mean "quirky chatbot with a backstory." I mean three things:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Different System Prompts
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Celebi:&lt;/strong&gt; "You're a resourceful assistant. Be concise. Don't ask clarifying questions unless critical. Default to action."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ProgrammierMinna:&lt;/strong&gt; "You're a senior software engineer. Write clean, maintainable code. Add error handling. Consider edge cases. Explain your reasoning briefly."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DocMinna:&lt;/strong&gt; "You're a technical writer. Structure docs with clear headings. Include code examples. Write for an intermediate developer. Be thorough but not verbose."&lt;/p&gt;

&lt;p&gt;These aren't decorations — they fundamentally change the output. The same request to all three produces completely different results.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Different Context Scopes
&lt;/h3&gt;

&lt;p&gt;Celebi sees my calendar, emails, weather, and general notes. It knows I'm in Turkey, that I have a meeting at 3 PM, that it's hot outside.&lt;/p&gt;

&lt;p&gt;ProgrammierMinna sees my Git repos, code patterns, and project structure. It knows I prefer Go over Python for CLI tools, that I use specific naming conventions, that I hate nested callbacks.&lt;/p&gt;

&lt;p&gt;DocMinna sees my documentation templates, style guides, and existing docs. It knows I write in Markdown, that I include a "Quick Start" section, that I don't use emojis in technical docs.&lt;/p&gt;

&lt;p&gt;Each agent's context is &lt;strong&gt;filtered.&lt;/strong&gt; Celebi doesn't get the Git repos. ProgrammierMinna doesn't get my grocery list. This alone cut my token usage by ~40%.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Different Tone and Format
&lt;/h3&gt;

&lt;p&gt;Ask all three to "explain Docker":&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Celebi:&lt;/strong&gt; "Docker packages apps into containers so they run the same everywhere. Think of it as a shipping container for software — standardized, portable, isolated. Need help with a specific setup?"&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;ProgrammierMinna:&lt;/strong&gt; "Docker uses OS-level virtualization to package applications with their dependencies. Key concepts: images (read-only templates), containers (runtime instances), and Dockerfiles (build instructions). For multi-container apps, use Docker Compose. Here's a minimal example..."&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;DocMinna:&lt;/strong&gt; "Docker is a platform for developing, shipping, and running applications in containers. This guide covers installation, core concepts (images, containers, volumes), and best practices for production deployments..."&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Same facts. Completely different delivery. And that's the point — you pick the right voice for the situation.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Routing Logic
&lt;/h2&gt;

&lt;p&gt;Celebi decides who handles what. The rules are simple but effective:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Input Signal&lt;/th&gt;
&lt;th&gt;Route To&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Contains code snippets, "refactor," "debug," "function"&lt;/td&gt;
&lt;td&gt;ProgrammierMinna&lt;/td&gt;
&lt;td&gt;"Fix this Go error"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Contains "document," "README," "spec," "guide"&lt;/td&gt;
&lt;td&gt;DocMinna&lt;/td&gt;
&lt;td&gt;"Write API docs for this endpoint"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;General question, scheduling, notification&lt;/td&gt;
&lt;td&gt;Celebi (self)&lt;/td&gt;
&lt;td&gt;"What's on my calendar?"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mixed task (code + docs)&lt;/td&gt;
&lt;td&gt;Both, combined&lt;/td&gt;
&lt;td&gt;"Build a tool and document it"&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The routing is a lightweight classifier — just a few-shot prompt to Qwen 3.5 9B. It gets it right ~95% of the time. The 5% that are wrong? I correct it, and the model learns from the feedback (stored in memory files).&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Improved
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Code quality:&lt;/strong&gt; ProgrammierMinna suggests better abstractions because it doesn't have to also remember my dentist appointment. Cleaner context = better reasoning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Documentation speed:&lt;/strong&gt; DocMinna writes docs in 30 seconds that used to take me 20 minutes. And they're consistent with my existing style.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Response time:&lt;/strong&gt; Simple queries stay on the Mac Mini (instant). Complex ones go to the GPU (acceptable delay). No more "one size fits none" latency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token costs:&lt;/strong&gt; Splitting context means each agent sees only what it needs. My monthly API bill dropped from ~$45 to ~$15 because 80% of tasks stay local.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Less context-switching for me:&lt;/strong&gt; I say what I want in Telegram. The system figures out who should handle it. I don't think about "which model should I use for this."&lt;/p&gt;

&lt;h2&gt;
  
  
  The Downsides (Being Honest)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Setup complexity:&lt;/strong&gt; Three agents means three configurations, three model endpoints, three context files to manage. It's not "install and go."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Routing mistakes:&lt;/strong&gt; Sometimes Celebi sends a coding task to DocMinna, and I get a beautifully written document instead of working code. I fix the routing rule, and it improves.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cross-agent memory gaps:&lt;/strong&gt; ProgrammierMinna doesn't know that DocMinna just wrote the API spec. If I'm building a tool and documenting it simultaneously, I have to manually sync context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hardware footprint:&lt;/strong&gt; Three models loaded means more RAM and VRAM usage. On my setup (Mac Mini + RTX 3060), it's manageable. On a single machine with 8GB RAM, you'd struggle.&lt;/p&gt;

&lt;h2&gt;
  
  
  Is This Overkill for You?
&lt;/h2&gt;

&lt;p&gt;Probably, if you're just using ChatGPT for occasional questions.&lt;/p&gt;

&lt;p&gt;But if you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use AI daily for multiple distinct tasks&lt;/li&gt;
&lt;li&gt;Have a local GPU or powerful machine&lt;/li&gt;
&lt;li&gt;Find yourself rewriting AI output because the tone is wrong&lt;/li&gt;
&lt;li&gt;Want specialized quality without paying for frontier models constantly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;...then splitting into personalities is worth trying. You don't need three agents on day one. Start with two: one for general tasks, one for your most common specialized task (usually coding or writing).&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup in 30 Minutes
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Install Ollama&lt;/strong&gt; on your machines: &lt;code&gt;curl -fsSL https://ollama.com/install.sh | sh&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pull models:&lt;/strong&gt; &lt;code&gt;ollama pull qwen3.5:9b&lt;/code&gt; (general) + &lt;code&gt;qwen3-coder:30b&lt;/code&gt; (coding)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Create system prompts&lt;/strong&gt; — one file per agent with its personality&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build a router&lt;/strong&gt; — a 20-line script that classifies input and sends to the right endpoint&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add a frontend&lt;/strong&gt; — Telegram bot, CLI, or web UI&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The router is the only custom code you need. Everything else is off-the-shelf.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;I'm experimenting with two additions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Memory sharing&lt;/strong&gt; — A shared context file that all agents can read (but not write) for cross-cutting concerns like "current project" or "my tech stack."&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Agent spawning&lt;/strong&gt; — When a task is genuinely new, Celebi spawns a temporary agent with a custom prompt, runs the task, then discards it. No permanent bloat.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The goal isn't to build AGI. It's to build a team of specialists that costs less than one generalist and produces better work.&lt;/p&gt;




&lt;h2&gt;
  
  
  Want This Architecture?
&lt;/h2&gt;

&lt;p&gt;We build custom multi-agent systems tailored to your workflow:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🤖 Multi-agent orchestration with personality routing&lt;/li&gt;
&lt;li&gt;📝 Specialized documentation agents&lt;/li&gt;
&lt;li&gt;💻 Code-focused AI assistants with project context&lt;/li&gt;
&lt;li&gt;🔔 Unified notification layer (Telegram, Slack, etc.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;→ &lt;a href="http://www.fiverr.com/s/XLyg" rel="noopener noreferrer"&gt;Custom AI Agent Setup on Fiverr&lt;/a&gt;&lt;br&gt;
→ &lt;a href="https://t.me/celebibot_en" rel="noopener noreferrer"&gt;Follow the build process on Telegram&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I write about running AI locally, building weird automation, and occasionally making money from side projects. If this was useful, feel free to follow.&lt;/em&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  ai #agents #automation #architecture #productivity #ollama #localllm
&lt;/h1&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>automation</category>
      <category>architecture</category>
    </item>
    <item>
      <title>I Use Telegram as My DevOps Dashboard — No Web UI, No VPN, Just Works</title>
      <dc:creator>Sam Hartley</dc:creator>
      <pubDate>Mon, 15 Jun 2026 08:03:38 +0000</pubDate>
      <link>https://dev.to/samhartley_dev/i-use-telegram-as-my-devops-dashboard-no-web-ui-no-vpn-just-works-5cjk</link>
      <guid>https://dev.to/samhartley_dev/i-use-telegram-as-my-devops-dashboard-no-web-ui-no-vpn-just-works-5cjk</guid>
      <description>&lt;p&gt;I have a bunch of things running 24/7 on a Mac Mini. GPU rental jobs, a Garmin watch face updater, a Fiverr inbox monitor, a funding rate tracker, a few cron jobs.&lt;/p&gt;

&lt;p&gt;For a while I ran a Grafana dashboard to keep an eye on them. It looked impressive. I never opened it.&lt;/p&gt;

&lt;p&gt;What I actually do is check my phone. So I built the monitoring layer there.&lt;/p&gt;

&lt;p&gt;Here's the setup: a lightweight Telegram bot that serves as my entire DevOps interface. Status checks, alerts, and even simple commands — all from the Telegram app I already have open.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Not a Proper Dashboard?
&lt;/h2&gt;

&lt;p&gt;Honest answer: dashboards are for teams. If you're a solo dev with a few projects, a fancy web UI creates more overhead than it solves.&lt;/p&gt;

&lt;p&gt;Problems I had with Grafana:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;VPN required to reach it from outside my home network&lt;/li&gt;
&lt;li&gt;Needs to stay running (another thing to maintain)&lt;/li&gt;
&lt;li&gt;I never actually opened the browser tab&lt;/li&gt;
&lt;li&gt;It didn't &lt;em&gt;push&lt;/em&gt; me information — I had to pull it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Telegram flips this: it pushes alerts to me. I glance at my phone, see what's happening, and move on.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Services (cron jobs, Python scripts, shell scripts)
  ↓
Central alert script: notify.sh
  ↓
Telegram Bot API → my phone
  ↓ (optional)
Command bot → runs queries on server
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two pieces:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Outbound alerts&lt;/strong&gt; — services send me messages when things happen&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inbound commands&lt;/strong&gt; — I can ask the bot questions from my phone&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Part 1: Dead Simple Alert Script
&lt;/h2&gt;

&lt;p&gt;Every service on my server can call this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="c"&gt;# notify.sh — send a Telegram message from any script&lt;/span&gt;
&lt;span class="c"&gt;# Usage: ./notify.sh "Your GPU job finished"&lt;/span&gt;

&lt;span class="nv"&gt;BOT_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your_bot_token_here"&lt;/span&gt;
&lt;span class="nv"&gt;CHAT_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your_chat_id_here"&lt;/span&gt;
&lt;span class="nv"&gt;MESSAGE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$1&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="s2"&gt;"https://api.telegram.org/bot&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;BOT_TOKEN&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/sendMessage"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nv"&gt;chat_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;CHAT_ID&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nv"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;MESSAGE&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nv"&gt;parse_mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"HTML"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /dev/null
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Any script can now send me a message in one line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./notify.sh &lt;span class="s2"&gt;"✅ GPU rental job completed — earned &lt;/span&gt;&lt;span class="nv"&gt;$2&lt;/span&gt;&lt;span class="s2"&gt;.40"&lt;/span&gt;
./notify.sh &lt;span class="s2"&gt;"⚠️ Funding rate dropped below threshold on LYN_USDT"&lt;/span&gt;
./notify.sh &lt;span class="s2"&gt;"📬 New Fiverr inquiry from user987"&lt;/span&gt;
./notify.sh &lt;span class="s2"&gt;"❌ Garmin watch face API returned 503"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I spent maybe 20 minutes on this. It replaced a monitoring stack I spent days configuring.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real Examples from My Setup
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;GPU rental monitor:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Runs every 30 min&lt;/span&gt;
&lt;span class="nv"&gt;earnings&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;./check_gpu_earnings.sh&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$earnings&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;-gt&lt;/span&gt; &lt;span class="s2"&gt;"0"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt;
  ./notify.sh &lt;span class="s2"&gt;"💰 GPU earned: &lt;/span&gt;&lt;span class="nv"&gt;$earnings&lt;/span&gt;&lt;span class="s2"&gt; today"&lt;/span&gt;
&lt;span class="k"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Funding rate watcher:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Python script, runs every 15 min via cron
&lt;/span&gt;&lt;span class="n"&gt;rate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_funding_rate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LYN_USDT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;rate&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# negative = people paying longs
&lt;/span&gt;    &lt;span class="nf"&gt;notify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;🔥 LYN funding rate: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;rate&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;% — worth checking&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Daily summary (9 AM cron):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="nv"&gt;msg&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"📊 Daily Summary — &lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; +%Y-%m-%d&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;

GPU Jobs: &lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;get_gpu_count&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt; completed
Funding Earned: &lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;get_funding_total&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;
Fiverr Inquiries: &lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;get_fiverr_count&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;
Watch Face Updates: &lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;get_garmin_count&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;

Server uptime: &lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;uptime&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

./notify.sh &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$msg&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I wake up, check my phone, and immediately know if anything needs attention. No browser, no VPN, no dashboard.&lt;/p&gt;

&lt;h2&gt;
  
  
  Part 2: The Command Interface
&lt;/h2&gt;

&lt;p&gt;Outbound alerts are great. But sometimes I want to query the server from my phone.&lt;/p&gt;

&lt;p&gt;I wrote a simple Python bot that listens for commands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;telebot&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;

&lt;span class="n"&gt;BOT_TOKEN&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your_token&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;ALLOWED_USER&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;123456789&lt;/span&gt;  &lt;span class="c1"&gt;# your Telegram user ID
&lt;/span&gt;
&lt;span class="n"&gt;bot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;telebot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;TeleBot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BOT_TOKEN&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;COMMANDS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;uptime &amp;amp;&amp;amp; free -h &amp;amp;&amp;amp; df -h /&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/gpu&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./check_gpu_status.sh&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/funding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python3 check_funding_rates.py --summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/services&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ps aux | grep -E &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;(python|node|ollama)&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; | grep -v grep&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nd"&gt;@bot.message_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;commands&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;COMMANDS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handle_command&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;ALLOWED_USER&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;bot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reply_to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Not authorized.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;

    &lt;span class="n"&gt;cmd_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# handle /status@botname format
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cmd_text&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;COMMANDS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;COMMANDS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;cmd_text&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; 
            &lt;span class="n"&gt;shell&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
            &lt;span class="n"&gt;capture_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
            &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;3000&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stderr&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;3000&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="n"&gt;bot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reply_to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;```
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;endraw&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
```&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;parse_mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Markdown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;bot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;polling&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;none_stop&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now from Telegram I can type &lt;code&gt;/status&lt;/code&gt; and get:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt; 11:23:15 up 14 days, 3:41,  1 user
Mem:   16Gi   8.2Gi   7.8Gi
/dev/sda1        245G   82G  163G  34%
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or &lt;code&gt;/funding&lt;/code&gt; and get the current rate snapshot.&lt;/p&gt;

&lt;p&gt;The key detail: &lt;code&gt;ALLOWED_USER&lt;/code&gt; check. Only my Telegram ID can run commands. Everyone else gets "Not authorized." Bot tokens are public in the sense that anyone can message your bot — you need to validate the sender.&lt;/p&gt;

&lt;h2&gt;
  
  
  Keeping It Running
&lt;/h2&gt;

&lt;p&gt;The command bot needs to stay alive. I use a simple systemd service:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="nn"&gt;[Unit]&lt;/span&gt;
&lt;span class="py"&gt;Description&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;Telegram DevOps Bot&lt;/span&gt;
&lt;span class="py"&gt;After&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;network.target&lt;/span&gt;

&lt;span class="nn"&gt;[Service]&lt;/span&gt;
&lt;span class="py"&gt;ExecStart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/usr/bin/python3 /home/user/telegram-bot/bot.py&lt;/span&gt;
&lt;span class="py"&gt;Restart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;always&lt;/span&gt;
&lt;span class="py"&gt;RestartSec&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;5&lt;/span&gt;

&lt;span class="nn"&gt;[Install]&lt;/span&gt;
&lt;span class="py"&gt;WantedBy&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;multi-user.target&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;systemctl enable telegram-bot &amp;amp;&amp;amp; systemctl start telegram-bot&lt;/code&gt; — and it survives reboots.&lt;/p&gt;

&lt;p&gt;On macOS (my setup) I use a launchd plist, same concept.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Actually Get Alerts For
&lt;/h2&gt;

&lt;p&gt;Not everything. Alert fatigue is real.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Alert on:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Job completions (GPU task done, funding cycle closed)&lt;/li&gt;
&lt;li&gt;❌ Errors that need action&lt;/li&gt;
&lt;li&gt;📬 New customer inquiries&lt;/li&gt;
&lt;li&gt;⚠️ Thresholds crossed (rate drops, disk usage, memory spikes)&lt;/li&gt;
&lt;li&gt;📊 Daily summaries (once a day, morning)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Silence:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Routine successful runs (no news is good news)&lt;/li&gt;
&lt;li&gt;Health checks that pass&lt;/li&gt;
&lt;li&gt;Regular cron completions with no anomalies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal is: every message I receive from the bot is something I actually care about. If I'm ignoring 80% of notifications, I'm alerting on the wrong things.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Full Cost
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Telegram Bot API: &lt;strong&gt;free&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;curl&lt;/code&gt; command: comes with your OS&lt;/li&gt;
&lt;li&gt;Python + telebot library: &lt;strong&gt;free&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Running this bot: negligible CPU, ~20MB RAM&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;My entire monitoring setup costs $0/month and runs on the same Mac Mini as everything else. No SaaS, no cloud logging, no dashboard subscription.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Months Later
&lt;/h2&gt;

&lt;p&gt;I send about 15-20 alerts per day. Daily summary at 9 AM, event-driven messages the rest of the day. I check my phone, see green checkmarks and earnings summaries, and know the server is doing its job.&lt;/p&gt;

&lt;p&gt;The one time the GPU host went offline, I got a message within 5 minutes. Fixed it from my phone during lunch.&lt;/p&gt;

&lt;p&gt;That's the whole point: not more tooling, just the right interface for how I actually work.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Got a monitoring setup you like? Drop it in the comments — always curious what others are running.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>telegram</category>
      <category>devops</category>
      <category>selfhosted</category>
      <category>automation</category>
    </item>
    <item>
      <title>I Built an AI Agent That Writes and Posts Articles For Me — Here's What Happened</title>
      <dc:creator>Sam Hartley</dc:creator>
      <pubDate>Sat, 13 Jun 2026 08:01:25 +0000</pubDate>
      <link>https://dev.to/samhartley_dev/i-built-an-ai-agent-that-writes-and-posts-articles-for-me-heres-what-happened-3jmp</link>
      <guid>https://dev.to/samhartley_dev/i-built-an-ai-agent-that-writes-and-posts-articles-for-me-heres-what-happened-3jmp</guid>
      <description>&lt;p&gt;A few months ago I started posting on Dev.to about my side projects. It was fun at first — writing about GPU rentals, Telegram bots, local AI setups. Then life got busy and the cadence dropped. Deadlines slip, the "write that post" todo sits there for weeks, and the audience you built starts forgetting you exist.&lt;/p&gt;

&lt;p&gt;So I did what any developer would do: I automated it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;I have an AI agent ("Celebi," running on a Mac Mini) that handles a bunch of background tasks for me. Checks my Fiverr inbox, monitors GPU earnings, sends me weather alerts. It was natural to add "post a Dev.to article every few days" to the list.&lt;/p&gt;

&lt;p&gt;The rules I gave it were simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Write as "Sam Hartley" — that's the persona I use for tech writing&lt;/li&gt;
&lt;li&gt;No bot tone. No "In today's rapidly evolving landscape..." No emojis in the title. No bullet lists for no reason&lt;/li&gt;
&lt;li&gt;Tell real stories from actual projects. If something failed, say so&lt;/li&gt;
&lt;li&gt;End with an honest call-to-action, not a hard sell&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I fed it a folder of draft ideas and a list of things I've actually built. The agent picks a topic, writes the article, and posts it via the Dev.to API.&lt;/p&gt;

&lt;h2&gt;
  
  
  What It Gets Right
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Consistency, obviously.&lt;/strong&gt; Posts go up every 2 days like clockwork. That alone is worth something — most of my favorite creators aren't consistent because consistency is exhausting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The voice is... surprisingly close.&lt;/strong&gt; I read the first few posts and had to double-check I didn't write them. The agent nailed the casual, slightly self-deprecating tone I aim for. It remembers details I forgot — "remember when the GPU host went offline at lunch?" — because it reads my logs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It handles the boring parts.&lt;/strong&gt; API formatting, tag selection, cover image prompts, the admin work of getting a post live. I used to spend 20 minutes on that stuff per article. Now it's zero.&lt;/p&gt;

&lt;h2&gt;
  
  
  What It Gets Wrong
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;It can't be truly original.&lt;/strong&gt; The agent combines things I've already said in new ways, but it doesn't have new experiences. When something genuinely unexpected happens — a project fails, a new opportunity appears — I have to write that one myself. The automated posts are polished remixes, not discoveries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It doesn't know when to shut up.&lt;/strong&gt; I've had to edit out paragraphs where the agent over-explained something obvious or added a "lesson learned" that wasn't actually learned from anything. It follows patterns, not truth.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The feedback loop is broken.&lt;/strong&gt; When a post does well, the agent doesn't know why. When one flops, it can't diagnose. It keeps posting with the same formula because I haven't told it to change. I'm the one who needs to read comments, spot what resonates, and update the instructions.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Honest Truth
&lt;/h2&gt;

&lt;p&gt;I feel two things about this setup.&lt;/p&gt;

&lt;p&gt;Proud, because it's genuinely useful. I built a system that maintains my writing presence while I focus on building. That's the promise of AI agents — they handle the mechanical so you can do the meaningful.&lt;/p&gt;

&lt;p&gt;And weird, because part of what makes writing valuable is the process. Thinking through an idea, finding the right example, cutting what doesn't work. When an agent does that for you, you lose something. Not the output — the output might be fine. But the practice of thinking in public.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Actually Do Now
&lt;/h2&gt;

&lt;p&gt;I let the agent handle the "maintenance posts" — updates on ongoing projects, explainers for things I've already figured out. I write the new stuff myself — the failures, the pivots, the things I don't understand yet.&lt;/p&gt;

&lt;p&gt;Best of both worlds, I think. The feed stays alive, and I still show up when there's something real to say.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If you're curious about the agent setup, I write about that too — follow along or &lt;a href="http://www.fiverr.com/s/XLyg" rel="noopener noreferrer"&gt;check out the automation services I offer&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://t.me/celebibot_en" rel="noopener noreferrer"&gt;Follow on Telegram&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>automation</category>
      <category>writing</category>
      <category>devto</category>
    </item>
    <item>
      <title>How I Built a Self-Funding AI Lab: From Hobby to Side Income in 6 Months</title>
      <dc:creator>Sam Hartley</dc:creator>
      <pubDate>Tue, 09 Jun 2026 08:04:13 +0000</pubDate>
      <link>https://dev.to/samhartley_dev/how-i-built-a-self-funding-ai-lab-from-hobby-to-side-income-in-6-months-4jb1</link>
      <guid>https://dev.to/samhartley_dev/how-i-built-a-self-funding-ai-lab-from-hobby-to-side-income-in-6-months-4jb1</guid>
      <description>&lt;h1&gt;
  
  
  How I Built a Self-Funding AI Lab: From Hobby to Side Income in 6 Months
&lt;/h1&gt;

&lt;p&gt;Six months ago I had a Mac Mini and a vague idea that local AI was cool.&lt;/p&gt;

&lt;p&gt;Today I run three machines, six GPUs, a Telegram bot that manages my infrastructure, and a Fiverr gig that uses my own hardware to deliver AI services. The lab generates enough to cover its own electricity, hardware upgrades, and a decent chunk of my rent.&lt;/p&gt;

&lt;p&gt;Here's how that happened — and the exact architecture that makes it work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 1: The Hobby (Month 0–2)
&lt;/h2&gt;

&lt;p&gt;It started the way these things usually start: I bought a Mac Mini M4 and installed Ollama because I was tired of ChatGPT's rate limits.&lt;/p&gt;

&lt;p&gt;The first month was pure experimentation. I ran Qwen, I ran Llama, I ran whatever fit in 16GB of unified RAM. I built a Telegram bot to query models from my phone. I wrote a Garmin watch face that fetched stock prices via a local API wrapper.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost so far:&lt;/strong&gt; $0 additional (already owned the Mac Mini).&lt;br&gt;
&lt;strong&gt;Revenue:&lt;/strong&gt; $0.&lt;br&gt;
&lt;strong&gt;Fun level:&lt;/strong&gt; High.&lt;/p&gt;

&lt;p&gt;The Mac Mini handled light tasks fine, but anything serious — a 30B coder model, image generation, batch embedding jobs — was either impossible or painfully slow.&lt;/p&gt;

&lt;p&gt;So I bought a used RTX 3060 12GB off eBay for $150 and stuck it in an old Windows PC.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost:&lt;/strong&gt; $150 one-time.&lt;br&gt;
&lt;strong&gt;Capability unlocked:&lt;/strong&gt; Real local inference, vision models, proper code generation.&lt;/p&gt;
&lt;h2&gt;
  
  
  Phase 2: The Realization (Month 2–3)
&lt;/h2&gt;

&lt;p&gt;Here's the thing about adding a GPU: it's idle ~80% of the time.&lt;/p&gt;

&lt;p&gt;I'd run a coding session, the fans would spin up, I'd get my answer in 3 seconds instead of 30, and then... nothing. The GPU just sat there, drawing power, doing nothing.&lt;/p&gt;

&lt;p&gt;I tried mining. The margins in 2026 are a joke. I tried &lt;a href="mailto:folding@home"&gt;folding@home&lt;/a&gt;. Noble, but doesn't pay.&lt;/p&gt;

&lt;p&gt;Then I found &lt;a href="https://vast.ai" rel="noopener noreferrer"&gt;Vast.ai&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The concept is simple: you install a daemon on your machine, set a price per hour, and people rent your GPU to run their AI workloads. You get paid per second of actual use. No crypto, no pools, no complexity.&lt;/p&gt;

&lt;p&gt;I set my RTX 3060 at $0.15/hour, mostly as an experiment. It got rented the first night.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Week 1 earnings:&lt;/strong&gt; ~$11.&lt;br&gt;
&lt;strong&gt;Not quit-your-job money, but not zero either.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The bigger realization: this wasn't just about renting a GPU. I now had infrastructure — machines, models, APIs — that could do work for other people.&lt;/p&gt;
&lt;h2&gt;
  
  
  Phase 3: The Services (Month 3–5)
&lt;/h2&gt;

&lt;p&gt;GPU rental income is nice but lumpy. Day 3: $0. Day 6: $3.30. You can't budget around it.&lt;/p&gt;

&lt;p&gt;What you &lt;em&gt;can&lt;/em&gt; budget around: selling services that use the same infrastructure.&lt;/p&gt;

&lt;p&gt;I opened a Fiverr gig offering custom Telegram bot development. Not generic "I'll build you a bot" — specifically AI-powered bots that run on local models. Customer service bots. Content schedulers. Alert systems. The pitch: "Your bot runs on my hardware, not OpenAI's API. No monthly fees, no token counting, no data leaving my servers."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this worked:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I already had the infrastructure (Ollama, Telegram API wrapper, cron jobs)&lt;/li&gt;
&lt;li&gt;I had working examples from my own projects&lt;/li&gt;
&lt;li&gt;"No API costs" is a genuine differentiator in 2026&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;First client: a small e-commerce business in Germany wanted an FAQ bot that could answer product questions in German and English. I fine-tuned a local Qwen model on their product docs, wrapped it in a Telegram bot, and delivered it in a weekend.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Revenue per gig:&lt;/strong&gt; $50–300 depending on complexity.&lt;br&gt;
&lt;strong&gt;Material cost:&lt;/strong&gt; $0 (runs on hardware I already have).&lt;/p&gt;
&lt;h2&gt;
  
  
  Phase 4: The System (Month 5–6)
&lt;/h2&gt;

&lt;p&gt;At this point I had multiple income streams running on the same machines. The problem: managing them.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPU rental means the Windows PC is sometimes unavailable&lt;/li&gt;
&lt;li&gt;Fiverr gigs need reliable uptime for demos&lt;/li&gt;
&lt;li&gt;My own projects (watch face, personal RAG) need to keep running&lt;/li&gt;
&lt;li&gt;I need to know when something breaks, without staring at logs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So I built what I now call my "AI operations center" — a Mac Mini that does nothing but orchestrate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Mac Mini M4 (always on, always available)
├── Telegram Bot (Celebi)
│   ├── Status checks (all machines, all services)
│   ├── Income tracking (GPU rental + Fiverr)
│   └── Alerts when anything breaks
├── Model Router
│   ├── Quick queries → Mac Mini (Qwen 4B)
│   ├── Code/vision → Windows PC (when available)
│   └── Fallback → Ubuntu CPU box (when GPU is rented)
├── Cron Jobs
│   ├── Health checks every 15 min
│   ├── Watch face data updates
│   └── Market data fetchers
└── Notification Layer
    └── All alerts → my phone via Telegram
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The router is the key piece. When the Windows PC is rented out on Vast.ai, it automatically falls back to the Ubuntu box or the Mac Mini. When it's free, heavy tasks get routed there for speed.&lt;/p&gt;

&lt;p&gt;I never have to think about which machine to use. I just ask a question, the system figures it out.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Numbers (Month 6)
&lt;/h2&gt;

&lt;p&gt;Here's what the lab looks like financially:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Income Source&lt;/th&gt;
&lt;th&gt;Monthly Range&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPU Rental (1× RTX 3060)&lt;/td&gt;
&lt;td&gt;$50–130&lt;/td&gt;
&lt;td&gt;Variable utilization&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fiverr (AI bots, RAG setup)&lt;/td&gt;
&lt;td&gt;$100–400&lt;/td&gt;
&lt;td&gt;Depends on active gigs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total Monthly&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$150–530&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Expense&lt;/th&gt;
&lt;th&gt;Monthly Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Electricity (3 machines, 24/7)&lt;/td&gt;
&lt;td&gt;~$35&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Internet (already had)&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vast.ai fees&lt;/td&gt;
&lt;td&gt;~$5–15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total Expenses&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$40–50&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Hardware&lt;/th&gt;
&lt;th&gt;One-time Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Mac Mini M4&lt;/td&gt;
&lt;td&gt;Already owned&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RTX 3060 12GB (used)&lt;/td&gt;
&lt;td&gt;$150&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ubuntu box (old laptop)&lt;/td&gt;
&lt;td&gt;Already owned&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total Hardware&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$150&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Net monthly: $100–480 after expenses.&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Hardware payback: already achieved in month 3.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Is this life-changing money? No. Could I scale it? Absolutely — and I'm planning to.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'm Adding Next
&lt;/h2&gt;

&lt;p&gt;The math gets interesting with scale:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;GPU&lt;/th&gt;
&lt;th&gt;Rental Rate&lt;/th&gt;
&lt;th&gt;Monthly (50% util)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;RTX 3060 (current)&lt;/td&gt;
&lt;td&gt;$0.15/h&lt;/td&gt;
&lt;td&gt;~$54&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RTX 3080 10GB (in storage)&lt;/td&gt;
&lt;td&gt;$0.20/h&lt;/td&gt;
&lt;td&gt;~$72&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2× RTX 3070 8GB (in storage)&lt;/td&gt;
&lt;td&gt;$0.16/h&lt;/td&gt;
&lt;td&gt;~$115&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total if all deployed&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$240/month rental alone&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;At 70% utilization: ~$340/month in rental income before any Fiverr work.&lt;/p&gt;

&lt;p&gt;The bottleneck isn't demand — it's physical setup. I need to build a proper rig, manage thermals, and handle the cabling. But the software side is already solved.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture That Makes This Work
&lt;/h2&gt;

&lt;p&gt;If you want to build something similar, here are the components:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Local LLM Stack (Ollama)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://ollama.com" rel="noopener noreferrer"&gt;Ollama&lt;/a&gt; for model serving&lt;/li&gt;
&lt;li&gt;Mix of small (4B) and large (30B) models&lt;/li&gt;
&lt;li&gt;Run on whatever hardware you have&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Model Router
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Simple Python function that picks the right machine/model for each task&lt;/li&gt;
&lt;li&gt;Fallback chain: GPU → CPU → cloud (only for emergencies)&lt;/li&gt;
&lt;li&gt;I open-sourced my basic version &lt;a href="https://dev.to/samhartley_dev/i-built-a-model-router-that-picks-the-right-ai-for-every-task-heres-why-you-should-too-5la"&gt;here&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Telegram Bot (Notification Layer)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Lightweight bot using python-telegram-bot&lt;/li&gt;
&lt;li&gt;Receives alerts from all services&lt;/li&gt;
&lt;li&gt;Simple commands for status checks (&lt;code&gt;/status&lt;/code&gt;, &lt;code&gt;/income&lt;/code&gt;, &lt;code&gt;/health&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;I wrote about this pattern &lt;a href="https://dev.to/samhartley_dev/i-use-telegram-as-my-devops-dashboard-no-web-ui-no-vpn-just-works-10bn"&gt;here&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. GPU Rental (Vast.ai)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Install daemon, set price, wait&lt;/li&gt;
&lt;li&gt;Use earnings to fund the next GPU&lt;/li&gt;
&lt;li&gt;Treat it as a dividend on hardware that would otherwise be idle&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  5. Service Layer (Fiverr)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Use your own infrastructure as a selling point&lt;/li&gt;
&lt;li&gt;"Runs on my hardware" = no API costs for clients&lt;/li&gt;
&lt;li&gt;Document everything — clients love seeing the architecture&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Should You Do This?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Yes, if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You already have a GPU sitting around&lt;/li&gt;
&lt;li&gt;You enjoy building and automating systems&lt;/li&gt;
&lt;li&gt;You can handle variable income (some months are $150, some are $500)&lt;/li&gt;
&lt;li&gt;You want to learn by doing, not by reading&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;No, if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need predictable income immediately&lt;/li&gt;
&lt;li&gt;You don't want to maintain hardware&lt;/li&gt;
&lt;li&gt;You're looking for a "set and forget" passive income stream (this isn't that — it needs attention)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Honest Truth
&lt;/h2&gt;

&lt;p&gt;This isn't a get-rich-quick scheme. It's a get-some-income-while-learning scheme. The real value isn't the money — it's that I now understand:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How to run production LLM inference&lt;/li&gt;
&lt;li&gt;How to route workloads across heterogeneous hardware&lt;/li&gt;
&lt;li&gt;How to build reliable automation that runs for months without intervention&lt;/li&gt;
&lt;li&gt;How to sell technical services without being a "consultant"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The money is nice. The skills are better.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I write about running AI locally, automation side projects, and occasionally making money from hardware that would otherwise just sit there. If any of this is useful, feel free to follow or reach out on &lt;a href="https://t.me/celebibot_en" rel="noopener noreferrer"&gt;Telegram&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  ai #sideproject #passiveincome #selfhosted #ollama
&lt;/h1&gt;

</description>
      <category>ai</category>
      <category>sideprojects</category>
      <category>passiveincome</category>
      <category>selfhosted</category>
    </item>
    <item>
      <title>I Built an AI Assistant That Lives in My Telegram — Here's What 6 Months Taught Me</title>
      <dc:creator>Sam Hartley</dc:creator>
      <pubDate>Tue, 09 Jun 2026 00:21:18 +0000</pubDate>
      <link>https://dev.to/samhartley_dev/i-built-an-ai-assistant-that-lives-in-my-telegram-heres-what-6-months-taught-me-3l1</link>
      <guid>https://dev.to/samhartley_dev/i-built-an-ai-assistant-that-lives-in-my-telegram-heres-what-6-months-taught-me-3l1</guid>
      <description>&lt;p&gt;Six months ago I got tired of switching between apps to talk to AI. ChatGPT in the browser. Claude in another tab. Local models in a terminal. It was like having five friends who all live in different cities and refuse to visit each other.&lt;/p&gt;

&lt;p&gt;So I did what any developer with too many GPUs and too little patience would do: I built my own assistant and put it where I already spend my day — Telegram.&lt;/p&gt;

&lt;p&gt;It's not a chatbot for customers. It's not a business automation tool. It's just... my assistant. It lives in a private chat on my phone and handles the stuff I used to do manually. Here's what six months of actually using it has looked like.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Actually Built (And Why Telegram)
&lt;/h2&gt;

&lt;p&gt;I already had three machines running Ollama at home — a Mac Mini M4, a Windows PC with an RTX 3060, and an Ubuntu box. Three endpoints, eight models, and me constantly forgetting which model was good for what.&lt;/p&gt;

&lt;p&gt;Telegram was the obvious choice because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I'm already there all day (friends, family, a few dev groups)&lt;/li&gt;
&lt;li&gt;It works on my phone, my Mac, and my watch&lt;/li&gt;
&lt;li&gt;The Bot API is dead simple&lt;/li&gt;
&lt;li&gt;I can send voice messages, photos, documents — and the bot can handle all of them&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The setup: a Python bot running on the Mac Mini, connected to all three Ollama endpoints. When I message it, the bot classifies what I want, routes to the right model on the right machine, and replies in the same chat thread.&lt;/p&gt;

&lt;p&gt;Sounds simple. Took three evenings to get right. Took six months to make actually useful.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Things I Actually Use It For
&lt;/h2&gt;

&lt;p&gt;Here's the honest list. Not the marketing pitch — the real daily usage:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Quick questions without context switching
&lt;/h3&gt;

&lt;p&gt;"Summarize this article" (I paste a link). "Explain this error" (I paste a stack trace). "Rewrite this email less formally." These used to mean opening a browser tab, logging in, maybe hitting a rate limit. Now I just... send a message. The reply comes back in 2-8 seconds depending on which model handles it.&lt;/p&gt;

&lt;p&gt;The routing is simple but effective: quick chat → small model on the Mac. Code → 30B coder on the GPU machine. Complex reasoning → 8B reasoning model. Vision (screenshots) → vision model on GPU. It's not fancy — just keyword matching — but it works 90% of the time.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Voice notes while walking
&lt;/h3&gt;

&lt;p&gt;This was the surprise killer feature. I walk a lot (living near Sakarya, there's decent hiking). I send voice messages to the bot while walking. It transcribes them (Whisper via Ollama), processes the request, and replies with text I can read when I'm back.&lt;/p&gt;

&lt;p&gt;"Remind me to refactor the database module when I'm home" → transcribed, understood, added to my notes. "What was that Python pattern for retry logic with exponential backoff?" → code snippet in my pocket before I finish the trail.&lt;/p&gt;

&lt;p&gt;I probably send 5-10 voice messages a day now. Never would have predicted that.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Code review on my phone
&lt;/h3&gt;

&lt;p&gt;Someone sends me a code snippet in a dev group. I forward it to the bot: "review this." It comes back with actual useful feedback — variable naming issues, potential edge cases, suggestions for simplification. Is it as good as a senior dev? No. Is it better than my phone-scrolling half-attention review? Absolutely.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Document Q&amp;amp;A
&lt;/h3&gt;

&lt;p&gt;I dump PDFs, markdown files, or pasted text into the chat and ask questions. The bot uses a local RAG setup (Chroma + nomic-embed-text) that indexes my project docs, notes, and anything I feed it. "How does my Garmin watch face fetch stock data?" → actual answer from my own documentation, not a hallucinated guess.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. The dumb stuff that adds up
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;"Convert this JSON to a Python dataclass" &lt;/li&gt;
&lt;li&gt;"What's 847 * 16 / 3 in hex?"&lt;/li&gt;
&lt;li&gt;"Translate this Turkish message to German"&lt;/li&gt;
&lt;li&gt;"Generate a regex that matches these three examples"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these are hard. All of them are annoying to do manually. Having an always-on assistant in my most-used app removes the friction completely.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Went Wrong (The Honest Part)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The "it's down and I don't know why" problem
&lt;/h3&gt;

&lt;p&gt;For the first month, the bot crashed randomly. Out of memory on the Mac Mini (it's only got 16GB). Network hiccup to the Windows PC. Ubuntu box decided to update itself and reboot. I'd message the bot and... silence. Then I'd SSH in, check logs, restart services, and feel like I was maintaining infrastructure instead of having an assistant.&lt;/p&gt;

&lt;p&gt;Fix: health checks, auto-restart via launchd, and a fallback chain. If the GPU machine is down, everything routes to the Mac's smaller model. Degraded but functional.&lt;/p&gt;

&lt;h3&gt;
  
  
  The "it answered confidently and was wrong" problem
&lt;/h3&gt;

&lt;p&gt;Early on, I'd trust the bot's answers without verifying. It told me a Python function was valid. It wasn't. It gave me a Docker command with a subtle flag error. I spent 20 minutes debugging before I realized the bot hallucinated a flag that doesn't exist.&lt;/p&gt;

&lt;p&gt;My rule now: if the answer matters, I verify it. The bot is my fastest junior developer. It's also my most confident one.&lt;/p&gt;

&lt;h3&gt;
  
  
  The "I talk to it more than some humans" problem
&lt;/h3&gt;

&lt;p&gt;This is just a weird psychological thing. I realized after a few months that I was messaging the bot 20-30 times a day. More than some friends. There's something slightly dystopian about having your most responsive conversation partner be a Python script. I'm aware of it. I haven't fixed it. Just noting it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture (If You Want to Build This)
&lt;/h2&gt;

&lt;p&gt;Telegram Message&lt;br&gt;
  → Python Bot (python-telegram-bot)&lt;br&gt;
    → Classify intent (simple keyword router)&lt;br&gt;
      → Route to Ollama endpoint&lt;br&gt;
        → Mac Mini (qwen3:4b) for quick chat&lt;br&gt;
        → Windows PC (qwen3-coder:30b) for code&lt;br&gt;
        → Windows PC (granite3.2-vision:2b) for images&lt;br&gt;
        → Ubuntu (minicpm-v) as fallback&lt;br&gt;
      → Optional: RAG lookup in Chroma DB&lt;br&gt;
    → Format reply (code blocks, markdown, etc.)&lt;br&gt;
  → Send back to Telegram&lt;/p&gt;

&lt;p&gt;The whole thing runs on a Mac Mini M4. Total cost: $0 for software, maybe $8/month in electricity if you count the always-on machines.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Build the router on day one.&lt;/strong&gt; I started with "just use the big model for everything." It worked but was slow and kept my GPU busy. The router took an afternoon to write and improved response times by 3x.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Add voice support immediately.&lt;/strong&gt; I added it as a "nice to have" afterthought. It became 30% of my usage. If you're building something similar, start with voice. People talk more than they type on phones.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Make it degrade gracefully.&lt;/strong&gt; Machines go down. Networks hiccup. Your bot should always answer something, even if it's "I'm running slow today, but here's a basic answer." Silence is worse than a degraded response.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Log everything.&lt;/strong&gt; I log every request, response time, and which model handled it. Not for analytics — for debugging. When something feels slow, the logs tell me if it's the model, the network, or my terrible code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Is This Better Than ChatGPT Plus?
&lt;/h2&gt;

&lt;p&gt;Depends on what you value.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;My Bot&lt;/th&gt;
&lt;th&gt;ChatGPT Plus&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Cost&lt;/td&gt;
&lt;td&gt;$0/month&lt;/td&gt;
&lt;td&gt;$20/month&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Privacy&lt;/td&gt;
&lt;td&gt;✅ Everything stays local&lt;/td&gt;
&lt;td&gt;❌ Cloud&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Speed&lt;/td&gt;
&lt;td&gt;⚡ 0.3-12s depending on model&lt;/td&gt;
&lt;td&gt;⚡ ~2-5s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Availability&lt;/td&gt;
&lt;td&gt;🟢 24/7 (if I maintain it)&lt;/td&gt;
&lt;td&gt;🟢 24/7 (they maintain it)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model choice&lt;/td&gt;
&lt;td&gt;8 models, I pick&lt;/td&gt;
&lt;td&gt;4 models, they pick&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Voice&lt;/td&gt;
&lt;td&gt;✅ Native in Telegram&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reliability&lt;/td&gt;
&lt;td&gt;🟡 I fix it when it breaks&lt;/td&gt;
&lt;td&gt;🟢 It just works&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For me, the privacy and model flexibility win. For someone who doesn't want to maintain infrastructure, ChatGPT Plus is the obvious choice. This is a hobby project that became useful, not a product recommendation.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Lesson
&lt;/h2&gt;

&lt;p&gt;The best AI assistant isn't the most powerful one. It's the one that's actually there when you need it, in the app you already use, without making you think about models or endpoints or API keys.&lt;/p&gt;

&lt;p&gt;I built this because I was annoyed. I kept using it because it removed friction from my day. That's the bar: not "can it do X?" but "is it easier than doing X myself?"&lt;/p&gt;

&lt;p&gt;For 80% of what I ask, the answer is yes. For the remaining 20%, I still open a terminal or a browser. And that's fine.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I write about building things with local AI, self-hosting, and side projects that accidentally become useful. If you're running a home lab or experimenting with local models, I'd love to hear your setup — drop it in the comments.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>telegram</category>
      <category>selfhosted</category>
      <category>automation</category>
    </item>
    <item>
      <title>Run Your Own AI Server for $0/month with Ollama</title>
      <dc:creator>Sam Hartley</dc:creator>
      <pubDate>Mon, 08 Jun 2026 16:31:17 +0000</pubDate>
      <link>https://dev.to/samhartley_dev/run-your-own-ai-server-for-0month-with-ollama-5ecg</link>
      <guid>https://dev.to/samhartley_dev/run-your-own-ai-server-for-0month-with-ollama-5ecg</guid>
      <description>&lt;p&gt;You don't need OpenAI. You don't need a $200/month API bill. You can run powerful AI models &lt;strong&gt;on hardware you already own&lt;/strong&gt; — for free.&lt;/p&gt;

&lt;p&gt;Here's exactly how I set this up, and why I haven't paid for API credits in months.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Local AI?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Zero API costs&lt;/strong&gt; — no per-token billing, no surprise invoices&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full privacy&lt;/strong&gt; — your data never leaves your network&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No rate limits&lt;/strong&gt; — run as many queries as your hardware allows&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Works offline&lt;/strong&gt; — no internet? No problem&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No vendor lock-in&lt;/strong&gt; — switch models, change configs, own your stack&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What You Need
&lt;/h2&gt;

&lt;p&gt;Any modern computer works. Here's what different setups can handle:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Hardware&lt;/th&gt;
&lt;th&gt;RAM&lt;/th&gt;
&lt;th&gt;Best Models&lt;/th&gt;
&lt;th&gt;Speed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MacBook M1/M2/M3/M4&lt;/td&gt;
&lt;td&gt;8-16GB&lt;/td&gt;
&lt;td&gt;Qwen 3.5 9B, Llama 3.1 8B&lt;/td&gt;
&lt;td&gt;Fast ⚡&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gaming PC (RTX 3060+)&lt;/td&gt;
&lt;td&gt;16GB+&lt;/td&gt;
&lt;td&gt;Qwen 3 Coder 30B, DeepSeek R1&lt;/td&gt;
&lt;td&gt;Very Fast 🚀&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Old laptop/desktop&lt;/td&gt;
&lt;td&gt;8GB+&lt;/td&gt;
&lt;td&gt;Phi-3 Mini, Gemma 2B&lt;/td&gt;
&lt;td&gt;Usable 🐢&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Raspberry Pi 5&lt;/td&gt;
&lt;td&gt;8GB&lt;/td&gt;
&lt;td&gt;Tiny models only&lt;/td&gt;
&lt;td&gt;Slow 🐌&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The sweet spot:&lt;/strong&gt; A used gaming GPU (RTX 3060 12GB) costs ~$150 on eBay and runs 30B parameter models comfortably.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Install Ollama (2 minutes)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# macOS or Linux — one command&lt;/span&gt;
curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://ollama.com/install.sh | sh

&lt;span class="c"&gt;# Windows — download from ollama.com&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. No Docker, no Python environments, no dependency hell.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Download a Model (5 minutes)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Fast &amp;amp; capable (recommended starter)&lt;/span&gt;
ollama pull qwen3.5:9b

&lt;span class="c"&gt;# Code specialist&lt;/span&gt;
ollama pull qwen3-coder:30b

&lt;span class="c"&gt;# Reasoning powerhouse&lt;/span&gt;
ollama pull deepseek-r1:8b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Models download once and run locally forever.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Start Using It
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Interactive Chat
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama run qwen3.5:9b

&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; What&lt;span class="s1"&gt;'s the fastest sorting algorithm for nearly-sorted data?
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  API Access (OpenAI-compatible!)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:11434/v1/chat/completions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "model": "qwen3.5:9b",
    "messages": [{"role": "user", "content": "Explain Docker in 3 sentences"}]
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Yes — it's &lt;strong&gt;OpenAI API compatible&lt;/strong&gt;. Any tool that works with GPT works with Ollama. Just change the base URL.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Make It a Server
&lt;/h2&gt;

&lt;p&gt;Want other devices on your network to access it?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Start Ollama with network access&lt;/span&gt;
&lt;span class="nv"&gt;OLLAMA_HOST&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.0.0.0 ollama serve
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now any device on your network can query &lt;code&gt;http://YOUR_IP:11434&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  What I've Built With This:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Telegram Bot&lt;/strong&gt; running 24/7 on a Mac Mini, answering questions via local Qwen 3.5&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code Review Agent&lt;/strong&gt; using Qwen 3 Coder 30B — reviews PRs in ~12 seconds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Document Q&amp;amp;A&lt;/strong&gt; with RAG pipeline — load PDFs, ask questions, get cited answers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Garmin Watch Face&lt;/strong&gt; that fetches stock data (the background service uses local AI for formatting)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Cost Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Solution&lt;/th&gt;
&lt;th&gt;Monthly Cost&lt;/th&gt;
&lt;th&gt;Privacy&lt;/th&gt;
&lt;th&gt;Speed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI GPT-4o&lt;/td&gt;
&lt;td&gt;$20-200+&lt;/td&gt;
&lt;td&gt;❌ Cloud&lt;/td&gt;
&lt;td&gt;Fast&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Anthropic Claude&lt;/td&gt;
&lt;td&gt;$20-100+&lt;/td&gt;
&lt;td&gt;❌ Cloud&lt;/td&gt;
&lt;td&gt;Fast&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Google Gemini&lt;/td&gt;
&lt;td&gt;$0-25+&lt;/td&gt;
&lt;td&gt;❌ Cloud&lt;/td&gt;
&lt;td&gt;Fast&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Ollama (Local)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;✅ Private&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Fast&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The only cost is electricity — roughly $5-15/month if running 24/7 on a desktop PC.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pro Tips
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Use GPU, not CPU&lt;/strong&gt; — A $150 used RTX 3060 is 10-15x faster than any CPU&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Start with 7-9B models&lt;/strong&gt; — They're surprisingly capable and fast&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Try different models&lt;/strong&gt; for different tasks — coding, reasoning, and chat each have specialists&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enable the OpenAI-compatible API&lt;/strong&gt; — instant compatibility with thousands of tools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Set up auto-start&lt;/strong&gt; — &lt;code&gt;systemctl enable ollama&lt;/code&gt; on Linux, launchd on macOS&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run multiple models&lt;/strong&gt; — I keep 3-4 models loaded and switch based on the task&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  My Current Setup
&lt;/h2&gt;

&lt;p&gt;I run a 3-machine lab:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Machine&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Mac Mini M4&lt;/td&gt;
&lt;td&gt;Quick chat, orchestration&lt;/td&gt;
&lt;td&gt;Qwen 3 4B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Windows PC (RTX 3060)&lt;/td&gt;
&lt;td&gt;Heavy inference, coding&lt;/td&gt;
&lt;td&gt;Qwen 3 Coder 30B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ubuntu box&lt;/td&gt;
&lt;td&gt;Fallback, background tasks&lt;/td&gt;
&lt;td&gt;minicpm-v&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Total monthly API cost: &lt;strong&gt;$0&lt;/strong&gt;. Total hardware cost: one $150 used GPU.&lt;/p&gt;

&lt;h2&gt;
  
  
  Want to Go Deeper?
&lt;/h2&gt;

&lt;p&gt;I write about running AI locally, home lab setups, and turning hardware into income. If you want more of this, drop a comment — I read every one.&lt;/p&gt;

&lt;p&gt;Other posts in this series:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/samhartley_dev/my-3-machine-ai-lab-how-i-divide-work-between-a-mac-mini-a-windows-pc-and-an-ubuntu-box-3gfi"&gt;My 3-machine AI lab setup&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/samhartley_dev/i-expanded-my-gpu-rental-fleet-to-6-cards-heres-what-happened-to-my-earnings-38oj"&gt;I expanded to 6 GPUs for rental income&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/samhartley_dev/i-built-a-model-router-that-picks-the-right-ai-for-every-task-heres-why-you-should-too-5la"&gt;I built a model router that picks the right AI for every task&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>opensource</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>I Replaced Every Cloud AI Tool with Local Models — 6 Months Later, Here's What Actually Broke</title>
      <dc:creator>Sam Hartley</dc:creator>
      <pubDate>Fri, 15 May 2026 08:06:03 +0000</pubDate>
      <link>https://dev.to/samhartley_dev/i-replaced-every-cloud-ai-tool-with-local-models-6-months-later-heres-what-actually-broke-30l2</link>
      <guid>https://dev.to/samhartley_dev/i-replaced-every-cloud-ai-tool-with-local-models-6-months-later-heres-what-actually-broke-30l2</guid>
      <description>&lt;p&gt;Six months ago I decided to see if I could stop paying for AI APIs entirely. Not "reduce costs." Not "use local models for simple stuff." I mean &lt;em&gt;everything&lt;/em&gt; — coding help, document analysis, image understanding, automation, even the AI that writes my Dev.to posts.&lt;/p&gt;

&lt;p&gt;The bill since then: &lt;strong&gt;$0 in API costs.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;But "free" doesn't mean "zero effort." Here's the honest maintenance report from half a year of running my own AI infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Stack I Actually Run
&lt;/h2&gt;

&lt;p&gt;Three machines, one desk, no cloud dependency:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Machine&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;th&gt;Specs&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Mac Mini M4&lt;/td&gt;
&lt;td&gt;Orchestrator + always-on models&lt;/td&gt;
&lt;td&gt;16GB RAM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Windows PC&lt;/td&gt;
&lt;td&gt;GPU workhorse&lt;/td&gt;
&lt;td&gt;RTX 3060 12GB + RTX 3070 8GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ubuntu box&lt;/td&gt;
&lt;td&gt;CPU fallback + storage&lt;/td&gt;
&lt;td&gt;Headless, 32GB RAM&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Between them: 8 Ollama models, a Chroma vector database, ~5,000 indexed document chunks from my projects, and a dozen Python scripts that tie it all together.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Works Better Than Expected
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Code generation is actually fine now
&lt;/h3&gt;

&lt;p&gt;I was skeptical. Everyone says you need GPT-4 for "serious" coding. But Qwen 3 Coder 30B on my RTX 3060 writes functions, debugs errors, and suggests refactors at roughly 90% of the quality I'd get from Claude. The other 10%? Usually it's being overly verbose or missing an edge case I'd catch in review anyway.&lt;/p&gt;

&lt;p&gt;The difference: &lt;strong&gt;0 seconds of waiting&lt;/strong&gt; vs. typing a prompt into a web UI and watching a progress bar. Local inference feels instant because it &lt;em&gt;is&lt;/em&gt; instant.&lt;/p&gt;

&lt;h3&gt;
  
  
  No rate limits changes how you work
&lt;/h3&gt;

&lt;p&gt;With cloud APIs, I catch myself thinking "is this query worth the tokens?" That's a weird mental tax. Local models removed it entirely. I can fire off 50 variations of a prompt, iterate on phrasing, test different approaches — all without checking a dashboard or worrying about a bill.&lt;/p&gt;

&lt;p&gt;I probably use AI &lt;em&gt;more&lt;/em&gt; now that it's free, not less. Counterintuitive.&lt;/p&gt;

&lt;h3&gt;
  
  
  Privacy isn't theoretical
&lt;/h3&gt;

&lt;p&gt;I index my entire Obsidian vault, all my project documentation, client notes, even personal journal entries. The vector DB is local. The embeddings are local. The inference is local. Nobody's training on my data because nobody &lt;em&gt;sees&lt;/em&gt; my data.&lt;/p&gt;

&lt;p&gt;That's not just privacy — it's permission to be sloppy. I can paste a full error log without scrubbing paths. I can ask about a client project by name. I don't have to think about it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Broke (And How I Fixed It)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Model updates are a tax
&lt;/h3&gt;

&lt;p&gt;Ollama makes updating easy: &lt;code&gt;ollama pull qwen3.5:9b&lt;/code&gt; and you're done. But "easy" multiplied by 8 models across 3 machines is still work. Every few weeks there's a new version, a new quantization, a "better" tag.&lt;/p&gt;

&lt;p&gt;I used to update everything immediately. Now I have a rule: &lt;strong&gt;only update when the current model fails me.&lt;/strong&gt; If Qwen 3.5 9B answers my question correctly, I don't care that 9.1 exists. This cut my maintenance time by about 70%.&lt;/p&gt;

&lt;h3&gt;
  
  
  The "which machine is this model on?" problem
&lt;/h3&gt;

&lt;p&gt;For the first two months I was constantly SSHing between machines to check what was installed where. Did I put the vision model on the Windows PC or the Ubuntu box? Is the 30B coder on the GPU machine or did I move it?&lt;/p&gt;

&lt;p&gt;I built a simple model router (wrote about it &lt;a href="https://dev.to/samhartley_dev/i-built-a-model-router-that-picks-the-right-ai-for-every-task-heres-why-you-should-too-5la"&gt;here&lt;/a&gt;) that tracks endpoints and health-checks them. Now I just send a prompt and it figures out where to route it. Problem solved, but it took an afternoon to build and another afternoon to debug.&lt;/p&gt;

&lt;h3&gt;
  
  
  One machine goes down, your workflow doesn't stop — but it degrades
&lt;/h3&gt;

&lt;p&gt;The Windows PC rebooted for an update mid-project last month. All my "heavy" models went with it. The router fell back to the Mac Mini's 4B model, which meant my coding assistant became... less helpful. It still answered, but the quality drop was noticeable.&lt;/p&gt;

&lt;p&gt;I now schedule Windows updates for Sunday mornings when I'm not working. And I keep a "fallback" version of my most-used models on each machine, even if they're slower. Redundancy isn't just for servers.&lt;/p&gt;

&lt;h3&gt;
  
  
  VRAM is the real bottleneck
&lt;/h3&gt;

&lt;p&gt;Not compute. Not CPU. VRAM. A 30B parameter model at 4-bit quantization needs ~18GB. My 3060 has 12GB, which means I can't run the biggest models at all. The 3070's 8GB is even more limiting.&lt;/p&gt;

&lt;p&gt;I spent $150 on a used RTX 3060 with 12GB specifically for this setup. Best investment I made. If you're buying for local AI, &lt;strong&gt;VRAM &amp;gt; everything else.&lt;/strong&gt; A 12GB card beats an 8GB card even if the 8GB card is technically "faster."&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hidden Costs
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;th&gt;Monthly&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Electricity (3 machines, ~200W average)&lt;/td&gt;
&lt;td&gt;~$25-35&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hardware depreciation&lt;/td&gt;
&lt;td&gt;Hard to calculate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;My time maintaining it&lt;/td&gt;
&lt;td&gt;~2-3 hours/month&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;So it's not literally free. But compared to $20-200/month for cloud API subscriptions? It's a rounding error.&lt;/p&gt;

&lt;p&gt;The real cost is mental overhead. You become your own ops team. When something breaks at 11 PM, there's no support chat. You're the support chat.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Moment I Almost Switched Back
&lt;/h2&gt;

&lt;p&gt;Last month I had a complex architecture question for a client project. The local model gave me an answer that &lt;em&gt;sounded&lt;/em&gt; right but was subtly wrong about how PostgreSQL handles concurrent index creation. I didn't catch it until review.&lt;/p&gt;

&lt;p&gt;A cloud model might have gotten it right. Or it might have hallucinated something equally plausible. I don't actually know. But in that moment I considered keeping a Claude subscription as a "reality check" for critical questions.&lt;/p&gt;

&lt;p&gt;I didn't do it. Instead, I added a rule: &lt;strong&gt;for architectural decisions that affect production, verify with official docs, not AI.&lt;/strong&gt; Whether the AI is local or cloud, that's always been true. I just got lazy because the local model was &lt;em&gt;so&lt;/em&gt; convenient.&lt;/p&gt;

&lt;h2&gt;
  
  
  Would I Do It Again?
&lt;/h2&gt;

&lt;p&gt;Yes. With caveats.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do this if:&lt;/strong&gt; You're technical, you enjoy tinkering, you already have decent hardware, and you use AI heavily enough that API costs would matter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Don't do this if:&lt;/strong&gt; You want something that "just works," you need the absolute best model quality for every query, or your time is worth more than the API savings.&lt;/p&gt;

&lt;p&gt;For me, the tradeoff is worth it. I spend maybe 2-3 hours a month on maintenance and save $50-150 in API costs. More importantly, I own the stack. No vendor lock-in. No terms-of-service changes. No "we've updated our pricing" emails.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;I'm experimenting with running smaller models (1-3B parameters) on the Mac Mini for truly instant responses to simple queries. The latency difference between 4B and 30B is massive — sometimes you just need a quick answer, not a thoughtful one.&lt;/p&gt;

&lt;p&gt;Also looking into quantization options that let me squeeze bigger models into 12GB. Every GB of VRAM I free up is another model I can keep loaded.&lt;/p&gt;

&lt;p&gt;Six months in, I'm not going back. But I'm also not pretending it's effortless. Local AI is a lifestyle choice, not just a technical one.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Questions about the setup or want the specific config files? Drop a comment.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;→ &lt;a href="http://www.fiverr.com/s/XLyg" rel="noopener noreferrer"&gt;I build custom local AI setups on Fiverr&lt;/a&gt;&lt;br&gt;&lt;br&gt;
→ &lt;a href="https://t.me/celebibot_en" rel="noopener noreferrer"&gt;Follow along on Telegram&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>ollama</category>
      <category>selfhosted</category>
      <category>productivity</category>
    </item>
    <item>
      <title>I Expanded My GPU Rental Fleet to 6 Cards — Here's What Happened to My Earnings</title>
      <dc:creator>Sam Hartley</dc:creator>
      <pubDate>Sat, 09 May 2026 08:05:49 +0000</pubDate>
      <link>https://dev.to/samhartley_dev/i-expanded-my-gpu-rental-fleet-to-6-cards-heres-what-happened-to-my-earnings-38oj</link>
      <guid>https://dev.to/samhartley_dev/i-expanded-my-gpu-rental-fleet-to-6-cards-heres-what-happened-to-my-earnings-38oj</guid>
      <description>&lt;h1&gt;
  
  
  I Expanded My GPU Rental Fleet to 6 Cards — Here's What Happened to My Earnings
&lt;/h1&gt;

&lt;p&gt;A few weeks ago I wrote about renting out my single RTX 3060 on Vast.ai for passive income. The experiment worked better than I expected, so I did what any reasonable person would do: I went and dug out the five other GPUs sitting in my storage room.&lt;/p&gt;

&lt;p&gt;This is the honest follow-up. What actually happened when I went from 1 GPU to 6.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Backstory
&lt;/h2&gt;

&lt;p&gt;I had a bunch of GPUs from an older setup — two RTX 3070s, one RTX 3080, and two more RTX 3060s. They were collecting dust. The PC they came from got upgraded, the cards went into cardboard boxes, the boxes went under a shelf.&lt;/p&gt;

&lt;p&gt;Total VRAM across all six: around 62GB. Combined retail value when new: probably $3,000+. Current value sitting in boxes: $0/month.&lt;/p&gt;

&lt;p&gt;The math wasn't complicated.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Expansion Actually Took
&lt;/h2&gt;

&lt;p&gt;Here's what I underestimated: it's not just "plug cards in, profit."&lt;/p&gt;

&lt;h3&gt;
  
  
  The hardware side
&lt;/h3&gt;

&lt;p&gt;You can't just stack 6 GPUs into a regular PC case. I had to think about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PCIe slots and bandwidth.&lt;/strong&gt; A standard ATX board has maybe 2-3 real x16 slots. For 6 cards, you're looking at risers, which means a mining-style open frame or a server chassis.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Power.&lt;/strong&gt; Each card pulls 150-250W under load. Six cards = potentially 1,200-1,500W just in GPU power. Plus CPU, drives, RAM. My existing 850W PSU was not going to cut it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cooling.&lt;/strong&gt; Cards in a tight case thermal-throttle each other. Open frame was the answer.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I ended up using an open-air mining frame I found used for cheap, two PSUs daisy-chained (a sketchy-but-common approach in the mining world), and PCIe risers.&lt;/p&gt;

&lt;p&gt;Setup time: about a full weekend.&lt;/p&gt;

&lt;h3&gt;
  
  
  The software side
&lt;/h3&gt;

&lt;p&gt;Getting all six cards recognized wasn't plug-and-play either. I run Windows on the main PC (easier driver support for NVIDIA), and Vast.ai has a Windows daemon that mostly works — except when it doesn't.&lt;/p&gt;

&lt;p&gt;A few issues I hit:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Two risers were flaky and caused cards to drop off&lt;/li&gt;
&lt;li&gt;One 3070 had a driver conflict until I did a clean DDU reinstall&lt;/li&gt;
&lt;li&gt;Vast.ai's host dashboard showed 5 GPUs after setup; took me an hour to figure out the sixth wasn't being detected&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Total debugging time before everything was stable: another weekend.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Earnings Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setup&lt;/th&gt;
&lt;th&gt;Cards&lt;/th&gt;
&lt;th&gt;VRAM&lt;/th&gt;
&lt;th&gt;Weekly Earnings&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Before (1 card)&lt;/td&gt;
&lt;td&gt;RTX 3060&lt;/td&gt;
&lt;td&gt;12GB&lt;/td&gt;
&lt;td&gt;~$12-18&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;After (6 cards)&lt;/td&gt;
&lt;td&gt;3060 × 3, 3070 × 2, 3080 × 1&lt;/td&gt;
&lt;td&gt;62GB&lt;/td&gt;
&lt;td&gt;~$65-95&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Not exactly linear scaling. Here's why:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Demand is unpredictable.&lt;/strong&gt; Sometimes 4 of my 6 cards are rented simultaneously. Sometimes 1. The RTX 3080 gets picked up more often than the 3060s — higher VRAM matters for LLM inference jobs that need room to load bigger models.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Not all hours are equal.&lt;/strong&gt; Utilization spikes during US business hours and drops overnight (Turkey time). I'm in a timezone where "overnight for me" overlaps with "peak US working hours," which actually helps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pricing matters more than I thought.&lt;/strong&gt; I dropped my per-card price slightly and saw utilization go up noticeably. A few cents per hour makes a real difference when renters are comparing a dozen similar options.&lt;/p&gt;

&lt;h2&gt;
  
  
  Current Monthly Run Rate
&lt;/h2&gt;

&lt;p&gt;Across all six cards, I'm averaging around &lt;strong&gt;$280-340/month&lt;/strong&gt; before electricity.&lt;/p&gt;

&lt;p&gt;Power costs are real. Six GPUs under load is serious wattage. My electricity bill went up — I haven't calculated the exact delta yet because my bill is shared (I'm not the only one using power in my building), but I'd estimate $40-60/month in additional costs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Net: roughly $220-280/month in real passive income.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Is that life-changing? No. Is it meaningful for money that was doing nothing? Absolutely.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Start with a proper open-frame rig, not a cobbled-together case.&lt;/strong&gt;&lt;br&gt;
The mining frame was cheap but took time to source. If I were doing this again I'd budget for it from day one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Get a proper high-wattage PSU setup.&lt;/strong&gt;&lt;br&gt;
Running two PSUs linked together works but it's inelegant. A server PSU with the right adapter is cleaner and safer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Test each card individually before combining them.&lt;/strong&gt;&lt;br&gt;
I wasted time troubleshooting "which card is the problem" when I could've confirmed each one worked before building the full rig.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Set minimum job duration.&lt;/strong&gt;&lt;br&gt;
Short jobs (under an hour) rack up overhead — container spin-up time, handshaking — without much earnings. I set a minimum of 2 hours and earnings-per-hour improved.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Unexpected Part
&lt;/h2&gt;

&lt;p&gt;I expected this to be a boring passive income setup. It mostly is. But I've learned a surprising amount about how the AI inference market actually works by watching what gets rented and when.&lt;/p&gt;

&lt;p&gt;Most renters are running:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fine-tuning jobs (need sustained GPU hours)&lt;/li&gt;
&lt;li&gt;LLM inference (need VRAM more than raw compute)&lt;/li&gt;
&lt;li&gt;Image generation (FLUX, Stable Diffusion variants)&lt;/li&gt;
&lt;li&gt;Dev environments (people testing stuff without committing to a cloud contract)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Watching the demand patterns is actually interesting data about what the AI dev community is building right now. The 3080 almost always goes first — 10GB VRAM hits a sweet spot for smaller Llama and Mistral models.&lt;/p&gt;

&lt;h2&gt;
  
  
  Is It Worth It?
&lt;/h2&gt;

&lt;p&gt;Depends on your situation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Yes, if:&lt;/strong&gt; You already have the GPUs and they're sitting idle. The marginal cost of setting this up is mostly your time, and the monthly return is real.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Maybe, if:&lt;/strong&gt; You'd have to buy the GPUs. At current used-market prices, payback period is 6-12 months depending on utilization. That's not terrible but it's not obvious.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No, if:&lt;/strong&gt; You're renting out your daily-driver GPU. The rental platform can grab your card at inconvenient times. Keep at least one card reserved for your own use.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;I'm looking at adding the Ubuntu server I have running as a CPU-only Vast.ai host for smaller workloads. Less money per unit but zero additional hardware cost.&lt;/p&gt;

&lt;p&gt;Also thinking about whether it makes sense to eventually get into the dedicated hosting side rather than the rental marketplace — more stable income, more setup required. Still researching.&lt;/p&gt;

&lt;p&gt;For now, 6 cards, ~$250/month net, and a weekend's worth of setup. I'll take it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Questions about the rig setup or Vast.ai specifics? Drop them in the comments.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://www.fiverr.com/s/XLyg" rel="noopener noreferrer"&gt;Check out my automation work on Fiverr&lt;/a&gt;&lt;br&gt;
→ &lt;a href="https://t.me/celebibot_en" rel="noopener noreferrer"&gt;Follow along on Telegram&lt;/a&gt;&lt;/p&gt;

</description>
      <category>gpu</category>
      <category>passiveincome</category>
      <category>selfhosted</category>
      <category>ai</category>
    </item>
    <item>
      <title>My 3-Machine AI Lab: How I Divide Work Between a Mac Mini, a Windows PC, and an Ubuntu Box</title>
      <dc:creator>Sam Hartley</dc:creator>
      <pubDate>Thu, 07 May 2026 08:06:27 +0000</pubDate>
      <link>https://dev.to/samhartley_dev/my-3-machine-ai-lab-how-i-divide-work-between-a-mac-mini-a-windows-pc-and-an-ubuntu-box-3gfi</link>
      <guid>https://dev.to/samhartley_dev/my-3-machine-ai-lab-how-i-divide-work-between-a-mac-mini-a-windows-pc-and-an-ubuntu-box-3gfi</guid>
      <description>&lt;p&gt;I keep seeing posts about running AI on a single machine. "Just use Ollama on your laptop!" Sure, that works — until you want to run a 30B model while your IDE is indexing, your test suite is running, and you're editing a video.&lt;/p&gt;

&lt;p&gt;I have three machines. Not because I'm rich — because I kept old hardware alive and gave each one a job. Here's how I split AI workloads across a Mac Mini M4, a Windows PC with an RTX 3060, and an old Ubuntu box, and why one machine was never enough.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Machines
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Machine&lt;/th&gt;
&lt;th&gt;Specs&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Mac Mini M4&lt;/td&gt;
&lt;td&gt;10-core, 16GB RAM&lt;/td&gt;
&lt;td&gt;Orchestrator, coding, light inference&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Windows PC&lt;/td&gt;
&lt;td&gt;AMD 9970X, 128GB RAM, RTX 3060 12GB&lt;/td&gt;
&lt;td&gt;Heavy inference, image generation, GPU rental&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ubuntu Box&lt;/td&gt;
&lt;td&gt;Older CPU, 16GB RAM&lt;/td&gt;
&lt;td&gt;Background services, OmniParser, CPU inference fallback&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Total cost for the additional machines: the Windows PC was a workstation I already had, the Ubuntu box is a repurposed laptop. The Mac Mini is the only machine I bought specifically for this.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Three? Because One Keeps Running Into Walls.
&lt;/h2&gt;

&lt;p&gt;Here's what happens when you try to do everything on a single machine:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;On the Mac Mini alone:&lt;/strong&gt; Run a 9B model → fine. Try a 30B model → the fans sound like a jet engine and inference drops to 2 tokens/second because Apple Silicon shares RAM between CPU and GPU. Start a build while inference is running? Everything crawls.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;On the Windows PC alone:&lt;/strong&gt; The RTX 3060 crushes inference — a 30B model at 4-bit runs at 8-12 tok/s. But the machine is also my GPU rental host (I rent it on Vast.ai for passive income). When someone rents it, I can't use it. And running heavy inference while coding in an IDE on the same machine? Stutter city.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;On the Ubuntu box alone:&lt;/strong&gt; It's slow. CPU-only inference on a 7B model takes 30+ seconds per token. But it runs 24/7 without complaints, never reboots for Windows updates, and costs almost nothing in power.&lt;/p&gt;

&lt;p&gt;The answer was never "pick the best one." It was "give each one the right job."&lt;/p&gt;

&lt;h2&gt;
  
  
  The Division of Labor
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Mac Mini: The Brain
&lt;/h3&gt;

&lt;p&gt;This is where I actually &lt;em&gt;work&lt;/em&gt;. Code editing, git operations, web browsing, writing. It runs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ollama with small models&lt;/strong&gt; — Qwen 3 4B for quick chat, granite3.2-vision for screenshots&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;My automation agents&lt;/strong&gt; — OpenClaw runs here 24/7, orchestrating the other machines&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dev environment&lt;/strong&gt; — VS Code, terminal, browser, all the tools&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key insight: the Mac Mini is for &lt;em&gt;interactive&lt;/em&gt; work. Fast response time matters more than throughput. A 4B model answering in 0.3 seconds beats a 30B model answering in 8 seconds when I'm in the middle of a thought.&lt;/p&gt;

&lt;p&gt;What it does NOT do: heavy batch processing, image generation, anything that pins the CPU for more than a minute.&lt;/p&gt;

&lt;h3&gt;
  
  
  Windows PC: The Muscle
&lt;/h3&gt;

&lt;p&gt;The RTX 3060 with 12GB VRAM is the workhorse. It runs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ollama with big models&lt;/strong&gt; — Qwen3-Coder 30B for complex code, DeepSeek R1 8B for reasoning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FLUX image generation&lt;/strong&gt; — when I need AI photos (I run a Fiverr gig for brand character photos — $0 generation cost when it's your own GPU)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vast.ai host&lt;/strong&gt; — when I'm not using it, someone else pays to rent it. ~$50-130/month passive income.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The scheduling is simple: during my working hours (roughly 9am-11pm Istanbul time), I pause Vast.ai rentals and use the GPU myself. Overnight, it goes back on the rental market.&lt;/p&gt;

&lt;p&gt;128GB of RAM means I can load truly large models or run multiple services simultaneously. The GPU handles inference; the CPU cores handle data prep and post-processing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Ubuntu Box: The Backbone
&lt;/h3&gt;

&lt;p&gt;This machine's superpower is reliability. It runs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;OmniParser&lt;/strong&gt; — a UI parsing service that my automation agents use to "see" screens (runs on port 8100, I SSH-tunnel to it from the Mac)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CPU inference fallback&lt;/strong&gt; — when the Windows GPU is rented out and the Mac is busy, the Ubuntu box runs minicpm-v for vision tasks. Slow (~180 seconds per query) but it works.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Background cron jobs&lt;/strong&gt; — data scraping, health checks, monitoring scripts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Testing ground&lt;/strong&gt; — I deploy new services here first because if something crashes, my main workflow isn't affected&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's the machine I interact with least directly, but it's the one I'd miss most if it went down.&lt;/p&gt;

&lt;h2&gt;
  
  
  How They Talk to Each Other
&lt;/h2&gt;

&lt;p&gt;This was the hardest part to figure out. Three machines that don't communicate are just three isolated computers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The network:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Mac Mini (192.168.1.102) ←→ Windows PC (192.168.1.106) ←→ Ubuntu (192.168.1.100)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All on the same local network. No VPN needed at home, but I set up Tailscale for remote access when I'm not home.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ollama routing:&lt;/strong&gt;&lt;br&gt;
The Mac Mini's Ollama is configured with &lt;code&gt;OLLAMA_HOST=0.0.0.0&lt;/code&gt; so it's accessible from the other machines. Same for the Windows PC. My automation agent on the Mac Mini routes requests based on model size:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_ollama_url&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Route to the right machine based on model size.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;small_models&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen3:4b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;granite3.2-vision:2b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;moondream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;big_models&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen3-coder:30b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deepseek-r1:8b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;big_models&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://192.168.1.106:11434&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# Windows GPU
&lt;/span&gt;    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;small_models&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:11434&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;       &lt;span class="c1"&gt;# Mac Mini local
&lt;/span&gt;    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://192.168.1.100:11434&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;   &lt;span class="c1"&gt;# Ubuntu fallback
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Simple. No Kubernetes, no service mesh. Just a function that knows which machine has which model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SSH tunnels for services:&lt;/strong&gt;&lt;br&gt;
The Ubuntu box runs OmniParser on port 8100, but I need it on the Mac Mini. Instead of exposing it publicly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Auto-reconnecting SSH tunnel (runs via launchd on Mac)&lt;/span&gt;
ssh &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="nv"&gt;ServerAliveInterval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;60 &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="nv"&gt;ServerAliveCountMax&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;3 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-L&lt;/span&gt; 8100:localhost:8100 exp@192.168.1.100 &lt;span class="nt"&gt;-N&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now &lt;code&gt;localhost:8100&lt;/code&gt; on the Mac Mini reaches OmniParser on Ubuntu. Clean, secure, zero config on the Ubuntu side.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Mistakes I Made
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Mistake 1: Trying to make the Mac Mini do everything.&lt;/strong&gt;&lt;br&gt;
For the first month, I ran all inference on the Mac Mini. The 16GB unified memory seems like a lot until a 30B model is eating 20GB and your IDE is using another 4GB and macOS starts swapping to SSD. The SSD is fast, but swap is still swap. Code completion went from instant to 3-second delays.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mistake 2: Not setting up Ollama access from day one.&lt;/strong&gt;&lt;br&gt;
I installed Ollama separately on each machine and used them in isolation for weeks. The moment I realized I could call the Windows Ollama from the Mac Mini was a lightbulb moment. The models were always there — I just wasn't using them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mistake 3: Ignoring the Ubuntu box because it's "slow."&lt;/strong&gt;&lt;br&gt;
CPU inference is slow. But "slow" is relative. A background task that takes 3 minutes on CPU doesn't matter if you're not waiting for it. I was treating every request as if it needed to be instant. Turns out, most don't.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mistake 4: No fallback plan.&lt;/strong&gt;&lt;br&gt;
When the Windows PC rebooted for an update, my entire inference pipeline died because I had no fallback routing. Now: if Windows is down, Ubuntu takes over. If Ubuntu is down, the Mac runs a tiny model. Something always answers.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Setup Enables
&lt;/h2&gt;

&lt;p&gt;Running three machines isn't about having the biggest, fastest setup. It's about &lt;strong&gt;never being blocked.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Windows GPU is rented out? Route to Mac or Ubuntu.&lt;/li&gt;
&lt;li&gt;Mac Mini is busy with a build? Use the Windows GPU via network.&lt;/li&gt;
&lt;li&gt;Ubuntu is running a heavy OmniParser job? Queue it, the Mac handles the request.&lt;/li&gt;
&lt;li&gt;Need to generate 50 images for a client? Windows GPU does it in batch while I keep coding on the Mac.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And the cost breakdown:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Item&lt;/th&gt;
&lt;th&gt;Monthly Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Mac Mini power&lt;/td&gt;
&lt;td&gt;~$8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Windows PC power (idle + rental hours)&lt;/td&gt;
&lt;td&gt;~$15-25&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ubuntu box power&lt;/td&gt;
&lt;td&gt;~$5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vast.ai earnings (offset)&lt;/td&gt;
&lt;td&gt;~$50-130&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Net&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-$17 to -$97 (net profit)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The setup literally pays for itself. The Windows PC's GPU rental income exceeds the total power cost of all three machines.&lt;/p&gt;

&lt;h2&gt;
  
  
  Do You Need Three Machines?
&lt;/h2&gt;

&lt;p&gt;Probably not. But you might need &lt;em&gt;two&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;If you're running AI on a laptop: get a cheap desktop (even without a GPU) and put it on your network as a background worker. Ollama on a second machine means your main machine stays responsive.&lt;/p&gt;

&lt;p&gt;If you have a gaming PC: you already have a GPU. Install Ollama on it, set &lt;code&gt;OLLAMA_HOST=0.0.0.0&lt;/code&gt;, and call it from your laptop. You now have a two-machine AI lab for zero additional cost.&lt;/p&gt;

&lt;p&gt;The Ubuntu box in my setup could be a $50 Raspberry Pi 5 with 8GB RAM for the same purpose. It doesn't need to be fast. It needs to be &lt;em&gt;there&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'm Adding Next
&lt;/h2&gt;

&lt;p&gt;I have five more GPUs in storage (2× RTX 3070, 1× RTX 3080, 2× RTX 3060). The plan is to build an open-air rig and add them all to the Windows PC for both local inference and Vast.ai rental. That'd give me 62GB total VRAM — enough to run a 70B model locally or rent out as a cluster.&lt;/p&gt;

&lt;p&gt;The Ubuntu box is getting a dedicated role as a monitoring and alerting server. Right now my health checks are scattered across cron jobs on all three machines. Centralizing them makes debugging easier.&lt;/p&gt;

&lt;p&gt;And the Mac Mini? It stays the brain. More RAM would be nice (16GB is tight with Ollama + IDE + browser), but the M4's efficiency is hard to beat for interactive work.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Takeaway
&lt;/h2&gt;

&lt;p&gt;One machine is a workstation. Two machines is a lab. Three machines is a distributed system that happens to fit on a desk.&lt;/p&gt;

&lt;p&gt;The magic isn't in the hardware — it's in the routing. Knowing which machine should handle which task, building fallback paths, and making sure a single machine going down doesn't kill your workflow.&lt;/p&gt;

&lt;p&gt;Start with what you have. Add a second machine when you feel the friction. Route intelligently. The three-machine setup happened organically — each machine earned its place by solving a problem the others couldn't.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I write about running AI locally, home lab setups, and turning hardware into income. New post every few days — follow along if that's your thing.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;My GPU rental story: &lt;a href="https://dev.to/samhartley/i-rented-out-my-gpu-for-passive-income-heres-what-happened-after-my-first-week-2l8a"&gt;One card → six cards → $250/month passive&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>infrastructure</category>
      <category>llm</category>
      <category>productivity</category>
    </item>
    <item>
      <title>From Idea to Deployed Tool in 3 Hours — How AI Coding Agents Changed My Workflow</title>
      <dc:creator>Sam Hartley</dc:creator>
      <pubDate>Tue, 05 May 2026 08:06:19 +0000</pubDate>
      <link>https://dev.to/samhartley_dev/from-idea-to-deployed-tool-in-3-hours-how-ai-coding-agents-changed-my-workflow-3dd0</link>
      <guid>https://dev.to/samhartley_dev/from-idea-to-deployed-tool-in-3-hours-how-ai-coding-agents-changed-my-workflow-3dd0</guid>
      <description>&lt;p&gt;I used to think AI coding assistants were autocomplete on steroids. Fancy IntelliSense. Then I tried using one as an actual junior developer — someone who writes the first draft while I review and refine.&lt;/p&gt;

&lt;p&gt;Two months later, my workflow is unrecognizable. I just shipped a complete B2B configuration tool — interactive maps, zone polygons, dynamic forms, the works — in under three hours. Here's what changed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Old Way
&lt;/h2&gt;

&lt;p&gt;Before AI agents, my process for a new tool looked like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Research&lt;/strong&gt; — How does Leaflet.js work? What's the API for geo polygons? Stack Overflow, docs, tutorials. 45 minutes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Boilerplate&lt;/strong&gt; — HTML structure, CSS grid, JavaScript imports, event listeners. 30 minutes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Core logic&lt;/strong&gt; — The actual thing the tool needs to do. 2–3 hours.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Debugging&lt;/strong&gt; — Why doesn't the map render? Why is the polygon offset? 1–2 hours.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Polish&lt;/strong&gt; — Styling, responsive layout, edge cases. 1 hour.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Total: 6–8 hours for a medium-complexity tool.&lt;/strong&gt; And that's if I know the stack. If it's something new (like Garmin's Monkey C or a mapping library I haven't used), double it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The New Way
&lt;/h2&gt;

&lt;p&gt;Last week a client asked for a heat zone map for 21 European countries. Click a country, see the heating zones, pick one, get the right configuration. With polygon boundaries, country-specific defaults, and a responsive UI.&lt;/p&gt;

&lt;p&gt;Here's how it went:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hour 0–0.5: Prompt engineering&lt;/strong&gt;&lt;br&gt;
I wrote a detailed spec. Not "make a map" — that's useless. I described:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The data structure (country → zones → polygon coordinates)&lt;/li&gt;
&lt;li&gt;The UI flow (dropdown → map render → zone selection)&lt;/li&gt;
&lt;li&gt;The tech stack (Leaflet.js, vanilla JS, no frameworks)&lt;/li&gt;
&lt;li&gt;Edge cases (what happens when a country has no zones?)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Hour 0.5–1.5: First draft from the agent&lt;/strong&gt;&lt;br&gt;
I fed the spec to Codex (Claude Code via CLI). It generated the full HTML file — 800+ lines — with Leaflet integration, zone polygons, event handlers, and the selection logic.&lt;/p&gt;

&lt;p&gt;Was it perfect? No. The polygon coordinates were placeholder circles. The styling was bare-bones. But the &lt;em&gt;architecture&lt;/em&gt; was right. The map rendered. The flow worked.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hour 1.5–2.5: My turn — polish and fix&lt;/strong&gt;&lt;br&gt;
I replaced the placeholder polygons with real GeoJSON-ish coordinates for all 21 countries. Tweaked the CSS for mobile. Added validation. Fixed a bug where the map didn't re-center when switching countries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hour 2.5–3: Integration and deploy&lt;/strong&gt;&lt;br&gt;
Hooked it into the existing project structure. Git commit. Done.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Total: 3 hours.&lt;/strong&gt; And the heavy lifting — the Leaflet setup, the polygon rendering logic, the event wiring — was handled by the agent. I did the creative/problem-solving work: defining the problem, validating the output, fixing the edge cases.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Works (and What Doesn't)
&lt;/h2&gt;

&lt;p&gt;After two months of daily use, here's my honest assessment:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;✅ What works brilliantly:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Boilerplate and plumbing&lt;/strong&gt; — Setting up projects, imports, basic structure. The agent is faster and makes fewer typos than me.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;API integration patterns&lt;/strong&gt; — "Here's an endpoint, here's the expected response, write the fetch and parse logic." It gets this right 90% of the time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Refactoring&lt;/strong&gt; — "Rename this function and update all callers across 5 files." Instant, error-free.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Exploring unfamiliar territory&lt;/strong&gt; — I hadn't used Leaflet in years. The agent got me to a working state without reading docs for an hour.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;❌ What doesn't work (yet):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Complex business logic&lt;/strong&gt; — Anything with nuanced rules, edge cases, or domain-specific constraints. The agent generates something plausible that breaks in production.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;UI/UX design&lt;/strong&gt; — It makes functional UIs. They look like a developer made them (because one did). You'll still need a human eye for polish.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Debugging its own mistakes&lt;/strong&gt; — When the agent writes a bug, it's often subtle. You need to understand the code to catch it. This is &lt;em&gt;not&lt;/em&gt; a replacement for knowing how to code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Large-scale architecture&lt;/strong&gt; — It works file-by-file. Designing a system with proper separation of concerns, caching strategies, and scalability? That's still on you.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Mindset Shift
&lt;/h2&gt;

&lt;p&gt;The biggest change isn't speed. It's &lt;em&gt;how I think about problems&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Before: "How do I implement this?" → Research → Code → Debug.&lt;br&gt;
Now: "How do I describe this so an agent can implement a first draft?" → Spec → Review → Refine.&lt;/p&gt;

&lt;p&gt;I'm the architect and editor now, not the typist. The agent is the junior dev who writes fast but needs supervision.&lt;/p&gt;

&lt;p&gt;This matters because it scales. I can explore 3 approaches in the time it used to take to build 1. I can say "what if we used a canvas instead of Leaflet?" and get a working comparison in 10 minutes. The cost of experimentation dropped to near-zero.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;If you want to try this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Use Claude Code or Codex CLI&lt;/strong&gt; — The terminal interface lets you iterate fast. Chat-based tools (ChatGPT, etc.) are too slow for code generation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Write detailed specs&lt;/strong&gt; — The agent is only as good as your prompt. Include examples, expected outputs, and constraints.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Review every line&lt;/strong&gt; — Don't blindly commit. The agent writes plausible-looking bugs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keep a "golden test"&lt;/strong&gt; — A known input/output pair you can run after every change. Catches regressions instantly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Iterate in small chunks&lt;/strong&gt; — "Add the map" → review → "Add zone polygons" → review. Don't ask for 500 lines at once.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;p&gt;Two months in, my main learning: &lt;strong&gt;the spec is everything.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;My fastest sessions happen when I spend 10 minutes writing a bullet-point spec before touching the agent. My slowest sessions happen when I vague-prompt my way through and spend an hour fixing misunderstandings.&lt;/p&gt;

&lt;p&gt;The second learning: &lt;strong&gt;agents excel at breadth, humans at depth.&lt;/strong&gt; Use the agent to explore options. Use your brain to pick the right one.&lt;/p&gt;




&lt;p&gt;If you're using AI coding agents, what's your experience? I'm curious if the "architect + editor" model resonates, or if you've found a different pattern that works. Drop your thoughts below — I'm still figuring this out too.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>codex</category>
      <category>development</category>
    </item>
    <item>
      <title>I Built a Model Router That Picks the Right AI for Every Task — Here's Why You Should Too</title>
      <dc:creator>Sam Hartley</dc:creator>
      <pubDate>Sun, 03 May 2026 08:01:54 +0000</pubDate>
      <link>https://dev.to/samhartley_dev/i-built-a-model-router-that-picks-the-right-ai-for-every-task-heres-why-you-should-too-5la</link>
      <guid>https://dev.to/samhartley_dev/i-built-a-model-router-that-picks-the-right-ai-for-every-task-heres-why-you-should-too-5la</guid>
      <description>&lt;p&gt;I caught myself doing something stupid the other day. I used a 30B parameter model to summarize a two-paragraph email. The GPU spun up, the fans kicked in, and 8 seconds later I had my summary. The same summary my 4B model would've given me in 0.3 seconds.&lt;/p&gt;

&lt;p&gt;That's when I realized: I'd been treating every AI request the same way. Big model for everything. Small model for nothing. No intelligence in the routing at all.&lt;/p&gt;

&lt;p&gt;So I built a model router. Not a fancy orchestrator with Kubernetes and service meshes — just a simple function that looks at what you're asking and sends it to the right model on the right machine. It's the single most impactful thing I've done for my local AI setup this year.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;I run three machines with Ollama (Mac Mini M4, Windows PC with RTX 3060, Ubuntu box). Between them, I have maybe 8 models installed. Before the router, my workflow was:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Need something → open terminal&lt;/li&gt;
&lt;li&gt;Think about which model is good enough&lt;/li&gt;
&lt;li&gt;Think about which machine has it&lt;/li&gt;
&lt;li&gt;Type the full URL: &lt;code&gt;curl http://192.168.1.106:11434/api/generate -d '{"model": "qwen3-coder:30b", ...}'&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Wait&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Steps 2-4 happened every single time. I was spending more mental energy on routing than on the actual task. And half the time I'd default to the biggest model just because "it's probably better."&lt;/p&gt;

&lt;p&gt;Here's the thing: it's usually not better. A 4B model summarizing text is 95% as good as a 30B model summarizing text. But it's 25x faster and uses 1/10th the resources. The big model should earn its keep on tasks where size actually matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Router
&lt;/h2&gt;

&lt;p&gt;I started with a Python function. Nothing fancy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;

&lt;span class="c1"&gt;# Model registry: what's available where
&lt;/span&gt;&lt;span class="n"&gt;MODELS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quick&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen3:4b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;endpoint&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:11434&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vision&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;granite3.2-vision:2b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;endpoint&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://192.168.1.106:11434&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen3-coder:30b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;endpoint&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://192.168.1.106:11434&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reasoning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deepseek-r1:8b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;endpoint&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://192.168.1.106:11434&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fallback&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;minicpm-v&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;endpoint&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://192.168.1.100:11434&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;screenshot&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;picture&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;MODELS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vision&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;class&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;debug&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;refactor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;implement&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;MODELS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prove&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;solve&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;calculate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;math&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;logic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;MODELS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reasoning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;MODELS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quick&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. That's the whole router. It classifies prompts by keywords and sends them to the appropriate model. Is it perfect? No. Does it need to be? Also no.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters More Than You Think
&lt;/h2&gt;

&lt;p&gt;Before the router, my typical day looked like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;30 quick questions (summarize this, what does this error mean, rephrase this)&lt;/li&gt;
&lt;li&gt;5 code tasks (write a function, debug this, add tests)&lt;/li&gt;
&lt;li&gt;2 vision tasks (what's in this screenshot)&lt;/li&gt;
&lt;li&gt;1 deep reasoning task (complex analysis)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without routing, I'd use the big model for everything. 38 requests to the 30B model. Each taking 5-15 seconds. Total wait time: ~4-5 minutes of just... waiting.&lt;/p&gt;

&lt;p&gt;With routing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;30 quick → 4B model on Mac Mini, 0.3s each → &lt;strong&gt;9 seconds total&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;5 code → 30B on Windows GPU, 8-12s each → &lt;strong&gt;~50 seconds&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;2 vision → vision model on GPU, 4-13s each → &lt;strong&gt;~15 seconds&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;1 reasoning → 8B model on GPU, 5-8s → &lt;strong&gt;~6 seconds&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Total: ~80 seconds&lt;/strong&gt; instead of ~5 minutes. And my GPU is free for most of that time.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fallback System
&lt;/h2&gt;

&lt;p&gt;The real value hit me when the Windows PC went down for a Windows Update mid-session. Before, this was catastrophic.&lt;/p&gt;

&lt;p&gt;Now the router has fallback logic — when a machine is down, it finds the next best option. Code tasks go to the small model (worse but functional). Vision tasks try Ubuntu. Quick tasks keep humming on the Mac Mini. Something always answers.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Token Cost Angle
&lt;/h2&gt;

&lt;p&gt;I track tokens per model per day. Not because I'm cheap (these are all free — local), but because it tells me where I'm spending compute:&lt;/p&gt;

&lt;p&gt;After a week: &lt;strong&gt;72% of my requests went to the quick model.&lt;/strong&gt; Only 15% needed the big code model.&lt;/p&gt;

&lt;p&gt;I was using a sledgehammer for 72% of my nails. No wonder the GPU always felt busy.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Build the router on day one.&lt;/strong&gt; It took an afternoon. I wasted months manually routing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Start with keyword routing.&lt;/strong&gt; I considered embeddings, classifiers, even using an LLM to pick the LLM. Keywords work for 90% of cases. Ship the simple thing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Make the fallback automatic.&lt;/strong&gt; My first version just errored when the GPU machine was down. A degraded response is infinitely better than no response.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Log everything.&lt;/strong&gt; You can't optimize what you don't measure. The 72% stat jumped out immediately once I started tracking.&lt;/p&gt;

&lt;h2&gt;
  
  
  Beyond Keywords
&lt;/h2&gt;

&lt;p&gt;The keyword router works but has blind spots. So I'm adding:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Confidence scoring:&lt;/strong&gt; If a prompt matches multiple categories, try the cheaper model first. Auto-retry with bigger model if quality seems low.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context-aware routing:&lt;/strong&gt; If the last 3 prompts were about the same codebase, keep using the code model even without explicit keywords.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost-aware fallback:&lt;/strong&gt; When local models can't handle something, the router should know whether a cloud API call is "worth it" or not.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Do You Need This?
&lt;/h2&gt;

&lt;p&gt;If you run one model on one machine: no. You're fine.&lt;/p&gt;

&lt;p&gt;If you have more than 2 models: &lt;strong&gt;yes.&lt;/strong&gt; Otherwise you'll default to the biggest one every time, and that's a waste.&lt;/p&gt;

&lt;p&gt;If you have multiple machines: &lt;strong&gt;absolutely.&lt;/strong&gt; The router isn't just about picking the right model — it's about picking the right machine. Code tasks go where the GPU is. Quick tasks stay local. Background tasks go to the always-on box.&lt;/p&gt;

&lt;p&gt;Start with 2 models and a simple if/else. Add more as you grow. The architecture stays the same.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I write about running AI locally, home lab setups, and turning hardware into income. If that's your jam, I post every few days.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>ollama</category>
      <category>selfhosted</category>
      <category>productivity</category>
    </item>
    <item>
      <title>I Built a Model Router That Picks the Right AI for Every Task — Here's Why You Should Too</title>
      <dc:creator>Sam Hartley</dc:creator>
      <pubDate>Fri, 01 May 2026 08:02:21 +0000</pubDate>
      <link>https://dev.to/samhartley_dev/i-built-a-model-router-that-picks-the-right-ai-for-every-task-heres-why-you-should-too-1603</link>
      <guid>https://dev.to/samhartley_dev/i-built-a-model-router-that-picks-the-right-ai-for-every-task-heres-why-you-should-too-1603</guid>
      <description>&lt;p&gt;I caught myself doing something stupid the other day. I used a 30B parameter model to summarize a two-paragraph email. The GPU spun up, the fans kicked in, and 8 seconds later I had my summary. The same summary my 4B model would've given me in 0.3 seconds.&lt;/p&gt;

&lt;p&gt;That's when I realized: I'd been treating every AI request the same way. Big model for everything. Small model for nothing. No intelligence in the routing at all.&lt;/p&gt;

&lt;p&gt;So I built a model router. Not a fancy orchestrator with Kubernetes and service meshes — just a simple function that looks at what you're asking and sends it to the right model on the right machine. It's the single most impactful thing I've done for my local AI setup this year.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;I run three machines with Ollama (Mac Mini M4, Windows PC with RTX 3060, Ubuntu box). Between them, I have maybe 8 models installed. Before the router, my workflow was:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Need something → open terminal&lt;/li&gt;
&lt;li&gt;Think about which model is good enough&lt;/li&gt;
&lt;li&gt;Think about which machine has it&lt;/li&gt;
&lt;li&gt;Type the full URL: &lt;code&gt;curl http://192.168.1.106:11434/api/generate -d '{\"model\": \"qwen3-coder:30b\", ...}'&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Wait&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Steps 2-4 happened every single time. I was spending more mental energy on routing than on the actual task. And half the time I'd default to the biggest model just because "it's probably better."&lt;/p&gt;

&lt;p&gt;Here's the thing: it's usually not better. A 4B model summarizing text is 95% as good as a 30B model summarizing text. But it's 25x faster and uses 1/10th the resources. The big model should earn its keep on tasks where size actually matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Router
&lt;/h2&gt;

&lt;p&gt;I started with a Python function. Nothing fancy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;

&lt;span class="c1"&gt;# Model registry: what's available where
&lt;/span&gt;&lt;span class="n"&gt;MODELS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;# Quick chat, summaries, simple Q&amp;amp;A → Mac Mini (fast response)
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quick&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen3:4b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;endpoint&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:11434&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="c1"&gt;# Vision / image analysis → Windows GPU (needs VRAM)
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vision&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;granite3.2-vision:2b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;endpoint&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://192.168.1.106:11434&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="c1"&gt;# Code generation, complex reasoning → Windows GPU
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen3-coder:30b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;endpoint&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://192.168.1.106:11434&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="c1"&gt;# Deep reasoning, math, logic puzzles → Windows GPU
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reasoning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deepseek-r1:8b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;endpoint&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://192.168.1.106:11434&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="c1"&gt;# Fallback for when Windows is rented out → Ubuntu CPU
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fallback&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;minicpm-v&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;endpoint&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://192.168.1.100:11434&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Decide which model should handle this prompt.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Vision tasks
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;screenshot&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;picture&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;see this&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;what&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s on&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;MODELS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vision&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="c1"&gt;# Code tasks
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;class&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;debug&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;refactor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;implement&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                             &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;write code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;javascript&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;typescript&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                             &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;api endpoint&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sql&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;regex&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;MODELS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="c1"&gt;# Reasoning tasks
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prove&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;solve&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;calculate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;math&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;logic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                             &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;analyze&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;compare&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;evaluate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;MODELS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reasoning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="c1"&gt;# Default: quick model for everything else
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;MODELS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quick&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. That's the whole router. It classifies prompts by keywords and sends them to the appropriate model. Is it perfect? No. Does it need to be? Also no.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters More Than You Think
&lt;/h2&gt;

&lt;p&gt;Before the router, my typical day looked like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;30 quick questions (summarize this, what does this error mean, rephrase this)&lt;/li&gt;
&lt;li&gt;5 code tasks (write a function, debug this, add tests)&lt;/li&gt;
&lt;li&gt;2 vision tasks (what's in this screenshot)&lt;/li&gt;
&lt;li&gt;1 deep reasoning task (complex analysis)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without routing, I'd use the big model for everything. 38 requests to the 30B model. Each taking 5-15 seconds. Total wait time: ~4-5 minutes of just... waiting. While my GPU is maxed out and I can't run anything else.&lt;/p&gt;

&lt;p&gt;With routing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;30 quick → 4B model on Mac Mini, 0.3s each → &lt;strong&gt;9 seconds total&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;5 code → 30B on Windows GPU, 8-12s each → &lt;strong&gt;~50 seconds&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;2 vision → vision model on GPU, 4-13s each → &lt;strong&gt;~15 seconds&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;1 reasoning → 8B model on GPU, 5-8s → &lt;strong&gt;~6 seconds&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Total: ~80 seconds&lt;/strong&gt; instead of ~5 minutes. And my GPU is free for most of that time, so I could be generating images or renting it out simultaneously.&lt;/p&gt;

&lt;p&gt;The router doesn't just save time. It frees capacity.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fallback System
&lt;/h2&gt;

&lt;p&gt;The real value of routing hit me when the Windows PC went down for a Windows Update mid-session. Before, this was catastrophic — all my "important" models were on that machine.&lt;/p&gt;

&lt;p&gt;Now the router has fallback logic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;route_with_fallback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Route with health checks and fallbacks.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;primary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Check if primary endpoint is alive
&lt;/span&gt;    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;primary&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;endpoint&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/api/tags&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;primary&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;pass&lt;/span&gt;  &lt;span class="c1"&gt;# Primary is down, find fallback
&lt;/span&gt;
    &lt;span class="c1"&gt;# Fallback chain based on task type
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;primary&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;MODELS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="c1"&gt;# Can't run 30B on Mac, but 4B is better than nothing
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;MODELS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quick&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;primary&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;MODELS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vision&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="c1"&gt;# Try Ubuntu's slow vision model
&lt;/span&gt;        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;MODELS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;fallback&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;endpoint&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/api/tags&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;MODELS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fallback&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;MODELS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quick&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# No vision available, use text
&lt;/span&gt;    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;primary&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;MODELS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reasoning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;MODELS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quick&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# Quick model can reason a bit
&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;MODELS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quick&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When Windows reboots, my workflow doesn't stop. The router sends code tasks to the small model (worse but functional), vision tasks to Ubuntu (slow but works), and quick tasks keep humming along on the Mac Mini. Something always answers.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Token Cost Angle
&lt;/h2&gt;

&lt;p&gt;I also track tokens per model per day. Not because I'm cheap (these are all free — they run locally), but because it tells me where I'm spending compute:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Simple token counter per route
&lt;/span&gt;&lt;span class="n"&gt;token_log&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;defaultdict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;track_tokens&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;route_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;token_log&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;route_name&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt;
    &lt;span class="c1"&gt;# Log daily totals to a file
&lt;/span&gt;    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;token-usage-&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;today&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;route&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;route_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()})&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After a week of tracking, I found: &lt;strong&gt;72% of my requests were going to the quick model.&lt;/strong&gt; Only 15% actually needed the big code model. 8% were vision. 5% reasoning.&lt;/p&gt;

&lt;p&gt;I was using a sledgehammer for 72% of my nails. No wonder the GPU always felt busy.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd Do Differently If I Started Over
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Build the router on day one, not month three.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I wasted months manually routing. The router took an afternoon to write. If you're running multiple models, build the routing first. You can always make it smarter later.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Start with keyword routing. Don't over-engineer it.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I considered using embeddings to classify prompts. I considered training a tiny classifier. I considered using an LLM to decide which LLM to use (yes, really). Keyword matching works for 90% of cases. Ship the simple thing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Make the fallback automatic.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;My first version just errored when the GPU machine was down. The fallback logic was an afterthought. It should've been the first thing I built — because machines go down, and a degraded response is infinitely better than no response.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Log everything.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You can't optimize what you don't measure. Once I started logging which routes were used, the 72% quick-model stat jumped out immediately. Without logs, I would've kept thinking I needed the big model "most of the time."&lt;/p&gt;

&lt;h2&gt;
  
  
  Beyond Keywords: What I'm Adding Next
&lt;/h2&gt;

&lt;p&gt;The keyword router works, but it has blind spots. A prompt like "help me understand why this function is slow" could be code OR reasoning. So I'm adding:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Confidence scoring:&lt;/strong&gt; If the prompt matches multiple categories, route to the cheaper model first. If the response quality seems low (response is very short, model seems confused), auto-retry with the bigger model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context-aware routing:&lt;/strong&gt; If the last 3 prompts were all about the same codebase, keep using the code model even if the current prompt doesn't have code keywords. Context continuity matters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost-aware routing for cloud fallback:&lt;/strong&gt; When local models can't handle something, I fall back to cloud APIs. The router should know: "this task isn't worth a $0.03 GPT-4 call, use the local model." Or: "this is worth it, spend the API credits."&lt;/p&gt;

&lt;h2&gt;
  
  
  Do You Need This?
&lt;/h2&gt;

&lt;p&gt;If you run one model on one machine: no. You're fine. The router solves a multi-model, multi-machine problem.&lt;/p&gt;

&lt;p&gt;If you're starting to accumulate models (a small one for chat, a big one for code, a vision model): &lt;strong&gt;yes.&lt;/strong&gt; The moment you have more than 2 models, you need routing. Otherwise you'll default to the biggest one every time, and that's a waste.&lt;/p&gt;

&lt;p&gt;If you have multiple machines: &lt;strong&gt;absolutely.&lt;/strong&gt; The router isn't just about picking the right model — it's about picking the right machine. Code tasks go where the GPU is. Quick tasks stay local. Background tasks go to the always-on box.&lt;/p&gt;

&lt;p&gt;The beauty is that the router grows with you. Start with 2 models and a simple if/else. Add more models, add more rules. Add more machines, add more endpoints. The architecture stays the same.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I write about running AI locally, home lab setups, and turning hardware into income. If that's your jam, I post every few days.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;My AI lab setup: &lt;a href="https://dev.to/samhartley/my-3-machine-ai-lab-how-i-divide-work-between-a-mac-mini-a-windows-pc-and-an-ubuntu-box-3b8a"&gt;3 machines, 1 desk, zero cloud dependency&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>ollama</category>
      <category>selfhosted</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
