DEV Community: Sam Hartley

I Gave My AI a Memory. It Changed Everything.

Sam Hartley — Wed, 29 Jul 2026 08:03:29 +0000

I Gave My AI a Memory. It Changed Everything.

I got tired of explaining my own codebase to an AI every single session.

"Here's the architecture. Here's the README. Here's what I tried last time and why it didn't work." Every. Single. Time.

Six months in, I realized I wasn't chatting with an assistant — I was onboarding a new intern who quit after every conversation and came back with total amnesia.

So I built a memory system. Local. Free. And honestly, it's the single most useful AI upgrade I've made.

The Problem Nobody Talks About

LLMs don't remember you. They don't remember your projects. They don't remember that you tried asyncio.gather last week and it deadlocked, or that your Garmin watch face has a hard 64KB memory limit, or that you always forget how your Telegram bot handles rate limits.

What they have is a context window — a temporary scratchpad that gets wiped when the conversation ends. For coding workflows, this is broken by design. You're not having a chat. You're doing ongoing work on a codebase that spans months.

I tried the obvious fixes:

Pasting huge chunks of context every time. Burned through token limits, got truncated responses.
Keeping "project briefs" in a text file and pasting them in. Worked until I had ten projects.
Using cloud memory features. Worked until I hit another subscription and another privacy question.

None of it felt right. What I wanted was simple: ask a question about my stuff, get an answer based on my docs, with zero setup friction.

The Setup: RAG That Actually Runs Locally

RAG (Retrieval-Augmented Generation) isn't new. What's new is that you can run the entire stack — embedding model, vector database, and LLM — on a Mac Mini without touching a single cloud API.

Here's what I ended up with:

My Documents (markdown, code, notes, PDFs)
  → Chunked into ~400-token pieces
  → Embedded with nomic-embed-text (Ollama)
  → Stored in Chroma (local vector DB, no server needed)

Question
  → Embedded with the same model
  → Top-5 relevant chunks pulled from Chroma
  → Fed into Qwen 3.5 9B with a prompt template
  → Answer

Total stack: Ollama + Chroma + a 40-line Python script. No Docker. No cloud. No API keys.

What I Actually Indexed

Not the whole internet. Just the stuff I actually need:

Project docs — READMEs, architecture decisions, API specs
My notes — Obsidian vault, scratchpads, debugging logs
Code snippets — Functions I keep reusing, config templates
Bookmarked solutions — Stack Overflow answers I always forget
Deployment notes — "How did I set up the VPS again?"

About 4,800 chunks total. Query time: under 2 seconds on the Mac Mini, under 500ms if I route it through the Windows PC's RTX 3060.

The Code (It's Embarrassingly Simple)

Indexing:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma

loader = DirectoryLoader("~/projects/", glob="**/*.md", recursive=True)
docs = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=50)
chunks = splitter.split_documents(docs)

embeddings = OllamaEmbeddings(model="nomic-embed-text")
db = Chroma.from_documents(chunks, embeddings, persist_directory="./knowledge-base")
db.persist()

Querying:

import ollama

def ask(question: str) -> str:
    results = db.similarity_search(question, k=5)
    context = "\n\n".join([r.page_content for r in results])

    response = ollama.chat(
        model="qwen3.5:9b",
        messages=[{
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {question}"
        }]
    )
    return response['message']['content']

That's it. No LangChain chains. No agents. No frameworks that require a PhD to configure.

What Changed

Before: "How does my Garmin watch face fetch stock data?" → paste 200 lines of code → wait → hope the model doesn't hallucinate.

After: ask("How does my Garmin watch face fetch stock data?") → "It uses Background.exit() to pass a dictionary with the ticker and price to the app-side view, because the background service has a 64KB memory limit." — in 1.8 seconds. Correct. Specific. Zero hallucination.

Before: "What's the rate limit for the crypto bot?" → dig through files → find nothing → guess.

After: "3 requests per second with a 1.5-second cooldown between calls, enforced in the throttle() decorator in bot/utils.py." — pulled straight from my own code comments.

The AI stopped being a generalist that needed onboarding. It became a specialist that already knew my stuff.

The Honest Downsides

Indexing is slow. Embedding 4,800 chunks on a Mac Mini CPU takes ~20 minutes. On the RTX 3060 it's ~4 minutes. You don't do it often — I run incremental updates via cron every hour — but the first run is painful.

Chunking is an art. Too small (100 tokens) and you lose context. Too big (1000 tokens) and you hit the retrieval limit too fast. I settled on 400 with 50-token overlap after way too much trial and error.

It doesn't replace search. If you ask "what's the syntax for Python list comprehension," RAG is overkill and probably retrieves some random list-handling function from your old projects. This is for your knowledge, not general knowledge.

Storage grows. 4,800 chunks ≈ 300MB of vector data. Manageable, but not nothing. Chroma handles it fine locally.

One Trick That Made It Actually Useful

I added incremental updates with file hashes instead of re-indexing everything:

def update_index(docs_dir, db, cache_file="./index-cache.json"):
    cache = json.loads(Path(cache_file).read_text()) if Path(cache_file).exists() else {}
    changed = []

    for f in Path(docs_dir).rglob("*.md"):
        h = hashlib.md5(f.read_bytes()).hexdigest()
        if cache.get(str(f)) != h:
            changed.append(str(f))
            cache[str(f)] = h

    if changed:
        # ... re-index only changed files
        Path(cache_file).write_text(json.dumps(cache))

Runs every hour via cron. Most of the time it does nothing. When I edit a file, it catches it. My knowledge base is never more than an hour stale.

When This Makes Sense

Do this if:

You work on multiple long-term projects
You keep notes/docs/code that you reference repeatedly
You want context-aware answers without pasting walls of text
You have a machine that can run Ollama (Mac Mini M4, any modern PC)

Don't do this if:

You only do one-off queries ("what's the weather")
Your projects are small and you remember everything anyway
You can't spare 300MB of disk space
You expect it to replace Google (it won't)

Getting Started (15 Minutes)

# 1. Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# 2. Pull models
ollama pull qwen3.5:9b
ollama pull nomic-embed-text

# 3. Python deps
pip install chromadb langchain ollama

# 4. Index your docs (adapt the script above)
# 5. Ask questions

Total time: 15 minutes. Total cost: $0. Monthly cost: $0.

The Real Win

The technical setup is neat. But the real change is mental.

Before, every AI session started with context-building: "Here's what I'm working on, here's what I tried, here's the constraints." Now I just ask the question. The system already knows the constraints because it read my docs.

It's the difference between calling a consultant and calling a teammate who was in the meeting.

That gap — the gap between "AI that answers questions" and "AI that knows your work" — is bigger than I expected. And closing it costs exactly zero dollars.

Sam Hartley is a solo dev building local AI tools on a 3-machine home lab. Writes about the infrastructure that makes AI actually useful for real work.

→ Custom automation setups on Fiverr
→ Follow CelebiBots on Telegram

ai #rag #ollama #selfhosted #productivity #buildinginpublic

I Built a Health Monitor for My AI Agents — Now They Tell Me When They're Dying

Sam Hartley — Mon, 27 Jul 2026 08:03:06 +0000

I Built a Health Monitor for My AI Agents — Now They Tell Me When They're Dying

For two weeks, my AI content pipeline was silently broken.

The cron job was still running. The logs showed "SUCCESS." But no articles were actually posting. The Dev.to API was returning 429s (rate limited) and the script was swallowing the error because I forgot to check the response status.

I only noticed because I happened to check the blog and saw the gap. Two weeks of missing posts. Two weeks of thinking everything was fine while the system was quietly failing.

That was the moment I realized: my agents need a pulse.

The Problem: Silent Failures Are the Worst Failures

I run three AI agents on three machines. Celebi on a Mac Mini, ProgrammierMinna on a Windows PC, DocMinna on... also the Mac Mini. They're connected via a simple router, triggered by cron jobs, and they talk to me through Telegram.

When everything works, it's magic. When something breaks, it's archaeology.

Common failure modes I'd hit:

Ollama crashed on the Windows PC after a Windows update. No one knew. Queries just timed out.
Disk full on the Mac Mini because model files kept accumulating. Logs stopped rotating. The system ground to a halt.
API rate limits on Dev.to (max 10 posts/day). The publish script didn't retry or notify.
Router misconfiguration after a router restart. The Windows PC got a new IP. Queries fell into the void.
Model pulled but not loaded. I'd update a model, restart Ollama, but forget to actually ollama run it. First query would error out.

Each of these took 20-60 minutes to diagnose. Not because they're hard problems, but because I didn't know where to look.

The Fix: A 30-Line Health Check Script

I didn't build Prometheus. I didn't install Grafana. I wrote a Python script that runs every 10 minutes, checks the things I care about, and sends me a Telegram message if something is wrong.

That's it. No dashboard. No metrics server. Just "tell me when it's broken."

# health_check.py
import requests
import shutil
import subprocess
import json

TELEGRAM_BOT_TOKEN = "your_token"
TELEGRAM_CHAT_ID = "your_chat_id"
MACHINES = {
    "mac_mini": "http://192.168.1.102:11434",
    "windows_pc": "http://192.168.1.106:11434",
    "ubuntu": "http://192.168.1.100:11434",
}
MODELS = ["qwen3.5:9b", "qwen3-coder:30b", "granite3.2:8b"]

def alert(message):
    url = f"https://api.telegram.org/bot{TELEGRAM_BOT_TOKEN}/sendMessage"
    requests.post(url, json={"chat_id": TELEGRAM_CHAT_ID, "text": f"🚨 {message}"})

def check_ollama(machine_name, url):
    try:
        r = requests.get(f"{url}/api/tags", timeout=5)
        if r.status_code != 200:
            alert(f"Ollama down on {machine_name}: HTTP {r.status_code}")
            return False

        models = [m["name"] for m in r.json().get("models", [])]
        for model in MODELS:
            if model not in models:
                alert(f"Model {model} missing on {machine_name}")
                return False
        return True
    except requests.exceptions.ConnectionError:
        alert(f"Ollama unreachable on {machine_name}")
        return False

def check_disk():
    disk = shutil.disk_usage("/")
    percent = disk.used / disk.total * 100
    if percent > 90:
        alert(f"Disk usage: {percent:.1f}% — clean up needed!")
        return False
    return True

def check_last_article():
    # Check if an article was posted in the last 3 days
    # This reads a simple timestamp file written by the publish script
    try:
        with open("/tmp/last_article_timestamp.txt") as f:
            last = float(f.read().strip())
        import time
        if time.time() - last > 3 * 24 * 3600:
            alert("No article published in 3 days — pipeline may be stuck")
            return False
        return True
    except FileNotFoundError:
        return True  # No articles yet, that's fine

if __name__ == "__main__":
    all_ok = True
    all_ok &= check_disk()

    for name, url in MACHINES.items():
        all_ok &= check_ollama(name, url)

    check_last_article()  # Non-critical, don't fail on this

    if all_ok:
        # Optional: send heartbeat once per day
        pass

Total dependencies: requests (probably already installed). Total lines: ~50. Total setup time: 10 minutes.

What I Actually Monitor

1. Ollama Heartbeat

The most critical check. Every machine gets pinged at /api/tags every 10 minutes. If it doesn't respond in 5 seconds, I get a Telegram alert with the machine name.

This caught a Windows update restart last week. The PC came back up, but Ollama didn't start (I hadn't added it to startup yet). I knew within 10 minutes instead of finding out when I actually needed it.

2. Model Presence

Even if Ollama is running, the model I need might not be loaded. I check that my three core models (qwen3.5:9b, qwen3-coder:30b, granite3.2:8b) are available on their respective machines.

This saved me once when I updated qwen3-coder and the new version had a different tag. The old tag was gone. The router was pointing to a ghost. The monitor caught it before the first failed query.

3. Disk Space

Model files are big. A 30B parameter model is ~20GB. Three machines times three models each — that's a lot of storage that grows quietly. I alert at 90% disk usage.

The Mac Mini has a 256GB SSD. I hit 95% once and the system started swapping aggressively. Response times went from 2 seconds to 30 seconds. Now I clean up before it hurts.

4. Pipeline Liveness

This is the subtle one. The cron job runs, but is it actually doing anything? I write a timestamp file every time an article successfully publishes. The health check compares that timestamp to "now." If it's been more than 3 days, I get an alert.

This is what would have caught the silent 429 errors. The script was running, but not publishing. The timestamp wouldn't update. I'd know.

5. Router Reachability (Bonus)

I added a check that pings the gateway router itself. If the whole network is down, I get a different alert. This happened once during a power outage — the router rebooted faster than the machines, but DHCP reassigned IPs. I knew the network was wonky before I even tried to query anything.

What I Don't Monitor (On Purpose)

I'm not running a datacenter. There are things I deliberately don't check:

GPU temperature. My RTX 3060 has a perfectly fine stock cooler. If it throttles, I'll notice in response times. Adding temp monitoring means sensors, drivers, more code. Not worth it.
Network bandwidth. I'm not streaming video. Local network latency is never the bottleneck.
CPU load. The Mac Mini sits at 15% most of the time. If it spikes, I don't care unless it's sustained — and then Ollama response times will tell me.
Log aggregation. I read logs when something breaks. I don't need them shipped to Elasticsearch.

The rule: if a failure mode would be annoying to debug, monitor it. If it's just "nice to know," skip it.

How I Run It

The script runs as a systemd timer on the Mac Mini. Every 10 minutes, checks fire. Alerts go to Telegram. If everything is fine, nothing happens. Silent success is the goal.

# /etc/systemd/system/ai-health-check.timer
[Unit]
Description=AI Agent Health Check

[Timer]
OnCalendar=*:0/10
Persistent=true

[Install]
WantedBy=timers.target

# /etc/systemd/system/ai-health-check.service
[Unit]
Description=Run AI health check

[Service]
Type=oneshot
ExecStart=/usr/bin/python3 /home/sam/health_check.py
User=sam

sudo systemctl enable ai-health-check.timer
sudo systemctl start ai-health-check.timer

Takes 30 seconds to set up. Runs forever.

What Changed (The Honest Version)

I fix things before they hurt. The Windows PC issue was fixed within 10 minutes of the alert. Before monitoring, it would have been "why is this query so slow today?" followed by 20 minutes of ssh and log reading.

I trust the system more. There's a difference between "I think it's working" and "I know it's working because the monitor is green." I can focus on building instead of worrying.

But I also get more alerts than I'd like. The Ubuntu box is on WiFi and occasionally drops for 30 seconds. I get a "unreachable" alert, then 10 minutes later it's fine. I haven't tuned the thresholds yet because... honestly, it's not annoying enough to fix.

False positives exist. Once, the model check failed because Ollama was still loading after a restart. The model existed, but /api/tags returned an empty list for 15 seconds. I added a 5-second retry and it went away.

When This Matters (And When It Doesn't)

Do this if:

You have automated processes running unsupervised
You've had a "wait, when did that break?" moment
You use Telegram (or Slack, or email) anyway
You value knowing about problems over perfect metrics

Don't do this if:

You're still debugging your setup manually every day (fix the core issue first)
You want beautiful dashboards (use Prometheus + Grafana instead)
You enjoy the thrill of surprise failures

The Real Lesson

The best monitoring isn't the one that tells you everything. It's the one that tells you the thing you actually need to know, at the time you can still do something about it.

My 50-line script isn't impressive. It won't get me DevOps cred. But it caught three real issues in the first month, and it cost me nothing but an afternoon.

If you're running AI agents at home and you don't know when they break... you just don't know when they break. Fix that first. Everything else is optimization.

Sam Hartley is a solo dev running a monitored 3-machine AI home lab. Writes about the boring infrastructure that makes local AI actually reliable.

→ Custom automation setups on Fiverr
→ Follow CelebiBots on Telegram

ai #agents #monitoring #devops #selfhosted #homelab #buildinginpublic

I Moved My AI Stack From NVIDIA to Apple Silicon. Then I Moved Back. Here's Why.

Sam Hartley — Sat, 25 Jul 2026 08:03:13 +0000

I Moved My AI Stack From NVIDIA to Apple Silicon. Then I Moved Back. Here's Why.

I bought a Mac Mini M4 thinking it would replace my Windows PC for local AI. Spoiler: it didn't. But it also didn't fail — it just turned out to be good at different things than I expected.

Here's six months of actual usage data, with the real numbers nobody puts in benchmark charts.

The Hardware

Machine	CPU/GPU	RAM	What I Paid
Mac Mini M4	10-core M4, 10-core GPU	16GB unified	$599
Windows PC	AMD 9970X	128GB	~$2,500 (existing)
	RTX 3060 12GB		$150 (used)

The Mac Mini is silent, tiny, and sips power. The PC sounds like a jet engine under load and pulls 200W+ when the GPU is working. I wanted the Mac to replace the PC for everything AI-related. I was half right.

The Migration (Month 1-2)

I installed Ollama on the Mac Mini, pulled Qwen 3.5 9B, and started using it as my daily driver. The first thing I noticed: it's fast. Really fast. For a 9B model, inference felt snappier than the RTX 3060 running the same model.

But then I tried loading bigger models.

Qwen 3 Coder 30B (Q4_K_M, ~18GB): Loaded fine on the RTX 3060 (12GB VRAM... wait, that shouldn't work). Oh right — it offloads to system RAM. Slow, but functional. On the Mac Mini? Out of memory. 16GB unified memory isn't 16GB free memory — the OS, browser, and apps eat 6-8GB before Ollama even starts.

DeepSeek R1 8B: Runs great on both. Mac Mini: ~25 tok/s. RTX 3060: ~35 tok/s. The GPU wins, but not by enough to matter for interactive use.

Qwen 3.5 9B: Mac Mini: ~18 tok/s. RTX 3060: ~28 tok/s. Again, GPU is faster, but both feel instant.

So far: Mac Mini handles small models well. But I hit the wall fast.

The Wall (Month 3)

Three things broke my "Mac only" dream:

1. Memory Pressure

16GB unified memory sounds like a lot until you're running:

Ollama (4-8GB for a loaded model)
VS Code with a few extensions (2-3GB)
Safari with 20 tabs (3-4GB)
A terminal, a music player, maybe a video call (2GB)

That's 11-17GB before you even ask the model a question. macOS starts swapping. Ollama slows down. The whole machine gets sluggish.

On the Windows PC with 128GB RAM? I can load a 30B model and run a game and compile code simultaneously. The RTX 3060's 12GB VRAM is a separate pool that doesn't compete with the OS.

2. Model Availability

Not every model runs well on Apple Silicon. The M4 has excellent support for mainstream models (Llama, Qwen, Mistral), but niche or newly released models often ship with "CUDA only" initial releases. I spent an afternoon trying to get a specific vision model running on the Mac before giving up and running it on the PC in 10 minutes.

The NVIDIA ecosystem is still the default for AI tooling. Apple Silicon is catching up, but it's not there yet.

3. Multi-User / Multi-Model

I wanted to run two models simultaneously — a coding assistant and a general chatbot. On the Mac Mini, loading two 9B models (~12GB total) leaves zero headroom for the OS. Everything crawls.

On the PC, I can load a 30B model in VRAM and a 7B model in system RAM, and both respond quickly. The GPU handles the heavy one; the CPU handles the light one. Separate memory pools are a feature, not a bug.

What the Mac Mini Actually Excels At

I didn't move everything back to the PC. The Mac Mini found its niche, and it's a good one:

Always-On Tasks

The Mac Mini draws ~15W at idle. The PC draws ~80W at idle (that GPU doesn't sleep well). For 24/7 tasks — routing, scheduling, lightweight notifications — the Mac wins on power cost alone.

Small Model Inference

For anything 9B and under, the Mac Mini is genuinely competitive. The unified memory architecture means there's no "copy to VRAM" overhead. A 4B model on the Mac can feel faster than the same model on the GPU because the latency is lower.

No Driver Drama

NVIDIA drivers on Windows are... fine. Until they aren't. An update breaks CUDA, a new Ollama version wants a different driver, and suddenly you're debugging nvidia-smi on a Tuesday night. The Mac just works. I have never, in six months, had an Apple Silicon AI setup break because of a system update.

Portability

I unplugged the Mac Mini, took it to a different room, plugged it back in, and my entire AI stack was back online in 30 seconds. Try that with a full tower PC.

The Real Numbers

Here's a month of actual usage, tracked with a simple script that logs tokens/sec and power draw:

Task	Mac Mini M4	RTX 3060	Winner
Qwen 3.5 9B, 500-token prompt	18 tok/s	28 tok/s	RTX 3060
DeepSeek R1 8B, reasoning	25 tok/s	35 tok/s	RTX 3060
Qwen 3 4B, quick chat	42 tok/s	55 tok/s	RTX 3060
30B model loading	❌ OOM	✅ Loads (slow)	RTX 3060
Two models simultaneously	❌ Crawls	✅ Works	RTX 3060
Idle power draw	~15W	~80W	Mac Mini
Load power draw	~35W	~220W	Mac Mini
Setup time (new model)	2 min	2 min	Tie
Driver reliability	Flawless	Occasional issues	Mac Mini

What I Actually Do Now

I stopped trying to pick a winner. Both machines have jobs:

Mac Mini M4:

Always-on orchestration (routing, scheduling, notifications)
Small model inference (4-9B) for quick tasks
Coding and development (it's my daily driver)
Runs 24/7, costs ~$3/month in power

Windows PC + RTX 3060:

Heavy inference (30B models, image generation)
GPU rental on Vast.ai when I'm not using it
Anything that needs CUDA or more than 8GB VRAM
On-demand, not 24/7

This isn't a benchmark conclusion. It's a workflow conclusion. The Mac Mini is the brain. The PC is the muscle. Separating them by task type eliminated the "which machine should I use?" decision fatigue and made both faster at what they're good at.

The Honest Bottom Line

If you're buying one machine for local AI and you want to run large models: get an NVIDIA GPU. Full stop. The ecosystem, the memory architecture, and the raw inference speed are still unbeatable for serious workloads.

If you want a silent, efficient, always-on machine for smaller models and orchestration: the Mac Mini M4 is excellent. Just don't expect it to replace a dedicated GPU for everything.

And if you already have both? Stop trying to consolidate. Use each for what it's good at. The separation is the feature.

Sam Hartley is a solo dev running a split AI stack across a Mac Mini and a Windows PC. No benchmarks were harmed in the making of this article — just a lot of time commands and a power meter.

Drop your setup in the comments — curious if others have found the same split, or if you've made a single-machine setup work for everything.

I Stopped Asking One AI to Do Everything. Here's What Happened.

Sam Hartley — Thu, 23 Jul 2026 08:03:37 +0000

I Stopped Asking One AI to Do Everything. Here's What Happened.

For months I had one AI agent. One model. One endpoint. I'd throw every question at it — code review, article drafts, debugging, random "what's the weather" questions — and expect it to handle it all.

It didn't. Not really. It answered everything, but it answered everything the same way. The same tone, the same depth, the same blind spots. When I needed a surgical code review, I got a friendly chat response. When I needed a casual summary, I got a three-paragraph essay.

So I split it into three specialized agents. Not because it's trendy. Because general-purpose AI is a compromise, and I was tired of compromising.

The Problem: One Brain, Too Many Jobs

I run a home lab on a Mac Mini M4, a Windows PC with an RTX 3060, and an Ubuntu box. All three have Ollama installed. For the first six months, I just used the biggest model I could fit and routed everything to it.

That model was Qwen 3 Coder 30B — a coding specialist. Great for refactoring. Great for debugging. But when I asked it to "write a friendly summary of my day," it would respond with something that sounded like a technical spec document. Because that's what it was trained to do.

The reverse was just as bad. When I used a general chat model for coding questions, it would hallucinate APIs, miss edge cases, and write code that looked right but subtly violated conventions.

I was asking a brain surgeon to do therapy, and a therapist to do surgery.

The Split: Three Agents, Three Jobs

I didn't build a microservices architecture. I didn't install Kubernetes. I added two more Ollama endpoints and a 20-line router.

Here's what I ended up with:

Celebi (Mac Mini, Qwen 3.5 9B) — The generalist. Handles routing, scheduling, daily summaries, weather, quick questions. Response time: 1-2 seconds. Perfect for "what's on my calendar today" and "summarize these emails."

ProgrammierMinna (Windows PC, Qwen 3 Coder 30B) — The coder. Handles code generation, refactoring, debugging, PR review. Response time: 8-15 seconds. When I need a function written or a bug found, this is where the query goes.

DocMinna (Mac Mini, Granite 3.2 8B) — The writer. Handles documentation, article drafts, READMEs, technical specs. Response time: 3-5 seconds. This model is worse at coding but surprisingly good at structure and flow.

The router is embarrassingly simple:

def route_query(query: str) -> str:
    coding_keywords = ["code", "function", "bug", "refactor", "debug", "pr", "review"]
    writing_keywords = ["write", "draft", "article", "readme", "doc", "summary", "blog"]

    if any(k in query.lower() for k in coding_keywords):
        return "http://192.168.1.106:11434"  # ProgrammierMinna
    elif any(k in query.lower() for k in writing_keywords):
        return "http://192.168.1.102:11434"  # DocMinna
    else:
        return "http://192.168.1.102:11434"  # Celebi

Is it perfect? No. "Write a Python script that sends emails" gets routed to DocMinna because of "write," when it probably should go to ProgrammierMinna. I fix those manually when I catch them. But it works 90% of the time, and that's enough.

What Actually Changed

Quality went up immediately

Before the split, I'd ask for a code review and get generic advice like "consider adding error handling." After the split, ProgrammierMinna would say "this async function doesn't handle TimeoutError — add a try/except around line 47, and use asyncio.wait_for with a 5-second timeout."

Same question. Different model. Different depth.

I stopped over-explaining

With the generalist, I'd have to add context like "please be thorough, this is for production code" or "keep it casual, this is a blog post." The specialists already know their role. I don't need to prompt-engineer the tone. It's baked into the model choice.

Parallel processing became possible

I can now fire off a coding task to ProgrammierMinna and a writing task to DocMinna simultaneously. They're running on different machines with different GPUs. No queue. No waiting for one to finish before the other starts.

Fallbacks got simpler

When the Windows PC is offline ( asleep, rented out on Vast.ai, or I'm traveling), queries that would normally go to ProgrammierMinna fall back to Celebi with a note: "PC offline — answering with generalist model, quality may vary." The system degrades gracefully instead of just failing.

The Honest Downsides

Three models to manage. Updates, storage, keeping track of which version is on which machine — it's overhead. Each model is 4-20GB. My model folder went from 40GB to 120GB across three machines.

Routing mistakes happen. I mentioned the "write a Python script" example. There are others. "Debug this article" (writing + debugging) confuses the router. I hit maybe 5% mis-routes, and I notice them because the response feels slightly off.

More endpoints to monitor. Instead of one Ollama instance to check, I have three. I built a simple health check script that pings each one and sends me a Telegram alert if any are down. It took 30 minutes to build. It runs forever.

Context doesn't transfer. If I'm in a long coding session with ProgrammierMinna and then ask Celebi "what did we just decide about the database schema?" — Celebi has no idea. Each agent has its own conversation history. I work around this by copying relevant context when I switch, but it's friction.

The Numbers

Metric	One Generalist	Three Specialists
Avg response time (coding)	8-15s	8-15s (same model)
Avg response time (writing)	8-15s	3-5s (smaller, faster model)
Avg response time (general)	8-15s	1-2s (tiny model)
Code review quality	6/10	9/10
Draft writing quality	5/10	8/10
Daily summary quality	7/10	8/10
Monthly cost	$0 (local)	$0 (local)
Setup complexity	Low	Medium
Maintenance overhead	Low	Medium

The quality jumps are subjective but real. I measured them by how often I had to ask for a redo. With the generalist, maybe 30% of responses needed a follow-up clarification. With specialists, maybe 5%.

When This Makes Sense (And When It Doesn't)

Do this if:

You have multiple distinct task types (coding + writing + analysis)
You have the hardware to run multiple models (even small ones)
You care about quality more than simplicity
You're already hitting the limits of your current model

Don't do this if:

You're just casually chatting with AI
You only have one machine with limited RAM
Your tasks are all similar (all coding, all writing)
You value simplicity over marginal quality gains

The Setup in 30 Minutes

If you want to try this, here's the fastest path:

Install Ollama on two machines (or twice on one machine if you have the RAM)
Pull different models: a coder model on one, a general model on the other
Write a 10-line router (like the Python snippet above)
Point your scripts at the router instead of directly at a model
Adjust keywords as you find mis-routes

Total time: 30 minutes. Total cost: $0.

The Real Win

The biggest change isn't technical. It's mental.

Before, I had one AI "employee" who was mediocre at everything. Now I have three specialists who are genuinely good at their jobs. When I send a query, I know which expert is handling it. I trust the output more. I spend less time verifying and fixing.

It's the difference between a Swiss Army knife and a actual toolbox. The knife fits in your pocket. But when you need to build something real, you want the right tool.

Sam Hartley is a solo dev running a multi-agent AI setup on a 3-machine home lab. Writes about the infrastructure that makes local AI actually usable.

→ Custom automation setups on Fiverr
→ Follow CelebiBots on Telegram

ai #agents #automation #selfhosted #ollama #productivity #buildinginpublic

I Built a Garmin Watch Face with Live Stock Charts in Monkey C — Here’s What I Learned

Sam Hartley — Tue, 21 Jul 2026 08:02:19 +0000

I wanted stock prices on my wrist. Not a notification — an actual chart. So I built a custom Garmin watch face with live sparkline charts for 5 configurable assets.

Here’s what I learned building StockFaceTC for the Garmin Venu 2 Plus.

The Challenge

Garmin watch faces are tiny — 124KB memory, 416×416 AMOLED display, and a language (Monkey C) that most devs have never heard of. No npm. No frameworks. No Stack Overflow answers.

Oh, and the “simulator” can’t make web requests. You test API calls on real hardware or not at all.

What I Built

A watch face showing:

Time, date, and health stats (heart rate, steps, floors)
5 polar sparkline charts in an outer ring — each showing 24 hours of price data
Auto-updates every 15 minutes via Twelve Data API
6-hour weather forecast with custom-drawn icons
Fully configurable tickers via the Garmin Connect app

Default tickers: AAPL, TSLA, Gold, BTC/USD, ETH/USD

The Font Problem (and How I Solved It)

Garmin’s VectorFont API isn’t available on all devices. The Venu 2 Plus? Nope.

So I built my own: PrimitiveFont — a complete vector font library using only drawLine() calls. Every character is defined on a 5×7 grid and rendered procedurally.

// “A” on a 5×7 grid
CHAR_A = [0,6, 0,2,  0,2, 2,0,  2,0, 4,2,  4,2, 4,6,  0,4, 4,4]

Features:

Full alphabet (A-Z, a-z, 0-9, specials)
Proportional widths (Arial-like metrics — ‘i’ is narrow, ‘M’ is wide)
Bold, italic, underline
Arc text — characters curved along a circular path (for the ring labels!)
Any size via simple scaling: sizePx / 7.0

The entire library is ~21KB — well within the 124KB budget.

Architecture

Garmin Background Service (every 15 min)
  → Twelve Data API (1h candles × 24 = full day)
  → Phone acts as transparent HTTP proxy
  → JSON parsed in background thread (64KB limit!)
  → Prices + display names stored in model
  → View renders sparklines + arc labels on update

Key Constraint: 64KB Background Memory

The background service that fetches data runs in a separate 64KB sandbox. You can’t access the main app’s memory. The only way to pass data: Background.exit(dictionary), which the main app receives in onBackgroundData().

This means: parse JSON, extract only what you need, pass a minimal dictionary. No room for raw API responses.

Dynamic Labels from API

One early mistake: I hardcoded display names like “XAU/USD” → “GOLD”. That doesn’t scale.

Better approach: The Twelve Data API returns metadata with every response:

“meta”: {
    “symbol”: “XAU/USD”,
    “currency_base”: “Gold Spot”,
    “type”: “Precious Metal”
}

Now labels come from the API itself:

Stocks: meta.symbol → “AAPL”
Crypto: meta.currency_base → “Bitcoin” → “BITCOIN”
Commodities: meta.currency_base → “Gold Spot” → “GOLD”

Zero hardcoded aliases. Add any ticker, get the right label automatically.

Simulator Gotchas

If you’re building Connect IQ apps, save yourself some pain:

Properties are cached forever — changing properties.xml defaults does nothing after first run. Use “Reset All App Data” first.
makeWebRequest() doesn’t work in the simulator — you need a real phone connected via Bluetooth.
getSettingsView() must be implemented (even returning null) or the settings menu stays greyed out.
No VectorFont on many devices — if you need custom text, roll your own like PrimitiveFont.

API Budget

Twelve Data free tier: 800 API calls/day.

My budget: 5 symbols × 96 fetches/day (every 15 min) = 480 calls. Well within limits.

The Code

Source is on GitLab. Feel free to fork it, break it, or build something cooler.

What’s the weirdest hardware you’ve built for? Drop it in the comments.

I Built a Personal AI That Actually Knows My Projects (RAG + Ollama, Zero Cloud)

Sam Hartley — Sun, 19 Jul 2026 08:03:13 +0000

I got tired of explaining my own codebase to an AI every single session.

"Here's the architecture. Here's the README. Here's what I tried last time." Every. Single. Time.

So I built a local RAG (Retrieval-Augmented Generation) system that knows my projects, my notes, and my docs — permanently. No cloud. No API costs. No context window resets.

Here's exactly how it works.

The Problem with Context Windows

LLMs don't remember. You paste the same 200 lines of context every session, hit the token limit, and start over. It's fine for one-off questions. It's exhausting for ongoing projects.

The standard solution is RAG: instead of stuffing everything into the prompt, you store docs in a vector database and retrieve only the relevant chunks when you ask a question. The model sees 3-5 paragraphs of targeted context instead of your entire repo.

Result: faster, cheaper, and the AI actually answers the right question.

The Architecture

Your Documents (markdown, code, PDFs, notes)
  → Chunked + embedded (Ollama nomic-embed-text)
  → Stored in Chroma (local vector DB)

Query
  → Embedded (same model)
  → Top-5 relevant chunks retrieved
  → Stuffed into Ollama prompt (Qwen 3.5 9B)
  → Answer

Zero cloud. Zero API keys. Runs on a Mac Mini or any machine with 8GB RAM.

What I Index

Everything that would normally eat my context window:

Project READMEs and architecture docs
My personal notes (Obsidian vault)
Code snippets and past solutions
API documentation I use regularly
Stack Overflow answers I bookmarked (because I always forget them again)
Config files and deployment notes

Total indexed: ~4,800 chunks. Query time: under 2 seconds.

Step 1: Install the Stack (15 minutes)

# Ollama (already installed? skip)
curl -fsSL https://ollama.com/install.sh | sh

# Pull models
ollama pull qwen3.5:9b          # LLM for answers
ollama pull nomic-embed-text    # Embedding model

# Python dependencies
pip install chromadb langchain ollama pypdf markdown

That's the entire stack. No Docker required (though Chroma has a Docker option if you want a persistent server).

Step 2: Index Your Documents

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma

# Load your docs folder
loader = DirectoryLoader("~/projects/", glob="**/*.md", recursive=True)
docs = loader.load()

# Split into chunks (400 tokens, 50 overlap)
splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=50)
chunks = splitter.split_documents(docs)

print(f"Indexed {len(chunks)} chunks from {len(docs)} documents")

# Embed and store locally
embeddings = OllamaEmbeddings(model="nomic-embed-text")
db = Chroma.from_documents(
    chunks,
    embeddings,
    persist_directory="./my-knowledge-base"
)
db.persist()

Run once. Done. Your docs are now searchable by meaning, not just keywords.

Step 3: Query It

import ollama
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma

# Load existing DB
embeddings = OllamaEmbeddings(model="nomic-embed-text")
db = Chroma(persist_directory="./my-knowledge-base", embedding_function=embeddings)

def ask(question: str) -> str:
    # Retrieve top 5 relevant chunks
    results = db.similarity_search(question, k=5)
    context = "\n\n".join([r.page_content for r in results])

    # Query local LLM with context
    response = ollama.chat(
        model="qwen3.5:9b",
        messages=[{
            "role": "user",
            "content": f"Based on this context:\n\n{context}\n\nAnswer: {question}"
        }]
    )
    return response['message']['content']

# Example
print(ask("How does my Garmin watch face fetch stock data?"))
print(ask("What's the API rate limit for the crypto bot?"))
print(ask("How do I deploy the Telegram bot to the VPS?"))

Real answers from your own documentation. No hallucinations about your specific setup.

The Killer Feature: Incremental Updates

Don't re-index everything when one file changes. Just update what's new:

import hashlib
import json
from pathlib import Path

def get_file_hash(path):
    return hashlib.md5(Path(path).read_bytes()).hexdigest()

def update_index(docs_dir, db, index_cache="./index-cache.json"):
    cache = json.loads(Path(index_cache).read_text()) if Path(index_cache).exists() else {}

    changed_files = []
    for f in Path(docs_dir).rglob("*.md"):
        h = get_file_hash(f)
        if cache.get(str(f)) != h:
            changed_files.append(str(f))
            cache[str(f)] = h

    if changed_files:
        print(f"Re-indexing {len(changed_files)} changed files...")
        # [load, chunk, embed, upsert only changed files]

    json.dumps(cache) and Path(index_cache).write_text(json.dumps(cache))

Run this as a cron job every hour. Your knowledge base stays current automatically.

What Actually Changed for Me

Before RAG:

"Explain the background service memory limit in my Garmin project" → paste 200 lines → wait → answer
Every new chat session: context reset, start explaining again

After RAG:

ask("Garmin background service memory limit") → "64KB sandbox, pass data via Background.exit(dictionary)" — in 1.8 seconds

My LLM now answers questions about projects I haven't touched in 6 months. No context management. No pasting. Just ask.

Hardware Requirements

Setup	RAM	Embedding Speed	Query Speed
Mac Mini M4 8GB	8GB	~500 docs/min	~2s
RTX 3060 12GB	12GB VRAM	~3000 docs/min	~0.5s
Old laptop 8GB	8GB	~100 docs/min	~5-8s

The embedding step (indexing) is the slow part — run it once, then it's instant.

Tips from Running This for 3 Months

Chunk size matters — 400 tokens works well for prose and docs. For code, try 200 with more overlap.
Metadata is your friend — store filename and section in chunk metadata. When the AI says "see the deployment notes," you know exactly where to look.
Re-rank when accuracy matters — if top-5 chunks aren't enough, add a re-ranker step (Cohere has a free API, or use a local cross-encoder).
Watch your embed model — nomic-embed-text beats most larger models for RAG. Don't use your chat LLM for embeddings.
Hybrid search — combine vector search with BM25 keyword search for better results on technical queries with specific names/functions.

The Bigger Picture

This is step one of something bigger: a personal AI that grows with your projects instead of resetting every session.

Next phase I'm building: automatic indexing from Git commits (index diffs in real-time as you code) + a simple web UI for non-terminal queries.

Total current cost of this setup: $0/month. It runs on the same Mac Mini I already had.

The Honest Bottom Line

RAG isn't magic. It's a database query with a language model on top. But it solves a real problem: LLMs that don't know your stuff.

If you're pasting READMEs into ChatGPT every session, try this. 15 minutes of setup, and your AI finally remembers what you told it yesterday.

Sam Hartley is a solo dev building tools on a Mac Mini + RTX 3060 home lab. Writes about the messy reality of shipping stuff with AI.

→ Custom automation setups on Fiverr
→ Follow CelebiBots on Telegram

ai #rag #ollama #selfhosted #python #homelab #buildinpublic

I Automated My Entire Dev Workflow with AI Agents Running 24/7 on a Mac Mini

Sam Hartley — Fri, 17 Jul 2026 08:03:02 +0000

I Automated My Entire Dev Workflow with AI Agents Running 24/7 on a Mac Mini

Every morning I wake up and check Telegram. There's a message from Celebi — my AI agent — telling me what happened overnight. New emails summarized. A draft article ready for review. A reminder that I have a meeting in 2 hours. Sometimes a screenshot from a camera showing motion at the front door.

All of this runs on a Mac Mini. Not in the cloud. Not on rented GPUs. On a $599 box under my desk.

Here's how I built it and what it actually costs.

The Hardware

My setup is three machines that talk to each other over my home network:

Machine	Role	Cost
Mac Mini M4 (16GB)	Always-on orchestrator, notifications, lightweight tasks	$599
Windows PC (AMD 9970X, RTX 3060 12GB)	Heavy lifting — coding models, image generation	~$2,500 existing hardware
Ubuntu Server (CPU-only)	Fallback, OCR backend, lightweight inference	~$300 old laptop

The Mac Mini is the brain. It's on 24/7, draws maybe 15W at idle, and handles routing, scheduling, and simple queries. The Windows PC wakes up for the hard stuff — 30B parameter models, vision tasks, anything that needs a GPU.

The Ubuntu box is my safety net. When the Windows PC is offline or I need something CPU-only, it handles it. It's slow (180 seconds for a vision query vs 4 seconds on the GPU), but it works.

What the Agents Actually Do

Celebi (Mac Mini, Qwen 3.5 9B)

My main agent. It runs on the Mac Mini and handles:

Daily summaries — emails, calendar, notifications in one message
Routing — decides which agent handles which task
Publishing — posts articles to Dev.to, sends Telegram messages
Lightweight queries — weather, quick questions, reminders

Response time: 1-3 seconds. Perfect for "what's my schedule today?"

ProgrammierMinna (Windows PC, Qwen 3 Coder 30B)

The coder. Handles:

Code generation — full features from descriptions
Refactoring — restructuring messy code
Debugging — finding bugs I missed
PR review — automated code review

Response time: 8-15 seconds. Worth the wait for quality code.

DocMinna (Mac Mini, Granite 3.2 8B)

The writer. Handles:

Documentation — READMEs, API docs, guides
Article drafts — turning my notes into readable prose
Technical specs — structured requirements documents

Response time: 3-5 seconds. Fast enough for iterative writing.

The Architecture (Simple)

No Kubernetes. No Docker Swarm. Just:

User (Telegram) → Celebi → Router → Right Agent → Response

Celebi receives the message, classifies it (coding, writing, general), and routes to the right specialist. The specialist does the work and sends it back to Celebi, which formats it and sends it to me.

If the Windows PC is offline, Celebi either handles it itself or falls back to the Ubuntu server. It's not fancy, but it works.

What I Automated

Morning Briefing (Every day at 8 AM)

Celebi checks my calendar, recent emails, and any flagged notifications. Sends a 3-sentence summary to Telegram. Takes me 10 seconds to read instead of 10 minutes of app-hopping.

Article Pipeline (Every 2 days)

ProgrammierMinna writes a draft from my notes. DocMinna reviews and edits. Celebi publishes via the Dev.to API. I get a notification with a link to review.

Actual time I spend per article: 10-15 minutes editing. Before this? 2-3 hours writing from scratch.

Code Review (On every push)

ProgrammierMinna scans PRs for bugs, anti-patterns, missing error handling. It's not perfect — it misses edge cases sometimes — but it catches 80% of the obvious stuff before a human reviews it.

Home Monitoring (Motion-triggered)

Camera detects motion? Celebi sends me a screenshot and asks if it's important. Package delivery? I'll know in 10 seconds. Stray cat? Also know in 10 seconds.

The Numbers (Monthly Cost)

Component	Cost
Mac Mini electricity (24/7, ~15W)	~$3/month
Windows PC electricity (on demand, ~200W when active)	~$8/month
Ubuntu server electricity (24/7, ~10W)	~$2/month
Dev.to API	$0
Telegram Bot API	$0
Ollama	$0
Total	~$13/month

Compare that to cloud alternatives:

OpenAI API for my volume: ~$150-200/month
Anthropic Claude: ~$100-150/month
A hosted agent platform (n8n, Make, etc.): ~$50-100/month

Savings: ~$300-400/month. The Mac Mini paid for itself in 2 months.

What's Annoying

Model management. Keeping track of which model is on which machine, updating them, clearing old ones — it's overhead. Not huge, but real.

Windows PC sleep. When the PC is asleep, complex queries take 180 seconds on the Ubuntu fallback instead of 8 seconds on the GPU. I've learned to schedule heavy tasks during hours when the PC is already awake.

Context limits. Even 30B models have limited context windows. For large codebases, I have to chunk the work. The model doesn't see the full picture, which leads to integration issues.

Debugging the system. When something breaks, it's not always obvious where. Is the model acting weird? Is the routing wrong? Is the hardware offline? I spend maybe 30 minutes per week on maintenance.

What Surprised Me

Local models are faster for simple tasks. A Qwen 3.5 9B on the Mac Mini answers in 1-2 seconds. GPT-4o via API? 500ms to 2 seconds plus network latency. For quick queries, local is actually snappier.

The agents talk to each other better than expected. I was worried about the handoff — would context get lost? Would responses be garbled? In practice, the routing works 95% of the time. The 5% failures are usually obvious and easy to fix.

It's more reliable than cloud. I've had OpenAI outages, rate limits, API changes. My local setup? The only downtime is when I restart the machine for updates. In 6 months of operation, total downtime: maybe 2 hours.

The Honest Bottom Line

This isn't about replacing developers or writers or thinkers. It's about removing friction.

I still make all the decisions. I still review all the code. I still edit every article. The agents just handle the parts I find tedious — turning my rough notes into readable prose, catching obvious bugs, formatting responses.

The result? I ship more. I write more. I spend less time on grunt work and more time on things that matter.

Is it perfect? No. Is it better than doing everything manually? Absolutely.

If you're running side projects and drowning in maintenance, consider a local agent setup. It doesn't have to be this elaborate — start with one agent on one machine and expand from there.

The $599 Mac Mini was the best dev investment I've made this year.

Sam Hartley is a solo dev building tools on a 3-machine home lab. Writes about the messy reality of shipping stuff with AI.

→ Custom automation setups on Fiverr
→ Follow CelebiBots on Telegram

I Built a Dead-Simple API Gateway for My Local LLMs in 50 Lines of Python

Sam Hartley — Thu, 16 Jul 2026 15:43:08 +0000

I Built a Dead-Simple API Gateway for My Local LLMs in 50 Lines of Python

I run three machines with local LLMs. A Mac Mini with an M4, a Windows box with an RTX 3060, and an Ubuntu server with a couple older GPUs. Each has Ollama installed. Each has different models loaded.

For months, I hardcoded URLs in my scripts. Need a quick answer? Query the Mac. Need a coding assistant? Hit the Windows machine. Need the big model? Wait for the Ubuntu server.

It was annoying. So I built a tiny API gateway that routes requests automatically. It took an afternoon. It runs on a single Python file. And it completely changed how I use my local AI setup.

The Problem: Three URLs, Zero Logic

Before the gateway, my scripts looked like this:

# quick_question.py
import requests

# Which machine do I use today?
# Mac Mini — fast, small models
response = requests.post("http://192.168.1.100:11434/api/generate", json={
    "model": "qwen2.5:7b",
    "prompt": "Explain Python decorators"
})

# code_review.py
import requests

# Windows — has the GPU
response = requests.post("http://192.168.1.106:11434/api/generate", json={
    "model": "qwen3-coder:30b",
    "prompt": "Review this function..."
})

# hard_question.py
import requests

# Ubuntu — has the most VRAM
response = requests.post("http://192.168.1.100:11434/api/generate", json={
    "model": "deepseek-r1:70b",
    "prompt": "Design a distributed task queue..."
})

Three scripts. Three URLs. Zero flexibility. If the Windows machine was offline, the coding script just failed. If I added a new model, I had to update everything manually.

The Fix: A Stupid-Simple Gateway

I wanted one URL. One API. Let the gateway figure out which machine can handle the request.

Here's what I built:

# gateway.py
from flask import Flask, request, jsonify
import requests

app = Flask(__name__)

# My machines and what they can run
MACHINES = {
    "mac": {"url": "http://192.168.1.100:11434", "models": ["qwen2.5:7b", "granite3.2-vision:2b"]},
    "windows": {"url": "http://192.168.1.106:11434", "models": ["qwen3-coder:30b", "deepseek-r1:8b"]},
    "ubuntu": {"url": "http://192.168.1.100:11434", "models": ["deepseek-r1:70b", "minicpm-v"]},
}

def find_machine(model):
    for name, cfg in MACHINES.items():
        if model in cfg["models"]:
            return cfg["url"]
    return None

@app.route("/api/generate", methods=["POST"])
def generate():
    data = request.get_json()
    model = data.get("model")

    if not model:
        return jsonify({"error": "No model specified"}), 400

    machine_url = find_machine(model)
    if not machine_url:
        return jsonify({"error": f"Model {model} not found on any machine"}), 404

    try:
        response = requests.post(
            f"{machine_url}/api/generate",
            json=data,
            timeout=300
        )
        return response.json(), response.status_code
    except requests.exceptions.ConnectionError:
        return jsonify({"error": f"Machine for {model} is offline"}), 503

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=11435)

That's it. 50 lines. Run it on any machine, point all your scripts at http://gateway:11435, and forget about which box has which model.

Why This Is Better Than I Expected

I can move models around. When I got a new GPU for the Windows machine, I moved the big coding model there. Changed one line in MACHINES. Every script kept working.

Health checks are trivial. I added a /health endpoint that pings each machine. If one is down, my main script knows and routes around it.

Load balancing is obvious. If two machines have the same model, I can pick whichever is less busy. I haven't needed this yet, but the structure supports it.

My scripts got dumber. In a good way. They don't need to know about the infrastructure anymore. They just ask for a model and get an answer.

What I Didn't Build (On Purpose)

No database. No config files. No Docker. No Kubernetes. No "service mesh."

This is a single Python file with a dictionary. If I need to change something, I edit the file and restart it. Takes 10 seconds.

I thought about making it "proper" — YAML configs, hot reloading, Prometheus metrics. But this is for my home lab. I'm the only user. Complexity is the enemy.

The Real Win: Mental Overhead

Before the gateway, using my local AI felt like work. I'd open a script, remember which machine had which model, check if it was online, then query it.

Now it feels like... using an API. Any API. I don't think about the infrastructure. I just write the prompt and get the result.

That's the whole point of infrastructure: it should disappear.

Numbers (Because Why Not)

Lines of Python: 50
Time to build: 2 hours (including testing)
Time saved per week: ~30 minutes of "which machine is this on again?"
Additional dependencies: Flask (already installed for other projects)
Cost: $0

Getting Started

If you have multiple Ollama instances, you can literally copy-paste the script above, change the IPs and models, and be done.

pip install flask requests
python gateway.py

Then in your scripts:

import requests

# One URL. Any model. Gateway handles the rest.
response = requests.post("http://localhost:11435/api/generate", json={
    "model": "qwen3-coder:30b",
    "prompt": "Refactor this function..."
})

The Honest Bottom Line

Is this production-ready? No. Does it handle edge cases? Barely. Is it good enough for my home lab? Absolutely.

Sometimes the right architecture is the one you'll actually maintain. A 50-line Python file I can debug in my head beats a "proper" solution I'd never finish.

If you're running multiple Ollama instances and manually switching between them — just build the gateway. It takes an afternoon and saves you from ever thinking about machine IPs again.

Sam Hartley is a solo dev running a multi-machine AI home lab in Turkey. Writes about the boring infrastructure that makes local AI actually usable.

→ Custom automation setups on Fiverr
→ Follow CelebiBots on Telegram

ai #ollama #selfhosted #api #python #homelab #buildinpublic

I Built a Dead-Simple API Gateway for My Local LLMs in 50 Lines of Python

Sam Hartley — Wed, 15 Jul 2026 08:01:40 +0000

I Built a Dead-Simple API Gateway for My Local LLMs in 50 Lines of Python

For months, I hardcoded URLs in my scripts. Need a quick answer? Query the Mac. Need a coding assistant? Hit the Windows machine. Need the big model? Wait for the Ubuntu server.

It was annoying. So I built a tiny API gateway that routes requests automatically. It took an afternoon. It runs on a single Python file. And it completely changed how I use my local AI setup.

The Problem: Three URLs, Zero Logic

Before the gateway, my scripts looked like this:

# quick_question.py
import requests

# Which machine do I use today?
# Mac Mini — fast, small models
response = requests.post("http://192.168.1.100:11434/api/generate", json={
    "model": "qwen2.5:7b",
    "prompt": "Explain Python decorators"
})

# code_review.py
import requests

# Windows — has the GPU
response = requests.post("http://192.168.1.106:11434/api/generate", json={
    "model": "qwen3-coder:30b",
    "prompt": "Review this function..."
})

# hard_question.py
import requests

# Ubuntu — has the most VRAM
response = requests.post("http://192.168.1.100:11434/api/generate", json={
    "model": "deepseek-r1:70b",
    "prompt": "Design a distributed task queue..."
})

Three scripts. Three URLs. Zero flexibility. If the Windows machine was offline, the coding script just failed. If I added a new model, I had to update everything manually.

The Fix: A Stupid-Simple Gateway

I wanted one URL. One API. Let the gateway figure out which machine can handle the request.

Here's what I built:

# gateway.py
from flask import Flask, request, jsonify
import requests

app = Flask(__name__)

# My machines and what they can run
MACHINES = {
    "mac": {"url": "http://192.168.1.100:11434", "models": ["qwen2.5:7b", "granite3.2-vision:2b"]},
    "windows": {"url": "http://192.168.1.106:11434", "models": ["qwen3-coder:30b", "deepseek-r1:8b"]},
    "ubuntu": {"url": "http://192.168.1.100:11434", "models": ["deepseek-r1:70b", "minicpm-v"]},
}

def find_machine(model):
    for name, cfg in MACHINES.items():
        if model in cfg["models"]:
            return cfg["url"]
    return None

@app.route("/api/generate", methods=["POST"])
def generate():
    data = request.get_json()
    model = data.get("model")

    if not model:
        return jsonify({"error": "No model specified"}), 400

    machine_url = find_machine(model)
    if not machine_url:
        return jsonify({"error": f"Model {model} not found on any machine"}), 404

    try:
        response = requests.post(
            f"{machine_url}/api/generate",
            json=data,
            timeout=300
        )
        return response.json(), response.status_code
    except requests.exceptions.ConnectionError:
        return jsonify({"error": f"Machine for {model} is offline"}), 503

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=11435)

That's it. 50 lines. Run it on any machine, point all your scripts at http://gateway:11435, and forget about which box has which model.

Why This Is Better Than I Expected

I can move models around. When I got a new GPU for the Windows machine, I moved the big coding model there. Changed one line in MACHINES. Every script kept working.

Health checks are trivial. I added a /health endpoint that pings each machine. If one is down, my main script knows and routes around it.

Load balancing is obvious. If two machines have the same model, I can pick whichever is less busy. I haven't needed this yet, but the structure supports it.

My scripts got dumber. In a good way. They don't need to know about the infrastructure anymore. They just ask for a model and get an answer.

What I Didn't Build (On Purpose)

No database. No config files. No Docker. No Kubernetes. No "service mesh."

This is a single Python file with a dictionary. If I need to change something, I edit the file and restart it. Takes 10 seconds.

I thought about making it "proper" — YAML configs, hot reloading, Prometheus metrics. But this is for my home lab. I'm the only user. Complexity is the enemy.

The Real Win: Mental Overhead

Before the gateway, using my local AI felt like work. I'd open a script, remember which machine had which model, check if it was online, then query it.

Now it feels like... using an API. Any API. I don't think about the infrastructure. I just write the prompt and get the result.

That's the whole point of infrastructure: it should disappear.

Numbers (Because Why Not)

Lines of Python: 50
Time to build: 2 hours (including testing)
Time saved per week: ~30 minutes of "which machine is this on again?"
Additional dependencies: Flask (already installed for other projects)
Cost: $0

Getting Started

If you have multiple Ollama instances, you can literally copy-paste the script above, change the IPs and models, and be done.

pip install flask requests
python gateway.py

Then in your scripts:

import requests

# One URL. Any model. Gateway handles the rest.
response = requests.post("http://localhost:11435/api/generate", json={
    "model": "qwen3-coder:30b",
    "prompt": "Refactor this function..."
})

The Honest Bottom Line

Is this production-ready? No. Does it handle edge cases? Barely. Is it good enough for my home lab? Absolutely.

Sometimes the right architecture is the one you'll actually maintain. A 50-line Python file I can debug in my head beats a "proper" solution I'd never finish.

If you're running multiple Ollama instances and manually switching between them — just build the gateway. It takes an afternoon and saves you from ever thinking about machine IPs again.

Sam Hartley is a solo dev running a multi-machine AI home lab in Turkey. Writes about the boring infrastructure that makes local AI actually usable.

→ Custom automation setups on Fiverr
→ Follow CelebiBots on Telegram

ai #ollama #selfhosted #api #python #homelab #buildinpublic

I Ditched ChatGPT for Local LLMs and Saved $2,000 in a Year — The Real Numbers

Sam Hartley — Tue, 07 Jul 2026 08:02:10 +0000

"Just use ChatGPT." — I heard this for months. And I did. Until I got the bill.

$187 in one month. For a solo dev running side projects. That was my wake-up call.

This is the story of how I went from cloud-only to a hybrid setup, what it actually cost, and where local models fall flat on their face.

The Setup (July 2025)

I was using three APIs daily:

OpenAI GPT-4o for code review and general questions
Anthropic Claude Sonnet for writing and reasoning
Google Gemini Pro for quick tasks and summaries

My workload wasn't enterprise-level. Maybe 300-500 queries per day across all projects — a mix of coding help, content drafting, data extraction, and random "what's the difference between these two Python libraries" questions.

The bill for June 2025: $187.42.

For context, that's more than my internet bill, my streaming subscriptions, and my VPS combined.

Month 1: The Experiment

I bought a used RTX 3060 12GB off eBay for $150. Added it to my existing PC (which already had a decent CPU). Installed Ollama. Pulled Qwen 2.5 7B.

Took 20 minutes from "unboxing" to "first local query".

The result? For simple questions — "explain this regex", "refactor this function", "summarize this text" — the 7B model was about 85% as good as GPT-4o. The answers were slightly less polished, sometimes missing nuance, but good enough.

The catch? Complex reasoning. I asked it to design a database schema for a multi-tenant app with row-level security. It gave me something that looked right but had a subtle flaw that would have caused data leaks in production.

GPT-4o caught that flaw. The local model didn't.

Lesson learned: Local models are great for 80% of tasks. The other 20% still needs the big guns.

Building the Hybrid

By month 3, I had a routing system:

Query comes in
  → Is it simple? (explain, refactor, summarize)
    → Local model (free, ~2s)
  → Is it code review?
    → Local coder model (free, ~8s)
  → Is it complex reasoning or architecture?
    → Cloud API ($0.003-0.02 per query)

I didn't build anything fancy. Just a 30-line Python script that checks the query type and routes it. The "complexity check" is embarrassingly simple — if the query contains words like "architecture", "design", "security", "performance", or is longer than 500 characters, it goes to the cloud.

Is it perfect? No. Does it catch edge cases? Sometimes. But it's good enough and saved me a fortune.

The Numbers (12 Months)

Cloud-Only Year (Hypothetical)

If I kept my June 2025 pace: $187/month × 12 = $2,244/year

Actual Hybrid Year

Cost	Amount
Used RTX 3060 12GB	$150 (one-time)
Electricity (GPU running 24/7)	~$12/month = $144/year
Cloud API usage (reduced)	~$25/month = $300/year
Total first year	$594
Total subsequent years	~$444/year

Savings: $2,244 - $594 = $1,650 in year one.

After the GPU is paid off, it's $444/year vs $2,244. The GPU pays for itself in under 4 months.

What Surprised Me

Latency is better locally. Cloud APIs average 500-2000ms. My local setup answers in 200-800ms depending on model size. For iterative coding (write, test, ask, fix), that speed difference matters.

Privacy is underrated. I started piping customer support tickets through the local model for sentiment analysis and categorization. With cloud APIs, I'd need a data processing agreement. With local? The data never leaves my machine.

Rate limits don't exist locally. Hit a deadline and need to process 1000 queries in an hour? Cloud APIs throttle you. Local hardware just gets warm.

Model management is annoying. Updates, storage (each model is 4-15GB), keeping track of which model does what — it's overhead. Not huge, but real.

Where Local Models Fail (Honestly)

Frontier reasoning. I asked DeepSeek R1 70B (local, quantized) and Claude 3.5 Sonnet (cloud) to debug a race condition in my async Python code. Claude spotted it in 2 sentences. The local model gave me a 3-paragraph explanation that was technically correct but missed the actual bug.

Creative writing. GPT-4o writes prose that flows. Local models write prose that... exists. For marketing copy or user-facing content, I still use the cloud.

Multimodal. Local vision models exist but they're not great. If I need to analyze a screenshot or diagram, cloud wins hands down.

My Actual Recommendation

Your Situation	What to Do
Solo dev, side projects	Local only. Start with Ollama + Qwen 2.5 7B
Small team, some budget	Hybrid. Local for 80%, cloud for complex stuff
Startup with VC funding	Hybrid. Local default, cloud for frontier tasks
Enterprise with compliance needs	Local + air-gapped. Cloud only for non-sensitive
"I just want it to work"	Cloud. But you're paying for convenience

Getting Started (10 Minutes)

If you're curious, here's the fastest path:

# 1. Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# 2. Pull a model (7B fits in 8GB RAM)
ollama pull qwen2.5:7b

# 3. Start chatting
ollama run qwen2.5:7b

Total time: 10 minutes. Total cost: $0.

If you have an old gaming GPU lying around, you're golden. If not, CPU-only works for smaller models. It's slower but still usable for casual queries.

The Honest Bottom Line

Local LLMs aren't a magic bullet. They're a cost optimization with trade-offs.

You lose some quality on complex tasks. You gain speed, privacy, and massive cost savings. For me, routing 80% of queries locally dropped my AI bill from $187/month to $25/month.

That's $1,650/year I can spend on... literally anything else.

If you've tried local LLMs, what's your experience? Did the quality drop bother you, or was the cost saving worth it?

Drop your setup in the comments — always curious how others are handling this.

Sam Hartley is a solo dev building tools on a Mac Mini + RTX 3060 home lab. Writes about the messy reality of shipping stuff with AI.

I Built a Morning Briefing Bot in 50 Lines of Python — Here's Why I Check Telegram Before Email Now

Sam Hartley — Wed, 01 Jul 2026 08:05:34 +0000

I Built a Morning Briefing Bot in 50 Lines of Python — Here's Why I Check Telegram Before Email Now

For years, my morning routine was the same: open laptop, check email, get distracted by Slack, remember I needed to check the weather, forget what I was doing, and 20 minutes later realize I hadn't started actual work yet.

Sound familiar?

Three months ago, I built a dead-simple Telegram bot that aggregates everything I actually care about into one message. It fires at 8 AM every day. I read it in 30 seconds. Then I start working.

No apps to switch between. No rabbit holes. Just one message with the stuff that matters.

What It Actually Sends

Every morning at 8 AM, my phone buzzes with something like this:

📅 Morning Brief — Wednesday, Jul 1

Today: Team standup at 10:00, Dentist at 14:30
Tomorrow: Deploy to prod (set reminder!)

🌤️ Weather: 28°C, sunny — no rain expected

💻 Systems: Mac Mini ✅ | GPU Server ✅ | Ubuntu Box ✅
All green. Uptime: 47 days.

📝 Yesterday's notes: "Refactored auth module, tests passing"

⚠️ One thing: GitHub issue #142 still open — "fix API rate limiting"

That's it. No graphs. No dashboards. No "click here to see more." Just the facts I need to plan my day.

Why Telegram?

I already use Telegram for my devops dashboard (wrote about that here). Adding a morning briefing was a natural extension.

But honestly? The real reason is friction reduction. My phone's notification shade is a graveyard of app alerts I ignore. Telegram is one of the few apps I actually open. Putting my briefing there means I actually read it.

The Architecture (It's Embarrassingly Simple)

Here's the entire stack:

Cron job on my Mac Mini — triggers at 8:00 AM daily
50-line Python script — gathers data from 4 sources
Telegram Bot API — sends the formatted message

That's it. No web framework. No database. No message queue. Just a script that runs, collects, formats, and sends.

The Script

Here's the core of it (simplified, but functional):

#!/usr/bin/env python3
import os
import json
import requests
from datetime import datetime, timedelta

# Config
BOT_TOKEN = os.environ['BRIEFING_BOT_TOKEN']
CHAT_ID = os.environ['MY_TELEGRAM_CHAT_ID']

# Gather data
def get_calendar():
    # I export from Apple Calendar to a local ICS file nightly
    # This reads today's events from it
    # ... (parsing logic here) ...
    return ["Team standup at 10:00", "Dentist at 14:30"]

def get_weather():
    # wttr.in — free, no API key needed
    r = requests.get('https://wttr.in/Sakarya?format=%C+%t')
    return r.text.strip()

def get_system_status():
    # Quick ping to my other machines
    machines = {
        'Mac Mini': '192.168.1.105',
        'GPU Server': '192.168.1.106',
        'Ubuntu Box': '192.168.1.100'
    }
    status = {}
    for name, ip in machines.items():
        response = os.system(f'ping -c 1 -W 2 {ip} > /dev/null 2>&1')
        status[name] = '✅' if response == 0 else '❌'
    return status

def get_github_issues():
    # Check my main repo for open issues labeled "urgent"
    # ... (GitHub API call) ...
    return ["#142: Fix API rate limiting"]

# Build and send message
def send_briefing():
    calendar = get_calendar()
    weather = get_weather()
    systems = get_system_status()
    issues = get_github_issues()

    today = datetime.now().strftime('%A, %b %d')

    lines = [
        f"📅 Morning Brief — {today}",
        "",
        "Today:",
        *[f"  • {event}" for event in calendar] if calendar else ["  • Nothing scheduled 🎉"],
        "",
        f"🌤️ Weather: {weather}",
        "",
        "💻 Systems:",
        *[f"  {name} {status}" for name, status in systems.items()],
        "",
    ]

    if issues:
        lines += ["⚠️ Open issues:", *[f"  • {issue}" for issue in issues]]

    message = '\n'.join(lines)

    requests.post(
        f'https://api.telegram.org/bot{BOT_TOKEN}/sendMessage',
        json={'chat_id': CHAT_ID, 'text': message, 'parse_mode': 'HTML'}
    )

if __name__ == '__main__':
    send_briefing()

The real version has error handling and a few more data sources (like yesterday's git commit summary), but this is the gist. 50 lines. One cron entry. Done.

The Cron Job

# crontab -e
0 8 * * * /usr/bin/python3 /Users/sam/scripts/morning_briefing.py >> /tmp/briefing.log 2>&1

That's literally the entire scheduling infrastructure.

What Changed

Before: I'd check 4-5 apps every morning. Sometimes I'd miss something. Sometimes I'd get distracted. Average time to "actually working": 20-30 minutes.

After: One notification. 30 seconds to read. I know what's happening today, whether my systems are healthy, and if there's anything urgent. Then I close Telegram and start work.

Average time to "actually working": 2 minutes.

The weird part? I feel less anxious. I used to have this low-grade worry that I was forgetting something. Now I know the bot checks for me. If there was a problem, I'd know.

The Downsides (Because Nothing's Perfect)

ICS parsing is brittle. Apple Calendar exports aren't always clean. I had to add a nightly script that sanitizes the ICS file before the morning run.

Weather API can be flaky. wttr.in is great until it isn't. I added a fallback to a local weather file that updates every hour.

No interactivity. It's a one-way push. If I want details ("what's in that GitHub issue?"), I have to go look. I tried adding reply buttons, but honestly, it added complexity I didn't need. The goal was speed, not interactivity.

Edge cases suck. Daylight saving time shift? The cron fired at 7 AM for a week before I noticed. Public holidays? The bot doesn't know. I had to add a manual "skip" flag for vacation days.

Should You Build This?

If you check more than 3 apps every morning: probably yes.

You don't need my exact setup. Use whatever you have:

Discord webhook instead of Telegram?
A simple Bash script instead of Python?
iOS Shortcuts instead of cron?

The pattern matters more than the stack: one automated message, curated by you, delivered where you already look.

Start small. My first version just sent the weather and today's calendar. Everything else came later. The 50-line script grew to 150 lines over three months — but it started as a weekend experiment.

What's Next

I'm experimenting with two additions:

Weekly summary on Sundays — "This week you shipped 12 commits, closed 3 issues, and your GPU earned $47."
Context-aware alerts — If I have a meeting in 15 minutes and haven't checked the briefing, send a nudge. But only for external meetings (not "standup" — I won't forget that).

The goal isn't to build a product. It's to build a personal utility that removes friction from my day.

I write about running AI locally, building weird automation, and occasionally making money from side projects. If this was useful, drop a comment with your morning routine — I'm always looking for ideas to steal.

I Rented Out My GPU for Passive Income — Here's What Happened After My First Week

Sam Hartley — Sat, 27 Jun 2026 08:05:35 +0000

I had an RTX 3060 sitting on a shelf.

Not broken. Not old. Just... not doing anything. My Windows PC runs models when I need them, but most of the time it's idle. The fans spin, the power draw ticks along, and that 12GB of VRAM just sits there.

A week ago I connected it to Vast.ai — a GPU marketplace where people rent compute time. No code required. You install a daemon, set a price, and wait for someone to rent your machine.

Here's what actually happened.

Why I Didn't Just Mine Crypto

First thing people ask: "Why not just mine?"

Short answer: it's 2026, the margins are brutal, and I didn't want to deal with it. GPU compute rental is different — you're renting raw processing power, and the demand right now is AI inference and training. People building LLMs, running diffusion models, doing batch jobs.

The upside: no mining pool setup, no daily coin price anxiety, no special software. Your machine runs Docker containers, gets paid per second of use, you get a payout.

The Setup (Genuinely About 90 Minutes)

Created a Vast.ai account
Installed the host daemon on Windows (it's a one-click installer)
Set my RTX 3060 12GB at $0.15/hour
Went to bed

That's it. No configuration rabbit holes, no drivers to hunt down. The daemon manages everything — spinning up containers, cleaning up after renters, reporting uptime.

I set the minimum rental duration to 1 hour so I wouldn't get hit with a dozen 5-minute jobs.

The First Week Numbers

Day	Hours Rented	Earnings
Day 1	3.2h	$0.48
Day 2	11.5h	$1.73
Day 3	0h	$0.00
Day 4	16.8h	$2.52
Day 5	9.1h	$1.37
Day 6	22.0h	$3.30
Day 7	14.4h	$2.16

Week 1 total: ~$11.56

Annualized naively? About $600/year. Which would be great except day 3 was $0 and utilization is inconsistent.

A more realistic steady-state: $50–130/month depending on demand.

What People Actually Rent It For

Vast.ai shows you the jobs (anonymized). Mine has been used for:

Running vllm inference servers (Mistral, Qwen, LLaMA variants)
Stable Diffusion batch jobs
Some kind of PyTorch training run that lasted 8 hours

The 3060 with 12GB VRAM is actually sweet for inference — fits most 7B–13B models at 4-bit quantization without breaking a sweat. It's not the fastest card, but it's affordable to rent, which means demand is there.

The Honest Downsides

You can't use your GPU while it's rented. Sounds obvious, but the practical implication: if you need your machine for local inference and someone's rented it, tough luck. I started routing heavy tasks to my Mac Mini during rental periods.

Electricity. My RTX 3060 at load pulls about 150W. At Turkish electricity rates, that's roughly $8–15/month in power at typical utilization. So the net is lower than the gross numbers above.

It's genuinely passive but not predictable. Day 3 was $0. Day 6 was near-full utilization. There's no way to forecast demand.

Payouts have a minimum. Vast.ai pays out once you hit a threshold. Nothing to worry about, just something to know going in.

What I'm Going to Try Next

The obvious play is adding more GPUs. I have a few more in storage — an RTX 3080 and some older 3060s. If I rack those up, the math gets interesting:

GPU	Rate	Monthly (50% util)
RTX 3060 (current)	$0.15/h	~$54
RTX 3080 10GB	$0.20/h	~$72
2x RTX 3070 8GB	$0.16/h	~$115

That's ~$240/month without doing anything after setup. At 70% utilization: ~$340.

The real work is physical — pulling GPUs from storage, getting them into a rig, managing thermals. But the software side is almost zero maintenance.

Should You Try This?

If you have a spare GPU collecting dust: yes, probably. The setup is low friction, the risk is near-zero (worst case, you uninstall the daemon and move on), and even modest earnings beat $0.

If you're thinking about buying a GPU specifically for this: do the math carefully. At current rates, an RTX 3060 costs ~$300–350 used. Payback period at $50/month is 6–7 months, which is fine — but don't expect to fund your retirement from a single card.

The real value for me isn't the income (yet). It's that I now have a system running, I understand the demand patterns, and I know the path to scale looks viable.

The Setup Summary

Platform: Vast.ai (there's also RunPod if you want alternatives)
Time to set up: ~90 minutes
Technical skill required: Know how to install software on Windows
Ongoing maintenance: Almost none
Realistic earnings: $50–130/month per mid-tier GPU

Happy to answer questions if you try this and run into something weird. The daemon is pretty solid but there's always an edge case or two.

I write about running AI locally, automation side projects, and occasionally making money from hardware that would otherwise just collect dust. If any of this is useful, feel free to follow.