DEV Community: Anup Sharma

What Word Break Leetcode Problem Taught Me About Debugging Order

Anup Sharma — Mon, 06 Jul 2026 12:35:36 +0000

I recently worked through the classic Word Break problem in an interview. My approach was solid from the start — recursion with memoization, a breakable helper that tests every prefix and recurses on the rest. The logic was right. What slowed me down was everything around the logic.

Here's the solution I landed on:

class Solution {
public:
    bool wordBreak(string s, vector<string>& wordDict) {
        unordered_set<string> dict(wordDict.begin(), wordDict.end());
        unordered_map<size_t, bool> memo;
        return breakable(s, dict, memo, 0); // missed: breakable was a free function defined below -> "not declared in this scope"
    }

private:
    // missed: had int here, compared against s.length() (size_t) -> sign-compare warnings
    bool breakable(const string& s, const unordered_set<string>& dict,
                   unordered_map<size_t, bool>& memo, size_t starting) {
        if (starting == s.length()) return true;
        if (memo.count(starting)) return memo[starting];

        for (size_t e = starting + 1; e <= s.length(); e++) {
            string word = s.substr(starting, e - starting); // missed: shadowed an outer `word`, and had substr(starting, e) instead of e - starting
            if (dict.count(word) && breakable(s, dict, memo, e)) {
                memo[starting] = true; // missed: wrote == instead of =, so success was never cached
                return true;
            }
        }

        memo[starting] = false; // missed: this line, so failures were never cached and memoization broke down
        return false;
    }
};

The real lesson

Most of what tripped me up was syntax and scope — a function declared in the wrong place, signed/unsigned mismatches, a shadowed variable, == where I meant =. None of these were about the algorithm. But because I spent my time chasing them, I had less room to focus on the one thing that actually matters in this problem: the logical correctness of the memoization.

The takeaway I'm keeping: get the syntax and scoping clean early, so the debugging budget goes toward logic, not typos. That's the difference between a working solution and a working solution you can reason about under pressure.

Adding Observability to My AI Homelab

Anup Sharma — Fri, 03 Jul 2026 16:46:42 +0000

Part 4 of the Homelab AI Series — Part 1 | Part 2 | Part 3

Let me set the scene.

My personal AI agent — is running its nightly cron jobs. Calendar summaries. Email digests. Task prioritization. It's been doing this silently for three weeks since I integrated the vLLM Semantic Router in Part 3.

And I have absolutely no idea if it's working.

Not because it's broken. Because I have no visibility into it at all. The Mac Mini sits in my living room, green light blinking quietly, processing requests — and I have zero idea whether the routing is actually working, whether my API bills are exploding, or whether the local Ollama model is grinding through prompts that should have gone to Gemini.

I was flying completely blind.

The Plan That Never Happened

After Part 3, my original observability roadmap was ambitious. I was going to deploy the full "Big Tech" monitoring stack:

Prometheus to scrape AgentGateway's /metrics endpoint
Jaeger for distributed tracing via OpenTelemetry
Grafana with custom dashboards for token costs and latency
Loki for log aggregation, because why not go full enterprise

I'd even started writing the docker-compose.yaml. Four services, two config volumes, a shared network — and I hadn't even gotten to the Grafana provisioning scripts yet.

Then during weekly agentgateway community meeting Lin and John announced new UI in v1.3.0

I quickly ran git pull on the AgentGateway repo.

$ git pull origin main
...
 crates/agentgateway/src/ui.rs    | 423 ++++++++++++++++++++++++
 ui/src/pages/Analytics.tsx       | 311 ++++++++++++++++
 ui/src/pages/Logs.tsx            | 287 +++++++++++++++

The team had just shipped a brand new built-in UI — complete with an Analytics dashboard, a live Logs Explorer, and a Cost Breakdown view. Everything I was about to spend my weekend building was already there. Native. In the binary. On port 15000.

I closed the docker-compose.yaml. I was never going to open it again.

Three Lines of YAML. That's It.

The built-in UI was already serving at http://localhost:15000/ui. But when I navigated there, the Logs and Analytics pages showed nothing. Just empty charts and a message:

Logs API error — request log database is not configured

Right. The UI needed somewhere to write request logs. This is where I expected to set up a Postgres instance or at minimum a Docker container for SQLite.

Instead, I added this to my homelab_config.yaml:

config:
  modelCatalog:
  - file: base-costs.json
  database:
    url: sqlite://agentgateway.db

That's it.

One important gotcha I hit: the database: key must be nested inside the config: section. I originally tried adding it at the top level of the YAML and got an "unknown field" validation error. The config parser is strict. Nest it correctly and it just works.

Restarted AgentGateway. Sent a few test requests. Refreshed the dashboard.

The charts lit up.

What's Actually Inside the Dashboard

The Analytics View

The Analytics page groups every request by provider and model. In my setup, I have three possible destinations for every request Pi sends:

qwen2.5-coder:7b via Ollama — local, free, slower
gpt-4o via OpenAI — expensive, fast, best reasoning
gemini-2.5-flash via Google — cheap cloud, fast, great context window

AgentGateway knows which model handled each request because the vLLM Semantic Router adds an x-selected-model header before forwarding. So the UI doesn't just show me "a request happened" — it shows me which model got it, how many tokens it consumed, and the estimated dollar cost using the built-in model pricing catalog.

In the 24-hour snapshot above: 60 calls, 13,929 tokens, $0.0340 total. That's the entire cost of running Pi's overnight jobs. Fractions of a cent per interaction.

And I can see the routing is working — the traffic spike on the right corresponds to Pi's 3 AM cron batch. The model breakdown lets me verify that coding tasks are actually hitting the local Ollama and not burning cloud API credits.

The Logs Explorer

This is the view that genuinely surprised me.

Every single LLM call shows up as a row with:

HTTP Status — 200, 400, 404 — the bad ones are impossible to miss
Duration — total time from request received to response delivered
Model — the actual model called, not my MoM alias
Provider — gcp.gemini, openai, openai (for Ollama, since it speaks the OpenAI API)
Token counts — input and output separately
Estimated cost — per-request dollar amount against the model price catalog

Look at the screenshot above. You can see real requests: gemini-2.5-flash calls at a few tenths of a cent each, qwen2.5-coder:7b calls with zero cost, and a handful of 404s for non-existent-model at the top — those are the simulated error requests from my traffic test, showing up exactly as expected.

I can click into any row and see the full request detail — the exact prompt Pi sent and the exact response it got back. When Pi's 3 AM calendar job sends something weird, I can see the raw JSON. That was never possible before.

The Full Config

For anyone setting this up, here's the complete homelab_config.yaml that runs my entire homelab AI stack:

# yaml-language-server: $schema=https://agentgateway.dev/schema/config

# Gateway-level policy: Semantic Router as ExtProc sidecar
policies:
- name:
    name: semantic-router
    namespace: default
  target:
    gateway:
      gatewayName: default
      gatewayNamespace: default
  phase: gateway
  policy:
    extProc:
      host: 127.0.0.1:50051
      processingOptions:
        requestBodyMode: buffered
        responseBodyMode: none
        requestHeaderMode: send
        responseHeaderMode: skip
        requestTrailerMode: skip
        responseTrailerMode: skip
      failureMode: failOpen   # If SR crashes, requests fall through to Gemini

# Routes based on the header the Semantic Router sets
binds:
- port: 3000
  listeners:
  - routes:

    # x-selected-model: qwen-coder → Local Ollama (free)
    - matches:
      - headers:
        - name: x-selected-model
          value:
            exact: qwen-coder
      policies:
        ai:
          modelAliases:
            MoM: qwen2.5-coder:7b
            inteli-llm: qwen2.5-coder:7b
      backends:
      - ai:
          provider:
            openAI: {}
          name: ollama
          hostOverride: localhost:11434

    # x-selected-model: gpt-4o → OpenAI
    - matches:
      - headers:
        - name: x-selected-model
          value:
            exact: gpt-4o
      policies:
        ai:
          modelAliases:
            MoM: gpt-4o
            inteli-llm: gpt-4o
      backends:
      - ai:
          provider:
            openAI: {}
          name: openai
        policies:
          backendAuth:
            key: $OPENAI_API_KEY

    # x-selected-model: gemini-flash → Google
    - matches:
      - headers:
        - name: x-selected-model
          value:
            exact: gemini-flash
      policies:
        ai:
          modelAliases:
            MoM: gemini-2.5-flash
            inteli-llm: gemini-2.5-flash
      backends:
      - ai:
          provider:
            gemini: {}
          name: gemini
        policies:
          backendAuth:
            key: $GEMINI_API_KEY

    # Fallback (SR down or no header matched)
    - backends:
      - ai:
          provider:
            gemini: {}
          name: gemini-default
        policies:
          ai:
            modelAliases:
              MoM: gemini-2.5-flash
              inteli-llm: gemini-2.5-flash
          backendAuth:
            key: $GEMINI_API_KEY

# Direct LLM proxy on port 4000
llm:
  port: 4000
  models:
  - name: openai
    provider: openai
  providers: []
  virtualModels: []

# Frontend policy
frontendPolicies:
  http:
    maxBufferSize: 33554432

# The three lines that unlocked full observability
config:
  modelCatalog:
  - file: base-costs.json
  database:
    url: sqlite://agentgateway.db

The separation of concerns is worth calling out again: the Semantic Router never touches API keys. It classifies the prompt, sets a header, and gets out of the way. AgentGateway owns the downstream auth entirely. This is the same design pattern you'd use in a production Kubernetes cluster — routing intelligence decoupled from security posture.

Why Not Grafana?

I want to address this directly because I know some people will ask.

If you're running an enterprise Kubernetes cluster with a dedicated platform team, absolutely export AgentGateway's OpenTelemetry data to your centralized Datadog or Prometheus stack. AgentGateway supports this out of the box — it emits OTLP traces and a /metrics endpoint. The production observability story is excellent.

But if you're running a homelab?

The operational burden of Prometheus + Grafana for a single-node AI gateway is enormous relative to what you get. You need to keep two additional services running and healthy, write and maintain Grafana dashboard JSON, configure Prometheus alerting rules, and keep all of it in sync when your schema changes.

AgentGateway's built-in dashboard gives you every metric I care about — token usage, cost per model, latency distribution, error rates — with zero operational overhead. The SQLite file lives right next to the binary. There's nothing to maintain, nothing to restart, nothing to provision.

Do not build an observability stack if you don't have to.

The Numbers After One Week of Real Visibility

Having actual data changes how you think about your setup:

Metric	Blind (before)	With Dashboard
Routing correctness	"Probably fine?"	Verified per-model in Analytics
Monthly API cost estimate	"Maybe $20-30?"	~$12 projected
Error rate	Unknown	2.3% (mostly 3 AM config edge cases)
Avg. Gemini latency	Unknown	~340ms
Avg. Ollama latency	Unknown	~18 seconds (7B model on CPU)
Hidden issues found	0	3 in first week

That last row is the one that matters. Three real problems I'd had zero visibility into — a calendar cron sending malformed date ranges to Gemini, a tokenization edge case in Pi's summarization prompt, and one silent API key rotation failure. The dashboard didn't just give me numbers. It gave me answers.

The Homelab Stack, Complete

Four posts. One Mac Mini in a living room. Here's the full picture:

Pi (Personal Agent)
       │
       ▼ POST /v1/chat/completions  model: "MoM"
       │
┌──────────────────────────────────────────────────────┐
│                AgentGateway (:3000)                   │
│                                                        │
│  ExtProc → vLLM Semantic Router (:50051)              │
│  mmBERT classifies prompt in ~1ms                     │
│  Sets x-selected-model header                         │
│                                                        │
│  Route match on header → forward to backend           │
│                                                        │
│  Built-in UI (:15000/ui)                              │
│  SQLite → Analytics + Logs Explorer                   │
└───────┬─────────────┬─────────────┬──────────────────┘
        ▼             ▼             ▼
   Ollama:11434   OpenAI API   Gemini API
   qwen2.5-coder  gpt-4o       gemini-2.5-flash
   (free, local)  (~$0.03/1k)  (~$0.0015/1k)

The Agent — Pi, running cron jobs and personal tasks 24/7 from a Mac Mini in my living room.
The Intelligence Layer — vLLM Semantic Router, using mmBERT embeddings to classify every prompt and set routing headers in ~1ms.
The Data Plane — AgentGateway in Rust, owning all API keys, handling auth, matching routes.
The Control Plane — AgentGateway's built-in UI, backed by SQLite, showing real-time token usage, costs, latency, and errors.

The whole stack runs as a single binary (plus the SR container). Zero cloud spend on infrastructure. The Mac Mini was already sitting in my living room.

What's Next

This feels like a natural pause point. The stack is stable, observable, and honestly more capable than I expected when I started this series.

A few things I'm actively exploring:

Dockerizing the stack — a single docker-compose.yaml to boot Ollama, the SR container, and AgentGateway together so the Mac Mini fully self-heals after a reboot without me touching anything.
More model cards — now that routing is semantic, adding a new specialized model is just writing a new description in the SR's config.yaml. The router figures out the rest.
OTLP export — AgentGateway already emits OpenTelemetry spans. I want to wire it to a lightweight alertmanager that notifies me when Pi's error rate spikes past a threshold during its 3 AM runs.

If you're building agents — homelab or production — the combination of AgentGateway + vLLM Semantic Router + the built-in SQLite observability is, right now, the most complete single-node AI infrastructure stack I know of. No YAML sprawl. No external dependencies for the happy path. Just a config file, a binary, and a Mac Mini with a green light.

And it runs silently, 24/7, from my living room. 🏠

Have questions about the setup? Drop them in the comments — I check daily. And if you've built something similar, I'd love to see how you've adapted it.

#ai #agents #observability #homelab #agentgateway #vllm #sqlite #llm #opensource

Giving AgentGateway a Semantic Brain with vLLM Semantic Router - Inside My Homelab

Anup Sharma — Sat, 20 Jun 2026 03:00:37 +0000

Part 3 of the Homelab AI Series — Part 1 | Part 2

The Problem Was Embarrassing

In Part 1, I showed how I built a personal AI agent (Pi) that runs 24/7 from my living room, using AgentGateway to route requests across three models: a local Ollama (qwen2.5-coder:7b) for coding, OpenAI (gpt-4o) for deep reasoning, and Gemini (gemini-2.5-flash) for fast general tasks.

The routing brain? A 100-line Python script sitting between Pi and AgentGateway:

# router.py — The "AI brain" I was embarrassed to deploy
coding_keywords = ["code", "python", "javascript", "bash", "script",
                   "function", "bug", "error", "html", "css"]
reasoning_keywords = ["think", "analyze", "explain in detail",
                      "reasoning", "logic", "deduce"]

if any(k in prompt_lower for k in coding_keywords):
    intent = "coding"
elif len(prompt) > 400 or any(k in prompt_lower for k in reasoning_keywords):
    intent = "reasoning"
else:
    intent = "simple"

Yes. My "intelligent" AI routing was a glorified if-elif-else chain.

It worked — until it didn't. "Explain the async/await pattern in Rust" got classified as simple because none of the keywords matched. "Help me think about dinner options" got classified as reasoning because think was in the keyword list. And anything in Hindi or mixed-language prompts? Straight to the fallback, every single time.

After running this setup daily for two weeks, I collected some rough numbers:

Metric	With Python Router
Misrouted requests (spot-checked)	~18%
Monthly estimated API cost	~$24
Routing latency (Python proxy hop)	~45ms
Keyword list maintenance	Manual, weekly tweaks

Eighteen percent of requests going to the wrong model doesn't just waste money — it gives bad answers. When my cron-job agent sends a complex "summarize this week's calendar and suggest optimizations" to the 7B local model instead of Gemini or GPT-4o, the output is noticeably worse.

I needed something that understood the prompt, not just scanned it for keywords.

Enter vLLM Semantic Router

While discussing with Maintainers of AgentGateway AgentGateway, I discovered a first-class integration with vLLM Semantic Router thanks to Keith Mattix and John Howard. The architecture clicked immediately:

Instead of my Python script sitting in front of AgentGateway as a janky reverse proxy, the Semantic Router runs as an Envoy ExtProc sidecar. AgentGateway pauses the request, sends the HTTP body to the SR's gRPC endpoint, gets back a header mutation (x-selected-model: qwen-coder), and resumes routing. Zero proxy hops. Zero Python processes. Just gRPC-native intelligence inside the gateway's own request lifecycle.

The SR uses an embedded mmBERT model (a 2D Matryoshka embedding model, ~130MB) to semantically classify every prompt and compare it against model descriptions you write in YAML. No keyword lists. No regex. Actual embeddings.

The Architecture

┌─────────────────────────────────────────────────────┐
│                  Client (Pi Agent)                   │
│             POST /v1/chat/completions                │
│                  model: "MoM"                        │
└────────────────────────┬────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────┐
│              AgentGateway (:3000)                     │
│                                                       │
│  1. Receive request                                   │
│  2. Pause → send body to ExtProc (gRPC :50051)       │
│  3. SR analyzes prompt with mmBERT embeddings         │
│  4. SR returns header: x-selected-model: qwen-coder  │
│  5. Resume → match route by header → forward          │
└──────┬──────────────┬──────────────┬────────────────┘
       │              │              │
       ▼              ▼              ▼
   ┌────────┐   ┌──────────┐   ┌──────────┐
   │ Ollama  │   │ OpenAI   │   │ Gemini   │
   │ :11434  │   │ Cloud    │   │ Cloud    │
   └────────┘   └──────────┘   └──────────┘

Setting It Up (Two YAML Files)

The entire setup is defined in two config files. No code. No Python.

1. Semantic Router Config (`config.yaml`)

This tells the SR about your models and how to route between them:

version: v0.3

providers:
  defaults:
    default_model: qwen-coder
  models:
    - name: qwen-coder
      provider_model_id: qwen2.5-coder:7b
      api_format: openai
      backend_refs:
        - name: local-ollama
          endpoint: host.docker.internal:11434
          protocol: http

    - name: gpt-4o
      provider_model_id: gpt-4o
      api_format: openai
      backend_refs:
        - name: openai-cloud
          base_url: https://api.openai.com/v1

    - name: gemini-flash
      provider_model_id: gemini-2.5-flash
      api_format: openai
      backend_refs:
        - name: gemini-cloud
          base_url: https://generativelanguage.googleapis.com/v1beta/openai

routing:
  modelCards:
    - name: qwen-coder
      param_size: 7B
      context_window_size: 32768
      description: >
        Specialized coding model optimized for programming tasks.
        Excellent at writing code, debugging, algorithms, data structures,
        code review, refactoring, and technical implementation in Python,
        Rust, JavaScript, Go. Best for code generation, fixing bugs,
        writing tests, and technical programming Q&A.

    - name: gpt-4o
      param_size: 200B+
      context_window_size: 128000
      description: >
        Frontier reasoning model with exceptional analytical capability.
        Best for complex multi-step reasoning, strategic analysis,
        comparing trade-offs, writing long-form essays, nuanced
        explanations, math proofs, scientific reasoning.

    - name: gemini-flash
      param_size: ~100B
      context_window_size: 1000000
      description: >
        Fast general-purpose model. Ideal for simple factual questions,
        quick lookups, summarization, casual conversation, translations,
        everyday tasks, and when speed matters more than depth.

  decisions:
    - name: MoM
      description: "Mixture of Models router"
      priority: 100
      rules: {}
      modelRefs:
        - model: qwen-coder
        - model: gpt-4o
        - model: gemini-flash
      algorithm:
        type: multi_factor
        multi_factor:
          weights:
            quality: 0.1
            latency: 0.4
            cost: 0.5
          slo:
            max_cost_per_1m: 0.5

The key insight: you describe what each model is good at in natural language, and the SR uses those descriptions as semantic anchors. No keyword lists to maintain. When a new prompt arrives, the SR embeds it and compares it against these descriptions using cosine similarity. The model whose description is closest to the prompt wins.

2. AgentGateway Config (`homelab_config.yaml`)

This tells AgentGateway to use the SR as an ExtProc sidecar, and to route based on the header it sets:

# Gateway-level policy: ExtProc to Semantic Router
policies:
- name:
    name: semantic-router
    namespace: default
  target:
    gateway:
      gatewayName: default
  phase: gateway
  policy:
    extProc:
      host: "127.0.0.1:50051"
      processingOptions:
        requestBodyMode: buffered
        responseBodyMode: none
        requestHeaderMode: send
      failureMode: failOpen   # If SR is down, fall through

binds:
- port: 3000
  listeners:
  - routes:
    # When SR sets x-selected-model: qwen-coder → Local Ollama
    - matches:
      - headers:
        - name: "x-selected-model"
          value:
            exact: "qwen-coder"
      backends:
      - ai:
          provider:
            openAI: {}
          name: ollama
          hostOverride: "localhost:11434"

    # When SR sets x-selected-model: gpt-4o → OpenAI
    - matches:
      - headers:
        - name: "x-selected-model"
          value:
            exact: "gpt-4o"
      backends:
      - ai:
          provider:
            openAI: {}
          name: openai
        policies:
          backendAuth:
            key: $OPENAI_API_KEY

    # When SR sets x-selected-model: gemini-flash → Google
    - matches:
      - headers:
        - name: "x-selected-model"
          value:
            exact: "gemini-flash"
      backends:
      - ai:
          provider:
            gemini: {}
          name: gemini
        policies:
          backendAuth:
            key: $GEMINI_API_KEY

    # Fallback if SR is down (failOpen)
    - backends:
      - ai:
          provider:
            gemini: {}
          name: gemini-default
        policies:
          backendAuth:
            key: $GEMINI_API_KEY

Notice the separation of concerns: the Semantic Router never touches API keys. It classifies the prompt and mutates a header. AgentGateway owns the downstream auth. This is exactly how infrastructure teams design production gateways — the routing intelligence is decoupled from the security posture.

And that failureMode: failOpen? It means if the SR container ever crashes or is restarting, AgentGateway seamlessly falls through to the default Gemini route. I've tested this — during SR container restarts, Pi's requests still get answered without a single error. The agent doesn't even notice.

The ARM64 Rabbit Hole (Two Bugs, Two PRs)

Here's where the story gets real. I run this on an Apple Silicon Mac Mini (M-series, ARM64). Everything installed fine. The SR container started. And then:

{
  "msg": "embedding_models_init_completed",
  "embedding_ready": false,
  "tools_ready": false
}

The mmBERT model loaded but the embedding runtime never became ready. Every routing attempt logged:

Failed to embed model qwen-coder: failed to generate batched embedding (status: -1)

Bug #1: Wrong FFI Dispatch (#2172)

After deep-diving into the SR source code, I discovered the issue. The Go router was calling candle_binding.GetEmbeddingBatched() for all model types — but the Rust FFI backend only supports batched embeddings for qwen3 architectures. For mmbert (the default), it returned status: -1.

The fix (PR #2192) was elegant — a 15-line change that adds a dispatch check:

// Only qwen3 supports the batched FFI. Others use single-text FFI.
func candleEmbeddingSupportsBatched(modelType string) bool {
    return modelType == "qwen3"
}

For non-qwen3 models, it gracefully falls back to GetEmbeddingWithModelType(), which works perfectly on ARM64.

Bug #2: Missing Model Files on First Boot (#2173)

The second issue was subtler. When the SR container downloaded the mmBERT model files from HuggingFace on first boot, several required files (like tokenizer.json and config.json) weren't being fetched. This was a download-completeness bug in the model resolver.

Fixed in PR #2195.

A Huge Thank You 🙏

Both issues were triaged and fixed within days by the vLLM Semantic Router team, particularly @WUKUNTAI-0211 who wrote the fix for the FFI dispatch and @theohsiung for the file completeness fix. The PRs are now merged into main. If you're running on ARM64/Apple Silicon, just pull the latest and it works. Also shout out to AayushSaini101 for encouraging me recently to contribute to repo.

This is open source at its best. I filed two issues with reproduction steps and log snippets, and got working fixes merged into the upstream repo. The community aspect of this project is exceptional.

The Proof: Real Routing Logs

Let me show you what it actually looks like when a request flows through. I send this:

curl http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "MoM",
    "messages": [
      {"role": "user", "content": "Write me a Python function to compute fibonacci numbers using memoization"}
    ]
  }'

Step 1: SR Classifies the Prompt (1ms!)

{
  "msg": "routing_decision",
  "original_model": "MoM",
  "selected_model": "qwen-coder",
  "reason_code": "auto_routing",
  "routing_latency_ms": 1,
  "component": "extproc"
}

One millisecond. The SR embedded the prompt, compared it against the three model descriptions, and decided this is a coding task → qwen-coder.

Step 2: AgentGateway Routes to Ollama

info  request
  gateway=default/default
  route=default/route0
  endpoint=localhost:11434
  http.status=200
  gen_ai.request.model=qwen2.5-coder:7b
  gen_ai.response.model=qwen2.5-coder:7b
  gen_ai.usage.input_tokens=41
  gen_ai.usage.output_tokens=366
  duration=22537ms

AgentGateway matched the x-selected-model: qwen-coder header, routed to the local Ollama endpoint, and the entire round-trip (including LLM generation) completed in 22.5 seconds. The routing overhead? 1ms. The rest is just Ollama thinking.

Step 3: The SR Startup Sequence

On container boot, you see the full model loading pipeline:

{"msg":"embedding_models_init_started","mmbert_configured":true,"use_cpu":true}

INFO: mmBERT embedding model registered with 2D Matryoshka support

{"msg":"embedding_models_initialized","use_batched":false}

{"msg":"selection_factory_initialized","selector_count":14}

{"msg":"startup_complete","embedding_ready":false,"sem_cache_enabled":true,
 "model_selection":true,"extproc_port":50051,"decisions":"MoM"}

14 selection algorithms available out of the box. Multi-factor, ELO, reinforcement-learning-driven, hybrid, latency-aware, session-aware, KNN, SVM, K-means — all registered and ready. I'm using multi_factor with cost-heavy weighting, but I can switch to any of these with a single YAML change. Try doing that with a Python keyword list.

The Numbers After Two Weeks

After running the SR-powered setup alongside Pi for two weeks, here's the comparison:

Metric	Python Router	vLLM Semantic Router
Misrouted requests	~18%	~3% (subjective spot-checks)
Routing latency	~45ms (HTTP proxy)	1-3ms (gRPC ExtProc)
Monthly estimated API cost	~$24	~$14
Maintenance effort	Weekly keyword updates	Zero (model descriptions are stable)
Failover behavior	Manual restart	Automatic failOpen to Gemini
Language support	English keywords only	Multi-language (embedding-based)
Config	100 lines of Python	2 YAML files

The cost savings come from fewer misroutes. When "explain the async/await pattern in Rust" correctly goes to the local Ollama instead of GPT-4o, that's a $0.003 request instead of $0.03. Across hundreds of daily requests from Pi's cron jobs and my direct usage, it adds up fast.

Why Every Agent Builder Needs This

If you're building agents — whether it's a personal Pi running on a Mac Mini or a production fleet of agents in Kubernetes — you need a routing layer that understands prompts. Here's why:

Cost control is the #1 agent problem. Agents generate a lot of requests. Without intelligent routing, every request goes to your most expensive model. The SR's multi_factor algorithm explicitly weighs cost, latency, and quality.
Keyword routing doesn't scale. The moment your agent handles a domain you didn't anticipate (my Pi started doing recipe research — none of my keywords covered "sourdough starter hydration"), keyword-based routing silently fails.
AgentGateway + SR is production-grade. This isn't a hobby-tier setup. AgentGateway is a Gateway API data plane built in Rust. The SR is an Envoy ExtProc server written in Go and Rust, backed by the vLLM project. This is the same architecture you'd deploy in a Kubernetes cluster with 50 models.
Zero code maintenance. I haven't touched my routing config since I wrote those model descriptions. The SR learns from the descriptions, not from rules I have to keep updating.

What's Next

With the routing intelligence sorted, I'm now focused on:

Observability: Wiring up Jaeger and Prometheus to trace every request from Pi → AgentGateway → SR → Upstream LLM and back. The AgentGateway already emits OpenTelemetry-compatible spans — I just need to set up the collectors.
More models: Now that routing is semantic, I can add specialized models (a medical one, a legal one) with just a new model card in YAML. The SR will automatically figure out when to use them.

If you're running a homelab AI setup — or building agents at any scale — the combination of AgentGateway + vLLM Semantic Router is, in my opinion, the most underrated infrastructure combo in the AI ecosystem right now. It turned my janky Python keyword matcher into a proper ML-powered routing plane.

And it runs on a Mac Mini in my living room. 🏠

Follow me for Part 4, where I'll add full observability to this pipeline and show you exactly what happens when Pi dreams at 3 AM — now with traces.

#ai #agents #architecture #opensource

How Cassandra Compression Actually Works (Chunks, Offsets, and Reads)

Anup Sharma — Tue, 09 Jun 2026 12:13:51 +0000

I got asked about Cassandra compression by someone recently and didn't do it justice on the spot. The questions were good ones: what does chunk_length_in_kb really control, what happens on a write, and on a read how does Cassandra know how many bytes to pull off disk before it can decompress anything? I work on a database with a Cassandra backend, but we forked Cassandra years ago, before table compression existed, and our data is local so we never leaned on it. So I went and read the actual mechanics. Here's the version I wish I'd had in my head.

The setup

Cassandra stores data in SSTables, which are immutable once written. Compression happens when the SSTable is written and never changes after that. If you ALTER the compression settings, nothing happens to existing data until those SSTables get rewritten by compaction.

When compression is on, two files matter:

Data.db holds the compressed bytes
CompressionInfo.db holds the metadata Cassandra needs to find and decompress those bytes

That second file is the whole trick. Hold onto it.

chunk_length_in_kb is the uncompressed size

This is the part I had backwards in my head. chunk_length_in_kb is not how big each chunk is on disk. It's the size of the uncompressed buffer Cassandra fills before it compresses and flushes.

So with the default of 16 KB (it was 64 KB before Cassandra 4.0), Cassandra buffers 16 KB of real data, compresses that block, and writes the result. The compressed output might be 4 KB or 9 KB depending on how squishy the data is. The chunks on disk are all different sizes. But every chunk represents exactly one fixed slice of the uncompressed stream.

That "fixed in the uncompressed world, variable on disk" split is what the he kept circling, and it's the thing that makes everything else work.

The write path

Writing is the easy direction:

Buffer incoming data until you hit chunk_length_in_kb worth of uncompressed bytes.
Compress that buffer (LZ4 by default).
Append the compressed bytes to Data.db, followed by a 4-byte checksum.
Record the starting byte offset of this chunk in CompressionInfo.db.

Repeat until the SSTable is done. The checksum is a CRC over the compressed bytes; it's how Cassandra catches bitrot later, and crc_check_chance controls how often it bothers to verify on read.

So CompressionInfo.db ends up looking roughly like this:

compressor name        e.g. "LZ4Compressor"
chunk_length           e.g. 16384   (uncompressed bytes per chunk)
data_length            total uncompressed length of the file
chunk_count            N
chunk_offsets[]        long[N]   <-- the important bit

chunk_offsets is just an array of byte positions into Data.db. Offset i tells you where compressed chunk i starts.

The read path

Now the question that actually matters. Cassandra has resolved a partition through its index and knows the uncompressed byte position it wants, call it position. The data on disk is compressed and every chunk is a different size, so it can't just seek there. Here's how it gets from an uncompressed position to actual bytes.

First, figure out which chunk holds that position. Because chunks are a fixed size in uncompressed terms, this is plain division:

int chunkIndex = (int) (position / chunkLength);
int offsetInChunk = (int) (position % chunkLength);

Then look up where that chunk lives on disk, and figure out how many bytes to read. The length of a compressed chunk isn't stored directly. You get it by subtracting consecutive offsets (minus the 4 checksum bytes):

long start = chunkOffsets[chunkIndex];

long end = (chunkIndex + 1 < chunkCount)
    ? chunkOffsets[chunkIndex + 1]   // next chunk starts here
    : compressedFileLength;          // last chunk runs to EOF

int compressedLength = (int) (end - start - 4); // 4 = CRC checksum

That answers the "how do we know how many bytes / offsets to read" question directly. You don't store the compressed length, you derive it from the gap between this offset and the next one.

After that the rest is mechanical:

file.seek(start);
file.read(buffer, 0, compressedLength);   // read exactly this chunk

if (shouldCheck(crcCheckChance))
    verifyCrc(buffer, file.readInt());    // the trailing 4 bytes

byte[] decompressed = lz4.decompress(buffer); // up to chunkLength bytes

return decompressed[offsetInChunk ...];   // jump to what we wanted

A concrete pass with 16 KB chunks (16384 bytes). Say Cassandra wants uncompressed position = 50000:

chunkIndex = 50000 / 16384 = 3
offsetInChunk = 50000 % 16384 = 848
read the compressed bytes between chunkOffsets[3] and chunkOffsets[4]
decompress that one chunk back into ~16 KB
skip to byte 848 in the result

That's it. One division to find the chunk, one subtraction to size the read, one decompress, one in-memory skip.

Why fixed uncompressed size, and not fixed disk size

You can't index into compressed data, because the compressor changes the size unpredictably. If chunks were a fixed size on disk, you'd have no idea which uncompressed byte each one started at, and random reads would mean decompressing from the front of the file every time.

By fixing the uncompressed size instead, the mapping from "byte I want" to "chunk number" becomes a single divide. The offset array handles the other direction, telling you where that chunk sits on disk. The two together give you O(1) random access into compressed data, which is the whole point.

The tradeoff in chunk size

To read one tiny cell, Cassandra still has to read and decompress the entire chunk that contains it. So chunk size is a real knob:

Bigger chunks give the compressor more context, so better compression ratio and a smaller file. But every small read drags a big block off disk and decompresses it. That's read amplification.
Smaller chunks mean less wasted I/O per read, but a worse ratio and more offheap memory, since you keep more offsets around.

That's why 4.0 dropped the default from 64 KB to 16 KB. For read-heavy or point-read workloads, dragging 64 KB off disk to return a few hundred bytes is mostly waste. If you're doing big sequential scans or your rows are large, bigger chunks can still win.

The short version

chunk_length_in_kb sizes the uncompressed buffer. On write, Cassandra compresses one buffer at a time and records each chunk's disk offset in CompressionInfo.db. On read, it divides the wanted position by the chunk length to pick a chunk, subtracts neighbouring offsets to size the read, pulls exactly those bytes, checks the CRC, decompresses, and skips to the byte it wanted. Fixed uncompressed chunks are what let it do all that without scanning from the start of the file.

I should have been able to walk through this in the room. Now I can, and writing it down made it stick.

I Traced Personal Agent's Source Code. Inside Was Pi... And It Dreams at 3 AM.

Anup Sharma — Sat, 30 May 2026 09:17:18 +0000

This is Part 2 of my homelab AI series. In Part 1, I built a system where one AI decides which AI to talk to. This time, I popped the hood on the agent itself — and what I found inside changed how I think about AI software.

Last week I wrote about an autonomous agent OpenClaw running on a Raspberry Pi: an autonomous agent called OpenClaw running on a Raspberry Pi, routing requests through AgentGateway to three different LLMs based on intent. People loved it. A few folks DMed me asking how OpenClaw actually works — like, what happens after the routing? How does an autonomous agent that edits PDFs, writes code, schedules research, and finds the best restaurants in Indiranagar every Friday actually... do all that?

Honestly? I didn't fully know either. I knew OpenClaw was powerful. I used it daily. I'd even contributed some code. But I'd never really sat down and traced a request all the way through. So last weekend, I did.

And about 30 minutes in, I hit a line in package.json that stopped me cold:

"@earendil-works/pi-agent-core": "0.75.4",
"@earendil-works/pi-coding-agent": "0.75.4"

OpenClaw doesn't have its own agent engine. Buried inside it — embedded as an SDK, not a subprocess, not an API call — is a tiny coding agent called Pi. Then I directly jump into youtube and found a great talk from Mario at AI Engineer Conference

And Pi might be the most elegant piece of AI software I've ever read.

Wait, What is Pi?

Pi is an open-source terminal coding agent written in TypeScript by Mario Zechner. If you've been in the AI coding agent space, you've probably heard of Cursor, Windsurf, Aider, or Claude Code. Pi sits in the same category but takes a radically different approach.

Where other agents keep adding features, Pi keeps removing them.

Where other agents have massive system prompts spanning thousands of tokens, Pi's is almost embarrassingly short.

Where other agents ship with dozens of built-in tools, Pi ships with four.

Yes. Four.

read   →  Read a file
write  →  Write a file
edit   →  Edit a file
bash   →  Run a shell command

That's it. That's the entire toolkit the LLM gets to work with.

And here's the thing that broke my brain: it's enough.

Think about it. What can you do with a terminal? You can read files, write files, edit files, and run commands. That's literally everything. grep? That's a bash command. git commit? Bash. npm install? Bash. curl an API? Bash. Run tests? Bash. Deploy to production? ...also bash.

Pi doesn't try to build a specialized tool for every possible operation. It gives the LLM the same primitives that you have as a developer, and trusts the model to compose them.

Armin Ronacher (of Flask fame) wrote about Pi back in January and called it a glimpse into the future of software. After spending a weekend inside the source code, I think he undersold and explain it very well.

How Pi Actually Runs Inside OpenClaw

Here's what surprised me the most: Pi isn't a separate service that OpenClaw calls over HTTP. It's not a subprocess. It's not even an RPC server.

OpenClaw literally imports Pi as an npm package and runs the agent loop in the same process.

OpenClaw starts up
    ↓
Calls createAgentSession() from @earendil-works/pi-coding-agent
    ↓
Pi's agent loop starts running in-process
    ↓
OpenClaw subscribes to Pi's events (message_start, tool_execution, turn_end, etc.)
    ↓
OpenClaw replaces Pi's default tools with its own extended set
    ↓
User sends a message on Discord → OpenClaw calls session.prompt(message)
    ↓
Pi takes over: talks to LLM, executes tools, streams responses
    ↓
OpenClaw receives events, formats them, sends back to Discord

This is wild to me. Pi is designed as a standalone CLI agent. You can npm install -g @earendil-works/pi-coding-agent and use it directly in your terminal. But Mario architected it so cleanly that the entire agent core can be extracted and embedded into another application like a library.

OpenClaw is the vehicle. Pi is the engine.

The Agent Loop: Where the Magic Happens

Let me walk you through what actually happens when I send a message to OpenClaw on Discord. This is where it gets fun.

Pi's agent loop lives in a single 743-line file (agent-loop.ts), and it follows a deceptively simple cycle:

┌─────────────────────────────────────────────────┐
│                  USER PROMPT                     │
└─────────────┬───────────────────────────────────┘
              ↓
┌─────────────────────────────────────────────────┐
│   Transform Context (extensions can modify)      │
│   Convert AgentMessages → LLM Messages           │
│   Send to LLM provider (streaming)               │
└─────────────┬───────────────────────────────────┘
              ↓
┌─────────────────────────────────────────────────┐
│          ASSISTANT RESPONSE                      │
│   ┌──────────────┐    ┌──────────────────┐      │
│   │  Text Reply   │    │  Tool Calls       │      │
│   └──────────────┘    └───────┬──────────┘      │
└───────────────────────────────┼──────────────────┘
              ↓                 ↓
        (no tools?)      Execute tools
         ↓                (parallel by default)
    ┌──────────┐              ↓
    │ Check     │      Tool Results
    │ follow-up │         ↓
    │ queue     │   ┌─────────────────────┐
    └──────────┘   │ Check steering queue │
         ↓         │ (user interrupts?)   │
    Empty? STOP    └─────────┬───────────┘
    Has msgs?                ↓
    → loop again       Loop back to LLM
                       with tool results

But here's where Pi gets clever. See those two queues?

The Dual Queue System

Most agents have a simple loop: user says something → agent responds → done. Pi has two hidden message queues that make it far more powerful:

1. Steering Queue — "Hey, change direction."
These messages get injected between tool results and the next LLM call. If the agent is mid-task and you send a new message saying "actually, use TypeScript instead of Python," Pi doesn't wait for the current task to finish. It slides your message into the conversation right before the next LLM turn. The model sees the tool results AND your course correction, and adapts.

2. Follow-Up Queue — "Before you stop, consider this too."
These get checked after the agent would normally stop (no more tool calls). If there are follow-up messages, the agent continues instead of ending. Extensions use this to chain multi-step workflows without the user having to manually prompt each step.

This is elegant. Most agents treat conversations as request-response. Pi treats them as navigable streams that can be redirected mid-flight.

The Part That Changed How I Think: Append-Only Tree Sessions

This is where I went from "oh, this is a nice agent" to "okay, this is genuinely brilliant engineering."

Most AI chat apps store conversations as a flat list. Message 1, message 2, message 3... linear. If you want to try a different approach, you either edit your message (and lose the original response) or start a new conversation entirely.

Pi stores conversations as an append-only tree.

                    Session Start
                         │
                    User: "Build me a REST API"
                         │
                    Assistant: "Sure, I'll use Express..."
                         │
              ┌──────────┴──────────┐
              │                      │
         [Branch A]             [Branch B]
    "Use FastAPI instead"    "Add authentication"
              │                      │
    Assistant: "Okay,         Assistant: "I'll add
    switching to Python..."   JWT middleware..."
              │                      │
         [Branch A1]            [Branch B1]
    "Add rate limiting"      "Use OAuth instead"

Every message is a node with an id and a parentId. When you fork a conversation, Pi creates a new branch from any point in the tree. The original branch stays untouched. You can navigate back and forth between branches, compare approaches, and even branch from a branch.

The session file is JSONL (one JSON object per line, append-only). It's never rewritten, never mutated. New messages just get appended with pointers to their parent.

Why does this matter? Three reasons:

1. It's crash-proof. Append-only means no data corruption on unexpected shutdown. Your Raspberry Pi loses power at 3 AM mid-response? The session is fine. Just re-open and continue from the last complete message.

2. It enables time travel. You can jump back to any point in the conversation and fork. "What if I'd asked for Rust instead of Python?" Just navigate back and try it. Both histories coexist.

3. It makes compaction elegant. When the context window fills up, Pi doesn't throw away old messages. It summarizes them into a CompactionEntry node in the tree. The original messages are still in the file — they're just not loaded into context anymore. You can always go back.

Iterative Compaction: How Pi Remembers What Matters

Every AI agent has the same problem: context windows are finite. Eventually, your conversation gets too long and you hit the token limit. Most agents handle this by... well, by crashing. Or by silently dropping the oldest messages. Or by starting a new session.

Pi does something smarter. It runs iterative compaction.

When the context is getting full, Pi:

Walks backward from the newest messages, counting tokens
Keeps the most recent ~20,000 tokens intact (you want your recent context fresh)
Takes everything older and generates a structured summary via the LLM itself
Stores that summary as a CompactionEntry in the session tree
On the next context build, it loads the summary instead of the original messages

But here's the key word: iterative. When compaction runs a second time, Pi doesn't regenerate the summary from scratch. It takes the existing summary and merges new information into it. The summary evolves over time, like a living document.

The summary follows a structured format:

## Goal
## Constraints & Preferences  
## Progress (Done / In Progress / Blocked)
## Key Decisions
## Next Steps
## Critical Context

It also tracks which files were read and modified across the entire session, even across multiple compactions. So if you ask "what files have we changed today?" after 6 hours of work and 3 compactions, Pi knows.

OpenClaw's Memory: The Part Where AI Dreams

Okay, this is where things get genuinely sci-fi. And I mean that literally.

Pi handles context management within a single session beautifully. But what about across sessions? What about things you told the agent three weeks ago? What about your preferences, your coding style, the fact that you always want biryani recommendations from places with 4.5+ ratings?

OpenClaw builds a multi-layered memory system on top of Pi:

Layer 1: File-Based Memory

MEMORY.md — Long-term memory, loaded at every session start
memory/YYYY-MM-DD.md — Daily notes (today + yesterday auto-loaded)
DREAMS.md — A dream diary. Yes, really.

Layer 2: Active Memory

Before every reply, a bounded sub-agent runs a quick memory search and injects relevant past context into the prompt. It has a circuit breaker — if it takes too long, it gets skipped.

Layer 3: The Dreaming System 🌙

This is the one that made me put my laptop down and take a walk.

Every night at 3 AM, OpenClaw runs a three-phase memory consolidation cycle inspired by how human sleep works:

Light Sleep — Sorts through recent short-term memories. Stages candidates. Doesn't write anything yet. Just organizes.

REM Sleep — Reflects on recurring themes, patterns, and connections across memories. Still no writes. Just thinking.

Deep Sleep — Scores each memory candidate across 6 weighted signals and decides what gets promoted to long-term storage:

Relevance:            30%   (how useful is this?)
Frequency:            24%   (how often did this come up?)
Query Diversity:      15%   (was it relevant to different topics?)
Recency:              15%   (is it still timely?)
Consolidation:        10%   (does it connect to existing memories?)
Conceptual Richness:   6%   (is it a deep insight or just a fact?)

Memories that score high enough get written to MEMORY.md. Everything else fades.

The agent literally sleeps, dreams, and wakes up smarter the next morning.

I'm not going to pretend I wasn't a little unsettled the first time I realized my agent had reorganized its own memory overnight without being asked. But also... it remembered that I prefer tabs over spaces three weeks later without me mentioning it again. So, worth it.

The Extension System: How OpenClaw Bends Pi to Its Will

Pi ships with 4 tools. OpenClaw's agent has dozens — browser automation, web search, image generation, cron scheduling, subagent spawning, Discord actions, PDF extraction, memory search, and more.

How? Pi's extension system.

Extensions are TypeScript files that hook into Pi's 30+ lifecycle events:

Session Events:     session_start, session_before_compact, session_shutdown
Agent Events:       before_agent_start, agent_start, agent_end, turn_start, turn_end
Message Events:     message_start, message_update, message_end
Tool Events:        tool_call (can block!), tool_result (can modify!)
Input Events:       input (can intercept and transform user input)
Model Events:       model_select, thinking_level_select
Resource Events:    resources_discover

An extension can:

Register new tools that the LLM can call
Intercept tool calls before they execute (for safety, logging, sandboxing)
Modify tool results after execution
Inject messages mid-conversation (steering queue!)
Register custom LLM providers
Override the system prompt
Add UI widgets to the terminal

When OpenClaw boots up, it calls createAgentSession() from Pi and then runs a 7-stage tool pipeline that completely replaces Pi's default 4 tools with OpenClaw's full suite:

Pi's defaults → Custom replacements → OpenClaw tools → Channel-specific tools
    → Policy filtering → Schema normalization → AbortSignal wrapping

This is what good software architecture looks like. Pi doesn't try to be everything. It gives you a clean, minimal core and says: "Here are 30 hooks. Build whatever you want."

Why This Architecture Works

After spending a weekend inside this codebase, I think Pi gets three things right that most AI agents get wrong:

1. Trust the Model, Don't Hand-Hold It

Most agents build a specialized tool for every operation: search_files, list_directory, run_tests, git_commit, install_package...

Pi says: here's bash. Figure it out.

This seems reckless until you realize that modern LLMs are really good at shell commands. They know grep. They know find. They know git. Giving them bash and getting out of the way produces better results than giving them 50 narrow tools with rigid parameter schemas.

2. State is a Tree, Not a Line

Linear chat history is a lie. Real problem-solving is branching. You try approach A, realize it's wrong, backtrack, try approach B. Pi's tree sessions make this a first-class operation instead of a hack.

3. Extensions > Features

Instead of shipping a monolithic agent with every feature imaginable, Pi ships a tiny core with a powerful extension system. OpenClaw adds 129 extensions. My homelab setup is much simpler. Both work, because the core doesn't care what you bolt onto it.

Setting This Up For Yourself

If you want to try Pi standalone (no OpenClaw, just the coding agent):

npm install -g @earendil-works/pi-coding-agent
pi

That's it. You now have a terminal coding agent with 4 tools, tree sessions, and iterative compaction.

If you want the full OpenClaw experience — Discord integration, dreaming, multi-agent orchestration, 129 extensions — check out openclaw.ai. Fair warning: once you have an agent that dreams and remembers your preferences across weeks, going back to stateless ChatGPT feels like using a typewriter.

What I'm Building Next

In Part 1, I built the routing layer (which AI answers). In this post, I explored the engine (how the AI thinks). The next piece of the puzzle: observability.

AgentGateway already emits OpenTelemetry traces for every LLM call. Pi tracks token usage, tool execution times, and compaction events. I want to pipe all of this into a Grafana dashboard so I can see, in real-time:

Which model is handling which type of request
How many tokens stay local vs go to the cloud
How long tool executions take
When compaction fires and how much context it saves
What the dreaming system promoted to long-term memory

Stay tuned.

If you made it this far — first of all, respect. Second, if you're building something similar or want to nerd out about agent architectures, hit me up. I live for this stuff.

AI #CodingAgent #OpenClaw #Pi #LLM #AgentArchitecture #HomeLab #BuildInPublic #AIEngineering #OpenSource

I Built an AI That Decides Which AI to Talk To — Running 24/7 From My Living Room

Anup Sharma — Sat, 23 May 2026 07:43:58 +0000

Last Saturday when I woke up, my AI agent reviewed 14 restaurant ratings in Indiranagar, updated a shared Google Sheet, signed a 20-page PDF I'd been ignoring for a week, and wrote a bash script to clean up my server logs.

I didn't ask it to do any of that. It just... does things now.

Meet OpenClaw — my long-running autonomous agent that lives on a Raspberry Pi, plugged into Discord, running 24/7. It manages my memory, handles research, writes code, edits documents, finds the best weekend spots in Bangalore by scraping live ratings — basically, it runs half my life on autopilot.

But a few weeks ago, I noticed something that bothered me.

I asked it: "Write a Python script to parse JSON logs." Simple coding task. It sent that request to a cloud API, waited 3 seconds, burned tokens I paid for, and came back with an answer — when I had a perfectly capable local LLM sitting idle on my Mac Mini, three feet away.

Then I asked: "Think step by step about the trade-offs between event-driven vs polling architecture for my notification system." That's a hard reasoning question. I want that going to a frontier model. That's worth the tokens.

Same agent. Same endpoint. Completely different needs.

And that's when a stupid idea hit me:

What if the system could figure out which brain to use — before the request even reaches a model?

Turns out, it's not stupid at all. And it took me a weekend, a Raspberry Pi, a Mac Mini, 50 lines of Python, and an open-source gateway to build it.

Here's how.

The Setup

Here's what's running in my living room:

Raspberry Pi → Runs OpenClaw, my autonomous agent. It takes input from Discord, manages context, memory, and orchestrates everything.
Mac Mini → The brain farm. Runs three things:
Ollama with qwen2.5-coder:7b — a local coding model that never leaves my network
AgentGateway — an open-source AI gateway from Google that handles routing, auth, observability
A lightweight Python router — the "intent classifier" I wrote in ~50 lines of code
The magic? OpenClaw doesn't know any of this is happening. It just sends a request to one endpoint. Behind the scenes, the system figures out the rest.

The Architecture

Three models. Three price points. One unified endpoint. OpenClaw just hits http://192.168.1.15:1234/v1/chat/completions and forgets about it.

Why AgentGateway?

I evaluated a few options — raw Envoy, Nginx with Lua scripting, even building a full proxy from scratch. But AgentGateway stood out for a few reasons:

What it gives you out of the box:
Protocol translation — It speaks OpenAI-compatible API on the frontend, but can talk to Gemini, Vertex AI, Bedrock, Ollama, and more on the backend. I don't write a single line of provider-specific code.
Backend authentication — API keys are managed at the gateway level. OpenClaw never sees or stores any API key. I just set backendAuth: key: $GEMINI_API_KEY in the config and it handles the rest.
Model aliasing — OpenClaw sends model: "inteli-llm" in every request. AgentGateway silently translates that to qwen2.5-coder:7b, gpt-4o, or gemini-2.5-flash depending on which route matched. The client has no idea.
Observability — Every request gets logged with provider name, model, token counts, and latency. I can see exactly how many tokens are going to OpenAI vs staying local.
Prompt guards & rate limiting — Built-in regex-based PII masking, webhook-based content moderation, and rate limiting. Enterprise-grade features I get for free.
Weighted load balancing & failover — If Ollama crashes (it happens), I can configure automatic failover to a cloud model. No downtime.
What it doesn't do (yet): Content-aware routing. AgentGateway routes based on path, headers, and methods — which is the right design for a gateway. It doesn't peek into your request body to decide where to send it. That's a feature, not a bug — gateways should be fast and protocol-level, not parsing JSON payloads.

But I needed content-aware routing. So instead of searching for other tool, I extended it.

The 50-Line Router That Makes It All Work

I wrote a tiny FastAPI proxy that sits in front of AgentGateway. Here's what it does:

Intercepts the incoming OpenAI-compatible request
Reads the last message in the chat
Classifies intent using simple keyword matching + prompt length heuristics:
- Contains code, python, script, function, bug? → coding
- Contains think, analyze, reasoning, deduce? Or prompt > 400 chars? → reasoning
- Everything else? → simple
Injects an x-intent HTTP header
Forwards the request to AgentGateway untouched That's it. No ML model for classification. No vector databases. No semantic similarity. Just good old keyword matching that works 90% of the time — and that's good enough for a homelab.

coding_keywords = ["code", "python", "javascript", "bash", "script", "function", "bug"]
reasoning_keywords = ["think", "analyze", "explain in detail", "reasoning", "logic", "deduce"]

if any(k in prompt_lower for k in coding_keywords):
    intent = "coding"
elif len(prompt) > 400 or any(k in prompt_lower for k in reasoning_keywords):
    intent = "reasoning"
else:
    intent = "simple"

The Cost Equation

Here's what this setup actually saves me:

Intent	Model	Where it runs	Cost per 1M tokens
Coding	qwen2.5-coder:7b	Local (Ollama)	$0
Simple Q&A	gemini-2.5-flash	Google Cloud	~$0.15
Deep Reasoning	gpt-4o	OpenAI	~$2.50

Before this setup, every single request was going to a cloud API. Now, roughly 60-70% of my queries stay local — coding questions, quick lookups, simple formatting tasks. They're fast, free, and private.

The expensive reasoning model only gets called when I genuinely need it. And the mid-tier Gemini handles everything in between.

My monthly API bill dropped significantly, and the local responses are actually faster.

Design Choices & Why They Worked

1. Header-based routing over path-based routing Initially, I was going to use URL paths (/coding, /reasoning, /simple) and strip them with URL rewriting. But header injection is cleaner — the original request path stays intact, and AgentGateway's header matching is first-class.

2. Classification at the proxy, not the gateway I could have tried to use AgentGateway's CEL expressions or ExtProc policies for classification. But those run after backend selection, not before. Keeping classification in a separate lightweight layer means I can swap algorithms without touching my gateway config.

3. Keyword heuristics over ML classifiers Could I use a small classifier model or even RouteLLM for smarter routing? Absolutely. But for a homelab, keyword matching is:

Zero latency overhead
Zero dependencies
Easy to debug (just read the logs)
Surprisingly accurate for my use cases

4. One unified model name OpenClaw sends model: "inteli-llm" for everything. AgentGateway's modelAliases feature translates it per-route. This means I can swap out backend models without touching a single line of OpenClaw's config. Last week it was gemini-1.5-flash, this week it's gemini-2.5-flash. OpenClaw never knew.

What's Next

Smarter classification — Maybe a tiny local classifier model, or even using the first few tokens of a response to reclassify and retry on a better model.
Metrics dashboard — AgentGateway already emits OpenTelemetry traces. I want to hook up a Grafana dashboard to see which models are handling what, with latency and token breakdowns.
Failover chains — If Ollama is under heavy load, automatically fall back to Gemini for coding tasks. AgentGateway supports priority groups for this.
More agents — OpenClaw is just the beginning. I want to run specialized agents for different domains, all routing through the same gateway.

The Takeaway

You don't need a Kubernetes cluster or a $10K GPU server to build a multi-model AI system. A Raspberry Pi, a Mac Mini, an open-source gateway, and 50 lines of Python got me:

✅ An always-on autonomous agent ✅Intelligent routing ✅across 3 different LLMs ✅Local-first for privacy and speed ✅Cloud when I need the horsepower ✅Zero API keys exposed to the client ✅A monthly bill I actually don't mind paying

The best part? The entire config is a single YAML file and a single Python script. No Docker. No Kubernetes. No Terraform. Just two processes on a Mac Mini and an agent on a Pi.

Sometimes the best infrastructure is the one you can explain in a napkin sketch.

If you're building something similar or want to see the config files, drop a comment — happy to share the full setup.

AI #HomeAssistant #LLM #AgentGateway #Ollama #OpenAI #Gemini #HomeLab #BuildInPublic #MacMini #RaspberryPi #AIEngineering

DEV Community: Anup Sharma

What Word Break Leetcode Problem Taught Me About Debugging Order

The real lesson

Adding Observability to My AI Homelab

The Plan That Never Happened

Three Lines of YAML. That's It.

What's Actually Inside the Dashboard

The Analytics View

The Logs Explorer

The Full Config

Why Not Grafana?

The Numbers After One Week of Real Visibility

The Homelab Stack, Complete

What's Next

Giving AgentGateway a Semantic Brain with vLLM Semantic Router - Inside My Homelab

The Problem Was Embarrassing

Enter vLLM Semantic Router

The Architecture

Setting It Up (Two YAML Files)

1. Semantic Router Config (config.yaml)

2. AgentGateway Config (homelab_config.yaml)

The ARM64 Rabbit Hole (Two Bugs, Two PRs)

Bug #1: Wrong FFI Dispatch (#2172)

Bug #2: Missing Model Files on First Boot (#2173)

A Huge Thank You 🙏

The Proof: Real Routing Logs

Step 1: SR Classifies the Prompt (1ms!)

Step 2: AgentGateway Routes to Ollama

Step 3: The SR Startup Sequence

The Numbers After Two Weeks

Why Every Agent Builder Needs This

What's Next

How Cassandra Compression Actually Works (Chunks, Offsets, and Reads)

The setup

chunk_length_in_kb is the uncompressed size

The write path

The read path

Why fixed uncompressed size, and not fixed disk size

The tradeoff in chunk size

The short version

I Traced Personal Agent's Source Code. Inside Was Pi... And It Dreams at 3 AM.

Wait, What is Pi?

How Pi Actually Runs Inside OpenClaw

The Agent Loop: Where the Magic Happens

The Dual Queue System

The Part That Changed How I Think: Append-Only Tree Sessions

Iterative Compaction: How Pi Remembers What Matters

OpenClaw's Memory: The Part Where AI Dreams

Layer 1: File-Based Memory

Layer 2: Active Memory

Layer 3: The Dreaming System 🌙

The Extension System: How OpenClaw Bends Pi to Its Will

Why This Architecture Works

1. Trust the Model, Don't Hand-Hold It

2. State is a Tree, Not a Line

3. Extensions > Features

Setting This Up For Yourself

What I'm Building Next

AI #CodingAgent #OpenClaw #Pi #LLM #AgentArchitecture #HomeLab #BuildInPublic #AIEngineering #OpenSource

I Built an AI That Decides Which AI to Talk To — Running 24/7 From My Living Room

The Setup

The Architecture

Why AgentGateway?

The 50-Line Router That Makes It All Work

The Cost Equation

Design Choices & Why They Worked

What's Next

The Takeaway

AI #HomeAssistant #LLM #AgentGateway #Ollama #OpenAI #Gemini #HomeLab #BuildInPublic #MacMini #RaspberryPi #AIEngineering

1. Semantic Router Config (`config.yaml`)

2. AgentGateway Config (`homelab_config.yaml`)