I Built Two Ollama Tools I Don't Actually Need Yet

#ollama #opensource #selfhosted #ai

I had one Ollama instance and a question: can I queue inference tasks and route them with priorities? That question turned into two open source tools over about three days, from planning through to shipped repos. I don't fully need either of them yet. I built them anyway.

The starting point

Ollama ships with no authentication. There's an OLLAMA_API_KEY option, but it's a single shared key — every client uses it, there's no per-client separation, and community tools that can't send Authorization: Bearer headers are locked out entirely.

Several services share the same Ollama instance on a dedicated host that homelab-agent is provisioning: LibreChat for interactive chat, a SearXNG MCP server for ML-reranked search, and three background embedding jobs — graphiti, jobsearch-mcp, and memsearch-watch.

The embedding jobs run constantly. Interactive chat waits. I wanted to fix that — give batch workloads a lower priority so interactive requests don't queue behind them.

Before building anything, I used the SearXNG MCP to do a proper competitive search — structured queries, ML reranking, multi-source synthesis. A few tool calls instead of an afternoon of tab-hopping through SEO results.

What the research turned up

The existing tools fell into two camps:

Too little: Basic auth proxies that add a shared key or simple API key checking. No queuing, no per-client control, no routing.

Too much: Rails-heavy tools with VRAM-aware scheduling and GPU memory introspection. Impressive engineering, but overkill for anyone not running a GPU cluster.

The gap: nobody had priority queuing with per-tier depth limits, per-client concurrency caps, and client injection (injecting auth on behalf of clients that can't send headers themselves) — all in one place. LiteLLM has semantic caching and multi-provider routing but no client injection. The simpler proxies have auth but no queue semantics. Nothing had all three.

That gap was worth filling.

The two tools

ollama-queue-proxy

github.com/TadMSTR/ollama-queue-proxy

A smart pool manager for Ollama. Drop it in front of your instance (or fleet), point your consumers at port 11435 instead of 11434, and everything else works as before.

What it adds:

Per-client API keys with priority ceilings. A batch key with max_priority: low that sends X-Queue-Priority: high gets silently capped. The client doesn't know — it just gets queued at its allowed tier.
Three-tier priority queue — high, normal, low — with per-tier depth limits, expiry, and backpressure (429/503 + Retry-After).
Client injection — extra listen ports that inject a fixed identity for clients that can't send Bearer headers. memsearch-watch points at port 11436; the proxy fills in its key and routes it through the same queue with the same priority ceiling.
Model-aware routing — background poller hits /api/tags on each host every 30 seconds, maintains a live model inventory, and routes requests to a host that already has the target model loaded. Avoids cold-start latency when you have multiple Ollama hosts.
Embedding cache — SHA-256 keyed Valkey cache for /api/embed and /api/embeddings. Repeated RAG requests skip the queue and upstream entirely. X-Cache: HIT in the response.
Per-client concurrency caps — a batch client configured with max_concurrent: 2 can't monopolize the worker pool regardless of how many requests it queues.
Failover — mark a host unhealthy on connection failure, retry on the next host, recover automatically via background health checks.

Proxy overhead is ~1-2ms. Negligible compared to inference time.

The unique combination is the point. Priority queuing + per-client caps with fairness bound + client injection — I didn't find that anywhere else.

OLLAMA_HOST=http://localhost:11435

That's the only change your consumers need. Everything else is transparent.

ollama-auth-sidecar

github.com/TadMSTR/ollama-auth-sidecar

Not everyone needs a queue. Some people just want to put auth in front of Ollama without running a full reverse proxy stack.

The sidecar is a single nginx container with a config file. Each consumer gets its own listen port and its own key. Clients that can't send auth headers point at the sidecar instead of Ollama — the sidecar injects the header for them.

services:
  - name: librechat
    listen: 11436
    upstream: http://ollama:11434
    timeout: 300s
    headers:
      Authorization: "Bearer ${LIBRECHAT_KEY}"

  - name: memsearch-watch
    listen: 11437
    upstream: http://ollama:11434
    timeout: 60s
    headers:
      Authorization: "Bearer ${MEMSEARCH_KEY}"

No databases. No dashboards. No processes to keep alive. Container restart is under 1 second. The entrypoint fails fast at startup if any ${ENV_VAR} reference is unset — you don't discover a missing key at request time.

The upstream field is generic — works with any auth-gated HTTP service, not just Ollama. The Ollama-specific framing is for discoverability; the implementation doesn't care.

The relationship between the two tools: If you run the queue proxy, you don't need the sidecar — v0.2.0 bakes client injection in directly. If you only want auth, use the sidecar. Each README links to the other as the appropriate upgrade or downgrade path.

The workflow that made this possible

Both tools went from question to shipped repo in a few days. That's not because I'm fast — it's because each phase of work is handed off to a specialized agent, so I'm not context-switching between research, implementation, security, and docs.

Each phase is a separate Claude Code session with a specific role:

Research agent — competitive analysis using the SearXNG MCP. Structured queries across multiple sources, ML reranking, cross-source synthesis. Found the ecosystem gaps that shaped the feature list.
Build plan — a detailed spec document: repo structure, config schema, phases, success criteria, deployment modes. Written before any code.
Security review (pre-build) — a dedicated pass over the build plan before handoff. For the queue proxy: 7 FLAG items, 0 BLOCKs. For the sidecar: 10 FLAG items across two passes, 0 BLOCKs. All implementation-level — no plan amendments needed, but concrete requirements the dev agent had to address.
Dev agent — picks up the spec and security review, implements, tests, writes docs, tags the release.
Security audit — a separate security agent runs a post-build audit and sends a structured report back. The build agent then picks up the findings and works through them: some fixes, some confirmations, documented reasoning for anything left as-is.
Writer agent — once the build is clean, a writer agent takes over. Docs updated, verified, and cross-referenced across both repos. READMEs, CHANGELOGs, and cross-links between the two tools all done in one pass.

This isn't Copilot autocomplete. Each phase is a full delegated role with a distinct responsibility and a clean handoff. The research agent doesn't touch code. The security reviewer reads both the plan and the finished implementation. The writer agent doesn't make build decisions — it synthesizes and documents what was actually built.

The queue proxy's dev agent shipped all five planned features plus two unplanned improvements (Prometheus label escaping, Dockerfile CMD fix). The sidecar's dev agent shipped the feature, hit real test failures, fixed them methodically, then handed off cleanly to the security and writer passes. Both look like healthy implementation trails from experienced engineers.

The honest close

I have one Ollama instance. I don't need priority queuing. I don't have a second GPU host to route between. The embedding cache is useful, but my embedding job isn't running at a volume where cache hits matter much.

I built these anyway for two reasons.

First: the research showed there was a real gap. Tools built to fill ecosystem gaps get used. Tools built to scratch a personal itch stay forks. If I'm going to build something, it might as well be useful to more people than me.

Second: the workflow needed a real test. The research → build plan → security review → dev agent → security audit → writer agent pipeline is what I'm developing inside homelab-agent — with the goal of shipping it as a proper framework in the platform homelab-agent is building toward. Running the workflow against a real, non-trivial project — something with actual security considerations, a competitive landscape to map, and multiple phases of implementation — was the validation I needed. Both tools passed.

I'll grow into the features. I'm adding a second Ollama host to my homelab. The model-aware routing will stop being theoretical the moment I plug it in.

Both repos ship multi-arch Docker images (linux/amd64 and linux/arm64) published to ghcr.io, with CI on every PR.