When you use a cloud LLM API, you get usage data for free. Token counts, latency, cost per request, all tracked and queryable. When you run Ollama locally, you get a response and nothing else. No logs, no token counts, no way to tell which of your services is eating all the inference time.
I had 5 services and a workflow engine all hitting the same Ollama instance on a Mac mini with 24GB of RAM. I couldn't tell which service was the heaviest consumer, whether my model-swapping strategy was working, or if specific workflows were making redundant calls. I was flying blind.
So I built a transparent proxy that sits between my services and Ollama, logs every inference call with full token counts and timing, and streams responses through without adding any latency. Then I open-sourced it.
The Concept
Point your apps at the proxy (port 11433) instead of Ollama directly (port 11434). The proxy forwards everything transparently. For inference calls, it records what model was used, how many tokens were consumed, how long it took, and which service made the request. Health checks and model listing get passed through without logging.
Response streaming works exactly as it does with Ollama directly. The proxy never buffers a full response before forwarding. Clients see tokens appearing in real time during long generations. The token counting happens from a copy of the stream after the response completes.
What the Data Actually Told Me
The reason I'm writing about this isn't the proxy itself. It's what I learned once I had visibility into my LLM traffic.
Model swap overhead is measurable. I run two models on a machine with only enough memory for one at a time. They swap via a 10-second keep-alive timeout. The proxy showed me that the first request after a model swap is consistently 3-4x slower than subsequent requests. That's the cold-start penalty of loading a 9GB model into memory. Before the proxy, I assumed model swapping was "fast enough." Now I have actual numbers and can make informed decisions about keep-alive timing.
One workflow was consuming 40% of all inference. The email classification pipeline was the heaviest consumer by a wide margin. Seeing the actual numbers made it the obvious first target for optimization. I restructured the batching to reduce total calls, and the overall daily inference time dropped noticeably. Without per-caller breakdown, I would have been optimizing the wrong things.
Embedding calls are almost free. My embedding model returns in under 100ms consistently. I had been conservative about batching embeddings because I assumed they were competing for resources with the larger models. They weren't. The proxy data gave me confidence to run embeddings more aggressively.
None of this was visible before. I was making assumptions about my own infrastructure that turned out to be wrong.
What's In the Repo
The tool is designed to be useful with zero configuration (install, run, point your apps at it, logs go to SQLite) but flexible enough for production setups. It ships with:
- 4 storage backends: SQLite (default), PostgreSQL, JSONL files, or stdout
- A built-in dashboard: vanilla HTML and Chart.js, no npm, no build step
- Prometheus metrics: request counts, token totals, and duration histograms, ready for Grafana
- A pre-built Grafana dashboard: included in the repo with a Docker Compose file that wires everything together
- An extensible backend protocol: implement three Python methods and you can send logs anywhere
The README has the full setup guide, CLI reference, and Docker instructions. I won't repeat all of that here.
Why I Open-Sourced It
This started as an internal tool. A single Python script that logged Ollama calls to a PostgreSQL table. It grew into something more general because the problem isn't specific to my setup. Anyone running Ollama for real workloads (not just chatting with a model in a terminal) eventually needs to know what's happening at the request level.
The local LLM ecosystem has great tools for running models and great tools for building applications on top of them. The observability layer in between is mostly missing. This is one piece of that.
If you run Ollama and you've ever wondered "which of my services is actually using the most tokens," give it a try. The README walks through everything from a zero-config SQLite setup to a full Docker + Grafana stack.
Top comments (0)