DEV Community

stockyard-dev
stockyard-dev

Posted on

I built an open source LLM proxy as a single Go binary — here's why

About 18 months ago I started building Stockyard. It's an LLM proxy: you point your apps at it instead of directly at OpenAI or Anthropic or Gemini, and it handles routing, caching, rate limiting, logging, and retries. You can self-host it in under a minute.

The interesting design decision: it ships as a single ~25MB Go binary with embedded SQLite and zero external dependencies.

That choice drives everything else. Here's why I made it, and where it hurts.

Why not Postgres + Redis?

Most infrastructure projects reach for Postgres and Redis by default. It's a reasonable stack. But for an LLM proxy, it creates friction exactly where you don't want it:

  • Deploying means provisioning two additional services
  • "Try it out" becomes a multi-step process
  • On-call incidents now include "is it the proxy or the DB?"

I wanted onboarding to be: download binary, run it, change one URL in your code. That's it. No YAML, no Compose file, no managed database tier.

SQLite handles this surprisingly well for most workloads. LLM proxy traffic is write-heavy (logging requests) but not write-concurrent in ways that stress SQLite. Reads are fast. The database file lives next to the binary. Backups are a file copy.

Why Go?

A few reasons, in order of how much they actually mattered:

Static compilation. go build produces a self-contained binary. No runtime, no shared libraries, no "which Python version." The Linux binary runs on Linux. The macOS binary runs on macOS. This isn't magic, it's just how Go works, but it's genuinely useful for distribution.

Goroutines. An LLM proxy is fundamentally a concurrent problem: you're waiting on upstream API calls, you're handling multiple requests at once, you might be streaming. Go's goroutine model handles this without the overhead of threads-per-request. The event loop alternative (Node, Python asyncio) works too, but Go's concurrency is easier to reason about when things go wrong.

Cross-compilation. GOOS=linux GOARCH=arm64 go build just works. I release binaries for Linux amd64, Linux arm64, macOS amd64, macOS arm64, and Windows amd64 from a single CI step. That would be much harder in most other languages.

Boring is good. Go is not exciting. It has no macros, limited generics, verbose error handling. For infrastructure code, that's a feature. The codebase is readable by anyone who knows Go, and most of it is readable by anyone who programs.

The "change one URL" onboarding

Here's what using Stockyard actually looks like:

# Download and run
curl -L https://stockyard.dev/install.sh | sh
stockyard start

# Your code, before:
client = OpenAI(api_key="sk-...")

# Your code, after:
client = OpenAI(api_key="sk-...", base_url="http://localhost:8080/openai")
Enter fullscreen mode Exit fullscreen mode

That's the entire migration for OpenAI. Same pattern for Anthropic, Gemini, Mistral, and 37 other providers. The proxy speaks each provider's native API format, so your existing SDK code keeps working.

Lessons from 40+ provider integrations

Supporting 40+ LLM providers taught me things I didn't expect:

API inconsistency is the norm. Every provider has slightly different authentication (Bearer token, x-api-key header, query param, custom header), different streaming formats, different error codes. There's no standard. OpenAI's format is the closest thing to one, and providers that claim "OpenAI-compatible" mean it to varying degrees.

Models come and go fast. I've had to update provider configs for model deprecations more times than I can count. Building this as config-driven rather than hard-coded matters a lot.

Streaming is where bugs hide. Non-streaming request/response is straightforward to proxy. Streaming is not. Different providers use different SSE formats, different done signals, different error injection patterns mid-stream. Robust streaming support took probably 40% of the proxy implementation work.

Documentation is often wrong. Provider API docs lag behind actual behavior. The ground truth is the HTTP traffic.

The honest tradeoff: embedded SQLite = no horizontal scaling

Here's what I won't pretend isn't true: SQLite doesn't scale horizontally.

If you're running a single instance, this doesn't matter at all. If you need to run two instances behind a load balancer and have them share state (for rate limiting, caching), you can't. You'd need to switch to a different persistence layer.

For the majority of Stockyard's users, a single instance handles the load. LLM APIs are slow (200ms-2000ms per request), so you can handle a lot of concurrent requests before you need more than one proxy instance. But if you're at the scale where this matters, you'll need to work around it.

I'm not planning to swap out SQLite for Postgres. The single-binary property is core to the project's identity. If horizontal scaling is a requirement, Stockyard might not be the right tool.

Where it is today

Stockyard is live at stockyard.dev. The proxy core is Apache 2.0. The full platform (dashboard, team features) is BSL 1.1. Source is at github.com/stockyard-dev/Stockyard.

It supports 40+ providers, has a web dashboard for request logging and analytics, handles caching and rate limiting, and ships as that single ~25MB binary.

If you're routing LLM traffic through multiple providers or want visibility into what your app is actually sending to APIs, it might be useful to you.

Top comments (0)