Free 17,500 LLM Requests a Day

#ai #api #go #llm

The Problem: Rate Limits Kill Projects

We’ve all been there. You’re building a bot or research tool, and just when it gets interesting, you hit a rate limit or your credits run out. Everything goes dark, and it's incredibly frustrating.

The fix isn't finding one "perfect" free API. It’s about building a system that treats every provider as a disposable spare part. I built a Go-based gateway that handles 17,500+ requests a day for $0. Here’s how.

The Backstory: Tired of Broken Bots

I didn't actually want to write a Go service; I did it because I was sick of my antispam bot crashing.

I started with Python and n8n, which worked for about five minutes. As traffic grew, the setup crumbled. Free models on OpenRouter changed weekly, and my bot would quit whenever an API vanished. I tried Cloudflare’s AI Gateway, but it disconnected under heavy load. To get 100% uptime on a budget, I had to build a tool I could actually control.

The real hurdle was my hardware: a $3/month VDS with 700MB of RAM. Tools like LiteLLM used half my memory just idling. I needed a lightweight binary that could handle thousands of requests without a sweat.

The Plan: Building a "Meta-Tier"

Instead of relying on one provider, I grouped several free APIs into a "Meta-Tier." If one provider throttles or goes offline, the gateway instantly moves to the next one.

The Capacity Breakdown:

Groq (Free): ~15,000 Req/Day (Llama 3.3 70B) — Industry-leading inference speed.
Gemini (AI Studio): 1,500 Req/Day (Gemini 1.5 Flash) — Massive context window.
OpenRouter: 1,000 Req/Day (GPT-OSS / Qwen) — Access to niche/experimental models.
Mistral (Exp): Variable Capacity (Mistral Small) — Excellent for complex logic fallback.

Total: 17,500+ Requests for $0.00/month.

How the Gateway Works

This is a specialized load balancer designed for LLM-specific failures. Since we want to keep things lean, we avoid complex visual libraries and stick to a robust request flow:

The Request Flow:

Client (Bot/App) → Sends HTTPS request to Nginx.
Nginx → Proxies via Unix Socket to the Go Gateway.
Go Gateway → Performs internal Auth & Token check.
- Sequential Rotator → Picks the first available provider (e.g., Groq).
- Failover Logic → If Provider A returns a 429 (Rate Limit), the Gateway instantly retries with Provider B (Gemini) or Provider C (OpenRouter).
- Logging → Every success and failure is saved as structured JSON for monitoring.

Why Go?

The 700MB RAM limit dictated the architecture. Python is too bloated for this hardware. This Go gateway is a small binary that sips ~15MB of RAM, leaving the rest of the server for your actual apps.

Catching the Errors

The "brain" is a Sequential Rotator that is "429-aware." When a provider returns a rate-limit error, the gateway catches it and retries with the next provider in milliseconds. Your application never sees the failure.

🚀 Get it Running

First off, clone https://github.com/leshchenko1979/ai-gateway.

1. Setup

Copy the example config and add your API keys.

cp config.yaml.example config.yaml

2. Install

Skip Docker to save resources. Use the script to build and install the systemd service.

./install.sh build
./install.sh install-service
sudo systemctl start ai-gateway

3. Remote Deploy

Deploy from your local machine straight to your server.

cp .env.example .env
SSH_HOST=your-server.com ./install.sh deploy

Monitoring

The gateway logs everything in JSON. Run

journalctl -u ai-gateway -f

to watch it swap providers in real-time as rate limits are reached.

Try it Out

Once running, the stack works like a single OpenAI-compatible endpoint:

curl http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer YOUR_INTERNAL_TOKEN" \
  -d '{"model": "gpt-oss-120b", "messages": [{"role": "user", "content": "Hello!"}]}'

By owning this layer, you've built a private "meta-tier" that’s more reliable than any single API on its own.

See the repo: https://github.com/leshchenko1979/ai-gateway