Turn ~800M Free AI Tokens Into a Single OpenAI API with FreeLLMAPI

#ai #llm #opensource #devtools

The Problem Nobody Talks About

Every major AI lab now offers a free tier. Gemini, Groq, Mistral, Cerebras — they all give you a few million tokens a month, a few thousand requests a day.

On paper, that's generous. In practice, you end up juggling 14 different SDKs, 14 rate limits, and 14 places a request can silently fail.

FreeLLMAPI solves exactly that.

What It Does

It's a self-hosted proxy that aggregates free tiers from 14 providers behind a single /v1/chat/completions endpoint — fully compatible with the OpenAI SDK.

Supported providers:

Provider	Notable Models
Google Gemini	2.5 Pro / Flash
Groq	Llama 4, Qwen, Kimi
Cerebras	Llama 3.3, Qwen
SambaNova	Llama 3.3 70B
NVIDIA NIM	Full catalog
Mistral	La Plateforme
OpenRouter	Free-tier models
GitHub Models	GPT-4o, Llama, Phi
Hugging Face	Inference Providers
Cloudflare	Workers AI
Zhipu	GLM-4 series
Moonshot	Kimi
MiniMax	abab / hailuo

Combined: roughly ~800M tokens/month across all providers.

Zero Code Changes

Point your existing OpenAI SDK at localhost:3001/v1:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:3001/v1",
    api_key="freellmapi-your-unified-key",
)

resp = client.chat.completions.create(
    model="auto",  # router picks the best available
    messages=[{"role": "user", "content": "Summarise the fall of Rome in one sentence."}],
)

print(resp.choices[0].message.content)
print("Routed via:", resp.headers.get("x-routed-via"))

That's it. Every response includes an X-Routed-Via header so you know which provider actually served the request.

Technical Highlights

Automatic failover — On 429 / timeout / 5xx, the router cools down the key and retries the next provider in your chain, up to 20 attempts.

Sticky sessions — Multi-turn conversations stay on the same model for 30 minutes. This matters more than it sounds — switching models mid-conversation causes subtle hallucination spikes.

Per-key rate tracking — RPM, RPD, TPM, and TPD counters per (platform, model, key). The router always picks a key that's under its caps.

Encrypted key storage — AES-256-GCM before hitting SQLite. Upstream provider keys never leave your machine.

Admin dashboard — React + Vite UI to manage keys, reorder the fallback chain, inspect analytics, and test prompts in a playground.

Lightweight — Runs on a Raspberry Pi 4 at ~40MB RAM idle.

Setup in 3 Lines

git clone https://github.com/tashfeenahmed/freellmapi
cd freellmapi && npm install
cp .env.example .env && npm run dev

Open localhost:5173, add your provider API keys, grab your unified key → done.

The Honest Part

A few things the README says clearly, and you should know upfront:

Intelligence degrades throughout the day. Gemini 2.5 Pro and GPT-4o (via GitHub Models) have the lowest daily caps. Once they're exhausted, the router falls back to smaller models. Expect effective quality to drop in the late hours — then reset at UTC midnight.

Tool calling and vision are not yet supported. Text-only for now. PRs are welcome.

Latency is unpredictable. Cerebras and Groq are extremely fast. Others are not. You get whichever one is available.

Personal use only. No multi-tenant auth. Don't expose this to the internet.

Free tiers change without notice. When a provider tightens limits, you'll see 429s until the catalog is updated.

Who This Is For

✅ Building AI agents or coding assistants and want to prototype without spending money upfront

✅ Researchers and students who hit rate limits on one provider and want seamless fallback

✅ Anyone tired of maintaining multiple SDK integrations

❌ Production workloads — use a paid API with an SLA

Quick ToS Note

The project includes a detailed review of each provider's terms. Most are fine for single-user personal use. Notable exceptions: Cohere's trial ToS explicitly forbids personal/household use, and NVIDIA NIM's free tier is scoped to evaluation only.

Read the full table in the README before adding keys.

FreeLLMAPI is MIT licensed and actively welcoming contributors — especially for adding embeddings, tool calling, and new providers.

→ github.com/tashfeenahmed/freellmapi