Devansh

Posted on Apr 14 • Originally published at devanshtiwari.com

LiteLLM got hacked. I built a simpler LLM gateway you can actually audit.

#ai #opensource #security #openai

On March 24, 2026, LiteLLM versions 1.82.7 and 1.82.8 were uploaded to PyPI with a credential harvester, a Kubernetes lateral-movement toolkit, and a persistent remote code execution backdoor baked in.

The malicious package was live for about 40 minutes before PyPI quarantined it.

40 minutes doesn't sound like much. But LiteLLM gets 95 million downloads a month. It's the default multi-provider routing library for anyone building on LLMs. Teams running pip install litellm during that window got compromised automatically. No explicit import needed. The payload triggered on Python interpreter startup via a .pth file.

Google brought in Mandiant for the investigation. Snyk, Kaspersky, and Trend Micro all published breakdowns. The attack vector: a compromised Trivy security scanner leaked CircleCI credentials, including the PyPI publishing token and a GitHub PAT.

This is not a theoretical risk. This happened.

The real problem is not one attack

LiteLLM does a lot. 2,000+ models across 100+ providers. Proxy server, load balancing, spend tracking, A/B testing, caching, logging, guardrails, prompt management.

That scope is the problem.

A developer on HN described the codebase as having a 7,000+ line utils.py. A 30-year engineer called it "the worst code I have ever read in my life." Before the supply chain attack, a DEV Community post titled "5 Real Issues With LiteLLM That Are Pushing Teams Away in 2026" was already documenting the trust erosion.

The supply chain attack was the tipping point, not the root cause. The root cause is depending on a massive, opaque library for critical routing infrastructure.

What a simpler design looks like

I ran into the same multi-provider routing problem last year while building Metis, an AI stock analysis tool. Kept burning through Groq's free tier in 20 minutes, switching to Gemini manually, hitting their cap, switching again.

Built FreeLLM to stop doing that manually. It solves a narrower problem than LiteLLM, and that's the point.

FreeLLM is an OpenAI-compatible gateway that routes across Groq, Gemini, Mistral, Cerebras, NVIDIA NIM, and Ollama. When one provider rate-limits, the next one answers. That's the core of it.

What it does

curl http://localhost:3000/v1/chat/completions \
  -d '{"model": "free-fast", "messages": [{"role": "user", "content": "Hello!"}]}'

Your existing OpenAI SDK code works. Swap the base URL. Keep your code.

Three meta-models handle routing: free-fast (lowest latency, usually Groq/Cerebras), free-smart (best reasoning, usually Gemini 2.5 Pro), and free (max availability).

What it fixes that LiteLLM doesn't

Gemini 2.5 reasoning tokens eating your output. This is one of the most reported Gemini bugs right now. Gemini 2.5 Flash and Pro are reasoning models. They burn 90-98% of your max_tokens on internal thinking before producing visible text. Ask for 1,000 tokens and you get back 37. There are 15+ open GitHub issues about this across multiple SDKs.

FreeLLM fixes it at the gateway. Flash gets reasoning_effort: "none" by default. Pro gets "low". Your full token budget goes to the actual answer. Override per-request if you want the reasoning back.

Provider outages don't break your app. Claude went down for three consecutive days in early April. 8,000+ Downdetector reports. If your app depends on one provider, that's three days of broken service. FreeLLM's circuit breakers pull failing providers from rotation and test for recovery automatically.

Response caching without a separate layer. Identical prompts return in ~23ms with zero quota burn. The cache refuses to store truncated responses (another Gemini bug: reasoning models returning cut-off output that then poisons your cache for an hour).

Browser-safe tokens for static sites. Mint a short-lived HMAC-signed token from a serverless function, pass it to the browser, call the gateway directly from client-side JavaScript. No auth backend. No session store.

Key stacking: 360 free requests per minute

Every provider env var accepts a comma-separated list. FreeLLM rotates round-robin per key.

GROQ_API_KEY=gsk_key1,gsk_key2,gsk_key3
GEMINI_API_KEY=AI_key1,AI_key2,AI_key3

Stack 3 keys across 5 cloud providers: ~360 req/min. All free. Enough to prototype an entire product without spending anything.

Get it running

Docker:

docker run -d -p 3000:3000 \
  -e GROQ_API_KEY=gsk_... \
  -e GEMINI_API_KEY=AI... \
  ghcr.io/devansh-365/freellm:latest

Or one-click deploy on Railway or Render (buttons in the README).

Use it from Python:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:3000/v1", api_key="unused")
response = client.chat.completions.create(
    model="free-smart",
    messages=[{"role": "user", "content": "Explain circuit breakers"}]
)

TypeScript, Go, Ruby, anything that speaks OpenAI. Same pattern.

Why this matters beyond FreeLLM

The LiteLLM attack exposed something the community already suspected: critical AI infrastructure is running on libraries nobody audits.

The fix is not "use my tool instead." The fix is smaller dependencies, pinned versions, codebases you can read in an afternoon. FreeLLM is 262 tests across 22 files. TypeScript, not Python. Docker images with pinned deps. MIT licensed.

If you don't use FreeLLM, build something similarly scoped. The era of "install this 100-provider mega-library and trust it with your API keys" should be over.

262 tests. 6 providers. One endpoint. Zero cost.

GitHub: github.com/devansh-365/freellm