I built an API that roasts you, and every response is AI-generated

#python #django #ai #showdev

TL;DR I built Snark, an open-source REST API that serves AI-generated humor. Roasts, brutally honest commit messages, ELI5, corporate jargon, and around 25 other endpoints. Every response is a live LLM call, so you almost never get the same line twice. Here's how it works under the hood.

Why I built it

Most joke APIs hand you a random line from a static list. That's fine for about ten requests, and then you've seen everything it has.

I wanted the opposite. An API where every response is generated fresh by a language model, in the voice of a specific persona, and where the same endpoint almost never repeats a joke on you. Honestly it started as an excuse to learn provider fallback and caching properly, but I ended up using it more than I expected.

Here are a few real responses from the running service:

$ curl -s localhost:8100/v1/wit/commit-message/ | jq -r .response
fix: finally found the typo

$ curl -s localhost:8100/v1/wit/bug-blame/ | jq -r .response
The culprit behind the burnt toast is a rogue toaster wire,
sparked to life by a freak solar flare. Case closed.

$ curl -s "localhost:8100/v1/wit/explain-like-im-5/?q=quantum+physics" | jq -r .response
Quantum physics is like coloring with crayons, but the colors
can be in many places at the same time.

The stack

Nothing exotic here:

Django and Django REST Framework for the API
PostgreSQL for personas and a response log
Redis for caching and per-IP rate limiting
Groq as the default model provider on its free tier, with Gemini and Claude as optional fallbacks
Docker Compose, so a single docker compose up brings the whole thing up with nothing external to provision

Provider fallback is the part I care about

If you lean on a single model provider, it will eventually let you down. Rate limits, content filters, the occasional 500. So Snark doesn't trust any one provider on its own.

Every endpoint runs through one orchestrator. It tries the default provider, and if that goes wrong it walks down a chain:

def _generate_with_fallback(system_prompt, user_prompt, temperature, max_tokens):
    primary = ProviderRegistry.get()
    try:
        return primary.generate(...)
    except ContentFilterError:
        # The model refused. Soften the prompt and retry on the SAME provider
        # before giving up on it.
        softened = system_prompt + "\n\nIMPORTANT: Keep it light and safe..."
        try:
            return primary.generate(system_prompt=softened, temperature=max(temperature - 0.2, 0.3), ...)
        except (ContentFilterError, ProviderError):
            pass
    except ProviderError:
        pass

    # Primary is out. Walk the rest of the chain.
    for fallback in ProviderRegistry.get_fallbacks(exclude=primary.name):
        try:
            return fallback.generate(...)
        except (ContentFilterError, ProviderError):
            continue

    raise ProviderError("All AI providers failed to generate a response")

The bit I'm happiest with is the content-filter branch. When a model refuses, the first instinct is to bail and hit the next provider, but a lot of the time the model just needs a calmer prompt. So before switching anything, it lowers the temperature, appends a "keep it safe" line, and asks the same provider again. That alone rescues a surprising number of requests, and it's cheaper than paying for a second provider's round trip.

The other nice side effect is that a provider is just a class with a generate() method. When I added Claude, it was one new file. The registry handles the ordering and the "don't retry the one that already failed" logic.

Stopping it from repeating itself

A generator that keeps repeating itself feels broken, even if every response is technically a fresh API call. I deal with this in the prompt rather than in code.

Right before each call, Snark grabs the last 10 responses for that persona from the database and drops them into the system prompt as a "don't do these again" list:

recent = (
    ResponseLog.objects.filter(persona=persona)
    .order_by("-created_at")
    .values_list("response_text", flat=True)[:ANTI_REPETITION_COUNT]
)

anti_rep = (
    "\n\nIMPORTANT: Do NOT repeat or closely paraphrase any of these "
    "recent responses. Be completely original:\n"
    + "\n".join(f'- "{s[:80]}"' for s in recent)
)

It's cheap, the model doesn't have to remember anything between calls, and in practice it does a good job of keeping things varied.

Caching without making everything identical

This one's a bit of a contradiction. Caching saves money and latency, but the entire point of the service is that responses are unique. Cache too hard and you've built the static joke list I was trying to avoid.

The middle ground I landed on is to cache by the exact shape of the request, but only for a few minutes. The key is a SHA-256 of slug : user_input : mood, and it expires after five:

def _response_cache_key(slug, user_input, mood):
    raw = f"{slug}:{user_input}:{mood or ''}"
    digest = hashlib.sha256(raw.encode()).hexdigest()[:16]
    return f"wit:resp:{digest}"

So if two people hit /roast/dave/?mood=spicy in the same minute, they share one result and I only pay for one call. Across different inputs, or the same input a few minutes later, you still get something new. Each response tells you which one you got:

{ "response": "...", "persona": "The Honest Committer", "cached": false }

Personas instead of hardcoded prompts

Every endpoint maps to a persona that lives in the database. A name, a system prompt, a tone, some rules, and its own temperature and max_tokens. "The Honest Committer" writes the commit messages, "The Feedback Villain" handles code review comments. Adding an endpoint is usually just adding a row, not writing code.

On top of that there's an optional ?mood= parameter (sarcastic, deadpan, unhinged, wholesome, and so on) that overrides the tone, so one persona can say the same thing fifteen different ways.

A few things I learned

The one that bit me: my tests mock the model SDKs, which is the right call for fast, deterministic tests, but it also means a green test suite tells you nothing about whether a provider's real API still matches your code. When I started bumping dependencies I had to check the SDKs structurally instead of trusting the checkmark.

Bundling Postgres and Redis into the compose file early was worth it too. Once docker compose up brought up everything, the "how do I even run this" questions disappeared.

And the thing that took longest to accept is that the jokes were the easy part. The fallback, the retries, the anti-repetition, the caching, that's what actually makes it feel like a real service rather than a demo.

Try it

It's open source under AGPL-3.0 and free. The endpoints don't need any auth or keys. You only need a free Groq key to run your own instance:

git clone https://github.com/PramodTKodag/snark.git
cd snark && cp .env.example .env   # add a free Groq key
docker compose --profile dev up
curl http://localhost:8100/v1/wit/roast/your-pr/