Harshal Mehta

Posted on Mar 30

I Built EvalGuard: A LLM Security & Evaluation Platform

#ai #cybersecurity #rag

I Built EvalGuard: A Full-Stack LLM Security & Evaluation Platform

After spending some days watching teams ship AI features with basically zero confidence in how their models would behave under adversarial conditions, I decided to build the tool I kept wishing existed.

EvalGuard is a full-stack LLM security and evaluation platform. Think Promptfoo meets Datadog — but purpose-built for AI teams who need more than vibe checks before deploying to production.

This post covers the architecture, the interesting technical decisions, and what I learned building this end-to-end as a solo project.

What EvalGuard Does

At a high level, EvalGuard gives teams three things:

1. Eval Suites — run structured evaluations across multiple LLM providers (OpenAI, Anthropic, Google AI, Groq) side by side. Compare GPT-4o vs Claude vs Llama on the same test cases with 7 different scoring metrics.

2. Red-Teaming — attack your own models before someone else does. 50+ attack templates across 5 categories: prompt injection, jailbreaking, PII leakage, bias, and toxicity. Beyond static templates, EvalGuard uses an LLM to dynamically generate adversarial prompts tailored to your model's specific system prompt.

3. Agent Monitoring — real-time tracing at the span level with automatic policy violation detection. If your agent does something it shouldn't, you know immediately.

Everything sits behind org-level plans with usage tracking, rate limiting, and CI/CD integration via GitHub Actions or CLI.

Architecture Overview

The system is split into 5 layers:

Clients → Frontend & Auth → API Layer → Processing → Data & Providers

Clients

Three ways to interact with the platform:

Dashboard — Next.js web UI
CLI — built with Typer, talks to the API directly
Python SDK — async, built on httpx, with LangChain integration

Frontend & Auth

Next.js 15 + React + Tailwind + shadcn/ui
Clerk handles auth — JWT + JWKS, multi-tenant out of the box

Clerk was genuinely the right call here. Multi-tenancy with org-level access control would have taken weeks to build from scratch. Offloading that entirely let me focus on the actual product.

API Layer (FastAPI)

The backend is a FastAPI app with a middleware chain: CORS → Auth → Org Guard → Rate Limit, before any request hits a route.

Routes are grouped into: Suites, Runs, Red-Team, Agents, Reports, Billing, CI/CD, Keys.

SQLAlchemy 2.0 async with Pydantic validation throughout. The async SQLAlchemy shift was worth it — under load, the difference is noticeable.

Processing (Celery Workers)

This is where the interesting stuff happens. Three main workers, but before getting into what each does — why Celery at all?

LLM calls are slow. A single eval run might involve dozens of API calls to external providers, each taking 2–10 seconds. Doing that synchronously in a FastAPI request would mean holding HTTP connections open for minutes, timeouts everywhere, and zero visibility into progress. The answer is obvious: push the work onto a queue and process it asynchronously.

The architecture looks like this:

FastAPI → Redis (broker) → Celery Workers → PostgreSQL (results)
    ↑                                              ↓
    └─────────────── Status polling ───────────────┘

When a user triggers an eval run or a red-team campaign, FastAPI creates a DB record, pushes a task onto the Redis queue, and immediately returns a run_id to the client. The frontend polls for status updates. Workers pick up tasks, do the heavy lifting, and write results back to Postgres as they complete.

Redis here is doing double duty — it's both the Celery broker (task queue) and the result backend (where task state gets written). That's a deliberate choice to keep the infra footprint small rather than introducing a separate message broker.

Each worker type runs in its own Celery queue, so you can scale them independently:

Eval Runner — evalguard.eval queue

Test Case → LLM Call → Score

Test cases within a suite are fanned out as individual Celery tasks using group() so they run in parallel across workers. Results get aggregated back with a chord callback that writes the final suite summary once all cases complete. This means a 50-case eval suite doesn't run sequentially — it saturates however many workers you have.

from celery import group, chord
from app.tasks import run_test_case, finalize_eval_suite

# Fan out all test cases in parallel, aggregate when all complete
job = chord(
    group(
        run_test_case.s(case_id=case.id, run_id=run.id)
        for case in suite.test_cases
    ),
    finalize_eval_suite.s(run_id=run.id)
)

job.apply_async()

group() fires all test cases in parallel across available workers. chord() holds the finalize callback until every task in the group has a result — that's where pass/fail rates, aggregate scores, and the final run status get computed and written to Postgres.

Red-Team Runner — evalguard.redteam queue

Attack → Target → Judge → Risk

Red-team runs are more sequential by nature — generate attack, hit target, judge response, score risk — so these use Celery chains. Each attack prompt is its own chain, but multiple chains run concurrently across the queue. The judge step is where LLM-as-a-Judge fires, adding latency but also adding the signal that makes red-team results actually meaningful.

Report Generator — evalguard.reports queue

Query → Jinja2 → PDF

Separated into its own queue specifically so report generation never competes with eval or red-team capacity. Reports can be slow (big DB queries, PDF rendering) and you don't want one large report export starving active eval runs of workers.

Workers have graceful shutdown configured — on deploy or restart, Celery's SIGTERM handling lets in-flight tasks finish before the process exits. Without this, a worker restart mid-eval would silently drop results and leave runs stuck in a RUNNING state forever.

Core Services (inside Processing)

LiteLLM Router — unified interface across all providers. Swapping models is one config change.
Scorer Engine — 7 scoring metrics, composable per eval suite
LLM-as-a-Judge — secondary model evaluates target model responses for safety and accuracy
Attack Generator — uses an LLM to craft adversarial inputs from the target model's system prompt
Policy Engine — defines and enforces rules for agent monitoring

Data & Providers

PostgreSQL 16 — 12 tables, managed on Render
Redis 7 — task queue + rate limiting, managed on Render
Cloudflare R2 — report storage

The AI-Powered Parts

This is where EvalGuard gets self-referential in a fun way.

LLM-as-a-Judge

Static scoring (exact match, regex) only gets you so far. For safety evaluation especially, you need semantic understanding. EvalGuard uses a secondary LLM to evaluate whether the target model's response was actually safe and appropriate — not just syntactically correct.

The judge prompt is structured around the eval category. A PII leakage judge looks for different signals than a toxicity judge.

Dynamic Attack Generation

The 50+ static templates are a starting point, not a ceiling. Given a target model's system prompt, EvalGuard calls an LLM to generate novel adversarial inputs specific to that model's context. A customer support bot and a code assistant have completely different attack surfaces — the generator accounts for that.

Semantic Similarity Scoring

For cases where you want to measure how close an output is to an expected answer without requiring exact matches, EvalGuard uses sentence-transformers embeddings to compute cosine similarity. Useful for evaluating open-ended responses where wording varies but meaning should be consistent.

Deployment: Render Blueprint + Docker

The entire infra is defined in a render.yaml Blueprint file:

Frontend — Node web service
Backend — Docker web service
Workers — Docker background workers
PostgreSQL 16 — managed DB
Redis 7 — managed cache

Render spins all of this up from a single Blueprint deploy. The workers have graceful shutdown handling so in-flight Celery tasks don't get killed mid-run on deploys.

Huge thanks to Render here — the free hobby Postgres gave me a real month of building without worrying about infra costs, and the Blueprint + Docker combo meant I could focus on the actual product instead of YAML wrestling.

CLI & SDK

The Python SDK is async (httpx under the hood) with a LangChain integration for teams already in that ecosystem.

from evalguard import EvalGuardClient

client = EvalGuardClient(api_key="...")
run = await client.runs.create(suite_id="...", model="gpt-4o")
results = await run.wait()
print(results.summary())

The CLI is built with Typer and covers the full surface: trigger runs, pull results, manage keys, stream logs. Designed to drop into GitHub Actions without friction.

What I'd Do Differently

Async SQLAlchemy from day one. I started sync and migrated mid-build. It's not painful, but it's friction you don't need.

More investment in the Judge prompts earlier. The quality of LLM-as-a-Judge evaluations is almost entirely determined by how well the judge prompt is structured. I underestimated this.

Blueprint-first deployment. I set up the Render Blueprint after the fact. Defining infra as code from the start would have saved a few annoying debugging sessions.

What's Next

Open the red-team templates to community contributions
Deeper LangSmith / Langfuse integration
More agent tracing protocols

Not open-source at this point, but genuinely happy to talk architecture, design decisions, or anything LLM eval-related in the comments.

The hosted instance is paused right now to keep costs sane, but if you want a live walkthrough, drop a comment or DM me and I'll spin it back up.

Built this solo — backend, frontend, CLI, SDK, deployment, the whole stack. If you're building in the LLM evaluation or AI security space, I'd love to hear what you're working on.

DEV Community