I Built EvalGuard: A Full-Stack LLM Security & Evaluation Platform
After spending some days watching teams ship AI features with basically zero confidence in how their models would behave under adversarial conditions, I decided to build the tool I kept wishing existed.
EvalGuard is a full-stack LLM security and evaluation platform. Think Promptfoo meets Datadog — but purpose-built for AI teams who need more than vibe checks before deploying to production.
This post covers the architecture, the interesting technical decisions, and what I learned building this end-to-end as a solo project.
What EvalGuard Does
At a high level, EvalGuard gives teams three things:
1. Eval Suites — run structured evaluations across multiple LLM providers (OpenAI, Anthropic, Google AI, Groq) side by side. Compare GPT-4o vs Claude vs Llama on the same test cases with 7 different scoring metrics.
2. Red-Teaming — attack your own models before someone else does. 50+ attack templates across 5 categories: prompt injection, jailbreaking, PII leakage, bias, and toxicity. Beyond static templates, EvalGuard uses an LLM to dynamically generate adversarial prompts tailored to your model's specific system prompt.
3. Agent Monitoring — real-time tracing at the span level with automatic policy violation detection. If your agent does something it shouldn't, you know immediately.
Everything sits behind org-level plans with usage tracking, rate limiting, and CI/CD integration via GitHub Actions or CLI.
Architecture Overview
The system is split into 5 layers:
Clients → Frontend & Auth → API Layer → Processing → Data & Providers
Clients
Three ways to interact with the platform:
- Dashboard — Next.js web UI
- CLI — built with Typer, talks to the API directly
- Python SDK — async, built on httpx, with LangChain integration
Frontend & Auth
- Next.js 15 + React + Tailwind + shadcn/ui
- Clerk handles auth — JWT + JWKS, multi-tenant out of the box
Clerk was genuinely the right call here. Multi-tenancy with org-level access control would have taken weeks to build from scratch. Offloading that entirely let me focus on the actual product.
API Layer (FastAPI)
The backend is a FastAPI app with a middleware chain: CORS → Auth → Org Guard → Rate Limit, before any request hits a route.
Routes are grouped into: Suites, Runs, Red-Team, Agents, Reports, Billing, CI/CD, Keys.
SQLAlchemy 2.0 async with Pydantic validation throughout. The async SQLAlchemy shift was worth it — under load, the difference is noticeable.
Processing (Celery Workers)
This is where the interesting stuff happens. Three main workers, but before getting into what each does — why Celery at all?
LLM calls are slow. A single eval run might involve dozens of API calls to external providers, each taking 2–10 seconds. Doing that synchronously in a FastAPI request would mean holding HTTP connections open for minutes, timeouts everywhere, and zero visibility into progress. The answer is obvious: push the work onto a queue and process it asynchronously.
The architecture looks like this:
FastAPI → Redis (broker) → Celery Workers → PostgreSQL (results)
↑ ↓
└─────────────── Status polling ───────────────┘
When a user triggers an eval run or a red-team campaign, FastAPI creates a DB record, pushes a task onto the Redis queue, and immediately returns a run_id to the client. The frontend polls for status updates. Workers pick up tasks, do the heavy lifting, and write results back to Postgres as they complete.
Redis here is doing double duty — it's both the Celery broker (task queue) and the result backend (where task state gets written). That's a deliberate choice to keep the infra footprint small rather than introducing a separate message broker.
Each worker type runs in its own Celery queue, so you can scale them independently:
Eval Runner — evalguard.eval queue
Test Case → LLM Call → Score
Test cases within a suite are fanned out as individual Celery tasks using group() so they run in parallel across workers. Results get aggregated back with a chord callback that writes the final suite summary once all cases complete. This means a 50-case eval suite doesn't run sequentially — it saturates however many workers you have.
from celery import group, chord
from app.tasks import run_test_case, finalize_eval_suite
# Fan out all test cases in parallel, aggregate when all complete
job = chord(
group(
run_test_case.s(case_id=case.id, run_id=run.id)
for case in suite.test_cases
),
finalize_eval_suite.s(run_id=run.id)
)
job.apply_async()
group() fires all test cases in parallel across available workers. chord() holds the finalize callback until every task in the group has a result — that's where pass/fail rates, aggregate scores, and the final run status get computed and written to Postgres.
Red-Team Runner — evalguard.redteam queue
Attack → Target → Judge → Risk
Red-team runs are more sequential by nature — generate attack, hit target, judge response, score risk — so these use Celery chains. Each attack prompt is its own chain, but multiple chains run concurrently across the queue. The judge step is where LLM-as-a-Judge fires, adding latency but also adding the signal that makes red-team results actually meaningful.
Report Generator — evalguard.reports queue
Query → Jinja2 → PDF
Separated into its own queue specifically so report generation never competes with eval or red-team capacity. Reports can be slow (big DB queries, PDF rendering) and you don't want one large report export starving active eval runs of workers.
Workers have graceful shutdown configured — on deploy or restart, Celery's SIGTERM handling lets in-flight tasks finish before the process exits. Without this, a worker restart mid-eval would silently drop results and leave runs stuck in a RUNNING state forever.
Core Services (inside Processing)
- LiteLLM Router — unified interface across all providers. Swapping models is one config change.
- Scorer Engine — 7 scoring metrics, composable per eval suite
- LLM-as-a-Judge — secondary model evaluates target model responses for safety and accuracy
- Attack Generator — uses an LLM to craft adversarial inputs from the target model's system prompt
- Policy Engine — defines and enforces rules for agent monitoring
Data & Providers
- PostgreSQL 16 — 12 tables, managed on Render
- Redis 7 — task queue + rate limiting, managed on Render
- Cloudflare R2 — report storage
The AI-Powered Parts
This is where EvalGuard gets self-referential in a fun way.
LLM-as-a-Judge
Static scoring (exact match, regex) only gets you so far. For safety evaluation especially, you need semantic understanding. EvalGuard uses a secondary LLM to evaluate whether the target model's response was actually safe and appropriate — not just syntactically correct.
The judge prompt is structured around the eval category. A PII leakage judge looks for different signals than a toxicity judge.
Dynamic Attack Generation
The 50+ static templates are a starting point, not a ceiling. Given a target model's system prompt, EvalGuard calls an LLM to generate novel adversarial inputs specific to that model's context. A customer support bot and a code assistant have completely different attack surfaces — the generator accounts for that.
Semantic Similarity Scoring
For cases where you want to measure how close an output is to an expected answer without requiring exact matches, EvalGuard uses sentence-transformers embeddings to compute cosine similarity. Useful for evaluating open-ended responses where wording varies but meaning should be consistent.
Deployment: Render Blueprint + Docker
The entire infra is defined in a render.yaml Blueprint file:
- Frontend — Node web service
- Backend — Docker web service
- Workers — Docker background workers
- PostgreSQL 16 — managed DB
- Redis 7 — managed cache
Render spins all of this up from a single Blueprint deploy. The workers have graceful shutdown handling so in-flight Celery tasks don't get killed mid-run on deploys.
Huge thanks to Render here — the free hobby Postgres gave me a real month of building without worrying about infra costs, and the Blueprint + Docker combo meant I could focus on the actual product instead of YAML wrestling.
CLI & SDK
The Python SDK is async (httpx under the hood) with a LangChain integration for teams already in that ecosystem.
from evalguard import EvalGuardClient
client = EvalGuardClient(api_key="...")
run = await client.runs.create(suite_id="...", model="gpt-4o")
results = await run.wait()
print(results.summary())
The CLI is built with Typer and covers the full surface: trigger runs, pull results, manage keys, stream logs. Designed to drop into GitHub Actions without friction.
What I'd Do Differently
Async SQLAlchemy from day one. I started sync and migrated mid-build. It's not painful, but it's friction you don't need.
More investment in the Judge prompts earlier. The quality of LLM-as-a-Judge evaluations is almost entirely determined by how well the judge prompt is structured. I underestimated this.
Blueprint-first deployment. I set up the Render Blueprint after the fact. Defining infra as code from the start would have saved a few annoying debugging sessions.
What's Next
- Open the red-team templates to community contributions
- Deeper LangSmith / Langfuse integration
- More agent tracing protocols
Not open-source at this point, but genuinely happy to talk architecture, design decisions, or anything LLM eval-related in the comments.
The hosted instance is paused right now to keep costs sane, but if you want a live walkthrough, drop a comment or DM me and I'll spin it back up.
Built this solo — backend, frontend, CLI, SDK, deployment, the whole stack. If you're building in the LLM evaluation or AI security space, I'd love to hear what you're working on.

Top comments (0)