I built an LLM-powered compliance scanner that points at the actual line of code

#opensource #llm #gdpr #devtools

A few weeks ago I went down a rabbit hole. I'd been reading about how every SaaS company eventually has to deal with GDPR / SOC 2 / HIPAA, and how the existing tooling space basically goes like this:

"Do you have a password policy document?"
"Yes."
"Great, you're compliant."

That checks the policy. It doesn't check whether your login route actually stores passwords with MD5. Which felt like… kind of the wrong layer to look at?

So I built Themida — an open-source compliance scanner that reads the actual code.

What it does

Point it at a GitHub repo (or a local directory). It returns findings like this:

src/auth/login.ts:41
CRITICAL  GDPR Art. 5(1)(f), 32(1)(a)
Password hashed with broken MD5
Maximum fine: €20M or 4% of revenue
Fix → bcrypt at cost 12+, or Argon2id

Every finding has:

The exact file and line number
The legal article that the code violates
The maximum fine for context
A code fix you can paste straight into a PR
A severity rating (CRITICAL / HIGH / MEDIUM / LOW)

You can export the whole report as a PDF if you need to share it with someone non-technical.

Why an LLM and not regex?

Honest answer: I tried regex first. It was awful.

Pattern-matching catches the easy cases (crypto.createHash('md5')) but produces a tidal wave of false positives on real codebases. MD5 used to hash a password is a crime. MD5 used as a cache key is fine. A regex can't tell the difference. An LLM can! If you give it enough context and the right prompt.

The scanner runs three passes:

Recon : small/cheap LLM scans the file tree and picks ~15 suspect paths
Deep scan : bigger LLM reads those files line by line and produces findings
Verify : a final pass that drops hallucinated paths and findings already mitigated nearby

Splitting it this way keeps the cost under control. A scan of a medium-sized repo costs around 5–20 cents depending on which models you pick.

Provider-agnostic

You bring your own LLM key. The scanner ships adapters for:

Anthropic (Claude)
OpenAI
Anything that speaks OpenAI's Chat Completions API, OpenRouter, Groq, Together, vLLM, llama.cpp server, Ollama, LiteLLM

Pick one with one env var:

LLM_PROVIDER=openai
OPENAI_API_KEY=sk-...
OPENAI_BASE_URL=https://openrouter.ai/api/v1  # optional, defaults to OpenAI

Self-hosters running a local model are first-class citizens, the cost tracker just records 0 cents and moves on.

What's done, what's not

Done: GDPR (5 rules), EU AI Act (5 rules), full scan pipeline, dashboard, real-time progress, PDF export, GitHub App integration, local CLI path (pnpm dev:scan).

Open issues, PRs welcome: HIPAA, SOC 2, ISO 27001, OWASP Top 10, PCI DSS. Each rule pack is a single TypeScript file with a fairly readable schema, adding rules is the easiest way to contribute.

On the roadmap: Better local-LLM ergonomics, VS Code extension, eval suite for measuring rule accuracy as packs grow.

Try it

git clone https://github.com/Nikolaospet/themida
cd themida
pnpm install
cp .env.example .env.local
# edit .env.local — pick a provider
pnpm dev:scan

There's also a sample report on OWASP NodeGoat you can poke around without setting anything up.

This is a personal project

I want to be upfront about this: Themida isn't a company, it doesn't have funding, there's no "managed version" hiding behind the OSS face. It's a side project I'm building in the open because I find the problem interesting and I think devs are tired of compliance tools that don't read code.

It's released under AGPL-3.0 , use it, modify it, run it for your team, fork it. The license just stops someone wrapping it in a SaaS and closing it back up.

If you try it and something breaks, open an issue. If you want to add a rule pack, an LLM adapter, or improve the eval suite, PRs are warmly welcomed, there's a CONTRIBUTING.md and a PR template ready to go.

The repo is here: github.com/Nikolaospet/themida

If you build software in regulated industries, fintech, health, EU-anything, anywhere with AI Act exposure, I'd love to hear which rule packs would be most useful to ship next. Drop a comment.

Thanks for reading 🙏