Patience Mpofu

Posted on May 3

I Built a SAST Scanner From Scratch — Here's Every Design Decision I Made

#security #design #python #owasp

When most developers want to scan their code for security vulnerabilities, they install Semgrep or Snyk and call it a day. I did the opposite. I built one from scratch.

Not because the existing tools are bad — they're excellent. But because I'm transitioning from 13 years of software engineering into application security, and I wanted to understand what a SAST tool actually is underneath the hood. What decisions go into building one? What tradeoffs do you make? What does "language-agnostic" really mean when you have to implement it yourself?

This is the story of those decisions. Some were obvious. Some I got wrong the first time. All of them taught me something.

What I Set Out to Build

The goal was a tool that could:

Scan source code across any language without needing language-specific parsers
Use a rule engine that non-engineers could extend without touching code
Produce output in three formats — terminal, JSON, and HTML — so it could fit into both human workflows and CI/CD pipelines
Fail builds when findings exceeded a configurable severity threshold
Handle false positive suppression with inline annotations That's not a toy. That's a real tool with real requirements. So let's talk about how I built it.

Decision 1: Regex Over AST (And Why I'd Make the Same Choice Again)

This was the most consequential decision in the whole project, and I want to be honest about the tradeoffs.

A proper SAST tool ideally parses code into an Abstract Syntax Tree (AST) — a structured representation of the code's meaning, not just its text. AST-based analysis can understand context. It knows that password on line 42 is a variable assignment, not a string literal. It can trace data flow. It can detect that user input on line 10 reaches an unparameterised SQL query on line 87 without being sanitised in between.

Regex can't do any of that. Regex sees text.

So why did I choose regex?

Because AST parsing is language-specific by definition. Every language has its own grammar, its own AST format, its own parsing library. Java's AST looks nothing like Python's. Kotlin's is different again. If I wanted to support 12+ languages — Python, Java, Kotlin, JavaScript, TypeScript, C#, Go, Ruby, PHP, Shell, YAML, Terraform — I'd need 12+ separate AST parsers, each with its own dependencies, its own quirks, its own maintenance burden.

Regex patterns, by contrast, can be written to match suspicious code constructs across any language where those constructs look similar in text form. SQL injection via string concatenation looks recognisably similar in Java, Python, and PHP. Hardcoded AWS access keys follow the same pattern everywhere. MD5 usage reads roughly the same in most languages.

The tradeoff is accuracy. Regex-based SAST has higher false positive rates than AST-based analysis because it can't understand context. It sees md5( and flags it regardless of whether it's being used for a password hash or a file integrity check.

My answer to this was confidence scoring on rules and inline suppression annotations. Rules can declare their confidence level (HIGH, MEDIUM, LOW), and developers can annotate lines with # sast-ignore or # nosec to suppress false positives with a documented reason. That's not perfect, but it's pragmatic — and it mirrors how production tools like Bandit handle the same problem.

If I were building a production-grade commercial tool, I'd layer in AST analysis per language using something like tree-sitter, which provides a unified API across dozens of languages. But for a portfolio project built to understand the domain? Regex got me 80% of the value at 20% of the complexity.

Decision 2: YAML-Driven Rules (The Best Decision I Made)

Every detection rule in the scanner is defined in a YAML file. Not in code. Here's what one looks like:

- id: INJ-001
  title: SQL Injection — String Concatenation
  description: >
    User-controlled input is concatenated directly into a SQL query,
    bypassing parameterisation and enabling SQL injection attacks.
  severity: CRITICAL
  category: INJECTION
  cwe: CWE-89
  owasp: A03:2021 - Injection
  languages: ["python", "java", "javascript", "php", "csharp"]
  remediation: >
    Use parameterised queries or prepared statements. Never concatenate
    user input directly into SQL strings.
  patterns:
    - regex: '(execute|query|cursor)\s*\(\s*["\'].*\+.*["\']'
      confidence: HIGH

Why is this the best decision I made?

Because it separates the detection logic from the engine. The scanner engine — the part that reads files, applies patterns, generates findings, produces reports — never needs to change when you add a new vulnerability category. You just write a new YAML file and drop it in the rules/ directory. The engine discovers it automatically on startup.

This is exactly how production tools like Semgrep and Nuclei work. Rules are data. The engine is infrastructure. Keeping them separate means:

Security teams can contribute new detections without needing to understand Python
Rules are version-controlled and diff-able like any other file
Rules can be reviewed in pull requests by people who've never written a line of the engine
Custom organisational rules can be maintained separately from the core ruleset I ended up with five rule files covering 28 rules across six categories: Injection, Secrets, Cryptography, Authentication, Misconfiguration, and Path Traversal. Every rule maps to a CWE identifier and an OWASP Top 10 category. That structure matters — it's the language that security professionals and auditors actually speak.

Decision 3: Three Output Formats From Day One

I could have built a tool that prints to terminal and called it done. Instead I built three output formats simultaneously: rich terminal output, JSON, and HTML.

This wasn't vanity. Each format serves a completely different consumer.

Terminal output is for developers running scans locally during development. It needs to be immediately readable, colour-coded by severity, and show exactly the file and line number of each finding. I used Python's rich library for this, which gives you nice bordered panels with colour-coded severity labels without writing a lot of custom formatting code.

JSON output is for machines. CI/CD pipelines, SIEM systems, dashboards, and any downstream tooling that needs to process findings programmatically. The JSON schema includes everything: finding ID, title, severity, category, CWE, OWASP reference, file path, line number, matched content, and remediation guidance. That's a schema a security team could ingest into Splunk or Elastic without modification.

HTML output is for stakeholders. Developers understand terminal output. Product managers and engineering leads don't. The HTML report is a self-contained file — no server required, just open it in a browser — with severity filtering and full remediation guidance. You generate it, you email it, anyone can read it.

The design principle here is that a security tool's effectiveness is limited by how well its output reaches the people who need to act on it. Building three output formats from the start wasn't over-engineering — it was thinking about the full workflow.

Decision 4: CI/CD Exit Codes and Configurable Severity Thresholds

This is where the tool goes from "interesting project" to "actually useful in production."

The scanner exits with code 1 when findings meet or exceed a configurable severity threshold. In practice:

# Fail the build on any HIGH or CRITICAL finding
- name: SAST Scan
  run: |
    docker run --rm -v ${{ github.workspace }}:/src \
      sast-tool /src \
      --fail-on HIGH

# Audit mode — run without failing the build
python main.py ./src --fail-on none

The --fail-on flag is the key design decision here. It lets teams adopt the tool incrementally:

Start in audit mode (--fail-on none). Get a baseline of your existing findings. Don't break anything.
Tighten to --fail-on CRITICAL. Only the most severe issues block releases.
Over time, tighten to --fail-on HIGH as the codebase gets cleaned up. This reflects something I learned from running Snyk against a production Node.js codebase: you can't go from zero security gates to blocking every high-severity finding overnight. The build will fail constantly and engineers will start disabling the check to ship. Incremental adoption with configurable thresholds is how security tooling actually gets embedded into teams.

Decision 5: Docker-First Distribution

The scanner ships as a Docker image. Local Python installation is an option, but Docker is the primary recommended path.

Why? Zero dependency hell.

Python dependency management is a known pain point. Different teams run different Python versions. pip install on one machine behaves differently on another. A tool that fails to install never gets used.

Docker eliminates this. One command, any machine with Docker installed, consistent results:

docker run --rm -v $(pwd):/src sast-tool /src

For CI/CD integration — which is where this tool matters most — Docker is even more natural. GitHub Actions, GitLab CI, Jenkins — they all run steps in containers. A Docker-first tool drops into any pipeline without configuration.

What I'd Do Differently

I'd add tree-sitter for at least two or three languages. Python and JavaScript are well-supported, and adding AST-based passes for those two languages would dramatically reduce false positives on the most commonly scanned codebases. The regex engine would remain the fallback for everything else.

I'd add a findings baseline. The first time you scan a legacy codebase, you might get 200 findings. That's not useful — it's noise. A baseline file that records the current state of findings and only alerts on new ones since the last scan is critical for real-world adoption. Snyk does this. I didn't build it.

I'd invest more in the HTML report. The current version is functional but basic. A proper interactive report with trend data across multiple scans, drill-down into individual findings, and a remediation progress tracker would make it genuinely compelling for security leadership conversations.

What Building This Taught Me

Understanding how SAST tools work made me a better consumer of them. When I run Snyk or Semgrep now, I have a much clearer mental model of what's happening under the hood, why certain findings are false positives, and what "confidence level" actually means.

The design decisions in a security tool aren't just engineering decisions — they're security decisions. Choosing regex over AST isn't just a technical tradeoff; it's a decision about your false positive rate, which is a decision about how much friction you introduce into developer workflows, which determines whether the tool actually gets used.

Building something from scratch is still one of the fastest ways to understand a domain deeply.

The full source code is on GitHub at github.com/pgmpofu/sast-tool. The rules are all in rules/, the engine is in sast/, and there's a vulnerable_sample.py you can use to test it immediately.

If you want to see it in action against real vulnerable applications, I wrote about that in my previous article.

Next up: the regex vs AST debate in depth — when pattern matching is good enough, and when it'll get you into trouble.

DEV Community