Patience Mpofu

Posted on May 4

Why I Chose Regex Over AST Parsing in My SAST Tool (And When That Would Be Wrong)

#security #regex #python #programming

In my last article, I mentioned that my SAST tool uses regex-based pattern matching instead of AST parsing, and that this was a deliberate tradeoff. A few people asked me to go deeper on that decision — because on the surface, it sounds like I took a shortcut.

I didn't. Or rather — I did, but it was an informed shortcut, and there's a meaningful difference.

Let me explain what AST parsing actually is, why it's considered the "correct" approach, why I chose not to use it, and — most importantly — when that choice would be the wrong one.

First, What's the Difference?

When your SAST tool scans a file, it needs to understand what the code is doing. There are two fundamentally different ways to approach this.

The Regex Approach

Regex treats source code as plain text and looks for patterns that look like vulnerabilities. Here's a simplified version of what my SQL injection rule does:

(execute|query|cursor)\s*\(\s*["\'].*\+.*["\']

This pattern says: find any call to execute, query, or cursor that contains a string concatenation inside the parentheses. If it matches, flag it as a potential SQL injection.

It's fast, simple, and language-agnostic. The same pattern catches suspicious SQL construction in Python, Java, PHP, and JavaScript without modification.

The AST Approach

AST stands for Abstract Syntax Tree. When a compiler or interpreter reads your code, it doesn't see text — it parses the text into a structured tree that represents the meaning of the code.

Take this Python snippet:

user_id = request.args.get("id")
query = "SELECT * FROM users WHERE id = " + user_id
cursor.execute(query)

An AST parser doesn't just see the word execute followed by some text. It understands:

user_id is a variable assigned from request.args — a known source of user-controlled input
query is a string built by concatenating that variable — which is a taint propagation step
cursor.execute(query) is a database call receiving that tainted string — which is a sink This is taint analysis — tracking the flow of untrusted data from a source to a dangerous sink. It's the gold standard of SAST analysis because it understands context, not just surface patterns.

What Regex Gets Wrong

Let me show you a concrete example of where regex fails.

False Positive: The Innocuous MD5

My rule CRYPTO-001 flags any use of MD5 as a potential weak hashing vulnerability:

- id: CRYPTO-001
  title: Weak Hashing — MD5
  severity: HIGH
  patterns:
    - regex: '\bmd5\s*\('
      confidence: HIGH

This will correctly flag:

hashed_password = md5(user_password).hexdigest()  # BAD — MD5 for passwords

But it will also flag:

file_checksum = md5(file_contents).hexdigest()  # FINE — MD5 for file integrity

An AST-based tool with data flow analysis could potentially distinguish between these cases by understanding what type of data is being hashed. A regex tool cannot. It sees md5( and fires regardless.

False Negative: The Indirect Injection

Regex also misses vulnerabilities that span multiple lines or involve intermediate variables. Consider:

String userId = request.getParameter("id");
String sql = buildQuery(userId);           // vulnerability travels through this function
Statement stmt = conn.createStatement();
stmt.execute(sql);                         // regex might not flag this

My regex looks for string concatenation at the point of execution. If the tainted input is assembled in a helper function and passed in as a completed string, the regex never fires. The vulnerability is invisible to pattern matching.

An AST tool with interprocedural taint analysis would follow the data through buildQuery() and flag the eventual execute() call correctly.

So Why Did I Choose Regex Anyway?

Three reasons.

1. Language-Agnostic by Design

AST parsing is inherently language-specific. Every language has its own grammar and its own parser. Python's AST looks nothing like Java's. Kotlin's is different again. JavaScript has multiple competing parsers with different behaviours across versions.

To support AST-based analysis across 12 languages — Python, Java, Kotlin, JavaScript, TypeScript, C#, Go, Ruby, PHP, Shell, YAML, Terraform — I'd need 12 separate parsing libraries, each with their own dependencies, version constraints, and maintenance requirements.

tree-sitter comes closest to solving this problem. It's a parser generator that provides a unified API across dozens of languages, and it's what tools like GitHub's code scanning use under the hood. But even with tree-sitter, you still need to write language-specific query logic to express what you're looking for in each language's AST structure.

Regex patterns, by contrast, can be written once and applied across any language where the vulnerable pattern looks similar in text form. Hardcoded AWS access keys follow the same format everywhere. JWT secrets look the same in any language. That's genuine value that regex delivers cheaply.

2. The Vulnerability Surface I'm Targeting

Not all vulnerability classes require deep analysis. Some are genuinely well-served by pattern matching.

Secrets detection — hardcoded API keys, passwords, connection strings, private key material — is almost entirely a pattern matching problem. The secret has to appear literally in the source code for it to be a finding. Regex is exactly the right tool.

- id: SEC-001
  title: Hardcoded AWS Access Key
  patterns:
    - regex: 'AKIA[0-9A-Z]{16}'
      confidence: HIGH

That pattern will catch a hardcoded AWS key in any language, in any file, instantly. AST analysis adds nothing here.

Misconfiguration detection — debug mode enabled, CORS wildcards, insecure session settings — is similarly pattern-oriented. These are usually single-line declarations that look the same regardless of context.

The injection and authentication categories are where regex struggles most. But even there, high-confidence patterns — direct string concatenation in SQL calls, algorithm: "none" in JWT configurations — catch a meaningful portion of real vulnerabilities.

3. Pragmatic Scope

I built this tool to learn application security deeply, not to compete with Checkmarx. Scope matters. A tool that actually ships with 28 working rules across 6 categories is more valuable than a tool that was going to have perfect taint analysis but never got finished.

The regex approach let me build a complete, functional, deployable tool. That's not nothing.

When This Choice Would Be Wrong

I want to be direct about the scenarios where choosing regex would be the wrong call.

If you're scanning a single language at scale, the language-agnostic argument evaporates. If you're only scanning Java — which is common in enterprise AppSec programmes — you should be using a Java AST parser or a tool like SpotBugs or SonarQube that understands Java's type system.

If you need to catch data flow vulnerabilities reliably, regex will miss too much. Injection vulnerabilities that travel through multiple functions, variables, or modules require taint analysis. The indirect injection example I showed earlier is not an edge case — it's the norm in real codebases.

If you're running this in a high-security environment where false negatives are more dangerous than false positives, the calculus changes. A false negative means a real vulnerability gets missed. In a financial services or healthcare context, that might be unacceptable.

If you're trying to replace a commercial SAST tool, you need AST analysis. There's no way around it. Tools like Semgrep (which uses a hybrid AST/pattern approach), Checkmarx, and Veracode achieve their accuracy because they understand code structure. Pattern matching is a starting point, not a destination.

The Hybrid Path Forward

The most pragmatic production approach is a hybrid — which is exactly what Semgrep does.

Semgrep's rule syntax looks like pattern matching but operates on the AST. When you write a Semgrep rule that matches cursor.execute($X + $Y), Semgrep isn't doing string matching. It's matching against the AST, which means it correctly handles whitespace, string formatting variations, and code structure in ways that regex cannot.

For my tool, the natural evolution would be to keep the YAML rule engine and regex patterns as the default layer, but add an optional tree-sitter AST pass for languages where it's available. The two approaches aren't mutually exclusive — they're complementary. Regex for speed and coverage, AST for accuracy on the highest-risk patterns.

That's the architecture note I left in the README:

For a production tool, layering in tree-sitter AST analysis per language would reduce false positives.

That's not hedging. That's honest engineering — knowing where your current approach has limits and documenting the path to improving it.

The Practical Takeaway

If you're building a SAST tool or evaluating one, here's how to think about the regex vs AST question:

Scenario	Right Approach
Multi-language scanning, broad coverage	Regex or hybrid (Semgrep-style)
Single language, high accuracy	AST-based analysis
Secrets detection	Regex — it's optimal
Taint/data flow analysis	AST — regex can't do this
CI/CD gate with low false positive tolerance	AST or hybrid
Learning how SAST works	Build both and compare

The "correct" approach depends entirely on your threat model, your team's language footprint, and how much false positive noise your developers will tolerate before they disable the scanner entirely.

A scanner that developers trust and actually use is more valuable than a theoretically perfect scanner that gets switched off after the first sprint.

The full source code, including all YAML rules, is at github.com/pgmpofu/sast-tool.

Next up: how I modelled the OWASP Top 10 into a YAML rule engine — and the thought process behind some of the trickier rules like JWT algorithm confusion and insecure deserialization.

DEV Community