Patience Mpofu

Posted on May 7

How I Modelled the OWASP Top 10 Into a YAML Rule Engine

#python #tutorial #security #appsec

When I set out to write detection rules for my SAST tool, I didn't start with a list of regex patterns. I started with the OWASP Top 10.

That might sound obvious, but it matters. The OWASP Top 10 is the closest thing the AppSec world has to a universal curriculum. Every security engineer speaks it. Every compliance framework references it. When I map my rules to OWASP categories, I'm not just organising them — I'm making them legible to the people who will ultimately use them.

This article is about the thought process behind translating OWASP into a machine-readable rule engine. Not just what rules I wrote, but why I wrote them the way I did, and where the tricky ones gave me the most trouble.

The Rule Schema

Every rule in the engine follows the same structure:

- id: AUTHN-001
  title: "JWT Algorithm None Attack Vector"
  description: ">"
    The application accepts JWTs with algorithm set to 'none', allowing
    attackers to forge tokens without a valid signature.
  severity: CRITICAL
  category: Authentication
  cwe: CWE-347
  owasp: A07:2021 - Identification and Authentication Failures
  languages: ["python", "javascript", "java", "csharp", "go"]
  remediation: >
    Always explicitly specify and enforce the expected algorithm when
    verifying JWTs. Never accept 'none' as a valid algorithm. Use an
    allowlist of accepted algorithms.
  patterns:
    - regex: 'algorithm[s]?\s*[=:]\s*["\']none["\']'
      confidence: HIGH
    - regex: 'verify\s*=\s*False'
      confidence: MEDIUM

Six things matter in this schema beyond the obvious metadata:

CWE ID — links to the Common Weakness Enumeration, which is the language of vulnerability databases and CVEs
OWASP category — maps to a Top 10 entry using the 2021 version
Languages array — controls which file types the pattern is applied to
Multiple patterns — a rule can have several patterns, each with its own confidence level
Confidence — HIGH means the pattern is very likely a real vulnerability; MEDIUM means it warrants manual review
Remediation — not just "this is bad" but "here's what to do instead" That last one is deliberate. A scanner that flags vulnerabilities without telling developers how to fix them creates noise, not security. Every rule in my tool includes actionable remediation guidance.

How the 28 Rules Map to OWASP

Here's the full picture before we go deep on individual rules:

OWASP 2021 Category	My Rules
A01 — Broken Access Control	AUTHN-005 (IDOR), MISC-001 (Path Traversal)
A02 — Cryptographic Failures	CRYPTO-001 through CRYPTO-006, SEC-003, SEC-004
A03 — Injection	INJ-001 through INJ-005, MISC-003 (XXE), MISC-006 (Deserialization)
A04 — Insecure Design	MISC-004 (File Upload)
A05 — Security Misconfiguration	MISC-002 (Debug Mode), MISC-003, MISC-005 (CORS)
A07 — Auth & Identity Failures	AUTHN-001 through AUTHN-005, SEC-001 through SEC-006
A08 — Software & Data Integrity	MISC-006 (Insecure Deserialization)

Some OWASP categories are underrepresented — A06 (Vulnerable Components) is better handled by SCA tools like Snyk than a SAST scanner, and A09 (Logging Failures) and A10 (SSRF) would require data flow analysis that regex can't reliably deliver. I'll come back to this.

Deep Dive: The Rules That Required Real Thought

AUTHN-001 — JWT Algorithm None Attack Vector

This one is my favourite rule in the entire set, because it targets a specific, well-known attack that is both elegant and devastating.

The vulnerability: The JWT specification allows the alg header to be set to "none", which means "no signature required." Some libraries honour this. If an attacker intercepts a JWT, changes the payload (for example, escalating "role": "user" to "role": "admin"), sets alg: none, and removes the signature, a vulnerable library will accept it as valid.

This is CWE-347 — Improper Verification of Cryptographic Signature. It's not a cryptographic weakness in the algorithm — it's a logic flaw in how the algorithm is selected.

The detection challenge: The attack can be enabled in several ways. The most obvious is setting the algorithm explicitly:

jwt.decode(token, options={"algorithms": ["none"]})

But it can also be enabled by disabling verification entirely:

jwt.decode(token, verify=False)  # Python jwt library

Or by using a wildcard algorithm list that implicitly includes none. My rule covers the first two patterns:

patterns:
  - regex: 'algorithm[s]?\s*[=:]\s*["\']none["\']'
    confidence: HIGH
  - regex: 'verify\s*=\s*False'
    confidence: MEDIUM

The second pattern (verify=False) is MEDIUM confidence rather than HIGH because disabling verification has legitimate uses in test environments. That's an important distinction — the same code can be correct or dangerous depending on context, and the confidence level communicates that to the developer reviewing the finding.

Remediation: Always pass an explicit allowlist of algorithms when decoding JWTs and never include none. In Python's PyJWT library, that looks like:

jwt.decode(token, key, algorithms=["HS256"])  # explicit allowlist

MISC-006 — Insecure Deserialization

This is the rule I found hardest to write well, because insecure deserialization is one of those vulnerability classes where the presence of the function call isn't necessarily dangerous — it's the source of the data being deserialized that makes it dangerous.

The vulnerability: Deserializing untrusted data can lead to remote code execution. In Python, pickle.loads() will execute arbitrary Python code embedded in the serialized payload. In Java, ObjectInputStream.readObject() has been the source of countless critical CVEs. In PHP, unserialize() is a classic RCE vector.

The detection challenge: I can't tell from the call site alone whether the data being deserialized is trusted (coming from a file the application wrote itself) or untrusted (coming from a user-submitted HTTP body or a message queue). Both look identical to a regex scanner.

My decision was to flag it at HIGH confidence with a remediation note that acknowledges the context-dependence:

- id: MISC-006
  title: Insecure Deserialization
  severity: CRITICAL
  cwe: CWE-502
  owasp: A08:2021 - Software and Data Integrity Failures
  patterns:
    - regex: 'pickle\.loads?\s*\('
      confidence: HIGH
    - regex: 'ObjectInputStream\s*\('
      confidence: HIGH
    - regex: 'unserialize\s*\('
      confidence: MEDIUM
  remediation: >
    Avoid deserializing untrusted data. If deserialization is required,
    use safer formats like JSON. If using pickle, only deserialize data
    from trusted, integrity-verified sources. Consider signing serialized
    payloads. For Java, use safer alternatives like Jackson or Gson for
    JSON deserialization.

I gave unserialize() in PHP a MEDIUM confidence rather than HIGH because PHP codebases legitimately use it in contexts where the data comes from internal sources. The confidence difference is a signal to the developer: look harder at this one, but don't automatically treat it as a defect.

AUTHN-004 — Timing Attack in Auth Comparison

This is the subtlest rule in the set, and the one most likely to generate confused questions from developers who haven't encountered it before.

The vulnerability: When you compare two strings — say, a provided token against a stored token — using a standard equality operator (==), most implementations short-circuit on the first mismatched character. This means comparing a completely wrong token takes microseconds, while a token that matches the first 30 characters takes longer.

An attacker can exploit this by measuring response times to brute-force secrets character by character. It sounds theoretical. It isn't — it's been used in practice against authentication systems.

The fix: Use a constant-time comparison function. In Python, that's hmac.compare_digest(). In Node.js, it's crypto.timingSafeEqual().

The detection: I look for direct string comparison in contexts that suggest authentication:

- id: AUTHN-004
  title: Timing Attack in Auth Comparison
  severity: MEDIUM
  cwe: CWE-208
  patterns:
    - regex: '(token|secret|password|api_key)\s*==\s*'
      confidence: MEDIUM
    - regex: '==\s*(token|secret|password|api_key)'
      confidence: MEDIUM

MEDIUM severity, MEDIUM confidence. The false positive rate here is real — lots of code compares passwords or tokens with == in contexts where timing attacks are a genuine concern, but also in test code, logging, and input validation where they aren't. The finding is a prompt to review, not an automatic defect.

CRYPTO-005 — ECB Mode Encryption

This rule catches one of the most common misuses of encryption that isn't immediately obvious to developers who aren't cryptographers.

The vulnerability: AES-ECB (Electronic Codebook) mode encrypts each block of plaintext independently using the same key. This means identical plaintext blocks produce identical ciphertext blocks, which leaks structural information about the data even when it's "encrypted."

The classic demonstration is encrypting a bitmap image with AES-ECB — the overall pattern of the image remains visible in the ciphertext because regions of the same colour encrypt to the same blocks. For structured data like JSON or database rows, the same leakage applies.

The detection:

- id: CRYPTO-005
  title: ECB Mode Encryption
  severity: HIGH
  cwe: CWE-327
  patterns:
    - regex: 'AES\.MODE_ECB|Cipher\.getInstance\(["\']AES["\']|AES/ECB'
      confidence: HIGH

The pattern catches Java's Cipher.getInstance("AES") because Java's default AES mode — when you don't specify one — is ECB. This is a documentation trap that developers fall into all the time. They think they're using secure AES; they're actually using AES-ECB because they didn't know to specify AES/GCM or AES/CBC.

Remediation: Use AES-GCM for authenticated encryption (preferred) or AES-CBC with a random IV and separate HMAC for integrity verification.

MISC-005 — CORS Wildcard / Reflected Origin

This rule sits at MEDIUM severity because CORS misconfiguration is context-dependent in a way that matters.

The vulnerability: A wildcard CORS header (Access-Control-Allow-Origin: *) allows any website to make credentialed cross-origin requests to your API. A reflected origin header — where the server echoes back whatever Origin header the request sent — is even worse, because it's a wildcard that bypasses the credentials: true restriction that wildcards technically can't combine with.

The patterns:

- id: MISC-005
  title: CORS Wildcard / Reflected Origin
  severity: MEDIUM
  cwe: CWE-942
  owasp: A05:2021 - Security Misconfiguration
  patterns:
    - regex: "Access-Control-Allow-Origin['\"]?\s*[,:]\s*['\"]?\*"
      confidence: HIGH
    - regex: 'allow_origins\s*=\s*\[?\s*["\']\*["\']'
      confidence: HIGH
    - regex: 'request\.headers\.get\(["\']Origin["\']\)'
      confidence: MEDIUM

The third pattern — looking for code that reads the Origin header — is a signal that reflected origin might be happening, not a definitive finding. A developer reading the Origin header might be implementing proper allowlist validation. MEDIUM confidence reflects that ambiguity.

The Categories I Deliberately Left Out

Being honest about gaps matters as much as documenting what you built.

A06 — Vulnerable and Outdated Components belongs to Software Composition Analysis (SCA), not SAST. SCA tools like Snyk and Dependabot check your dependency versions against CVE databases. A regex scanner can't do this — it would need to parse package manifests and cross-reference them against live vulnerability feeds. I deferred this entirely to dedicated SCA tooling.

A09 — Security Logging and Monitoring Failures requires understanding what isn't in the code — which authentication events aren't being logged, which error handlers swallow exceptions silently. Pattern matching can only find things that are present in the text. Detecting absence requires semantic understanding the tool doesn't have.

A10 — Server-Side Request Forgery (SSRF) requires taint analysis. An SSRF vulnerability exists when user-controlled input reaches an HTTP request function without validation. That's exactly the kind of multi-step data flow that regex can't trace. I flagged this in the README as a known gap and a candidate for future AST-based analysis.

What Mapping to OWASP Gave Me

Structuring the rules against OWASP rather than building them ad hoc gave me three things I didn't expect.

Coverage gaps become visible. When you're mapping rules to a framework, the categories with no rules stand out immediately. That's a forcing function for honesty about what your tool actually covers.

The output speaks to security professionals. When a finding says A03:2021 - Injection and CWE-89, a security engineer doesn't need to read the description to understand what they're looking at. The taxonomy does the communication work.

It's defensible. If someone asks why I chose to flag MD5 usage, I can say: because CWE-327 maps to A02:2021 - Cryptographic Failures, and OWASP identifies weak hashing as a top-tier risk category. That's not me making a judgment call — it's me implementing an industry-standard framework.

Building your own tool is one of the fastest ways to understand why the standards are structured the way they are. You don't really understand OWASP until you've had to decide how to implement it.

The full rule set is in the rules/ directory at github.com/pgmpofu/sast-tool. Each YAML file corresponds to a rule category, and every rule follows the schema described above.

Next up: writing custom SAST rules for vulnerabilities your scanner doesn't cover — a practical tutorial using the YAML rule format to extend the tool for stack-specific patterns.

DEV Community