Bala Paranj

Posted on Apr 27

Debugging theory solved our security triage problem

#go #security #development #architecture

Our security CLI produced findings engineers couldn't triage without hours of research. We applied Andreas Zeller's defect/infection/failure chain from debugging theory — and triage time collapsed.

50 findings. 4 hours of triage.

Our Go CLI scans cloud configurations and reports security misconfigurations. A typical scan produces 50+ findings. Each finding says what control fired, which asset, and what severity.

Finding: CTL.IAM.ESCALATION.001
  Asset:     arn:aws:iam::123456:role/DeployerRole
  Severity:  high
  Remediation: Restrict iam:PassRole permissions

An engineer reads this and asks three questions:

What is wrong? "Escalation" — but what specifically about this role is the problem?
Why does it matter? Is this a theoretical risk or an active exposure?
What happens if I ignore it? Account compromise? Data leak? Compliance finding?

The finding answers none of these. The engineer opens the AWS console, reads the role's policy, traces the permissions, checks what resources the role can access, and reconstructs the risk chain manually. Per finding. Fifty times.

4.8 hours per week on triage. Not because the findings were wrong — because they were incomplete.

The insight from debugging theory

Andreas Zeller's Why Programs Fail describes how bugs propagate through programs. Every failure has three stages:

Defect — a specific flaw in the code
Infection — the defect causes incorrect state at runtime
Failure — the incorrect state produces an observable wrong behavior

A developer debugging a crash doesn't start at the crash. They trace backward: what state was wrong (infection), what code produced that state (defect). The chain from defect to failure is how they understand the bug.

Cloud misconfigurations follow the exact same chain:

Defect — a specific misconfiguration (the role grants iam:PassRole without resource constraints)
Infection — the misconfiguration propagates through dependent infrastructure (an attacker with this role can pass any role to any Lambda function)
Failure — the observable security consequence (full account compromise via privilege escalation through Lambda execution roles)

We were reporting the failure (high severity escalation finding) without the chain that explains it. Engineers reconstructed the chain manually every time.

1. What the output looks like now

Each finding carries the full chain:

Finding: CTL.IAM.ESCALATION.001
  Asset:     arn:aws:iam::123456:role/DeployerRole
  Severity:  high

  DEFECT:
    The role grants iam:PassRole and
    lambda:InvokeFunction without resource
    constraints, allowing any role to be passed
    to any Lambda function.

  INFECTION:
    An attacker with temporary access to this
    role's credentials can create or invoke a
    Lambda function that executes with a higher-
    privileged role. The iam:PassRole permission
    has no Condition or Resource restriction,
    meaning any role in the account is a valid
    target — including administrator roles.

  FAILURE:
    Full account compromise. The attacker
    escalates from the DeployerRole's permissions
    to any role in the account, potentially
    reaching administrator access within a single
    API call chain.

  OBSERVED:
    identity.role.permissions = [
      "iam:PassRole",
      "lambda:InvokeFunction",
      "lambda:CreateFunction"
    ]
    identity.role.condition = null

  REMEDIATION:
    Add a Condition to iam:PassRole limiting
    which roles can be passed, and restrict the
    Resource field to specific function ARNs.

Three authored sections (defect, infection, failure) plus one mechanical section (observed). The engineer reads the finding and knows what's wrong, why it matters, and what happens if they ignore it. No console. No manual tracing. No reconstructing the chain.

2. Authored vs. mechanical

Two kinds of content in the expanded output, with different sources:

Authored content: defect, infection, failure. Written by humans who understand the vulnerability. Stored as metadata alongside the control definition. Reviewed for accuracy and clarity. Changes when explanation quality improves or when new attack techniques emerge.

Mechanical content: observed. Generated automatically by the engine during evaluation. Every property the predicate read during evaluation is captured as a trace — property path and observed value. No authoring needed. Scales to every control automatically.

The OBSERVED section is the engine's property-access trace — Zeller's dynamic slice applied to configuration evaluation. Instead of instrumenting a program's runtime to capture variable reads, we instrument the predicate evaluator to capture observation property reads. The trace shows exactly what data the engine consulted to produce this finding.

OBSERVED:
  storage.access.acl.grants[0].grantee =
    "http://acs.amazonaws.com/groups/global/AllUsers"
  storage.access.acl.grants[0].permission = "READ"

An engineer reading this knows: the engine looked at the ACL grants. It found AllUsers with READ permission. That's what fired the control. No guessing about which property triggered the finding.

3. Why three sections, not one

An earlier design combined everything into a single context field — one paragraph explaining the finding. It didn't work. The single paragraph mixed three concerns that serve different triage needs:

Section	Triage question	Who uses it
Defect	What do I look for in my config?	Engineer fixing the issue
Infection	Should I care about this right now?	Engineer prioritizing the backlog
Failure	What do I tell leadership?	Engineer reporting risk upward

An engineer triaging 50 findings scans the failure sections first to prioritize. "Account compromise" triages before "development log exposure." Then they read the infection for the top-priority findings to decide urgency. Finally, the defect tells them where to look.

Three sections, three reading patterns. A combined paragraph forces the engineer to parse prose for the piece they need. Separate sections let them scan.

4. The scaling problem

Three controls authored, 675 to go. Per-control authoring doesn't scale.

The observation that saved us: within a control family, infection and failure text is nearly identical. Every CTL.S3.PUBLIC.* control shares the same infection ("public internet access to bucket contents") and the same failure ("data exposure"). Only the defect differs (which specific ACL or policy property is misconfigured).

We separated the authored content into two levels:

Family-level templates — infection and failure text shared across a family:

# triage/families/s3_public.yaml
family: CTL.S3.PUBLIC
infection: "Anyone on the internet can access this
  bucket's contents without authentication. Automated
  scanners continuously enumerate public S3 buckets."
failure: "Data exposure. Bucket contents are readable
  or writable by the public."

Per-control overrides — defect text specific to each control:

# triage/overrides/s3_public_001.yaml
control: CTL.S3.PUBLIC.001
defect: "The bucket's ACL grants read access to the
  AllUsers principal."

47 family templates cover 675 controls. Per-control overrides exist only when the defect needs to be specific. The engine joins them at runtime: family template provides infection and failure; override provides defect; OBSERVED is always mechanical.

This reduced the authoring burden by 14×. And it separated the two concerns — security definitions in one directory, triage content in another. Different change rates, different authors, different review processes.

5. The authoring guide

Authors write three things per family. Each has a specific purpose and a specific quality bar:

Defect: Specific about what's misconfigured. "The bucket's ACL grants AllUsers read access" is a defect. "Public S3 exposure" is a category. Engineers need the former to match against their own config.

Infection: How the defect propagates to enable attack. Plain language, focused on mechanism. Engineers use this to decide whether the defect matters in their specific environment. A public bucket in a CDN-only architecture is different from a public bucket holding customer PII. The infection section provides the reasoning; the engineer applies context.

Failure: Worst-case outcome in terms that matter to the business. "Data exposure" for storage issues. "Account compromise" for privilege escalation. "Regulatory violation" for compliance controls. Not CVSS scores — language that leadership understands without translation.

The quality bar: an engineer reading the three sections should be able to triage the finding without opening another tool or consulting external documentation. If the content doesn't reach that bar, it's not ready.

6. What changed for engineers

Before: 50 findings. Each a control ID, severity, and generic remediation. 4.8 hours of manual triage per week.

After: 50 findings. Each with a defect-infection-failure chain explaining why it matters, mechanical observed data showing what the engine found, and specific remediation. Triage is reading, not research.

The findings aren't fewer. The work per finding is smaller. An engineer scanning failure sections to prioritize, reading infection sections to decide urgency, and checking defect sections for the specific fix — that's minutes, not hours.

When to apply this

The defect-infection-failure chain works whenever your tool reports problems that require context to act on:

Linters that report rule violations but don't explain why the rule exists or what happens if it's violated.
Policy engines that report policy failures but don't explain how the failure propagates through dependent infrastructure.
Compliance scanners that report control failures but don't explain the business consequence of non-compliance.
Infrastructure auditors that report drift but don't explain which drift matters and which is cosmetic.

If your tool's output requires engineers to research context before acting, the context belongs in the output. Zeller's chain gives you the structure to deliver it: what's wrong, how it spreads, what breaks.

Understanding vs Detection Accuracy

Tool authors optimize for detection accuracy. We tune predicates, reduce false positives, expand control catalogs. That's the hard technical work, and it matters.

But the engineer receiving 50 findings doesn't care about detection accuracy at triage time. They care about understanding: what's wrong, why it matters, what happens if I ignore it, and what should I do. If answering those questions takes hours of manual research, detection accuracy is irrelevant — the findings sit in a backlog until someone has time.

The output isn't done when the detection is correct. It's done when the engineer can act without leaving the terminal.

These lessons were learned from real problems during development of Stave, an offline configuration safety evaluator.

DEV Community