DEV Community: Douglas Walseth

RSA 2026: The AI Governance Gap Nobody Is Talking About

Douglas Walseth — Wed, 18 Mar 2026 07:46:21 +0000

RSA Conference 2026 starts March 23. Every security vendor will announce AI agent governance.

CrowdStrike just acquired SGNL for $740M. Okta announced "Okta for AI Agents" (GA April 30). Singulr, Lasso, Arthur AI, and Patronus are all pitching runtime detection. The AI governance market is officially hot.

But nobody is talking about the gap that matters.

The Identity Plane vs. The Behavioral Plane

Okta's announcement is significant. They're extending enterprise IAM to treat AI agents as non-human identities — discovery, credential vaulting, universal logout, governance workflows. The stats they cite are real: 88% of orgs report AI agent security incidents, only 22% manage agents as identity-bearing entities.

CrowdStrike's SGNL acquisition adds policy-based access governance to their security platform.

Both are solving the identity plane: Who are your agents? What can they access? What credentials do they hold?

This is necessary. It is not sufficient.

What Happens After Authentication?

Once an agent is authenticated, credentialed, and authorized — what governs what it actually does?

Consider: Your AI coding agent has the right credentials, the right permissions, and passes every identity check. Then it:

Forgets a critical rule because the context window compressed it away
Generates code that passes tests but introduces a subtle security vulnerability
Makes the same class of mistake it made last week, because no one encoded the fix

Identity governance cannot prevent these. They are behavioral problems — structural issues with how agents process context, retain rules, and learn from failures.

The Detection Ceiling

Every vendor at RSA 2026 selling AI governance will demo the same thing: runtime detection. An agent does something bad, the system catches it, an alert fires.

The problem with detection-based governance is mathematical:

Alert volume grows linearly with agent count
The same class of violation can recur every day
Governance teams become alert-processing bottlenecks
Compliance evidence is a snapshot, not a guarantee

After 12 months of detection-based governance, you have the same violations. You just get faster alerts.

The Alternative: Prevent by Construction

What if the system structurally prevented violation classes from recurring?

The enforcement ladder:

L2 (Prose): A rule in documentation. Humans must remember it. This is where most governance "frameworks" stop.
L3 (Template): The rule is embedded in code templates. New code starts correct by default.
L4 (Test): The rule is checked automatically. Violations fail CI. No human needed.
L5 (Hook): The rule is enforced at the system level. The violation literally cannot occur.

Each level up requires zero additional human awareness. L5 enforcement means the lesson is permanent and compounding.

Production Numbers

We run this system in production:

3,700+ violations processed through the enforcement ladder
<5% regression rate — once encoded at L4+, violation classes almost never recur
250+ specs executed by AI agents under structural enforcement
Zero governance team — the system governs itself

What to Ask at RSA

When vendors pitch you AI governance this week:

"When you detect a violation, what prevents the same class from recurring?" If the answer involves humans or dashboards — you're buying detection.
"After 12 months, will we have more or fewer alerts?" If "more, because more agents" — governance scales linearly with your problem.
"Does the system learn from violations structurally?" If "we update our models" — they improve detection, not governance.

The Real Gap

The companies that win the AI agent era will not have the best monitoring dashboards. They will have systems that get better every week without human intervention.

Identity governance (Okta, CrowdStrike) + behavioral enforcement (structural prevention) = complete AI agent governance. Either alone is incomplete.

RSA 2026 will be full of announcements about the identity side. The behavioral side — preventing violations structurally rather than detecting them — is the gap nobody is talking about.

Free governance scanner for any public repo: walseth.ai/scan. 30 seconds, no signup.

RSA 2026 AI Governance Hub with all vendor comparisons: walseth.ai/rsa-2026

Structural Enforcement vs Singulr AI: Runtime Governance Compared

Douglas Walseth — Wed, 18 Mar 2026 05:10:22 +0000

Overview

Singulr AI and structural enforcement both aim to solve the same problem: making AI agents trustworthy in production. They take fundamentally different approaches. Singulr operates at runtime, detecting and responding to violations as they occur. Structural enforcement operates at the system level, making classes of violations impossible by construction.

This is not a question of which product is better. It is a question of which architecture matches your needs: continuous monitoring or permanent prevention.

How Singulr AI Works

Singulr AI launched Agent Pulse in March 2026, positioning it as "enforceable runtime governance and visibility for AI agents." The platform provides:

Agent Discovery: Singulr maps a context graph of tool connections, data access, MCP servers, and permissions across your AI agent ecosystem. This gives visibility into what agents exist and what they can access.

Risk Scoring: The Singulr Trust Feed combines AI red-teaming with risk scoring aligned to agent type, data sensitivity, and scope. This identifies which agents pose the highest governance risk.

Runtime Controls: Policy enforcement against unauthorized access and prompt injection, applied at runtime. Integrations cover Copilot Studio, AWS Bedrock, Azure Foundry, GCP Vertex AI, Databricks, ServiceNow, CrewAI, LangGraph, and OpenTelemetry.

The strength of this approach is breadth. Singulr covers a wide range of agent frameworks and cloud platforms with a consistent governance layer. For organizations with diverse agent deployments, this visibility is genuinely valuable.

How Structural Enforcement Works

The prevent-by-construction methodology is built on the enforcement ladder -- five levels from ephemeral conversation rules (L1) to permanent pre-commit hooks (L5). The core principle: every lesson learned from a violation must be encoded at the highest possible enforcement level.

When a violation is detected, the response is not an alert. It is a structural change that makes the class of violation impossible:

L3 (Template): New code starts correct by default because templates embed the rule.
L4 (Test): CI fails if the rule is violated. No human review needed.
L5 (Hook): The violation is blocked at commit time. It literally cannot enter the codebase.

In production, this approach has processed 3,700+ violations with less than 5% regression rate on enforced code paths. The system improves permanently with each violation -- alert volume decreases over time instead of growing.

Key Differences

Capability	Singulr AI	Structural Enforcement
Enforcement model	Runtime detection and response	Prevent-by-construction (hooks, tests, templates)
Violation recurrence	Same violation class can recur indefinitely	Each violation class is eliminated permanently
Self-improvement	No automated learning loop	GEPA cycle + convergence encoding compound over time
Alert trajectory	Alert volume grows with agent scale	Alert volume decreases as lessons compound
Compliance evidence	Point-in-time monitoring snapshots	Structural proof that violation classes are impossible
Deployment model	SaaS platform with agent framework integrations	Embedded in development workflow (CI/CD, pre-commit)
Integration breadth	10+ agent frameworks and cloud platforms	Framework-agnostic (operates at code and commit level)

When to Choose Each

Choose Singulr AI when:

You have agents deployed across many frameworks and need unified visibility
Your primary concern is discovering what agents exist and what they access
You need runtime protection against external threats like prompt injection
Your organization prefers SaaS platforms with vendor support

Choose structural enforcement when:

You want violations to stop recurring, not just be detected faster
Your governance team is drowning in alerts and needs volume to decrease
You need compliance evidence that is structural, not snapshot-based
You are willing to invest in embedding governance into your development workflow
You want a system that gets better autonomously with each violation processed

Consider both when:

Runtime detection catches the immediate threat while structural enforcement prevents the class. These are complementary architectures. Singulr tells you what happened. Structural enforcement ensures it cannot happen again.

Try It Yourself

The difference between detection and prevention is measurable. Run a free context engineering scan on your repository to see your current enforcement posture -- how many of your governance rules are structural (L4/L5) versus prose (L1/L2).

See what structural enforcement finds that runtime monitoring misses.

Run the free scan at walseth.ai/scan

Competitor information sourced from public product documentation and announcements as of March 2026. We aim for accuracy -- if anything here is incorrect, contact us and we will update it.

Originally published at walseth.ai

Your Next AI Agent Should Cost $0 to Train

Douglas Walseth — Wed, 18 Mar 2026 05:10:20 +0000

Fine-tuned domain agents on consumer hardware. The economics just changed.

The consulting pitch for custom AI agents used to start at $50,000. GPU cloud rental, data labeling, ML engineering time -- the cost structure assumed enterprise budgets. If you were a 5-person startup with a $10M seed round, custom AI was not in your budget.

Two developments collapsed this cost structure in early 2026.

The Zero-Compute Stack

Unsloth + Qwen3.5-4B dropped fine-tuning requirements to 5GB VRAM. That is a consumer laptop GPU. Unsloth's custom CUDA kernels deliver 2x training speedup with 70% less memory. Combined with Qwen3.5-4B -- a model with 256K context, 201 language support, and agentic coding optimization -- you can fine-tune a production-capable model on Google Colab's free tier.

No cloud GPU rental. No ML infrastructure team. No $50,000 training budget.

SCOTT and MIM-JEPA solved the data problem. Traditional fine-tuning needs thousands of labeled examples. SCOTT's sparse convolutional tokenizer enables self-supervised training on datasets "orders of magnitude smaller than traditionally required." For domain-specific agents, this means your existing documentation, knowledge base, and past interactions are sufficient training data. No labeling pipeline needed.

Together: zero compute cost, minimal data requirements, and a base model capable enough for domain Q&A, classification, and simple tool use.

What This Changes

For post-seed startups running AI agents, the calculation flips. Instead of "can we afford custom AI?" the question becomes "can we afford NOT to customize?"

Off-the-shelf models produce generic outputs that require constant human correction. Every time an engineer fixes a model's response to match your domain, that's a training example being wasted. With zero-cost fine-tuning, those corrections become training data that improves the model.

The deliverable: a Qwen3.5-4B LoRA adapter trained on your domain data, running on your hardware, owned by you. No API dependency. No per-token billing. No vendor lock-in.

The Missing Piece: Enforcement

Fine-tuning solves domain knowledge. It does not solve reliability.

A fine-tuned model that knows your domain terminology will still hallucinate. It will still violate constraints that matter to your business. It will work in demos and fail in production in ways that are expensive to debug.

This is where the enforcement ladder changes the equation. Layer structural constraints -- L4 automated tests, L5 pre-commit hooks -- on top of the fine-tuned model's outputs. Every response passes through domain-specific assertions before delivery. Violations are caught automatically, not discovered by users.

The enforcement layer is not overhead -- it is what makes the fine-tuned model production-ready.

Why Not Just Use API Fine-Tuning?

OpenAI, Anthropic, and Google all offer fine-tuning APIs. They work. But they come with trade-offs:

Per-token billing. Every inference call costs money. At scale, a domain-specific agent fielding hundreds of queries per day accumulates meaningful costs. A local model has zero marginal inference cost.
Vendor lock-in. Your fine-tuned weights live on their infrastructure. If pricing changes, you migrate or pay more. With a local LoRA adapter, you own the weights.
No enforcement layer. API fine-tuning delivers a model. Not a model with structural quality guarantees. The gap between "fine-tuned" and "production-ready" is exactly the enforcement layer.
Data residency. For regulated industries, sending training data to third-party APIs raises compliance questions. Local fine-tuning keeps data on your infrastructure.

API fine-tuning is the right choice for teams that need frontier reasoning capabilities. Local fine-tuning with enforcement is the right choice for teams that need domain-specific reliability at predictable cost.

The Two-Week Engagement

Here is what a zero-cost custom agent deployment looks like:

Week 1: Data preparation (convert existing docs to training format), fine-tuning (LoRA training on Colab), and initial enforcement layer (L4 tests for your domain rules).

Week 2: Validation against your test scenarios, iteration on enforcement rules, handoff of model checkpoint + deployment guide + training notebook.

What you get:

A fine-tuned model that understands your domain terminology and workflows
An enforcement layer (L4 tests) that catches domain-specific failures automatically
A reproducible Jupyter notebook so your team can retrain as data evolves
A deployment guide for your target infrastructure (Colab, local GPU, or cloud)
A performance report comparing base model vs. fine-tuned on your test scenarios

What it costs: consulting time. The compute is free.

Who This Is For

The ideal client is a post-seed startup (5-20 people, $5M-15M raised) running AI agents that produce generic outputs requiring constant human correction. They have:

Domain-specific data (documentation, knowledge base, past interactions)
Engineers spending time correcting model outputs instead of building product
Budget for a proof-of-concept ($5K-10K) but not enterprise ML infrastructure ($50K+)
A use case where domain knowledge matters more than general reasoning

If your agents need to understand your terminology, your workflows, and your constraints, a domain-tuned model with enforcement beats a generic frontier model without it.

Honest Limitations

A 4B parameter model has a ceiling. It will not match Claude or GPT-4 on complex multi-step reasoning. It excels at domain Q&A, classification, entity extraction, and simple tool use. If your use case requires sophisticated reasoning, you need a larger model or an API-based approach.

The enforcement layer catches known failure modes. Novel edge cases require ongoing monitoring. The honest pitch: this is not magic. It is a practical, low-cost path from generic AI to domain-specific AI with production-grade quality gates.

Originally published at walseth.ai

Your Context Is Poisoned

Douglas Walseth — Wed, 18 Mar 2026 02:49:39 +0000

Lance Martin at LangChain published a framework for context engineering with four operations: Write, Select, Compress, and Isolate. Each operation has a failure mode. We mapped all four to real production data from our 6-agent autonomous system.

The result: 4,768 violations detected. Every single one traces back to one of these four poisoned context patterns.

Failure Mode 1: Stale Context (Write)

Your agent's instructions were written three sprints ago. The API changed. The schema migrated. The agent keeps generating code against a version of reality that no longer exists.

In our system, stale context is the most common violation category. Agent CLAUDE.md files reference file paths that have moved, cite constraints from specs that were superseded, and enforce patterns the codebase abandoned weeks ago.

The fix is not better documentation. Documentation drifts by definition. The fix is structural enforcement -- L5 hooks that validate context freshness before the agent acts on it. When an instruction references a file path, the hook verifies the path exists. When a constraint cites a spec, the hook confirms the spec is still active.

Detection tools tell you the context was stale after the agent shipped broken code. The prevent-by-construction approach is different: an L5 hook prevents the stale context from reaching the agent at all.

Failure Mode 2: Missing Context (Select)

The agent has access to 200K tokens of context window. It selects 40K of conversation history, 20K of file contents, and zero bytes of the configuration that actually matters.

We track this as "context gap" violations. The agent makes a decision that would have been correct if it had read the target repo's CLAUDE.md first. It didn't, because nothing forced it to. The context was available but not selected.

Our enforcement ladder addresses this with mandatory context loading. Before a coder agent touches any repo, an L5 hook verifies it has read the repo's CLAUDE.md. Before an agent responds to an operator message, it must check its inbox. These are not suggestions -- they are automated gates that block execution until the required context is loaded.

Failure Mode 3: Bloated Context (Compress)

Context windows are large. 200K tokens sounds infinite until your agent has read 40 files, run 80 commands, and accumulated 150K tokens of conversation history. At that point, compression kicks in. Earlier messages get summarized or dropped.

What gets dropped first? The instructions at the top. The system prompt. The behavioral constraints.

We measured this directly: agents averaged 12 rule violations per day after context compression events. They kept working confidently, generating output that violated the rules they could no longer see. This is the silent failure mode -- your agent forgets its rules every 45 minutes and never tells you.

The pre-compaction memory flush hook solves this. At 150 tool calls, it writes critical context to persistent storage before compression hits. When the agent's context gets compressed, the knowledge survives on disk.

Failure Mode 4: Leaking Context (Isolate)

Multi-agent systems have a context isolation problem. Agent A's constraints leak into Agent B's context. Agent B's research pollutes Agent C's execution queue. Without isolation boundaries, agents influence each other in ways nobody intended.

We run 6 agents with distinct roles: coder, CEO, oracle, communications. Each has its own memory, goals, and behavioral rules. Without isolation enforcement, the coder agent was picking up strategic directives meant for the CEO agent and making product decisions it had no authority to make.

Isolation enforcement means each agent's context is structurally bounded. Cross-agent signals follow a defined routing protocol. The coder agent cannot read the CEO's mailbox. The oracle cannot modify the coder's priorities. These boundaries are enforced at the data access layer, not by asking agents to be careful.

The Numbers

From our production system:

4,768 total violations detected across 6 agents
18 violations promoted to structural enforcement (L3-L5)
477:1 violation-to-promotion ratio -- the real measure of self-improvement velocity
< 5% regression rate on violations that received L5 enforcement

Detection finds violations. Enforcement makes them impossible. That is the difference between monitoring context health and engineering context health.

What This Means for Your System

If you are running AI agents in production -- coding assistants, research agents, autonomous workflows -- your context is poisoned in at least one of these four ways. You might not know it yet because the failure mode is silent: the agent keeps producing confident output from degraded context.

The question is not whether your context is clean. The question is whether your system can detect and prevent context poisoning structurally, before the agent acts on it.

Run a scan on your repo to see where your context engineering stands:

Originally published at walseth.ai

Token Security Is an Innovation Sandbox Finalist. Here Is What That Means for AI Agent Governance.

Douglas Walseth — Wed, 18 Mar 2026 02:49:34 +0000

The Innovation Sandbox Pick Nobody Expected

RSAC 2026 selected Token Security as one of ten Innovation Sandbox finalists, presenting on Day 1 (March 23). The program has a strong track record -- previous winners include Wiz, Apiiro, and Abnormal Security, each worth billions today.

Token Security is not a governance vendor in the traditional sense. They are building identity security purpose-built for non-human identities (NHIs) -- the AI agents, service accounts, API keys, and machine credentials that are rapidly outnumbering human users in enterprise infrastructure. Their thesis is straightforward: traditional IAM was designed for humans, and the explosion of AI agents requires a machine-first identity architecture.

They raised $28M in Series A funding led by Notable Capital in January 2026, and their Innovation Sandbox selection puts them in front of the largest security audience of the year.

What Token Security Actually Does

Token Security's platform operates at the identity layer for AI agents and NHIs. Four core capabilities:

Continuous NHI Discovery -- Automatically finds AI agents and non-human identities across cloud infrastructure, mapping what exists and what it connects to
Contextual Identity Graph -- Maps relationships between agents, services, resources, and permissions into a queryable graph structure
Permission Drift Detection -- Monitors when agent permissions deviate from their intended scope, catching privilege creep before it becomes a security incident
Intent-Based Access Controls -- Grants and restricts access based on what agents are supposed to do, not just static role assignments

They also integrate with MCP servers, which gives them visibility into the agent toolchain layer -- what tools agents are using and what resources those tools access.

Their content marketing leading into RSA has been notably aggressive: 10+ blog posts published in a single week, each targeting different segments of the NHI security narrative. This is sophisticated go-to-market execution that signals both marketing maturity and confidence in their Innovation Sandbox pitch.

The Question Token Security Answers -- And the One It Does Not

Token Security answers a critical question: Who are your AI agents, and what can they access?

This is a real problem. Most organizations have no inventory of their non-human identities. AI agents spin up with credentials that nobody tracks. Permission sprawl happens silently. When a security team asks "which agents have access to production data?", there is usually no answer.

Token Security provides that answer. Their identity graph and continuous discovery solve the visibility gap that makes NHI governance impossible. This is valuable and necessary work.

But there is a second question that identity governance does not address: What will the agent do with that access, and how do you prevent violations?

An AI agent can be fully discovered in Token Security's identity graph, have correctly scoped permissions, pass every NHI compliance check -- and still produce outputs that violate compliance policies. It can still drift from its behavioral constraints. It can still introduce governance regressions in the codebase it modifies. Identity verification ensures the right agent has the right access. It does not ensure the agent uses that access correctly.

This is the identity-behavioral gap. Token Security operates at the identity layer. Behavioral enforcement operates at the constraint layer. They are different problems requiring different architectures.

Identity Layer vs Behavioral Layer

The distinction matters because the two layers prevent different failure modes:

Identity layer failures (Token Security's domain):

Unknown agents operating in production infrastructure
Stale credentials with excessive permissions
Permission sprawl across NHI populations
No audit trail of which agents exist or what they connect to

Behavioral layer failures (Walseth AI's domain):

Agents producing outputs that violate compliance policies
Context drift where agent behavior diverges from intent
Constraint regression when code changes weaken governance controls
No structural prevention of violation classes before runtime

You can have perfect identity governance and still experience behavioral failures. You can have perfect behavioral enforcement and still have NHI visibility gaps. Enterprises need both layers.

What Innovation Sandbox Means for the Market

The Innovation Sandbox selection validates that NHI security is now a first-class category at RSA, not a niche within traditional IAM. This matters for three reasons:

Visibility. Innovation Sandbox finalists receive press coverage estimated at $5M+ in equivalent visibility. Token Security's pitch on March 23 puts NHI identity security in front of every CISO, security architect, and enterprise buyer attending RSA. Search volume for "Token Security", "NHI security", and related terms will spike during the week.

Validation. The program's track record of selecting companies that become category leaders means the judges see NHI identity as a real, fundable, scalable market. This draws more investment and more competition to the identity layer of agent governance.

Complementary positioning. For organizations evaluating AI agent security, Token Security's Innovation Sandbox presence clarifies the two-layer architecture: identity governance (who agents are and what they can access) and behavioral enforcement (what agents do and how they comply). Teams that deploy Token Security for NHI discovery still need behavioral constraints for the agents they discover.

The Two-Layer Architecture Enterprises Need

The strongest AI agent security posture combines both layers:

Token Security discovers every agent, maps every permission, detects every identity drift
Behavioral enforcement ensures every discovered agent actually complies with policies, maintains context integrity, and produces governance-compliant outputs

Neither layer alone is sufficient. Identity without behavioral constraints means you know who your agents are but cannot prevent what they do. Behavioral constraints without identity management means you govern agent behavior but cannot see your full NHI surface.

Our enforcement ladder operates at five levels, from prose documentation through automated hooks, each compounding on the previous. This prevent-by-construction approach eliminates violation classes before they reach runtime -- the exact layer that identity governance does not cover.

What to Watch at RSA

Token Security presents to the Innovation Sandbox judges on March 23. Watch for:

How they position NHI discovery relative to existing IAM vendors (particularly Okta for AI Agents, which also targets NHI governance from the enterprise identity side)
Whether their pitch addresses the behavioral gap or stays focused on identity
Audience questions about what happens after agents are discovered and permissioned

For a full comparison of all AI governance vendors heading into RSA, see our vendor map. To see how behavioral enforcement scores your own repository, run the free scanner or explore the AI governance leaderboard.

See also: Walseth AI vs Token Security for a detailed feature comparison.

Originally published at walseth.ai

The AI Failure Tax: What Unreliable Agents Actually Cost in Financial Services

Douglas Walseth — Wed, 18 Mar 2026 01:59:40 +0000

Every AI agent failure has a cost. In financial services, those costs compound:

Direct cost: The failed transaction, wrong calculation, missed compliance check
Recovery cost: Human time to detect, diagnose, and fix
Trust cost: Internal stakeholders lose confidence in AI adoption
Regulatory cost: Audit findings, remediation plans, potential fines

We measured this across production agent systems. The numbers:

A single L1-enforced rule (prose instruction in a prompt) has a ~47% violation rate under context pressure. For a financial services agent processing 1,000 decisions/day, that's ~470 potential violations.

The same rule at L5 (hook enforcement) has a 0% violation rate. The agent literally cannot proceed without satisfying the check.

The math: If each violation costs in recovery time (conservative — most finserv incidents cost much more), moving from L1 to L5 enforcement saves ,500/day per rule. For 10 critical rules, that's ,000/day.

This is not theoretical. This is measured production data from systems running the enforcement ladder.

Scan your own AI repos: Free Governance Scanner

Full analysis →

Why Your AI Agent Needs a Command Center, Not Better Prompts

Douglas Walseth — Wed, 18 Mar 2026 01:59:36 +0000

Andrej Karpathy described the ideal AI system as a "command center" — observable, debuggable, steerable. Most agent frameworks give you none of that.

Here's the gap: your agent runs 50 tasks, fails silently on 3, and you find out from a customer complaint. There's no audit trail, no enforcement of what went wrong, no way to prevent it next time.

The enforcement ladder approach gives agents 5 levels of structural control:

L1 (Prose): Instructions in CLAUDE.md — easily ignored
L2 (Convention): Naming patterns, file structure — fragile
L3 (Template): Structured output formats — moderate enforcement
L4 (Test): Automated verification — catches violations
L5 (Hook): Pre-commit/pre-deploy automation — prevents violations

The key insight: L1 (prose instructions) fails ~47% of the time under context pressure. L5 (hooks) fails 0% — the code literally cannot execute if the check fails.

When Karpathy talks about a command center, this is what he means: structural enforcement that doesn't depend on the model reading its instructions correctly.

Try it yourself: Free AI Governance Scanner — paste any GitHub repo and get a scored assessment in 30 seconds.

Full analysis on our blog →

How the Enforcement Ladder Maps to Anthropic's Context Engineering Framework

Douglas Walseth — Wed, 18 Mar 2026 01:32:06 +0000

Anthropic Published the Playbook. We Already Ran It.

Last week Anthropic released "Effective Context Engineering for AI Agents" — their official guide to managing the tokens that flow through production AI systems. It immediately became the most-cited reference in the agent engineering space.

Reading it felt like looking in a mirror.

Their core framework — what they call "Right Altitude" — describes a spectrum from over-specified prose (brittle, breaks on edge cases) to structural constraints (robust, self-enforcing). They argue that the right level of abstraction determines whether your agent system compounds or collapses.

We've been running exactly this hierarchy in production since September 2025. We call it the enforcement ladder. Five levels, from conversation to pre-commit hooks, each encoding lessons at increasing durability.

The mapping isn't approximate. It's exact.

The Technical Mapping

Anthropic's guide identifies four core operations for context engineering: Write (add information), Select (choose what enters the window), Compress (reduce without losing signal), and Isolate (separate concerns into independent contexts).

Our enforcement ladder implements all four — plus a fifth operation Anthropic acknowledges but doesn't systematize: Verify.

Anthropic Concept	Enforcement Ladder	What It Does
Tool design constraints	L5: Hooks	Pre-commit hooks, automated scanners, CI gates. Hard constraints that reject bad context before it enters the system. Anthropic's guide says "tool design > verbose instructions." We agree — and we have 3,700+ violation records proving hooks catch what instructions miss.
Compaction with structured recall	L4: Tests	Automated tests that verify context survives compression. When Claude auto-compacts your 200K context to 40K, do the critical facts survive? L4 tests catch context rot before it causes downstream failures.
Structured note-taking patterns	L3: Templates	Standardized formats (WARM files, completion reports, spec templates) that ensure critical information is written in machine-parseable structure. Manus independently discovered the same pattern — their `todo.md` recitation is L3 enforcement.
"Brittle extreme" (over-specified prose)	L2: Prose	Natural language instructions in CLAUDE.md files. Anthropic explicitly warns this is the weakest form of context management. We track it as a failure mode: if a lesson can only be encoded as prose, we document why structural enforcement was impossible.

The hierarchy isn't arbitrary. Each level is strictly more durable than the one below it. A hook (L5) survives context compaction, developer turnover, and model upgrades. A prose instruction (L2) survives none of those.

The Layer Anthropic Left Out: Verification

Anthropic's guide focuses on getting the right context into the window. That's necessary but not sufficient.

The missing piece is closed-loop verification — systematically checking whether your context engineering actually worked. Not "did the model generate output?" but "did the output respect the constraints the context was supposed to enforce?"

In our system, this is the violation database. Every time an agent's output contradicts an encoded constraint, we record it: which rule, which agent, what happened, what the intended behavior was. 3,706 violations logged across 960+ commits. Each violation is a data point that feeds back into the enforcement ladder — promoting patterns from L2 prose up to L5 hooks when they fail repeatedly.

Anthropic hints at this with their mention of "evaluative signals," but they don't prescribe a systematic feedback loop. The OpenClaw-RL paper (arxiv 2603.10165) quantifies why this matters: combined evaluative + directive signals produce 4.8x improvement over 16 iterations compared to evaluative signals alone.

Context engineering without verification is like writing tests but never running them.

What Manus Confirms

The Manus team published their own context engineering lessons the same week. Their production agent — handling real user tasks at scale — independently validated the same hierarchy:

Logit masking (hard constraint on token generation) maps to L5 hooks
Dynamic tool removal (soft constraint) maps to L3/L4
File system as extended memory maps to our WARM files (persistent context that survives compaction)
todo.md recitation (re-reading structured state each step) maps to L3 templates

Their key insight: "No amount of raw capability replaces memory, environment, and feedback." That's the enforcement ladder thesis in one sentence.

Why This Matters for Your Production Agents

If you're running AI agents in production — or planning to — Anthropic's guide gives you the vocabulary. The enforcement ladder gives you the implementation.

Three concrete actions:

1. Audit your context altitude. How much of your agent's behavior depends on prose instructions (L2) vs. structural constraints (L5)? The ratio predicts failure rate. In our system, every lesson starts as prose and gets promoted up the ladder. If a constraint has been violated 3+ times as prose, it must be promoted to L4 or L5.

2. Build the verification loop. Log every time your agent violates an expected constraint. Not just errors — constraint violations. The difference matters. An error is "the code crashed." A violation is "the code ran fine but ignored the rule that says don't modify production data." Violations are invisible without explicit checking.

3. Measure context survival across compaction. Anthropic's guide discusses context rot — information loss when the context window compresses. Test this explicitly. Write a critical fact into your agent's context, trigger compaction, then check if the agent still knows the fact. Our data shows ~40% of L2 prose doesn't survive a single compaction cycle. L5 hooks survive indefinitely because they're encoded in the file system, not the context window.

The Competitive Window

Three things happened in March 2026:

Anthropic published official context engineering guidance
OpenAI acquired Promptfoo (AI testing/security)
Microsoft announced E7 general availability for May

The platform vendors are converging on "context integrity" as a feature. The window for independent practitioners to establish methodology ownership is narrowing.

The enforcement ladder isn't a product pitch. It's a framework that maps 1:1 to what Anthropic recommends, extends it with verification, and has 6 months of production data behind it. If you're evaluating how to manage context for your agents, the question isn't whether to use a hierarchy like this — it's whether to build it yourself or adopt one that's already been battle-tested.

Run a free context health scan on your repository at walseth.ai/scan. See how your project's enforcement structure maps to Anthropic's framework — in 30 seconds, no signup required.

What Okta's Entry Into Agent Governance Means for Enterprises

Douglas Walseth — Wed, 18 Mar 2026 00:50:08 +0000

The Biggest Enterprise Entry Into Agent Governance

On March 16, 2026, Okta announced "Okta for AI Agents" at Okta Showcase -- the most significant enterprise entry into AI agent governance to date. General availability is set for April 30, one week before RSA Conference closes.

The timing is not accidental. Okta is planting its flag in the agent governance space before the largest security conference of the year, and before the rest of the industry can react.

Here is what they announced, what it means, and the critical gap it does not address.

What Okta Built

Okta for AI Agents extends enterprise IAM to treat AI agents as first-class non-human identities. The platform has seven key capabilities:

Shadow AI Discovery -- Detects unauthorized AI agents connected to enterprise apps via their Identity Security Posture Management (ISPM) engine
Universal Directory for Agents -- Registers agents as non-human identities with full lifecycle management
Agent Gateway -- Centralized control plane with a virtual MCP server that aggregates tools from the Okta MCP registry
Privileged Credential Management -- Vaults and rotates agent credentials automatically
Universal Logout for AI Agents -- Instant kill switch that revokes all tokens for a rogue agent
Governance Workflows -- Brings agents into standard certification and access-review workflows
SIEM Integration -- Agent activity logged and forwarded to enterprise SIEM

Integration partners at launch include Boomi, DataRobot, and Google Vertex AI.

The Numbers That Drove This

Okta cited four statistics that explain why they are entering this space now:

88% of organizations report suspected or confirmed AI agent security incidents
Only 22% treat agents as independent, identity-bearing entities
80% have experienced unintended agent behavior
23% report credential exposure from agents

The first and third numbers are the most telling. Nearly nine in ten organizations have had agent security incidents. Four in five have experienced unintended agent behavior. These are not edge cases. This is the baseline reality of deploying AI agents in production.

What Okta Gets Right

Okta is solving a real problem at the right layer for their expertise. Enterprise organizations genuinely do not know how many AI agents are operating in their environment, what those agents can access, or how to shut them down when something goes wrong.

The Universal Logout capability alone is worth the price of entry. When 23% of organizations have experienced credential exposure from agents, an instant kill switch is not a nice-to-have -- it is table stakes for production deployments.

Shadow AI discovery addresses the visibility gap that plagues every enterprise we talk to. You cannot govern what you cannot see, and Okta's ISPM engine is well-positioned to find agents that IT does not know about.

The Agent Gateway with MCP integration is forward-looking. By aggregating agent tools through a centralized control plane, Okta creates a single point of policy enforcement for agent access. This is architecturally sound for the identity layer.

The Gap: Identity Does Not Equal Behavior

Here is what Okta's announcement does not cover: what agents actually do once they have been authenticated.

An agent can be fully registered in Okta's Universal Directory, properly credentialed via their Privileged Credential Management, and monitored through their SIEM integration. That agent can still:

Produce outputs that violate compliance policies -- identity verification does not constrain agent output quality
Drift from its behavioral constraints -- credential management does not enforce context integrity
Introduce governance regressions in the codebases it modifies -- access control does not prevent structural violations
Hallucinate or generate non-compliant content -- the Agent Gateway controls what tools an agent can access, not how it uses them

This is the identity-behavioral governance gap. Okta secures the identity plane. The behavioral plane -- what agents do, how they comply, whether their outputs meet governance standards -- requires prevent-by-construction enforcement: structural constraints that make violations impossible regardless of identity controls.

Two Layers, One Problem

Think of it as two layers of the same stack:

Layer	Question	Solution
Identity	Is this agent authorized to act?	Okta for AI Agents
Behavioral	Is this agent acting correctly?	Structural enforcement

The 88% incident rate exists because most organizations have neither layer. Okta entering the space means enterprises will soon have the identity layer. The behavioral layer -- enforcement ladders, context integrity checks, constraint automation -- is what we build.

These layers are complementary. Identity governance without behavioral enforcement catches unauthorized agents but misses compliant-but-incorrect behavior. Behavioral enforcement without identity governance catches violations but cannot revoke agent access when needed.

The strongest governance posture uses both.

What This Means for the Market

Three implications of Okta's entry:

1. Market validation. When a $15B identity company builds an AI agent governance product, it confirms that agent governance is enterprise-critical infrastructure, not a niche concern. Every conversation we have with prospects about whether AI governance is "real" just got easier.

2. Category creation. Okta's announcement creates the "AI agent governance" category in enterprise security. Search volume for "okta ai agents", "ai agent governance", and "ai agent security" will spike around RSA. This lifts all boats in the space.

3. The complementary gap becomes obvious. As enterprises deploy Okta for agent identity, they will immediately discover that identity alone does not prevent behavioral incidents. The 80% who have experienced unintended agent behavior will not see that number drop just because agents have better credentials.

For Teams Evaluating Agent Governance

If you are evaluating AI agent governance solutions ahead of RSA Conference:

Start with identity if you do not know how many AI agents operate in your environment. Okta's shadow discovery solves the visibility problem.
Start with behavioral enforcement if you are building AI agent systems and your agents are producing incorrect, non-compliant, or inconsistent outputs. Run our free scanner to see where your governance gaps are.
Plan for both if you are deploying AI agents in regulated industries. EU AI Act compliance, NIST AI RMF alignment, and SOC 2 requirements span both identity and behavioral governance.

For a detailed comparison of how the two approaches differ, see our Walseth AI vs Okta comparison page.

The Bottom Line

Okta's entry is good for the entire AI agent governance space. They bring enterprise credibility, existing customer relationships, and a serious engineering effort to the identity layer.

The behavioral enforcement layer -- what agents do, how they comply, whether their outputs meet governance standards -- remains the unsolved half of the problem. That is where structural enforcement, enforcement ladders, and context engineering operate.

Okta governs who your agents are. We govern what they do. Enterprises need both.

Originally published at walseth.ai

AI Governance Leaderboard: We Scanned 21 Top Repos Before RSA 2026

Douglas Walseth — Wed, 18 Mar 2026 00:50:03 +0000

AI Governance Leaderboard: We Scanned 21 Top Repos Before RSA 2026

RSA Conference 2026 starts March 23. Every AI security vendor will be on stage talking about governance, compliance, and responsible AI. We wanted to see what governance actually looks like in the repos people are shipping.

So we scanned 21 of the most popular AI/ML repositories using the same governance scanner anyone can run for free. No manual review. No subjective scoring. Just structural analysis of what each repo enforces automatically.

The results are not great.

The Numbers

21 repos scanned across AI agent frameworks, ML libraries, web frameworks, and AI SDKs
Average score: 53/100 (grade C)
Only 2 repos (10%) score 70+ and are on track for EU AI Act readiness
6 repos (29%) have any AI governance configuration (CLAUDE.md or .cursorrules)
1 repo scored an F

View the full interactive leaderboard

Top 5

Rank	Repository	Score	Grade	EU AI Act
1	vllm-project/vllm	78	B	On track
2	BerriAI/litellm	72	B	On track
3	Significant-Gravitas/AutoGPT	68	B	Gaps identified
4	fastapi/fastapi	62	B	Gaps identified
5	langchain-ai/langchain	61	B	Gaps identified

vLLM leads the pack at 78/100 with pre-commit hooks, 7 CI/CD workflows, a security policy, and Dependabot. Its one critical finding: 2 .env files committed to source control.

Bottom 3

Rank	Repository	Score	Grade	EU AI Act
19	ollama/ollama	36	D	Not ready
20	microsoft/autogen	30	D	Not ready
21	yoheinakajima/babyagi	17	F	Not ready

BabyAGI's 17/100 is the lowest score in the set. No CI/CD pipeline, no enforcement hooks, no security policy, no governance config. It scores points only for having a test directory and basic project hygiene.

The Pattern: CI/CD Without Enforcement

The most striking finding across all 21 repos: nearly every project has CI/CD, but almost none enforce rules structurally.

Most repos scored 15/15 on CI/CD. They have GitHub Actions. They run tests in the pipeline. That part of modern software development is well-adopted.

But enforcement -- pre-commit hooks, commit-lint, CODEOWNERS, branch protection -- averages only 11/30 across all repos. This is the gap. Rules exist in documentation but are not structurally enforced before code enters the pipeline.

This is exactly what we call the "detection gap" in the enforcement ladder framework. You can detect violations in CI, but by then the code is already committed. Structural enforcement catches problems before they enter the system.

AI Governance Is Nearly Absent

Only 6 of 21 repos (29%) have any AI governance configuration -- a CLAUDE.md file or .cursorrules. This means that in 71% of the most popular AI/ML repos, AI coding tools operate with zero structural guidance.

When a developer uses Cursor, Claude Code, or GitHub Copilot on these repos, the AI has no project-specific rules to follow. No constraints on what it can modify. No enforced patterns. The governance score for these repos on this dimension: 0/15.

The repos that do have governance configs: vLLM, LiteLLM, AutoGPT, LangChain, Transformers, and LocalAI.

What the Scores Mean

Our scanner evaluates 6 dimensions (100 points total):

Enforcement (30 pts): Pre-commit hooks, commit-lint, CODEOWNERS, branch protection
CI/CD (15 pts): GitHub Actions, Travis CI, CircleCI workflows
Security (20 pts): Security policy, .gitignore, no committed .env files, Dependabot/Renovate
Testing (10 pts): Test configuration files, test directories
Governance (15 pts): CLAUDE.md, .cursorrules, governance directories
Hygiene (10 pts): README, CONTRIBUTING, LICENSE, CHANGELOG, lockfiles

Grades: A (80+), B (60-79), C (40-59), D (20-39), F (below 20).

Category Breakdown

AI Agent Frameworks (8 repos, avg 47/100)

The agent frameworks -- the repos building autonomous AI systems -- scored the lowest as a category. AutoGPT leads at 68, but BabyAGI (17), Autogen (30), and SuperAGI (41) drag the average down. These are the repos building systems that make autonomous decisions, and they have the least governance infrastructure.

ML Libraries (3 repos, avg 62/100)

vLLM (78) lifts this category. scikit-learn and Transformers both score 54 -- solid CI/CD and testing, but weak on enforcement and governance.

Web Frameworks (3 repos, avg 58/100)

FastAPI (62), Pydantic (59), Django (54). These established projects have mature CI/CD but mostly lack AI governance configs and full enforcement tooling.

AI SDKs (4 repos, avg 56/100)

The Anthropic SDK (55), OpenAI SDK (53), LlamaIndex (58), and DSPy (56) cluster tightly in the C range. The Anthropic SDK notably has no pre-commit hooks despite being from the company that makes Claude.

Local AI / Inference (3 repos, avg 53/100)

LiteLLM (72) stands out. Ollama (36) is the weakest -- no enforcement hooks, no test infrastructure detected, and no governance config.

Methodology

All scans were run on March 16, 2026 using the Walseth AI Governance Scanner -- the same tool available for free at walseth.ai/scan. Scores are point-in-time snapshots based on the default branch at scan time.

The scanner analyzes the file tree of each repository via the GitHub API. It checks for the presence of specific files and directories that indicate structural governance. It does not read file contents beyond filenames and paths.

Repos that fail to scan (private, rate-limited, or not found) are excluded. All 21 repos in this leaderboard scanned successfully.

What Would It Take to Score an A?

No repo in this scan scored an A (80+). To get there, a project would need:

Pre-commit hooks AND commit-lint AND CODEOWNERS (25/30 enforcement)
3+ CI/CD workflows (15/15)
Security policy + Dependabot + no committed .env files (17-20/20)
Test config + test directories (10/10)
CLAUDE.md or .cursorrules + governance directory (15/15)
README + CONTRIBUTING + LICENSE + lockfile (8-10/10)

The tooling exists. The patterns are well-understood. Most projects just have not prioritized structural enforcement alongside their CI/CD pipelines.

Scan Your Own Repo

Every score in this leaderboard was generated by the same free scanner you can run right now:

Scan your repo free at walseth.ai/scan

Want a deeper analysis? Our $497 Full Governance Report covers 30+ dimensions with specific remediation steps and a compliance roadmap.

View the full interactive leaderboard with sortable columns

Last scanned: March 16, 2026. Scores are point-in-time snapshots. Run the scanner to get the latest score for any repo.

Originally published at walseth.ai

The 477:1 Problem

Douglas Walseth — Wed, 18 Mar 2026 00:28:47 +0000

Every AI team celebrates when their agent catches errors. Nobody tracks whether those errors stop recurring.

We ran 6 autonomous agents through 145+ specs and 960+ commits. The critical metric we discovered: 477:1.

That's 4,768 violations detected but only 18 promoted to structural enforcement. A stark gap between detection and actual prevention.

What the Ratio Means

A violation is a detected failure — an agent breaks rules, uses outdated context, or misses constraints. Detection is straightforward; every monitoring tool does it.

A promotion is when that violation becomes structurally impossible to repeat. Not "we documented it." Not "we added a Jira ticket." The violation gets encoded as an L5 hook, L4 test, or L3 template in the enforcement ladder.

The remaining 4,750 violations can recur because nothing structural changed — despite logging and alerting.

Why the Gap Exists

1. No promotion pipeline. Teams have error logging but lack mechanisms to transform logged errors into structural prevention.

2. Promotion requires architecture. Real solutions mean writing L5 hooks or L4 tests that fail builds — not just updating documentation.

3. The 80/20 trap. Most violations are low-severity conventions. The 18 promotions targeted highest-leverage issues causing cascading failures or production breaks.

Real Example

Violation: Coder agent committed code without running full test suites, breaking unrelated modules.

L2 Detection: Added a prose rule to documentation.

Problem: Agent violated it again within 2 days. Prose rules get lost in context compression.

L5 Promotion: Created a pre-commit hook running tests automatically. Commits fail if any test breaks.

Result: Zero violations in 30+ days. Prevention-by-construction achieved.

How to Measure Your Ratio

Most teams cannot answer: "Of the errors your AI agents made, how many can never happen again?"

To measure:

Count distinct failure classes (not individual errors)
Count promotions — structural enforcement that makes violations impossible
Divide violations by promotions

A ratio of 477:1 is honest. Most production AI systems would be thousands-to-one or infinity-to-one.

Regression Rates

Our 18 L5 promotions show < 5% regression rates. Once promoted to structural enforcement, violations rarely recur.

Compare this to L2 prose enforcement with > 40% regression rates — documentation gets forgotten or compressed out of context.

Enterprise Implications

If you are deploying AI agents in production, you have violations. The determining question isn't detection volume but promotion count.

We publish this ratio because transparency about the gap builds more credibility than pretending the gap does not exist. Every team running AI agents has this gap — most just don't measure it.

The path forward isn't better detection. It's building the pipeline that turns detected violations into structural enforcement, one promotion at a time.

Free codebase governance audit: walseth.ai

Your AI Agent Forgets Its Rules Every 45 Minutes. Here's the Fix.

Douglas Walseth — Wed, 18 Mar 2026 00:28:26 +0000

Every AI coding agent has the same silent failure mode: context compression. When the conversation gets long enough, the LLM compresses earlier messages to make room. Your carefully crafted system prompts, project rules, and behavioral constraints? Gone. The agent keeps working, but now it's working without guardrails.

This isn't theoretical. We run a 6-agent production system that processes thousands of tool calls per day. Before we fixed this, agents would silently lose their CLAUDE.md instructions, forget which files they'd already modified, and repeat work they'd done 30 minutes ago. The worst part: they never told us. They just kept generating confident, rule-violating output.

The Problem: Invisible Knowledge Loss

LLMs have finite context windows. Claude's is 200K tokens. Sounds like a lot — until your agent has read 40 files, run 80 commands, and accumulated 150K tokens of conversation history. At that point, the system compresses. Earlier messages get summarized or dropped entirely.

What gets lost first? The instructions at the top. The CLAUDE.md rules. The project constraints. The memory of what the agent already tried.

Detection tools can't help here. By the time you detect that the agent forgot its rules, it's already shipped rule-violating code. You need prevention.

The Fix: One Hook, 150 Tool Calls

The solution is a PostToolUse hook that monitors context consumption and flushes critical knowledge to persistent storage before compression hits.

Here's the core logic (simplified from our production implementation):

#!/usr/bin/env python3
"""Pre-compaction memory flush hook."""
import json, os, sys
from datetime import datetime
from pathlib import Path

FLUSH_THRESHOLD = 150  # ~75% of context window

def main():
    session_file = Path(f"/tmp/session_{os.environ.get('AGENT_NAME', 'default')}.json")

    # Load or initialize session state
    if session_file.exists():
        session = json.loads(session_file.read_text())
    else:
        session = {"tool_calls": 0, "files_read": [], "files_written": [], "flushed": False}

    # Track every tool call
    session["tool_calls"] += 1

    # Flush before compression hits
    if session["tool_calls"] >= FLUSH_THRESHOLD and not session["flushed"]:
        memory_path = Path(f"data/agents/{os.environ['AGENT_NAME']}/MEMORY.md")
        summary = f"\n## Session {datetime.now().isoformat()}\n"
        summary += f"- Files modified: {', '.join(session['files_written'])}\n"
        summary += f"- Files read: {len(session['files_read'])} total\n"
        summary += f"- Tool calls: {session['tool_calls']}\n"

        with open(memory_path, "a") as f:
            f.write(summary)

        session["flushed"] = True
        print(json.dumps({"notification": "Memory flushed to persistent storage before context compression."}))

    session_file.write_text(json.dumps(session))

if __name__ == "__main__":
    main()

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "",
        "hooks": [{"type": "command", "command": "hooks/pre_compaction_flush.py"}]
      }
    ]
  }
}

That's it. Every tool call increments the counter. At 150 calls (~75% of a 200K context window), the hook writes a session summary to persistent storage. When compression happens, the knowledge survives.

What Changes in Practice

Before the hook, our agents exhibited three failure patterns after context compression:

Rule amnesia. Agent forgets CLAUDE.md constraints and starts violating project rules. In our system, this caused an average of 12 rule violations per agent per day post-compression.
Work repetition. Agent re-reads files it already processed, re-runs commands it already executed. We measured 23% wasted tool calls from redundant work after compression events.
Silent context loss. Agent continues with high confidence but degraded capability. No error, no warning, no indication that critical context was lost. This is the most dangerous pattern — the agent doesn't know what it doesn't know.

After deploying the hook across our 6-agent system:

Rule violations post-compression dropped to near-zero. The agent reads the flushed memory file on restart and recovers its constraints.
Redundant tool calls reduced by 18%. The session summary tells the agent what it already did.
We paired the flush hook with a PreToolUse memory search enforcer — a second hook that reminds the agent to read persistent memory before its first substantive action. Belt and suspenders.

Why This Matters Beyond Our System

Context compression isn't unique to our setup. Every long-running AI agent session hits it. If you're running Claude Code, Cursor, Windsurf, or any agentic coding tool on a large codebase, your agent is losing context regularly. You just don't see it because the failure is silent.

The enforcement ladder framework treats this as an L5 problem — the highest enforcement level. L5 means automated, zero-awareness-required. The hook fires automatically. The agent doesn't need to remember to save its memory. The system handles it.

This is the difference between detection and prevention. A monitoring tool tells you the agent forgot its rules after it shipped bad code. An L5 hook prevents the forgetting from happening at all.

Try It

The hook above works with any Claude Code project. Drop it in your repo, register it in settings, and your agent stops losing context at compression boundaries.

If you want to see how your entire AI development pipeline scores on enforcement posture — including compaction vulnerability, rule enforcement, and test coverage — run our free governance scanner:

Free codebase governance audit: walseth.ai