DEV Community: Nathan Maine

You Can't Verify Intent. Can You Verify Output?

Nathan Maine — Thu, 09 Apr 2026 20:01:07 +0000

The Zero Trust Paradox at the Frontier of Autonomous AI Agents

By Nathan Maine

There's a question I can't stop thinking about. It sits at the intersection of two ideas that the AI industry treats as compatible but aren't.

The first idea is zero trust architecture. Every action is verified explicitly. No implicit trust. No assumption that because a system was authorized to do something five minutes ago, it's still authorized now. This is the foundation of modern enterprise security and it works well for systems that behave predictably.

The second idea is Level 3 autonomous AI agents. These are systems that explore their environment freely - browsing the web, reading emails, querying databases, executing multi-step plans, running for hours or days without human intervention. They don't follow a predetermined path. They decide their own path at runtime based on what they encounter. And increasingly, they write and execute their own code.

Here's the paradox: zero trust demands that you verify every action explicitly. But how do you explicitly verify the intent of a system that is constantly rewriting its own internal logic?

I don't think anyone has a complete answer yet. But I think we're asking the wrong question, and I want to walk through why.

What Works: Levels 0 Through 2

Before we get to the hard part, it's worth acknowledging that the AI security community has built real solutions for simpler agent architectures.

Level 0 is a single inference call. User asks a question, model answers. The security model here is straightforward: scan the input, scan the output, block anything malicious. Tools like NVIDIA's garak vulnerability scanner do this well. I contribute adversarial probes to garak - one tests whether models fabricate regulatory citations when asked compliance questions (PR 1658), another tests whether attackers can bypass safety filters using Unicode character substitution (PR 1660). At Level 0, these probes catch the failures before deployment.

Level 1 is a chain of deterministic tool calls. The agent follows a predetermined sequence: retrieve data, process it, format the output. The security model adds dataflow tracing - you can manually map every possible path and block untrusted data from reaching sensitive tools. It's tedious but tractable because the paths are enumerable.

Level 2 introduces weak autonomy. The agent chooses which tools to call based on context. Now you need runtime guardrails (like NeMo Guardrails filtering input and output in real time), sandboxing (like NVIDIA's OpenShell isolating each agent at the kernel level using Linux Landlock), and manual approval gates for sensitive actions. The attack surface is larger but still bounded because the agent's autonomy is constrained.

These are real, shipping solutions. They work. The industry should be proud of them.

Where Everything Breaks: Level 3

Level 3 is where the security model collapses.

A fully autonomous agent doesn't follow a predetermined path. It explores. It reads a document, decides it needs more context, searches the web, finds a relevant page, summarizes it, realizes the summary contradicts the original document, queries a database to resolve the contradiction, writes a script to analyze the results, executes the script, and uses the output to update its plan. All without a human in the loop.

The threat vector at Level 3 is no longer the user. It's the environment. A compromised web page. A poisoned database entry. A malicious instruction embedded in white text on a PDF that the agent reads as a system command. The agent didn't start malicious. The environment made it malicious, mid-session, through data it ingested autonomously.

The standard defense for this is taint tracing - tagging every piece of data from an untrusted source as "tainted" and blocking any tainted data from reaching high-privilege tools. In theory, this works. In practice, it creates a cascading problem.

When a Level 3 agent enters a reasoning loop - which it will, because that's the whole point of autonomy - every piece of data it processes after touching a tainted source becomes tainted itself. The agent summarizes a tainted web page. The summary is now tainted. The agent uses that summary to formulate a new query. The query is tainted. The query returns results that get incorporated into the agent's reasoning. All tainted. Within minutes, the entire context window is what I'd call "permanently pink."

If you enforce strict taint tracing policies at this point, you trigger a denial of service against your own application. The policy engine flags every subsequent tool call. The human operator drowns in approval requests. The agent's autonomy collapses back to Level 0. You've spent millions of dollars building a system that's functionally equivalent to a chatbot with extra steps.

The Deeper Problem: Agents Building Agents

It gets worse. Jensen Huang described this at GTC 2026 as the next industrial revolution in knowledge work - employees "supercharged by teams of frontier, specialized, and custom-built agents they deploy and manage." But the current trajectory isn't just autonomous agents executing tasks. It's autonomous agents writing and deploying code for other autonomous agents to execute. The orchestrator agent identifies a problem, spins up a temporary worker agent in a sandboxed environment, feeds it data, evaluates the output, and terminates the worker when the job is done.

In this architecture, the fundamental boundary between code and data dissolves. The prompt IS the code. The generated code IS the data for the next agent. A malicious instruction injected into one agent's data stream becomes executable code in the next agent's runtime. Traditional security assumes you can distinguish between what the system is told to do (code) and what the system processes (data). When agents write code for other agents, that distinction ceases to exist.

Zero trust says: verify explicitly. But verify WHAT? The agent's intent changes with every reasoning step. Its logic rewrites itself continuously. The verification target is a moving target that moves faster than any verification system can evaluate.

The Wrong Question and the Right One

I spent months trying to figure out how to verify the intent of a self-modifying autonomous system. I couldn't. And I eventually realized I was asking the wrong question.

You can't verify intent. Intent in a Level 3 system is non-deterministic by definition. The agent's "intent" is an emergent property of its current context window, its model weights, the data it has ingested, and the tools available to it. It changes with every token generated. Trying to verify it is like trying to verify the intent of weather. You can observe it. You can model it. You can't verify it.

But you can verify output.

The agent produces something. A recommendation. A generated document. A code commit. An API call. Whatever it produces, it produces specific bytes. And those bytes can be cryptographically attested before they leave the system.

An Output-Centric Framework

I've been building toward a framework that shifts the security question from "did the agent mean well?" to "can we prove what the agent actually produced?"

The approach has three layers:

Layer 1: Canonical byte-binding. When the agent generates output, the exact bytes are canonicalized to a deterministic sequence and bound to a cryptographic signature chain before the output leaves the system. Any modification downstream - whether by a compromised intermediary, a network man-in-the-middle, or a post-processing step that introduces errors - is detectable because the signature no longer matches the canonical form. You can prove to an auditor that the output the user received is byte-for-byte identical to what the model produced. I have a patent pending on this approach.

Layer 2: Tamper-evident audit trails. Every step of the agent's execution is logged in an append-only chain where each entry is cryptographically linked to the previous one. This isn't standard logging - standard logs can be modified by anyone with admin access. A cryptographically linked chain means even the system administrator can't alter a historical entry without breaking the hash chain. For regulated industries deploying autonomous agents (healthcare under HIPAA, defense under CMMC, finance under SOX), this level of auditability will eventually be table stakes.

Layer 3: Steganographic channel prevention. Even if the output passes through guardrails and sandbox restrictions, data can be exfiltrated through the authorized output channel itself. An agent can embed hidden information in Unicode characters that look identical to humans but carry different byte values - a Latin "a" swapped for a Cyrillic "a" passes visual inspection but encodes a different signal. Canonicalizing the output to strip these channels before attestation closes this exfiltration vector.

What This Doesn't Solve

I want to be clear about the limitations.

This framework does not solve the intent verification problem. Nothing does. If a Level 3 agent decides to pursue a harmful goal through a series of individually legitimate-looking actions, output attestation won't catch the strategic intent. It will only prove that each individual output was faithfully recorded and unmodified.

It also doesn't replace the existing security stack. You still need garak for pre-deployment vulnerability scanning. You still need NeMo Guardrails for runtime input/output filtering. You still need OpenShell for kernel-level sandboxing. You still need taint tracing for data provenance. Output attestation is not a replacement for any of these. It's the layer that sits on top - the proof layer that tells a regulated customer: "We can't guarantee the agent was right. But we can prove exactly what it said, when it said it, and that the record hasn't been altered."

For healthcare systems deploying autonomous agents to interact with patient data, for defense contractors running agents that process classified information, for financial institutions using agents to make trading decisions - that proof layer is the difference between "we trust the AI" and "we can demonstrate to an auditor exactly what the AI did." The first is a policy statement. The second is a compliance posture.

The Question That Remains

If zero trust and Level 3 autonomy are fundamentally incompatible - and I believe they are - then the industry needs to decide what replaces explicit intent verification for self-modifying systems.

My bet is on output-centric attestation. Verify what the agent produced, not what it intended. Build the cryptographic proof chain that lets regulated industries deploy autonomous agents with auditable evidence trails.

But this is an open problem. The agents are getting more autonomous faster than the security frameworks are adapting. And the moment agents start building other agents - which is already happening - the verification challenge compounds exponentially.

I'd love to hear how others are thinking about this. Especially if you're working on the infrastructure side at companies building these systems. The solutions will come from practitioners who are living with these constraints daily, not from theoretical frameworks written in isolation.

Nathan Maine is a Technical Program Manager and AI practitioner. He contributes adversarial probes to NVIDIA's garak LLM vulnerability scanner, has trained 13 LLMs across 7 base architectures, and holds 6 pending patents on AI egress security and cryptographic attestation. He publishes models and research on HuggingFace.

Connect: LinkedIn | GitHub | HuggingFace

Gemma 4 After 24 Hours: What the Community Found vs What Google Promised

Nathan Maine — Fri, 03 Apr 2026 02:31:45 +0000

Google released Gemma 4 yesterday under Apache 2.0. The benchmarks looked incredible. The community went to work. Here's what we're actually seeing.

I spent the last 24 hours reading through forums, running my own fine-tuning experiments, and collecting reports from dozens of early adopters. This is a summary of the real-world findings, the open questions, and where I think this model family lands.

The Good News First

Apache 2.0 is a big deal. Previous Gemma releases used a custom Google license that technically allowed them to restrict usage. Apache 2.0 removes that uncertainty entirely. For anyone building commercial products on open models, this matters more than any benchmark number.

Multilingual quality is genuinely strong. Users testing German, Arabic, Vietnamese, and French are reporting that Gemma 4 outperforms Qwen 3.5 in non-English tasks. One user called it "in a tier of its own" for translation. Another said it "makes translategemma feel outdated instantly." For global enterprise deployments, this is a significant differentiator.

The ELO score tells a different story than benchmarks. The 31B model scored 2150 on LMArena, which puts it above GPT-OSS-120B and comparable to GPT-5-mini. But side-by-side benchmark tables show it roughly tying with Qwen 3.5 27B. The gap between ELO (human preference) and automated benchmarks suggests Gemma 4 produces responses that humans prefer even when raw accuracy is similar.

The E2B model is absurd. Multiple users confirmed that the 2.3B effective parameter model beats Gemma 3 27B on most benchmarks. A user running it on a basic i7 laptop with 32GB RAM reported it was "not only faster, it gives significantly better answers" than Qwen 3.5 4B for finance analysis.

The Problems Nobody Warned About

Inference Speed

This is the elephant in the room. Multiple users are reporting that Gemma 4's MoE model (26B-A4B) runs significantly slower than Qwen 3.5's equivalent:

One user: 11 tokens/sec on Gemma 4 26B-A4B vs 60+ tokens/sec on Qwen 3.5 35B-A3B on the same 5060 Ti 16GB
Another confirmed higher VRAM usage for context at the same quantization level
Someone running on a DGX Spark asked "why is it super slow?" with no clear answer yet

For the dense 31B model, users are reporting 18-25 tokens/sec on dual NVIDIA GPUs (5070 Ti + 5060 Ti), which is reasonable but not fast.

The speed gap against Qwen 3.5 is concerning for production deployments where latency matters.

VRAM Consumption

Gemma models have historically been VRAM-hungry for context, and Gemma 4 appears to continue this pattern. One user noted they could only fit Gemma 3 27B Q4 with 20K context on a 5090, while Qwen 3.5 27B Q4 fit with 190K context on the same card.

For the 256K context window to be useful in practice, you need significantly more VRAM than competing models at the same parameter count.

Fine-Tuning Compatibility

As someone who attempted QLoRA fine-tuning within hours of release, I can confirm the tooling is not ready. Three issues hit immediately:

HuggingFace Transformers didn't recognize the gemma4 architecture (required installing from source)
PEFT couldn't handle Gemma4ClippableLinear, a new layer type in the vision encoder (required a monkey-patch)
A new mm_token_type_ids field is required during training even for text-only data (required a custom data collator)

I've filed issues on both huggingface/peft and huggingface/transformers. Both received responses within hours, and a fix for the mm_token_type_ids issue is already in progress. Unsloth also has day-one support if you prefer that path.

The community question "how easy is it to fine-tune compared to Gemma 3?" currently has no good answer beyond "harder, but solvable."

Stability Questions

One user testing the non-quantized 31B in Google AI Studio reported "infinite loops and no possibility to read text from the image." Another found that the model jailbreaks with basic system prompts. A third reported Mac hard crashes when loading either the 31B or 26B in LM Studio.

These are early reports and may be resolved with updates, but they're worth noting for anyone considering production deployment.

The Benchmark Reality

The community quickly assembled side-by-side comparisons. Here's the consolidated picture:

Metric	Gemma 4 31B	Qwen 3.5 27B	Winner
MMLU-Pro	85.2%	86.1%	Qwen
GPQA Diamond	84.3%	85.5%	Qwen
LiveCodeBench v6	80.0%	80.7%	Tie
Codeforces ELO	2150	1899	Gemma
TAU2-Bench	76.9%	79.0%	Qwen
MMMLU	88.4%	85.9%	Gemma
HLE (no tools)	19.5%	24.3%	Qwen

Gemma 4 wins on competitive coding (ELO) and multilingual (MMMLU). Qwen 3.5 wins on most reasoning benchmarks. Neither is a clear overall winner.

The honest take from one top commenter: "Gemma 4 ties with Qwen, if not Qwen being slightly ahead. And Qwen 3.5 is more compute efficient too."

What the Community Is Waiting For

QAT versions. Gemma 3 QAT (quantization-aware training) models arrived weeks after the initial release. The community expects the same for Gemma 4, and these will likely improve quantized inference quality significantly.

Abliterated/uncensored versions. At least one already exists. Multiple users are requesting more. The Apache 2.0 license makes this fully legal now.

Larger models. There were rumors of a 120B model that didn't materialize. Several users expressed disappointment. A 100B+ MoE from Google could be transformative.

A 9-12B dense model. The gap between E4B (4.5B effective) and 26B MoE leaves a hole in the lineup. Gemma 3's 12B model was popular, and there's no direct upgrade path.

Where This Leaves Us

Gemma 4 is not the clear winner the benchmarks suggested. But it's not trying to be.

The real value proposition is the combination of:

Apache 2.0 (fully permissive, no restrictions)
Multilingual excellence (best in class for non-English)
Base models available (fine-tuning ready on day one)
Size diversity (2B to 31B covers edge to server)
Native system prompts and function calling (production-ready features)

For English-only, benchmark-optimized, speed-critical deployments, Qwen 3.5 is still the better choice. For multilingual, legally unrestricted, fine-tuning-focused use cases, Gemma 4 has a compelling argument.

The speed and VRAM issues need to be addressed. The fine-tuning tooling needs a week or two to catch up. And we need QAT quantizations before the smaller models can truly compete on efficiency.

But make no mistake, releasing a 31B dense model under Apache 2.0 that rivals models 4-10x its size on human preference benchmarks is a significant moment for open AI. Google is finally competing on openness, not just capability.

I'll be publishing our fine-tuning results (including the day-zero bug fixes) and benchmark comparisons as the training run completes. Follow along if you're interested.

Nathan Maine builds AI systems for regulated industries. He is currently fine-tuning Gemma 4 31B for domain-specific deployment and has filed bug reports on huggingface/peft and huggingface/transformers for day-zero compatibility issues.

Fine-Tuning Gemma 4 on Day Zero: 3 Bugs We Solved in 30 Minutes

Nathan Maine — Thu, 02 Apr 2026 20:35:03 +0000

Google released Gemma 4 today under Apache 2.0 — their most capable open model family. The 31B dense model scores ~1452 on LMArena with a 256K context window.

We wanted to fine-tune it immediately. QLoRA on a single NVIDIA B200. It broke three times before training started.

Here's what happened and how we fixed each one.

Bug 1: "Transformers does not recognize this architecture"

The first error hits before the model even loads:

ValueError: The checkpoint you are trying to load has model type `gemma4` 
but Transformers does not recognize this architecture.

Why: The latest stable Transformers release (5.4.0) shipped before Gemma 4 existed. The gemma4 model type only exists in the dev branch.

Fix: Install from source.

pip install git+https://github.com/huggingface/transformers.git

This gets you 5.5.0.dev0 which includes the Gemma4ForConditionalGeneration class.

Time to fix: 2 minutes.

Bug 2: "Target module Gemma4ClippableLinear is not supported"

After installing Transformers from source, the model loads fine. But when PEFT tries to apply LoRA:

ValueError: Target module Gemma4ClippableLinear(
  (linear): Linear4bit(in_features=1152, out_features=1152, bias=False)
) is not supported.

Why: Gemma 4 introduces a new layer type called Gemma4ClippableLinear for its vision and audio encoders. It wraps nn.Linear with optional input/output clamping for numerical stability. The catch: it inherits from nn.Module, not nn.Linear.

PEFT checks the type of every target module before applying LoRA. Since Gemma4ClippableLinear isn't nn.Linear, PEFT rejects it — even though we only want to apply LoRA to the text decoder layers, not the vision encoder.

The exclude_modules parameter doesn't help either. PEFT runs the type check before filtering, so excluded modules still need to be recognized types.

Installing PEFT from source doesn't help either — the support simply doesn't exist yet.

Fix: Monkey-patch Gemma4ClippableLinear to inherit from nn.Linear before loading the model.

import torch.nn as nn
from transformers.models.gemma4 import modeling_gemma4

class PatchedClippableLinear(nn.Linear):
    def __init__(self, config, in_features, out_features):
        nn.Linear.__init__(self, in_features, out_features, bias=False)
        self.use_clipped_linears = getattr(config, "use_clipped_linears", False)
        if self.use_clipped_linears:
            self.register_buffer("input_min", torch.tensor(-float("inf")))
            self.register_buffer("input_max", torch.tensor(float("inf")))
            self.register_buffer("output_min", torch.tensor(-float("inf")))
            self.register_buffer("output_max", torch.tensor(float("inf")))

    def forward(self, x):
        if self.use_clipped_linears:
            x = torch.clamp(x, self.input_min, self.input_max)
        out = nn.Linear.forward(self, x)
        if self.use_clipped_linears:
            out = torch.clamp(out, self.output_min, self.output_max)
        return out

modeling_gemma4.Gemma4ClippableLinear = PatchedClippableLinear

Place this before any AutoModelForCausalLM.from_pretrained() call. PEFT now sees the vision encoder layers as standard linear layers and proceeds normally.

Result: 534M trainable parameters (1.68% of 31.8B total).

Time to fix: 15 minutes (including reading the Gemma 4 source to understand the layer).

Bug 3: "mm_token_type_ids is required"

LoRA applies, data loads, training starts — and immediately crashes:

ValueError: `mm_token_type_ids` is required as a model input when training

Why: Gemma 3 required token_type_ids during training. Gemma 4 adds a second required field: mm_token_type_ids (multimodal token type IDs). The model validates their presence in the forward pass, even for text-only training. For text-only inputs, both should be all zeros.

Standard tokenizers and data collators don't produce mm_token_type_ids. You need a custom collator.

Fix: Add both fields during tokenization and build a custom data collator.

# During tokenization
def format_chat(example):
    text = tokenizer.apply_chat_template(
        example["messages"], tokenize=False, add_generation_prompt=False
    )
    tokenized = tokenizer(text, truncation=True, max_length=4096)
    tokenized["token_type_ids"] = [0] * len(tokenized["input_ids"])
    tokenized["mm_token_type_ids"] = [0] * len(tokenized["input_ids"])
    tokenized["labels"] = tokenized["input_ids"].copy()
    return tokenized

# Custom data collator
@dataclass
class GemmaCollator:
    tokenizer: object
    def __call__(self, features):
        max_len = max(len(f["input_ids"]) for f in features)
        pad_id = self.tokenizer.pad_token_id
        batch = {
            "input_ids": [],
            "attention_mask": [],
            "token_type_ids": [],
            "mm_token_type_ids": [],
            "labels": [],
        }
        for f in features:
            pad_len = max_len - len(f["input_ids"])
            batch["input_ids"].append(f["input_ids"] + [pad_id] * pad_len)
            batch["attention_mask"].append(
                [1] * len(f["input_ids"]) + [0] * pad_len
            )
            batch["token_type_ids"].append([0] * max_len)
            batch["mm_token_type_ids"].append([0] * max_len)
            batch["labels"].append(
                f.get("labels", f["input_ids"]) + [-100] * pad_len
            )
        return {k: torch.tensor(v) for k, v in batch.items()}

Important: set remove_unused_columns=False in your training config, or the trainer will strip mm_token_type_ids before it reaches the model.

training_args = SFTConfig(
    ...,
    dataset_text_field=None,
    remove_unused_columns=False,
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=GemmaCollator(tokenizer),
)

Time to fix: 5 minutes.

The Result

After all three fixes:

31B model training at 4.5s/step on a single NVIDIA B200 (192GB)
534M trainable parameters via QLoRA (1.68% of 31.8B)
GPU utilization: 89%, 38GB VRAM used
Estimated training time: ~7.5 hours for 3 epochs on 16K examples

Total time from "model released" to "training steps running": under 4 hours (including model download).

Key Takeaways

Day-zero fine-tuning requires bleeding-edge dependencies. Install Transformers and PEFT from source when working with newly released models.
Multimodal models have hidden requirements for text-only training. Both token_type_ids and mm_token_type_ids are validated even when no images or audio are involved.
PEFT's type checking happens before module filtering. Even if you exclude vision modules, they still need to be recognized types. Monkey-patching is a valid workaround until official support lands.
None of these are avoidable with experience. They're day-zero discovery problems. The difference is how fast you solve them.

Issues filed:

[huggingface/peft] Gemma4ClippableLinear not supported
[huggingface/transformers] mm_token_type_ids required for text-only fine-tuning

Both include workarounds and suggested fixes. PRs welcome.