DEV Community: Sunil Prakash

Gemma 4 is the small-model tier agent stacks were waiting for

Sunil Prakash — Sun, 24 May 2026 13:08:50 +0000

Most agent failures aren't reasoning failures. They're policy failures. The model picked the right tool, then called it with arguments outside its scope. The delegation chain expanded one step beyond what the user actually authorized. The output cleared the LLM's own check but tripped the compliance rule three layers down.

These don't get fixed by a larger frontier model. They get fixed by a faster check, run more often, on a model small enough that running it constantly isn't a budget event.

That's the tier most agent stacks don't have. And it's the tier Gemma 4 finally fills.

The missing tier

When you sketch an agent system on a whiteboard, you draw one box for the reasoning model. In production you discover you need a lot more boxes around it: pre-flight checks before each tool call, scope verifiers in delegation chains, output classifiers feeding audit trails, intent disambiguation when the user's last message could mean two things.

Teams currently solve this two ways. They route everything through the frontier model, which is fast becoming the most expensive way to ask trivial questions. Or they skip the checks entirely, which is how scope creep, prompt injection, and policy drift survive into production.

A third option, a small open-weight model running locally as a policy and verification layer, has existed in theory for two years. The blocker has always been the same: small models capable enough to be trusted on structured judgment, on hardware cheap enough to run them on every action.

Google's positioning of Gemma 4 makes this explicit. The 26B and 31B variants are pitched for "advanced reasoning"; the E2B and E4B variants are pitched for "maximum compute and memory efficiency" and "mobile and IoT devices" (source). That second tier is the one most agent-system writeups skip past on the way to discussing the flagship. It is also the one that changes the architecture.

The argument for the small tier is not "it matches the frontier." It is that "did this delegated agent action stay inside the user's authorized scope?" is a structurally easier question than "write me a working SQL query." Lower complexity, higher determinism requirement, narrow output space. Small models that handle that class of question well are exactly what agent stacks have needed and haven't reliably had until now. Run it on your own eval set before you trust it, but the bar it has to clear is finally in range.

Three useful patterns

Pre-flight policy check. Before any tool call leaves the agent, a Gemma 4 E2B evaluates: does this call match what the user actually asked for, given the policy attached to this session? Frontier model proposes, edge model disposes. Run it local, fail closed, log the verdict. The economics work because the check runs on commodity hardware in the same process as the agent runtime, not as a remote call.

Delegation-scope verifier. In multi-agent chains, each handoff is a place where authorization can quietly widen. A small local model sitting on each delegation boundary checks the new request against the original scope and the attenuation rules. Forged or expanded delegations get rejected at the boundary, not three hops later when the damage is done.

Output classifier for audit. Every agent response gets a small-model second-pass before it leaves the system: PII check, policy-violation check, scope-adherence check. The labels become the audit trail. Regulated teams stop choosing between "log everything raw" and "log nothing reviewable."

A caveat worth saying out loud: small-model checks should gate narrow classification and policy-shape decisions. They are not a substitute for a deterministic policy engine, a real authorization service, or a domain-specific rules layer. The pattern is belt-and-suspenders before the expensive call, not single point of truth.

The deployment model matters too

Many regulated teams do run agents against approved cloud APIs and vendored endpoints, with appropriate controls. The pattern above is not an argument against that. It is an argument for what becomes possible when the check layer can sit fully inside your trust boundary, runnable on the same machine as the agent, with no per-decision call to a third party. Open-weight models that run on-prem are the prerequisite for that pattern. Once the small tier is capable enough, the architecture is open.

Reframe

The discourse around new model releases tends to be "is this as good as the frontier?" That's the wrong question for Gemma 4. The right question is: what becomes affordable now that didn't used to be?

When an E2B model handles structured judgment well, the architecture changes. You stop treating your agent stack as one expensive model with retries and start treating it as a graph of small fast checks gated by occasional expensive reasoning. That graph is more reliable, more auditable, and significantly cheaper than what most teams run today.

Gemma 4's flagship variants will get the headlines. My bet is that E2B and E4B will quietly become load-bearing in production agent systems for teams that take policy enforcement seriously. If you're building one, that's where to look.

I work on agent identity and policy enforcement, including the Agent Identity Protocol and an open-source AI governance framework. More writing on agent systems at the Agent Engineering Lab and my LinkedIn newsletter Building AI Systems.

I tested 4 AI agent-governance tools against an open spec - here's the matrix

Sunil Prakash — Sat, 23 May 2026 12:09:08 +0000

The scenario

Your AI agent just deleted a customer record. Three months later, an auditor asks you to prove:

What tool actually ran (not "the agent made a deletion call" — the precise tool, version, and capability)
With what arguments (the exact customer ID, scoped fields, options — byte-for-byte)
Who approved it (which human, or which automated policy rule)
Against which version of which policy (the literal policy bundle the runtime evaluated, not "the policy at the time, probably")
Whether it actually succeeded (not "we said allow", but "the downstream system confirmed the row is gone")

You open your audit log.

It says: delete_customer approved, run_id=xyz, decision=allow. The arguments are in a different table. The policy version isn't recorded anywhere — you'd have to git log your settings file. The execution outcome lives in your application logs, which roll over after 30 days. And the auditor has no way to verify any of this without an engineer walking them through every join.

This gap shows up the moment an agent does something consequential and a non-engineer needs to understand what happened. It's the same gap regardless of which framework you used. Approval is not proof.

What's actually missing

The pattern across every agent-governance tool I looked at is the same: they're built around the decision (allow / deny / require-approval) and treat the action itself as an implementation detail. So the audit log records "the policy fired" but not a single record carrying everything a third party needs to reconstruct what actually happened.

A useful audit artifact has to survive the following:

It can be verified without trusting the runtime that produced it. If your auditor has to call your engineers to interpret the log, the log is testimony, not evidence.
The arguments and the decision are cryptographically bound. If args mutate between approval and execution, the audit must show it.
The policy version is in the record. Not "the policy at the time" — the literal bundle identifier.
The execution outcome is in the record. Approval ≠ execution. Both belong in the same artifact.
The chain of receipts is tamper-evident. Deleting a row from history must break something a verifier can detect.

A receipt that does all five becomes a single evidence record you can hand to an auditor, regulator, insurer, or a compliance team six months later — without them needing access to your database, your cloud creds, or your engineering team.

What I built

AgentBoundary is an open spec for that kind of receipt. v0.1 is stable; v0.2-alpha (draft) adds the optional provenance block and singly-linked chain shown in the example below. Same JSON document, deterministic schema, hash-bound to its arguments.

Here's one a Discord agent I run in production emitted on 2026-05-21 — it files GitHub issues on behalf of users:

{
  "version":      "agentboundary/v0.2-alpha",
  "receipt_id":   "f04df972-f9fc-4624-83cb-0ed3682297cf",
  "issued_at":    "2026-05-21T06:54:39.251Z",

  "actor": {
    "type":         "agent",
    "id":           "agent:jambot:discord:user:aa74fa40751b528f"
  },

  "tool":   { "name": "github-rest", "version": "2022-11-28", "capability": "github.issues.create" },
  "target": { "system": "github.com/jamjet-labs/jamjet-discord-bot", "environment": "prod" },

  "arguments_hash":  "2d257d4e72f62afa112766154b9b5ac0dd98ae79ee7c2758563a4363a0fb4bdf",
  "policy":          { "name": "jambot.file-issue.v1", "version": "1", "decision": "allow" },
  "execution":       { "status": "success", "completed_at": "2026-05-21T06:54:40.103Z", "result_ref": "github://issues/1" },

  "prior_receipt":      { "receipt_id": "cab5eff7-…", "receipt_hash": "3e7f5a93…" },
  "completeness_score": 0.913,
  "receipt_hash":       "..."
}

A verifier with only this JSON — no database, no Fly.io credentials, no GitHub token, no Discord session — can run six independent checks:

Tamper-evidence. Re-canonicalise the body without receipt_hash, take SHA-256, confirm it matches the stored hash.
Argument binding. Re-canonicalise the arguments separately, take SHA-256, confirm it matches arguments_hash. If anything mutated between approval and execution, this fails.
Spec compliance. Fetch the public JSON Schema, validate the receipt structurally.
Chain integrity. Fetch the receipt at prior_receipt.receipt_id and confirm its hash matches the link.
Emitter honesty. Recompute completeness_score from the provenance block using the deterministic formula in the spec. Catches an emitter that lies about how confident it was in each field.
Execution proof. Follow execution.result_ref to a real downstream artifact (in this case, a public GitHub issue) and read it.

How existing tools do against the bar

I built one adapter per vendor — translating their normative artifact (or, where they don't have one, the developer-recommended capture shape) into an AgentBoundary v0.2-alpha receipt. Then I ran all 40 conformance scenarios against the adapter-produced receipts.

Vendor	PASS	PARTIAL	DOCS-ONLY	NOT COVERED	N/A
JamJet reference	40	0	0	0	0
Anthropic permission_policy	12	9	3	14	2
Cloudflare HITL Agents	5	7	1	25	2
LangSmith Gateway	15	14	1	8	2
Microsoft AGT	17	5	1	15	2

Reference implementation first; vendors alphabetical. Not ranked. The PASS counts collapse meaningful categorical differences. Each vendor is solving for a different layer of the stack:

Anthropic's permission_policy is the richest runtime evaluation pipeline of the four — layered hooks, scoped tool patterns, permission modes, the canUseTool callback. But the audit log from Anthropic's Managed Agents Console isn't a published schema, so there's no portable artifact a third party can verify. That's why 3 DOCS-ONLY (highest of any vendor) and 14 NOT COVERED.
Cloudflare HITL is a workflow primitive — durable approval gates with multi-day windows and external notifications. It's deliberately not an emitted-artifact format. The 25 NOT COVERED reflects that their recommended audit table is 6 columns and doesn't model the things conformance is asking about.
LangSmith is an observability platform. The Run object captures the data, but where in the Run varies by team convention — one team puts the decision in tags, another in feedback_stats. A cross-team auditor can't reliably extract it. That's why 14 PARTIAL.
Microsoft AGT is the closest peer — also an artifact format, also designed for verifiable evidence, with a Merkle-chained audit log that's structurally stronger than AgentBoundary's current singly-linked design. The 15 NOT COVERED rows are deliberate scoping decisions, not bugs.

Per-vendor breakdowns with structural reasoning live in adapters/<vendor>/results.md in the public repo.

Where AgentBoundary itself currently falls short

The reference implementation scoring 40/40 against its own spec is the implementation grading itself. That's meaningful but not sufficient.

JamBot's emitter mutates receipts on approval-finalize. When a maintainer approves a held action, the existing row's execution.status is updated in place and receipt_hash is recomputed — which breaks chain links from any later receipt whose prior_receipt.receipt_hash was captured before the mutation. Fix queued for v0.2.
The chain is singly-linked, not Merkle. AGT's design (every entry commits to every preceding one) catches arbitrary-entry-reordering attacks that v0.2-alpha would miss. v0.3 candidate.
provenance is a 3-value enum where AGT has a float [0.0, 1.0]. Simpler to reason about, coarser in practice. v0.3 candidate if practitioner feedback warrants it.
No second non-reference implementation yet. Only one production deployment (JamBot). A second emitter in Rust, Go, or Java would validate the spec is implementation-portable.

These are also in the report's §8.

Run the suite yourself

npx agentboundary run scenarios/
# or
uvx agentboundary run scenarios/

60 seconds on a clean machine. No signup, no Docker, no account. Scenarios are at jamjet-labs/agentboundary/scenarios. If your results disagree, open an issue with the exact command and your environment — the suite is reproducible; if it isn't on your machine, that's a bug.

What I want from this post

If you maintain an agent-governance product and any of the per-scenario mappings are wrong: open a PR against adapters/<your-product>/. Right-to-respond issues are filed against all four vendors; windows close 2026-05-28 to 2026-05-30 and corrections are folded in inline.
If you're integrating agents into a regulated stack (finance, healthcare, infrastructure ops): try the suite against your own runtime. Emitting an AgentBoundary receipt from your existing audit log is usually a few hundred lines.
If you already have an audit format: map one of your real audit rows to the conformance scenarios and open an issue where the suite misrepresents your model. Concrete corrections are far more useful than general feedback. AGT and AgentBoundary's design centres are complementary; the two specs could reasonably converge.

Full report with the per-vendor deep-dives at jamjet.dev/blog/agent-action-control-40-tests. Canonical archive on the spec microsite at agentboundary.jamjet.dev/reports/2026-05-comparative.

Spec is Apache 2.0. Implementations welcome.

Every AI toolchain is inventing its own safety layer.

Sunil Prakash — Tue, 12 May 2026 18:26:09 +0000

Same policy. Three runtimes.

# Claude Code, with @jamjet/claude-code-hook installed as a PreToolUse hook:
> Delete the old customer records from the staging DB.

  Tool request: bash.shell_exec
  Args: psql -c "DELETE FROM customers WHERE created_at < '2024-01-01'"

  JamJet policy: BLOCKED (rule: shell.exec)
  Audit: ~/.jamjet/audit/2026-05-11/claude-code-hook.jsonl

# OpenAI Agents SDK (TS), with @jamjet/openai-guardrail wired into a refund tool:
JamjetPolicyBlocked: JamJet policy: BLOCKED
  (tool: payments.refund, rule: payments.*)

  Audit: ~/.jamjet/audit/2026-05-11/openai-guardrail.jsonl

# Claude Desktop talking to a Postgres MCP server, fronted by @jamjet/mcp-shim:
{"jsonrpc":"2.0","id":7,"error":{
  "code": -32000,
  "message": "JamJet policy: BLOCKED (rule: *delete*)",
  "data": {"tool": "postgres.delete_all_rows", "audit": "mcp-shim.jsonl"}
}}

Same policy.yaml. Three runtimes. One audit log.

The models are real. The tool calls came from real agent loops. The destructive payloads never reached the tool function.

The fragmentation problem

The market is converging on "control AI agent actions." But the primitives are not portable.

Anthropic shipped Claude Code hooks — PreToolUse, PostToolUse, Notification, and friends. They run as subprocesses, get JSON on stdin, and decide whether the tool call proceeds. OpenAI ships tool guardrails in the Agents SDK — Python (and now TS) callables you attach to a tool, with tripwire booleans that abort the run. The MCP ecosystem is sprouting gateways and proxies for the same purpose: MCPX, IBM ContextForge, Microsoft's MCP Gateway, Lasso Security's MCP Gateway — all reasonable answers to the same wire-level question.

Each one is competently designed for its own context. None of them speak the same policy.

A real team I talked to last month runs Claude Code for engineering workflows, OpenAI Agents SDK for a customer-facing copilot, and two MCP servers wired into Cursor for ad-hoc database work. Their security review asked one question: what can the agents do?

The honest answer required reading a settings.json, a Python guardrail file, two MCP gateway configs in different YAML dialects, and an internal Confluence page describing the production if-statements. Three audit trails in three formats. Three approval flows — one Slack bot, one PagerDuty escalation, and one human paging through the OpenAI trace viewer.

The platforms are not the problem. The hook API is good. The guardrail API is good. The MCP proxy pattern is good. The problem is the seam between them — every team writes their own.

The thesis

JamJet is the action-control plane for AI agents. One policy file. One audit trail. Across hooks, guardrails, MCP gateways, SDKs, and custom runtimes.

The portable layer underneath all of them is a single policy.yaml schema and a single audit JSONL schema. Every adapter reads the same YAML, writes the same JSONL. jamjet audit show tails the lot in one chronological view.

Phase 2 shipped five packages today: @jamjet/cloud@0.3.0, @jamjet/claude-code-hook@0.1.0, @jamjet/mcp-shim@0.1.0, @jamjet/openai-guardrail@0.1.0, and @jamjet/cli@0.1.0 on npm; plus jamjet 0.8.3 on PyPI with jamjet.integrations.openai_guardrail as the Python sister. Source at jamjet-labs/jamjet-policy.

Three adapters in one paragraph each

@jamjet/claude-code-hook wires into Claude Code's PreToolUse hook. One line in ~/.config/claude-code/settings.json:

{ "hooks": { "PreToolUse": [
    { "command": "jamjet-hook --policy ~/.jamjet/policy.yaml" }
] } }

Every tool call — native or MCP — runs through the policy before Claude Code invokes it. What it does: enforce, audit, and surface approval prompts as blocks in v0.1. What it does not do: replace Claude Code's own hook system. It is the hook.

@jamjet/mcp-shim sits between an MCP client (Claude Desktop, Cursor, an OpenAI Agents SDK MCP client) and any MCP server. You swap the server's command for the shim, pass the policy path, and put the real server after --:

{ "mcpServers": { "postgres": {
  "command": "npx",
  "args": [
    "-y", "@jamjet/mcp-shim", "--policy", "~/.jamjet/policy.yaml",
    "--server", "postgres", "--",
    "postgres-mcp", "--db", "postgresql://localhost/mydb"
  ]
} } }

The shim relays MCP traffic transparently. On a blocked tools/call, it returns a JSON-RPC error to the client — and the real MCP server never sees the request. What it does not do: replace the MCP protocol. It speaks MCP on both ends.

@jamjet/openai-guardrail (and its Python sister, jamjet.integrations.openai_guardrail) plugs into the OpenAI Agents SDK's inputGuardrails API. One line on a tool definition:

import { tool } from 'openai-agents'
import { jamjetGuardrail } from '@jamjet/openai-guardrail'

const refund = tool({
  name: 'payments.refund',
  inputGuardrails: [jamjetGuardrail({ policy: '~/.jamjet/policy.yaml' })],
  execute: refundCustomer,
})

Blocks throw JamjetPolicyBlocked. Approval-required calls throw JamjetApprovalRequired in v0.1 — the SDK aborts the run, audit gets written, and the run id is recoverable. What it does not do: replace the SDK's tripwire pattern. It is a tripwire.

The unified policy and audit

The policy file every adapter reads:

version: 1
rules:
  - { match: "*delete*",   action: block }
  - { match: "shell.exec",  action: block }
  - { match: "payments.*",  action: require_approval }
  - { match: "database.read_*", action: allow }
audit:
  destination: ~/.jamjet/audit

Glob match. Four actions: allow, block, require_approval, audit. Same shape in every adapter.

The audit log every adapter writes:

$ jamjet audit show
v 2026-05-11T10:14:02Z  claude-code-hook    fs.read_file           ALLOWED
x 2026-05-11T10:14:18Z  claude-code-hook    bash.shell_exec        BLOCKED               shell.exec
x 2026-05-11T10:21:47Z  mcp-shim            postgres.delete_rows   BLOCKED               *delete*
~ 2026-05-11T10:33:11Z  openai-guardrail    payments.refund        WAITING_FOR_APPROVAL  payments.*
v 2026-05-11T10:41:55Z  python-sdk          customers.search       ALLOWED
x 2026-05-11T10:52:09Z  openai-guardrail    db.drop_table          BLOCKED               *delete*
v 2026-05-11T11:07:33Z  mcp-shim            github.list_issues     ALLOWED

Four files in ~/.jamjet/audit/2026-05-11/, one row per decision, sorted by timestamp. Pending approvals live in ~/.jamjet/pending/<run-id>.json and clear via jamjet approve <run-id> or jamjet reject <run-id>. The audit format is documented in the conformance spec, and the v1 schema is what each adapter is tested against in CI.

This is the part of the launch we are most willing to defend. That answer to "what can the agents do?" — read it once, in one place.

What's honest

Each adapter is at v0.1. The policy YAML and audit JSONL shapes are committed to v1 and covered by conformance tests across all four adapters. Adapter-specific options will evolve in minor versions.
Approval surfaces as exceptions or blocks in v0.1 for hook, guardrail, and Python adapters. The filesystem flow works end-to-end today — jamjet approve <run-id> flips a pending file and the next run unblocks. SDK-integrated approval (the OpenAI Agents SDK approval API, Claude Code's native settings surface) and a web UI both land with JamJet Cloud sync in v0.2.
MCP shim is stdio only in v0.1. HTTP/SSE MCP transports land in Phase 3 alongside the Java/Spring adapter.
JamJet Cloud sync — shared team policies, cloud audit retention, signed approvals — is the v0.2 milestone. Today's flow is local-only by design, so nothing leaves the developer's machine unless you opt into Cloud.
One Phase 1 line still applies: the demo agent prompts are real, the enforcement path is real, the audit is real. Pre-baked deterministic agents are clearly labelled as such.

Try it

# Claude Code hook:
npm i -g @jamjet/claude-code-hook

# MCP shim (zero-install):
npx -y @jamjet/mcp-shim --help

# OpenAI Agents SDK guardrail (TS):
pnpm add @jamjet/openai-guardrail
# or Python:
pip install jamjet  # includes jamjet.integrations.openai_guardrail

# Unified CLI:
npm i -g @jamjet/cli
jamjet audit show

Star jamjet-labs/jamjet-policy — the Phase 2 monorepo.
Read the Phase 1 launch post for the deeper argument about why the runtime, not the model, is the safety boundary.
Join the JamJet Discord to talk through your toolchain — we want to know which extension points to plug into next.

Phase 3 is the Java/Spring adapter, MCP HTTP/SSE transport, and JamJet Cloud sync. Same policy. More surfaces.

Your AI agent already emits OpenTelemetry. Why aren't you watching it?

Sunil Prakash — Sat, 09 May 2026 02:16:14 +0000

TL;DR: Spring AI, LangChain4j, Koog (Kotlin), the Python OpenLLMetry-style instrumentations, and the Go OTel SDKs all emit gen_ai.* spans natively now. So you don't need a vendor SDK to make your agent observable — you need an OTLP endpoint that knows what to do with the spans your framework is already throwing on the wire. Here's what that looks like in three lines of YAML or one Kotlin extension.

Someone on your team shipped an LLM agent two months ago. Today it ran up a $400 bill in twenty minutes, hallucinated a refund policy to a real customer, and got stuck in a tool-calling loop that retried the same broken payments.create call seventeen times before the rate limiter caught it.

You'd like to know which of those things happened first, which agent was responsible (you're up to four now), what the user typed, and why the planner decided that calling the payments API was a reasonable response to "how do I unsubscribe."

If you're observing your agents the same way you observe your other services, you can answer maybe two of those questions and only after a long Slack thread with the engineer who wrote the prompt. The trace span called POST /chat doesn't help you. Neither does the metric for p99 latency on /v1/agent/run.

This post is about why that gap exists, why it's about to close, and what to do about it today.

The agent observability gap

Two existing approaches sort of work and mostly don't:

Generic APM (Datadog, New Relic, Honeycomb-as-default-config) treats your agent like any other HTTP service. You get latency histograms, error rates, and a top-level span. You don't get the prompt, the model, the token counts, the tool calls, or the cost. The signal is buried under "request body" or not captured at all.

Vendor LLM-observability SDKs (Langfuse, Helicone, Phoenix, the proprietary ones) capture all the right signals but ship as a heavy SDK that you bolt onto your service. Every framework upgrade is now a coordination problem. Every backend switch is a rewrite. And the more frameworks your stack uses (Spring AI for the orchestrator, LangChain4j for the rag service, Koog for that one Kotlin pilot), the more SDKs you carry.

Neither is the right shape. The right shape is: your framework emits standard signal, your backend understands standard signal, you change zero application code.

Until recently that wasn't possible. The OpenTelemetry community had been working on the gen_ai.* semantic conventions for a year, but framework support was uneven and the conventions kept shifting.

That changed in the last six months. Concretely:

Spring AI 1.0 emits gen_ai.client.chat, gen_ai.tool.execute, and friends via Micrometer Observations.
LangChain4j emits the same via its ChatModelListener API.
Koog 0.8 ships a first-class OpenTelemetry feature with addDatadogExporter, addLangfuseExporter, addWeaveExporter.
Python OpenLLMetry / OpenInference instrumentations (Anthropic, OpenAI, LangChain, LlamaIndex) emit the same conventions and stream through standard OTel exporters.
Go's otel-instrumentation-genai is in alpha with the same shape.

In other words: the signal is on the wire, in standard form, regardless of which framework your team picked. What's missing is a backend that does something useful with it.

The four-line Spring AI version

Stock Spring Boot 3.5 + Spring AI 1.0. No vendor SDK on the classpath. Just the standard OTel pieces (micrometer-tracing-bridge-otel, opentelemetry-exporter-otlp).

In application.yml:

management:
  otlp:
    tracing:
      endpoint: ${JAMJET_API_URL}/v1/otlp/v1/traces
      headers:
        authorization: "Bearer ${JAMJET_API_KEY}"
  tracing:
    sampling:
      probability: 1.0

That's it. Spring AI's gen_ai.client.chat spans get serialized as standard OTLP/HTTP-protobuf and posted to a JamJet Cloud project. Demo: jamjet-runtime-java/examples/spring-ai-engram-cloud-demo.

The same pattern works against any OTLP-aware backend. The endpoint URL is the only thing that varies.

The Kotlin Koog one-liner

AIAgent(...) {
    install(OpenTelemetry) {
        setServiceInfo(serviceName = "memory-agent", serviceVersion = "1.0")
        addJamjetCloudExporter()  // reads JAMJET_API_KEY from env
    }
}

About 20 lines wrapping the standard OtlpHttpSpanExporter. Demo: jamjet-runtime-java/examples/kotlin-koog-engram-cloud-demo. We've filed an upstream YouTrack issue proposing this lands in agents-features-opentelemetry-jvm directly.

And for Python / Go folks

The same shape works:

Python:

export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT="https://api.jamjet.dev/v1/otlp/v1/traces"
export OTEL_EXPORTER_OTLP_TRACES_HEADERS="authorization=Bearer ${JAMJET_API_KEY}"

Plus pip install opentelemetry-instrumentation-anthropic (or -openai, -langchain, etc.) and a one-line instrument() call. The OpenInference and OpenLLMetry projects each ship instrumentations for the major frameworks.

Go:

Standard otelhttp for the LLM client wrapper, plus otlptracehttp.New(...) configured to point at /v1/otlp/v1/traces. The instrumentation surface is younger but moving fast.

The point: once your framework speaks gen_ai.* OTel, the only language-specific code you write is the exporter setup. And that's stock OTel boilerplate, not vendor-specific.

What you get on the other side

The interesting question, which generic OTel backends won't help you with, is what the receiver does with these spans. The signals an agent owner actually wants:

Multi-agent network graph — every cross-agent call is a node, edges show who-called-whom with cost and latency rolled up per edge. (W3C traceparent plus a jj tracestate segment links agents across HTTP hops.)
Cost rollups per agent / model / end-user — computed server-side from gen_ai.usage.*_tokens against current vendor pricing. No pricing table in your app.
Failure-mode pie chart — typed exception classification, not just "HTTP 5xx" buckets.
Cross-agent identity — when a user request fans out across three agents, you see the same end-user-id stitched across all of them.
Policy enforcement + audit export — Ed25519-signed JSON+CSV+PDF audit packages for the SOC2-ish surface, OTLP-formatted exports for SIEM tools (Splunk, Datadog Logs).

If "the safety layer behind your AI agents" is something you've been trying to articulate to your CTO, that's the shape we're building for.

Why this architecture is the durable bet

Three reasons stock-OTel-plus-LLM-aware-backend wins over vendor SDKs over the next two years:

No SDK on the classpath / requirements.txt / go.mod. The exporter is already in your app for HTTP tracing — you change one URL.
Backend-portable. Every line in those demos works against Honeycomb, Tempo, Jaeger, Datadog, or a self-hosted OTel collector. That's a real CTO-pitch argument when "vendor lock-in" comes up in the procurement review.
The frameworks are doing the work. Spring AI's Observation handlers, LangChain4j's listeners, Koog's OpenTelemetry feature, Python OpenLLMetry instrumentations — these aren't vendor projects. They're the framework's own contracts. Every release brings new signals for free.

Try it

Demos: github.com/jamjet-labs/jamjet-runtime-java
Cloud sign-up: jamjet.dev (free tier)
Spring Boot starter on Maven Central (for users who want a richer in-process path than stock OTLP): dev.jamjet:jamjet-cloud-spring-boot-starter:0.2.0

If you're at Devoxx this week and want to see this running against a real Spring AI or Koog agent, drop me a line — happy to do a five-minute hallway walkthrough. If your stack is Python or Go and you'd like the equivalent demo for your language, that's the next post — let me know in the comments which framework so I can pick the right starting point.

If you've got opinions on what AI-agent observability should look like, especially the bits I've glossed over (multi-tenancy, on-prem, BYO collector), the comments are open.

The State of Memory in Java AI Agents (April 2026)

Sunil Prakash — Tue, 07 Apr 2026 17:12:08 +0000

This post was originally published on jamjet.dev.

TL;DR

If you're building AI agents in Java today, your options for persistent memory range from "store the last 20 chat messages in Postgres" to "run a Python service in a sidecar container and call it over HTTP." There is no Java-native equivalent to Mem0, Zep, or Letta — the libraries Python developers reach for when they need real memory.

This post is a tour of every option a Java developer has in April 2026, why most of them stop at chat history, what "real memory" should actually mean, and one library we shipped to fill the gap.

The scenario every Java AI developer recognises

You're building an AI agent in Spring Boot. Maybe it's a customer support copilot, maybe it's a coding assistant, maybe it's a research agent. You wire up Spring AI or LangChain4j, write a few tools, and the first conversation works.

Then your user comes back the next day. The agent doesn't remember them. It doesn't remember they're allergic to peanuts. It doesn't remember they're working on the Acme migration. It doesn't remember they prefer verbose explanations. Every conversation starts from zero.

You search for "Java AI agent memory" and end up with three kinds of results:

Tutorials on how to store chat messages in Postgres
Marketing pages for Mem0 and Zep — Python only
GitHub issues asking why there's no Java SDK

This is the gap.

"Memory" means three different things

Before we tour the libraries, we need to be precise. There are at least three different things people mean when they say "agent memory":

1. Conversation history. The last N messages of the current session. Solved problem — every framework ships this.

2. State checkpointing. Snapshots of agent execution state for resume and replay. Solved by LangGraph, Koog persistence, Temporal-style runtimes.

3. Long-term knowledge memory. Facts about the user, their preferences, their projects, their history — extracted from conversations, stored durably, retrievable across sessions, and de-conflicted when they change. This is what Mem0 and Zep do. It is not solved on the JVM.

The rest of this post is about the third one.

What real memory needs

Fact extraction. An LLM reads a conversation and pulls out discrete, atomic facts.
Conflict detection. When a new fact contradicts an old one, the system invalidates the old fact.
Hybrid retrieval. Vector + keyword + graph walk fused together.
Temporal reasoning. Facts have validity windows.
Token-budgeted context assembly. Pick which facts go in the prompt and respect the budget.
Decay and consolidation. Stale facts fade, frequent facts get promoted, duplicates merge.

Tour: every option Java developers have today

LangChain4j ChatMemory

Most popular JVM AI framework. Ships ChatMemory interface with MessageWindowChatMemory and TokenWindowChatMemory. Persistence via developer-implemented ChatMemoryStore.

What it does: stores message objects, respects token/count limits. What it does not do: extract facts, deduplicate, retrieve semantically, reason about time. The docs are explicit — ChatMemory is a container abstraction.

Spring AI ChatMemory

Shipped GA in 2025 with broad backend support: JDBC, Cassandra, Mongo, Neo4j, Cosmos DB. Three advisors plug it into ChatClient.

The VectorStoreChatMemoryAdvisor is the closest thing to "semantic memory" — it indexes raw messages in your VectorStore. But it indexes raw messages, not extracted facts. No entity model, no relationship graph, no conflict detection.

Google ADK for Java

Ships 1.0.0 with two memory implementations: InMemoryMemoryService (keyword matching only) and VertexAiMemoryBankService (Vertex AI only). Memory Bank is excellent but Google Cloud-locked.

Koog (JetBrains)

Kotlin-first framework with AgentMemory storing facts by Concept, Subject, Scope. Closest competitor on the "facts about subjects" axis.

Two caveats: Java consumption is awkward, and GitHub issue JetBrains/koog#1001 documents that AgentMemory floods prompts as facts accumulate — no token budgeting.

Embabel

Rod Johnson's JVM agent framework. Uses a blackboard pattern — shared state per agent run.

Per the maintainers: "in Embabel it's not about conversational memory so much as domain objects that are stored in the blackboard during the flow." Long-term memory is an explicit non-goal.

Mem0 Java SDK (the one that doesn't exist)

The top Google result is me.pgthinker:mem0-client-java, a community wrapper at version 0.1.3, last updated nine months ago, with 9 GitHub stars. It's a thin REST client requiring a Python Mem0 server alongside your JVM app.

No official Mem0 Java client exists. Python and Node.js only.

Zep Java SDK (also doesn't exist)

Zep's official clients are Python, TypeScript, and Go. No Java SDK.

DIY (what most teams actually do)

When Java teams need real memory today, they assemble:

Postgres + pgvector (or Qdrant) for embeddings
JdbcChatMemoryRepository for messages
Custom advisor that calls an LLM to extract facts
Custom retrieval layer combining vector and keyword search
Nightly cron job for decay and dedup
Custom token-budgeting in the prompt builder

Roughly 1,500–3,000 lines of bespoke Java per team. Quietly diverges between projects. Rarely gets temporal reasoning right. Almost never gets consolidation right.

The pattern

Every Java memory option lives in one of two boxes:

Chat history persistence (LangChain4j, Spring AI core, Embabel)
State checkpointing (LangGraph4j, Koog persistence)

Nothing in between. No JVM-native library that does fact extraction + conflict resolution + temporal graph + hybrid retrieval + consolidation in one dependency.

The Python ecosystem has had Mem0 since 2024 and Zep/Graphiti since early 2025. The Java ecosystem is roughly 18 months behind.

What we built

I run JamJet. As we built our agent runtime, the memory gap kept showing up. So we built a memory layer.

Engram is a durable memory system that does the things on the list:

Fact extraction from conversation messages via LLM
Conflict detection — vector similarity threshold plus LLM resolution
Hybrid retrieval — vector + SQLite FTS5 keyword + graph walk
Temporal knowledge graph with validity windows
Token-budgeted context assembly with three output formats
5-operation consolidation engine: decay, promote, dedup, summarize, reflect
MCP server option

Runs against SQLite by default. No Postgres, no Qdrant, no Neo4j, no Python sidecar.

import dev.jamjet.engram.EngramClient;
import dev.jamjet.engram.EngramConfig;

import java.util.List;
import java.util.Map;

try (var memory = new EngramClient(EngramConfig.defaults())) {
    memory.add(
        List.of(
            Map.of("role", "user",      "content", "I'm allergic to peanuts and live in Austin"),
            Map.of("role", "assistant", "content", "Got it, I'll remember that.")
        ),
        "alice", null, null
    );

    var context = memory.context(
        "what should I cook for dinner",
        "alice", null, 1000, "system_prompt"
    );

    System.out.println(context.get("text"));
}

Maven Central:

<dependency>
    <groupId>dev.jamjet</groupId>
    <artifactId>jamjet-sdk</artifactId>
    <version>0.4.3</version>
</dependency>

Apache 2.0. Rust runtime published as jamjet-engram on crates.io.

What it doesn't do (yet)

No Spring Boot auto-configuration yet (starter on roadmap)
No JDBC backend (SQLite-first, Postgres in 0.5.x)
No managed cloud option
No published LongMemEval / DMR scores yet (benchmarks running, not going to cherry-pick)

Try it

<dependency>
    <groupId>dev.jamjet</groupId>
    <artifactId>jamjet-sdk</artifactId>
    <version>0.4.3</version>
</dependency>

Or run it as an MCP server:

cargo install jamjet-engram-server
engram serve --db memory.db

GitHub: github.com/jamjet-labs/jamjet

If you've been quietly rolling your own memory layer in Java, I'd love to hear what you ended up with. Reach out via GitHub issues or the comments below.

Introducing Agentic AI for Serious Engineers: A Free Practical Book + Code Repo

Sunil Prakash — Fri, 27 Mar 2026 04:33:12 +0000

Agentic AI is moving fast.

There are new frameworks, new demos, new orchestration patterns, and new opinions almost every week. That is exciting, but it also creates a problem for engineers trying to build real systems.

A lot of the material online is either too abstract, too framework-specific, too hype-heavy, or too focused on demo magic instead of production reality.

I wanted something different.

So I started writing Agentic AI for Serious Engineers - a free practical book and code repo for engineers who want to build agent systems that are not just impressive in a demo, but usable, testable, observable, and trustworthy in real environments.

Why I wrote it

I kept running into the same gap.

There is plenty of excitement around agents, but much less practical material on questions like:

What actually makes a system agentic?
When should you use an agent instead of a workflow?
How should tools be designed so they are safe and reliable to call?
What does good context design look like?
How do you evaluate an agent system beyond “it seemed to work”?
Where should human approval sit in the architecture?
How do you think about reliability, security, and observability from the start?

Those are the questions this book tries to answer.

Who this is for

This book is for engineers and architects building real AI systems:

backend engineers
platform engineers
staff+ engineers
software architects
technical leads
data engineers working on production AI applications

It assumes you already understand software systems.

It does not assume you already know how to design agent systems well.

What makes it different

This project is intentionally practical.

It is not a catalog of trendy frameworks.
It is not a collection of magical claims.
It is not “plug in this library and everything works.”

Instead, it focuses on the engineering questions that matter when systems move closer to production:

tool contracts
control flow
context boundaries
evaluation
approval gates
reliability
hardening
traceability
operating tradeoffs

The goal is not just to help someone build an agent.

The goal is to help them build one with better judgment.

What’s inside

The repo currently includes:

7 chapters
2 threaded end-to-end projects
per-chapter code
tests and eval-oriented structure
diagrams, principles, and roadmap
a free online version of the book

Some of the topics include:

what “agentic” actually means
tools, context, and the agent loop
workflow-first design
multi-agent systems
human-in-the-loop architecture
evaluating and hardening agents
when not to use agents

The kind of engineer I wrote this for

I wrote this for the engineer who looks at agentic AI and thinks:

This is interesting.

But how do I build it in a way I can actually trust?

That is the center of gravity of the book.

Not anti-agent.

Not pro-hype.

Just serious about engineering.

Read it free

You can read the book and explore the code here:

[https://github.com/sunilp/agentic-ai]

If you end up reading it, I’d genuinely love feedback on:

what feels most useful
what is still unclear
what you’d want covered more deeply

I’m continuing to improve it, and I’d love for it to become a genuinely useful resource for engineers building the next generation of AI systems.

Your AI CLI Writes Code. Mine Tells You What It'll Break.

Sunil Prakash — Sun, 22 Mar 2026 15:18:05 +0000

AI CLI tools are everywhere right now. Claude Code, Gemini CLI, GitHub Copilot in the terminal — they'll write your code, refactor your modules, even run your tests.

But ask any of them: "If I rename this function, what breaks?"

They'll scan the files they can see, make their best guess, and probably miss the SQL view that reads the column you're about to change. Or the Java batch job that calls your Python function through a stored procedure. Or the dbt model downstream of the table your migration is about to alter.

That's not a knock on AI. It's just not what LLMs are built for. Dependency analysis needs deterministic static analysis, not probabilistic text generation.

The gap in every AI CLI

Here's what I noticed building with these tools: they're incredible at writing code but terrible at understanding what already depends on it.

Ask Claude Code to "add retry logic to the HTTP client" — brilliant. Ask it "what will break if I change the response shape of getUser" — it'll read a few files and give you a confident answer that misses half the callers.

That's because LLMs work with whatever fits in context. Your codebase has thousands of files. The SQL stored procedure that calls your function through EXEC isn't in the context window.

Static analysis that crosses language boundaries

I built Jam to fill this gap. It's a developer CLI with 40+ commands, but the one that keeps saving me is jam trace --impact.

It uses tree-sitter to parse your entire workspace — TypeScript, Python, Java, and SQL — builds a SQLite index of every function, every call site, every import, and every SQL column reference. Then gives you a deterministic answer:

$ jam trace updateBalance --impact

Impact Analysis for updateBalance
───────────────────────────────────
Direct callers:
  PaymentService.processRefund()  [Java]
  BATCH_NIGHTLY_RECONCILE         [SQL]

Column dependents:
  VIEW v_customer_summary   (reads customer.balance)
  PROC_MONTHLY_STATEMENT    (reads customer.balance)

Risk: HIGH — 2 callers across 2 languages, 2 column dependents

No hallucination. No "I think these files might be affected." Two callers in two languages, two column dependents, risk level HIGH. Deterministic.

Why LLMs can't do this (yet)

Context window limits — Your 200-file Java project with SQL migrations doesn't fit in any context window. Static analysis indexes everything.
Cross-language boundaries — A Java class running EXEC update_user is calling a SQL stored procedure. LLMs see a string. Tree-sitter sees a cross-language call.
Column-level tracking — When a SQL view reads customer.balance, and your function writes to customer.balance, that's a dependency. No LLM tracks this.
Determinism — Ask an LLM the same question twice, get different answers. Ask jam trace twice, get the same graph. For impact analysis, you need guarantees, not guesses.

Not replacing AI — complementing it

Jam isn't anti-AI. It literally has AI built in — jam ask, jam go (agentic execution), jam commit (AI-powered commit messages), jam review. It works with Ollama, Copilot, OpenAI, Anthropic, and Groq.

But for the question "what breaks if I change this?" — AI is the wrong tool. You wouldn't ask ChatGPT to run your test suite. You shouldn't ask it to trace your dependency graph either.

The best workflow: use Claude Code or Gemini CLI to write the change, then use jam trace --impact to verify the blast radius before you ship.

# Trace any symbol's callers and callees
jam trace createProvider --depth 5

# Get the full impact report
jam trace updateBalance --impact

# Output as Mermaid diagram for docs
jam trace handleRequest --mermaid

# JSON for CI/automation
jam trace processPayment --impact --json

How it works

Tree-sitter parsing — Builds ASTs for TypeScript, Python, Java, SQL
Symbol extraction — Functions, classes, methods, imports, call sites
Cross-language detection — Java EXEC/CALL → SQL procedures. SQL column refs in SELECT/UPDATE/INSERT/DELETE
SQLite index — Local database, fast graph queries, incremental updates
Impact analysis — Walks the graph upstream, finds column dependents, calculates risk

The index rebuilds in seconds. No cloud. No API calls. Pure local static analysis.

Try it

npm install -g @sunilp-org/jam-cli
jam trace <any-function-name>
jam trace <any-function-name> --impact

No API key needed for trace — it's pure static analysis. The AI features (ask, go, commit, review) auto-detect your provider.

978 tests. MIT licensed. Works everywhere Node runs.

GitHub | Website | npm | VSCode Extension