DEV Community: Ritwika Kancharla

Build shielded token mint, transfer, and burn flows in Compact

Ritwika Kancharla — Mon, 11 May 2026 20:28:52 +0000

Shielded tokens let Midnight applications move value while keeping sensitive
transaction details private. In this tutorial, you will build a small Compact
project that mints shielded value, transfers it, burns it, and verifies the
full lifecycle with a Vitest test suite.

The core idea is simple but important: a newly minted shielded coin is not yet a
committed ledger coin. Fresh coins use ShieldedCoinInfo and can be spent
immediately with sendImmediateShielded. Coins that already exist in the ledger
use QualifiedShieldedCoinInfo, which includes the Merkle tree index required
by sendShielded.

You will use that distinction to build three practical flows: minting to the
current contract, transferring committed value to a user, and burning both fresh
and committed shielded value. Along the way, you will also handle change outputs
and nonce derivation, two details that matter in real wallet and DApp code.

By the end of this tutorial, you will:

Create a Compact contract that mints shielded tokens to itself.
Use evolveNonce to derive mint nonces safely.
Transfer committed shielded coins with sendShielded.
Burn committed and freshly minted coins with shieldedBurnAddress.
Use sendImmediateShielded for coins created in the same transaction.
Write Vitest tests for minting, transfers, burns, change, nonce reuse, and the Merkle timing rule.

Prerequisites

You need:

Node.js 20 or newer.
npm.
The Midnight Compact toolchain.
Basic TypeScript knowledge.
Basic familiarity with Compact circuits and ledgers.

Install or update the Compact toolchain from the Midnight documentation, then
check that the command is available:

compact check

Expected result:

compact: Latest version available: ...

On Windows, run the Compact toolchain from WSL or make sure the Midnight
compact binary appears before C:\Windows\System32\compact.exe in your
PATH. Windows also has a system command named compact, and that command is
not the Midnight compiler.

What you will build

The finished demo project has this structure:

shielded-token-operations/
|-- package.json
|-- tsconfig.json
|-- src/
|   |-- shielded-token-lifecycle.compact
|   |-- witnesses.ts
|   `-- model/
|       `-- shielded-token-model.ts
`-- test/
    `-- shielded-token-lifecycle.test.ts

The Compact file contains the contract. The witness file manages local nonce
state. The TypeScript model gives you deterministic tests without needing a live
node. The Vitest file proves the flows and catches the mistakes developers tend
to make when they mix fresh and committed shielded coins.

Part 0: Choose the right shielded API

Before writing code, map each operation to the correct standard library helper.
Most bugs in shielded token examples come from choosing the right idea but the
wrong helper.

Use mintShieldedToken when the contract creates a new shielded token. It
returns ShieldedCoinInfo, which describes a fresh output from the current
transaction.

Use sendImmediateShielded when that fresh output is spent in the same
transaction. This is the right tool for atomic flows such as mint-and-send or
mint-and-burn.

Use sendShielded when the coin already exists in the ledger. This requires
QualifiedShieldedCoinInfo, not plain ShieldedCoinInfo, because the committed
coin must include its Merkle tree position.

Use shieldedBurnAddress() when the send target should destroy the shielded
value. Burning is not a separate token primitive in this example; it is a send
to a special recipient.

Keep this decision table nearby while building:

Operation                      Input coin type              Helper
Mint new shielded value        none                         mintShieldedToken
Spend fresh minted value       ShieldedCoinInfo             sendImmediateShielded
Spend later committed value    QualifiedShieldedCoinInfo    sendShielded
Burn fresh minted value        ShieldedCoinInfo             sendImmediateShielded + shieldedBurnAddress
Burn committed value           QualifiedShieldedCoinInfo    sendShielded + shieldedBurnAddress

The tutorial code follows this table exactly. If you change the API shape later,
re-run the tests that check immediate sends and Merkle timing.

Part 1: Create the package

Create a folder for the demo project:

mkdir shielded-token-operations
cd shielded-token-operations

Initialize the package:

npm init -y

Install the test and TypeScript dependencies:

npm install --save-dev typescript vitest @types/node

Update package.json with these scripts:

{
  "scripts": {
    "compact": "compact compile src/shielded-token-lifecycle.compact ./src/managed/shielded-token-lifecycle",
    "test": "vitest run",
    "typecheck": "tsc -p tsconfig.json --noEmit"
  }
}

Create tsconfig.json:

{
  "compilerOptions": {
    "target": "ES2022",
    "module": "NodeNext",
    "moduleResolution": "NodeNext",
    "strict": true,
    "esModuleInterop": true,
    "forceConsistentCasingInFileNames": true,
    "skipLibCheck": true,
    "types": ["node", "vitest"]
  },
  "include": ["src/**/*.ts", "test/**/*.ts"]
}

This setup lets you run fast local tests while still keeping a direct Compact
compile command for the smart contract.

Part 2: Write the Compact contract

Create the source folder:

mkdir -p src
touch src/shielded-token-lifecycle.compact

Start the contract with the language pragma, the standard library import, and
two public ledger fields:

pragma language_version >= 0.20;

import CompactStandardLibrary;

export ledger mintedOperations: Counter;
export ledger totalBurned: Uint<128>;

constructor() {
  totalBurned = 0;
}

mintedOperations is a simple counter used by the example. totalBurned
records the amount this contract has intentionally sent to the shielded burn
address.

Next, add a witness for local nonce seed management:

witness localNonceSeed(): Bytes<32>;

The witness value is private state supplied by the TypeScript layer. The
contract will combine it with a public nonce index using evolveNonce.

Part 3: Mint shielded tokens

Add a circuit that mints directly to the current contract:

export circuit mint_to_contract(
  domainSep: Bytes<32>,
  value: Uint<64>,
  nonce: Bytes<32>
): ShieldedCoinInfo {
  assert(disclose(value) > 0, "mint amount must be non-zero");

  const coin = mintShieldedToken(
    disclose(domainSep),
    disclose(value),
    disclose(nonce),
    right<ZswapCoinPublicKey, ContractAddress>(kernel.self())
  );

  mintedOperations.increment(1);
  return coin;
}

mintShieldedToken returns a ShieldedCoinInfo. That is a fresh output. It has
a nonce, color, and value, but it does not have a Merkle index yet.

The recipient is:

right<ZswapCoinPublicKey, ContractAddress>(kernel.self())

That means the new shielded coin belongs to the current contract. This is useful
when the contract should later spend the coin after it appears in the ledger.

Now add a second minting circuit that derives its nonce:

export circuit mint_with_local_nonce(
  domainSep: Bytes<32>,
  value: Uint<64>,
  nonceIndex: Uint<128>
): ShieldedCoinInfo {
  assert(disclose(value) > 0, "mint amount must be non-zero");

  const nonce = disclose(evolveNonce(disclose(nonceIndex), localNonceSeed()));
  const coin = mintShieldedToken(
    disclose(domainSep),
    disclose(value),
    nonce,
    right<ZswapCoinPublicKey, ContractAddress>(kernel.self())
  );

  mintedOperations.increment(1);
  return coin;
}

evolveNonce takes an index and a prior nonce/seed. The disclose(...) wrapper
around the evolved nonce is intentional. Compact treats witness-derived values
as private by default, and a minted coin returns nonce-derived data. This wrapper
declares that the derived nonce value may be used in that public-facing result
without disclosing the raw witness seed.

Part 4: Transfer a committed shielded coin

Add a committed transfer circuit:

export circuit send_committed(
  input: QualifiedShieldedCoinInfo,
  recipient: ZswapCoinPublicKey,
  value: Uint<128>
): ShieldedSendResult {
  assert(disclose(value) > 0, "send amount must be non-zero");
  assert(disclose(input).value >= disclose(value), "send amount exceeds coin value");

  return sendShielded(
    disclose(input),
    left<ZswapCoinPublicKey, ContractAddress>(disclose(recipient)),
    disclose(value)
  );
}

The input type matters. sendShielded spends a QualifiedShieldedCoinInfo.
That type represents an existing shielded coin in the ledger. It includes the
Merkle tree position:

struct QualifiedShieldedCoinInfo {
  nonce: Bytes<32>;
  color: Bytes<32>;
  value: Uint<128>;
  mtIndex: Uint<64>;
}

The result type is ShieldedSendResult:

struct ShieldedSendResult {
  change: Maybe<ShieldedCoinInfo>;
  sent: ShieldedCoinInfo;
}

If you send the full input value, change is empty. If you send less than the
input value, change contains a new contract-owned shielded coin. Your
application must keep track of that change output, wait for it to be committed,
and later spend it as a qualified coin.

Part 5: Burn shielded tokens

Burning is a send to the special burn recipient returned by
shieldedBurnAddress().

Add a committed burn circuit:

export circuit burn_committed(
  input: QualifiedShieldedCoinInfo,
  value: Uint<128>
): ShieldedSendResult {
  assert(disclose(value) > 0, "burn amount must be non-zero");
  assert(disclose(input).value >= disclose(value), "burn amount exceeds coin value");

  const result = sendShielded(
    disclose(input),
    shieldedBurnAddress(),
    disclose(value)
  );

  totalBurned = (totalBurned + disclose(value)) as Uint<128>;
  return result;
}

This works like a committed transfer, except the recipient is the burn address.
The result can still include change. Burning 40 from a 90 value coin burns
40 and returns 50 as change.

Now add a fresh burn circuit:

export circuit burn_fresh(
  domainSep: Bytes<32>,
  mintValue: Uint<64>,
  mintNonce: Bytes<32>,
  burnValue: Uint<128>
): ShieldedSendResult {
  assert(disclose(mintValue) > 0, "mint amount must be non-zero");
  assert(disclose(burnValue) > 0, "burn amount must be non-zero");

  const coin = mintShieldedToken(
    disclose(domainSep),
    disclose(mintValue),
    disclose(mintNonce),
    right<ZswapCoinPublicKey, ContractAddress>(kernel.self())
  );

  assert(coin.value >= disclose(burnValue), "burn amount exceeds minted value");

  const result = sendImmediateShielded(
    coin,
    shieldedBurnAddress(),
    disclose(burnValue)
  );

  mintedOperations.increment(1);
  totalBurned = (totalBurned + disclose(burnValue)) as Uint<128>;
  return result;
}

Use sendImmediateShielded here because the coin was created in the same
transaction. It has not been committed yet, so it cannot be spent with
sendShielded.

Part 6: Add atomic mint and send

Atomic mint-and-send is the same fresh-coin pattern. Add this circuit:

export circuit mint_and_send(
  domainSep: Bytes<32>,
  mintValue: Uint<64>,
  mintNonce: Bytes<32>,
  recipient: ZswapCoinPublicKey,
  sendValue: Uint<128>
): ShieldedSendResult {
  assert(disclose(mintValue) > 0, "mint amount must be non-zero");
  assert(disclose(sendValue) > 0, "send amount must be non-zero");

  const coin = mintShieldedToken(
    disclose(domainSep),
    disclose(mintValue),
    disclose(mintNonce),
    right<ZswapCoinPublicKey, ContractAddress>(kernel.self())
  );

  assert(coin.value >= disclose(sendValue), "send amount exceeds minted value");

  const result = sendImmediateShielded(
    coin,
    left<ZswapCoinPublicKey, ContractAddress>(disclose(recipient)),
    disclose(sendValue)
  );

  mintedOperations.increment(1);
  return result;
}

This circuit mints to the contract and immediately sends part or all of the
fresh coin to a user public key. If mintValue is larger than sendValue, the
returned change belongs to the contract.

Part 7: Add the TypeScript witness

Create the witness file:

touch src/witnesses.ts

Add a private state type and a witness implementation:

import { createHash } from "node:crypto";

export type ShieldedTokenPrivateState = {
  readonly nonceSeed: Uint8Array;
  readonly nextNonceIndex: bigint;
};

type WitnessContext<PrivateState> = {
  readonly privateState: PrivateState;
};

export const createShieldedTokenPrivateState = (
  nonceSeed = hashBytes32("shielded-token:demo-seed"),
): ShieldedTokenPrivateState => ({
  nonceSeed,
  nextNonceIndex: 0n,
});

export const witnesses = {
  localNonceSeed: ({ privateState }: WitnessContext<ShieldedTokenPrivateState>) => [
    {
      nonceSeed: privateState.nonceSeed,
      nextNonceIndex: privateState.nextNonceIndex + 1n,
    },
    privateState.nonceSeed,
  ],
};

export function hashBytes32(input: string): Uint8Array {
  return createHash("sha256").update(input).digest();
}

The witness returns the nonce seed to Compact and advances local private state.
In a production DApp, persist this private state. Do not reset it casually, or
you may reuse nonce material.

Part 8: Build a local test model

Create a deterministic model for tests:

mkdir -p src/model
touch src/model/shielded-token-model.ts

The model should mirror the standard library types closely:

export type ShieldedCoinInfo = {
  readonly nonce: string;
  readonly color: string;
  readonly value: bigint;
  readonly recipient: Recipient;
};

export type QualifiedShieldedCoinInfo = ShieldedCoinInfo & {
  readonly mtIndex: bigint;
};

export type ShieldedSendResult = {
  readonly sent: ShieldedCoinInfo;
  readonly change: ShieldedCoinInfo | null;
};

Then model the two send paths:

export function sendShielded(
  input: QualifiedShieldedCoinInfo,
  recipient: Recipient,
  value: bigint | number,
): ShieldedSendResult {
  assertQualified(input);
  return splitCoin(input, recipient, value, "sendShielded");
}

export function sendImmediateShielded(
  input: ShieldedCoinInfo,
  target: Recipient,
  value: bigint | number,
): ShieldedSendResult {
  return splitCoin(input, target, value, "sendImmediateShielded");
}

The important test helper is assertQualified. It rejects fresh coins that do
not have an mtIndex:

function assertQualified(
  input: ShieldedCoinInfo | QualifiedShieldedCoinInfo,
): asserts input is QualifiedShieldedCoinInfo {
  if (!("mtIndex" in input)) {
    throw new Error("sendShielded requires a committed coin with an mtIndex");
  }
}

This is how the test suite makes the Merkle timing rule visible without running
a full network.

Part 9: Write the Vitest suite

Create the test file:

mkdir -p test
touch test/shielded-token-lifecycle.test.ts

Start with a mint test:

it("mints a shielded coin to the contract with the expected color", () => {
  const harness = new ShieldedTokenHarness(DOMAIN);

  const coin = harness.mintToContract(1000n, NONCE);

  expect(coin.value).toBe(1000n);
  expect(coin.color).toEqual(tokenType(DOMAIN));
  expect(coin.recipient).toEqual(CONTRACT_SELF);
});

Add a nonce test:

it("derives deterministic unique nonces with evolveNonce", () => {
  const first = evolveNonce(0n, SEED);
  const second = evolveNonce(1n, SEED);

  expect(first).not.toEqual(second);
  expect(evolveNonce(0n, SEED)).toEqual(first);
});

Add the Merkle timing test:

it("requires a committed Merkle position before sendShielded can spend a coin", () => {
  const harness = new ShieldedTokenHarness(DOMAIN);
  const fresh = harness.mintToContract(25n, NONCE);

  expect(() =>
    sendShielded(fresh as unknown as QualifiedShieldedCoinInfo, ALICE, 5n),
  ).toThrow(/mtIndex/);
});

Add committed transfer tests for partial and exact sends:

const qualified = harness.commit(harness.mintToContract(100n, NONCE));
const result = harness.sendCommitted(qualified, ALICE, 35n);

expect(result.sent.value).toBe(35n);
expect(result.change?.value).toBe(65n);

Add burn tests for both paths:

const committed = harness.commit(harness.mintToContract(90n, NONCE));
const committedBurn = harness.burnCommitted(committed, 40n);
expect(committedBurn.sent.recipient).toEqual(BURN_ADDRESS);
expect(committedBurn.change?.value).toBe(50n);

const freshBurn = harness.burnFresh(75n, NONCE, 75n);
expect(freshBurn.sent.recipient).toEqual(BURN_ADDRESS);
expect(freshBurn.change).toBeNull();

Finally, test atomic mint-and-send:

const result = harness.mintAndSend(120n, NONCE, ALICE, 45n);

expect(result.sent.value).toBe(45n);
expect(result.sent.recipient).toEqual(ALICE);
expect(result.change?.value).toBe(75n);

The completed suite should cover normal flows and edge cases: over-spend,
over-burn, zero-value operations, change reuse, fresh immediate sends, and color
preservation.

Add one more test for change reuse. This test proves that change from a partial
send is not just bookkeeping text; it is the next coin your application must
commit and track:

it("allows change from one transaction to be committed and spent later", () => {
  const harness = new ShieldedTokenHarness(DOMAIN);
  const qualified = harness.commit(harness.mintToContract(100n, NONCE));
  const first = harness.sendCommitted(qualified, ALICE, 30n);

  const change = harness.commit(first.change!);
  const second = harness.sendCommitted(change, BOB, 20n);

  expect(second.sent.value).toBe(20n);
  expect(second.change?.value).toBe(50n);
});

This is the test that catches a common wallet integration bug. If the application
forgets to store first.change, the user may believe the remaining value is
still available, but the app will not know which output to qualify and spend
later.

Part 10: Run the checks

Run TypeScript:

npm run typecheck

Expected output:

tsc -p tsconfig.json --noEmit

Run the tests:

npm test

Expected output:

Test Files  1 passed (1)
Tests  18 passed (18)

Compile the Compact smart contract:

npm run compact

Expected result:

Compilation successful

The compile step creates generated contract files under src/managed. Those
generated files are build output and do not need to be committed.

For final review, record the three verification results together:

Compact compiler: passed
TypeScript: passed
Vitest: 18 tests passed

Do not treat the Vitest result as a substitute for Compact compilation. The
tests prove the lifecycle model and edge cases. The compiler proves the Compact
syntax, types, witness disclosure, and standard library calls.

Troubleshooting

`compact` runs the wrong command on Windows

Problem:

Listing ... New files added to this directory will not be compressed.

That is the Windows filesystem compression utility, not the Midnight compiler.

Fix:

Run the command from WSL, or
Put the Midnight Compact toolchain earlier in PATH than C:\Windows\System32.

The compiler complains about witness disclosure

Problem:

potential witness-value disclosure must be declared but is not

Fix:

Wrap the evolved nonce in disclose(...):

const nonce = disclose(evolveNonce(disclose(nonceIndex), localNonceSeed()));

This declares the intentional disclosure of the derived nonce value, not the raw
witness seed.

A later transfer fails because the coin has no `mtIndex`

Problem:

You are trying to call sendShielded with a fresh ShieldedCoinInfo.

Fix:

Use sendImmediateShielded if the coin was created in the same transaction. If
the coin was created in a previous transaction, wait until it is committed and
spend it as a QualifiedShieldedCoinInfo with the real Merkle index.

Change disappears from your app state

Problem:

A partial send or burn returns change, but the app does not store it.

Fix:

Always inspect ShieldedSendResult.change. If it is present, persist it and
track its later Merkle position. That change is the remaining balance.

Conclusion

You have built a complete shielded token lifecycle example:

mintShieldedToken creates fresh shielded coins.
evolveNonce gives you deterministic nonce derivation.
sendShielded spends committed coins with mtIndex.
sendImmediateShielded spends coins created in the same transaction.
shieldedBurnAddress turns a shielded send into a burn.
ShieldedSendResult.change carries the unspent remainder.
Vitest tests prove the normal paths and the common failure cases.

The most important lesson is the fresh-versus-committed distinction. If you keep
that boundary clear, shielded mint, transfer, and burn operations become much
easier to reason about.

Related resources:

Midnight documentation: https://docs.midnight.network/
Compact language reference: https://docs.midnight.network/compact/
Compact standard library exports: https://docs.midnight.network/compact/standard-library/exports

Stop Building Chatbots. Google Just Showed Us What Comes After.

Ritwika Kancharla — Fri, 24 Apr 2026 19:46:58 +0000

This is a submission for the Google Cloud NEXT Writing Challenge

My Take

Everyone left Cloud NEXT '26 talking about TPU 8t, Gemini 3.1 Pro, or the Apple-Siri partnership. Those are headline-grabbers, sure. But the announcement that will quietly reshape how we architect software got maybe 10% of the attention it deserved:

Google shipped a complete, production-grade stack for multi-agent systems — from protocol (A2A v1.0) to framework (ADK v1.0) to runtime (Gemini Enterprise Agent Platform) to orchestration primitives (SequentialAgent, ParallelAgent, LoopAgent).

This isn't a research preview. 150 organizations are running A2A in production. ADK has stable releases in four languages. The "era of the pilot is over," as Thomas Kurian put it on stage — and for once, the infrastructure actually backs up the keynote rhetoric.

Here's why this matters more than any model upgrade, and what it concretely looks like to build with it.

The Problem Multi-Agent Systems Actually Solve

Let me frame this with a real scenario. Say you're building an incident triage pipeline — turning raw customer support tickets into actionable engineering tasks. Here's what the single-agent approach looks like vs. multi-agent:

Dimension	Single-Agent (One Prompt)	Multi-Agent (ADK-based)
Architecture	One giant prompt with few-shot examples	5 specialized agents in a SequentialAgent pipeline
Failure mode	Entire output is garbage if one step fails	Only the failing agent needs a fix; others keep working
Debugging	Re-read 2,000-token prompt, guess what went wrong	Agent interaction traces show exactly where the pipeline broke
Scaling	Scale the whole thing or nothing	Scale the clustering agent independently during incident spikes
Testing	End-to-end only; no unit tests for substeps	Each agent has its own eval suite with trajectory checks
Latency	Sequential by nature; everything waits on one call	ParallelAgent fetches data concurrently; LoopAgent iterates until quality passes

The agents in this pipeline would be:

Classifier Agent — categorizes tickets, tags severity (P0–P3)
Clustering Agent — groups similar issues using embeddings, surfaces patterns
Root Cause Analyzer — maps issue clusters to specific services/code paths
Action Generator — drafts Jira tickets, PR descriptions, runbook links
Reporter Agent — produces a clean incident summary for the on-call channel

Each agent is simpler to prompt, simpler to test, and simpler to replace. The system is what's intelligent — not any individual component.

What Google Actually Shipped: The Full Stack

Here's the part most coverage glossed over. Google didn't announce a feature. They shipped an integrated stack with four layers, each solving a distinct problem:

Layer 1: A2A Protocol (Agent-to-Agent Communication)

What it is: An open protocol (now at v1.0, governed by the Linux Foundation) that lets agents discover and communicate with each other across vendors and platforms.

Why it matters: Before A2A, if your Salesforce agent needed data from a ServiceNow agent, you wrote custom glue code. Now, agents publish Agent Cards (think: OpenAPI specs for agents) with cryptographic signatures, and any A2A-compliant agent can discover and call them.

Key v1.0 features:

Signed Agent Cards for domain verification (prevents card forgery)
Multi-tenancy (one endpoint, many agents — critical for SaaS)
Multi-protocol bindings (JSON-RPC and gRPC)
SDKs in Python, JavaScript, Java, Go, and .NET

Production adoption: 150 organizations running it in production, including deployments on Azure AI Foundry and Amazon Bedrock AgentCore. This isn't a Google-only play.

A2A handles agent↔agent communication. Anthropic's MCP handles agent↔tool communication. They're complementary, not competing — and Google adopted MCP across its own services in December 2025.

Layer 2: ADK (Agent Development Kit)

What it is: An open-source framework (v1.0 stable, Apache 2.0) for building multi-agent applications. Available in Python, TypeScript, Go, and Java.

The key abstraction — three agent types:

Agent Type	Role	Example
LLM Agents	Reasoning, planning, decisions	"Analyze this ticket and classify severity"
Workflow Agents	Deterministic orchestration	SequentialAgent, ParallelAgent, LoopAgent
Custom Agents	Arbitrary logic via BaseAgent	Rate limiting, auth checks, custom routing

What makes this different from LangChain/CrewAI: ADK's workflow agents are deterministic. They don't use an LLM to decide execution order — they follow predefined patterns (sequential, parallel, loop). The LLM agents handle reasoning; the workflow agents handle control flow. This separation is what makes the system predictable enough for production.

Layer 3: Gemini Enterprise Agent Platform (Runtime)

What it is: The renamed/consolidated Vertex AI — now a full-stack platform for deploying, governing, and scaling agents. Includes agent registries, skill registries, tool registries, and universal context management.

Deployment options:

Agent Engine (fully managed, one-command deploy)
Cloud Run (containerized, serverless)
GKE (full Kubernetes control — 300 sandboxes/second, sub-second cold starts)

Layer 4: Observability & Governance

Agent interaction traces (not just stack traces), Model Armor for prompt injection protection, IAM-based access control, and audit logging. This is the "Kubernetes for intelligence" layer — the orchestration infrastructure that makes multi-agent systems operable.

The Architectural Shift Developers Should Care About

Here's the mental model change. We've gone through several "unit of software design" transitions:

Era	Unit	Coordination	Failure Mode
Monolith	Function	Procedure calls	Whole app crashes
Microservices	Service	APIs + message queues	Circuit breakers, retries
Agentic	Agent	A2A + MCP	Agent interaction traces, fallback agents

This isn't a forced analogy. Look at the primitives Google shipped:

Agent Cards are the agentic equivalent of service discovery
A2A is the equivalent of inter-service communication protocols
ADK workflow agents are the equivalent of orchestration layers (think: Temporal, but for AI)
Agent Engine is the equivalent of managed Kubernetes

The teams that internalized microservices early (Netflix, Uber, Airbnb) had a massive architectural advantage for a decade. The same dynamic is starting now with agent architectures.

What I Think Is Underrated (My Actual Opinion)

The LoopAgent pattern. Most multi-agent demos show linear pipelines: Agent A → Agent B → Agent C → done. That's just a fancy chain-of-prompts.

The LoopAgent is where things get genuinely interesting. In ADK, you can build a writer's room pattern: a Researcher agent generates content, a Critic agent evaluates it, and the LoopAgent keeps cycling until the Critic passes the output. This is iterative refinement without human intervention — and it's deterministic at the orchestration level while being creative at the agent level.

From the ADK Codelab: a movie pitch generator uses exactly this pattern — a researcher, screenwriter, and critic agent collaborate in a loop until quality criteria are met. This is the kind of architecture that produces reliable output from unreliable components, which is the whole game in production AI.

The second underrated thing: A2A's cross-vendor interoperability. The demo showed a Salesforce Agentforce agent handing off to a Google Vertex agent, which queries a ServiceNow agent for IT asset data — all through A2A, with none of the three systems needing to understand each other's internals. If this pattern holds, we're looking at an agent ecosystem that works like the web: open, interoperable, decentralized.

Where This Actually Leads

Most companies are still building:

Chatbots (single-turn Q&A)
Copilots (human-in-the-loop for every action)
"AI wrappers" (thin UIs over API calls)

The multi-agent stack Google shipped enables something fundamentally different: autonomous workflows that compose, self-evaluate, and coordinate across organizational boundaries.

Real examples already in production:

Citigroup's Citi Sky — a Gemini-powered wealth advisor
Macy's Ask Macy's — a customer-facing retail agent
Merck — a billion-dollar Google Cloud agentic AI deal
Google's own SOC — triage agents processing 5M+ alerts, reducing 30-minute manual analysis to 60 seconds

This isn't "AI as a feature." It's AI as infrastructure. And the teams that start designing agent architectures now — with proper orchestration, observability, and composability — will have the same structural advantage that early microservices adopters had in 2014.

TL;DR

Cloud NEXT '26's most important announcement wasn't a model — it was a complete multi-agent infrastructure stack: A2A protocol + ADK framework + Agent Platform runtime + governance layer
A2A v1.0 is in production at 150+ organizations, governed by the Linux Foundation, with SDKs in 5 languages
ADK separates reasoning (LLM agents) from orchestration (workflow agents) — this is the key insight that makes multi-agent systems production-viable
The LoopAgent pattern (iterative agent refinement) is the most underrated primitive in the stack
We're at the "microservices moment" for AI architecture — the teams that grok this early will build the next generation of software

Useful Links

Resource	URL
A2A Protocol Docs	a2a-protocol.org
ADK Documentation	google.github.io/adk-docs
ADK Multi-Agent Codelab	codelabs.developers.google.com
A2A Developer Blog Post	developers.googleblog.com
ADK Multi-Agent Blog	cloud.google.com
Cloud NEXT '26 Announcements	blog.google

A/B Testing LLM Systems

Ritwika Kancharla — Tue, 03 Mar 2026 20:45:36 +0000

When Your New Model "Looks Better" but the Metrics Disagree

You swapped in a new embedding model. Responses feel sharper. Your team is excited. You ship it.

Two weeks later, task completion is down 8%. You have no idea why, and no way to trace it back to the change.

This is the most common way LLM improvements go wrong. The new version looks better in demos, passes the vibe check, and fails silently in production. A/B testing is how you stop guessing and start knowing.

Why LLM A/B Testing Is Harder Than Normal A/B Testing

In a standard web A/B test, you change a button color and measure clicks. The metric is immediate, unambiguous, and causally close to the change.

LLM systems have three properties that make this harder:

Evaluation lag. Whether a response was actually helpful often isn't clear until the user does (or doesn't) complete their task — which might be minutes or sessions later.

Multi-component pipelines. Changing the embedding model affects retrieval quality, which affects generation quality, which affects user behavior. The signal is distributed across the whole pipeline, not just one component.

High variance outputs. The same query can produce meaningfully different responses across runs, which means you need more samples to detect real signal over noise.

None of these are insurmountable. They just mean you need to be more deliberate about experimental design than you would be for a UI test.

Part 1: What You're Actually Testing

Before writing any code, be precise about the hypothesis. LLM A/B tests fall into a few categories:

Change Type	Example	Primary Metric
Embedding model	`text-embedding-3-small` → `text-embedding-3-large`	Retrieval MRR, NDCG
Chunk size / strategy	500 chars → 1000 chars with overlap	Faithfulness, relevance
Reranker	No reranking → cross-encoder reranking	Precision@5
Generation model	GPT-4o-mini → GPT-4o	Faithfulness, task completion
Prompt change	Added chain-of-thought instruction	Citation accuracy, response quality
Temperature	0.4 → 0.2 for factual queries	Hallucination rate

Define your primary metric before you run the test. If you measure 12 things and declare victory on whichever one improved, you're doing p-hacking, not science.

Part 2: Traffic Splitting

The infrastructure for LLM A/B testing is simpler than most people expect. You need a router that assigns users to variants consistently and logs which variant served each request.

import hashlib
from enum import Enum

class Variant(str, Enum):
    CONTROL = "control"
    TREATMENT = "treatment"

def assign_variant(user_id: str, experiment_id: str, treatment_pct: float = 0.10) -> Variant:
    """
    Deterministic assignment — same user always gets same variant.
    Hash-based so no state required.
    """
    key = f"{experiment_id}:{user_id}"
    hash_val = int(hashlib.md5(key.encode()).hexdigest(), 16)
    bucket = (hash_val % 1000) / 1000  # 0.000 to 0.999

    return Variant.TREATMENT if bucket < treatment_pct else Variant.CONTROL

Deterministic assignment matters. If the same user sees control on Monday and treatment on Thursday, their behavior becomes uninterpretable. Hash-based assignment is stateless and consistent across restarts.

Start with 10% treatment traffic. You can ramp up once you've verified nothing is obviously broken.

class ExperimentRouter:
    def __init__(self, experiment_id: str, control_system, treatment_system):
        self.experiment_id = experiment_id
        self.control = control_system
        self.treatment = treatment_system

    def route(self, user_id: str, query: str) -> dict:
        variant = assign_variant(user_id, self.experiment_id)

        system = self.treatment if variant == Variant.TREATMENT else self.control
        result = system.query(query)

        # Tag every response with its variant for analysis
        result["experiment_id"] = self.experiment_id
        result["variant"] = variant
        result["user_id"] = user_id

        return result

Part 3: What to Measure

Automated Metrics (Available Immediately)

These you can compute on every request:

from dataclasses import dataclass

@dataclass
class RequestMetrics:
    experiment_id: str
    variant: str
    user_id: str
    query: str

    # Retrieval
    mrr: float
    ndcg_5: float

    # Generation
    faithfulness: float
    citation_accuracy: float
    response_length: int

    # Latency
    total_latency_ms: float
    ttft_ms: float          # Time to first token

    # Reliability
    guardrail_passed: bool
    error: bool

def collect_metrics(result: dict, golden_labels: dict = None) -> RequestMetrics:
    return RequestMetrics(
        experiment_id=result["experiment_id"],
        variant=result["variant"],
        user_id=result["user_id"],
        query=result["query"],
        mrr=mean_reciprocal_rank(
            result["retrieved_ids"],
            golden_labels.get("relevant_ids", [])
        ) if golden_labels else None,
        faithfulness=faithfulness_score(result["response"], result["sources"]),
        citation_accuracy=citation_accuracy(result["response"], result["sources"])["accuracy"],
        response_length=len(result["response"]),
        total_latency_ms=result["latency_ms"],
        ttft_ms=result["ttft_ms"],
        guardrail_passed=result["guardrail_passed"],
        error=result.get("error", False)
    )

Behavioral Metrics (Require Follow-Through)

These are harder to collect but closer to what actually matters:

class BehaviorTracker:
    """Track what users do after receiving a response."""

    def log_followup(self, user_id: str, experiment_id: str, event: str):
        """
        Events worth tracking:
        - "thumbs_up" / "thumbs_down"
        - "clicked_source"
        - "copied_response"
        - "asked_followup"       # Might mean confused or engaged
        - "task_completed"       # Best signal, hardest to measure
        - "session_abandoned"    # Bad signal
        """
        log({
            "user_id": user_id,
            "experiment_id": experiment_id,
            "variant": get_last_variant(user_id, experiment_id),
            "event": event,
            "timestamp": datetime.utcnow().isoformat()
        })

Explicit feedback (thumbs up/down) has low response rates but high signal. Implicit signals (follow-up questions, session length, task completion) have high volume but require careful interpretation. Collect both.

Part 4: Statistical Analysis

This is where most teams go wrong. They run an experiment for a week, eyeball the numbers, and declare a winner. Here's how to do it properly.

Sample Size Calculation

Calculate required sample size before you start, not after you see the results:

from scipy import stats
import numpy as np

def required_sample_size(
    baseline_mean: float,
    minimum_detectable_effect: float,  # Smallest improvement worth caring about
    alpha: float = 0.05,               # False positive rate
    power: float = 0.80                # Probability of detecting a real effect
) -> int:
    """
    How many samples per variant do you need?
    """
    effect_size = minimum_detectable_effect / baseline_mean

    z_alpha = stats.norm.ppf(1 - alpha / 2)  # Two-tailed
    z_beta  = stats.norm.ppf(power)

    # Cohen's formula for proportions (simplified)
    n = (2 * ((z_alpha + z_beta) ** 2) * baseline_mean * (1 - baseline_mean)) / (minimum_detectable_effect ** 2)

    return int(np.ceil(n))

# Example: faithfulness baseline 0.82, want to detect 3% improvement
n = required_sample_size(
    baseline_mean=0.82,
    minimum_detectable_effect=0.03
)
print(f"Need {n} samples per variant ({n * 2} total)")
# → Need ~1,400 samples per variant

If you'd need 50,000 samples to detect a 0.5% improvement, either increase traffic to the experiment or reconsider whether 0.5% is worth detecting.

Significance Testing

def analyze_experiment(control_metrics: list, treatment_metrics: list) -> dict:
    control_arr   = np.array(control_metrics)
    treatment_arr = np.array(treatment_metrics)

    # Two-sided t-test
    t_stat, p_value = stats.ttest_ind(treatment_arr, control_arr)

    # Effect size (Cohen's d)
    pooled_std = np.sqrt(
        (control_arr.std() ** 2 + treatment_arr.std() ** 2) / 2
    )
    cohens_d = (treatment_arr.mean() - control_arr.mean()) / pooled_std

    # Confidence interval on the difference
    diff = treatment_arr.mean() - control_arr.mean()
    se   = np.sqrt(control_arr.var() / len(control_arr) +
                   treatment_arr.var() / len(treatment_arr))
    ci   = stats.norm.interval(0.95, loc=diff, scale=se)

    return {
        "control_mean":   control_arr.mean(),
        "treatment_mean": treatment_arr.mean(),
        "absolute_change": diff,
        "relative_change": diff / control_arr.mean(),
        "p_value":   p_value,
        "significant": p_value < 0.05,
        "cohens_d":  cohens_d,
        "ci_95":     ci,
        "practical_significance": abs(cohens_d) > 0.2  # Small effect threshold
    }

Statistical significance and practical significance are different things. A p-value of 0.001 tells you the effect is real. Cohen's d tells you whether it's big enough to matter. You want both.

The Full Analysis Report

def experiment_report(experiment_id: str, results_df) -> dict:
    control   = results_df[results_df["variant"] == "control"]
    treatment = results_df[results_df["variant"] == "treatment"]

    metrics_to_test = [
        "faithfulness",
        "citation_accuracy",
        "mrr",
        "total_latency_ms",
        "guardrail_pass_rate"
    ]

    report = {
        "experiment_id": experiment_id,
        "sample_sizes": {
            "control": len(control),
            "treatment": len(treatment)
        },
        "metrics": {}
    }

    for metric in metrics_to_test:
        if metric in results_df.columns:
            report["metrics"][metric] = analyze_experiment(
                control[metric].dropna().tolist(),
                treatment[metric].dropna().tolist()
            )

    # Overall recommendation
    significant_improvements = [
        m for m, r in report["metrics"].items()
        if r["significant"] and r["relative_change"] > 0 and r["practical_significance"]
    ]
    significant_regressions = [
        m for m, r in report["metrics"].items()
        if r["significant"] and r["relative_change"] < 0 and r["practical_significance"]
    ]

    report["recommendation"] = _make_recommendation(
        significant_improvements,
        significant_regressions
    )

    return report

def _make_recommendation(improvements: list, regressions: list) -> str:
    if regressions:
        return f"DO NOT SHIP — regressions detected in: {', '.join(regressions)}"
    if improvements:
        return f"SHIP — significant improvements in: {', '.join(improvements)}"
    return "INCONCLUSIVE — no significant changes detected. Extend experiment or increase traffic."

Part 5: Common Mistakes

Stopping early. You ran the experiment for 3 days, faithfulness is up 4%, p=0.04. You ship it. This is p-hacking. Decide your stopping criteria — sample size or duration — before you start, and don't peek at results until you hit it.

Novelty effect. Users behave differently with new things. A new UI or response style might get better engagement for a week just because it's different. Run experiments for at least two full weeks for behavioral metrics.

Segment blindness. An overall improvement can hide a regression in a specific segment. Always break down results by query type, user cohort, and difficulty level.

def segment_analysis(results_df) -> dict:
    breakdowns = {}

    for segment_col in ["query_type", "user_cohort", "difficulty"]:
        if segment_col not in results_df.columns:
            continue

        breakdowns[segment_col] = {}
        for segment, group in results_df.groupby(segment_col):
            ctrl = group[group["variant"] == "control"]["faithfulness"].tolist()
            trt  = group[group["variant"] == "treatment"]["faithfulness"].tolist()

            if len(ctrl) > 30 and len(trt) > 30:  # Minimum for reliable stats
                breakdowns[segment_col][segment] = analyze_experiment(ctrl, trt)

    return breakdowns

Measuring the wrong thing. Faithfulness going up doesn't mean users are happier. Keep at least one behavioral metric (thumbs up rate, task completion) in every experiment so you're always connected to what actually matters.

Part 6: Decision Framework

When the experiment ends, you need a clear process for what to do with the results:

Experiment complete
        ↓
Did we hit the required sample size?
        ├── No → Extend or abort (don't analyze yet)
        └── Yes ↓
Any significant regressions?
        ├── Yes → Do not ship. Investigate why.
        └── No ↓
Any significant improvements on primary metric?
        ├── No → Inconclusive. Bigger change needed, or effect too small to matter.
        └── Yes ↓
Does improvement hold across segments?
        ├── No → Mixed results. Consider partial rollout or further investigation.
        └── Yes → Ship. Set new baseline. Document learnings.

The "set new baseline" step is critical and usually skipped. After you ship, update baseline.json so your regression detector compares against the new normal, not the old one.

Putting It Together: The Experiment Lifecycle

class Experiment:
    def __init__(self, experiment_id: str, hypothesis: str, primary_metric: str,
                 minimum_detectable_effect: float, control, treatment):
        self.id = experiment_id
        self.hypothesis = hypothesis
        self.primary_metric = primary_metric
        self.router = ExperimentRouter(experiment_id, control, treatment)

        # Pre-calculate required sample size
        baseline = load_baseline_metric(primary_metric)
        self.required_n = required_sample_size(baseline, minimum_detectable_effect)

        print(f"Experiment '{experiment_id}' initialized.")
        print(f"Hypothesis: {hypothesis}")
        print(f"Required samples per variant: {self.required_n}")

    def is_ready_to_analyze(self) -> bool:
        n_control   = count_samples(self.id, "control")
        n_treatment = count_samples(self.id, "treatment")
        return min(n_control, n_treatment) >= self.required_n

    def analyze(self) -> dict:
        if not self.is_ready_to_analyze():
            raise RuntimeError("Not enough samples yet. Don't peek.")

        results_df = load_experiment_results(self.id)
        report = experiment_report(self.id, results_df)
        report["segment_analysis"] = segment_analysis(results_df)

        return report

    def ship(self):
        """Call after analysis confirms improvement."""
        promote_treatment_to_production(self.id)
        update_baseline(self.primary_metric)
        archive_experiment(self.id)
        print(f"Experiment '{self.id}' shipped. Baseline updated.")

The Honest Truth About LLM A/B Testing

Most of the time, experiments are inconclusive. The new model is marginally better on some metrics, marginally worse on others, and you genuinely can't tell if shipping it is the right call.

That's useful information. It means the change doesn't matter enough to deploy the operational risk of switching. Save the deployment for changes that move the needle clearly.

The teams that improve their LLM systems fastest aren't the ones running the most experiments — they're the ones running experiments with clear hypotheses, adequate sample sizes, and the discipline to ship nothing when the data says nothing.

Previous: Stop Eyeballing Your RAG Outputs. Start Measuring Quality.

Next up: Hybrid search — combining vector and BM25 for queries where pure semantic search falls flat.

Building an LLM Evaluation Framework That Actually Works

Ritwika Kancharla — Tue, 03 Mar 2026 20:36:52 +0000

Stop Eyeballing Your RAG Outputs. Start Measuring Quality.

I shipped a RAG system. It felt fine. Then users started reporting wrong product recommendations, invented prices, and confidently wrong answers to questions the documents couldn't support.

I had no numbers. No regression detection. No systematic way to improve. I was flying blind.

This is how I built an evaluation stack that catches failures before users do.

What "Evaluation" Actually Means

Most teams jump straight to asking humans "does this seem good?" That's too slow and too expensive to run on every change. There's a whole layer of automated evaluation that should come first.

Level	Question	Cadence
Unit	Does this component work correctly?	Every commit
Integration	Does the full pipeline work end-to-end?	Every PR
Human	Do users actually find this helpful?	Weekly
A/B	Is the new version measurably better?	Monthly

The lower layers are fast and cheap. Build them first, then let human evaluation handle the things automation genuinely can't.

Part 1: The Golden Dataset

Everything starts here. A golden dataset is a hand-curated set of examples that represent correct behavior — your ground truth for all automated metrics.

golden_examples = [
    {
        "id": "g_001",
        "query": "moisturizer for oily skin under $30",
        "context": {"user_skin_type": "oily", "budget": 30},
        "expected_retrieved_ids": ["prod_123", "prod_456"],
        "expected_response_contains": ["non-comedogenic", "oil-free", "lightweight"],
        "expected_citations": [1, 2],
        "difficulty": "medium"
    },
    {
        "id": "g_002",
        "query": "foundation",
        "context": {},
        "expected_action": "CLARIFY",
        "expected_clarifying_question_contains": ["skin type", "shade", "coverage"],
        "difficulty": "hard"
    }
]

Building It Without Guessing

Don't invent examples from your imagination. Sample from real production traffic, then label them.

# Pull from recent logs
production_queries = load_logs(last_weeks=1, n=1000)

# Stratified sample by complexity
simple  = [q for q in production_queries if word_count(q) < 5]
medium  = [q for q in production_queries if 5 <= word_count(q) < 15]
complex = [q for q in production_queries if word_count(q) >= 15]

sample = (
    random.sample(simple, 30) +
    random.sample(medium, 50) +
    random.sample(complex, 20)
)

Then have two annotators label each example independently. Target inter-annotator agreement above 0.8. Resolve disagreements with a third reviewer.

One rule: never modify your golden set in place. Version it. golden_v1.jsonl → golden_v2.jsonl. Track the diff. Your historical metrics are meaningless if the benchmark silently changes under them.

Part 2: Automated Metrics

Retrieval Metrics

These answer the question: did we fetch the right documents?

def mean_reciprocal_rank(retrieved_ids: list, relevant_ids: list) -> float:
    """Position of first relevant result."""
    for rank, doc_id in enumerate(retrieved_ids[:10], 1):
        if doc_id in relevant_ids:
            return 1.0 / rank
    return 0.0

def ndcg_at_k(retrieved_ids: list, relevant_ids: list, k: int = 5) -> float:
    """Ranking quality with graded relevance."""
    def dcg(ids):
        return sum(
            (1.0 if rid in relevant_ids else 0.0) / np.log2(i + 2)
            for i, rid in enumerate(ids[:k])
        )
    ideal_dcg = dcg(relevant_ids[:k])
    actual_dcg = dcg(retrieved_ids)
    return actual_dcg / ideal_dcg if ideal_dcg > 0 else 0.0

def precision_at_k(retrieved_ids: list, relevant_ids: list, k: int = 5) -> float:
    """How many of top-k results are actually relevant?"""
    hits = sum(1 for rid in retrieved_ids[:k] if rid in relevant_ids)
    return hits / k

Generation Metrics

Faithfulness — does the response follow from the sources, or is the model adding things it invented?

from sentence_transformers import CrossEncoder

nli_model = CrossEncoder("cross-encoder/nli-deberta-v3-base")

def faithfulness_score(response: str, sources: list) -> float:
    source_sents = [s["text"] for s in sources]
    scores = []

    for source in source_sents[:3]:
        result = nli_model.predict([(source, response)])
        # result: [contradiction, neutral, entailment]
        scores.append(result[0])

    return np.mean([s[2] for s in scores])  # Average entailment probability

Citation accuracy — are the [1], [2] references in the response pointing to real sources?

import re

def citation_accuracy(response: str, sources: list) -> dict:
    citations = re.findall(r'\[(\d+)\]', response)
    citation_indices = [int(c) - 1 for c in citations]

    issues = []
    for idx in citation_indices:
        if idx < 0 or idx >= len(sources):
            issues.append(f"Invalid citation [{idx + 1}]")

    return {
        "citation_count": len(citations),
        "valid_citations": len(citations) - len(issues),
        "accuracy": (len(citations) - len(issues)) / len(citations) if citations else 1.0,
        "issues": issues
    }

Answer relevance — does the response actually address the query?

def answer_relevance(query: str, response: str) -> float:
    query_emb = client.embeddings.create(
        input=query, model="text-embedding-3-small"
    ).data[0].embedding

    response_emb = client.embeddings.create(
        input=response, model="text-embedding-3-small"
    ).data[0].embedding

    return cosine_similarity(query_emb, response_emb)

Latency Metrics

Speed is a quality signal. Measure every stage.

def measure_latency(func):
    def wrapper(*args, **kwargs):
        start = time.perf_counter()
        result = func(*args, **kwargs)
        elapsed = (time.perf_counter() - start) * 1000
        return {"result": result, "latency_ms": elapsed}
    return wrapper

@measure_latency
def embed_query(query: str):
    return embedding_model.encode(query)

@measure_latency
def retrieve(query_emb):
    return vector_store.search(query_emb)

Track P50, P95, and P99. P95 is usually the most actionable — it's the experience your worst-off users are getting, without being dominated by outliers.

Part 3: The Evaluation Pipeline

Daily Automated Run

class EvaluationPipeline:
    def __init__(self, golden_path: str, system_under_test):
        self.golden = load_jsonl(golden_path)
        self.sut = system_under_test

    def run(self) -> dict:
        results = []

        for example in tqdm(self.golden):
            start = time.time()
            output = self.sut.process(example["query"])
            latency = (time.time() - start) * 1000

            results.append({
                "example_id": example["id"],
                "query": example["query"],
                "latency_ms": latency,
                "mrr": mean_reciprocal_rank(
                    output["retrieved_ids"],
                    example["expected_retrieved_ids"]
                ),
                "ndcg_5": ndcg_at_k(
                    output["retrieved_ids"],
                    example["expected_retrieved_ids"],
                    k=5
                ),
                "faithfulness": faithfulness_score(
                    output["response"],
                    output["sources"]
                ),
                "citation_accuracy": citation_accuracy(
                    output["response"],
                    output["sources"]
                )["accuracy"],
                "answer_relevance": answer_relevance(
                    example["query"],
                    output["response"]
                ),
                "passes_guardrails": output["guardrail_passed"],
                "correct_action": output["action"] == example.get("expected_action")
            })

        return self.aggregate(results)

    def aggregate(self, results) -> dict:
        df = pd.DataFrame(results)

        return {
            "retrieval": {
                "mrr_mean": df["mrr"].mean(),
                "mrr_p10": df["mrr"].quantile(0.10),
                "ndcg_mean": df["ndcg_5"].mean()
            },
            "generation": {
                "faithfulness_mean": df["faithfulness"].mean(),
                "citation_acc_mean": df["citation_accuracy"].mean(),
                "relevance_mean": df["answer_relevance"].mean()
            },
            "latency": {
                "p50": df["latency_ms"].median(),
                "p95": df["latency_ms"].quantile(0.95),
                "p99": df["latency_ms"].quantile(0.99)
            },
            "reliability": {
                "guardrail_pass_rate": df["passes_guardrails"].mean(),
                "correct_action_rate": df["correct_action"].mean()
            },
            "by_difficulty": df.groupby("difficulty")[["mrr", "faithfulness"]].mean().to_dict()
        }

Regression Detection

A daily run is only useful if something happens when metrics drop. Here's a regression detector with configurable tolerance:

class RegressionDetector:
    def __init__(self, baseline_metrics: dict, tolerance: float = 0.05):
        self.baseline = baseline_metrics
        self.tolerance = tolerance

    def check(self, new_metrics: dict) -> list:
        regressions = []

        checks = [
            ("retrieval.mrr_mean", "Retrieval MRR"),
            ("generation.faithfulness_mean", "Faithfulness"),
            ("latency.p95", "P95 Latency"),
            ("reliability.guardrail_pass_rate", "Guardrail Pass Rate")
        ]

        for path, name in checks:
            baseline = self._get_nested(self.baseline, path)
            current  = self._get_nested(new_metrics, path)

            # Latency: lower is better
            if "latency" in path:
                if current > baseline * (1 + self.tolerance):
                    regressions.append({
                        "metric": name,
                        "baseline": baseline,
                        "current": current,
                        "change": f"+{((current/baseline - 1) * 100):.1f}%"
                    })
            else:
                # Everything else: higher is better
                if current < baseline * (1 - self.tolerance):
                    regressions.append({
                        "metric": name,
                        "baseline": baseline,
                        "current": current,
                        "change": f"-{((1 - current/baseline) * 100):.1f}%"
                    })

        return regressions

    def alert(self, regressions: list):
        if regressions:
            message = "REGRESSION DETECTED:\n" + "\n".join([
                f"- {r['metric']}: {r['baseline']:.3f} → {r['current']:.3f} ({r['change']})"
                for r in regressions
            ])
            send_slack_alert("#ml-alerts", message)

A 5% tolerance is a reasonable starting point. Tighten it as your baselines stabilize and the system matures.

Part 4: Human Evaluation (Done Right)

Automated metrics can't catch everything. Response helpfulness, tone, and nuanced faithfulness edge cases all need human judgment. The key is using humans efficiently.

What Automation Can and Can't Do

Task	Automated	Human
MRR, NDCG calculation	✅	❌
Faithfulness (clear cases)	✅	❌
Faithfulness (edge cases)	⚠️	✅
Response helpfulness	❌	✅
Tone, style, brand voice	❌	✅

Sample Strategically

Don't review a random 50 examples. Review the examples that are most likely to surface issues:

def select_for_human_eval(all_results: list, n: int = 50) -> list:
    # Failures first
    failures = [r for r in all_results
                if not r["passes_guardrails"] or r["faithfulness"] < 0.6]

    # Uncertain cases — where the model might be right or wrong
    uncertain = [r for r in all_results if 0.4 < r["faithfulness"] < 0.8]

    # Diverse sample across query types
    by_type = defaultdict(list)
    for r in all_results:
        by_type[classify_query(r["query"])].append(r)

    diverse = []
    for qtype, items in by_type.items():
        diverse.extend(random.sample(items, min(5, len(items))))

    selected = list({r["example_id"]: r
                     for r in failures + uncertain + diverse}.values())
    return selected[:n]

A Rubric Worth Using

Unstructured "is this good?" questions produce inconsistent ratings. Give annotators something concrete:

Rate this response on 5 dimensions (1–5):

1. ACCURACY        — Information is correct and grounded in sources
2. COMPLETENESS    — Addresses all parts of the question
3. CLARITY         — Easy to understand, well-structured
4. HELPFULNESS     — Actually helps the user make progress
5. SAFETY          — No harmful, biased, or inappropriate content

Overall: Would you be satisfied with this response?
Provide brief justification for each score.

Check inter-annotator agreement with Cohen's kappa. Target above 0.6 (substantial agreement). If you're consistently below that, the rubric needs refinement before the ratings mean anything.

Part 5: Continuous Integration

Evaluation that only runs on demand gets skipped. Put it in CI so it runs on every PR automatically.

# .github/workflows/eval.yml
name: Evaluation

on: [pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Run evaluation
        run: python -m evaluation.run --golden golden_v2.jsonl --output results.json

      - name: Check for regressions
        run: python -m evaluation.check_regression --baseline baseline.json --current results.json

      - name: Comment results on PR
        uses: actions/github-script@v6
        with:
          script: |
            const results = JSON.parse(require('fs').readFileSync('results.json', 'utf8'));
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: `## Evaluation Results\n\n` +
                    `| Metric | Value | Status |\n` +
                    `|--------|-------|--------|\n` +
                    `| MRR | ${results.retrieval.mrr_mean.toFixed(3)} | ✅ |\n` +
                    `| Faithfulness | ${results.generation.faithfulness_mean.toFixed(3)} | ✅ |\n` +
                    `| P95 Latency | ${results.latency.p95.toFixed(0)}ms | ${results.latency.p95 < 500 ? '✅' : '⚠️'} |`
            });

Every PR now gets an automated evaluation comment. Reviewers can see metric changes alongside code changes.

The Streamlit Dashboard

import streamlit as st

st.title("RAG Evaluation Dashboard")
results = json.load(open("latest_eval.json"))

col1, col2, col3 = st.columns(3)
col1.metric("MRR", f"{results['retrieval']['mrr_mean']:.3f}", "+0.02")
col2.metric("Faithfulness", f"{results['generation']['faithfulness_mean']:.3f}", "-0.01")
col3.metric("P95 Latency", f"{results['latency']['p95']:.0f}ms", "-50ms")

st.line_chart(load_historical_metrics())

failures = [r for r in results["per_example"] if r["faithfulness"] < 0.6]
st.table(failures[:10])

The Full Stack

┌─────────────────────────────────────────┐
│         GOLDEN DATASET                  │
│  Versioned, diverse, expert-labeled     │
└─────────────────────────────────────────┘
                    ↓
┌─────────────────────────────────────────┐
│      AUTOMATED METRICS (CI)             │
│  • Retrieval: MRR, NDCG, Precision      │
│  • Generation: Faithfulness, Citations  │
│  • Latency: P50, P95, breakdown         │
│  • Reliability: Guardrails, errors      │
└─────────────────────────────────────────┘
                    ↓
┌─────────────────────────────────────────┐
│      REGRESSION DETECTION               │
│  Compare to baseline, alert on degrade  │
└─────────────────────────────────────────┘
                    ↓
┌─────────────────────────────────────────┐
│      HUMAN EVALUATION (Weekly)          │
│  Sampled, rubric-based, IAA-checked     │
└─────────────────────────────────────────┘
                    ↓
┌─────────────────────────────────────────┐
│      A/B TESTING (Monthly)              │
│  New model vs. production, business KPIs│
└─────────────────────────────────────────┘

The thing nobody tells you about building LLM systems: getting the model to generate output is 20% of the work. Understanding whether that output is any good — and knowing the moment it gets worse — is the other 80%.

Build the evaluation stack early. It's what turns a prototype you're guessing about into a system you can actually improve.

Next up: A/B testing LLM systems — when your new model "looks better" but the metrics disagree.

Building a Production-Grade RAG System (Not Just a Demo)

Ritwika Kancharla — Tue, 03 Mar 2026 19:57:04 +0000

It's easy to build a RAG prototype that impresses in a notebook. It's much harder to build one that holds up in production — one that handles 100,000 documents instead of a hundred, recovers gracefully from failures, and gives you actual visibility into what's going wrong when it does.

This is the article for the second kind.

What "Production-Grade" Actually Means

Before we write any code, it's worth being precise about the target. A demo RAG system works on your laptop, handles a small corpus, and "looks right" to whoever's watching. A production RAG system does something fundamentally different: it's measured, monitored, and improvable. It handles load, recovers from failures, and can be understood by a teammate who didn't build it.

The architecture that gets you there has four layers:

┌─────────────────────────────────────────┐
│           DOCUMENT PIPELINE             │
│  Ingest → Chunk → Embed → Index         │
│  (Batch jobs, idempotent, monitored)    │
└─────────────────────────────────────────┘
                    ↓
┌─────────────────────────────────────────┐
│           RETRIEVAL LAYER               │
│  Query → Embed → Search → Rerank        │
│  (Cached, filtered, logged)             │
└─────────────────────────────────────────┘
                    ↓
┌─────────────────────────────────────────┐
│           GENERATION LAYER              │
│  Prompt → LLM → Post-process → Stream   │
│  (Guardrailed, traced, evaluated)       │
└─────────────────────────────────────────┘
                    ↓
┌─────────────────────────────────────────┐
│           OBSERVABILITY                 │
│  Metrics → Logs → Evals → Alerts        │
│  (You actually know when it breaks)     │
└─────────────────────────────────────────┘

Let's build each one properly.

Part 1: Document Ingestion Pipeline

Chunking: The Strategy Nobody Thinks About Until It's Too Late

Most people grab a text splitter, pick an arbitrary chunk size, and move on. This works until you're debugging why your system can't answer questions the documents clearly contain.

The right mental model: one chunk = one answerable unit. A chunk should contain enough context to stand alone as the answer to some question. Too small and you lose context; too large and you dilute the signal.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,        # Characters, not tokens
    chunk_overlap=50,      # Preserves context across boundaries
    separators=["\n\n", "\n", ". ", " ", ""],  # Tries these in order
    length_function=len,
)

chunks = splitter.split_text(long_document)

The RecursiveCharacterTextSplitter is the right default: it respects document structure, splitting on paragraphs before sentences before words. Fixed-size splitters will happily cleave a sentence in half.

Metadata: Store It Now, Thank Yourself Later

Every chunk needs metadata attached at ingestion time. You will want to filter by source, date, and document type in production, and retrofitting that metadata later is painful.

def process_document(doc: dict) -> list:
    chunks = splitter.split_text(doc["content"])

    return [
        {
            "id": f"{doc['source_id']}_{i}",
            "text": chunk,
            "metadata": {
                "source": doc["source"],
                "created_at": doc["timestamp"],
                "chunk_index": i,
                "total_chunks": len(chunks),
                "section": extract_heading(chunk),
                "doc_type": classify_doc_type(chunk),  # FAQ, tutorial, reference, etc.
            }
        }
        for i, chunk in enumerate(chunks)
    ]

Embedding: Batch and Cache

Embedding is where your API costs live. Two habits that pay off immediately: batching and caching.

from openai import OpenAI
import hashlib
import diskcache

client = OpenAI()
cache = diskcache.Cache("./embedding_cache")

def embed_with_cache(texts: list) -> list:
    embeddings = []
    texts_to_embed = []

    for text in texts:
        key = hashlib.md5(text.encode()).hexdigest()

        if key in cache:
            embeddings.append(cache[key])
        else:
            texts_to_embed.append((key, text))

    if texts_to_embed:
        response = client.embeddings.create(
            model="text-embedding-3-large",
            input=[t[1] for t in texts_to_embed]
        )

        for (key, _), embedding in zip(texts_to_embed, response.data):
            cache[key] = embedding.embedding
            embeddings.append(embedding.embedding)

    return embeddings

The sweet spot for batch size is 100–500 texts per API call. Don't embed one text at a time.

Choosing a Vector Store

Store	Best For
Chroma	Prototyping and smaller corpora (<100K docs)
Pinecone	Managed production scale with metadata filtering
Weaviate	Complex graph-like queries
pgvector	When you already have Postgres and want one database
FAISS	Batch/research use cases needing GPU acceleration

For most teams starting out, Chroma gets you running fast. Pinecone is the natural migration target when you need managed scale.

Idempotent Ingestion

Re-running your ingestion pipeline shouldn't create duplicates. This sounds obvious, but it's the kind of thing that bites you the first time you need to re-index after a bug fix.

def ingest_documents(new_docs: list):
    existing = collection.get(ids=[d["id"] for d in new_docs])
    existing_ids = set(existing["ids"])

    to_add = [d for d in new_docs if d["id"] not in existing_ids]
    to_update = [d for d in new_docs if d["id"] in existing_ids]

    if to_add:
        collection.add(...)

    for doc in to_update:
        if content_changed(doc):  # Compare hashes
            collection.delete(ids=[doc["id"]])
            collection.add(...)

Part 2: Retrieval Layer

Over-Fetch, Then Rerank

Vector similarity is good at finding roughly relevant chunks. It's not as good at ranking them. The solution is to over-fetch — grab 2–3x more candidates than you need — and then rerank with a cross-encoder.

class RetrievalEngine:
    def search(self, query: str, filters: dict = None, top_k: int = 10) -> list:
        query_emb = self.embedder.embed(query)

        results = self.collection.query(
            query_embeddings=[query_emb],
            n_results=top_k * 2,  # Over-fetch
            where=filters,
            include=["documents", "metadatas", "distances"]
        )

        reranked = self.rerank(query, results, top_k)
        self.log_query(query, results, reranked)

        return reranked

A cross-encoder scores each query–document pair jointly, which is more accurate than a bi-encoder embedding comparison. The tradeoff is speed, but since you're only reranking a small candidate set, it's fast enough:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query: str, candidates: list, top_k: int) -> list:
    pairs = [(query, doc["text"]) for doc in candidates]
    scores = reranker.predict(pairs)

    for doc, score in zip(candidates, scores):
        doc["rerank_score"] = score

    return sorted(candidates, key=lambda x: x["rerank_score"], reverse=True)[:top_k]

In most benchmarks, reranking improves precision@5 by 15–25%. It's one of the highest-ROI improvements you can make.

Query Rewriting for Conversational Context

Users in a multi-turn conversation say things like "how do I fix it?" without specifying what "it" is. Retrieval breaks down on pronouns and context-dependent references.

The fix is a short LLM call that rewrites the query to be self-contained before searching:

def rewrite_query(query: str, conversation_history: list) -> str:
    prompt = f"""
    Rewrite this query to be self-contained and specific for search.

    History: {conversation_history[-3:]}
    Current query: {query}

    Rules:
    - Replace "this", "it", "that" with specific nouns from history
    - Add relevant context from the conversation
    - Make it keyword-friendly, not conversational

    Rewritten query:
    """

    return llm.generate(prompt)

So "How do I fix it?" becomes "How to fix Docker build failure: no space left on device" — something the vector store can actually work with.

Part 3: Generation Layer

Prompt Structure Beats Prompt Cleverness

There's a lot of mythology around prompt engineering. In practice, the highest-value thing you can do for RAG prompts is give the model clear, structured instructions with explicit fallback behavior:

RAG_PROMPT = """You are a helpful assistant. Answer based on the provided context.

CONTEXT:
{context}

USER QUESTION:
{question}

INSTRUCTIONS:
1. Answer using ONLY the context provided
2. If the context doesn't contain the answer, say "I don't have that information"
3. Cite your sources with [1], [2], etc.
4. Be concise but complete

ANSWER:
"""

def format_context(docs: list) -> str:
    return "\n\n".join([
        f"[{i+1}] {d['metadata']['source']}: {d['text'][:500]}"
        for i, d in enumerate(docs)
    ])

The explicit "say you don't know" instruction is critical. Without it, models will hallucinate confident answers from thin context.

Guardrails: Catch Bad Outputs Before Users See Them

A guardrail layer runs checks on every response before it goes to the user. Start simple — you can make this as sophisticated as you need over time:

import re

class OutputGuardrail:
    def check(self, response: str, sources: list) -> dict:
        issues = []

        # Hallucinated citations (model invented a source number that doesn't exist)
        citations = re.findall(r'\[(\d+)\]', response)
        for c in citations:
            if int(c) > len(sources):
                issues.append(f"Invalid citation [{c}]")

        # Excessive hedging (often signals the model is guessing)
        weasel_words = ["might", "maybe", "possibly", "could be"]
        if sum(w in response.lower() for w in weasel_words) > 2:
            issues.append("Low confidence language detected")

        # Suspiciously short responses
        if len(response) < 20:
            issues.append("Response too short")

        return {
            "passed": len(issues) == 0,
            "issues": issues,
            "suggested_action": "retry" if issues else "proceed"
        }

Streaming Makes Everything Feel Faster

Users perceive a system that starts showing output immediately as dramatically faster than one that makes them wait for a complete response — even if total latency is similar.

def generate_streaming(prompt: str):
    stream = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        stream=True
    )

    for chunk in stream:
        if chunk.choices[0].delta.content:
            yield chunk.choices[0].delta.content

First token in ~200ms instead of a 2-second wait. This is a perception win, not a performance hack.

Part 4: Observability

If you don't measure it, you can't improve it. Here are the metrics that actually matter for RAG:

Category	Metric	Why It Matters
Retrieval	MRR, NDCG@5, Precision@K	Is search finding the right chunks?
Generation	Faithfulness, citation accuracy	Is the LLM staying grounded?
Latency	P50, P95, time-to-first-token	Is it fast enough for real use?
Business	User satisfaction, task completion	Is it actually useful?
Cost	Tokens per query, embedding costs	Can you afford to run it?

Structured Logging You Can Actually Query

Write logs as NDJSON. Every line is a complete, valid JSON object. BigQuery, Elasticsearch, and most log aggregators love this format.

def log_interaction(query: str, retrieved: list, response: str, latency: float):
    log_entry = {
        "timestamp": datetime.utcnow().isoformat(),
        "query": query,
        "query_hash": hashlib.md5(query.encode()).hexdigest(),
        "num_retrieved": len(retrieved),
        "retrieved_sources": [r["metadata"]["source"] for r in retrieved],
        "response_length": len(response),
        "latency_ms": latency,
        "guardrail_issues": check_guardrails(response, retrieved),
    }

    with open("rag_logs.ndjson", "a") as f:
        f.write(json.dumps(log_entry) + "\n")

Automated Daily Evaluation

Build a golden dataset of (query, expected relevant document IDs) pairs. Run it daily. Alert on regression.

class RAGEvaluator:
    def evaluate_retrieval(self) -> dict:
        scores = []

        for item in self.golden:
            results = retrieval_engine.search(item["query"])
            retrieved_ids = [r["id"] for r in results]
            scores.append(calculate_mrr(retrieved_ids, item["relevant_ids"]))

        return {
            "mrr_mean": np.mean(scores),
            "mrr_p10": np.percentile(scores, 10),
            "mrr_p90": np.percentile(scores, 90),
        }

    def run_daily_eval(self):
        metrics = self.evaluate_retrieval()

        if metrics["mrr_mean"] < BASELINE_MRR * 0.95:
            send_alert(f"Retrieval MRR dropped to {metrics['mrr_mean']:.3f}")

        log_to_datadog(metrics)

A 5% regression threshold is a reasonable starting point. Tighten it as your system matures and baselines stabilize.

Putting It All Together

Here's the full query path, assembled:

class ProductionRAG:
    def query(self, user_query: str, conversation_history: list = None) -> dict:
        start_time = time.time()

        # Rewrite query if we have conversation context
        search_query = (
            rewrite_query(user_query, conversation_history)
            if conversation_history
            else user_query
        )

        # Retrieve with reranking
        retrieved = self.retrieval.search(search_query, top_k=5)

        # Build and run prompt with streaming
        prompt = RAG_PROMPT.format(
            context=format_context(retrieved),
            question=user_query
        )

        response = "".join(generate_streaming(prompt))

        # Guardrail check
        guardrail_result = self.guardrail.check(response, retrieved)
        if not guardrail_result["passed"]:
            response = "I need to verify some details before I can answer this confidently."

        # Log everything
        latency = (time.time() - start_time) * 1000
        log_interaction(user_query, retrieved, response, latency)

        return {
            "response": response,
            "sources": [r["metadata"]["source"] for r in retrieved],
            "latency_ms": latency,
            "guardrail_passed": guardrail_result["passed"]
        }

Before You Ship: The Checklist

[ ] Chunking strategy documented and tested against real queries
[ ] Metadata schema versioned and consistent across documents
[ ] Idempotent ingestion — re-running never creates duplicates
[ ] Embedding cache reducing API costs on repeated content
[ ] Reranking improving precision over raw vector similarity
[ ] Query rewriting handling ambiguous, conversational queries
[ ] Guardrails catching bad outputs before users see them
[ ] Streaming enabled for perceived performance
[ ] Structured NDJSON logging queryable in your data stack
[ ] Daily automated evaluation against a golden dataset
[ ] Alerts configured for metric regression
[ ] Health checks for load balancer integration
[ ] Runbook written for the three most likely failure modes

Where to Go From Here

Once this foundation is solid, the natural next steps are:

Hybrid search — combine vector search with BM25 keyword search. Purely vector-based retrieval underperforms on keyword-heavy queries (product names, error codes, proper nouns).

Multi-tenancy — separate collections per customer, with per-tenant metadata filtering. Don't let one customer's documents bleed into another's retrieval results.

Continuous indexing — webhook-driven updates instead of scheduled batch jobs. New documents show up in retrieval within seconds, not hours.

A/B testing — route 10% of traffic to a new embedding model and measure retrieval metrics before committing. This is the only rigorous way to evaluate embedding model changes.

The difference between a RAG prototype and a RAG system is mostly about the plumbing nobody sees: idempotent pipelines, structured logs, evaluation harnesses, guardrails. It's less glamorous than the retrieval algorithm, but it's what determines whether the thing is still working correctly six months after you shipped it.

Build the plumbing first.

DEV Community: Ritwika Kancharla

Build shielded token mint, transfer, and burn flows in Compact

Prerequisites

What you will build

Part 0: Choose the right shielded API

Part 1: Create the package

Part 2: Write the Compact contract

Part 3: Mint shielded tokens

Part 4: Transfer a committed shielded coin

Part 5: Burn shielded tokens

Part 6: Add atomic mint and send

Part 7: Add the TypeScript witness

Part 8: Build a local test model

Part 9: Write the Vitest suite

Part 10: Run the checks

Troubleshooting

compact runs the wrong command on Windows

The compiler complains about witness disclosure

A later transfer fails because the coin has no mtIndex

Change disappears from your app state

Conclusion

Stop Building Chatbots. Google Just Showed Us What Comes After.

My Take

The Problem Multi-Agent Systems Actually Solve

What Google Actually Shipped: The Full Stack

Layer 1: A2A Protocol (Agent-to-Agent Communication)

Layer 2: ADK (Agent Development Kit)

Layer 3: Gemini Enterprise Agent Platform (Runtime)

Layer 4: Observability & Governance

The Architectural Shift Developers Should Care About

What I Think Is Underrated (My Actual Opinion)

Where This Actually Leads

TL;DR

Useful Links

A/B Testing LLM Systems

When Your New Model "Looks Better" but the Metrics Disagree

Why LLM A/B Testing Is Harder Than Normal A/B Testing

Part 1: What You're Actually Testing

Part 2: Traffic Splitting

Part 3: What to Measure

Automated Metrics (Available Immediately)

Behavioral Metrics (Require Follow-Through)

Part 4: Statistical Analysis

Sample Size Calculation

Significance Testing

The Full Analysis Report

Part 5: Common Mistakes

Part 6: Decision Framework

Putting It Together: The Experiment Lifecycle

The Honest Truth About LLM A/B Testing

Building an LLM Evaluation Framework That Actually Works

Stop Eyeballing Your RAG Outputs. Start Measuring Quality.

What "Evaluation" Actually Means

Part 1: The Golden Dataset

Building It Without Guessing

Part 2: Automated Metrics

Retrieval Metrics

Generation Metrics

Latency Metrics

Part 3: The Evaluation Pipeline

Daily Automated Run

Regression Detection

Part 4: Human Evaluation (Done Right)

What Automation Can and Can't Do

Sample Strategically

A Rubric Worth Using

Part 5: Continuous Integration

The Streamlit Dashboard

The Full Stack

Building a Production-Grade RAG System (Not Just a Demo)

What "Production-Grade" Actually Means

Part 1: Document Ingestion Pipeline

Chunking: The Strategy Nobody Thinks About Until It's Too Late

Metadata: Store It Now, Thank Yourself Later

Embedding: Batch and Cache

Choosing a Vector Store

Idempotent Ingestion

Part 2: Retrieval Layer

Over-Fetch, Then Rerank

Query Rewriting for Conversational Context

`compact` runs the wrong command on Windows

A later transfer fails because the coin has no `mtIndex`