DEV Community: Alberto Nieto

From one model to seven — what it took to make TurboQuant model-portable

Alberto Nieto — Wed, 01 Apr 2026 00:42:31 +0000

A KV cache compression plugin that only works on one model is a demo, not a tool. turboquant-vllm v1.0.0 shipped four days ago with one validated architecture: Molmo2. v1.3.0 validates seven — Llama 3.1, Mistral 7B, Qwen2.5, Phi-3-mini, Phi-4, Gemma-2, and Gemma-3. The path between those two points was more interesting than the destination.

What Changed

Fused paged kernels (v1.2.0). The original architecture decompressed KV cache from TQ4 to FP16 in HBM, then ran standard attention on the result. The new fused kernel reads compressed blocks directly from vLLM's page table, decompresses in SRAM, and computes attention in a single pass. HBM traffic: 1,160 → 136 bytes per token.

# One flag. Same as before.
vllm serve meta-llama/Llama-3.1-8B --attention-backend CUSTOM

Non-pow2 head dimensions (v1.3.0). Triton's tl.arange requires power-of-two ranges. Phi-3-mini has head_dim=96. Gemma has head_dim=256. All five Triton kernels needed pad-to-next-power-of-two with boundary masking. 23 new tests cover the three new dimension classes.

Sliding window attention bypass (v1.3.0). Gemma-2 and Gemma-3 mix global and sliding window attention layers. Compressing SWA layers breaks cache eviction. The fix: SWA layers bypass compression automatically via the is_sliding attribute. Global layers compress normally.

Verify CLI. Check any model in thirty seconds:

python -m turboquant_vllm.verify --model google/gemma-2-2b --bits 4
# PASS — all layers, cosine 0.9951

Why This Design

The fused kernel architecture was a prerequisite for everything else. Without it, model expansion would have multiplied a slow path — decompress-to-HBM on every decode step across more models means more wasted bandwidth. Fusing first meant each new model gets the fast path for free.

The non-pow2 fix was not a config change. It was a kernel rewrite across five files, each with different padding constraints depending on whether the kernel reads keys, values, or both. The ~5–15% throughput penalty for non-pow2 dimensions is real and documented — but for head_dim=128 models (the majority), it's zero.

The production hotfixes (v1.2.1, v1.2.2) are worth mentioning because they came from container benchmarking, not unit tests. Running TQ4 inside the vLLM container against real video clips surfaced OOM bugs that synthetic tests never would. Both patches landed within 24 hours.

Getting Started

pip install turboquant-vllm[vllm]>=1.3.0
vllm serve meta-llama/Llama-3.1-8B --attention-backend CUSTOM

Verify compression quality on any supported model:

python -m turboquant_vllm.verify --model <model-id> --bits 4

Validated models: Molmo2-4B, Llama 3.1 8B, Mistral 7B, Qwen2.5-3B, Phi-3-mini, Phi-4, Gemma-2-2b, Gemma-3-4B-it. All pass at cosine ≥0.99.

Benchmarks:

VLM (Molmo2-4B, FP16 baseline): 3.76x KV compression
Text-only (Llama 3.1 / Mistral, FP8 baseline): 1.88x KV capacity, lossless at temperature=0
At 16K context: 6x concurrent requests vs baseline 3x

What's Next

Upstream vLLM contribution (vllm#38171 — 49 upvotes)
Flash Attention kernel fusion for multi-layer correctness
VL-Cache stacking for multiplicative VLM savings

PyPI | Docs | GitHub

Compressed VLM inference from a single Containerfile — turboquant-vllm v1.1

Alberto Nieto — Sat, 28 Mar 2026 17:56:54 +0000

The hardest part of GPU inference isn't the model — it's the environment. CUDA versions, driver compatibility, pip dependency conflicts. You can have a working quantization plugin and still spend an hour getting it to run on a fresh machine.

turboquant-vllm v1.1.0 ships a Containerfile that eliminates that setup. It extends the official vLLM image, installs the TQ4 compression plugin from PyPI, and verifies the plugin entry point at build time — not at runtime when you're debugging a silent fallback to uncompressed attention.

What Changed in v1.1

Container support. A single Containerfile bakes turboquant-vllm into the official vllm-openai image:

git clone https://github.com/Alberto-Codes/turboquant-vllm.git
cd turboquant-vllm
podman build -t vllm-turboquant -f infra/Containerfile.vllm .

Then serve a vision-language model with compressed KV cache:

podman run --rm \
  --device nvidia.com/gpu=all \
  --shm-size=8g \
  -p 8000:8000 \
  vllm-turboquant \
  --model allenai/Molmo2-8B \
  --attention-backend CUSTOM

One flag: --attention-backend CUSTOM. That's it.

Documentation site. Auto-generated API reference from docstrings, usage guides for vLLM, HuggingFace, and container deployment — including Quadlet examples for running as a systemd service. Deployed to GitHub Pages on every release.

Per-layer quality tests. 12 new cosine similarity tests verify compression fidelity at each of the 36 transformer layers, not just end-to-end output. This catches layer-specific precision degradation that whole-model tests miss.

Why This Design

The Containerfile is deliberately minimal — 11 lines. It installs from PyPI (not from source) and verifies the plugin entry point at build time:

FROM docker.io/vllm/vllm-openai:v0.18.0
ARG TURBOQUANT_VERSION=1.1.0

RUN pip install --no-cache-dir "turboquant-vllm[vllm]==${TURBOQUANT_VERSION}"

RUN python3 -c "\
import importlib.metadata; \
eps = [e for e in importlib.metadata.entry_points(group='vllm.general_plugins') \
       if e.name == 'tq4_backend']; \
assert len(eps) == 1, 'TQ4 entry point not found'"

Build-time verification matters because vLLM's plugin discovery is silent. If the entry point isn't registered, vLLM falls back to its default attention backend without any error. You'd serve uncompressed inference thinking you had 3.76x compression. The assert in the Containerfile makes that failure loud and early.

The TURBOQUANT_VERSION build arg means you can pin or upgrade versions without editing the file.

Getting Started

Install from PyPI if you don't need the container:

pip install turboquant-vllm[vllm]
vllm serve allenai/Molmo2-8B --attention-backend CUSTOM

Or build the container for reproducible deployment:

podman build -t vllm-turboquant -f infra/Containerfile.vllm .
podman run --rm --device nvidia.com/gpu=all -p 8000:8000 \
  vllm-turboquant --model allenai/Molmo2-8B --attention-backend CUSTOM

Verify compression is active in the logs:

INFO [cuda.py:257] Using AttentionBackendEnum.CUSTOM backend.

What's Next

Upstream vLLM contribution (vllm#38171 — 49 upvotes)
Full Flash Attention-style kernel fusion for multi-layer correctness
Stacking with token pruning (VL-Cache) for multiplicative VLM savings

PyPI | Docs | GitHub

I shipped Google's TurboQuant as a vLLM plugin 72 hours after the paper — here's what nobody else tested

Alberto Nieto — Fri, 27 Mar 2026 22:57:34 +0000

Google published TurboQuant at ICLR 2026 — a technique that compresses transformer KV caches to 4 bits per coordinate with zero accuracy loss. The paper reports 5-6x memory reduction on H100 GPUs, tested on text models like Gemma and Mistral.

I wanted to know: does it work on a vision-language model processing video? On a consumer GPU?

72 hours later, turboquant-vllm is on PyPI.

Quick Start

pip install turboquant-vllm[vllm]
vllm serve allenai/Molmo2-8B --attention-backend CUSTOM

That's it. The plugin auto-registers via vLLM's entry point system. No code changes, no forking, no monkey-patching.

For HuggingFace users:

from transformers import DynamicCache
from turboquant_vllm import CompressedDynamicCache

cache = DynamicCache()
compressed = CompressedDynamicCache(cache, head_dim=128, bits=4)
# Pass cache (not wrapper) to model.generate()

Why Vision-Language Models Matter

Every other TurboQuant implementation tests on text-only models with hundreds of tokens. But a 12-second video clip through Molmo2-4B produces ~11,000 visual tokens — 1.6 GB of KV cache on a 24 GB GPU.

That's 10x more memory, 10x more opportunities for precision bugs to compound across 36 transformer layers. The existing VLM compression literature (VL-Cache, Dynamic-LLaVA, ZipVL) is all token pruning — deciding which tokens to discard. TurboQuant compresses the tokens you keep. They're complementary approaches, and nobody had validated whether vector quantization survives the visual token regime.

It does.

Results

Molmo2-4B on RTX 4090, 11K visual tokens from a Seinfeld video clip:

Metric	Baseline	TQ4 Compressed
KV cache	1,639 MiB	435 MiB (3.76x)
Output quality	Detailed scene description	Near-identical (100+ tokens match)
Decode overhead	—	1.78x

Molmo2-8B: same 3.76x ratio, correctly identifies all Seinfeld characters. Full 23-minute episode processed at 24 tok/s.

What I Built Differently

Plugin, not fork

Other vLLM TurboQuant efforts are forks or monkey-patches. turboquant-vllm uses vLLM's official plugin entry point:

[project.entry-points."vllm.general_plugins"]
tq4_backend = "turboquant_vllm.vllm:register_tq4_backend"

Incremental dequantization

The naive approach decompresses the full KV cache at every layer, every step — 3.36x overhead. Incremental dequantization decompresses only the 1 new token per step and appends to a running buffer. Overhead drops to 1.78x. This isn't in Google's paper.

Cross-platform Triton

Fused kernels run on both NVIDIA CUDA and AMD ROCm without code changes. 84/84 GPU tests pass on a Radeon 890M iGPU.

Bugs Nobody Else Has Found

FP16 norms fail at scale. Works at 11,385 tokens, garbles output at 11,397 tokens. The 0.01% error per vector compounds across 36 layers. Always use fp32.
QJL correction is invisible in standard attention. The paper's Stage 2 (2-bit MSE + 1-bit QJL) wastes 1 bit of precision — standard Q @ K^T can't use the correction. Full 3-bit MSE produces identical output.
Multi-layer precision drift in fused kernels. A 0.023 cosine gap per layer between fp32 Triton and bf16 SDPA compounds to produce "pizza pizza pizza" at 36 layers. Flash Attention-style fusion needed.

Validation

180+ tests, 9 test files, 95%+ coverage
16 GPU experiments with documented failures
Cross-platform: NVIDIA RTX 4090 + AMD Radeon 890M
End-to-end: installed from PyPI into stock vllm/vllm-openai:latest container

What's Next

Upstream contribution to vLLM (issue #38171, 49 upvotes)
Full Flash Attention fusion for the fused Triton kernels
Stacking with VL-Cache-style token pruning for multiplicative VLM savings

PyPI | Docs | GitHub

I Implemented Google's TurboQuant and Tested It on a Vision-Language Model — Here's What the Paper Doesn't Tell You

Alberto Nieto — Thu, 26 Mar 2026 08:24:13 +0000

Google published TurboQuant at ICLR 2026 — a technique that compresses transformer KV caches to 3-4 bits per coordinate with zero accuracy loss. Their paper reports 5-6x memory reduction and 8x attention speedup on H100 GPUs, tested on text-only models like Gemma and Mistral.

I wanted to know: does it work on a vision-language model processing video? On a consumer GPU?

So I implemented it from the paper and ran it on Molmo2-4B analyzing Seinfeld clips (~11,000 visual tokens) on an RTX 4090.

TL;DR Results

Metric	Baseline	TQ4 Compressed
KV cache	1,639 MiB	435 MiB (3.76x)
Output quality	Detailed scene description	Near-identical
Decode overhead	—	1.78x

The model describes the same Seinfeld scene with the same visual details. Different phrasing on minor points, both equally valid.

Three Things the Paper Doesn't Tell You

1. 4-bit nibble packing beats 3-bit unpacked

The paper focuses on 3-bit quantization (8 centroids). But 3-bit values cross byte boundaries — no existing Python/Triton implementation actually packs them. Every implementation I found stores 3-bit indices in 8-bit bytes, wasting 62.5% of storage.

4-bit indices pack trivially into nibbles:

# Pack two 4-bit indices into one byte
packed = (indices[..., 0::2] << 4) | indices[..., 1::2]

# Unpack
high = packed >> 4
low = packed & 0x0F

And 16 centroids (4-bit) give ~97% cosine similarity vs ~95% for 8 centroids (3-bit). Better quality AND nearly double the compression of unpacked 3-bit, with two lines of code.

Mode	Storage	Compression	Quality
3-bit unpacked (uint8)	132 B/block	1.94x	~95% cosine
4-bit nibble-packed	68 B/block	3.76x	~97% cosine

2. FP16 norms fail silently at scale

I initially stored vector norms as float16 to save 2 bytes per vector. It worked fine on short sequences.

Then I tested with an 11,397-token video clip and got:

Output: "In the video,1.0 0 0 0 0 0 0 0 0 0 0 0..."

The model produced 4 correct tokens ("In the video,") then completely degenerated. The same prompt with 11,385 tokens (12 fewer) worked perfectly.

Root cause: FP16 precision loss (~0.01% per vector) accumulated across 36 transformer layers, shifting attention logits at low-confidence decision boundaries. Token-by-token analysis showed the divergence at step 5 where the logit margin was < 0.5.

Fix: Float32 norms. The 2 extra bytes per vector barely affect the compression ratio (1.97x → 1.94x). No other TurboQuant implementation I've found documents this failure mode.

3. The boring optimization wins

I built a fused Triton kernel that computes Q @ compressed_K^T directly from nibble-packed indices. It achieves 17.8x speedup on the micro-benchmark by:

Pre-rotating queries once (q_rot = Q @ rotation_matrix^T)
Having the kernel do simple centroid lookups and dot products
Eliminating the 128x128 rotation matmul from the inner loop

Sounds great. But when I wired it into all 36 layers of Molmo2, the output degenerated into "pizza pizza pizza pizza."

Root cause: The fused kernel computes in fp32, but the model expects bf16 attention behavior (via SDPA/FlashAttention). The 0.023 per-layer cosine gap between fp32 kernel output and bf16 SDPA output compounds catastrophically across 36 layers.

The fix that actually worked: incremental dequantization. Instead of decompressing the entire 11K-token cache at every layer at every decode step (the 3.36x overhead), decompress only the 1 new token and append it to a running buffer. Standard SDPA handles the attention.

Overhead went from 3.36x to 1.78x. No custom kernels needed.

The Fused Kernel Isn't Wasted

The kernel is correct (1.0 cosine similarity on the micro-benchmark) and fast (17.8x). It just needs to be part of a full Flash Attention-style fusion — computing softmax and V multiplication inside the kernel, not just Q@K^T scores. That's a future project.

Implementation Details

The full implementation is at github.com/Alberto-Codes/turboquant-consumer:

Core algorithm: Lloyd-Max codebook solver, TurboQuantMSE (Stage 1), TurboQuantProd (Stage 2 with QJL)
CompressedDynamicCache: Drop-in KV cache wrapper with nibble packing and incremental dequant
Fused Triton kernel: Nibble unpack + centroid gather + GQA mapping
Benchmark harness: A/B testing CLI for any HuggingFace model
62 tests including long-sequence regression (36 layers, 1024 tokens)
5 experiment logs with full results

# Quick start
git clone https://github.com/Alberto-Codes/turboquant-consumer.git
cd turboquant-consumer && uv sync

# Run tests
uv run pytest tests/ -v

# Benchmark (requires GPU)
uv run python -m turboquant_consumer.benchmark \
    --model allenai/Molmo2-4B --bits 4 --compressed \
    --video /path/to/video.mp4

What's Next

Molmo2-8B validation — the 8B model recognizes character names
Flash Attention-style fused kernel — full softmax+V fusion for multi-layer correctness
vLLM integration — waiting for upstream cache backend API

This is the first TurboQuant implementation validated on a vision-language model with video input. If you're working on KV cache compression, I'd love to hear about your experiences — especially if you've hit the fp16 norms issue.

Your docstrings are lying — docvet 1.14 catches them

Alberto Nieto — Mon, 23 Mar 2026 00:04:09 +0000

A 2024 study by Macke & Doyle found that incorrect documentation degrades LLM task success by 22.6 percentage points. Missing documentation? No statistically significant effect. Your AI coding assistant performs worse with wrong docs than with no docs at all.

That's the gap docvet fills. And with v1.14, it closes the gap further — checking not just whether your docstrings exist, but whether they match your code.

What Changed

Parameter Agreement Checks

Two new rules — missing-param-in-docstring and extra-param-in-docstring — compare function signatures against Args: sections, parameter by parameter.

You know the drill: you rename retries to max_retries across a refactor, update every call site, and forget the docstring. Now docvet catches it:

src/client.py:47: missing-param-in-docstring Function 'connect' has parameters not documented in Args: max_retries [required]

Handles positional-only params (PEP 570), keyword-only, self/cls exclusion, and both Google and Sphinx styles.

Reverse Enrichment Checks

Before 1.14, docvet asked "did the docstring mention this behavior?" Now it also asks the reverse: "does the docstring claim behavior the code doesn't exhibit?"

Three new rules:

extra-raises-in-docstring — documents exceptions the function never raises
extra-yields-in-docstring — documents yields in a non-generator
extra-returns-in-docstring — documents returns the function never makes

A docstring that claims FileNotFoundError when the function never raises anything is a trap. Callers write try/except blocks for phantom exceptions. AI tools generate defensive code for errors that can't happen.

Trivial Docstring Detection

def get_user():
    """Get user."""

This passes every presence check but adds zero information. The trivial-docstring rule decomposes symbol names and summaries into word sets, filters stop words, and flags cases where the summary is just an echo of the name.

Also in This Release

missing-deprecation — catches warnings.warn(DeprecationWarning) or @deprecated (PEP 702) without a deprecation notice in the docstring
missing-return-type (opt-in) — flags Returns: sections with no type when there's no return annotation
undocumented-init-params (opt-in) — catches __init__ methods with parameters but no Args: section

Design note: Reverse checks use recommended severity (not required) to account for delegation patterns. Two rules are opt-in for progressive adoption. Full design tradeoffs in the blog post.

Getting Started

pip install docvet==1.14.1

Param agreement and reverse checks are on by default. Opt-in rules:

[tool.docvet.enrichment]
require-return-type = true
require-init-params = true

Run it:

docvet check src/ --all --verbose

What's Next

Semantic verification — not just "did you document the parameters?" but "is what you said about them accurate?"
Expanding multi-style support across all rule categories

PyPI | Docs | GitHub

How docvet learned to read Sphinx and NumPy docstrings

Alberto Nieto — Sun, 08 Mar 2026 16:27:53 +0000

The problem: one inspector, one language

docvet checks whether your Python docstrings are present, complete, accurate, and renderable. Since v1.0, it's caught missing Raises: sections, stale docstrings after code changes, broken mkdocs rendering, and more — 22 rules across five check modules.

But it only understood Google-style.

That's a problem, because a huge portion of the Python ecosystem uses something else:

Sphinx/RST style (:param name:, :returns:, :raises:) — Django, Flask, SQLAlchemy, requests, boto3, CPython stdlib
NumPy style (underlined section headers) — NumPy, SciPy, pandas, scikit-learn, matplotlib

If you maintain a Django app or a scientific Python library, docvet's enrichment checks couldn't parse your docstrings. As of v1.13.0, that's fixed.

How style support works

It's a project-level setting, not auto-detection

docvet doesn't guess your style per-file. You tell it once in pyproject.toml:

[tool.docvet]
docstring-style = "sphinx"

Two valid options: "google" (default) and "sphinx". The setting applies project-wide.

The NumPy twist: NumPy-style underlined headers are recognized automatically in the default Google mode. If your project uses NumPy-style docstrings, you don't need to change anything — it already works.

Sphinx/RST parsing

When you set docstring-style = "sphinx", docvet maps field-list directives to the same internal section model used for Google-style:

Sphinx directive	Maps to
`:param:`, `:type:`	Args
`:returns:`, `:return:`, `:rtype:`	Returns
`:raises:`	Raises
`:ivar:`, `:cvar:`	Attributes
`.. seealso::`	See Also
`.. code-block::`, `>>>`	Examples

This means all existing enrichment rules apply — missing-raises, missing-examples, missing-attributes, and the rest.

def connect(host: str, port: int = 5432) -> Connection:
    """Open a database connection.

    :param host: Hostname or IP address.
    :param port: Port number.
    :returns: An active connection object.
    :raises ConnectionError: If the host is unreachable.
    """

docvet checks this the same way it checks a Google-style docstring — are all raised exceptions documented? Are there parameters in the signature not covered by :param: directives?

Auto-disabled rules: Five enrichment rules that have no Sphinx/RST equivalent are automatically disabled in Sphinx mode: require_yields, require_receives, require_warns, require_other_parameters, and prefer_fenced_code_blocks. You can override any of them:

[tool.docvet]
docstring-style = "sphinx"

[tool.docvet.enrichment]
require_yields = true  # re-enable if your project uses a yields convention

Griffe compatibility: The griffe check is auto-skipped in Sphinx mode, since griffe's parser targets Google-style docstrings.

NumPy section recognition

NumPy-style uses section headers with matching-length underlines:

def transform(data, axis=0):
    """Apply transformation along an axis.

    Parameters
    ----------
    data : array_like
        Input data.
    axis : int, optional
        Axis along which to operate.

    Returns
    -------
    result : ndarray
        Transformed data.

    Raises
    ------
    ValueError
        If axis is out of bounds.
    """

In the default Google mode, docvet already recognizes these headers alongside Google colon-format headers. The section parser looks for 3+ consecutive dashes or equals signs on the line following a known header name. No config change needed.

NumPy-specific sections like Notes, References, Warnings, Extended Summary, and Methods are recognized for section boundary detection but don't have their own enforcement rules — they won't trigger findings.

New rules

missing-returns

Functions that return a value should document what they return:

# docvet flags this — return value is undocumented
def calculate_total(items: list[Item]) -> float:
    """Sum all item prices."""
    return sum(item.price for item in items)

The rule skips cases where a return section doesn't make sense:

Stubs (... or pass body) — interface definitions, not implementations
__init__ methods — return None by convention
Properties — the getter docstring describes the attribute, not a return value
Re-raise-only functions — they don't meaningfully "return"

Works in both Google and Sphinx modes.

overload-has-docstring

@typing.overload signatures describe distinct call patterns. They deserve docstrings explaining when to use each variant:

@overload
def parse(data: str) -> dict: ...

@overload
def parse(data: bytes) -> dict: ...

def parse(data):
    """Parse input data into a dictionary."""

docvet flags the overload signatures missing docstrings. The existing missing-docstring rule skips overloads to avoid double-reporting — each rule owns its scope cleanly.

The bigger picture

This release is about reach. docvet's quality model — six layers from presence to rendering — applies regardless of docstring style. The rules don't change; the parser learned new dialects.

Style	Configuration	Notes
Google	Default, no config needed	Original support
NumPy	Default, no config needed	Recognized automatically in Google mode
Sphinx/RST	`docstring-style = "sphinx"`	One line in pyproject.toml

If you're on a Django team, a scientific Python project, or any codebase using Sphinx-style docs, docvet is ready.

pip install --upgrade docvet
docvet check --all

22 rules across 5 check modules. Zero runtime dependencies beyond typer.

Docs | PyPI

Encrypt Google ADK Sessions in 5 Minutes

Alberto Nieto — Sat, 07 Mar 2026 07:09:05 +0000

Google ADK stores everything your agent knows — tool calls, user messages, conversation context — in plaintext SQLite. If that makes you uncomfortable, this post fixes it.

This is the recipe card. Ingredients, steps, done.

Prerequisites

Python 3.12+
An existing ADK agent using DatabaseSessionService (or a willingness to create a minimal one)

No system libraries, no C compilation, no Docker. The library is pure Python with two runtime dependencies: google-adk and cryptography. A short ingredient list.

Step 1: Install

pip install adk-secure-sessions

Or with uv:

uv add adk-secure-sessions

Verify the install:

python -c "import adk_secure_sessions; print('OK')"

Step 2: Swap the Import

Your agent code probably has something like this:

# Before — ADK default (unencrypted):
from google.adk.sessions import DatabaseSessionService

session_service = DatabaseSessionService(
    db_url="sqlite+aiosqlite:///sessions.db"
)

Replace it with:

# After — encrypted:
from adk_secure_sessions import EncryptedSessionService, FernetBackend

session_service = EncryptedSessionService(
    db_url="sqlite+aiosqlite:///sessions.db",
    backend=FernetBackend("your-secret-passphrase"),
)

Two changes: the import line and the constructor. Everything else in your agent stays the same — create_session, get_session, list_sessions, delete_session, append_event — the full ADK session lifecycle, identical behavior. The difference is what hits the disk.

Step 3: Use the Async Context Manager

For proper connection cleanup, wrap the service in async with. Here's a complete, runnable script:

import asyncio
from adk_secure_sessions import EncryptedSessionService, FernetBackend


async def main():
    backend = FernetBackend("my-secret-passphrase")

    async with EncryptedSessionService(
        db_url="sqlite+aiosqlite:///sessions.db",
        backend=backend,
    ) as service:
        # Create a session with sensitive state
        session = await service.create_session(
            app_name="my-agent",
            user_id="user-123",
            state={
                "patient_name": "Jane Doe",
                "diagnosis_code": "J06.9",
                "api_key": "sk-secret-key-12345",
            },
        )
        print(f"Created session: {session.id}")

        # Retrieve — state is automatically decrypted
        session = await service.get_session(
            app_name="my-agent",
            user_id="user-123",
            session_id=session.id,
        )
        print(f"Decrypted state: {session.state}")

        # List sessions for this app/user
        response = await service.list_sessions(
            app_name="my-agent",
            user_id="user-123",
        )
        print(f"Sessions found: {len(response.sessions)}")

        # Clean up when you're done
        await service.delete_session(
            app_name="my-agent",
            user_id="user-123",
            session_id=session.id,
        )
        print("Session deleted")


asyncio.run(main())

Copy this into a file and run it. The API behaves identically to ADK's DatabaseSessionService — same methods, same signatures, same return types. The only difference is what's stored on disk: switching from a glass jar to a lockbox. Same ingredients go in, same ingredients come out, but nobody can peek inside without the key.

Step 4: Verify the Encryption

Trust but verify. Open the SQLite database directly and confirm the data is actually encrypted.

Using the sqlite3 CLI:

sqlite3 sessions.db "SELECT state FROM sessions LIMIT 1;"

You'll see a base64-encoded string — the encrypted envelope — not readable JSON:

AQFnQUFBQUJuVm1Gc2RX...

Using Python:

import sqlite3

conn = sqlite3.connect("sessions.db")
row = conn.execute("SELECT state FROM sessions LIMIT 1").fetchone()
print(row[0][:60])  # First 60 chars of the encrypted envelope
conn.close()

What you won't see: {"patient_name": "Jane Doe", "diagnosis_code": "J06.9"}. That's the point. With DatabaseSessionService, anyone with file access reads your mise en place. With EncryptedSessionService, they see noise.

For a more convincing demo, run the basic usage example from the repo — it runs a real multi-turn ADK agent with Ollama and then inspects the raw database to prove no plaintext leaks. After a three-turn conversation about patient intake, the database contains zero occurrences of "Jane Doe" or "headache."

Step 5: Manage Your Passphrase

The passphrase is the only secret. Never hardcode it.

import os
from adk_secure_sessions import EncryptedSessionService, FernetBackend

backend = FernetBackend(os.environ["SESSION_KEY"])

Set it in your environment, your .env file, or your secrets manager. The library handles everything else — FernetBackend derives a cryptographic key using PBKDF2-HMAC-SHA256 with 480,000 iterations. You don't need to generate, store, or rotate raw key material.

If you use the wrong passphrase to read a session encrypted with a different one, you get a clear DecryptionError — never garbage data, never silent corruption.

What You Just Built

Five steps, plaintext to encrypted-at-rest:

Installed — pip install adk-secure-sessions
Swapped — one import, one constructor change
Ran — same API, encrypted storage
Verified — the database contains ciphertext, not JSON
Secured — passphrase in the environment, not the codebase

Your agent still works the same way. Your tests still pass. But the SQLite file is now useless without the key — like a walk-in freezer with a combination lock. Nothing changes about how the food is stored or retrieved, but the back door isn't open anymore.

Error Handling

When things go wrong, the library tells you what happened:

ConfigurationError — raised at startup if the backend is misconfigured. You'll catch this before any data is written.
DecryptionError — raised if you read a session with the wrong key. The library never returns garbage.

from adk_secure_sessions import (
    ConfigurationError,
    DecryptionError,
    EncryptedSessionService,
    FernetBackend,
)

try:
    async with EncryptedSessionService(
        db_url="sqlite+aiosqlite:///sessions.db",
        backend=FernetBackend("correct-passphrase"),
    ) as service:
        session = await service.get_session(
            app_name="my-agent",
            user_id="user-123",
            session_id="some-session-id",
        )
        if session is None:
            print("Session not found")
except ConfigurationError:
    print("Backend doesn't conform to EncryptionBackend protocol")
except DecryptionError:
    print("Wrong key — cannot decrypt session data")

Key Takeaways

One install, one import change — pip install adk-secure-sessions, swap the constructor, done
Full ADK lifecycle — create_session, get_session, list_sessions, delete_session, and append_event all work identically
Verify it yourself — inspect the SQLite file to confirm ciphertext, not plaintext
Passphrase management — use environment variables, never hardcode secrets
Clear errors — DecryptionError for wrong keys, ConfigurationError for bad setup

I Let an Algorithm Rewrite My AI Agent's Prompts. It Found Things I Never Would Have.

Alberto Nieto — Fri, 06 Mar 2026 07:13:19 +0000

I started with this instruction for a Google ADK agent:

"Greet the user appropriately."

Five words. Seemed fine. The agent produced decent greetings. I could've shipped it.

Instead, I ran it through an evolutionary optimizer. Three iterations later, the instruction was three paragraphs long — covering formality tiers, period-appropriate language for different honorifics, tonal variation based on social context, and specific vocabulary constraints I never would have thought to include.

The agent's quality score went from 0.35 to 0.81. Same model, same training examples, completely different output quality. The only thing that changed was the instruction text — and I didn't write a single word of the new one.

The Problem

Prompt engineering is guess-and-check. You write something, test it on a couple examples, tweak a word, test again. It works, kind of — like seasoning food without tasting it. You'll get something edible, but you'll never find the version that's genuinely great.

The core issue: the space of possible instructions is infinite, and your intuition can only explore a tiny corner of it. You get stuck in local optima. You test against too few examples. You optimize for what feels wrong instead of what measurably underperforms.

The Fix: Let an LLM Critique and Rewrite the Prompts

gepa-adk automates this loop using evolutionary optimization (based on the GEPA paper):

Run the agent on training examples
Score outputs with a critic agent
Reflect — an LLM analyzes what went wrong
Mutate — proposes a better instruction based on the analysis
Keep or discard based on whether scores improve

The mutation isn't random. The reflection model sees every output, every score, and every piece of critic feedback. It makes targeted changes. Think of it less like genetic mutation and more like a head chef tasting every plate and adjusting the recipe.

Here's the entire thing:

from google.adk.agents import LlmAgent
from gepa_adk import evolve_sync, SimpleCriticOutput

agent = LlmAgent(
    name="greeter",
    model="gemini-2.5-flash",
    instruction="Greet the user appropriately.",
)

critic = LlmAgent(
    name="critic",
    model="gemini-2.5-flash",
    instruction="Score for formal, Dickens-style greetings. 0.0-1.0.",
    output_schema=SimpleCriticOutput,
)

trainset = [
    {"input": "I am His Majesty, the King."},
    {"input": "I am your mother."},
    {"input": "I am a close friend."},
]

result = evolve_sync(agent, trainset, critic=critic)
print(f"Score: {result.original_score:.2f} -> {result.final_score:.2f}")

That's it. evolve_sync handles the loop. You get back the evolved instruction and the score trajectory.

What Else Can Evolve

Instructions are the default target, but gepa-adk can also optimize output schemas (Pydantic models), generation config (temperature, top-p), and even multi-agent systems — evolving how multiple agents coordinate together.

The multi-agent case is where it gets wild. In a pipeline, one agent's instruction affects another agent's input. Evolving them together finds coordination patterns you'd never discover tuning each agent in isolation.

When This Makes Sense

It shines when you have measurable quality criteria, diverse inputs, and you're building for production where the difference between 0.65 and 0.82 matters at scale.

It's overkill for one-off prompts or tasks where "good enough" is actually good enough.

Try It

pip install gepa-adk

GitHub | PyPI | Docs | v1.0.0 Announcement

For the full deep dive on the evolution loop, critic agents, and architecture: Stop Writing AI Agent Prompts by Hand

Based on the GEPA paper — built on Google ADK.

What's the worst prompt you've manually tuned into submission? I'm curious if evolution would've found something better.

Give Your AI Coding Agent a Docstring Quality Tool (MCP Setup for VS Code, Cursor, and Claude Code)

Alberto Nieto — Wed, 04 Mar 2026 14:56:40 +0000

Your AI coding agent can read your code, run your tests, and search your repo. But can it check whether your docstrings actually match what the code does?

Research shows incorrect documentation drops LLM task success by 22.6 percentage points. Missing docs are annoying. Wrong docs are toxic — they create false confidence in generated code.

docvet catches these gaps: 19 rules that check whether your docstrings actually match what the code does. Since v1.8, it ships an MCP server — meaning any MCP-aware editor can give its AI agent direct, programmatic access to those checks.

What Your Agent Gets

Two tools appear in the agent's toolbox:

docvet_check — Run checks on any Python file or directory. Returns structured JSON:

{
  "findings": [
    {
      "file": "src/pipeline/extract.py",
      "line": 42,
      "symbol": "extract_text",
      "rule": "missing-raises",
      "message": "Function 'extract_text' raises ValueError but has no Raises section",
      "category": "required"
    }
  ],
  "summary": {
    "total": 3,
    "by_category": {"required": 2, "recommended": 1},
    "files_checked": 8
  }
}

docvet_rules — List all 19 rules with descriptions and categories.

No CLI output to parse. No regex. Typed fields the agent reasons about directly.

Setup: One Block of JSON

The MCP server runs on stdio via uvx — no pip install in your project, no virtual environment pollution, no global packages. uvx downloads and runs docvet in an isolated environment automatically. You add the config and it just works.

VS Code

Add to .vscode/mcp.json:

{
  "servers": {
    "docvet": {
      "type": "stdio",
      "command": "uvx",
      "args": ["docvet[mcp]", "mcp"]
    }
  }
}

Note: VS Code uses "servers", not "mcpServers".

Cursor

Add to .cursor/mcp.json (project) or ~/.cursor/mcp.json (global):

{
  "mcpServers": {
    "docvet": {
      "command": "uvx",
      "args": ["docvet[mcp]", "mcp"]
    }
  }
}

Claude Code

One command:

claude mcp add --transport stdio --scope project docvet -- uvx "docvet[mcp]" mcp

Others

Windsurf, Claude Desktop, and anything that speaks MCP — same mcpServers pattern. Full configs here.

The Workflow

Once configured, the agent uses docvet as part of its normal flow:

Agent opens a Python file to modify
Agent runs docvet_check on the file
Findings come back — missing Raises sections, stale signatures, undocumented attributes
Agent fixes the docstrings alongside the code change

The feedback loop becomes automatic — like a line cook who taste-tests every dish before it leaves the pass. Code and documentation stay in sync because the agent checks both.

Try It

Add the .vscode/mcp.json block above
Open a Python file with a known gap (function raises an exception, no Raises: section)
Ask your AI agent to check the file with docvet
Watch it fix the docstring

Links:

Your AI Reads Your Docstrings. Are They Right?

Alberto Nieto — Tue, 03 Mar 2026 06:35:17 +0000

Copilot, Claude Code, Cursor — they all read your docstrings to understand your code. When those docstrings are wrong, your AI makes confident, wrong suggestions.

And wrong docs are worse than no docs. Studies show incorrect documentation drops LLM task success by 22 percentage points compared to correct docs.

Your linter checks style. But who checks that the docstring is actually accurate?

The gap in your toolchain

Existing tools cover the basics:

ruff — docstring style and formatting
interrogate — docstring presence

But neither checks whether your docstring matches the code. A function that raises ValueError but doesn't document it. A parameter added last sprint but missing from the docstring. Code that changed but the docstring didn't.

That's layers 3–6 of docstring quality — and nothing was checking them.

docvet fills that gap

docvet is a CLI tool that vets docstrings across six quality layers:

Layer	Check	What it catches
Presence	`docvet presence`	Public symbols with no docstring
Completeness	`docvet enrichment`	Missing Raises, Yields, Attributes sections
Accuracy	`docvet freshness`	Code changed, docstring didn't
Rendering	`docvet griffe`	Docstrings that break mkdocs
Visibility	`docvet coverage`	Modules invisible to doc generators

Try it

pip install docvet
docvet check --all

Run it on your codebase. You'll probably find something.

Why this matters for AI

Docstrings are no longer just for humans reading your code. They're the context window for every AI tool touching your codebase. Accurate docstrings create a feedback loop: better context → better AI suggestions → better code.

docvet keeps that contract honest.

Docs · GitHub · PyPI