DEV Community: Ertuğrul Demir

Skills over System Prompts: Building an Anki Tutor with the Antigravity SDK

Ertuğrul Demir — Fri, 19 Jun 2026 10:09:03 +0000

AI has made me a little lazier.

Not dramatically lazy. Not "the robots will do everything" lazy. More like: once you get used to asking an agent to do boring work, every small manual workflow starts looking suspicious.

Anki is a perfect example.

Anki is great. I use it to remember things I study, subjects I work on, and the weird little decisions hidden inside codebases. Spaced repetition works. The problem is not Anki.

The problem is me.

I can already see the rot setting in. On complex cards, my brain starts negotiating with itself. "Yeah, I basically knew that." "Close enough." "I would have remembered it in context." Then I press Good and move on.

That is not studying. That is self-certified vibes.

What I actually wanted was a study buddy sitting on top of my real Anki collection. Someone to ask the card, wait for my answer, reveal the real answer, compare it honestly, explain the gap, and only then help decide whether it was Again, Hard, Good, or Easy.

AI is annoyingly good for that.

It is also useful when taking over a new project. When I enter a repo, I do not only want a summary. I want to be quizzed later on the key decisions, the architecture, the gotchas, and the "why is it like this?" parts. Anki is great for that too.

But I am still lazy.

I am not going to manually write every card. I am not going to keep every deck updated by hand. And if I am studying from my phone, I am definitely not going to type long answers into a chat just so the agent can grade me. Voice needs to work too.

So the project quickly stopped being "connect Gemini to Anki."

It became a small agent system:

a terminal tutor for focused review sessions
a Telegram tutor for studying from my phone, including voice answers
a deck builder that creates cards from web research or a local codebase
a watch mode that can notice code changes and create cards while I work

That is a lot of behavior.

My first instinct was the usual one: write a bigger system prompt. Tell the agent how to run a study session. Tell it how to write good flashcards. Tell it how to inspect a codebase and turn architecture into cards. Tell it how to behave differently in Telegram. Tell it not to touch scheduling unless I approve.

That works for about ten minutes.

Then the system prompt becomes a junk drawer.

The hard part was not giving the agent tools.

The hard part was giving it habits.

That is where the Google Antigravity SDK fit really well. It gives you the agent runtime as a Python library: custom tools, reusable skills, lifecycle hooks, safety policies, streaming, triggers, and multiple ways to run the same agent logic from different surfaces.

What the Antigravity SDK Gives You

The Antigravity SDK is not just a wrapper around a chat model.

It gives you programmatic access to the same agent runtime behind Google Antigravity 2.0 and the Antigravity CLI, but from Python.

That matters because a real agent is not only a model call. A real agent needs:

tools
memory across turns
permissions
hooks
skills
streaming
triggers
safety around side effects

The SDK puts those behind one main abstraction: Agent.

The smallest useful version really is tiny:

import asyncio
from google.antigravity import Agent, LocalAgentConfig

async def main():
    config = LocalAgentConfig()
    async with Agent(config) as agent:
        response = await agent.chat("What files are in the current directory?")
        print(await response.text())

if __name__ == "__main__":
    asyncio.run(main())

Install it with:

pip install google-antigravity

Then set a Gemini API key from Google AI Studio:

export GEMINI_API_KEY="your-key-here"

That is the hello world.

The useful version starts when you compose the runtime features around a real workflow.

In this project, the Antigravity SDK pieces mapped like this:

Antigravity SDK capability	Where I used it
`Agent` / `LocalAgentConfig`	the terminal tutor, Telegram tutor, and deck builder all run on the same agent runtime
Custom Python tools	AnkiConnect actions like `get_due_cards`, `show_answer`, `rate_card`, and `add_notes`
`skills_paths`	shared `review-buddy`, `plain-cards`, and `codebase-cards` behavior packages
Lifecycle hooks	sync on session start/end, deck backup before writes, audit log after scheduling changes, tool-error recovery
Safety policies	practice mode blocks `rate_card` so cram sessions cannot change real scheduling
Streaming	the deck builder prints progress while the agent researches and creates cards
Triggers	watch mode reacts to `.py` file changes and asks the agent to card important changes
Built-in read-only tools	codebase mode lets the agent inspect a repo without editing it

That list is the reason this worked better as an SDK project than as one giant prompt around a model call.

Now, the first useful step: give the agent hands.

Giving the Agent Hands: Anki as Python Tools

Anki already has an HTTP API through the AnkiConnect add-on. The entire bridge is basically one POST to localhost:

def invoke(action: str, **params):
    response = requests.post(
        "http://localhost:8765",
        json={"action": action, "version": 6, "params": params},
        timeout=30,
    )
    response.raise_for_status()
    payload = response.json()
    if payload["error"]:
        raise RuntimeError(payload["error"])
    return payload["result"]

From there, the agent tools are just normal Python functions.

A simplified version:

def list_decks() -> str:
    """List all Anki decks with their due counts."""
    decks = invoke("deckNames")
    stats = invoke("getDeckStats", decks=decks)
    return json.dumps(stats)


def get_due_cards(deck: str = "", limit: int = 5) -> str:
    """Return due cards without revealing the answer side."""
    query = f'deck:"{deck}" is:due' if deck else "is:due"
    card_ids = invoke("findCards", query=query)[:limit]
    cards = invoke("cardsInfo", cards=card_ids)
    return json.dumps(cards)


def rate_card(card_id: int, rating: int) -> str:
    """Submit a user-confirmed Anki rating: 1 Again, 2 Hard, 3 Good, 4 Easy."""
    invoke("answerCards", answers=[{"cardId": card_id, "ease": rating}])
    return json.dumps({"rated": card_id, "rating": rating})

Then register them with the SDK:

from google.antigravity import LocalAgentConfig

config = LocalAgentConfig(
    tools=[list_decks, get_due_cards, rate_card],
)

That is one of the nicest parts of the SDK: custom tools do not require a separate server. For this version, I did not need MCP, a framework, a schema generator, or a second process.

The agent can call plain Python.

In the real project I ended up with more tools:

list_decks
get_due_cards
show_answer
rate_card
find_notes
add_note
add_notes
update_note
suspend_card
unsuspend_card
undo
get_stats
sync

That was enough to make the tutor useful.

This is the first pattern:

Put capabilities in tools.

Tools are the agent's hands. But hands are not behavior.

For behavior, I used skills.

The Problem with Giant System Prompts

At first, I tried to describe everything in the agent's system instructions.

The tutor needs to know how to run a review session:

show the question
wait for my answer
reveal the answer
compare my answer
suggest a rating
wait for confirmation
only then update Anki scheduling

It also needs to know how to write good cards:

one fact per card
answer-first backs
no trivia padding
no vague questions
no giant essay cards

Then the deck builder needs another workflow:

research a topic
extract the important facts
create cards
verify they exist in Anki

Then the codebase deck builder needs a different workflow:

inspect the repo breadth-first
find key abstractions
explain responsibilities and data flow
avoid making cards for random syntax

Then Telegram needs shorter replies because nobody wants a wall of Markdown on a phone.

You can put all of that into one system prompt.

But you should not.

A giant system prompt has three problems:

It pollutes every task. The agent is thinking about codebase exploration while you are reviewing Spanish verbs.
It is hard to reuse. The same card-writing rules need to appear in the terminal tutor, Telegram tutor, and deck builder.
It rots. Every new behavior gets pasted into the same blob until nobody knows which rule controls what.

This is exactly the problem skills solve.

The shape changed from this:

system prompt = tutor rules
              + card-writing rules
              + codebase-exploration rules
              + Telegram style rules
              + safety reminders
              + whatever I forgot last week

Into this:

system prompt  = identity + hard safety floor
review-buddy   = study-session behavior
plain-cards    = card-writing behavior
codebase-cards = repo-exploration behavior
hooks/policies = enforcement and receipts

That is the real pattern behind the title.

Not "make the prompt better."

Make the prompt smaller.

Skills over System Prompts

A skill is a folder with a SKILL.md file inside it.

My project has three:

.agents/skills/
  plain-cards/
    SKILL.md
  review-buddy/
    SKILL.md
  codebase-cards/
    SKILL.md

Each skill starts with a tiny bit of frontmatter.

For example, the review skill begins like this:

---
name: review-buddy
description: Playbook for running an interactive Anki review session — quiz one card at a time, grade recall together, submit ratings, repair noisy or broken cards.
---

That description is not just documentation for humans. It is the lightweight discovery layer. The agent can see what skills exist, then load the full instructions only when the task calls for them.

A skill is not a service. It is not an MCP server. It is not a deployment. It is a behavior package sitting on disk, ready to be pulled into the agent when needed.

Then the SDK loads the skill directory:

config = LocalAgentConfig(
    system_instructions=SYSTEM_INSTRUCTIONS,
    tools=ALL_TOOLS,
    skills_paths=[".agents/skills"],
)

The key idea is simple:

The system prompt says who the agent is. Skills say what job it is currently doing.

For this project, the system prompt stays small. It says the agent is a friendly flashcard tutor working with a real Anki collection.

The details live in skills.

`review-buddy`: the study session playbook

This skill describes how to run a review session.

It covers the rhythm:

ask one card at a time
hide the answer until the user attempts it
reveal and teach briefly
suggest a rating
wait for confirmation
handle noisy or broken cards
close with a recap

This is not code. It is behavioral protocol.

That distinction matters. The review flow is not tied to terminal I/O, Telegram messages, or AnkiConnect. It is just the way a good tutor should behave.

`plain-cards`: the card-writing style guide

This skill handles card quality.

It tells the agent to write cards that are:

atomic
answer-first
lean
verified
free of filler
easy to review months later

A bad flashcard is worse than no flashcard. It creates fake progress. The model can generate ten cards in seconds, but without a style guide it will happily generate ten vague cards that future me will hate.

So card writing became a skill.

`codebase-cards`: the repo exploration protocol

This one is for turning source code into Anki cards.

The agent is told to inspect the repo breadth-first, identify architecture, data flow, responsibilities, and gotchas, then turn only the useful findings into cards.

That skill powers code mode in the deck builder:

python deck_builder.py "overall architecture" --path ~/my/project --count 6

The focus hint changes, but the exploration protocol stays the same.

This is the second pattern:

Put reusable behavior in skills.

Not in the system prompt. Not duplicated across entrypoints. Not buried in Python conditionals.

A skill is just a file, but it changes the shape of the whole project.

One Behavior Layer, Three Surfaces

Once the behavior lived in skills, adding new surfaces became much easier.

The architecture looked like this:

                         .agents/skills/
                  ┌──────────┼──────────┐
                  │          │          │
           review-buddy  plain-cards  codebase-cards
                  │          │          │
                  └──────────┼──────────┘
                             │
                    LocalAgentConfig
                             │
       ┌─────────────────────┼─────────────────────┐
       │                     │                     │
  terminal tutor        Telegram tutor        deck builder
    tutor.py          telegram_tutor.py      deck_builder.py

The terminal tutor is the simplest surface:

async with Agent(config) as agent:
    await run_interactive_loop(agent)

The Telegram tutor uses the same agent differently:

async def chat_response(agent: Agent, prompt: str) -> str:
    response = await agent.chat(prompt)
    return "".join([token async for token in response])

The deck builder streams output as it works:

response = await agent.chat(message)
async for token in response:
    print(token, end="", flush=True)

Different surfaces. Same runtime. Same skills.

That is the part I liked most. Telegram did not need a copied review prompt. The deck builder did not need its own card-writing manifesto. The codebase mode did not need a separate app-specific doctrine.

They all loaded the same skill directory.

The Terminal Tutor

The terminal version is the baseline.

Start Anki, run the tutor, and ask naturally:

python tutor.py

Then:

quiz me on XYZ

The tutor lists due cards, asks one question, waits for my answer, reveals the real Anki answer, compares, teaches, and suggests a rating.

The important part: it does not update scheduling just because the model thinks I got the answer right.

The review loop is human-in-the-loop by design:

Agent: I would rate this Good (3). You had the main idea but missed the date.
User: yes
Agent: rated 3. Next card...

Or I can override it:

Agent: I would rate this Hard (2).
User: actually 1
Agent: rated Again (1). Let's reinforce it.

Spaced repetition is stateful. A bad rating affects the future schedule. So the model can suggest, but I decide.

That is not just a prompt preference. It is the product boundary.

The Telegram Tutor

The second surface was Telegram.

Not because Telegram is fancy. Because the best study app is the one I actually open.

The Telegram bot long-polls the Bot API, sends messages into the same Antigravity agent, and returns the response. It also supports voice notes: speak the answer, transcribe it, and feed the transcript back into the tutor as text.

The agent gets a small extra instruction:

TELEGRAM_INSTRUCTIONS = """
You are chatting through Telegram on a phone. Keep replies short and plain
text only — no markdown headers, tables, or code fences. One card per message.
"""

Everything else stays shared.

Same Anki tools. Same hooks. Same skills.

I also added due-card nudges without spending model tokens. Every 30 minutes, plain Python checks Anki deck counts. If cards are waiting, the bot sends a short reminder:

25 cards waiting (X 5, Y 8). Say 'quiz me' to start.

No LLM needed. No reasoning needed. Just deterministic code.

This became a useful design rule:

Do not use the model for work a for loop can do.

The agent is for tutoring. The nudge is just a counter.

The Deck Builder

The third surface is a deck builder.

It has two modes.

Web mode:

python deck_builder.py "Ottoman Empire" --deck "History" --count 8

Codebase mode:

python deck_builder.py "error handling and edge cases" --path ~/my/project --count 6

Web mode gives the agent a small research toolset: Wikipedia search, Wikipedia read, and URL fetch. Then it asks the agent to create cards using the plain-cards skill.

Codebase mode is more interesting. The SDK can give the agent built-in file tools scoped to a workspace. I enabled read-only access:

from google.antigravity.types import BuiltinTools, CapabilitiesConfig

config = LocalAgentConfig(
    tools=[add_notes, list_decks],
    workspaces=[code_path],
    capabilities=CapabilitiesConfig(
        enabled_tools=BuiltinTools.read_only()
    ),
    skills_paths=[".agents/skills"],
)

That means the agent can inspect the target repo, but not edit it.

For a deck builder, that is the right permission boundary. It needs to read code and create Anki notes. It does not need to modify the project.

This is where codebase-cards activates. The agent explores the repo, identifies the concepts worth remembering, then writes cards through add_notes.

At the end, I do not trust the model's narration. The script queries Anki to verify the cards exist.

def cards_in_anki(deck: str) -> int:
    result = json.loads(find_notes(f'deck:"{deck}" tag:auto-researched', 100))
    return len(result) if isinstance(result, list) else 0

If the model says it created cards but Anki has zero, the script nudges it to try again.

That became another rule:

Trust the system receipt, not the model narration.

Turning It Ambient with Triggers

The SDK also supports triggers: background tasks that react to external events and push messages into the agent.

I used a file-change trigger for codebase card generation.

The idea: while I work on a project, if a Python file changes, the agent can inspect the change and decide whether it introduced something worth remembering.

Simplified:

from google.antigravity.triggers import on_file_change


def make_watch_trigger(path, deck, tag):
    async def on_change(ctx, changes):
        paths = sorted({c.path for c in changes if c.path.endswith(".py")})
        if not paths:
            return

        await ctx.send(
            f"These files changed: {', '.join(paths)}. "
            f"Create cards in deck {deck} if the change is worth remembering."
        )

    return on_file_change(path, on_change)

Run it like this:

python deck_builder.py "as I work" --path ~/my/project --watch

This is where the project started feeling less like a chatbot and more like a sidecar.

I edit code. The trigger wakes the agent. The codebase skill tells it how to inspect the change. The card-writing skill tells it how to write good cards. The Anki tool creates the notes.

No new server. No custom scheduler. No giant prompt.

Just SDK triggers plus skills.

The Part I Refused to Trust to the Model

Skills are guidance.

Policies and hooks are enforcement.

That line is the difference between a fun demo and a tool I can leave connected to my real Anki collection.

The Antigravity SDK has declarative safety policies and lifecycle hooks. I used both.

Practice mode blocks scheduling writes

Sometimes I want to cram without touching Anki scheduling.

A prompt instruction is not enough for that. If the agent forgets and calls rate_card, the schedule changes.

So practice mode denies the tool at the harness level:

from google.antigravity.hooks import policy

policies = policy.confirm_run_command()

if practice_mode:
    policies = policies + [
        policy.deny("rate_card", name="practice_mode")
    ]

Now rate_card is blocked even if the model tries to call it.

That is the kind of safety I want: not vibes, not trust, not "please don't". A runtime boundary.

Hooks sync, back up, audit, and recover

The SDK hook system lets you observe or intervene at lifecycle points.

I used session hooks to sync Anki:

@hooks.on_session_start
async def sync_on_start():
    sync_anki()

@hooks.on_session_end
async def sync_on_end():
    sync_anki()

I used a pre-tool-call Decide hook to back up a deck before note writes:

@hooks.pre_tool_call_decide
async def backup_before_note_writes(tool_call):
    if tool_call.name in ("add_note", "add_notes"):
        backup_deck(tool_call.args["deck"])
    return hooks.HookResult(allow=True)

I used a post-tool-call Inspect hook to audit scheduling changes:

@hooks.post_tool_call
async def audit_scheduling_changes(result):
    if result.name in {"rate_card", "undo", "suspend_card", "unsuspend_card"}:
        append_jsonl("backups/scheduling_audit.jsonl", result)

And I used a Transform hook to turn ugly tool errors into recovery hints the model can act on:

@hooks.on_tool_error
async def recover_from_tool_error(error):
    if isinstance(error, requests.Timeout):
        return "AnkiConnect timed out. Ask the user to check Anki, then retry."
    return None

This is one of the strongest parts of the SDK.

The model does not need to remember to audit itself. The harness does it.

The model does not need to remember to back up a deck before writing. The hook does it.

The model does not get to bypass practice mode. The policy blocks it.

The pattern became clear:

tools give the agent capabilities
skills give the agent reusable behavior
policies define what must never happen
hooks add system-level guarantees around the agent

That separation is the architecture.

What Worked

A few things worked better than expected.

Plain Python tools were enough

I originally thought I might need to build an MCP server immediately.

I did not.

For one application, custom Python functions were simpler. The SDK already knows how to expose them as tools. That kept the first version small.

MCP is still useful when you want the same tools available across multiple clients. But for an SDK-native app, Python functions are the shortest path.

Skills kept the project from becoming prompt soup

This was the biggest win.

The base system instructions stayed focused. The detailed workflows moved into skills.

When I improved card-writing rules, terminal, Telegram, and deck builder all benefited. I did not need to update three prompts.

Hooks made side effects less scary

Anki is not a toy database. It is my real spaced-repetition schedule.

The hooks gave me a deterministic layer around model behavior:

sync at session boundaries
backup before writes
audit after scheduling changes
recover from tool failures

That made the agent feel much less like a random chatbot with database access.

Triggers changed the feel of the app

The file watcher was small, but it changed the mental model.

The agent was no longer only something I talked to. It could react to work happening around it.

That is where SDK agents get interesting: not just chat, but event-driven labor.

What Did Not Work Perfectly

A few caveats.

Skills are not hard guarantees

Skills are instructions. They improve behavior, but they are still model-read guidance.

If something must be impossible, use a policy or remove the tool.

That is why practice mode denies rate_card instead of merely asking the model not to call it.

AnkiConnect has sharp edges

AnkiConnect is simple, but it has quirks.

For example, answerCards can return success even for bad card IDs unless you pre-check the card. Some note updates silently fail if the note is open in Anki's browser window. AnkiConnect also runs inside Anki's Qt process, so you should not treat it like a high-concurrency API.

The fix is boring and important: validate inside tools.

Voice was simpler outside the agent loop

The Telegram bot supports voice answers, but I kept transcription outside the agent loop. A direct Gemini transcription call turns the voice note into text, then the transcript goes into the tutor.

That was simpler and more reliable for this build.

The lesson: use the SDK where it makes the architecture cleaner. Do not force every feature through the agent if a direct call is simpler.

How to Build Something Similar

If you want to build your own version of this pattern, I would do it in this order.

1. Start with one real workflow

Do not start with a platform.

Pick one annoying workflow with real state behind it:

flashcards
GitHub issues
CRM updates
personal knowledge base
support tickets
finance records

The state matters. Agents get interesting when they can act on something real.

2. Wrap the system as small Python tools

Keep the tools boring.

def search_items(query: str) -> str:
    """Search the user's records."""
    ...


def create_item(title: str, body: str) -> str:
    """Create a new record after user approval."""
    ...

config = LocalAgentConfig(
    tools=[search_items, create_item],
)

Make tools validate inputs. Do not rely on the model to pass perfect IDs.

3. Move task behavior into skills

Create a skill folder:

.agents/skills/my-workflow/SKILL.md

A minimal skill:

---
name: my-workflow
description: Use when helping the user process and update records in this system.
---

# My Workflow

1. Inspect the current record before changing it.
2. Propose the change in plain language.
3. Wait for user confirmation before writing.
4. After writing, verify the record exists.

Then load it:

config = LocalAgentConfig(
    tools=TOOLS,
    skills_paths=[".agents/skills"],
)

This is the move: do not keep growing the system prompt forever.

4. Add policies for non-negotiables

If a tool should never run in a mode, deny it.

policies = [
    policy.deny("delete_record", name="no_deletes"),
]

If shell execution should require confirmation, keep the default guard:

policies = policy.confirm_run_command()

The model can misunderstand a skill. It cannot ignore a denied tool.

5. Add hooks for receipts

Use hooks for things that should happen regardless of whether the model remembers them:

audit logs
backups
sync
metrics
sanitization
error recovery

@hooks.post_tool_call
async def audit(result):
    write_log({
        "tool": result.name,
        "result": result.result,
        "error": result.error,
    })

6. Add another surface only after the behavior is reusable

Once the behavior lives in tools and skills, a second surface becomes much cheaper.

Terminal first. Then Telegram, Slack, web, cron, or file triggers.

The surface should be thin. The agent behavior should not live there.

The Bigger Point

The old way to build an AI feature was to write a large prompt and hope the model followed it.

That is not enough for real agents.

A real agent needs separation of concerns:

Capabilities       → tools
Reusable behavior  → skills
Hard boundaries    → policies
System guarantees  → hooks
External events    → triggers
User interface     → thin surface

This is what the Antigravity SDK made pleasant. I could build one agent runtime and reuse it across terminal, Telegram, and deck generation. I could keep the tutoring behavior in SKILL.md files instead of duplicating it. I could wrap real side effects with policies and hooks instead of trusting the model to behave.

The Anki tutor is just the concrete example.

The pattern generalizes.

A support agent could keep triage behavior in a skill, expose ticket updates as tools, deny destructive writes by policy, and audit every status change by hook.

A code review agent could keep review rubrics in skills, expose GitHub as tools, require approval before comments, and verify every posted review.

A research agent could keep extraction protocols in skills, use file triggers to process new papers, and write structured outputs only after validation.

The skill is the portable behavior module.

The SDK is the harness that lets it act.

Resources

Closing

I started this because I was too lazy to open Anki.

That sounds like a joke, but most useful automation starts there. Not with a grand platform vision. With a small workflow that keeps not happening because the friction is just high enough.

The surprising part was not that an LLM could quiz me.

The surprising part was how clean the architecture became.

Tools gave the agent hands. Skills gave it habits. Policies gave it boundaries. Hooks gave it receipts. Triggers made it wake up when something changed.

That is the version of agents I trust more: not one giant prompt pretending to be an application, but a small runtime with clear layers.

The future of agent apps is not monolithic complex systems.

It is smaller prompts, sharper tools, reusable skills, and a harness that refuses to let the model pretend a side effect happened when it did not.

The Local Model That Doesn't Sleep: Gemma 4 + MTP as a Marathon Engine

Ertuğrul Demir — Fri, 08 May 2026 09:01:07 +0000

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

I set the agent running just before midnight, did a quick mental count of my remaining API quota, and went to sleep. I was going to wake up to a finished job. That was the plan, anyway...

What I actually woke up to was a frozen terminal. The agent had stopped in the tenth minute. The remote service had gone down overnight and taken the whole job with it. The task I had given it was simple enough: scrape fifty documentation pages, cross-reference the data across sources, produce a structured summary. It had barely started before the infrastructure I had no control over just switched off.

The model wasn't failing. The problem wasn't intelligence. The problem was that I was building on a foundation I didn't own: a service that could go down, a quota that could run out, and no way to know which one was waiting for me in the morning.

I had always worked with local models on the side: trained them, tested them, liked them. But to be honest, I'd never trusted them much in the past for complex tasks. They were a hobby, not a solution. Too much babysitting required for a real workload. I had filed them under "interesting" and left them there. That frozen terminal moved them to a different folder.

For a long time, the gap between the proprietary giants and the open-source world felt like a canyon. You had the "God-models" in the closed gates: GPT, Claude, Gemini. They could reason through almost anything, but you had to play by their rules. If you wanted actual intelligence, you paid the subscription and accepted their rules.

But lately, that canyon is shrinking.

We're seeing a massive push from the open-weights community. Models like DeepSeek V4, Kimi K2.6, and GLM-5.1 are proving that high-end reasoning is becoming a commodity. The problem is the weight. Unless you're running a server farm or expensive rack, hosting a model at that scale is a logistical nightmare. Great to admire from a distance, but too heavy to actually build with.

Then came the sweet spot: Gemma 4 31B and Qwen 3.6 27B.

Suddenly, the math changed. These models aren't as smart as the trillion-parameter giants, but they fit. They fit on consumer-grade GPUs. They work offline. And they work for free, minus whatever your GPU costs in electricity...

But here is the thing: I don't think the goal of local models is to beat the cloud models at a game of IQ.

For a complex task, you still want the big guns. You want the most powerful model available to handle high-value iterations where precision is everything. That is a sprint.

But what happens when the task isn't a sprint? What happens when you need a model to work for six hours straight? To scrape a hundred pages, try fifty different reasoning paths, fail, pivot, and keep grinding until the job is done?

That is a marathon.

And in a marathon, intelligence is secondary to endurance.

The real advantage of a local setup isn't just privacy or cost. It is the fact that you have a little working engine that doesn't get tired. No rate limits. No monthly token quota. It is completely yours, and you can leave it running all night while you sleep.

The stamina was already there. Then, recently, the Gemma family got something new: a way to run faster without burning out. A marathon engine that also picks up pace doesn't just finish sooner. It fits more work into the same night.

The Turbocharger (What is MTP?)

Before we get into the build, we need to talk about why this suddenly became possible. If you've been following the Gemma 4 release, you probably saw the term MTP (Multi-Token Prediction).

One thing worth naming up front: MTP isn't just a runtime trick bolted onto inference. It is a training objective. Google trained Gemma 4 from the ground up with auxiliary heads that predict multiple future tokens simultaneously. That structural choice is what lets the speculative-decoding pipeline below run so tightly integrated and efficient, far more so than older bolt-on drafters like Medusa or generic small-model speculative decoding.

On the surface, Google says it makes the models "up to 3x faster." But as a developer, you know that "faster" can mean a lot of things. In this case, it is not about making the GPU clock speed higher. It is about changing how the model actually thinks.

Standard LLMs are autoregressive. They produce one token at a time. It doesn't matter if the next word is completely predictable or a complex logic puzzle: the model spends the same amount of energy and time to generate that one single token. This is the latency bottleneck. Your GPU spends most of its time just moving parameters around, waiting to spit out one word.

MTP fixes this using a technique called Speculative Decoding.

Think of it as pairing a heavy target model (the 31B brain) with a lightweight "drafter." The drafter is autoregressive too. It just runs much faster because of its size, producing a short candidate sequence in the time the target would take to produce a single token.

For example, if the model is generating something as predictable as "Once upon a time," the words "in a galaxy far far away" are practically a given in some contexts. A standard model would still grind through each of those words one by one, spending the same compute on a cliché as it would on a genuine reasoning problem. The drafter generates the likely sequence quickly simply because of its small size.

Then the target model steps in. Instead of generating those tokens one by one, it verifies the entire draft in a single parallel forward pass. The same weight load that normally yields one token now yields a lot more (depending on the drafted sequence). If the drafter was fully right, you get the whole sequence accepted in the time it usually takes to generate one word, and the target even throws in one extra token of its own as a bonus. If the drafter was only partially right, the target accepts everything up to the first disagreement, swaps in its own token at that point, and the process continues. Either way, the output follows the same probability distribution as running the target model alone. The acceptance algorithm is a mathematical guarantee, not a heuristic.

The result is a massive win for local agents.

When you are building an agent that needs to iterate, research, and self-correct, you are basically running a loop of "Think → Act → Observe." If every "Think" step takes a minute, your agent is a snail. If MTP cuts that down to a matter of seconds, your agent becomes a real-time engine.

You get the pretty strong reasoning of a 31B model, but it's delivered at the speed of a much smaller one. For a "marathon" task, this is the difference between a project that takes a day and one that finishes by breakfast.

The Engine Room

Now, the question is: how do you actually run this without your computer turning into a space heater?

When it comes to local inference, the landscape is usually split between two different philosophies. On one side, you have the llama.cpp ecosystem. This is the powerhouse of versatility. It’s the project that effectively democratized local LLMs, allowing us to run massive models on everything from MacBooks to old gaming PCs by utilizing GGUF and sophisticated memory offloading. If you need a model to run on a weird hardware configuration or want to lean on your system RAM, llama.cpp is the tool for the job.

But for an endurance engine, versatility is secondary to throughput.

That’s where vLLM comes in.

While llama.cpp is built for the individual user's flexibility, vLLM is built for the scale of a serving engine. To understand why, you have to understand the "Double Penalty" of long context.

When you increase the context length of a model, you get hit twice. First, you have the Compute Cost: the model has to attend to every previous token, so the work increases as the sequence grows. Second, you have the Memory Cost: you have to store the KV Cache, the pre-computed Keys and Values for every past token, so the model does not have to recompute that history from scratch on every new step.

In a standard setup, this KV cache is stored in one contiguous block of VRAM. But in the real world, sequences have different lengths. This leads to massive memory fragmentation: you have "holes" in your VRAM that are too small to be used but too large to ignore. As your context grows, this waste grows with it. Eventually, your batch size collapses, and your GPU sits underutilized while your agent crawls.

PagedAttention is vLLM's solution, and it's basically "Virtual Memory" for LLMs.

Instead of storing the KV cache as one giant chunk, PagedAttention splits it into fixed-size blocks, or "pages." It uses a page table to map logical tokens to physical memory blocks. This means the model can store the cache anywhere in VRAM, eliminating fragmentation and allowing it to pack requests tightly.

For a research agent that is reading fifty pages of documentation, this is the difference between the agent finishing the job and the system crashing with an Out of Memory error. It also enables prefix caching: if your agent asks ten different questions about the same documentation, vLLM doesn't recompute the documentation ten times. It shares the same KV pages across all requests.

The best part is that we no longer have to wait for the community to "hack" MTP support into the codebase. vLLM launched Day-0 support for Gemma 4 MTP.

They provided a ready-to-use Docker image, which effectively removes the "dependency hell" that usually comes with cutting-edge AI releases. You don't have to spend your afternoon wrestling with CUDA versions or Triton kernels. You pull the image, spin up the server, and you have a high-performance MTP engine running on consumer hardware.

Because vLLM provides an OpenAI-compatible API, the integration is seamless. The server sits there as a lightweight endpoint, and any tool, whether it's a custom Python script or an agentic orchestrator like pi, can talk to it using standard API calls.

You’ve effectively decoupled the "Brain" (the model) from the "Pilot" (the agent). The Brain lives in vLLM, optimized for raw speed and memory efficiency. The Pilot lives in your orchestration layer, focusing on the logic and the goal.

Setting Up vLLM

Time to actually run the thing. This is where most local-model articles get bogged down in CUDA versions, Triton kernels, and Python env nightmares. Fine-tuning a model on Bronze Age tablets, I can handle. CUDA toolchain mismatches at 1 AM, I cannot.

Luckily, the vLLM team shipped a pre-release Docker image specifically for Gemma 4. If you’re on Hopper, you grab vllm/vllm-openai:gemma4-0505-cu129. On Blackwell, it’s vllm/vllm-openai:gemma4-0505-cu130. One small but important gotcha: the standard vllm/vllm-openai:latest tag does not include MTP speculative decoding for Gemma 4 yet. If you pull the default image out of habit, the --speculative-config flag will silently get you nowhere.

docker pull vllm/vllm-openai:gemma4-0505-cu130

That’s dependency hell, gone in one command.

The next problem is fitting a 31B-parameter model on a single card. In native BF16, Gemma 4 31B eats a serious chunk of VRAM just to load the weights, before a single byte goes to the KV cache. That’s server-class hardware territory, not a workstation, and certainly not a single consumer card like the RTX 5090 with its 32GB of VRAM.

The trick is NVFP4, NVIDIA’s 4-bit floating-point format, native to Blackwell. NVIDIA published a quantized checkpoint, nvidia/gemma-4-31B-it-NVFP4, that drops the weights to roughly 19GB. Stack an FP8 KV cache on top of that, and a 31B reasoning model fits comfortably on a consumer Blackwell card like the RTX 5090, with headroom left over for serving.

Here’s the actual launch command:

docker run --gpus all \
  --privileged --ipc=host -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:gemma4-0505-cu130 nvidia/gemma-4-31B-it-NVFP4 \
  --kv-cache-dtype fp8 \
  --tensor-parallel-size 1 \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --chat-template examples/tool_chat_template_gemma4.jinja \
  --reasoning-parser gemma4 \
  --speculative-config '{"model":"google/gemma-4-31B-it-assistant","num_speculative_tokens":4}'

A few lines worth calling out:

--kv-cache-dtype fp8 cuts the KV cache footprint roughly in half. Long contexts are still expensive, just half as expensive.
The --tool-call-parser, --reasoning-parser, and --chat-template trio wires up Gemma 4’s native function calling and structured thinking mode. We don’t need tools for the benchmark itself, but any agent that drives this engine afterwards will.
The interesting line is the last one. --speculative-config is the switch that turns MTP on. The target is the NVFP4 31B model doing the actual reasoning. The drafter is google/gemma-4-31B-it-assistant, a 0.5B-parameter companion model that Google ships specifically as the speculative-decoding partner for the 31B. At roughly 60x smaller than the target, it generates draft tokens fast enough that the verification step costs almost nothing extra. It also shares the target model’s KV cache and feeds off its final-layer activations rather than building its own context from scratch, which is why the acceptance rate stays stable even as context grows. num_speculative_tokens: 4 is the recommended starting point at this scale; vLLM’s own benchmarks suggest pushing up to 8 if your acceptance rate holds.

Once the container boots, vLLM exposes an OpenAI-compatible endpoint on localhost:8000. Anything that already speaks the OpenAI API talks to this. No new SDK, no custom wire protocol, no learning curve.

That’s the whole engine. Brain loaded, drafter wired up, KV cache paged. Now the only question worth answering is whether MTP actually earns its keep, or whether it’s another "up to 3x faster" line that quietly evaporates on real workloads.

That’s what the next section is for.

Does MTP Actually Earn Its Keep?

I ran this on a dedicated Nvidia RTX PRO 6000 Blackwell 96GB instance rather than my local machine, and used the unquantized BF16 checkpoint. The PRO 6000 is a workstation card, not a consumer one — I picked it deliberately. Local inference benchmarking is noisy (background processes, thermal throttling, memory contention), and BF16 weights in a clean isolated environment let me measure the MTP mechanism itself without quantization or thermal effects muddying the numbers.

The trade-off worth naming: these numbers are not directly what a 5090 running NVFP4 will hit. The two setups pull in different directions — the PRO 6000 has more raw compute and memory bandwidth, but NVFP4 on Blackwell has native FP4 tensor cores and a much smaller memory footprint, which matters a lot for the bandwidth-bound decode step. Which curve ends up higher in absolute tok/s is an empirical question I haven't answered here. What does transfer is the shape: where MTP wins, where the gain narrows, and where it crosses over. If you want exact numbers for your card, run llama-benchy yourself with the config from the previous section.

The first test used vLLM’s own built-in benchmark tool, vllm bench serve. The setup was a controlled A/B: everything identical except the presence of --speculative-config. Three runs per arm, results averaged.

vllm bench serve \
  --model google/gemma-4-31B-it \
  --host localhost --port 8000 \
  --dataset-name random \
  --random-input-len 1024 --random-output-len 1024 \
  --num-prompts 100 --max-concurrency 32

Spec OFF: 356 tok/s. Spec ON: 642 tok/s. A consistent 1.80x across all three runs.

But vllm bench serve answers a different question than the one I was actually asking. It is built to stress-test a serving deployment: it saturates the server at concurrency 32, mixes request queues, and measures aggregate output across all users at once. That is exactly what you want if you are sizing a production API. It is not what you want if you are asking how fast a single agent thinks on a long task.

There is also a structural problem with the random dataset beyond just MTP. It is the only format that lets you pin exact input and output lengths. And vllm bench serve has no mechanism to measure how performance changes as context grows, which is exactly what matters for a marathon task.

The question I actually needed to answer was different: how does per-request generation speed change as context grows from zero to 120k? Real text, real acceptance rates, one request at a time.

For that, I used llama-benchy.

The Context Ladder

llama-benchy is a llama-bench style tool built for any OpenAI-compatible endpoint. The key differences from vllm bench serve are threefold: it runs one request at a time, which is the actual solo-agent scenario; it uses real book text from Project Gutenberg, which gives the speculative drafter something meaningful to predict; and it sweeps across context depths, so you can see exactly how performance changes as the KV cache fills.

llama-benchy \
  --base-url http://localhost:8000/v1 \
  --latency-mode generation --skip-coherence \
  --pp 2048 --tg 480 \
  --depth 0 1000 5000 10000 20000 50000 100000 120000 \
  --book-url https://www.gutenberg.org/files/2600/2600-0.txt \
  --no-cache

Here is the full comparison across the context window, one request at a time:

Context depth	Spec ON (tok/s)	Spec OFF (tok/s)	Advantage
0 (fresh start)	52.5	22.3	2.4x
5k	46.2	21.7	2.1x
10k	40.3	21.3	1.9x
20k	38.3	20.6	1.9x
50k	27.0	19.7	1.4x
100k	19.1	18.4	~1.0x
120k	16.6	18.0	0.9x

As an additional test, I increased num_speculative_tokens from 4 to 8 to see if performance would scale. While the 8-token configuration did improve throughput, the results showed clear diminishing returns in this dataset. Across most context lengths, doubling the speculative tokens only yielded a modest bump of roughly 2 to 3.5 tok/s over the 4-token setup, with the most noticeable gains in the 10k to 50k range.

The engine does not get tired. But past a certain point, the turbocharger becomes a drag.

Two things stand out. First, spec OFF is surprisingly stable: only a 19% drop across the entire 120k window, from 22.3 to 18.0 tok/s. The model's autoregressive baseline is memory-bandwidth bound and barely sensitive to context length on its own. Second, spec ON drops 68% over the same range, from 52.5 to 16.6 tok/s. The drafter overhead compounds with the growing attention cost: the shared KV cache it attends over gets larger with every token processed, and that cost grows whether or not the drafter is predicting well.

The crossover lands at around 100k tokens. At 120k, spec OFF is actually faster.

It is also worth noting that acceptance rate is workload-dependent. The vLLM bench reported an acceptance length of 3.54 out of 4 on random dataset tokens. The ladder benchmark on War and Peace text showed a consistent ~2.7 out of 4 across all context depths. The inversion is counterintuitive — you might expect coherent prose to be more predictable than random tokens — but vLLM's random dataset feeds uniform random vocab IDs as input, which is a fairly degenerate condition for an LLM to operate in. Models under high uncertainty have a documented tendency to fall back toward repetitive or low-entropy outputs, and that kind of output is exactly what a small drafter handles well. The two benchmarks also differ in concurrency and decode settings, which complicates a direct comparison further. The takeaway isn't that one number is wrong, it's that 3.54/4 isn't the figure that will generalize to a real workload. The 2.7/4 on coherent prose is closer to what an agent on real text will see.

One more thing worth naming: MTP only touches the generation side. Prefill is compute-bound and speculative decoding does nothing for it. For a read-heavy agent continuously ingesting new documents, the time spent waiting for the model to process each new chunk of context is unaffected by whether spec decode is on or off. That is the next constraint, and prefix caching is what addresses it: if the agent revisits the same source material across multiple reasoning steps, the cached KV pages are free.

For a typical agentic task in the short to medium context range, this is not a concern. The 2x+ advantage holds through 20k tokens and is still meaningful at 50k. But for a task designed to fill the full context window, the honest recommendation is to pick the configuration based on your expected average depth: spec ON for workloads that mostly stay under 50k, spec OFF if your agent spends most of its time deep in a 100k+ session. vLLM doesn't let you flip --speculative-config per request, so this is a server-launch decision, not a runtime one.

These numbers are also conservative in a second way: they come from near-default vLLM settings. There is meaningful headroom on top of both curves. The most impactful levers:

NVFP4 weights + FP8 KV cache: the production setup from the previous section. Cuts weight footprint from ~62GB to ~19GB and halves KV cache memory, freeing headroom for larger batches or longer contexts.
--enable-chunked-prefill: overlaps prefill computation with ongoing decode steps. Helps TTFT under load without touching throughput.
Prefix caching: if the agent re-reads the same documents across multiple reasoning steps, vLLM shares KV pages across those requests instead of recomputing them. For a research loop that revisits the same source material, this is a significant multiplier.
FlashInfer attention backend (--attention-backend flashinfer): optimized for Blackwell, can improve throughput over the default backend at longer context lengths where the attention step dominates.

The Pilot

The benchmarks answer the speed question. The actual workflow question is: what do you point at this thing?

For the agent layer, I have been using Pi. Minimal terminal harness, tiny system prompt, fully extensible. No context bloat, no baked-in opinions about how your workflow should look. For marathon tasks where every token in the context window has to earn its place, lean tooling matters.

Pointing it at the local engine is one config file. Add this to ~/.pi/agent/models.json:

{
  "providers": {
    "vllm-gemma4": {
      "baseUrl": "http://localhost:8000/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "google/gemma-4-31B-it",
          "contextWindow": 128000
        }
      ]
    }
  }
}

Switch to it with /model. Pi talks to your local vLLM instance the same way it would talk to any cloud endpoint. The Brain and the Pilot stay fully decoupled: one handles raw inference speed, the other handles the logic and the goal.

Which left only one thing to find out: whether the whole stack actually holds up overnight.

The Marathon

Couple days later. Same shape of task, slightly different flavor: point the agent at a pile of raw sources, papers, scattered docs, half-finished notes, and have it build a Karpathy-style LLM wiki out of them. Structured markdown files and entity pages, the whole thing knitting itself together as it went. The kind of job that rewards grinding: read, summarize, link, double back, refine. I pointed Pi at the local vLLM endpoint, set it running just before midnight, and went to sleep.

This time I woke up to a populated wiki/ directory. Forty-something markdown files, a few hundred wikilinks, and a conflicts.md where the agent had flagged two sources disagreeing instead of silently picking a winner. No frozen terminal. No 12:10 AM service outage. The engine had just kept going through the night, on my desk, at whatever speed MTP and a 31B model could manage on consumer silicon.

That's what the marathon engine is actually for. Not to beat the cloud giants on a single hard reasoning step; it won't, and I don't ask it to. To be the thing that's still there at 3 AM, still working, when the clever model is down or rate-limited or metering every token. The "babysitting" problem I used to have with local models wasn't really about intelligence. It was about endurance, and a serving stack that didn't fall over. Both of those, finally, are being solved.

Verdict

A year ago, "local model" and "marathon agent" did not belong in the same sentence. The hardware was wrong, the serving stack was wrong, and the speed was definitely wrong. The frontier was something you rented by the token, and that was the deal.

That deal is now negotiable.

The deal changed because the models got good enough and the serving stack finally got to acceptable levels; and MTP is a good bonus on top of that. The benchmarks back up the headline at the depths where most agentic work actually lives. From a fresh start through 50k tokens, speculative decoding delivers a consistent 1.4x to 2.4x speedup over the autoregressive baseline. That is shy of the "up to 3x" top-line number, but it is a measured, reproducible win on real prose, with a verification step that mathematically guarantees the same output distribution as the target model alone. The drafter does what it claims, the acceptance algorithm holds, and the engine stays honest.

A few caveats worth naming before the takeaways:

The advantage is not flat across the context window. MTP shines early; gains narrow as the KV cache grows and the drafter overhead compounds with attention cost. Measure for your own workload before assuming the headline number applies everywhere.
Spec decode only touches generation. Prefill is a separate problem. For read-heavy agents that re-ingest the same documents, prefix caching matters more than MTP.
Acceptance rate is workload-dependent. Random benchmark tokens behaved differently from coherent prose in my tests. One number will not tell you what your stack actually does.

The takeaways:

Use the giants for sprints. When precision on a single hard reasoning step is what you need, the trillion-parameter models still win. That is not changing for a while (Hope I am wrong).
Use a local marathon engine for routine tasks. Anything that grinds: multi-hour scraping, knowledge-base construction, batch summarization, agent loops with dozens of self-correction steps. The economics flip the moment your task crosses the API quota line.
vLLM + Gemma 4 + MTP is the current sweet spot. Not because it beats everything else on IQ, but because it is the first stack where consumer hardware, modern serving infrastructure, and decent generation speed all line up at the same time.
Decouple Brain and Pilot. Keep inference (vLLM) separate from orchestration (Pi, or whatever you reach for). The Brain optimizes tokens per second. The Pilot optimizes getting the job done. Treating them as one thing is the bug behind half the local-agent frustrations I have seen.

The failed auto job that opened this post was not a failure of intelligence. It was a failure of foundation. Now there is a real alternative that fits on consumer hardware and runs without a token quota.

It is not the smartest model in the world. It is the one that works tirelessly and locally.

Decoding Bronze Age Paperwork: Modern AI vs. Ancient Assyrian Clay Tablets

Ertuğrul Demir — Sat, 28 Mar 2026 12:17:54 +0000

Four thousand years ago, Assyrian merchants were doing what people have always done: tracking debts, chasing payments, arguing over contracts. They pressed these records into clay tablets. Not sacred texts, not epic poetry. Just the ancient equivalent of office emails.

Nearly 23,000 of these tablets survive. Half have never been translated — not because they're damaged, but because a few people on Earth can read Old Assyrian.

When the Deep Past Initiative turned this into a Kaggle competition, build a machine translation system for Old Assyrian cuneiform — I jumped in. The task: take transliterated text (cuneiform signs converted to Latin characters) and produce an English translation.

The training set? Around 1500 pairs. That's it.

For context, standard translation models train on millions of sentence pairs. Even research on "low-resource" languages works with tens of thousands. We got fifteen hundred documents and a pat on the back.

So the question was straightforward: how do you build a translation model when you barely have any data, for a language that no modern tokenizer has ever seen, where every proper noun and number matters because these are legal and financial records?

What started as "fine-tune a model on some ancient text" turned into a full-stack AI pipeline: Gemini vision for OCR-ing scanned academic books, LLMs for sentence alignment and cross-lingual translation, ByT5 as a byte-level backbone that doesn't choke on cuneiform, Unsloth for efficient LoRA training, and vLLM for fast inference on Kaggle T4s. The results surprised us.

Let's start with why the obvious approaches don't work.

Why the Obvious Approaches Don't Work

The first thing I tried was what everyone tries — throw a pretrained LLM at it. Gemma, Qwen, the usual suspects. Prompt it with some examples, let it translate.

And honestly? The outputs look pretty good at first glance. Fluent English, reasonable sentence structure, feels like it could be right. But "feels right" is dangerous when you're translating ancient legal documents.

The problem is hallucination — and not the subtle kind. These models confidently fill in names of merchants, cities, and commodities that simply aren't in the source text. When the transliteration says A-šùr-i-dí the model might output a completely different name that sounds plausibly Bronze Age. When it hits an unfamiliar trade term, it improvises. For documents where every name, every number, every commodity is the actual information — that's not a minor quality issue, it's the whole problem.

Ok so what about standard encoder-decoder translation models? Here the issue is more fundamental: tokenization. Modern tokenizers are trained on modern text. Akkadian transliteration is a different universe — hyphenated syllable sequences like a-na, Sumerian logograms in ALL CAPS like KÙ.BABBAR, determinatives in curly braces like {d} and {ki}, subscript digits encoding phonetic variants like il₅, and gap markers like <gap> for broken sections of the physical tablet.

Feed this into a standard tokenizer and it fragments on every character it hasn't seen. Proper nouns that have never appeared in any pretraining corpus get silently mangled. The <gap> markers that indicate missing text get treated as noise or special tokens.

So: decoder-only models hallucinate, standard translation models can't tokenize the input properly. What actually fits this problem?

ByT5 — The Right Tool for a Weird Job

One of the best things about Kaggle competitions is the community. People share findings, discuss approaches in the forums, and collectively narrow down what works. Early on, several participants converged on the same answer: ByT5.

Image from "ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models" (Xue et al., 2021)

ByT5 comes from a 2021 Google Research paper — "Towards a Token-Free Future with Pre-trained Byte-to-Byte Models". The idea is simple and kind of radical: skip tokenization entirely. Instead of mapping text to a learned vocabulary of subwords, ByT5 operates directly on raw bytes. A standard Transformer, minimal modifications, just processing one byte at a time.

Why does this matter for our problem? Because every character is valid input by definition. It doesn't matter that A-mur-{d}UTU has never appeared in any pretraining corpus — ByT5 doesn't need it to. No vocabulary misses, no fragmented tokens, no special handling for curly braces or subscript digits. The model just sees bytes.

The paper also showed something else that turned out to be critical: byte-level models are significantly more robust to noise. When your source text comes from OCR'd clay tablets with inconsistent transcription conventions across different scholars and decades — that robustness isn't a nice-to-have, it's a requirement.

Architecture: solved. Now came the harder problem — we had the right model, but nowhere near enough data to train it.

The Data Problem

With ByT5 as the architecture, the bottleneck shifted entirely to data. And the competition host made the challenges very clear in a public discussion post.

Two things consistently broke translations more than anything else:

Named entities. Personal names, place names, divine names — they're transliterated inconsistently across different editions, they preserve older spelling conventions, and they're completely opaque to the model. In practice, many otherwise reasonable translations failed because a name got mangled, dropped, or hallucinated. The host even prepared an onomasticon (a curated list of attested name spellings) as supplemental data to help with this.

Transliteration format inconsistency. Different corpora encode the same text using different conventions. One participant converted diacritics to ASCII before training (š → sz, ú → u2) — a reasonable instinct, but the evaluation data expects diacritics. Collapsing ṣ into S₂ or š into SZ removes distinctions that are semantically meaningful in Akkadian. The rule was clear: normalize toward the format used in the evaluation set, not away from it.

On top of that, gap handling was tricky. Damaged sections of tablets are marked with <gap>, but the training data wasn't perfectly aligned — sometimes a large gap appears in the transliteration but not in the translation, forcing the model to learn misalignment rather than translation. Edge cases like <gap>-A-šùr (a gap attached to a word) needed to be preserved, not blindly stripped.

The host's closing point stuck with me: these aren't model architecture problems. They're data problems. And with only ~1,500 training pairs, every one of these issues hits harder because the model sees so few examples to learn from.

So the path forward was obvious — find more data.

Finding More Data — The AKT Books

The training data had to come from somewhere. The competition hosts pointed the way — they shared scanned PDFs of the AKT series (Anatolian Kültepe Texts), a multi-volume scholarly publication of Old Assyrian tablets from the Kültepe excavations in Turkey. Each volume contains transliterations and translations of tablets. Exactly the domain, exactly the format we needed.

The catch? These are academic books published between 1990 and the 2020s, by different authors, in different languages. AKT 1, 2, 4, 9a, and 10 are in Turkish. AKT 3 is in German. Each volume has its own layout, its own heading conventions, its own way of marking tablet edges and sections. Different fonts, different editorial styles, different decades of typesetting.

This isn't structured data you can parse with a script. These are scanned pages of physical books — some crisp, some not — where a tablet's transliteration might start on one page and continue on the next, where scholarly commentary sits right next to the translation text, and where the format changes just enough between volumes that nothing generalizes cleanly.

But inside these messy PDFs was exactly what we were starving for: hundreds of additional transliteration-translation pairs, many with line-by-line alignment that the original training set didn't have.

The question was whether I could extract it reliably enough to actually help the model — or whether the noise would make things worse. This is where Gemini's multimodal capabilities came in — specifically its ability to understand page layouts, distinguish between transliteration blocks and commentary, and handle multilingual content out of the box. I decided to build the pipeline.

The Extraction Pipeline

Building this pipeline was its own mini-project. Each step solved one problem and revealed the next.

Step 1: PDF → Page Images

The simplest step — render each PDF page as a numbered PNG. This is the only part that runs purely local. Everything else goes through Gemini.

Step 2: Page Images → Structured JSON

Each page image gets sent to Gemini's vision model via the Vertex AI Batch API. The flow: build a JSONL of requests (one per page image, referencing GCS URIs), submit to Vertex, parse the predictions back.

A quick note on why batch inference: when you're processing hundreds of pages and don't need real-time responses, the Batch API is a no-brainer. You get a 50% discount over standard inference, much higher rate limits, and the service handles parallelization and retries for you — typically completing within 24 hours. You submit one job, go do something else, come back to results. For a pipeline like this where I was processing multiple books with hundreds of pages each, it saved both money and sanity.

The request construction:

def build_request(gcs_uri: str, prompt_text: str) -> dict:
    return {
        "request": {
            "contents": [{
                "role": "user",
                "parts": [
                    {"fileData": {"mimeType": "image/png", "fileUri": gcs_uri}},
                    {"text": prompt_text},
                ]
            }],
            "generationConfig": {
                "responseMimeType": "application/json",
                "temperature": 0.0,
                "mediaResolution": "MEDIA_RESOLUTION_HIGH",  # needed for diacritics
                "thinkingConfig": {"thinkingLevel": "MEDIUM"},
            },
        }
    }

We used gemini-3.1-flash-lite-preview with medium thinking enabled — the reasoning step helped significantly with understanding complex page layouts and making correct decisions about where one tablet ends and another begins.

Submit with the Vertex AI Batch API:

client = genai.Client(vertexai=True, project=project, location=location)
job = client.batches.create(
    model="gemini-3.1-flash-lite-preview",
    src="gs://your-bucket/book/ocr_batch/requests.jsonl",
    config=CreateBatchJobConfig(
        dest="gs://your-bucket/book/ocr_batch/predictions/"
    ),
)

One gotcha that bit me early: predictions come back shuffled. You can't rely on line order in the output — you have to extract the page number from each prediction's original request URI:

def extract_page_num(pred: dict) -> int:
    uri = pred["request"]["contents"][0]["parts"][0]["fileData"]["fileUri"]
    m = re.search(r"page_(\d+)\.png", uri)
    return int(m.group(1))

This is actually a feature — it forces you to write robust parsing from the start.

Every AKT volume needs its own prompt. Different heading formats, different edge markers (Ö.y., Ak. for Turkish volumes; Vs., Rs. for German), different conventions for commentary blocks. Get this wrong and you extract commentary as translation, or merge two tablets into one.

Step 3: JSON Pages → Tablets CSV

A book-specific export script aggregates all the per-page JSONs into a flat CSV — one row per tablet with combined transliteration and translation fields. Each volume needs its own exporter because the structure varies enough that a generic one would silently break.

Step 4: Visual QC

Dump everything to an HTML file and actually look at it. This is where you spot the real problems: misread headings, commentary leaking into translation fields, duplicate translations from continuation pages. No amount of automated testing replaces eyeballing the output.

Step 5: Cleanup

Book-specific cleanup scripts apply the fixes found during QC — drop bad rows, merge tablets that got split across pages, strip commentary that leaked through. Unglamorous and manual but completely necessary.

Step 6: Sentence Chunking + Translation

Here's where it gets interesting again. The original training data is document-level — full tablet in, full translation out. But the AKT books have something better: line-by-line structure. Each transliteration line has a marker ((Vs.1), (2), (Rs.14)) and each translation sentence references those markers.

A second Gemini batch job handles two things at once: align transliteration lines to translation sentences by marker, and translate the non-English content (Turkish or German) into English. For each tablet, I retrieved the most similar examples from the official training set using TF-IDF cosine similarity and included them as few-shot context. This turned out to be crucial — not just for translation quality, but for matching the distribution of the host's wording, style, and terminology choices. The model wasn't just translating, it was learning to translate the way the competition data expected.

Same batch pattern — build JSONL, submit, parse shuffled predictions.

Step 7: Normalization

Most of the invisible work happened here. The competition test set uses a specific character format, and the books don't match it. Every volume has its own OCR artifacts, its own conventions.

A few examples from the normalization stack:

ḫ/Ḫ → h/H (test set uses plain H)
Unicode subscripts → plain digits (₄ → 4)
Superscript determinatives → brace format (ᵈ → {d}, ᵏⁱ → {ki})
OCR artifacts: KU.BABBAR → KÙ.BABBAR, ś → š, ş → ṣ
Gap deduplication: <gap> <gap> → <gap>, while preserving attachments like <gap>-A-šùr

For a character-level model like ByT5, this isn't cosmetic. A single character mismatch between training and test — ḫ vs h, ₄ vs 4 — is invisible to a human reviewer and catastrophic to a model that has learned exactly one representation.

Step 8: Merge

The final step pulls normalized chunks into the main training set. Starting from ~1,500 pairs, the pipeline roughly multiplied our available training data — and more importantly, added sentence-level pairs that gave the model a much finer-grained learning signal than document-level translations alone.

Training — ByT5 Gets You Far, Then Stops

With the expanded dataset ready, training ByT5 was straightforward — standard seq2seq encoder-decoder training using HuggingFace Transformers. No tricks, no exotic schedulers. The model picked up patterns fast and translated training-domain tablets surprisingly well.

But then the leaderboard scores started telling a different story.

In our case, the hidden test set on Kaggle seemed to have a different distribution than what we trained on. Our best guess: different books, different topics, different translator styles, unfamiliar names and locations. Our ByT5 was doing well on what it had seen directly in training, but the leaderboard scores suggested it wasn't generalizing beyond that.

We hit a ceiling. Many teams went on to have great success pushing ByT5 further — better augmentation, longer training, smarter tricks I guess. But in our setup, the gains had stalled, and we decided to explore a different direction.

Back to Decoder-Only — But This Time, Fine-Tuned

This is where the story comes full circle. Earlier, we'd dismissed decoder-only LLMs because they hallucinate. That's still true — out of the box. But fine-tuning changes the picture completely.

The reasoning was simple: ByT5 and Qwen were solving different problems. ByT5 was a great fit for the transliteration itself — every character mattered, and byte-level modeling let it handle weird orthography, diacritics, subscripts, and determinatives without fighting the tokenizer. But once the task became generalization across unfamiliar tablets, translator styles, and topic shifts, Qwen3.5 had something ByT5 didn't: much stronger pretrained language knowledge.

Out of the box, that strength was useless because it came with hallucination. Fine-tuning changed that. LoRA gave us a way to keep the model's broader language ability while grounding it in the task and the dataset. Instead of prompting a general-purpose model and hoping for the best, we trained a lightweight adapter on our curated examples. Combined with few-shot prompting to match the host's translation style, the fine-tuned Qwen handled the distribution shifts that our ByT5 couldn't.

Fine-Tuning with Unsloth — Making LLMs Affordable

Before diving into the training details, a quick primer for anyone who hasn't fine-tuned a model before.

The naive approach to fine-tuning a large language model means updating all its parameters — billions of them. That requires serious hardware, serious memory, and serious money. For a Kaggle competition where you're iterating fast on limited GPUs, it's a non-starter.

This is where LoRA (Low-Rank Adaptation) comes in. Instead of updating the entire model, you freeze the original weights and train a small set of adapter matrices on top. You get most of the benefits of full fine-tuning at a fraction of the cost. QLoRA takes it a step further by quantizing the base model to 4-bit precision, which dramatically cuts memory usage — making it possible to fine-tune models that would otherwise never fit on a single GPU.

For this project we used Unsloth, which makes the whole process surprisingly painless. It handles the LoRA/QLoRA setup, optimizes training to run ~2x faster with ~70% less VRAM, and supports a wide range of models out of the box — including Qwen3.5, which is what we needed.

The training itself was SFT (Supervised Fine-Tuning) using Unsloth's built-in SFT trainer. We structured our data as chat conversations: a system prompt setting the role of an expert Assyriologist, few-shot examples retrieved via TF-IDF similarity, and the target tablet as the final user message. The model only learns from the assistant completion — the actual translation.

# each training example looks like this
messages = [
    {"role": "system", "content": "You are an expert Assyriologist..."},
    # few-shot examples from similar tablets
    {"role": "user", "content": "Translate: um-ma A-šùr-i-dí-ma ..."},
    {"role": "assistant", "content": "Thus says Aššur-idī: ..."},
    {"role": "user", "content": "Translate: um-ma Pu-šu-ki-in-ma ..."},
    {"role": "assistant", "content": "Thus says Pūšu-kēn: ..."},
    # the actual tablet to translate
    {"role": "user", "content": "Translate: a-na A-lim {ki} ..."},
    {"role": "assistant", "content": "To the City: ..."},  # model learns this
]

An important detail here: we used completion-only masking. The loss is computed only on the assistant's translation tokens — the prompt tokens (system message, few-shot examples, user messages) are masked out during training. This means the model isn't wasting capacity learning to predict the input; it's focused entirely on producing accurate translations.

This meant the model wasn't just learning to translate — it was learning to translate in context, grounded by similar examples. The same retrieval and prompt structure would be used at inference time, so there was no gap between how the model trained and how it would be evaluated.

One direction we started exploring but ran out of time for: reinforcement learning on top of the fine-tuned model. The idea was to use GRPO (Group Relative Policy Optimization) with custom reward functions — combining the competition metric itself, gap alignment between transliteration and translation, and length balance — to push the model beyond what SFT alone could achieve. Each reward would target a specific failure mode that supervised training couldn't address directly. We didn't get there before the deadline, but it felt like the natural next step.

Inference — vLLM on Kaggle T4s

With a fine-tuned model ready, the next challenge was actually running it within Kaggle's competition constraints. This is a code competition — no internet access at submission time, two T4 GPUs with 16GB VRAM each, and a strict time limit.

A quick intro on vLLM for those unfamiliar: it's an open-source inference engine originally developed at UC Berkeley that's become the go-to for serving LLMs efficiently. The key innovation is PagedAttention — instead of pre-allocating a fixed block of memory for each sequence's key-value cache, it pages the KV cache dynamically, similar to how operating systems manage virtual memory. This means you can serve larger models on less hardware. On top of that you get continuous batching, optimized CUDA kernels, tensor parallelism, and seamless HuggingFace model support out of the box.

Sounds perfect, right? In theory. In practice, we hit a wall.

Qwen3.5 was released in the final weeks of the competition. The model was brand new — vLLM support was experimental and unstable. On top of that, Kaggle's T4 GPUs have compute capability 7.5, which means no FlashAttention 2 support. We had to fall back to Triton attention backend, wrestle with environment compatibility issues, and work around the fact that you can't pip install anything at submission time — every dependency needs to be pre-packaged in your dataset.

Getting a 9B parameter model to load, run, and generate translations on two T4s without crashing was its own mini-project. Tensor parallelism across both GPUs was non-negotiable — the model simply wouldn't fit on a single card.

llm = LLM(
    model=MODEL_PATH,
    dtype="float16",
    max_model_len=16000,
    gpu_memory_utilization=0.85,
    enforce_eager=True,            # no CUDA graphs on T4
    tensor_parallel_size=2,        # split across both T4s
)

The inference prompt mirrors the training setup exactly — same system prompt, same TF-IDF few-shot retrieval. For each test tablet, we retrieve the 5 most similar examples from our training data and include them as conversation context:

prompts = [
    build_messages(
        transliteration=row["transliteration"],
        few_shot_examples=retriever.top_k(row["transliteration"]),
    )
    for row in test_rows
]
outputs = llm.chat(prompts, sampling_params=sampling_params)

Keeping the inference pipeline identical to training — same prompt structure, same retrieval, same style anchoring — meant the model was seeing exactly the kind of input it was trained on. No distribution shift at inference time.

Results and Reflections

Our team finished with a silver medal out of 2500+ teams. In the final days of the competition, the OCR extraction pipeline was still producing new data — each batch of cleaned and normalized tablets pushed our scores higher. We genuinely felt like gold was within reach with a couple more days. That stings a bit, but honestly? The journey was worth more than the medal.

Here's what I'm taking away from this:

Gemini's batch inference is a superpower for unstructured data. We used it to turn scanned academic books from the 1990s — messy layouts, multiple languages, inconsistent formatting — into clean, structured training data. If it works for 4,000-year-old Assyrian tablets in Turkish and German PDFs, it'll work for your use case too. The Vertex AI Batch API made it affordable and painless at scale.

Few-shot retrieval is still easy gains. TF-IDF character n-gram similarity is dead simple to implement, and using retrieved examples to anchor both training and inference gave us consistent improvements with minimal effort. Small iterations, big returns.

Fine-tuning is more accessible than you think. LoRA + Unsloth meant we could train a 9B parameter model on Kaggle's free GPUs. You don't need a cluster. You need good data and the right tools.

vLLM makes deployment practical. Even on constrained hardware like Kaggle T4s, with a brand-new model and no internet access, we got a 9B model running with tensor parallelism. The ecosystem is maturing fast.

And the bigger picture — the one that got me into this competition in the first place — is that there are still thousands of untranslated tablets sitting in museums. The pipeline we built here isn't a one-off competition hack. It's a blueprint: scan the books, extract the data, train the models, translate the tablets. The tools already exist. The data is already out there. At this point, the bottleneck is no longer whether this can be done. It's whether someone is willing to do it.

Skills, Not Vibes: Teaching AI Agents to Write Clean Code

Ertuğrul Demir — Mon, 26 Jan 2026 11:17:47 +0000

In February 2025, Andrej Karpathy coined "vibe coding" to describe programming's new reality: give in to the vibes, accept all changes, "forget that the code even exists." He called it "not too bad for throwaway weekend projects." But for production systems? That's where the trouble starts.

I've watched AI-generated codebases accumulate the same mess developers spent decades learning to avoid—duplication everywhere, inconsistent naming, missing edge cases. Then it hit me: these are exactly the problems Robert C. Martin warned about in Clean Code almost two decades ago.

So I went back to the book, specifically Chapter 17's catalog of 66 code smells and heuristics. These aren't just relevant to AI coding—they're more relevant. AI makes exactly the mistakes Uncle Bob warned us about, just faster and at scale.

The solution? Skills—instruction files that AI agents read before writing code. I've translated Clean Code's complete catalog into Python skills you can use today. They work in Google's Antigravity IDE, Anthropic's Claude Code, and anywhere that supports the Agent Skills standard.

Let me show you why we need this, and how to implement it.

Even Linus Torvalds Vibe Codes (Sometimes)

In January 2026, Linus Torvalds revealed a side project called AudioNoise—a digital audio effects simulator he'd been tinkering with over the holidays. The Python visualizer, he noted, was "basically written by vibe-coding."

In his own words from the repo:

"I know more about analog filters—and that's not saying much—than I do about python. It started out as my typical 'google and do the monkey-see-monkey-do' kind of programming, but then I cut out the middle-man—me—and just used Google Antigravity to do the audio sample visualizer."

The Hacker News discussion revealed two camps. Some saw it as validation: "It's official, vibe coding is legit." Others noted the crucial context: Torvalds used AI for the part he lacks expertise in (Python visualization) while hand-coding the parts he knows (C and digital signal processing).

One commenter nailed it: "There's a big difference between vibe-coding an entire project and having an AI build a component that you lack competency for."

Another observation cut deeper: "If anyone on the planet knows how to do vibe coding right, it's him"—because Torvalds spent decades mastering code review. He can spot bad code instantly. Most of us can't.

But here's what's telling: Torvalds wrote tests for his hand-coded C—numerical accuracy checks for the DSP primitives he understands. The vibe-coded Python visualizer? No tests, no type hints, and a duplicated function definition that slipped right through. The same four-line method appears twice in a row—the first an empty stub, the second the real implementation. It's textbook "Accept All, don't read the diffs." The code runs fine (Python silently overwrites the first definition), but it's exactly the kind of dead code that accumulates into maintenance nightmares.

This works for Torvalds' toy project precisely. It's a throwaway learning exercise. The moment that visualizer needs to be production code, those missing guardrails become technical debt.

The same week, Torvalds rejected "AI slop" submissions to the Linux kernel, arguing that documentation telling people not to submit garbage won't help because "the people who would submit it won't read the documentation anyway."

The lesson isn't that vibe coding is bad. It's that context matters. Skills let you define when to enforce rigor and when to let the vibes flow.

The Data: AI Code Quality Is Getting Worse

Google's DORA Report found AI adoption shows a negative relationship with software delivery stability. The 2025 report's central finding: "AI doesn't fix a team; it amplifies what's already there." Without robust control systems—strong testing, mature practices, fast feedback loops—increased AI-generated code leads to instability. Skills are exactly those control systems, encoded as instructions.

Carnegie Mellon researchers analyzed 807 GitHub repositories after Cursor adoption: +30% static analysis warnings, +41% code complexity. The speed gains were transient; the quality problems compounded.

GitClear's analysis of 211 million lines of code from Google, Microsoft, Meta, and enterprise repositories found code duplication increased 4x with AI adoption. For the first time in their dataset, copy/pasted code exceeded refactored code.

Even Anthropic's Agentic Coding Trends Report shows the gap: developers use AI in roughly 60% of their work, but can fully delegate only 0-20% of tasks. The rest requires "thoughtful setup, active supervision, and human judgment."

That gap—between what AI touches and what AI can own—is exactly what skills address. The setup is the skill. The supervision is the rules.

The Pattern: AI Recreates Classic Code Smells

The research consistently identifies the same failure patterns. Here's how they map to specific Clean Code violations:

Naming and Consistency Problems

Inconsistent variable names across similar functions
Vague names like data, tmp, proc
Mixing naming conventions (camelCase and snake_case)
Clean Code rules: N1 (descriptive names), G11 (consistency), G24 (conventions)

Code Duplication

Copy/paste instead of extracting shared logic
Same calculation appearing in multiple places
Pattern repetition that should be abstracted
Clean Code rule: G5 (DRY - Don't Repeat Yourself)

Missing Safety Checks

No validation of input boundaries
Assumptions about data structure without verification
Missing null/None checks
Clean Code rules: G3 (boundary conditions), G4 (don't override safeties), G26 (be precise)

Readability Issues

Magic numbers without explanation (what does 86400 mean?)
Unused variables cluttering code
Functions mixing multiple abstraction levels
Clean Code rules: G12 (remove clutter), G16 (no obscured intent), G34 (single abstraction level)

Performance Problems

Functions doing multiple things at once
Exposing internal data unnecessarily
Nested loops that could be optimized
Clean Code rules: G8 (minimize public interface), G30 (functions do one thing)

These aren't arbitrary style preferences—they're the exact problems that make code hard to maintain, debug, and extend. The skills we'll build enforce these rules automatically.

The fix isn't to stop using AI. It's to give AI the explicit rules it needs to follow.

That's what skills do.

What Are Skills?

Skills are markdown files containing domain-specific instructions that AI agents read before working on your code. They follow the Agent Skills open standard and work in Google Antigravity, Anthropic's Claude Code, and other compatible agents.

The architecture is called Progressive Disclosure. Instead of dumping every instruction into the agent's context at once (causing what Antigravity's docs call "Context Saturation"), skills work in layers:

Discovery: The agent sees only a lightweight menu of skill names and descriptions
Activation: When your request matches a skill's description, the full instructions load
Execution: Scripts and templates are read only when the task requires them

This keeps the agent fast and focused. It's not thinking about database migrations when you're writing a React component.

The format is simple:

---
name: skill-name
description: When this skill should activate
---

# Skill Title

Your instructions, examples, and rules here.

The description field is crucial—it's the trigger phrase. The agent semantically matches your request against all available skill descriptions to decide which ones to load. "Enforces function best practices" is vague. "Use when writing or refactoring Python functions" tells the agent exactly when to activate.

Skills can do far more than enforce coding standards—the community has built skills for Stripe integration, Metasploit security testing, voice agents, and even multi-agent startup automation. This article focuses on one specific use case: encoding Clean Code principles.

Let me show you how to translate Clean Code's catalog into working skills.

Building the Skills: Three Examples

Rather than catalog all 66 rules exhaustively, I'll show you three critical categories in detail. The complete implementation is at the end.

1. Comments (C1-C5): Code Should Explain Itself

Uncle Bob is famously skeptical of comments—not because documentation is bad, but because comments rot faster than code updates.

File Reference: clean-comments/SKILL.md

---
name: clean-comments
description: Use when writing, fixing, editing, or reviewing Python comments and docstrings. Enforces Clean Code principles—no metadata, no redundancy, no commented-out code.
---

# Clean Comments

## C1: No Inappropriate Information

Comments shouldn't hold metadata. Use Git for author names, change history, 
ticket numbers, and dates. Comments are for technical notes about code only.

## C2: Delete Obsolete Comments

If a comment describes code that no longer exists or works differently, 
delete it immediately. Stale comments become "floating islands of 
irrelevance and misdirection."

## C3: No Redundant Comments

# Bad - the code already says this
i += 1  # increment i
user.save()  # save the user

# Good - explains WHY, not WHAT
i += 1  # compensate for zero-indexing in display

## C4: Write Comments Well

If a comment is worth writing, write it well:
- Choose words carefully
- Use correct grammar
- Don't ramble or state the obvious
- Be brief

## C5: Never Commit Commented-Out Code

# DELETE THIS - it's an abomination
# def old_calculate_tax(income):
#     return income * 0.15

Who knows how old it is? Who knows if it's meaningful? Delete it. 
Git remembers everything.

## The Goal

The best comment is the code itself. If you need a comment to explain 
what code does, refactor first, comment last.

2. Functions (F1-F4): Small, Focused, Obvious

Functions should do one thing, do it well, and have an obvious purpose.

File Reference: clean-functions/SKILL.md

---
name: clean-functions
description: Use when writing or refactoring Python functions. Enforces Clean Code principles—maximum 3 arguments, single responsibility, no flag parameters.
---

# Clean Functions

## F1: Too Many Arguments (Maximum 3)

# Bad - too many parameters
def create_user(name, email, age, country, timezone, language, newsletter):
    ...

# Good - use a dataclass or dict
@dataclass
class UserData:
    name: str
    email: str
    age: int
    country: str
    timezone: str
    language: str
    newsletter: bool

def create_user(data: UserData):
    ...

More than 3 arguments means your function is doing too much or needs 
a data structure.

## F2: No Output Arguments

Don't modify arguments as side effects. Return values instead.

# Bad - modifies argument
def append_footer(report: Report) -> None:
    report.append("\n---\nGenerated by System")

# Good - returns new value
def with_footer(report: Report) -> Report:
    return report + "\n---\nGenerated by System"

## F3: No Flag Arguments

Boolean flags mean your function does at least two things.

# Bad - function does two different things
def render(is_test: bool):
    if is_test:
        render_test_page()
    else:
        render_production_page()

# Good - split into two functions
def render_test_page(): ...
def render_production_page(): ...

## F4: Delete Dead Functions

If it's not called, delete it. No "just in case" code. Git preserves history.

3. General Principles (G1-G36): The Core Rules

These are the fundamental patterns that separate clean code from legacy nightmares.

File Reference: clean-general/SKILL.md

---
name: clean-general
description: Use when reviewing Python code quality. Enforces Clean Code's core principles—DRY, single responsibility, clear intent, no magic numbers, proper abstractions.
---

# General Clean Code Principles

## Critical Rules

**G5: DRY (Don't Repeat Yourself)**

Every piece of knowledge has one authoritative representation.

# Bad - duplication
tax_rate = 0.0825
ca_total = subtotal * 1.0825
ny_total = subtotal * 1.07

# Good - single source of truth
TAX_RATES = {"CA": 0.0825, "NY": 0.07}
def calculate_total(subtotal: float, state: str) -> float:
    return subtotal * (1 + TAX_RATES[state])

**G16: No Obscured Intent**

Don't be clever. Be clear.

# Bad - what does this do?
return (x & 0x0F) << 4 | (y & 0x0F)

# Good - obvious intent
return pack_coordinates(x, y)

**G23: Prefer Polymorphism to If/Else**

# Bad - will grow forever
def calculate_pay(employee):
    if employee.type == "SALARIED":
        return employee.salary
    elif employee.type == "HOURLY":
        return employee.hours * employee.rate
    elif employee.type == "COMMISSIONED":
        return employee.base + employee.commission

# Good - open/closed principle
class SalariedEmployee:
    def calculate_pay(self): return self.salary

class HourlyEmployee:
    def calculate_pay(self): return self.hours * self.rate

class CommissionedEmployee:
    def calculate_pay(self): return self.base + self.commission

**G25: Replace Magic Numbers with Named Constants**

# Bad
if elapsed_time > 86400:
    ...

# Good
SECONDS_PER_DAY = 86400
if elapsed_time > SECONDS_PER_DAY:
    ...

**G30: Functions Should Do One Thing**

If you can extract another function, your function does more than one thing.

**G36: Law of Demeter (Avoid Train Wrecks)**

# Bad - reaching through multiple objects
output_dir = context.options.scratch_dir.absolute_path

# Good - one dot
output_dir = context.get_scratch_dir()

## Enforcement Checklist

When reviewing AI-generated code, verify:
- [ ] No duplication (G5)
- [ ] Clear intent, no magic numbers (G16, G25)
- [ ] Polymorphism over conditionals (G23)
- [ ] Functions do one thing (G30)
- [ ] No Law of Demeter violations (G36)
- [ ] Boundary conditions handled (G3)
- [ ] Dead code removed (G9)

The Complete Catalog

I've translated all 66 rules from Clean Code Chapter 17 into skills covering six categories:

Click to expand all skill categories

Comments (C1-C5): Minimal, accurate commenting

C1: No inappropriate information (metadata belongs in version control)
C2: Delete obsolete comments immediately
C3: No redundant comments that repeat the code
C4: Write comments well—brief, grammatical, purposeful
C5: Never commit commented-out code

Environment (E1-E2): One-command build and test

E1: Build requires only one step
E2: Tests require only one step

Functions (F1-F4): Small, focused, obvious

F1: Maximum 3 arguments (use data structures for more)
F2: No output arguments (return values instead)
F3: No flag arguments (split into separate functions)
F4: Delete dead functions

General (G1-G36): Core principles

G1: Multiple languages in one source file
G2: Obvious behavior is unimplemented
G3: Incorrect behavior at the boundaries
G4: Overridden safeties
G5: Duplication
G6: Code at wrong level of abstraction
G7: Base classes depending on their derivatives
G8: Too much information
G9: Dead code
G10: Vertical separation
G11: Inconsistency
G12: Clutter
G13: Artificial coupling
G14: Feature envy
G15: Selector arguments
G16: Obscured intent
G17: Misplaced responsibility
G18: Inappropriate static
G19: Use explanatory variables
G20: Function names should say what they do
G21: Understand the algorithm
G22: Make logical dependencies physical
G23: Prefer polymorphism to if/else or switch/case
G24: Follow standard conventions
G25: Replace magic numbers with named constants
G26: Be precise
G27: Structure over convention
G28: Encapsulate conditionals
G29: Avoid negative conditionals
G30: Functions should do one thing
G31: Hidden temporal couplings
G32: Don't be arbitrary
G33: Encapsulate boundary conditions
G34: Functions should descend only one level of abstraction
G35: Keep configurable data at high levels
G36: Avoid transitive navigation

Names (N1-N7): Descriptive, unambiguous, right-sized

N1: Choose descriptive names
N2: Choose names at the right abstraction level
N3: Use standard nomenclature where possible
N4: Use unambiguous names
N5: Use long names for long scopes
N6: Avoid encodings (Hungarian notation, etc.)
N7: Names should describe side effects

Tests (T1-T9): Fast, independent, exhaustive

T1: Insufficient tests—test everything that could break
T2: Use a coverage tool
T3: Don't skip trivial tests
T4: Ignored tests indicate ambiguity
T5: Test boundary conditions
T6: Exhaustively test near bugs
T7: Patterns of failure are diagnostic
T8: Coverage patterns can be revealing
T9: Tests should be fast

Get the complete skill files:

ertugrul-dmr / clean-code-skills

Clean Code Skills for AI Agents

Teach your AI to write code that doesn't suck.

This repository contains Agent Skills that enforce Robert C. Martin's Clean Code principles. They work with Google Antigravity, Anthropic's Claude Code, and any agent that supports the Agent Skills standard.

Why?

AI generates code fast, but research shows it also generates technical debt fast:

GitClear: 4x increase in code duplication with AI adoption
Carnegie Mellon: +30% static analysis warnings, +41% code complexity after Cursor adoption
Google DORA: Negative relationship between AI adoption and software delivery stability

These skills encode battle-tested solutions to exactly these problems—directly into your AI workflow.

What's Included

Track	Skill	Description	Rules
Python	`boy-scout`	Orchestrator—always leave code cleaner than you found it	Coordinates all skills
Python	`python-clean-code`	Master skill with all 66 rules	C1-C5, E1-E2, F1-F4, G1-G36, N1-N7, P1-P3, T1-T9
Python	`clean-comments`	Minimal, accurate commenting	C1-C5
Python	`clean-functions`

…

View on GitHub

The repo includes:

boy-scout: An orchestrator skill that embodies the Boy Scout Rule—"always leave code cleaner than you found it"—and coordinates the other skills
python-clean-code: A master skill with all 66 rules, plus a quick reference table and anti-patterns cheatsheet
Individual skills for each category (clean-comments, clean-functions, clean-general, clean-names, clean-tests)—drop in only what you need
Installation instructions for Antigravity, Claude Code, and other Agent Skills-compatible tools

How to Use These Skills

Skills sit in a specific place in the agent ecosystem. Rules are passive guardrails that are always on. Skills are agent-triggered—the model decides when to equip them based on your intent. If you're using MCP servers (connections to external tools like GitHub or Postgres), think of MCP as the "hands" and skills as the "brains" that direct them.

For Antigravity

Create .agent/skills/ in your project root (or ~/.gemini/antigravity/skills/ for global access)
Save the skill as a folder with a SKILL.md file inside (e.g., .agent/skills/python-clean-code/SKILL.md)
Ask the agent to review or write code—it'll automatically apply the rules when relevant

Global vs Project Skills

Project-specific: .agent/skills/
Global Antigravity: ~/.gemini/antigravity/skills/

The agent only loads full skill content when needed, so comprehensive skills don't slow down simple requests.

Going Further

The skills in this article are instruction-only—they tell the agent what to do. For stricter enforcement, you could add a scripts/ folder with a linter that compatible agents runs them automatically, or an examples/ folder with before/after code samples for few-shot learning. The format supports it; we're just keeping things simple here.

A Real-World Example

Here's code that violates multiple Clean Code rules:

from utils import *  # P1
# Author: John, Modified: 2024-01-15  # C1
def proc(d, t, flag=False):  # N1, F1, F3
    # Process the data  # C3
    x = []  # N1
    for i in d:
        if flag:  # F3
            if i['type'] == 'A':  # G23
                x.append(i['val'] * 1.0825)  # G25
            elif i['type'] == 'B':
                x.append(i['val'] * 1.05)  # G25
        else:
            x.append(i['val'])
    with open(f'/tmp/{t}.json', 'w') as f:  # G6
        json.dump(x, f)
    # Old approach  # C5
    # for item in d:
    #     print(item)
    return x

Violations: P1, C1, C3, C5, F1, F3, G6, G23, G25, N1

With the Clean Code skill active, ask your AI agent to refactor this:

import json
from pathlib import Path
from typing import List, Literal
from dataclasses import dataclass

TAX_RATE_CA = 0.0825
TAX_RATE_NY = 0.05
TransactionType = Literal['CA', 'NY']

@dataclass
class Transaction:
    value: float
    type: TransactionType

def apply_tax(transaction: Transaction) -> float:
    """Apply state-specific tax to transaction value."""
    tax_rates = {'CA': TAX_RATE_CA, 'NY': TAX_RATE_NY}
    return transaction.value * (1 + tax_rates[transaction.type])

def process_transactions_with_tax(
    transactions: List[Transaction]
) -> List[float]:
    """Calculate taxed values for all transactions."""
    return [apply_tax(t) for t in transactions]

def process_transactions_without_tax(
    transactions: List[Transaction]
) -> List[float]:
    """Extract raw values from all transactions."""
    return [t.value for t in transactions]

def save_results(values: List[float], output_path: Path) -> None:
    """Save processed values to JSON file."""
    output_path.parent.mkdir(parents=True, exist_ok=True)
    with output_path.open('w') as f:
        json.dump(values, f)

The refactored version:

✅ No wildcard imports (P1)
✅ No metadata comments (C1)
✅ No redundant comments (C3)
✅ No commented-out code (C5)
✅ Descriptive names (N1)
✅ No flag arguments (F3)
✅ Named constants instead of magic numbers (G25)
✅ Functions do one thing (G30)
✅ Polymorphism through data structure (G23)

Anatomy of a Vibe-Coded Script

Remember the duplicated function I mentioned in Torvalds' AudioNoise visualizer? Here it is:

def update_slider_text(self, val):
    """Helper to update slider texts (Width and End Point)."""
    start_val, end_val = val
    width = end_val - start_val

def update_slider_text(self, val):
    """Helper to update slider texts (Width and End Point)."""
    start_val, end_val = val
    width = end_val - start_val

    if self.x_mode == 'Time':
        self.slider.valtext.set_text(f"Window: {start_val:.3f} + {width:.3f} s")
    else:
        self.slider.valtext.set_text(f"Window: {int(start_val)} + {int(width)}")

The first definition unpacks values, calculates width, then... returns None. The second definition is the real implementation. Python silently overwrites the first with the second, so the code runs. But it's textbook dead code—Clean Code rule G9: Remove dead code.

With the skill active, an agent refactors the entire 600-line script. The duplicate vanishes, magic numbers become constants, and nested functions get extracted into focused methods:

def update_slider_text(self, val: tuple[float, float]):
    """Update slider text with either time or sample count."""
    start_val, end_val = val
    width = end_val - start_val

    if self.x_mode == 'Time':
        self.slider.valtext.set_text(f"Window: {start_val:.3f} + {width:.3f} s")
    else:
        self.slider.valtext.set_text(f"Window: {int(start_val)} + {int(width)}")

The refactored version:

✅ Dead code removed (G9)
✅ Type hints added (clarity)
✅ Single, authoritative definition (G5)
✅ Magic numbers extracted to constants (G25)
✅ Large methods decomposed (G30)

The full diff shows 600+ lines reduced to ~440—not by removing functionality, but by eliminating duplication and extracting reusable patterns.

Why This Matters Now

Vibe coding isn't going away. AI will get better at generating code, not worse. But "better at generating" doesn't mean "better at maintaining."

The research is clear: AI produces code faster, but that code accumulates technical debt faster too. Without guard rails, we're building tomorrow's legacy systems today.

Uncle Bob's Clean Code principles are almost 20 years old, but they're exactly what we need now. They're not arbitrary style preferences—they're battle-tested solutions to the problems AI recreates at scale.

Skills give you the mechanism to encode these rules directly into your AI workflow. Whether you're using Antigravity, Claude Code, or another agent, the approach is the same: define what clean code means, then let the AI follow the rules.

Your agent doesn't know what good code looks like unless you tell it.

So tell it.

Resources

The Book

Clean Code by Robert C. Martin: Amazon

Skills Documentation

Agent Skills Standard — The open standard for AI agent instructions
Antigravity Skills Guide — Google's official documentation
Claude Code Agent Skills — Anthropic's implementation

Research Cited

DORA 2025: AI-Assisted Software Development — Google's findings on AI and delivery stability
Code Quality After Cursor Adoption — Carnegie Mellon's analysis of 807 repositories
GitClear 2025 Code Quality Report — 211M lines analyzed
Agentic Coding Trends — Anthropic's delegation gap analysis

Get the Skills

Clean Code Skills Repository — All 66 rules as ready-to-use skill files

The future of programming is human intent translated by AI. Make sure the translation preserves quality, not just speed.

DEV Community: Ertuğrul Demir

Skills over System Prompts: Building an Anki Tutor with the Antigravity SDK

What the Antigravity SDK Gives You

Giving the Agent Hands: Anki as Python Tools

The Problem with Giant System Prompts

Skills over System Prompts

review-buddy: the study session playbook

plain-cards: the card-writing style guide

codebase-cards: the repo exploration protocol

One Behavior Layer, Three Surfaces

The Terminal Tutor

The Telegram Tutor

The Deck Builder

Turning It Ambient with Triggers

The Part I Refused to Trust to the Model

Practice mode blocks scheduling writes

Hooks sync, back up, audit, and recover

What Worked

Plain Python tools were enough

Skills kept the project from becoming prompt soup

Hooks made side effects less scary

Triggers changed the feel of the app

What Did Not Work Perfectly

Skills are not hard guarantees

AnkiConnect has sharp edges

Voice was simpler outside the agent loop

How to Build Something Similar

1. Start with one real workflow

2. Wrap the system as small Python tools

3. Move task behavior into skills

4. Add policies for non-negotiables

5. Add hooks for receipts

6. Add another surface only after the behavior is reusable

The Bigger Point

Resources

Closing

The Local Model That Doesn't Sleep: Gemma 4 + MTP as a Marathon Engine

The Turbocharger (What is MTP?)

The Engine Room

Setting Up vLLM

Does MTP Actually Earn Its Keep?

The Pilot

The Marathon

Verdict

Decoding Bronze Age Paperwork: Modern AI vs. Ancient Assyrian Clay Tablets

Why the Obvious Approaches Don't Work

ByT5 — The Right Tool for a Weird Job

The Data Problem

Finding More Data — The AKT Books

The Extraction Pipeline

Training — ByT5 Gets You Far, Then Stops

Back to Decoder-Only — But This Time, Fine-Tuned

Fine-Tuning with Unsloth — Making LLMs Affordable

Inference — vLLM on Kaggle T4s

Results and Reflections

Skills, Not Vibes: Teaching AI Agents to Write Clean Code

Even Linus Torvalds Vibe Codes (Sometimes)

The Data: AI Code Quality Is Getting Worse

The Pattern: AI Recreates Classic Code Smells

What Are Skills?

Building the Skills: Three Examples

1. Comments (C1-C5): Code Should Explain Itself

2. Functions (F1-F4): Small, Focused, Obvious

3. General Principles (G1-G36): The Core Rules

The Complete Catalog

ertugrul-dmr / clean-code-skills

Clean Code Skills for AI Agents

Why?

What's Included

How to Use These Skills

For Antigravity

Global vs Project Skills

Going Further

A Real-World Example

Anatomy of a Vibe-Coded Script

Why This Matters Now

Resources

`review-buddy`: the study session playbook

`plain-cards`: the card-writing style guide

`codebase-cards`: the repo exploration protocol