DEV Community: Ertuğrul Demir

Decoding Bronze Age Paperwork: Modern AI vs. Ancient Assyrian Clay Tablets

Ertuğrul Demir — Sat, 28 Mar 2026 12:17:54 +0000

Four thousand years ago, Assyrian merchants were doing what people have always done: tracking debts, chasing payments, arguing over contracts. They pressed these records into clay tablets. Not sacred texts, not epic poetry. Just the ancient equivalent of office emails.

Nearly 23,000 of these tablets survive. Half have never been translated — not because they're damaged, but because a few people on Earth can read Old Assyrian.

When the Deep Past Initiative turned this into a Kaggle competition, build a machine translation system for Old Assyrian cuneiform — I jumped in. The task: take transliterated text (cuneiform signs converted to Latin characters) and produce an English translation.

The training set? Around 1500 pairs. That's it.

For context, standard translation models train on millions of sentence pairs. Even research on "low-resource" languages works with tens of thousands. We got fifteen hundred documents and a pat on the back.

So the question was straightforward: how do you build a translation model when you barely have any data, for a language that no modern tokenizer has ever seen, where every proper noun and number matters because these are legal and financial records?

What started as "fine-tune a model on some ancient text" turned into a full-stack AI pipeline: Gemini vision for OCR-ing scanned academic books, LLMs for sentence alignment and cross-lingual translation, ByT5 as a byte-level backbone that doesn't choke on cuneiform, Unsloth for efficient LoRA training, and vLLM for fast inference on Kaggle T4s. The results surprised us.

Let's start with why the obvious approaches don't work.

Why the Obvious Approaches Don't Work

The first thing I tried was what everyone tries — throw a pretrained LLM at it. Gemma, Qwen, the usual suspects. Prompt it with some examples, let it translate.

And honestly? The outputs look pretty good at first glance. Fluent English, reasonable sentence structure, feels like it could be right. But "feels right" is dangerous when you're translating ancient legal documents.

The problem is hallucination — and not the subtle kind. These models confidently fill in names of merchants, cities, and commodities that simply aren't in the source text. When the transliteration says A-šùr-i-dí the model might output a completely different name that sounds plausibly Bronze Age. When it hits an unfamiliar trade term, it improvises. For documents where every name, every number, every commodity is the actual information — that's not a minor quality issue, it's the whole problem.

Ok so what about standard encoder-decoder translation models? Here the issue is more fundamental: tokenization. Modern tokenizers are trained on modern text. Akkadian transliteration is a different universe — hyphenated syllable sequences like a-na, Sumerian logograms in ALL CAPS like KÙ.BABBAR, determinatives in curly braces like {d} and {ki}, subscript digits encoding phonetic variants like il₅, and gap markers like <gap> for broken sections of the physical tablet.

Feed this into a standard tokenizer and it fragments on every character it hasn't seen. Proper nouns that have never appeared in any pretraining corpus get silently mangled. The <gap> markers that indicate missing text get treated as noise or special tokens.

So: decoder-only models hallucinate, standard translation models can't tokenize the input properly. What actually fits this problem?

ByT5 — The Right Tool for a Weird Job

One of the best things about Kaggle competitions is the community. People share findings, discuss approaches in the forums, and collectively narrow down what works. Early on, several participants converged on the same answer: ByT5.

Image from "ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models" (Xue et al., 2021)

ByT5 comes from a 2021 Google Research paper — "Towards a Token-Free Future with Pre-trained Byte-to-Byte Models". The idea is simple and kind of radical: skip tokenization entirely. Instead of mapping text to a learned vocabulary of subwords, ByT5 operates directly on raw bytes. A standard Transformer, minimal modifications, just processing one byte at a time.

Why does this matter for our problem? Because every character is valid input by definition. It doesn't matter that A-mur-{d}UTU has never appeared in any pretraining corpus — ByT5 doesn't need it to. No vocabulary misses, no fragmented tokens, no special handling for curly braces or subscript digits. The model just sees bytes.

The paper also showed something else that turned out to be critical: byte-level models are significantly more robust to noise. When your source text comes from OCR'd clay tablets with inconsistent transcription conventions across different scholars and decades — that robustness isn't a nice-to-have, it's a requirement.

Architecture: solved. Now came the harder problem — we had the right model, but nowhere near enough data to train it.

The Data Problem

With ByT5 as the architecture, the bottleneck shifted entirely to data. And the competition host made the challenges very clear in a public discussion post.

Two things consistently broke translations more than anything else:

Named entities. Personal names, place names, divine names — they're transliterated inconsistently across different editions, they preserve older spelling conventions, and they're completely opaque to the model. In practice, many otherwise reasonable translations failed because a name got mangled, dropped, or hallucinated. The host even prepared an onomasticon (a curated list of attested name spellings) as supplemental data to help with this.

Transliteration format inconsistency. Different corpora encode the same text using different conventions. One participant converted diacritics to ASCII before training (š → sz, ú → u2) — a reasonable instinct, but the evaluation data expects diacritics. Collapsing ṣ into S₂ or š into SZ removes distinctions that are semantically meaningful in Akkadian. The rule was clear: normalize toward the format used in the evaluation set, not away from it.

On top of that, gap handling was tricky. Damaged sections of tablets are marked with <gap>, but the training data wasn't perfectly aligned — sometimes a large gap appears in the transliteration but not in the translation, forcing the model to learn misalignment rather than translation. Edge cases like <gap>-A-šùr (a gap attached to a word) needed to be preserved, not blindly stripped.

The host's closing point stuck with me: these aren't model architecture problems. They're data problems. And with only ~1,500 training pairs, every one of these issues hits harder because the model sees so few examples to learn from.

So the path forward was obvious — find more data.

Finding More Data — The AKT Books

The training data had to come from somewhere. The competition hosts pointed the way — they shared scanned PDFs of the AKT series (Anatolian Kültepe Texts), a multi-volume scholarly publication of Old Assyrian tablets from the Kültepe excavations in Turkey. Each volume contains transliterations and translations of tablets. Exactly the domain, exactly the format we needed.

The catch? These are academic books published between 1990 and the 2020s, by different authors, in different languages. AKT 1, 2, 4, 9a, and 10 are in Turkish. AKT 3 is in German. Each volume has its own layout, its own heading conventions, its own way of marking tablet edges and sections. Different fonts, different editorial styles, different decades of typesetting.

This isn't structured data you can parse with a script. These are scanned pages of physical books — some crisp, some not — where a tablet's transliteration might start on one page and continue on the next, where scholarly commentary sits right next to the translation text, and where the format changes just enough between volumes that nothing generalizes cleanly.

But inside these messy PDFs was exactly what we were starving for: hundreds of additional transliteration-translation pairs, many with line-by-line alignment that the original training set didn't have.

The question was whether I could extract it reliably enough to actually help the model — or whether the noise would make things worse. This is where Gemini's multimodal capabilities came in — specifically its ability to understand page layouts, distinguish between transliteration blocks and commentary, and handle multilingual content out of the box. I decided to build the pipeline.

The Extraction Pipeline

Building this pipeline was its own mini-project. Each step solved one problem and revealed the next.

Step 1: PDF → Page Images

The simplest step — render each PDF page as a numbered PNG. This is the only part that runs purely local. Everything else goes through Gemini.

Step 2: Page Images → Structured JSON

Each page image gets sent to Gemini's vision model via the Vertex AI Batch API. The flow: build a JSONL of requests (one per page image, referencing GCS URIs), submit to Vertex, parse the predictions back.

A quick note on why batch inference: when you're processing hundreds of pages and don't need real-time responses, the Batch API is a no-brainer. You get a 50% discount over standard inference, much higher rate limits, and the service handles parallelization and retries for you — typically completing within 24 hours. You submit one job, go do something else, come back to results. For a pipeline like this where I was processing multiple books with hundreds of pages each, it saved both money and sanity.

The request construction:

def build_request(gcs_uri: str, prompt_text: str) -> dict:
    return {
        "request": {
            "contents": [{
                "role": "user",
                "parts": [
                    {"fileData": {"mimeType": "image/png", "fileUri": gcs_uri}},
                    {"text": prompt_text},
                ]
            }],
            "generationConfig": {
                "responseMimeType": "application/json",
                "temperature": 0.0,
                "mediaResolution": "MEDIA_RESOLUTION_HIGH",  # needed for diacritics
                "thinkingConfig": {"thinkingLevel": "MEDIUM"},
            },
        }
    }

We used gemini-3.1-flash-lite-preview with medium thinking enabled — the reasoning step helped significantly with understanding complex page layouts and making correct decisions about where one tablet ends and another begins.

Submit with the Vertex AI Batch API:

client = genai.Client(vertexai=True, project=project, location=location)
job = client.batches.create(
    model="gemini-3.1-flash-lite-preview",
    src="gs://your-bucket/book/ocr_batch/requests.jsonl",
    config=CreateBatchJobConfig(
        dest="gs://your-bucket/book/ocr_batch/predictions/"
    ),
)

One gotcha that bit me early: predictions come back shuffled. You can't rely on line order in the output — you have to extract the page number from each prediction's original request URI:

def extract_page_num(pred: dict) -> int:
    uri = pred["request"]["contents"][0]["parts"][0]["fileData"]["fileUri"]
    m = re.search(r"page_(\d+)\.png", uri)
    return int(m.group(1))

This is actually a feature — it forces you to write robust parsing from the start.

Every AKT volume needs its own prompt. Different heading formats, different edge markers (Ö.y., Ak. for Turkish volumes; Vs., Rs. for German), different conventions for commentary blocks. Get this wrong and you extract commentary as translation, or merge two tablets into one.

Step 3: JSON Pages → Tablets CSV

A book-specific export script aggregates all the per-page JSONs into a flat CSV — one row per tablet with combined transliteration and translation fields. Each volume needs its own exporter because the structure varies enough that a generic one would silently break.

Step 4: Visual QC

Dump everything to an HTML file and actually look at it. This is where you spot the real problems: misread headings, commentary leaking into translation fields, duplicate translations from continuation pages. No amount of automated testing replaces eyeballing the output.

Step 5: Cleanup

Book-specific cleanup scripts apply the fixes found during QC — drop bad rows, merge tablets that got split across pages, strip commentary that leaked through. Unglamorous and manual but completely necessary.

Step 6: Sentence Chunking + Translation

Here's where it gets interesting again. The original training data is document-level — full tablet in, full translation out. But the AKT books have something better: line-by-line structure. Each transliteration line has a marker ((Vs.1), (2), (Rs.14)) and each translation sentence references those markers.

A second Gemini batch job handles two things at once: align transliteration lines to translation sentences by marker, and translate the non-English content (Turkish or German) into English. For each tablet, I retrieved the most similar examples from the official training set using TF-IDF cosine similarity and included them as few-shot context. This turned out to be crucial — not just for translation quality, but for matching the distribution of the host's wording, style, and terminology choices. The model wasn't just translating, it was learning to translate the way the competition data expected.

Same batch pattern — build JSONL, submit, parse shuffled predictions.

Step 7: Normalization

Most of the invisible work happened here. The competition test set uses a specific character format, and the books don't match it. Every volume has its own OCR artifacts, its own conventions.

A few examples from the normalization stack:

ḫ/Ḫ → h/H (test set uses plain H)
Unicode subscripts → plain digits (₄ → 4)
Superscript determinatives → brace format (ᵈ → {d}, ᵏⁱ → {ki})
OCR artifacts: KU.BABBAR → KÙ.BABBAR, ś → š, ş → ṣ
Gap deduplication: <gap> <gap> → <gap>, while preserving attachments like <gap>-A-šùr

For a character-level model like ByT5, this isn't cosmetic. A single character mismatch between training and test — ḫ vs h, ₄ vs 4 — is invisible to a human reviewer and catastrophic to a model that has learned exactly one representation.

Step 8: Merge

The final step pulls normalized chunks into the main training set. Starting from ~1,500 pairs, the pipeline roughly multiplied our available training data — and more importantly, added sentence-level pairs that gave the model a much finer-grained learning signal than document-level translations alone.

Training — ByT5 Gets You Far, Then Stops

With the expanded dataset ready, training ByT5 was straightforward — standard seq2seq encoder-decoder training using HuggingFace Transformers. No tricks, no exotic schedulers. The model picked up patterns fast and translated training-domain tablets surprisingly well.

But then the leaderboard scores started telling a different story.

In our case, the hidden test set on Kaggle seemed to have a different distribution than what we trained on. Our best guess: different books, different topics, different translator styles, unfamiliar names and locations. Our ByT5 was doing well on what it had seen directly in training, but the leaderboard scores suggested it wasn't generalizing beyond that.

We hit a ceiling. Many teams went on to have great success pushing ByT5 further — better augmentation, longer training, smarter tricks I guess. But in our setup, the gains had stalled, and we decided to explore a different direction.

Back to Decoder-Only — But This Time, Fine-Tuned

This is where the story comes full circle. Earlier, we'd dismissed decoder-only LLMs because they hallucinate. That's still true — out of the box. But fine-tuning changes the picture completely.

The reasoning was simple: ByT5 and Qwen were solving different problems. ByT5 was a great fit for the transliteration itself — every character mattered, and byte-level modeling let it handle weird orthography, diacritics, subscripts, and determinatives without fighting the tokenizer. But once the task became generalization across unfamiliar tablets, translator styles, and topic shifts, Qwen3.5 had something ByT5 didn't: much stronger pretrained language knowledge.

Out of the box, that strength was useless because it came with hallucination. Fine-tuning changed that. LoRA gave us a way to keep the model's broader language ability while grounding it in the task and the dataset. Instead of prompting a general-purpose model and hoping for the best, we trained a lightweight adapter on our curated examples. Combined with few-shot prompting to match the host's translation style, the fine-tuned Qwen handled the distribution shifts that our ByT5 couldn't.

Fine-Tuning with Unsloth — Making LLMs Affordable

Before diving into the training details, a quick primer for anyone who hasn't fine-tuned a model before.

The naive approach to fine-tuning a large language model means updating all its parameters — billions of them. That requires serious hardware, serious memory, and serious money. For a Kaggle competition where you're iterating fast on limited GPUs, it's a non-starter.

This is where LoRA (Low-Rank Adaptation) comes in. Instead of updating the entire model, you freeze the original weights and train a small set of adapter matrices on top. You get most of the benefits of full fine-tuning at a fraction of the cost. QLoRA takes it a step further by quantizing the base model to 4-bit precision, which dramatically cuts memory usage — making it possible to fine-tune models that would otherwise never fit on a single GPU.

For this project we used Unsloth, which makes the whole process surprisingly painless. It handles the LoRA/QLoRA setup, optimizes training to run ~2x faster with ~70% less VRAM, and supports a wide range of models out of the box — including Qwen3.5, which is what we needed.

The training itself was SFT (Supervised Fine-Tuning) using Unsloth's built-in SFT trainer. We structured our data as chat conversations: a system prompt setting the role of an expert Assyriologist, few-shot examples retrieved via TF-IDF similarity, and the target tablet as the final user message. The model only learns from the assistant completion — the actual translation.

# each training example looks like this
messages = [
    {"role": "system", "content": "You are an expert Assyriologist..."},
    # few-shot examples from similar tablets
    {"role": "user", "content": "Translate: um-ma A-šùr-i-dí-ma ..."},
    {"role": "assistant", "content": "Thus says Aššur-idī: ..."},
    {"role": "user", "content": "Translate: um-ma Pu-šu-ki-in-ma ..."},
    {"role": "assistant", "content": "Thus says Pūšu-kēn: ..."},
    # the actual tablet to translate
    {"role": "user", "content": "Translate: a-na A-lim {ki} ..."},
    {"role": "assistant", "content": "To the City: ..."},  # model learns this
]

An important detail here: we used completion-only masking. The loss is computed only on the assistant's translation tokens — the prompt tokens (system message, few-shot examples, user messages) are masked out during training. This means the model isn't wasting capacity learning to predict the input; it's focused entirely on producing accurate translations.

This meant the model wasn't just learning to translate — it was learning to translate in context, grounded by similar examples. The same retrieval and prompt structure would be used at inference time, so there was no gap between how the model trained and how it would be evaluated.

One direction we started exploring but ran out of time for: reinforcement learning on top of the fine-tuned model. The idea was to use GRPO (Group Relative Policy Optimization) with custom reward functions — combining the competition metric itself, gap alignment between transliteration and translation, and length balance — to push the model beyond what SFT alone could achieve. Each reward would target a specific failure mode that supervised training couldn't address directly. We didn't get there before the deadline, but it felt like the natural next step.

Inference — vLLM on Kaggle T4s

With a fine-tuned model ready, the next challenge was actually running it within Kaggle's competition constraints. This is a code competition — no internet access at submission time, two T4 GPUs with 16GB VRAM each, and a strict time limit.

A quick intro on vLLM for those unfamiliar: it's an open-source inference engine originally developed at UC Berkeley that's become the go-to for serving LLMs efficiently. The key innovation is PagedAttention — instead of pre-allocating a fixed block of memory for each sequence's key-value cache, it pages the KV cache dynamically, similar to how operating systems manage virtual memory. This means you can serve larger models on less hardware. On top of that you get continuous batching, optimized CUDA kernels, tensor parallelism, and seamless HuggingFace model support out of the box.

Sounds perfect, right? In theory. In practice, we hit a wall.

Qwen3.5 was released in the final weeks of the competition. The model was brand new — vLLM support was experimental and unstable. On top of that, Kaggle's T4 GPUs have compute capability 7.5, which means no FlashAttention 2 support. We had to fall back to Triton attention backend, wrestle with environment compatibility issues, and work around the fact that you can't pip install anything at submission time — every dependency needs to be pre-packaged in your dataset.

Getting a 9B parameter model to load, run, and generate translations on two T4s without crashing was its own mini-project. Tensor parallelism across both GPUs was non-negotiable — the model simply wouldn't fit on a single card.

llm = LLM(
    model=MODEL_PATH,
    dtype="float16",
    max_model_len=16000,
    gpu_memory_utilization=0.85,
    enforce_eager=True,            # no CUDA graphs on T4
    tensor_parallel_size=2,        # split across both T4s
)

The inference prompt mirrors the training setup exactly — same system prompt, same TF-IDF few-shot retrieval. For each test tablet, we retrieve the 5 most similar examples from our training data and include them as conversation context:

prompts = [
    build_messages(
        transliteration=row["transliteration"],
        few_shot_examples=retriever.top_k(row["transliteration"]),
    )
    for row in test_rows
]
outputs = llm.chat(prompts, sampling_params=sampling_params)

Keeping the inference pipeline identical to training — same prompt structure, same retrieval, same style anchoring — meant the model was seeing exactly the kind of input it was trained on. No distribution shift at inference time.

Results and Reflections

Our team finished with a silver medal out of 2500+ teams. In the final days of the competition, the OCR extraction pipeline was still producing new data — each batch of cleaned and normalized tablets pushed our scores higher. We genuinely felt like gold was within reach with a couple more days. That stings a bit, but honestly? The journey was worth more than the medal.

Here's what I'm taking away from this:

Gemini's batch inference is a superpower for unstructured data. We used it to turn scanned academic books from the 1990s — messy layouts, multiple languages, inconsistent formatting — into clean, structured training data. If it works for 4,000-year-old Assyrian tablets in Turkish and German PDFs, it'll work for your use case too. The Vertex AI Batch API made it affordable and painless at scale.

Few-shot retrieval is still easy gains. TF-IDF character n-gram similarity is dead simple to implement, and using retrieved examples to anchor both training and inference gave us consistent improvements with minimal effort. Small iterations, big returns.

Fine-tuning is more accessible than you think. LoRA + Unsloth meant we could train a 9B parameter model on Kaggle's free GPUs. You don't need a cluster. You need good data and the right tools.

vLLM makes deployment practical. Even on constrained hardware like Kaggle T4s, with a brand-new model and no internet access, we got a 9B model running with tensor parallelism. The ecosystem is maturing fast.

And the bigger picture — the one that got me into this competition in the first place — is that there are still thousands of untranslated tablets sitting in museums. The pipeline we built here isn't a one-off competition hack. It's a blueprint: scan the books, extract the data, train the models, translate the tablets. The tools already exist. The data is already out there. At this point, the bottleneck is no longer whether this can be done. It's whether someone is willing to do it.

Skills, Not Vibes: Teaching AI Agents to Write Clean Code

Ertuğrul Demir — Mon, 26 Jan 2026 11:17:47 +0000

In February 2025, Andrej Karpathy coined "vibe coding" to describe programming's new reality: give in to the vibes, accept all changes, "forget that the code even exists." He called it "not too bad for throwaway weekend projects." But for production systems? That's where the trouble starts.

I've watched AI-generated codebases accumulate the same mess developers spent decades learning to avoid—duplication everywhere, inconsistent naming, missing edge cases. Then it hit me: these are exactly the problems Robert C. Martin warned about in Clean Code almost two decades ago.

So I went back to the book, specifically Chapter 17's catalog of 66 code smells and heuristics. These aren't just relevant to AI coding—they're more relevant. AI makes exactly the mistakes Uncle Bob warned us about, just faster and at scale.

The solution? Skills—instruction files that AI agents read before writing code. I've translated Clean Code's complete catalog into Python skills you can use today. They work in Google's Antigravity IDE, Anthropic's Claude Code, and anywhere that supports the Agent Skills standard.

Let me show you why we need this, and how to implement it.

Even Linus Torvalds Vibe Codes (Sometimes)

In January 2026, Linus Torvalds revealed a side project called AudioNoise—a digital audio effects simulator he'd been tinkering with over the holidays. The Python visualizer, he noted, was "basically written by vibe-coding."

In his own words from the repo:

"I know more about analog filters—and that's not saying much—than I do about python. It started out as my typical 'google and do the monkey-see-monkey-do' kind of programming, but then I cut out the middle-man—me—and just used Google Antigravity to do the audio sample visualizer."

The Hacker News discussion revealed two camps. Some saw it as validation: "It's official, vibe coding is legit." Others noted the crucial context: Torvalds used AI for the part he lacks expertise in (Python visualization) while hand-coding the parts he knows (C and digital signal processing).

One commenter nailed it: "There's a big difference between vibe-coding an entire project and having an AI build a component that you lack competency for."

Another observation cut deeper: "If anyone on the planet knows how to do vibe coding right, it's him"—because Torvalds spent decades mastering code review. He can spot bad code instantly. Most of us can't.

But here's what's telling: Torvalds wrote tests for his hand-coded C—numerical accuracy checks for the DSP primitives he understands. The vibe-coded Python visualizer? No tests, no type hints, and a duplicated function definition that slipped right through. The same four-line method appears twice in a row—the first an empty stub, the second the real implementation. It's textbook "Accept All, don't read the diffs." The code runs fine (Python silently overwrites the first definition), but it's exactly the kind of dead code that accumulates into maintenance nightmares.

This works for Torvalds' toy project precisely. It's a throwaway learning exercise. The moment that visualizer needs to be production code, those missing guardrails become technical debt.

The same week, Torvalds rejected "AI slop" submissions to the Linux kernel, arguing that documentation telling people not to submit garbage won't help because "the people who would submit it won't read the documentation anyway."

The lesson isn't that vibe coding is bad. It's that context matters. Skills let you define when to enforce rigor and when to let the vibes flow.

The Data: AI Code Quality Is Getting Worse

Google's DORA Report found AI adoption shows a negative relationship with software delivery stability. The 2025 report's central finding: "AI doesn't fix a team; it amplifies what's already there." Without robust control systems—strong testing, mature practices, fast feedback loops—increased AI-generated code leads to instability. Skills are exactly those control systems, encoded as instructions.

Carnegie Mellon researchers analyzed 807 GitHub repositories after Cursor adoption: +30% static analysis warnings, +41% code complexity. The speed gains were transient; the quality problems compounded.

GitClear's analysis of 211 million lines of code from Google, Microsoft, Meta, and enterprise repositories found code duplication increased 4x with AI adoption. For the first time in their dataset, copy/pasted code exceeded refactored code.

Even Anthropic's Agentic Coding Trends Report shows the gap: developers use AI in roughly 60% of their work, but can fully delegate only 0-20% of tasks. The rest requires "thoughtful setup, active supervision, and human judgment."

That gap—between what AI touches and what AI can own—is exactly what skills address. The setup is the skill. The supervision is the rules.

The Pattern: AI Recreates Classic Code Smells

The research consistently identifies the same failure patterns. Here's how they map to specific Clean Code violations:

Naming and Consistency Problems

Inconsistent variable names across similar functions
Vague names like data, tmp, proc
Mixing naming conventions (camelCase and snake_case)
Clean Code rules: N1 (descriptive names), G11 (consistency), G24 (conventions)

Code Duplication

Copy/paste instead of extracting shared logic
Same calculation appearing in multiple places
Pattern repetition that should be abstracted
Clean Code rule: G5 (DRY - Don't Repeat Yourself)

Missing Safety Checks

No validation of input boundaries
Assumptions about data structure without verification
Missing null/None checks
Clean Code rules: G3 (boundary conditions), G4 (don't override safeties), G26 (be precise)

Readability Issues

Magic numbers without explanation (what does 86400 mean?)
Unused variables cluttering code
Functions mixing multiple abstraction levels
Clean Code rules: G12 (remove clutter), G16 (no obscured intent), G34 (single abstraction level)

Performance Problems

Functions doing multiple things at once
Exposing internal data unnecessarily
Nested loops that could be optimized
Clean Code rules: G8 (minimize public interface), G30 (functions do one thing)

These aren't arbitrary style preferences—they're the exact problems that make code hard to maintain, debug, and extend. The skills we'll build enforce these rules automatically.

The fix isn't to stop using AI. It's to give AI the explicit rules it needs to follow.

That's what skills do.

What Are Skills?

Skills are markdown files containing domain-specific instructions that AI agents read before working on your code. They follow the Agent Skills open standard and work in Google Antigravity, Anthropic's Claude Code, and other compatible agents.

The architecture is called Progressive Disclosure. Instead of dumping every instruction into the agent's context at once (causing what Antigravity's docs call "Context Saturation"), skills work in layers:

Discovery: The agent sees only a lightweight menu of skill names and descriptions
Activation: When your request matches a skill's description, the full instructions load
Execution: Scripts and templates are read only when the task requires them

This keeps the agent fast and focused. It's not thinking about database migrations when you're writing a React component.

The format is simple:

---
name: skill-name
description: When this skill should activate
---

# Skill Title

Your instructions, examples, and rules here.

The description field is crucial—it's the trigger phrase. The agent semantically matches your request against all available skill descriptions to decide which ones to load. "Enforces function best practices" is vague. "Use when writing or refactoring Python functions" tells the agent exactly when to activate.

Skills can do far more than enforce coding standards—the community has built skills for Stripe integration, Metasploit security testing, voice agents, and even multi-agent startup automation. This article focuses on one specific use case: encoding Clean Code principles.

Let me show you how to translate Clean Code's catalog into working skills.

Building the Skills: Three Examples

Rather than catalog all 66 rules exhaustively, I'll show you three critical categories in detail. The complete implementation is at the end.

1. Comments (C1-C5): Code Should Explain Itself

Uncle Bob is famously skeptical of comments—not because documentation is bad, but because comments rot faster than code updates.

File Reference: clean-comments/SKILL.md

---
name: clean-comments
description: Use when writing, fixing, editing, or reviewing Python comments and docstrings. Enforces Clean Code principles—no metadata, no redundancy, no commented-out code.
---

# Clean Comments

## C1: No Inappropriate Information

Comments shouldn't hold metadata. Use Git for author names, change history, 
ticket numbers, and dates. Comments are for technical notes about code only.

## C2: Delete Obsolete Comments

If a comment describes code that no longer exists or works differently, 
delete it immediately. Stale comments become "floating islands of 
irrelevance and misdirection."

## C3: No Redundant Comments

# Bad - the code already says this
i += 1  # increment i
user.save()  # save the user

# Good - explains WHY, not WHAT
i += 1  # compensate for zero-indexing in display

## C4: Write Comments Well

If a comment is worth writing, write it well:
- Choose words carefully
- Use correct grammar
- Don't ramble or state the obvious
- Be brief

## C5: Never Commit Commented-Out Code

# DELETE THIS - it's an abomination
# def old_calculate_tax(income):
#     return income * 0.15

Who knows how old it is? Who knows if it's meaningful? Delete it. 
Git remembers everything.

## The Goal

The best comment is the code itself. If you need a comment to explain 
what code does, refactor first, comment last.

2. Functions (F1-F4): Small, Focused, Obvious

Functions should do one thing, do it well, and have an obvious purpose.

File Reference: clean-functions/SKILL.md

---
name: clean-functions
description: Use when writing or refactoring Python functions. Enforces Clean Code principles—maximum 3 arguments, single responsibility, no flag parameters.
---

# Clean Functions

## F1: Too Many Arguments (Maximum 3)

# Bad - too many parameters
def create_user(name, email, age, country, timezone, language, newsletter):
    ...

# Good - use a dataclass or dict
@dataclass
class UserData:
    name: str
    email: str
    age: int
    country: str
    timezone: str
    language: str
    newsletter: bool

def create_user(data: UserData):
    ...

More than 3 arguments means your function is doing too much or needs 
a data structure.

## F2: No Output Arguments

Don't modify arguments as side effects. Return values instead.

# Bad - modifies argument
def append_footer(report: Report) -> None:
    report.append("\n---\nGenerated by System")

# Good - returns new value
def with_footer(report: Report) -> Report:
    return report + "\n---\nGenerated by System"

## F3: No Flag Arguments

Boolean flags mean your function does at least two things.

# Bad - function does two different things
def render(is_test: bool):
    if is_test:
        render_test_page()
    else:
        render_production_page()

# Good - split into two functions
def render_test_page(): ...
def render_production_page(): ...

## F4: Delete Dead Functions

If it's not called, delete it. No "just in case" code. Git preserves history.

3. General Principles (G1-G36): The Core Rules

These are the fundamental patterns that separate clean code from legacy nightmares.

File Reference: clean-general/SKILL.md

---
name: clean-general
description: Use when reviewing Python code quality. Enforces Clean Code's core principles—DRY, single responsibility, clear intent, no magic numbers, proper abstractions.
---

# General Clean Code Principles

## Critical Rules

**G5: DRY (Don't Repeat Yourself)**

Every piece of knowledge has one authoritative representation.

# Bad - duplication
tax_rate = 0.0825
ca_total = subtotal * 1.0825
ny_total = subtotal * 1.07

# Good - single source of truth
TAX_RATES = {"CA": 0.0825, "NY": 0.07}
def calculate_total(subtotal: float, state: str) -> float:
    return subtotal * (1 + TAX_RATES[state])

**G16: No Obscured Intent**

Don't be clever. Be clear.

# Bad - what does this do?
return (x & 0x0F) << 4 | (y & 0x0F)

# Good - obvious intent
return pack_coordinates(x, y)

**G23: Prefer Polymorphism to If/Else**

# Bad - will grow forever
def calculate_pay(employee):
    if employee.type == "SALARIED":
        return employee.salary
    elif employee.type == "HOURLY":
        return employee.hours * employee.rate
    elif employee.type == "COMMISSIONED":
        return employee.base + employee.commission

# Good - open/closed principle
class SalariedEmployee:
    def calculate_pay(self): return self.salary

class HourlyEmployee:
    def calculate_pay(self): return self.hours * self.rate

class CommissionedEmployee:
    def calculate_pay(self): return self.base + self.commission

**G25: Replace Magic Numbers with Named Constants**

# Bad
if elapsed_time > 86400:
    ...

# Good
SECONDS_PER_DAY = 86400
if elapsed_time > SECONDS_PER_DAY:
    ...

**G30: Functions Should Do One Thing**

If you can extract another function, your function does more than one thing.

**G36: Law of Demeter (Avoid Train Wrecks)**

# Bad - reaching through multiple objects
output_dir = context.options.scratch_dir.absolute_path

# Good - one dot
output_dir = context.get_scratch_dir()

## Enforcement Checklist

When reviewing AI-generated code, verify:
- [ ] No duplication (G5)
- [ ] Clear intent, no magic numbers (G16, G25)
- [ ] Polymorphism over conditionals (G23)
- [ ] Functions do one thing (G30)
- [ ] No Law of Demeter violations (G36)
- [ ] Boundary conditions handled (G3)
- [ ] Dead code removed (G9)

The Complete Catalog

I've translated all 66 rules from Clean Code Chapter 17 into skills covering six categories:

Click to expand all skill categories

Comments (C1-C5): Minimal, accurate commenting

C1: No inappropriate information (metadata belongs in version control)
C2: Delete obsolete comments immediately
C3: No redundant comments that repeat the code
C4: Write comments well—brief, grammatical, purposeful
C5: Never commit commented-out code

Environment (E1-E2): One-command build and test

E1: Build requires only one step
E2: Tests require only one step

Functions (F1-F4): Small, focused, obvious

F1: Maximum 3 arguments (use data structures for more)
F2: No output arguments (return values instead)
F3: No flag arguments (split into separate functions)
F4: Delete dead functions

General (G1-G36): Core principles

G1: Multiple languages in one source file
G2: Obvious behavior is unimplemented
G3: Incorrect behavior at the boundaries
G4: Overridden safeties
G5: Duplication
G6: Code at wrong level of abstraction
G7: Base classes depending on their derivatives
G8: Too much information
G9: Dead code
G10: Vertical separation
G11: Inconsistency
G12: Clutter
G13: Artificial coupling
G14: Feature envy
G15: Selector arguments
G16: Obscured intent
G17: Misplaced responsibility
G18: Inappropriate static
G19: Use explanatory variables
G20: Function names should say what they do
G21: Understand the algorithm
G22: Make logical dependencies physical
G23: Prefer polymorphism to if/else or switch/case
G24: Follow standard conventions
G25: Replace magic numbers with named constants
G26: Be precise
G27: Structure over convention
G28: Encapsulate conditionals
G29: Avoid negative conditionals
G30: Functions should do one thing
G31: Hidden temporal couplings
G32: Don't be arbitrary
G33: Encapsulate boundary conditions
G34: Functions should descend only one level of abstraction
G35: Keep configurable data at high levels
G36: Avoid transitive navigation

Names (N1-N7): Descriptive, unambiguous, right-sized

N1: Choose descriptive names
N2: Choose names at the right abstraction level
N3: Use standard nomenclature where possible
N4: Use unambiguous names
N5: Use long names for long scopes
N6: Avoid encodings (Hungarian notation, etc.)
N7: Names should describe side effects

Tests (T1-T9): Fast, independent, exhaustive

T1: Insufficient tests—test everything that could break
T2: Use a coverage tool
T3: Don't skip trivial tests
T4: Ignored tests indicate ambiguity
T5: Test boundary conditions
T6: Exhaustively test near bugs
T7: Patterns of failure are diagnostic
T8: Coverage patterns can be revealing
T9: Tests should be fast

Get the complete skill files:

ertugrul-dmr / clean-code-skills

Clean Code Skills for AI Agents

Teach your AI to write code that doesn't suck.

This repository contains Agent Skills that enforce Robert C. Martin's Clean Code principles. They work with Google Antigravity, Anthropic's Claude Code, and any agent that supports the Agent Skills standard.

Why?

AI generates code fast, but research shows it also generates technical debt fast:

GitClear: 4x increase in code duplication with AI adoption
Carnegie Mellon: +30% static analysis warnings, +41% code complexity after Cursor adoption
Google DORA: Negative relationship between AI adoption and software delivery stability

These skills encode battle-tested solutions to exactly these problems—directly into your AI workflow.

What's Included

Skill	Description	Rules
`boy-scout`	Orchestrator—always leave code cleaner than you found it	Coordinates all skills
`python-clean-code`	Master skill with all 66 rules	C1-C5, E1-E2, F1-F4, G1-G36, N1-N7, P1-P3, T1-T9
`clean-comments`	Minimal, accurate commenting	C1-C5
`clean-functions`	Small, focused, obvious functions	F1-F4

…

View on GitHub

The repo includes:

boy-scout: An orchestrator skill that embodies the Boy Scout Rule—"always leave code cleaner than you found it"—and coordinates the other skills
python-clean-code: A master skill with all 66 rules, plus a quick reference table and anti-patterns cheatsheet
Individual skills for each category (clean-comments, clean-functions, clean-general, clean-names, clean-tests)—drop in only what you need
Installation instructions for Antigravity, Claude Code, and other Agent Skills-compatible tools

How to Use These Skills

Skills sit in a specific place in the agent ecosystem. Rules are passive guardrails that are always on. Skills are agent-triggered—the model decides when to equip them based on your intent. If you're using MCP servers (connections to external tools like GitHub or Postgres), think of MCP as the "hands" and skills as the "brains" that direct them.

For Antigravity

Create .agent/skills/ in your project root (or ~/.gemini/antigravity/skills/ for global access)
Save the skill as a folder with a SKILL.md file inside (e.g., .agent/skills/python-clean-code/SKILL.md)
Ask the agent to review or write code—it'll automatically apply the rules when relevant

Global vs Project Skills

Project-specific: .agent/skills/
Global Antigravity: ~/.gemini/antigravity/skills/

The agent only loads full skill content when needed, so comprehensive skills don't slow down simple requests.

Going Further

The skills in this article are instruction-only—they tell the agent what to do. For stricter enforcement, you could add a scripts/ folder with a linter that compatible agents runs them automatically, or an examples/ folder with before/after code samples for few-shot learning. The format supports it; we're just keeping things simple here.

A Real-World Example

Here's code that violates multiple Clean Code rules:

from utils import *  # P1
# Author: John, Modified: 2024-01-15  # C1
def proc(d, t, flag=False):  # N1, F1, F3
    # Process the data  # C3
    x = []  # N1
    for i in d:
        if flag:  # F3
            if i['type'] == 'A':  # G23
                x.append(i['val'] * 1.0825)  # G25
            elif i['type'] == 'B':
                x.append(i['val'] * 1.05)  # G25
        else:
            x.append(i['val'])
    with open(f'/tmp/{t}.json', 'w') as f:  # G6
        json.dump(x, f)
    # Old approach  # C5
    # for item in d:
    #     print(item)
    return x

Violations: P1, C1, C3, C5, F1, F3, G6, G23, G25, N1

With the Clean Code skill active, ask your AI agent to refactor this:

import json
from pathlib import Path
from typing import List, Literal
from dataclasses import dataclass

TAX_RATE_CA = 0.0825
TAX_RATE_NY = 0.05
TransactionType = Literal['CA', 'NY']

@dataclass
class Transaction:
    value: float
    type: TransactionType

def apply_tax(transaction: Transaction) -> float:
    """Apply state-specific tax to transaction value."""
    tax_rates = {'CA': TAX_RATE_CA, 'NY': TAX_RATE_NY}
    return transaction.value * (1 + tax_rates[transaction.type])

def process_transactions_with_tax(
    transactions: List[Transaction]
) -> List[float]:
    """Calculate taxed values for all transactions."""
    return [apply_tax(t) for t in transactions]

def process_transactions_without_tax(
    transactions: List[Transaction]
) -> List[float]:
    """Extract raw values from all transactions."""
    return [t.value for t in transactions]

def save_results(values: List[float], output_path: Path) -> None:
    """Save processed values to JSON file."""
    output_path.parent.mkdir(parents=True, exist_ok=True)
    with output_path.open('w') as f:
        json.dump(values, f)

The refactored version:

✅ No wildcard imports (P1)
✅ No metadata comments (C1)
✅ No redundant comments (C3)
✅ No commented-out code (C5)
✅ Descriptive names (N1)
✅ No flag arguments (F3)
✅ Named constants instead of magic numbers (G25)
✅ Functions do one thing (G30)
✅ Polymorphism through data structure (G23)

Anatomy of a Vibe-Coded Script

Remember the duplicated function I mentioned in Torvalds' AudioNoise visualizer? Here it is:

def update_slider_text(self, val):
    """Helper to update slider texts (Width and End Point)."""
    start_val, end_val = val
    width = end_val - start_val

def update_slider_text(self, val):
    """Helper to update slider texts (Width and End Point)."""
    start_val, end_val = val
    width = end_val - start_val

    if self.x_mode == 'Time':
        self.slider.valtext.set_text(f"Window: {start_val:.3f} + {width:.3f} s")
    else:
        self.slider.valtext.set_text(f"Window: {int(start_val)} + {int(width)}")

The first definition unpacks values, calculates width, then... returns None. The second definition is the real implementation. Python silently overwrites the first with the second, so the code runs. But it's textbook dead code—Clean Code rule G9: Remove dead code.

With the skill active, an agent refactors the entire 600-line script. The duplicate vanishes, magic numbers become constants, and nested functions get extracted into focused methods:

def update_slider_text(self, val: tuple[float, float]):
    """Update slider text with either time or sample count."""
    start_val, end_val = val
    width = end_val - start_val

    if self.x_mode == 'Time':
        self.slider.valtext.set_text(f"Window: {start_val:.3f} + {width:.3f} s")
    else:
        self.slider.valtext.set_text(f"Window: {int(start_val)} + {int(width)}")

The refactored version:

✅ Dead code removed (G9)
✅ Type hints added (clarity)
✅ Single, authoritative definition (G5)
✅ Magic numbers extracted to constants (G25)
✅ Large methods decomposed (G30)

The full diff shows 600+ lines reduced to ~440—not by removing functionality, but by eliminating duplication and extracting reusable patterns.

Why This Matters Now

Vibe coding isn't going away. AI will get better at generating code, not worse. But "better at generating" doesn't mean "better at maintaining."

The research is clear: AI produces code faster, but that code accumulates technical debt faster too. Without guard rails, we're building tomorrow's legacy systems today.

Uncle Bob's Clean Code principles are almost 20 years old, but they're exactly what we need now. They're not arbitrary style preferences—they're battle-tested solutions to the problems AI recreates at scale.

Skills give you the mechanism to encode these rules directly into your AI workflow. Whether you're using Antigravity, Claude Code, or another agent, the approach is the same: define what clean code means, then let the AI follow the rules.

Your agent doesn't know what good code looks like unless you tell it.

So tell it.

Resources

The Book

Clean Code by Robert C. Martin: Amazon

Skills Documentation

Agent Skills Standard — The open standard for AI agent instructions
Antigravity Skills Guide — Google's official documentation
Claude Code Agent Skills — Anthropic's implementation

Research Cited

DORA 2025: AI-Assisted Software Development — Google's findings on AI and delivery stability
Code Quality After Cursor Adoption — Carnegie Mellon's analysis of 807 repositories
GitClear 2025 Code Quality Report — 211M lines analyzed
Agentic Coding Trends — Anthropic's delegation gap analysis

Get the Skills

Clean Code Skills Repository — All 66 rules as ready-to-use skill files

The future of programming is human intent translated by AI. Make sure the translation preserves quality, not just speed.