<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ertuğrul Demir</title>
    <description>The latest articles on DEV Community by Ertuğrul Demir (@ertugrul_demir).</description>
    <link>https://dev.to/ertugrul_demir</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3700961%2Fa008c5d6-a099-4c11-827e-bc2df02828a9.jpg</url>
      <title>DEV Community: Ertuğrul Demir</title>
      <link>https://dev.to/ertugrul_demir</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ertugrul_demir"/>
    <language>en</language>
    <item>
      <title>The Local Model That Doesn't Sleep: Gemma 4 + MTP as a Marathon Engine</title>
      <dc:creator>Ertuğrul Demir</dc:creator>
      <pubDate>Fri, 08 May 2026 09:01:07 +0000</pubDate>
      <link>https://dev.to/gde/the-local-model-that-doesnt-sleep-gemma-4-mtp-as-a-marathon-engine-4c9</link>
      <guid>https://dev.to/gde/the-local-model-that-doesnt-sleep-gemma-4-mtp-as-a-marathon-engine-4c9</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/google-gemma-2026-05-06"&gt;Gemma 4 Challenge: Write About Gemma 4&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I set the agent running just before midnight, did a quick mental count of my remaining API quota, and went to sleep. I was going to wake up to a finished job. That was the plan, anyway...&lt;/p&gt;

&lt;p&gt;What I actually woke up to was a frozen terminal. The agent had stopped in the tenth minute. The remote service had gone down overnight and taken the whole job with it. The task I had given it was simple enough: scrape fifty documentation pages, cross-reference the data across sources, produce a structured summary. It had barely started before the infrastructure I had no control over just switched off.&lt;/p&gt;

&lt;p&gt;The model wasn't failing. The problem wasn't intelligence. The problem was that I was building on a foundation I didn't own: a service that could go down, a quota that could run out, and no way to know which one was waiting for me in the morning.&lt;/p&gt;

&lt;p&gt;I had always worked with local models on the side: trained them, tested them, liked them. But to be honest, I'd never trusted them much in the past for complex tasks. They were a hobby, not a solution. Too much babysitting required for a real workload. I had filed them under "interesting" and left them there. That frozen terminal moved them to a different folder.&lt;/p&gt;




&lt;p&gt;For a long time, the gap between the proprietary giants and the open-source world felt like a canyon. You had the "God-models" in the closed gates: GPT, Claude, Gemini. They could reason through almost anything, but you had to play by their rules. If you wanted actual intelligence, you paid the subscription and accepted their rules.&lt;/p&gt;

&lt;p&gt;But lately, that canyon is shrinking.&lt;/p&gt;

&lt;p&gt;We're seeing a massive push from the open-weights community. Models like DeepSeek V4, Kimi K2.6, and GLM-5.1 are proving that high-end reasoning is becoming a commodity. The problem is the weight. Unless you're running a server farm or expensive rack, hosting a model at that scale is a logistical nightmare. Great to admire from a distance, but too heavy to actually build with.&lt;/p&gt;

&lt;p&gt;Then came the sweet spot: Gemma 4 31B and Qwen 3.6 27B.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqdvhoyu98yeo73hpo629.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqdvhoyu98yeo73hpo629.png" alt="Gemma Official Banner" width="800" height="160"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Suddenly, the math changed. These models aren't as smart as the trillion-parameter giants, but they fit. They fit on consumer-grade GPUs. They work offline. And they work for free, minus whatever your GPU costs in electricity...&lt;/p&gt;

&lt;p&gt;But here is the thing: I don't think the goal of local models is to beat the cloud models at a game of IQ.&lt;/p&gt;

&lt;p&gt;For a complex task, you still want the big guns. You want the most powerful model available to handle high-value iterations where precision is everything. That is a sprint.&lt;/p&gt;

&lt;p&gt;But what happens when the task isn't a sprint? What happens when you need a model to work for six hours straight? To scrape a hundred pages, try fifty different reasoning paths, fail, pivot, and keep grinding until the job is done?&lt;/p&gt;

&lt;p&gt;That is a marathon.&lt;/p&gt;

&lt;p&gt;And in a marathon, intelligence is secondary to endurance.&lt;/p&gt;

&lt;p&gt;The real advantage of a local setup isn't just privacy or cost. It is the fact that you have a little working engine that doesn't get tired. No rate limits. No monthly token quota. It is completely yours, and you can leave it running all night while you sleep.&lt;/p&gt;

&lt;p&gt;The stamina was already there. Then, recently, the Gemma family got something new: a way to run faster without burning out. A marathon engine that also picks up pace doesn't just finish sooner. It fits more work into the same night.&lt;/p&gt;




&lt;h3&gt;
  
  
  The Turbocharger (What is MTP?)
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy8ecuko1021co9bk6wsd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy8ecuko1021co9bk6wsd.png" alt="Based on https://x.com/googlegemma/status/2051694045869879749" width="800" height="999"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Before we get into the build, we need to talk about why this suddenly became possible. If you've been following the Gemma 4 release, you probably saw the term &lt;strong&gt;MTP&lt;/strong&gt; (Multi-Token Prediction).&lt;/p&gt;

&lt;p&gt;One thing worth naming up front: MTP isn't just a runtime trick bolted onto inference. It is a training objective. Google trained Gemma 4 from the ground up with auxiliary heads that predict multiple future tokens simultaneously. That structural choice is what lets the speculative-decoding pipeline below run so tightly integrated and efficient, far more so than older bolt-on drafters like Medusa or generic small-model speculative decoding.&lt;/p&gt;

&lt;p&gt;On the surface, Google says it makes the models "up to 3x faster." But as a developer, you know that "faster" can mean a lot of things. In this case, it is not about making the GPU clock speed higher. It is about changing how the model actually thinks.&lt;/p&gt;

&lt;p&gt;Standard LLMs are autoregressive. They produce one token at a time. It doesn't matter if the next word is completely predictable or a complex logic puzzle: the model spends the same amount of energy and time to generate that one single token. This is the latency bottleneck. Your GPU spends most of its time just moving parameters around, waiting to spit out one word.&lt;/p&gt;

&lt;p&gt;MTP fixes this using a technique called &lt;strong&gt;Speculative Decoding&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Think of it as pairing a heavy target model (the 31B brain) with a lightweight "drafter." The drafter is autoregressive too. It just runs much faster because of its size, producing a short candidate sequence in the time the target would take to produce a single token.&lt;/p&gt;

&lt;p&gt;For example, if the model is generating something as predictable as "Once upon a time," the words "in a galaxy far far away" are practically a given in some contexts. A standard model would still grind through each of those words one by one, spending the same compute on a cliché as it would on a genuine reasoning problem. The drafter generates the likely sequence quickly simply because of its small size.&lt;/p&gt;

&lt;p&gt;Then the target model steps in. Instead of generating those tokens one by one, it verifies the entire draft in a single parallel forward pass. The same weight load that normally yields one token now yields a lot more (depending on the drafted sequence). If the drafter was fully right, you get the whole sequence accepted in the time it usually takes to generate one word, and the target even throws in one extra token of its own as a bonus. If the drafter was only partially right, the target accepts everything up to the first disagreement, swaps in its own token at that point, and the process continues. Either way, the output follows the same probability distribution as running the target model alone. The acceptance algorithm is a mathematical guarantee, not a heuristic.&lt;/p&gt;

&lt;p&gt;The result is a massive win for local agents.&lt;/p&gt;

&lt;p&gt;When you are building an agent that needs to iterate, research, and self-correct, you are basically running a loop of "Think → Act → Observe." If every "Think" step takes a minute, your agent is a snail. If MTP cuts that down to a matter of seconds, your agent becomes a real-time engine.&lt;/p&gt;

&lt;p&gt;You get the pretty strong reasoning of a 31B model, but it's delivered at the speed of a much smaller one. For a "marathon" task, this is the difference between a project that takes a day and one that finishes by breakfast.&lt;/p&gt;




&lt;h3&gt;
  
  
  The Engine Room
&lt;/h3&gt;

&lt;p&gt;Now, the question is: how do you actually run this without your computer turning into a space heater?&lt;/p&gt;

&lt;p&gt;When it comes to local inference, the landscape is usually split between two different philosophies. On one side, you have the &lt;code&gt;llama.cpp&lt;/code&gt; ecosystem. This is the powerhouse of versatility. It’s the project that effectively democratized local LLMs, allowing us to run massive models on everything from MacBooks to old gaming PCs by utilizing GGUF and sophisticated memory offloading. If you need a model to run on a weird hardware configuration or want to lean on your system RAM, &lt;code&gt;llama.cpp&lt;/code&gt; is the tool for the job.&lt;/p&gt;

&lt;p&gt;But for an endurance engine, versatility is secondary to &lt;strong&gt;throughput&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That’s where &lt;a href="https://vllm.ai/" rel="noopener noreferrer"&gt;&lt;strong&gt;vLLM&lt;/strong&gt;&lt;/a&gt; comes in.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fap3b3vh1hala2p6heiq9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fap3b3vh1hala2p6heiq9.png" alt="vLLM Official Logo" width="800" height="229"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;While &lt;code&gt;llama.cpp&lt;/code&gt; is built for the individual user's flexibility, vLLM is built for the scale of a serving engine. To understand why, you have to understand the "Double Penalty" of long context.&lt;/p&gt;

&lt;p&gt;When you increase the context length of a model, you get hit twice. First, you have the &lt;strong&gt;Compute Cost&lt;/strong&gt;: the model has to attend to every previous token, so the work increases as the sequence grows. Second, you have the &lt;strong&gt;Memory Cost&lt;/strong&gt;: you have to store the &lt;strong&gt;KV Cache&lt;/strong&gt;, the pre-computed Keys and Values for every past token, so the model does not have to recompute that history from scratch on every new step.&lt;/p&gt;

&lt;p&gt;In a standard setup, this KV cache is stored in one contiguous block of VRAM. But in the real world, sequences have different lengths. This leads to massive memory fragmentation: you have "holes" in your VRAM that are too small to be used but too large to ignore. As your context grows, this waste grows with it. Eventually, your batch size collapses, and your GPU sits underutilized while your agent crawls.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PagedAttention&lt;/strong&gt; is vLLM's solution, and it's basically "Virtual Memory" for LLMs.&lt;/p&gt;

&lt;p&gt;Instead of storing the KV cache as one giant chunk, PagedAttention splits it into fixed-size blocks, or "pages." It uses a page table to map logical tokens to physical memory blocks. This means the model can store the cache anywhere in VRAM, eliminating fragmentation and allowing it to pack requests tightly.&lt;/p&gt;

&lt;p&gt;For a research agent that is reading fifty pages of documentation, this is the difference between the agent finishing the job and the system crashing with an &lt;code&gt;Out of Memory&lt;/code&gt; error. It also enables &lt;strong&gt;prefix caching&lt;/strong&gt;: if your agent asks ten different questions about the same documentation, vLLM doesn't recompute the documentation ten times. It shares the same KV pages across all requests.&lt;/p&gt;

&lt;p&gt;The best part is that we no longer have to wait for the community to "hack" MTP support into the codebase. vLLM launched &lt;strong&gt;Day-0 support&lt;/strong&gt; for Gemma 4 MTP.&lt;/p&gt;

&lt;p&gt;They provided a ready-to-use Docker image, which effectively removes the "dependency hell" that usually comes with cutting-edge AI releases. You don't have to spend your afternoon wrestling with CUDA versions or Triton kernels. You pull the image, spin up the server, and you have a high-performance MTP engine running on consumer hardware.&lt;/p&gt;

&lt;p&gt;Because vLLM provides an &lt;strong&gt;OpenAI-compatible API&lt;/strong&gt;, the integration is seamless. The server sits there as a lightweight endpoint, and any tool, whether it's a custom Python script or an agentic orchestrator like &lt;code&gt;pi&lt;/code&gt;, can talk to it using standard API calls.&lt;/p&gt;

&lt;p&gt;You’ve effectively decoupled the "Brain" (the model) from the "Pilot" (the agent). The Brain lives in vLLM, optimized for raw speed and memory efficiency. The Pilot lives in your orchestration layer, focusing on the logic and the goal.&lt;/p&gt;




&lt;h3&gt;
  
  
  Setting Up vLLM
&lt;/h3&gt;

&lt;p&gt;Time to actually run the thing. This is where most local-model articles get bogged down in CUDA versions, Triton kernels, and Python env nightmares. &lt;a href="https://dev.to/gde/decoding-bronze-age-paperwork-modern-ai-vs-ancient-assyrian-clay-tablets-5adf"&gt;Fine-tuning a model on Bronze Age tablets&lt;/a&gt;, I can handle. CUDA toolchain mismatches at 1 AM, I cannot.&lt;/p&gt;

&lt;p&gt;Luckily, the vLLM team shipped a pre-release Docker image specifically for Gemma 4. If you’re on Hopper, you grab &lt;code&gt;vllm/vllm-openai:gemma4-0505-cu129&lt;/code&gt;. On Blackwell, it’s &lt;code&gt;vllm/vllm-openai:gemma4-0505-cu130&lt;/code&gt;. One small but important gotcha: the standard &lt;code&gt;vllm/vllm-openai:latest&lt;/code&gt; tag does &lt;strong&gt;not&lt;/strong&gt; include MTP speculative decoding for Gemma 4 yet. If you pull the default image out of habit, the &lt;code&gt;--speculative-config&lt;/code&gt; flag will silently get you nowhere.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker pull vllm/vllm-openai:gemma4-0505-cu130
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That’s dependency hell, gone in one command.&lt;/p&gt;

&lt;p&gt;The next problem is fitting a 31B-parameter model on a single card. In native BF16, Gemma 4 31B eats a serious chunk of VRAM just to load the weights, before a single byte goes to the KV cache. That’s server-class hardware territory, not a workstation, and certainly not a single consumer card like the RTX 5090 with its 32GB of VRAM.&lt;/p&gt;

&lt;p&gt;The trick is &lt;strong&gt;NVFP4&lt;/strong&gt;, NVIDIA’s 4-bit floating-point format, native to Blackwell. NVIDIA published a quantized checkpoint, &lt;code&gt;nvidia/gemma-4-31B-it-NVFP4&lt;/code&gt;, that drops the weights to roughly &lt;strong&gt;19GB&lt;/strong&gt;. Stack an FP8 KV cache on top of that, and a 31B reasoning model fits comfortably on a consumer Blackwell card like the RTX 5090, with headroom left over for serving.&lt;/p&gt;

&lt;p&gt;Here’s the actual launch command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;--gpus&lt;/span&gt; all &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--privileged&lt;/span&gt; &lt;span class="nt"&gt;--ipc&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;host &lt;span class="nt"&gt;-p&lt;/span&gt; 8000:8000 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-v&lt;/span&gt; ~/.cache/huggingface:/root/.cache/huggingface &lt;span class="se"&gt;\&lt;/span&gt;
  vllm/vllm-openai:gemma4-0505-cu130 nvidia/gemma-4-31B-it-NVFP4 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--kv-cache-dtype&lt;/span&gt; fp8 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--tensor-parallel-size&lt;/span&gt; 1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--enable-auto-tool-choice&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--tool-call-parser&lt;/span&gt; gemma4 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--chat-template&lt;/span&gt; examples/tool_chat_template_gemma4.jinja &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--reasoning-parser&lt;/span&gt; gemma4 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--speculative-config&lt;/span&gt; &lt;span class="s1"&gt;'{"model":"google/gemma-4-31B-it-assistant","num_speculative_tokens":4}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few lines worth calling out:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;--kv-cache-dtype fp8&lt;/code&gt; cuts the KV cache footprint roughly in half. Long contexts are still expensive, just half as expensive.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;--tool-call-parser&lt;/code&gt;, &lt;code&gt;--reasoning-parser&lt;/code&gt;, and &lt;code&gt;--chat-template&lt;/code&gt; trio wires up Gemma 4’s native function calling and structured &lt;strong&gt;thinking mode&lt;/strong&gt;. We don’t need tools for the benchmark itself, but any agent that drives this engine afterwards will.&lt;/li&gt;
&lt;li&gt;The interesting line is the last one. &lt;code&gt;--speculative-config&lt;/code&gt; is the switch that turns MTP on. The &lt;strong&gt;target&lt;/strong&gt; is the NVFP4 31B model doing the actual reasoning. The &lt;strong&gt;drafter&lt;/strong&gt; is &lt;code&gt;google/gemma-4-31B-it-assistant&lt;/code&gt;, a &lt;strong&gt;0.5B-parameter&lt;/strong&gt; companion model that Google ships specifically as the speculative-decoding partner for the 31B. At roughly 60x smaller than the target, it generates draft tokens fast enough that the verification step costs almost nothing extra. It also shares the target model’s KV cache and feeds off its final-layer activations rather than building its own context from scratch, which is why the acceptance rate stays stable even as context grows. &lt;code&gt;num_speculative_tokens: 4&lt;/code&gt; is the recommended starting point at this scale; vLLM’s own benchmarks suggest pushing up to 8 if your acceptance rate holds.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once the container boots, vLLM exposes an OpenAI-compatible endpoint on &lt;code&gt;localhost:8000&lt;/code&gt;. Anything that already speaks the OpenAI API talks to this. No new SDK, no custom wire protocol, no learning curve.&lt;/p&gt;

&lt;p&gt;That’s the whole engine. Brain loaded, drafter wired up, KV cache paged. Now the only question worth answering is whether MTP actually earns its keep, or whether it’s another "up to 3x faster" line that quietly evaporates on real workloads.&lt;/p&gt;

&lt;p&gt;That’s what the next section is for.&lt;/p&gt;




&lt;h3&gt;
  
  
  Does MTP Actually Earn Its Keep?
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvjirzo3kqticuxt0iigz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvjirzo3kqticuxt0iigz.png" alt="vLLM Bench Results" width="800" height="593"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I ran this on a dedicated Nvidia RTX PRO 6000 Blackwell 96GB instance rather than my local machine, and used the unquantized BF16 checkpoint. The PRO 6000 is a workstation card, not a consumer one — I picked it deliberately. Local inference benchmarking is noisy (background processes, thermal throttling, memory contention), and BF16 weights in a clean isolated environment let me measure the MTP mechanism itself without quantization or thermal effects muddying the numbers.&lt;/p&gt;

&lt;p&gt;The trade-off worth naming: these numbers are not directly what a 5090 running NVFP4 will hit. The two setups pull in different directions — the PRO 6000 has more raw compute and memory bandwidth, but NVFP4 on Blackwell has native FP4 tensor cores and a much smaller memory footprint, which matters a lot for the bandwidth-bound decode step. Which curve ends up higher in absolute tok/s is an empirical question I haven't answered here. What does transfer is the shape: where MTP wins, where the gain narrows, and where it crosses over. If you want exact numbers for your card, run llama-benchy yourself with the config from the previous section.&lt;/p&gt;

&lt;p&gt;The first test used vLLM’s own built-in benchmark tool, &lt;code&gt;vllm bench serve&lt;/code&gt;. The setup was a controlled A/B: everything identical except the presence of &lt;code&gt;--speculative-config&lt;/code&gt;. Three runs per arm, results averaged.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vllm bench serve &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model&lt;/span&gt; google/gemma-4-31B-it &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--host&lt;/span&gt; localhost &lt;span class="nt"&gt;--port&lt;/span&gt; 8000 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--dataset-name&lt;/span&gt; random &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--random-input-len&lt;/span&gt; 1024 &lt;span class="nt"&gt;--random-output-len&lt;/span&gt; 1024 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--num-prompts&lt;/span&gt; 100 &lt;span class="nt"&gt;--max-concurrency&lt;/span&gt; 32
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Spec OFF: &lt;strong&gt;356 tok/s&lt;/strong&gt;. Spec ON: &lt;strong&gt;642 tok/s&lt;/strong&gt;. A consistent &lt;strong&gt;1.80x&lt;/strong&gt; across all three runs.&lt;/p&gt;

&lt;p&gt;But &lt;code&gt;vllm bench serve&lt;/code&gt; answers a different question than the one I was actually asking. It is built to stress-test a serving deployment: it saturates the server at concurrency 32, mixes request queues, and measures aggregate output across all users at once. That is exactly what you want if you are sizing a production API. It is not what you want if you are asking how fast a single agent thinks on a long task.&lt;/p&gt;

&lt;p&gt;There is also a structural problem with the random dataset beyond just MTP. It is the only format that lets you pin exact input and output lengths. And &lt;code&gt;vllm bench serve&lt;/code&gt; has no mechanism to measure how performance changes as context grows, which is exactly what matters for a marathon task.&lt;/p&gt;

&lt;p&gt;The question I actually needed to answer was different: how does per-request generation speed change as context grows from zero to 120k? Real text, real acceptance rates, one request at a time.&lt;/p&gt;

&lt;p&gt;For that, I used &lt;a href="https://github.com/eugr/llama-benchy" rel="noopener noreferrer"&gt;&lt;strong&gt;llama-benchy&lt;/strong&gt;&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;The Context Ladder&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;llama-benchy is a llama-bench style tool built for any OpenAI-compatible endpoint. The key differences from &lt;code&gt;vllm bench serve&lt;/code&gt; are threefold: it runs one request at a time, which is the actual solo-agent scenario; it uses real book text from Project Gutenberg, which gives the speculative drafter something meaningful to predict; and it sweeps across context depths, so you can see exactly how performance changes as the KV cache fills.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;llama-benchy &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--base-url&lt;/span&gt; http://localhost:8000/v1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--latency-mode&lt;/span&gt; generation &lt;span class="nt"&gt;--skip-coherence&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--pp&lt;/span&gt; 2048 &lt;span class="nt"&gt;--tg&lt;/span&gt; 480 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--depth&lt;/span&gt; 0 1000 5000 10000 20000 50000 100000 120000 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--book-url&lt;/span&gt; https://www.gutenberg.org/files/2600/2600-0.txt &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--no-cache&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here is the full comparison across the context window, one request at a time:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Context depth&lt;/th&gt;
&lt;th&gt;Spec ON (tok/s)&lt;/th&gt;
&lt;th&gt;Spec OFF (tok/s)&lt;/th&gt;
&lt;th&gt;Advantage&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0 (fresh start)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;52.5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;22.3&lt;/td&gt;
&lt;td&gt;2.4x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5k&lt;/td&gt;
&lt;td&gt;46.2&lt;/td&gt;
&lt;td&gt;21.7&lt;/td&gt;
&lt;td&gt;2.1x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10k&lt;/td&gt;
&lt;td&gt;40.3&lt;/td&gt;
&lt;td&gt;21.3&lt;/td&gt;
&lt;td&gt;1.9x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;20k&lt;/td&gt;
&lt;td&gt;38.3&lt;/td&gt;
&lt;td&gt;20.6&lt;/td&gt;
&lt;td&gt;1.9x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50k&lt;/td&gt;
&lt;td&gt;27.0&lt;/td&gt;
&lt;td&gt;19.7&lt;/td&gt;
&lt;td&gt;1.4x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100k&lt;/td&gt;
&lt;td&gt;19.1&lt;/td&gt;
&lt;td&gt;18.4&lt;/td&gt;
&lt;td&gt;~1.0x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;120k&lt;/td&gt;
&lt;td&gt;16.6&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;18.0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.9x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;As an additional test, I increased &lt;code&gt;num_speculative_tokens&lt;/code&gt; from 4 to 8 to see if performance would scale. While the 8-token configuration did improve throughput, the results showed clear diminishing returns in this dataset. Across most context lengths, doubling the speculative tokens only yielded a modest bump of roughly 2 to 3.5 tok/s over the 4-token setup, with the most noticeable gains in the 10k to 50k range.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The engine does not get tired. But past a certain point, the turbocharger becomes a drag.&lt;/p&gt;

&lt;p&gt;Two things stand out. First, spec OFF is surprisingly stable: only a 19% drop across the entire 120k window, from 22.3 to 18.0 tok/s. The model's autoregressive baseline is memory-bandwidth bound and barely sensitive to context length on its own. Second, spec ON drops 68% over the same range, from 52.5 to 16.6 tok/s. The drafter overhead compounds with the growing attention cost: the shared KV cache it attends over gets larger with every token processed, and that cost grows whether or not the drafter is predicting well.&lt;/p&gt;

&lt;p&gt;The crossover lands at around 100k tokens. At 120k, spec OFF is actually faster.&lt;/p&gt;

&lt;p&gt;It is also worth noting that acceptance rate is workload-dependent. The vLLM bench reported an acceptance length of 3.54 out of 4 on random dataset tokens. The ladder benchmark on War and Peace text showed a consistent ~2.7 out of 4 across all context depths. The inversion is counterintuitive — you might expect coherent prose to be more predictable than random tokens — but vLLM's random dataset feeds uniform random vocab IDs as input, which is a fairly degenerate condition for an LLM to operate in. Models under high uncertainty have a documented tendency to fall back toward repetitive or low-entropy outputs, and that kind of output is exactly what a small drafter handles well. The two benchmarks also differ in concurrency and decode settings, which complicates a direct comparison further. The takeaway isn't that one number is wrong, it's that 3.54/4 isn't the figure that will generalize to a real workload. The 2.7/4 on coherent prose is closer to what an agent on real text will see.&lt;/p&gt;

&lt;p&gt;One more thing worth naming: MTP only touches the generation side. Prefill is compute-bound and speculative decoding does nothing for it. For a read-heavy agent continuously ingesting new documents, the time spent waiting for the model to process each new chunk of context is unaffected by whether spec decode is on or off. That is the next constraint, and prefix caching is what addresses it: if the agent revisits the same source material across multiple reasoning steps, the cached KV pages are free.&lt;/p&gt;

&lt;p&gt;For a typical agentic task in the short to medium context range, this is not a concern. The 2x+ advantage holds through 20k tokens and is still meaningful at 50k. But for a task designed to fill the full context window, the honest recommendation is to pick the configuration based on your expected average depth: spec ON for workloads that mostly stay under 50k, spec OFF if your agent spends most of its time deep in a 100k+ session. vLLM doesn't let you flip &lt;code&gt;--speculative-config&lt;/code&gt; per request, so this is a server-launch decision, not a runtime one.&lt;/p&gt;

&lt;p&gt;These numbers are also conservative in a second way: they come from near-default vLLM settings. There is meaningful headroom on top of both curves. The most impactful levers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;NVFP4 weights + FP8 KV cache&lt;/strong&gt;: the production setup from the previous section. Cuts weight footprint from ~62GB to ~19GB and halves KV cache memory, freeing headroom for larger batches or longer contexts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;--enable-chunked-prefill&lt;/code&gt;&lt;/strong&gt;: overlaps prefill computation with ongoing decode steps. Helps TTFT under load without touching throughput.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prefix caching&lt;/strong&gt;: if the agent re-reads the same documents across multiple reasoning steps, vLLM shares KV pages across those requests instead of recomputing them. For a research loop that revisits the same source material, this is a significant multiplier.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FlashInfer attention backend&lt;/strong&gt; (&lt;code&gt;--attention-backend flashinfer&lt;/code&gt;): optimized for Blackwell, can improve throughput over the default backend at longer context lengths where the attention step dominates.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  The Pilot
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbvw06xk8sd53o2ce1dux.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbvw06xk8sd53o2ce1dux.png" alt="Pi Logo" width="225" height="225"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The benchmarks answer the speed question. The actual workflow question is: what do you point at this thing?&lt;/p&gt;

&lt;p&gt;For the agent layer, I have been using &lt;a href="https://pi.dev" rel="noopener noreferrer"&gt;&lt;strong&gt;Pi&lt;/strong&gt;&lt;/a&gt;. Minimal terminal harness, tiny system prompt, fully extensible. No context bloat, no baked-in opinions about how your workflow should look. For marathon tasks where every token in the context window has to earn its place, lean tooling matters.&lt;/p&gt;

&lt;p&gt;Pointing it at the local engine is one config file. Add this to &lt;code&gt;~/.pi/agent/models.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"providers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"vllm-gemma4"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"baseUrl"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http://localhost:8000/v1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"api"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"openai-completions"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"apiKey"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"none"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"models"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"google/gemma-4-31B-it"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"contextWindow"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;128000&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6h63w5y9fs5mzgh4jsne.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6h63w5y9fs5mzgh4jsne.png" alt="Pi Coding Agent Working" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Switch to it with &lt;code&gt;/model&lt;/code&gt;. Pi talks to your local vLLM instance the same way it would talk to any cloud endpoint. The Brain and the Pilot stay fully decoupled: one handles raw inference speed, the other handles the logic and the goal.&lt;/p&gt;

&lt;p&gt;Which left only one thing to find out: whether the whole stack actually holds up overnight.&lt;/p&gt;




&lt;h3&gt;
  
  
  The Marathon
&lt;/h3&gt;

&lt;p&gt;Couple days later. Same shape of task, slightly different flavor: point the agent at a pile of raw sources, papers, scattered docs, half-finished notes, and have it build a Karpathy-style LLM wiki out of them. Structured markdown files and entity pages, the whole thing knitting itself together as it went. The kind of job that rewards grinding: read, summarize, link, double back, refine. I pointed Pi at the local vLLM endpoint, set it running just before midnight, and went to sleep.&lt;/p&gt;

&lt;p&gt;This time I woke up to a populated &lt;code&gt;wiki/&lt;/code&gt; directory. Forty-something markdown files, a few hundred wikilinks, and a &lt;code&gt;conflicts.md&lt;/code&gt; where the agent had flagged two sources disagreeing instead of silently picking a winner. No frozen terminal. No 12:10 AM service outage. The engine had just kept going through the night, on my desk, at whatever speed MTP and a 31B model could manage on consumer silicon.&lt;/p&gt;

&lt;p&gt;That's what the marathon engine is actually for. Not to beat the cloud giants on a single hard reasoning step; it won't, and I don't ask it to. To be the thing that's still there at 3 AM, still working, when the clever model is down or rate-limited or metering every token. The "babysitting" problem I used to have with local models wasn't really about intelligence. It was about endurance, and a serving stack that didn't fall over. Both of those, finally, are being solved.&lt;/p&gt;




&lt;h3&gt;
  
  
  Verdict
&lt;/h3&gt;

&lt;p&gt;A year ago, "local model" and "marathon agent" did not belong in the same sentence. The hardware was wrong, the serving stack was wrong, and the speed was definitely wrong. The frontier was something you rented by the token, and that was the deal.&lt;/p&gt;

&lt;p&gt;That deal is now negotiable.&lt;/p&gt;

&lt;p&gt;The deal changed because the models got good enough and the serving stack finally got to acceptable levels; and MTP is a good bonus on top of that. The benchmarks back up the headline at the depths where most agentic work actually lives. From a fresh start through 50k tokens, speculative decoding delivers a consistent 1.4x to 2.4x speedup over the autoregressive baseline. That is shy of the "up to 3x" top-line number, but it is a measured, reproducible win on real prose, with a verification step that mathematically guarantees the same output distribution as the target model alone. The drafter does what it claims, the acceptance algorithm holds, and the engine stays honest.&lt;/p&gt;

&lt;p&gt;A few caveats worth naming before the takeaways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The advantage is not flat across the context window.&lt;/strong&gt; MTP shines early; gains narrow as the KV cache grows and the drafter overhead compounds with attention cost. Measure for your own workload before assuming the headline number applies everywhere.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Spec decode only touches generation.&lt;/strong&gt; Prefill is a separate problem. For read-heavy agents that re-ingest the same documents, prefix caching matters more than MTP.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Acceptance rate is workload-dependent.&lt;/strong&gt; Random benchmark tokens behaved differently from coherent prose in my tests. One number will not tell you what your stack actually does.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The takeaways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Use the giants for sprints.&lt;/strong&gt; When precision on a single hard reasoning step is what you need, the trillion-parameter models still win. That is not changing for a while (Hope I am wrong).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use a local marathon engine for routine tasks.&lt;/strong&gt; Anything that grinds: multi-hour scraping, knowledge-base construction, batch summarization, agent loops with dozens of self-correction steps. The economics flip the moment your task crosses the API quota line.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;vLLM + Gemma 4 + MTP is the current sweet spot.&lt;/strong&gt; Not because it beats everything else on IQ, but because it is the first stack where consumer hardware, modern serving infrastructure, and decent generation speed all line up at the same time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decouple Brain and Pilot.&lt;/strong&gt; Keep inference (vLLM) separate from orchestration (Pi, or whatever you reach for). The Brain optimizes tokens per second. The Pilot optimizes getting the job done. Treating them as one thing is the bug behind half the local-agent frustrations I have seen.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The failed auto job that opened this post was not a failure of intelligence. It was a failure of foundation. Now there is a real alternative that fits on consumer hardware and runs without a token quota.&lt;/p&gt;

&lt;p&gt;It is not the smartest model in the world. It is the one that works tirelessly and locally.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>gemma</category>
      <category>devchallenge</category>
      <category>gemmachallenge</category>
    </item>
    <item>
      <title>Decoding Bronze Age Paperwork: Modern AI vs. Ancient Assyrian Clay Tablets</title>
      <dc:creator>Ertuğrul Demir</dc:creator>
      <pubDate>Sat, 28 Mar 2026 12:17:54 +0000</pubDate>
      <link>https://dev.to/gde/decoding-bronze-age-paperwork-modern-ai-vs-ancient-assyrian-clay-tablets-5adf</link>
      <guid>https://dev.to/gde/decoding-bronze-age-paperwork-modern-ai-vs-ancient-assyrian-clay-tablets-5adf</guid>
      <description>&lt;p&gt;Four thousand years ago, Assyrian merchants were doing what people have always done: tracking debts, chasing payments, arguing over contracts. They pressed these records into clay tablets. Not sacred texts, not epic poetry. Just the ancient equivalent of office emails.&lt;/p&gt;

&lt;p&gt;Nearly 23,000 of these tablets survive. Half have never been translated — not because they're damaged, but because a few people on Earth can read Old Assyrian.&lt;/p&gt;

&lt;p&gt;When the Deep Past Initiative turned this into a Kaggle competition, build a machine translation system for Old Assyrian cuneiform — I jumped in. The task: take transliterated text (cuneiform signs converted to Latin characters) and produce an English translation.&lt;/p&gt;

&lt;p&gt;The training set? Around 1500 pairs. That's it.&lt;/p&gt;

&lt;p&gt;For context, standard translation models train on millions of sentence pairs. Even research on "low-resource" languages works with tens of thousands. We got fifteen hundred documents and a pat on the back.&lt;/p&gt;

&lt;p&gt;So the question was straightforward: how do you build a translation model when you barely have any data, for a language that no modern tokenizer has ever seen, where every proper noun and number matters because these are legal and financial records?&lt;/p&gt;

&lt;p&gt;What started as "fine-tune a model on some ancient text" turned into a full-stack AI pipeline: Gemini vision for OCR-ing scanned academic books, LLMs for sentence alignment and cross-lingual translation, ByT5 as a byte-level backbone that doesn't choke on cuneiform, Unsloth for efficient LoRA training, and vLLM for fast inference on Kaggle T4s. The results surprised us.&lt;/p&gt;

&lt;p&gt;Let's start with why the obvious approaches don't work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the Obvious Approaches Don't Work
&lt;/h2&gt;

&lt;p&gt;The first thing I tried was what everyone tries — throw a pretrained LLM at it. Gemma, Qwen, the usual suspects. Prompt it with some examples, let it translate.&lt;/p&gt;

&lt;p&gt;And honestly? The outputs look pretty good at first glance. Fluent English, reasonable sentence structure, feels like it could be right. But "feels right" is dangerous when you're translating ancient legal documents.&lt;/p&gt;

&lt;p&gt;The problem is hallucination — and not the subtle kind. These models confidently fill in names of merchants, cities, and commodities that simply aren't in the source text. When the transliteration says &lt;code&gt;A-šùr-i-dí&lt;/code&gt; the model might output a completely different name that sounds plausibly Bronze Age. When it hits an unfamiliar trade term, it improvises. For documents where every name, every number, every commodity is the actual information — that's not a minor quality issue, it's the whole problem.&lt;/p&gt;

&lt;p&gt;Ok so what about standard encoder-decoder translation models? Here the issue is more fundamental: tokenization. Modern tokenizers are trained on modern text. Akkadian transliteration is a different universe — hyphenated syllable sequences like &lt;code&gt;a-na&lt;/code&gt;, Sumerian logograms in ALL CAPS like &lt;code&gt;KÙ.BABBAR&lt;/code&gt;, determinatives in curly braces like &lt;code&gt;{d}&lt;/code&gt; and &lt;code&gt;{ki}&lt;/code&gt;, subscript digits encoding phonetic variants like &lt;code&gt;il₅&lt;/code&gt;, and gap markers like &lt;code&gt;&amp;lt;gap&amp;gt;&lt;/code&gt; for broken sections of the physical tablet.&lt;/p&gt;

&lt;p&gt;Feed this into a standard tokenizer and it fragments on every character it hasn't seen. Proper nouns that have never appeared in any pretraining corpus get silently mangled. The &lt;code&gt;&amp;lt;gap&amp;gt;&lt;/code&gt; markers that indicate missing text get treated as noise or special tokens.&lt;/p&gt;

&lt;p&gt;So: decoder-only models hallucinate, standard translation models can't tokenize the input properly. What actually fits this problem?&lt;/p&gt;

&lt;h2&gt;
  
  
  ByT5 — The Right Tool for a Weird Job
&lt;/h2&gt;

&lt;p&gt;One of the best things about Kaggle competitions is the community. People share findings, discuss approaches in the forums, and collectively narrow down what works. Early on, several participants converged on the same answer: ByT5.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhjrn2lis6dpfmlfo8i5o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhjrn2lis6dpfmlfo8i5o.png" alt="ByT5 architecture" width="800" height="415"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;small&gt;Image from "ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models" (Xue et al., 2021)&lt;/small&gt;



&lt;p&gt;ByT5 comes from a 2021 Google Research paper — &lt;em&gt;"Towards a Token-Free Future with Pre-trained Byte-to-Byte Models"&lt;/em&gt;. The idea is simple and kind of radical: skip tokenization entirely. Instead of mapping text to a learned vocabulary of subwords, ByT5 operates directly on raw bytes. A standard Transformer, minimal modifications, just processing one byte at a time.&lt;/p&gt;

&lt;p&gt;Why does this matter for our problem? Because every character is valid input by definition. It doesn't matter that &lt;code&gt;A-mur-{d}UTU&lt;/code&gt; has never appeared in any pretraining corpus — ByT5 doesn't need it to. No vocabulary misses, no fragmented tokens, no special handling for curly braces or subscript digits. The model just sees bytes.&lt;/p&gt;

&lt;p&gt;The paper also showed something else that turned out to be critical: byte-level models are significantly more robust to noise. When your source text comes from OCR'd clay tablets with inconsistent transcription conventions across different scholars and decades — that robustness isn't a nice-to-have, it's a requirement.&lt;/p&gt;

&lt;p&gt;Architecture: solved. Now came the harder problem — we had the right model, but nowhere near enough data to train it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Data Problem
&lt;/h2&gt;

&lt;p&gt;With ByT5 as the architecture, the bottleneck shifted entirely to data. And the competition host made the challenges very clear in a public discussion post.&lt;/p&gt;

&lt;p&gt;Two things consistently broke translations more than anything else:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Named entities.&lt;/strong&gt; Personal names, place names, divine names — they're transliterated inconsistently across different editions, they preserve older spelling conventions, and they're completely opaque to the model. In practice, many otherwise reasonable translations failed because a name got mangled, dropped, or hallucinated. The host even prepared an onomasticon (a curated list of attested name spellings) as supplemental data to help with this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Transliteration format inconsistency.&lt;/strong&gt; Different corpora encode the same text using different conventions. One participant converted diacritics to ASCII before training (&lt;code&gt;š → sz&lt;/code&gt;, &lt;code&gt;ú → u2&lt;/code&gt;) — a reasonable instinct, but the evaluation data expects diacritics. Collapsing &lt;code&gt;ṣ&lt;/code&gt; into &lt;code&gt;S₂&lt;/code&gt; or &lt;code&gt;š&lt;/code&gt; into &lt;code&gt;SZ&lt;/code&gt; removes distinctions that are semantically meaningful in Akkadian. The rule was clear: normalize &lt;em&gt;toward&lt;/em&gt; the format used in the evaluation set, not away from it.&lt;/p&gt;

&lt;p&gt;On top of that, gap handling was tricky. Damaged sections of tablets are marked with &lt;code&gt;&amp;lt;gap&amp;gt;&lt;/code&gt;, but the training data wasn't perfectly aligned — sometimes a large gap appears in the transliteration but not in the translation, forcing the model to learn misalignment rather than translation. Edge cases like &lt;code&gt;&amp;lt;gap&amp;gt;-A-šùr&lt;/code&gt; (a gap attached to a word) needed to be preserved, not blindly stripped.&lt;/p&gt;

&lt;p&gt;The host's closing point stuck with me: these aren't model architecture problems. They're data problems. And with only ~1,500 training pairs, every one of these issues hits harder because the model sees so few examples to learn from.&lt;/p&gt;

&lt;p&gt;So the path forward was obvious — find more data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding More Data — The AKT Books
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzxq6mxsgy4zhmqc57gxy.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzxq6mxsgy4zhmqc57gxy.jpg" alt="AKT 5 Cover" width="250" height="330"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The training data had to come from somewhere. The competition hosts pointed the way — they shared scanned PDFs of the AKT series (Anatolian Kültepe Texts), a multi-volume scholarly publication of Old Assyrian tablets from the Kültepe excavations in Turkey. Each volume contains transliterations and translations of tablets. Exactly the domain, exactly the format we needed.&lt;/p&gt;

&lt;p&gt;The catch? These are academic books published between 1990 and the 2020s, by different authors, in different languages. AKT 1, 2, 4, 9a, and 10 are in Turkish. AKT 3 is in German. Each volume has its own layout, its own heading conventions, its own way of marking tablet edges and sections. Different fonts, different editorial styles, different decades of typesetting.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft0r6loipcfioqnhvq8y6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft0r6loipcfioqnhvq8y6.png" alt="Example of publication" width="800" height="661"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This isn't structured data you can parse with a script. These are scanned pages of physical books — some crisp, some not — where a tablet's transliteration might start on one page and continue on the next, where scholarly commentary sits right next to the translation text, and where the format changes just enough between volumes that nothing generalizes cleanly.&lt;/p&gt;

&lt;p&gt;But inside these messy PDFs was exactly what we were starving for: hundreds of additional transliteration-translation pairs, many with line-by-line alignment that the original training set didn't have.&lt;/p&gt;

&lt;p&gt;The question was whether I could extract it reliably enough to actually help the model — or whether the noise would make things worse. This is where Gemini's multimodal capabilities came in — specifically its ability to understand page layouts, distinguish between transliteration blocks and commentary, and handle multilingual content out of the box. I decided to build the pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Extraction Pipeline
&lt;/h2&gt;

&lt;p&gt;Building this pipeline was its own mini-project. Each step solved one problem and revealed the next.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: PDF → Page Images&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The simplest step — render each PDF page as a numbered PNG. This is the only part that runs purely local. Everything else goes through Gemini.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Page Images → Structured JSON&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Each page image gets sent to Gemini's vision model via the Vertex AI Batch API. The flow: build a JSONL of requests (one per page image, referencing GCS URIs), submit to Vertex, parse the predictions back.&lt;/p&gt;

&lt;p&gt;A quick note on why batch inference: when you're processing hundreds of pages and don't need real-time responses, the Batch API is a no-brainer. You get a 50% discount over standard inference, much higher rate limits, and the service handles parallelization and retries for you — typically completing within 24 hours. You submit one job, go do something else, come back to results. For a pipeline like this where I was processing multiple books with hundreds of pages each, it saved both money and sanity.&lt;/p&gt;

&lt;p&gt;The request construction:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gcs_uri&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt_text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;request&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;contents&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
                    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fileData&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mimeType&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image/png&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fileUri&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;gcs_uri&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;
                    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt_text&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="p"&gt;}],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;generationConfig&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;responseMimeType&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mediaResolution&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MEDIA_RESOLUTION_HIGH&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# needed for diacritics
&lt;/span&gt;                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;thinkingConfig&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;thinkingLevel&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MEDIUM&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We used &lt;code&gt;gemini-3.1-flash-lite-preview&lt;/code&gt; with medium thinking enabled — the reasoning step helped significantly with understanding complex page layouts and making correct decisions about where one tablet ends and another begins.&lt;/p&gt;

&lt;p&gt;Submit with the Vertex AI Batch API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vertexai&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;location&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;location&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;job&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;batches&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-3.1-flash-lite-preview&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gs://your-bucket/book/ocr_batch/requests.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;CreateBatchJobConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;dest&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gs://your-bucket/book/ocr_batch/predictions/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One gotcha that bit me early: &lt;strong&gt;predictions come back shuffled&lt;/strong&gt;. You can't rely on line order in the output — you have to extract the page number from each prediction's original request URI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_page_num&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pred&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;uri&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pred&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;request&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;contents&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fileData&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fileUri&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;page_(\d+)\.png&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;uri&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is actually a feature — it forces you to write robust parsing from the start.&lt;/p&gt;

&lt;p&gt;Every AKT volume needs its own prompt. Different heading formats, different edge markers (&lt;code&gt;Ö.y.&lt;/code&gt;, &lt;code&gt;Ak.&lt;/code&gt; for Turkish volumes; &lt;code&gt;Vs.&lt;/code&gt;, &lt;code&gt;Rs.&lt;/code&gt; for German), different conventions for commentary blocks. Get this wrong and you extract commentary as translation, or merge two tablets into one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: JSON Pages → Tablets CSV&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A book-specific export script aggregates all the per-page JSONs into a flat CSV — one row per tablet with combined transliteration and translation fields. Each volume needs its own exporter because the structure varies enough that a generic one would silently break.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4: Visual QC&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Dump everything to an HTML file and actually look at it. This is where you spot the real problems: misread headings, commentary leaking into translation fields, duplicate translations from continuation pages. No amount of automated testing replaces eyeballing the output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 5: Cleanup&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Book-specific cleanup scripts apply the fixes found during QC — drop bad rows, merge tablets that got split across pages, strip commentary that leaked through. Unglamorous and manual but completely necessary.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 6: Sentence Chunking + Translation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffbs63lyyuenjkftivfaa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffbs63lyyuenjkftivfaa.png" alt="Batch Job Flow" width="800" height="423"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here's where it gets interesting again. The original training data is document-level — full tablet in, full translation out. But the AKT books have something better: line-by-line structure. Each transliteration line has a marker (&lt;code&gt;(Vs.1)&lt;/code&gt;, &lt;code&gt;(2)&lt;/code&gt;, &lt;code&gt;(Rs.14)&lt;/code&gt;) and each translation sentence references those markers.&lt;/p&gt;

&lt;p&gt;A second Gemini batch job handles two things at once: align transliteration lines to translation sentences by marker, and translate the non-English content (Turkish or German) into English. For each tablet, I retrieved the most similar examples from the official training set using TF-IDF cosine similarity and included them as few-shot context. This turned out to be crucial — not just for translation quality, but for matching the distribution of the host's wording, style, and terminology choices. The model wasn't just translating, it was learning to translate &lt;em&gt;the way the competition data expected&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Same batch pattern — build JSONL, submit, parse shuffled predictions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 7: Normalization&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most of the invisible work happened here. The competition test set uses a specific character format, and the books don't match it. Every volume has its own OCR artifacts, its own conventions.&lt;/p&gt;

&lt;p&gt;A few examples from the normalization stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;ḫ/Ḫ → h/H&lt;/code&gt; (test set uses plain H)&lt;/li&gt;
&lt;li&gt;Unicode subscripts → plain digits (&lt;code&gt;₄ → 4&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Superscript determinatives → brace format (&lt;code&gt;ᵈ → {d}&lt;/code&gt;, &lt;code&gt;ᵏⁱ → {ki}&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;OCR artifacts: &lt;code&gt;KU.BABBAR → KÙ.BABBAR&lt;/code&gt;, &lt;code&gt;ś → š&lt;/code&gt;, &lt;code&gt;ş → ṣ&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Gap deduplication: &lt;code&gt;&amp;lt;gap&amp;gt; &amp;lt;gap&amp;gt; → &amp;lt;gap&amp;gt;&lt;/code&gt;, while preserving attachments like &lt;code&gt;&amp;lt;gap&amp;gt;-A-šùr&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a character-level model like ByT5, this isn't cosmetic. A single character mismatch between training and test — &lt;code&gt;ḫ&lt;/code&gt; vs &lt;code&gt;h&lt;/code&gt;, &lt;code&gt;₄&lt;/code&gt; vs &lt;code&gt;4&lt;/code&gt; — is invisible to a human reviewer and catastrophic to a model that has learned exactly one representation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 8: Merge&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The final step pulls normalized chunks into the main training set. Starting from ~1,500 pairs, the pipeline roughly multiplied our available training data — and more importantly, added sentence-level pairs that gave the model a much finer-grained learning signal than document-level translations alone.&lt;/p&gt;

&lt;h2&gt;
  
  
  Training — ByT5 Gets You Far, Then Stops
&lt;/h2&gt;

&lt;p&gt;With the expanded dataset ready, training ByT5 was straightforward — standard seq2seq encoder-decoder training using HuggingFace Transformers. No tricks, no exotic schedulers. The model picked up patterns fast and translated training-domain tablets surprisingly well.&lt;/p&gt;

&lt;p&gt;But then the leaderboard scores started telling a different story.&lt;/p&gt;

&lt;p&gt;In our case, the hidden test set on Kaggle seemed to have a different distribution than what we trained on. Our best guess: different books, different topics, different translator styles, unfamiliar names and locations. Our ByT5 was doing well on what it had seen directly in training, but the leaderboard scores suggested it wasn't generalizing beyond that.&lt;/p&gt;

&lt;p&gt;We hit a ceiling. Many teams went on to have great success pushing ByT5 further — better augmentation, longer training, smarter tricks I guess. But in our setup, the gains had stalled, and we decided to explore a different direction.&lt;/p&gt;

&lt;h2&gt;
  
  
  Back to Decoder-Only — But This Time, Fine-Tuned
&lt;/h2&gt;

&lt;p&gt;This is where the story comes full circle. Earlier, we'd dismissed decoder-only LLMs because they hallucinate. That's still true — out of the box. But fine-tuning changes the picture completely.&lt;/p&gt;

&lt;p&gt;The reasoning was simple: ByT5 and Qwen were solving different problems. ByT5 was a great fit for the transliteration itself — every character mattered, and byte-level modeling let it handle weird orthography, diacritics, subscripts, and determinatives without fighting the tokenizer. But once the task became generalization across unfamiliar tablets, translator styles, and topic shifts, Qwen3.5 had something ByT5 didn't: much stronger pretrained language knowledge.&lt;/p&gt;

&lt;p&gt;Out of the box, that strength was useless because it came with hallucination. Fine-tuning changed that. LoRA gave us a way to keep the model's broader language ability while grounding it in the task and the dataset. Instead of prompting a general-purpose model and hoping for the best, we trained a lightweight adapter on our curated examples. Combined with few-shot prompting to match the host's translation style, the fine-tuned Qwen handled the distribution shifts that our ByT5 couldn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fine-Tuning with Unsloth — Making LLMs Affordable
&lt;/h2&gt;

&lt;p&gt;Before diving into the training details, a quick primer for anyone who hasn't fine-tuned a model before.&lt;/p&gt;

&lt;p&gt;The naive approach to fine-tuning a large language model means updating all its parameters — billions of them. That requires serious hardware, serious memory, and serious money. For a Kaggle competition where you're iterating fast on limited GPUs, it's a non-starter.&lt;/p&gt;

&lt;p&gt;This is where LoRA (Low-Rank Adaptation) comes in. Instead of updating the entire model, you freeze the original weights and train a small set of adapter matrices on top. You get most of the benefits of full fine-tuning at a fraction of the cost. QLoRA takes it a step further by quantizing the base model to 4-bit precision, which dramatically cuts memory usage — making it possible to fine-tune models that would otherwise never fit on a single GPU.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fma6se1jr7jilhm9b8wik.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fma6se1jr7jilhm9b8wik.png" alt="Unsloth" width="225" height="225"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For this project we used Unsloth, which makes the whole process surprisingly painless. It handles the LoRA/QLoRA setup, optimizes training to run ~2x faster with ~70% less VRAM, and supports a wide range of models out of the box — including Qwen3.5, which is what we needed.&lt;/p&gt;

&lt;p&gt;The training itself was SFT (Supervised Fine-Tuning) using Unsloth's built-in SFT trainer. We structured our data as chat conversations: a system prompt setting the role of an expert Assyriologist, few-shot examples retrieved via TF-IDF similarity, and the target tablet as the final user message. The model only learns from the assistant completion — the actual translation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# each training example looks like this
&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are an expert Assyriologist...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="c1"&gt;# few-shot examples from similar tablets
&lt;/span&gt;    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Translate: um-ma A-šùr-i-dí-ma ...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assistant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Thus says Aššur-idī: ...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Translate: um-ma Pu-šu-ki-in-ma ...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assistant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Thus says Pūšu-kēn: ...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="c1"&gt;# the actual tablet to translate
&lt;/span&gt;    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Translate: a-na A-lim {ki} ...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assistant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;To the City: ...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;  &lt;span class="c1"&gt;# model learns this
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;An important detail here: we used completion-only masking. The loss is computed only on the assistant's translation tokens — the prompt tokens (system message, few-shot examples, user messages) are masked out during training. This means the model isn't wasting capacity learning to predict the input; it's focused entirely on producing accurate translations.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1gj9rl0yko1xfy4srvhf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1gj9rl0yko1xfy4srvhf.png" alt="Completion masking: prompt tokens are masked in the labels, loss is only computed on the completion tokens" width="800" height="186"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This meant the model wasn't just learning to translate — it was learning to translate &lt;em&gt;in context&lt;/em&gt;, grounded by similar examples. The same retrieval and prompt structure would be used at inference time, so there was no gap between how the model trained and how it would be evaluated.&lt;/p&gt;

&lt;p&gt;One direction we started exploring but ran out of time for: reinforcement learning on top of the fine-tuned model. The idea was to use GRPO (Group Relative Policy Optimization) with custom reward functions — combining the competition metric itself, gap alignment between transliteration and translation, and length balance — to push the model beyond what SFT alone could achieve. Each reward would target a specific failure mode that supervised training couldn't address directly. We didn't get there before the deadline, but it felt like the natural next step.&lt;/p&gt;

&lt;h2&gt;
  
  
  Inference — vLLM on Kaggle T4s
&lt;/h2&gt;

&lt;p&gt;With a fine-tuned model ready, the next challenge was actually running it within Kaggle's competition constraints. This is a code competition — no internet access at submission time, two T4 GPUs with 16GB VRAM each, and a strict time limit.&lt;/p&gt;

&lt;p&gt;A quick intro on vLLM for those unfamiliar: it's an open-source inference engine originally developed at UC Berkeley that's become the go-to for serving LLMs efficiently. The key innovation is PagedAttention — instead of pre-allocating a fixed block of memory for each sequence's key-value cache, it pages the KV cache dynamically, similar to how operating systems manage virtual memory. This means you can serve larger models on less hardware. On top of that you get continuous batching, optimized CUDA kernels, tensor parallelism, and seamless HuggingFace model support out of the box.&lt;/p&gt;

&lt;p&gt;Sounds perfect, right? In theory. In practice, we hit a wall.&lt;/p&gt;

&lt;p&gt;Qwen3.5 was released in the final weeks of the competition. The model was brand new — vLLM support was experimental and unstable. On top of that, Kaggle's T4 GPUs have compute capability 7.5, which means no FlashAttention 2 support. We had to fall back to Triton attention backend, wrestle with environment compatibility issues, and work around the fact that you can't pip install anything at submission time — every dependency needs to be pre-packaged in your dataset.&lt;/p&gt;

&lt;p&gt;Getting a 9B parameter model to load, run, and generate translations on two T4s without crashing was its own mini-project. Tensor parallelism across both GPUs was non-negotiable — the model simply wouldn't fit on a single card.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LLM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;MODEL_PATH&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;float16&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_model_len&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;16000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;gpu_memory_utilization&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.85&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;enforce_eager&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;            &lt;span class="c1"&gt;# no CUDA graphs on T4
&lt;/span&gt;    &lt;span class="n"&gt;tensor_parallel_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="c1"&gt;# split across both T4s
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The inference prompt mirrors the training setup exactly — same system prompt, same TF-IDF few-shot retrieval. For each test tablet, we retrieve the 5 most similar examples from our training data and include them as conversation context:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;prompts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="nf"&gt;build_messages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;transliteration&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transliteration&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;few_shot_examples&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;retriever&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transliteration&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;test_rows&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sampling_params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sampling_params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Keeping the inference pipeline identical to training — same prompt structure, same retrieval, same style anchoring — meant the model was seeing exactly the kind of input it was trained on. No distribution shift at inference time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results and Reflections
&lt;/h2&gt;

&lt;p&gt;Our team finished with a silver medal out of 2500+ teams. In the final days of the competition, the OCR extraction pipeline was still producing new data — each batch of cleaned and normalized tablets pushed our scores higher. We genuinely felt like gold was within reach with a couple more days. That stings a bit, but honestly? The journey was worth more than the medal.&lt;/p&gt;

&lt;p&gt;Here's what I'm taking away from this:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gemini's batch inference is a superpower for unstructured data.&lt;/strong&gt; We used it to turn scanned academic books from the 1990s — messy layouts, multiple languages, inconsistent formatting — into clean, structured training data. If it works for 4,000-year-old Assyrian tablets in Turkish and German PDFs, it'll work for your use case too. The Vertex AI Batch API made it affordable and painless at scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Few-shot retrieval is still easy gains.&lt;/strong&gt; TF-IDF character n-gram similarity is dead simple to implement, and using retrieved examples to anchor both training and inference gave us consistent improvements with minimal effort. Small iterations, big returns.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fine-tuning is more accessible than you think.&lt;/strong&gt; LoRA + Unsloth meant we could train a 9B parameter model on Kaggle's free GPUs. You don't need a cluster. You need good data and the right tools.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;vLLM makes deployment practical.&lt;/strong&gt; Even on constrained hardware like Kaggle T4s, with a brand-new model and no internet access, we got a 9B model running with tensor parallelism. The ecosystem is maturing fast.&lt;/p&gt;

&lt;p&gt;And the bigger picture — the one that got me into this competition in the first place — is that there are still thousands of untranslated tablets sitting in museums. The pipeline we built here isn't a one-off competition hack. It's a blueprint: scan the books, extract the data, train the models, translate the tablets. The tools already exist. The data is already out there. At this point, the bottleneck is no longer whether this can be done. It's whether someone is willing to do it.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>kaggle</category>
      <category>gemini</category>
      <category>vertexai</category>
    </item>
    <item>
      <title>Skills, Not Vibes: Teaching AI Agents to Write Clean Code</title>
      <dc:creator>Ertuğrul Demir</dc:creator>
      <pubDate>Mon, 26 Jan 2026 11:17:47 +0000</pubDate>
      <link>https://dev.to/gde/skills-not-vibes-teaching-ai-agents-to-write-clean-code-3l9e</link>
      <guid>https://dev.to/gde/skills-not-vibes-teaching-ai-agents-to-write-clean-code-3l9e</guid>
      <description>&lt;p&gt;In February 2025, Andrej Karpathy coined "vibe coding" to describe programming's new reality: give in to the vibes, accept all changes, "forget that the code even exists." He called it "not too bad for throwaway weekend projects." But for production systems? That's where the trouble starts.&lt;/p&gt;

&lt;p&gt;I've watched AI-generated codebases accumulate the same mess developers spent decades learning to avoid—duplication everywhere, inconsistent naming, missing edge cases. Then it hit me: these are exactly the problems Robert C. Martin warned about in &lt;em&gt;Clean Code&lt;/em&gt; almost two decades ago.&lt;/p&gt;

&lt;p&gt;So I went back to the book, specifically Chapter 17's catalog of 66 code smells and heuristics. These aren't just relevant to AI coding—they're &lt;em&gt;more&lt;/em&gt; relevant. AI makes exactly the mistakes Uncle Bob warned us about, just faster and at scale.&lt;/p&gt;

&lt;p&gt;The solution? &lt;strong&gt;Skills&lt;/strong&gt;—instruction files that AI agents read before writing code. I've translated Clean Code's complete catalog into Python skills you can use today. They work in Google's Antigravity IDE, Anthropic's Claude Code, and anywhere that supports the Agent Skills standard.&lt;/p&gt;

&lt;p&gt;Let me show you why we need this, and how to implement it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Even Linus Torvalds Vibe Codes (Sometimes)
&lt;/h2&gt;

&lt;p&gt;In January 2026, Linus Torvalds revealed a side project called &lt;a href="https://github.com/torvalds/AudioNoise" rel="noopener noreferrer"&gt;AudioNoise&lt;/a&gt;—a digital audio effects simulator he'd been tinkering with over the holidays. The Python visualizer, he noted, was "basically written by vibe-coding."&lt;/p&gt;

&lt;p&gt;In his own words from the repo:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"I know more about analog filters—and that's not saying much—than I do about python. It started out as my typical 'google and do the monkey-see-monkey-do' kind of programming, but then I cut out the middle-man—me—and just used Google Antigravity to do the audio sample visualizer."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The Hacker News discussion revealed two camps. Some saw it as validation: "It's official, vibe coding is legit." Others noted the crucial context: Torvalds used AI for the part he lacks expertise in (Python visualization) while hand-coding the parts he knows (C and digital signal processing).&lt;/p&gt;

&lt;p&gt;One commenter nailed it: "There's a big difference between vibe-coding an entire project and having an AI build a component that you lack competency for."&lt;/p&gt;

&lt;p&gt;Another observation cut deeper: "If anyone on the planet knows how to do vibe coding right, it's him"—because Torvalds spent decades mastering code review. He can spot bad code instantly. Most of us can't.&lt;/p&gt;

&lt;p&gt;But here's what's telling: Torvalds wrote tests for his hand-coded C—numerical accuracy checks for the DSP primitives he understands. The vibe-coded Python visualizer? &lt;strong&gt;No tests, no type hints, and a duplicated function definition that slipped right through.&lt;/strong&gt; The same four-line method appears twice in a row—the first an empty stub, the second the real implementation. It's textbook "Accept All, don't read the diffs." The code runs fine (Python silently overwrites the first definition), but it's exactly the kind of dead code that accumulates into maintenance nightmares.&lt;/p&gt;

&lt;p&gt;This works for Torvalds' toy project precisely. It's a throwaway learning exercise. The moment that visualizer needs to be production code, those missing guardrails become technical debt.&lt;/p&gt;

&lt;p&gt;The same week, Torvalds rejected "AI slop" submissions to the Linux kernel, arguing that documentation telling people not to submit garbage won't help because "the people who would submit it won't read the documentation anyway."&lt;/p&gt;

&lt;p&gt;The lesson isn't that vibe coding is bad. It's that context matters. Skills let you define when to enforce rigor and when to let the vibes flow.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Data: AI Code Quality Is Getting Worse
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://cloud.google.com/resources/content/2025-dora-ai-assisted-software-development-report" rel="noopener noreferrer"&gt;Google's DORA Report&lt;/a&gt;&lt;/strong&gt;  found AI adoption shows a negative relationship with software delivery stability. The 2025 report's central finding: "AI doesn't fix a team; it amplifies what's already there." Without robust control systems—strong testing, mature practices, fast feedback loops—increased AI-generated code leads to instability. Skills are exactly those control systems, encoded as instructions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://arxiv.org/abs/2511.04427" rel="noopener noreferrer"&gt;Carnegie Mellon researchers&lt;/a&gt;&lt;/strong&gt; analyzed 807 GitHub repositories after Cursor adoption: +30% static analysis warnings, +41% code complexity. The speed gains were transient; the quality problems compounded.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.gitclear.com/ai_assistant_code_quality_2025_research" rel="noopener noreferrer"&gt;GitClear's&lt;/a&gt;&lt;/strong&gt; analysis of 211 million lines of code from Google, Microsoft, Meta, and enterprise repositories found code duplication increased &lt;strong&gt;4x&lt;/strong&gt; with AI adoption. For the first time in their dataset, copy/pasted code exceeded refactored code.&lt;/p&gt;

&lt;p&gt;Even &lt;strong&gt;&lt;a href="https://claude.com/blog/eight-trends-defining-how-software-gets-built-in-2026" rel="noopener noreferrer"&gt;Anthropic's Agentic Coding Trends Report&lt;/a&gt;&lt;/strong&gt; shows the gap: developers use AI in roughly 60% of their work, but can fully delegate only 0-20% of tasks. The rest requires "thoughtful setup, active supervision, and human judgment."&lt;/p&gt;

&lt;p&gt;That gap—between what AI touches and what AI can own—is exactly what skills address. The setup &lt;em&gt;is&lt;/em&gt; the skill. The supervision &lt;em&gt;is&lt;/em&gt; the rules.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Pattern: AI Recreates Classic Code Smells
&lt;/h3&gt;

&lt;p&gt;The research consistently identifies the same failure patterns. Here's how they map to specific Clean Code violations:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Naming and Consistency Problems&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Inconsistent variable names across similar functions&lt;/li&gt;
&lt;li&gt;Vague names like &lt;code&gt;data&lt;/code&gt;, &lt;code&gt;tmp&lt;/code&gt;, &lt;code&gt;proc&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Mixing naming conventions (camelCase and snake_case)&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Clean Code rules: N1 (descriptive names), G11 (consistency), G24 (conventions)&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Code Duplication&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Copy/paste instead of extracting shared logic&lt;/li&gt;
&lt;li&gt;Same calculation appearing in multiple places&lt;/li&gt;
&lt;li&gt;Pattern repetition that should be abstracted&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Clean Code rule: G5 (DRY - Don't Repeat Yourself)&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Missing Safety Checks&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No validation of input boundaries&lt;/li&gt;
&lt;li&gt;Assumptions about data structure without verification&lt;/li&gt;
&lt;li&gt;Missing null/None checks&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Clean Code rules: G3 (boundary conditions), G4 (don't override safeties), G26 (be precise)&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Readability Issues&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Magic numbers without explanation (what does 86400 mean?)&lt;/li&gt;
&lt;li&gt;Unused variables cluttering code&lt;/li&gt;
&lt;li&gt;Functions mixing multiple abstraction levels&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Clean Code rules: G12 (remove clutter), G16 (no obscured intent), G34 (single abstraction level)&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Performance Problems&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Functions doing multiple things at once&lt;/li&gt;
&lt;li&gt;Exposing internal data unnecessarily&lt;/li&gt;
&lt;li&gt;Nested loops that could be optimized&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Clean Code rules: G8 (minimize public interface), G30 (functions do one thing)&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These aren't arbitrary style preferences—they're the exact problems that make code hard to maintain, debug, and extend. The skills we'll build enforce these rules automatically.&lt;/p&gt;

&lt;p&gt;The fix isn't to stop using AI. It's to give AI the explicit rules it needs to follow.&lt;/p&gt;

&lt;p&gt;That's what skills do.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Are Skills?
&lt;/h2&gt;

&lt;p&gt;Skills are markdown files containing domain-specific instructions that AI agents read before working on your code. They follow the &lt;a href="https://agentskills.io" rel="noopener noreferrer"&gt;Agent Skills&lt;/a&gt; open standard and work in Google Antigravity, Anthropic's Claude Code, and other compatible agents.&lt;/p&gt;

&lt;p&gt;The architecture is called &lt;strong&gt;Progressive Disclosure&lt;/strong&gt;. Instead of dumping every instruction into the agent's context at once (causing what Antigravity's docs call "Context Saturation"), skills work in layers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Discovery&lt;/strong&gt;: The agent sees only a lightweight menu of skill names and descriptions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Activation&lt;/strong&gt;: When your request matches a skill's description, the full instructions load&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execution&lt;/strong&gt;: Scripts and templates are read only when the task requires them&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This keeps the agent fast and focused. It's not thinking about database migrations when you're writing a React component.&lt;/p&gt;

&lt;p&gt;The format is simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;skill-name&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;When this skill should activate&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

&lt;span class="c1"&gt;# Skill Title&lt;/span&gt;

&lt;span class="s"&gt;Your instructions, examples, and rules here.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The &lt;code&gt;description&lt;/code&gt; field is crucial—it's the trigger phrase. The agent semantically matches your request against all available skill descriptions to decide which ones to load. "Enforces function best practices" is vague. "Use when writing or refactoring Python functions" tells the agent exactly when to activate.&lt;/p&gt;

&lt;p&gt;Skills can do far more than enforce coding standards—the community has built skills for Stripe integration, Metasploit security testing, voice agents, and even multi-agent startup automation. This article focuses on one specific use case: encoding Clean Code principles.&lt;/p&gt;

&lt;p&gt;Let me show you how to translate Clean Code's catalog into working skills.&lt;/p&gt;


&lt;h2&gt;
  
  
  Building the Skills: Three Examples
&lt;/h2&gt;

&lt;p&gt;Rather than catalog all 66 rules exhaustively, I'll show you three critical categories in detail. The complete implementation is at the end.&lt;/p&gt;
&lt;h3&gt;
  
  
  1. Comments (C1-C5): Code Should Explain Itself
&lt;/h3&gt;

&lt;p&gt;Uncle Bob is famously skeptical of comments—not because documentation is bad, but because comments rot faster than code updates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;File Reference: &lt;code&gt;clean-comments/SKILL.md&lt;/code&gt;&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;clean-comments&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Use when writing, fixing, editing, or reviewing Python comments and docstrings. Enforces Clean Code principles—no metadata, no redundancy, no commented-out code.&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

&lt;span class="gh"&gt;# Clean Comments&lt;/span&gt;

&lt;span class="gu"&gt;## C1: No Inappropriate Information&lt;/span&gt;

Comments shouldn't hold metadata. Use Git for author names, change history, 
ticket numbers, and dates. Comments are for technical notes about code only.

&lt;span class="gu"&gt;## C2: Delete Obsolete Comments&lt;/span&gt;

If a comment describes code that no longer exists or works differently, 
delete it immediately. Stale comments become "floating islands of 
irrelevance and misdirection."

&lt;span class="gu"&gt;## C3: No Redundant Comments&lt;/span&gt;

&lt;span class="gh"&gt;# Bad - the code already says this&lt;/span&gt;
i += 1  # increment i
user.save()  # save the user

&lt;span class="gh"&gt;# Good - explains WHY, not WHAT&lt;/span&gt;
i += 1  # compensate for zero-indexing in display

&lt;span class="gu"&gt;## C4: Write Comments Well&lt;/span&gt;

If a comment is worth writing, write it well:
&lt;span class="p"&gt;-&lt;/span&gt; Choose words carefully
&lt;span class="p"&gt;-&lt;/span&gt; Use correct grammar
&lt;span class="p"&gt;-&lt;/span&gt; Don't ramble or state the obvious
&lt;span class="p"&gt;-&lt;/span&gt; Be brief

&lt;span class="gu"&gt;## C5: Never Commit Commented-Out Code&lt;/span&gt;

&lt;span class="gh"&gt;# DELETE THIS - it's an abomination&lt;/span&gt;
&lt;span class="gh"&gt;# def old_calculate_tax(income):&lt;/span&gt;
&lt;span class="gh"&gt;#     return income * 0.15&lt;/span&gt;

Who knows how old it is? Who knows if it's meaningful? Delete it. 
Git remembers everything.

&lt;span class="gu"&gt;## The Goal&lt;/span&gt;

The best comment is the code itself. If you need a comment to explain 
what code does, refactor first, comment last.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  2. Functions (F1-F4): Small, Focused, Obvious
&lt;/h3&gt;

&lt;p&gt;Functions should do one thing, do it well, and have an obvious purpose.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;File Reference: &lt;code&gt;clean-functions/SKILL.md&lt;/code&gt;&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;clean-functions&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Use when writing or refactoring Python functions. Enforces Clean Code principles—maximum 3 arguments, single responsibility, no flag parameters.&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

&lt;span class="gh"&gt;# Clean Functions&lt;/span&gt;

&lt;span class="gu"&gt;## F1: Too Many Arguments (Maximum 3)&lt;/span&gt;

&lt;span class="gh"&gt;# Bad - too many parameters&lt;/span&gt;
def create_user(name, email, age, country, timezone, language, newsletter):
    ...

&lt;span class="gh"&gt;# Good - use a dataclass or dict&lt;/span&gt;
@dataclass
class UserData:
    name: str
    email: str
    age: int
    country: str
    timezone: str
    language: str
    newsletter: bool

def create_user(data: UserData):
    ...

More than 3 arguments means your function is doing too much or needs 
a data structure.

&lt;span class="gu"&gt;## F2: No Output Arguments&lt;/span&gt;

Don't modify arguments as side effects. Return values instead.

&lt;span class="gh"&gt;# Bad - modifies argument&lt;/span&gt;
def append_footer(report: Report) -&amp;gt; None:
    report.append("&lt;span class="se"&gt;\n&lt;/span&gt;---&lt;span class="se"&gt;\n&lt;/span&gt;Generated by System")

&lt;span class="gh"&gt;# Good - returns new value&lt;/span&gt;
def with_footer(report: Report) -&amp;gt; Report:
    return report + "&lt;span class="se"&gt;\n&lt;/span&gt;---&lt;span class="se"&gt;\n&lt;/span&gt;Generated by System"

&lt;span class="gu"&gt;## F3: No Flag Arguments&lt;/span&gt;

Boolean flags mean your function does at least two things.

&lt;span class="gh"&gt;# Bad - function does two different things&lt;/span&gt;
def render(is_test: bool):
    if is_test:
        render_test_page()
    else:
        render_production_page()

&lt;span class="gh"&gt;# Good - split into two functions&lt;/span&gt;
def render_test_page(): ...
def render_production_page(): ...

&lt;span class="gu"&gt;## F4: Delete Dead Functions&lt;/span&gt;

If it's not called, delete it. No "just in case" code. Git preserves history.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  3. General Principles (G1-G36): The Core Rules
&lt;/h3&gt;

&lt;p&gt;These are the fundamental patterns that separate clean code from legacy nightmares.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;File Reference: &lt;code&gt;clean-general/SKILL.md&lt;/code&gt;&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;clean-general&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Use when reviewing Python code quality. Enforces Clean Code's core principles—DRY, single responsibility, clear intent, no magic numbers, proper abstractions.&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

&lt;span class="gh"&gt;# General Clean Code Principles&lt;/span&gt;

&lt;span class="gu"&gt;## Critical Rules&lt;/span&gt;

&lt;span class="gs"&gt;**G5: DRY (Don't Repeat Yourself)**&lt;/span&gt;

Every piece of knowledge has one authoritative representation.

&lt;span class="gh"&gt;# Bad - duplication&lt;/span&gt;
tax_rate = 0.0825
ca_total = subtotal &lt;span class="err"&gt;*&lt;/span&gt; 1.0825
ny_total = subtotal &lt;span class="err"&gt;*&lt;/span&gt; 1.07

&lt;span class="gh"&gt;# Good - single source of truth&lt;/span&gt;
TAX_RATES = {"CA": 0.0825, "NY": 0.07}
def calculate_total(subtotal: float, state: str) -&amp;gt; float:
    return subtotal &lt;span class="err"&gt;*&lt;/span&gt; (1 + TAX_RATES[state])

&lt;span class="gs"&gt;**G16: No Obscured Intent**&lt;/span&gt;

Don't be clever. Be clear.

&lt;span class="gh"&gt;# Bad - what does this do?&lt;/span&gt;
return (x &amp;amp; 0x0F) &amp;lt;&amp;lt; 4 | (y &amp;amp; 0x0F)

&lt;span class="gh"&gt;# Good - obvious intent&lt;/span&gt;
return pack_coordinates(x, y)

&lt;span class="gs"&gt;**G23: Prefer Polymorphism to If/Else**&lt;/span&gt;

&lt;span class="gh"&gt;# Bad - will grow forever&lt;/span&gt;
def calculate_pay(employee):
    if employee.type == "SALARIED":
        return employee.salary
    elif employee.type == "HOURLY":
        return employee.hours &lt;span class="err"&gt;*&lt;/span&gt; employee.rate
    elif employee.type == "COMMISSIONED":
        return employee.base + employee.commission

&lt;span class="gh"&gt;# Good - open/closed principle&lt;/span&gt;
class SalariedEmployee:
    def calculate_pay(self): return self.salary

class HourlyEmployee:
    def calculate_pay(self): return self.hours &lt;span class="err"&gt;*&lt;/span&gt; self.rate

class CommissionedEmployee:
    def calculate_pay(self): return self.base + self.commission

&lt;span class="gs"&gt;**G25: Replace Magic Numbers with Named Constants**&lt;/span&gt;

&lt;span class="gh"&gt;# Bad&lt;/span&gt;
if elapsed_time &amp;gt; 86400:
    ...

&lt;span class="gh"&gt;# Good&lt;/span&gt;
SECONDS_PER_DAY = 86400
if elapsed_time &amp;gt; SECONDS_PER_DAY:
    ...

&lt;span class="gs"&gt;**G30: Functions Should Do One Thing**&lt;/span&gt;

If you can extract another function, your function does more than one thing.

&lt;span class="gs"&gt;**G36: Law of Demeter (Avoid Train Wrecks)**&lt;/span&gt;

&lt;span class="gh"&gt;# Bad - reaching through multiple objects&lt;/span&gt;
output_dir = context.options.scratch_dir.absolute_path

&lt;span class="gh"&gt;# Good - one dot&lt;/span&gt;
output_dir = context.get_scratch_dir()

&lt;span class="gu"&gt;## Enforcement Checklist&lt;/span&gt;

When reviewing AI-generated code, verify:
&lt;span class="p"&gt;-&lt;/span&gt; [ ] No duplication (G5)
&lt;span class="p"&gt;-&lt;/span&gt; [ ] Clear intent, no magic numbers (G16, G25)
&lt;span class="p"&gt;-&lt;/span&gt; [ ] Polymorphism over conditionals (G23)
&lt;span class="p"&gt;-&lt;/span&gt; [ ] Functions do one thing (G30)
&lt;span class="p"&gt;-&lt;/span&gt; [ ] No Law of Demeter violations (G36)
&lt;span class="p"&gt;-&lt;/span&gt; [ ] Boundary conditions handled (G3)
&lt;span class="p"&gt;-&lt;/span&gt; [ ] Dead code removed (G9)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The Complete Catalog
&lt;/h3&gt;

&lt;p&gt;I've translated all 66 rules from Clean Code Chapter 17 into skills covering six categories:&lt;/p&gt;

&lt;p&gt;
  Click to expand all skill categories
  &lt;p&gt;&lt;strong&gt;Comments (C1-C5)&lt;/strong&gt;: Minimal, accurate commenting&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;C1: No inappropriate information (metadata belongs in version control)&lt;/li&gt;
&lt;li&gt;C2: Delete obsolete comments immediately&lt;/li&gt;
&lt;li&gt;C3: No redundant comments that repeat the code&lt;/li&gt;
&lt;li&gt;C4: Write comments well—brief, grammatical, purposeful&lt;/li&gt;
&lt;li&gt;C5: Never commit commented-out code&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Environment (E1-E2)&lt;/strong&gt;: One-command build and test&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;E1: Build requires only one step&lt;/li&gt;
&lt;li&gt;E2: Tests require only one step&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Functions (F1-F4)&lt;/strong&gt;: Small, focused, obvious&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;F1: Maximum 3 arguments (use data structures for more)&lt;/li&gt;
&lt;li&gt;F2: No output arguments (return values instead)&lt;/li&gt;
&lt;li&gt;F3: No flag arguments (split into separate functions)&lt;/li&gt;
&lt;li&gt;F4: Delete dead functions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;General (G1-G36)&lt;/strong&gt;: Core principles&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;G1: Multiple languages in one source file&lt;/li&gt;
&lt;li&gt;G2: Obvious behavior is unimplemented&lt;/li&gt;
&lt;li&gt;G3: Incorrect behavior at the boundaries&lt;/li&gt;
&lt;li&gt;G4: Overridden safeties&lt;/li&gt;
&lt;li&gt;G5: Duplication&lt;/li&gt;
&lt;li&gt;G6: Code at wrong level of abstraction&lt;/li&gt;
&lt;li&gt;G7: Base classes depending on their derivatives&lt;/li&gt;
&lt;li&gt;G8: Too much information&lt;/li&gt;
&lt;li&gt;G9: Dead code&lt;/li&gt;
&lt;li&gt;G10: Vertical separation&lt;/li&gt;
&lt;li&gt;G11: Inconsistency&lt;/li&gt;
&lt;li&gt;G12: Clutter&lt;/li&gt;
&lt;li&gt;G13: Artificial coupling&lt;/li&gt;
&lt;li&gt;G14: Feature envy&lt;/li&gt;
&lt;li&gt;G15: Selector arguments&lt;/li&gt;
&lt;li&gt;G16: Obscured intent&lt;/li&gt;
&lt;li&gt;G17: Misplaced responsibility&lt;/li&gt;
&lt;li&gt;G18: Inappropriate static&lt;/li&gt;
&lt;li&gt;G19: Use explanatory variables&lt;/li&gt;
&lt;li&gt;G20: Function names should say what they do&lt;/li&gt;
&lt;li&gt;G21: Understand the algorithm&lt;/li&gt;
&lt;li&gt;G22: Make logical dependencies physical&lt;/li&gt;
&lt;li&gt;G23: Prefer polymorphism to if/else or switch/case&lt;/li&gt;
&lt;li&gt;G24: Follow standard conventions&lt;/li&gt;
&lt;li&gt;G25: Replace magic numbers with named constants&lt;/li&gt;
&lt;li&gt;G26: Be precise&lt;/li&gt;
&lt;li&gt;G27: Structure over convention&lt;/li&gt;
&lt;li&gt;G28: Encapsulate conditionals&lt;/li&gt;
&lt;li&gt;G29: Avoid negative conditionals&lt;/li&gt;
&lt;li&gt;G30: Functions should do one thing&lt;/li&gt;
&lt;li&gt;G31: Hidden temporal couplings&lt;/li&gt;
&lt;li&gt;G32: Don't be arbitrary&lt;/li&gt;
&lt;li&gt;G33: Encapsulate boundary conditions&lt;/li&gt;
&lt;li&gt;G34: Functions should descend only one level of abstraction&lt;/li&gt;
&lt;li&gt;G35: Keep configurable data at high levels&lt;/li&gt;
&lt;li&gt;G36: Avoid transitive navigation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Names (N1-N7)&lt;/strong&gt;: Descriptive, unambiguous, right-sized&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;N1: Choose descriptive names&lt;/li&gt;
&lt;li&gt;N2: Choose names at the right abstraction level&lt;/li&gt;
&lt;li&gt;N3: Use standard nomenclature where possible&lt;/li&gt;
&lt;li&gt;N4: Use unambiguous names&lt;/li&gt;
&lt;li&gt;N5: Use long names for long scopes&lt;/li&gt;
&lt;li&gt;N6: Avoid encodings (Hungarian notation, etc.)&lt;/li&gt;
&lt;li&gt;N7: Names should describe side effects&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Tests (T1-T9)&lt;/strong&gt;: Fast, independent, exhaustive&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;T1: Insufficient tests—test everything that could break&lt;/li&gt;
&lt;li&gt;T2: Use a coverage tool&lt;/li&gt;
&lt;li&gt;T3: Don't skip trivial tests&lt;/li&gt;
&lt;li&gt;T4: Ignored tests indicate ambiguity&lt;/li&gt;
&lt;li&gt;T5: Test boundary conditions&lt;/li&gt;
&lt;li&gt;T6: Exhaustively test near bugs&lt;/li&gt;
&lt;li&gt;T7: Patterns of failure are diagnostic&lt;/li&gt;
&lt;li&gt;T8: Coverage patterns can be revealing&lt;/li&gt;
&lt;li&gt;T9: Tests should be fast&lt;/li&gt;
&lt;/ul&gt;



&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Get the complete skill files:&lt;/strong&gt;&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/ertugrul-dmr" rel="noopener noreferrer"&gt;
        ertugrul-dmr
      &lt;/a&gt; / &lt;a href="https://github.com/ertugrul-dmr/clean-code-skills" rel="noopener noreferrer"&gt;
        clean-code-skills
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;Clean Code Skills for AI Agents&lt;/h1&gt;
&lt;/div&gt;

&lt;p&gt;&lt;a href="https://agentskills.io" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/2bad303febd09cbe378475a843a53a6edf564fbe547636be2bb815d8835c7e1e/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4167656e74253230536b696c6c732d436f6d70617469626c652d626c7565" alt="Agent Skills"&gt;&lt;/a&gt;
&lt;a href="https://github.com/ertugrul-dmr/clean-code-skills/LICENSE" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/784362b26e4b3546254f1893e778ba64616e362bd6ac791991d2c9e880a3a64e/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4c6963656e73652d4d49542d677265656e2e737667" alt="License: MIT"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Teach your AI to write code that doesn't suck.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;This repository contains &lt;a href="https://agentskills.io" rel="nofollow noopener noreferrer"&gt;Agent Skills&lt;/a&gt; that enforce Robert C. Martin's &lt;em&gt;Clean Code&lt;/em&gt; principles. They work with Google Antigravity, Anthropic's Claude Code, and any agent that supports the Agent Skills standard.&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Why?&lt;/h2&gt;
&lt;/div&gt;
&lt;p&gt;AI generates code fast, but research shows it also generates technical debt fast:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitClear&lt;/strong&gt;: 4x increase in code duplication with AI adoption&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Carnegie Mellon&lt;/strong&gt;: +30% static analysis warnings, +41% code complexity after Cursor adoption&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Google DORA&lt;/strong&gt;: Negative relationship between AI adoption and software delivery stability&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These skills encode battle-tested solutions to exactly these problems—directly into your AI workflow.&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;What's Included&lt;/h2&gt;
&lt;/div&gt;
&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Track&lt;/th&gt;
&lt;th&gt;Skill&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;th&gt;Rules&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Python&lt;/td&gt;
&lt;td&gt;&lt;code&gt;boy-scout&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Orchestrator&lt;/strong&gt;—always leave code cleaner than you found it&lt;/td&gt;
&lt;td&gt;Coordinates all skills&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Python&lt;/td&gt;
&lt;td&gt;&lt;code&gt;python-clean-code&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Master skill&lt;/strong&gt; with all 66 rules&lt;/td&gt;
&lt;td&gt;C1-C5, E1-E2, F1-F4, G1-G36, N1-N7, P1-P3, T1-T9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Python&lt;/td&gt;
&lt;td&gt;&lt;code&gt;clean-comments&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Minimal, accurate commenting&lt;/td&gt;
&lt;td&gt;C1-C5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Python&lt;/td&gt;
&lt;td&gt;&lt;code&gt;clean-functions&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;…&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/ertugrul-dmr/clean-code-skills" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


&lt;p&gt;The repo includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;boy-scout&lt;/code&gt;&lt;/strong&gt;: An orchestrator skill that embodies the Boy Scout Rule—"always leave code cleaner than you found it"—and coordinates the other skills&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;python-clean-code&lt;/code&gt;&lt;/strong&gt;: A master skill with all 66 rules, plus a quick reference table and anti-patterns cheatsheet&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Individual skills&lt;/strong&gt; for each category (&lt;code&gt;clean-comments&lt;/code&gt;, &lt;code&gt;clean-functions&lt;/code&gt;, &lt;code&gt;clean-general&lt;/code&gt;, &lt;code&gt;clean-names&lt;/code&gt;, &lt;code&gt;clean-tests&lt;/code&gt;)—drop in only what you need&lt;/li&gt;
&lt;li&gt;Installation instructions for Antigravity, Claude Code, and other Agent Skills-compatible tools&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  How to Use These Skills
&lt;/h2&gt;

&lt;p&gt;Skills sit in a specific place in the agent ecosystem. &lt;strong&gt;Rules&lt;/strong&gt; are passive guardrails that are always on. &lt;strong&gt;Skills&lt;/strong&gt; are agent-triggered—the model decides when to equip them based on your intent. If you're using MCP servers (connections to external tools like GitHub or Postgres), think of MCP as the "hands" and skills as the "brains" that direct them.&lt;/p&gt;

&lt;h3&gt;
  
  
  For Antigravity
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Create &lt;code&gt;.agent/skills/&lt;/code&gt; in your project root (or &lt;code&gt;~/.gemini/antigravity/skills/&lt;/code&gt; for global access)&lt;/li&gt;
&lt;li&gt;Save the skill as a folder with a &lt;code&gt;SKILL.md&lt;/code&gt; file inside (e.g., &lt;code&gt;.agent/skills/python-clean-code/SKILL.md&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Ask the agent to review or write code—it'll automatically apply the rules when relevant&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Global vs Project Skills
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Project-specific&lt;/strong&gt;: &lt;code&gt;.agent/skills/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Global Antigravity&lt;/strong&gt;: &lt;code&gt;~/.gemini/antigravity/skills/&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The agent only loads full skill content when needed, so comprehensive skills don't slow down simple requests.&lt;/p&gt;

&lt;h3&gt;
  
  
  Going Further
&lt;/h3&gt;

&lt;p&gt;The skills in this article are instruction-only—they tell the agent what to do. For stricter enforcement, you could add a &lt;code&gt;scripts/&lt;/code&gt; folder with a linter that compatible agents runs them automatically, or an &lt;code&gt;examples/&lt;/code&gt; folder with before/after code samples for few-shot learning. The format supports it; we're just keeping things simple here.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Real-World Example
&lt;/h2&gt;

&lt;p&gt;Here's code that violates multiple Clean Code rules:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;utils&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;  &lt;span class="c1"&gt;# P1
# Author: John, Modified: 2024-01-15  # C1
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;proc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flag&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  &lt;span class="c1"&gt;# N1, F1, F3
&lt;/span&gt;    &lt;span class="c1"&gt;# Process the data  # C3
&lt;/span&gt;    &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;  &lt;span class="c1"&gt;# N1
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;flag&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# F3
&lt;/span&gt;            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;A&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# G23
&lt;/span&gt;                &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;val&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;1.0825&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# G25
&lt;/span&gt;            &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;B&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;val&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;1.05&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# G25
&lt;/span&gt;        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;val&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/tmp/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.json&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# G6
&lt;/span&gt;        &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Old approach  # C5
&lt;/span&gt;    &lt;span class="c1"&gt;# for item in d:
&lt;/span&gt;    &lt;span class="c1"&gt;#     print(item)
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Violations&lt;/strong&gt;: P1, C1, C3, C5, F1, F3, G6, G23, G25, N1&lt;/p&gt;

&lt;p&gt;With the Clean Code skill active, ask your AI agent to refactor this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Literal&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dataclasses&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dataclass&lt;/span&gt;

&lt;span class="n"&gt;TAX_RATE_CA&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.0825&lt;/span&gt;
&lt;span class="n"&gt;TAX_RATE_NY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.05&lt;/span&gt;
&lt;span class="n"&gt;TransactionType&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Literal&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CA&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;NY&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Transaction&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;
    &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;TransactionType&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;apply_tax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transaction&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Transaction&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Apply state-specific tax to transaction value.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;tax_rates&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CA&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;TAX_RATE_CA&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;NY&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;TAX_RATE_NY&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;transaction&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;tax_rates&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;transaction&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process_transactions_with_tax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;transactions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Transaction&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Calculate taxed values for all transactions.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;apply_tax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;transactions&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process_transactions_without_tax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;transactions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Transaction&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Extract raw values from all transactions.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;transactions&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;save_results&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;output_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Save processed values to JSON file.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;output_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mkdir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exist_ok&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;output_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The refactored version:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ No wildcard imports (P1)&lt;/li&gt;
&lt;li&gt;✅ No metadata comments (C1)&lt;/li&gt;
&lt;li&gt;✅ No redundant comments (C3)&lt;/li&gt;
&lt;li&gt;✅ No commented-out code (C5)&lt;/li&gt;
&lt;li&gt;✅ Descriptive names (N1)&lt;/li&gt;
&lt;li&gt;✅ No flag arguments (F3)&lt;/li&gt;
&lt;li&gt;✅ Named constants instead of magic numbers (G25)&lt;/li&gt;
&lt;li&gt;✅ Functions do one thing (G30)&lt;/li&gt;
&lt;li&gt;✅ Polymorphism through data structure (G23)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Anatomy of a Vibe-Coded Script
&lt;/h3&gt;

&lt;p&gt;Remember the duplicated function I mentioned in Torvalds' &lt;a href="https://github.com/torvalds/AudioNoise/blob/3a6b51032da587e5d2e269515f3dc21c96da15c4/visualize.py#L342C9-L342C27" rel="noopener noreferrer"&gt;AudioNoise visualizer&lt;/a&gt;? Here it is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;update_slider_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Helper to update slider texts (Width and End Point).&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;start_val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;end_val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;
    &lt;span class="n"&gt;width&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;end_val&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start_val&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;update_slider_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Helper to update slider texts (Width and End Point).&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;start_val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;end_val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;
    &lt;span class="n"&gt;width&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;end_val&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start_val&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x_mode&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Time&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;slider&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;valtext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Window: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;start_val&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; + &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;slider&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;valtext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Window: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;start_val&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; + &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first definition unpacks values, calculates width, then... returns &lt;code&gt;None&lt;/code&gt;. The second definition is the real implementation. Python silently overwrites the first with the second, so the code runs. But it's textbook dead code—&lt;strong&gt;Clean Code rule G9: Remove dead code.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;With the skill active, an agent refactors the entire 600-line script. The duplicate vanishes, magic numbers become constants, and nested functions get extracted into focused methods:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;update_slider_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Update slider text with either time or sample count.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;start_val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;end_val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;
    &lt;span class="n"&gt;width&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;end_val&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start_val&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x_mode&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Time&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;slider&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;valtext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Window: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;start_val&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; + &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;slider&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;valtext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Window: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;start_val&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; + &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbaemhkvcb4do3479focu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbaemhkvcb4do3479focu.png" alt="Antigravity Review" width="583" height="1152"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The refactored version:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Dead code removed (G9)&lt;/li&gt;
&lt;li&gt;✅ Type hints added (clarity)&lt;/li&gt;
&lt;li&gt;✅ Single, authoritative definition (G5)&lt;/li&gt;
&lt;li&gt;✅ Magic numbers extracted to constants (G25)&lt;/li&gt;
&lt;li&gt;✅ Large methods decomposed (G30)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The full diff shows 600+ lines reduced to ~440—not by removing functionality, but by eliminating duplication and extracting reusable patterns.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Matters Now
&lt;/h2&gt;

&lt;p&gt;Vibe coding isn't going away. AI will get better at generating code, not worse. But "better at generating" doesn't mean "better at maintaining."&lt;/p&gt;

&lt;p&gt;The research is clear: AI produces code faster, but that code accumulates technical debt faster too. Without guard rails, we're building tomorrow's legacy systems today.&lt;/p&gt;

&lt;p&gt;Uncle Bob's Clean Code principles are almost 20 years old, but they're exactly what we need now. They're not arbitrary style preferences—they're battle-tested solutions to the problems AI recreates at scale.&lt;/p&gt;

&lt;p&gt;Skills give you the mechanism to encode these rules directly into your AI workflow. Whether you're using Antigravity, Claude Code, or another agent, the approach is the same: &lt;strong&gt;define what clean code means, then let the AI follow the rules&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Your agent doesn't know what good code looks like unless you tell it.&lt;/p&gt;

&lt;p&gt;So tell it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The Book&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Clean Code&lt;/strong&gt; by Robert C. Martin: &lt;a href="https://www.amazon.com/Clean-Code-Handbook-Software-Craftsmanship/dp/0132350882" rel="noopener noreferrer"&gt;Amazon&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Skills Documentation&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://agentskills.io" rel="noopener noreferrer"&gt;Agent Skills Standard&lt;/a&gt; — The open standard for AI agent instructions&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://antigravity.google/docs/skills" rel="noopener noreferrer"&gt;Antigravity Skills Guide&lt;/a&gt; — Google's official documentation&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://platform.claude.com/docs/en/agents-and-tools/agent-skills/overview" rel="noopener noreferrer"&gt;Claude Code Agent Skills&lt;/a&gt; — Anthropic's implementation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Research Cited&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://cloud.google.com/resources/content/2025-dora-ai-assisted-software-development-report" rel="noopener noreferrer"&gt;DORA 2025: AI-Assisted Software Development&lt;/a&gt; — Google's findings on AI and delivery stability&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2511.04427" rel="noopener noreferrer"&gt;Code Quality After Cursor Adoption&lt;/a&gt; — Carnegie Mellon's analysis of 807 repositories&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.gitclear.com/ai_assistant_code_quality_2025_research" rel="noopener noreferrer"&gt;GitClear 2025 Code Quality Report&lt;/a&gt; — 211M lines analyzed&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://claude.com/blog/eight-trends-defining-how-software-gets-built-in-2026" rel="noopener noreferrer"&gt;Agentic Coding Trends&lt;/a&gt; — Anthropic's delegation gap analysis&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Get the Skills&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
Clean Code Skills Repository — All 66 rules as ready-to-use skill files&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The future of programming is human intent translated by AI. Make sure the translation preserves quality, not just speed.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>antigravity</category>
      <category>python</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
