Four thousand years ago, Assyrian merchants were doing what people have always done: tracking debts, chasing payments, arguing over contracts. They pressed these records into clay tablets. Not sacred texts, not epic poetry. Just the ancient equivalent of office emails.
Nearly 23,000 of these tablets survive. Half have never been translated — not because they're damaged, but because a few people on Earth can read Old Assyrian.
When the Deep Past Initiative turned this into a Kaggle competition, build a machine translation system for Old Assyrian cuneiform — I jumped in. The task: take transliterated text (cuneiform signs converted to Latin characters) and produce an English translation.
The training set? Around 1500 pairs. That's it.
For context, standard translation models train on millions of sentence pairs. Even research on "low-resource" languages works with tens of thousands. We got fifteen hundred documents and a pat on the back.
So the question was straightforward: how do you build a translation model when you barely have any data, for a language that no modern tokenizer has ever seen, where every proper noun and number matters because these are legal and financial records?
What started as "fine-tune a model on some ancient text" turned into a full-stack AI pipeline: Gemini vision for OCR-ing scanned academic books, LLMs for sentence alignment and cross-lingual translation, ByT5 as a byte-level backbone that doesn't choke on cuneiform, Unsloth for efficient LoRA training, and vLLM for fast inference on Kaggle T4s. The results surprised us.
Let's start with why the obvious approaches don't work.
Why the Obvious Approaches Don't Work
The first thing I tried was what everyone tries — throw a pretrained LLM at it. Gemma, Qwen, the usual suspects. Prompt it with some examples, let it translate.
And honestly? The outputs look pretty good at first glance. Fluent English, reasonable sentence structure, feels like it could be right. But "feels right" is dangerous when you're translating ancient legal documents.
The problem is hallucination — and not the subtle kind. These models confidently fill in names of merchants, cities, and commodities that simply aren't in the source text. When the transliteration says A-šùr-i-dí the model might output a completely different name that sounds plausibly Bronze Age. When it hits an unfamiliar trade term, it improvises. For documents where every name, every number, every commodity is the actual information — that's not a minor quality issue, it's the whole problem.
Ok so what about standard encoder-decoder translation models? Here the issue is more fundamental: tokenization. Modern tokenizers are trained on modern text. Akkadian transliteration is a different universe — hyphenated syllable sequences like a-na, Sumerian logograms in ALL CAPS like KÙ.BABBAR, determinatives in curly braces like {d} and {ki}, subscript digits encoding phonetic variants like il₅, and gap markers like <gap> for broken sections of the physical tablet.
Feed this into a standard tokenizer and it fragments on every character it hasn't seen. Proper nouns that have never appeared in any pretraining corpus get silently mangled. The <gap> markers that indicate missing text get treated as noise or special tokens.
So: decoder-only models hallucinate, standard translation models can't tokenize the input properly. What actually fits this problem?
ByT5 — The Right Tool for a Weird Job
One of the best things about Kaggle competitions is the community. People share findings, discuss approaches in the forums, and collectively narrow down what works. Early on, several participants converged on the same answer: ByT5.
ByT5 comes from a 2021 Google Research paper — "Towards a Token-Free Future with Pre-trained Byte-to-Byte Models". The idea is simple and kind of radical: skip tokenization entirely. Instead of mapping text to a learned vocabulary of subwords, ByT5 operates directly on raw bytes. A standard Transformer, minimal modifications, just processing one byte at a time.
Why does this matter for our problem? Because every character is valid input by definition. It doesn't matter that A-mur-{d}UTU has never appeared in any pretraining corpus — ByT5 doesn't need it to. No vocabulary misses, no fragmented tokens, no special handling for curly braces or subscript digits. The model just sees bytes.
The paper also showed something else that turned out to be critical: byte-level models are significantly more robust to noise. When your source text comes from OCR'd clay tablets with inconsistent transcription conventions across different scholars and decades — that robustness isn't a nice-to-have, it's a requirement.
Architecture: solved. Now came the harder problem — we had the right model, but nowhere near enough data to train it.
The Data Problem
With ByT5 as the architecture, the bottleneck shifted entirely to data. And the competition host made the challenges very clear in a public discussion post.
Two things consistently broke translations more than anything else:
Named entities. Personal names, place names, divine names — they're transliterated inconsistently across different editions, they preserve older spelling conventions, and they're completely opaque to the model. In practice, many otherwise reasonable translations failed because a name got mangled, dropped, or hallucinated. The host even prepared an onomasticon (a curated list of attested name spellings) as supplemental data to help with this.
Transliteration format inconsistency. Different corpora encode the same text using different conventions. One participant converted diacritics to ASCII before training (š → sz, ú → u2) — a reasonable instinct, but the evaluation data expects diacritics. Collapsing ṣ into S₂ or š into SZ removes distinctions that are semantically meaningful in Akkadian. The rule was clear: normalize toward the format used in the evaluation set, not away from it.
On top of that, gap handling was tricky. Damaged sections of tablets are marked with <gap>, but the training data wasn't perfectly aligned — sometimes a large gap appears in the transliteration but not in the translation, forcing the model to learn misalignment rather than translation. Edge cases like <gap>-A-šùr (a gap attached to a word) needed to be preserved, not blindly stripped.
The host's closing point stuck with me: these aren't model architecture problems. They're data problems. And with only ~1,500 training pairs, every one of these issues hits harder because the model sees so few examples to learn from.
So the path forward was obvious — find more data.
Finding More Data — The AKT Books
The training data had to come from somewhere. The competition hosts pointed the way — they shared scanned PDFs of the AKT series (Anatolian Kültepe Texts), a multi-volume scholarly publication of Old Assyrian tablets from the Kültepe excavations in Turkey. Each volume contains transliterations and translations of tablets. Exactly the domain, exactly the format we needed.
The catch? These are academic books published between 1990 and the 2020s, by different authors, in different languages. AKT 1, 2, 4, 9a, and 10 are in Turkish. AKT 3 is in German. Each volume has its own layout, its own heading conventions, its own way of marking tablet edges and sections. Different fonts, different editorial styles, different decades of typesetting.
This isn't structured data you can parse with a script. These are scanned pages of physical books — some crisp, some not — where a tablet's transliteration might start on one page and continue on the next, where scholarly commentary sits right next to the translation text, and where the format changes just enough between volumes that nothing generalizes cleanly.
But inside these messy PDFs was exactly what we were starving for: hundreds of additional transliteration-translation pairs, many with line-by-line alignment that the original training set didn't have.
The question was whether I could extract it reliably enough to actually help the model — or whether the noise would make things worse. This is where Gemini's multimodal capabilities came in — specifically its ability to understand page layouts, distinguish between transliteration blocks and commentary, and handle multilingual content out of the box. I decided to build the pipeline.
The Extraction Pipeline
Building this pipeline was its own mini-project. Each step solved one problem and revealed the next.
Step 1: PDF → Page Images
The simplest step — render each PDF page as a numbered PNG. This is the only part that runs purely local. Everything else goes through Gemini.
Step 2: Page Images → Structured JSON
Each page image gets sent to Gemini's vision model via the Vertex AI Batch API. The flow: build a JSONL of requests (one per page image, referencing GCS URIs), submit to Vertex, parse the predictions back.
A quick note on why batch inference: when you're processing hundreds of pages and don't need real-time responses, the Batch API is a no-brainer. You get a 50% discount over standard inference, much higher rate limits, and the service handles parallelization and retries for you — typically completing within 24 hours. You submit one job, go do something else, come back to results. For a pipeline like this where I was processing multiple books with hundreds of pages each, it saved both money and sanity.
The request construction:
def build_request(gcs_uri: str, prompt_text: str) -> dict:
return {
"request": {
"contents": [{
"role": "user",
"parts": [
{"fileData": {"mimeType": "image/png", "fileUri": gcs_uri}},
{"text": prompt_text},
]
}],
"generationConfig": {
"responseMimeType": "application/json",
"temperature": 0.0,
"mediaResolution": "MEDIA_RESOLUTION_HIGH", # needed for diacritics
"thinkingConfig": {"thinkingLevel": "MEDIUM"},
},
}
}
We used gemini-3.1-flash-lite-preview with medium thinking enabled — the reasoning step helped significantly with understanding complex page layouts and making correct decisions about where one tablet ends and another begins.
Submit with the Vertex AI Batch API:
client = genai.Client(vertexai=True, project=project, location=location)
job = client.batches.create(
model="gemini-3.1-flash-lite-preview",
src="gs://your-bucket/book/ocr_batch/requests.jsonl",
config=CreateBatchJobConfig(
dest="gs://your-bucket/book/ocr_batch/predictions/"
),
)
One gotcha that bit me early: predictions come back shuffled. You can't rely on line order in the output — you have to extract the page number from each prediction's original request URI:
def extract_page_num(pred: dict) -> int:
uri = pred["request"]["contents"][0]["parts"][0]["fileData"]["fileUri"]
m = re.search(r"page_(\d+)\.png", uri)
return int(m.group(1))
This is actually a feature — it forces you to write robust parsing from the start.
Every AKT volume needs its own prompt. Different heading formats, different edge markers (Ö.y., Ak. for Turkish volumes; Vs., Rs. for German), different conventions for commentary blocks. Get this wrong and you extract commentary as translation, or merge two tablets into one.
Step 3: JSON Pages → Tablets CSV
A book-specific export script aggregates all the per-page JSONs into a flat CSV — one row per tablet with combined transliteration and translation fields. Each volume needs its own exporter because the structure varies enough that a generic one would silently break.
Step 4: Visual QC
Dump everything to an HTML file and actually look at it. This is where you spot the real problems: misread headings, commentary leaking into translation fields, duplicate translations from continuation pages. No amount of automated testing replaces eyeballing the output.
Step 5: Cleanup
Book-specific cleanup scripts apply the fixes found during QC — drop bad rows, merge tablets that got split across pages, strip commentary that leaked through. Unglamorous and manual but completely necessary.
Step 6: Sentence Chunking + Translation
Here's where it gets interesting again. The original training data is document-level — full tablet in, full translation out. But the AKT books have something better: line-by-line structure. Each transliteration line has a marker ((Vs.1), (2), (Rs.14)) and each translation sentence references those markers.
A second Gemini batch job handles two things at once: align transliteration lines to translation sentences by marker, and translate the non-English content (Turkish or German) into English. For each tablet, I retrieved the most similar examples from the official training set using TF-IDF cosine similarity and included them as few-shot context. This turned out to be crucial — not just for translation quality, but for matching the distribution of the host's wording, style, and terminology choices. The model wasn't just translating, it was learning to translate the way the competition data expected.
Same batch pattern — build JSONL, submit, parse shuffled predictions.
Step 7: Normalization
Most of the invisible work happened here. The competition test set uses a specific character format, and the books don't match it. Every volume has its own OCR artifacts, its own conventions.
A few examples from the normalization stack:
-
ḫ/Ḫ → h/H(test set uses plain H) - Unicode subscripts → plain digits (
₄ → 4) - Superscript determinatives → brace format (
ᵈ → {d},ᵏⁱ → {ki}) - OCR artifacts:
KU.BABBAR → KÙ.BABBAR,ś → š,ş → ṣ - Gap deduplication:
<gap> <gap> → <gap>, while preserving attachments like<gap>-A-šùr
For a character-level model like ByT5, this isn't cosmetic. A single character mismatch between training and test — ḫ vs h, ₄ vs 4 — is invisible to a human reviewer and catastrophic to a model that has learned exactly one representation.
Step 8: Merge
The final step pulls normalized chunks into the main training set. Starting from ~1,500 pairs, the pipeline roughly multiplied our available training data — and more importantly, added sentence-level pairs that gave the model a much finer-grained learning signal than document-level translations alone.
Training — ByT5 Gets You Far, Then Stops
With the expanded dataset ready, training ByT5 was straightforward — standard seq2seq encoder-decoder training using HuggingFace Transformers. No tricks, no exotic schedulers. The model picked up patterns fast and translated training-domain tablets surprisingly well.
But then the leaderboard scores started telling a different story.
In our case, the hidden test set on Kaggle seemed to have a different distribution than what we trained on. Our best guess: different books, different topics, different translator styles, unfamiliar names and locations. Our ByT5 was doing well on what it had seen directly in training, but the leaderboard scores suggested it wasn't generalizing beyond that.
We hit a ceiling. Many teams went on to have great success pushing ByT5 further — better augmentation, longer training, smarter tricks I guess. But in our setup, the gains had stalled, and we decided to explore a different direction.
Back to Decoder-Only — But This Time, Fine-Tuned
This is where the story comes full circle. Earlier, we'd dismissed decoder-only LLMs because they hallucinate. That's still true — out of the box. But fine-tuning changes the picture completely.
The reasoning was simple: ByT5 and Qwen were solving different problems. ByT5 was a great fit for the transliteration itself — every character mattered, and byte-level modeling let it handle weird orthography, diacritics, subscripts, and determinatives without fighting the tokenizer. But once the task became generalization across unfamiliar tablets, translator styles, and topic shifts, Qwen3.5 had something ByT5 didn't: much stronger pretrained language knowledge.
Out of the box, that strength was useless because it came with hallucination. Fine-tuning changed that. LoRA gave us a way to keep the model's broader language ability while grounding it in the task and the dataset. Instead of prompting a general-purpose model and hoping for the best, we trained a lightweight adapter on our curated examples. Combined with few-shot prompting to match the host's translation style, the fine-tuned Qwen handled the distribution shifts that our ByT5 couldn't.
Fine-Tuning with Unsloth — Making LLMs Affordable
Before diving into the training details, a quick primer for anyone who hasn't fine-tuned a model before.
The naive approach to fine-tuning a large language model means updating all its parameters — billions of them. That requires serious hardware, serious memory, and serious money. For a Kaggle competition where you're iterating fast on limited GPUs, it's a non-starter.
This is where LoRA (Low-Rank Adaptation) comes in. Instead of updating the entire model, you freeze the original weights and train a small set of adapter matrices on top. You get most of the benefits of full fine-tuning at a fraction of the cost. QLoRA takes it a step further by quantizing the base model to 4-bit precision, which dramatically cuts memory usage — making it possible to fine-tune models that would otherwise never fit on a single GPU.
For this project we used Unsloth, which makes the whole process surprisingly painless. It handles the LoRA/QLoRA setup, optimizes training to run ~2x faster with ~70% less VRAM, and supports a wide range of models out of the box — including Qwen3.5, which is what we needed.
The training itself was SFT (Supervised Fine-Tuning) using Unsloth's built-in SFT trainer. We structured our data as chat conversations: a system prompt setting the role of an expert Assyriologist, few-shot examples retrieved via TF-IDF similarity, and the target tablet as the final user message. The model only learns from the assistant completion — the actual translation.
# each training example looks like this
messages = [
{"role": "system", "content": "You are an expert Assyriologist..."},
# few-shot examples from similar tablets
{"role": "user", "content": "Translate: um-ma A-šùr-i-dí-ma ..."},
{"role": "assistant", "content": "Thus says Aššur-idī: ..."},
{"role": "user", "content": "Translate: um-ma Pu-šu-ki-in-ma ..."},
{"role": "assistant", "content": "Thus says Pūšu-kēn: ..."},
# the actual tablet to translate
{"role": "user", "content": "Translate: a-na A-lim {ki} ..."},
{"role": "assistant", "content": "To the City: ..."}, # model learns this
]
An important detail here: we used completion-only masking. The loss is computed only on the assistant's translation tokens — the prompt tokens (system message, few-shot examples, user messages) are masked out during training. This means the model isn't wasting capacity learning to predict the input; it's focused entirely on producing accurate translations.
This meant the model wasn't just learning to translate — it was learning to translate in context, grounded by similar examples. The same retrieval and prompt structure would be used at inference time, so there was no gap between how the model trained and how it would be evaluated.
One direction we started exploring but ran out of time for: reinforcement learning on top of the fine-tuned model. The idea was to use GRPO (Group Relative Policy Optimization) with custom reward functions — combining the competition metric itself, gap alignment between transliteration and translation, and length balance — to push the model beyond what SFT alone could achieve. Each reward would target a specific failure mode that supervised training couldn't address directly. We didn't get there before the deadline, but it felt like the natural next step.
Inference — vLLM on Kaggle T4s
With a fine-tuned model ready, the next challenge was actually running it within Kaggle's competition constraints. This is a code competition — no internet access at submission time, two T4 GPUs with 16GB VRAM each, and a strict time limit.
A quick intro on vLLM for those unfamiliar: it's an open-source inference engine originally developed at UC Berkeley that's become the go-to for serving LLMs efficiently. The key innovation is PagedAttention — instead of pre-allocating a fixed block of memory for each sequence's key-value cache, it pages the KV cache dynamically, similar to how operating systems manage virtual memory. This means you can serve larger models on less hardware. On top of that you get continuous batching, optimized CUDA kernels, tensor parallelism, and seamless HuggingFace model support out of the box.
Sounds perfect, right? In theory. In practice, we hit a wall.
Qwen3.5 was released in the final weeks of the competition. The model was brand new — vLLM support was experimental and unstable. On top of that, Kaggle's T4 GPUs have compute capability 7.5, which means no FlashAttention 2 support. We had to fall back to Triton attention backend, wrestle with environment compatibility issues, and work around the fact that you can't pip install anything at submission time — every dependency needs to be pre-packaged in your dataset.
Getting a 9B parameter model to load, run, and generate translations on two T4s without crashing was its own mini-project. Tensor parallelism across both GPUs was non-negotiable — the model simply wouldn't fit on a single card.
llm = LLM(
model=MODEL_PATH,
dtype="float16",
max_model_len=16000,
gpu_memory_utilization=0.85,
enforce_eager=True, # no CUDA graphs on T4
tensor_parallel_size=2, # split across both T4s
)
The inference prompt mirrors the training setup exactly — same system prompt, same TF-IDF few-shot retrieval. For each test tablet, we retrieve the 5 most similar examples from our training data and include them as conversation context:
prompts = [
build_messages(
transliteration=row["transliteration"],
few_shot_examples=retriever.top_k(row["transliteration"]),
)
for row in test_rows
]
outputs = llm.chat(prompts, sampling_params=sampling_params)
Keeping the inference pipeline identical to training — same prompt structure, same retrieval, same style anchoring — meant the model was seeing exactly the kind of input it was trained on. No distribution shift at inference time.
Results and Reflections
Our team finished with a silver medal out of 2500+ teams. In the final days of the competition, the OCR extraction pipeline was still producing new data — each batch of cleaned and normalized tablets pushed our scores higher. We genuinely felt like gold was within reach with a couple more days. That stings a bit, but honestly? The journey was worth more than the medal.
Here's what I'm taking away from this:
Gemini's batch inference is a superpower for unstructured data. We used it to turn scanned academic books from the 1990s — messy layouts, multiple languages, inconsistent formatting — into clean, structured training data. If it works for 4,000-year-old Assyrian tablets in Turkish and German PDFs, it'll work for your use case too. The Vertex AI Batch API made it affordable and painless at scale.
Few-shot retrieval is still easy gains. TF-IDF character n-gram similarity is dead simple to implement, and using retrieved examples to anchor both training and inference gave us consistent improvements with minimal effort. Small iterations, big returns.
Fine-tuning is more accessible than you think. LoRA + Unsloth meant we could train a 9B parameter model on Kaggle's free GPUs. You don't need a cluster. You need good data and the right tools.
vLLM makes deployment practical. Even on constrained hardware like Kaggle T4s, with a brand-new model and no internet access, we got a 9B model running with tensor parallelism. The ecosystem is maturing fast.
And the bigger picture — the one that got me into this competition in the first place — is that there are still thousands of untranslated tablets sitting in museums. The pipeline we built here isn't a one-off competition hack. It's a blueprint: scan the books, extract the data, train the models, translate the tablets. The tools already exist. The data is already out there. At this point, the bottleneck is no longer whether this can be done. It's whether someone is willing to do it.






Top comments (0)