Four thousand years ago, Assyrian merchants were doing what people have always done: tracking debts, chasing payments, arguing over contracts. They...
For further actions, you may consider blocking this person and/or reporting abuse
Brilliant! 99 problems and data is 98 of them. Thank you for your efforts in sharing this. Are you able to provide the Kaggle competition link?
kaggle.com/competitions/deep-past-...
Thank you and congratulations!
This is one of those rare posts where the engineering is impressive, but the implication is even bigger.
The way you reframed the problem—from “train a model” to “build a full data pipeline for a dead language”—is what really makes this work. The insight that this wasn’t primarily a modeling problem but a data + representation problem (tokenization, normalization, alignment) feels spot on.
Your use of ByT5 is a perfect example of choosing the right abstraction for the job. Most people would try to force modern tokenizers onto the problem and debug forever, but going byte-level sidesteps the entire failure mode. Then pairing that with a fine-tuned Qwen for generalization is a really elegant “hybrid thinking” approach—each model doing what it’s actually good at.
Also, the Gemini-powered extraction pipeline is low-key one of the most powerful parts of this project. Turning messy academic PDFs into structured training data at scale is something way beyond this competition—it’s basically a template for unlocking any niche or historical dataset. That’s the kind of thing people underestimate, but it’s where a lot of real-world AI leverage lives.
And philosophically, there’s something incredible here:
we’re using modern AI systems to decode what is essentially Bronze Age bureaucracy—contracts, debts, logistics. Like one of the comments said, it really is just ancient state management.
If anything, this project shows that the bottleneck isn’t capability anymore—it’s attention and intent. The tools exist. The data exists. It just needs people willing to connect the dots.
Really fascinating work—this feels like the beginning of something much bigger than a Kaggle competition.
Byte-Level Models Over Token-Based Models
This is where models like ByT5 shine.
Instead of relying on a predefined tokenizer, byte-level models operate directly on raw text as sequences of bytes. That means:
no fragmentation of rare or unseen tokens
proper nouns stay intact
special markers like or {ki} remain meaningful
In your case, this is critical. The structure of transliteration is the data. Destroy that structure, and you destroy the signal.
Constrained Generation > Fluent Generation
The second shift is philosophical: you don’t want “good English.” You want faithful mapping.
That means:
penalizing hallucinations aggressively
preferring literal translations over fluent ones
possibly introducing constrained decoding (copy mechanisms, alignment hints, or pointer-style outputs)
In legal/financial tablets, correctness is binary. Either the merchant name is right, or it isn’t.
Synthetic Data and Alignment as Force Multipliers
With only ~1500 pairs, raw training isn’t enough.
But you can expand effective data by:
using LLMs for alignment (not translation)
generating noisy back-translations
augmenting with structure-preserving transformations (e.g. permuting clauses, masking gaps)
Here, LLMs become tools in the pipeline—not the final model.
Retrieval-Augmented Translation
Another underrated approach: treat this as retrieval + translation, not pure generation.
If similar tablet fragments exist:
retrieve nearest neighbors
condition the model on those examples
bias outputs toward historically consistent mappings
In a domain where patterns repeat (names, trade terms, contract structures), retrieval adds stability.
Domain-Specific Evaluation
BLEU scores won’t cut it here.
You care about:
exact match on named entities
numerical accuracy
preservation of structure
A model that scores lower on BLEU but gets names and quantities right is objectively better.
The Real Insight
What makes this problem interesting is that it breaks modern AI defaults.
Most NLP progress assumes:
large data
natural language
tolerance for approximation
You have:
tiny data
semi-structured symbolic input
zero tolerance for error
So the winning approach isn’t “more intelligence.”
It’s less assumption, more control.
And that’s why a pipeline with byte-level models, careful alignment, constrained decoding, and domain-aware evaluation outperforms brute-force LLM prompting.
You’re not just translating text.
You’re reconstructing meaning from a system that predates language as we model it today.
You are the true Indiana Jones!
Fascinating comparison. There's something almost poetic about the fact that both clay tablets and modern databases are ultimately just structured ways to persist and query state — the abstraction layers have changed but the fundamental problem hasn't.
The structured data aspect is interesting too: cuneiform records were essentially early schemas. Field names, values, relationships. The Assyrians invented the concept of a record long before we had words for it.
Smart move using ByT5 for the character-level handling — modern tokenizers would butcher cuneiform transliterations. Curious whether the TF-IDF retrieval for few-shot context outperformed just throwing more LoRA training data at the Qwen model.
This is such a beautiful reminder that even 4,000 years later, we’re still debugging human paperwork, you just made ancient history feel weirdly personal.
Good read. The agent reliability problem is real — I've been building autonomous SEO tools and the failure modes you describe are spot-on. State management is the hardest part.
1500 training pairs is wild - that's like training on a handful of pages. curious how you handled class imbalance, some signs probably appear way more than others in that corpus.
Brilliant! 99 problems and data = 98 of them. Thank you for your efforts in sharing this. Are you able to provide the Kaggle competition link?