DEV Community

kivumia
kivumia

Posted on

We validated our COBOL-to-Python engine on 15,552 real-world programs. 98.78% produce valid Python. Zero LLMs involved.

We validated our COBOL-to-Python engine on
15,552 real-world programs. 98.78% produce valid
Python. Zero LLMs involved.
Last week we published a proof of concept with IBM's SAM1 — 505 lines, 32 milliseconds.
This week we scaled it to the entire planet.
The corpus
15,552 COBOL source files. Not synthetic benchmarks. Real programs, collected from 131
open-source repositories across 5 continents:
— Norway. France. Brazil. India. Japan. USA.
— GitHub. HuggingFace. CBT Tape. GnuCOBOL. IBM public repositories.
— Commercial COBOL. GnuCOBOL extensions. TypeCOBOL. Mainframe dialects.
No selection bias. No curated samples. Everything we could find.
The result
Before (v5.6)
Corpus
Valid Python
Failures
Net gain
14,508 files
14,020 (96.84%)
After (v5.8e)
15,552 files (+1,044)
15,362 (98.78%)
456

190
+1,342 files
On the original v5.7 reference corpus: 99.25%. 180 of 289 failures corrected in a single session.
What "valid Python" means
We are not using LLMs to judge output quality. We are not doing string comparison. We are not
running style checks.
We use ast.parse().
Binary. Deterministic. No margin for interpretation.
If the generated Python passes ast.parse() without raising a SyntaxError — it is valid. If it raises — it
fails. Nothing in between.
This is the strictest possible definition of syntactic correctness. A human reviewer cannot override it.
A model cannot hallucinate its way through it.
What fails and why
190 files still fail. Here is what they are:
Category
TypeCOBOL
GnuCOBOL extensions
Non-standard COBOL
Deep STRING/UNSTRING
~Files
~60
~40
~30
Example
Multi-level qualifications, REPLACE, typed expressions
GUI, bitwise composed, OO, SCREEN SECTION
WebSocket, brainfuck interpreter, .NET GUI
~25
Exotic mainframe
~35
Complex nesting, multiple delimiters
CICS inline, complex EXEC SQL, nested copybooks
These are not parsing bugs. These are constructions that sit at the outer boundary of what any
standard COBOL parser is expected to handle. The sanitizer cannot fix what the parser never
understood.
We know exactly what they are. We are working on them.
How it works
AGUELLID CODE does not translate COBOL to Python.
It transforms COBOL into a semantic intermediate representation, then generates Python that is
provably equivalent — not line-by-line, but behavior-by-behavior.
No neural network. No prompt. No sampling.
The transformation is deterministic: the same input always produces the same output. The output
can be audited. The logic can be traced. There is no black box.
This matters in banking. In insurance. In government systems. In any environment where "the model
thought it was right" is not an acceptable explanation.
Why this matters
There are an estimated 220 billion lines of COBOL in active production today.
Most of it runs on systems that organizations can no longer maintain. The engineers who wrote it
are retired. The documentation is incomplete. The behavior is institutional memory encoded in
syntax.
Modernizing this code is not a style choice. It is a survival question for dozens of industries.
Current approaches:
— Manual rewrite: expensive, slow, error-prone
— LLM translation: non-deterministic, unauditable, high hallucination risk on legacy syntax
— Transpilers: brittle, shallow, fail on complex constructs
AGUELLID CODE is none of these.
98.78% on 15,552 real files. Deterministic. Auditable. No LLMs.
What comes next
The 190 remaining failures map to specific parser gaps. We are working through them by gain/risk
ratio — some TypeCOBOL patterns alone can recover 20-30 files in a single micro-patch.
Target: 99.2-99.5% on the full expanded corpus.
The forge is still burning.
KIVUMIA — AGUELLID CODE v5.8e
Validated: 2026-04-05 03:27 UTC
Corpus: 131 sources, 15,552 files, 5 continents
Engine: deterministic, zero LLMs
kivumia.ai

Top comments (1)

Collapse
 
swarmly profile image
kivumia

If you're curious about the 190 remaining failures — they're not random. They cluster around specific constructs: TypeCOBOL multi-level qualifications, GnuCOBOL OO extensions, CICS inline. We have a full breakdown and a prioritized attack plan. v5.9 targets the highest gain/risk patterns first.
The parser is next. That's where the real frontier is.