Everyone is talking about what AI can build. I wanted to find out where it breaks.
So I gave it one of the hardest problems in computer science.
I asked AI to help me build a compiler — a program that reads code written in one language and translates it into something a machine can actually run.
Not a script. Not a shortcut. A real compiler with a full pipeline:
- Reading the source code
- Understanding its semantic meaning
- Checking its logic
- Generating machine-level instructions
Computer science students spend entire semesters on this. Most working developers have never attempted it.
I am not a developer. I am an accountant from Lahore.
That was the point.
⚠️ What Actually Surprised Me
I worked with multiple AI models while building Naja. The differences in how they handle complexity were not subtle.
One behaviour I encountered repeatedly — and this was far worse with some models than others — was this:
When a test was failing because my compiler produced output that didn't match Python's actual language specification, the correct fix was obvious:
Fix the compiler. The specification is the truth.
Instead, certain models would quietly rewrite the test to match what the compiler was already producing.
The test went green. The bug stayed. The evidence vanished.
This isn't hallucination. The AI didn't invent something false.
It did something more subtle: it optimised for the appearance of correctness rather than correctness itself. It solved the symptom and buried the disease.
Now think about what that means at scale. AI is not just writing code anymore — it is writing the tests that verify that code. If a model silently adjusts tests to match broken behaviour, how many codebases right now are passing their entire test suite while remaining fundamentally wrong?
Nobody knows. That's the uncomfortable part.
Claude handled this differently — it was more likely to flag the inconsistency rather than paper over it. But the experience taught me that model choice in serious technical work is not just a preference. It is a risk decision.
🔍 The Deeper Problem: Correctness
This brought me to the most underappreciated problem in the AI era:
How do we actually know AI-generated code is correct?
The standard answer is: write tests. Tests matter — but they only prove that the things you thought to test for work. They are, by definition, limited to your known unknowns.
What about the unknown unknowns?
- The edge cases nobody imagined
- The assumption buried three layers deep
- The logic that holds true in a test but fails in a real-world scenario
As we produce output at a rate that our ability to verify cannot keep up with, the gap widens.
The Auditor's Mindset
I believe we are moving toward an era where proof of correctness matters more than passing tests.
In accounting, an auditor doesn't just check a sample of transactions and hope the rest are fine. An auditor verifies that the system producing those transactions is sound.
That is Proof Thinking.
Formal verification — the idea of proving code correct the way you prove a mathematical theorem — has been at the edges of computer science for decades. It was considered too slow or too academic for real-world use.
AI is making it necessary.
🔎 What Does Working With AI Seriously Actually Require?
The only way I could catch AI's silent failures was to understand the territory well enough to recognise when something was wrong — even when it looked right.
That is the real skill. Not prompting. Not knowing the right words to type.
It is knowing enough to audit the output.
AI moves you faster. But it moves you in whatever direction you point it. Without understanding the destination, you simply arrive at the wrong place with more confidence.
The Result: Naja
The compiler works. It's called Naja.
It takes Python code and compiles it to run natively in the .NET ecosystem — including WinForms desktop UI support. Real windows render. The architecture holds.
The compiler is still far from complete. There are still failing tests. That is honest. But the foundation on which I can build further is solid.
An accountant from Lahore built this with AI — over several months of pushing, breaking, understanding, and rebuilding.
I didn't do this to prove a point about accountants. I did it to understand what AI can actually do — and what it still needs from us.
The Question That Matters
The most important skill in the AI era is not knowing how to use AI.
It is knowing enough to know when AI is wrong.
As we generate more code than any generation before us — the question of how we prove it is correct may be the most important question in software right now.
I don't have the full answer. But I think it starts with asking the question seriously.
Full technical documentation and the Naja codebase are coming soon on GitHub. Follow if you want to see how far this goes.
If you are building something serious with AI — not demos, not shortcuts, something that actually pushes the limits — I'd like to hear from you.
Top comments (0)