O6lvl4

Posted on May 19

I built a programming language to test how well LLM-edited code survives

#programming #ai #webassembly #rust

Can programming languages be designed for LLM code edits?

I built a statically typed programming language called Almide.

The motivation was not simply to make another general-purpose language. I wanted to explore a specific question:

If LLMs are going to modify existing code, can programming languages be designed so those edits are less likely to break the program?

Most discussion around AI and programming focuses on code generation: can a model write a function from scratch, solve a benchmark problem, or generate an implementation from a prompt?

But in day-to-day software work, a lot of coding is not greenfield generation. It is modification:

add a parameter
change a data structure
update a parser
fix an edge case
refactor an API
preserve existing behavior
make the tests still pass

That led me to a metric I call modification survival rate.

Modification survival rate

The idea is simple:

After an LLM modifies an existing program, does the result still compile and pass the existing tests?

This is intentionally stricter than asking whether the output “looks right.” It asks whether the modified program survives the basic checks that real code has to survive.

For Almide, I built a small benchmark with 30 code modification tasks.

On the current benchmark:

Claude Sonnet 4.6 passes 30/30 Almide tasks
Running the same task set in Rust gives around 58%

This is not meant as “Almide is better than Rust.”

Rust is vastly more mature, has a much larger ecosystem, and is solving a broader and harder set of problems. I use Rust as a comparison point because it is a serious statically typed systems language with strong compile-time guarantees.

The question I’m trying to explore is narrower:

Can language design choices measurably affect how often LLM-generated edits continue to compile and pass tests?

What Almide is

Almide is a statically typed language implemented in Rust.

It currently has:

bidirectional type inference
generics
pattern matching with exhaustiveness checking
effect functions for automatic error propagation
pipeline operator and UFCS-style calls
a module system with versioned packages
native Rust code generation
direct WebAssembly code generation
a browser playground where the compiler itself runs as WASM

Playground:
https://almide.github.io/playground/

Benchmark:
https://almide.github.io/almide-dojo/

Why language design might matter for LLM edits

LLMs are very good at producing plausible code. The problem is that plausible code is not always valid code.

When a model edits an existing program, it can fail in many small ways:

changing one call site but not another
returning the wrong variant
forgetting an error case
breaking ownership or mutation rules
producing code that looks locally correct but no longer matches the surrounding program
making an edit that compiles but fails tests

Some of these failures are model problems. But some may be language-design problems.

A language can make edits more survivable by making program structure easier to infer, making common changes local, making invalid states easier to reject, and giving diagnostics that point toward the intended repair.

That is the design space Almide is exploring.

The benchmark

The benchmark is built around code modification tasks rather than from-scratch generation.

Each task gives the model an existing program and asks it to modify the code. The output is then checked by compilation and tests.

The score is not based on whether the code is elegant, idiomatic, or human-preferred. The first question is just:

Did the edit survive?

I know 30 tasks is small. I do not think this benchmark proves anything definitive yet.

But the early result is interesting enough that I want to make the benchmark public and get feedback from people who work on programming languages, compilers, testing, and AI-assisted programming.

What I want feedback on

I’m especially interested in feedback on the benchmark methodology.

For example:

What kinds of modification tasks should be included?
How should task difficulty be categorized?
How do we avoid overfitting the language to the benchmark?
Should the benchmark include multi-file edits?
Should it include larger libraries or real applications?
How should we compare across languages fairly?
Is “compile and pass tests” enough, or should there be another layer of semantic checking?

I’m also interested in language-design feedback.

If LLM-assisted modification becomes a normal programming workflow, what should languages optimize for?

Not just for humans writing code from scratch, but for humans and machines repeatedly changing existing code together.

That is the question Almide is trying to explore.

DEV Community