DEV Community

Cover image for Crayons and the Wall
Lars Moelleken
Lars Moelleken

Posted on

Crayons and the Wall

Why LLMs Break Systems — and Why That’s Our Fault

#blogPostAsWebApp: https://voku.github.io/llm_coding_constraints/


Introduction — Incentives, Not Rebellion

Imagine teaching a toddler to draw.

Every time they produce a recognizable shape on paper, you praise them:

  • “Great job.”
  • “Nice colors.”
  • “Very creative.”

But you never articulate the one sentence that actually matters:

“You are not allowed to draw on the wall.”

A week later, your hallway looks like a prehistoric art gallery.

The child didn’t rebel.

The child followed incentives.

You rewarded the output (drawing) and forgot to define the boundary (the paper).

That is exactly how we train and deploy Large Language Models today.

We reward:

  • helpfulness
  • fluency
  • clean abstractions
  • elegant refactorings

But we almost never encode the other half of the equation:

  • What must never change?
  • What is forbidden, even if it looks reasonable?
  • What exists solely because production burned down three years ago?

So the model generalizes.

If drawing is good and nobody mentioned walls, then the wall is just a larger canvas.


This post follows one red line, end to end: the missing constraint.

We’ll walk through:

  • why transformers are architecturally simple
  • why history matters more than architecture
  • why rules must be explicit, owned, and test-backed
  • and why we’re now papering over the gap with agents, prompts, and skills

The Important Realization (State It Early)

Before history, before code, here’s the punchline:

Modern LLMs are conceptually simple.

Their power — and their danger — comes from scale, data, and missing constraints.

Just like parenting.

Teaching rules is hard.

Handing out crayons is easy.


A Readable History (So We Can Argue About It)

This section exists to kill the “AI magic” narrative.

Nothing here is mystical. None of this is new.

1906 — Probability Before Intelligence

Long before silicon, the core idea already existed:

“Given what I’ve seen so far, what is the most likely next thing?”

That’s a Markov chain.

  • no understanding
  • no meaning
  • no intent

Just probability conditioned on context.

This is the child repeating a word because they heard it last.

No wall.

Just imitation.


1940s–1950s — Tiny Brains, Explicit Boundaries

Early neural models (perceptrons) were brutally honest.

At their core:

if (weighted_sum > threshold) {
    fire();
}
Enter fullscreen mode Exit fullscreen mode

No creativity.
No abstraction.
No crayons on the wall.

Ironically, these early models had clearer boundaries than modern ones.

The wall was absolute: this fires, that doesn’t.


1990s — Memory Appears

Then we realized something important:

Context matters.

With LSTMs, earlier inputs could influence later outputs.
Sequences became meaningful.

This is the child remembering:

“Last time I drew on the table, mom was angry.”

Still:

the rule is external

the model didn’t invent it

it just remembers the consequence


2017 — The Transformer (The Code, Not the Myth)

This is where explanations usually collapse into mysticism.

So let’s not do that.

Here is the uncomfortable truth:

The transformer architecture is embarrassingly small.

Conceptually, a transformer block is just linear algebra and normalization:

# 1. Token embeddings
x = embed(tokens)

# 2. Self-attention
q = x @ Wq
k = x @ Wk
v = x @ Wv

# Attention: 
softmax(QKᵀ / √d) · V
attn = softmax(q @ k.T / sqrt(d))
x = attn @ v

# 3. Feed-forward
x = relu(x @ W1) @ W2
Enter fullscreen mode Exit fullscreen mode

That’s it.

There is:

no business logic module

no ethics layer

no domain model

no notion of “this is illegal”

If you’re comfortable with matrix multiplication, you understand the engine.

The complexity is not inside the transformer.
The complexity is in the data, feedback loops, and scale around it.


Where the Metaphor Stops Being Cute

Transformers are exceptional generalizers.

That is their superpower — and their flaw.

If you teach a model:

  • “Clean code is good.”
  • “Duplication is bad.”
  • “Simplify logic.”

The model concludes:

“Simplify everything.”

So it removes a redundant check.

What it doesn’t know:

this check exists because of a lawsuit in 2019

this branch prevents a race condition we hit once

this “ugly” code is contractual

Patterns vs. Rules

Across Markov → LSTM → Transformer, one invariant holds:

We taught models patterns, not rules.

Patterns scale.
Rules constrain.

Children need both.
So do LLMs.


The Wall Was Never Learned

An LLM trained on your codebase sees:

  • the final snapshot
  • the cleaned-up version
  • the happy path

It does not see:

  • reverted commits
  • production outages
  • 2 a.m. Slack threads
  • “never do this again” post-mortems

That knowledge lives:

  • in git history
  • in tests
  • in annotations
  • in human memory

Not in tokens.


Your Git History Is Parenting

Git is not just version control.
It is a decision log.

Every hotfix and revert is negative knowledge.

Example:

if ($timeout < 30) {
    $timeout = 30;
}
Enter fullscreen mode Exit fullscreen mode

LLM interpretation:
Magic number. Cleanup candidate.

Git blame interpretation:
We tried 10. Production burned. 30 survived.

That if statement is the wall.

But the model can’t see it.


Rules Are Not Comments — They Are Contracts

Most teams try to fix this with:

  • comments
  • prompts
  • “be careful here” notes

That fails.

Comments are:

  • optional
  • unverifiable
  • easy to delete
  • invisible to tools

A real rule answers:

  • why does this exist?
  • who owns it?
  • how critical is it?
  • how is it proven?

A critical rule without executable proof is just a suggestion.


The Final Example — Drawing the Wall Properly

  1. Name the Rule

Stop relying on folklore.
Give the rule an identity.

enum BillingRules: string
{
    case RefundLimit = 'REFUND_LIMIT_CRITICAL';
}
Enter fullscreen mode Exit fullscreen mode

  1. Define the Intent (Not the Logic)
return [
    BillingRules::RefundLimit->value => new RuleDefinition(
        statement: 'Refunds above 500 EUR require manual approval',
        tier: Tier::Critical,
        rationale: 'Fraud prevention and regulatory requirements (2021 Audit)',
        owner: 'Team-Billing',
        verifiedBy: [RefundLimitTest::class],
    ),
];
Enter fullscreen mode Exit fullscreen mode

No conditionals.
No duplication.
Just why.


  1. Attach the Rule to the Code
final class RefundService
{
    #[Rule(BillingRules::RefundLimit)]
    public function refund(Order $order): void
    {
        if ($order->amount > 500) {
            throw new ManualApprovalRequired();
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

Zero runtime cost.
Maximum semantic weight.


  1. The Test Is the Concrete Wall
final class RefundLimitTest extends TestCase
{
    public function testRefundAboveLimitRequiresManualApproval(): void
    {
        $this->expectException(ManualApprovalRequired::class);

        // If an LLM (or human) removes the check,
        // this test fails. The wall holds.
        (new RefundService())->refund(Order::fake(amount: 600));
    }
}
Enter fullscreen mode Exit fullscreen mode

Remove the test → the wall disappears.
Break the rule → CI fails.

This is enforced memory.


Why We’re Now Talking About AGENTS.md, Prompts, and Skills

Here’s the twist that makes everything above unavoidable.

We are suddenly adding:
_ AGENTS.md

  • role descriptions
  • skill boundaries
  • “this agent may / may not” rules

Not because LLMs changed.

But because we trained them without our past.

LLMs were trained on:

  • final snapshots
  • cleaned-up repositories
  • best practices
  • success stories

They were not trained on:

  • git history
  • post-mortems
  • reverted ideas

hacks that hold entire systems together

In other words:

Happy path in. Happy path out.

Agent docs and skill descriptions are not AI features.
They are manual history injection.

They answer questions the model can never infer:

Where does optimization stop?

Which invariants override cleanup?

When must the agent refuse?

This is not babysitting.

This is us finally writing down what humans “just knew”.


LLMs are not magical; they are pattern matchers

Transformers are conceptually small

Scale amplifies mistakes

A missing “no” is interpreted as “yes”

Or, more bluntly:

We didn’t create a monster.
We optimized the crayons and forgot the wall.


Call to Action

Pick one critical service today and ask:

What must never change here?

Why does this ugly code exist?

Where is that knowledge written down?

If the answer is “in people’s heads”,
your wall is imaginary.

Start drawing it.


The wall was always there.
We just never wrote it down.

Top comments (0)