Tasos Nikolaou

Posted on May 10

An LLM Walks Into General Relativity - Lessons from a Devoxx Talk

#ai #llm #softwareengineering #systemdesign

Why fluent AI-generated technical content can still be fundamentally incorrect, and how to fix it with system design.

Introduction

At Devoxx, I presented a simple experiment:

What happens if you ask an LLM to generate an entire technical presentation on General Relativity?

The model produces something impressive:

well-structured slides
correct terminology
equations
citations
a coherent narrative

It looks like something you could present. And yet, parts of it are fundamentally wrong. Not obviously wrong, convincingly wrong.

This is the real problem with AI-generated technical content.

The Problem: Fluent ≠ Correct

Large Language Models are extremely good at:

structure
storytelling
pedagogy

But they are not built to preserve:

physical constraints
invariants
measurement consistency

In physics, that becomes "obvious" very quickly 😉.

Example:

A model might say:

"Light slows down in gravity, so time slows down."

This sounds reasonable. But it's wrong, or at best, deeply misleading, because:

locally, the speed of light is always c, constant
time dilation is defined through clock comparisons, not metaphors

This is what I call:

Frame confusion

The model mixes:

different observers
different measurement definitions
intuitive metaphors

...into a single explanation.

Everything reads smoothly. But the reasoning is broken.

Why Physics is a Perfect Stress Test

General Relativity is unforgiving.

You can't get away with:

vague explanations
metaphor-only reasoning
mixing frames of reference

Every statement must answer:

"How would you measure that?"

If you can't answer that, the explanation is incomplete, or wrong. This makes physics an ideal domain to expose LLM weaknesses.

From Prompting to System Design

Instead of trying to "prompt better", I built a system around the model.

The goal:

Not to make the model smarter, but to make the output auditable and correctable.

Architecture Overview

The system is a multi-agent pipeline:

Sources → Chunking 
        → Retrieval (RAG)        
        → Author Agent (generate slides)
        → Schema Validation
        → Post-processing
        → Physics Rule Engine
        → Critic Agent
        → Refinement Loop
        → PowerPoint Rendering

Key Components

1. Structured Generation

The model doesn't output free text. It must generate strict JSON:

slide types
bullet constraints
equations
citations

Validated with Pydantic. If it doesn't parse, it doesn't ship.

2. Deterministic Validation (The "Physics Linter")

I implemented rule-based checks like:

Time dilation must reference clocks or measurements
Gravitational waves must reference strain or detectors
No "black holes suck everything in" explanations
Distinguish event horizon vs singularity

These rules catch systematic failure patterns instantly.

3. Critic Agent

A second LLM reviews the output:

checks clarity
checks reasoning
suggests corrections

But importantly, it runs after deterministic validation

4. Refinement Loop

Generate → Validate → Critique → Revise

This loop runs until:

errors are reduced
or a maximum number of iterations is reached

Results

From a real run:

Draft deck → 6 failing slides
After refinement → 4 failing slides

We didn't achieve perfection. But we achieved something more important:

We made correctness measurable.

What Still Fails

Even with this pipeline:

Citations may look correct but not truly support claims
Subtle reasoning errors remain
Frame confusion is hard to eliminate
Models can satisfy rules while staying vague

Human review is still necessary.

Key Insight

Reliable AI is not a prompting problem.
It's a system design problem.

What You Can Reuse

These patterns generalize beyond physics:

legal documents
financial reports
medical summaries
architecture decisions

Use:

structured outputs
deterministic validation
domain-specific rules
critique loops

The project

GitHub Repo: https://github.com/tase-nikol/gr-deck-agent

Example commands:

make index
make draft
make review
make refine
make replay

Devoxx Talk

You can watch the full talk here:

YouTube: https://www.youtube.com/watch?v=NanGs7ZMQEE

Final Thought

LLMs don't "understand" systems.

They generate plausible descriptions of them.

If your domain has constraints, invariants, or correctness requirements:

You need to build those constraints into the system,not hope the model learns them.

Conclusion

If you're working with AI-generated technical content, I'd love to hear:

what failure modes you've seen
how you validate outputs
what worked (or didn't)

AI is fluent
But reality is not optional

DEV Community