Why fluent AI-generated technical content can still be fundamentally incorrect, and how to fix it with system design.
Introduction
At Devoxx, I presented a simple experiment:
What happens if you ask an LLM to generate an entire technical presentation on General Relativity?
The model produces something impressive:
- well-structured slides
- correct terminology
- equations
- citations
- a coherent narrative
It looks like something you could present. And yet, parts of it are fundamentally wrong. Not obviously wrong, convincingly wrong.
This is the real problem with AI-generated technical content.
The Problem: Fluent ā Correct
Large Language Models are extremely good at:
- structure
- storytelling
- pedagogy
But they are not built to preserve:
- physical constraints
- invariants
- measurement consistency
In physics, that becomes "obvious" very quickly š.
Example:
A model might say:
"Light slows down in gravity, so time slows down."
This sounds reasonable. But it's wrong, or at best, deeply misleading, because:
- locally, the speed of light is always c, constant
- time dilation is defined through clock comparisons, not metaphors
This is what I call:
Frame confusion
The model mixes:
- different observers
- different measurement definitions
- intuitive metaphors
...into a single explanation.
Everything reads smoothly. But the reasoning is broken.
Why Physics is a Perfect Stress Test
General Relativity is unforgiving.
You can't get away with:
- vague explanations
- metaphor-only reasoning
- mixing frames of reference
Every statement must answer:
"How would you measure that?"
If you can't answer that, the explanation is incomplete, or wrong. This makes physics an ideal domain to expose LLM weaknesses.
From Prompting to System Design
Instead of trying to "prompt better", I built a system around the model.
The goal:
Not to make the model smarter, but to make the output auditable and correctable.
Architecture Overview
The system is a multi-agent pipeline:
Sources ā Chunking
ā Retrieval (RAG)
ā Author Agent (generate slides)
ā Schema Validation
ā Post-processing
ā Physics Rule Engine
ā Critic Agent
ā Refinement Loop
ā PowerPoint Rendering
Key Components
1. Structured Generation
The model doesn't output free text. It must generate strict JSON:
- slide types
- bullet constraints
- equations
- citations
Validated with Pydantic. If it doesn't parse, it doesn't ship.
2. Deterministic Validation (The "Physics Linter")
I implemented rule-based checks like:
- Time dilation must reference clocks or measurements
- Gravitational waves must reference strain or detectors
- No "black holes suck everything in" explanations
- Distinguish event horizon vs singularity
These rules catch systematic failure patterns instantly.
3. Critic Agent
A second LLM reviews the output:
- checks clarity
- checks reasoning
- suggests corrections
But importantly, it runs after deterministic validation
4. Refinement Loop
Generate ā Validate ā Critique ā Revise
This loop runs until:
- errors are reduced
- or a maximum number of iterations is reached
Results
From a real run:
- Draft deck ā 6 failing slides
- After refinement ā 4 failing slides
We didn't achieve perfection. But we achieved something more important:
We made correctness measurable.
What Still Fails
Even with this pipeline:
- Citations may look correct but not truly support claims
- Subtle reasoning errors remain
- Frame confusion is hard to eliminate
- Models can satisfy rules while staying vague
Human review is still necessary.
Key Insight
Reliable AI is not a prompting problem.
It's a system design problem.
What You Can Reuse
These patterns generalize beyond physics:
- legal documents
- financial reports
- medical summaries
- architecture decisions
Use:
- structured outputs
- deterministic validation
- domain-specific rules
- critique loops
The project
GitHub Repo: https://github.com/tase-nikol/gr-deck-agent
Example commands:
gr-deck-agent index
gr-deck-agent draft
gr-deck-agent review
gr-deck-agent refine
gr-deck-agent replay
Devoxx Talk
You can watch the full talk here:
YouTube: https://www.youtube.com/watch?v=NanGs7ZMQEE
Final Thought
LLMs don't "understand" systems.
They generate plausible descriptions of them.
If your domain has constraints, invariants, or correctness requirements:
You need to build those constraints into the system,not hope the model learns them.
Conclusion
If you're working with AI-generated technical content, I'd love to hear:
- what failure modes you've seen
- how you validate outputs
- what worked (or didn't)
AI is fluent
But reality is not optional
Top comments (0)