Part of From Software Engineer to GenAI Engineer: A Practical Series for 2026
Large language models are often introduced as something fundamentally new.
A breakthrough.
A leap.
A category shift.
From a systems perspective, they’re something more familiar.
They’re probabilistic components with clear constraints, predictable failure modes, and operational costs. Once you see them that way, much of the confusion around GenAI disappears.
Determinism is the first thing you lose
Traditional software systems are deterministic.
Given the same input, you expect the same output. When that doesn’t happen, something is wrong.
LLMs break this assumption by design.
Even with the same prompt, the same model, and the same data, outputs can vary. This is not a bug. It’s a property of how these models generate text.
For engineers, this means correctness can no longer be defined as equality. It has to be defined in terms of acceptability, bounds, and constraints.
Tokens are the real interface
LLMs don’t operate on text. They operate on tokens.
From a systems point of view, tokens behave more like memory than strings:
- Context is finite
- Cost scales with token count
- Latency grows as context grows
- Truncation happens silently
Once context becomes a constrained resource, prompt design stops being about wording and starts being about resource management.
Why hallucinations happen
Hallucinations aren’t random.
An LLM generates the most likely continuation of a sequence based on its training. When it lacks information, it doesn’t stop. It fills the gap with something statistically plausible.
This is expected behavior for a component optimized for fluency, not truth.
That’s why:
- Asking the model to “be accurate” doesn’t work
- Confidence is not a signal of correctness
- Grounding and validation must live outside the model
Hallucinations aren’t fixed by better prompts. They’re constrained by system design.
Temperature is not creativity
Temperature is often described as a creativity dial. That framing is misleading.
Lower temperatures reduce variance. Higher temperatures increase it.
In production systems, temperature is a reliability control. Higher variance increases risk. Lower variance increases repeatability.
Treating temperature as an aesthetic choice instead of a systems lever is a common early mistake.
Context windows define architecture
Context window size isn’t just a model feature. It’s an architectural constraint.
It determines:
- How much information the model can reason over at once
- Whether retrieval is required
- How often summarization happens
- How state is carried forward
Once the context window is exceeded, the system doesn’t fail loudly. It degrades quietly.
Good architectures are designed around this limit, not surprised by it.
Why prompt-only systems hit a ceiling
Prompt engineering works well early on because it’s cheap and flexible.
It stops working when:
- Prompts grow uncontrollably
- Behavior becomes brittle
- Changes introduce side effects
- Multiple use cases collide
At that point, prompts are no longer instructions. They’re configuration.
And like any configuration, they need versioning, validation, and isolation.
A useful mental model
A practical way to think about an LLM is this:
An LLM is a non-deterministic function that:
- Accepts a bounded context
- Produces a probabilistic output
- Optimizes for likelihood, not correctness
- Incurs cost and latency proportional to input size
Once framed this way, LLMs stop feeling mysterious. They become components with tradeoffs that can be reasoned about.
What this changes downstream
When LLMs are treated as system components:
- Raw output is no longer trusted
- Validation layers become necessary
- Retries and fallbacks are expected
- Critical logic moves outside the model
This is where GenAI engineering starts to resemble backend engineering again.
The next post looks at why prompt engineering alone doesn’t scale, and why it’s more useful to treat prompts as configuration than as a skillset.
Top comments (0)