yuer

Posted on Dec 23, 2025

Powerful LLMs Are Not the Problem — Using Them “Raw” Is A systems-engineering view for builders

#ai #softwareengineering #llm #architecture

Large Language Models are no longer just tools for writing text or generating code.
They are increasingly used to advise, judge, and influence decisions — sometimes quietly, sometimes explicitly.

And that’s where a systems problem begins.

This post is not about which model is better, faster, or cheaper.
It’s about a more basic engineering question:

What is the correct system form of AI when it starts participating in decisions, not just producing output?

Many AI systems today are used “raw”

By raw, I don’t mean unsafe, unethical, or non-compliant.

I mean this:

We are embedding high-capability, non-deterministic reasoning systems directly into environments that require stable, repeatable, auditable decisions — without a real system-level control layer in between.

Prompt engineering, RAG, rules, and agent frameworks increase capability.
They do not, by themselves, guarantee decision stability.

For low-stakes tasks, this distinction barely matters.
For real systems, it matters a lot.

LLMs behave more like engines than finished systems

From a systems perspective, LLMs look less like complete products and more like extremely powerful engines.

They offer:

strong generalization

flexible reasoning paths

impressive expressive power

But they do not inherently manage:

stability

permissions

responsibility

long-term state consistency

In classical computing terms:

LLM ≈ CPU

Prompt ≈ instruction stream

Which naturally raises the real question:

Where is the operating system?

The real risk isn’t hallucinations

Hallucinations get most of the attention, but they’re not the core issue.

The deeper risks are structural.

Non-repeatability

The same inputs, under nearly identical conditions, can produce different conclusions.
In content generation, this is creativity.
In decision systems, it’s loss of control.

Illusion of control

LLMs can convincingly explain almost any result.
But in engineering, sounding reasonable does not equal being governed.

Poor debuggability

When decisions matter, we need to answer:

What triggered this decision?

Which path was taken?

Would it happen again?

If we can’t, the system isn’t production-grade.

The paradox: LLMs aren’t too weak — they’re too free

This is the counterintuitive part.

The problem isn’t intelligence.
It’s high capability without structural governance.

Powerful components without system-level constraints inevitably lead to:

behavior drift

accumulated risk

unclear accountability

This is not an AI problem.
It’s a systems engineering problem.

Why “AI operating systems” keep coming up

We’ve seen this pattern before.

CPUs alone were never enough:

no scheduling → chaos

no isolation → insecurity

no state management → instability

Operating systems didn’t weaken CPUs.
They made them usable at scale.

For AI, the equivalent challenge is not computation — it’s decision rights.

Decision models are not ML models

When we talk about decision models here, we don’t mean another trained model.

We mean a system layer that:

does not predict

does not generate

does not optimize creatively

It answers one question only:

Is this decision allowed under the current system state?

The requirement is simple, but rare in practice:

Same conditions → same decision.

Companion models need a hard boundary

Long-lived systems (AI phones, robots, vehicles) need continuity — preferences, habits, context.

This motivates the idea of companion models.

But a strict rule is required:

Companion models may provide state — never authority.

Once long-term preference gains decision power, control erodes.

Closing: this is a systems problem, not a model race

The next phase of AI isn’t about making models smarter.

It’s about making systems:

controllable

repeatable

auditable

trustworthy over time

Intelligence without a decision kernel doesn’t scale reliability — it scales risk.

Author note
This post distills ongoing work on decision stability and system boundaries, framed under an experimental architecture often referred to as EDCA (Expression-Driven Cognitive Architecture).
The focus is on structural questions, not implementation details.

AI Decision Systems · Core Q&A (v1.0)
Q1: Where is AI fundamentally stronger than traditional industry software?

A:
Not in speed or accuracy, but in its ability to operate under incomplete, ambiguous, and non-structured conditions.

Traditional industry software excels when:

rules are explicit

boundaries are clear

conditions are enumerable

LLM-based AI becomes powerful when:

information is incomplete

requirements are vaguely expressed

real-world variables constantly change

However, this is a capability advantage, not an engineering maturity advantage.

Q2: You argue that “constraining LLMs” improves safety and reliability. Doesn’t that weaken their power?

A:
No. It doesn’t weaken capability — it makes capability deployable.

Unconstrained LLMs:

appear powerful

but behave inconsistently

and cannot be reliably audited

System-governed LLMs:

retain their intelligence

but only act under permitted conditions

with decisions that can be traced, frozen, and reviewed

In engineering, capability without control has no production value.

Q2 (Extended): You compare LLMs to powerful car engines. Does that imply most people are “using LLMs naked”? Why is that dangerous?

A:
Yes — that implication is intentional.

A high-performance engine:

without transmission, brakes, or stability control

becomes more dangerous as horsepower increases

LLMs behave similarly:

stronger reasoning

better articulation

larger impact radius when things go wrong

The danger is not that LLMs make mistakes,
but that their mistakes still sound convincing.

Q3: So like a PC needs Windows before the CPU is useful, AI needs an OS? Is that why you’re building EDCA OS?

A:
Yes — and this analogy is literal, not rhetorical.

A CPU does not manage:

task scheduling

permission isolation

state persistence

fault recovery

That’s the operating system’s role.

When AI participates in decisions, it needs similar structure:

who may decide

under what conditions

whether a decision is allowed

whether it can be reproduced

EDCA OS focuses on turning decisions into system behavior, not making AI “smarter.”

Q4: Why did you choose the GPT client as your runtime environment? Is this your own standard?

A:
This is not about preference. It’s about whether the runtime behaves like a system.

We prioritize:

session stability

built-in behavioral boundaries

consistent execution characteristics

At present, only a few LLM runtimes allow serious discussion of:

decision stability

repeatability

“same input → same outcome” validation

This is not a model benchmark — it’s a systems prerequisite.

Q5: What’s the real difference between traditional quantitative systems and AI-based quant systems? Where does AI quant fail?

A:
The difference is not predictive power — it’s decision trustworthiness.

Traditional quant systems:

fixed strategies

explicit paths

auditable and backtestable behavior

AI quant systems often suffer from:

decision drift

inconsistent behavior under identical conditions

weak auditability

The issue is not intelligence, but missing decision stability structure.

Q5 (Extended): Does this mean you aim for scikit-learn compatibility, or are you abandoning it?

A:
Neither. They operate at different layers.

scikit-learn handles training and prediction

EDCA-style decision models handle whether predictions are allowed to be acted upon

They are complementary, not competing.

You may use sklearn to generate signals —
but whether to trust and execute them belongs to the decision layer.

Q6: Why did you build CMRE? What were you trying to validate?

A:
CMRE is not about “building medical AI.”
It’s about testing decision boundaries in extreme risk environments.

Medical scenarios combine:

high risk

high responsibility

strong temptation to overstep

If a system can:

distinguish information from judgment

resist unauthorized decision-making

remain stable under pressure

then it will be safer in less critical domains.

Q7: What’s your breakthrough in LLM-based research assistants? Why do you disconnect online retrieval during testing?

A:
Because research is not harmed most by ignorance —
but by false confidence.

Online retrieval often causes:

retrieval to be mistaken for reasoning

existing conclusions to masquerade as discovery

Disconnecting search forces the model to:

expose its reasoning structure

operate within known constraints

reveal gaps instead of hiding them behind citations

AI’s role in research is not to replace scientists,
but to surface blind spots and cognitive inertia.

Q6 (Extended): If data scarcity is no longer the bottleneck, what do you still rely on scientists for? Doesn’t AI lack cognitive bias?

A:
AI lacks cognitive inertia — but it also lacks research responsibility.

What scientists uniquely provide is not data volume, but:

which variables matter

which assumptions deserve challenge

which questions are worth asking

AI expands reasoning space.
Humans define research direction.

DEV Community

Powerful LLMs Are Not the Problem — Using Them “Raw” Is A systems-engineering view for builders

Top comments (0)