We are past the "Hello World" phase of AI Agents.
The internet is flooded with tutorials on how to make an LLM call a weather API. But if you are building agents for enterprise production, you know that making the LLM "do things" is the easy part.
The hard part is preventing it from doing bad things, understanding what it learned, and rendering the output in a way that isn't just a wall of text.
I’ve been working on an architecture I call Context as a Service (CaaS). It decouples the "Brain" (LLM) from the "Infrastructure" (Tools/UI) using a set of deterministic layers.
Here is the architecture that turns a toy agent into a production system.
1. The Logic Firewall (Constraint Engineering)
The Problem: Prompt Engineering is fragile. Begging an LLM, "Please do not drop the database," is not a security strategy.
The Solution: A deterministic Constraint Engine that sits between the Agent and the Executor. The Agent generates a plan, but the Firewall approves it.
I implemented a ConstraintEngine class that intercepts plans before they touch the infrastructure. It uses regex and logic, not AI.
# constraint_engine.py
class SQLInjectionRule(ConstraintRule):
"""Detects dangerous SQL operations deterministically."""
DANGEROUS_PATTERNS = [
r'\bDROP\s+TABLE\b',
r'\bDELETE\s+FROM\b.*\bWHERE\s+1\s*=\s*1\b',
]
def validate(self, plan):
query = plan.get("query", "")
for pattern in self.DANGEROUS_PATTERNS:
if re.search(pattern, query, re.IGNORECASE):
return ConstraintViolation(
severity=ViolationSeverity.CRITICAL,
message="Dangerous SQL detected. Execution Blocked."
)
The key insight: The Human builds the walls; the AI plays inside them.
2. The Wisdom Curator (Human-in-the-Loop)
The Problem: You can't review 10,000 interactions a day. But you can't let the agent learn "bad habits" (like ignoring errors to seem successful) without oversight.
The Solution: The Wisdom Curator. Instead of reviewing every log, we review Strategic Samples and Policy Violations.
My WisdomCurator tracks "Wisdom Updates" (new lessons the agent wants to save). If the agent tries to save a lesson like "Ignore 500 errors," the Curator catches it using a keyword blacklist.
# wisdom_curator.py
class PolicyViolationType(Enum):
HARMFUL_BEHAVIOR = "harmful_behavior" # e.g. "ignore error"
SECURITY_RISK = "security_risk" # e.g. "disable auth"
def requires_policy_review(self, proposed_wisdom: str) -> bool:
"""Blocks auto-updates if they violate safety policy."""
for pattern in self.policy_patterns:
if pattern in proposed_wisdom.lower():
return True # Human must approve
return False
This shifts the human role from "Editor" (fixing grammar) to "Curator" (approving knowledge).
3. Polymorphic Output (Just-in-Time UI)
The Problem: Why do AI agents always reply with text? If I ask for sales data, I want a chart. If I ask for a bug fix, I want a diff.
The Solution: Polymorphic Output. The agent generates data and an InputContext. The interface layer decides how to render it.
-
Context:
IDE→ Output:GHOST_TEXT(Autocomplete) -
Context:
DASHBOARD→ Output:WIDGET(React Component) -
Context:
CHAT→ Output:TEXT
# example_polymorphic_output.py
def scenario_telemetry_to_widget():
# Input: Backend telemetry stream with high urgency
# Context: Monitoring Dashboard
response = engine.generate_response(
data={"metric": "latency", "value": "2000ms", "trend": "up"},
input_context=InputContext.MONITORING,
urgency=0.9
)
# Result: A React Widget spec, NOT a text message.
# Modality: OutputModality.DASHBOARD_WIDGET
If input can be anything (multimodal), output must be anything.
4. Silent Signals (The Feedback Loop)
The Problem: Users don't click "Thumbs Down." They just leave. Explicit feedback is a blind spot.
The Solution: Capture Silent Signals.
-
Undo Signal: User hits
Ctrl+Z? That’s a Critical Failure. - Abandonment: User stops typing mid-flow? Engagement Failure.
- Acceptance: User copies code and tabs away? Success.
We instrument the DoerAgent to emit these signals automatically.
# example_silent_signals.py
def on_undo_detected():
# The loudest "Thumbs Down" possible
doer.emit_undo_signal(
query="Delete temp files",
agent_response="rm -rf /",
undo_action="Ctrl+Z"
)
# Observer Agent immediately flags this as CRITICAL
We stop begging for feedback and start observing behavior.
5. Evaluation Engineering (Eval-DD)
The Problem: You can't write unit tests for an LLM. assert response == "Hello" fails if the AI says "Hi".
The Solution: Eval-DD (Evaluation Driven Development). We replace unit tests with Golden Datasets and Scoring Rubrics.
Instead of testing for exact matches, we score on dimensions:
- Correctness (Did it solve it?)
- Tone (Was it rude?)
- Safety (Did it leak secrets?)
# evaluation_engineering.py
rubric = ScoringRubric("Customer Service")
rubric.add_criteria("correctness", weight=0.5, evaluator=correctness_eval)
rubric.add_criteria("tone", weight=0.4, evaluator=tone_eval) # Tone matters!
runner = EvaluationRunner(dataset, rubric, agent)
results = runner.run()
If the agent is the Coder, the Engineer is the Examiner.
Conclusion
We are building Context as a Service.
This architecture acknowledges that the LLM is a probabilistic engine that needs a deterministic chassis to be safe, useful, and scalable.
- Constraint Engine keeps it safe.
- Wisdom Curator keeps it smart.
- Polymorphic Output makes it usable.
- Silent Signals makes it learn.
- Eval-DD proves it works.
Stop building chatbots. Start building architectures.
Top comments (0)