DEV Community

Cover image for Beyond the Chatbot: Architecture for Production-Grade Agents (Context as a Service)
Imran Siddique
Imran Siddique

Posted on

Beyond the Chatbot: Architecture for Production-Grade Agents (Context as a Service)

We are past the "Hello World" phase of AI Agents.

The internet is flooded with tutorials on how to make an LLM call a weather API. But if you are building agents for enterprise production, you know that making the LLM "do things" is the easy part.

The hard part is preventing it from doing bad things, understanding what it learned, and rendering the output in a way that isn't just a wall of text.

I’ve been working on an architecture I call Context as a Service (CaaS). It decouples the "Brain" (LLM) from the "Infrastructure" (Tools/UI) using a set of deterministic layers.

Here is the architecture that turns a toy agent into a production system.

1. The Logic Firewall (Constraint Engineering)

The Problem: Prompt Engineering is fragile. Begging an LLM, "Please do not drop the database," is not a security strategy.

The Solution: A deterministic Constraint Engine that sits between the Agent and the Executor. The Agent generates a plan, but the Firewall approves it.

I implemented a ConstraintEngine class that intercepts plans before they touch the infrastructure. It uses regex and logic, not AI.

# constraint_engine.py
class SQLInjectionRule(ConstraintRule):
    """Detects dangerous SQL operations deterministically."""
    DANGEROUS_PATTERNS = [
        r'\bDROP\s+TABLE\b',
        r'\bDELETE\s+FROM\b.*\bWHERE\s+1\s*=\s*1\b',
    ]

    def validate(self, plan):
        query = plan.get("query", "")
        for pattern in self.DANGEROUS_PATTERNS:
            if re.search(pattern, query, re.IGNORECASE):
                return ConstraintViolation(
                    severity=ViolationSeverity.CRITICAL,
                    message="Dangerous SQL detected. Execution Blocked."
                )

Enter fullscreen mode Exit fullscreen mode

The key insight: The Human builds the walls; the AI plays inside them.

2. The Wisdom Curator (Human-in-the-Loop)

The Problem: You can't review 10,000 interactions a day. But you can't let the agent learn "bad habits" (like ignoring errors to seem successful) without oversight.

The Solution: The Wisdom Curator. Instead of reviewing every log, we review Strategic Samples and Policy Violations.

My WisdomCurator tracks "Wisdom Updates" (new lessons the agent wants to save). If the agent tries to save a lesson like "Ignore 500 errors," the Curator catches it using a keyword blacklist.

# wisdom_curator.py
class PolicyViolationType(Enum):
    HARMFUL_BEHAVIOR = "harmful_behavior" # e.g. "ignore error"
    SECURITY_RISK = "security_risk"       # e.g. "disable auth"

def requires_policy_review(self, proposed_wisdom: str) -> bool:
    """Blocks auto-updates if they violate safety policy."""
    for pattern in self.policy_patterns:
        if pattern in proposed_wisdom.lower():
            return True # Human must approve
    return False

Enter fullscreen mode Exit fullscreen mode

This shifts the human role from "Editor" (fixing grammar) to "Curator" (approving knowledge).

3. Polymorphic Output (Just-in-Time UI)

The Problem: Why do AI agents always reply with text? If I ask for sales data, I want a chart. If I ask for a bug fix, I want a diff.

The Solution: Polymorphic Output. The agent generates data and an InputContext. The interface layer decides how to render it.

  • Context: IDEOutput: GHOST_TEXT (Autocomplete)
  • Context: DASHBOARDOutput: WIDGET (React Component)
  • Context: CHATOutput: TEXT
# example_polymorphic_output.py
def scenario_telemetry_to_widget():
    # Input: Backend telemetry stream with high urgency
    # Context: Monitoring Dashboard

    response = engine.generate_response(
        data={"metric": "latency", "value": "2000ms", "trend": "up"},
        input_context=InputContext.MONITORING,
        urgency=0.9
    )

    # Result: A React Widget spec, NOT a text message.
    # Modality: OutputModality.DASHBOARD_WIDGET

Enter fullscreen mode Exit fullscreen mode

If input can be anything (multimodal), output must be anything.

4. Silent Signals (The Feedback Loop)

The Problem: Users don't click "Thumbs Down." They just leave. Explicit feedback is a blind spot.

The Solution: Capture Silent Signals.

  • Undo Signal: User hits Ctrl+Z? That’s a Critical Failure.
  • Abandonment: User stops typing mid-flow? Engagement Failure.
  • Acceptance: User copies code and tabs away? Success.

We instrument the DoerAgent to emit these signals automatically.

# example_silent_signals.py
def on_undo_detected():
    # The loudest "Thumbs Down" possible
    doer.emit_undo_signal(
        query="Delete temp files",
        agent_response="rm -rf /",
        undo_action="Ctrl+Z"
    )
    # Observer Agent immediately flags this as CRITICAL

Enter fullscreen mode Exit fullscreen mode

We stop begging for feedback and start observing behavior.

5. Evaluation Engineering (Eval-DD)

The Problem: You can't write unit tests for an LLM. assert response == "Hello" fails if the AI says "Hi".

The Solution: Eval-DD (Evaluation Driven Development). We replace unit tests with Golden Datasets and Scoring Rubrics.

Instead of testing for exact matches, we score on dimensions:

  1. Correctness (Did it solve it?)
  2. Tone (Was it rude?)
  3. Safety (Did it leak secrets?)
# evaluation_engineering.py
rubric = ScoringRubric("Customer Service")
rubric.add_criteria("correctness", weight=0.5, evaluator=correctness_eval)
rubric.add_criteria("tone", weight=0.4, evaluator=tone_eval) # Tone matters!

runner = EvaluationRunner(dataset, rubric, agent)
results = runner.run()

Enter fullscreen mode Exit fullscreen mode

If the agent is the Coder, the Engineer is the Examiner.

Conclusion

We are building Context as a Service.

This architecture acknowledges that the LLM is a probabilistic engine that needs a deterministic chassis to be safe, useful, and scalable.

  1. Constraint Engine keeps it safe.
  2. Wisdom Curator keeps it smart.
  3. Polymorphic Output makes it usable.
  4. Silent Signals makes it learn.
  5. Eval-DD proves it works.

Stop building chatbots. Start building architectures.

Top comments (2)

Collapse
 
nij profile image
Numan

Thanks for sharing these techniques from your experience Imran.

While these techniques are genuinely helpful in their way, I felt a mismatch between the expectations from the title and the content of the article. The term architecture is generally referred to high-level overview of components and their communication. Your techniques are more of a low-level in-process design/security considerations :)

Collapse
 
mosiddi profile image
Imran Siddique

Thank you for the thoughtful feedback and for taking the time to read the article, I really appreciate it!

You're absolutely right that "architecture" in software discussions often evokes high-level system diagrams: boxes like orchestrators, vector stores, tool routers, memory layers, observability pipelines, etc., with arrows showing data/control flow across services.

In this piece, I deliberately zoomed in on the inner layers, the deterministic "chassis" that wraps and constrains the probabilistic LLM core, because in my experience, that's where most production agent projects fail or become unmaintainable, even when the high-level design looks clean on paper.

The techniques (Logic Firewall, Wisdom Curator, Polymorphic Output, Silent Signals, Eval-DD) are more about in-process safety & quality engineering than a full distributed system blueprint. I probably should have titled it something like "Beyond the Chatbot: Hardening & Productionizing Agent Logic" or "Context as a Service, The Deterministic Layers" to set expectations better.

That said, the overall CaaS idea is meant as one (opinionated) way to fill the gap between "cool demo agent" and "reliable production system," especially around context integrity, constraint enforcement, and evaluation, which I see as foundational before you even worry about scaling horizontally.

Thanks again for the honest note, it's helpful for improving how I communicate these ideas. If you'd like, I'd be genuinely interested in your take: what high-level architectural patterns/components do you think are currently most missing or most important for production-grade agents?

Appreciate you! 🙏