Prateek Chaudhary

Posted on Feb 7

Operating AI in Production Is an Ops Problem

#devops #ai #opensource #observability

Over the last year, I’ve been working with LLMs and AI systems that actually run in production — not demos, not notebooks, not proof-of-concepts.

What surprised me most wasn’t model behavior.
It was how quickly operational assumptions broke.

From an ops and platform perspective, AI systems don’t fail like models.
They fail like systems.

What breaks first in real environments

When AI systems move into production, the early issues are rarely about accuracy.

Instead, teams struggle with:

unclear decision boundaries

non-reproducible behavior

missing audit trails

no safe rollback paths

uncomfortable “why did this happen?” questions

Most existing tooling focuses on observing outputs.
Very little focuses on governing behavior.

Observability helps, but it’s reactive

We already know how to observe software:

logs

metrics

traces

alerts

AI observability tools extend this to:

drift

cost

latency

token usage

All useful — but mostly after the fact.

In production systems, knowing what happened is not enough.
You also need to know:

whether it should have happened

whether it should happen again

whether it should be allowed at all

The core mismatch

LLMs reason probabilistically.
Production systems expect determinism.

Trying to force AI to behave like traditional software doesn’t work.
But letting AI directly execute decisions inside deterministic systems also doesn’t work.

So we started experimenting with a different boundary:

AI can reason.
Deterministic systems decide.
Execution must remain controlled.

Separating reasoning from execution

Once you separate these concerns, a lot of things become clearer:

AI suggestions can be evaluated before execution

policies can block or correct unsafe actions

failures become structured signals, not surprises

accountability boundaries become explicit

This is a familiar pattern in ops — just applied to intelligence.

Why I started working on Kakveda

This line of thinking led me to start working on Kakveda, an open-source project focused on intelligence monitoring, observability, and deterministic control for AI systems.

The goal isn’t to replace models or agents.
It’s to supervise them.

Kakveda sits around AI systems and focuses on:

observing how AI behaves over time

enforcing rules before actions execute

capturing failures as first-class events

keeping execution predictable

In short: making AI systems operable.

What Kakveda is not

To be clear, Kakveda is not:

a prompt framework

an agent toolkit

an LLM wrapper

a chatbot platform

It doesn’t try to make AI smarter.
It tries to make AI safer to run.

Why open source

Governance and control layers should not be opaque.

If AI already introduces uncertainty, the systems supervising it should be:

inspectable

auditable

adaptable

Open source allows this to evolve based on real failures, not theoretical design.

Kakveda is early-stage and opinionated — and that’s intentional.

The bigger takeaway

As AI adoption grows, the most important question won’t be:

“How powerful is this model?”

It will be:

“Do we understand and control what this system is allowed to do?”

That’s an ops question.
And ops questions deserve first-class systems.

If you’re operating AI systems in production — especially from a DevOps, SRE, or platform perspective — I’d love to hear what’s breaking for you.

DEV Community

Operating AI in Production Is an Ops Problem

Top comments (0)