Savi Saluwadana

Posted on Apr 13

AI-Native Platform Engineering: How OpenChoreo Brings MCP and an SRE Agent to Your Infrastructure

#agents #ai #mcp #sre

AI assistants have become a standard part of how developers write code. The next frontier is whether they can be trusted participants in how that code gets deployed, operated, and debugged.

OpenChoreo, an open source IDP that recently entered the CNCF Sandbox, takes a clear position on this. AI is not a plugin or an afterthought. It is a first-class platform construct with the same authorization model, the same guardrails, and the same observability as every other part of the system.

I contribute to the project and in this post I want to walk through two specific capabilities: the MCP server integration that connects AI assistants to your platform, and the built-in RCA Agent that autonomously investigates production incidents.

Why AI at the Platform Layer Is Different

There is a meaningful difference between AI that helps you write code and AI that interacts with your running infrastructure.

A code suggestion going wrong costs you a review cycle. A deployment action going wrong costs you an incident. The stakes are different and the design has to reflect that.

OpenChoreo's approach is to expose AI interfaces that follow the same authorization policies as human users. When your AI assistant connects to the platform via MCP, it authenticates with OAuth2/OIDC and is subject to the same RBAC and ABAC policies as a human operator. It can only do what a human with the same role could do. No elevated permissions, no side doors.

The MCP Server Architecture

OpenChoreo exposes two MCP servers.

The Control Plane MCP server gives your AI assistant access to platform management operations. The Observability Plane MCP server gives it direct access to logs, metrics, traces, and alerts without proxying through the control plane.

The two-server design is intentional. Observability data never flows through the control plane on its way to an AI assistant. In multi-regional or multi-tenant deployments this matters for data privacy and compliance. Each server is independently secured and independently queryable.

What your AI assistant can actually do

Once connected, your AI assistant becomes an active participant in platform operations across five categories:

Resource management

List namespaces, projects, components, and environments
Inspect deployment pipelines and release bindings
Check component status across environments

Build and workflow operations

Trigger workflow runs
Inspect build status and history
Query workflow logs
Compare successful and failed builds

Observability queries

Fetch distributed logs with domain-aware filtering by namespace, project, and component
Query metrics and check resource utilization
Trace requests across service boundaries with query_traces and query_trace_spans
Inspect active alerts and incidents

Deployment and promotion

Update release bindings to promote components across environments
Apply configuration changes to running deployments
Roll back by pointing a binding at a previous release

Resource optimization

Query resource metrics against actual allocation
Get right-sizing recommendations
Apply optimized configurations directly

Supported AI assistants

Claude Code, Cursor, Codex CLI, Gemini CLI, OpenCode CLI, and VS Code with GitHub Copilot all work out of the box. Both browser-based OAuth (authorization code with PKCE) and client credentials flows are supported depending on your setup.

Real Scenarios: What This Looks Like in Practice

The docs ship with five hands-on MCP scenarios that show exactly how this works. Here are the ones worth understanding in detail.

Debugging a cascading failure

This scenario uses the GCP Microservices Demo (Online Boutique). You intentionally break the product catalog service by scaling it to zero replicas. Then you use your AI assistant to diagnose the failure across service boundaries.

The assistant works through the investigation using:

list_components          → find affected services
query_component_logs     → surface error patterns in logs
query_traces             → follow the request path across services
query_trace_spans        → pinpoint exactly where the failure propagates
get_release_binding      → inspect current deployment state
update_release_binding   → apply the fix

The entire investigation and remediation happens conversationally without leaving your editor. The assistant has the full observability context, not just a log dump.

Diagnosing a build failure

You trigger a build with a misconfigured Dockerfile path in a Go service. The assistant:

list_workflow_runs        → find the failed run
get_workflow_run          → inspect the failure details
query_workflow_logs       → surface the exact error
create_workflow_run       → trigger a new build after the fix

Comparing against the previous successful build to identify what changed is a natural conversational step. The assistant has the history.

Resource optimization

You allocate excessive CPU and memory to several services in a demo deployment. The assistant:

list_components           → enumerate running services
list_release_bindings     → get current configurations
query_resource_metrics    → compare allocation vs actual usage
update_release_binding    → apply right-sized configurations

This is a genuinely useful operational workflow. Right-sizing based on actual usage data rather than educated guesses, applied directly without a separate tooling context switch.

The RCA Agent: Autonomous Incident Investigation

Beyond the interactive MCP integration, OpenChoreo ships with a built-in RCA Agent. This is a different model. Instead of you asking the AI assistant to investigate something, the RCA Agent reacts autonomously when alerts fire.

How it works

The RCA Agent is configured at the alert level. When you define an alert rule, you can set triggerAiRca: true. When that alert fires in production, the agent immediately pulls logs, metrics, and traces from the affected deployments and generates a root cause analysis report.

The workflow is:

Alert fires
    ↓
RCA Agent triggers automatically
    ↓
Agent pulls logs, metrics, traces from observability plane
    ↓
LLM analyzes the correlated signals
    ↓
Root cause analysis report generated
    ↓
Report available in the OpenChoreo portal and via the RCA chat interface

No engineer needs to be the first one paging through dashboards. By the time someone picks up the incident, there is already a structured analysis waiting for them.

The RCA chat interface

Beyond automatic reports, OpenChoreo ships an interactive RCA chat interface. You can query past incidents conversationally, ask follow-up questions about a specific report, and dig into the reasoning behind a root cause conclusion.

This is the key design difference from just getting a wall of text. The report is a starting point for a conversation, not a terminal output.

Setup

The RCA Agent requires:

OpenChoreo Observability Plane with at least a logs module installed
An LLM API key (currently OpenAI GPT model series, additional providers on the roadmap)
Alerting configured with triggerAiRca: true on the alerts you want covered

Enable it via Helm:

helm upgrade --install openchoreo-observability-plane \
  oci://ghcr.io/openchoreo/helm-charts/openchoreo-observability-plane \
  --version 1.0.0 \
  --namespace openchoreo-observability-plane \
  --reuse-values \
  --set rca.enabled=true \
  --set rca.llm.modelName=gpt-4o

Reports are stored in SQLite by default with a persistent volume. For production scale or horizontal scaling, PostgreSQL is supported as the report backend.

Cost note: The docs recommend enabling triggerAiRca only for critical alerts to manage LLM costs. Every alert trigger is an LLM call.

The Authorization Model Underneath All of This

Both the MCP servers and the RCA Agent operate within OpenChoreo's unified authorization engine. This is worth understanding because it is what makes AI at the infra layer safe to expose.

The authorization engine is powered by Apache Casbin and supports fine-grained RBAC, ABAC, and instance-level access controls down to the namespace, project, and component level.

When your AI assistant connects via MCP it authenticates with OAuth2/OIDC and is granted a role that defines exactly what it can and cannot do. The RCA Agent authenticates via the client_credentials grant and is assigned the rca-agent role, scoped precisely to the operations it needs for incident analysis.

The same policy model applies to humans and AI. Your AI assistant cannot do anything a human with equivalent permissions could not do. The guardrails are structural, not procedural.

What This Means for Platform Teams

The practical implication of all of this is a shift in how platform operations work day to day.

For developers: Instead of opening five dashboards to understand why a build failed or why a service is returning errors, you ask your AI assistant. It has the context. It can correlate across logs, traces, and deployment state in a single conversation.

For on-call engineers: When an alert fires you are not starting from zero. The RCA Agent has already correlated the signals and generated a structured analysis. You start from a hypothesis, not a blank screen.

For platform teams: The same golden paths and authorization policies you define for human users apply to AI automatically. You do not need a separate AI governance model. The platform's existing model extends to cover it.

Getting Started

Connect your AI assistant to a local OpenChoreo instance in about 15 minutes:

Run OpenChoreo locally with k3d following the quick start guide
Connect your AI assistant using the MCP configuration in the docs
Try the getting started scenario to verify the connection
Work through the log analysis scenario to see the full observability integration

AI docs: openchoreo.dev/docs/ai/overview
MCP scenarios: openchoreo.dev/docs/ai/mcp-prompt-scenarios
RCA Agent setup: openchoreo.dev/docs/ai/rca-agent
GitHub: github.com/openchoreo/openchoreo

The project is fully open source under CNCF governance. If you are building in the platform engineering or AI tooling space, contributions and feedback are very welcome.

DEV Community