DEV Community: Savi Saluwadana

I Deployed the 11-Tier Google Microservices Demo in Minutes using OpenChoreo

Savi Saluwadana — Thu, 16 Apr 2026 09:13:40 +0000

If you work in the cloud-native ecosystem, you are probably familiar with the Google Online Boutique (formerly the GCP Microservices Demo). It is the gold standard for testing cloud-native infrastructure. It features 11 distinct microservices written in multiple languages (including Go, Java, and Python), generating a web of high-concurrency traffic and deep dependencies.

Deploying it manually usually means wrestling with a mountain of Kubernetes manifests and struggling to visualize how the services actually talk to each other.

I recently wanted to see how this demo would look inside OpenChoreo, a new CNCF Sandbox open-source developer platform for Kubernetes. I was looking for a unified way to visualize the architecture, track deployments, and manage dependencies without building a custom portal from scratch.

Here is how I deployed the entire suite using OpenChoreo, and a look at the beautiful, premium platform UI it generated.

The Setup: Zero to Platform in Minutes
The goal was to avoid spending hours configuring a control plane. I wanted a complete Internal Developer Platform (IDP) experience straight out of the box.

Step 1: The Quick Start Guide
I started with the OpenChoreo Quick Start Guide (QSG). Running the QSG locally spun up the entire foundation. Within about 10 minutes, I had the OpenChoreo Control, Data, Workflow, and Observability planes running on my cluster, complete with the Backstage-powered UI.

Step 2: Deploying the Demo Script
With the platform running, I did not have to manually translate the Google microservices into OpenChoreo components. I simply executed the provided command script specifically designed for the Google Microservice Demo deployment.

The script automatically registered the project, defined the component boundaries, and initiated the GitOps reconciliation.

Exploring the Results
Once the deployment finished, the OpenChoreo UI brought the entire microservice architecture to life. Here is what the platform gave me automatically.

The Unified Component Catalog Instead of hunting through terminal commands to see what was running, the Overview tab provided a clean, unified catalog.

Every component (the Go-based frontend, the Java Spring Boot shipping service, the redis cache) was cataloged neatly under the "GCP Microservice Demo" project. More importantly, the UI exposed the deployment state of each component across the defined pipeline, showing clear green checks for successful deployments in the Development environment, with paths ready for Staging and Production.

The Relationship Graph Microservices only make sense when you understand their context. OpenChoreo generated a complete system diagram mapping the exact relationships.

The graph visually links the default namespace to the GCP Microservice Demo project, and then branches out to show every single component connected to it, alongside the Default Pipeline. It provides instant clarity on ownership and system boundaries.

The Cell Diagram (My Favorite Feature) This is where the platform truly feels like a billion-dollar enterprise tool. Managing high-concurrency backend systems requires understanding real-time traffic flows and dependencies.

The Cell Diagram automatically mapped the network interactions between the services. You can visually trace how the frontend component acts as the central hub, reaching out to checkout, currency, productcatalog, and recommendation. You can also clearly see secondary dependencies, like the cart service connecting directly to redis.

If a service goes down, this diagram is exactly what an SRE or platform engineer needs to instantly identify the blast radius.

The Takeaway
Building a platform that abstracts Kubernetes while providing deep observability usually takes organizations quarters, if not years, of dedicated engineering.

By combining the OpenChoreo QSG with the demo deployment script, I got a production-grade developer portal, automated deployment pipelines, and deep architectural mapping in minutes. It completely changes the developer experience from fighting infrastructure to purely focusing on software delivery.

If you are building an IDP or just want to see how clean Kubernetes abstractions can be, you should definitely take it for a spin.

Check out the project and give it a star: https://openchoreo.dev/

AI-Native Platform Engineering: How OpenChoreo Brings MCP and an SRE Agent to Your Infrastructure

Savi Saluwadana — Mon, 13 Apr 2026 01:35:05 +0000

AI assistants have become a standard part of how developers write code. The next frontier is whether they can be trusted participants in how that code gets deployed, operated, and debugged.

OpenChoreo, an open source IDP that recently entered the CNCF Sandbox, takes a clear position on this. AI is not a plugin or an afterthought. It is a first-class platform construct with the same authorization model, the same guardrails, and the same observability as every other part of the system.

I contribute to the project and in this post I want to walk through two specific capabilities: the MCP server integration that connects AI assistants to your platform, and the built-in RCA Agent that autonomously investigates production incidents.

Why AI at the Platform Layer Is Different

There is a meaningful difference between AI that helps you write code and AI that interacts with your running infrastructure.

A code suggestion going wrong costs you a review cycle. A deployment action going wrong costs you an incident. The stakes are different and the design has to reflect that.

OpenChoreo's approach is to expose AI interfaces that follow the same authorization policies as human users. When your AI assistant connects to the platform via MCP, it authenticates with OAuth2/OIDC and is subject to the same RBAC and ABAC policies as a human operator. It can only do what a human with the same role could do. No elevated permissions, no side doors.

The MCP Server Architecture

OpenChoreo exposes two MCP servers.

The Control Plane MCP server gives your AI assistant access to platform management operations. The Observability Plane MCP server gives it direct access to logs, metrics, traces, and alerts without proxying through the control plane.

The two-server design is intentional. Observability data never flows through the control plane on its way to an AI assistant. In multi-regional or multi-tenant deployments this matters for data privacy and compliance. Each server is independently secured and independently queryable.

What your AI assistant can actually do

Once connected, your AI assistant becomes an active participant in platform operations across five categories:

Resource management

List namespaces, projects, components, and environments
Inspect deployment pipelines and release bindings
Check component status across environments

Build and workflow operations

Trigger workflow runs
Inspect build status and history
Query workflow logs
Compare successful and failed builds

Observability queries

Fetch distributed logs with domain-aware filtering by namespace, project, and component
Query metrics and check resource utilization
Trace requests across service boundaries with query_traces and query_trace_spans
Inspect active alerts and incidents

Deployment and promotion

Update release bindings to promote components across environments
Apply configuration changes to running deployments
Roll back by pointing a binding at a previous release

Resource optimization

Query resource metrics against actual allocation
Get right-sizing recommendations
Apply optimized configurations directly

Supported AI assistants

Claude Code, Cursor, Codex CLI, Gemini CLI, OpenCode CLI, and VS Code with GitHub Copilot all work out of the box. Both browser-based OAuth (authorization code with PKCE) and client credentials flows are supported depending on your setup.

Real Scenarios: What This Looks Like in Practice

The docs ship with five hands-on MCP scenarios that show exactly how this works. Here are the ones worth understanding in detail.

Debugging a cascading failure

This scenario uses the GCP Microservices Demo (Online Boutique). You intentionally break the product catalog service by scaling it to zero replicas. Then you use your AI assistant to diagnose the failure across service boundaries.

The assistant works through the investigation using:

list_components          → find affected services
query_component_logs     → surface error patterns in logs
query_traces             → follow the request path across services
query_trace_spans        → pinpoint exactly where the failure propagates
get_release_binding      → inspect current deployment state
update_release_binding   → apply the fix

The entire investigation and remediation happens conversationally without leaving your editor. The assistant has the full observability context, not just a log dump.

Diagnosing a build failure

You trigger a build with a misconfigured Dockerfile path in a Go service. The assistant:

list_workflow_runs        → find the failed run
get_workflow_run          → inspect the failure details
query_workflow_logs       → surface the exact error
create_workflow_run       → trigger a new build after the fix

Comparing against the previous successful build to identify what changed is a natural conversational step. The assistant has the history.

Resource optimization

You allocate excessive CPU and memory to several services in a demo deployment. The assistant:

list_components           → enumerate running services
list_release_bindings     → get current configurations
query_resource_metrics    → compare allocation vs actual usage
update_release_binding    → apply right-sized configurations

This is a genuinely useful operational workflow. Right-sizing based on actual usage data rather than educated guesses, applied directly without a separate tooling context switch.

The RCA Agent: Autonomous Incident Investigation

Beyond the interactive MCP integration, OpenChoreo ships with a built-in RCA Agent. This is a different model. Instead of you asking the AI assistant to investigate something, the RCA Agent reacts autonomously when alerts fire.

How it works

The RCA Agent is configured at the alert level. When you define an alert rule, you can set triggerAiRca: true. When that alert fires in production, the agent immediately pulls logs, metrics, and traces from the affected deployments and generates a root cause analysis report.

The workflow is:

Alert fires
    ↓
RCA Agent triggers automatically
    ↓
Agent pulls logs, metrics, traces from observability plane
    ↓
LLM analyzes the correlated signals
    ↓
Root cause analysis report generated
    ↓
Report available in the OpenChoreo portal and via the RCA chat interface

No engineer needs to be the first one paging through dashboards. By the time someone picks up the incident, there is already a structured analysis waiting for them.

The RCA chat interface

Beyond automatic reports, OpenChoreo ships an interactive RCA chat interface. You can query past incidents conversationally, ask follow-up questions about a specific report, and dig into the reasoning behind a root cause conclusion.

This is the key design difference from just getting a wall of text. The report is a starting point for a conversation, not a terminal output.

Setup

The RCA Agent requires:

OpenChoreo Observability Plane with at least a logs module installed
An LLM API key (currently OpenAI GPT model series, additional providers on the roadmap)
Alerting configured with triggerAiRca: true on the alerts you want covered

Enable it via Helm:

helm upgrade --install openchoreo-observability-plane \
  oci://ghcr.io/openchoreo/helm-charts/openchoreo-observability-plane \
  --version 1.0.0 \
  --namespace openchoreo-observability-plane \
  --reuse-values \
  --set rca.enabled=true \
  --set rca.llm.modelName=gpt-4o

Reports are stored in SQLite by default with a persistent volume. For production scale or horizontal scaling, PostgreSQL is supported as the report backend.

Cost note: The docs recommend enabling triggerAiRca only for critical alerts to manage LLM costs. Every alert trigger is an LLM call.

The Authorization Model Underneath All of This

Both the MCP servers and the RCA Agent operate within OpenChoreo's unified authorization engine. This is worth understanding because it is what makes AI at the infra layer safe to expose.

The authorization engine is powered by Apache Casbin and supports fine-grained RBAC, ABAC, and instance-level access controls down to the namespace, project, and component level.

When your AI assistant connects via MCP it authenticates with OAuth2/OIDC and is granted a role that defines exactly what it can and cannot do. The RCA Agent authenticates via the client_credentials grant and is assigned the rca-agent role, scoped precisely to the operations it needs for incident analysis.

The same policy model applies to humans and AI. Your AI assistant cannot do anything a human with equivalent permissions could not do. The guardrails are structural, not procedural.

What This Means for Platform Teams

The practical implication of all of this is a shift in how platform operations work day to day.

For developers: Instead of opening five dashboards to understand why a build failed or why a service is returning errors, you ask your AI assistant. It has the context. It can correlate across logs, traces, and deployment state in a single conversation.

For on-call engineers: When an alert fires you are not starting from zero. The RCA Agent has already correlated the signals and generated a structured analysis. You start from a hypothesis, not a blank screen.

For platform teams: The same golden paths and authorization policies you define for human users apply to AI automatically. You do not need a separate AI governance model. The platform's existing model extends to cover it.

Getting Started

Connect your AI assistant to a local OpenChoreo instance in about 15 minutes:

Run OpenChoreo locally with k3d following the quick start guide
Connect your AI assistant using the MCP configuration in the docs
Try the getting started scenario to verify the connection
Work through the log analysis scenario to see the full observability integration

AI docs: openchoreo.dev/docs/ai/overview
MCP scenarios: openchoreo.dev/docs/ai/mcp-prompt-scenarios
RCA Agent setup: openchoreo.dev/docs/ai/rca-agent
GitHub: github.com/openchoreo/openchoreo

The project is fully open source under CNCF governance. If you are building in the platform engineering or AI tooling space, contributions and feedback are very welcome.

How OpenChoreo's Multi-Plane Architecture Works Under the Hood

Savi Saluwadana — Mon, 13 Apr 2026 01:07:42 +0000

Building an internal developer platform on Kubernetes involves a lot of moving pieces. CI pipelines, GitOps, observability, a developer portal, network policies, access control. Each of these is a solved problem in isolation. The interesting challenge is how you design the system that holds them together in a way that stays maintainable, scalable, and operable as your organisation grows.

OpenChoreo approaches this by designing the platform as a modular, multi-plane system from the ground up, where each concern has a dedicated home, a clear API surface, and an independent lifecycle. I contribute to the project and in this post I want to walk through the architecture in detail, plane by plane, so you understand not just what each piece does but why the separation exists and what it gives you operationally.

The Core Idea: Planes, Not Monoliths

OpenChoreo uses a clear separation of concerns across multiple planes, each responsible for specific aspects of the platform's functionality. It also uses a modular framework that allows external tools to be integrated as first-class experiences in the platform rather than just being bolted on.

There are four planes:

Plane	Responsibility
Control Plane	The brain. Orchestrates everything else.
Data Plane	Where your workloads actually run.
Workflow Plane	Where CI pipelines and automation execute.
Observability Plane	Where logs, metrics, and traces are collected and queried.

Each plane is independently deployable, independently scalable, and has its own upgrade lifecycle. In development you can run all of them in a single cluster using namespace isolation. In production each typically lives in its own cluster. The separation is not forced on you from day one but it is designed to be the natural growth path.

The Control Plane

The control plane is a Kubernetes cluster that acts as the brain of OpenChoreo. It runs a central control loop that continuously monitors the state of the platform and developer resources. It takes actions to ensure that the desired state as declared via the Developer and Platform APIs is reflected in the actual state across all planes.

It has three key components inside it.

API Server

The API Server exposes the OpenChoreo API which is used by both developers and platform teams to interact with the system. It serves as the main entry point for all API requests, handling authentication, authorization, and request validation. The API server also hosts OpenChoreo's authorization engine that provides fine-grained RBAC, ABAC, and hierarchical instance-level access control to all resources created in OpenChoreo.

The authorization engine is powered by Apache Casbin. It works by mapping groups from your Identity Provider to roles and authorization policies in OpenChoreo. The same authorization layer applies whether you are using the UI, CLI, API, or MCP servers. One policy model, consistent everywhere.

Controller Manager

A set of Kubernetes controllers that implement the core reconciliation logic of the platform. These controllers watch for changes to the CRD instances defined in the Developer and Platform APIs and take appropriate actions to ensure that the desired state is achieved across all planes.

For example, when a new Component is created the controllers will:

Validate the request
Resolve any references such as dependencies of components
Trigger the necessary workflows to build, deploy, and expose the component in the data plane with the required network policies and observability configurations

Everything in OpenChoreo is declarative. You declare what you want. The controller manager makes it happen.

Cluster Gateway

All other planes establish outbound connections to the control plane. This system component acts as the hub that allows the API server and Controller Manager to communicate with other planes in a hub-and-spoke model. It exposes a Secure WebSocket API that allows bidirectional communication between other planes via long-lived connections authenticated with mTLS using Cert-Manager. This prevents the Kubernetes API servers of the data, workflow, and observability planes from being exposed to the internet.

Security note: None of the other planes need to expose their Kubernetes API servers publicly. They call out to the control plane, not the other way around. The communication is mTLS authenticated and runs over long-lived secure websocket connections.

The Platform API and Developer API

The control plane exposes two distinct API surfaces and understanding the difference between them is key to understanding how OpenChoreo separates platform concerns from developer concerns.

Platform API

The Platform API is a set of Kubernetes CRDs that allow platform builders to define the structure and behaviour of the platform itself. It provides abstractions for defining:

Organizational boundaries (Namespaces)
Environments
Data Planes, Workflow Planes, and Observability Planes
Deployment Pipelines

Platform engineers work here. They define environments, configure deployment pipelines, set up gateway topologies, and create reusable ComponentTypes and Traits that become the golden paths developers use.

Developer API

The Developer API is a set of Kubernetes CRDs designed to simplify, streamline, and reduce the cognitive burden of application development on Kubernetes for development teams. Instead of exposing the entire configuration surface of the Kubernetes API, these abstractions provide a more intuitive and domain-driven way to define projects, their components, and their interactions via endpoints and dependencies.

OpenChoreo avoids black-box abstractions that completely obscure Kubernetes. Instead these provide a way for platform teams to create opinionated, reusable templates that define organizational best practices and standards as intent-driven interfaces for their development teams. This shift-down approach reduces developer cognitive load by offloading complexity to the platform.

A developer declares intent:

# I want to deploy this component
# I want to expose this endpoint publicly
# I want to depend on this other service

The platform compiles that intent into whatever Kubernetes resources are needed without the developer touching a single NetworkPolicy or HTTPRoute directly.

The Experience Plane

Sitting across all of this is the experience plane, the user-facing layer. It includes:

OpenAPI-v3-based APIs exposed by the control plane and observability plane
CLI (occ) supporting both API server mode and file system mode for GitOps-driven workflows
Backstage-based Internal Developer Portal — an extended fork supporting native Backstage plugins and custom plugins built specifically for OpenChoreo's APIs
MCP servers for AI-assisted development and operations, exposed by both the control plane and the observability plane

The MCP servers mean your AI assistant can interact with the platform using the same authorization model as human users. Claude Code, Cursor, Codex, and Gemini CLI are all supported out of the box.

The Data Plane

A data plane is a Kubernetes cluster responsible for running component workloads, enforcing network policies, and exposing component endpoints via a structured gateway topology and wiring up dependencies as instructed by the control plane.

An OpenChoreo deployment can have one or more data planes spanning clusters in different geographies and infrastructure providers. A component can be promoted across physically separated environments like this:

dev (data plane 1) → staging (data plane 1) → production (data plane 2)

Each promotion applies environment-specific configurations and secrets automatically.

Cells: The Runtime Boundary

At runtime, resources of a project are isolated through Cells — secure, isolated, and observable boundaries for all components belonging to a given namespace-project-environment combination. A Cell becomes the runtime boundary for a group of components with policy enforcement and observability, aligning with ideas of Cell-Based Architecture where individual teams or domains operate independently within well-defined boundaries while still benefiting from shared infrastructure capabilities.

Each Cell has a structured gateway topology covering all four traffic directions:

Direction	Handles
External Ingress	Traffic from the internet
Internal Ingress	Traffic from other cells or the internal network
External Egress	Outbound traffic to external services
Internal Egress	Outbound traffic to other cells

Cilium and eBPF enforce network policies at every boundary.

Data Plane Modules

Optional modules extend data plane capabilities without touching core platform logic:

API management module — rate limiting, authentication, and observability at the endpoint level
Elastic module — automatic scale-to-zero based on traffic
Guard module — Cilium CNI and eBPF for zero-trust network policies and kernel-level observability

The Workflow Plane

A workflow plane is a Kubernetes cluster responsible for executing platform-defined workflows. OpenChoreo has two categories of workflows:

CI workflows — developer self-service for building, testing, and deploying components
Generic workflows — all other automation including GitOps workflows, resource provisioning, and custom platform team workflows

The default workflow module is powered by Argo Workflows, a Kubernetes-native workflow engine. OpenChoreo's workflow concepts are designed to work with any CRD-based workflow engine so you can customise the Workflow Plane to use an alternative like Tekton.

The workflow plane is also optional. If you already have GitHub Actions, GitLab CI, or Jenkins, you can keep using them alongside it. A common pattern is:

Git provider native CI  →  pre-PR-merge checks
OpenChoreo Workflow Plane  →  final build and deploy on PR merge
Generic workflows  →  GitOps, integration tests, post-deployment checks

The Observability Plane

An observability plane is a Kubernetes cluster responsible for providing centralized logs, metrics, traces, and alerts. It acts as a central data sink, collecting and aggregating observability data from all other workflow and data planes.

Unlike the other planes, the observability plane exposes its own Observer API and MCP server directly. This design prevents observability data from being proxied through the control plane to end-users, which can be a concern in larger multi-regional, multi-tenant deployments where regional data privacy regulations may apply.

Default Observability Modules

Module	Powered By
Logs	OpenSearch
Metrics	Prometheus
Tracing	OpenTelemetry collector + OpenSearch backend
Alerting	Built into logs and metrics modules

All of these are swappable. If you have an existing observability system such as Datadog, Splunk, New Relic, or Grafana Cloud, OpenChoreo's adapter pattern allows a minimal observability plane to plug into an external system's API while still providing the same domain-centric Observer API and MCP servers across the unified experience plane.

Deployment Topologies

OpenChoreo supports three main topology patterns:

Topology	When to use
Single cluster	Development, testing, local k3d setup
Plane-per-cluster	Production, full fault isolation, independent scaling
Hybrid	Co-locate Control + Workflow for cost or operational efficiency

The architecture supports all of these without a redesign. The natural growth path is single-cluster locally, then namespace-isolated production, then splitting out planes as load and compliance requirements demand.

Why the Separation Matters Operationally

The multi-plane design has direct operational consequences.

Independent upgrade lifecycles. You can update the observability stack without touching the control plane. You can add a data plane in a new region without changing your workflow setup.

Independent scaling. A heavy CI workload on the workflow plane does not compete with production workloads on the data plane. Observability ingestion spikes do not impact control plane availability.

Clear security boundaries. The Kubernetes API servers of data, workflow, and observability planes are never exposed externally. All communication flows outbound through mTLS-authenticated websocket connections to the control plane's cluster gateway.

Native GitOps. Because all state is declarative Kubernetes CRDs, the entire platform is GitOps-compatible from day one. Platform topology, developer applications, deployment pipelines — all of it can be version controlled and reconciled from Git.

Getting Started

The full architecture runs locally on k3d in about 10 minutes. The quick start guide walks you through it step by step.

Docs: openchoreo.dev/docs/overview/architecture
GitHub: github.com/openchoreo/openchoreo

If you want to go deeper on any specific plane or the runtime model around Cells, happy to dig into that in the comments. And if you are interested in contributing, the project is fully open source under CNCF governance.

Why Your Platform Team Became the Bottleneck — And How to Fix It

Savi Saluwadana — Wed, 11 Mar 2026 09:10:21 +0000

A practical look at why internal developer platforms fail at scale, and how Kubernetes, Backstage, GitOps, and Observability — when assembled the right way — can finally work together.

There’s a quiet irony playing out inside engineering organizations right now. The team built to remove friction has become the source of it.

Platform engineering was supposed to solve tool sprawl. Instead, it often created a new problem: the platform team can’t scale fast enough to meet business demand. Developers wait. Releases slip. Engineers burn out.

This isn’t a people problem. It’s an approach problem.

The Hidden Cost of DIY Platforms
Every engineering org above a certain size eventually decides to build an Internal Developer Platform (IDP). The reasoning is sound: standardize tooling, reduce cognitive load, give developers a self-service experience.

But the execution is where things unravel. Here’s what typically happens:

→ DIY platforms take months to build and minutes to break → Every new tool added increases cognitive load, not capability → Maintenance consumes the bandwidth meant for innovation → The platform becomes a product — one that nobody signed up to maintain forever

Building an IDP from scratch in 2025 is a bit like writing your own database. Technically possible. Occasionally justified. Almost never the right first move.

Kubernetes: The Foundation the Industry Chose for a Reason
Let’s start with the elephant in the room. Kubernetes has a reputation for being complex — and that reputation isn’t entirely undeserved. But it’s also the most battle-tested, production-proven container orchestration platform in the world, and there’s a reason virtually every major cloud provider and enterprise has standardized on it.

Kubernetes isn’t complex because it was poorly designed. It’s complex because it solves genuinely hard problems:

→ Scheduling and running workloads across heterogeneous infrastructure reliably → Self-healing systems that automatically recover from failure → Declarative configuration that makes infrastructure auditable and reproducible → A rich ecosystem of extensions, operators, and tooling built over a decade of community investment → Multi-cloud and hybrid portability that prevents vendor lock-in

The CNCF ecosystem that has grown around Kubernetes is extraordinary. Service meshes, policy engines, secret management, cost optimization, progressive delivery — the community has built mature, production-ready solutions for virtually every platform concern.

Why Kubernetes won Kubernetes didn’t win because it was the simplest option. It won because it was the right abstraction at the right level — flexible enough to run anything, opinionated enough to provide a stable foundation. Over 5 million developers now use it in production. That network effect and community investment is irreplaceable.

The real challenge isn’t Kubernetes itself — it’s that raw Kubernetes requires platform teams to make hundreds of decisions before a single developer can deploy an application. Networking, RBAC, namespacing, resource quotas, ingress controllers, secrets management, CI/CD integration — all of it needs to be configured, secured, and maintained.

The answer isn’t to avoid Kubernetes. It’s to build the right abstractions on top of it.

Backstage: The Developer Portal That Changed Everything
Spotify built Backstage to solve a problem that every fast-growing engineering organization eventually hits: as the number of services, teams, and tools grows, developers spend more time navigating complexity than building product.

Backstage’s insight was elegant — give every service, API, pipeline, and piece of documentation a home. A single place where developers can discover what exists, understand who owns it, see its health, and take action on it.

Since Spotify open-sourced it and donated it to the CNCF, Backstage has become the de facto standard for internal developer portals. The reasons are clear:

→ A software catalog that gives every component, API, and resource a consistent identity and owner → TechDocs that brings documentation into the same workflow as the code it describes → A plugin ecosystem with hundreds of community-built integrations covering CI/CD, cloud providers, monitoring, incident management, and more → Scaffolder templates that let platform teams encode golden paths — so developers spin up new services the right way, every time → Search that makes the entire engineering knowledge base discoverable

The Backstage effect Organizations that have fully adopted Backstage report dramatic reductions in onboarding time for new engineers — sometimes from weeks to days. When every service has a home, when runbooks are one click away, and when deployments are self-service, developers spend their time on what matters: building.

But here’s the honest truth about Backstage: it’s a framework, not a finished product. Getting from ‘installed Backstage’ to ‘developers actually love using it’ requires significant investment in catalog population, plugin configuration, and workflow integration. Many teams underestimate this.

This is where having a platform that pre-integrates Backstage — with sensible defaults and pre-built workflows — removes the biggest barrier to adoption.

The Stack Is There. The Glue Is Missing.
The cloud-native ecosystem has matured remarkably. Every ingredient for a world-class developer platform already exists:

→ Kubernetes — the production-proven foundation trusted by millions of engineering teams worldwide → Backstage — the CNCF-graduated developer portal that gives every service a home and every developer a front door → GitOps — declarative, auditable, automated delivery pipelines that treat infrastructure as code → Observability — distributed tracing, metrics, and logging that give teams real-time confidence in what they’ve built

These aren’t new ideas. They’re proven. The challenge isn’t picking the right tools — it’s wiring them together reliably, at scale, for every team in the organization. That’s where orgs consistently lose months.

Key insight 80% of every internal developer platform is identical across organizations. The same Kubernetes abstractions. The same Backstage configuration. The same GitOps patterns. The same observability setup. Yet every team rebuilds it from scratch — costing thousands of engineering hours solving problems that are already solved.

What Good Platform Engineering Actually Looks Like
The platform teams winning right now aren’t the ones who built everything themselves. They’re the ones who had the wisdom to start with the right abstractions — and spent their energy on the 20% that’s specific to their business.

That means:

→ Giving developers a self-service experience without writing a portal from scratch → Standardizing deployment workflows without reinventing GitOps → Running on Kubernetes with guardrails that make sense for application teams → Shipping observability that surfaces the right signals — not more dashboards to ignore

The goal is to stop configuring and start enabling.

OpenChoreo: Built for Platform Teams Who Are Done Reinventing the Wheel
This is exactly the problem OpenChoreo was built to solve.

OpenChoreo is an open-source developer platform that gives platform engineers production-ready abstractions for Kubernetes — with a Backstage-powered developer portal, GitOps workflows, and observability built in. It takes the tools the industry has already standardized on and assembles them with the right defaults, so teams can go from zero to productive without months of configuration work.

It’s the 80% out of the box, so teams can focus on what actually differentiates their business.

What you get:

→ A Backstage-powered portal that gives developers a single pane of glass — without weeks of configuration → GitOps pipelines that work out of the box with the guardrails your teams actually need → Kubernetes abstractions that shield developers from complexity without hiding it from platform engineers → Built-in observability so you know what’s happening across every service, from day one

Less plumbing. More platform. Faster delivery — for everyone.

The Bottom Line
Kubernetes is exceptional. Backstage is transformative. GitOps is the right model for modern delivery. Observability is non-negotiable. These tools earned their place at the center of the cloud-native ecosystem — not through marketing, but through real-world proof at scale.

The opportunity in front of platform teams isn’t to replace them or work around them. It’s to assemble them better — with the right abstractions, the right defaults, and the right workflows — so that developers experience the best of what the cloud-native ecosystem has to offer without having to understand every layer of it.

Platform engineering doesn’t have to be a bottleneck. The tools exist. The patterns are proven. What’s been missing is a way to bring them together without months of assembly work.

If your platform team is spending more time on infrastructure plumbing than on enabling your developers — it might be time to reconsider the approach.

🔗 Explore OpenChoreo — open source, built for platform teams: https://openchoreo.dev/

⭐ Star the project on GitHub to help the community grow and keep the momentum going.

[Boost]

Savi Saluwadana — Fri, 06 Mar 2026 18:40:40 +0000

Beyond “You Build It, You Run It”: The Strategic Case for Internal Developer Platforms

Savi Saluwadana ・ Mar 1

#kubernetes #devops #docker #cloud

The End of YAML Fatigue: What Platform Engineering Actually Is (And Why You Need It)

Savi Saluwadana — Sun, 01 Mar 2026 07:03:09 +0000

Let’s be brutally honest for a second: "You build it, you run it" was a fantastic idea that accidentally turned into a nightmare.

A decade ago, the DevOps movement promised to break down the wall between developers and operations. And it did! But in the process of shifting everything "left," we accidentally shifted the entire cognitive weight of the cloud-native ecosystem onto the shoulders of application developers.

Suddenly, a frontend or backend engineer trying to ship a simple feature was expected to be an expert in Kubernetes manifests, Terraform state, IAM roles, CI/CD pipeline optimization, and Helm charts.

We didn't empower developers; we just gave them a massive second job. The result? Developer burnout, fractured "shadow ops" workarounds, and a massive drop in shipping velocity.

Enter Platform Engineering.

What is Platform Engineering?
If DevOps is the philosophy of collaboration, Platform Engineering is the product that makes it possible.

Platform Engineering is the discipline of designing and building toolchains and workflows that enable self-service capabilities for software engineering organizations in the cloud-native era.

Instead of asking every product team to wire up their own infrastructure from scratch, a dedicated Platform Team builds an Internal Developer Platform (IDP).

The IDP acts as an abstraction layer—an internal product built for developers. It provides paved roads (often called "Golden Paths") that give developers exactly what they need to ship their code, without requiring them to understand the intricate details of the underlying infrastructure.

The "Product" Mindset is Everything
The most critical difference between an old-school IT Operations team and a modern Platform Engineering team is product management.

Platform Engineers treat the developers as their primary customers. The IDP is their flagship product.

Old IT Ops: "File a Jira ticket and we will provision your database in 3 to 5 business days."

Platform Engineering: "Here is a self-service CLI command or portal button that instantly spins up a compliant, monitored database attached to your environment."

The platform team measures their success using product metrics: Developer adoption rates, time-to-first-commit, and the reduction of onboarding time for new engineers.

Platform Engineering vs. DevOps: What's the Difference?
It’s easy to confuse the two, but they serve different purposes in the modern stack.

DevOps is a cultural shift. It’s about breaking silos, sharing responsibility, and focusing on continuous delivery.

Site Reliability Engineering (SRE) is the implementation of DevOps focused on reliability, scaling, and incident response.

Platform Engineering is the team that builds the vending machine. They package up the complex DevOps and SRE practices into a self-service platform so developers can consume them on demand.

The Golden Path: Freedom Through Standardization
Developers naturally hate being told what tools they have to use. So, how does Platform Engineering avoid becoming just another restrictive IT bottleneck? Through the concept of the Golden Path.

A Golden Path is a highly opinionated, fully supported set of tools and workflows.

If a developer chooses to use the Golden Path (e.g., standard Node.js microservice template deployed to the standard cluster), the platform handles everything for them: CI/CD, security scanning, ingress routing, logging, and metrics. It just works.

However, a good platform doesn't block escape hatches. If a team has a highly specific use case that requires going off the Golden Path, they are free to do so—but they have to take on the operational burden of maintaining that custom setup themselves.

The goal is to make the Golden Path so incredibly easy and frictionless that developers want to use it.

Why the Business Cares
From a DevRel perspective, we love Platform Engineering because it makes developers happier. But the business loves it because it directly impacts the bottom line:

Reduced Cognitive Load: Developers stop acting as amateur infrastructure engineers and get back to writing business logic.

Standardized Security: Compliance and security guardrails are baked into the platform dynamically, rather than checked manually after the fact.

Faster Time-to-Market: "Lead time for changes" drops from weeks to minutes when self-service replaces ticket-ops.

The TL;DR
Platform Engineering isn't about taking power away from developers. It’s about taking away the toil. It’s the realization that forcing every engineer to be a Kubernetes expert is a terrible way to scale a software company.

Build the paved road, treat your developers like your best customers, and watch your shipping velocity skyrocket.

Beyond “You Build It, You Run It”: The Strategic Case for Internal Developer Platforms

Savi Saluwadana — Sun, 01 Mar 2026 04:23:18 +0000

Here is the formatted and structured version of your document, optimized for readability, scannability, and impact.

1. The Context: We Have a Cognitive Load Crisis

The cloud-native revolution gave us scalability, but it also shifted the complexity burden directly onto developers. Today, feature developers are expected to be experts in Kubernetes, Terraform, IAM policies, and networking, all while writing business logic. We told them, "You build it, you run it," but what we actually meant was, "You build it, you configure the entire universe around it."

This has created a paralyzing level of friction known as Cognitive Load. The cost is measurable and severe:

Velocity Drops: Developers spend 30–40% of their week fighting infrastructure instead of shipping code.
Shadow Ops: When the "right way" is too hard, teams create ad-hoc, insecure workarounds just to get things done.
Burnout: The pressure to be a "full-stack infrastructure architect" is driving top talent away.

The solution is not more training. The solution is abstraction—which is exactly what an Internal Developer Platform provides.

2. What is an Internal Developer Platform (IDP)?

An IDP is not a wiki or a portal. It is the Operating System for your engineering organization. Technically, it acts as an abstraction layer that sits between developers and infrastructure. It unifies your fragmented toolchain (Cloud, CI/CD, Security, Monitoring) into a cohesive, self-service product.

The Core Philosophy: Product Thinking

A successful IDP treats the platform as a Product, not a project. The developer becomes the customer, and friction reduction becomes the primary metric.

The Anatomy of an IDP

Think of the IDP as a machine with five distinct layers:

Layer	Description
Interface Layer	The only part developers touch, consisting of the Developer Portal (like Backstage) and a unified CLI.
Control Plane	The "brain" that translates developer intent into concrete infrastructure commands.
Integration Layer	The glue connecting existing CI/CD pipelines, Identity Providers, and Registries.
Resource Plane	Raw infrastructure (Kubernetes, RDS) completely abstracted so developers consume managed services, not raw components.
Observability Layer	Ensures every new service is born with logging, metrics, and tracing automatically wired up.

3. The Strategy: Golden Paths & Abstraction

The IDP doesn't just offer tools; it enforces standardization through Golden Paths.

Golden Paths: The Path of Least Resistance

A Golden Path is an officially sanctioned, fully supported workflow. It embodies a simple philosophy: The path of least resistance must also be the path of best practice. The easiest way to build is also the secure, compliant, and scalable way. Developers get an "easy button," while organizations get consistency and governance.

The Power of Abstraction

The IDP separates developer intent from implementation details.

Developer Intent: "I need a Postgres 14 database for my Payment Service in Production."
IDP Implementation: Automates provisioning, ingress, storage, IAM, and configuration entirely behind the scenes.

4. One Platform, Two Perspectives

A successful IDP bridges the gap between two opposing mandates: Speed and Control. This creates a win-win scenario where the platform team owns the complexity so the feature team can own the product.

Perspective	View of Platform	Primary Focus
Developer (The Customer)	Vending Machine	Velocity and autonomy. They want instantly pre-configured resources without ever seeing a Terraform state file.
Platform Engineer (The Provider)	Control Plane	Stability, security, and governance. They design templates, set guardrails, and manage clusters.

5. The Business Case: Why We Need an IDP

Technically, an IDP is about abstraction. Commercially, it is about Operational Leverage.

Solving the Speed Bottleneck: Automated Orchestration

The "Ticket-Ops" model is a manual, imperative workflow that increases Lead Time for Changes.

The Technical Mechanism: The IDP replaces manual provisioning with Dynamic Configuration Management via an API/CLI. It dynamically generates low-level configurations (Kubernetes manifests, IAM roles) from a developer's high-level intent.
The Resolution: Eliminates operations dependencies. Deployments become atomic and automated, reducing Mean Time to Deploy (MTTD) from days to minutes.

Solving the CapEx Drain: Cognitive Load Abstraction

The financial loss of high cognitive load comes from developers managing low-level primitives instead of high-level abstractions.

The Technical Mechanism: The Platform team defines Infrastructure as Code (IaC) standards encapsulated within the IDP's Golden Paths. Developers simply interact with a simplified "Workload Specification" schema.
The Resolution: Reduces the context-switching penalty. Recovering engineering capacity can effectively redirect 30% of the workforce back to feature development without increasing headcount.

Governance Without Bureaucracy

Security is enforced architecturally. If base images in the Golden Path are patched automatically, compliance is baked in by default without chasing down feature teams.

6. Core Capabilities of the IDP Ecosystem

While architecture provides the skeleton, these five capabilities provide the muscle:

The Service Catalog: A curated marketplace of templates for instant resource access.
Automated CI/CD: Standardized pipelines handling build, test, and security scanning instantly.
GitOps Delivery Engine: A declarative engine syncing Git state with the cluster, making deployments traceable and reversible.
Infrastructure Orchestrator: The background engine running IaC to abstract provisioning details.
Embedded Observability: Automatic instrumentation for instant logging and metrics.

7. The Strategic Decision: Build, Buy, or Compose?

For leadership, the decision is about the most efficient path to value.

Strategy	Description	The Reality / Challenge
Option A: Build In-House	Building the platform from scratch for total control.	The Trap: You become an "Accidental Software Company," diverting resources to internal tools rather than revenue products.
Option B: Buy Commercial	Purchasing a pre-built SaaS solution for speed.	The Trap: Rigidity at scale, vendor lock-in, and the hidden need for highly skilled engineers to manage integration.
Option C: Hybrid (Recommended)	Utilizing a foundational framework like OpenChoreo.	The Benefit: Eliminates the binary choice. No building from scratch, and no SaaS black boxes.

How OpenChoreo Achieves This

The Interface: Adopts the Backstage portal interface for a standard "single pane of glass."
The Engine: Orchestrates a robust ecosystem of Open Source and CNCF tools.
The Result: You own a transparent, extensible platform without the technical debt of building or the lock-in of buying.

8. Why IDPs Fail

1. The Challenge of Opaque Abstraction

Mechanism: Simplifying UI while suppressing the raw output of underlying engines.
Failure Scenario & Impact: When standard errors occur (e.g., ImagePullBackOff), developers receive a generic "Deployment Failed" message. Blinded to the actual logs, MTTR increases and dependencies on the platform team rise.

2. Disrupted "Inner Loop" Latency

Mechanism: Enforcing "Remote-First" workflows that require a full CI/CD run for minor local changes.
Failure Scenario & Impact: A 2-second hot reload turns into a 15-minute wait. Developers bypass the platform for localhost testing, causing environment drift and "works on my machine" issues.

3. Proprietary DSL Fatigue

Mechanism: Inventing custom configuration schemas instead of using industry standards.
Failure Scenario & Impact: Developers are forced to learn syntax unique to the company. Fearing "resume rot," they resist platform adoption.

4. The 80/20 Constraint: Insufficient Extensibility

Mechanism: Optimizing solely for standard stateless microservices with rigid constraints.
Failure Scenario & Impact: Non-standard workloads (legacy apps, specialized databases) hit a dead end with no escape hatch, fostering "Shadow IT."

5. Imperative Interventions in a Declarative Workflow

Mechanism: Allowing manual UI "ClickOps" changes that do not sync back to Git.
Failure Scenario & Impact: Creates a "split-brain" state. The next automated deployment silently overwrites manual interventions, eroding trust in the platform.

9. The Future: Where IDPs are Going

The IDP is becoming the standard operating system for the digital economy, defined by Composition and Intelligence:

The Rise of Composable Frameworks: The DIY era is over. Industry standards are shifting toward enterprise wrappers and pre-configured foundations (like OpenChoreo) so teams can focus on value-add plugins.
From Manual Interaction to Conversations: Operations are becoming Intent-Based. Developers will interact with AI chatbots to deploy resources or troubleshoot failures, shifting operations to natural dialogue.
The 'Product Platform Engineer': Platform Engineering is officially a product management discipline. Success is measured by organizational velocity and satisfaction, not just uptime.
The Role of AI: AI will serve as the platform's central nervous system, predicting bottlenecks, suggesting optimizations, and automating root-cause analysis based on deep telemetry context.

10. Conclusion

The Internal Developer Platform (IDP) has graduated from a luxury to a strategic necessity. By transitioning from "ticket-based" operations to a self-service product mindset, organizations solve the cognitive load crisis crippling velocity. An IDP is an organizational contract that promises developers autonomy while guaranteeing security and standardization.

Whether built, bought, or adopted, the mandate is clear: Abstract the complexity, pave the golden paths, and let your engineers do what they were hired to do—innovate!