Building an internal developer platform on Kubernetes involves a lot of moving pieces. CI pipelines, GitOps, observability, a developer portal, network policies, access control. Each of these is a solved problem in isolation. The interesting challenge is how you design the system that holds them together in a way that stays maintainable, scalable, and operable as your organisation grows.
OpenChoreo approaches this by designing the platform as a modular, multi-plane system from the ground up, where each concern has a dedicated home, a clear API surface, and an independent lifecycle. I contribute to the project and in this post I want to walk through the architecture in detail, plane by plane, so you understand not just what each piece does but why the separation exists and what it gives you operationally.
The Core Idea: Planes, Not Monoliths
OpenChoreo uses a clear separation of concerns across multiple planes, each responsible for specific aspects of the platform's functionality. It also uses a modular framework that allows external tools to be integrated as first-class experiences in the platform rather than just being bolted on.
There are four planes:
| Plane | Responsibility |
|---|---|
| Control Plane | The brain. Orchestrates everything else. |
| Data Plane | Where your workloads actually run. |
| Workflow Plane | Where CI pipelines and automation execute. |
| Observability Plane | Where logs, metrics, and traces are collected and queried. |
Each plane is independently deployable, independently scalable, and has its own upgrade lifecycle. In development you can run all of them in a single cluster using namespace isolation. In production each typically lives in its own cluster. The separation is not forced on you from day one but it is designed to be the natural growth path.
The Control Plane
The control plane is a Kubernetes cluster that acts as the brain of OpenChoreo. It runs a central control loop that continuously monitors the state of the platform and developer resources. It takes actions to ensure that the desired state as declared via the Developer and Platform APIs is reflected in the actual state across all planes.
It has three key components inside it.
API Server
The API Server exposes the OpenChoreo API which is used by both developers and platform teams to interact with the system. It serves as the main entry point for all API requests, handling authentication, authorization, and request validation. The API server also hosts OpenChoreo's authorization engine that provides fine-grained RBAC, ABAC, and hierarchical instance-level access control to all resources created in OpenChoreo.
The authorization engine is powered by Apache Casbin. It works by mapping groups from your Identity Provider to roles and authorization policies in OpenChoreo. The same authorization layer applies whether you are using the UI, CLI, API, or MCP servers. One policy model, consistent everywhere.
Controller Manager
A set of Kubernetes controllers that implement the core reconciliation logic of the platform. These controllers watch for changes to the CRD instances defined in the Developer and Platform APIs and take appropriate actions to ensure that the desired state is achieved across all planes.
For example, when a new Component is created the controllers will:
- Validate the request
- Resolve any references such as dependencies of components
- Trigger the necessary workflows to build, deploy, and expose the component in the data plane with the required network policies and observability configurations
Everything in OpenChoreo is declarative. You declare what you want. The controller manager makes it happen.
Cluster Gateway
All other planes establish outbound connections to the control plane. This system component acts as the hub that allows the API server and Controller Manager to communicate with other planes in a hub-and-spoke model. It exposes a Secure WebSocket API that allows bidirectional communication between other planes via long-lived connections authenticated with mTLS using Cert-Manager. This prevents the Kubernetes API servers of the data, workflow, and observability planes from being exposed to the internet.
Security note: None of the other planes need to expose their Kubernetes API servers publicly. They call out to the control plane, not the other way around. The communication is mTLS authenticated and runs over long-lived secure websocket connections.
The Platform API and Developer API
The control plane exposes two distinct API surfaces and understanding the difference between them is key to understanding how OpenChoreo separates platform concerns from developer concerns.
Platform API
The Platform API is a set of Kubernetes CRDs that allow platform builders to define the structure and behaviour of the platform itself. It provides abstractions for defining:
- Organizational boundaries (Namespaces)
- Environments
- Data Planes, Workflow Planes, and Observability Planes
- Deployment Pipelines
Platform engineers work here. They define environments, configure deployment pipelines, set up gateway topologies, and create reusable ComponentTypes and Traits that become the golden paths developers use.
Developer API
The Developer API is a set of Kubernetes CRDs designed to simplify, streamline, and reduce the cognitive burden of application development on Kubernetes for development teams. Instead of exposing the entire configuration surface of the Kubernetes API, these abstractions provide a more intuitive and domain-driven way to define projects, their components, and their interactions via endpoints and dependencies.
OpenChoreo avoids black-box abstractions that completely obscure Kubernetes. Instead these provide a way for platform teams to create opinionated, reusable templates that define organizational best practices and standards as intent-driven interfaces for their development teams. This shift-down approach reduces developer cognitive load by offloading complexity to the platform.
A developer declares intent:
# I want to deploy this component
# I want to expose this endpoint publicly
# I want to depend on this other service
The platform compiles that intent into whatever Kubernetes resources are needed without the developer touching a single NetworkPolicy or HTTPRoute directly.
The Experience Plane
Sitting across all of this is the experience plane, the user-facing layer. It includes:
- OpenAPI-v3-based APIs exposed by the control plane and observability plane
-
CLI (
occ) supporting both API server mode and file system mode for GitOps-driven workflows - Backstage-based Internal Developer Portal — an extended fork supporting native Backstage plugins and custom plugins built specifically for OpenChoreo's APIs
- MCP servers for AI-assisted development and operations, exposed by both the control plane and the observability plane
The MCP servers mean your AI assistant can interact with the platform using the same authorization model as human users. Claude Code, Cursor, Codex, and Gemini CLI are all supported out of the box.
The Data Plane
A data plane is a Kubernetes cluster responsible for running component workloads, enforcing network policies, and exposing component endpoints via a structured gateway topology and wiring up dependencies as instructed by the control plane.
An OpenChoreo deployment can have one or more data planes spanning clusters in different geographies and infrastructure providers. A component can be promoted across physically separated environments like this:
dev (data plane 1) → staging (data plane 1) → production (data plane 2)
Each promotion applies environment-specific configurations and secrets automatically.
Cells: The Runtime Boundary
At runtime, resources of a project are isolated through Cells — secure, isolated, and observable boundaries for all components belonging to a given namespace-project-environment combination. A Cell becomes the runtime boundary for a group of components with policy enforcement and observability, aligning with ideas of Cell-Based Architecture where individual teams or domains operate independently within well-defined boundaries while still benefiting from shared infrastructure capabilities.
Each Cell has a structured gateway topology covering all four traffic directions:
| Direction | Handles |
|---|---|
| External Ingress | Traffic from the internet |
| Internal Ingress | Traffic from other cells or the internal network |
| External Egress | Outbound traffic to external services |
| Internal Egress | Outbound traffic to other cells |
Cilium and eBPF enforce network policies at every boundary.
Data Plane Modules
Optional modules extend data plane capabilities without touching core platform logic:
- API management module — rate limiting, authentication, and observability at the endpoint level
- Elastic module — automatic scale-to-zero based on traffic
- Guard module — Cilium CNI and eBPF for zero-trust network policies and kernel-level observability
The Workflow Plane
A workflow plane is a Kubernetes cluster responsible for executing platform-defined workflows. OpenChoreo has two categories of workflows:
- CI workflows — developer self-service for building, testing, and deploying components
- Generic workflows — all other automation including GitOps workflows, resource provisioning, and custom platform team workflows
The default workflow module is powered by Argo Workflows, a Kubernetes-native workflow engine. OpenChoreo's workflow concepts are designed to work with any CRD-based workflow engine so you can customise the Workflow Plane to use an alternative like Tekton.
The workflow plane is also optional. If you already have GitHub Actions, GitLab CI, or Jenkins, you can keep using them alongside it. A common pattern is:
Git provider native CI → pre-PR-merge checks
OpenChoreo Workflow Plane → final build and deploy on PR merge
Generic workflows → GitOps, integration tests, post-deployment checks
The Observability Plane
An observability plane is a Kubernetes cluster responsible for providing centralized logs, metrics, traces, and alerts. It acts as a central data sink, collecting and aggregating observability data from all other workflow and data planes.
Unlike the other planes, the observability plane exposes its own Observer API and MCP server directly. This design prevents observability data from being proxied through the control plane to end-users, which can be a concern in larger multi-regional, multi-tenant deployments where regional data privacy regulations may apply.
Default Observability Modules
| Module | Powered By |
|---|---|
| Logs | OpenSearch |
| Metrics | Prometheus |
| Tracing | OpenTelemetry collector + OpenSearch backend |
| Alerting | Built into logs and metrics modules |
All of these are swappable. If you have an existing observability system such as Datadog, Splunk, New Relic, or Grafana Cloud, OpenChoreo's adapter pattern allows a minimal observability plane to plug into an external system's API while still providing the same domain-centric Observer API and MCP servers across the unified experience plane.
Deployment Topologies
OpenChoreo supports three main topology patterns:
| Topology | When to use |
|---|---|
| Single cluster | Development, testing, local k3d setup |
| Plane-per-cluster | Production, full fault isolation, independent scaling |
| Hybrid | Co-locate Control + Workflow for cost or operational efficiency |
The architecture supports all of these without a redesign. The natural growth path is single-cluster locally, then namespace-isolated production, then splitting out planes as load and compliance requirements demand.
Why the Separation Matters Operationally
The multi-plane design has direct operational consequences.
Independent upgrade lifecycles. You can update the observability stack without touching the control plane. You can add a data plane in a new region without changing your workflow setup.
Independent scaling. A heavy CI workload on the workflow plane does not compete with production workloads on the data plane. Observability ingestion spikes do not impact control plane availability.
Clear security boundaries. The Kubernetes API servers of data, workflow, and observability planes are never exposed externally. All communication flows outbound through mTLS-authenticated websocket connections to the control plane's cluster gateway.
Native GitOps. Because all state is declarative Kubernetes CRDs, the entire platform is GitOps-compatible from day one. Platform topology, developer applications, deployment pipelines — all of it can be version controlled and reconciled from Git.
Getting Started
The full architecture runs locally on k3d in about 10 minutes. The quick start guide walks you through it step by step.
If you want to go deeper on any specific plane or the runtime model around Cells, happy to dig into that in the comments. And if you are interested in contributing, the project is fully open source under CNCF governance.
Top comments (0)