Paulo Victor Leite Lima Gomes

Posted on Jul 5

the kubernetes dashboard died and the platform workbench took its place

#kubernetes #platformengineering #headlamp #developertools

The original Kubernetes Dashboard was a comforting idea.

Here is the cluster. Here are the pods. Here are the deployments. Here are the services. Click around, inspect a thing, maybe scale something, maybe edit YAML if you are brave or temporarily confused.

That model was fine when the question was "what is running in my cluster?"

It is much less fine when the question is "why is this fleet upgrade stuck, why is this batch workload waiting for capacity, why did this serverless revision receive traffic, and which team is allowed to change the thing that fixes it?"

Kubernetes is no longer one operational surface.

It is a pile of specialized control planes sharing an API server.

That is why the recent Headlamp plugin work is more interesting than it first looks. In late June, Kubernetes published new posts for Headlamp plugins around Cluster API, Volcano, and Knative. Each one sounds like a UI feature. Cluster lifecycle views. Batch scheduling context. Serverless traffic splits. Prometheus metrics in detail pages. Map views.

Useful, sure.

But the larger signal is that the Kubernetes UI story is becoming extensible by domain instead of pretending that one generic dashboard can explain every kind of platform work.

The dashboard died quietly.

The platform workbench is taking its place.

the generic dashboard was always too flat

A generic Kubernetes dashboard has a structural problem: Kubernetes resources are not all at the same level of meaning.

A Pod is not a Cluster API MachineDeployment. A Service is not a Knative KService. A Deployment is not a Volcano Queue. A ConfigMap may be random application configuration, or it may be the cluster default that explains why an autoscaler is behaving differently than the operator expected.

To the API server, these are all resources.

To a human trying to fix production, they are very different stories.

The old UI shape flattens them into lists and detail pages. That is better than nothing. It helps with discovery. It is friendly for onboarding. It gives you a way to see what exists without memorizing every kubectl incantation.

But once a team depends on higher-level controllers, the problem is not finding objects.

The problem is understanding relationships.

Why is this resource here? Which controller owns it? Which condition is blocking progress? Which child object matters? Which parent object should I change? Is this a local override, a default, a generated value, or the result of reconciliation? If I scale here, am I fighting the topology manager?

Those are not generic dashboard questions.

They are workbench questions.

Cluster API needs fleet thinking

The Headlamp Cluster API plugin is a good example because Cluster API is Kubernetes looking at itself.

Cluster API turns cluster lifecycle management into Kubernetes-style declarative objects. A management cluster stores and reconciles the desired state of other clusters. That is elegant, but it also means the operational object graph gets deep quickly: clusters, machine deployments, machine sets, machines, control planes, bootstrap configs, infrastructure references, conditions, versions, provider details, and remediation hints.

You can inspect all of that with kubectl.

Of course you can.

You can also debug a distributed system with only logs and heroic patience. That does not make it a good default interface.

What the plugin does is collect the fleet-level story in one place. It shows cluster health, control plane and worker replica status, ownership hierarchies, topology-managed resources, bootstrap configuration, map views, and inline Prometheus metrics when configured.

That is not just convenience.

It changes the unit of attention.

Instead of asking an operator to reconstruct the cluster lifecycle graph from raw objects, the UI starts with the workflow the operator actually has: find the broken cluster, understand the condition, follow the owned resources, check whether the problem is control plane, worker, bootstrap, infrastructure, or remediation.

That is platform UI doing its job.

Not hiding Kubernetes. Not replacing the API. Giving the API a shape that matches the work.

AI and batch workloads need scheduling context

The Volcano plugin points at a different problem.

Kubernetes started with long-running services as the default mental model. A web service wants replicas. Keep them alive. Route traffic. Replace unhealthy pods. Scale up and down.

Batch, HPC, and AI workloads are stranger.

Jobs arrive. Capacity is scarce. Some work is useless unless several workers start together. Queues, quotas, priorities, and gang scheduling matter. A pending workload is not just "a pod did not start." It may be waiting because the whole group cannot be scheduled under the current resource and policy constraints.

If the UI only shows Pods, the operator has to reverse engineer the scheduling story.

Volcano's plugin brings Jobs, Queues, PodGroups, Pods, events, logs, and map relationships into one workflow. The interesting part is not that there is a sidebar item called Volcano. The interesting part is that the UI admits the scheduling domain has its own nouns.

Queue capacity is not a decoration.

PodGroup state is not an implementation detail.

Gang scheduling blockers are not visible if you only stare at individual pods.

This matters more as Kubernetes becomes the default landing zone for AI infrastructure. GPUs and accelerators are expensive. Batch queues become political very quickly. Teams want fairness, priority, preemption, utilization, and predictable progress. The operator needs to see not only "what failed" but "which scheduling promise could not be satisfied."

That is a different lens than a service dashboard.

It deserves a different workbench.

serverless has its own operational grammar

Knative is another good case because it looks simple right until it is not.

A KService sounds like one thing. In practice, it is the front door to Routes, Configurations, Revisions, traffic splits, autoscaling annotations, domain mappings, cluster defaults, request rates, latency, and backing pods.

When something goes wrong, the useful question is rarely "does a Kubernetes object exist?"

It is more like:

which revision is receiving traffic?
is this a canary or an old revision that never drained?
did the service inherit a cluster autoscaling default?
is the min scale explicit or coming from a ConfigMap?
are request rates and latency tied to the revision I think they are?
does RBAC allow this operator to change the traffic split?

The Headlamp Knative plugin surfaces traffic splits, autoscaling configuration, map relationships, logs, redeploy actions, and Prometheus graphs near the thing the operator is inspecting.

Again, this is not just a prettier kubectl.

It is a domain-specific operational grammar.

Knative operators think in revisions, routes, traffic percentages, scale-to-zero behavior, concurrency, and defaults. A useful UI should make those concepts first-class instead of forcing every team to mentally translate them from raw YAML.

platform portals should learn from this

There is a lesson here for internal developer portals.

A lot of portals drift toward being catalogs with buttons. Here are your services. Here are your docs. Here are your scorecards. Here is a link to Grafana. Here is a link to Argo. Here is a link to the cloud console. Good luck.

That is better than a wiki graveyard, but it is still often too shallow.

The Headlamp direction is more interesting because it puts workflows near the resources and lets plugins bring domain knowledge into the same interface. Cluster lifecycle can get its own lens. Batch scheduling can get its own lens. Serverless traffic can get its own lens. GitOps, metrics, custom internal controllers, and company-specific workflows can do the same.

This is what a platform workbench should be.

Not one giant portal that tries to model the entire company.

Not five disconnected dashboards where every incident starts with tab archaeology.

A consistent shell with domain-specific plugins, shared authentication, RBAC-aware actions, resource relationships, metrics in context, and enough escape hatches that power users can still drop to the CLI.

The CLI remains important. Automation remains important. Raw YAML remains important.

But humans do not operate platforms by reading every object equally.

They operate by following meaning.

the risk is plugin sprawl

Of course, plugins do not automatically make a good platform.

They can make a messy platform easier to distribute.

If every team ships a plugin with different navigation, permissions, terminology, and action patterns, the workbench becomes a drawer full of unrelated mini-apps. If plugins perform dangerous actions without clear RBAC and audit trails, the UI becomes another privileged surface. If plugins are installed casually from unknown sources, the control plane gets a new supply chain problem.

This is where platform teams need boring standards.

Who can publish a plugin? How is it reviewed? Which permissions does it need? What actions are allowed from the UI? Does every action map cleanly to Kubernetes RBAC? Are destructive changes visible in audit logs? Can operators see whether a field is explicit, inherited, or generated? Do plugins expose the raw object when abstraction fails?

The workbench model is powerful because it accepts specialization.

It is risky for the same reason.

The answer is not to avoid plugins. The answer is to treat them as part of the platform surface, with the same care you would give admission policies, controllers, CLIs, and deployment tools.

what i would build first

If I were building an internal Kubernetes workbench, I would start with one annoying workflow.

Not "show everything in the cluster."

Something more specific: why is this cluster upgrade blocked? Why is this GPU job pending? Why did this revision receive traffic? Why is this rollout stuck between Git and runtime? Why is this storage claim waiting?

Then I would map the resources an operator actually checks, the commands they run, the metrics they open, the docs they search, and the decisions they make. That becomes the plugin boundary.

The goal would not be to hide complexity.

The goal would be to put the right complexity in order.

Every good workbench should answer three questions quickly:

what is the current state?
why is it in that state?
what action is safe for me to take?

If the UI cannot answer those, it is probably just a prettier object browser.

the punchline

Kubernetes won by making infrastructure programmable. The cost is that the platform now contains many different operational languages.

Cluster lifecycle has one language. Batch scheduling has another. Serverless traffic has another. Storage, networking, hardware allocation, GitOps, policy, and AI workloads all bring their own nouns, relationships, and failure modes.

One dashboard cannot explain all of that.

The better future is a workbench: common shell, shared access model, extensible plugins, domain-specific views, metrics in context, RBAC-aware actions, and a clean path back to raw Kubernetes objects when needed.

Headlamp's recent plugin work is not the whole answer, but it is pointing in the right direction.

The Kubernetes UI should not ask operators to pretend every resource is equally meaningful.

It should help them follow the shape of the work.

references

To test my projects, I use Railway. If you want $20 USD to get started, use this link.

DEV Community