Marcel Boccato

Posted on Apr 12

Sentinel Diary #3: From Information to Action — When the Dashboard Learned to Think

#kubernetes #finops #devops #go

Sentinel Diary #3: From Information to Action — When the Dashboard Learned to Think

A vibe coding journey: building a Kubernetes FinOps platform from scratch, one conversation at a time.

When I published Diary #2, the dashboard was finally telling the truth. The bugs were fixed, the data was real, the version badge was glowing cyan on hover. It felt like a finished thing.

It wasn't. It was a read-only mirror of a cluster.

Diary #3 is the story of turning that mirror into a tool — the session where Sentinel stopped showing data and started helping me act on it.

Where we left off

At v0.7.3, Sentinel had:

A Go agent collecting metrics every ~10s
PostgreSQL storing raw + hourly + daily aggregates
A dashboard with cost timeline, pod health, CPU utilization
22 automated tests
Zero authentication (honest versioning: still 0.x)

Online Boutique (12 Google microservices) was already deployed in google-demo namespace, waiting. Twenty-four pods. Real workload distribution. Real waste candidates.

I just couldn't do anything about them from the dashboard.

The "before" — a beautiful read-only report.

v0.10.0 — The forecast that scared me

Before visual work, I wanted the dashboard to answer a question I kept asking manually: "if this cluster runs through the weekend, how much will I spend?"

I spec'd out the requirement: linear regression over historical cost data, with confidence bands. No external dependencies — pure Go. I handed it to Claude, and the result was /api/forecast: a projection endpoint with ±1.5σ confidence bands.

The chart came back with a dashed purple budget line, a cyan usage line, shaded confidence regions, and a projected waste card below. It looked like something from a Bloomberg terminal.

Then I looked at the numbers.

Projected waste: 67% of budget. Every dollar spent on this cluster, sixty-seven cents was going to pods with requests set far above actual consumption.

The forecast didn't tell me something I didn't know. It told me something I knew but hadn't seen.

v0.10.1 — Closing M1

Before going further with UI, I closed Milestone 1 properly. I had a checklist:

/health endpoint with DB and collector status checks
Structured logging with slog (consistent fields across all components)
Thresholds loaded from config/thresholds.yaml via ConfigMap (no hardcoded values)
Version badge reading dynamically from /health
Fallback data for long ranges (30d/90d/1y)

Claude implemented all of it in a single session.

M1 criterion: "Sentinel collects, persists, calculates waste, and reports its own health without manual intervention." ✅

The layout problem

By v0.10.3, I had a confession to make to the dashboard.

It was working. Every metric was real. But it was ugly in a specific way: information arranged like a report, not like a tool. Everything equal weight. No hierarchy. No "look here first."

I spent the next few versions doing something I rarely do consciously: thinking about information architecture before writing a single directive.

The question wasn't "what data do we have?" It was "when someone opens this at 2am during an incident, where should their eyes go first?"

Answer: KPIs. Then cluster health. Then cost. Then details.

v0.10.4–v0.10.8 — The great layout rework

Version by version, I described what I needed and Claude shaped the layout:

v0.10.4: I wanted a dedicated Memory tile — a visual showing requested vs allocatable memory, with a drawer that broke risk down by namespace. Claude built a purple donut with OOM risk breakdown.

v0.10.5: Per-tile namespace filters — each tile (Pods, CPU, Memory) needed its own independent <select> so filtering one wouldn't break the others. Financial Correlation grew to full-width with an orange FinOps border. The drawer got an interactive period selector and sortable columns.

v0.10.6–v0.10.7 reorganized the grid — I drew the hierarchy on paper first:

row-4: Node Health | Pod Distribution | CPU (compact) | Memory (compact)
Financial Correlation: full-width, immediately below
Waste Intelligence: full-width with scroll, at the bottom
Active Alerts tile: removed (empty space is worse than no tile)

v0.10.8: An animated alert badge in the header — green dot for "All OK", orange for warnings, red pulsing for critical. All six KPI cards clickable, each opening its respective drawer. The dead "Active Alerts" KPI replaced with "Top Memory Consumer" — the actually useful metric.

The "after" — v0.10.12 with unified layout, forecast chart and Top Workloads panel.

v0.10.9 — The bug that crashed silently

During testing, I noticed the KPI cards were showing -- for values. Not an error. Not a console warning. Just dashes.

I flagged it to Claude, who traced it to a ReferenceError in updateOverview(). The code was doing:

pods.forEach(p => { ... })

But /api/summary doesn't return an individual pods array. It returns podsByPhase, failedPods, pendingPods. The variable pods didn't exist.

The error was thrown, silently swallowed by the outer try/catch, and execution stopped before updating kT, kMem, kW — all the KPI values. They stayed at -- from initialization.

Claude extracted updatePodsAllNsTile() — a new async function that fetches /api/pods separately, groups by namespace, and renders a namespace-distribution donut instead of the broken phase breakdown.

Silent failures are the worst kind. At least a loud crash tells you where to look.

v0.10.10 — The column that was always zero

The Memory drawer had an "Mem Request" column. It showed N/A for every pod.

I queried the DB directly.

SELECT DISTINCT mem_request FROM metrics LIMIT 5;
 -- 0
 -- 0
 -- 0

Every row. Zero.

Four versions back, when the DB INSERT was written, mem_request was hardcoded to 0. The struct field existed, the column existed, the frontend expected data — but real values were never being written.

I described the fix to Claude: collect memory requests per pod during the collection cycle and use those real values in the INSERT. Claude built podMemRequestMap[namespace][pod], summing memory requests across all containers. The INSERT now uses the real value.

Historical data stays zero — it's already written. But every new collection has the right number. A migration would fix history; I decided to let time heal it.

FinOps drawer: sortable history table with Budget, Actual, Waste and Waste% columns.

Memory drawer: per-namespace breakdown with OOM risk indicator per pod.

v0.10.11–v0.10.12 — From display to decision

This is the part I'm most proud of.

v0.10.11: I wanted a tooltip on the "Connected" badge — hover to see cluster health at a glance without opening any drawer. Claude built a card showing Cluster, Endpoint, Version, Session uptime, Last sync, and Database status. Small detail. High signal.

v0.10.12: I wanted to merge Waste Intelligence and Top Workloads into a single action-oriented panel: "Top Workloads — CPU & Waste Analysis". But the real ask was making pod names clickable.

I defined the interaction: click a pod name → drawer opens with current usage, request, a utilization bar, and a concrete rightsizing recommendation. Claude built it:

 kube-apiserver-minikube          sentinel          ⚠ Overprovisioned

 CPU Usage / Request              42m / 250m
 ████████░░░░░░░░░░░░░░░░░░░░    16.8%

 Memory Usage / Request           312 Mi / No request set

 ⚠ Savings Opportunity
 Potential CPU savings: -208m (83%)
 CPU request is significantly higher than actual usage.
 Consider reducing resources.requests.cpu to ~51m.

The number ~51m comes from ceil(actualUsage × 1.2) — a 20% headroom buffer calculated at draw time. Not a generic recommendation. A concrete one, specific to that pod, at that moment.

Rows with waste are highlighted in amber. Rightsized pods get a green checkmark. The table became a prioritized action list.

The star of the show: click any pod name to get a concrete rightsizing recommendation.

What I learned

Data without action is just reporting. For the first three months of this project, Sentinel was a very nice report. The forecast was beautiful. The donuts were pretty. But you couldn't do anything from the dashboard — you had to write it down, open a terminal, and kubectl edit something.

The pod detail drawer is the first time Sentinel gives you a number you can directly use. That's a different category of tool.

Silent failures compound. The pods.forEach bug, the mem_request = 0 bug, the Database -- in the tooltip — none of them threw visible errors. They all degraded silently. I need better observability on the dashboard itself.

Layout is product thinking. I spent more time this session defining information hierarchy than requesting new features. That felt wasteful in the moment. In retrospect, a dashboard where your eyes know where to go is worth more than a dashboard with more features.

State of the cluster (v0.10.12)

Nodes:    1 (minikube) — Running
Pods:     24 Running (sentinel + google-demo namespaces)
CPU:      32.8% allocated
Waste:    20 pods with savings opportunities
DB:       ✓ OK
Version:  v0.10.12

What's next

The roadmap points to M2 and M3:

Efficiency score per namespace — not just "which pods waste" but "which namespace is worst"
/api/incidents — deterministic violation detection without LLM
Online Boutique lab — baseline → load → chaos → comparison (the post I promised in #2)

And eventually: auth. Because a dashboard with no auth is a tool that trusts everyone in the room.

Sentinel is open-source and honestly versioned. Still 0.x. Getting closer.

Sentinel Diary #3: De Informação para Ação — Quando o Dashboard Aprendeu a Pensar

Uma jornada de vibe coding: construindo uma plataforma FinOps para Kubernetes do zero, uma conversa por vez.

Quando publiquei o Diary #2, o dashboard finalmente estava dizendo a verdade. Os bugs tinham sido corrigidos, os dados eram reais, o badge de versão brilhava em cyan no hover. Parecia uma coisa pronta.

Não estava. Era um espelho somente leitura de um cluster.

O Diary #3 é a história de transformar esse espelho numa ferramenta — a sessão em que o Sentinel parou de mostrar dados e começou a me ajudar a agir sobre eles.

De onde paramos

No v0.7.3, o Sentinel tinha:

Um agente Go coletando métricas a cada ~10s
PostgreSQL armazenando dados raw + hourly + daily
Dashboard com timeline de custo, saúde de pods, utilização de CPU
22 testes automatizados
Zero autenticação (versionamento honesto: ainda 0.x)

O Online Boutique (12 microsserviços do Google) já estava deployado no namespace google-demo, esperando. Vinte e quatro pods. Distribuição real de workload. Candidatos reais a rightsizing.

Eu só não conseguia fazer nada a respeito deles a partir do dashboard.

O "antes" — um relatório bonito, mas somente leitura.

v0.10.0 — O forecast que me assustou

Antes do trabalho visual, eu queria que o dashboard respondesse uma pergunta que eu ficava fazendo manualmente: "se esse cluster rodar durante o fim de semana, quanto vou gastar?"

Defini o requisito: regressão linear sobre os dados históricos de custo, com bandas de confiança. Sem dependências externas — Go puro. Passei o spec para o Claude, e o resultado foi o /api/forecast: um endpoint de projeção com bandas de confiança ±1.5σ.

O gráfico voltou com uma linha tracejada roxa de orçamento, uma linha cyan de uso, regiões sombreadas de confiança e um card de waste projetado abaixo. Parecia algo de um terminal Bloomberg.

Aí eu olhei para os números.

Waste projetado: 67% do orçamento. De cada real gasto no cluster, sessenta e sete centavos iam para pods com requests configurados bem acima do consumo real.

O forecast não me disse algo que eu não sabia. Me disse algo que eu sabia mas não tinha visto.

v0.10.1 — Fechando o M1

Antes de avançar na UI, fechei o Milestone 1 adequadamente. Tinha uma lista de critérios:

Endpoint /health com verificações de status do DB e do collector
Logs estruturados com slog (campos consistentes em todos os componentes)
Thresholds carregados de config/thresholds.yaml via ConfigMap (sem valores hardcoded)
Badge de versão lendo dinamicamente do /health
Fallback de dados para ranges longos (30d/90d/1y)

O Claude implementou tudo em uma única sessão.

Critério do M1: "O Sentinel coleta, persiste, calcula waste e reporta sua própria saúde sem intervenção manual." ✅

O problema do layout

No v0.10.3, eu tinha uma confissão a fazer ao dashboard.

Ele estava funcionando. Cada métrica era real. Mas era feio de uma forma específica: informação arranjada como relatório, não como ferramenta. Tudo com o mesmo peso. Sem hierarquia. Sem "olhe aqui primeiro."

Passei as próximas versões fazendo algo que raramente faço conscientemente: pensar em arquitetura de informação antes de escrever qualquer diretiva.

A pergunta não era "que dados temos?" Era "quando alguém abrir isso às 2h durante um incidente, para onde devem ir os olhos primeiro?"

Resposta: KPIs. Depois saúde do cluster. Depois custo. Depois detalhes.

v0.10.4–v0.10.8 — A grande reestruturação do layout

Versão por versão, eu descrevia o que precisava e o Claude moldava o layout:

v0.10.4: Queria um tile dedicado de Memória — um visual mostrando memória solicitada vs alocável, com um drawer quebrando o risco por namespace. O Claude construiu um donut roxo com breakdown de risco de OOM.

v0.10.5: Filtros de namespace por tile — cada tile (Pods, CPU, Memória) precisava do seu próprio <select> independente, para filtrar um sem quebrar os outros. O painel Financial Correlation cresceu para full-width com borda laranja FinOps. O drawer ganhou seletor de período interativo e colunas ordenáveis.

v0.10.6–v0.10.7: Reorganizei a hierarquia no papel primeiro:

row-4: Node Health | Pod Distribution | CPU (compacto) | Memory (compacto)
Financial Correlation: full-width, imediatamente abaixo
Waste Intelligence: full-width com scroll, no final
Tile Active Alerts: removido (espaço vazio é pior que nenhum tile)

v0.10.8: Pedi um badge de alerta animado no header — ponto verde para "All OK", laranja para warnings, vermelho pulsante para critical. Os seis cards KPI viraram clicáveis, cada um abrindo seu respectivo drawer. O KPI morto "Active Alerts" substituído por "Top Memory Consumer" — a métrica realmente útil.

O "depois" — v0.10.12 com layout unificado, gráfico de forecast e painel Top Workloads.

v0.10.9 — O bug que falhava silenciosamente

Durante os testes, percebi que os cards KPI mostravam -- nos valores. Não um erro. Não um aviso no console. Só travessões.

Reportei ao Claude, que rastreou até um ReferenceError em updateOverview(). O código fazia:

pods.forEach(p => { ... })

Mas /api/summary não retorna um array individual de pods. Retorna podsByPhase, failedPods, pendingPods. A variável pods não existia.

O erro era lançado, silenciosamente engolido pelo try/catch externo, e a execução parava antes de atualizar kT, kMem, kW — todos os valores KPI. Eles ficavam em -- desde a inicialização.

O Claude extraiu updatePodsAllNsTile() — uma nova função async que faz fetch separado em /api/pods, agrupa por namespace e renderiza um donut de distribuição por namespace.

Falhas silenciosas são o pior tipo. Pelo menos um crash barulhento te diz onde procurar.

v0.10.10 — A coluna que sempre foi zero

O drawer de Memória tinha uma coluna "Mem Request". Mostrava N/A para todo pod.

Fui consultar o banco diretamente.

SELECT DISTINCT mem_request FROM metrics LIMIT 5;
 -- 0
 -- 0
 -- 0

Toda linha. Zero.

Quatro versões atrás, quando o DB INSERT foi escrito, mem_request estava hardcoded como 0. O campo da struct existia, a coluna existia, o frontend esperava dados — mas valores reais nunca foram escritos.

Descrevi o fix para o Claude: coletar os memory requests por pod durante o ciclo de coleta e usar esses valores reais no INSERT. O Claude construiu podMemRequestMap[namespace][pod], somando memory requests de todos os containers. O INSERT agora usa o valor real.

Os dados históricos ficam zero — já foram escritos. Mas cada nova coleta tem o número certo. Uma migration consertaria o histórico; decidi deixar o tempo curar.

Drawer FinOps: tabela histórica ordenável com Budget, Actual, Waste e Waste%.

Drawer de memória: breakdown por namespace com indicador de risco de OOM por pod.

v0.10.11–v0.10.12 — De exibição para decisão

Esta é a parte de que mais me orgulho.

v0.10.11: Queria um tooltip no badge "Connected" — passar o mouse para ver a saúde do cluster sem abrir nenhum drawer. O Claude construiu um card mostrando Cluster, Endpoint, Version, Session uptime, Last sync e Database status. Detalhe pequeno. Sinal alto.

v0.10.12: Queria fundir Waste Intelligence e Top Workloads em um único painel orientado a ação: "Top Workloads — CPU & Waste Analysis". Mas o pedido central era tornar os nomes dos pods clicáveis.

Defini a interação: clicar num pod → drawer abre com uso atual, request, barra de utilização e uma recomendação concreta de rightsizing. O Claude implementou:

 kube-apiserver-minikube          sentinel          ⚠ Overprovisioned

 CPU Usage / Request              42m / 250m
 ████████░░░░░░░░░░░░░░░░░░░░    16.8%

 Memory Usage / Request           312 Mi / No request set

 ⚠ Savings Opportunity
 Potential CPU savings: -208m (83%)
 CPU request is significantly higher than actual usage.
 Consider reducing resources.requests.cpu to ~51m.

O número ~51m vem de ceil(usoReal × 1.2) — um buffer de 20% de headroom calculado no momento do render. Não é uma recomendação genérica. É uma concreta, específica para aquele pod, naquele momento.

Linhas com waste ficam destacadas em âmbar. Pods rightsized ganham um checkmark verde. A tabela virou uma lista de ações priorizadas.

A estrela do show: clique em qualquer nome de pod para uma recomendação concreta de rightsizing.

O que aprendi

Dados sem ação são apenas relatório. Durante os primeiros meses deste projeto, o Sentinel era um relatório muito bonito. O forecast era lindo. Os donuts eram bonitos. Mas você não conseguia fazer nada a partir do dashboard — tinha que anotar, abrir um terminal e kubectl edit alguma coisa.

O drawer de detalhe do pod é a primeira vez que o Sentinel te dá um número que você pode usar diretamente. Isso é uma categoria diferente de ferramenta.

Falhas silenciosas se acumulam. O bug do pods.forEach, o bug do mem_request = 0, o Database -- no tooltip — nenhum deles lançou erros visíveis. Todos degradaram silenciosamente. Preciso de melhor observabilidade no próprio dashboard.

Layout é pensamento de produto. Passei mais tempo nesta sessão definindo hierarquia de informação do que pedindo novas features. Isso pareceu desperdício no momento. Em retrospecto, um dashboard onde seus olhos sabem para onde ir vale mais do que um com mais features.

Estado do cluster (v0.10.12)

Nodes:    1 (minikube) — Running
Pods:     24 Running (sentinel + google-demo namespaces)
CPU:      32.8% allocated
Waste:    20 pods com oportunidades de savings
DB:       ✓ OK
Version:  v0.10.12

O que vem a seguir

O roadmap aponta para M2 e M3:

Score de eficiência por namespace — não só "quais pods desperdiçam" mas "qual namespace é o pior"
/api/incidents — detecção determinística de violações sem LLM
Lab Online Boutique — baseline → carga → chaos → comparação (o post que prometi no #2)

E eventualmente: auth. Porque um dashboard sem auth é uma ferramenta que confia em todo mundo na sala.

Sentinel é open-source e honestamente versionado. Ainda 0.x. Chegando lá.

DEV Community

Sentinel Diary #3: From Information to Action — When the Dashboard Learned to Think

Sentinel Diary #3: From Information to Action — When the Dashboard Learned to Think

Where we left off

v0.10.0 — The forecast that scared me

v0.10.1 — Closing M1

The layout problem

v0.10.4–v0.10.8 — The great layout rework

v0.10.9 — The bug that crashed silently

v0.10.10 — The column that was always zero

v0.10.11–v0.10.12 — From display to decision

What I learned

State of the cluster (v0.10.12)

What's next

Sentinel Diary #3: De Informação para Ação — Quando o Dashboard Aprendeu a Pensar

De onde paramos

v0.10.0 — O forecast que me assustou

v0.10.1 — Fechando o M1

O problema do layout

v0.10.4–v0.10.8 — A grande reestruturação do layout

v0.10.9 — O bug que falhava silenciosamente

v0.10.10 — A coluna que sempre foi zero

v0.10.11–v0.10.12 — De exibição para decisão

O que aprendi

Estado do cluster (v0.10.12)

O que vem a seguir

Top comments (0)