Marcel Boccato

Posted on Apr 12 • Edited on Apr 16

Sentinel Diary #2: the day the dashboard lied (and other honest bugs)

#claudecode #kubernetes #devops #finops

Series: vibe coding with Claude Code + Kubernetes

Today was one of those days where you sit down to "do one small thing" and stand up three hours later with a commit log longer than planned. Spoiler: not a single line was typed manually.

But before getting to the bugs, I need to talk about what changed in the session setup — because that's what made the day possible.

The new setup: GitHub Copilot Pro as a model hub

After a while switching between Claude API directly, Gemini and other tools, I made a decision that completely changed my workflow: signing up for GitHub Copilot Pro.

The insight isn't obvious at first. Copilot Pro gives access to multiple models under a single subscription — and that's where things got interesting.

The flow for the day looked roughly like this:

GPT-4.1 mini — initial code review, fast and cheap, good for a first pass
GPT-5.3 Codex — deep architecture review, where I needed denser reasoning
OpenCode with Claude Opus — installed OpenCode and ran it with Opus for the first complex analyses
Migration to Claude Sonnet 4.6 — after comparing results, moved to Sonnet. Equivalent quality, significantly lower token consumption

Using Copilot Pro as a hub — switching models depending on the task — was the turning point. Instead of paying for individual tokens or depending on a single provider, you have a menu and pick the right tool for each moment.

Sonnet 4.6 specifically surprised me: across Sentinel development sessions, it delivered the same reasoning quality as Opus at a fraction of the consumption. For continuous work on long projects, that makes a real difference both on the wallet and on session flow.

What happened since the last post

Sentinel grew quite a bit over the last week — and some decisions deserve a record before getting into today.

Leaving Grafana/Prometheus. The original stack depended on kube-prometheus-stack: Prometheus collecting metrics, Grafana displaying, AlertManager notifying. It worked, but was heavy for a local environment and created an infrastructure dependency I wanted to eliminate. The solution was my call: make the Go agent the single source of truth. Claude implemented it — collecting directly via client-go, persisting to PostgreSQL, exposing the REST API. No sidecar. No scrape. No three port-forwards at startup.

Helm chart. I wanted Sentinel to be a first-class Kubernetes citizen. A single helm install to bring everything up:

helm install sentinel helm/sentinel -n sentinel --create-namespace

Claude built the chart — Deployment, Service, ConfigMap, an initContainer that waits for PostgreSQL, and automatic InClusterConfig (the agent detects if it's running inside the cluster and uses the ServiceAccount, no kubeconfig needed).

Three-tier retention policies. With history growing, I needed a storage strategy that wouldn't blow up the local PostgreSQL. I defined the tiers:

Tier	Granularity	Retention
Raw	~10s	24 hours
Hourly	1 hour	30 days
Daily	1 day	365 days

Claude implemented the hourly aggregation job and extended /api/history to support ranges from 30m to 365d — same API, transparent to the dashboard.

Security hardening. GPT-5.3 Codex did a deep code review and flagged several issues: unbounded connection pool, missing rate limiting, bind address exposed on 0.0.0.0 without configuration. I took those findings to Claude, who fixed all of them. The harness got Unicode normalization (NFKC), 10MB input limit and path traversal protection on the --component parameter. 16 tests covering critical cases.

The mystery of the zeros

Opened the Sentinel dashboard. Everything was zeroed out. All panels showing --. Node Health Map empty. Pod Distribution gone. FinOps missing.

The cluster was running. Pods were healthy. The port-forward had started. But JavaScript wasn't receiving anything.

First hypothesis: Sentinel pod with a problem. kubectl logs — normal.
Second hypothesis: Metrics Server offline. Tested — working.
Third hypothesis: something with the port-forward.

Ran the command that solved everything:

lsof -i :8080

sentinel-agent  59321  boccatosantos  ...  IPv4  *:8080 (LISTEN)  <- the villain
kubectl         61204  boccatosantos  ...  IPv6  *:8080 (LISTEN)  <- correct

A locally compiled sentinel-agent instance had been running since 17:49 — listening on IPv4. Firefox was connecting to it, which had no access to the cluster at all. The kubectl port-forward was there too, but on IPv6, and the browser preferred IPv4.

Fix: kill 59321.

The hard part was getting to that line. The fix itself took two seconds.

The dashboard still wouldn't load

Killed the process. Refreshed the browser. Still no data.

Opened DevTools and found this in the console:

Refused to connect to /api/summary because it violates the Content Security Policy directive

The Go server had a Content-Security-Policy header configured, but without connect-src. The browser was silently blocking every fetch() call from JavaScript. No visible error in the UI — just the console screaming for anyone paying attention.

I described the issue to Claude, who updated main.go:

// before
"default-src self; script-src ..."

// after
"default-src self; connect-src self; script-src ..."

One word. One hour of diagnosis.

The bar that lied

With the dashboard working, I noticed the Utilization bar in the Top Workloads panel was wrong. It looked right — showed percentages, had colors — but the calculation was:

// heaviest pod in the cluster = 100%
// all others are relative to it
const pct = (cpu / maxConsumer) * 100;

A pod using 10m CPU with a 1000m request appeared as 100% efficient if it happened to be the cluster's top consumer at that moment. Useless for FinOps.

I explained the right semantics to Claude — usage vs the pod's own request:

// how much the pod is using vs what IT REQUESTED
const pct = request > 0 ? Math.min((cpu / request) * 100, 100) : 0;

Semantic colors:

Green (>70%): well-sized pod
Orange (40-70%): some waste
Red (<40%): oversized, right-sizing candidate

This bug had been there from the start. I only caught it when I stopped to actually look at the number.

Financial Correlation got context

The ROI Timeline panel was showing only the chart. You could see the Budget vs Actual lines, but without value references — hard to know if the waste was $0.002/h or $2/h.

I asked Claude to add a fixed summary above the chart:

Budget  $0.0312/h  |  Actual  $0.0102/h  |  Waste  $0.0210/h (67.3%)

Each point's tooltip now shows all three values. Adaptive Y-axis precision — no more embarrassing $0.000 when values are milli-cents.

The versioning decision

This was the most honest moment of the day.

The project was at v1.7.3. But it has no auth, no configurable alerts, no tests. Calling it v1.x implies stable API and feature-complete — and that's not what Sentinel is today.

Decided to renumber everything: 1.x → 0.x.

Before	After
v1.1	v0.1 — initial MVP
v1.3	v0.3 — FinOps + PostgreSQL
v1.5	v0.5 — Security hardening
v1.6	v0.6 — Configurable retention
v1.7	v0.7 — Standalone, no Prometheus
v1.7.3	v0.7.3 — Today

v1.0 will be the real milestone: when auth, alerts and tests are done. Until then, we're 0.x and proud of it.

The final touch

To close the session, I asked Claude to add a small version badge in the top-right corner of the header:

Default state: discrete gray, mono font
Hover: lights up in cyan
Tooltip: Sentinel v0.7.3 / Kubernetes Observability

Six lines of CSS. But it gives that sense of a cared-for product.

Final cluster state

Nodes:    1 (minikube) — Running
Pods:     24 Running, 0 Failed, 0 Pending
CPU:      2620m requested / 8000m allocatable (32.75% efficiency)
Commits:  11 today

Preparing the battlefield: Google Online Boutique

At the end of the session, before closing the terminal, I did something that will pay off in the next episode: deploying Google Online Boutique in a dedicated namespace.

Online Boutique is Google's microservices demo — 12 services simulating a real e-commerce app (frontend, cart, checkout, payment, recommendation engine and more). It's the perfect stress-test target for Sentinel.

kubectl create namespace google-demo
kubectl apply -n google-demo -f https://raw.githubusercontent.com/GoogleCloudPlatform/microservices-demo/main/release/kubernetes-manifests.yaml

Two commands. Twelve services. A proper load to observe.

Namespace: google-demo
Pods:      12 Running
Services:  12

The cluster went from 12 to 24 pods. Sentinel picked up everything without any configuration change — it monitors all namespaces by default.

Why does this matter? Because Sentinel was built and tested with its own workload as the only reference. Now there's a realistic multi-service app to probe: uneven CPU distribution, idle pods, services with no requests, cost variance between workloads. Real FinOps territory.

Next up: Sentinel Diary #3 — where we'll use Online Boutique as the lab. Capacity analysis, scaling automation on failure, request spike simulation. The cluster is set. Let's break things on purpose.

Three bugs. Four improvements. A more honest versioning. And not a single line typed manually.

This is Sentinel v0.7.3.

Full Changelog

v0.7.3 — today

Utilization bar fixed — now shows real usage / request, not relative to top consumer
Semantic colors: green (>70% efficient), orange (40-70%), red (<40% = waste)
Financial Correlation improved — Budget / Actual / Waste summary above the chart
Enriched tooltip — shows Budget, Actual and Waste per point on hover

v0.7 — fully standalone

Removed all Prometheus, Grafana and AlertManager dependencies
Resilient startup — initContainer waits for PostgreSQL + exponential backoff retry in Go
CSP fix — added connect-src 'self' to allow fetch requests in the dashboard
tools/monitor.py rewritten to use Go agent API
/startup simplified — only checks Minikube and Go agent

v0.6 — configurable retention

3-tier retention policy: raw (24h), hourly (30d), daily (365d) with automatic cleanup
/api/history now supports ranges from 30m to 365d
Hourly auto-aggregation compacting old metrics
New tables: metrics_hourly, metrics_daily, cost_history

v0.5 — Helm + security hardening

Helm chart — full Kubernetes deploy with helm install sentinel helm/sentinel -n sentinel
InClusterConfig — Go agent auto-detects if running inside the cluster
Auto-schema — metrics table created automatically on startup
Security hardening — PostgreSQL connection pool, rate limiting (100 rps), configurable bind address
Harness — Unicode normalization (NFKC), 10MB input limit, 16 tests
--component sanitized against path traversal, timeout with safe clamping

Série: vibe coding com Claude Code + Kubernetes

Hoje foi daqueles dias que você senta para fazer uma coisa pequena e levanta três horas depois com commit log maior do que planejava. Spoiler: nenhuma linha foi digitada manualmente.

Mas antes de chegar nos bugs, preciso contar o que mudou na infraestrutura da sessão — porque foi isso que tornou o dia possível.

O novo setup: GitHub Copilot Pro como hub de modelos

Depois de um tempo alternando entre Claude API direto, Gemini e outras ferramentas, tomei uma decisão que mudou o fluxo de trabalho completamente: assinar o GitHub Copilot Pro.

A sacada não é óbvia à primeira vista. O Copilot Pro dá acesso a vários modelos dentro de uma única assinatura — e foi aí que a coisa ficou interessante.

O fluxo do dia foi mais ou menos assim:

GPT-4.1 mini — code review inicial, rápido e barato, bom para uma primeira passagem
GPT-5.3 Codex — review profundo de arquitetura, onde realmente precisava de raciocínio mais denso
OpenCode com Claude Opus — instalei o OpenCode e rodei com Opus para as primeiras análises mais complexas
Migração para Claude Sonnet 4.6 — depois de comparar os resultados, migrei para o Sonnet. Qualidade equivalente, consumo de tokens significativamente menor

Essa estratégia de usar o Copilot Pro como hub — trocando de modelo conforme o tipo de tarefa — foi o divisor de águas. Em vez de pagar por tokens avulsos ou depender de um único provider, você tem um cardápio e escolhe a ferramenta certa para cada momento.

O Sonnet 4.6 especificamente surpreendeu: nas sessões de desenvolvimento do Sentinel, entregou a mesma qualidade de raciocínio do Opus com uma fração do consumo. Para trabalho contínuo em projetos longos, isso faz diferença real no bolso e na fluidez da sessão.

O que aconteceu desde o último post

O Sentinel cresceu bastante nas últimas semanas — e algumas decisões merecem registro antes de entrar no dia de hoje.

Saída do Grafana/Prometheus. O stack original dependia de kube-prometheus-stack. Funcionava, mas era pesado e criava dependências de infraestrutura que eu queria eliminar. Decidi tornar o Go agent a única fonte de verdade. O Claude implementou: coleta direto via client-go, persiste no PostgreSQL e expõe a API REST. Sem sidecar. Sem scrape. Sem três port-forwards na inicialização.

Helm chart. Queria que o Sentinel fosse um cidadão de primeira classe no Kubernetes. Um helm install para subir tudo:

helm install sentinel helm/sentinel -n sentinel --create-namespace

O Claude construiu o chart — Deployment, Service, ConfigMap, initContainer que aguarda o PostgreSQL e InClusterConfig automático (o agente detecta se está rodando dentro do cluster e usa o ServiceAccount, sem precisar de kubeconfig montado).

Políticas de retenção em três camadas. Com o histórico crescendo, precisava de uma estratégia de storage que não explodisse o PostgreSQL local. Defini as camadas:

Camada	Granularidade	Retenção
Raw	~10s	24 horas
Hourly	1 hora	30 dias
Daily	1 dia	365 dias

O Claude implementou o job de agregação por hora e estendeu o /api/history para suportar ranges de 30m até 365d — mesma API, transparente para o dashboard.

Security hardening. O GPT-5.3 Codex fez um code review profundo e sinalizou vários problemas: connection pool sem limite, ausência de rate limiting, bind address exposto em 0.0.0.0. Levei os achados para o Claude, que corrigiu tudo. O harness ganhou normalização Unicode (NFKC), limite de input de 10MB e proteção contra path traversal no parâmetro --component. 16 testes cobrindo os casos críticos.

O mistério dos zeros

Abri o dashboard do Sentinel. Estava tudo zerado. Todos os painéis mostrando --. Node Health Map vazio. Pod Distribution sem dados. FinOps sumido.

O cluster estava rodando. Os pods estavam healthy. O port-forward tinha subido. Mas o JavaScript não recebia nada.

Primeira hipótese: pod do Sentinel com problema. kubectl logs — normal.
Segunda hipótese: Metrics Server offline. Testei — funcionando.
Terceira hipótese: algo com o port-forward.

Rodei o comando que resolveu tudo:

lsof -i :8080

sentinel-agent  59321  boccatosantos  ...  IPv4  *:8080 (LISTEN)  <- vilão
kubectl         61204  boccatosantos  ...  IPv6  *:8080 (LISTEN)  <- correto

Tinha uma instância do sentinel-agent compilada localmente rodando desde as 17h49 — escutando em IPv4. O Firefox conectava nela, que não tinha acesso nenhum ao cluster. O kubectl port-forward estava lá também, mas em IPv6, e o browser preferia o IPv4.

Fix: kill 59321.

O mais trabalhoso foi chegar nessa linha. O fix em si levou dois segundos.

O dashboard ainda não carregava

Matei o processo. Atualizei o browser. Continuava sem dados.

Abri o DevTools e encontrei isso no console:

Refused to connect to /api/summary because it violates the Content Security Policy directive

O servidor Go tinha um header Content-Security-Policy configurado, mas sem connect-src. O browser bloqueava silenciosamente todo fetch() do JavaScript. Nenhum erro visível na UI — só o console gritando pra quem olhasse.

Descrevi o problema para o Claude, que atualizou o main.go:

// antes
"default-src self; script-src ..."

// depois
"default-src self; connect-src self; script-src ..."

Uma palavra. Uma hora de diagnóstico.

A barra que mentia

Com o dashboard funcionando, percebi que a barra de Utilization no painel Top Workloads estava errada. Parecia certa — mostrava porcentagens, tinha cores — mas o cálculo era:

// o pod mais pesado do cluster = 100%
// todos os outros são relativos a ele
const pct = (cpu / maxConsumer) * 100;

Um pod usando 10m de CPU com request de 1000m aparecia como 100% eficiente se fosse o maior consumidor do cluster naquele momento. Inútil para FinOps.

Expliquei a semântica correta para o Claude — uso versus o próprio request do pod:

// quanto o pod está usando do que ELE MESMO pediu
const pct = request > 0 ? Math.min((cpu / request) * 100, 100) : 0;

Cores semânticas:

Verde (>70%): pod bem dimensionado
Laranja (40-70%): algum desperdício
Vermelho (<40%): oversized, candidato a right-sizing

Esse bug estava lá desde sempre. Só percebi quando parei pra olhar o número com atenção.

Financial Correlation ganhou contexto

O painel de ROI Timeline mostrava só o gráfico. Você via as linhas de Budget vs Actual, mas sem referência de valores — ficava difícil saber se o desperdício era $0.002/h ou $2/h.

Pedi ao Claude para adicionar um sumário fixo acima do gráfico:

Budget  $0.0312/h  |  Actual  $0.0102/h  |  Waste  $0.0210/h (67.3%)

Tooltip de cada ponto agora mostra os três valores. Eixo Y com precisão adaptativa — sem aquele $0.000 vergonhoso quando os valores são milicêntimos.

A decisão do versionamento

Esse foi o momento mais honesto do dia.

O projeto estava em v1.7.3. Só que ele não tem auth, não tem alertas configuráveis, não tem testes. Chamar de v1.x implica API estável e feature-complete — e não é isso que o Sentinel é hoje.

Decidi renumerar tudo: 1.x → 0.x.

Antes	Depois
v1.1	v0.1 — MVP inicial
v1.3	v0.3 — FinOps + PostgreSQL
v1.5	v0.5 — Security hardening
v1.6	v0.6 — Retenção configurável
v1.7	v0.7 — Standalone, sem Prometheus
v1.7.3	v0.7.3 — Hoje

v1.0 vai ser o marco real: quando tiver auth, alertas e testes. Até lá, somos 0.x e temos orgulho disso.

O toque final

Para fechar a sessão, pedi ao Claude para adicionar um badge de versão pequeno no canto direito do header:

Estado normal: cinza discreto, fonte mono
Hover: acende em cyan
Tooltip: Sentinel v0.7.3 / Kubernetes Observability

Seis linhas de CSS. Mas dá aquela sensação de produto cuidado.

Estado final do cluster

Nodes:    1 (minikube) — Running
Pods:     24 Running, 0 Failed, 0 Pending
CPU:      2620m requested / 8000m allocatable (32.75% efficiency)
Commits:  11 hoje

Preparando o campo de batalha: Google Online Boutique

No final da sessão, antes de fechar o terminal, fiz algo que vai render no próximo episódio: o deploy do Google Online Boutique em um namespace dedicado.

O Online Boutique é o demo de microsserviços do Google — 12 serviços simulando um e-commerce real (frontend, carrinho, checkout, pagamento, motor de recomendação e mais). É o alvo perfeito para estressar o Sentinel.

kubectl create namespace google-demo
kubectl apply -n google-demo -f https://raw.githubusercontent.com/GoogleCloudPlatform/microservices-demo/main/release/kubernetes-manifests.yaml

Dois comandos. Doze serviços. Uma carga real para observar.

Namespace: google-demo
Pods:      12 Running
Services:  12

O cluster foi de 12 para 24 pods. O Sentinel pegou tudo sem nenhuma mudança de configuração — ele monitora todos os namespaces por padrão.

Por que isso importa? Porque o Sentinel foi construído e testado tendo sua própria workload como única referência. Agora tem um app multi-serviço realista para sondar: distribuição de CPU irregular, pods ociosos, serviços sem requisições, variância de custo entre workloads. Território de FinOps de verdade.

No próximo: Sentinel Diary #3 — onde vamos usar o Online Boutique como laboratório. Análise de capacity, automação de scaling em falhas, simulação de picos de requests. O cluster está pronto. Vamos quebrar coisas de propósito.

Três bugs. Quatro melhorias. Um versionamento mais honesto. E nenhuma linha digitada manualmente.

Esse é o Sentinel v0.7.3.

Changelog completo

v0.7.3 — hoje

Barra de Utilization corrigida — agora mostra uso / request real, não relativo ao maior consumidor
Cores semânticas na barra: verde (>70% eficiente), laranja (40-70%), vermelho (<40% = desperdício)
Financial Correlation melhorado — sumário de Budget / Actual / Waste acima do gráfico
Tooltip enriquecido — mostra Budget, Actual e Waste por ponto ao passar o mouse

v0.7 — standalone completo

Removida toda dependência de Prometheus, Grafana e AlertManager
Inicialização resiliente — initContainer aguarda PostgreSQL + retry com backoff exponencial no Go
CSP fix — adicionado connect-src 'self' para permitir fetch requests no dashboard
tools/monitor.py reescrito para usar API do Go agent
/startup simplificado — apenas verifica Minikube e Go agent

v0.6 — retenção configurável

Política de retenção em 3 camadas: raw (24h), hourly (30d), daily (365d) com cleanup automático
/api/history agora suporta ranges de 30m até 365d
Agregação automática por hora compactando métricas antigas
Novas tabelas: metrics_hourly, metrics_daily, cost_history

v0.5 — Helm + security hardening

Helm chart — deploy completo no Kubernetes com helm install sentinel helm/sentinel -n sentinel
InClusterConfig — Go agent detecta automaticamente se está rodando dentro do cluster
Auto-schema — tabela metrics criada automaticamente no startup
Security hardening — connection pool PostgreSQL, rate limiting (100 rps), bind address configurável
Harness — normalização Unicode (NFKC), limite de input (10MB), 16 testes
Sanitização de --component contra path traversal, timeout com clamping seguro

Se quiser acompanhar o projeto: github.com/boccato85/Sentinel

Top comments (1)

jakefurlong • Apr 13

Very nice. Sentinel looks like a cool project.