DEV Community: Marcel Boccato

# Sentinel Diary #4: From Dashboard to Incident Response — The deterministic path to reliable SRE

Marcel Boccato — Thu, 16 Apr 2026 21:43:57 +0000

Context: The "Vibe Coding" Evolution

We are currently at v0.10.20. Looking back at the last post, we were celebrating the FinOps module. Since then, the project has undergone a significant architectural shift.

My development stack evolved: I started with Claude Code, which generated the original monolith (main.go reaching ~2,200 lines). I then used Gemini 3.1 Pro to execute a massive refactoring, decomposing the monolith into a clean pkg/ structure (api, k8s, store, incidents). Finally, I integrated Minimax 2.7 (via Opencode) to push from v0.10.17 to v0.10.20, building the new "no-scroll" dashboard. I continue to use Gemini CLI as my core orchestration layer. The result? Higher velocity, better code structure, and a dashboard that finally feels like an SRE tool, not a prototype.

M3: Deterministic Incident Intelligence

The dashboard was excellent for viewing costs, but I realized it was "read-only." It showed the state, but it didn't detect issues. I needed it to assist the operator, not just display data.

1. Code Refactoring (The end of the "God Object")

The main.go file was a massive 2,282 lines of code. It was a monolith of responsibility. I guided the Gemini 3.1 Pro agent to refactor it into dedicated packages: pkg/api, pkg/k8s, pkg/store, and pkg/incidents. It now sits at a lean ~220 lines.

2. Deterministic Detection vs. LLM Hype

I wanted the system to be useful even without an LLM. I implemented /api/incidents using pure deterministic logic.

The logic: Correlating a CrashLoopBackOff status with a spike in CPU usage.
The severity: We now inject a severity field, allowing the dashboard to prioritize what matters.
The UI: The dashboard is now "no-scroll" and event-driven. You can toggle between FinOps and SRE views without leaving the context.

M4: Security Hardening (WIP)

Observability is a liability if your tool is an attack vector. We are currently executing Milestone 4, focusing on hardening the system:

DOM-based XSS: During a security audit via CodeQL, I received high-severity alerts indicating the dashboard was vulnerable to XSS due to dynamic rendering via innerHTML. I instructed the AI to integrate DOMPurify to sanitize inputs before rendering.
Data Resilience (In progress): We are migrating from emptyDir to PersistentVolumeClaim (PVC) for PostgreSQL, ensuring pod restarts no longer result in data loss.
CI/CD Pipeline (Completed): I implemented GitHub Actions. Every push or pull_request now triggers go test and helm lint. If the build is red, it does not merge.

Lessons Learned

"If the LLM goes down, Sentinel stays useful."

The deterministic-first approach is not just a design choice; it is a necessity for SRE tools. I observed that agents (like Minimax/Gemini) are brilliant, but they shouldn't be the central nervous system of your reliability tool. They should be the "specialist on call"—highly valuable, but not required for the system to remain upright.

Current Cluster State

Dimension	Grade
Efficiency	A+
Security	✅ (Patched)
Tests	25 (Go) + 16 (Harness)

Next steps: Prepare for M7, the real-world lab with Online Boutique (Chaos Engineering).

Portuguese Version

Contexto: A Evolução do "Vibe Coding"

Estamos na v0.10.20. Olhando para o post anterior, estávamos celebrando o módulo de FinOps. Desde então, o projeto passou por uma mudança arquitetural significativa.

Minha stack de desenvolvimento evoluiu: comecei com o Claude Code, que gerou o monólito original (main.go chegando a ~2.200 linhas). Em seguida, usei o Gemini 3.1 Pro para executar uma refatoração massiva, decompondo o monólito em uma estrutura limpa pkg/ (api, k8s, store, incidents). Finalmente, integrei o Minimax 2.7 (via Opencode) para subir da v0.10.17 para a v0.10.20, construindo o novo dashboard "no-scroll". Continuo usando o Gemini CLI como minha camada de orquestração central. O resultado? Maior velocidade, estrutura de código superior e um dashboard que finalmente parece uma ferramenta de SRE, não um protótipo.

M3: Inteligência Determinística de Incidentes

O dashboard era excelente para ver custos, mas percebi que ele era "somente leitura". Ele mostrava o estado, mas não detectava problemas. Eu precisava que ele auxiliasse o operador, não apenas exibisse dados.

1. Refatoração de Código (O fim do "God Object")

O arquivo main.go tinha 2.282 linhas. Era um monólito de responsabilidade. Orientei o agente do Gemini 3.1 Pro a refatorá-lo em pacotes dedicados: pkg/api, pkg/k8s, pkg/store e pkg/incidents. Agora ele tem enxutas ~220 linhas.

2. Detecção Determinística vs. Hype de LLM

Eu queria que o sistema fosse útil mesmo sem LLM. Implementei o /api/incidents usando lógica puramente determinística.

A lógica: Correlacionar um status CrashLoopBackOff com um pico na utilização de CPU.
A severidade: Agora injetamos um campo severity, permitindo que o dashboard priorize o que importa.
A UI: O dashboard agora é "no-scroll" e orientado a eventos. Você pode alternar entre as visões de FinOps e SRE sem sair do contexto.

M4: Hardening de Segurança (Em andamento)

Observabilidade é um risco se a ferramenta for um vetor de ataque. Estamos no meio da execução do Milestone 4, focando em blindar o sistema:

XSS baseado em DOM: Durante uma auditoria de segurança via CodeQL, recebi alertas de alta severidade indicando que o dashboard estava vulnerável a XSS por renderizar conteúdo via innerHTML. Instruí o agente a integrar o DOMPurify para sanitizar todo input antes da renderização no DOM.
Resiliência de Dados (Em andamento): Iniciamos a migração de emptyDir para PersistentVolumeClaim (PVC) real para o PostgreSQL, garantindo que restarts de pod não resultem em perda de dados.
Pipeline de CI/CD (Concluído): Implementei GitHub Actions. Todo push ou pull_request agora dispara go test e helm lint. A regra é clara: build vermelho não entra na main.

Lições Aprendidas

"Se o LLM cair, o Sentinel continua útil."

A abordagem deterministic-first não é apenas uma escolha de design; é uma necessidade para ferramentas de SRE. Observei que agentes (como Minimax/Gemini) são brilhantes, mas não devem ser o sistema nervoso central da sua ferramenta de confiabilidade. Eles devem ser o "especialista de plantão" — altamente valioso, mas não obrigatório para o sistema permanecer de pé.

Próximos passos: Preparar o M7, o laboratório real com Online Boutique (Chaos Engineering).

Repositório: https://github.com/boccato85/Sentinel

Sentinel Diary #3: From Information to Action — When the Dashboard Learned to Think

Marcel Boccato — Sun, 12 Apr 2026 18:06:36 +0000

Sentinel Diary #3: From Information to Action — When the Dashboard Learned to Think

A vibe coding journey: building a Kubernetes FinOps platform from scratch, one conversation at a time.

When I published Diary #2, the dashboard was finally telling the truth. The bugs were fixed, the data was real, the version badge was glowing cyan on hover. It felt like a finished thing.

It wasn't. It was a read-only mirror of a cluster.

Diary #3 is the story of turning that mirror into a tool — the session where Sentinel stopped showing data and started helping me act on it.

Where we left off

At v0.7.3, Sentinel had:

A Go agent collecting metrics every ~10s
PostgreSQL storing raw + hourly + daily aggregates
A dashboard with cost timeline, pod health, CPU utilization
22 automated tests
Zero authentication (honest versioning: still 0.x)

Online Boutique (12 Google microservices) was already deployed in google-demo namespace, waiting. Twenty-four pods. Real workload distribution. Real waste candidates.

I just couldn't do anything about them from the dashboard.

The "before" — a beautiful read-only report.

v0.10.0 — The forecast that scared me

Before visual work, I wanted the dashboard to answer a question I kept asking manually: "if this cluster runs through the weekend, how much will I spend?"

I spec'd out the requirement: linear regression over historical cost data, with confidence bands. No external dependencies — pure Go. I handed it to Claude, and the result was /api/forecast: a projection endpoint with ±1.5σ confidence bands.

The chart came back with a dashed purple budget line, a cyan usage line, shaded confidence regions, and a projected waste card below. It looked like something from a Bloomberg terminal.

Then I looked at the numbers.

Projected waste: 67% of budget. Every dollar spent on this cluster, sixty-seven cents was going to pods with requests set far above actual consumption.

The forecast didn't tell me something I didn't know. It told me something I knew but hadn't seen.

v0.10.1 — Closing M1

Before going further with UI, I closed Milestone 1 properly. I had a checklist:

/health endpoint with DB and collector status checks
Structured logging with slog (consistent fields across all components)
Thresholds loaded from config/thresholds.yaml via ConfigMap (no hardcoded values)
Version badge reading dynamically from /health
Fallback data for long ranges (30d/90d/1y)

Claude implemented all of it in a single session.

M1 criterion: "Sentinel collects, persists, calculates waste, and reports its own health without manual intervention." ✅

The layout problem

By v0.10.3, I had a confession to make to the dashboard.

It was working. Every metric was real. But it was ugly in a specific way: information arranged like a report, not like a tool. Everything equal weight. No hierarchy. No "look here first."

I spent the next few versions doing something I rarely do consciously: thinking about information architecture before writing a single directive.

The question wasn't "what data do we have?" It was "when someone opens this at 2am during an incident, where should their eyes go first?"

Answer: KPIs. Then cluster health. Then cost. Then details.

v0.10.4–v0.10.8 — The great layout rework

Version by version, I described what I needed and Claude shaped the layout:

v0.10.4: I wanted a dedicated Memory tile — a visual showing requested vs allocatable memory, with a drawer that broke risk down by namespace. Claude built a purple donut with OOM risk breakdown.

v0.10.5: Per-tile namespace filters — each tile (Pods, CPU, Memory) needed its own independent <select> so filtering one wouldn't break the others. Financial Correlation grew to full-width with an orange FinOps border. The drawer got an interactive period selector and sortable columns.

v0.10.6–v0.10.7 reorganized the grid — I drew the hierarchy on paper first:

row-4: Node Health | Pod Distribution | CPU (compact) | Memory (compact)
Financial Correlation: full-width, immediately below
Waste Intelligence: full-width with scroll, at the bottom
Active Alerts tile: removed (empty space is worse than no tile)

v0.10.8: An animated alert badge in the header — green dot for "All OK", orange for warnings, red pulsing for critical. All six KPI cards clickable, each opening its respective drawer. The dead "Active Alerts" KPI replaced with "Top Memory Consumer" — the actually useful metric.

The "after" — v0.10.12 with unified layout, forecast chart and Top Workloads panel.

v0.10.9 — The bug that crashed silently

During testing, I noticed the KPI cards were showing -- for values. Not an error. Not a console warning. Just dashes.

I flagged it to Claude, who traced it to a ReferenceError in updateOverview(). The code was doing:

pods.forEach(p => { ... })

But /api/summary doesn't return an individual pods array. It returns podsByPhase, failedPods, pendingPods. The variable pods didn't exist.

The error was thrown, silently swallowed by the outer try/catch, and execution stopped before updating kT, kMem, kW — all the KPI values. They stayed at -- from initialization.

Claude extracted updatePodsAllNsTile() — a new async function that fetches /api/pods separately, groups by namespace, and renders a namespace-distribution donut instead of the broken phase breakdown.

Silent failures are the worst kind. At least a loud crash tells you where to look.

v0.10.10 — The column that was always zero

The Memory drawer had an "Mem Request" column. It showed N/A for every pod.

I queried the DB directly.

SELECT DISTINCT mem_request FROM metrics LIMIT 5;
 -- 0
 -- 0
 -- 0

Every row. Zero.

Four versions back, when the DB INSERT was written, mem_request was hardcoded to 0. The struct field existed, the column existed, the frontend expected data — but real values were never being written.

I described the fix to Claude: collect memory requests per pod during the collection cycle and use those real values in the INSERT. Claude built podMemRequestMap[namespace][pod], summing memory requests across all containers. The INSERT now uses the real value.

Historical data stays zero — it's already written. But every new collection has the right number. A migration would fix history; I decided to let time heal it.

FinOps drawer: sortable history table with Budget, Actual, Waste and Waste% columns.

Memory drawer: per-namespace breakdown with OOM risk indicator per pod.

v0.10.11–v0.10.12 — From display to decision

This is the part I'm most proud of.

v0.10.11: I wanted a tooltip on the "Connected" badge — hover to see cluster health at a glance without opening any drawer. Claude built a card showing Cluster, Endpoint, Version, Session uptime, Last sync, and Database status. Small detail. High signal.

v0.10.12: I wanted to merge Waste Intelligence and Top Workloads into a single action-oriented panel: "Top Workloads — CPU & Waste Analysis". But the real ask was making pod names clickable.

I defined the interaction: click a pod name → drawer opens with current usage, request, a utilization bar, and a concrete rightsizing recommendation. Claude built it:

 kube-apiserver-minikube          sentinel          ⚠ Overprovisioned

 CPU Usage / Request              42m / 250m
 ████████░░░░░░░░░░░░░░░░░░░░    16.8%

 Memory Usage / Request           312 Mi / No request set

 ⚠ Savings Opportunity
 Potential CPU savings: -208m (83%)
 CPU request is significantly higher than actual usage.
 Consider reducing resources.requests.cpu to ~51m.

The number ~51m comes from ceil(actualUsage × 1.2) — a 20% headroom buffer calculated at draw time. Not a generic recommendation. A concrete one, specific to that pod, at that moment.

Rows with waste are highlighted in amber. Rightsized pods get a green checkmark. The table became a prioritized action list.

The star of the show: click any pod name to get a concrete rightsizing recommendation.

What I learned

Data without action is just reporting. For the first three months of this project, Sentinel was a very nice report. The forecast was beautiful. The donuts were pretty. But you couldn't do anything from the dashboard — you had to write it down, open a terminal, and kubectl edit something.

The pod detail drawer is the first time Sentinel gives you a number you can directly use. That's a different category of tool.

Silent failures compound. The pods.forEach bug, the mem_request = 0 bug, the Database -- in the tooltip — none of them threw visible errors. They all degraded silently. I need better observability on the dashboard itself.

Layout is product thinking. I spent more time this session defining information hierarchy than requesting new features. That felt wasteful in the moment. In retrospect, a dashboard where your eyes know where to go is worth more than a dashboard with more features.

State of the cluster (v0.10.12)

Nodes:    1 (minikube) — Running
Pods:     24 Running (sentinel + google-demo namespaces)
CPU:      32.8% allocated
Waste:    20 pods with savings opportunities
DB:       ✓ OK
Version:  v0.10.12

What's next

The roadmap points to M2 and M3:

Efficiency score per namespace — not just "which pods waste" but "which namespace is worst"
/api/incidents — deterministic violation detection without LLM
Online Boutique lab — baseline → load → chaos → comparison (the post I promised in #2)

And eventually: auth. Because a dashboard with no auth is a tool that trusts everyone in the room.

Sentinel is open-source and honestly versioned. Still 0.x. Getting closer.

Sentinel Diary #3: De Informação para Ação — Quando o Dashboard Aprendeu a Pensar

Uma jornada de vibe coding: construindo uma plataforma FinOps para Kubernetes do zero, uma conversa por vez.

Quando publiquei o Diary #2, o dashboard finalmente estava dizendo a verdade. Os bugs tinham sido corrigidos, os dados eram reais, o badge de versão brilhava em cyan no hover. Parecia uma coisa pronta.

Não estava. Era um espelho somente leitura de um cluster.

O Diary #3 é a história de transformar esse espelho numa ferramenta — a sessão em que o Sentinel parou de mostrar dados e começou a me ajudar a agir sobre eles.

De onde paramos

No v0.7.3, o Sentinel tinha:

Um agente Go coletando métricas a cada ~10s
PostgreSQL armazenando dados raw + hourly + daily
Dashboard com timeline de custo, saúde de pods, utilização de CPU
22 testes automatizados
Zero autenticação (versionamento honesto: ainda 0.x)

O Online Boutique (12 microsserviços do Google) já estava deployado no namespace google-demo, esperando. Vinte e quatro pods. Distribuição real de workload. Candidatos reais a rightsizing.

Eu só não conseguia fazer nada a respeito deles a partir do dashboard.

O "antes" — um relatório bonito, mas somente leitura.

v0.10.0 — O forecast que me assustou

Antes do trabalho visual, eu queria que o dashboard respondesse uma pergunta que eu ficava fazendo manualmente: "se esse cluster rodar durante o fim de semana, quanto vou gastar?"

Defini o requisito: regressão linear sobre os dados históricos de custo, com bandas de confiança. Sem dependências externas — Go puro. Passei o spec para o Claude, e o resultado foi o /api/forecast: um endpoint de projeção com bandas de confiança ±1.5σ.

O gráfico voltou com uma linha tracejada roxa de orçamento, uma linha cyan de uso, regiões sombreadas de confiança e um card de waste projetado abaixo. Parecia algo de um terminal Bloomberg.

Aí eu olhei para os números.

Waste projetado: 67% do orçamento. De cada real gasto no cluster, sessenta e sete centavos iam para pods com requests configurados bem acima do consumo real.

O forecast não me disse algo que eu não sabia. Me disse algo que eu sabia mas não tinha visto.

v0.10.1 — Fechando o M1

Antes de avançar na UI, fechei o Milestone 1 adequadamente. Tinha uma lista de critérios:

Endpoint /health com verificações de status do DB e do collector
Logs estruturados com slog (campos consistentes em todos os componentes)
Thresholds carregados de config/thresholds.yaml via ConfigMap (sem valores hardcoded)
Badge de versão lendo dinamicamente do /health
Fallback de dados para ranges longos (30d/90d/1y)

O Claude implementou tudo em uma única sessão.

Critério do M1: "O Sentinel coleta, persiste, calcula waste e reporta sua própria saúde sem intervenção manual." ✅

O problema do layout

No v0.10.3, eu tinha uma confissão a fazer ao dashboard.

Ele estava funcionando. Cada métrica era real. Mas era feio de uma forma específica: informação arranjada como relatório, não como ferramenta. Tudo com o mesmo peso. Sem hierarquia. Sem "olhe aqui primeiro."

Passei as próximas versões fazendo algo que raramente faço conscientemente: pensar em arquitetura de informação antes de escrever qualquer diretiva.

A pergunta não era "que dados temos?" Era "quando alguém abrir isso às 2h durante um incidente, para onde devem ir os olhos primeiro?"

Resposta: KPIs. Depois saúde do cluster. Depois custo. Depois detalhes.

v0.10.4–v0.10.8 — A grande reestruturação do layout

Versão por versão, eu descrevia o que precisava e o Claude moldava o layout:

v0.10.4: Queria um tile dedicado de Memória — um visual mostrando memória solicitada vs alocável, com um drawer quebrando o risco por namespace. O Claude construiu um donut roxo com breakdown de risco de OOM.

v0.10.5: Filtros de namespace por tile — cada tile (Pods, CPU, Memória) precisava do seu próprio <select> independente, para filtrar um sem quebrar os outros. O painel Financial Correlation cresceu para full-width com borda laranja FinOps. O drawer ganhou seletor de período interativo e colunas ordenáveis.

v0.10.6–v0.10.7: Reorganizei a hierarquia no papel primeiro:

row-4: Node Health | Pod Distribution | CPU (compacto) | Memory (compacto)
Financial Correlation: full-width, imediatamente abaixo
Waste Intelligence: full-width com scroll, no final
Tile Active Alerts: removido (espaço vazio é pior que nenhum tile)

v0.10.8: Pedi um badge de alerta animado no header — ponto verde para "All OK", laranja para warnings, vermelho pulsante para critical. Os seis cards KPI viraram clicáveis, cada um abrindo seu respectivo drawer. O KPI morto "Active Alerts" substituído por "Top Memory Consumer" — a métrica realmente útil.

O "depois" — v0.10.12 com layout unificado, gráfico de forecast e painel Top Workloads.

v0.10.9 — O bug que falhava silenciosamente

Durante os testes, percebi que os cards KPI mostravam -- nos valores. Não um erro. Não um aviso no console. Só travessões.

Reportei ao Claude, que rastreou até um ReferenceError em updateOverview(). O código fazia:

pods.forEach(p => { ... })

Mas /api/summary não retorna um array individual de pods. Retorna podsByPhase, failedPods, pendingPods. A variável pods não existia.

O erro era lançado, silenciosamente engolido pelo try/catch externo, e a execução parava antes de atualizar kT, kMem, kW — todos os valores KPI. Eles ficavam em -- desde a inicialização.

O Claude extraiu updatePodsAllNsTile() — uma nova função async que faz fetch separado em /api/pods, agrupa por namespace e renderiza um donut de distribuição por namespace.

Falhas silenciosas são o pior tipo. Pelo menos um crash barulhento te diz onde procurar.

v0.10.10 — A coluna que sempre foi zero

O drawer de Memória tinha uma coluna "Mem Request". Mostrava N/A para todo pod.

Fui consultar o banco diretamente.

SELECT DISTINCT mem_request FROM metrics LIMIT 5;
 -- 0
 -- 0
 -- 0

Toda linha. Zero.

Quatro versões atrás, quando o DB INSERT foi escrito, mem_request estava hardcoded como 0. O campo da struct existia, a coluna existia, o frontend esperava dados — mas valores reais nunca foram escritos.

Descrevi o fix para o Claude: coletar os memory requests por pod durante o ciclo de coleta e usar esses valores reais no INSERT. O Claude construiu podMemRequestMap[namespace][pod], somando memory requests de todos os containers. O INSERT agora usa o valor real.

Os dados históricos ficam zero — já foram escritos. Mas cada nova coleta tem o número certo. Uma migration consertaria o histórico; decidi deixar o tempo curar.

Drawer FinOps: tabela histórica ordenável com Budget, Actual, Waste e Waste%.

Drawer de memória: breakdown por namespace com indicador de risco de OOM por pod.

v0.10.11–v0.10.12 — De exibição para decisão

Esta é a parte de que mais me orgulho.

v0.10.11: Queria um tooltip no badge "Connected" — passar o mouse para ver a saúde do cluster sem abrir nenhum drawer. O Claude construiu um card mostrando Cluster, Endpoint, Version, Session uptime, Last sync e Database status. Detalhe pequeno. Sinal alto.

v0.10.12: Queria fundir Waste Intelligence e Top Workloads em um único painel orientado a ação: "Top Workloads — CPU & Waste Analysis". Mas o pedido central era tornar os nomes dos pods clicáveis.

Defini a interação: clicar num pod → drawer abre com uso atual, request, barra de utilização e uma recomendação concreta de rightsizing. O Claude implementou:

 kube-apiserver-minikube          sentinel          ⚠ Overprovisioned

 CPU Usage / Request              42m / 250m
 ████████░░░░░░░░░░░░░░░░░░░░    16.8%

 Memory Usage / Request           312 Mi / No request set

 ⚠ Savings Opportunity
 Potential CPU savings: -208m (83%)
 CPU request is significantly higher than actual usage.
 Consider reducing resources.requests.cpu to ~51m.

O número ~51m vem de ceil(usoReal × 1.2) — um buffer de 20% de headroom calculado no momento do render. Não é uma recomendação genérica. É uma concreta, específica para aquele pod, naquele momento.

Linhas com waste ficam destacadas em âmbar. Pods rightsized ganham um checkmark verde. A tabela virou uma lista de ações priorizadas.

A estrela do show: clique em qualquer nome de pod para uma recomendação concreta de rightsizing.

O que aprendi

Dados sem ação são apenas relatório. Durante os primeiros meses deste projeto, o Sentinel era um relatório muito bonito. O forecast era lindo. Os donuts eram bonitos. Mas você não conseguia fazer nada a partir do dashboard — tinha que anotar, abrir um terminal e kubectl edit alguma coisa.

O drawer de detalhe do pod é a primeira vez que o Sentinel te dá um número que você pode usar diretamente. Isso é uma categoria diferente de ferramenta.

Falhas silenciosas se acumulam. O bug do pods.forEach, o bug do mem_request = 0, o Database -- no tooltip — nenhum deles lançou erros visíveis. Todos degradaram silenciosamente. Preciso de melhor observabilidade no próprio dashboard.

Layout é pensamento de produto. Passei mais tempo nesta sessão definindo hierarquia de informação do que pedindo novas features. Isso pareceu desperdício no momento. Em retrospecto, um dashboard onde seus olhos sabem para onde ir vale mais do que um com mais features.

Estado do cluster (v0.10.12)

Nodes:    1 (minikube) — Running
Pods:     24 Running (sentinel + google-demo namespaces)
CPU:      32.8% allocated
Waste:    20 pods com oportunidades de savings
DB:       ✓ OK
Version:  v0.10.12

O que vem a seguir

O roadmap aponta para M2 e M3:

Score de eficiência por namespace — não só "quais pods desperdiçam" mas "qual namespace é o pior"
/api/incidents — detecção determinística de violações sem LLM
Lab Online Boutique — baseline → carga → chaos → comparação (o post que prometi no #2)

E eventualmente: auth. Porque um dashboard sem auth é uma ferramenta que confia em todo mundo na sala.

Sentinel é open-source e honestamente versionado. Ainda 0.x. Chegando lá.

Sentinel Diary #2: the day the dashboard lied (and other honest bugs)

Marcel Boccato — Sun, 12 Apr 2026 03:07:42 +0000

Series: vibe coding with Claude Code + Kubernetes

Today was one of those days where you sit down to "do one small thing" and stand up three hours later with a commit log longer than planned. Spoiler: not a single line was typed manually.

But before getting to the bugs, I need to talk about what changed in the session setup — because that's what made the day possible.

The new setup: GitHub Copilot Pro as a model hub

After a while switching between Claude API directly, Gemini and other tools, I made a decision that completely changed my workflow: signing up for GitHub Copilot Pro.

The insight isn't obvious at first. Copilot Pro gives access to multiple models under a single subscription — and that's where things got interesting.

The flow for the day looked roughly like this:

GPT-4.1 mini — initial code review, fast and cheap, good for a first pass
GPT-5.3 Codex — deep architecture review, where I needed denser reasoning
OpenCode with Claude Opus — installed OpenCode and ran it with Opus for the first complex analyses
Migration to Claude Sonnet 4.6 — after comparing results, moved to Sonnet. Equivalent quality, significantly lower token consumption

Using Copilot Pro as a hub — switching models depending on the task — was the turning point. Instead of paying for individual tokens or depending on a single provider, you have a menu and pick the right tool for each moment.

Sonnet 4.6 specifically surprised me: across Sentinel development sessions, it delivered the same reasoning quality as Opus at a fraction of the consumption. For continuous work on long projects, that makes a real difference both on the wallet and on session flow.

What happened since the last post

Sentinel grew quite a bit over the last week — and some decisions deserve a record before getting into today.

Leaving Grafana/Prometheus. The original stack depended on kube-prometheus-stack: Prometheus collecting metrics, Grafana displaying, AlertManager notifying. It worked, but was heavy for a local environment and created an infrastructure dependency I wanted to eliminate. The solution was my call: make the Go agent the single source of truth. Claude implemented it — collecting directly via client-go, persisting to PostgreSQL, exposing the REST API. No sidecar. No scrape. No three port-forwards at startup.

Helm chart. I wanted Sentinel to be a first-class Kubernetes citizen. A single helm install to bring everything up:

helm install sentinel helm/sentinel -n sentinel --create-namespace

Claude built the chart — Deployment, Service, ConfigMap, an initContainer that waits for PostgreSQL, and automatic InClusterConfig (the agent detects if it's running inside the cluster and uses the ServiceAccount, no kubeconfig needed).

Three-tier retention policies. With history growing, I needed a storage strategy that wouldn't blow up the local PostgreSQL. I defined the tiers:

Tier	Granularity	Retention
Raw	~10s	24 hours
Hourly	1 hour	30 days
Daily	1 day	365 days

Claude implemented the hourly aggregation job and extended /api/history to support ranges from 30m to 365d — same API, transparent to the dashboard.

Security hardening. GPT-5.3 Codex did a deep code review and flagged several issues: unbounded connection pool, missing rate limiting, bind address exposed on 0.0.0.0 without configuration. I took those findings to Claude, who fixed all of them. The harness got Unicode normalization (NFKC), 10MB input limit and path traversal protection on the --component parameter. 16 tests covering critical cases.

The mystery of the zeros

Opened the Sentinel dashboard. Everything was zeroed out. All panels showing --. Node Health Map empty. Pod Distribution gone. FinOps missing.

The cluster was running. Pods were healthy. The port-forward had started. But JavaScript wasn't receiving anything.

First hypothesis: Sentinel pod with a problem. kubectl logs — normal.
Second hypothesis: Metrics Server offline. Tested — working.
Third hypothesis: something with the port-forward.

Ran the command that solved everything:

lsof -i :8080

sentinel-agent  59321  boccatosantos  ...  IPv4  *:8080 (LISTEN)  <- the villain
kubectl         61204  boccatosantos  ...  IPv6  *:8080 (LISTEN)  <- correct

A locally compiled sentinel-agent instance had been running since 17:49 — listening on IPv4. Firefox was connecting to it, which had no access to the cluster at all. The kubectl port-forward was there too, but on IPv6, and the browser preferred IPv4.

Fix: kill 59321.

The hard part was getting to that line. The fix itself took two seconds.

The dashboard still wouldn't load

Killed the process. Refreshed the browser. Still no data.

Opened DevTools and found this in the console:

Refused to connect to /api/summary because it violates the Content Security Policy directive

The Go server had a Content-Security-Policy header configured, but without connect-src. The browser was silently blocking every fetch() call from JavaScript. No visible error in the UI — just the console screaming for anyone paying attention.

I described the issue to Claude, who updated main.go:

// before
"default-src self; script-src ..."

// after
"default-src self; connect-src self; script-src ..."

One word. One hour of diagnosis.

The bar that lied

With the dashboard working, I noticed the Utilization bar in the Top Workloads panel was wrong. It looked right — showed percentages, had colors — but the calculation was:

// heaviest pod in the cluster = 100%
// all others are relative to it
const pct = (cpu / maxConsumer) * 100;

A pod using 10m CPU with a 1000m request appeared as 100% efficient if it happened to be the cluster's top consumer at that moment. Useless for FinOps.

I explained the right semantics to Claude — usage vs the pod's own request:

// how much the pod is using vs what IT REQUESTED
const pct = request > 0 ? Math.min((cpu / request) * 100, 100) : 0;

Semantic colors:

Green (>70%): well-sized pod
Orange (40-70%): some waste
Red (<40%): oversized, right-sizing candidate

This bug had been there from the start. I only caught it when I stopped to actually look at the number.

Financial Correlation got context

The ROI Timeline panel was showing only the chart. You could see the Budget vs Actual lines, but without value references — hard to know if the waste was $0.002/h or $2/h.

I asked Claude to add a fixed summary above the chart:

Budget  $0.0312/h  |  Actual  $0.0102/h  |  Waste  $0.0210/h (67.3%)

Each point's tooltip now shows all three values. Adaptive Y-axis precision — no more embarrassing $0.000 when values are milli-cents.

The versioning decision

This was the most honest moment of the day.

The project was at v1.7.3. But it has no auth, no configurable alerts, no tests. Calling it v1.x implies stable API and feature-complete — and that's not what Sentinel is today.

Decided to renumber everything: 1.x → 0.x.

Before	After
v1.1	v0.1 — initial MVP
v1.3	v0.3 — FinOps + PostgreSQL
v1.5	v0.5 — Security hardening
v1.6	v0.6 — Configurable retention
v1.7	v0.7 — Standalone, no Prometheus
v1.7.3	v0.7.3 — Today

v1.0 will be the real milestone: when auth, alerts and tests are done. Until then, we're 0.x and proud of it.

The final touch

To close the session, I asked Claude to add a small version badge in the top-right corner of the header:

Default state: discrete gray, mono font
Hover: lights up in cyan
Tooltip: Sentinel v0.7.3 / Kubernetes Observability

Six lines of CSS. But it gives that sense of a cared-for product.

Final cluster state

Nodes:    1 (minikube) — Running
Pods:     24 Running, 0 Failed, 0 Pending
CPU:      2620m requested / 8000m allocatable (32.75% efficiency)
Commits:  11 today

Preparing the battlefield: Google Online Boutique

At the end of the session, before closing the terminal, I did something that will pay off in the next episode: deploying Google Online Boutique in a dedicated namespace.

Online Boutique is Google's microservices demo — 12 services simulating a real e-commerce app (frontend, cart, checkout, payment, recommendation engine and more). It's the perfect stress-test target for Sentinel.

kubectl create namespace google-demo
kubectl apply -n google-demo -f https://raw.githubusercontent.com/GoogleCloudPlatform/microservices-demo/main/release/kubernetes-manifests.yaml

Two commands. Twelve services. A proper load to observe.

Namespace: google-demo
Pods:      12 Running
Services:  12

The cluster went from 12 to 24 pods. Sentinel picked up everything without any configuration change — it monitors all namespaces by default.

Why does this matter? Because Sentinel was built and tested with its own workload as the only reference. Now there's a realistic multi-service app to probe: uneven CPU distribution, idle pods, services with no requests, cost variance between workloads. Real FinOps territory.

Next up: Sentinel Diary #3 — where we'll use Online Boutique as the lab. Capacity analysis, scaling automation on failure, request spike simulation. The cluster is set. Let's break things on purpose.

Three bugs. Four improvements. A more honest versioning. And not a single line typed manually.

This is Sentinel v0.7.3.

Full Changelog

v0.7.3 — today

Utilization bar fixed — now shows real usage / request, not relative to top consumer
Semantic colors: green (>70% efficient), orange (40-70%), red (<40% = waste)
Financial Correlation improved — Budget / Actual / Waste summary above the chart
Enriched tooltip — shows Budget, Actual and Waste per point on hover

v0.7 — fully standalone

Removed all Prometheus, Grafana and AlertManager dependencies
Resilient startup — initContainer waits for PostgreSQL + exponential backoff retry in Go
CSP fix — added connect-src 'self' to allow fetch requests in the dashboard
tools/monitor.py rewritten to use Go agent API
/startup simplified — only checks Minikube and Go agent

v0.6 — configurable retention

3-tier retention policy: raw (24h), hourly (30d), daily (365d) with automatic cleanup
/api/history now supports ranges from 30m to 365d
Hourly auto-aggregation compacting old metrics
New tables: metrics_hourly, metrics_daily, cost_history

v0.5 — Helm + security hardening

Helm chart — full Kubernetes deploy with helm install sentinel helm/sentinel -n sentinel
InClusterConfig — Go agent auto-detects if running inside the cluster
Auto-schema — metrics table created automatically on startup
Security hardening — PostgreSQL connection pool, rate limiting (100 rps), configurable bind address
Harness — Unicode normalization (NFKC), 10MB input limit, 16 tests
--component sanitized against path traversal, timeout with safe clamping

Série: vibe coding com Claude Code + Kubernetes

Hoje foi daqueles dias que você senta para fazer uma coisa pequena e levanta três horas depois com commit log maior do que planejava. Spoiler: nenhuma linha foi digitada manualmente.

Mas antes de chegar nos bugs, preciso contar o que mudou na infraestrutura da sessão — porque foi isso que tornou o dia possível.

O novo setup: GitHub Copilot Pro como hub de modelos

Depois de um tempo alternando entre Claude API direto, Gemini e outras ferramentas, tomei uma decisão que mudou o fluxo de trabalho completamente: assinar o GitHub Copilot Pro.

A sacada não é óbvia à primeira vista. O Copilot Pro dá acesso a vários modelos dentro de uma única assinatura — e foi aí que a coisa ficou interessante.

O fluxo do dia foi mais ou menos assim:

GPT-4.1 mini — code review inicial, rápido e barato, bom para uma primeira passagem
GPT-5.3 Codex — review profundo de arquitetura, onde realmente precisava de raciocínio mais denso
OpenCode com Claude Opus — instalei o OpenCode e rodei com Opus para as primeiras análises mais complexas
Migração para Claude Sonnet 4.6 — depois de comparar os resultados, migrei para o Sonnet. Qualidade equivalente, consumo de tokens significativamente menor

Essa estratégia de usar o Copilot Pro como hub — trocando de modelo conforme o tipo de tarefa — foi o divisor de águas. Em vez de pagar por tokens avulsos ou depender de um único provider, você tem um cardápio e escolhe a ferramenta certa para cada momento.

O Sonnet 4.6 especificamente surpreendeu: nas sessões de desenvolvimento do Sentinel, entregou a mesma qualidade de raciocínio do Opus com uma fração do consumo. Para trabalho contínuo em projetos longos, isso faz diferença real no bolso e na fluidez da sessão.

O que aconteceu desde o último post

O Sentinel cresceu bastante nas últimas semanas — e algumas decisões merecem registro antes de entrar no dia de hoje.

Saída do Grafana/Prometheus. O stack original dependia de kube-prometheus-stack. Funcionava, mas era pesado e criava dependências de infraestrutura que eu queria eliminar. Decidi tornar o Go agent a única fonte de verdade. O Claude implementou: coleta direto via client-go, persiste no PostgreSQL e expõe a API REST. Sem sidecar. Sem scrape. Sem três port-forwards na inicialização.

Helm chart. Queria que o Sentinel fosse um cidadão de primeira classe no Kubernetes. Um helm install para subir tudo:

helm install sentinel helm/sentinel -n sentinel --create-namespace

O Claude construiu o chart — Deployment, Service, ConfigMap, initContainer que aguarda o PostgreSQL e InClusterConfig automático (o agente detecta se está rodando dentro do cluster e usa o ServiceAccount, sem precisar de kubeconfig montado).

Políticas de retenção em três camadas. Com o histórico crescendo, precisava de uma estratégia de storage que não explodisse o PostgreSQL local. Defini as camadas:

Camada	Granularidade	Retenção
Raw	~10s	24 horas
Hourly	1 hora	30 dias
Daily	1 dia	365 dias

O Claude implementou o job de agregação por hora e estendeu o /api/history para suportar ranges de 30m até 365d — mesma API, transparente para o dashboard.

Security hardening. O GPT-5.3 Codex fez um code review profundo e sinalizou vários problemas: connection pool sem limite, ausência de rate limiting, bind address exposto em 0.0.0.0. Levei os achados para o Claude, que corrigiu tudo. O harness ganhou normalização Unicode (NFKC), limite de input de 10MB e proteção contra path traversal no parâmetro --component. 16 testes cobrindo os casos críticos.

O mistério dos zeros

Abri o dashboard do Sentinel. Estava tudo zerado. Todos os painéis mostrando --. Node Health Map vazio. Pod Distribution sem dados. FinOps sumido.

O cluster estava rodando. Os pods estavam healthy. O port-forward tinha subido. Mas o JavaScript não recebia nada.

Primeira hipótese: pod do Sentinel com problema. kubectl logs — normal.
Segunda hipótese: Metrics Server offline. Testei — funcionando.
Terceira hipótese: algo com o port-forward.

Rodei o comando que resolveu tudo:

lsof -i :8080

sentinel-agent  59321  boccatosantos  ...  IPv4  *:8080 (LISTEN)  <- vilão
kubectl         61204  boccatosantos  ...  IPv6  *:8080 (LISTEN)  <- correto

Tinha uma instância do sentinel-agent compilada localmente rodando desde as 17h49 — escutando em IPv4. O Firefox conectava nela, que não tinha acesso nenhum ao cluster. O kubectl port-forward estava lá também, mas em IPv6, e o browser preferia o IPv4.

Fix: kill 59321.

O mais trabalhoso foi chegar nessa linha. O fix em si levou dois segundos.

O dashboard ainda não carregava

Matei o processo. Atualizei o browser. Continuava sem dados.

Abri o DevTools e encontrei isso no console:

Refused to connect to /api/summary because it violates the Content Security Policy directive

O servidor Go tinha um header Content-Security-Policy configurado, mas sem connect-src. O browser bloqueava silenciosamente todo fetch() do JavaScript. Nenhum erro visível na UI — só o console gritando pra quem olhasse.

Descrevi o problema para o Claude, que atualizou o main.go:

// antes
"default-src self; script-src ..."

// depois
"default-src self; connect-src self; script-src ..."

Uma palavra. Uma hora de diagnóstico.

A barra que mentia

Com o dashboard funcionando, percebi que a barra de Utilization no painel Top Workloads estava errada. Parecia certa — mostrava porcentagens, tinha cores — mas o cálculo era:

// o pod mais pesado do cluster = 100%
// todos os outros são relativos a ele
const pct = (cpu / maxConsumer) * 100;

Um pod usando 10m de CPU com request de 1000m aparecia como 100% eficiente se fosse o maior consumidor do cluster naquele momento. Inútil para FinOps.

Expliquei a semântica correta para o Claude — uso versus o próprio request do pod:

// quanto o pod está usando do que ELE MESMO pediu
const pct = request > 0 ? Math.min((cpu / request) * 100, 100) : 0;

Cores semânticas:

Verde (>70%): pod bem dimensionado
Laranja (40-70%): algum desperdício
Vermelho (<40%): oversized, candidato a right-sizing

Esse bug estava lá desde sempre. Só percebi quando parei pra olhar o número com atenção.

Financial Correlation ganhou contexto

O painel de ROI Timeline mostrava só o gráfico. Você via as linhas de Budget vs Actual, mas sem referência de valores — ficava difícil saber se o desperdício era $0.002/h ou $2/h.

Pedi ao Claude para adicionar um sumário fixo acima do gráfico:

Budget  $0.0312/h  |  Actual  $0.0102/h  |  Waste  $0.0210/h (67.3%)

Tooltip de cada ponto agora mostra os três valores. Eixo Y com precisão adaptativa — sem aquele $0.000 vergonhoso quando os valores são milicêntimos.

A decisão do versionamento

Esse foi o momento mais honesto do dia.

O projeto estava em v1.7.3. Só que ele não tem auth, não tem alertas configuráveis, não tem testes. Chamar de v1.x implica API estável e feature-complete — e não é isso que o Sentinel é hoje.

Decidi renumerar tudo: 1.x → 0.x.

Antes	Depois
v1.1	v0.1 — MVP inicial
v1.3	v0.3 — FinOps + PostgreSQL
v1.5	v0.5 — Security hardening
v1.6	v0.6 — Retenção configurável
v1.7	v0.7 — Standalone, sem Prometheus
v1.7.3	v0.7.3 — Hoje

v1.0 vai ser o marco real: quando tiver auth, alertas e testes. Até lá, somos 0.x e temos orgulho disso.

O toque final

Para fechar a sessão, pedi ao Claude para adicionar um badge de versão pequeno no canto direito do header:

Estado normal: cinza discreto, fonte mono
Hover: acende em cyan
Tooltip: Sentinel v0.7.3 / Kubernetes Observability

Seis linhas de CSS. Mas dá aquela sensação de produto cuidado.

Estado final do cluster

Nodes:    1 (minikube) — Running
Pods:     24 Running, 0 Failed, 0 Pending
CPU:      2620m requested / 8000m allocatable (32.75% efficiency)
Commits:  11 hoje

Preparando o campo de batalha: Google Online Boutique

No final da sessão, antes de fechar o terminal, fiz algo que vai render no próximo episódio: o deploy do Google Online Boutique em um namespace dedicado.

O Online Boutique é o demo de microsserviços do Google — 12 serviços simulando um e-commerce real (frontend, carrinho, checkout, pagamento, motor de recomendação e mais). É o alvo perfeito para estressar o Sentinel.

kubectl create namespace google-demo
kubectl apply -n google-demo -f https://raw.githubusercontent.com/GoogleCloudPlatform/microservices-demo/main/release/kubernetes-manifests.yaml

Dois comandos. Doze serviços. Uma carga real para observar.

Namespace: google-demo
Pods:      12 Running
Services:  12

O cluster foi de 12 para 24 pods. O Sentinel pegou tudo sem nenhuma mudança de configuração — ele monitora todos os namespaces por padrão.

Por que isso importa? Porque o Sentinel foi construído e testado tendo sua própria workload como única referência. Agora tem um app multi-serviço realista para sondar: distribuição de CPU irregular, pods ociosos, serviços sem requisições, variância de custo entre workloads. Território de FinOps de verdade.

No próximo: Sentinel Diary #3 — onde vamos usar o Online Boutique como laboratório. Análise de capacity, automação de scaling em falhas, simulação de picos de requests. O cluster está pronto. Vamos quebrar coisas de propósito.

Três bugs. Quatro melhorias. Um versionamento mais honesto. E nenhuma linha digitada manualmente.

Esse é o Sentinel v0.7.3.

Changelog completo

v0.7.3 — hoje

Barra de Utilization corrigida — agora mostra uso / request real, não relativo ao maior consumidor
Cores semânticas na barra: verde (>70% eficiente), laranja (40-70%), vermelho (<40% = desperdício)
Financial Correlation melhorado — sumário de Budget / Actual / Waste acima do gráfico
Tooltip enriquecido — mostra Budget, Actual e Waste por ponto ao passar o mouse

v0.7 — standalone completo

Removida toda dependência de Prometheus, Grafana e AlertManager
Inicialização resiliente — initContainer aguarda PostgreSQL + retry com backoff exponencial no Go
CSP fix — adicionado connect-src 'self' para permitir fetch requests no dashboard
tools/monitor.py reescrito para usar API do Go agent
/startup simplificado — apenas verifica Minikube e Go agent

v0.6 — retenção configurável

Política de retenção em 3 camadas: raw (24h), hourly (30d), daily (365d) com cleanup automático
/api/history agora suporta ranges de 30m até 365d
Agregação automática por hora compactando métricas antigas
Novas tabelas: metrics_hourly, metrics_daily, cost_history

v0.5 — Helm + security hardening

Helm chart — deploy completo no Kubernetes com helm install sentinel helm/sentinel -n sentinel
InClusterConfig — Go agent detecta automaticamente se está rodando dentro do cluster
Auto-schema — tabela metrics criada automaticamente no startup
Security hardening — connection pool PostgreSQL, rate limiting (100 rps), bind address configurável
Harness — normalização Unicode (NFKC), limite de input (10MB), 16 testes
Sanitização de --component contra path traversal, timeout com clamping seguro

Se quiser acompanhar o projeto: github.com/boccato85/Sentinel

Sentinel Diary #1: from the Anthropic certificate to a FinOps tool in days

Marcel Boccato — Sun, 05 Apr 2026 22:08:13 +0000

This article is not a tutorial. It's a diary :)

I'm taking the CKA (Certified Kubernetes Administrator) course on KodeKloud, wanted to get started in the world of agentic AI with Claude — and ended up with a Kubernetes observability and FinOps tool running Go, PostgreSQL and a real-time dashboard. Without having planned any of it.

Day 1: the certificate and the $5 credit

I decided to take the official Anthropic course on Skilljar: Claude Code in Action. Free, with a certificate, and it took a few hours to complete.

The course asks for an Anthropic Platform API key to run the examples. I created the key, spun up the local server, and got:

error: Your credit balance is too low to access the Anthropic API.

"But the course is free..."

Not exactly. The course is free. The API calls consume real credits. They're two separate things — and the platform doesn't make that clear enough. I bought $5 in credits, cleared cache and sessions, recreated the key, and it worked.

Lesson 1: capitalism always wins. But $5 goes a long way with Haiku, the entry-level model.

With the certificate in hand the same day, I moved on to practice.

Still Day 1: creating v1.0

The idea was simple: a Claude Code agent that would monitor a Kubernetes cluster and automatically generate runbooks. No manual code. Just directives.

I run Linux Fedora 43 KDE on an Acer Predator PHN16-72 laptop.

Installed Minikube, spun up kube-prometheus-stack via Helm:

helm install prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --set grafana.adminPassword=admin123

Six pods coming up at once. Grafana with ready-made dashboards. Prometheus collecting real Minikube metrics.

Then I created the base structure with slash commands in Markdown inside .claude/commands/:

Command	Function
`/sentinel`	Main orchestrator
`/collect-metrics`	Sub-agent A — queries Prometheus via PromQL
`/analyze-pods`	Sub-agent B — checks pods via kubectl
`/correlate`	Sub-agent C — classifies severity

The CLAUDE.md became the agent's operational memory: endpoints, thresholds, namespaces, runbook template.

Ran /sentinel for the first time and it generated this:

Severity: WARNING
CPU: 11.4% ✅ | Memory: 45.1% ✅ | Disk: 17.65% ✅
64 Warning events identified as residual from previous node restart
storage-provisioner: recent BackOff — requires monitoring

The agent separated noise from signal on its own. It identified that the 64 Warning events were residual from a Minikube reboot — not real anomalies. That wasn't in the prompt. It was model reasoning.

v1.0 on GitHub. Same day as the certificate.

Day 2: v1.1 and automatic startup

The biggest friction in v1.0 was operational: every time I opened Claude Code, I had to remember to manually spin up three port-forwards before running any command.

I created /startup — an agent that checks whether Prometheus, Grafana and AlertManager are accessible and only starts the missing port-forwards, in the background, with up to 10 retries:

╔══════════════════════════════════════════╗
║     Sentinel — Startup                   ║
╚══════════════════════════════════════════╝
 Prometheus    (localhost:9090)  →  ✅ STARTED
 Grafana       (localhost:3000)  →  ✅ STARTED
 AlertManager  (localhost:9093)  →  ✅ STARTED

I asked Claude to expand /analyze-pods to monitor multiple namespaces in parallel: default, monitoring and kube-system — with results grouped by namespace and cross-namespace root cause correlation.

The result looked like this:

Namespace	Total	Unhealthy
default	0	0
monitoring	6	0 (all Running)
kube-system	8	0 (all Running)

The agent also identified storage-provisioner with 21 restarts versus an average of 8 for other pods — flagging the genuinely anomalous component with no explicit instruction to do so.

v1.1 on GitHub. Day 2.

The Gemini week: when Claude Code went down

The following week, I had a hard time using Claude — whether via web or via terminal. Tokens were being consumed much faster, even following all of Anthropic's own best practices. Complaints started piling up on Reddit, X, Stack Overflow. Apparently demand was too high for them to absorb. I'm on the $20 pro plan — not a big deal, but even for studying it became unworkable.

I migrated to Gemini. It worked for a good while — and that was the period when Sentinel started gaining its more complex features: the Go agent, PostgreSQL integration, the dashboard.

But as the project grew, problems appeared:

Dashboard UI: Gemini generated functional HTML but with visual inconsistencies I had to fix manually
Code security: I quickly ran Claude Code as a reviewer and found patterns in the generated code that worried me — missing input validation, absent security headers, SQL queries without proper sanitization

When Claude Code came back (they gave an additional ~$22 credit), I switched back. The cost per session is higher and the context window is smaller — but the quality and reliability of generated code, especially on security matters, make it worth it.

Sentinel was born in Gemini. But it grew up in Claude Code.

The v2.0: when it became a real tool

The turning point came with a question: "what if I had the data already collected before calling Claude?"

I wanted a real-time platform and to dive headfirst into vibe coding.

The slash commands were querying Prometheus and kubectl in real time on every execution. That worked, but it was slow and stateless — no history, no trends, no real FinOps.

The solution: a Go agent running continuously in the background.

// Collects every 10 seconds
// Persists to PostgreSQL
// Exposes REST API for Claude to consume

Using native client-go, the agent collects CPU, memory and waste metrics per pod, persists them in batch transactions to PostgreSQL and exposes three endpoints:

Endpoint	Description
`GET /api/summary`	Cluster state: nodes, pods, CPU
`GET /api/metrics`	Per-pod metrics: CPU usage, waste
`GET /api/history`	Cost history for the last 30 min

Claude Code started consuming these endpoints via /incident instead of querying directly. Real layer separation: Go collects, Claude analyzes.

Node Health Map with honeycomb, Pod Distribution with donut chart, Waste Intelligence with savings opportunities per pod, Financial ROI Timeline with Budget vs Actual over the last 30 minutes.

The harness: treating LLM output as untrusted input

Between LinkedIn posts, my brother sent me an interesting article about Harness Engineering — basically having security and reliability around the model, with proper infrastructure, security reviews and constant ReACT patterns (Review, ACT) plus feedback.

Article: Harness Engineering — Martin Fowler

A very important architectural decision to put the concept into practice: harness/validador_saida.py.

Every report generated by Claude Code goes through a gatekeeper before being written to disk. The validator blocks:

Rule	Blocked examples
Destructive commands	`rm -rf`, `kubectl delete`, `DROP TABLE`, fork bomb
Required structure	Reports without `## Executive Summary` are rejected
Minimum size	Content under 100 chars is rejected

if result.returncode != 0:
    return {
        "status": "error",
        "message": f"Validator blocked the write: {result.stderr.strip()}",
        "file": None,
    }

If the validator rejects it, the file is not created. Period.

This isn't paranoia — it's production architecture. Any system that uses an LLM to generate actions on infrastructure needs a gatekeeper.

What vibe coding means in practice

Vibe coding isn't "letting AI do everything". It's a specific way of working:

You define the what. The AI decides the how.

At each step, I knew the result I wanted and acted as the SRE, the decision maker:

"I want startup to check services and spin up missing port-forwards"
"I want namespace-grouped analysis with cross-namespace root cause correlation"
"I want a gatekeeper that blocks destructive commands before writing"

The agent implemented. I reviewed, questioned, redirected.

What surprised me: at no point did I write a single line of the Go agent, the dashboard, or the harness. But every architecture decision was mine. The AI was my senior software engineer.

What became clear about Claude Code

Strengths:

Quality and consistency of generated code, especially in Go
Security reasoning — correct HTTP headers, sanitization, file permissions
Ability to maintain project context via CLAUDE.md
Parallel sub-agents that genuinely execute in parallel
Optimized UI generation

Real limitations:

Smaller context window than competitors — large projects require frequent /compact
Higher cost per token — relevant in long sessions
The token instability from that week showed that total dependency on a single provider is real risk

Journey timeline

Date	Milestone
Mar 8, 2026	Claude Code in Action certificate — Anthropic/Skilljar
Mar 8, 2026	v1.0: slash commands + Prometheus + K8s
Mar 9, 2026	v1.1: automatic startup + multiple namespaces
Week 2	Temporary migration to Gemini — global Anthropic token restriction
Week 2-3	v2.0: Go agent + PostgreSQL + dashboard + harness
Week 3	Return to Claude Code + renamed to Sentinel

Where it stands today

The project is open source, Apache 2.0:

👉 github.com/boccato85/Sentinel

Full stack:

Minikube / Kubernetes v1.35.1
Go agent with client-go
PostgreSQL
Claude Code (OpenCode + Sonnet 4.6)

If you work with SRE, CloudOps or FinOps and want to explore Claude Code in practice, this is a real starting point — not a hello world.

This project is part of a personal track: CKA → Claude Code → MLOps. Follow along here on dev.to.

Esse artigo não é um tutorial. É um diário :)

Estou fazendo o curso de certificação CKA, Certified Kubernetes Administrator através da plataforma KodeKloud, e assim gostaria de iniciar no mundo "agêntico" de IA com o Claude e terminei com uma plataforma de observabilidade e FinOps para Kubernetes rodando Go, PostgreSQL e um dashboard em tempo real — sem ter planejado nada disso.

Dia 1: o certificado e os $5 de crédito

Decidi fazer o curso oficial da Anthropic (Skilljar): Claude Code in Action. Gratuito, com certificado, e levei algumas horas para concluir.

O curso pede uma API key do Anthropic Platform para rodar os exemplos. Criei a chave, subi o servidor local, e recebi:

error: Your credit balance is too low to access the Anthropic API.

"Mas o curso é gratuito..."

Não é bem assim. O curso é gratuito. As chamadas de API consomem créditos reais. São duas coisas separadas — e a plataforma não deixa isso claro o suficiente. Comprei $5 de crédito, deletei cache e sessões, recriei a key, e funcionou.

Lição 1: o capitalismo sempre vence. Mas $5 dura bastante com o Haiku, o modelo de "entrada" digamos assim.

Com o certificado em mãos no mesmo dia, parti para a prática.

Ainda no Dia 1: criação da v1.0

A ideia era simples: um agente Claude Code que monitorasse um cluster Kubernetes e gerasse runbooks automaticamente. Sem código manual. Só diretivas.

Eu utilizo o Linux Fedora 43 KDE em um Notebook Acer Predator PHN16-72.

Instalei o minikube, subi o kube-prometheus-stack via Helm:

helm install prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --set grafana.adminPassword=admin123

Seis pods subindo de uma vez. Grafana com dashboards prontos. Prometheus coletando métricas reais do Minikube.

Então criei a estrutura base com slash commands em Markdown dentro de .claude/commands/:

Comando	Função
`/sentinel`	Orquestrador principal
`/collect-metrics`	Sub-agent A — consulta Prometheus via PromQL
`/analyze-pods`	Sub-agent B — verifica pods via kubectl
`/correlate`	Sub-agent C — classifica severidade

O CLAUDE.md virou a memória operacional do agente: endpoints, thresholds, namespaces, template de runbook.

Rodei /sentinel pela primeira vez e ele gerou isso:

Severidade: WARNING
CPU: 11.4% ✅ | Memória: 45.1% ✅ | Disco: 17.65% ✅
64 Warning events identificados como residuais de restart anterior do nó
storage-provisioner: BackOff recente — requer monitoramento

O agente separou ruído de sinal sozinho. Identificou que os 64 eventos de Warning eram residuais de um reboot do Minikube — não anomalias reais. Isso não estava no prompt. Foi raciocínio do modelo.

v1.0 no GitHub. Mesmo dia do certificado.

Dia 2: a v1.1 e o startup automático

O maior atrito da v1.0 era operacional: toda vez que abria o Claude Code, precisava lembrar de subir os três port-forwards manualmente antes de rodar qualquer comando.

Criei o /startup — um agente que verifica se Prometheus, Grafana e AlertManager estão acessíveis e sobe apenas os port-forwards ausentes, em background, com retry de até 10 tentativas:

╔══════════════════════════════════════════╗
║     Sentinel — Startup                   ║
╚══════════════════════════════════════════╝
 Prometheus    (localhost:9090)  →  ✅ STARTED
 Grafana       (localhost:3000)  →  ✅ STARTED
 AlertManager  (localhost:9093)  →  ✅ STARTED

Pedi ao Claude para expandir o /analyze-pods para monitorar múltiplos namespaces em paralelo: default, monitoring e kube-system — com resultados agrupados por namespace e correlação de causa raiz cruzada.

O resultado ficou assim:

Namespace	Total	Unhealthy
default	0	0
monitoring	6	0 (todos Running)
kube-system	8	0 (todos Running)

E o agente ainda identificou o storage-provisioner com 21 restarts versus média de 8 dos outros pods — sinalizando o componente realmente anômalo sem nenhuma instrução explícita sobre isso.

v1.1 no GitHub. Dia 2.

A semana do Gemini: quando o Claude Code saiu do ar

Na semana seguinte, senti uma grande dificuldade de utilizar o Claude, seja via web ou Code via terminal. Os tokens estavam sendo consumidos muito mais rápido, mesmo seguindo todas as boas práticas recomendadas da própria Anthropic. Reclamações começaram a se multiplicar no Reddit, X, Stack etc. Aparentemente a demanda estava sendo alta demais para absorverem. Eu uso o plano pro R$20 dólares, sei que não é grande coisa, mas mesmo para estudos estava inviável.

Migrei para o Gemini. Funcionou por um bom tempo — e foi nesse período que o Sentinel começou a ganhar as features mais complexas: o Go agent, a integração com PostgreSQL, o dashboard.

Mas à medida que o projeto crescia, os problemas apareceram:

UI do dashboard: o Gemini gerava HTML funcional mas com inconsistências visuais que eu precisava corrigir manualmente
Segurança no código: Utilizei o Claude Code rapidamente para ser um code review e encontrei padrões no código gerado que me preocuparam — falta de validação de input, headers de segurança ausentes, queries SQL sem sanitização adequada

Quando o Claude Code voltou a funcionar (deram um crédito adicional de R$ 110), migrei de volta. O custo por sessão é maior e a janela de contexto é menor — mas a qualidade e confiabilidade do código gerado, especialmente em questões de segurança, compensam.

O Sentinel nasceu no Gemini. Mas cresceu no Claude Code.

A v2.0: quando virou plataforma

A virada aconteceu com uma pergunta: "e se eu tiver os dados já coletados antes de chamar o Claude?"

Ou seja, queria uma plataforma em RealTime e entrar de cabeça no vibecoding.

Os slash commands consultavam Prometheus e kubectl em tempo real a cada execução. Isso funcionava, mas era lento e stateless — sem histórico, sem tendência, sem FinOps real.

A solução: um Go agent rodando continuamente em background.

// Coleta a cada 10 segundos
// Persiste no PostgreSQL
// Expõe API REST para o Claude consumir

Com client-go nativo, o agent coleta métricas de CPU, memória e waste por pod, persiste em transação batch no PostgreSQL e expõe três endpoints:

Endpoint	Descrição
`GET /api/summary`	Estado do cluster: nodes, pods, CPU
`GET /api/metrics`	Métricas por pod: CPU usage, waste
`GET /api/history`	Histórico de custo dos últimos 30min

O Claude Code passou a consumir esses endpoints via /incident em vez de fazer as queries diretamente. Separação de camadas real: Go coleta, Claude analisa.

Node Health Map com honeycomb, Pod Distribution com donut chart, Waste Intelligence com savings opportunities por pod, ROI Timeline financeiro com Budget vs Actual nos últimos 30 minutos.

O harness: tratando output de LLM como untrusted input

Bem, entre publicações no Linkedin, meu irmão me enviou um artigo interessante sobre Harness Engineering, basicamente ter segurança e confiabilidade ao redor do modelo, com infraestrutura adequada, revisões de segurança e padrões de ReACT constante (Review, ACT) além de feedback.

Segue artigo: Harness Engineering — Martin Fowler

Uma decisão de arquitetura muito importante para colocar o conceito em prática: o harness/validador_saida.py.

Todo relatório gerado pelo Claude Code passa por um gatekeeper antes de ser gravado em disco. O validador bloqueia:

Regra	Exemplos bloqueados
Comandos destrutivos	`rm -rf`, `kubectl delete`, `DROP TABLE`, fork bomb
Estrutura obrigatória	Relatórios sem `## Resumo Executivo` são rejeitados
Tamanho mínimo	Conteúdo menor que 100 chars é rejeitado

if result.returncode != 0:
    return {
        "status": "error",
        "message": f"Validador bloqueou a gravação: {result.stderr.strip()}",
        "file": None,
    }

Se o validador rejeitar, o arquivo não é criado. Ponto.

Isso não é paranoia — é arquitetura de produção. Qualquer sistema que usa LLM para gerar ações sobre infraestrutura precisa de um gatekeeper.

O que o vibe coding significa na prática

Vibe coding não é "deixar a AI fazer tudo". É uma forma específica de trabalho:

Você define o quê. A AI decide o como.

Em cada etapa, eu sabia o resultado que queria e agia como SRE, tomador de decisões:

"Quero que o startup verifique os serviços e suba os port-forwards ausentes"
"Quero análise agrupada por namespace com correlação de causa raiz"
"Quero um gatekeeper que bloqueie comandos destrutivos antes de gravar"

O agente implementava. Eu revisava, questionava, redirecionava.

O que me surpreendeu: em nenhum momento eu escrevi uma linha de código do Go agent, do dashboard, ou do harness. Mas cada decisão de arquitetura foi minha. A AI foi o meu software engineer sênior.

O que ficou claro sobre Claude Code

Pontos fortes:

Qualidade e consistência do código gerado, especialmente em Go
Raciocínio sobre segurança — headers HTTP corretos, sanitização, permissões de arquivo
Capacidade de manter contexto de projeto via CLAUDE.md
Sub-agents paralelos que genuinamente executam em paralelo
Capacidade de criação UI otimizada

Limitações reais:

Janela de contexto menor que concorrentes — projetos grandes exigem /compact frequente
Custo por token mais elevado — relevante em sessões longas
A instabilidade de tokens que aconteceu na semana passada mostrou que dependência total de um único provider é risco real

Timeline da jornada

Data	Marco
8 mar 2026	Certificado Claude Code in Action — Anthropic/Skilljar
8 mar 2026	v1.0: slash commands + Prometheus + K8s
9 mar 2026	v1.1: startup automático + múltiplos namespaces
Semana 2	Migração temporária para Gemini — restrição global de tokens Anthropic
Semana 2-3	v2.0: Go agent + PostgreSQL + dashboard + harness
Semana 3	Retorno ao Claude Code + renomeação para Sentinel

Onde está hoje

O projeto é open source, Apache 2.0:

👉 github.com/boccato85/Sentinel

Stack completa:

Minikube / Kubernetes v1.35.1
Go agent com client-go
PostgreSQL
Claude Code (OpenCode + Sonnet 4.6)

Se você trabalha com SRE, CloudOps ou FinOps e quer explorar Claude Code na prática, esse é um ponto de partida real — não um hello world.

Esse projeto faz parte de uma trilha pessoal: CKA → Claude Code → MLOps. Se quiser acompanhar, me segue aqui no dev.to.