Marcel Boccato

Posted on Apr 5 • Edited on Apr 16

Sentinel Diary #1: from the Anthropic certificate to a FinOps tool in days

#claudecode #kubernetes #devops #finops

This article is not a tutorial. It's a diary :)

I'm taking the CKA (Certified Kubernetes Administrator) course on KodeKloud, wanted to get started in the world of agentic AI with Claude — and ended up with a Kubernetes observability and FinOps tool running Go, PostgreSQL and a real-time dashboard. Without having planned any of it.

Day 1: the certificate and the $5 credit

I decided to take the official Anthropic course on Skilljar: Claude Code in Action. Free, with a certificate, and it took a few hours to complete.

The course asks for an Anthropic Platform API key to run the examples. I created the key, spun up the local server, and got:

error: Your credit balance is too low to access the Anthropic API.

"But the course is free..."

Not exactly. The course is free. The API calls consume real credits. They're two separate things — and the platform doesn't make that clear enough. I bought $5 in credits, cleared cache and sessions, recreated the key, and it worked.

Lesson 1: capitalism always wins. But $5 goes a long way with Haiku, the entry-level model.

With the certificate in hand the same day, I moved on to practice.

Still Day 1: creating v1.0

The idea was simple: a Claude Code agent that would monitor a Kubernetes cluster and automatically generate runbooks. No manual code. Just directives.

I run Linux Fedora 43 KDE on an Acer Predator PHN16-72 laptop.

Installed Minikube, spun up kube-prometheus-stack via Helm:

helm install prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --set grafana.adminPassword=admin123

Six pods coming up at once. Grafana with ready-made dashboards. Prometheus collecting real Minikube metrics.

Then I created the base structure with slash commands in Markdown inside .claude/commands/:

Command	Function
`/sentinel`	Main orchestrator
`/collect-metrics`	Sub-agent A — queries Prometheus via PromQL
`/analyze-pods`	Sub-agent B — checks pods via kubectl
`/correlate`	Sub-agent C — classifies severity

The CLAUDE.md became the agent's operational memory: endpoints, thresholds, namespaces, runbook template.

Ran /sentinel for the first time and it generated this:

Severity: WARNING
CPU: 11.4% ✅ | Memory: 45.1% ✅ | Disk: 17.65% ✅
64 Warning events identified as residual from previous node restart
storage-provisioner: recent BackOff — requires monitoring

The agent separated noise from signal on its own. It identified that the 64 Warning events were residual from a Minikube reboot — not real anomalies. That wasn't in the prompt. It was model reasoning.

v1.0 on GitHub. Same day as the certificate.

Day 2: v1.1 and automatic startup

The biggest friction in v1.0 was operational: every time I opened Claude Code, I had to remember to manually spin up three port-forwards before running any command.

I created /startup — an agent that checks whether Prometheus, Grafana and AlertManager are accessible and only starts the missing port-forwards, in the background, with up to 10 retries:

╔══════════════════════════════════════════╗
║     Sentinel — Startup                   ║
╚══════════════════════════════════════════╝
 Prometheus    (localhost:9090)  →  ✅ STARTED
 Grafana       (localhost:3000)  →  ✅ STARTED
 AlertManager  (localhost:9093)  →  ✅ STARTED

I asked Claude to expand /analyze-pods to monitor multiple namespaces in parallel: default, monitoring and kube-system — with results grouped by namespace and cross-namespace root cause correlation.

The result looked like this:

Namespace	Total	Unhealthy
default	0	0
monitoring	6	0 (all Running)
kube-system	8	0 (all Running)

The agent also identified storage-provisioner with 21 restarts versus an average of 8 for other pods — flagging the genuinely anomalous component with no explicit instruction to do so.

v1.1 on GitHub. Day 2.

The Gemini week: when Claude Code went down

The following week, I had a hard time using Claude — whether via web or via terminal. Tokens were being consumed much faster, even following all of Anthropic's own best practices. Complaints started piling up on Reddit, X, Stack Overflow. Apparently demand was too high for them to absorb. I'm on the $20 pro plan — not a big deal, but even for studying it became unworkable.

I migrated to Gemini. It worked for a good while — and that was the period when Sentinel started gaining its more complex features: the Go agent, PostgreSQL integration, the dashboard.

But as the project grew, problems appeared:

Dashboard UI: Gemini generated functional HTML but with visual inconsistencies I had to fix manually
Code security: I quickly ran Claude Code as a reviewer and found patterns in the generated code that worried me — missing input validation, absent security headers, SQL queries without proper sanitization

When Claude Code came back (they gave an additional ~$22 credit), I switched back. The cost per session is higher and the context window is smaller — but the quality and reliability of generated code, especially on security matters, make it worth it.

Sentinel was born in Gemini. But it grew up in Claude Code.

The v2.0: when it became a real tool

The turning point came with a question: "what if I had the data already collected before calling Claude?"

I wanted a real-time platform and to dive headfirst into vibe coding.

The slash commands were querying Prometheus and kubectl in real time on every execution. That worked, but it was slow and stateless — no history, no trends, no real FinOps.

The solution: a Go agent running continuously in the background.

// Collects every 10 seconds
// Persists to PostgreSQL
// Exposes REST API for Claude to consume

Using native client-go, the agent collects CPU, memory and waste metrics per pod, persists them in batch transactions to PostgreSQL and exposes three endpoints:

Endpoint	Description
`GET /api/summary`	Cluster state: nodes, pods, CPU
`GET /api/metrics`	Per-pod metrics: CPU usage, waste
`GET /api/history`	Cost history for the last 30 min

Claude Code started consuming these endpoints via /incident instead of querying directly. Real layer separation: Go collects, Claude analyzes.

Node Health Map with honeycomb, Pod Distribution with donut chart, Waste Intelligence with savings opportunities per pod, Financial ROI Timeline with Budget vs Actual over the last 30 minutes.

The harness: treating LLM output as untrusted input

Between LinkedIn posts, my brother sent me an interesting article about Harness Engineering — basically having security and reliability around the model, with proper infrastructure, security reviews and constant ReACT patterns (Review, ACT) plus feedback.

Article: Harness Engineering — Martin Fowler

A very important architectural decision to put the concept into practice: harness/validador_saida.py.

Every report generated by Claude Code goes through a gatekeeper before being written to disk. The validator blocks:

Rule	Blocked examples
Destructive commands	`rm -rf`, `kubectl delete`, `DROP TABLE`, fork bomb
Required structure	Reports without `## Executive Summary` are rejected
Minimum size	Content under 100 chars is rejected

if result.returncode != 0:
    return {
        "status": "error",
        "message": f"Validator blocked the write: {result.stderr.strip()}",
        "file": None,
    }

If the validator rejects it, the file is not created. Period.

This isn't paranoia — it's production architecture. Any system that uses an LLM to generate actions on infrastructure needs a gatekeeper.

What vibe coding means in practice

Vibe coding isn't "letting AI do everything". It's a specific way of working:

You define the what. The AI decides the how.

At each step, I knew the result I wanted and acted as the SRE, the decision maker:

"I want startup to check services and spin up missing port-forwards"
"I want namespace-grouped analysis with cross-namespace root cause correlation"
"I want a gatekeeper that blocks destructive commands before writing"

The agent implemented. I reviewed, questioned, redirected.

What surprised me: at no point did I write a single line of the Go agent, the dashboard, or the harness. But every architecture decision was mine. The AI was my senior software engineer.

What became clear about Claude Code

Strengths:

Quality and consistency of generated code, especially in Go
Security reasoning — correct HTTP headers, sanitization, file permissions
Ability to maintain project context via CLAUDE.md
Parallel sub-agents that genuinely execute in parallel
Optimized UI generation

Real limitations:

Smaller context window than competitors — large projects require frequent /compact
Higher cost per token — relevant in long sessions
The token instability from that week showed that total dependency on a single provider is real risk

Journey timeline

Date	Milestone
Mar 8, 2026	Claude Code in Action certificate — Anthropic/Skilljar
Mar 8, 2026	v1.0: slash commands + Prometheus + K8s
Mar 9, 2026	v1.1: automatic startup + multiple namespaces
Week 2	Temporary migration to Gemini — global Anthropic token restriction
Week 2-3	v2.0: Go agent + PostgreSQL + dashboard + harness
Week 3	Return to Claude Code + renamed to Sentinel

Where it stands today

The project is open source, Apache 2.0:

👉 github.com/boccato85/Sentinel

Full stack:

Minikube / Kubernetes v1.35.1
Go agent with client-go
PostgreSQL
Claude Code (OpenCode + Sonnet 4.6)

If you work with SRE, CloudOps or FinOps and want to explore Claude Code in practice, this is a real starting point — not a hello world.

This project is part of a personal track: CKA → Claude Code → MLOps. Follow along here on dev.to.

Esse artigo não é um tutorial. É um diário :)

Estou fazendo o curso de certificação CKA, Certified Kubernetes Administrator através da plataforma KodeKloud, e assim gostaria de iniciar no mundo "agêntico" de IA com o Claude e terminei com uma plataforma de observabilidade e FinOps para Kubernetes rodando Go, PostgreSQL e um dashboard em tempo real — sem ter planejado nada disso.

Dia 1: o certificado e os $5 de crédito

Decidi fazer o curso oficial da Anthropic (Skilljar): Claude Code in Action. Gratuito, com certificado, e levei algumas horas para concluir.

O curso pede uma API key do Anthropic Platform para rodar os exemplos. Criei a chave, subi o servidor local, e recebi:

error: Your credit balance is too low to access the Anthropic API.

"Mas o curso é gratuito..."

Não é bem assim. O curso é gratuito. As chamadas de API consomem créditos reais. São duas coisas separadas — e a plataforma não deixa isso claro o suficiente. Comprei $5 de crédito, deletei cache e sessões, recriei a key, e funcionou.

Lição 1: o capitalismo sempre vence. Mas $5 dura bastante com o Haiku, o modelo de "entrada" digamos assim.

Com o certificado em mãos no mesmo dia, parti para a prática.

Ainda no Dia 1: criação da v1.0

A ideia era simples: um agente Claude Code que monitorasse um cluster Kubernetes e gerasse runbooks automaticamente. Sem código manual. Só diretivas.

Eu utilizo o Linux Fedora 43 KDE em um Notebook Acer Predator PHN16-72.

Instalei o minikube, subi o kube-prometheus-stack via Helm:

helm install prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --set grafana.adminPassword=admin123

Seis pods subindo de uma vez. Grafana com dashboards prontos. Prometheus coletando métricas reais do Minikube.

Então criei a estrutura base com slash commands em Markdown dentro de .claude/commands/:

Comando	Função
`/sentinel`	Orquestrador principal
`/collect-metrics`	Sub-agent A — consulta Prometheus via PromQL
`/analyze-pods`	Sub-agent B — verifica pods via kubectl
`/correlate`	Sub-agent C — classifica severidade

O CLAUDE.md virou a memória operacional do agente: endpoints, thresholds, namespaces, template de runbook.

Rodei /sentinel pela primeira vez e ele gerou isso:

Severidade: WARNING
CPU: 11.4% ✅ | Memória: 45.1% ✅ | Disco: 17.65% ✅
64 Warning events identificados como residuais de restart anterior do nó
storage-provisioner: BackOff recente — requer monitoramento

O agente separou ruído de sinal sozinho. Identificou que os 64 eventos de Warning eram residuais de um reboot do Minikube — não anomalias reais. Isso não estava no prompt. Foi raciocínio do modelo.

v1.0 no GitHub. Mesmo dia do certificado.

Dia 2: a v1.1 e o startup automático

O maior atrito da v1.0 era operacional: toda vez que abria o Claude Code, precisava lembrar de subir os três port-forwards manualmente antes de rodar qualquer comando.

Criei o /startup — um agente que verifica se Prometheus, Grafana e AlertManager estão acessíveis e sobe apenas os port-forwards ausentes, em background, com retry de até 10 tentativas:

╔══════════════════════════════════════════╗
║     Sentinel — Startup                   ║
╚══════════════════════════════════════════╝
 Prometheus    (localhost:9090)  →  ✅ STARTED
 Grafana       (localhost:3000)  →  ✅ STARTED
 AlertManager  (localhost:9093)  →  ✅ STARTED

Pedi ao Claude para expandir o /analyze-pods para monitorar múltiplos namespaces em paralelo: default, monitoring e kube-system — com resultados agrupados por namespace e correlação de causa raiz cruzada.

O resultado ficou assim:

Namespace	Total	Unhealthy
default	0	0
monitoring	6	0 (todos Running)
kube-system	8	0 (todos Running)

E o agente ainda identificou o storage-provisioner com 21 restarts versus média de 8 dos outros pods — sinalizando o componente realmente anômalo sem nenhuma instrução explícita sobre isso.

v1.1 no GitHub. Dia 2.

A semana do Gemini: quando o Claude Code saiu do ar

Na semana seguinte, senti uma grande dificuldade de utilizar o Claude, seja via web ou Code via terminal. Os tokens estavam sendo consumidos muito mais rápido, mesmo seguindo todas as boas práticas recomendadas da própria Anthropic. Reclamações começaram a se multiplicar no Reddit, X, Stack etc. Aparentemente a demanda estava sendo alta demais para absorverem. Eu uso o plano pro R$20 dólares, sei que não é grande coisa, mas mesmo para estudos estava inviável.

Migrei para o Gemini. Funcionou por um bom tempo — e foi nesse período que o Sentinel começou a ganhar as features mais complexas: o Go agent, a integração com PostgreSQL, o dashboard.

Mas à medida que o projeto crescia, os problemas apareceram:

UI do dashboard: o Gemini gerava HTML funcional mas com inconsistências visuais que eu precisava corrigir manualmente
Segurança no código: Utilizei o Claude Code rapidamente para ser um code review e encontrei padrões no código gerado que me preocuparam — falta de validação de input, headers de segurança ausentes, queries SQL sem sanitização adequada

Quando o Claude Code voltou a funcionar (deram um crédito adicional de R$ 110), migrei de volta. O custo por sessão é maior e a janela de contexto é menor — mas a qualidade e confiabilidade do código gerado, especialmente em questões de segurança, compensam.

O Sentinel nasceu no Gemini. Mas cresceu no Claude Code.

A v2.0: quando virou plataforma

A virada aconteceu com uma pergunta: "e se eu tiver os dados já coletados antes de chamar o Claude?"

Ou seja, queria uma plataforma em RealTime e entrar de cabeça no vibecoding.

Os slash commands consultavam Prometheus e kubectl em tempo real a cada execução. Isso funcionava, mas era lento e stateless — sem histórico, sem tendência, sem FinOps real.

A solução: um Go agent rodando continuamente em background.

// Coleta a cada 10 segundos
// Persiste no PostgreSQL
// Expõe API REST para o Claude consumir

Com client-go nativo, o agent coleta métricas de CPU, memória e waste por pod, persiste em transação batch no PostgreSQL e expõe três endpoints:

Endpoint	Descrição
`GET /api/summary`	Estado do cluster: nodes, pods, CPU
`GET /api/metrics`	Métricas por pod: CPU usage, waste
`GET /api/history`	Histórico de custo dos últimos 30min

O Claude Code passou a consumir esses endpoints via /incident em vez de fazer as queries diretamente. Separação de camadas real: Go coleta, Claude analisa.

Node Health Map com honeycomb, Pod Distribution com donut chart, Waste Intelligence com savings opportunities por pod, ROI Timeline financeiro com Budget vs Actual nos últimos 30 minutos.

O harness: tratando output de LLM como untrusted input

Bem, entre publicações no Linkedin, meu irmão me enviou um artigo interessante sobre Harness Engineering, basicamente ter segurança e confiabilidade ao redor do modelo, com infraestrutura adequada, revisões de segurança e padrões de ReACT constante (Review, ACT) além de feedback.

Segue artigo: Harness Engineering — Martin Fowler

Uma decisão de arquitetura muito importante para colocar o conceito em prática: o harness/validador_saida.py.

Todo relatório gerado pelo Claude Code passa por um gatekeeper antes de ser gravado em disco. O validador bloqueia:

Regra	Exemplos bloqueados
Comandos destrutivos	`rm -rf`, `kubectl delete`, `DROP TABLE`, fork bomb
Estrutura obrigatória	Relatórios sem `## Resumo Executivo` são rejeitados
Tamanho mínimo	Conteúdo menor que 100 chars é rejeitado

if result.returncode != 0:
    return {
        "status": "error",
        "message": f"Validador bloqueou a gravação: {result.stderr.strip()}",
        "file": None,
    }

Se o validador rejeitar, o arquivo não é criado. Ponto.

Isso não é paranoia — é arquitetura de produção. Qualquer sistema que usa LLM para gerar ações sobre infraestrutura precisa de um gatekeeper.

O que o vibe coding significa na prática

Vibe coding não é "deixar a AI fazer tudo". É uma forma específica de trabalho:

Você define o quê. A AI decide o como.

Em cada etapa, eu sabia o resultado que queria e agia como SRE, tomador de decisões:

"Quero que o startup verifique os serviços e suba os port-forwards ausentes"
"Quero análise agrupada por namespace com correlação de causa raiz"
"Quero um gatekeeper que bloqueie comandos destrutivos antes de gravar"

O agente implementava. Eu revisava, questionava, redirecionava.

O que me surpreendeu: em nenhum momento eu escrevi uma linha de código do Go agent, do dashboard, ou do harness. Mas cada decisão de arquitetura foi minha. A AI foi o meu software engineer sênior.

O que ficou claro sobre Claude Code

Pontos fortes:

Qualidade e consistência do código gerado, especialmente em Go
Raciocínio sobre segurança — headers HTTP corretos, sanitização, permissões de arquivo
Capacidade de manter contexto de projeto via CLAUDE.md
Sub-agents paralelos que genuinamente executam em paralelo
Capacidade de criação UI otimizada

Limitações reais:

Janela de contexto menor que concorrentes — projetos grandes exigem /compact frequente
Custo por token mais elevado — relevante em sessões longas
A instabilidade de tokens que aconteceu na semana passada mostrou que dependência total de um único provider é risco real

Timeline da jornada

Data	Marco
8 mar 2026	Certificado Claude Code in Action — Anthropic/Skilljar
8 mar 2026	v1.0: slash commands + Prometheus + K8s
9 mar 2026	v1.1: startup automático + múltiplos namespaces
Semana 2	Migração temporária para Gemini — restrição global de tokens Anthropic
Semana 2-3	v2.0: Go agent + PostgreSQL + dashboard + harness
Semana 3	Retorno ao Claude Code + renomeação para Sentinel

Onde está hoje

O projeto é open source, Apache 2.0: