<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Marcel Boccato</title>
    <description>The latest articles on DEV Community by Marcel Boccato (@boccato85).</description>
    <link>https://dev.to/boccato85</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3862790%2Fbbc24de9-b0ae-4216-8c84-8997d8020c3f.jpeg</url>
      <title>DEV Community: Marcel Boccato</title>
      <link>https://dev.to/boccato85</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/boccato85"/>
    <language>en</language>
    <item>
      <title># Sentinel Diary #4: From Dashboard to Incident Response — The deterministic path to reliable SRE</title>
      <dc:creator>Marcel Boccato</dc:creator>
      <pubDate>Thu, 16 Apr 2026 21:43:57 +0000</pubDate>
      <link>https://dev.to/boccato85/-sentinel-diary-4-from-dashboard-to-incident-response-the-deterministic-path-to-reliable-sre-4b0f</link>
      <guid>https://dev.to/boccato85/-sentinel-diary-4-from-dashboard-to-incident-response-the-deterministic-path-to-reliable-sre-4b0f</guid>
      <description>&lt;h3&gt;
  
  
  Context: The "Vibe Coding" Evolution
&lt;/h3&gt;

&lt;p&gt;We are currently at &lt;strong&gt;v0.10.20&lt;/strong&gt;. Looking back at the last post, we were celebrating the FinOps module. Since then, the project has undergone a significant architectural shift.&lt;/p&gt;

&lt;p&gt;My development stack evolved: I started with &lt;strong&gt;Claude Code&lt;/strong&gt;, which generated the original monolith (&lt;code&gt;main.go&lt;/code&gt; reaching ~2,200 lines). I then used &lt;strong&gt;Gemini 3.1 Pro&lt;/strong&gt; to execute a massive refactoring, decomposing the monolith into a clean &lt;code&gt;pkg/&lt;/code&gt; structure (api, k8s, store, incidents). Finally, I integrated &lt;strong&gt;Minimax 2.7&lt;/strong&gt; (via Opencode) to push from v0.10.17 to v0.10.20, building the new "no-scroll" dashboard. I continue to use &lt;strong&gt;Gemini CLI&lt;/strong&gt; as my core orchestration layer. The result? Higher velocity, better code structure, and a dashboard that finally feels like an SRE tool, not a prototype.&lt;/p&gt;




&lt;h3&gt;
  
  
  M3: Deterministic Incident Intelligence
&lt;/h3&gt;

&lt;p&gt;The dashboard was excellent for viewing costs, but I realized it was "read-only." It showed the state, but it didn't &lt;em&gt;detect&lt;/em&gt; issues. I needed it to assist the operator, not just display data.&lt;/p&gt;

&lt;h4&gt;
  
  
  1. Code Refactoring (The end of the "God Object")
&lt;/h4&gt;

&lt;p&gt;The &lt;code&gt;main.go&lt;/code&gt; file was a massive 2,282 lines of code. It was a monolith of responsibility. I guided the Gemini 3.1 Pro agent to refactor it into dedicated packages: &lt;code&gt;pkg/api&lt;/code&gt;, &lt;code&gt;pkg/k8s&lt;/code&gt;, &lt;code&gt;pkg/store&lt;/code&gt;, and &lt;code&gt;pkg/incidents&lt;/code&gt;. It now sits at a lean ~220 lines.&lt;/p&gt;

&lt;h4&gt;
  
  
  2. Deterministic Detection vs. LLM Hype
&lt;/h4&gt;

&lt;p&gt;I wanted the system to be useful even without an LLM. I implemented &lt;code&gt;/api/incidents&lt;/code&gt; using pure deterministic logic. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;The logic:&lt;/strong&gt; Correlating a &lt;code&gt;CrashLoopBackOff&lt;/code&gt; status with a spike in CPU usage.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;The severity:&lt;/strong&gt; We now inject a &lt;code&gt;severity&lt;/code&gt; field, allowing the dashboard to prioritize what matters.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;The UI:&lt;/strong&gt; The dashboard is now "no-scroll" and event-driven. You can toggle between FinOps and SRE views without leaving the context.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fboccato85%2FSentinel%2Fraw%2Fmain%2Fdocs%2Fscreenshots%2Fsentinel_ss_0.10.20%281%29.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fboccato85%2FSentinel%2Fraw%2Fmain%2Fdocs%2Fscreenshots%2Fsentinel_ss_0.10.20%281%29.png" alt="Dashboard layout" width="800" height="458"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fboccato85%2FSentinel%2Fraw%2Fmain%2Fdocs%2Fscreenshots%2Fsentinel_ss_0.10.20%283%29.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fboccato85%2FSentinel%2Fraw%2Fmain%2Fdocs%2Fscreenshots%2Fsentinel_ss_0.10.20%283%29.png" alt="Incidents drawer" width="800" height="456"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  M4: Security Hardening (WIP)
&lt;/h3&gt;

&lt;p&gt;Observability is a liability if your tool is an attack vector. We are currently executing Milestone 4, focusing on hardening the system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;DOM-based XSS:&lt;/strong&gt; During a security audit via CodeQL, I received high-severity alerts indicating the dashboard was vulnerable to XSS due to dynamic rendering via &lt;code&gt;innerHTML&lt;/code&gt;. I instructed the AI to integrate &lt;strong&gt;DOMPurify&lt;/strong&gt; to sanitize inputs before rendering.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Data Resilience (In progress):&lt;/strong&gt; We are migrating from &lt;code&gt;emptyDir&lt;/code&gt; to &lt;code&gt;PersistentVolumeClaim&lt;/code&gt; (PVC) for PostgreSQL, ensuring pod restarts no longer result in data loss.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;CI/CD Pipeline (Completed):&lt;/strong&gt; I implemented GitHub Actions. Every &lt;code&gt;push&lt;/code&gt; or &lt;code&gt;pull_request&lt;/code&gt; now triggers &lt;code&gt;go test&lt;/code&gt; and &lt;code&gt;helm lint&lt;/code&gt;. If the build is red, it does not merge.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Lessons Learned
&lt;/h3&gt;

&lt;p&gt;"If the LLM goes down, Sentinel stays useful."&lt;/p&gt;

&lt;p&gt;The deterministic-first approach is not just a design choice; it is a necessity for SRE tools. I observed that agents (like Minimax/Gemini) are brilliant, but they shouldn't be the central nervous system of your reliability tool. They should be the "specialist on call"—highly valuable, but not required for the system to remain upright.&lt;/p&gt;

&lt;h3&gt;
  
  
  Current Cluster State
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Grade&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Efficiency&lt;/td&gt;
&lt;td&gt;A+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security&lt;/td&gt;
&lt;td&gt;✅ (Patched)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tests&lt;/td&gt;
&lt;td&gt;25 (Go) + 16 (Harness)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Next steps:&lt;/strong&gt; Prepare for M7, the real-world lab with Online Boutique (Chaos Engineering).&lt;/p&gt;




&lt;h3&gt;
  
  
  Portuguese Version
&lt;/h3&gt;

&lt;h3&gt;
  
  
  Contexto: A Evolução do "Vibe Coding"
&lt;/h3&gt;

&lt;p&gt;Estamos na &lt;strong&gt;v0.10.20&lt;/strong&gt;. Olhando para o post anterior, estávamos celebrando o módulo de FinOps. Desde então, o projeto passou por uma mudança arquitetural significativa.&lt;/p&gt;

&lt;p&gt;Minha stack de desenvolvimento evoluiu: comecei com o &lt;strong&gt;Claude Code&lt;/strong&gt;, que gerou o monólito original (&lt;code&gt;main.go&lt;/code&gt; chegando a ~2.200 linhas). Em seguida, usei o &lt;strong&gt;Gemini 3.1 Pro&lt;/strong&gt; para executar uma refatoração massiva, decompondo o monólito em uma estrutura limpa &lt;code&gt;pkg/&lt;/code&gt; (api, k8s, store, incidents). Finalmente, integrei o &lt;strong&gt;Minimax 2.7&lt;/strong&gt; (via Opencode) para subir da v0.10.17 para a v0.10.20, construindo o novo dashboard "no-scroll". Continuo usando o &lt;strong&gt;Gemini CLI&lt;/strong&gt; como minha camada de orquestração central. O resultado? Maior velocidade, estrutura de código superior e um dashboard que finalmente parece uma ferramenta de SRE, não um protótipo.&lt;/p&gt;




&lt;h3&gt;
  
  
  M3: Inteligência Determinística de Incidentes
&lt;/h3&gt;

&lt;p&gt;O dashboard era excelente para ver custos, mas percebi que ele era "somente leitura". Ele mostrava o estado, mas não &lt;em&gt;detectava&lt;/em&gt; problemas. Eu precisava que ele auxiliasse o operador, não apenas exibisse dados.&lt;/p&gt;

&lt;h4&gt;
  
  
  1. Refatoração de Código (O fim do "God Object")
&lt;/h4&gt;

&lt;p&gt;O arquivo &lt;code&gt;main.go&lt;/code&gt; tinha 2.282 linhas. Era um monólito de responsabilidade. Orientei o agente do Gemini 3.1 Pro a refatorá-lo em pacotes dedicados: &lt;code&gt;pkg/api&lt;/code&gt;, &lt;code&gt;pkg/k8s&lt;/code&gt;, &lt;code&gt;pkg/store&lt;/code&gt; e &lt;code&gt;pkg/incidents&lt;/code&gt;. Agora ele tem enxutas ~220 linhas.&lt;/p&gt;

&lt;h4&gt;
  
  
  2. Detecção Determinística vs. Hype de LLM
&lt;/h4&gt;

&lt;p&gt;Eu queria que o sistema fosse útil mesmo sem LLM. Implementei o &lt;code&gt;/api/incidents&lt;/code&gt; usando lógica puramente determinística.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;A lógica:&lt;/strong&gt; Correlacionar um status &lt;code&gt;CrashLoopBackOff&lt;/code&gt; com um pico na utilização de CPU.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;A severidade:&lt;/strong&gt; Agora injetamos um campo &lt;code&gt;severity&lt;/code&gt;, permitindo que o dashboard priorize o que importa.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;A UI:&lt;/strong&gt; O dashboard agora é "no-scroll" e orientado a eventos. Você pode alternar entre as visões de FinOps e SRE sem sair do contexto.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  M4: Hardening de Segurança (Em andamento)
&lt;/h3&gt;

&lt;p&gt;Observabilidade é um risco se a ferramenta for um vetor de ataque. Estamos no meio da execução do Milestone 4, focando em blindar o sistema:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;XSS baseado em DOM:&lt;/strong&gt; Durante uma auditoria de segurança via CodeQL, recebi alertas de alta severidade indicando que o dashboard estava vulnerável a XSS por renderizar conteúdo via &lt;code&gt;innerHTML&lt;/code&gt;. Instruí o agente a integrar o &lt;strong&gt;DOMPurify&lt;/strong&gt; para sanitizar todo input antes da renderização no DOM.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Resiliência de Dados (Em andamento):&lt;/strong&gt; Iniciamos a migração de &lt;code&gt;emptyDir&lt;/code&gt; para &lt;code&gt;PersistentVolumeClaim&lt;/code&gt; (PVC) real para o PostgreSQL, garantindo que restarts de pod não resultem em perda de dados.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Pipeline de CI/CD (Concluído):&lt;/strong&gt; Implementei GitHub Actions. Todo &lt;code&gt;push&lt;/code&gt; ou &lt;code&gt;pull_request&lt;/code&gt; agora dispara &lt;code&gt;go test&lt;/code&gt; e &lt;code&gt;helm lint&lt;/code&gt;. A regra é clara: build vermelho não entra na &lt;code&gt;main&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Lições Aprendidas
&lt;/h3&gt;

&lt;p&gt;"Se o LLM cair, o Sentinel continua útil."&lt;/p&gt;

&lt;p&gt;A abordagem &lt;em&gt;deterministic-first&lt;/em&gt; não é apenas uma escolha de design; é uma necessidade para ferramentas de SRE. Observei que agentes (como Minimax/Gemini) são brilhantes, mas não devem ser o sistema nervoso central da sua ferramenta de confiabilidade. Eles devem ser o "especialista de plantão" — altamente valioso, mas não obrigatório para o sistema permanecer de pé.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Próximos passos:&lt;/strong&gt; Preparar o M7, o laboratório real com Online Boutique (Chaos Engineering).&lt;/p&gt;

&lt;p&gt;Repositório: &lt;a href="https://github.com/boccato85/Sentinel" rel="noopener noreferrer"&gt;https://github.com/boccato85/Sentinel&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>go</category>
      <category>sre</category>
    </item>
    <item>
      <title>Sentinel Diary #3: From Information to Action — When the Dashboard Learned to Think</title>
      <dc:creator>Marcel Boccato</dc:creator>
      <pubDate>Sun, 12 Apr 2026 18:06:36 +0000</pubDate>
      <link>https://dev.to/boccato85/sentinel-diary-3-from-information-to-action-when-the-dashboard-learned-to-think-123j</link>
      <guid>https://dev.to/boccato85/sentinel-diary-3-from-information-to-action-when-the-dashboard-learned-to-think-123j</guid>
      <description>&lt;h1&gt;
  
  
  Sentinel Diary #3: From Information to Action — When the Dashboard Learned to Think
&lt;/h1&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;A vibe coding journey: building a Kubernetes FinOps platform from scratch, one conversation at a time.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;When I published Diary #2, the dashboard was finally telling the truth. The bugs were fixed, the data was real, the version badge was glowing cyan on hover. It felt like a finished thing.&lt;/p&gt;

&lt;p&gt;It wasn't. It was a read-only mirror of a cluster.&lt;/p&gt;

&lt;p&gt;Diary #3 is the story of turning that mirror into a tool — the session where Sentinel stopped showing data and started helping me act on it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where we left off
&lt;/h2&gt;

&lt;p&gt;At v0.7.3, Sentinel had:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A Go agent collecting metrics every ~10s&lt;/li&gt;
&lt;li&gt;PostgreSQL storing raw + hourly + daily aggregates&lt;/li&gt;
&lt;li&gt;A dashboard with cost timeline, pod health, CPU utilization&lt;/li&gt;
&lt;li&gt;22 automated tests&lt;/li&gt;
&lt;li&gt;Zero authentication (honest versioning: still &lt;code&gt;0.x&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Online Boutique (12 Google microservices) was already deployed in &lt;code&gt;google-demo&lt;/code&gt; namespace, waiting. Twenty-four pods. Real workload distribution. Real waste candidates.&lt;/p&gt;

&lt;p&gt;I just couldn't &lt;em&gt;do&lt;/em&gt; anything about them from the dashboard.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fih47x86v6tbfzoc8h01r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fih47x86v6tbfzoc8h01r.png" alt="Sentinel before — v0.7.x dashboard" width="800" height="456"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;The "before" — a beautiful read-only report.&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  v0.10.0 — The forecast that scared me
&lt;/h2&gt;

&lt;p&gt;Before visual work, I wanted the dashboard to answer a question I kept asking manually: &lt;em&gt;"if this cluster runs through the weekend, how much will I spend?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I spec'd out the requirement: linear regression over historical cost data, with confidence bands. No external dependencies — pure Go. I handed it to Claude, and the result was &lt;code&gt;/api/forecast&lt;/code&gt;: a projection endpoint with ±1.5σ confidence bands.&lt;/p&gt;

&lt;p&gt;The chart came back with a dashed purple budget line, a cyan usage line, shaded confidence regions, and a projected waste card below. It looked like something from a Bloomberg terminal.&lt;/p&gt;

&lt;p&gt;Then I looked at the numbers.&lt;/p&gt;

&lt;p&gt;Projected waste: &lt;strong&gt;67% of budget&lt;/strong&gt;. Every dollar spent on this cluster, sixty-seven cents was going to pods with requests set far above actual consumption.&lt;/p&gt;

&lt;p&gt;The forecast didn't tell me something I didn't know. It told me something I knew but hadn't &lt;em&gt;seen&lt;/em&gt;.&lt;/p&gt;


&lt;h2&gt;
  
  
  v0.10.1 — Closing M1
&lt;/h2&gt;

&lt;p&gt;Before going further with UI, I closed Milestone 1 properly. I had a checklist:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;/health&lt;/code&gt; endpoint with DB and collector status checks&lt;/li&gt;
&lt;li&gt;Structured logging with &lt;code&gt;slog&lt;/code&gt; (consistent fields across all components)&lt;/li&gt;
&lt;li&gt;Thresholds loaded from &lt;code&gt;config/thresholds.yaml&lt;/code&gt; via ConfigMap (no hardcoded values)&lt;/li&gt;
&lt;li&gt;Version badge reading dynamically from &lt;code&gt;/health&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Fallback data for long ranges (30d/90d/1y)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Claude implemented all of it in a single session.&lt;/p&gt;

&lt;p&gt;M1 criterion: &lt;em&gt;"Sentinel collects, persists, calculates waste, and reports its own health without manual intervention."&lt;/em&gt; ✅&lt;/p&gt;


&lt;h2&gt;
  
  
  The layout problem
&lt;/h2&gt;

&lt;p&gt;By v0.10.3, I had a confession to make to the dashboard.&lt;/p&gt;

&lt;p&gt;It was working. Every metric was real. But it was &lt;strong&gt;ugly in a specific way&lt;/strong&gt;: information arranged like a report, not like a tool. Everything equal weight. No hierarchy. No "look here first."&lt;/p&gt;

&lt;p&gt;I spent the next few versions doing something I rarely do consciously: thinking about information architecture before writing a single directive.&lt;/p&gt;

&lt;p&gt;The question wasn't "what data do we have?" It was "when someone opens this at 2am during an incident, where should their eyes go first?"&lt;/p&gt;

&lt;p&gt;Answer: KPIs. Then cluster health. Then cost. Then details.&lt;/p&gt;


&lt;h2&gt;
  
  
  v0.10.4–v0.10.8 — The great layout rework
&lt;/h2&gt;

&lt;p&gt;Version by version, I described what I needed and Claude shaped the layout:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;v0.10.4&lt;/strong&gt;: I wanted a dedicated Memory tile — a visual showing requested vs allocatable memory, with a drawer that broke risk down by namespace. Claude built a purple donut with OOM risk breakdown.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;v0.10.5&lt;/strong&gt;: Per-tile namespace filters — each tile (Pods, CPU, Memory) needed its own independent &lt;code&gt;&amp;lt;select&amp;gt;&lt;/code&gt; so filtering one wouldn't break the others. Financial Correlation grew to full-width with an orange FinOps border. The drawer got an interactive period selector and sortable columns.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;v0.10.6–v0.10.7&lt;/strong&gt; reorganized the grid — I drew the hierarchy on paper first:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;row-4&lt;/code&gt;: Node Health | Pod Distribution | CPU (compact) | Memory (compact)&lt;/li&gt;
&lt;li&gt;Financial Correlation: full-width, immediately below&lt;/li&gt;
&lt;li&gt;Waste Intelligence: full-width with scroll, at the bottom&lt;/li&gt;
&lt;li&gt;Active Alerts tile: removed (empty space is worse than no tile)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;v0.10.8&lt;/strong&gt;: An animated alert badge in the header — green dot for "All OK", orange for warnings, red pulsing for critical. All six KPI cards clickable, each opening its respective drawer. The dead "Active Alerts" KPI replaced with "Top Memory Consumer" — the actually useful metric.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F67jjm4hsld6t8pd2zbwb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F67jjm4hsld6t8pd2zbwb.png" alt="Sentinel v0.10.12 — full dashboard overview" width="800" height="518"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;The "after" — v0.10.12 with unified layout, forecast chart and Top Workloads panel.&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  v0.10.9 — The bug that crashed silently
&lt;/h2&gt;

&lt;p&gt;During testing, I noticed the KPI cards were showing &lt;code&gt;--&lt;/code&gt; for values. Not an error. Not a console warning. Just dashes.&lt;/p&gt;

&lt;p&gt;I flagged it to Claude, who traced it to a &lt;code&gt;ReferenceError&lt;/code&gt; in &lt;code&gt;updateOverview()&lt;/code&gt;. The code was doing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;pods&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;forEach&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;p&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But &lt;code&gt;/api/summary&lt;/code&gt; doesn't return an individual pods array. It returns &lt;code&gt;podsByPhase&lt;/code&gt;, &lt;code&gt;failedPods&lt;/code&gt;, &lt;code&gt;pendingPods&lt;/code&gt;. The variable &lt;code&gt;pods&lt;/code&gt; didn't exist.&lt;/p&gt;

&lt;p&gt;The error was thrown, silently swallowed by the outer &lt;code&gt;try/catch&lt;/code&gt;, and execution stopped before updating &lt;code&gt;kT&lt;/code&gt;, &lt;code&gt;kMem&lt;/code&gt;, &lt;code&gt;kW&lt;/code&gt; — all the KPI values. They stayed at &lt;code&gt;--&lt;/code&gt; from initialization.&lt;/p&gt;

&lt;p&gt;Claude extracted &lt;code&gt;updatePodsAllNsTile()&lt;/code&gt; — a new async function that fetches &lt;code&gt;/api/pods&lt;/code&gt; separately, groups by namespace, and renders a namespace-distribution donut instead of the broken phase breakdown.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Silent failures are the worst kind. At least a loud crash tells you where to look.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  v0.10.10 — The column that was always zero
&lt;/h2&gt;

&lt;p&gt;The Memory drawer had an "Mem Request" column. It showed &lt;code&gt;N/A&lt;/code&gt; for every pod.&lt;/p&gt;

&lt;p&gt;I queried the DB directly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="n"&gt;mem_request&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;metrics&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
 &lt;span class="c1"&gt;-- 0&lt;/span&gt;
 &lt;span class="c1"&gt;-- 0&lt;/span&gt;
 &lt;span class="c1"&gt;-- 0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every row. Zero.&lt;/p&gt;

&lt;p&gt;Four versions back, when the DB INSERT was written, &lt;code&gt;mem_request&lt;/code&gt; was hardcoded to &lt;code&gt;0&lt;/code&gt;. The struct field existed, the column existed, the frontend expected data — but real values were never being written.&lt;/p&gt;

&lt;p&gt;I described the fix to Claude: collect memory requests per pod during the collection cycle and use those real values in the INSERT. Claude built &lt;code&gt;podMemRequestMap[namespace][pod]&lt;/code&gt;, summing memory requests across all containers. The INSERT now uses the real value.&lt;/p&gt;

&lt;p&gt;Historical data stays zero — it's already written. But every new collection has the right number. A migration would fix history; I decided to let time heal it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzj0roax75gu93x9ma3a2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzj0roax75gu93x9ma3a2.png" alt="FinOps drawer — Financial Correlation detail" width="800" height="518"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;FinOps drawer: sortable history table with Budget, Actual, Waste and Waste% columns.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0zns7uev0tc3qrk3wwlp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0zns7uev0tc3qrk3wwlp.png" alt="Memory Resource drawer — OOM Risk breakdown" width="800" height="517"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Memory drawer: per-namespace breakdown with OOM risk indicator per pod.&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  v0.10.11–v0.10.12 — From display to decision
&lt;/h2&gt;

&lt;p&gt;This is the part I'm most proud of.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;v0.10.11&lt;/strong&gt;: I wanted a tooltip on the "Connected" badge — hover to see cluster health at a glance without opening any drawer. Claude built a card showing Cluster, Endpoint, Version, Session uptime, Last sync, and Database status. Small detail. High signal.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;v0.10.12&lt;/strong&gt;: I wanted to merge Waste Intelligence and Top Workloads into a single action-oriented panel: &lt;strong&gt;"Top Workloads — CPU &amp;amp; Waste Analysis"&lt;/strong&gt;. But the real ask was making pod names clickable.&lt;/p&gt;

&lt;p&gt;I defined the interaction: click a pod name → drawer opens with current usage, request, a utilization bar, and a concrete rightsizing recommendation. Claude built it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; kube-apiserver-minikube          sentinel          ⚠ Overprovisioned

 CPU Usage / Request              42m / 250m
 ████████░░░░░░░░░░░░░░░░░░░░    16.8%

 Memory Usage / Request           312 Mi / No request set

 ⚠ Savings Opportunity
 Potential CPU savings: -208m (83%)
 CPU request is significantly higher than actual usage.
 Consider reducing resources.requests.cpu to ~51m.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The number &lt;code&gt;~51m&lt;/code&gt; comes from &lt;code&gt;ceil(actualUsage × 1.2)&lt;/code&gt; — a 20% headroom buffer calculated at draw time. Not a generic recommendation. A concrete one, specific to that pod, at that moment.&lt;/p&gt;

&lt;p&gt;Rows with waste are highlighted in amber. Rightsized pods get a green checkmark. The table became a prioritized action list.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftv393vb3lcewm29gnisw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftv393vb3lcewm29gnisw.png" alt="Pod Detail — Waste Analysis drawer" width="800" height="518"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;The star of the show: click any pod name to get a concrete rightsizing recommendation.&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  What I learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Data without action is just reporting.&lt;/strong&gt; For the first three months of this project, Sentinel was a very nice report. The forecast was beautiful. The donuts were pretty. But you couldn't do anything &lt;em&gt;from&lt;/em&gt; the dashboard — you had to write it down, open a terminal, and kubectl edit something.&lt;/p&gt;

&lt;p&gt;The pod detail drawer is the first time Sentinel gives you a number you can directly use. That's a different category of tool.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Silent failures compound.&lt;/strong&gt; The &lt;code&gt;pods.forEach&lt;/code&gt; bug, the &lt;code&gt;mem_request = 0&lt;/code&gt; bug, the Database &lt;code&gt;--&lt;/code&gt; in the tooltip — none of them threw visible errors. They all degraded silently. I need better observability on the dashboard itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layout is product thinking.&lt;/strong&gt; I spent more time this session defining information hierarchy than requesting new features. That felt wasteful in the moment. In retrospect, a dashboard where your eyes know where to go is worth more than a dashboard with more features.&lt;/p&gt;


&lt;h2&gt;
  
  
  State of the cluster (v0.10.12)
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Nodes:    1 (minikube) — Running
Pods:     24 Running (sentinel + google-demo namespaces)
CPU:      32.8% allocated
Waste:    20 pods with savings opportunities
DB:       ✓ OK
Version:  v0.10.12
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;The roadmap points to M2 and M3:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Efficiency score per namespace&lt;/strong&gt; — not just "which pods waste" but "which namespace is worst"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;/api/incidents&lt;/code&gt;&lt;/strong&gt; — deterministic violation detection without LLM&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Online Boutique lab&lt;/strong&gt; — baseline → load → chaos → comparison (the post I promised in #2)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And eventually: auth. Because a dashboard with no auth is a tool that trusts everyone in the room.&lt;/p&gt;



&lt;p&gt;&lt;em&gt;Sentinel is open-source and honestly versioned. Still &lt;code&gt;0.x&lt;/code&gt;. Getting closer.&lt;/em&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  Sentinel Diary #3: De Informação para Ação — Quando o Dashboard Aprendeu a Pensar
&lt;/h1&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Uma jornada de vibe coding: construindo uma plataforma FinOps para Kubernetes do zero, uma conversa por vez.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Quando publiquei o Diary #2, o dashboard finalmente estava dizendo a verdade. Os bugs tinham sido corrigidos, os dados eram reais, o badge de versão brilhava em cyan no hover. Parecia uma coisa pronta.&lt;/p&gt;

&lt;p&gt;Não estava. Era um espelho somente leitura de um cluster.&lt;/p&gt;

&lt;p&gt;O Diary #3 é a história de transformar esse espelho numa ferramenta — a sessão em que o Sentinel parou de mostrar dados e começou a me ajudar a agir sobre eles.&lt;/p&gt;


&lt;h2&gt;
  
  
  De onde paramos
&lt;/h2&gt;

&lt;p&gt;No v0.7.3, o Sentinel tinha:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Um agente Go coletando métricas a cada ~10s&lt;/li&gt;
&lt;li&gt;PostgreSQL armazenando dados raw + hourly + daily&lt;/li&gt;
&lt;li&gt;Dashboard com timeline de custo, saúde de pods, utilização de CPU&lt;/li&gt;
&lt;li&gt;22 testes automatizados&lt;/li&gt;
&lt;li&gt;Zero autenticação (versionamento honesto: ainda &lt;code&gt;0.x&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;O Online Boutique (12 microsserviços do Google) já estava deployado no namespace &lt;code&gt;google-demo&lt;/code&gt;, esperando. Vinte e quatro pods. Distribuição real de workload. Candidatos reais a rightsizing.&lt;/p&gt;

&lt;p&gt;Eu só não conseguia &lt;em&gt;fazer nada&lt;/em&gt; a respeito deles a partir do dashboard.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fih47x86v6tbfzoc8h01r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fih47x86v6tbfzoc8h01r.png" alt="Sentinel antes — dashboard v0.7.x" width="800" height="456"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;O "antes" — um relatório bonito, mas somente leitura.&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  v0.10.0 — O forecast que me assustou
&lt;/h2&gt;

&lt;p&gt;Antes do trabalho visual, eu queria que o dashboard respondesse uma pergunta que eu ficava fazendo manualmente: &lt;em&gt;"se esse cluster rodar durante o fim de semana, quanto vou gastar?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Defini o requisito: regressão linear sobre os dados históricos de custo, com bandas de confiança. Sem dependências externas — Go puro. Passei o spec para o Claude, e o resultado foi o &lt;code&gt;/api/forecast&lt;/code&gt;: um endpoint de projeção com bandas de confiança ±1.5σ.&lt;/p&gt;

&lt;p&gt;O gráfico voltou com uma linha tracejada roxa de orçamento, uma linha cyan de uso, regiões sombreadas de confiança e um card de waste projetado abaixo. Parecia algo de um terminal Bloomberg.&lt;/p&gt;

&lt;p&gt;Aí eu olhei para os números.&lt;/p&gt;

&lt;p&gt;Waste projetado: &lt;strong&gt;67% do orçamento&lt;/strong&gt;. De cada real gasto no cluster, sessenta e sete centavos iam para pods com requests configurados bem acima do consumo real.&lt;/p&gt;

&lt;p&gt;O forecast não me disse algo que eu não sabia. Me disse algo que eu sabia mas não tinha &lt;em&gt;visto&lt;/em&gt;.&lt;/p&gt;


&lt;h2&gt;
  
  
  v0.10.1 — Fechando o M1
&lt;/h2&gt;

&lt;p&gt;Antes de avançar na UI, fechei o Milestone 1 adequadamente. Tinha uma lista de critérios:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Endpoint &lt;code&gt;/health&lt;/code&gt; com verificações de status do DB e do collector&lt;/li&gt;
&lt;li&gt;Logs estruturados com &lt;code&gt;slog&lt;/code&gt; (campos consistentes em todos os componentes)&lt;/li&gt;
&lt;li&gt;Thresholds carregados de &lt;code&gt;config/thresholds.yaml&lt;/code&gt; via ConfigMap (sem valores hardcoded)&lt;/li&gt;
&lt;li&gt;Badge de versão lendo dinamicamente do &lt;code&gt;/health&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Fallback de dados para ranges longos (30d/90d/1y)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;O Claude implementou tudo em uma única sessão.&lt;/p&gt;

&lt;p&gt;Critério do M1: &lt;em&gt;"O Sentinel coleta, persiste, calcula waste e reporta sua própria saúde sem intervenção manual."&lt;/em&gt; ✅&lt;/p&gt;


&lt;h2&gt;
  
  
  O problema do layout
&lt;/h2&gt;

&lt;p&gt;No v0.10.3, eu tinha uma confissão a fazer ao dashboard.&lt;/p&gt;

&lt;p&gt;Ele estava funcionando. Cada métrica era real. Mas era &lt;strong&gt;feio de uma forma específica&lt;/strong&gt;: informação arranjada como relatório, não como ferramenta. Tudo com o mesmo peso. Sem hierarquia. Sem "olhe aqui primeiro."&lt;/p&gt;

&lt;p&gt;Passei as próximas versões fazendo algo que raramente faço conscientemente: pensar em arquitetura de informação antes de escrever qualquer diretiva.&lt;/p&gt;

&lt;p&gt;A pergunta não era "que dados temos?" Era "quando alguém abrir isso às 2h durante um incidente, para onde devem ir os olhos primeiro?"&lt;/p&gt;

&lt;p&gt;Resposta: KPIs. Depois saúde do cluster. Depois custo. Depois detalhes.&lt;/p&gt;


&lt;h2&gt;
  
  
  v0.10.4–v0.10.8 — A grande reestruturação do layout
&lt;/h2&gt;

&lt;p&gt;Versão por versão, eu descrevia o que precisava e o Claude moldava o layout:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;v0.10.4&lt;/strong&gt;: Queria um tile dedicado de Memória — um visual mostrando memória solicitada vs alocável, com um drawer quebrando o risco por namespace. O Claude construiu um donut roxo com breakdown de risco de OOM.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;v0.10.5&lt;/strong&gt;: Filtros de namespace por tile — cada tile (Pods, CPU, Memória) precisava do seu próprio &lt;code&gt;&amp;lt;select&amp;gt;&lt;/code&gt; independente, para filtrar um sem quebrar os outros. O painel Financial Correlation cresceu para full-width com borda laranja FinOps. O drawer ganhou seletor de período interativo e colunas ordenáveis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;v0.10.6–v0.10.7&lt;/strong&gt;: Reorganizei a hierarquia no papel primeiro:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;row-4&lt;/code&gt;: Node Health | Pod Distribution | CPU (compacto) | Memory (compacto)&lt;/li&gt;
&lt;li&gt;Financial Correlation: full-width, imediatamente abaixo&lt;/li&gt;
&lt;li&gt;Waste Intelligence: full-width com scroll, no final&lt;/li&gt;
&lt;li&gt;Tile Active Alerts: removido (espaço vazio é pior que nenhum tile)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;v0.10.8&lt;/strong&gt;: Pedi um badge de alerta animado no header — ponto verde para "All OK", laranja para warnings, vermelho pulsante para critical. Os seis cards KPI viraram clicáveis, cada um abrindo seu respectivo drawer. O KPI morto "Active Alerts" substituído por "Top Memory Consumer" — a métrica realmente útil.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F67jjm4hsld6t8pd2zbwb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F67jjm4hsld6t8pd2zbwb.png" alt="Sentinel v0.10.12 — overview completo" width="800" height="518"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;O "depois" — v0.10.12 com layout unificado, gráfico de forecast e painel Top Workloads.&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  v0.10.9 — O bug que falhava silenciosamente
&lt;/h2&gt;

&lt;p&gt;Durante os testes, percebi que os cards KPI mostravam &lt;code&gt;--&lt;/code&gt; nos valores. Não um erro. Não um aviso no console. Só travessões.&lt;/p&gt;

&lt;p&gt;Reportei ao Claude, que rastreou até um &lt;code&gt;ReferenceError&lt;/code&gt; em &lt;code&gt;updateOverview()&lt;/code&gt;. O código fazia:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;pods&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;forEach&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;p&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Mas &lt;code&gt;/api/summary&lt;/code&gt; não retorna um array individual de pods. Retorna &lt;code&gt;podsByPhase&lt;/code&gt;, &lt;code&gt;failedPods&lt;/code&gt;, &lt;code&gt;pendingPods&lt;/code&gt;. A variável &lt;code&gt;pods&lt;/code&gt; não existia.&lt;/p&gt;

&lt;p&gt;O erro era lançado, silenciosamente engolido pelo &lt;code&gt;try/catch&lt;/code&gt; externo, e a execução parava antes de atualizar &lt;code&gt;kT&lt;/code&gt;, &lt;code&gt;kMem&lt;/code&gt;, &lt;code&gt;kW&lt;/code&gt; — todos os valores KPI. Eles ficavam em &lt;code&gt;--&lt;/code&gt; desde a inicialização.&lt;/p&gt;

&lt;p&gt;O Claude extraiu &lt;code&gt;updatePodsAllNsTile()&lt;/code&gt; — uma nova função async que faz fetch separado em &lt;code&gt;/api/pods&lt;/code&gt;, agrupa por namespace e renderiza um donut de distribuição por namespace.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Falhas silenciosas são o pior tipo. Pelo menos um crash barulhento te diz onde procurar.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  v0.10.10 — A coluna que sempre foi zero
&lt;/h2&gt;

&lt;p&gt;O drawer de Memória tinha uma coluna "Mem Request". Mostrava &lt;code&gt;N/A&lt;/code&gt; para todo pod.&lt;/p&gt;

&lt;p&gt;Fui consultar o banco diretamente.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="n"&gt;mem_request&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;metrics&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
 &lt;span class="c1"&gt;-- 0&lt;/span&gt;
 &lt;span class="c1"&gt;-- 0&lt;/span&gt;
 &lt;span class="c1"&gt;-- 0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Toda linha. Zero.&lt;/p&gt;

&lt;p&gt;Quatro versões atrás, quando o DB INSERT foi escrito, &lt;code&gt;mem_request&lt;/code&gt; estava hardcoded como &lt;code&gt;0&lt;/code&gt;. O campo da struct existia, a coluna existia, o frontend esperava dados — mas valores reais nunca foram escritos.&lt;/p&gt;

&lt;p&gt;Descrevi o fix para o Claude: coletar os memory requests por pod durante o ciclo de coleta e usar esses valores reais no INSERT. O Claude construiu &lt;code&gt;podMemRequestMap[namespace][pod]&lt;/code&gt;, somando memory requests de todos os containers. O INSERT agora usa o valor real.&lt;/p&gt;

&lt;p&gt;Os dados históricos ficam zero — já foram escritos. Mas cada nova coleta tem o número certo. Uma migration consertaria o histórico; decidi deixar o tempo curar.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzj0roax75gu93x9ma3a2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzj0roax75gu93x9ma3a2.png" alt="Drawer FinOps — detalhe da correlação financeira" width="800" height="518"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Drawer FinOps: tabela histórica ordenável com Budget, Actual, Waste e Waste%.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0zns7uev0tc3qrk3wwlp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0zns7uev0tc3qrk3wwlp.png" alt="Drawer Memory Resource — breakdown por namespace" width="800" height="517"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Drawer de memória: breakdown por namespace com indicador de risco de OOM por pod.&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  v0.10.11–v0.10.12 — De exibição para decisão
&lt;/h2&gt;

&lt;p&gt;Esta é a parte de que mais me orgulho.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;v0.10.11&lt;/strong&gt;: Queria um tooltip no badge "Connected" — passar o mouse para ver a saúde do cluster sem abrir nenhum drawer. O Claude construiu um card mostrando Cluster, Endpoint, Version, Session uptime, Last sync e Database status. Detalhe pequeno. Sinal alto.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;v0.10.12&lt;/strong&gt;: Queria fundir Waste Intelligence e Top Workloads em um único painel orientado a ação: &lt;strong&gt;"Top Workloads — CPU &amp;amp; Waste Analysis"&lt;/strong&gt;. Mas o pedido central era tornar os nomes dos pods clicáveis.&lt;/p&gt;

&lt;p&gt;Defini a interação: clicar num pod → drawer abre com uso atual, request, barra de utilização e uma recomendação concreta de rightsizing. O Claude implementou:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; kube-apiserver-minikube          sentinel          ⚠ Overprovisioned

 CPU Usage / Request              42m / 250m
 ████████░░░░░░░░░░░░░░░░░░░░    16.8%

 Memory Usage / Request           312 Mi / No request set

 ⚠ Savings Opportunity
 Potential CPU savings: -208m (83%)
 CPU request is significantly higher than actual usage.
 Consider reducing resources.requests.cpu to ~51m.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;O número &lt;code&gt;~51m&lt;/code&gt; vem de &lt;code&gt;ceil(usoReal × 1.2)&lt;/code&gt; — um buffer de 20% de headroom calculado no momento do render. Não é uma recomendação genérica. É uma concreta, específica para aquele pod, naquele momento.&lt;/p&gt;

&lt;p&gt;Linhas com waste ficam destacadas em âmbar. Pods rightsized ganham um checkmark verde. A tabela virou uma lista de ações priorizadas.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftv393vb3lcewm29gnisw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftv393vb3lcewm29gnisw.png" alt="Pod Detail — Waste Analysis drawer" width="800" height="518"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;A estrela do show: clique em qualquer nome de pod para uma recomendação concreta de rightsizing.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  O que aprendi
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Dados sem ação são apenas relatório.&lt;/strong&gt; Durante os primeiros meses deste projeto, o Sentinel era um relatório muito bonito. O forecast era lindo. Os donuts eram bonitos. Mas você não conseguia fazer nada &lt;em&gt;a partir&lt;/em&gt; do dashboard — tinha que anotar, abrir um terminal e kubectl edit alguma coisa.&lt;/p&gt;

&lt;p&gt;O drawer de detalhe do pod é a primeira vez que o Sentinel te dá um número que você pode usar diretamente. Isso é uma categoria diferente de ferramenta.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Falhas silenciosas se acumulam.&lt;/strong&gt; O bug do &lt;code&gt;pods.forEach&lt;/code&gt;, o bug do &lt;code&gt;mem_request = 0&lt;/code&gt;, o Database &lt;code&gt;--&lt;/code&gt; no tooltip — nenhum deles lançou erros visíveis. Todos degradaram silenciosamente. Preciso de melhor observabilidade no próprio dashboard.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layout é pensamento de produto.&lt;/strong&gt; Passei mais tempo nesta sessão definindo hierarquia de informação do que pedindo novas features. Isso pareceu desperdício no momento. Em retrospecto, um dashboard onde seus olhos sabem para onde ir vale mais do que um com mais features.&lt;/p&gt;




&lt;h2&gt;
  
  
  Estado do cluster (v0.10.12)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Nodes:    1 (minikube) — Running
Pods:     24 Running (sentinel + google-demo namespaces)
CPU:      32.8% allocated
Waste:    20 pods com oportunidades de savings
DB:       ✓ OK
Version:  v0.10.12
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  O que vem a seguir
&lt;/h2&gt;

&lt;p&gt;O roadmap aponta para M2 e M3:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Score de eficiência por namespace&lt;/strong&gt; — não só "quais pods desperdiçam" mas "qual namespace é o pior"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;/api/incidents&lt;/code&gt;&lt;/strong&gt; — detecção determinística de violações sem LLM&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lab Online Boutique&lt;/strong&gt; — baseline → carga → chaos → comparação (o post que prometi no #2)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;E eventualmente: auth. Porque um dashboard sem auth é uma ferramenta que confia em todo mundo na sala.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Sentinel é open-source e honestamente versionado. Ainda &lt;code&gt;0.x&lt;/code&gt;. Chegando lá.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>finops</category>
      <category>devops</category>
      <category>go</category>
    </item>
    <item>
      <title>Sentinel Diary #2: the day the dashboard lied (and other honest bugs)</title>
      <dc:creator>Marcel Boccato</dc:creator>
      <pubDate>Sun, 12 Apr 2026 03:07:42 +0000</pubDate>
      <link>https://dev.to/boccato85/diario-sentinel-o-dia-que-o-dashboard-mentiu-e-outros-bugs-honestos-38n8</link>
      <guid>https://dev.to/boccato85/diario-sentinel-o-dia-que-o-dashboard-mentiu-e-outros-bugs-honestos-38n8</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Series: vibe coding with Claude Code + Kubernetes&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Today was one of those days where you sit down to "do one small thing" and stand up three hours later with a commit log longer than planned. Spoiler: not a single line was typed manually.&lt;/p&gt;

&lt;p&gt;But before getting to the bugs, I need to talk about what changed in the session setup — because that's what made the day possible.&lt;/p&gt;




&lt;h2&gt;
  
  
  The new setup: GitHub Copilot Pro as a model hub
&lt;/h2&gt;

&lt;p&gt;After a while switching between Claude API directly, Gemini and other tools, I made a decision that completely changed my workflow: signing up for &lt;strong&gt;GitHub Copilot Pro&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The insight isn't obvious at first. Copilot Pro gives access to multiple models under a single subscription — and that's where things got interesting.&lt;/p&gt;

&lt;p&gt;The flow for the day looked roughly like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GPT-4.1 mini&lt;/strong&gt; — initial code review, fast and cheap, good for a first pass&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPT-5.3 Codex&lt;/strong&gt; — deep architecture review, where I needed denser reasoning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenCode with Claude Opus&lt;/strong&gt; — installed OpenCode and ran it with Opus for the first complex analyses&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Migration to Claude Sonnet 4.6&lt;/strong&gt; — after comparing results, moved to Sonnet. Equivalent quality, significantly lower token consumption&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Using Copilot Pro as a hub — switching models depending on the task — was the turning point. Instead of paying for individual tokens or depending on a single provider, you have a menu and pick the right tool for each moment.&lt;/p&gt;

&lt;p&gt;Sonnet 4.6 specifically surprised me: across Sentinel development sessions, it delivered the same reasoning quality as Opus at a fraction of the consumption. For continuous work on long projects, that makes a real difference both on the wallet and on session flow.&lt;/p&gt;




&lt;h2&gt;
  
  
  What happened since the last post
&lt;/h2&gt;

&lt;p&gt;Sentinel grew quite a bit over the last week — and some decisions deserve a record before getting into today.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Leaving Grafana/Prometheus.&lt;/strong&gt; The original stack depended on kube-prometheus-stack: Prometheus collecting metrics, Grafana displaying, AlertManager notifying. It worked, but was heavy for a local environment and created an infrastructure dependency I wanted to eliminate. The solution was my call: make the Go agent the single source of truth. Claude implemented it — collecting directly via &lt;code&gt;client-go&lt;/code&gt;, persisting to PostgreSQL, exposing the REST API. No sidecar. No scrape. No three port-forwards at startup.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Helm chart.&lt;/strong&gt; I wanted Sentinel to be a first-class Kubernetes citizen. A single &lt;code&gt;helm install&lt;/code&gt; to bring everything up:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm &lt;span class="nb"&gt;install &lt;/span&gt;sentinel helm/sentinel &lt;span class="nt"&gt;-n&lt;/span&gt; sentinel &lt;span class="nt"&gt;--create-namespace&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Claude built the chart — Deployment, Service, ConfigMap, an initContainer that waits for PostgreSQL, and automatic InClusterConfig (the agent detects if it's running inside the cluster and uses the ServiceAccount, no kubeconfig needed).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Three-tier retention policies.&lt;/strong&gt; With history growing, I needed a storage strategy that wouldn't blow up the local PostgreSQL. I defined the tiers:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tier&lt;/th&gt;
&lt;th&gt;Granularity&lt;/th&gt;
&lt;th&gt;Retention&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Raw&lt;/td&gt;
&lt;td&gt;~10s&lt;/td&gt;
&lt;td&gt;24 hours&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hourly&lt;/td&gt;
&lt;td&gt;1 hour&lt;/td&gt;
&lt;td&gt;30 days&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Daily&lt;/td&gt;
&lt;td&gt;1 day&lt;/td&gt;
&lt;td&gt;365 days&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Claude implemented the hourly aggregation job and extended &lt;code&gt;/api/history&lt;/code&gt; to support ranges from &lt;code&gt;30m&lt;/code&gt; to &lt;code&gt;365d&lt;/code&gt; — same API, transparent to the dashboard.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Security hardening.&lt;/strong&gt; GPT-5.3 Codex did a deep code review and flagged several issues: unbounded connection pool, missing rate limiting, bind address exposed on &lt;code&gt;0.0.0.0&lt;/code&gt; without configuration. I took those findings to Claude, who fixed all of them. The harness got Unicode normalization (NFKC), 10MB input limit and path traversal protection on the &lt;code&gt;--component&lt;/code&gt; parameter. 16 tests covering critical cases.&lt;/p&gt;




&lt;h2&gt;
  
  
  The mystery of the zeros
&lt;/h2&gt;

&lt;p&gt;Opened the Sentinel dashboard. Everything was zeroed out. All panels showing &lt;code&gt;--&lt;/code&gt;. Node Health Map empty. Pod Distribution gone. FinOps missing.&lt;/p&gt;

&lt;p&gt;The cluster was running. Pods were healthy. The port-forward had started. But JavaScript wasn't receiving anything.&lt;/p&gt;

&lt;p&gt;First hypothesis: Sentinel pod with a problem. &lt;code&gt;kubectl logs&lt;/code&gt; — normal.&lt;br&gt;
Second hypothesis: Metrics Server offline. Tested — working.&lt;br&gt;
Third hypothesis: something with the port-forward.&lt;/p&gt;

&lt;p&gt;Ran the command that solved everything:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;lsof &lt;span class="nt"&gt;-i&lt;/span&gt; :8080
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;sentinel-agent  59321  boccatosantos  ...  IPv4  *:8080 (LISTEN)  &amp;lt;- the villain
kubectl         61204  boccatosantos  ...  IPv6  *:8080 (LISTEN)  &amp;lt;- correct
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A locally compiled &lt;code&gt;sentinel-agent&lt;/code&gt; instance had been running since 17:49 — listening on IPv4. Firefox was connecting to it, which had no access to the cluster at all. The &lt;code&gt;kubectl port-forward&lt;/code&gt; was there too, but on IPv6, and the browser preferred IPv4.&lt;/p&gt;

&lt;p&gt;Fix: &lt;code&gt;kill 59321&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The hard part was getting to that line. The fix itself took two seconds.&lt;/p&gt;




&lt;h2&gt;
  
  
  The dashboard still wouldn't load
&lt;/h2&gt;

&lt;p&gt;Killed the process. Refreshed the browser. Still no data.&lt;/p&gt;

&lt;p&gt;Opened DevTools and found this in the console:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Refused to connect to /api/summary because it violates the Content Security Policy directive
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Go server had a &lt;code&gt;Content-Security-Policy&lt;/code&gt; header configured, but without &lt;code&gt;connect-src&lt;/code&gt;. The browser was silently blocking every &lt;code&gt;fetch()&lt;/code&gt; call from JavaScript. No visible error in the UI — just the console screaming for anyone paying attention.&lt;/p&gt;

&lt;p&gt;I described the issue to Claude, who updated &lt;code&gt;main.go&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// before&lt;/span&gt;
&lt;span class="s"&gt;"default-src self; script-src ..."&lt;/span&gt;

&lt;span class="c"&gt;// after&lt;/span&gt;
&lt;span class="s"&gt;"default-src self; connect-src self; script-src ..."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One word. One hour of diagnosis.&lt;/p&gt;




&lt;h2&gt;
  
  
  The bar that lied
&lt;/h2&gt;

&lt;p&gt;With the dashboard working, I noticed the Utilization bar in the Top Workloads panel was wrong. It looked right — showed percentages, had colors — but the calculation was:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// heaviest pod in the cluster = 100%&lt;/span&gt;
&lt;span class="c1"&gt;// all others are relative to it&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;pct&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cpu&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nx"&gt;maxConsumer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A pod using 10m CPU with a 1000m request appeared as 100% efficient if it happened to be the cluster's top consumer at that moment. Useless for FinOps.&lt;/p&gt;

&lt;p&gt;I explained the right semantics to Claude — usage vs the pod's own request:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// how much the pod is using vs what IT REQUESTED&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;pct&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;request&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;cpu&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Semantic colors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Green (&amp;gt;70%): well-sized pod&lt;/li&gt;
&lt;li&gt;Orange (40-70%): some waste&lt;/li&gt;
&lt;li&gt;Red (&amp;lt;40%): oversized, right-sizing candidate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This bug had been there from the start. I only caught it when I stopped to actually look at the number.&lt;/p&gt;




&lt;h2&gt;
  
  
  Financial Correlation got context
&lt;/h2&gt;

&lt;p&gt;The ROI Timeline panel was showing only the chart. You could see the Budget vs Actual lines, but without value references — hard to know if the waste was $0.002/h or $2/h.&lt;/p&gt;

&lt;p&gt;I asked Claude to add a fixed summary above the chart:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Budget  $0.0312/h  |  Actual  $0.0102/h  |  Waste  $0.0210/h (67.3%)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each point's tooltip now shows all three values. Adaptive Y-axis precision — no more embarrassing &lt;code&gt;$0.000&lt;/code&gt; when values are milli-cents.&lt;/p&gt;




&lt;h2&gt;
  
  
  The versioning decision
&lt;/h2&gt;

&lt;p&gt;This was the most honest moment of the day.&lt;/p&gt;

&lt;p&gt;The project was at &lt;code&gt;v1.7.3&lt;/code&gt;. But it has no auth, no configurable alerts, no tests. Calling it &lt;code&gt;v1.x&lt;/code&gt; implies stable API and feature-complete — and that's not what Sentinel is today.&lt;/p&gt;

&lt;p&gt;Decided to renumber everything: &lt;code&gt;1.x → 0.x&lt;/code&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;v1.1&lt;/td&gt;
&lt;td&gt;v0.1 — initial MVP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;v1.3&lt;/td&gt;
&lt;td&gt;v0.3 — FinOps + PostgreSQL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;v1.5&lt;/td&gt;
&lt;td&gt;v0.5 — Security hardening&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;v1.6&lt;/td&gt;
&lt;td&gt;v0.6 — Configurable retention&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;v1.7&lt;/td&gt;
&lt;td&gt;v0.7 — Standalone, no Prometheus&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;v1.7.3&lt;/td&gt;
&lt;td&gt;v0.7.3 — Today&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;v1.0&lt;/code&gt; will be the real milestone: when auth, alerts and tests are done. Until then, we're &lt;code&gt;0.x&lt;/code&gt; and proud of it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The final touch
&lt;/h2&gt;

&lt;p&gt;To close the session, I asked Claude to add a small version badge in the top-right corner of the header:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Default state: discrete gray, mono font&lt;/li&gt;
&lt;li&gt;Hover: lights up in cyan&lt;/li&gt;
&lt;li&gt;Tooltip: &lt;code&gt;Sentinel v0.7.3 / Kubernetes Observability&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Six lines of CSS. But it gives that sense of a cared-for product.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final cluster state
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Nodes:    1 (minikube) — Running
Pods:     24 Running, 0 Failed, 0 Pending
CPU:      2620m requested / 8000m allocatable (32.75% efficiency)
Commits:  11 today
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkuhvtr83y2xjsuslu4md.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkuhvtr83y2xjsuslu4md.png" alt="Sentinel v0.7.3 Dashboard" width="800" height="456"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Preparing the battlefield: Google Online Boutique
&lt;/h2&gt;

&lt;p&gt;At the end of the session, before closing the terminal, I did something that will pay off in the next episode: deploying &lt;strong&gt;Google Online Boutique&lt;/strong&gt; in a dedicated namespace.&lt;/p&gt;

&lt;p&gt;Online Boutique is Google's microservices demo — 12 services simulating a real e-commerce app (frontend, cart, checkout, payment, recommendation engine and more). It's the perfect stress-test target for Sentinel.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl create namespace google-demo
kubectl apply &lt;span class="nt"&gt;-n&lt;/span&gt; google-demo &lt;span class="nt"&gt;-f&lt;/span&gt; https://raw.githubusercontent.com/GoogleCloudPlatform/microservices-demo/main/release/kubernetes-manifests.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two commands. Twelve services. A proper load to observe.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Namespace: google-demo
Pods:      12 Running
Services:  12
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The cluster went from 12 to 24 pods. Sentinel picked up everything without any configuration change — it monitors all namespaces by default.&lt;/p&gt;

&lt;p&gt;Why does this matter? Because Sentinel was built and tested with its own workload as the only reference. Now there's a realistic multi-service app to probe: uneven CPU distribution, idle pods, services with no requests, cost variance between workloads. Real FinOps territory.&lt;/p&gt;

&lt;p&gt;Next up: Sentinel Diary #3 — where we'll use Online Boutique as the lab. Capacity analysis, scaling automation on failure, request spike simulation. The cluster is set. Let's break things on purpose.&lt;/p&gt;




&lt;p&gt;Three bugs. Four improvements. A more honest versioning. And not a single line typed manually.&lt;/p&gt;

&lt;p&gt;This is Sentinel &lt;code&gt;v0.7.3&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Full Changelog
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;v0.7.3 — today&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Utilization bar fixed — now shows real usage / request, not relative to top consumer&lt;/li&gt;
&lt;li&gt;Semantic colors: green (&amp;gt;70% efficient), orange (40-70%), red (&amp;lt;40% = waste)&lt;/li&gt;
&lt;li&gt;Financial Correlation improved — Budget / Actual / Waste summary above the chart&lt;/li&gt;
&lt;li&gt;Enriched tooltip — shows Budget, Actual and Waste per point on hover&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;v0.7 — fully standalone&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Removed all Prometheus, Grafana and AlertManager dependencies&lt;/li&gt;
&lt;li&gt;Resilient startup — initContainer waits for PostgreSQL + exponential backoff retry in Go&lt;/li&gt;
&lt;li&gt;CSP fix — added &lt;code&gt;connect-src 'self'&lt;/code&gt; to allow fetch requests in the dashboard&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tools/monitor.py&lt;/code&gt; rewritten to use Go agent API&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;/startup&lt;/code&gt; simplified — only checks Minikube and Go agent&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;v0.6 — configurable retention&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;3-tier retention policy: raw (24h), hourly (30d), daily (365d) with automatic cleanup&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;/api/history&lt;/code&gt; now supports ranges from 30m to 365d&lt;/li&gt;
&lt;li&gt;Hourly auto-aggregation compacting old metrics&lt;/li&gt;
&lt;li&gt;New tables: &lt;code&gt;metrics_hourly&lt;/code&gt;, &lt;code&gt;metrics_daily&lt;/code&gt;, &lt;code&gt;cost_history&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;v0.5 — Helm + security hardening&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Helm chart — full Kubernetes deploy with &lt;code&gt;helm install sentinel helm/sentinel -n sentinel&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;InClusterConfig — Go agent auto-detects if running inside the cluster&lt;/li&gt;
&lt;li&gt;Auto-schema — &lt;code&gt;metrics&lt;/code&gt; table created automatically on startup&lt;/li&gt;
&lt;li&gt;Security hardening — PostgreSQL connection pool, rate limiting (100 rps), configurable bind address&lt;/li&gt;
&lt;li&gt;Harness — Unicode normalization (NFKC), 10MB input limit, 16 tests&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--component&lt;/code&gt; sanitized against path traversal, timeout with safe clamping&lt;/li&gt;
&lt;/ul&gt;







&lt;blockquote&gt;
&lt;p&gt;Série: vibe coding com Claude Code + Kubernetes&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Hoje foi daqueles dias que você senta para fazer uma coisa pequena e levanta três horas depois com commit log maior do que planejava. Spoiler: nenhuma linha foi digitada manualmente.&lt;/p&gt;

&lt;p&gt;Mas antes de chegar nos bugs, preciso contar o que mudou na infraestrutura da sessão — porque foi isso que tornou o dia possível.&lt;/p&gt;




&lt;h2&gt;
  
  
  O novo setup: GitHub Copilot Pro como hub de modelos
&lt;/h2&gt;

&lt;p&gt;Depois de um tempo alternando entre Claude API direto, Gemini e outras ferramentas, tomei uma decisão que mudou o fluxo de trabalho completamente: assinar o &lt;strong&gt;GitHub Copilot Pro&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A sacada não é óbvia à primeira vista. O Copilot Pro dá acesso a vários modelos dentro de uma única assinatura — e foi aí que a coisa ficou interessante.&lt;/p&gt;

&lt;p&gt;O fluxo do dia foi mais ou menos assim:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GPT-4.1 mini&lt;/strong&gt; — code review inicial, rápido e barato, bom para uma primeira passagem&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPT-5.3 Codex&lt;/strong&gt; — review profundo de arquitetura, onde realmente precisava de raciocínio mais denso&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenCode com Claude Opus&lt;/strong&gt; — instalei o OpenCode e rodei com Opus para as primeiras análises mais complexas&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Migração para Claude Sonnet 4.6&lt;/strong&gt; — depois de comparar os resultados, migrei para o Sonnet. Qualidade equivalente, consumo de tokens significativamente menor&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Essa estratégia de usar o Copilot Pro como hub — trocando de modelo conforme o tipo de tarefa — foi o divisor de águas. Em vez de pagar por tokens avulsos ou depender de um único provider, você tem um cardápio e escolhe a ferramenta certa para cada momento.&lt;/p&gt;

&lt;p&gt;O Sonnet 4.6 especificamente surpreendeu: nas sessões de desenvolvimento do Sentinel, entregou a mesma qualidade de raciocínio do Opus com uma fração do consumo. Para trabalho contínuo em projetos longos, isso faz diferença real no bolso e na fluidez da sessão.&lt;/p&gt;




&lt;h2&gt;
  
  
  O que aconteceu desde o último post
&lt;/h2&gt;

&lt;p&gt;O Sentinel cresceu bastante nas últimas semanas — e algumas decisões merecem registro antes de entrar no dia de hoje.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Saída do Grafana/Prometheus.&lt;/strong&gt; O stack original dependia de kube-prometheus-stack. Funcionava, mas era pesado e criava dependências de infraestrutura que eu queria eliminar. Decidi tornar o Go agent a única fonte de verdade. O Claude implementou: coleta direto via &lt;code&gt;client-go&lt;/code&gt;, persiste no PostgreSQL e expõe a API REST. Sem sidecar. Sem scrape. Sem três port-forwards na inicialização.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Helm chart.&lt;/strong&gt; Queria que o Sentinel fosse um cidadão de primeira classe no Kubernetes. Um &lt;code&gt;helm install&lt;/code&gt; para subir tudo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm &lt;span class="nb"&gt;install &lt;/span&gt;sentinel helm/sentinel &lt;span class="nt"&gt;-n&lt;/span&gt; sentinel &lt;span class="nt"&gt;--create-namespace&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;O Claude construiu o chart — Deployment, Service, ConfigMap, initContainer que aguarda o PostgreSQL e InClusterConfig automático (o agente detecta se está rodando dentro do cluster e usa o ServiceAccount, sem precisar de kubeconfig montado).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Políticas de retenção em três camadas.&lt;/strong&gt; Com o histórico crescendo, precisava de uma estratégia de storage que não explodisse o PostgreSQL local. Defini as camadas:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Camada&lt;/th&gt;
&lt;th&gt;Granularidade&lt;/th&gt;
&lt;th&gt;Retenção&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Raw&lt;/td&gt;
&lt;td&gt;~10s&lt;/td&gt;
&lt;td&gt;24 horas&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hourly&lt;/td&gt;
&lt;td&gt;1 hora&lt;/td&gt;
&lt;td&gt;30 dias&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Daily&lt;/td&gt;
&lt;td&gt;1 dia&lt;/td&gt;
&lt;td&gt;365 dias&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;O Claude implementou o job de agregação por hora e estendeu o &lt;code&gt;/api/history&lt;/code&gt; para suportar ranges de &lt;code&gt;30m&lt;/code&gt; até &lt;code&gt;365d&lt;/code&gt; — mesma API, transparente para o dashboard.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Security hardening.&lt;/strong&gt; O GPT-5.3 Codex fez um code review profundo e sinalizou vários problemas: connection pool sem limite, ausência de rate limiting, bind address exposto em &lt;code&gt;0.0.0.0&lt;/code&gt;. Levei os achados para o Claude, que corrigiu tudo. O harness ganhou normalização Unicode (NFKC), limite de input de 10MB e proteção contra path traversal no parâmetro &lt;code&gt;--component&lt;/code&gt;. 16 testes cobrindo os casos críticos.&lt;/p&gt;




&lt;h2&gt;
  
  
  O mistério dos zeros
&lt;/h2&gt;

&lt;p&gt;Abri o dashboard do Sentinel. Estava tudo zerado. Todos os painéis mostrando &lt;code&gt;--&lt;/code&gt;. Node Health Map vazio. Pod Distribution sem dados. FinOps sumido.&lt;/p&gt;

&lt;p&gt;O cluster estava rodando. Os pods estavam healthy. O port-forward tinha subido. Mas o JavaScript não recebia nada.&lt;/p&gt;

&lt;p&gt;Primeira hipótese: pod do Sentinel com problema. &lt;code&gt;kubectl logs&lt;/code&gt; — normal.&lt;br&gt;
Segunda hipótese: Metrics Server offline. Testei — funcionando.&lt;br&gt;
Terceira hipótese: algo com o port-forward.&lt;/p&gt;

&lt;p&gt;Rodei o comando que resolveu tudo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;lsof &lt;span class="nt"&gt;-i&lt;/span&gt; :8080
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;sentinel-agent  59321  boccatosantos  ...  IPv4  *:8080 (LISTEN)  &amp;lt;- vilão
kubectl         61204  boccatosantos  ...  IPv6  *:8080 (LISTEN)  &amp;lt;- correto
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tinha uma instância do &lt;code&gt;sentinel-agent&lt;/code&gt; compilada localmente rodando desde as 17h49 — escutando em IPv4. O Firefox conectava nela, que não tinha acesso nenhum ao cluster. O &lt;code&gt;kubectl port-forward&lt;/code&gt; estava lá também, mas em IPv6, e o browser preferia o IPv4.&lt;/p&gt;

&lt;p&gt;Fix: &lt;code&gt;kill 59321&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;O mais trabalhoso foi chegar nessa linha. O fix em si levou dois segundos.&lt;/p&gt;




&lt;h2&gt;
  
  
  O dashboard ainda não carregava
&lt;/h2&gt;

&lt;p&gt;Matei o processo. Atualizei o browser. Continuava sem dados.&lt;/p&gt;

&lt;p&gt;Abri o DevTools e encontrei isso no console:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Refused to connect to /api/summary because it violates the Content Security Policy directive
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;O servidor Go tinha um header &lt;code&gt;Content-Security-Policy&lt;/code&gt; configurado, mas sem &lt;code&gt;connect-src&lt;/code&gt;. O browser bloqueava silenciosamente todo &lt;code&gt;fetch()&lt;/code&gt; do JavaScript. Nenhum erro visível na UI — só o console gritando pra quem olhasse.&lt;/p&gt;

&lt;p&gt;Descrevi o problema para o Claude, que atualizou o &lt;code&gt;main.go&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// antes&lt;/span&gt;
&lt;span class="s"&gt;"default-src self; script-src ..."&lt;/span&gt;

&lt;span class="c"&gt;// depois&lt;/span&gt;
&lt;span class="s"&gt;"default-src self; connect-src self; script-src ..."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Uma palavra. Uma hora de diagnóstico.&lt;/p&gt;




&lt;h2&gt;
  
  
  A barra que mentia
&lt;/h2&gt;

&lt;p&gt;Com o dashboard funcionando, percebi que a barra de Utilization no painel Top Workloads estava errada. Parecia certa — mostrava porcentagens, tinha cores — mas o cálculo era:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// o pod mais pesado do cluster = 100%&lt;/span&gt;
&lt;span class="c1"&gt;// todos os outros são relativos a ele&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;pct&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cpu&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nx"&gt;maxConsumer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Um pod usando 10m de CPU com request de 1000m aparecia como 100% eficiente se fosse o maior consumidor do cluster naquele momento. Inútil para FinOps.&lt;/p&gt;

&lt;p&gt;Expliquei a semântica correta para o Claude — uso versus o próprio request do pod:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// quanto o pod está usando do que ELE MESMO pediu&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;pct&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;request&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;cpu&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cores semânticas:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Verde (&amp;gt;70%): pod bem dimensionado&lt;/li&gt;
&lt;li&gt;Laranja (40-70%): algum desperdício&lt;/li&gt;
&lt;li&gt;Vermelho (&amp;lt;40%): oversized, candidato a right-sizing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Esse bug estava lá desde sempre. Só percebi quando parei pra olhar o número com atenção.&lt;/p&gt;




&lt;h2&gt;
  
  
  Financial Correlation ganhou contexto
&lt;/h2&gt;

&lt;p&gt;O painel de ROI Timeline mostrava só o gráfico. Você via as linhas de Budget vs Actual, mas sem referência de valores — ficava difícil saber se o desperdício era $0.002/h ou $2/h.&lt;/p&gt;

&lt;p&gt;Pedi ao Claude para adicionar um sumário fixo acima do gráfico:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Budget  $0.0312/h  |  Actual  $0.0102/h  |  Waste  $0.0210/h (67.3%)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tooltip de cada ponto agora mostra os três valores. Eixo Y com precisão adaptativa — sem aquele &lt;code&gt;$0.000&lt;/code&gt; vergonhoso quando os valores são milicêntimos.&lt;/p&gt;




&lt;h2&gt;
  
  
  A decisão do versionamento
&lt;/h2&gt;

&lt;p&gt;Esse foi o momento mais honesto do dia.&lt;/p&gt;

&lt;p&gt;O projeto estava em &lt;code&gt;v1.7.3&lt;/code&gt;. Só que ele não tem auth, não tem alertas configuráveis, não tem testes. Chamar de &lt;code&gt;v1.x&lt;/code&gt; implica API estável e feature-complete — e não é isso que o Sentinel é hoje.&lt;/p&gt;

&lt;p&gt;Decidi renumerar tudo: &lt;code&gt;1.x → 0.x&lt;/code&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Antes&lt;/th&gt;
&lt;th&gt;Depois&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;v1.1&lt;/td&gt;
&lt;td&gt;v0.1 — MVP inicial&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;v1.3&lt;/td&gt;
&lt;td&gt;v0.3 — FinOps + PostgreSQL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;v1.5&lt;/td&gt;
&lt;td&gt;v0.5 — Security hardening&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;v1.6&lt;/td&gt;
&lt;td&gt;v0.6 — Retenção configurável&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;v1.7&lt;/td&gt;
&lt;td&gt;v0.7 — Standalone, sem Prometheus&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;v1.7.3&lt;/td&gt;
&lt;td&gt;v0.7.3 — Hoje&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;v1.0&lt;/code&gt; vai ser o marco real: quando tiver auth, alertas e testes. Até lá, somos &lt;code&gt;0.x&lt;/code&gt; e temos orgulho disso.&lt;/p&gt;




&lt;h2&gt;
  
  
  O toque final
&lt;/h2&gt;

&lt;p&gt;Para fechar a sessão, pedi ao Claude para adicionar um badge de versão pequeno no canto direito do header:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Estado normal: cinza discreto, fonte mono&lt;/li&gt;
&lt;li&gt;Hover: acende em cyan&lt;/li&gt;
&lt;li&gt;Tooltip: &lt;code&gt;Sentinel v0.7.3 / Kubernetes Observability&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Seis linhas de CSS. Mas dá aquela sensação de produto cuidado.&lt;/p&gt;




&lt;h2&gt;
  
  
  Estado final do cluster
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Nodes:    1 (minikube) — Running
Pods:     24 Running, 0 Failed, 0 Pending
CPU:      2620m requested / 8000m allocatable (32.75% efficiency)
Commits:  11 hoje
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkuhvtr83y2xjsuslu4md.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkuhvtr83y2xjsuslu4md.png" alt="Sentinel v0.7.3 Dashboard" width="800" height="456"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Preparando o campo de batalha: Google Online Boutique
&lt;/h2&gt;

&lt;p&gt;No final da sessão, antes de fechar o terminal, fiz algo que vai render no próximo episódio: o deploy do &lt;strong&gt;Google Online Boutique&lt;/strong&gt; em um namespace dedicado.&lt;/p&gt;

&lt;p&gt;O Online Boutique é o demo de microsserviços do Google — 12 serviços simulando um e-commerce real (frontend, carrinho, checkout, pagamento, motor de recomendação e mais). É o alvo perfeito para estressar o Sentinel.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl create namespace google-demo
kubectl apply &lt;span class="nt"&gt;-n&lt;/span&gt; google-demo &lt;span class="nt"&gt;-f&lt;/span&gt; https://raw.githubusercontent.com/GoogleCloudPlatform/microservices-demo/main/release/kubernetes-manifests.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Dois comandos. Doze serviços. Uma carga real para observar.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Namespace: google-demo
Pods:      12 Running
Services:  12
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;O cluster foi de 12 para 24 pods. O Sentinel pegou tudo sem nenhuma mudança de configuração — ele monitora todos os namespaces por padrão.&lt;/p&gt;

&lt;p&gt;Por que isso importa? Porque o Sentinel foi construído e testado tendo sua própria workload como única referência. Agora tem um app multi-serviço realista para sondar: distribuição de CPU irregular, pods ociosos, serviços sem requisições, variância de custo entre workloads. Território de FinOps de verdade.&lt;/p&gt;

&lt;p&gt;No próximo: Sentinel Diary #3 — onde vamos usar o Online Boutique como laboratório. Análise de capacity, automação de scaling em falhas, simulação de picos de requests. O cluster está pronto. Vamos quebrar coisas de propósito.&lt;/p&gt;




&lt;p&gt;Três bugs. Quatro melhorias. Um versionamento mais honesto. E nenhuma linha digitada manualmente.&lt;/p&gt;

&lt;p&gt;Esse é o Sentinel &lt;code&gt;v0.7.3&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Changelog completo
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;v0.7.3 — hoje&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Barra de Utilization corrigida — agora mostra uso / request real, não relativo ao maior consumidor&lt;/li&gt;
&lt;li&gt;Cores semânticas na barra: verde (&amp;gt;70% eficiente), laranja (40-70%), vermelho (&amp;lt;40% = desperdício)&lt;/li&gt;
&lt;li&gt;Financial Correlation melhorado — sumário de Budget / Actual / Waste acima do gráfico&lt;/li&gt;
&lt;li&gt;Tooltip enriquecido — mostra Budget, Actual e Waste por ponto ao passar o mouse&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;v0.7 — standalone completo&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Removida toda dependência de Prometheus, Grafana e AlertManager&lt;/li&gt;
&lt;li&gt;Inicialização resiliente — initContainer aguarda PostgreSQL + retry com backoff exponencial no Go&lt;/li&gt;
&lt;li&gt;CSP fix — adicionado &lt;code&gt;connect-src 'self'&lt;/code&gt; para permitir fetch requests no dashboard&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tools/monitor.py&lt;/code&gt; reescrito para usar API do Go agent&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;/startup&lt;/code&gt; simplificado — apenas verifica Minikube e Go agent&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;v0.6 — retenção configurável&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Política de retenção em 3 camadas: raw (24h), hourly (30d), daily (365d) com cleanup automático&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;/api/history&lt;/code&gt; agora suporta ranges de 30m até 365d&lt;/li&gt;
&lt;li&gt;Agregação automática por hora compactando métricas antigas&lt;/li&gt;
&lt;li&gt;Novas tabelas: &lt;code&gt;metrics_hourly&lt;/code&gt;, &lt;code&gt;metrics_daily&lt;/code&gt;, &lt;code&gt;cost_history&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;v0.5 — Helm + security hardening&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Helm chart — deploy completo no Kubernetes com &lt;code&gt;helm install sentinel helm/sentinel -n sentinel&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;InClusterConfig — Go agent detecta automaticamente se está rodando dentro do cluster&lt;/li&gt;
&lt;li&gt;Auto-schema — tabela &lt;code&gt;metrics&lt;/code&gt; criada automaticamente no startup&lt;/li&gt;
&lt;li&gt;Security hardening — connection pool PostgreSQL, rate limiting (100 rps), bind address configurável&lt;/li&gt;
&lt;li&gt;Harness — normalização Unicode (NFKC), limite de input (10MB), 16 testes&lt;/li&gt;
&lt;li&gt;Sanitização de &lt;code&gt;--component&lt;/code&gt; contra path traversal, timeout com clamping seguro&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Se quiser acompanhar o projeto: &lt;a href="https://github.com/boccato85/Sentinel" rel="noopener noreferrer"&gt;github.com/boccato85/Sentinel&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>claudecode</category>
      <category>kubernetes</category>
      <category>devops</category>
      <category>finops</category>
    </item>
    <item>
      <title>Sentinel Diary #1: from the Anthropic certificate to a FinOps tool in days</title>
      <dc:creator>Marcel Boccato</dc:creator>
      <pubDate>Sun, 05 Apr 2026 22:08:13 +0000</pubDate>
      <link>https://dev.to/boccato85/diario-de-vibe-coding-do-certificado-anthropic-a-uma-plataforma-de-finops-em-dias-1dib</link>
      <guid>https://dev.to/boccato85/diario-de-vibe-coding-do-certificado-anthropic-a-uma-plataforma-de-finops-em-dias-1dib</guid>
      <description>&lt;p&gt;This article is not a tutorial. It's a diary :)&lt;/p&gt;

&lt;p&gt;I'm taking the CKA (Certified Kubernetes Administrator) course on KodeKloud, wanted to get started in the world of agentic AI with Claude — and ended up with a Kubernetes observability and FinOps tool running Go, PostgreSQL and a real-time dashboard. Without having planned any of it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Day 1: the certificate and the $5 credit
&lt;/h2&gt;

&lt;p&gt;I decided to take the official Anthropic course on Skilljar: &lt;strong&gt;Claude Code in Action&lt;/strong&gt;. Free, with a certificate, and it took a few hours to complete.&lt;/p&gt;

&lt;p&gt;The course asks for an Anthropic Platform API key to run the examples. I created the key, spun up the local server, and got:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;error: Your credit balance is too low to access the Anthropic API.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;"But the course is free..."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Not exactly. The course is free. The API calls consume real credits. They're two separate things — and the platform doesn't make that clear enough. I bought $5 in credits, cleared cache and sessions, recreated the key, and it worked.&lt;/p&gt;

&lt;p&gt;Lesson 1: capitalism always wins. But $5 goes a long way with Haiku, the entry-level model.&lt;/p&gt;

&lt;p&gt;With the certificate in hand the same day, I moved on to practice.&lt;/p&gt;




&lt;h2&gt;
  
  
  Still Day 1: creating v1.0
&lt;/h2&gt;

&lt;p&gt;The idea was simple: a Claude Code agent that would monitor a Kubernetes cluster and automatically generate runbooks. No manual code. Just directives.&lt;/p&gt;

&lt;p&gt;I run Linux Fedora 43 KDE on an Acer Predator PHN16-72 laptop.&lt;/p&gt;

&lt;p&gt;Installed Minikube, spun up &lt;strong&gt;kube-prometheus-stack&lt;/strong&gt; via Helm:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm &lt;span class="nb"&gt;install &lt;/span&gt;prometheus-stack prometheus-community/kube-prometheus-stack &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; monitoring &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; grafana.adminPassword&lt;span class="o"&gt;=&lt;/span&gt;admin123
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Six pods coming up at once. Grafana with ready-made dashboards. Prometheus collecting real Minikube metrics.&lt;/p&gt;

&lt;p&gt;Then I created the base structure with slash commands in Markdown inside &lt;code&gt;.claude/commands/&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Function&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/sentinel&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Main orchestrator&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/collect-metrics&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Sub-agent A — queries Prometheus via PromQL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/analyze-pods&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Sub-agent B — checks pods via kubectl&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/correlate&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Sub-agent C — classifies severity&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;CLAUDE.md&lt;/code&gt; became the agent's operational memory: endpoints, thresholds, namespaces, runbook template.&lt;/p&gt;

&lt;p&gt;Ran &lt;code&gt;/sentinel&lt;/code&gt; for the first time and it generated this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Severity: WARNING
CPU: 11.4% ✅ | Memory: 45.1% ✅ | Disk: 17.65% ✅
64 Warning events identified as residual from previous node restart
storage-provisioner: recent BackOff — requires monitoring
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent separated noise from signal on its own. It identified that the 64 Warning events were residual from a Minikube reboot — not real anomalies. That wasn't in the prompt. It was model reasoning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;v1.0 on GitHub. Same day as the certificate.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Day 2: v1.1 and automatic startup
&lt;/h2&gt;

&lt;p&gt;The biggest friction in v1.0 was operational: every time I opened Claude Code, I had to remember to manually spin up three port-forwards before running any command.&lt;/p&gt;

&lt;p&gt;I created &lt;code&gt;/startup&lt;/code&gt; — an agent that checks whether Prometheus, Grafana and AlertManager are accessible and only starts the missing port-forwards, in the background, with up to 10 retries:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;╔══════════════════════════════════════════╗
║     Sentinel — Startup                   ║
╚══════════════════════════════════════════╝
 Prometheus    (localhost:9090)  →  ✅ STARTED
 Grafana       (localhost:3000)  →  ✅ STARTED
 AlertManager  (localhost:9093)  →  ✅ STARTED
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I asked Claude to expand &lt;code&gt;/analyze-pods&lt;/code&gt; to monitor multiple namespaces in parallel: &lt;code&gt;default&lt;/code&gt;, &lt;code&gt;monitoring&lt;/code&gt; and &lt;code&gt;kube-system&lt;/code&gt; — with results grouped by namespace and cross-namespace root cause correlation.&lt;/p&gt;

&lt;p&gt;The result looked like this:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Namespace&lt;/th&gt;
&lt;th&gt;Total&lt;/th&gt;
&lt;th&gt;Unhealthy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;default&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;monitoring&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;0 (all Running)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;kube-system&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;0 (all Running)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The agent also identified &lt;code&gt;storage-provisioner&lt;/code&gt; with 21 restarts versus an average of 8 for other pods — flagging the genuinely anomalous component with no explicit instruction to do so.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;v1.1 on GitHub. Day 2.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Gemini week: when Claude Code went down
&lt;/h2&gt;

&lt;p&gt;The following week, I had a hard time using Claude — whether via web or via terminal. Tokens were being consumed much faster, even following all of Anthropic's own best practices. Complaints started piling up on Reddit, X, Stack Overflow. Apparently demand was too high for them to absorb. I'm on the $20 pro plan — not a big deal, but even for studying it became unworkable.&lt;/p&gt;

&lt;p&gt;I migrated to &lt;strong&gt;Gemini&lt;/strong&gt;. It worked for a good while — and that was the period when Sentinel started gaining its more complex features: the Go agent, PostgreSQL integration, the dashboard.&lt;/p&gt;

&lt;p&gt;But as the project grew, problems appeared:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dashboard UI&lt;/strong&gt;: Gemini generated functional HTML but with visual inconsistencies I had to fix manually&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code security&lt;/strong&gt;: I quickly ran Claude Code as a reviewer and found patterns in the generated code that worried me — missing input validation, absent security headers, SQL queries without proper sanitization&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When Claude Code came back (they gave an additional ~$22 credit), I switched back. The cost per session is higher and the context window is smaller — but the quality and reliability of generated code, especially on security matters, make it worth it.&lt;/p&gt;

&lt;p&gt;Sentinel was born in Gemini. But it grew up in Claude Code.&lt;/p&gt;




&lt;h2&gt;
  
  
  The v2.0: when it became a real tool
&lt;/h2&gt;

&lt;p&gt;The turning point came with a question: &lt;em&gt;"what if I had the data already collected before calling Claude?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I wanted a real-time platform and to dive headfirst into vibe coding.&lt;/p&gt;

&lt;p&gt;The slash commands were querying Prometheus and kubectl in real time on every execution. That worked, but it was slow and stateless — no history, no trends, no real FinOps.&lt;/p&gt;

&lt;p&gt;The solution: a &lt;strong&gt;Go agent&lt;/strong&gt; running continuously in the background.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Collects every 10 seconds&lt;/span&gt;
&lt;span class="c"&gt;// Persists to PostgreSQL&lt;/span&gt;
&lt;span class="c"&gt;// Exposes REST API for Claude to consume&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Using native &lt;code&gt;client-go&lt;/code&gt;, the agent collects CPU, memory and waste metrics per pod, persists them in batch transactions to PostgreSQL and exposes three endpoints:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Endpoint&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;GET /api/summary&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Cluster state: nodes, pods, CPU&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;GET /api/metrics&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Per-pod metrics: CPU usage, waste&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;GET /api/history&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Cost history for the last 30 min&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Claude Code started consuming these endpoints via &lt;code&gt;/incident&lt;/code&gt; instead of querying directly. Real layer separation: Go collects, Claude analyzes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5satzpeodhgftq9i4a4q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5satzpeodhgftq9i4a4q.png" alt="Screenshot Sentinel Dashboard" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Node Health Map with honeycomb, Pod Distribution with donut chart, Waste Intelligence with savings opportunities per pod, Financial ROI Timeline with Budget vs Actual over the last 30 minutes.&lt;/p&gt;




&lt;h2&gt;
  
  
  The harness: treating LLM output as untrusted input
&lt;/h2&gt;

&lt;p&gt;Between LinkedIn posts, my brother sent me an interesting article about Harness Engineering — basically having security and reliability around the model, with proper infrastructure, security reviews and constant ReACT patterns (Review, ACT) plus feedback.&lt;/p&gt;

&lt;p&gt;Article: &lt;a href="https://martinfowler.com/articles/harness-engineering.html" rel="noopener noreferrer"&gt;Harness Engineering — Martin Fowler&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A very important architectural decision to put the concept into practice: &lt;code&gt;harness/validador_saida.py&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Every report generated by Claude Code goes through a gatekeeper before being written to disk. The validator blocks:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rule&lt;/th&gt;
&lt;th&gt;Blocked examples&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Destructive commands&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;rm -rf&lt;/code&gt;, &lt;code&gt;kubectl delete&lt;/code&gt;, &lt;code&gt;DROP TABLE&lt;/code&gt;, fork bomb&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Required structure&lt;/td&gt;
&lt;td&gt;Reports without &lt;code&gt;## Executive Summary&lt;/code&gt; are rejected&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Minimum size&lt;/td&gt;
&lt;td&gt;Content under 100 chars is rejected&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;returncode&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Validator blocked the write: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stderr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the validator rejects it, the file is not created. Period.&lt;/p&gt;

&lt;p&gt;This isn't paranoia — it's production architecture. Any system that uses an LLM to generate actions on infrastructure needs a gatekeeper.&lt;/p&gt;




&lt;h2&gt;
  
  
  What vibe coding means in practice
&lt;/h2&gt;

&lt;p&gt;Vibe coding isn't "letting AI do everything". It's a specific way of working:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You define the what. The AI decides the how.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;At each step, I knew the result I wanted and acted as the SRE, the decision maker:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;"I want startup to check services and spin up missing port-forwards"&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;"I want namespace-grouped analysis with cross-namespace root cause correlation"&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;"I want a gatekeeper that blocks destructive commands before writing"&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The agent implemented. I reviewed, questioned, redirected.&lt;/p&gt;

&lt;p&gt;What surprised me: at no point did I write a single line of the Go agent, the dashboard, or the harness. But every architecture decision was mine. The AI was my senior software engineer.&lt;/p&gt;




&lt;h2&gt;
  
  
  What became clear about Claude Code
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Quality and consistency of generated code, especially in Go&lt;/li&gt;
&lt;li&gt;Security reasoning — correct HTTP headers, sanitization, file permissions&lt;/li&gt;
&lt;li&gt;Ability to maintain project context via &lt;code&gt;CLAUDE.md&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Parallel sub-agents that genuinely execute in parallel&lt;/li&gt;
&lt;li&gt;Optimized UI generation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Real limitations:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Smaller context window than competitors — large projects require frequent &lt;code&gt;/compact&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Higher cost per token — relevant in long sessions&lt;/li&gt;
&lt;li&gt;The token instability from that week showed that total dependency on a single provider is real risk&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Journey timeline
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Date&lt;/th&gt;
&lt;th&gt;Milestone&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Mar 8, 2026&lt;/td&gt;
&lt;td&gt;Claude Code in Action certificate — Anthropic/Skilljar&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mar 8, 2026&lt;/td&gt;
&lt;td&gt;v1.0: slash commands + Prometheus + K8s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mar 9, 2026&lt;/td&gt;
&lt;td&gt;v1.1: automatic startup + multiple namespaces&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Week 2&lt;/td&gt;
&lt;td&gt;Temporary migration to Gemini — global Anthropic token restriction&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Week 2-3&lt;/td&gt;
&lt;td&gt;v2.0: Go agent + PostgreSQL + dashboard + harness&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Week 3&lt;/td&gt;
&lt;td&gt;Return to Claude Code + renamed to Sentinel&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Where it stands today
&lt;/h2&gt;

&lt;p&gt;The project is open source, Apache 2.0:&lt;/p&gt;

&lt;p&gt;👉 &lt;strong&gt;&lt;a href="https://github.com/boccato85/Sentinel" rel="noopener noreferrer"&gt;github.com/boccato85/Sentinel&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Full stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Minikube / Kubernetes v1.35.1&lt;/li&gt;
&lt;li&gt;Go agent with &lt;code&gt;client-go&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;PostgreSQL&lt;/li&gt;
&lt;li&gt;Claude Code (OpenCode + Sonnet 4.6)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you work with SRE, CloudOps or FinOps and want to explore Claude Code in practice, this is a real starting point — not a hello world.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This project is part of a personal track: CKA → Claude Code → MLOps. Follow along here on dev.to.&lt;/em&gt;&lt;/p&gt;







&lt;p&gt;Esse artigo não é um tutorial. É um diário :)&lt;/p&gt;

&lt;p&gt;Estou fazendo o curso de certificação CKA, Certified Kubernetes Administrator através da plataforma KodeKloud, e assim gostaria de iniciar no mundo "agêntico" de IA com o Claude e terminei com uma plataforma de observabilidade e FinOps para Kubernetes rodando Go, PostgreSQL e um dashboard em tempo real — sem ter planejado nada disso.&lt;/p&gt;




&lt;h2&gt;
  
  
  Dia 1: o certificado e os $5 de crédito
&lt;/h2&gt;

&lt;p&gt;Decidi fazer o curso oficial da Anthropic (Skilljar): &lt;strong&gt;Claude Code in Action&lt;/strong&gt;. Gratuito, com certificado, e levei algumas horas para concluir.&lt;/p&gt;

&lt;p&gt;O curso pede uma API key do Anthropic Platform para rodar os exemplos. Criei a chave, subi o servidor local, e recebi:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;error: Your credit balance is too low to access the Anthropic API.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;"Mas o curso é gratuito..."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Não é bem assim. O curso é gratuito. As chamadas de API consomem créditos reais. São duas coisas separadas — e a plataforma não deixa isso claro o suficiente. Comprei $5 de crédito, deletei cache e sessões, recriei a key, e funcionou.&lt;/p&gt;

&lt;p&gt;Lição 1: o capitalismo sempre vence. Mas $5 dura bastante com o Haiku, o modelo de "entrada" digamos assim.&lt;/p&gt;

&lt;p&gt;Com o certificado em mãos no mesmo dia, parti para a prática.&lt;/p&gt;




&lt;h2&gt;
  
  
  Ainda no Dia 1: criação da v1.0
&lt;/h2&gt;

&lt;p&gt;A ideia era simples: um agente Claude Code que monitorasse um cluster Kubernetes e gerasse runbooks automaticamente. Sem código manual. Só diretivas.&lt;/p&gt;

&lt;p&gt;Eu utilizo o Linux Fedora 43 KDE em um Notebook Acer Predator PHN16-72.&lt;/p&gt;

&lt;p&gt;Instalei o minikube, subi o &lt;strong&gt;kube-prometheus-stack&lt;/strong&gt; via Helm:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm &lt;span class="nb"&gt;install &lt;/span&gt;prometheus-stack prometheus-community/kube-prometheus-stack &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; monitoring &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; grafana.adminPassword&lt;span class="o"&gt;=&lt;/span&gt;admin123
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Seis pods subindo de uma vez. Grafana com dashboards prontos. Prometheus coletando métricas reais do Minikube.&lt;/p&gt;

&lt;p&gt;Então criei a estrutura base com slash commands em Markdown dentro de &lt;code&gt;.claude/commands/&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Comando&lt;/th&gt;
&lt;th&gt;Função&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/sentinel&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Orquestrador principal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/collect-metrics&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Sub-agent A — consulta Prometheus via PromQL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/analyze-pods&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Sub-agent B — verifica pods via kubectl&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/correlate&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Sub-agent C — classifica severidade&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;O &lt;code&gt;CLAUDE.md&lt;/code&gt; virou a memória operacional do agente: endpoints, thresholds, namespaces, template de runbook.&lt;/p&gt;

&lt;p&gt;Rodei &lt;code&gt;/sentinel&lt;/code&gt; pela primeira vez e ele gerou isso:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Severidade: WARNING
CPU: 11.4% ✅ | Memória: 45.1% ✅ | Disco: 17.65% ✅
64 Warning events identificados como residuais de restart anterior do nó
storage-provisioner: BackOff recente — requer monitoramento
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;O agente separou ruído de sinal sozinho. Identificou que os 64 eventos de Warning eram residuais de um reboot do Minikube — não anomalias reais. Isso não estava no prompt. Foi raciocínio do modelo.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;v1.0 no GitHub. Mesmo dia do certificado.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Dia 2: a v1.1 e o startup automático
&lt;/h2&gt;

&lt;p&gt;O maior atrito da v1.0 era operacional: toda vez que abria o Claude Code, precisava lembrar de subir os três port-forwards manualmente antes de rodar qualquer comando.&lt;/p&gt;

&lt;p&gt;Criei o &lt;code&gt;/startup&lt;/code&gt; — um agente que verifica se Prometheus, Grafana e AlertManager estão acessíveis e sobe apenas os port-forwards ausentes, em background, com retry de até 10 tentativas:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;╔══════════════════════════════════════════╗
║     Sentinel — Startup                   ║
╚══════════════════════════════════════════╝
 Prometheus    (localhost:9090)  →  ✅ STARTED
 Grafana       (localhost:3000)  →  ✅ STARTED
 AlertManager  (localhost:9093)  →  ✅ STARTED
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pedi ao Claude para expandir o &lt;code&gt;/analyze-pods&lt;/code&gt; para monitorar múltiplos namespaces em paralelo: &lt;code&gt;default&lt;/code&gt;, &lt;code&gt;monitoring&lt;/code&gt; e &lt;code&gt;kube-system&lt;/code&gt; — com resultados agrupados por namespace e correlação de causa raiz cruzada.&lt;/p&gt;

&lt;p&gt;O resultado ficou assim:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Namespace&lt;/th&gt;
&lt;th&gt;Total&lt;/th&gt;
&lt;th&gt;Unhealthy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;default&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;monitoring&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;0 (todos Running)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;kube-system&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;0 (todos Running)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;E o agente ainda identificou o &lt;code&gt;storage-provisioner&lt;/code&gt; com 21 restarts versus média de 8 dos outros pods — sinalizando o componente realmente anômalo sem nenhuma instrução explícita sobre isso.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;v1.1 no GitHub. Dia 2.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  A semana do Gemini: quando o Claude Code saiu do ar
&lt;/h2&gt;

&lt;p&gt;Na semana seguinte, senti uma grande dificuldade de utilizar o Claude, seja via web ou Code via terminal. Os tokens estavam sendo consumidos muito mais rápido, mesmo seguindo todas as boas práticas recomendadas da própria Anthropic. Reclamações começaram a se multiplicar no Reddit, X, Stack etc. Aparentemente a demanda estava sendo alta demais para absorverem. Eu uso o plano pro R$20 dólares, sei que não é grande coisa, mas mesmo para estudos estava inviável.&lt;/p&gt;

&lt;p&gt;Migrei para o &lt;strong&gt;Gemini&lt;/strong&gt;. Funcionou por um bom tempo — e foi nesse período que o Sentinel começou a ganhar as features mais complexas: o Go agent, a integração com PostgreSQL, o dashboard.&lt;/p&gt;

&lt;p&gt;Mas à medida que o projeto crescia, os problemas apareceram:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;UI do dashboard&lt;/strong&gt;: o Gemini gerava HTML funcional mas com inconsistências visuais que eu precisava corrigir manualmente&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Segurança no código&lt;/strong&gt;: Utilizei o Claude Code rapidamente para ser um code review e encontrei padrões no código gerado que me preocuparam — falta de validação de input, headers de segurança ausentes, queries SQL sem sanitização adequada&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Quando o Claude Code voltou a funcionar (deram um crédito adicional de R$ 110), migrei de volta. O custo por sessão é maior e a janela de contexto é menor — mas a qualidade e confiabilidade do código gerado, especialmente em questões de segurança, compensam.&lt;/p&gt;

&lt;p&gt;O Sentinel nasceu no Gemini. Mas cresceu no Claude Code.&lt;/p&gt;




&lt;h2&gt;
  
  
  A v2.0: quando virou plataforma
&lt;/h2&gt;

&lt;p&gt;A virada aconteceu com uma pergunta: &lt;em&gt;"e se eu tiver os dados já coletados antes de chamar o Claude?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Ou seja, queria uma plataforma em RealTime e entrar de cabeça no vibecoding.&lt;/p&gt;

&lt;p&gt;Os slash commands consultavam Prometheus e kubectl em tempo real a cada execução. Isso funcionava, mas era lento e stateless — sem histórico, sem tendência, sem FinOps real.&lt;/p&gt;

&lt;p&gt;A solução: um &lt;strong&gt;Go agent&lt;/strong&gt; rodando continuamente em background.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Coleta a cada 10 segundos&lt;/span&gt;
&lt;span class="c"&gt;// Persiste no PostgreSQL&lt;/span&gt;
&lt;span class="c"&gt;// Expõe API REST para o Claude consumir&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Com &lt;code&gt;client-go&lt;/code&gt; nativo, o agent coleta métricas de CPU, memória e waste por pod, persiste em transação batch no PostgreSQL e expõe três endpoints:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Endpoint&lt;/th&gt;
&lt;th&gt;Descrição&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;GET /api/summary&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Estado do cluster: nodes, pods, CPU&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;GET /api/metrics&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Métricas por pod: CPU usage, waste&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;GET /api/history&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Histórico de custo dos últimos 30min&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;O Claude Code passou a consumir esses endpoints via &lt;code&gt;/incident&lt;/code&gt; em vez de fazer as queries diretamente. Separação de camadas real: Go coleta, Claude analisa.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5satzpeodhgftq9i4a4q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5satzpeodhgftq9i4a4q.png" alt="Screenshot Sentinel Dashboard" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Node Health Map com honeycomb, Pod Distribution com donut chart, Waste Intelligence com savings opportunities por pod, ROI Timeline financeiro com Budget vs Actual nos últimos 30 minutos.&lt;/p&gt;




&lt;h2&gt;
  
  
  O harness: tratando output de LLM como untrusted input
&lt;/h2&gt;

&lt;p&gt;Bem, entre publicações no Linkedin, meu irmão me enviou um artigo interessante sobre Harness Engineering, basicamente ter segurança e confiabilidade ao redor do modelo, com infraestrutura adequada, revisões de segurança e padrões de ReACT constante (Review, ACT) além de feedback.&lt;/p&gt;

&lt;p&gt;Segue artigo: &lt;a href="https://martinfowler.com/articles/harness-engineering.html" rel="noopener noreferrer"&gt;Harness Engineering — Martin Fowler&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Uma decisão de arquitetura muito importante para colocar o conceito em prática: o &lt;code&gt;harness/validador_saida.py&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Todo relatório gerado pelo Claude Code passa por um gatekeeper antes de ser gravado em disco. O validador bloqueia:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Regra&lt;/th&gt;
&lt;th&gt;Exemplos bloqueados&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Comandos destrutivos&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;rm -rf&lt;/code&gt;, &lt;code&gt;kubectl delete&lt;/code&gt;, &lt;code&gt;DROP TABLE&lt;/code&gt;, fork bomb&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Estrutura obrigatória&lt;/td&gt;
&lt;td&gt;Relatórios sem &lt;code&gt;## Resumo Executivo&lt;/code&gt; são rejeitados&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tamanho mínimo&lt;/td&gt;
&lt;td&gt;Conteúdo menor que 100 chars é rejeitado&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;returncode&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Validador bloqueou a gravação: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stderr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Se o validador rejeitar, o arquivo não é criado. Ponto.&lt;/p&gt;

&lt;p&gt;Isso não é paranoia — é arquitetura de produção. Qualquer sistema que usa LLM para gerar ações sobre infraestrutura precisa de um gatekeeper.&lt;/p&gt;




&lt;h2&gt;
  
  
  O que o vibe coding significa na prática
&lt;/h2&gt;

&lt;p&gt;Vibe coding não é "deixar a AI fazer tudo". É uma forma específica de trabalho:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Você define o quê. A AI decide o como.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Em cada etapa, eu sabia o resultado que queria e agia como SRE, tomador de decisões:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;"Quero que o startup verifique os serviços e suba os port-forwards ausentes"&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;"Quero análise agrupada por namespace com correlação de causa raiz"&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;"Quero um gatekeeper que bloqueie comandos destrutivos antes de gravar"&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;O agente implementava. Eu revisava, questionava, redirecionava.&lt;/p&gt;

&lt;p&gt;O que me surpreendeu: em nenhum momento eu escrevi uma linha de código do Go agent, do dashboard, ou do harness. Mas cada decisão de arquitetura foi minha. A AI foi o meu software engineer sênior.&lt;/p&gt;




&lt;h2&gt;
  
  
  O que ficou claro sobre Claude Code
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Pontos fortes:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Qualidade e consistência do código gerado, especialmente em Go&lt;/li&gt;
&lt;li&gt;Raciocínio sobre segurança — headers HTTP corretos, sanitização, permissões de arquivo&lt;/li&gt;
&lt;li&gt;Capacidade de manter contexto de projeto via &lt;code&gt;CLAUDE.md&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Sub-agents paralelos que genuinamente executam em paralelo&lt;/li&gt;
&lt;li&gt;Capacidade de criação UI otimizada&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Limitações reais:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Janela de contexto menor que concorrentes — projetos grandes exigem &lt;code&gt;/compact&lt;/code&gt; frequente&lt;/li&gt;
&lt;li&gt;Custo por token mais elevado — relevante em sessões longas&lt;/li&gt;
&lt;li&gt;A instabilidade de tokens que aconteceu na semana passada mostrou que dependência total de um único provider é risco real&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Timeline da jornada
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Data&lt;/th&gt;
&lt;th&gt;Marco&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;8 mar 2026&lt;/td&gt;
&lt;td&gt;Certificado Claude Code in Action — Anthropic/Skilljar&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8 mar 2026&lt;/td&gt;
&lt;td&gt;v1.0: slash commands + Prometheus + K8s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9 mar 2026&lt;/td&gt;
&lt;td&gt;v1.1: startup automático + múltiplos namespaces&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semana 2&lt;/td&gt;
&lt;td&gt;Migração temporária para Gemini — restrição global de tokens Anthropic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semana 2-3&lt;/td&gt;
&lt;td&gt;v2.0: Go agent + PostgreSQL + dashboard + harness&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semana 3&lt;/td&gt;
&lt;td&gt;Retorno ao Claude Code + renomeação para Sentinel&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Onde está hoje
&lt;/h2&gt;

&lt;p&gt;O projeto é open source, Apache 2.0:&lt;/p&gt;

&lt;p&gt;👉 &lt;strong&gt;&lt;a href="https://github.com/boccato85/Sentinel" rel="noopener noreferrer"&gt;github.com/boccato85/Sentinel&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Stack completa:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Minikube / Kubernetes v1.35.1&lt;/li&gt;
&lt;li&gt;Go agent com &lt;code&gt;client-go&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;PostgreSQL&lt;/li&gt;
&lt;li&gt;Claude Code (OpenCode + Sonnet 4.6)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Se você trabalha com SRE, CloudOps ou FinOps e quer explorar Claude Code na prática, esse é um ponto de partida real — não um hello world.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Esse projeto faz parte de uma trilha pessoal: CKA → Claude Code → MLOps. Se quiser acompanhar, me segue aqui no dev.to.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>claudecode</category>
      <category>kubernetes</category>
      <category>devops</category>
      <category>finops</category>
    </item>
  </channel>
</rss>
