<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Carlos Mario Mora Restrepo</title>
    <description>The latest articles on DEV Community by Carlos Mario Mora Restrepo (@carlosmoradev).</description>
    <link>https://dev.to/carlosmoradev</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3868105%2F4b24f81c-7f01-4759-9dd6-1df452c62c85.png</url>
      <title>DEV Community: Carlos Mario Mora Restrepo</title>
      <link>https://dev.to/carlosmoradev</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/carlosmoradev"/>
    <language>en</language>
    <item>
      <title>AWS Cost Explorer Just Got Conversational — And That Changes the Workflow</title>
      <dc:creator>Carlos Mario Mora Restrepo</dc:creator>
      <pubDate>Thu, 09 Apr 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/carlosmoradev/aws-cost-explorer-just-got-conversational-and-that-changes-the-workflow-59f6</link>
      <guid>https://dev.to/carlosmoradev/aws-cost-explorer-just-got-conversational-and-that-changes-the-workflow-59f6</guid>
      <description>&lt;p&gt;AWS just closed the last friction gap in cost analysis.&lt;/p&gt;

&lt;p&gt;Natural language queries in Cost Explorer — powered by Amazon Q — launched this week. You ask, Cost Explorer updates its charts in real time. No filters. No manual groupings. No switching to a separate Q Developer chat.&lt;/p&gt;

&lt;p&gt;“How much did we spend on RDS last month compared to the previous one?” → instant answer + automatic visualization update.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem with cost tooling has always been friction
&lt;/h2&gt;

&lt;p&gt;As an SRE managing multi-cloud infrastructure, I’ve spent years building cost alert layers manually: tagging strategies, Budget alarms, custom Lambda parsers for anomaly detection. Each layer added complexity. Each handoff between tools added friction.&lt;/p&gt;

&lt;p&gt;The tooling was always capable. The problem was the interface — engineers had to translate between what they wanted to know and what the tool could show them. That translation cost was real, and it was killing adoption.&lt;/p&gt;

&lt;h2&gt;
  
  
  What’s actually new here
&lt;/h2&gt;

&lt;p&gt;Amazon Q has had Cost Explorer integration since late 2024. What changed isn’t the underlying capability — it’s the interface.&lt;/p&gt;

&lt;p&gt;The answer and the visualization now live in the same surface, updating together, maintaining full conversation context across follow-up questions. You can ask a follow-up without resetting the query. The conversation persists.&lt;/p&gt;

&lt;p&gt;That sounds small. It isn’t. That’s the friction that was killing adoption.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means for cost governance
&lt;/h2&gt;

&lt;p&gt;My &lt;a href="https://dev.to/2026/01/10/multi-layer-cost-controls"&gt;first blog post on this site&lt;/a&gt; was about building a 4-layer cost defense strategy for cloud data platforms. At the time, building the alert pipeline was a manual exercise in connecting layers: resource monitors, warehouse sizing, connection pooling, user education.&lt;/p&gt;

&lt;p&gt;Today AWS gives you natural language on top of those same layers. The layers still matter — you still need tagging discipline, budget boundaries, and anomaly detection. But the interface to analyze and interrogate those layers just got dramatically lower friction.&lt;/p&gt;

&lt;h2&gt;
  
  
  The next unlock
&lt;/h2&gt;

&lt;p&gt;The question I keep coming back to: if cost analysis is now conversational, what’s next?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Proactive anomaly surfacing before the spike hits?&lt;/li&gt;
&lt;li&gt;Rightsizing recommendations that execute autonomously?&lt;/li&gt;
&lt;li&gt;Cost SLOs with automated enforcement?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The distance between “cost alert” and “autonomously governed cost” is closing fast. And for SREs who’ve been hand-building that infrastructure for years — that’s worth paying attention to.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Have you tried the natural language queries in Cost Explorer yet? Curious how teams are integrating this into their FinOps workflows.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>finops</category>
      <category>costoptimization</category>
      <category>sre</category>
    </item>
    <item>
      <title>From ticket to PR with agents: how to use Claude to automate platform changes without breaking SLOs</title>
      <dc:creator>Carlos Mario Mora Restrepo</dc:creator>
      <pubDate>Wed, 08 Apr 2026 15:42:10 +0000</pubDate>
      <link>https://dev.to/carlosmoradev/from-ticket-to-pr-with-agents-how-to-use-claude-to-automate-platform-changes-without-breaking-slos-48lg</link>
      <guid>https://dev.to/carlosmoradev/from-ticket-to-pr-with-agents-how-to-use-claude-to-automate-platform-changes-without-breaking-slos-48lg</guid>
      <description>&lt;p&gt;In Platform Engineering and SRE, the hardest part of change is rarely writing the change itself. The hard part is everything around it: understanding the intent behind a ticket or incident, locating the right context, identifying the systems involved, deciding what should change, validating the blast radius, documenting rollback, and making the result legible enough for someone else to review with confidence.&lt;/p&gt;

&lt;p&gt;That is why I think the real promise of Claude is not code generation. It is the ability to help close the loop between operational intent and reviewable execution.&lt;/p&gt;

&lt;h2&gt;
  
  
  The translation problem
&lt;/h2&gt;

&lt;p&gt;A ticket, incident, or operational task expresses intent. But between that intent and a merged change, there is usually a long chain of manual translation. Engineers need to gather context from runbooks, infrastructure repositories, dashboards, previous incidents, documentation, and platform conventions. They need to decide whether the task requires a configuration tweak, an IaC change, a runbook update, or some combination of all three. They need to make the work explicit enough to review and safe enough to deploy.&lt;/p&gt;

&lt;p&gt;That translation layer is where Claude becomes interesting.&lt;/p&gt;

&lt;p&gt;Anthropic describes effective agents as systems that use tools dynamically, adapt based on feedback from the environment, and operate with clear stopping conditions and human oversight. That is a much more useful framing than treating Claude as a smarter autocomplete layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pattern
&lt;/h2&gt;

&lt;p&gt;Applied to Platform Engineering, the workflow looks something like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A ticket, incident, or task becomes the initial statement of intent.&lt;/li&gt;
&lt;li&gt;Claude gathers context from the relevant repos, documentation, and operational systems.&lt;/li&gt;
&lt;li&gt;It uses tools to inspect files, compare configurations, reason about likely changes, and validate assumptions.&lt;/li&gt;
&lt;li&gt;It produces a proposed change in a form that the team can actually govern — ideally as a pull request.&lt;/li&gt;
&lt;li&gt;Humans review the result, enforce policy, and decide whether it should ship.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The pull request as the unit of governance
&lt;/h2&gt;

&lt;p&gt;The pull request is the key unit here.&lt;/p&gt;

&lt;p&gt;The real output of an agent in this workflow should not be a blob of generated code. It should be a reviewable change set with rationale, scope, validation steps, and rollback guidance. Once the output becomes a PR rather than a prompt response, the conversation shifts from "Can the model write this?" to "Can the organization safely absorb and govern this change?"&lt;/p&gt;

&lt;p&gt;That distinction matters because SRE is not optimized for novelty. It is optimized for reliability. A change that is fast but opaque is often worse than a change that is slower but auditable. If Claude is going to be useful in platform workflows, it has to increase clarity, not just speed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why SLOs matter in the title
&lt;/h2&gt;

&lt;p&gt;This is also why the phrase "without breaking SLOs" matters so much. It prevents the conversation from drifting into generic AI optimism. In a platform context, any serious use of agents has to be evaluated against reliability outcomes. Faster workflows are not automatically better workflows if they increase incident risk, reduce operator understanding, or blur accountability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Guardrails are not obstacles — they are the design
&lt;/h2&gt;

&lt;p&gt;A credible workflow therefore needs guardrails. At minimum, that means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clear tool boundaries and scoped permissions&lt;/li&gt;
&lt;li&gt;Strong context about the system being changed&lt;/li&gt;
&lt;li&gt;Validation before merge&lt;/li&gt;
&lt;li&gt;Human review for sensitive or high-impact changes&lt;/li&gt;
&lt;li&gt;Explicit rollback paths&lt;/li&gt;
&lt;li&gt;Traceability from original intent to final diff&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This guardrail-heavy framing is not anti-agent. It is what makes agents useful in production environments. Anthropic's own materials emphasize that agents work best when they can interact with the environment, test their assumptions, and operate inside structured limits rather than open-ended autonomy.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real opportunity
&lt;/h2&gt;

&lt;p&gt;That is why I think the most interesting future for Claude in Platform Engineering is not "AI writes infrastructure code." It is "AI helps translate operational work into changes that humans can evaluate, approve, and ship with confidence."&lt;/p&gt;

&lt;p&gt;Seen this way, Claude is not just a writing assistant or coding assistant. It starts to look more like an operational interface — a system that sits between intent and execution, helping teams move from ticket to PR with more context, better traceability, and less manual translation overhead.&lt;/p&gt;

&lt;p&gt;Not replacing engineers.&lt;/p&gt;

&lt;p&gt;Not removing judgment.&lt;/p&gt;

&lt;p&gt;But reducing the distance between work that needs to happen and changes that are safe enough to review, govern, and deploy.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;How are you thinking about AI agents in your platform workflows? Are you already using them for operational tasks, or still evaluating the risk?&lt;/em&gt;&lt;/p&gt;

</description>
      <category>platformengineering</category>
      <category>sre</category>
      <category>aiagents</category>
      <category>automation</category>
    </item>
    <item>
      <title>AI Observability: the problem nobody is solving well in 2026</title>
      <dc:creator>Carlos Mario Mora Restrepo</dc:creator>
      <pubDate>Fri, 03 Apr 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/carlosmoradev/ai-observability-the-problem-nobody-is-solving-well-in-2026-5959</link>
      <guid>https://dev.to/carlosmoradev/ai-observability-the-problem-nobody-is-solving-well-in-2026-5959</guid>
      <description>&lt;p&gt;We’ve spent years building AIOps — using AI to observe infrastructure. But there’s a more urgent problem taking shape: who observes the AI itself?&lt;/p&gt;

&lt;p&gt;Monitoring hallucinations, prompt drift, MCP call latency, and inference costs in production is the new frontier of modern SRE. And almost nobody has a complete stack for it.&lt;/p&gt;

&lt;h2 id="the-monitoring-gap-is-structural-not-tactical"&gt;The monitoring gap is structural, not tactical&lt;/h2&gt;

&lt;p&gt;Your current observability stack was built for deterministic systems. A service either returns 200 or it doesn’t. Latency is measurable. Error rates are countable. SLOs make sense because “correct behavior” is definable.&lt;/p&gt;

&lt;p&gt;AI systems break all of these assumptions.&lt;/p&gt;

&lt;p&gt;The failure mode isn’t a 500 error — it’s a confident hallucination delivered with perfect latency and a 200 status code. Your dashboards are green. Your AI is producing garbage. A Fortune 100 bank misrouted 18% of critical cases without triggering a single alert.&lt;/p&gt;

&lt;p&gt;This isn’t a tooling gap you can close by adding a plugin to your existing stack. It’s a paradigm problem.&lt;/p&gt;

&lt;h2 id="the-current-landscape-15-tools-zero-consensus"&gt;The current landscape: 15+ tools, zero consensus&lt;/h2&gt;

&lt;p&gt;The AI observability market hit $510M in 2024, growing at 32% annually. That sounds like a mature space. It isn’t.&lt;/p&gt;

&lt;p&gt;The landscape splits into two camps that don’t talk to each other:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI-native platforms&lt;/strong&gt; (Langfuse, LangSmith, Arize Phoenix, Helicone, Braintrust) understand prompts, tokens, and semantic evaluation — but have no context about your infrastructure, your SLOs, or your cost centers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Traditional APM vendors&lt;/strong&gt; (Datadog, New Relic, Dynatrace, Grafana) understand infrastructure deeply — but treat AI as just another microservice, missing everything that makes AI systems different.&lt;/p&gt;

&lt;p&gt;OpenTelemetry’s GenAI Semantic Conventions are the closest thing to a unifying standard — still experimental as of Q1 2026, not GA. Every major vendor has adopted them as a wire format while building proprietary analytics on top. The instrumentation layer is converging. Everything above it is fragmented.&lt;/p&gt;

&lt;h2 id="four-gaps-practitioners-cant-close"&gt;Four gaps practitioners can’t close&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Inference cost is invisible at the decision layer&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;AI inference cost is generated where routing decisions happen — model selection, retry logic, token budgets, context window management. Your observability monitors the infrastructure layer. These are different layers, and the gap between them is expensive.&lt;/p&gt;

&lt;p&gt;A typical pattern: a poorly optimized prompt costs more per day than the entire Kubernetes cluster running the application. One team discovered they were paying an LLM to be reminded of its job — sending the same system instructions hundreds of times daily. Reasoning models like o3 add internal “thinking tokens” that inflate consumption silently. Output tokens cost 3–10x more than input tokens.&lt;/p&gt;

&lt;p&gt;What looks like $500/month in a pilot becomes $15,000 at production scale. Before accounting for growth.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. MCP traces break at the boundary&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;97 million monthly SDK downloads. 5,800+ MCP servers in the ecosystem. And a fundamental tracing problem: when a user request flows from Agent → LLM Provider → MCP Server → External Tool, the trace breaks at the MCP boundary. Two disconnected traces. No correlation. No end-to-end visibility.&lt;/p&gt;

&lt;p&gt;Sentry shipped the first dedicated MCP monitoring tool in mid-2025 — after running their own MCP server at 50 million requests per month and discovering random user timeouts with no results and no errors. No way to even know how many users were affected.&lt;/p&gt;

&lt;p&gt;OpenTelemetry’s MCP semantic conventions remain in draft.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Silent semantic failures don’t trigger alerts&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A single user request can trigger 15+ LLM calls across embedding generation, vector retrieval, context assembly, reasoning steps, and response synthesis. Every traditional metric can look healthy while the output is meaningless.&lt;/p&gt;

&lt;p&gt;44% of organizations still rely on manual methods to monitor AI agent interactions. The current state-of-the-art for detecting semantic failures in production is largely “a human reads logs and guesses.” Most teams discover problems through downstream business metrics — weeks after the damage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. SLOs don’t exist for non-deterministic systems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the open question practitioners keep returning to. Traditional SRE practice assumes you can define expected behavior, measure deviation, and set error budgets. When the same input can legitimately produce different outputs, when “correct” requires semantic judgment, and when model providers silently update weights underneath you — the entire SLI/SLO framework needs rethinking.&lt;/p&gt;

&lt;p&gt;Nobody has solved this. The conversation is still at the “how do we even frame the problem” stage.&lt;/p&gt;

&lt;h2 id="the-cost-paradox"&gt;The cost paradox&lt;/h2&gt;

&lt;p&gt;Adding AI monitoring to Datadog increases observability bills by 40–200%. A typical RAG pipeline generates 10–50x more telemetry than an equivalent API call. LangSmith customers routinely sample down to 0.1% of production traffic to control costs.&lt;/p&gt;

&lt;p&gt;You end up paying significantly more to observe significantly less.&lt;/p&gt;

&lt;p&gt;Gartner predicts that more than 40% of agentic AI projects will be canceled by 2027. The Dynatrace 2026 Pulse of Agentic AI survey found that 51% of engineering leaders cite limited visibility into agent behavior as their top technical blocker.&lt;/p&gt;

&lt;h2 id="whats-actually-converging"&gt;What’s actually converging&lt;/h2&gt;

&lt;p&gt;OpenTelemetry is winning the instrumentation war. The GenAI SIG has defined semantic conventions for LLM spans, agent spans, tool execution, token metrics, and evaluation events. Every major vendor accepts OTel GenAI spans.&lt;/p&gt;

&lt;p&gt;That’s the one genuine convergence story. Everything above the wire format remains fragmented — comparable to cloud monitoring circa 2010–2012. Except OpenTelemetry’s existence may accelerate consolidation faster than it happened last time.&lt;/p&gt;

&lt;h2 id="the-practitioner-reality"&gt;The practitioner reality&lt;/h2&gt;

&lt;p&gt;This is the infrastructure monitoring crisis of 2010 all over again. The stakes are higher. The systems are non-deterministic. The failure modes are semantic rather than structural.&lt;/p&gt;

&lt;p&gt;If you’re an SRE or Platform Engineer who’s been handed responsibility for AI systems without the tools to properly operate them — that’s the actual state of the industry, not a gap in your skills or your team’s preparation.&lt;/p&gt;

&lt;p&gt;The tooling will converge. OpenTelemetry will help. The ecosystem is moving.&lt;/p&gt;

&lt;p&gt;But right now, in early 2026, most teams are flying partially blind — and the first step is naming the problem clearly enough to start solving it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Data points: Dynatrace 2026 Pulse of Agentic AI (919 leaders), KubeCon Atlanta 2025, OneUptime AI Observability Cost Analysis, Sentry MCP Server Monitoring launch, Gartner 2025–2027 predictions, Pydantic AI observability pricing analysis.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>observability</category>
      <category>aiagents</category>
      <category>platformengineering</category>
    </item>
    <item>
      <title>Multi-Layer Cost Controls for Cloud Data Platforms</title>
      <dc:creator>Carlos Mario Mora Restrepo</dc:creator>
      <pubDate>Sat, 10 Jan 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/carlosmoradev/multi-layer-cost-controls-for-cloud-data-platforms-24ac</link>
      <guid>https://dev.to/carlosmoradev/multi-layer-cost-controls-for-cloud-data-platforms-24ac</guid>
      <description>&lt;p&gt;Managing costs in cloud data platforms is challenging, especially in sandbox environments where analysts experiment freely. A single misconfigured query can run for hours, consuming resources and exploding budgets.&lt;/p&gt;

&lt;p&gt;After experiencing several unexpected cost spikes in a Snowflake sandbox environment, I implemented a &lt;strong&gt;4-layer defense strategy&lt;/strong&gt; that reduced overages by 60% while maintaining analyst productivity.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Sandbox environments are critical for data teams. Analysts need freedom to experiment, test queries, and explore data without the constraints of production. However, this freedom comes with risks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Forgotten queries&lt;/strong&gt; : Analysts start a query, switch tasks, and forget to terminate it&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inefficient SQL&lt;/strong&gt; : Experimentation means suboptimal queries that scan entire tables&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Large compute&lt;/strong&gt; : “Let me just use XLARGE for this one query” becomes the default&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No accountability&lt;/strong&gt; : Costs are aggregated, so individual users don’t see their impact&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Traditional approaches fail:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Budget alerts only&lt;/strong&gt; : By the time you get the alert, you’ve already spent the money&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Query timeouts&lt;/strong&gt; : Legitimate long-running analytics get killed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Restrictive permissions&lt;/strong&gt; : Kills innovation and analyst productivity&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The 4-Layer Defense Strategy
&lt;/h2&gt;

&lt;p&gt;Instead of relying on a single mechanism, I implemented &lt;strong&gt;redundant layers&lt;/strong&gt; so that if one fails, others catch the issue.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1: Warehouse Configuration (Prevention)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Aggressive auto-suspend and right-sizing:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;
&lt;span class="c1"&gt;-- Create warehouse with 60-second auto-suspend&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;WAREHOUSE&lt;/span&gt; &lt;span class="n"&gt;SANDBOX_WAREHOUSE&lt;/span&gt;
  &lt;span class="n"&gt;WAREHOUSE_SIZE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'SMALL'&lt;/span&gt;
  &lt;span class="n"&gt;AUTO_SUSPEND&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;
  &lt;span class="n"&gt;AUTO_RESUME&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;TRUE&lt;/span&gt;
  &lt;span class="n"&gt;INITIALLY_SUSPENDED&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;TRUE&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why 60 seconds?&lt;/strong&gt; Most analysts iterate on queries with 1-2 minute gaps. 60 seconds catches forgotten warehouses while allowing workflow continuity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Default to SMALL:&lt;/strong&gt; Unless there’s a documented need, sandbox warehouses start at SMALL. Analysts can scale up temporarily, but the default prevents “XLARGE for everything.”&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Results:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Average warehouse utilization: 15 minutes/day per analyst (down from 2+ hours)&lt;/li&gt;
&lt;li&gt;70% reduction in idle warehouse costs&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Layer 2: Resource Monitors (Guardrails)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Per-user budget enforcement:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;
&lt;span class="c1"&gt;-- Create resource monitor for sandbox user&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;RESOURCE&lt;/span&gt; &lt;span class="n"&gt;MONITOR&lt;/span&gt; &lt;span class="n"&gt;USER_MONITOR&lt;/span&gt;
  &lt;span class="n"&gt;CREDIT_QUOTA&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;MONTHLY_BUDGET&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="n"&gt;FREQUENCY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MONTHLY&lt;/span&gt;
  &lt;span class="n"&gt;START_TIMESTAMP&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;IMMEDIATELY&lt;/span&gt;
  &lt;span class="n"&gt;TRIGGERS&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="mi"&gt;75&lt;/span&gt; &lt;span class="n"&gt;PERCENT&lt;/span&gt; &lt;span class="k"&gt;DO&lt;/span&gt; &lt;span class="k"&gt;NOTIFY&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="mi"&gt;90&lt;/span&gt; &lt;span class="n"&gt;PERCENT&lt;/span&gt; &lt;span class="k"&gt;DO&lt;/span&gt; &lt;span class="k"&gt;NOTIFY&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="mi"&gt;95&lt;/span&gt; &lt;span class="n"&gt;PERCENT&lt;/span&gt; &lt;span class="k"&gt;DO&lt;/span&gt; &lt;span class="k"&gt;NOTIFY&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="n"&gt;PERCENT&lt;/span&gt; &lt;span class="k"&gt;DO&lt;/span&gt; &lt;span class="n"&gt;SUSPEND_IMMEDIATE&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Assign to user's warehouse&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="n"&gt;WAREHOUSE&lt;/span&gt; &lt;span class="n"&gt;SANDBOX_WAREHOUSE&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;RESOURCE_MONITOR&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;USER_MONITOR&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why per-user monitors?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Individual accountability (users see their own consumption)&lt;/li&gt;
&lt;li&gt;Graceful degradation (one user hitting limit doesn’t affect others)&lt;/li&gt;
&lt;li&gt;Data for user education (who needs query optimization training?)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Progressive notifications:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;75%: “You’re on track, no action needed”&lt;/li&gt;
&lt;li&gt;90%: “Slow down, review your queries”&lt;/li&gt;
&lt;li&gt;95%: “Critical - optimize or your warehouse suspends at 100%”&lt;/li&gt;
&lt;li&gt;100%: Immediate suspension (prevents overage)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Results:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Zero budget overages since implementation&lt;/li&gt;
&lt;li&gt;Users self-optimize before hitting 90% threshold&lt;/li&gt;
&lt;li&gt;Average spending maintained well within budget limits&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Layer 3: Connection Pooling (Efficiency)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Reuse connections to reduce cold-start costs:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Python implementation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;snowflake.connector&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;contextlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;contextmanager&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SnowflakeConnectionPool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_connection&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="nd"&gt;@contextmanager&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_connection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Reuse connection if available, create new if needed&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_connection&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_connection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;is_closed&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_connection&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;snowflake&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;connector&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_connection&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# Connection failed, reset for next attempt
&lt;/span&gt;            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_connection&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt;

&lt;span class="c1"&gt;# Usage
&lt;/span&gt;&lt;span class="n"&gt;pool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SnowflakeConnectionPool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;pool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_connection&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;cursor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT * FROM data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this matters:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each new connection incurs warehouse resume cost (if auto-suspended)&lt;/li&gt;
&lt;li&gt;Connection pooling achieves 80% reuse rate&lt;/li&gt;
&lt;li&gt;Warehouse stays “warm” during analyst work sessions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Results:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;80% connection reuse rate&lt;/li&gt;
&lt;li&gt;30% reduction in warehouse resume events&lt;/li&gt;
&lt;li&gt;Faster query execution (no cold-start delay)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Layer 4: User Education (Culture)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Monthly cost transparency:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Send each analyst their personal cost report:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
Your Snowflake Usage - December 2025

Credits used: XX of YY (84%)
Queries executed: X,XXX
Most expensive query: X.X credits (view details)

Cost breakdown:
- Compute: XX credits
- Storage: X credits
- Data transfer: X credit

Top 3 expensive queries:
1. Full table scan on LARGE_TABLE (X.X credits)
   → Optimization: Add WHERE clause to filter data
2. Cartesian join (X.X credits)
   → Optimization: Add JOIN condition
3. Repeated aggregation (X.X credits)
   → Optimization: Materialize intermediate results

Tips for next month:
- Use LIMIT when exploring data
- Add filters before aggregations
- Check query profile before large runs

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Results:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;40% reduction in inefficient query patterns&lt;/li&gt;
&lt;li&gt;Users proactively optimize before hitting budget limits&lt;/li&gt;
&lt;li&gt;Cultural shift: “cost-aware” becomes default mindset&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Combined Impact
&lt;/h2&gt;

&lt;p&gt;The 4-layer strategy delivered:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost metrics:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;60% reduction&lt;/strong&gt; in unexpected sandbox overages&lt;/li&gt;
&lt;li&gt;Average spending per user maintained within budget&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero budget overruns&lt;/strong&gt; since implementation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Operational metrics:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;80% connection pooling&lt;/strong&gt; efficiency&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;70% reduction&lt;/strong&gt; in idle warehouse costs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;40% improvement&lt;/strong&gt; in query efficiency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cultural metrics:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Users self-optimize at 90% threshold (before suspension)&lt;/li&gt;
&lt;li&gt;Proactive query profiling becomes standard practice&lt;/li&gt;
&lt;li&gt;Cost awareness embedded in daily workflow&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why Redundancy Matters
&lt;/h2&gt;

&lt;p&gt;Each layer catches different failure modes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Failure Scenario&lt;/th&gt;
&lt;th&gt;Layer That Catches It&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Analyst forgets to terminate warehouse&lt;/td&gt;
&lt;td&gt;Layer 1: Auto-suspend after 60 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inefficient query runs for hours&lt;/td&gt;
&lt;td&gt;Layer 2: Resource monitor suspends at 100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Many short queries throughout the day&lt;/td&gt;
&lt;td&gt;Layer 3: Connection pooling reduces resume costs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;User habitually writes expensive queries&lt;/td&gt;
&lt;td&gt;Layer 4: Monthly report triggers education&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Healthcare compliance bonus:&lt;/strong&gt; In regulated environments, this approach provides audit trails showing cost governance without restricting legitimate data access.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation Checklist
&lt;/h2&gt;

&lt;p&gt;Want to implement this in your organization? Here’s the checklist:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Week 1: Infrastructure (Layers 1-2)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Configure warehouse auto-suspend (aggressive timing)&lt;/li&gt;
&lt;li&gt;Set default warehouse size to SMALL&lt;/li&gt;
&lt;li&gt;Create per-user resource monitors with monthly quotas&lt;/li&gt;
&lt;li&gt;Set up progressive notification thresholds (75%, 90%, 95%, 100%)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Week 2: Optimization (Layer 3)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Implement connection pooling in application code&lt;/li&gt;
&lt;li&gt;Measure connection reuse rate&lt;/li&gt;
&lt;li&gt;Monitor warehouse resume events&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Week 3: Culture (Layer 4)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Build monthly cost report automation&lt;/li&gt;
&lt;li&gt;Include query optimization recommendations&lt;/li&gt;
&lt;li&gt;Send first round of reports&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Week 4: Monitoring&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dashboard for cost trends&lt;/li&gt;
&lt;li&gt;Alert on anomalies (user exceeding historical average)&lt;/li&gt;
&lt;li&gt;Quarterly review and adjustment&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;No single mechanism is enough&lt;/strong&gt; : Redundant layers provide resilience&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Make costs visible&lt;/strong&gt; : Users can’t optimize what they can’t see&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Default to small&lt;/strong&gt; : Scaling up is easier than justifying scale-down&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Progressive alerts work&lt;/strong&gt; : Users self-correct before hitting hard limits&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Culture beats controls&lt;/strong&gt; : Education changes behavior permanently&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Technologies Used
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Snowflake resource monitors&lt;/li&gt;
&lt;li&gt;Python connection pooling&lt;/li&gt;
&lt;li&gt;Automated reporting (SQL + pandas)&lt;/li&gt;
&lt;li&gt;TOML configuration management&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Want to discuss cost optimization strategies?&lt;/strong&gt; Connect with me on &lt;a href="https://linkedin.com/in/carlosmoradev" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Related reading:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/projects/snowflake-governance"&gt;Multi-Account Data Warehouse Governance&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>finops</category>
      <category>costoptimization</category>
      <category>snowflake</category>
      <category>dataplatforms</category>
    </item>
  </channel>
</rss>
