<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Matt</title>
    <description>The latest articles on DEV Community by Matt (@dspv).</description>
    <link>https://dev.to/dspv</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3782713%2Fa6b49eb1-89f2-4142-97df-c0dc96c281e1.png</url>
      <title>DEV Community: Matt</title>
      <link>https://dev.to/dspv</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/dspv"/>
    <language>en</language>
    <item>
      <title>Why Do AWS Staging Environments Cost So Much?</title>
      <dc:creator>Matt</dc:creator>
      <pubDate>Sun, 21 Jun 2026 15:01:58 +0000</pubDate>
      <link>https://dev.to/dspv/why-do-aws-staging-environments-cost-so-much-38nf</link>
      <guid>https://dev.to/dspv/why-do-aws-staging-environments-cost-so-much-38nf</guid>
      <description>&lt;h1&gt;
  
  
  Why AWS Staging Environments Cost So Much (2026 Guide)
&lt;/h1&gt;

&lt;blockquote&gt;
&lt;p&gt;Originally published at &lt;a href="https://fortem.dev/blog/aws-staging-environment-cost" rel="noopener noreferrer"&gt;https://fortem.dev/blog/aws-staging-environment-cost&lt;/a&gt;&lt;br&gt;
AWS staging environments run 168 hours a week. Your team works 40. Here's where the money goes on ECS Fargate — and how to cut it without touching production.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;Guide&lt;/p&gt;

&lt;p&gt;aws-staging-environment-costfargate-idle-costecs-environment-scheduling&lt;/p&gt;

&lt;p&gt;You have 10 ECS environments. Most of them are staging, QA, or dev. No one is using them at 2am on Saturday. But Fargate bills by the second, and by the time the monthly invoice arrives the number is larger than expected. This isn't an infrastructure design problem — it's an idle compute problem. Here's where the money goes, and what moves the needle.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  01Non-prod ECS environments run 168 hours a week. Your team works 40. That's 128 hrs/week of idle compute per environment.&lt;/li&gt;
&lt;li&gt;  02Fargate compute is ~68% of your ECS bill. The rest (CloudWatch Logs, ALB baseline) doesn't stop when the environment sits idle.&lt;/li&gt;
&lt;li&gt;  03NAT Gateway, VPC, and often ALB are shared across environments — that overhead doesn't multiply. Compute does.&lt;/li&gt;
&lt;li&gt;  04Fargate Spot cuts non-prod compute by up to 70% for fault-tolerant tasks. Not suitable for demo environments or shared QA sessions.&lt;/li&gt;
&lt;li&gt;  05Business-hours scheduling (Mon–Fri 09:00–19:00) cuts active compute time to ~30% of the 24/7 baseline with zero architecture changes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Ready to use — drop this into your Terraform today&lt;/p&gt;

&lt;p&gt;ECS Application Auto Scaling scheduled actions — stops all tasks at 19:00 and restarts at 09:00, Mon–Fri. No Lambda required. Replace &lt;code&gt;your-cluster&lt;/code&gt; and &lt;code&gt;your-service&lt;/code&gt; with your values. Repeat the &lt;code&gt;aws_appautoscaling_*&lt;/code&gt; blocks for each service.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight terraform"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Register the ECS service as a scalable target&lt;/span&gt;
&lt;span class="k"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_appautoscaling_target"&lt;/span&gt; &lt;span class="s2"&gt;"staging_svc"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;max_capacity&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;
  &lt;span class="nx"&gt;min_capacity&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
  &lt;span class="nx"&gt;resource_id&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"service/your-cluster/your-service"&lt;/span&gt;
  &lt;span class="nx"&gt;scalable_dimension&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"ecs:service:DesiredCount"&lt;/span&gt;
  &lt;span class="nx"&gt;service_namespace&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"ecs"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Stop at 19:00 UTC Mon–Fri&lt;/span&gt;
&lt;span class="k"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_appautoscaling_scheduled_action"&lt;/span&gt; &lt;span class="s2"&gt;"stop_evening"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;               &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"stop-staging-evening"&lt;/span&gt;
  &lt;span class="nx"&gt;service_namespace&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_appautoscaling_target&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;staging_svc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;service_namespace&lt;/span&gt;
  &lt;span class="nx"&gt;resource_id&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_appautoscaling_target&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;staging_svc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;resource_id&lt;/span&gt;
  &lt;span class="nx"&gt;scalable_dimension&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_appautoscaling_target&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;staging_svc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;scalable_dimension&lt;/span&gt;
  &lt;span class="nx"&gt;schedule&lt;/span&gt;           &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"cron(0 19 ? * MON-FRI *)"&lt;/span&gt;

  &lt;span class="nx"&gt;scalable_target_action&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;min_capacity&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="nx"&gt;max_capacity&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Restart at 09:00 UTC Mon–Fri&lt;/span&gt;
&lt;span class="k"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_appautoscaling_scheduled_action"&lt;/span&gt; &lt;span class="s2"&gt;"start_morning"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;               &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"start-staging-morning"&lt;/span&gt;
  &lt;span class="nx"&gt;service_namespace&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_appautoscaling_target&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;staging_svc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;service_namespace&lt;/span&gt;
  &lt;span class="nx"&gt;resource_id&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_appautoscaling_target&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;staging_svc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;resource_id&lt;/span&gt;
  &lt;span class="nx"&gt;scalable_dimension&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_appautoscaling_target&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;staging_svc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;scalable_dimension&lt;/span&gt;
  &lt;span class="nx"&gt;schedule&lt;/span&gt;           &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"cron(0 9 ? * MON-FRI *)"&lt;/span&gt;

  &lt;span class="nx"&gt;scalable_target_action&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;min_capacity&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="nx"&gt;max_capacity&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Optional: Fargate Spot capacity provider for non-prod&lt;/span&gt;
&lt;span class="k"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_ecs_service"&lt;/span&gt; &lt;span class="s2"&gt;"staging_svc"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;# ... your existing service config ...&lt;/span&gt;

  &lt;span class="nx"&gt;capacity_provider_strategy&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;capacity_provider&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"FARGATE_SPOT"&lt;/span&gt;
    &lt;span class="nx"&gt;weight&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;capacity_provider_strategy&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;capacity_provider&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"FARGATE"&lt;/span&gt;
    &lt;span class="nx"&gt;weight&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="nx"&gt;base&lt;/span&gt;              &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Monthly compute cost — 10 non-prod environments (80 services, 0.5 vCPU each)&lt;/p&gt;

&lt;p&gt;us-east-1, Linux x86, on-demand rates June 2026&lt;/p&gt;

&lt;p&gt;24/7 on-demand&lt;/p&gt;

&lt;p&gt;$1,442/mo&lt;/p&gt;

&lt;p&gt;Business hours on-demand&lt;/p&gt;

&lt;p&gt;-70%$428/mo&lt;/p&gt;

&lt;p&gt;Business hours + Fargate Spot&lt;/p&gt;

&lt;p&gt;-91%$128/mo&lt;/p&gt;

&lt;p&gt;Business hours = Mon–Fri 09:00–19:00 (50 hrs/wk, ~217 hrs/mo). Fargate Spot at 70% discount. Shared infrastructure (NAT Gateway, VPC, ALB) not included — shared cost does not multiply per environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why non-prod spend stays invisible
&lt;/h2&gt;

&lt;p&gt;Non-prod costs get lumped into a single “infrastructure” line item with no per-environment breakdown. No one owns the number, so it doesn't get fixed.&lt;/p&gt;

&lt;p&gt;Production gets optimized after a big bill. Staging gets the same config it had when the second engineer joined and no one has touched it since. The reason isn't negligence — it's visibility. AWS Cost Explorer shows you ECS as a service total. Without &lt;a href="https://dev.to/blog/ecs-fargate-cost-visibility/"&gt;per-environment cost allocation tags&lt;/a&gt;, there's no way to see that your staging environment costs more than your QA environment, or that three dev environments have been running since February with no active work behind them.&lt;/p&gt;

&lt;p&gt;The result: non-prod spend is invisible in reviews, gets absorbed into the overall AWS bill, and deferred indefinitely with “it's just staging, we'll fix it later.”&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;KEY INSIGHT:&lt;/strong&gt; Key insight “Nobody noticed because staging bills get lumped into ‘infrastructure costs’ and nobody questions them.” — practitioner, dev.to&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Where the money goes on Fargate
&lt;/h2&gt;

&lt;p&gt;Fargate compute is ~68% of a typical ECS bill at $0.04048/vCPU-hr and $0.004445/GB-hr. The remaining 32% — CloudWatch Logs at $0.50/GB ingested, ALB baseline at $0.0225/hr — doesn't scale to zero when tasks are idle.&lt;/p&gt;

&lt;p&gt;The big number is compute, and compute is the lever. But a few non-obvious charges compound the problem for non-prod environments specifically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;01&lt;/p&gt;

&lt;p&gt;CloudWatch Logs — verbose by default&lt;/p&gt;

&lt;p&gt;Non-prod environments often run at DEBUG log level. A service generating 1 GB/day of logs costs $15/month in ingestion alone. Multiply by 8 services and 10 environments and you have a meaningful line item that has nothing to do with compute.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;02&lt;/p&gt;

&lt;p&gt;Container Insights — charged per observation&lt;/p&gt;

&lt;p&gt;Container Insights is on by default on many clusters. For non-prod, it adds cost without adding value. Turn it off on dev and staging clusters.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;03&lt;/p&gt;

&lt;p&gt;ALB dedicated to one environment&lt;/p&gt;

&lt;p&gt;If each environment has its own ALB, the $0.0225/hr base charge ($16.43/mo) runs regardless of traffic. Teams running 10 environments with dedicated ALBs pay $164/mo in ALB base charges before a single request is processed.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The 168-hour problem
&lt;/h2&gt;

&lt;p&gt;A non-prod environment running 24/7 runs 168 hours a week. Your team works 40. That gap — 128 hours per week of idle compute per environment — is the real cost driver on Fargate.&lt;/p&gt;

&lt;p&gt;Let's do the math on a realistic fleet. Ten non-prod environments, each running 8 services at 0.5 vCPU and 1 GB memory:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Hrs/mo active&lt;/th&gt;
&lt;th&gt;Compute/mo&lt;/th&gt;
&lt;th&gt;vs 24/7&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;24/7 on-demand&lt;/td&gt;
&lt;td&gt;730&lt;/td&gt;
&lt;td&gt;$1,442&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Business hours on-demand&lt;/td&gt;
&lt;td&gt;~217&lt;/td&gt;
&lt;td&gt;$428&lt;/td&gt;
&lt;td&gt;−70%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Business hours + Spot&lt;/td&gt;
&lt;td&gt;~217&lt;/td&gt;
&lt;td&gt;~$128&lt;/td&gt;
&lt;td&gt;−91%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;80 services × 0.5 vCPU × $0.04048/hr + 80 × 1 GB × $0.004445/hr. Business hours = Mon–Fri 09:00–19:00 UTC (~217 hrs/mo).&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;KEY INSIGHT:&lt;/strong&gt; Key insight The compute in a non-prod environment doesn't know it's 2am on Sunday. It charges the same rate as a Tuesday afternoon.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Fargate bills by the second with no minimum charge. A task stopped at 19:00 pays nothing until it restarts at 09:00. That's not an approximation — it's how the billing model works. The savings from scheduling are immediate and exact.&lt;/p&gt;

&lt;h2&gt;
  
  
  What shared infrastructure changes (and doesn't change)
&lt;/h2&gt;

&lt;p&gt;NAT Gateway, VPC, and often ALB are shared across environments. That overhead doesn't multiply per environment. What multiplies is compute — one set of running tasks per environment, billed independently.&lt;/p&gt;

&lt;p&gt;A well-structured ECS fleet shares:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  —&lt;strong&gt;NAT Gateway&lt;/strong&gt; — one per VPC, ~$32.85/mo base. Shared across all environments. $3.29/env at 10 environments.&lt;/li&gt;
&lt;li&gt;  —&lt;strong&gt;ALB with host-based routing&lt;/strong&gt; — one ALB routes to all environments via hostname rules. $16.43/mo base total, not per environment.&lt;/li&gt;
&lt;li&gt;  —&lt;strong&gt;VPC, subnets, security groups&lt;/strong&gt; — no per-environment charge.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What doesn't share: Fargate task hours, CloudWatch Logs ingestion per environment, and ECR image pull data. These are the numbers that multiply at fleet scale — and they're all driven by idle compute.&lt;/p&gt;

&lt;p&gt;This is why the fix is scheduling tasks, not redesigning network architecture. Once you understand that shared infra is already cheap per environment, the question becomes: how do you stop paying for 128 idle compute hours per week?&lt;/p&gt;

&lt;p&gt;You can set up &lt;a href="https://dev.to/blog/aws-cost-anomaly-detection-ecs/"&gt;per-environment cost allocation tags with AWS Cost Anomaly Detection&lt;/a&gt; to get alerted when any single environment deviates from its historical spend baseline — useful once you have scheduling in place and want to catch drift.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fargate Spot for non-prod: when it works, when it doesn't
&lt;/h2&gt;

&lt;p&gt;Fargate Spot runs non-prod tasks on spare AWS capacity at up to 70% off on-demand rates. It works well for dev and QA. Avoid it for environments used for customer demos or with stateful in-memory work that can't tolerate a restart.&lt;/p&gt;

&lt;p&gt;The mechanics: AWS gives 2 minutes' warning via SIGTERM before reclaiming Spot capacity. ECS marks the task as &lt;code&gt;SPOT_INTERRUPTION&lt;/code&gt;and, if desired count is still &amp;gt; 0, launches a replacement.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Environment type&lt;/th&gt;
&lt;th&gt;Fargate Spot?&lt;/th&gt;
&lt;th&gt;Reason&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Dev environments&lt;/td&gt;
&lt;td&gt;✓ Yes&lt;/td&gt;
&lt;td&gt;Stateless, restartable, no active users&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Feature branch preview&lt;/td&gt;
&lt;td&gt;✓ Yes&lt;/td&gt;
&lt;td&gt;Ephemeral, restartable on interrupt&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CI / integration tests&lt;/td&gt;
&lt;td&gt;✓ Yes&lt;/td&gt;
&lt;td&gt;Short-lived tasks, retry on failure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;QA (automated)&lt;/td&gt;
&lt;td&gt;✓ Yes&lt;/td&gt;
&lt;td&gt;Tests restart automatically on failure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;QA (live session)&lt;/td&gt;
&lt;td&gt;✗ Risky&lt;/td&gt;
&lt;td&gt;Interrupt kills active QA session&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Demo environment&lt;/td&gt;
&lt;td&gt;✗ No&lt;/td&gt;
&lt;td&gt;Customer impact if interrupted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Staging (production-like)&lt;/td&gt;
&lt;td&gt;✗ Usually not&lt;/td&gt;
&lt;td&gt;Used for final validation, needs stability&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The capacity provider strategy in the Terraform block above sets &lt;code&gt;FARGATE_SPOT weight=1, FARGATE weight=0&lt;/code&gt; — pure Spot. For environments that need occasional stability, set Spot weight to 3 and on-demand weight to 1 to prefer Spot but fall back automatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  Business-hours scheduling: the fastest ROI
&lt;/h2&gt;

&lt;p&gt;Scheduling ECS tasks to stop at 19:00 and restart at 09:00 Mon–Fri cuts active compute time from 730 hours/month to ~217 hours — a 70% reduction with no architecture changes required.&lt;/p&gt;

&lt;p&gt;The AWS-native approach uses ECS Application Auto Scaling scheduled actions. No Lambda function, no custom scheduler, no third-party tool — this is a first-class ECS feature. The Terraform block at the top of this article implements it exactly.&lt;/p&gt;

&lt;p&gt;A few operational details worth knowing before you deploy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  —&lt;strong&gt;Deregistration delay.&lt;/strong&gt; ALB target groups have a default 300-second deregistration delay. Reduce this to 30 seconds on non-prod target groups so environments stop promptly at 19:00 instead of draining for 5 minutes.&lt;/li&gt;
&lt;li&gt;  —&lt;strong&gt;Stateful services.&lt;/strong&gt; RDS and ElastiCache run independently — they're not stopped by this config. Data persists across task restarts. EFS mounts reattach on task start.&lt;/li&gt;
&lt;li&gt;  —&lt;strong&gt;Timezone offset.&lt;/strong&gt; EventBridge cron uses UTC. Mon–Fri 09:00–19:00 ET is 13:00–23:00 UTC. Adjust the cron expressions for your team's timezone.&lt;/li&gt;
&lt;li&gt;  —&lt;strong&gt;Override capability.&lt;/strong&gt; The scheduled action sets desired count — any engineer can manually set it back to 1 for an after-hours session. The schedule resumes as normal the next morning.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  At 10+ environments, this math becomes unavoidable
&lt;/h2&gt;

&lt;p&gt;One staging environment running 24/7 is an annoyance. Ten of them is a line item that starts appearing in board decks. The fix doesn't scale manually.&lt;/p&gt;

&lt;p&gt;Manual scheduling via the AWS console or one-off Terraform blocks works at 1–2 environments. At 10+, the operational overhead compounds:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  —Schedule drift — different engineers set different start/stop times, no one audits&lt;/li&gt;
&lt;li&gt;  —Environment-specific hours — the ML team needs their env at 6am, QA needs theirs until 9pm&lt;/li&gt;
&lt;li&gt;  —On-demand overrides — “can you keep staging up tonight, we have a client demo” — sent in Slack, forgotten in Terraform&lt;/li&gt;
&lt;li&gt;  —New environments inherit no schedule by default — the next dev environment someone spins up runs 24/7 until someone notices&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where fleet-level tooling pays for itself. Fortem manages scheduling across all non-prod environments from one interface — with override capability per environment, audit log of who changed what, and defaults that apply to new environments automatically.&lt;/p&gt;

&lt;p&gt;See which environments in your fleet are burning budget right now.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/book/"&gt;Talk to us about your fleet&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Questions this article doesn't answer
&lt;/h2&gt;

&lt;p&gt;How do I actually see which environment is costing what in AWS?+&lt;/p&gt;

&lt;p&gt;Enable cost allocation tags for your environment key in the AWS Billing console, then use Cost Explorer with a Group by filter on that tag. You'll see per-environment spend broken out as individual rows. Our article on per-environment cost visibility walks through the exact steps.&lt;/p&gt;

&lt;p&gt;Can I automatically stop ECS environments when there's no active deployment or open PR?+&lt;/p&gt;

&lt;p&gt;Not with native ECS scheduling alone — you'd need to wire EventBridge to your CI/CD events. A GitHub Actions workflow can call the ECS UpdateService API to set desired count to 0 when a PR is closed and back to 1 when a new deployment completes. Some teams add this to their deploy pipeline directly.&lt;/p&gt;

&lt;p&gt;What's the difference between desired count = 0 and deleting the ECS service entirely?+&lt;/p&gt;

&lt;p&gt;Setting desired count to 0 stops all running tasks but preserves the service definition, IAM roles, capacity provider strategies, and auto-scaling rules. The service restarts exactly as configured. Deleting the service removes all of this and you'd need to recreate it from Terraform. For scheduling, use desired count = 0 — not service deletion.&lt;/p&gt;

&lt;p&gt;Does stopping and restarting ECS tasks affect RDS or other stateful services?+&lt;/p&gt;

&lt;p&gt;RDS, ElastiCache, and other stateful services run independently of ECS task count. Stopping tasks at 19:00 has no effect on your database — it continues running (and billing) until you separately stop it. Data persists across task restarts. EFS volumes reattach automatically when tasks start again.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Is Fargate Spot available for ECS services or only tasks?
&lt;/h3&gt;

&lt;p&gt;Fargate Spot is available for ECS services through capacity provider strategies. You set FARGATE_SPOT as a capacity provider with a weight in your ECS service definition. Tasks get scheduled on Spot capacity when available. If AWS needs the capacity back, tasks receive a SIGTERM with a 2-minute warning before SIGKILL.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does setting ECS desired count to 0 stop billing immediately?
&lt;/h3&gt;

&lt;p&gt;Yes. When desired count reaches 0 and running tasks drain and stop, Fargate billing stops within seconds — Fargate charges by the second with no minimum. However, other resources associated with the environment (ALB if dedicated, CloudWatch Log Groups, RDS) continue to incur charges independently.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I set up a schedule to stop ECS services on nights and weekends?
&lt;/h3&gt;

&lt;p&gt;Use ECS Application Auto Scaling scheduled actions — no Lambda required. Create a scalable target for each ECS service, then add two scheduled actions: one to set desired count to 0 at your stop time and one to restore it in the morning. EventBridge cron expressions handle the schedule. Terraform example is included in this article.&lt;/p&gt;

&lt;h3&gt;
  
  
  Will reducing non-prod ECS task size break anything?
&lt;/h3&gt;

&lt;p&gt;It depends on what the task does. For services that only handle QA traffic or automated tests, dropping from 1 vCPU to 0.5 vCPU rarely causes issues. The risk is for tasks that run build pipelines, data migrations, or integration tests under time constraints — those may fail or time out. Right-size based on actual observed CPU and memory utilization, not on what production uses.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does Fargate Spot handle interruptions in ECS?
&lt;/h3&gt;

&lt;p&gt;AWS sends a SIGTERM to the task 2 minutes before reclaiming capacity, then sends SIGKILL. ECS marks the task as stopped with reason SPOT_INTERRUPTION. If the ECS service has a desired count greater than 0, it will launch a replacement task — on Spot if available, falling back to on-demand if not (depending on your capacity provider strategy weights).&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;See your real per-env cost:&lt;/strong&gt; &lt;a href="https://fortem.dev/ecs-cost-calculator" rel="noopener noreferrer"&gt;fortem.dev/ecs-cost-calculator&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>ecs</category>
      <category>fargate</category>
      <category>finops</category>
    </item>
    <item>
      <title>Humanitec or Fortem: Which ECS Platform Fits Your Team?</title>
      <dc:creator>Matt</dc:creator>
      <pubDate>Fri, 19 Jun 2026 23:51:03 +0000</pubDate>
      <link>https://dev.to/dspv/humanitec-or-fortem-which-ecs-platform-fits-your-team-211e</link>
      <guid>https://dev.to/dspv/humanitec-or-fortem-which-ecs-platform-fits-your-team-211e</guid>
      <description>&lt;h1&gt;
  
  
  Fortem vs Humanitec: ECS Ops vs General IDP
&lt;/h1&gt;

&lt;blockquote&gt;
&lt;p&gt;Originally published at &lt;a href="https://fortem.dev/blog/fortem-vs-humanitec" rel="noopener noreferrer"&gt;https://fortem.dev/blog/fortem-vs-humanitec&lt;/a&gt;&lt;br&gt;
Humanitec's Container Driver explicitly excludes ECS. If your problem is operating an ECS Fargate fleet, you're comparing the wrong category of tool. Here's the breakdown.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;Versus&lt;/p&gt;

&lt;p&gt;fortem-vs-humanitechumanitec-alternative-ecsecs-idp-comparison&lt;/p&gt;

&lt;p&gt;Humanitec is the most-marketed IDP of 2025. If you run AWS ECS Fargate and searched “humanitec alternative,” you've likely seen it at the top of every comparison listicle. But the ECS team evaluating Humanitec is usually solving a different problem than the one Humanitec is built for. This article explains both tools precisely so you can figure out which problem you actually have.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  Humanitec is a platform orchestrator built for Kubernetes — the Container Driver (Jan 2025) explicitly states it "can only be used with managed clusters (EKS, AKS or GKE)" — ECS is not a supported workload target&lt;/li&gt;
&lt;li&gt;  Humanitec Teams is $2,199/mo for 5 users, 2 projects, 5 environments per project — structurally incompatible with a fleet of 20+ ECS environments before you even reach Pro at $5,499/mo&lt;/li&gt;
&lt;li&gt;  Fortem is purpose-built for ECS Fargate fleet operations: scheduling, cloning, fleet visibility, developer self-service — reads your existing AWS resources, no Terraform rewrite&lt;/li&gt;
&lt;li&gt;  If your problem is 'operate my ECS fleet at scale' → Fortem. If your problem is 'build a company-wide IDP across AWS, GCP, and Azure' → Humanitec&lt;/li&gt;
&lt;li&gt;  Humanitec requires substantial custom work to build the developer interface — right for a 5+ person platform team, not for a 1–2 person ops team on ECS&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Ready to use — run this evaluation today&lt;/p&gt;

&lt;p&gt;Before booking a demo with either vendor, answer these 5 questions. The answers tell you which category of tool fits — before any sales call.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;1.&lt;/p&gt;

&lt;p&gt;What runtimes does your production infrastructure actually run on?&lt;/p&gt;

&lt;p&gt;If the answer is 'ECS Fargate only' → operational layer. If 'ECS + EKS + Lambda + GCP' → platform orchestrator.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;2.&lt;/p&gt;

&lt;p&gt;How many ECS environments do you have, and are any of them running 24/7 without a workload?&lt;/p&gt;

&lt;p&gt;Count: aws ecs list-clusters | jq '.clusterArns | length'. If &amp;gt;10 environments with idle time → scheduling ROI is immediate.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;3.&lt;/p&gt;

&lt;p&gt;How many full-time engineers are dedicated to the internal platform (not to product features)?&lt;/p&gt;

&lt;p&gt;1–2 engineers → operational layer. 5+ dedicated platform engineers → full IDP may make sense.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;4.&lt;/p&gt;

&lt;p&gt;Are developers filing tickets to the platform team for environment restarts, clones, or access?&lt;/p&gt;

&lt;p&gt;If yes, that's an ops bottleneck — a self-service operations layer solves this faster than building an IDP.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;5.&lt;/p&gt;

&lt;p&gt;What is your current monthly Fargate compute spend on non-production environments?&lt;/p&gt;

&lt;p&gt;Run: aws ce get-cost-and-usage --time-period Start=2026-05-01,End=2026-06-01 --granularity MONTHLY --filter '{"Dimensions":{"Key":"SERVICE","Values":["Amazon Elastic Container Service"]}}' --metrics BlendedCost. If &amp;gt;$1,000/mo → scheduling saves more than Fortem Starter costs.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What Humanitec actually is
&lt;/h2&gt;

&lt;p&gt;Humanitec is a graph-based platform orchestrator that enforces security, cost, and compliance policies on every deployment. It is Kubernetes-architected and cloud-agnostic.&lt;/p&gt;

&lt;p&gt;The current homepage headline is “Let AI build. On your terms.” — Humanitec has repositioned from “developer self-service platform” to “AI agent governance layer.” The product lets platform teams define resource templates (databases, caches, queues) that AI agents and human developers can provision in a standardized, policy-compliant way across multiple cloud providers.&lt;/p&gt;

&lt;p&gt;Compute targets claimed on the marketing site: EKS, GKE, AKS, VMs, Serverless. The reality for ECS teams is more nuanced. Humanitec's Container Driver — the mechanism that actually routes workloads to a compute target — was announced in January 2025 with an explicit restriction:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“As of today, the Container Driver can only be used with managed clusters (EKS, AKS or GKE.)”&lt;/p&gt;

&lt;p&gt;— &lt;a href="https://humanitec.com/blog/container-driver" rel="noopener noreferrer"&gt;Humanitec blog: Introducing the Container Driver (Jan 2025)&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;AWS ECS and Fargate are not listed. The “serverless-ecs” runner type that appears in Humanitec docs refers to running Humanitec's own build agents on ECS compute — not routing your application workloads to ECS clusters. As of June 2026, humanitec.com/blog has zero posts about deploying to ECS.&lt;/p&gt;

&lt;p&gt;Humanitec's workload descriptor is called Score — a YAML format that abstracts a service away from any specific runtime. To deploy through Humanitec, teams rewrite their Terraform task definitions or Kubernetes manifests in Score. This is the right tradeoff for organizations standardizing across heterogeneous platforms. For ECS-only teams, it adds an abstraction layer on top of resources that already exist and work.&lt;/p&gt;

&lt;p&gt;Gartner Peer Insights reviewers describe the implementation honestly: Humanitec “requires you to build the developer interface and integrate existing tools. This means platform teams need to do substantial custom work to create the complete developer experience.” Humanitec is transparent about this — their MVP Program is a structured onboarding that guides a platform team through building that experience.&lt;/p&gt;

&lt;p&gt;Known customers: Western Union, BambooHR, Cimpress — large, multi-cloud organizations with dedicated platform teams. The product is enterprise-sold; it does not have a self-serve signup.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;KEY INSIGHT:&lt;/strong&gt; Humanitec is genuinely good at what it does — the absence of ECS support is not a flaw, it's a design decision. The question is whether the ECS team evaluating it is solving the same problem Humanitec was built to solve.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What Fortem actually is
&lt;/h2&gt;

&lt;p&gt;Fortem is a control plane for ECS Fargate fleet operations. It reads your existing AWS resources via tags and naming — Terraform stays your source of truth, nothing gets rewritten.&lt;/p&gt;

&lt;p&gt;Fortem connects to your AWS account via a read-heavy IAM role, discovers every ECS cluster, service, and task definition through tags and naming conventions, and adds an operations layer on top. It does not replace Terraform, does not manage deployments, and does not require you to learn a new workload descriptor format.&lt;/p&gt;

&lt;p&gt;What it adds: fleet-wide scheduling (stop all non-production environments at 7pm, restart at 9am, per-timezone), environment cloning (copy a staging environment with one API call), per-environment cost tracking, developer self-service with ECS-scoped RBAC so engineers can restart their own service without filing a ticket, and AI-assisted diagnostics that pull CloudWatch logs and task events when something is unhealthy.&lt;/p&gt;

&lt;p&gt;The typical customer is a platform engineering team of 1–3 people running 10–80 ECS Fargate environments across 1–3 AWS accounts. Their IaC is Terraform. They did not set out to build an internal developer platform — they wanted to stop being the bottleneck for every environment restart, clone, and schedule change. See how &lt;a href="https://dev.to/blog/ecs-fargate-cost-optimization/"&gt;ECS Fargate cost optimization&lt;/a&gt; compounds when scheduling, right-sizing, and Spot work together.&lt;/p&gt;

&lt;p&gt;What Fortem does not do: Fortem does not manage Kubernetes, does not provide a service catalog, does not handle multi-cloud deployments, and does not give you a Backstage-style developer portal. If those are your requirements, Fortem is not the right tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pricing — what you actually pay
&lt;/h2&gt;

&lt;p&gt;Humanitec Teams at $2,199/mo limits you to 5 users and 5 environments per project. For a fleet of 20+ ECS environments, you're on Pro at $5,499/mo before you've saved anything.&lt;/p&gt;

&lt;p&gt;Humanitec's pricing was verified June 2026 at &lt;a href="https://humanitec.com/pricing" rel="noopener noreferrer"&gt;humanitec.com/pricing&lt;/a&gt;. Both cloud-hosted tiers offer a free trial. An annual discount of 10% applies to both plans. Humanitec also lists a separate AWS Marketplace SKU at $999/mo for up to 15 users — the terms differ from the main site plans.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Humanitec Teams $2,199/mo -   5 users · 2 projects · 5 envs/project -   Email su&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fortem Starter $790/mo -   Up to 20 environments · 1 AWS account -   Unlimited u&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The ROI math for Fortem is straightforward. Scheduling 10 dev and staging environments (8 services each, 0.5 vCPU / 1 GB per task) to business hours saves $1,013/month in Fargate compute — using published AWS pricing at $0.04048/vCPU-hr. The Starter plan pays for itself before any other feature. For the full calculation, see the &lt;a href="https://dev.to/blog/aws-fargate-pricing-real-costs/"&gt;Fargate real cost breakdown&lt;/a&gt; — including the fixed overhead (ALB, NAT Gateway, CloudWatch) that scheduling doesn't eliminate.&lt;/p&gt;

&lt;p&gt;Humanitec ROI comes from platform team leverage — fewer tickets, faster developer onboarding, consistent environments across teams. That is real value, but it requires a platform team large enough to build and maintain the developer experience layer Humanitec expects you to create. The Gartner review notes this plainly: “platform teams need to do substantial custom work to create the complete developer experience.”&lt;/p&gt;

&lt;h2&gt;
  
  
  ECS Fargate specifically
&lt;/h2&gt;

&lt;p&gt;Humanitec has no ECS-specific features. Fortem was built exclusively for ECS Fargate — every feature maps to a real ECS operational problem.&lt;/p&gt;

&lt;p&gt;Humanitec on ECS&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  ✗No ECS workload deployment — Container Driver supports EKS, AKS, GKE only&lt;/li&gt;
&lt;li&gt;  ✗No environment scheduling for ECS clusters or services&lt;/li&gt;
&lt;li&gt;  ✗No ECS fleet visibility — no per-environment cost or status dashboard&lt;/li&gt;
&lt;li&gt;  ✗No environment cloning for ECS task definitions&lt;/li&gt;
&lt;li&gt;  ✗No ECS-specific diagnostics — no CloudWatch log integration&lt;/li&gt;
&lt;li&gt;  ✗Zero ECS-specific blog posts or documentation as of June 2026&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Fortem on ECS&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  ✓Reads ECS services, task definitions, and CloudWatch metrics natively via AWS API&lt;/li&gt;
&lt;li&gt;  ✓Fleet-wide scheduling: stop/start all environments by timezone on a cron schedule&lt;/li&gt;
&lt;li&gt;  ✓Per-environment cost tracking using ECS service CPU/memory allocations&lt;/li&gt;
&lt;li&gt;  ✓Environment cloning: copy a full ECS environment with one API call&lt;/li&gt;
&lt;li&gt;  ✓Developer self-service: scoped IAM lets engineers restart their own services&lt;/li&gt;
&lt;li&gt;  ✓AI diagnostics: surfaces unhealthy tasks with CloudWatch context automatically&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;KEY INSIGHT:&lt;/strong&gt; The Humanitec “serverless-ecs” runner type that appears in their docs is not what it sounds like. It refers to running Humanitec's own CI build agents on ECS compute — not routing your application services to ECS clusters. If you found that in a Google search, it is not ECS workload support.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  When Humanitec is the right choice
&lt;/h2&gt;

&lt;p&gt;Humanitec is right when you need a company-wide platform across multiple clouds and platforms — not when your problem is operating an ECS fleet.&lt;/p&gt;

&lt;p&gt;The strongest signal that Humanitec fits: your engineering org runs workloads on at least two of EKS, GKE, AKS, Lambda, or VMs, and you want a single interface for developers to provision infrastructure regardless of which runtime it lands on. Score (Humanitec's workload descriptor) earns its abstraction cost when the same service needs to deploy to different runtimes depending on environment or team.&lt;/p&gt;

&lt;p&gt;Humanitec fits when&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  —You need a formal IDP for 50+ engineers across AWS, GCP, and Azure&lt;/li&gt;
&lt;li&gt;  —You're building a dedicated platform team with a charter to standardize all of engineering&lt;/li&gt;
&lt;li&gt;  —You already use Backstage or Port and want an orchestration layer on top&lt;/li&gt;
&lt;li&gt;  —You're willing to invest 2–6 months in implementation and have 5+ dedicated platform engineers&lt;/li&gt;
&lt;li&gt;  —You need AI agent governance — controlled AI provisioning across heterogeneous platforms&lt;/li&gt;
&lt;li&gt;  —Your compliance requirements need multi-cloud deployment standardization&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Humanitec's MVP Program offers structured onboarding with a platform architect. Their documentation targets a first working example environment in the initial session — not a production IDP. The actual implementation timeline for a working internal developer platform, with real services, resource drivers, and a developer-facing interface, is measured in months.&lt;/p&gt;

&lt;h2&gt;
  
  
  When Fortem is the right choice
&lt;/h2&gt;

&lt;p&gt;Fortem is right when your infrastructure is primarily AWS ECS Fargate and your problem is operational — managing environments, controlling costs, and enabling self-service without a full IDP build.&lt;/p&gt;

&lt;p&gt;Fortem fits when&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  —Your stack is primarily or entirely AWS ECS Fargate — no multi-cloud requirement&lt;/li&gt;
&lt;li&gt;  —You have 10–80 environments and growing compute spend on idle dev and staging&lt;/li&gt;
&lt;li&gt;  —Your platform team is 1–3 people — not 5+ engineers dedicated to IDP work&lt;/li&gt;
&lt;li&gt;  —You use Terraform and don't want to learn Score or rewrite task definitions&lt;/li&gt;
&lt;li&gt;  —You need results in days, not months — no multi-month implementation project&lt;/li&gt;
&lt;li&gt;  —Developers are filing tickets for environment restarts, clones, or schedule changes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Fortem onboarding runs 7 business days: a Fortem engineer audits your AWS setup, tags environments, configures per-timezone schedules, and hands you the dashboard. No Score migration, no new abstraction layer. You keep your Terraform as the source of truth.&lt;/p&gt;

&lt;h2&gt;
  
  
  Side-by-side at a glance
&lt;/h2&gt;

&lt;p&gt;The table below shows the practical differences. If a row is missing from one column, that tool doesn't have that capability.&lt;/p&gt;

&lt;p&gt;|  | Humanitec | Fortem |&lt;br&gt;
| --- | --- |&lt;br&gt;
| Pricing | $2,199/mo (Teams) · $5,499/mo (Pro) | $790/mo (Starter) · $2,490/mo (Scale) |&lt;br&gt;
| Users included | 5 (Teams) · 50 (Pro) | Unlimited on all plans |&lt;br&gt;
| Environments | 5/project (Teams) · Unlimited (Pro) | 20 (Starter) · 80 (Scale) |&lt;br&gt;
| ECS Fargate support | Not a workload target | Purpose-built |&lt;br&gt;
| Kubernetes support | EKS, AKS, GKE via Container Driver | Not supported |&lt;br&gt;
| Terraform required | Score replaces task definitions | Reads existing state, no rewrite |&lt;br&gt;
| Self-service for devs | Build it (custom developer portal) | Included — ECS-scoped RBAC |&lt;br&gt;
| Environment scheduling | Not available | Fleet-wide, per-timezone |&lt;br&gt;
| Environment cloning | Not available | Single API call |&lt;br&gt;
| Onboarding time | Months (IDP build required) | 7 business days |&lt;br&gt;
| Right for | Multi-cloud, 50+ engineers, Kubernetes | ECS Fargate, 1–3 platform engineers |&lt;/p&gt;

&lt;p&gt;Humanitec pricing: humanitec.com/pricing, verified June 2026. Fortem pricing: fortem.dev/pricing, verified June 2026.&lt;/p&gt;

&lt;h2&gt;
  
  
  If you read this, you might also want to know
&lt;/h2&gt;

&lt;p&gt;What does an actual platform engineering stack look like for an ECS team?+&lt;/p&gt;

&lt;p&gt;Most ECS teams run Terraform for IaC, GitHub Actions or CircleCI for CI/CD, Datadog or CloudWatch for observability, and a Slack-connected ops runbook for manual operations. Fortem slots in as the fleet operations layer — the piece that handles scheduling, self-service, and cost tracking that none of those tools cover.&lt;/p&gt;

&lt;p&gt;Can I start with Fortem and add a developer portal like Backstage later?+&lt;/p&gt;

&lt;p&gt;Yes. Fortem operates at the ECS fleet operations layer — it doesn't own your service catalog, developer portal, or deployment pipeline. Backstage (or Port, or Cortex) can sit alongside Fortem, each covering different surfaces. Many teams start with Fortem for immediate cost and ops wins, then add a portal when the team grows large enough to need a service catalog.&lt;/p&gt;

&lt;p&gt;What does Humanitec's Score language actually require me to rewrite?+&lt;/p&gt;

&lt;p&gt;Score replaces the deployment configuration for each service — your ECS task definitions, environment variables, resource dependencies (databases, caches, queues). You'd define each service in a score.yaml and map its resources to Humanitec resource types. For a team with 40 ECS services, this is a significant migration. The payoff is portability across Kubernetes clusters and other runtimes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently asked questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Does Fortem replace Humanitec?
&lt;/h3&gt;

&lt;p&gt;No — they solve different problems. Humanitec is a general-purpose platform orchestrator for standardizing deployments across multiple clouds and Kubernetes clusters. Fortem is a control plane specifically for AWS ECS Fargate fleet operations. If your stack is ECS Fargate, Fortem addresses the actual operational problems (scheduling, fleet visibility, self-service) without requiring you to build an IDP.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does Humanitec support ECS Fargate workload deployment natively?
&lt;/h3&gt;

&lt;p&gt;Not through its primary deployment mechanism. Humanitec's Container Driver — the tool used to route workloads to compute targets — explicitly states: 'the Container Driver can only be used with managed clusters (EKS, AKS or GKE).' ECS and Fargate are not supported workload targets. The 'serverless-ecs' runner type in Humanitec docs refers to running Humanitec's own build agents on ECS infrastructure, not deploying user applications to ECS.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I use Fortem and Humanitec together?
&lt;/h3&gt;

&lt;p&gt;In theory, yes — they operate at different layers. If you're running both ECS Fargate and Kubernetes workloads in a hybrid architecture, Humanitec could handle Kubernetes orchestration while Fortem handles ECS fleet operations. In practice, most ECS-first teams don't need both — Humanitec's value proposition (standardized developer experience across multi-cloud, multi-platform infrastructure) is a different problem than what ECS teams typically face.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is Humanitec's Score language and do I have to learn it?
&lt;/h3&gt;

&lt;p&gt;Score is Humanitec's workload descriptor — a YAML format that abstracts your service definition away from any specific infrastructure target. To use Humanitec fully, you describe your services in Score rather than in Terraform task definitions or Kubernetes manifests. For teams that need to deploy the same service across EKS, GKE, and Lambda, Score makes sense. For ECS-only teams who already have Terraform task definitions, Score is an additional abstraction layer with no payoff.&lt;/p&gt;

&lt;h3&gt;
  
  
  How long does Fortem onboarding take compared to Humanitec?
&lt;/h3&gt;

&lt;p&gt;Fortem onboarding runs 7 business days: a Fortem engineer audits your AWS setup, imports your environments, configures schedules per timezone, and hands you the keys. Humanitec's MVP Program (their structured onboarding) targets a working MVP after the first session but typically involves multiple months of platform team work to build the developer interface, configure resource drivers, and integrate existing tooling.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running 10+ ECS environments?
&lt;/h2&gt;

&lt;p&gt;We'll show you what fleet operations looks like without building an IDP — scheduling, cloning, cost tracking, and dev self-service, live in your AWS account in 7 days.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/audit"&gt;Run Fleet Audit →&lt;/a&gt;&lt;a href="https://dev.to/book"&gt;Book a 20-min call →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Response within 4 hours, weekdays.&lt;/p&gt;

&lt;p&gt;Worth reading&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/ecs-vs-eks/"&gt;ComparisonECS vs EKS: Which AWS Container Service?Humanitec's Container Driver explicitly excludes ECS. If you're on Fargate, the comparison starts here.&lt;/a&gt;&lt;a href="https://dev.to/blog/ecs-fargate-cost-optimization/"&gt;Guide · How to Cut AWS ECS Fargate Costs by 65%Scheduling, right-sizing, Spot, and orphaned environments — the four methods that take a 12-environment fleet from $1,730 to $380/month.&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;All ECS platform comparisons:&lt;/strong&gt; &lt;a href="https://fortem.dev/versus" rel="noopener noreferrer"&gt;fortem.dev/versus&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>ecs</category>
      <category>platform</category>
      <category>comparison</category>
    </item>
    <item>
      <title>Who Restarted Prod? How to Find It in CloudTrail</title>
      <dc:creator>Matt</dc:creator>
      <pubDate>Fri, 19 Jun 2026 23:50:07 +0000</pubDate>
      <link>https://dev.to/dspv/who-restarted-prod-how-to-find-it-in-cloudtrail-2fda</link>
      <guid>https://dev.to/dspv/who-restarted-prod-how-to-find-it-in-cloudtrail-2fda</guid>
      <description>&lt;h1&gt;
  
  
  Who Restarted Prod? ECS Audit in CloudTrail
&lt;/h1&gt;

&lt;blockquote&gt;
&lt;p&gt;Originally published at &lt;a href="https://fortem.dev/blog/ecs-audit-log-compliance" rel="noopener noreferrer"&gt;https://fortem.dev/blog/ecs-audit-log-compliance&lt;/a&gt;&lt;br&gt;
Every ECS change — UpdateService, StopTask, RunTask — lands in CloudTrail with who, when, and from where. Three CLI commands find the culprit in under 2 minutes.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;Use Case · June 16, 2026 · 8 min read&lt;/p&gt;

&lt;p&gt;ecs-audit-logecs-compliance-loggingaws-ecs-cloudtrail&lt;/p&gt;

&lt;p&gt;How to Find It in CloudTrail&lt;/p&gt;

&lt;p&gt;Your ECS service restarted. Or a task was manually stopped. Or desiredCount dropped to zero and nobody admits it. The ECS console shows WHAT happened — not WHO. CloudTrail has the answer, and three CLI commands get you there in under two minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  01CloudTrail captures every ECS API call — UpdateService, StopTask, RunTask, RegisterTaskDefinition — with who, when, and from where.&lt;/li&gt;
&lt;li&gt;  02Event History is free for the last 90 days. Three CLI commands find the culprit in under 2 minutes.&lt;/li&gt;
&lt;li&gt;  03The userIdentity field tells you human vs CI/CD vs AWS service. Root account activity in ECS is always suspicious.&lt;/li&gt;
&lt;li&gt;  04Download the skill file — an AI agent runs the full fleet audit and produces a structured report automatically.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why the ECS events tab doesn't tell you who did it
&lt;/h2&gt;

&lt;p&gt;ECS events show WHAT happened — "service updated", "task stopped" — but not WHO. The userIdentity lives in CloudTrail, not in the ECS console. That's the gap most teams waste an hour trying to bridge.&lt;/p&gt;

&lt;p&gt;You open the ECS service page. Under Events: "service my-api has started 1 tasks" at 14:23, "service my-api has stopped 1 running tasks" at 14:21. Something stopped your service and triggered a redeploy. The ECS console stops there — it doesn't record the API caller, the IAM identity, or whether it was a human clicking the console or Terraform applying a change.&lt;/p&gt;

&lt;p&gt;ECS Events tabCloudTrail&lt;/p&gt;

&lt;p&gt;Shows WHAT happenedShows WHO did it, WHEN, and FROM WHERE&lt;/p&gt;

&lt;p&gt;Service-level messages onlyAll API calls including StopTask, UpdateService, RunTask&lt;/p&gt;

&lt;p&gt;No API caller infouserIdentity: human, CI/CD role, or AWS service&lt;/p&gt;

&lt;p&gt;Kept for a few hours90-day Event History, free&lt;/p&gt;

&lt;p&gt;Not queryableSearchable by event name, username, resource, IP&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;KEY INSIGHT:&lt;/strong&gt; Key insight CloudTrail records every ECS API call automatically — no setup required. The 90-day Event History is free. You're not paying for it already; it's just there. The only thing missing is knowing where to look.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Three commands to find the culprit in under 2 minutes
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;aws cloudtrail lookup-events&lt;/code&gt; with &lt;code&gt;AttributeKey=EventName&lt;/code&gt; filters to specific actions. Pipe through jq to extract userIdentity.userName, eventTime, and sourceIPAddress. Covers the last 90 days at no charge.&lt;/p&gt;

&lt;p&gt;Find who stopped a task&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws cloudtrail lookup-events &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--lookup-attributes&lt;/span&gt; &lt;span class="nv"&gt;AttributeKey&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;EventName,AttributeValue&lt;span class="o"&gt;=&lt;/span&gt;StopTask &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'Events[*].CloudTrailEvent'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; text | &lt;span class="se"&gt;\&lt;/span&gt;
jq &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s1"&gt;'. | {
  time: .eventTime,
  who: (
    if .userIdentity.type == "IAMUser" then .userIdentity.userName
    elif .userIdentity.type == "AssumedRole" then .userIdentity.sessionContext.sessionIssuer.userName
    else .userIdentity.type
    end
  ),
  from: .sourceIPAddress,
  via: .userAgent,
  task: .requestParameters.task
}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Find who updated a service (deployments, scale changes)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws cloudtrail lookup-events &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--lookup-attributes&lt;/span&gt; &lt;span class="nv"&gt;AttributeKey&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;EventName,AttributeValue&lt;span class="o"&gt;=&lt;/span&gt;UpdateService &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'Events[*].CloudTrailEvent'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; text | &lt;span class="se"&gt;\&lt;/span&gt;
jq &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s1"&gt;'. | {
  time: .eventTime,
  who: (
    if .userIdentity.type == "IAMUser" then .userIdentity.userName
    elif .userIdentity.type == "AssumedRole" then .userIdentity.sessionContext.sessionIssuer.userName
    else .userIdentity.type
    end
  ),
  via: .userAgent,
  service: .requestParameters.service,
  desiredCount: .requestParameters.desiredCount
}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Narrow by specific user or role&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Find everything a specific IAM user did in the last 24h&lt;/span&gt;
aws cloudtrail lookup-events &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--lookup-attributes&lt;/span&gt; &lt;span class="nv"&gt;AttributeKey&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;Username,AttributeValue&lt;span class="o"&gt;=&lt;/span&gt;john.smith &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--start-time&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-u&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'24 hours ago'&lt;/span&gt; +%Y-%m-%dT%H:%M:%SZ 2&amp;gt;/dev/null &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-u&lt;/span&gt; &lt;span class="nt"&gt;-v-24H&lt;/span&gt; +%Y-%m-%dT%H:%M:%SZ&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'Events[*].{Time:EventTime,Event:EventName}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; table
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Rate limit: &lt;code&gt;lookup-events&lt;/code&gt; is capped at 2 requests/second per account per region. If you're scripting across many event types, add a 0.5s sleep between calls or use &lt;code&gt;--next-token&lt;/code&gt; for pagination. Max 50 events per request; paginate if you need more.&lt;/p&gt;

&lt;h2&gt;
  
  
  Which ECS events map to which actions
&lt;/h2&gt;

&lt;p&gt;UpdateService = scale change or deployment. StopTask = manual kill. RegisterTaskDefinition = new image or config. RunTask = standalone task launch. Each has a different userIdentity pattern worth knowing.&lt;/p&gt;

&lt;p&gt;ScenarioCloudTrail eventNameWho typically calls it&lt;/p&gt;

&lt;p&gt;Service scaled up/downUpdateServiceHuman, CI/CD, autoscaler&lt;/p&gt;

&lt;p&gt;Deployment triggeredUpdateService + RunTaskCI/CD pipeline&lt;/p&gt;

&lt;p&gt;Task manually stoppedStopTaskHuman, script, ECS agent&lt;/p&gt;

&lt;p&gt;New task definitionRegisterTaskDefinitionCI/CD pipeline, human&lt;/p&gt;

&lt;p&gt;Service created/deletedCreateService / DeleteServiceHuman, Terraform&lt;/p&gt;

&lt;p&gt;Cluster deletedDeleteClusterHuman, Terraform&lt;/p&gt;

&lt;p&gt;The most ambiguous one is &lt;code&gt;StopTask&lt;/code&gt;. It appears in CloudTrail when a human manually stops a task, when a script does it, and when ECS itself stops a task during a rolling deployment. Check &lt;code&gt;userIdentity.invokedBy&lt;/code&gt; — if it says &lt;code&gt;ecs.amazonaws.com&lt;/code&gt;, ECS triggered the stop internally during service orchestration, not a human.&lt;/p&gt;

&lt;h2&gt;
  
  
  Decoding userIdentity: human, CI/CD, or AWS service
&lt;/h2&gt;

&lt;p&gt;userIdentity.type tells you who called the API: IAMUser = human, AssumedRole = CI/CD or Lambda, AWSService = autoscaler or ECS itself. Root type should never appear in ECS — alert immediately if it does.&lt;/p&gt;

&lt;p&gt;userIdentity.typeMeaningHow to extract the name&lt;/p&gt;

&lt;p&gt;IAMUserHuman with IAM credentials.userIdentity.userName&lt;/p&gt;

&lt;p&gt;AssumedRoleCI/CD, Lambda, or human via role.userIdentity.sessionContext.sessionIssuer.userName&lt;/p&gt;

&lt;p&gt;RootAWS root account — alert immediatelytype = Root is the signal&lt;/p&gt;

&lt;p&gt;AWSServiceAWS-owned service (autoscaling, ECS agent).userIdentity.invokedBy&lt;/p&gt;

&lt;p&gt;AWSAccountCross-account call from another AWS account.userIdentity.accountId&lt;/p&gt;

&lt;p&gt;FederatedUserSSO / identity provider user.userIdentity.principalId&lt;/p&gt;

&lt;p&gt;The tricky one is &lt;code&gt;AssumedRole&lt;/code&gt;. When a GitHub Actions pipeline runs &lt;code&gt;aws ecs update-service&lt;/code&gt;, the CloudTrail event shows &lt;code&gt;type: AssumedRole&lt;/code&gt; and the ARN of the role. The human-readable role name is in &lt;code&gt;sessionContext.sessionIssuer.userName&lt;/code&gt;. That's the field to surface in your audit report — not the full ARN.&lt;/p&gt;

&lt;p&gt;To distinguish console vs CLI vs Terraform, use the &lt;code&gt;userAgent&lt;/code&gt; field:&lt;/p&gt;

&lt;p&gt;userAgent valueWhat called the API&lt;/p&gt;

&lt;p&gt;console.amazonaws.comAWS console (someone clicked)&lt;/p&gt;

&lt;p&gt;aws-cli/2.*AWS CLI (manual or script)&lt;/p&gt;

&lt;p&gt;Terraform/1.* terraform-provider-aws/*Terraform apply&lt;/p&gt;

&lt;p&gt;github-actions/*GitHub Actions CI/CD&lt;/p&gt;

&lt;p&gt;ECS ConsoleECS service console actions&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;KEY INSIGHT:&lt;/strong&gt; Key insight If &lt;code&gt;userIdentity.type&lt;/code&gt; is &lt;code&gt;Root&lt;/code&gt;, stop everything else and investigate. Root credentials should never be used for routine ECS operations. A Root call in CloudTrail means either someone is using the root account directly (a security failure) or credentials were compromised.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Alerting in real time: EventBridge rule for critical ECS changes
&lt;/h2&gt;

&lt;p&gt;EventBridge can trigger a notification within seconds of a StopTask or UpdateService call — before you notice the incident. One Terraform resource sets up the rule with no additional infrastructure.&lt;/p&gt;

&lt;p&gt;Searching CloudTrail after an incident is reactive. EventBridge makes it proactive: you define a rule that matches specific CloudTrail events, and EventBridge triggers an SNS notification, Lambda, or Slack webhook immediately when the event occurs. For &lt;a href="https://dev.to/blog/ecs-fargate-best-practices/"&gt;teams running 10+ ECS environments&lt;/a&gt;, catching a DeleteService before the on-call rotation starts saves significant incident response time.&lt;/p&gt;

&lt;p&gt;Terraform: EventBridge rule for critical ECS events&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_cloudwatch_event_rule"&lt;/span&gt; &lt;span class="s2"&gt;"ecs_critical"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"ecs-critical-changes"&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Alert on destructive or suspicious ECS API calls"&lt;/span&gt;

  &lt;span class="nx"&gt;event_pattern&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;jsonencode&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;source&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"aws.ecs"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="nx"&gt;detail-type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"AWS API Call via CloudTrail"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="nx"&gt;detail&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;eventSource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"ecs.amazonaws.com"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="nx"&gt;eventName&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="s2"&gt;"StopTask"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s2"&gt;"DeleteService"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s2"&gt;"DeleteCluster"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s2"&gt;"UpdateService"&lt;/span&gt;
      &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_cloudwatch_event_target"&lt;/span&gt; &lt;span class="s2"&gt;"ecs_critical_sns"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;rule&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_cloudwatch_event_rule&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ecs_critical&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;
  &lt;span class="nx"&gt;target_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"SendToSNS"&lt;/span&gt;
  &lt;span class="nx"&gt;arn&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_sns_topic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;alerts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;

  &lt;span class="nx"&gt;input_transformer&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;input_paths&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;event&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"$.detail.eventName"&lt;/span&gt;
      &lt;span class="nx"&gt;who&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"$.detail.userIdentity.sessionContext.sessionIssuer.userName"&lt;/span&gt;
      &lt;span class="nx"&gt;time&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"$.time"&lt;/span&gt;
      &lt;span class="nx"&gt;service&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"$.detail.requestParameters.service"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="nx"&gt;input_template&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;""&lt;/span&gt;&lt;span class="nx"&gt;ECS&lt;/span&gt; &lt;span class="nx"&gt;alert&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="err"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="err"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;on&lt;/span&gt; &lt;span class="err"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;service&lt;/span&gt;&lt;span class="err"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;by&lt;/span&gt; &lt;span class="err"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;who&lt;/span&gt;&lt;span class="err"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;at&lt;/span&gt; &lt;span class="err"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;time&lt;/span&gt;&lt;span class="err"&gt;&amp;gt;&lt;/span&gt;&lt;span class="s2"&gt;""&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For &lt;code&gt;UpdateService&lt;/code&gt;, add a second rule specifically for scale-to-zero: filter where &lt;code&gt;requestParameters.desiredCount = 0&lt;/code&gt;. That's the most common accidental incident — someone running a cleanup script that hits the wrong environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Oct 2025 addition: ECS CloudTrail data events
&lt;/h2&gt;

&lt;p&gt;Since October 2025, ECS supports CloudTrail data events for ContainerInstance agent API activity (ecs:Poll, ecs:StartTelemetrySession). These aren't in Event History — they require a CloudTrail trail or CloudTrail Lake.&lt;/p&gt;

&lt;p&gt;AWS management events (UpdateService, StopTask, etc.) are what most teams need for incident response. The October 2025 addition is different: ECS now supports CloudTrail &lt;em&gt;data events&lt;/em&gt; for ContainerInstance agent API calls — the low-level polling activity between the ECS agent and the control plane.&lt;/p&gt;

&lt;p&gt;Management eventsData events (Oct 2025)&lt;/p&gt;

&lt;p&gt;What they captureUpdateService, StopTask, RunTask, etc.ecs:Poll, ecs:StartTelemetrySession, ecs:PutSystemLogEvents&lt;/p&gt;

&lt;p&gt;CostFree (Event History)Additional CloudTrail charges&lt;/p&gt;

&lt;p&gt;In Event History?Yes — 90 daysNo — trail or Lake required&lt;/p&gt;

&lt;p&gt;Who needs themEveryone — incident responseEC2 launch type, compliance auditing&lt;/p&gt;

&lt;p&gt;Resource type—AWS::ECS::ContainerInstance&lt;/p&gt;

&lt;p&gt;For most ECS Fargate teams, data events aren't needed for incident response — management events cover UpdateService and StopTask which is where incidents come from. Data events matter if you run &lt;strong&gt;EC2 launch type&lt;/strong&gt; and need to audit ContainerInstance registration activity, or if compliance requires a full record of agent-to-control-plane communication. Enable them only if you have a specific requirement — at scale, ContainerInstance polling events generate significant volume and cost. Details in the &lt;a href="https://docs.aws.amazon.com/AmazonECS/latest/developerguide/logging-using-cloudtrail.html" rel="noopener noreferrer"&gt;ECS CloudTrail logging docs&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Download the skill file — let the AI agent do the audit
&lt;/h2&gt;

&lt;p&gt;The skill file instructs an AI agent to pull all critical ECS CloudTrail events from the last 24 hours across every cluster in your account and produce a structured "who did what" report. Read-only — no changes applied.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;ECS CloudTrail Audit Agent scans all clusters, pulls critical ECS events (Update&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The agent lists all clusters, runs &lt;code&gt;lookup-events&lt;/code&gt;for each critical event type, decodes the userIdentity, and produces a structured output: "Service X was updated at HH:MM by role deploy-prod via GitHub Actions from IP 140.82.114.3." It also flags Root account activity, unexpected source IPs, and scale-to-zero incidents. For teams where "who did this?" is a recurring post-incident question, this is the 2-minute version of the 20-minute manual process.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"To identify the user who initiates a StopTask API call, view StopTask in AWS CloudTrail for userIdentity information."&lt;/p&gt;

&lt;p&gt;— &lt;a href="https://repost.aws/knowledge-center/ecs-running-task-count-change" rel="noopener noreferrer"&gt;AWS Knowledge Center: Troubleshoot running task count changes in ECS&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;strong&gt;Book a 20-min fleet walkthrough:&lt;/strong&gt; &lt;a href="https://fortem.dev/book" rel="noopener noreferrer"&gt;fortem.dev/book&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>ecs</category>
      <category>cloudtrail</category>
      <category>devops</category>
    </item>
    <item>
      <title>What Is ECS Service Connect and Should You Use It?</title>
      <dc:creator>Matt</dc:creator>
      <pubDate>Fri, 19 Jun 2026 23:50:05 +0000</pubDate>
      <link>https://dev.to/dspv/what-is-ecs-service-connect-and-should-you-use-it-362</link>
      <guid>https://dev.to/dspv/what-is-ecs-service-connect-and-should-you-use-it-362</guid>
      <description>&lt;h1&gt;
  
  
  ECS Service Connect: Is It Worth the Envoy Tax?
&lt;/h1&gt;

&lt;blockquote&gt;
&lt;p&gt;Originally published at &lt;a href="https://fortem.dev/blog/ecs-service-connect-guide" rel="noopener noreferrer"&gt;https://fortem.dev/blog/ecs-service-connect-guide&lt;/a&gt;&lt;br&gt;
ECS Service Connect adds an Envoy sidecar to every Fargate task — free feature, real cost. When it beats Cloud Map, when it doesn't, and the July 2025 blue/green fix.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;Guide&lt;/p&gt;

&lt;p&gt;ecs-service-connectaws-ecs-service-connectecs-service-mesh&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  ECS Service Connect injects an Envoy proxy into every task automatically. No task definition changes — ECS manages the sidecar.&lt;/li&gt;
&lt;li&gt;  The feature is free. The cost is the Envoy tax: AWS recommends budgeting +0.25 vCPU and +64 MiB per task on Fargate.&lt;/li&gt;
&lt;li&gt;  Native blue/green (launched July 2025) works with Service Connect. The older CodeDeploy-based blue/green controller still does not.&lt;/li&gt;
&lt;li&gt;  Under 5 services or stuck on CodeDeploy: use Cloud Map. 10+ services on native deploys: Service Connect. External traffic: ALB.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Service Connect launched in 2022. It spent two years blocked by a CodeDeploy incompatibility that sent most teams back to plain Cloud Map. That blocker was fixed on July 17, 2025. Here's what Service Connect actually does, what it costs on Fargate, and the three situations where you should still skip it.&lt;/p&gt;

&lt;p&gt;Ready to use — Terraform Service Connect configuration&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight terraform"&gt;&lt;code&gt;&lt;span class="k"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_ecs_service"&lt;/span&gt; &lt;span class="s2"&gt;"api"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"api"&lt;/span&gt;
  &lt;span class="nx"&gt;cluster&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_ecs_cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;main&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;task_definition&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_ecs_task_definition&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;api&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;
  &lt;span class="nx"&gt;desired_count&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
  &lt;span class="nx"&gt;launch_type&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"FARGATE"&lt;/span&gt;

  &lt;span class="nx"&gt;service_connect_configuration&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;enabled&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="nx"&gt;namespace&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"arn:aws:servicediscovery:us-east-1:123456789:namespace/ns-xxx"&lt;/span&gt;

    &lt;span class="nx"&gt;service&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;port_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"http"&lt;/span&gt;  &lt;span class="c1"&gt;# must match portMappings[].name in task definition&lt;/span&gt;

      &lt;span class="nx"&gt;client_alias&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;port&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;8080&lt;/span&gt;
        &lt;span class="nx"&gt;dns_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"api"&lt;/span&gt;  &lt;span class="c1"&gt;# other services call: http://api:8080&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;network_configuration&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# In task definition portMappings, name the port:&lt;/span&gt;
&lt;span class="c1"&gt;# { "name": "http", "containerPort": 8080, "appProtocol": "http2" }&lt;/span&gt;

&lt;span class="c1"&gt;# Account for the Envoy sidecar in your task CPU/memory:&lt;/span&gt;
&lt;span class="c1"&gt;# If your app needs 512 CPU / 1024 MiB, set task to 768 CPU / 1088 MiB&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What ECS Service Connect actually does
&lt;/h2&gt;

&lt;p&gt;ECS injects an Envoy proxy sidecar into every task automatically. Apps call other services by short name (&lt;code&gt;http://checkout:8080&lt;/code&gt;); the proxy routes, load-balances, and health-checks in seconds — not minutes.&lt;/p&gt;

&lt;p&gt;The AWS docs are explicit about how the sidecar works: "This container isn't present in the task definition and you can't configure it. Amazon ECS manages the container configuration in the service." You don't add anything to your task definition (except naming the port mapping). ECS handles the rest at deploy time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Short-name routing.&lt;/strong&gt; Services call each other by a name you define in the &lt;code&gt;client_alias.dns_name&lt;/code&gt; field — &lt;code&gt;http://checkout:8080&lt;/code&gt;, not an ALB DNS string or VPC Route 53 entry. The proxy resolves the endpoint from the Cloud Map namespace and load-balances across all healthy task IPs in round-robin.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Passive health checks.&lt;/strong&gt; The proxy marks individual tasks unhealthy and stops routing to them within seconds of detecting failures — no health check endpoint required on the calling service. This is the primary reliability advantage over plain Cloud Map DNS, where stale records can persist for the full DNS TTL (typically 15–60 seconds) after a task stops.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Automatic CloudWatch metrics.&lt;/strong&gt; Request count, HTTP 4xx/5xx rates, and latency (p50, p99) are emitted per service pair without any instrumentation in your app. If your &lt;code&gt;appProtocol&lt;/code&gt; is &lt;code&gt;tcp&lt;/code&gt;, you get proxy activity but no per-call telemetry — only HTTP/1.1, HTTP/2, and gRPC get the full metric set.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Namespace scope.&lt;/strong&gt; Service Connect manages its own namespace — it does not write to VPC DNS or Route 53. Short names are only resolvable from inside tasks enrolled in the same namespace. A Lambda function, an EC2 instance, or an ECS service not enrolled in that namespace cannot resolve those names.&lt;/p&gt;

&lt;h2&gt;
  
  
  Service Connect vs Cloud Map vs internal ALB — which is which
&lt;/h2&gt;

&lt;p&gt;Cloud Map is DNS-based discovery — cheap, simple, slow to fail over. Service Connect adds a proxy for sub-10s failover and per-call metrics. Internal ALBs are for external ingress or non-ECS callers. Most fleets need all three.&lt;/p&gt;

&lt;p&gt;The full picture of ECS service-to-service options — including when Cloud Map still wins — is in the &lt;a href="https://dev.to/blog/ecs-service-discovery-guide/"&gt;ECS service discovery decision guide&lt;/a&gt;. This article focuses specifically on Service Connect: what it costs, what it can't do, and when to use it.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Service Connect&lt;/th&gt;
&lt;th&gt;Cloud Map&lt;/th&gt;
&lt;th&gt;Internal ALB&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Failover speed&lt;/td&gt;
&lt;td&gt;Seconds (proxy detects)&lt;/td&gt;
&lt;td&gt;DNS TTL — 15–60s stale&lt;/td&gt;
&lt;td&gt;Seconds (connection drain)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Feature cost&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;td&gt;$0.10/resource/month&lt;/td&gt;
&lt;td&gt;$0.0225/ALB-hour (~$16/mo)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Automatic retries&lt;/td&gt;
&lt;td&gt;2 retries on failure&lt;/td&gt;
&lt;td&gt;None — app-level only&lt;/td&gt;
&lt;td&gt;None — app-level only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Per-call metrics&lt;/td&gt;
&lt;td&gt;Built-in (HTTP/gRPC)&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;ALB access logs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Non-ECS callers&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Protocol telemetry&lt;/td&gt;
&lt;td&gt;HTTP/1.1, HTTP/2, gRPC&lt;/td&gt;
&lt;td&gt;n/a&lt;/td&gt;
&lt;td&gt;HTTP only (L7 log)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;When to use an internal ALB alongside Service Connect.&lt;/strong&gt; Lambda functions, EC2 instances, on-prem services via Direct Connect — anything not enrolled in the Service Connect namespace needs a stable HTTPS endpoint. That's the ALB's job. You can run Service Connect for ECS-to-ECS traffic and an internal ALB for inbound calls from outside ECS simultaneously on the same service.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When Cloud Map still wins over Service Connect.&lt;/strong&gt; If your services have more than 1,000 tasks (Cloud Map hits a Route 53 quota at that scale; Service Connect does not), or if you need cross-account service discovery without AWS RAM, Cloud Map is the cleaner path. And if you have non-ECS callers anywhere in your stack, Cloud Map DNS is the only registry they can reach without a separate ALB.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Service Connect costs — the feature is free, the sidecar isn't
&lt;/h2&gt;

&lt;p&gt;Service Connect has no feature charge. The cost is the Envoy sidecar: AWS recommends +0.25 vCPU and +64 MiB per task. On a 0.25 vCPU task, that doubles the CPU line item.&lt;/p&gt;

&lt;p&gt;AWS's official recommendation is to reserve 256 CPU units (0.25 vCPU) and 64 MiB for the sidecar per task. Bump that to 512 CPU units if your service handles more than 500 requests per second at peak, and to 128 MiB if your cluster runs more than 100 Service Connect services or more than 2,000 tasks total. These are minimums, not guarantees — monitor actual sidecar container metrics in CloudWatch after enabling.&lt;/p&gt;

&lt;p&gt;Fleet math for a typical 10-service setup, 3 tasks per service, running 24/7 in us-east-1:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Resource&lt;/th&gt;
&lt;th&gt;Calculation&lt;/th&gt;
&lt;th&gt;$/month&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Sidecar CPU&lt;/td&gt;
&lt;td&gt;30 tasks × 0.25 vCPU × $0.04048/hr × 730 hrs&lt;/td&gt;
&lt;td&gt;~$221&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sidecar memory&lt;/td&gt;
&lt;td&gt;30 tasks × 0.064 GB × $0.004445/GB-hr × 730 hrs&lt;/td&gt;
&lt;td&gt;~$6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Service Connect feature&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total Envoy tax (10 services × 3 tasks)&lt;/td&gt;
&lt;td&gt;~$227/mo&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The ratio matters more than the absolute number. On a 2 vCPU task, the sidecar adds ~12% to the CPU line — negligible. On a 0.25 vCPU task, the sidecar doubles it. If your services are right-sized to small task sizes, run the math before enabling fleet-wide. The full breakdown of Fargate task pricing is in the &lt;a href="https://dev.to/blog/aws-fargate-pricing-real-costs/"&gt;AWS Fargate pricing breakdown&lt;/a&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;KEY INSIGHT:&lt;/strong&gt; The Envoy sidecar is the cost. Run the math on your actual task sizes before rolling Service Connect out fleet-wide. A 0.25 vCPU task doubles its CPU cost. A 1 vCPU task adds 25%. The break-even depends on how much your team values built-in retries and per-call CloudWatch metrics versus paying for those resources.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Cloud Map fees when using Service Connect.&lt;/strong&gt; ECS creates the Cloud Map namespace and registers service instances automatically when you enable Service Connect. Standard Cloud Map fees still apply — $0.10/registered resource/month for each service instance ECS registers. On a 10-service fleet with 3 tasks each, that is 30 registered resources × $0.10 = $3/month on top of the Envoy sidecar compute. It is a small line item but not zero.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;mTLS adds cost.&lt;/strong&gt; Service Connect supports mutual TLS via AWS Private CA. Certificates rotate every 5 days — roughly 6 rotations per service per month. Factor in Private CA per-certificate pricing if you plan to enable mTLS. Without it, traffic between tasks is unencrypted at the proxy level (VPC provides network isolation, but not encryption in transit).&lt;/p&gt;

&lt;h2&gt;
  
  
  The July 2025 blue/green unblock — and the CodeDeploy trap that remains
&lt;/h2&gt;

&lt;p&gt;ECS native blue/green (July 17, 2025) supports Service Connect. The old &lt;code&gt;CODE_DEPLOY&lt;/code&gt; deployment controller still does not — and throws a hard error at deploy time if you combine them.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"DeploymentController#type CODE_DEPLOY is not supported by ECS Service Connect. I ended up going with Cloud Map-based Service Discovery — which initially felt like the 'old way' of doing things."&lt;/p&gt;

&lt;p&gt;— &lt;a href="https://dev.to/aws-builders/why-i-chose-service-discovery-over-service-connect-for-ecs-inter-service-communication-4l9d"&gt;Dev.to, AWS Builders&lt;/a&gt;, 2024 (CodeDeploy limitation — still accurate; blue/green via native controller now unblocked)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The dev.to article above drove a lot of teams away from Service Connect. Its conclusion — "use Cloud Map instead" — was correct at the time. The author hit a real, hard error. But the central reason for that conclusion changed on July 17, 2025, when AWS launched built-in blue/green deployments for ECS without a CodeDeploy dependency. Service Connect is explicitly supported in the new native blue/green controller.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What changed.&lt;/strong&gt; ECS now has three deployment controllers: &lt;code&gt;ROLLING&lt;/code&gt; (default), &lt;code&gt;ECS&lt;/code&gt; (native blue/green, launched July 2025), and &lt;code&gt;CODE_DEPLOY&lt;/code&gt; (legacy). Service Connect works with &lt;code&gt;ROLLING&lt;/code&gt; and &lt;code&gt;ECS&lt;/code&gt;. It still fails with &lt;code&gt;CODE_DEPLOY&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to check your controller.&lt;/strong&gt; In Terraform, look for &lt;code&gt;deployment_controller { type = "CODE_DEPLOY" }&lt;/code&gt; in your &lt;code&gt;aws_ecs_service&lt;/code&gt; resources. In the AWS console: ECS → Clusters → your cluster → Services → select a service → Configuration tab → Deployment type. If it says "Blue/green deployment (powered by AWS CodeDeploy)", you're on the legacy controller.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The migration path.&lt;/strong&gt; Switch from &lt;code&gt;CODE_DEPLOY&lt;/code&gt; to &lt;code&gt;ECS&lt;/code&gt; native blue/green before enabling Service Connect. The blue/green deployment guide covers that migration in detail — including the differences in traffic routing, test header behavior, and rollback mechanics.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;KEY INSIGHT:&lt;/strong&gt; If you're on CodeDeploy and want Service Connect, you have two options: migrate to the ECS native blue/green controller (recommended), or stay on Cloud Map for service discovery until the migration is done. Enabling Service Connect on a CodeDeploy service fails at deploy time with a hard error — not a warning. Budget the controller migration as a prerequisite.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Gotchas nobody warns you about
&lt;/h2&gt;

&lt;p&gt;Envoy sidecar memory can grow unbounded with gRPC traffic. &lt;code&gt;appProtocol&lt;/code&gt; is immutable after service creation. HTTP/1.0 is not supported. Windows containers and standalone tasks don't work with Service Connect.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;gRPC memory growth.&lt;/strong&gt; There's a documented re:Post thread from August 2024 where teams noticed the Service Connect proxy container's memory growing without bound after redeploys with gRPC workloads. The app container's memory released normally — only the sidecar kept climbing until gRPC traffic eventually stalled. AWS's workaround: add a task-level memory limit (not just container-level) and over-provision sidecar memory. Also, pin a recent &lt;code&gt;ecs-service-connect-agent&lt;/code&gt; version — CVE-2024-34364 (Envoy mirror-response unbounded buffer allocation) can manifest as a memory leak in older agent releases. Monitor sidecar container memory in CloudWatch if you're serving gRPC.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Immutable appProtocol.&lt;/strong&gt; You set &lt;code&gt;appProtocol&lt;/code&gt; in your port mapping configuration — &lt;code&gt;http&lt;/code&gt;, &lt;code&gt;http2&lt;/code&gt;, &lt;code&gt;grpc&lt;/code&gt;, or &lt;code&gt;tcp&lt;/code&gt;. You cannot change it after the service is created. If you start with &lt;code&gt;http&lt;/code&gt; and later migrate to gRPC, you must delete and recreate the ECS service. Plan the protocol decision before your first deploy — this is the single most operationally painful gotcha in Service Connect.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HTTP/1.0 not supported.&lt;/strong&gt; The Envoy proxy drops HTTP/1.0 traffic. Most modern clients speak HTTP/1.1 or higher, but legacy internal tools occasionally use 1.0. Verify your client HTTP version before enabling Service Connect on any service that receives internal API calls from older tooling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Namespace cleanup is manual.&lt;/strong&gt; ECS does not delete the Cloud Map namespace when you delete a cluster. The namespace (and its resource registrations) stays behind, and you pay $0.10/resource/month for every orphaned registration. Add Cloud Map namespace cleanup to your cluster teardown runbook.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What Service Connect doesn't support at all:&lt;/strong&gt; Windows containers, standalone task invocations (only ECS services — not &lt;code&gt;RunTask&lt;/code&gt;), and cross-Region routing. Services in different AWS Regions cannot communicate via Service Connect. Use an ALB or API Gateway with a custom domain for cross-Region traffic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mixed enrollment.&lt;/strong&gt; You can run some services on Service Connect and some on Cloud Map in the same cluster. The constraint is that non-enrolled services cannot resolve Service Connect short names. Migrate callers and callees together in the same deployment window, or keep a Cloud Map registration running in parallel during the cutover.&lt;/p&gt;

&lt;h2&gt;
  
  
  What you actually get — retries, passive health checks, metrics
&lt;/h2&gt;

&lt;p&gt;Service Connect configures the proxy for 2 retries, passive outlier detection (eject after 5 failures in 30s), a 15s default timeout, and per-call CloudWatch metrics — and these settings are fixed, not configurable per service.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Retries.&lt;/strong&gt; The proxy automatically retries failed requests twice, routing each retry to a different task (not re-sending to the failing host). This is transparent to the calling app — it sends one request, the proxy handles the retry if the first task fails to respond. Retry count is 2; you cannot configure it higher or lower.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Passive outlier detection.&lt;/strong&gt; The proxy tracks failure rates per task. After 5 consecutive failures in a 30-second window, the task is ejected from the load-balancing pool for 30–300 seconds depending on how many consecutive ejections have occurred. This is "passive" — no active health check probes — and it fires within seconds of detecting a problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Timeout.&lt;/strong&gt; The default per-request timeout is 15 seconds. This value is fixed — AWS manages it. If your service has requests that legitimately run longer (batch processing, report generation), Service Connect's fixed timeout will cause failures. Those services should either use Cloud Map or route through an ALB with a custom timeout.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CloudWatch metrics.&lt;/strong&gt; Without any instrumentation, you get per-service-pair: &lt;code&gt;RequestCount&lt;/code&gt;, &lt;code&gt;HTTPCode_Target_4XX&lt;/code&gt;, &lt;code&gt;HTTPCode_Target_5XX&lt;/code&gt;, and &lt;code&gt;TargetResponseTime&lt;/code&gt; (p50, p99). With &lt;code&gt;appProtocol = tcp&lt;/code&gt;, the proxy is active but emits only byte-level metrics — no per-call telemetry.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Not a full service mesh.&lt;/strong&gt; Retry count, timeout, and outlier detection parameters are AWS-managed fixed values. Unlike raw Envoy or the now-deprecated App Mesh, you cannot configure per-route retry policies or custom circuit-breaker thresholds. App Mesh (the configurable option) reaches end-of-life on September 30, 2026. If you need per-route circuit breaker configuration, the current path is a self-managed Envoy deployment — there's no AWS-native equivalent for ECS once App Mesh retires.&lt;/p&gt;

&lt;h2&gt;
  
  
  Should you use it? A decision framework by fleet size
&lt;/h2&gt;

&lt;p&gt;Under 5 microservices or on the CodeDeploy controller: use Cloud Map. 10+ services on rolling or native blue/green: Service Connect. Need external routing or L7 features: ALB, often alongside Service Connect.&lt;/p&gt;

&lt;p&gt;Walk through these questions in order:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;If your situation is…&lt;/th&gt;
&lt;th&gt;Use this&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;On CODE_DEPLOY deployment controller&lt;/td&gt;
&lt;td&gt;Cloud Map — migrate controller first if you want SC&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Non-ECS callers (Lambda, EC2) need to reach the service&lt;/td&gt;
&lt;td&gt;Internal ALB (SC can't help here)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fewer than 5 ECS services calling each other&lt;/td&gt;
&lt;td&gt;Cloud Map — simpler, cheaper&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10+ ECS services, rolling or native blue/green&lt;/td&gt;
&lt;td&gt;Service Connect — right default&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Need mTLS between services&lt;/td&gt;
&lt;td&gt;Service Connect + AWS Private CA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Heavy gRPC, Fargate&lt;/td&gt;
&lt;td&gt;Service Connect with +128 MiB sidecar memory, pinned agent version&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tasks need &amp;gt;15s per request&lt;/td&gt;
&lt;td&gt;Cloud Map or ALB — SC has a fixed 15s timeout&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Windows containers&lt;/td&gt;
&lt;td&gt;Cloud Map — SC doesn't support Windows&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;"In cases where you don't need traffic insights and the service is a minor supporting service, Cloud Map Service Discovery can be a simpler solution."&lt;/p&gt;

&lt;p&gt;— &lt;a href="https://repost.aws/questions/" rel="noopener noreferrer"&gt;AWS re:Post community&lt;/a&gt;, verified June 2026&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Service Connect is the right default for new ECS-to-ECS traffic on rolling or native blue/green. But "right default" doesn't mean "apply to everything without checking." The three concrete situations where you should stay on Cloud Map: you're on the CodeDeploy controller and don't have time to migrate it, you have non-ECS callers that can't enroll in the namespace, or your tasks are sized at 0.25 vCPU and the doubled CPU cost isn't justified by the built-in retries.&lt;/p&gt;

&lt;p&gt;For teams already using Cloud Map — the bar for migration is not "Service Connect is available." The bar is "we have a concrete reason to switch": you want per-call CloudWatch metrics without adding instrumentation, you need built-in retries without client-side retry logic, or you're enabling mTLS and want the native Private CA integration. If none of those apply, an existing Cloud Map setup that works is not worth disrupting.&lt;/p&gt;

&lt;h2&gt;
  
  
  If you read this, you might also want to know
&lt;/h2&gt;

&lt;p&gt;Can I mix Service Connect and Cloud Map service discovery in the same cluster?&lt;/p&gt;

&lt;p&gt;Yes — services in the same cluster can use different discovery mechanisms. The constraint is that a service not enrolled in the Service Connect namespace cannot resolve SC short names. Services on Cloud Map use Route 53 DNS and are reachable by any VPC resource regardless of namespace enrollment. Migrate callers and callees together to avoid resolution failures during cutover.&lt;/p&gt;

&lt;p&gt;How do I migrate existing services from Cloud Map to Service Connect?&lt;/p&gt;

&lt;p&gt;Add the service_connect_configuration block to your ECS service Terraform resource and redeploy. ECS creates a new namespace or enrolls in an existing one. Remove the service_registries block after the service is healthy on Service Connect. One caveat: Cloud Map and Service Connect cannot be active on the same ECS service simultaneously — the cutover is per-service, so migrate callee first, then callers in the same deploy window.&lt;/p&gt;

&lt;p&gt;Does Service Connect work with ECS tasks on EC2 launch type?&lt;/p&gt;

&lt;p&gt;Yes. Service Connect works with both Fargate and EC2 launch types. The Envoy sidecar overhead (0.25 vCPU, 64 MiB) applies either way, but on EC2 you're paying for the instance regardless — the marginal sidecar cost is lower than on Fargate where every CPU unit and MiB is billed directly.&lt;/p&gt;

&lt;p&gt;What happens to Service Connect if the Cloud Map namespace is deleted?&lt;/p&gt;

&lt;p&gt;Service Connect stops working. The Envoy proxy cannot resolve service names without the Cloud Map namespace. Services lose the ability to route to each other via short names. ECS does not auto-recreate the namespace. Treat the Cloud Map namespace as a dependency of your ECS cluster — include it in disaster recovery runbooks and don't delete it without disabling Service Connect first.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Does ECS Service Connect replace Cloud Map?
&lt;/h3&gt;

&lt;p&gt;No — Service Connect uses Cloud Map as its backend registry. When you enable Service Connect, ECS creates the Cloud Map services automatically. If you're already using Cloud Map directly, Service Connect sits on top of it. You pay for Cloud Map resources whether or not you add the proxy layer.&lt;/p&gt;

&lt;h3&gt;
  
  
  How much memory does Service Connect use?
&lt;/h3&gt;

&lt;p&gt;AWS recommends reserving +64 MiB per task for the Envoy sidecar. Observed idle usage is typically under 60 MB, but production workloads with many concurrent connections can push higher. There's a documented issue with gRPC traffic causing unbounded sidecar memory growth on Fargate — add a task-level memory limit and monitor sidecar container memory in CloudWatch.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does Service Connect work with CodeDeploy blue/green?
&lt;/h3&gt;

&lt;p&gt;No. The ECS native blue/green controller (launched July 2025) works with Service Connect. The older CodeDeploy-based blue/green controller does not — you'll get the error 'DeploymentController#type CODE_DEPLOY is not supported by ECS Service Connect' at deployment time. If you're on CodeDeploy, use Cloud Map for service discovery until you migrate to the native controller.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can Service Connect route traffic between ECS clusters?
&lt;/h3&gt;

&lt;p&gt;Yes — within the same AWS Region and the same Cloud Map namespace. Cross-Region routing is not supported. Services in different clusters can communicate as long as they're registered in the same namespace and the networking (VPC, security groups) allows it.&lt;/p&gt;

&lt;h3&gt;
  
  
  What protocols does ECS Service Connect support?
&lt;/h3&gt;

&lt;p&gt;HTTP/1.1, HTTP/2, gRPC, and TCP. Set appProtocol in your service configuration. HTTP/2 and gRPC get full per-call telemetry (request count, error rate, latency). TCP mode works but gives no protocol-level metrics — just bytes transferred. HTTP/1.0 is not supported. Note: appProtocol cannot be changed after the service is created.&lt;/p&gt;

&lt;p&gt;Worth reading&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/aws-ecs-fargate/"&gt;GuideAWS ECS Fargate: What It Is, How It Works, What It CostsService Connect in context — where it fits in the full Fargate architecture alongside task definitions, clusters, and cost.&lt;/a&gt;&lt;a href="https://dev.to/blog/ecs-service-discovery-guide/"&gt;GuideECS Service Discovery: Cloud Map, Service Connect, or Internal Load Balancer?The decision table for when to use Service Connect vs Cloud Map vs an internal ALB.&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Book a 20-min fleet walkthrough:&lt;/strong&gt; &lt;a href="https://fortem.dev/book" rel="noopener noreferrer"&gt;fortem.dev/book&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>ecs</category>
      <category>fargate</category>
      <category>devops</category>
    </item>
    <item>
      <title>How to Optimize AWS ECS Costs Beyond Reserved Instances</title>
      <dc:creator>Matt</dc:creator>
      <pubDate>Fri, 19 Jun 2026 23:49:23 +0000</pubDate>
      <link>https://dev.to/dspv/how-to-optimize-aws-ecs-costs-beyond-reserved-instances-1374</link>
      <guid>https://dev.to/dspv/how-to-optimize-aws-ecs-costs-beyond-reserved-instances-1374</guid>
      <description>&lt;h1&gt;
  
  
  AWS ECS Cost Optimization Beyond Spot and Savings Plans
&lt;/h1&gt;

&lt;blockquote&gt;
&lt;p&gt;Originally published at &lt;a href="https://fortem.dev/blog/aws-cost-optimization-ecs" rel="noopener noreferrer"&gt;https://fortem.dev/blog/aws-cost-optimization-ecs&lt;/a&gt;&lt;br&gt;
Spot and Savings Plans cover the first 30%. Five more levers most ECS teams miss: Graviton, VPC endpoints, Container Insights scoping, shared ALBs, Compute Optimizer.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;Guide&lt;/p&gt;

&lt;p&gt;aws-ecs-cost-optimizationecs-fargate-cost-reductionaws-cost-savings-ecs&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  Fargate compute is only half your ECS bill. ALBs, NAT Gateway, and Container Insights account for 30–52% of total spend in verified fleet benchmarks.&lt;/li&gt;
&lt;li&gt;  The S3 gateway endpoint is free — add it today. Every container image pull currently routes through NAT at $0.045/GB unless you have one.&lt;/li&gt;
&lt;li&gt;  ARM64/Graviton Fargate is $0.03238 vs $0.04048/vCPU-hr — a flat 20% reduction on all compute, no architectural change required.&lt;/li&gt;
&lt;li&gt;  AWS Compute Optimizer covers Fargate (free since Dec 2022). One CLI command returns right-sizing recommendations for every service in your fleet.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Ready to use — copy this today&lt;/p&gt;

&lt;p&gt;Three commands you can run right now — before changing a single task definition.&lt;/p&gt;

&lt;p&gt;1. Add the free S3 gateway endpoint (eliminates NAT charges on ECR image pulls)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws ec2 create-vpc-endpoint &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--vpc-id&lt;/span&gt; &lt;span class="nv"&gt;$VPC_ID&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--service-name&lt;/span&gt; com.amazonaws.us-east-1.s3 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--route-table-ids&lt;/span&gt; &lt;span class="nv"&gt;$RTB_ID&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--type&lt;/span&gt; Gateway
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;2. Pull Compute Optimizer recommendations fleet-wide&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws compute-optimizer get-ecs-service-recommendations &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'ecsServiceRecommendations[*].{
    Service:serviceArn,
    Finding:finding,
    vCPU:recommendationOptions[0].containerRecommendations[0].containerName
  }'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; table
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;3. Check Container Insights status per cluster&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws ecs list-clusters | jq &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s1"&gt;'.clusterArns[]'&lt;/span&gt; | &lt;span class="se"&gt;\&lt;/span&gt;
  xargs &lt;span class="nt"&gt;-I&lt;/span&gt;&lt;span class="o"&gt;{}&lt;/span&gt; aws ecs describe-clusters &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--clusters&lt;/span&gt; &lt;span class="o"&gt;{}&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--include&lt;/span&gt; SETTINGS &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'clusters[0].{
      Name:clusterName,
      Insights:settings[?name==`containerInsights`].value|[0]
    }'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; table
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  You've done Spot and scheduling. Here's where the next 30% hides.
&lt;/h2&gt;

&lt;p&gt;Fargate compute is only half the ECS bill. Teams that stop at Spot and scheduling still pay $0.045/GB in NAT data processing, $0.07/metric/month for Container Insights they forgot was on, and 20% more per vCPU than Graviton would cost. Verified average: compute-only estimates undercount total spend by 30–52%.&lt;/p&gt;

&lt;p&gt;CloudBurn ran the numbers on a real Fargate fleet and found compute-only estimates of $181.77 against an actual bill of $276.27 — a 52% gap driven entirely by ALB base charges, NAT data processing, and CloudWatch metrics. The gap compounds across environments: 10 environments that look like a $1,800/month Fargate bill are actually closer to $2,700.&lt;/p&gt;

&lt;p&gt;This article picks up where Spot and scheduling leave off. If you haven't covered those yet, start with &lt;a href="https://dev.to/blog/ecs-fargate-cost-optimization/"&gt;how to cut Fargate compute costs with Spot and scheduling&lt;/a&gt; — those two moves alone cut 60–70% before touching anything else. Come back here for the second layer.&lt;/p&gt;

&lt;p&gt;The five levers below — NAT/VPC endpoints, Graviton, Container Insights, ALB consolidation, and Compute Optimizer — are independent. You can apply any one of them this week without touching the others. Each section includes the dollar math for a 10-service fleet so you can rank them by impact before you start.&lt;/p&gt;

&lt;h2&gt;
  
  
  NAT Gateway is quietly billing $0.045/GB on every image pull
&lt;/h2&gt;

&lt;p&gt;Every container image pull and AWS API call from a private subnet runs through NAT at $0.045/GB data processing. A 403MB image pulled 32k times (crash-looping health check) costs ~$566 via NAT — $0.35 with an S3 gateway endpoint. The gateway endpoint is free. Add it before anything else.&lt;/p&gt;

&lt;p&gt;The crash-loop story is instructive. One team deployed a container with a health check misconfiguration — the task started, failed, restarted, and repeated 32,000 times over several days. The 403MB image pulled each time at $0.045/GB NAT data processing: 403 MB × 32,000 pulls ÷ 1,024 = 12,594 GB × $0.045 = ~$567. With the free S3 gateway endpoint routing ECR layer pulls through AWS backbone instead of NAT, the same traffic costs $0.35. The endpoint takes 90 seconds to add.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Option&lt;/th&gt;
&lt;th&gt;Hourly&lt;/th&gt;
&lt;th&gt;Data processing&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;NAT Gateway&lt;/td&gt;
&lt;td&gt;$0.045/hr&lt;/td&gt;
&lt;td&gt;$0.045/GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Interface endpoint (ECR, CW, etc.)&lt;/td&gt;
&lt;td&gt;$0.01/hr/AZ&lt;/td&gt;
&lt;td&gt;$0.01/GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gateway endpoint (S3, DynamoDB)&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The nuance that trips teams up:&lt;/strong&gt; interface endpoints are not always cheaper than NAT. One interface endpoint costs $0.01/hr/AZ = ~$7.20/month per AZ. Add the five endpoints typically required for private ECS tasks (ECR API, ECR DKR, CloudWatch Logs, Secrets Manager, STS) across 3 AZs: that's 5 × 3 × $7.20 = $108/month in endpoint hourly charges alone — before counting data processing. The fourtheorem team documented exactly this: a setup that looked cheaper with endpoints ($43.84) flipped to $197/month once all required endpoints were counted, versus $100/month with NAT.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Each service that is not deployable to a VPC requires a new VPC Endpoint… the bills stack up quickly!”&lt;/p&gt;

&lt;p&gt;— &lt;a href="https://fourtheorem.com/amazon-ecs-hidden-costs/" rel="noopener noreferrer"&gt;fourtheorem, Amazon ECS Hidden Costs&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;KEY INSIGHT:&lt;/strong&gt; The S3 gateway endpoint is always free and takes 90 seconds to add — do it unconditionally. Interface endpoints require explicit break-even math: calculate monthly NAT data charges vs. (number of endpoints × AZs × $7.20). At low data volumes, NAT is cheaper.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Check whether you already have the S3 gateway endpoint in place:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws ec2 describe-vpc-endpoints &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--filters&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="s2"&gt;"Name=service-name,Values=com.amazonaws.us-east-1.s3"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="s2"&gt;"Name=vpc-endpoint-type,Values=Gateway"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'VpcEndpoints[*].{State:State,VPC:VpcId}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Empty output means you don't have one. The create command is in the Ready-to-use block above.&lt;/p&gt;

&lt;h2&gt;
  
  
  Switch to Graviton and take 20% off every Fargate task
&lt;/h2&gt;

&lt;p&gt;ARM64 Fargate costs $0.03238/vCPU-hr vs $0.04048 for x86 — exactly 20% less, same memory pricing. For 10 services × 3 tasks × 2 vCPU running 730 hours, that's $142/month saved with no infrastructure change — just rebuild images for linux/arm64.&lt;/p&gt;

&lt;p&gt;The math is clean. For a 10-service fleet where each service runs 3 tasks at 2 vCPU:&lt;/p&gt;

&lt;p&gt;Graviton savings — 10-service fleet&lt;/p&gt;

&lt;p&gt;x86: 10 × 3 × 2 vCPU × $0.04048 × 730 hr = $1,773/mo&lt;/p&gt;

&lt;p&gt;ARM64: 10 × 3 × 2 vCPU × $0.03238 × 730 hr = $1,418/mo&lt;/p&gt;

&lt;p&gt;Savings: $355/mo · $4,257/yr&lt;/p&gt;

&lt;p&gt;That's at 2 vCPU per task. At 0.5 vCPU (smaller tasks), the same formula yields ~$89/month — still worth it for zero architectural work. Memory pricing is identical between x86 and ARM64, so all savings come from compute.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;KEY INSIGHT:&lt;/strong&gt; Graviton + Spot is the highest-impact combination on Fargate. The 20% Graviton discount stacks with the ~70% Spot discount for a combined ~76% reduction versus x86 on-demand. For dev environments that already run on Spot, switching to ARM64 pulls another 20% out of the remaining compute spend.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;What needs to change:&lt;/strong&gt; rebuild Docker images with &lt;code&gt;docker buildx build --platform linux/arm64&lt;/code&gt;, then update the ECS task definition's &lt;code&gt;runtimePlatform.cpuArchitecture&lt;/code&gt; to &lt;code&gt;ARM64&lt;/code&gt;. Most Python, Node.js, Go, Java, and Ruby stacks rebuild without any code changes. Test before flipping production — native code compiled for x86, some C-extension Python packages, and kernel-module-dependent workloads need verification.&lt;/p&gt;

&lt;p&gt;For the full Spot setup and capacity provider strategy, see &lt;a href="https://dev.to/blog/ecs-fargate-cost-optimization/"&gt;how to cut ECS Fargate costs by 65%&lt;/a&gt;. Graviton and Spot configure independently — you can add Graviton to an existing Spot setup today.&lt;/p&gt;

&lt;h2&gt;
  
  
  Container Insights Enhanced: $0.07/metric/month, and it multiplies with fleet size
&lt;/h2&gt;

&lt;p&gt;Enhanced Container Insights for ECS charges $0.07 per metric per month. AWS's own example: 1 cluster, 5 services, 20 tasks, 50 containers = 2,264 metrics = $158.48/month. At 10 environments that's over $1,500/month for observability you may not be actively using.&lt;/p&gt;

&lt;p&gt;The metric count compounds with fleet size because each container generates multiple metrics: CPU utilization, memory utilization, network bytes in/out, storage read/write, task count. AWS's &lt;a href="https://aws.amazon.com/cloudwatch/pricing/" rel="noopener noreferrer"&gt;CloudWatch pricing page&lt;/a&gt; gives the worked example directly: 1 cluster with 5 services, 20 tasks, 50 containers generates 2,264 metrics. At $0.07 each: $158.48/month per cluster.&lt;/p&gt;

&lt;p&gt;Container Insights fleet math&lt;/p&gt;

&lt;p&gt;1 cluster (50 containers): 2,264 metrics × $0.07 = $158.48/mo&lt;/p&gt;

&lt;p&gt;5 clusters (dev + staging + prod × regions): ~$792/mo&lt;/p&gt;

&lt;p&gt;10 clusters (multi-environment fleet): ~$1,585/mo&lt;/p&gt;

&lt;p&gt;Fix: enable Enhanced on prod clusters only → $158/mo vs $1,585/mo&lt;/p&gt;

&lt;p&gt;Enhanced Container Insights is opt-in per cluster — it is not on by default for all clusters. The problem is that teams often enable it when debugging a production issue and never disable it on the dev/staging clusters where they also turned it on. Standard CloudWatch metrics (CPU, memory at task level) still work without Enhanced — they're the base tier and are billed differently. Enhanced adds per-container-level metrics, ECS-specific dimensions, and storage metrics.&lt;/p&gt;

&lt;p&gt;Check which clusters have Enhanced enabled, then disable it on non-production clusters:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check current status per cluster&lt;/span&gt;
aws ecs describe-clusters &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--clusters&lt;/span&gt; YOUR_CLUSTER_NAME &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--include&lt;/span&gt; SETTINGS &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'clusters[0].settings'&lt;/span&gt;

&lt;span class="c"&gt;# Disable Enhanced on a cluster&lt;/span&gt;
aws ecs update-cluster-settings &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cluster&lt;/span&gt; YOUR_CLUSTER_NAME &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--settings&lt;/span&gt; &lt;span class="nv"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;containerInsights,value&lt;span class="o"&gt;=&lt;/span&gt;disabled
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For controlling CloudWatch log costs across ECS at fleet scale, see &lt;a href="https://dev.to/blog/cloudwatch-costs-ecs/"&gt;controlling CloudWatch log costs for ECS&lt;/a&gt;— that covers log group retention, FireLens vs awslogs, and the log-volume math by service type.&lt;/p&gt;

&lt;h2&gt;
  
  
  One ALB per environment, not one per service
&lt;/h2&gt;

&lt;p&gt;Each ALB costs $16–20/month in base hourly charges before LCUs. Teams that provision one ALB per microservice per environment run 50–100+ ALBs. One team reduced from 270 ALBs to 9 by switching to host-based routing — one ALB per environment, listener rules route to services.&lt;/p&gt;

&lt;p&gt;The Signiant engineering team documented their ALB consolidation in detail: they had 270 ALBs across their infrastructure, running at roughly $16–20 each per month. Switching to a shared ALB model with host-based routing ( &lt;code&gt;*.service.env.internal&lt;/code&gt; → listener rules → target groups) reduced that to 9 ALBs — 261 base charges eliminated. At $18/month average: $4,698/month removed from the bill.&lt;/p&gt;

&lt;p&gt;The anti-pattern is common: a Terraform module creates one ECS service and one ALB together, so teams end up with an ALB per service per environment by default. Shared ALBs require slightly more routing configuration but the cost argument is clear past 3–4 services per environment.&lt;/p&gt;

&lt;p&gt;ALB consolidation also reduces IPv4 address charges. Since February 2024, AWS charges $0.005/hr per public IPv4 address — each ALB typically holds one. At 270 ALBs: 270 × $0.005 × 730 hr = $985/month in IPv4 charges alone.&lt;/p&gt;

&lt;p&gt;A shared ALB setup in Terraform:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight terraform"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Shared ALB — one per environment&lt;/span&gt;
&lt;span class="k"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_lb"&lt;/span&gt; &lt;span class="s2"&gt;"env"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;               &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env_name&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;-alb"&lt;/span&gt;
  &lt;span class="nx"&gt;internal&lt;/span&gt;           &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
  &lt;span class="nx"&gt;load_balancer_type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"application"&lt;/span&gt;
  &lt;span class="nx"&gt;subnets&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;public_subnet_ids&lt;/span&gt;
  &lt;span class="nx"&gt;security_groups&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;aws_security_group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;alb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_lb_listener"&lt;/span&gt; &lt;span class="s2"&gt;"https"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;load_balancer_arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_lb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;
  &lt;span class="nx"&gt;port&lt;/span&gt;              &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;443&lt;/span&gt;
  &lt;span class="nx"&gt;protocol&lt;/span&gt;          &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"HTTPS"&lt;/span&gt;
  &lt;span class="nx"&gt;ssl_policy&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"ELBSecurityPolicy-TLS13-1-2-2021-06"&lt;/span&gt;
  &lt;span class="nx"&gt;certificate_arn&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;acm_cert_arn&lt;/span&gt;

  &lt;span class="nx"&gt;default_action&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"fixed-response"&lt;/span&gt;
    &lt;span class="nx"&gt;fixed_response&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;content_type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"text/plain"&lt;/span&gt;
      &lt;span class="nx"&gt;message_body&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"no route"&lt;/span&gt;
      &lt;span class="nx"&gt;status_code&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"404"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Per-service listener rule — host-based routing&lt;/span&gt;
&lt;span class="k"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_lb_listener_rule"&lt;/span&gt; &lt;span class="s2"&gt;"api"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;listener_arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_lb_listener&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;https&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;
  &lt;span class="nx"&gt;priority&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;

  &lt;span class="nx"&gt;action&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;type&lt;/span&gt;             &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"forward"&lt;/span&gt;
    &lt;span class="nx"&gt;target_group_arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_lb_target_group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;api&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;condition&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;host_header&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;values&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"api.&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env_name&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;.example.com"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_lb_listener_rule"&lt;/span&gt; &lt;span class="s2"&gt;"worker"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;listener_arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_lb_listener&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;https&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;
  &lt;span class="nx"&gt;priority&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;110&lt;/span&gt;

  &lt;span class="nx"&gt;action&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;type&lt;/span&gt;             &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"forward"&lt;/span&gt;
    &lt;span class="nx"&gt;target_group_arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_lb_target_group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;worker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;condition&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;host_header&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;values&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"worker.&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env_name&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;.example.com"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Caveat: WebSocket services and services that require conflicting port bindings may need their own ALB. Otherwise, one ALB per environment handles up to 100 listener rules (the default soft limit, extendable via quota request).&lt;/p&gt;

&lt;h2&gt;
  
  
  Compute Optimizer runs a free fleet right-sizing pass — and it's scriptable
&lt;/h2&gt;

&lt;p&gt;AWS Compute Optimizer has supported Fargate since December 2022 at no charge. &lt;code&gt;get-ecs-service-recommendations&lt;/code&gt; returns CPU and memory recommendations at both task and container level. Script it across all services and diff against current task definitions to find over-provisioned tasks fleet-wide.&lt;/p&gt;

&lt;p&gt;Compute Optimizer launched ECS Fargate support on December 23, 2022 — it's free, and most teams haven't used it. The tool analyzes CloudWatch utilization metrics from the trailing 14 days and returns recommendations at two levels: the task definition (overall CPU/memory) and the individual container (container-level CPU/memory shares within the task). For over-provisioned long-running services, the claimed savings are 30–70% of compute spend.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The gotcha that wastes time:&lt;/strong&gt;Compute Optimizer won't generate recommendations for a service if a target-tracking Auto Scaling policy is attached to CPU or memory for that service. If you check a service and get no recommendations, verify whether it has an ASG policy. Recommendations require at least 24 hours of CloudWatch and ECS utilization data in the trailing 14-day window.&lt;/p&gt;

&lt;p&gt;Fleet script — loop over all clusters and return recommendations for every service:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="c"&gt;# Get Compute Optimizer recommendations for all ECS services&lt;/span&gt;
&lt;span class="c"&gt;# Requires: AWS CLI v2, jq&lt;/span&gt;

&lt;span class="nv"&gt;REGION&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;AWS_DEFAULT_REGION&lt;/span&gt;&lt;span class="k"&gt;:-&lt;/span&gt;&lt;span class="nv"&gt;us&lt;/span&gt;&lt;span class="p"&gt;-east-1&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Fetching all clusters..."&lt;/span&gt;
&lt;span class="nv"&gt;CLUSTERS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;aws ecs list-clusters &lt;span class="nt"&gt;--region&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$REGION&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'clusterArns[]'&lt;/span&gt; &lt;span class="nt"&gt;--output&lt;/span&gt; text&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for &lt;/span&gt;CLUSTER &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="nv"&gt;$CLUSTERS&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;&lt;span class="nv"&gt;CLUSTER_NAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;basename&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$CLUSTER&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
  &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;""&lt;/span&gt;
  &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"=== &lt;/span&gt;&lt;span class="nv"&gt;$CLUSTER_NAME&lt;/span&gt;&lt;span class="s2"&gt; ==="&lt;/span&gt;

  &lt;span class="nv"&gt;SERVICES&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;aws ecs list-services     &lt;span class="nt"&gt;--cluster&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$CLUSTER&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;     &lt;span class="nt"&gt;--region&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$REGION&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;     &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'serviceArns[]'&lt;/span&gt;     &lt;span class="nt"&gt;--output&lt;/span&gt; text&lt;span class="si"&gt;)&lt;/span&gt;

  &lt;span class="k"&gt;for &lt;/span&gt;SERVICE_ARN &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="nv"&gt;$SERVICES&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
    &lt;/span&gt;&lt;span class="nv"&gt;RESULT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;aws compute-optimizer get-ecs-service-recommendations       &lt;span class="nt"&gt;--service-arns&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$SERVICE_ARN&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;       &lt;span class="nt"&gt;--region&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$REGION&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;       &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'ecsServiceRecommendations[0].{
        Finding: finding,
        CurrentCPU: currentServiceConfiguration.cpu,
        CurrentMem: currentServiceConfiguration.memory,
        RecommendedCPU: recommendationOptions[0].containerRecommendations[0].memorySizeConfiguration.cpu,
        RecommendedMem: recommendationOptions[0].containerRecommendations[0].memorySizeConfiguration.memory
      }'&lt;/span&gt;       &lt;span class="nt"&gt;--output&lt;/span&gt; json 2&amp;gt;/dev/null&lt;span class="si"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$RESULT&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$RESULT&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s2"&gt;"null"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
      &lt;/span&gt;&lt;span class="nv"&gt;SERVICE_NAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;basename&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$SERVICE_ARN&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
      &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$SERVICE_NAME&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="nv"&gt;$RESULT&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | jq &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
    &lt;span class="k"&gt;fi
  done
done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To use this in CI: run the script, parse the JSON output, and compare recommended vs current CPU/memory in each task definition. Flag as a PR comment or Slack alert when drift exceeds 20%. Teams that do this catch over-provisioning at deploy time before it accumulates.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“If you've estimated your ECS costs based only on Fargate compute pricing, you're probably underestimating by 30–50%.”&lt;/p&gt;

&lt;p&gt;— &lt;a href="https://cloudburn.io/blog/aws-fargate-pricing" rel="noopener noreferrer"&gt;CloudBurn, AWS Fargate Pricing: Real Costs&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  If you read this, you might also want to know
&lt;/h2&gt;

&lt;p&gt;How do I know if my ECS tasks are right-sized without Compute Optimizer?&lt;/p&gt;

&lt;p&gt;Check CloudWatch Container Insights → CPU and Memory Utilization per service. Look at p95 over a 2-week window. If p95 CPU stays below 30% of the task allocation, drop to the next Fargate size (e.g. 1 vCPU → 0.5 vCPU). If p95 memory stays below 40%, halve the memory allocation. Standard Container Insights (not Enhanced) gives you this at the task level for free.&lt;/p&gt;

&lt;p&gt;Can I mix x86 and ARM64 tasks in the same ECS cluster?&lt;/p&gt;

&lt;p&gt;Yes — ECS clusters are architecture-agnostic. Each task definition specifies its own runtimePlatform.cpuArchitecture. You can migrate services one at a time: flip a dev task definition to ARM64, test for a week, then move staging and production. No cluster changes required.&lt;/p&gt;

&lt;p&gt;How do I set up AZ-aware task placement to avoid cross-AZ charges?&lt;/p&gt;

&lt;p&gt;Use the spread placement strategy with field=attribute:ecs.availability-zone to distribute tasks across AZs, and ensure your target groups use the same AZ as the tasks routing through them. Cross-AZ traffic in the same region costs $0.01/GB each direction. For high-throughput services, co-locating the ALB target group and the task in the same AZ eliminates this charge.&lt;/p&gt;

&lt;p&gt;What happens to in-flight requests when a Fargate Spot task is interrupted?&lt;/p&gt;

&lt;p&gt;ECS sends SIGTERM to the container and fires an EventBridge event with stopCode SPOT_INTERRUPTION. You have up to 2 minutes before the task is forcibly stopped. Set stopTimeout to 120 seconds in your task definition to use the full window. Configure your ALB to deregister the target before SIGTERM (the connection draining period handles this). In-flight requests that complete within the draining window are served normally.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Are VPC endpoints always cheaper than NAT Gateway for ECS?
&lt;/h3&gt;

&lt;p&gt;Not always. The free S3 gateway endpoint is always worth adding — it cuts NAT costs on ECR image pulls at zero cost. Interface endpoints for ECR API, CloudWatch Logs, and Secrets Manager each cost ~$7.20/month per AZ. If you need all five common endpoints in 3 AZs, that's ~$108/month — which can exceed your NAT costs at low data volumes. Run the math: monthly NAT data charges vs. (number of endpoints × AZs × $7.20). The break-even is roughly 1,200 GB/month of relevant traffic per interface endpoint.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does AWS Compute Optimizer work with ECS Fargate?
&lt;/h3&gt;

&lt;p&gt;Yes — Compute Optimizer has covered Fargate since December 23, 2022 at no charge. It recommends CPU and memory at both task and container level. Requirements: at least 24 hours of CloudWatch and ECS utilization data in the trailing 14 days. One gotcha: it won't generate recommendations if a target-tracking Auto Scaling policy is attached to CPU or memory for the service.&lt;/p&gt;

&lt;h3&gt;
  
  
  How much does Container Insights cost for ECS?
&lt;/h3&gt;

&lt;p&gt;Enhanced Container Insights charges $0.07 per metric per month, prorated hourly. AWS's own example for a single cluster with 5 services, 20 tasks, and 50 containers: 2,264 metrics = $158.48/month. Standard CloudWatch custom metrics are a separate charge. Enhanced is opt-in per cluster — check which clusters have it enabled with: aws ecs describe-clusters --clusters CLUSTER --include SETTINGS&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I migrate ECS Fargate tasks to Graviton/ARM64?
&lt;/h3&gt;

&lt;p&gt;Rebuild your container images for linux/arm64 using Docker Buildx: docker buildx build --platform linux/arm64 -t myimage:arm64 . Then update your ECS task definition to set runtimePlatform.cpuArchitecture to ARM64 and runtimePlatform.operatingSystemFamily to LINUX. Most interpreted stacks (Python, Node.js, Go, Java, Ruby) rebuild without code changes. Test binaries compiled for x86 — they won't run on ARM64 without recompilation.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the Fargate Spot interruption rate?
&lt;/h3&gt;

&lt;p&gt;AWS does not publish a specific Fargate Spot interruption rate. In practice, Spot interruptions are infrequent for most workloads — typically single-digit percentage of tasks per month in stable capacity regions, though this varies by AZ and instance family demand. ECS gives a 2-minute SIGTERM warning via both the container signal and an EventBridge event (aws.ecs task-state-change with stopCode SPOT_INTERRUPTION). Set stopTimeout to 120 seconds or less in your task definition to handle the full window.&lt;/p&gt;

&lt;p&gt;Worth reading&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/ecs-cost-calculator/"&gt;ToolAWS ECS Fargate Cost CalculatorCalculate your real Fargate bill before and after optimization. Enter fleet size, see the dollar impact of scheduling.&lt;/a&gt;&lt;a href="https://dev.to/blog/ecs-fargate-cost-optimization/"&gt;Guide · How to Cut AWS ECS Fargate Costs by 65%Scheduling, right-sizing, Spot, and orphaned environments — the four methods that take a 12-environment fleet from $1,730 to $380/month.&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;See your real per-env cost:&lt;/strong&gt; &lt;a href="https://fortem.dev/ecs-cost-calculator" rel="noopener noreferrer"&gt;fortem.dev/ecs-cost-calculator&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>ecs</category>
      <category>fargate</category>
      <category>finops</category>
    </item>
    <item>
      <title>AWS Cost Anomaly Detection for ECS Teams: What It Catches, What It Misses, and How to Set It Up</title>
      <dc:creator>Matt</dc:creator>
      <pubDate>Fri, 19 Jun 2026 23:49:18 +0000</pubDate>
      <link>https://dev.to/dspv/aws-cost-anomaly-detection-for-ecs-teams-what-it-catches-what-it-misses-and-how-to-set-it-up-8i6</link>
      <guid>https://dev.to/dspv/aws-cost-anomaly-detection-for-ecs-teams-what-it-catches-what-it-misses-and-how-to-set-it-up-8i6</guid>
      <description>&lt;h1&gt;
  
  
  AWS Cost Anomaly Detection for ECS: Setup Guide 2026
&lt;/h1&gt;

&lt;blockquote&gt;
&lt;p&gt;Originally published at &lt;a href="https://fortem.dev/blog/aws-cost-anomaly-detection-ecs" rel="noopener noreferrer"&gt;https://fortem.dev/blog/aws-cost-anomaly-detection-ecs&lt;/a&gt;&lt;br&gt;
Set up AWS Cost Anomaly Detection for ECS Fargate fleets with per-environment tag monitors. Includes Terraform config, threshold strategy, and what the 24h delay means for your team.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;Guide&lt;/p&gt;

&lt;p&gt;aws-cost-anomaly-detectionecs-cost-monitoringfargate-cost-alerts&lt;/p&gt;

&lt;p&gt;AWS Cost Anomaly Detection is free, ships with ML-based pattern detection, and can catch ECS spend spikes automatically. The catch: it runs on billing data that's up to 24 hours old, and the default setup monitors all ECS spend as one pooled number — not per environment. A spike in staging looks identical to a spike in prod. This guide covers how CAD actually works, how to wire it to your environment tags for per-environment alerts, what Terraform to drop in, and where the tool has real blind spots.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  01CAD is free and uses ML — no static thresholds to maintain, no per-service configuration by default.&lt;/li&gt;
&lt;li&gt;  02The default AWS service monitor pools all ECS spend together — set up a tag-based monitor to get per-environment alerts.&lt;/li&gt;
&lt;li&gt;  03Detection takes up to 24 hours after a spike appears in billing data. Sub-12-hour spikes often go undetected.&lt;/li&gt;
&lt;li&gt;  04IMMEDIATE alerts require an SNS topic, not an email address — email-only subscriptions get a ValidationException.&lt;/li&gt;
&lt;li&gt;  05CAD is your monthly fire detector. It won't catch a runaway task that was killed before the billing data arrived.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Ready to use — drop this into your Terraform today&lt;/p&gt;

&lt;p&gt;Tag-based monitor on your &lt;code&gt;environment&lt;/code&gt; key, SNS topic with correct IAM policy, and an IMMEDIATE subscription with combined $ + % threshold. Replace &lt;code&gt;alerts@yourcompany.com&lt;/code&gt; with your on-call address.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight terraform"&gt;&lt;code&gt;&lt;span class="c1"&gt;# SNS topic for cost anomaly alerts&lt;/span&gt;
&lt;span class="k"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_sns_topic"&lt;/span&gt; &lt;span class="s2"&gt;"cost_anomaly"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"ecs-cost-anomaly-alerts"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Required: grant CAD permission to publish to SNS&lt;/span&gt;
&lt;span class="k"&gt;data&lt;/span&gt; &lt;span class="s2"&gt;"aws_iam_policy_document"&lt;/span&gt; &lt;span class="s2"&gt;"cost_anomaly_sns"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;statement&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;sid&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"AllowCostAnomalyDetection"&lt;/span&gt;
    &lt;span class="nx"&gt;effect&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
    &lt;span class="nx"&gt;actions&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"SNS:Publish"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="nx"&gt;principals&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;type&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Service"&lt;/span&gt;
      &lt;span class="nx"&gt;identifiers&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"costalerts.amazonaws.com"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="nx"&gt;resources&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;aws_sns_topic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cost_anomaly&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="nx"&gt;condition&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;test&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"StringEquals"&lt;/span&gt;
      &lt;span class="k"&gt;variable&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"aws:SourceAccount"&lt;/span&gt;
      &lt;span class="nx"&gt;values&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="k"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;aws_caller_identity&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;current&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;account_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_sns_topic_policy"&lt;/span&gt; &lt;span class="s2"&gt;"cost_anomaly"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;arn&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_sns_topic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cost_anomaly&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;
  &lt;span class="nx"&gt;policy&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;aws_iam_policy_document&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cost_anomaly_sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;json&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_sns_topic_subscription"&lt;/span&gt; &lt;span class="s2"&gt;"oncall_email"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;topic_arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_sns_topic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cost_anomaly&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;
  &lt;span class="nx"&gt;protocol&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"email"&lt;/span&gt;
  &lt;span class="nx"&gt;endpoint&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"alerts@yourcompany.com"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Tag-based monitor — one ML baseline per environment tag value&lt;/span&gt;
&lt;span class="k"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_ce_anomaly_monitor"&lt;/span&gt; &lt;span class="s2"&gt;"env_monitor"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"ecs-per-environment-monitor"&lt;/span&gt;
  &lt;span class="nx"&gt;monitor_type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"CUSTOM"&lt;/span&gt;

  &lt;span class="nx"&gt;monitor_specification&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;jsonencode&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;Tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;Key&lt;/span&gt;          &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"environment"&lt;/span&gt;      &lt;span class="c1"&gt;# must match your cost allocation tag key&lt;/span&gt;
      &lt;span class="nx"&gt;MatchOptions&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"EQUALS"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Subscription: IMMEDIATE via SNS (email-only = ValidationException)&lt;/span&gt;
&lt;span class="k"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_ce_anomaly_subscription"&lt;/span&gt; &lt;span class="s2"&gt;"env_alerts"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"ecs-environment-anomaly-alerts"&lt;/span&gt;
  &lt;span class="nx"&gt;frequency&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"IMMEDIATE"&lt;/span&gt;

  &lt;span class="nx"&gt;monitor_arn_list&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;aws_ce_anomaly_monitor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env_monitor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

  &lt;span class="nx"&gt;subscriber&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;type&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"SNS"&lt;/span&gt;
    &lt;span class="nx"&gt;address&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_sns_topic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cost_anomaly&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;depends_on&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;aws_sns_topic_policy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cost_anomaly&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

  &lt;span class="c1"&gt;# AND logic: both conditions must be met to reduce alert noise&lt;/span&gt;
  &lt;span class="nx"&gt;threshold_expression&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;and&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;dimension&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;key&lt;/span&gt;           &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"ANOMALY_TOTAL_IMPACT_ABSOLUTE"&lt;/span&gt;
        &lt;span class="nx"&gt;values&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"30"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;           &lt;span class="c1"&gt;# $30 minimum impact&lt;/span&gt;
        &lt;span class="nx"&gt;match_options&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"GREATER_THAN_OR_EQUAL"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="nx"&gt;and&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;dimension&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;key&lt;/span&gt;           &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"ANOMALY_TOTAL_IMPACT_PERCENTAGE"&lt;/span&gt;
        &lt;span class="nx"&gt;values&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"25"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;           &lt;span class="c1"&gt;# 25% above expected&lt;/span&gt;
        &lt;span class="nx"&gt;match_options&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"GREATER_THAN_OR_EQUAL"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  How AWS Cost Anomaly Detection works
&lt;/h2&gt;

&lt;p&gt;CAD uses ML to model your normal spend per dimension, runs approximately 3× daily on billing data that's up to 24 hours old, and alerts when actual spend deviates from expected by more than your configured threshold.&lt;/p&gt;

&lt;p&gt;The service launched in 2020 and has been updated significantly since. The November 2025 update switched from calendar-day batches to rolling 24-hour windows — meaning the model now compares your current spend against the same time of day in previous periods, rather than against a full-day total. For ECS workloads with business-hours patterns, this reduces false positives on Monday mornings when spend jumps from a quiet weekend.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Monitor dimension&lt;/th&gt;
&lt;th&gt;What it tracks&lt;/th&gt;
&lt;th&gt;ECS use&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AWS services&lt;/td&gt;
&lt;td&gt;All ECS spend pooled across all envs&lt;/td&gt;
&lt;td&gt;Default — too coarse for fleets&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Linked accounts&lt;/td&gt;
&lt;td&gt;Per AWS account spend&lt;/td&gt;
&lt;td&gt;Useful for account-per-env setups&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost allocation tags&lt;/td&gt;
&lt;td&gt;Per tag value (e.g. per environment)&lt;/td&gt;
&lt;td&gt;Best for ECS fleets using env tags&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost categories&lt;/td&gt;
&lt;td&gt;Per business unit or product&lt;/td&gt;
&lt;td&gt;Useful for multi-product orgs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The ML model adjusts for trends and seasonality automatically. You don't set a fixed budget cap — the model learns what "normal" looks like for your specific spend pattern and alerts only when that pattern breaks. The tradeoff: the model needs at least 10 days of history per dimension before it can fire. A brand-new ECS environment with zero history gets no anomaly alerts until day 11.&lt;/p&gt;

&lt;h2&gt;
  
  
  What ECS cost spikes CAD actually catches
&lt;/h2&gt;

&lt;p&gt;CAD catches ECS spend anomalies at AWS service level by default — meaning all your environments pooled together. It reliably catches sustained scale-out events, forgotten running environments, and Fargate On-Demand vs Spot fall-through spikes that last longer than one billing cycle.&lt;/p&gt;

&lt;p&gt;The "sustained" qualifier matters. &lt;a href="https://dev.to/blog/ecs-fargate-cost-visibility/"&gt;per-environment cost visibility&lt;/a&gt; on ECS is already hard — CAD makes it harder when all environments share one anomaly baseline. A $200 spike in your dev environment looks like noise when your prod environment spends $2,000.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Spike type&lt;/th&gt;
&lt;th&gt;CAD (default)&lt;/th&gt;
&lt;th&gt;Delay&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Dev env running 24/7 after sprint ends&lt;/td&gt;
&lt;td&gt;Sustained (3+ days)&lt;/td&gt;
&lt;td&gt;Catches it&lt;/td&gt;
&lt;td&gt;24–48 hours&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fargate Spot falls back to On-Demand for 8 hours&lt;/td&gt;
&lt;td&gt;Sustained (8+ hours)&lt;/td&gt;
&lt;td&gt;Usually catches it&lt;/td&gt;
&lt;td&gt;24 hours&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Runaway task scales to 50 replicas for 3 hours&lt;/td&gt;
&lt;td&gt;Short spike (&amp;lt;6h)&lt;/td&gt;
&lt;td&gt;Often misses&lt;/td&gt;
&lt;td&gt;Task gone before detection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NAT Gateway burst from one env's batch job&lt;/td&gt;
&lt;td&gt;Single-env spike&lt;/td&gt;
&lt;td&gt;Misses (pools with others)&lt;/td&gt;
&lt;td&gt;No per-env alert without tag monitor&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;KEY INSIGHT:&lt;/strong&gt; Key insight The 24-hour delay is the biggest constraint for ECS teams. A Fargate task that scaled out at 9am and was killed by 3pm generates no anomaly alert — the spending happens in a single billing period, and CAD reads billing data with a 24-hour lag. By the time the data arrives, the task is gone.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Setting up a per-environment tag monitor
&lt;/h2&gt;

&lt;p&gt;Create a Cost Allocation Tag monitor on your &lt;code&gt;environment&lt;/code&gt;key. Each tag value — dev, staging, prod — gets its own ML baseline and can fire independently without one environment's spend pattern polluting another's alert.&lt;/p&gt;

&lt;p&gt;Before creating a tag-based monitor, your cost allocation tags must be activated. Tags only appear in Cost Explorer — and therefore in CAD — after activation. This catches teams off guard: you've been tagging your ECS tasks for months, but none of that data flows into CAD until you flip the switch.&lt;/p&gt;

&lt;p&gt;Prerequisite — activate cost allocation tags&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; 1.Open AWS Billing and Cost Management console&lt;/li&gt;
&lt;li&gt; 2.Navigate to Cost allocation tags → "AWS-generated tags" and "User-defined tags" tabs&lt;/li&gt;
&lt;li&gt; 3.Find your environment tag key (e.g. "environment") → click Activate&lt;/li&gt;
&lt;li&gt; 4.Wait up to 24 hours for historical data to appear in Cost Explorer&lt;/li&gt;
&lt;li&gt; 5.Then create your CAD monitor — do not create it before activation&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Choose an &lt;strong&gt;AWS managed monitor&lt;/strong&gt;(not customer managed) for your environment tag. Managed monitors automatically discover new tag values as you add environments — if you spin up a new "staging-eu" environment next month, the monitor picks it up without any config change. The trade-off: all tag values share one alert threshold. If you need different thresholds for prod vs dev, use customer managed monitors — but they cap at 10 tag values per monitor.&lt;/p&gt;

&lt;p&gt;New tag values need 10 days of billing history before CAD can model normal spend and fire alerts. Plan for this when spinning up a new environment — don't expect anomaly alerts in the first two weeks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Terraform: the full config
&lt;/h2&gt;

&lt;p&gt;Two core resources: &lt;code&gt;aws_ce_anomaly_monitor&lt;/code&gt; (tag-based) and &lt;code&gt;aws_ce_anomaly_subscription&lt;/code&gt; (SNS, IMMEDIATE). Email-only subscriptions cannot use IMMEDIATE frequency — you need an SNS topic with the correct IAM policy first.&lt;/p&gt;

&lt;p&gt;The Terraform block at the top of this article is the complete production config. Two things to get right:&lt;/p&gt;

&lt;p&gt;SNS topic policy&lt;/p&gt;

&lt;p&gt;The SNS topic must explicitly grant &lt;code&gt;costalerts.amazonaws.com&lt;/code&gt; permission to publish. Without this policy, CAD sends no error — alerts fail silently and you get nothing. The &lt;code&gt;aws:SourceAccount&lt;/code&gt; condition limits the permission to your own account only.&lt;/p&gt;

&lt;p&gt;CUSTOM vs DIMENSIONAL monitor type&lt;/p&gt;

&lt;p&gt;Tag-based monitors use &lt;code&gt;monitor_type = "CUSTOM"&lt;/code&gt; with a &lt;code&gt;monitor_specification&lt;/code&gt; JSON block. Service-level monitors use &lt;code&gt;monitor_type = "DIMENSIONAL"&lt;/code&gt; with &lt;code&gt;monitor_dimension = "SERVICE"&lt;/code&gt;. These are different resource shapes in the Terraform provider — using the wrong type will error at apply time.&lt;/p&gt;

&lt;p&gt;For teams using &lt;a href="https://dev.to/blog/ecs-fargate-best-practices/"&gt;consistent cost allocation tagging across environments&lt;/a&gt;, the tag-based monitor can also be created as an AWS managed monitor (not customer managed) — which auto-discovers new environments. The Terraform resource for a managed TAG monitor looks slightly different: omit the &lt;code&gt;monitor_specification&lt;/code&gt; block and instead use &lt;code&gt;monitor_type = "DIMENSIONAL"&lt;/code&gt; with &lt;code&gt;monitor_dimension = "TAG"&lt;/code&gt;. Check the &lt;a href="https://docs.aws.amazon.com/cost-management/latest/userguide/getting-started-ad.html" rel="noopener noreferrer"&gt;AWS Cost Anomaly Detection docs&lt;/a&gt; for the current provider version syntax.&lt;/p&gt;

&lt;h2&gt;
  
  
  Threshold strategy for ECS fleets
&lt;/h2&gt;

&lt;p&gt;For a 10-environment fleet, start with $30 absolute AND 25% relative (AND logic). Production alone warrants a lower absolute threshold — $20 with 20% catches real incidents without drowning in dev noise. The AWS default (40% + $100) is too blunt for ECS environments with variable baselines.&lt;/p&gt;

&lt;p&gt;The problem with the $100 default: a dev environment spending $40/month on idle Fargate tasks can spike to $120 — a 200% increase — and never trigger an alert because the $100 absolute threshold isn't met. For small environments, percentage-based thresholds catch what dollar thresholds miss.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Environment&lt;/th&gt;
&lt;th&gt;Absolute ($)&lt;/th&gt;
&lt;th&gt;Percentage (%)&lt;/th&gt;
&lt;th&gt;Logic&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Production&lt;/td&gt;
&lt;td&gt;$20&lt;/td&gt;
&lt;td&gt;20%&lt;/td&gt;
&lt;td&gt;AND&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Staging&lt;/td&gt;
&lt;td&gt;$30&lt;/td&gt;
&lt;td&gt;25%&lt;/td&gt;
&lt;td&gt;AND&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dev / ephemeral&lt;/td&gt;
&lt;td&gt;$15&lt;/td&gt;
&lt;td&gt;30%&lt;/td&gt;
&lt;td&gt;AND&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;All environments (fallback)&lt;/td&gt;
&lt;td&gt;$30&lt;/td&gt;
&lt;td&gt;25%&lt;/td&gt;
&lt;td&gt;AND&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Use AND, not OR. OR logic on a percentage threshold fires every time a tiny environment has any activity after a quiet weekend — because 100% above $0 is infinite. AND requires both the dollar amount and the percentage to be exceeded simultaneously, which dramatically reduces noise from small environments with variable usage.&lt;/p&gt;

&lt;p&gt;After the first few weeks, mark detected anomalies as "Accurate anomaly" or "Not an issue" in the console. CAD uses this feedback to tune the model. A model trained on your team's feedback converges on your actual noise floor faster than one running without it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where CAD falls short for ECS teams
&lt;/h2&gt;

&lt;p&gt;CAD won't catch a Fargate task that scales to 50 replicas, runs for 6 hours, and is killed before billing data arrives. It also can't alert on per-service cost within an environment — only on per-environment total spend.&lt;/p&gt;

&lt;p&gt;Three hard limits to plan around:&lt;/p&gt;

&lt;p&gt;No real-time detection&lt;/p&gt;

&lt;p&gt;Cost Explorer has up to a 24-hour data lag. CAD runs 3× per day on that data. A Fargate task that spends $300 between 8am and 5pm on a Tuesday won't appear in CAD until Wednesday at the earliest — and only if the spending pattern looks anomalous relative to your history. Real-time cost monitoring requires CloudWatch metrics and billing alarms, which operate on estimated charges with a different (faster) refresh cycle.&lt;/p&gt;

&lt;p&gt;No service-level granularity within an environment&lt;/p&gt;

&lt;p&gt;The tag-based monitor fires when total spend for the "dev" tag value deviates from normal. It cannot tell you which ECS service within "dev" caused the spike. Root cause analysis surfaces up to 10 contributing factors (service, region, account, usage type) — but these are dimensions in Cost Explorer, not ECS service names. You still need Cost Explorer or a per-service tagging strategy to narrow it down.&lt;/p&gt;

&lt;p&gt;Scheduled environments create false anomalies&lt;/p&gt;

&lt;p&gt;If you schedule non-prod environments to stop outside business hours, CAD sees a cost of $0 at night and a spike every morning when they restart. The ML model learns this pattern over time — but the first 2–4 weeks after introducing scheduling will generate false positive alerts. Disable alerts during the model warm-up period or set a higher absolute threshold temporarily.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;KEY INSIGHT:&lt;/strong&gt; Key insight CAD is your monthly fire detector. It catches sustained burns — a forgotten environment left running, a Spot fallback that held for three days. Fortem's per-environment cost tracking is your smoke alarm: it sees what's happening now, before it becomes a billing-cycle problem.&lt;/p&gt;

&lt;p&gt;"AWS Cost Anomaly Detection runs approximately three times a day after your billing data is processed. Anomaly detection relies on the data from Cost Explorer which has a latency of up to 24 hours. Therefore, it can take up to 24 hours to detect an anomaly after the anomalous usage happens."&lt;/p&gt;

&lt;p&gt;— &lt;a href="https://aws.amazon.com/aws-cost-management/aws-cost-anomaly-detection/faqs" rel="noopener noreferrer"&gt;AWS Cost Anomaly Detection FAQ&lt;/a&gt;, verified June 2026&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  If you read this, you might also want to know
&lt;/h2&gt;

&lt;p&gt;Can I use CAD with a multi-account ECS setup?&lt;/p&gt;

&lt;p&gt;Yes. In a management account, create a linked account monitor to track per-member-account spend. Combine it with a tag-based monitor per account if you want both dimensions. Member accounts can only create an AWS service monitor — linked account and tag monitors require the management account.&lt;/p&gt;

&lt;p&gt;What if my ECS tasks don't have environment tags yet?&lt;/p&gt;

&lt;p&gt;Tag-based monitoring only works on costs that are tagged. Untagged ECS tasks appear in the 'no tag value' bucket. The fastest path: add a default_tags block to your Terraform AWS provider — every resource gets the environment tag automatically without changing individual resource configs.&lt;/p&gt;

&lt;p&gt;Does CAD replace AWS Budgets?&lt;/p&gt;

&lt;p&gt;No — they answer different questions. Budgets: 'alert me when I cross $X.' CAD: 'alert me when I'm abnormally above my historical pattern, even if I haven't hit a fixed cap.' Use Budgets for hard financial limits and CAD for pattern deviation. A $50 spike in a normally-$10 environment is an anomaly even if it's well below your budget cap.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Is AWS Cost Anomaly Detection free?
&lt;/h3&gt;

&lt;p&gt;Yes. CAD itself is free — no charge for monitors, alert subscriptions, or email/SNS delivery. The underlying data comes from Cost Explorer, which charges $0.01 per API request for programmatic access. Console use is free. For most ECS teams, the total cost is $0.&lt;/p&gt;

&lt;h3&gt;
  
  
  How long does it take for Cost Anomaly Detection to start working?
&lt;/h3&gt;

&lt;p&gt;CAD needs at least 10 days of historical billing data per monitored dimension before it can model 'normal' spend. After setup, alerts can take up to 24 hours to fire — Cost Explorer (which CAD reads) has a built-in delay of up to 24 hours. A Fargate task that scales out and is killed within 12 hours may never trigger an alert.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can AWS Cost Anomaly Detection detect ECS Fargate cost spikes?
&lt;/h3&gt;

&lt;p&gt;Yes, with caveats. The default AWS service monitor sees all ECS spend pooled together — a spike in one environment dilutes across others. For per-environment detection, create a tag-based monitor on your 'environment' cost allocation tag. Each tag value gets its own ML baseline and can fire independently.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the difference between AWS Cost Anomaly Detection and AWS Budgets?
&lt;/h3&gt;

&lt;p&gt;Budgets use static thresholds: 'alert me when I spend more than $500.' CAD uses ML to detect deviation from your historical pattern — it catches a 200% spike even if you're only at $50 total. Use both: Budgets for hard caps, CAD for pattern deviation. They answer different questions.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I get immediate alerts from Cost Anomaly Detection?
&lt;/h3&gt;

&lt;p&gt;Set alerting frequency to 'Individual alerts' — but this requires an SNS topic, not an email address. Attempting to use IMMEDIATE frequency with an email-only subscription triggers a ValidationException. Create an SNS topic, grant costalerts.amazonaws.com permission to publish, then subscribe your email to the SNS topic.&lt;/p&gt;

&lt;h3&gt;
  
  
  What triggers a cost anomaly alert?
&lt;/h3&gt;

&lt;p&gt;By default: spend 40% above expected AND at least $100 above expected. Both conditions must be met. You can customize both thresholds — for ECS environments with variable usage, combine an absolute threshold ($20–$50) AND a percentage threshold (20–30%) using AND logic to avoid false positives on small environments.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;### See what Fortem shows you that CAD doesn't CAD catches sustained billing ano&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Worth reading&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/blog/ecs-fargate-cost-visibility/"&gt;Use Case · Why Can't You See Per-Environment AWS Costs?Cost Explorer shows you by service, by account, by region. Not by environment. Here's why, and what to do about it.&lt;/a&gt;&lt;a href="https://dev.to/blog/cloudwatch-costs-ecs/"&gt;Use Case · How to Control CloudWatch Logs Costs on ECSECS creates log groups with no retention by default. 4 steps to cut CloudWatch costs by 60–80% without touching application code.&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;See your real per-env cost:&lt;/strong&gt; &lt;a href="https://fortem.dev/ecs-cost-calculator" rel="noopener noreferrer"&gt;fortem.dev/ecs-cost-calculator&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>ecs</category>
      <category>fargate</category>
      <category>finops</category>
    </item>
    <item>
      <title>Fortem vs Humanitec: ECS Fleet Operations vs General-Purpose IDP</title>
      <dc:creator>Matt</dc:creator>
      <pubDate>Fri, 19 Jun 2026 09:53:35 +0000</pubDate>
      <link>https://dev.to/dspv/fortem-vs-humanitec-ecs-fleet-operations-vs-general-purpose-idp-1607</link>
      <guid>https://dev.to/dspv/fortem-vs-humanitec-ecs-fleet-operations-vs-general-purpose-idp-1607</guid>
      <description>&lt;p&gt;Humanitec is the most-marketed IDP of 2025. If you run AWS ECS Fargate and searched "humanitec alternative," you've likely seen it at the top of every comparison listicle. But the ECS team evaluating Humanitec is usually solving a different problem than the one Humanitec is built for. This article explains both tools precisely so you can figure out which problem you actually have.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Humanitec is a platform orchestrator built for Kubernetes — the Container Driver (Jan 2025) explicitly states it "can only be used with managed clusters (EKS, AKS or GKE)" — ECS is not a supported workload target&lt;/li&gt;
&lt;li&gt;Humanitec Teams is $2,199/mo for 5 users, 2 projects, 5 environments per project — structurally incompatible with a fleet of 20+ ECS environments before you even reach Pro at $5,499/mo&lt;/li&gt;
&lt;li&gt;Fortem is purpose-built for ECS Fargate fleet operations: scheduling, cloning, fleet visibility, developer self-service — reads your existing AWS resources, no Terraform rewrite&lt;/li&gt;
&lt;li&gt;If your problem is "operate my ECS fleet at scale" → Fortem. If your problem is "build a company-wide IDP across AWS, GCP, and Azure" → Humanitec&lt;/li&gt;
&lt;li&gt;Humanitec requires substantial custom work to build the developer interface — right for a 5+ person platform team, not for a 1–2 person ops team on ECS&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Run this evaluation before booking any demo&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before talking to either vendor, answer these 5 questions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;What runtimes does your production infrastructure actually run on?&lt;/strong&gt;&lt;br&gt;
If the answer is "ECS Fargate only" → operational layer. If "ECS + EKS + Lambda + GCP" → platform orchestrator.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;How many ECS environments do you have, and are any running 24/7 without a workload?&lt;/strong&gt;&lt;br&gt;
Count: &lt;code&gt;aws ecs list-clusters | jq '.clusterArns | length'&lt;/code&gt;. If &amp;gt;10 environments with idle time → scheduling ROI is immediate.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;How many full-time engineers are dedicated to the internal platform (not product features)?&lt;/strong&gt;&lt;br&gt;
1–2 engineers → operational layer. 5+ dedicated platform engineers → full IDP may make sense.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Are developers filing tickets for environment restarts, clones, or access?&lt;/strong&gt;&lt;br&gt;
If yes, that's an ops bottleneck — a self-service operations layer solves this faster than building an IDP.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;What is your current monthly Fargate compute spend on non-production environments?&lt;/strong&gt;&lt;br&gt;
Run: &lt;code&gt;aws ce get-cost-and-usage --time-period Start=2026-05-01,End=2026-06-01 --granularity MONTHLY --filter '{"Dimensions":{"Key":"SERVICE","Values":["Amazon Elastic Container Service"]}}' --metrics BlendedCost&lt;/code&gt;&lt;br&gt;
If &amp;gt;$1,000/mo → scheduling saves more than Fortem Starter costs.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  What Humanitec actually is
&lt;/h2&gt;

&lt;p&gt;Humanitec is a graph-based platform orchestrator that enforces security, cost, and compliance policies on every deployment. It is Kubernetes-architected and cloud-agnostic.&lt;/p&gt;

&lt;p&gt;The current homepage headline is "Let AI build. On your terms." — Humanitec has repositioned from "developer self-service platform" to "AI agent governance layer." The product lets platform teams define resource templates (databases, caches, queues) that AI agents and human developers can provision in a standardized, policy-compliant way across multiple cloud providers.&lt;/p&gt;

&lt;p&gt;Compute targets claimed on the marketing site: EKS, GKE, AKS, VMs, Serverless. The reality for ECS teams is more nuanced. Humanitec's Container Driver — the mechanism that actually routes workloads to a compute target — was announced in January 2025 with an explicit restriction:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"As of today, the Container Driver can only be used with managed clusters (EKS, AKS or GKE.)"&lt;br&gt;
— &lt;a href="https://humanitec.com/blog/container-driver" rel="noopener noreferrer"&gt;Humanitec blog: Introducing the Container Driver (Jan 2025)&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;AWS ECS and Fargate are not listed. The "serverless-ecs" runner type that appears in Humanitec docs refers to running Humanitec's own build agents on ECS compute — not routing your application workloads to ECS clusters. As of June 2026, humanitec.com/blog has zero posts about deploying to ECS.&lt;/p&gt;

&lt;p&gt;Humanitec's workload descriptor is called Score — a YAML format that abstracts a service away from any specific runtime. To deploy through Humanitec, teams rewrite their Terraform task definitions or Kubernetes manifests in Score. This is the right tradeoff for organizations standardizing across heterogeneous platforms. For ECS-only teams, it adds an abstraction layer on top of resources that already exist and work.&lt;/p&gt;

&lt;p&gt;Gartner Peer Insights reviewers describe the implementation honestly: Humanitec "requires you to build the developer interface and integrate existing tools. This means platform teams need to do substantial custom work to create the complete developer experience."&lt;/p&gt;

&lt;p&gt;Known customers: Western Union, BambooHR, Cimpress — large, multi-cloud organizations with dedicated platform teams. The product is enterprise-sold; it does not have a self-serve signup.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key insight:&lt;/strong&gt; Humanitec is genuinely good at what it does — the absence of ECS support is not a flaw, it's a design decision. The question is whether the ECS team evaluating it is solving the same problem Humanitec was built to solve.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Fortem actually is
&lt;/h2&gt;

&lt;p&gt;Fortem is a control plane for ECS Fargate fleet operations. It reads your existing AWS resources via tags and naming — Terraform stays your source of truth, nothing gets rewritten.&lt;/p&gt;

&lt;p&gt;Fortem connects to your AWS account via a read-heavy IAM role, discovers every ECS cluster, service, and task definition through tags and naming conventions, and adds an operations layer on top. It does not replace Terraform, does not manage deployments, and does not require you to learn a new workload descriptor format.&lt;/p&gt;

&lt;p&gt;What it adds: fleet-wide scheduling (stop all non-production environments at 7pm, restart at 9am, per-timezone), environment cloning (copy a staging environment with one API call), per-environment cost tracking, developer self-service with ECS-scoped RBAC so engineers can restart their own service without filing a ticket, and AI-assisted diagnostics that pull CloudWatch logs and task events when something is unhealthy.&lt;/p&gt;

&lt;p&gt;The typical customer is a platform engineering team of 1–3 people running 10–80 ECS Fargate environments across 1–3 AWS accounts. Their IaC is Terraform. They did not set out to build an internal developer platform — they wanted to stop being the bottleneck for every environment restart, clone, and schedule change.&lt;/p&gt;

&lt;p&gt;What Fortem does not do: Fortem does not manage Kubernetes, does not provide a service catalog, does not handle multi-cloud deployments, and does not give you a Backstage-style developer portal. If those are your requirements, Fortem is not the right tool.&lt;/p&gt;




&lt;h2&gt;
  
  
  Pricing — what you actually pay
&lt;/h2&gt;

&lt;p&gt;Humanitec Teams at $2,199/mo limits you to 5 users and 5 environments per project. For a fleet of 20+ ECS environments, you're on Pro at $5,499/mo before you've saved anything.&lt;/p&gt;

&lt;p&gt;Humanitec pricing was verified June 2026 at &lt;a href="https://humanitec.com/pricing" rel="noopener noreferrer"&gt;humanitec.com/pricing&lt;/a&gt;. Both cloud-hosted tiers offer a free trial. An annual discount of 10% applies to both plans. Humanitec also lists a separate AWS Marketplace SKU at $999/mo for up to 15 users.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Humanitec pricing:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Teams: $2,199/mo — 5 users, 2 projects, 5 envs/project&lt;/li&gt;
&lt;li&gt;Pro: $5,499/mo — 50 users, 10 projects, unlimited envs&lt;/li&gt;
&lt;li&gt;AWS Marketplace: $999/mo (up to 15 users, different terms)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Fortem pricing:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Starter: $790/mo — up to 20 environments, 1 AWS account, unlimited users&lt;/li&gt;
&lt;li&gt;Scale: $2,490/mo — up to 80 environments, 3 AWS accounts, SSO, priority support&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The ROI math for Fortem is straightforward. Scheduling 10 dev and staging environments (8 services each, 0.5 vCPU / 1 GB per task) to business hours saves &lt;strong&gt;$1,013/month&lt;/strong&gt; in Fargate compute — using published AWS pricing at $0.04048/vCPU-hr. The Starter plan pays for itself before any other feature.&lt;/p&gt;

&lt;p&gt;Humanitec ROI comes from platform team leverage — fewer tickets, faster developer onboarding, consistent environments across teams. That is real value, but it requires a platform team large enough to build and maintain the developer experience layer Humanitec expects you to create.&lt;/p&gt;




&lt;h2&gt;
  
  
  ECS Fargate specifically
&lt;/h2&gt;

&lt;p&gt;Humanitec has no ECS-specific features. Fortem was built exclusively for ECS Fargate — every feature maps to a real ECS operational problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Humanitec on ECS:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✗ No ECS workload deployment — Container Driver supports EKS, AKS, GKE only&lt;/li&gt;
&lt;li&gt;✗ No environment scheduling for ECS clusters or services&lt;/li&gt;
&lt;li&gt;✗ No ECS fleet visibility — no per-environment cost or status dashboard&lt;/li&gt;
&lt;li&gt;✗ No environment cloning for ECS task definitions&lt;/li&gt;
&lt;li&gt;✗ No ECS-specific diagnostics — no CloudWatch log integration&lt;/li&gt;
&lt;li&gt;✗ Zero ECS-specific blog posts or documentation as of June 2026&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Fortem on ECS:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✓ Reads ECS services, task definitions, and CloudWatch metrics natively via AWS API&lt;/li&gt;
&lt;li&gt;✓ Fleet-wide scheduling: stop/start all environments by timezone on a cron schedule&lt;/li&gt;
&lt;li&gt;✓ Per-environment cost tracking using ECS service CPU/memory allocations&lt;/li&gt;
&lt;li&gt;✓ Environment cloning: copy a full ECS environment with one API call&lt;/li&gt;
&lt;li&gt;✓ Developer self-service: scoped IAM lets engineers restart their own services&lt;/li&gt;
&lt;li&gt;✓ AI diagnostics: surfaces unhealthy tasks with CloudWatch context automatically&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worth clarifying:&lt;/strong&gt; The Humanitec "serverless-ecs" runner type that appears in their docs is not what it sounds like. It refers to running Humanitec's own CI build agents on ECS compute — not routing your application services to ECS clusters. If you found that in a Google search, it is not ECS workload support.&lt;/p&gt;




&lt;h2&gt;
  
  
  When Humanitec is the right choice
&lt;/h2&gt;

&lt;p&gt;Humanitec is right when you need a company-wide platform across multiple clouds and platforms — not when your problem is operating an ECS fleet.&lt;/p&gt;

&lt;p&gt;The strongest signal that Humanitec fits: your engineering org runs workloads on at least two of EKS, GKE, AKS, Lambda, or VMs, and you want a single interface for developers to provision infrastructure regardless of which runtime it lands on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Humanitec fits when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need a formal IDP for 50+ engineers across AWS, GCP, and Azure&lt;/li&gt;
&lt;li&gt;You're building a dedicated platform team with a charter to standardize all of engineering&lt;/li&gt;
&lt;li&gt;You already use Backstage or Port and want an orchestration layer on top&lt;/li&gt;
&lt;li&gt;You're willing to invest 2–6 months in implementation and have 5+ dedicated platform engineers&lt;/li&gt;
&lt;li&gt;You need AI agent governance — controlled AI provisioning across heterogeneous platforms&lt;/li&gt;
&lt;li&gt;Your compliance requirements need multi-cloud deployment standardization&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Humanitec's MVP Program offers structured onboarding with a platform architect. The actual implementation timeline for a working internal developer platform, with real services, resource drivers, and a developer-facing interface, is measured in months.&lt;/p&gt;




&lt;h2&gt;
  
  
  When Fortem is the right choice
&lt;/h2&gt;

&lt;p&gt;Fortem is right when your infrastructure is primarily AWS ECS Fargate and your problem is operational — managing environments, controlling costs, and enabling self-service without a full IDP build.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fortem fits when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your stack is primarily or entirely AWS ECS Fargate — no multi-cloud requirement&lt;/li&gt;
&lt;li&gt;You have 10–80 environments and growing compute spend on idle dev and staging&lt;/li&gt;
&lt;li&gt;Your platform team is 1–3 people — not 5+ engineers dedicated to IDP work&lt;/li&gt;
&lt;li&gt;You use Terraform and don't want to learn Score or rewrite task definitions&lt;/li&gt;
&lt;li&gt;You need results in days, not months — no multi-month implementation project&lt;/li&gt;
&lt;li&gt;Developers are filing tickets for environment restarts, clones, or schedule changes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Fortem onboarding runs 7 business days: a Fortem engineer audits your AWS setup, tags environments, configures per-timezone schedules, and hands you the dashboard. No Score migration, no new abstraction layer. You keep your Terraform as the source of truth.&lt;/p&gt;




&lt;h2&gt;
  
  
  Side-by-side at a glance
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Humanitec&lt;/th&gt;
&lt;th&gt;Fortem&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Pricing&lt;/td&gt;
&lt;td&gt;$2,199/mo (Teams) · $5,499/mo (Pro)&lt;/td&gt;
&lt;td&gt;$790/mo (Starter) · $2,490/mo (Scale)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Users included&lt;/td&gt;
&lt;td&gt;5 (Teams) · 50 (Pro)&lt;/td&gt;
&lt;td&gt;Unlimited on all plans&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Environments&lt;/td&gt;
&lt;td&gt;5/project (Teams) · Unlimited (Pro)&lt;/td&gt;
&lt;td&gt;20 (Starter) · 80 (Scale)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ECS Fargate support&lt;/td&gt;
&lt;td&gt;Not a workload target&lt;/td&gt;
&lt;td&gt;Purpose-built&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kubernetes support&lt;/td&gt;
&lt;td&gt;EKS, AKS, GKE via Container Driver&lt;/td&gt;
&lt;td&gt;Not supported&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Terraform required&lt;/td&gt;
&lt;td&gt;Score replaces task definitions&lt;/td&gt;
&lt;td&gt;Reads existing state, no rewrite&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Self-service for devs&lt;/td&gt;
&lt;td&gt;Build it (custom developer portal)&lt;/td&gt;
&lt;td&gt;Included — ECS-scoped RBAC&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Environment scheduling&lt;/td&gt;
&lt;td&gt;Not available&lt;/td&gt;
&lt;td&gt;Fleet-wide, per-timezone&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Environment cloning&lt;/td&gt;
&lt;td&gt;Not available&lt;/td&gt;
&lt;td&gt;Single API call&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Onboarding time&lt;/td&gt;
&lt;td&gt;Months (IDP build required)&lt;/td&gt;
&lt;td&gt;7 business days&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Right for&lt;/td&gt;
&lt;td&gt;Multi-cloud, 50+ engineers, Kubernetes&lt;/td&gt;
&lt;td&gt;ECS Fargate, 1–3 platform engineers&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Humanitec pricing: humanitec.com/pricing, verified June 2026. Fortem pricing: fortem.dev/pricing, verified June 2026.&lt;/p&gt;




&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Does Fortem replace Humanitec?&lt;/strong&gt;&lt;br&gt;
No — they solve different problems. Humanitec is a general-purpose platform orchestrator for standardizing deployments across multiple clouds and Kubernetes clusters. Fortem is a control plane specifically for AWS ECS Fargate fleet operations. If your stack is ECS Fargate, Fortem addresses the actual operational problems (scheduling, fleet visibility, self-service) without requiring you to build an IDP.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does Humanitec support ECS Fargate workload deployment natively?&lt;/strong&gt;&lt;br&gt;
Not through its primary deployment mechanism. Humanitec's Container Driver explicitly states: "the Container Driver can only be used with managed clusters (EKS, AKS or GKE)." ECS and Fargate are not supported workload targets. The "serverless-ecs" runner type in Humanitec docs refers to running Humanitec's own build agents on ECS infrastructure — not deploying user applications to ECS.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I use Fortem and Humanitec together?&lt;/strong&gt;&lt;br&gt;
In theory, yes — they operate at different layers. If you're running both ECS Fargate and Kubernetes workloads in a hybrid architecture, Humanitec could handle Kubernetes orchestration while Fortem handles ECS fleet operations. In practice, most ECS-first teams don't need both.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is Humanitec's Score language and do I have to learn it?&lt;/strong&gt;&lt;br&gt;
Score is Humanitec's workload descriptor — a YAML format that abstracts your service definition away from any specific infrastructure target. To use Humanitec fully, you describe your services in Score rather than in Terraform task definitions or Kubernetes manifests. For ECS-only teams who already have Terraform task definitions, Score is an additional abstraction layer with no payoff.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How long does Fortem onboarding take compared to Humanitec?&lt;/strong&gt;&lt;br&gt;
Fortem onboarding runs 7 business days: a Fortem engineer audits your AWS setup, imports your environments, configures schedules per timezone, and hands you the keys. Humanitec's MVP Program targets a working MVP after the first session but typically involves multiple months of platform team work to build the developer interface, configure resource drivers, and integrate existing tooling.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Running 10+ ECS environments? &lt;a href="https://fortem.dev/audit" rel="noopener noreferrer"&gt;Run the free Fleet Audit&lt;/a&gt; — see your idle compute spend, scheduling potential, and per-environment costs in one report.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>ecs</category>
      <category>devops</category>
      <category>platformengineering</category>
    </item>
    <item>
      <title>How to Debug AWS Fargate Containers with ECS Exec</title>
      <dc:creator>Matt</dc:creator>
      <pubDate>Tue, 16 Jun 2026 08:34:03 +0000</pubDate>
      <link>https://dev.to/dspv/how-to-debug-aws-fargate-containers-with-ecs-exec-k82</link>
      <guid>https://dev.to/dspv/how-to-debug-aws-fargate-containers-with-ecs-exec-k82</guid>
      <description>&lt;h1&gt;
  
  
  How to Debug AWS Fargate Containers with ECS Exec?
&lt;/h1&gt;

&lt;blockquote&gt;
&lt;p&gt;Originally published at &lt;a href="https://fortem.dev/blog/ecs-exec-guide" rel="noopener noreferrer"&gt;https://fortem.dev/blog/ecs-exec-guide&lt;/a&gt;&lt;br&gt;
No more SSH into EC2 instances. ECS Exec gives you a shell into Fargate containers. The 5 IAM errors that catch everyone, copy-paste policy, and production audit setup.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;Use Case · June 11, 2026 · 7 min read&lt;/p&gt;

&lt;p&gt;You moved to Fargate. No more SSH. No more docker exec. Your container is failing and you can't get inside. ECS Exec — AWS's answer to docker exec for Fargate — has been here since 2021. This guide covers setup, the 5 IAM permissions that catch everyone, and the commands that work.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  01ECS Exec uses SSM Session Manager — bind-mounts an agent into your container, no sidecar needed&lt;/li&gt;
&lt;li&gt;  02Requires 3 things: --enable-execute-command on the service, IAM task role with SSM permissions, and the SSM Session Manager plugin on your local CLI&lt;/li&gt;
&lt;li&gt;  03The #1 failure point is IAM — the task role needs ssmmessages permissions, not just ecs:ExecuteCommand&lt;/li&gt;
&lt;li&gt;  0420-minute idle timeout, 1 session per container, root user — know the limits before you rely on it in production&lt;/li&gt;
&lt;li&gt;  05CloudTrail logs every ExecuteCommand call. S3 and CloudWatch can capture command output for compliance&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why ECS Exec exists — the Fargate debugging gap
&lt;/h2&gt;

&lt;p&gt;Fargate has no hosts to SSH into. ECS Exec bind-mounts the SSM agent at runtime, giving you an interactive shell without ports, keys, or changing your task definition. Available since Nov 2021 on both Linux and Windows containers.&lt;/p&gt;

&lt;p&gt;ECS on EC2 (before)ECS on Fargate (with ECS Exec)&lt;/p&gt;

&lt;p&gt;SSH into EC2 instanceaws ecs execute-command (no SSH needed)&lt;/p&gt;

&lt;p&gt;docker exec -it container bash/bin/bash inside container via SSM&lt;/p&gt;

&lt;p&gt;Open ports, manage SSH keysNo ports, no keys — IAM controls access&lt;/p&gt;

&lt;p&gt;Locate instance in ASG firstDirect to task ID — always routable&lt;/p&gt;

&lt;p&gt;Security: instance-level accessSecurity: per-task, per-container IAM&lt;/p&gt;

&lt;p&gt;Before ECS Exec (launched March 2021), debugging a Fargate container meant you couldn't get a shell at all — there are no EC2 instances to SSH into. Fargate runs your tasks on AWS-managed infrastructure, which has real implications for &lt;a href="https://dev.to/blog/ecs-fargate-best-practices/"&gt;operating Fargate at scale&lt;/a&gt;. ECS Exec was the &lt;a href="https://github.com/aws/containers-roadmap/issues/187" rel="noopener noreferrer"&gt;#1 most requested feature&lt;/a&gt; on the AWS Containers Roadmap for good reason.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;KEY INSIGHT:&lt;/strong&gt; Key insight ECS Exec is not a sidecar container or a separate service. It bind-mounts the SSM agent binaries into your existing container at runtime. Your task definition doesn't change — the ECS agent handles the plumbing transparently.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Download the skill file — check readiness first
&lt;/h2&gt;

&lt;p&gt;Before hitting one of the 5 errors below — a skill file your AI agent can run. It checks IAM permissions, the SSM plugin, networking, and the read-only-filesystem trap, all at once. Everything runs locally against your AWS account.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;ECS Exec Readiness Checker Checks whether ECS Exec is enabled and the agent is r&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The 5 errors that catch everyone
&lt;/h2&gt;

&lt;p&gt;Every team hits these. The error messages are cryptic but the fixes are specific — missing enable flag, wrong IAM role, missing SSM plugin, no VPC route, or read-only root filesystem. Each has a one-line resolution.&lt;/p&gt;

&lt;p&gt;01ExecuteCommandAgent not RUNNING&lt;/p&gt;

&lt;p&gt;Cause: You forgot --enable-execute-command when creating or updating the service. ECS Exec must be explicitly turned on per service or per standalone task.&lt;/p&gt;

&lt;p&gt;Fix&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Update the service to enable ECS Exec&lt;/span&gt;
aws ecs update-service &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--cluster&lt;/span&gt; your-cluster &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--service&lt;/span&gt; your-service &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--enable-execute-command&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--force-new-deployment&lt;/span&gt;

&lt;span class="c"&gt;# Or for a standalone task:&lt;/span&gt;
aws ecs run-task &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--cluster&lt;/span&gt; your-cluster &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--task-definition&lt;/span&gt; your-task &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--enable-execute-command&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify: aws ecs describe-tasks --cluster your-cluster --tasks task-id — check that enableExecuteCommand is true and ExecuteCommandAgent status is RUNNING&lt;/p&gt;

&lt;p&gt;02AccessDeniedException — User is not authorized&lt;/p&gt;

&lt;p&gt;Cause: Your task IAM role doesn't have the SSM permissions needed for the agent to open a session.&lt;/p&gt;

&lt;p&gt;Fix&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Statement"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"ssmmessages:CreateControlChannel"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"ssmmessages:CreateDataChannel"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"ssmmessages:OpenControlChannel"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"ssmmessages:OpenDataChannel"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"*"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify: Attach this policy to the task role (NOT the execution role). The SSM agent runs inside the container — it's the task that needs the permissions, not the service launching it.&lt;/p&gt;

&lt;p&gt;03TargetNotConnected — Session Manager plugin not found&lt;/p&gt;

&lt;p&gt;Cause: The SSM Session Manager plugin is not installed on your local machine.&lt;/p&gt;

&lt;p&gt;Fix&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# macOS&lt;/span&gt;
curl &lt;span class="s2"&gt;"https://s3.amazonaws.com/session-manager-downloads/plugin/latest/mac/sessionmanager-bundle.zip"&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="s2"&gt;"session.zip"&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
unzip session.zip &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;sudo&lt;/span&gt; ./sessionmanager-bundle/install &lt;span class="nt"&gt;-i&lt;/span&gt; /usr/local/sessionmanagerplugin &lt;span class="nt"&gt;-b&lt;/span&gt; /usr/local/bin/session-manager-plugin

&lt;span class="c"&gt;# Linux&lt;/span&gt;
curl &lt;span class="s2"&gt;"https://s3.amazonaws.com/session-manager-downloads/plugin/latest/linux_64bit/session-manager-plugin.rpm"&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="s2"&gt;"plugin.rpm"&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;yum &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; plugin.rpm
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify: session-manager-plugin --version&lt;/p&gt;

&lt;p&gt;04Timeout — session never connects&lt;/p&gt;

&lt;p&gt;Cause: Your Fargate task has no route to the SSM service endpoint. Either the task is in a private subnet with no NAT gateway, or VPC endpoints for SSM are missing.&lt;/p&gt;

&lt;p&gt;Fix&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Option A: Add NAT Gateway to route traffic to internet&lt;/span&gt;
&lt;span class="c"&gt;# Option B: Create VPC endpoints for SSM (recommended for private subnets)&lt;/span&gt;

aws ec2 create-vpc-endpoint &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--vpc-id&lt;/span&gt; vpc-xxx &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--service-name&lt;/span&gt; com.amazonaws.region.ssmmessages &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--subnet-ids&lt;/span&gt; subnet-xxx
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify: Check task networking: aws ecs describe-tasks — the task must be able to reach ssmmessages.region.amazonaws.com. If you're in a private subnet with no NAT, you MUST have the VPC endpoint.&lt;/p&gt;

&lt;p&gt;05Session starts but commands fail — 'cannot create directory'&lt;/p&gt;

&lt;p&gt;Cause: Your container's root filesystem is read-only (readonlyRootFilesystem: true). The SSM agent needs to create directories and files inside the container to function.&lt;/p&gt;

&lt;p&gt;Fix&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;In&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;your&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;task&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;definition,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;set:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"linuxParameters"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"initProcessEnabled"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;And&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;remove&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;or&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;set&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"readonlyRootFilesystem"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify: SSM agent writes to /var/lib/amazon/ssm/. If the root FS is read-only, ECS Exec won't work. There's no workaround — the agent needs writable storage.&lt;/p&gt;

&lt;h2&gt;
  
  
  The happy path — step by step
&lt;/h2&gt;

&lt;p&gt;Step 1 — Install the Session Manager plugin&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# macOS&lt;/span&gt;
curl &lt;span class="s2"&gt;"https://s3.amazonaws.com/session-manager-downloads/plugin/latest/mac/sessionmanager-bundle.zip"&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="s2"&gt;"session.zip"&lt;/span&gt;
unzip session.zip
&lt;span class="nb"&gt;sudo&lt;/span&gt; ./sessionmanager-bundle/install &lt;span class="nt"&gt;-i&lt;/span&gt; /usr/local/sessionmanagerplugin &lt;span class="nt"&gt;-b&lt;/span&gt; /usr/local/bin/session-manager-plugin

&lt;span class="c"&gt;# Verify&lt;/span&gt;
session-manager-plugin &lt;span class="nt"&gt;--version&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Step 2 — Create the task IAM role&lt;/p&gt;

&lt;p&gt;This is the policy the container needs to call SSM. Attach it to your ECS task role (NOT the execution role — that's for pulling images and writing logs).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Statement"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"ssmmessages:CreateControlChannel"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"ssmmessages:CreateDataChannel"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"ssmmessages:OpenControlChannel"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"ssmmessages:OpenDataChannel"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"*"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Step 3 — Enable ECS Exec on your service&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws ecs update-service &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--cluster&lt;/span&gt; my-cluster &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--service&lt;/span&gt; my-service &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--enable-execute-command&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--force-new-deployment&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Step 4 — Verify the agent is ready&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws ecs describe-tasks &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--cluster&lt;/span&gt; my-cluster &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--tasks&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;aws ecs list-tasks &lt;span class="nt"&gt;--cluster&lt;/span&gt; my-cluster &lt;span class="nt"&gt;--service&lt;/span&gt; my-service &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'taskArns[0]'&lt;/span&gt; &lt;span class="nt"&gt;--output&lt;/span&gt; text&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;# Look for:&lt;/span&gt;
&lt;span class="c"&gt;# "enableExecuteCommand": true&lt;/span&gt;
&lt;span class="c"&gt;# "lastStatus": "RUNNING" under ExecuteCommandAgent&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Step 5 — Execute&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Interactive shell&lt;/span&gt;
aws ecs execute-command &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--cluster&lt;/span&gt; my-cluster &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--task&lt;/span&gt; YOUR_TASK_ID &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--container&lt;/span&gt; nginx &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--command&lt;/span&gt; &lt;span class="s2"&gt;"/bin/bash"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--interactive&lt;/span&gt;

&lt;span class="c"&gt;# Single command&lt;/span&gt;
aws ecs execute-command &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--cluster&lt;/span&gt; my-cluster &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--task&lt;/span&gt; YOUR_TASK_ID &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--container&lt;/span&gt; nginx &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--command&lt;/span&gt; &lt;span class="s2"&gt;"env | grep DATABASE"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--interactive&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Production setup — logging, audit, security
&lt;/h2&gt;

&lt;p&gt;Three layers control ECS Exec in production: S3/CloudWatch logging captures every command, CloudTrail audits who ran what, and IAM conditions restrict exec by container name, cluster, and tags — including denying production access entirely.&lt;/p&gt;

&lt;p&gt;ECS Exec is powerful — and you need controls around it in production. Three layers: logging (what commands ran), auditing (who ran them), and access control (who CAN run them). If you enable CloudWatch logging, set a retention policy — &lt;a href="https://dev.to/blog/cloudwatch-costs-ecs/"&gt;unbounded CloudWatch log groups accumulate real cost&lt;/a&gt; on active clusters.&lt;/p&gt;

&lt;p&gt;1.Layer 1 — Log command output to S3 and CloudWatch&lt;/p&gt;

&lt;p&gt;Configure at the cluster level. Two destinations: S3 for durable retention, CloudWatch for real-time search. CloudTrail separately logs the ExecuteCommand API call (who and when). Together they give you full visibility: CloudTrail = who executed. S3/CloudWatch = what they ran.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws ecs update-cluster &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--cluster&lt;/span&gt; my-cluster &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--configuration&lt;/span&gt; &lt;span class="nv"&gt;executeCommandConfiguration&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'{
      "logging": "OVERRIDE",
      "logConfiguration": {
        "cloudWatchLogGroupName": "/aws/ecs/my-cluster-exec",
        "s3BucketName": "my-exec-logs",
        "s3KeyPrefix": "exec-output"
      }
    }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;2.Layer 2 — Restrict who can exec&lt;/p&gt;

&lt;p&gt;Use IAM condition keys on ecs:ExecuteCommand. This policy allows exec only on tasks tagged environment=development in a specific cluster. Production tasks are blocked — even if someone has the right IAM role.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ecs:ExecuteCommand"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:ecs:us-east-1:123456789:cluster/my-cluster"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:ecs:us-east-1:123456789:task/my-cluster/*"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Condition"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"StringEquals"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"ecs:ResourceTag/environment"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"development"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;3.Layer 3 — Block production by container name&lt;/p&gt;

&lt;p&gt;Add a Deny policy that blocks exec on any container named production-app — regardless of IAM role. This is the safety net. Even if someone tags a task wrong, the container name catches it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Deny"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ecs:ExecuteCommand"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"*"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Condition"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"StringEquals"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"ecs:container-name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"production-app"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What ECS Exec can't do
&lt;/h2&gt;

&lt;p&gt;Seven hard limits: 20-minute idle timeout, one session per PID namespace, must be enabled at launch, read-only root FS blocks the agent, commands run as root, no AWS Console support, only tools in the image are available.&lt;/p&gt;

&lt;p&gt;LimitationWhy&lt;/p&gt;

&lt;p&gt;20-minute idle timeoutThe SSM session drops after 20 minutes of inactivity — not configurable. Active commands keep it alive, but a paused shell will disconnect. Plan for reconnection.&lt;/p&gt;

&lt;p&gt;1 session per PID namespaceIf you share a PID namespace across containers in a task, you can only exec into one at a time. The second session will fail until the first exits.&lt;/p&gt;

&lt;p&gt;Must be enabled at launchYou can't retroactively enable ECS Exec on an already-running task. If you forgot --enable-execute-command, you need to redeploy the task.&lt;/p&gt;

&lt;p&gt;Read-only root FS breaks itThe SSM agent writes to /var/lib/amazon/ssm/ inside the container. readonlyRootFilesystem: true makes this impossible. No workaround.&lt;/p&gt;

&lt;p&gt;Commands run as rootEven if your container runs as a non-root user, commands executed through ECS Exec run as root. The SSM agent and its children ignore the container's USER directive.&lt;/p&gt;

&lt;p&gt;No AWS Console supportECS Exec is CLI/SDK only. You can't click a button in the Console to get a shell. AWS Copilot supports it (copilot svc exec), but the web Console doesn't.&lt;/p&gt;

&lt;p&gt;Only tools in the imageIf curl, netstat, or jq aren't in your container image, you can't use them during an exec session. ECS Exec doesn't inject tools — it only gives you access to what's already there.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"ECS Exec sessions drop after 20 minutes of idle time — this timeout is not configurable. Only one session per container PID namespace is supported, and sessions always run as root regardless of the container USER directive."&lt;/p&gt;

&lt;p&gt;— &lt;a href="https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-exec.html" rel="noopener noreferrer"&gt;AWS ECS Exec documentation&lt;/a&gt;, verified June 2026&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;strong&gt;Map your fleet in 5 min:&lt;/strong&gt; &lt;a href="https://fortem.dev/audit" rel="noopener noreferrer"&gt;fortem.dev/audit&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>ecs</category>
      <category>fargate</category>
      <category>debugging</category>
    </item>
    <item>
      <title>How to Control CloudWatch Logs Costs on ECS</title>
      <dc:creator>Matt</dc:creator>
      <pubDate>Tue, 16 Jun 2026 08:34:01 +0000</pubDate>
      <link>https://dev.to/dspv/how-to-control-cloudwatch-logs-costs-on-ecs-li3</link>
      <guid>https://dev.to/dspv/how-to-control-cloudwatch-logs-costs-on-ecs-li3</guid>
      <description>&lt;h1&gt;
  
  
  How to Control CloudWatch Logs Costs on ECS?
&lt;/h1&gt;

&lt;blockquote&gt;
&lt;p&gt;Originally published at &lt;a href="https://fortem.dev/blog/cloudwatch-costs-ecs" rel="noopener noreferrer"&gt;https://fortem.dev/blog/cloudwatch-costs-ecs&lt;/a&gt;&lt;br&gt;
ECS sends all logs to CloudWatch with retention set to Never Expire by default. 4 steps to cut your CloudWatch bill by 60-80%: retention, log level filtering, Insights queries, and per-service monitoring.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;Use Case&lt;/p&gt;

&lt;p&gt;Your AWS bill shows CloudWatch at $400 this month. You have 15 ECS services logging INFO-level to CloudWatch — with retention set to Never Expire. You didn't configure this. ECS did it by default. The fix takes 4 steps.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  01ECS default log driver sends everything to CloudWatch with retention = Never Expire — you didn't set this, ECS did&lt;/li&gt;
&lt;li&gt;  024-step fix: set retention (90% impact), filter by log level (5%), Insights instead of streaming (3%), monitor per-service (2%)&lt;/li&gt;
&lt;li&gt;  03One Terraform line: retention_in_days = 30 — cuts storage cost by 60-80% immediately&lt;/li&gt;
&lt;li&gt;  04Real example: 15 services, 3 GB/day → $135/mo (before) → $30/mo (after) — 78% savings&lt;/li&gt;
&lt;li&gt;  05Download the skill file — your AI agent can audit and fix this for you in 5 minutes&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why CloudWatch is silently eating your AWS bill
&lt;/h2&gt;

&lt;p&gt;ECS creates CloudWatch log groups with no retention policy by default — logs accumulate forever at $0.50/GB ingestion plus $0.03/GB/month storage, with no upper bound. Every container's stdout goes to CloudWatch. Logs accumulate forever and your bill grows every month. You did not configure this.&lt;/p&gt;

&lt;p&gt;The part that surprises most teams: &lt;strong&gt;ECS creates log groups with no retention policy.&lt;/strong&gt; No retention = Never Expire = logs accumulate forever = your bill grows every month. We audited a 15-service fleet where CloudWatch was $135/month — more than the compute cost for two of the environments combined. Retention is one lever; &lt;a href="https://dev.to/blog/ecs-fargate-cost-optimization"&gt;right-sizing and scheduling are the rest of the picture&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Cost component15 services, INFO level, 3 GB/day&lt;/p&gt;

&lt;p&gt;Ingestion ($0.50/GB)$45/mo&lt;/p&gt;

&lt;p&gt;Storage ($0.03/GB/month)$54/mo (grows every month)&lt;/p&gt;

&lt;p&gt;Insights queries ($0.50/GB)$36/mo (5 queries/day)&lt;/p&gt;

&lt;p&gt;Total$135/mo&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;KEY INSIGHT:&lt;/strong&gt; Key insight Three separate charges on the same data. Ingestion is pay-what-you-send. Storage is pay-what-you-keep. Insights is pay-what-you-scan. ECS defaults mean you pay all three — with no upper bound — on every log line your application prints.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Download the skill file — let AI fix it
&lt;/h2&gt;

&lt;p&gt;The downloadable skill file lets your AI agent scan all CloudWatch log groups, identify which ones lack retention, estimate monthly cost per group, and apply fixes — without writing a line of code. Everything runs locally on your machine against your AWS account.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;*&lt;em&gt;CloudWatch Cost Optimizer Finds log groups without retention, estimates monthly *&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Step 1 — Set retention on every log group
&lt;/h2&gt;

&lt;p&gt;Adding retention_in_days = 30 to every aws_cloudwatch_log_group Terraform resource cuts CloudWatch storage cost by 60–80% immediately — it is the single highest-impact change in this guide. Find every log group without retention and set it to something sensible.&lt;/p&gt;

&lt;p&gt;This single change has the biggest impact of any step in this guide. Every log group with Never Expire keeps accumulating data you will never query. The commands below find them and set a sensible ceiling.&lt;/p&gt;

&lt;p&gt;Find groups without retention:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws logs describe-log-groups &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'logGroups[?retentionInDays==`null`].[logGroupName,storedBytes]'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--output&lt;/span&gt; table
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Set 30-day retention on one group:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws logs put-retention-policy &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--log-group-name&lt;/span&gt; &lt;span class="s2"&gt;"/aws/ecs/your-service"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--retention-in-days&lt;/span&gt; 30
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Terraform — the one-liner that saves you $$$:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight terraform"&gt;&lt;code&gt;&lt;span class="k"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_cloudwatch_log_group"&lt;/span&gt; &lt;span class="s2"&gt;"ecs_service"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;              &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"/ecs/&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env_prefix&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;-&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;service_name&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
  &lt;span class="nx"&gt;retention_in_days&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;  &lt;span class="c1"&gt;# ← was null (Never Expire). Now 30 days.&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;EnvironmentRetentionWhy&lt;/p&gt;

&lt;p&gt;Production90 daysCompliance + incident investigation&lt;/p&gt;

&lt;p&gt;Staging30 daysRecent deploy history&lt;/p&gt;

&lt;p&gt;Dev / QA7 daysActive development only&lt;/p&gt;

&lt;p&gt;CI/CD / Build1 dayDon't store ephemeral build logs&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2 — Filter by log level
&lt;/h2&gt;

&lt;p&gt;Switching ECS production services from INFO to WARN log level reduces ingested log volume by one to two orders of magnitude, cutting both the $0.50/GB ingestion and $0.03/GB storage charges. Switch production to WARN, keep INFO for staging.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“CloudWatch Logs charges $0.50 per GB ingested, $0.03 per GB stored per month, and $0.50 per GB scanned by Logs Insights queries — beyond the 5 GB/month free tier.”&lt;/p&gt;

&lt;p&gt;— aws.amazon.com/cloudwatch/pricing, verified June 2026&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Spring Boot, Express, Django — they all default to INFO-level logging. That means every HTTP request, every database query, every cache hit generates a log line. Production doesn't need INFO. Switch to WARN.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Find which log groups ingest the most data (last 7 days)&lt;/span&gt;
aws logs start-query &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--log-group-name&lt;/span&gt; &lt;span class="s2"&gt;"/aws/ecs/prod-api"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--start-time&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-v-7d&lt;/span&gt; +%s&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--end-time&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; +%s&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--query-string&lt;/span&gt; &lt;span class="s2"&gt;"stats count() by @logStream | sort count desc | limit 10"&lt;/span&gt;

&lt;span class="c"&gt;# Check your framework's log level:&lt;/span&gt;
&lt;span class="c"&gt;# Spring Boot: logging.level.root=WARN in application.properties&lt;/span&gt;
&lt;span class="c"&gt;# Express: set LOG_LEVEL=warn&lt;/span&gt;
&lt;span class="c"&gt;# Django: LOGGING['root']['level'] = 'WARNING'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;KEY INSIGHT:&lt;/strong&gt; Key insight An INFO-level web server can generate one to two orders of magnitude more log volume than the same server at WARN. If you're paying $0.50/GB for ingestion, every unnecessary log line costs you money — twice (once to ingest, once to store).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Step 3 — Use Insights instead of streaming everything
&lt;/h2&gt;

&lt;p&gt;Use CloudWatch Logs Insights to query on demand at $0.50/GB scanned rather than streaming every log line to a third-party tool that charges separately for ingestion and indexing. For compliance, subscription filter to S3.&lt;/p&gt;

&lt;p&gt;Datadog's log pricing is two-part: ingestion is billed separately from indexing (making logs searchable). Once you index everything for debugging — which is the point of streaming logs there — the combined cost per GB is several times CloudWatch's ingest ($0.50/GB) + storage ($0.03/GB) total. For debugging, use CloudWatch Logs Insights instead — query on demand, pay per GB scanned ($0.50/GB), not per GB ingested or indexed.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Datadog charges separately for log ingestion and for indexing logs to make them searchable — to query logs during incident response, they need to be indexed.”&lt;/p&gt;

&lt;p&gt;— datadoghq.com/pricing, verified June 2026&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Find errors in the last hour across all services&lt;/span&gt;
aws logs start-query &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--log-group-name&lt;/span&gt; &lt;span class="s2"&gt;"/aws/ecs/prod-api"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--start-time&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-v-1H&lt;/span&gt; +%s&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--end-time&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; +%s&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--query-string&lt;/span&gt; &lt;span class="s2"&gt;"fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 50"&lt;/span&gt;

&lt;span class="c"&gt;# For compliance: subscription filter → S3 (cheap, durable)&lt;/span&gt;
aws logs put-subscription-filter &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--log-group-name&lt;/span&gt; &lt;span class="s2"&gt;"/aws/ecs/prod-api"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--filter-name&lt;/span&gt; &lt;span class="s2"&gt;"AllToS3"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--filter-pattern&lt;/span&gt; &lt;span class="s2"&gt;""&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--destination-arn&lt;/span&gt; &lt;span class="s2"&gt;"arn:aws:firehose:..."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 4 — Find which service costs the most
&lt;/h2&gt;

&lt;p&gt;One Insights query grouping by log stream and sorting by byte volume identifies which ECS service is responsible for the majority of your CloudWatch bill — run it in under 5 minutes. You don't know which service is responsible until you run it.&lt;/p&gt;

&lt;p&gt;Total CloudWatch cost is $400 — but which of your 15 services is responsible for $300 of it? This Insights query tells you in 5 minutes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Top log producers by byte volume (last 7 days)&lt;/span&gt;
aws logs start-query &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--log-group-name&lt;/span&gt; &lt;span class="s2"&gt;"/aws/ecs/prod-api"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--start-time&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-v-7d&lt;/span&gt; +%s&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--end-time&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; +%s&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--query-string&lt;/span&gt; &lt;span class="s2"&gt;"stats sum(strlen(@message)) as totalBytes by @logStream | sort totalBytes desc | limit 10"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once you know which service generates the most logs, go to that service and do three things: (1) check its log level, (2) check if it's logging stack traces on every request, (3) check if it's logging health check pings. Those three fix 90% of high-volume log problems. And when you're done with CloudWatch, &lt;a href="https://dev.to/blog/ecs-fargate-cost-visibility"&gt;the next invisible cost is per-environment attribution&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Find what your fleet is spending:&lt;/strong&gt; &lt;a href="https://fortem.dev/audit" rel="noopener noreferrer"&gt;fortem.dev/audit&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>ecs</category>
      <category>cloudwatch</category>
      <category>cost</category>
    </item>
    <item>
      <title>How Should You Set Up ECS Logging? (awslogs, FireLens, or Neither)</title>
      <dc:creator>Matt</dc:creator>
      <pubDate>Tue, 16 Jun 2026 08:33:06 +0000</pubDate>
      <link>https://dev.to/dspv/how-should-you-set-up-ecs-logging-awslogs-firelens-or-neither-2o</link>
      <guid>https://dev.to/dspv/how-should-you-set-up-ecs-logging-awslogs-firelens-or-neither-2o</guid>
      <description>&lt;h1&gt;
  
  
  How to Set Up ECS Logging the Right Way
&lt;/h1&gt;

&lt;blockquote&gt;
&lt;p&gt;Originally published at &lt;a href="https://fortem.dev/blog/aws-ecs-logging-guide" rel="noopener noreferrer"&gt;https://fortem.dev/blog/aws-ecs-logging-guide&lt;/a&gt;&lt;br&gt;
awslogs, FireLens, and the three decisions every ECS Fargate team gets wrong: blocking mode, Never Expire retention, and log group naming at fleet scale.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;Guide&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  ECS gives you three logging options: awslogs (CloudWatch only), FireLens (any destination, sidecar required), or none.&lt;/li&gt;
&lt;li&gt;  Default CloudWatch log retention is Never Expire — $0.03/GB/month accumulates silently. Set it to 30 days on every log group.&lt;/li&gt;
&lt;li&gt;  awslogs is synchronous by default. If CloudWatch is slow, your container blocks. Add mode: non-blocking to every task definition.&lt;/li&gt;
&lt;li&gt;  Name log groups /ecs/{environment}/{service} — not /ecs/{service}. It's impossible to fix at 30 environments.&lt;/li&gt;
&lt;li&gt;  Container Insights costs $0.07/metric/month — worth it in prod, skip it for dev environments.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Ready to use — fix Never Expire retention on existing log groups&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Find all log groups with no retention policy and set 30 days&lt;/span&gt;
aws logs describe-log-groups &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'logGroups[?retentionInDays==`null`].[logGroupName]'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; text | &lt;span class="se"&gt;\&lt;/span&gt;
  xargs &lt;span class="nt"&gt;-I&lt;/span&gt; &lt;span class="o"&gt;{}&lt;/span&gt; aws logs put-retention-policy &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--log-group-name&lt;/span&gt; &lt;span class="o"&gt;{}&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--retention-in-days&lt;/span&gt; 30
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run once, then enforce via Terraform going forward (see &lt;code&gt;aws_cloudwatch_log_group&lt;/code&gt; below).&lt;/p&gt;

&lt;h2&gt;
  
  
  The three ECS logging options (and when each breaks)
&lt;/h2&gt;

&lt;p&gt;ECS supports awslogs (CloudWatch only, 3 params), FireLens (any destination, sidecar required), or none. Most teams start with awslogs and hit its limits somewhere between 5 and 10 services.&lt;/p&gt;

&lt;p&gt;The three options are not equally suited to every team. The right driver depends on where you want logs to go, whether you can tolerate blocking on delivery, and whether you run Windows containers. Here's the full comparison:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Driver&lt;/th&gt;
&lt;th&gt;Setup&lt;/th&gt;
&lt;th&gt;Destination&lt;/th&gt;
&lt;th&gt;Backpressure risk&lt;/th&gt;
&lt;th&gt;Windows&lt;/th&gt;
&lt;th&gt;Extra cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;awslogs&lt;/td&gt;
&lt;td&gt;3–4 params&lt;/td&gt;
&lt;td&gt;CloudWatch only&lt;/td&gt;
&lt;td&gt;Yes (blocking)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;$0 extra&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FireLens&lt;/td&gt;
&lt;td&gt;Sidecar + config&lt;/td&gt;
&lt;td&gt;CloudWatch + any&lt;/td&gt;
&lt;td&gt;No (buffered)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;~$0.005/task/mo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Zero&lt;/td&gt;
&lt;td&gt;Nowhere&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;$0 (blind)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The key decisions are: (1) do you need to send logs to anything other than CloudWatch? (2) do you run Windows containers? (3) can you tolerate blocking? If you answered no, no, and no — awslogs with non-blocking mode is the right setup. If any of those answers change, FireLens is the path.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;KEY INSIGHT:&lt;/strong&gt; "None" is not zero-cost. A service with no logging driver is invisible during incidents. At 10+ environments, the time you spend reconstructing what happened from ALB access logs and CloudTrail is more expensive than the CloudWatch bill would have been.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Setting up awslogs: the 4 parameters you actually need
&lt;/h2&gt;

&lt;p&gt;awslogs requires awslogs-group, awslogs-region, awslogs-stream-prefix. Add awslogs-create-group: true or the task silently fails to start if the log group doesn't exist. The task execution role needs &lt;code&gt;logs:CreateLogStream&lt;/code&gt; and &lt;code&gt;logs:PutLogEvents&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Missing IAM permissions causes silent log loss — the task starts, ECS shows it as healthy, but nothing appears in CloudWatch. No error in the task events. You find out during an incident. Below is the correct task definition snippet and the Terraform to create the log group with an enforced retention policy.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight terraform"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Terraform — ECS task definition with correct awslogs config&lt;/span&gt;
&lt;span class="k"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_cloudwatch_log_group"&lt;/span&gt; &lt;span class="s2"&gt;"service"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;              &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"/ecs/&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;environment&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;service_name&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
  &lt;span class="nx"&gt;retention_in_days&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;

  &lt;span class="nx"&gt;tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;Environment&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;environment&lt;/span&gt;
    &lt;span class="nx"&gt;Service&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;service_name&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_ecs_task_definition"&lt;/span&gt; &lt;span class="s2"&gt;"service"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;# ... other fields ...&lt;/span&gt;

  &lt;span class="nx"&gt;container_definitions&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;jsonencode&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;name&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;service_name&lt;/span&gt;
      &lt;span class="nx"&gt;image&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;container_image&lt;/span&gt;

      &lt;span class="nx"&gt;logConfiguration&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;logDriver&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"awslogs"&lt;/span&gt;
        &lt;span class="nx"&gt;options&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="s2"&gt;"awslogs-group"&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_cloudwatch_log_group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;service&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;
          &lt;span class="s2"&gt;"awslogs-region"&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;aws_region&lt;/span&gt;
          &lt;span class="s2"&gt;"awslogs-stream-prefix"&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"ecs"&lt;/span&gt;
          &lt;span class="c1"&gt;# Non-blocking mode — prevents container blocking on CloudWatch hiccup&lt;/span&gt;
          &lt;span class="s2"&gt;"mode"&lt;/span&gt;                  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"non-blocking"&lt;/span&gt;
          &lt;span class="s2"&gt;"max-buffer-size"&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"25m"&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note the log group naming: &lt;code&gt;/ecs/{environment}/{service}&lt;/code&gt; — not &lt;code&gt;/ecs/{service}&lt;/code&gt;. This matters at scale, covered in the naming section below.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"logs:CreateLogStream and logs:PutLogEvents permission on the IAM role that you launch your container instances with."&lt;/p&gt;

&lt;p&gt;— &lt;a href="https://docs.aws.amazon.com/AmazonECS/latest/developerguide/using_awslogs.html" rel="noopener noreferrer"&gt;AWS ECS documentation&lt;/a&gt;, verified June 2026&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The "Never Expire" retention tax
&lt;/h2&gt;

&lt;p&gt;Every log group created by ECS defaults to "Never Expire" retention. At $0.03/GB/month, a 10-service fleet accumulates $500+/month in pure storage costs by month 12 if you never set a policy.&lt;/p&gt;

&lt;p&gt;This is the most common silent cost in ECS fleets. New log groups — created by &lt;code&gt;awslogs-create-group: true&lt;/code&gt;, the console, or CloudFormation — all default to Never Expire. AWS sets this default because it's the safe option for them: you can never lose data you didn't intend to delete. For your bill, it means every GB ever ingested stays charged until you explicitly delete the group.&lt;/p&gt;

&lt;p&gt;The math compounds fast. A fleet of 10 services producing 5 GB/day per service ingests 1,500 GB/month. After 12 months with no retention policy, you're storing ~18,000 GB. At $0.03/GB/month, that's $540/month just for storage — on top of the $750/month ingestion cost.&lt;/p&gt;

&lt;p&gt;We covered the full CloudWatch cost breakdown in more detail in &lt;a href="https://dev.to/blog/cloudwatch-costs-ecs/"&gt;how to control CloudWatch costs on ECS&lt;/a&gt; — including log-level filtering and Logs Insights query optimization. For most teams, retention alone cuts the bill by 30–40%.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;KEY INSIGHT:&lt;/strong&gt; 30 days covers 95% of incident investigations. 90 days covers compliance requirements for most regulated environments. "Never Expire" covers your AWS bill growing in perpetuity.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The awslogs backpressure problem
&lt;/h2&gt;

&lt;p&gt;awslogs is synchronous by default — if CloudWatch is slow or throttling, your container blocks waiting for the log write. Fix: set mode to non-blocking and max-buffer-size to 25m in logConfiguration.&lt;/p&gt;

&lt;p&gt;This is the production gotcha nobody documents until they've seen it. Under normal conditions, CloudWatch responds fast enough that you never notice the synchronous behavior. But when CloudWatch is throttling your account, when a region has degraded performance, or when you're logging at high throughput — your application containers pause waiting for log writes to complete. Requests time out. Health checks fail. ECS replaces the task.&lt;/p&gt;

&lt;p&gt;AWS confirmed this in their &lt;a href="https://aws.amazon.com/blogs/containers/choosing-container-logging-options-to-avoid-backpressure" rel="noopener noreferrer"&gt;container logging backpressure blog post&lt;/a&gt;: "an application can become blocked using the default awslogs driver." The fix is two lines in your task definition — add &lt;code&gt;mode: non-blocking&lt;/code&gt; and &lt;code&gt;max-buffer-size: 25m&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"logConfiguration"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"logDriver"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"awslogs"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"options"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"awslogs-group"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/ecs/prod/api"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"awslogs-region"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"us-east-1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"awslogs-stream-prefix"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ecs"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"mode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"non-blocking"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"max-buffer-size"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"25m"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The trade-off with non-blocking mode: if the buffer fills (25 MB by default), newer logs are dropped rather than blocking the container. This is the correct trade-off for production — a container that drops some logs under extreme pressure is better than one that stops serving requests. If you need guaranteed delivery, FireLens with filesystem buffering is the right answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to switch to FireLens
&lt;/h2&gt;

&lt;p&gt;Switch to FireLens when you need multi-destination routing, log filtering at source, or guaranteed non-blocking delivery. Don't switch if CloudWatch is your only destination — awslogs is simpler and cheaper.&lt;/p&gt;

&lt;p&gt;FireLens runs a Fluent Bit sidecar container alongside your application. Your application's stdout goes to the Fluent Bit container (via the &lt;code&gt;awsfirelens&lt;/code&gt; log driver), and Fluent Bit routes it to one or more destinations using its configuration file. The sidecar approach adds ~10 MB of memory and a small amount of CPU, but it buys you filesystem buffering — Fluent Bit buffers logs to disk before delivery, so CloudWatch issues never block your application.&lt;/p&gt;

&lt;p&gt;Three specific cases where FireLens is the right call:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  —&lt;strong&gt;Multi-destination routing:&lt;/strong&gt; you want CloudWatch for compliance + S3 for long-term storage + Datadog for real-time analysis. awslogs sends to CloudWatch only. FireLens sends to all three simultaneously.&lt;/li&gt;
&lt;li&gt;  —&lt;strong&gt;Filter before ingestion:&lt;/strong&gt; you have DEBUG logs that are useful locally but cost money in CloudWatch. Fluent Bit can drop DEBUG-level records before they're ingested — cutting your CloudWatch bill without changing application code.&lt;/li&gt;
&lt;li&gt;  —&lt;strong&gt;Guaranteed delivery:&lt;/strong&gt; you're in a regulated environment where dropped logs are a compliance issue. Fluent Bit's filesystem buffer survives CloudWatch throttling and delivery retries automatically.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Two important FireLens constraints: it does not support Windows containers on ECS, and it listens on port 24224 — block inbound traffic on that port in your task's security group or anyone on the same VPC can push logs to your sidecar.&lt;/p&gt;

&lt;h2&gt;
  
  
  Log group naming at fleet scale
&lt;/h2&gt;

&lt;p&gt;Name log groups &lt;code&gt;/ecs/{environment}/{service}&lt;/code&gt; — not &lt;code&gt;/ecs/{service}&lt;/code&gt;. Flat naming makes Logs Insights queries unworkable at 10+ environments and makes per-environment retention policies impossible to set.&lt;/p&gt;

&lt;p&gt;This is a convention decision that feels unimportant at 3 services and becomes a serious operational problem at 30. With flat naming (&lt;code&gt;/ecs/api&lt;/code&gt;, &lt;code&gt;/ecs/worker&lt;/code&gt;), you can't query "all prod logs" in a Logs Insights query without listing every group by name. You can't set a shorter retention on dev environments without affecting prod. You can't see cost by environment in the CloudWatch console.&lt;/p&gt;

&lt;p&gt;With hierarchical naming (&lt;code&gt;/ecs/prod/api&lt;/code&gt;, &lt;code&gt;/ecs/staging/api&lt;/code&gt;), Logs Insights can query all prod groups with &lt;code&gt;logGroupNamePrefix /ecs/prod&lt;/code&gt;. You can set 7-day retention on all dev groups using &lt;code&gt;describe-log-groups --log-group-name-prefix /ecs/dev&lt;/code&gt;. Naming is part of the broader &lt;a href="https://dev.to/blog/ecs-fargate-best-practices/"&gt;ECS Fargate best practices for fleet-scale operations&lt;/a&gt; — the same principle applies to CloudWatch metric namespaces, IAM role names, and ECS cluster names.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Query all prod logs at once — only works with hierarchical naming&lt;/span&gt;
aws logs start-query &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--log-group-name-prefix&lt;/span&gt; &lt;span class="s2"&gt;"/ecs/prod"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--start-time&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'1 hour ago'&lt;/span&gt; +%s&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--end-time&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; +%s&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query-string&lt;/span&gt; &lt;span class="s1"&gt;'filter @message like /ERROR/ | stats count(*) by @logStream'&lt;/span&gt;

&lt;span class="c"&gt;# Set 7-day retention on all dev log groups&lt;/span&gt;
aws logs describe-log-groups &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--log-group-name-prefix&lt;/span&gt; &lt;span class="s2"&gt;"/ecs/dev"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'logGroups[].logGroupName'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; text | &lt;span class="se"&gt;\&lt;/span&gt;
  xargs &lt;span class="nt"&gt;-I&lt;/span&gt; &lt;span class="o"&gt;{}&lt;/span&gt; aws logs put-retention-policy &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--log-group-name&lt;/span&gt; &lt;span class="o"&gt;{}&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--retention-in-days&lt;/span&gt; 7
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What ECS logging actually costs
&lt;/h2&gt;

&lt;p&gt;Ingestion costs $0.50/GB. Storage costs $0.03/GB/month. Logs Insights queries cost $0.12/GB scanned. A 10-service fleet producing 5 GB/day per service pays $750/month in ingestion before touching storage or queries.&lt;/p&gt;

&lt;p&gt;The full CloudWatch pricing breakdown (verified June 2026):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Line item&lt;/th&gt;
&lt;th&gt;Unit&lt;/th&gt;
&lt;th&gt;Price&lt;/th&gt;
&lt;th&gt;Note&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Log ingestion&lt;/td&gt;
&lt;td&gt;per GB&lt;/td&gt;
&lt;td&gt;$0.50&lt;/td&gt;
&lt;td&gt;First 5 GB free/month&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Log storage&lt;/td&gt;
&lt;td&gt;per GB/month&lt;/td&gt;
&lt;td&gt;$0.03&lt;/td&gt;
&lt;td&gt;Never Expire = runs forever&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Logs Insights queries&lt;/td&gt;
&lt;td&gt;per GB scanned&lt;/td&gt;
&lt;td&gt;$0.12&lt;/td&gt;
&lt;td&gt;Per GB, not per query&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Container Insights Enhanced&lt;/td&gt;
&lt;td&gt;per metric/month&lt;/td&gt;
&lt;td&gt;$0.07&lt;/td&gt;
&lt;td&gt;~30–50 metrics per 10 services&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Source: AWS CloudWatch pricing, verified June 2026&lt;/p&gt;

&lt;p&gt;Running the math on two real scenarios:&lt;/p&gt;

&lt;p&gt;Scenario A — 10 services, 5 GB/day each&lt;/p&gt;

&lt;p&gt;Monthly log volume1,500 GB&lt;/p&gt;

&lt;p&gt;Ingestion cost$750/mo&lt;/p&gt;

&lt;p&gt;Storage — no retention (month 12)$540/mo&lt;/p&gt;

&lt;p&gt;Storage — 30-day retention$45/mo&lt;/p&gt;

&lt;p&gt;Logs Insights (100 GB/day queries)$360/mo&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Total without retention fix:&lt;/strong&gt; $1,650/mo&lt;/p&gt;

&lt;p&gt;Total with 30-day retention~$1,155/mo&lt;/p&gt;

&lt;p&gt;Scenario B — 5 services, 1 GB/day each&lt;/p&gt;

&lt;p&gt;Monthly log volume150 GB&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ingestion cost:&lt;/strong&gt; $72/mo&lt;/p&gt;

&lt;p&gt;Storage — no retention (month 12)~$54/mo&lt;/p&gt;

&lt;p&gt;Storage — 30-day retention~$4.50/mo&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Total without retention fix:&lt;/strong&gt; $126/mo&lt;/p&gt;

&lt;p&gt;Total with 30-day retention~$77/mo&lt;/p&gt;

&lt;p&gt;The ingestion cost ($0.50/GB) is largely unavoidable if you're sending logs to CloudWatch. The storage cost is 100% avoidable with a retention policy. The Insights cost scales with how much data you scan per query — shorter retention windows mean cheaper queries.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"When a new CloudWatch Log Group is created, its data retention policy is automatically set to 'Never Expire.' While this ensures logs are always available, it also results in unnecessary storage costs over time."&lt;/p&gt;

&lt;p&gt;— &lt;a href="https://towardsaws.com/cloudwatch-logs-are-eating-your-money-the-retention-setting-you-forgot-to-change-27856bb5b0f7" rel="noopener noreferrer"&gt;Towards AWS&lt;/a&gt;, verified June 2026&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  If you read this, you might also want to know
&lt;/h2&gt;

&lt;p&gt;What happens to logs already in CloudWatch if I switch from awslogs to FireLens?&lt;/p&gt;

&lt;p&gt;Nothing — existing logs stay in CloudWatch. FireLens only affects where new logs go. The switch requires updating the task definition and redeploying (new task revision). Old log streams stay queryable under the same log groups.&lt;/p&gt;

&lt;p&gt;Can I send ECS logs to Datadog without FireLens?&lt;/p&gt;

&lt;p&gt;Not directly. The awslogs driver sends to CloudWatch only. You can set up a Lambda function to forward CloudWatch logs to Datadog, but FireLens is the cleaner path — send directly from the container to Datadog without going through CloudWatch at all.&lt;/p&gt;

&lt;p&gt;How do I query logs across multiple ECS environments in Logs Insights?&lt;/p&gt;

&lt;p&gt;Use the log group name prefix filter in Logs Insights — logGroupNamePrefix /ecs/prod queries all groups under that prefix. This only works with hierarchical naming (/ecs/{environment}/{service}). Flat naming requires selecting each group individually.&lt;/p&gt;

&lt;p&gt;Should I use structured (JSON) logging or plaintext in ECS?&lt;/p&gt;

&lt;p&gt;JSON. CloudWatch Logs Insights can parse and filter JSON fields natively, which reduces the GB scanned per query. Plaintext requires grep-style filters that scan every byte. The switch to JSON in your application doesn't change ECS or CloudWatch config — it's an application-level change.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Does ECS automatically create CloudWatch log groups?
&lt;/h3&gt;

&lt;p&gt;Yes, if you set awslogs-create-group to true in the log configuration. Without it, the task fails to start if the log group doesn't exist. Log groups created this way default to Never Expire retention — set an explicit policy immediately.&lt;/p&gt;

&lt;h3&gt;
  
  
  What IAM permissions does ECS need for CloudWatch logging?
&lt;/h3&gt;

&lt;p&gt;The task execution role needs logs:CreateLogStream and logs:PutLogEvents. Missing these causes silent log loss — the task starts normally but no logs appear in CloudWatch. No error is surfaced in the ECS console.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I use FireLens on Windows containers in ECS?
&lt;/h3&gt;

&lt;p&gt;No. FireLens is not supported for Windows containers on ECS. Use the awslogs driver for CloudWatch, or configure a third-party logging agent compatible with Windows containers.&lt;/p&gt;

&lt;h3&gt;
  
  
  How much does CloudWatch Logs Insights cost?
&lt;/h3&gt;

&lt;p&gt;$0.12 per GB scanned. Querying 100 GB of logs costs $12. Cost scales with data volume scanned, not query count — use time ranges and log group filters to reduce GB scanned per query.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the default CloudWatch log retention for ECS?
&lt;/h3&gt;

&lt;p&gt;Never Expire. Every log group ECS creates defaults to Never Expire retention at $0.03/GB/month. A 10-service fleet accumulates over $500/month in storage alone by month 12. Set a 30-day retention policy on every log group.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;### Blind spots in your ECS fleet cost money. Book 20 minutes — we'll show you w&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;strong&gt;Map your fleet in 5 min:&lt;/strong&gt; &lt;a href="https://fortem.dev/audit" rel="noopener noreferrer"&gt;fortem.dev/audit&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>ecs</category>
      <category>cloudwatch</category>
      <category>devops</category>
    </item>
    <item>
      <title>ECS Service Discovery: Cloud Map, Service Connect, or an Internal Load Balancer?</title>
      <dc:creator>Matt</dc:creator>
      <pubDate>Mon, 15 Jun 2026 11:40:38 +0000</pubDate>
      <link>https://dev.to/dspv/ecs-service-discovery-cloud-map-service-connect-or-an-internal-load-balancer-247a</link>
      <guid>https://dev.to/dspv/ecs-service-discovery-cloud-map-service-connect-or-an-internal-load-balancer-247a</guid>
      <description>&lt;h1&gt;
  
  
  ECS Service Discovery: Cloud Map vs Service Connect
&lt;/h1&gt;

&lt;blockquote&gt;
&lt;p&gt;Originally published at &lt;a href="https://fortem.dev/blog/ecs-service-discovery-guide" rel="noopener noreferrer"&gt;https://fortem.dev/blog/ecs-service-discovery-guide&lt;/a&gt;&lt;br&gt;
Cloud Map, Service Connect, or an internal ALB? A practical decision framework for ECS Fargate teams — with the July 2025 blue/green unblock, real cost math, and Terraform snippet.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;Guide&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  AWS docs say "We recommend Service Connect" for new ECS-to-ECS traffic — built-in retries, metrics, and mTLS without extra config.&lt;/li&gt;
&lt;li&gt;  Blue/green + Service Connect shipped July 17, 2025. The main reason teams stayed on Cloud Map is gone.&lt;/li&gt;
&lt;li&gt;  Cloud Map service discovery still wins for non-ECS callers (Lambda, EC2, on-prem) and services with 1,000+ tasks per service.&lt;/li&gt;
&lt;li&gt;  An internal ALB wins when you need L7 routing rules or a stable endpoint callable by anything outside ECS.&lt;/li&gt;
&lt;li&gt;  Services not enrolled in Service Connect cannot resolve its short names — namespace membership is required on both sides.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Your first ECS service called its dependency by ALB DNS name. That worked fine. Now you have 15 services, some team is asking about mTLS, someone noticed the console keeps surfacing "Service Connect," and a cross-account architecture is on the roadmap. Here's what the three options actually do, where each breaks, and the rule for choosing.&lt;/p&gt;

&lt;h2&gt;
  
  
  What service discovery actually solves
&lt;/h2&gt;

&lt;p&gt;Service discovery maps a logical name to healthy task IPs as tasks start, stop, and reschedule — without you touching DNS entries or hardcoding addresses.&lt;/p&gt;

&lt;p&gt;ECS tasks on Fargate get ephemeral private IPs. The IP changes every time a task restarts or a new version deploys. If service A hardcodes the IP of service B, you get connection failures on every B deployment. An ALB solves this — the ALB DNS name stays stable — but you're paying ~$16–22/month per ALB and adding a network hop for every service-to-service call.&lt;/p&gt;

&lt;p&gt;Service discovery handles three things automatically: registration (ECS adds a new task to the registry on launch), discovery (your app resolves a name to the current set of healthy IPs), and deregistration (ECS removes a task on shutdown). Your app just calls &lt;code&gt;http://payments&lt;/code&gt; or &lt;code&gt;http://payments.internal.example.com&lt;/code&gt; and gets a healthy endpoint back.&lt;/p&gt;

&lt;p&gt;What it is not: a load balancer, an API gateway, or a circuit breaker. It finds services. Retry logic, traffic shaping, and rate limiting are separate concerns — though Service Connect handles the first two.&lt;/p&gt;

&lt;h2&gt;
  
  
  Option 1 — ECS Service Connect
&lt;/h2&gt;

&lt;p&gt;Service Connect is AWS's recommended approach for ECS-to-ECS traffic. ECS injects an Envoy sidecar per task that handles routing, retries, metrics, and optional mTLS — your app just calls a short name.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"We recommend Service Connect, which provides Amazon ECS configuration for service discovery, connectivity, and traffic monitoring."&lt;/p&gt;

&lt;p&gt;— &lt;a href="https://docs.aws.amazon.com/AmazonECS/latest/developerguide/networking-connecting-services.html" rel="noopener noreferrer"&gt;AWS ECS networking best practices&lt;/a&gt;, verified June 2026&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;How it works.&lt;/strong&gt; When you enable Service Connect on an ECS service, ECS automatically runs an Envoy proxy container inside each task. Services call each other by short name — &lt;code&gt;http://payments&lt;/code&gt; instead of &lt;code&gt;payments.internal.example.com&lt;/code&gt;. The proxy handles load balancing, retries on transient failures, and outlier detection.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DNS mechanism.&lt;/strong&gt; Service Connect does not write to VPC DNS or Route 53. It manages its own namespace. Short names are only resolvable from inside tasks enrolled in the same namespace. A Lambda function, EC2 instance, or ECS service not in the namespace cannot resolve those names — they get a connection timeout instead of a useful error.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost.&lt;/strong&gt; Service Connect itself is free. The Cloud Map backend usage is free when used via Service Connect. You pay for the Envoy sidecar's compute: AWS recommends 0.1 vCPU and 128 MB per task, which runs ~$0.31/month per task at On-Demand Fargate rates. For a 20-service fleet with 3 tasks per service, that's ~$18/month of sidecar overhead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Blue/green deployments.&lt;/strong&gt; As of July 17, 2025, ECS Native Blue/Green supports Service Connect. Test traffic routes to the green revision via the &lt;code&gt;x-amzn-ecs-blue-green-test&lt;/code&gt; header. The previous blocker — CodeDeploy blue/green was incompatible with Service Connect — no longer applies if you migrate to ECS Native Blue/Green.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;mTLS.&lt;/strong&gt; Native support via AWS Private CA. Service Connect rotates TLS certificates every 5 days — roughly 6 certificates per service per month. Factor in Private CA cost if you enable this.&lt;/p&gt;

&lt;p&gt;Known limitation&lt;/p&gt;

&lt;p&gt;If you create some services, wait more than 5 hours, then add more services to the same cluster, the original services may not resolve the new ones via DNS until you redeploy them. This is a known proxy registration timing issue — not a showstopper, but something to handle in your deployment runbook.&lt;/p&gt;

&lt;h2&gt;
  
  
  Option 2 — Cloud Map service discovery
&lt;/h2&gt;

&lt;p&gt;Cloud Map service discovery registers ECS tasks as Route 53 DNS A-records. Any VPC resource — Lambda, EC2, on-prem via Direct Connect — can resolve those names. No proxy overhead, no namespace enrollment requirement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it works.&lt;/strong&gt; ECS registers each task's private IP with a Cloud Map service on launch and deregisters it on shutdown. Route 53 returns the current set of healthy task IPs for a DNS query like &lt;code&gt;payments.internal.example.com&lt;/code&gt;. Your app connects directly to the task IP — no proxy in the path. If the DNS TTL is set high, your app may hold a stale IP for a few seconds after a task stops.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost.&lt;/strong&gt; Cloud Map charges $0.10 per registered resource per month, plus $1.00 per million HTTP API calls for non-DNS lookups. Route 53 adds $0.50/month per private hosted zone and $0.40/million DNS queries. For a fleet with 20 services × 3 tasks, that's ~$6/month in Cloud Map registry fees plus minimal Route 53 query costs. Compare to Service Connect's ~$18/month sidecar overhead for the same fleet — Cloud Map is cheaper at moderate scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1,000-task limit.&lt;/strong&gt; AWS docs state: "Services configured to use service discovery have a limit of 1,000 tasks per service. This is due to a Route 53 service quota." If you run high-scale services — a job runner, a stateless API serving burst traffic — Cloud Map service discovery hits this ceiling. Service Connect does not have this limit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Manual cleanup.&lt;/strong&gt; AWS docs: "The AWS Cloud Map resources created when service discovery is used must be cleaned up manually." Delete an ECS service and its Cloud Map registration stays. Over time, stale records accumulate — you pay $0.10/month per orphaned registration. Add a cleanup step to your service teardown runbook.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DNS propagation.&lt;/strong&gt; There is a window between when a task stops and when Route 53 TTL expires. If your DNS TTL is 30 seconds and you scale down rapidly, callers may get dropped connections. AWS's own migration blog documents dropped requests under load during fast scale-in. Set TTL to 10 seconds or less for services with frequent task churn, and implement retry logic in your clients.&lt;/p&gt;

&lt;p&gt;When to choose Cloud Map: you have non-ECS callers; your services run more than 1,000 tasks; you need cross-account without setting up AWS RAM; or your existing Cloud Map setup already works and you don't need mTLS or built-in retries.&lt;/p&gt;

&lt;p&gt;On &lt;a href="https://dev.to/blog/ecs-multi-environment-strategy/"&gt;ECS multi-environment strategy&lt;/a&gt; and mixed-compute architectures, Cloud Map is the only option that gives all layers of the stack a single naming layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Option 3 — Internal ALB
&lt;/h2&gt;

&lt;p&gt;An internal Application Load Balancer gives any compute — Lambda, EC2, ECS, on-prem — a stable HTTP endpoint for an ECS service. It costs $16–22/month per ALB plus LCU charges, regardless of traffic.&lt;/p&gt;

&lt;p&gt;The ALB handles health checks and connection draining automatically. When an ECS task stops, the ALB deregisters it before terminating — no dropped connections. You get L7 routing rules: path-based routing to send &lt;code&gt;/api/users/*&lt;/code&gt; and &lt;code&gt;/api/orders/*&lt;/code&gt; to different ECS services behind a single ALB.&lt;/p&gt;

&lt;p&gt;When an internal ALB is the right answer: a Lambda function calls an ECS API and needs a stable HTTPS endpoint; you want path-based routing between services; or you need WebSocket support with sticky sessions across multiple ECS tasks.&lt;/p&gt;

&lt;p&gt;The cost adds up fast. At &lt;a href="https://dev.to/blog/aws-fargate-pricing-real-costs/"&gt;~$22/month per ALB&lt;/a&gt;, 10 internal ALBs (one per service) cost $220/month before any traffic. Sharing one ALB across services with path routing brings this to $22/month — but then you're managing routing rules as a shared resource.&lt;/p&gt;

&lt;h2&gt;
  
  
  Decision table
&lt;/h2&gt;

&lt;p&gt;Use Service Connect for pure ECS-to-ECS traffic in a single Region. Use Cloud Map when non-ECS services are involved. Use an internal ALB when you need L7 routing or a stable endpoint callable by anything.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Service Connect&lt;/th&gt;
&lt;th&gt;Cloud Map&lt;/th&gt;
&lt;th&gt;Internal ALB&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;ECS → ECS, same cluster&lt;/td&gt;
&lt;td&gt;✓ Recommended&lt;/td&gt;
&lt;td&gt;Works&lt;/td&gt;
&lt;td&gt;Works (extra cost)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ECS → ECS, cross-cluster&lt;/td&gt;
&lt;td&gt;✓ (shared namespace)&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;Works&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lambda → ECS&lt;/td&gt;
&lt;td&gt;✗ (can't resolve names)&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;EC2 → ECS&lt;/td&gt;
&lt;td&gt;✗ (not enrolled)&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cross-account (with RAM)&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;Works&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Service with 1,000+ tasks&lt;/td&gt;
&lt;td&gt;✓ (no task limit)&lt;/td&gt;
&lt;td&gt;✗ (Route 53 quota)&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mTLS required&lt;/td&gt;
&lt;td&gt;✓ (native, via Private CA)&lt;/td&gt;
&lt;td&gt;Manual&lt;/td&gt;
&lt;td&gt;Manual&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Blue/green deployments&lt;/td&gt;
&lt;td&gt;✓ (as of July 2025)&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;KEY INSIGHT:&lt;/strong&gt; If you're starting a new ECS service today and all its callers are also ECS services, choose Service Connect. If you have an existing Cloud Map setup that works and you don't need mTLS or built-in retries, leave it alone — the migration cost isn't worth it until you have a concrete reason to switch.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Migrating from Cloud Map to Service Connect
&lt;/h2&gt;

&lt;p&gt;AWS publishes a migration guide. The short version: add &lt;code&gt;service_connect_configuration&lt;/code&gt; to your ECS service, redeploy, then remove the &lt;code&gt;service_registries&lt;/code&gt; block. One deploy cycle per service.&lt;/p&gt;

&lt;p&gt;The catch: Service Connect and Cloud Map service discovery are not cross-compatible at the namespace level. A caller on Cloud Map cannot resolve Service Connect short names, and vice versa. Migrate the callee first, then migrate callers in the same deployment window — or keep the Cloud Map registration running in parallel during the cutover.&lt;/p&gt;

&lt;p&gt;Here is the Terraform diff for a single service:&lt;/p&gt;

&lt;p&gt;Ready to use — Terraform service_connect_configuration&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight terraform"&gt;&lt;code&gt;&lt;span class="k"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_ecs_service"&lt;/span&gt; &lt;span class="s2"&gt;"payments"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"payments"&lt;/span&gt;
  &lt;span class="nx"&gt;cluster&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_ecs_cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;main&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;task_definition&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_ecs_task_definition&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;payments&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;
  &lt;span class="nx"&gt;desired_count&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;

  &lt;span class="c1"&gt;# Remove this block when migrating to Service Connect:&lt;/span&gt;
  &lt;span class="c1"&gt;# service_registries {&lt;/span&gt;
  &lt;span class="c1"&gt;#   registry_arn = aws_service_discovery_service.payments.arn&lt;/span&gt;
  &lt;span class="c1"&gt;# }&lt;/span&gt;

  &lt;span class="nx"&gt;service_connect_configuration&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;enabled&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="nx"&gt;namespace&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_service_discovery_private_dns_namespace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;main&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;

    &lt;span class="nx"&gt;service&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;port_name&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"http"&lt;/span&gt;  &lt;span class="c1"&gt;# must match portName in task definition&lt;/span&gt;
      &lt;span class="nx"&gt;discovery_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"payments"&lt;/span&gt;

      &lt;span class="nx"&gt;client_alias&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;dns_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"payments"&lt;/span&gt;
        &lt;span class="nx"&gt;port&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;8080&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Task definition: add portName to the port mapping&lt;/span&gt;
&lt;span class="c1"&gt;# "portMappings": [&lt;/span&gt;
&lt;span class="c1"&gt;#   {&lt;/span&gt;
&lt;span class="c1"&gt;#     "name": "http",&lt;/span&gt;
&lt;span class="c1"&gt;#     "containerPort": 8080,&lt;/span&gt;
&lt;span class="c1"&gt;#     "protocol": "tcp"&lt;/span&gt;
&lt;span class="c1"&gt;#   }&lt;/span&gt;
&lt;span class="c1"&gt;# ]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What changes after the deploy: each task gets an Envoy sidecar container. The service is reachable at &lt;code&gt;http://payments:8080&lt;/code&gt; from any other Service Connect service in the same namespace. Your VPC, subnets, security groups, and task definition don't change.&lt;/p&gt;

&lt;p&gt;One thing to watch: App Mesh reaches end-of-life on September 30, 2026. If you're still on App Mesh for ECS workloads, Service Connect is the replacement. (For EKS, AWS recommends VPC Lattice instead.)&lt;/p&gt;

&lt;h2&gt;
  
  
  If you read this, you might also want to know
&lt;/h2&gt;

&lt;p&gt;How does Service Connect compare to AWS App Mesh?&lt;/p&gt;

&lt;p&gt;App Mesh was the predecessor service mesh for ECS and EKS — it reaches end-of-life September 30, 2026. Service Connect is the ECS replacement: simpler to configure (no separate mesh/virtual service objects), fully managed Envoy injection, and integrated with ECS deployments. If you're on App Mesh, migrate to Service Connect for ECS workloads or VPC Lattice for EKS.&lt;/p&gt;

&lt;p&gt;Can I run Service Connect and Cloud Map service discovery on the same ECS service?&lt;/p&gt;

&lt;p&gt;No. A single ECS service can use either service_registries (Cloud Map) or service_connect_configuration (Service Connect), not both. During migration, you can run parallel ECS services — one on Cloud Map, one on Service Connect — but a single service object chooses one path.&lt;/p&gt;

&lt;p&gt;What happens to Service Connect traffic when a task crashes mid-request?&lt;/p&gt;

&lt;p&gt;The Envoy proxy in the calling task detects the failed connection and retries on a different task instance using the configured retry policy. Outlier detection marks unhealthy tasks and stops routing to them until they recover. This is the main reliability advantage over Cloud Map, where retry logic lives entirely in your application code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Does Service Connect work with blue/green deployments?
&lt;/h3&gt;

&lt;p&gt;Yes, as of July 17, 2025. AWS shipped built-in blue/green deployments for ECS — including Service Connect. Test traffic routing uses the x-amzn-ecs-blue-green-test header. The previous blocker (CodeDeploy blue/green was incompatible with Service Connect) no longer applies with ECS Native Blue/Green.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I use ECS Service Connect across AWS accounts?
&lt;/h3&gt;

&lt;p&gt;Yes, with AWS Resource Access Manager (RAM). Share the Cloud Map namespace across accounts and configure Service Connect in each account's ECS cluster to use the shared namespace. Without RAM, Cloud Map service discovery is simpler for cross-account: it uses Route 53 / VPC DNS, which you extend with VPC peering or Transit Gateway.&lt;/p&gt;

&lt;h3&gt;
  
  
  How much does the Service Connect Envoy sidecar cost on Fargate?
&lt;/h3&gt;

&lt;p&gt;Service Connect itself is free — no charge for the feature or the Cloud Map backend. You pay only for the Envoy proxy sidecar container's CPU and memory. AWS recommends 0.1 vCPU and 128 MB per task. At Fargate On-Demand rates ($0.04048/vCPU-hr, $0.004445/GB-hr), that's roughly $0.31/month per task running 24/7.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why can't my service resolve Service Connect names?
&lt;/h3&gt;

&lt;p&gt;Service Connect manages its own namespace — it does NOT update VPC DNS. A service not enrolled in the same Service Connect namespace cannot resolve those short names. Fix: enroll the calling service in the same namespace, or switch the caller to Cloud Map service discovery (which uses Route 53 / VPC DNS and is accessible to any VPC resource).&lt;/p&gt;

&lt;h3&gt;
  
  
  What's the difference between Cloud Map and ECS service discovery?
&lt;/h3&gt;

&lt;p&gt;They're the same thing. 'ECS service discovery' is the ECS feature that registers tasks into AWS Cloud Map namespaces using Route 53 DNS. Cloud Map is the underlying registry. Service Connect is a newer, higher-level abstraction built on top of Cloud Map — it adds an Envoy proxy per task for retries, metrics, and mTLS, without requiring your app to change its DNS lookup code.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;### Service discovery is solved. Operating 20 services across 10 environments is&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;strong&gt;Map your fleet in 5 min:&lt;/strong&gt; &lt;a href="https://fortem.dev/audit" rel="noopener noreferrer"&gt;fortem.dev/audit&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>ecs</category>
      <category>fargate</category>
      <category>microservices</category>
    </item>
    <item>
      <title>ArgoCD Alternatives in 2026: 5 Real Options (and One Nobody Mentions)</title>
      <dc:creator>Matt</dc:creator>
      <pubDate>Sun, 14 Jun 2026 09:11:11 +0000</pubDate>
      <link>https://dev.to/dspv/argocd-alternatives-in-2026-5-real-options-and-one-nobody-mentions-210k</link>
      <guid>https://dev.to/dspv/argocd-alternatives-in-2026-5-real-options-and-one-nobody-mentions-210k</guid>
      <description>&lt;h1&gt;
  
  
  ArgoCD Alternatives in 2026: 5 Real Options (and One Nobody Mentions)
&lt;/h1&gt;

&lt;blockquote&gt;
&lt;p&gt;Originally published at &lt;a href="https://fortem.dev/blog/argocd-alternative" rel="noopener noreferrer"&gt;https://fortem.dev/blog/argocd-alternative&lt;/a&gt;&lt;br&gt;
An honest comparison of ArgoCD alternatives: Flux, Fleet, Harness, Spinnaker, plain CI — and the option comparison posts skip: not needing GitOps at all.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;Guide&lt;/p&gt;

&lt;p&gt;Most “ArgoCD alternatives” posts are written by people who have never debugged a stuck sync wave at 2 a.m. This one isn't.&lt;/p&gt;

&lt;p&gt;Teams leave ArgoCD for specific reasons: App-of-Apps complexity that grows faster than the fleet it manages, multi-cluster RBAC that fights you, drift detection that cries wolf, a UI that becomes the bottleneck instead of the helper. If one of those is your reason, the right alternative depends on which one. Here are the five real options — and a sixth that comparison posts never include, because it questions the premise.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  Flux CD: best direct replacement — lighter, composable, CNCF-graduated, active post-Weaveworks.&lt;/li&gt;
&lt;li&gt;  Rancher Fleet: best for managing dozens or hundreds of clusters — overkill below that.&lt;/li&gt;
&lt;li&gt;  Harness CD / managed: best when operating ArgoCD costs more than buying the service.&lt;/li&gt;
&lt;li&gt;  Spinnaker: enterprise pipelines, multi-cloud, canary analysis — serious operational overhead.&lt;/li&gt;
&lt;li&gt;  Plain CI + helm upgrade: underrated for small fleets where GitOps reconciliation is unnecessary.&lt;/li&gt;
&lt;li&gt;  ECS Fargate: if the pain is Kubernetes itself, the problem category disappears on ECS.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Ready to use — decision checklist&lt;/p&gt;

&lt;p&gt;Before picking an alternative, answer these five questions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; 1.Is the pain in ArgoCD itself (sync, RBAC, UI) — or in Kubernetes operations (nodes, upgrades, cost)?&lt;/li&gt;
&lt;li&gt; 2.How many clusters do you manage — fewer than 5, or dozens+?&lt;/li&gt;
&lt;li&gt; 3.Does your team have time to assemble and maintain CD tooling, or do you need to buy that back?&lt;/li&gt;
&lt;li&gt; 4.Do you need multi-cloud, approval gates, or canary analysis as first-class features?&lt;/li&gt;
&lt;li&gt; 5.Are your workloads stateless services on AWS — or do they depend on K8s operators and CRDs?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Map your answers to the summary table at the bottom of this article.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Flux CD — GitOps primitives without the platform
&lt;/h2&gt;

&lt;p&gt;Flux is the closest direct ArgoCD replacement — CNCF-graduated, lighter on cluster resources, and composable instead of opinionated.&lt;/p&gt;

&lt;p&gt;Flux is the other CNCF-graduated GitOps project, and the closest thing to a direct ArgoCD replacement. Where ArgoCD ships an opinionated platform — UI, SSO, RBAC, ApplicationSets — Flux ships composable controllers (source, kustomize, helm, notification) that you assemble into the workflow you actually want. Reconciliation logic is comparable; the philosophy is not.&lt;/p&gt;

&lt;p&gt;Pick it if&lt;/p&gt;

&lt;p&gt;ArgoCD's resource weight and platform opinions are your problem. Flux controllers are lighter on cluster resources, the Kustomize/Helm integration feels native rather than bolted on, and there is no central UI server to scale, secure, and babysit.&lt;/p&gt;

&lt;p&gt;Skip it if&lt;/p&gt;

&lt;p&gt;Your developers live in the ArgoCD UI and you have no appetite to rebuild that experience.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;KEY INSIGHT:&lt;/strong&gt; The “Flux has no UI” objection is outdated. As of 2025–2026, the Flux Operator ships a Web UI with cluster dashboards and deep-dive views into Kustomizations and HelmReleases. The gap with ArgoCD's UI has narrowed substantially.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The “is Flux dead?” question is also settled. After Weaveworks shut down in 2024, core maintainers moved to ControlPlane, which employs them to work on the project. Flux is CNCF-graduated with a public roadmap and steady releases. If you ruled it out in 2024 over maintenance fears, re-evaluate. Flux 2.8 (GA February 2026) added native Helm v4 support with server-side apply — Helm-heavy shops should pay attention.&lt;/p&gt;

&lt;p&gt;The trade-off you accept: more assembly required. Multi-tenancy, dashboards-as-default, and the app-centric mental model that ArgoCD gives you out of the box are things you compose yourself with Flux.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Rancher Fleet — when the problem is cluster count, not GitOps
&lt;/h2&gt;

&lt;p&gt;Designed for managing configuration across very large cluster counts — dozens to hundreds. At 1–5 clusters, you're paying Fleet's complexity for a problem you don't have.&lt;/p&gt;

&lt;p&gt;Fleet is SUSE/Rancher's GitOps engine, designed from the start for managing configuration across very large numbers of clusters — its docs talk about scaling to up to a million. It uses a hub-and-spoke model where a single management plane drives bundles of resources out to downstream clusters.&lt;/p&gt;

&lt;p&gt;Pick it if&lt;/p&gt;

&lt;p&gt;Your actual pain is ArgoCD's multi-cluster story — you're juggling dozens or hundreds of clusters (edge, retail, per-customer) and ApplicationSets plus cluster secrets are creaking. Fleet treats fleet-scale as the primary use case, not an extension.&lt;/p&gt;

&lt;p&gt;Skip it if&lt;/p&gt;

&lt;p&gt;You run one to five clusters. At that scale Fleet solves a problem you don't have, and you pay its complexity anyway.&lt;/p&gt;

&lt;p&gt;The trade-off you accept: a smaller ecosystem and community than either ArgoCD or Flux, and noticeable gravity toward the Rancher stack. Outside Rancher-managed environments you lose part of the value, and hiring engineers with Fleet experience is harder.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Harness CD / managed pipelines — make it someone else's pager
&lt;/h2&gt;

&lt;p&gt;Stop self-hosting CD entirely. Harness and similar platforms take upgrades, scaling, and RBAC wiring off your team — for a real cost, in both dollars and vendor coupling.&lt;/p&gt;

&lt;p&gt;The third answer to “ArgoCD alternative” is: stop self-hosting CD entirely. Harness CD offers a GitOps-as-a-service mode that actually runs ArgoCD under the hood, managed for you. Similar commercial platforms take the operational burden — upgrades, scaling, RBAC and SSO wiring, audit — off your team. Codefresh, long the other big name in managed GitOps, was acquired by Octopus Deploy in 2024; factor vendor trajectory into any evaluation there.&lt;/p&gt;

&lt;p&gt;Pick it if&lt;/p&gt;

&lt;p&gt;The problem was never ArgoCD's design. The problem is that operating ArgoCD became a part-time job nobody wanted: certificate rotations, controller scaling, upgrade regressions, SSO breakage. If your platform team is two people and CD tooling eats one of them, buying it back is rational.&lt;/p&gt;

&lt;p&gt;Skip it if&lt;/p&gt;

&lt;p&gt;The reason you chose GitOps was control and auditability of every change through Git. Managed platforms blur that line by design.&lt;/p&gt;

&lt;p&gt;The trade-off you accept: real money (these platforms price for enterprises), vendor coupling, and a less Git-purist workflow — pipelines and UI-driven configuration creep back in. Budget time for the procurement and security review cycle too.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Spinnaker — the heavyweight, for a narrow profile
&lt;/h2&gt;

&lt;p&gt;A deployment orchestrator built around pipelines and canary analysis — not GitOps in the reconciliation sense. Correct for enterprise multi-cloud; easy to regret adopting casually.&lt;/p&gt;

&lt;p&gt;Spinnaker predates the GitOps wave: it is a deployment orchestrator built around pipelines, multi-cloud targets, and sophisticated deployment strategies — automated canary analysis, staged rollouts, manual judgments. It is not GitOps in the reconciliation sense at all, which for some teams is exactly the point.&lt;/p&gt;

&lt;p&gt;Pick it if&lt;/p&gt;

&lt;p&gt;You're an enterprise with hard requirements for multi-cloud deployment pipelines, approval gates, and canary analysis as first-class citizens — and you have a platform team that can own a complex distributed system, because Spinnaker is one (a dozen-odd microservices to run and upgrade).&lt;/p&gt;

&lt;p&gt;Skip it if&lt;/p&gt;

&lt;p&gt;You're anything smaller than a platform organization. This is the easiest tool on this list to regret adopting casually.&lt;/p&gt;

&lt;p&gt;The trade-off you accept: Spinnaker is famously heavy to operate, and community momentum has visibly shifted toward GitOps-native tools over the last several years. Choosing it in 2026 means choosing against the current, which is fine if your requirements genuinely demand it.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Plain CI + &lt;code&gt;helm upgrade&lt;/code&gt; — the option listicles are embarrassed by
&lt;/h2&gt;

&lt;p&gt;Delete the CD layer and let CI run &lt;code&gt;helm upgrade --install&lt;/code&gt;directly. Underrated for small fleets where GitOps reconciliation solves a problem you don't have.&lt;/p&gt;

&lt;p&gt;Here is the alternative most posts omit, because there is nothing to affiliate-link: delete the CD layer and let your CI pipeline run &lt;code&gt;helm upgrade --install&lt;/code&gt; or &lt;code&gt;kubectl apply&lt;/code&gt; directly. Push-based deploys, straight from the pipeline that built the artifact.&lt;/p&gt;

&lt;p&gt;Pick it if&lt;/p&gt;

&lt;p&gt;You run a small fleet — a handful of services, a few environments — and you adopted ArgoCD because a conference talk said you should. GitOps reconciliation earns its complexity when many actors mutate cluster state and drift is a real threat. If deploys are the only thing that changes your clusters, a reconciliation loop is machinery without a purpose.&lt;/p&gt;

&lt;p&gt;Skip it if&lt;/p&gt;

&lt;p&gt;Multiple teams or operators touch the clusters, or compliance requires a continuously enforced desired state.&lt;/p&gt;

&lt;p&gt;The trade-off you accept: no drift detection, no self-healing, no declarative picture of “what should be running right now” outside your pipeline history. Cluster credentials live in CI, which widens your blast radius if the pipeline is compromised. You also give up the rollback ergonomics ArgoCD's history view provides.&lt;/p&gt;

&lt;h2&gt;
  
  
  Alternative #6: not needing GitOps at all
&lt;/h2&gt;

&lt;p&gt;If you're on AWS and your workloads are mostly stateless services, there's a version of your stack where the problem category that ArgoCD solves disappears entirely.&lt;/p&gt;

&lt;p&gt;ArgoCD exists to solve a Kubernetes-shaped problem: continuously reconciling a cluster's live state against Git. If you are on AWS and your workloads are mostly stateless services, there is a version of your stack where that entire problem category disappears.&lt;/p&gt;

&lt;p&gt;On ECS Fargate, a deploy is your CI pipeline updating a task definition. There is no cluster state to reconcile, no sync waves, no drift — because there is no node-level state to drift. You do not replace ArgoCD; you stop needing the thing ArgoCD does.&lt;/p&gt;

&lt;p&gt;This is not the right move if your team runs Kubernetes well, or if you depend on operators and CRDs. It is worth a hard look if the pain is not ArgoCD — it is operating Kubernetes itself, on a small platform team, on AWS. We wrote up the full comparison: &lt;a href="https://dev.to/ecs-vs-eks/"&gt;ECS vs EKS — the cost and ops breakdown&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Disclosure:&lt;/strong&gt; this is the part where we are biased. Fortem is a control plane for teams running 10+ ECS Fargate environments — scheduling, cloning, fleet-wide visibility. If you do land on ECS, that is the tooling gap you will eventually hit.&lt;/p&gt;

&lt;h2&gt;
  
  
  Which one, then?
&lt;/h2&gt;

&lt;p&gt;Match the tool to the pain: Flux for platform weight, Fleet for cluster count, managed for ops burden, Spinnaker for enterprise pipelines, plain CI for small fleets, ECS if the pain is Kubernetes itself.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Your actual pain&lt;/th&gt;
&lt;th&gt;Best option&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;ArgoCD's platform weight; you want primitives, not opinions&lt;/td&gt;
&lt;td&gt;Flux CD&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GitOps across dozens or hundreds of clusters&lt;/td&gt;
&lt;td&gt;Rancher Fleet&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operating CD tooling eats your platform team&lt;/td&gt;
&lt;td&gt;Harness CD / managed GitOps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Enterprise pipelines, canary analysis, approval gates, multi-cloud&lt;/td&gt;
&lt;td&gt;Spinnaker&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Small fleet; GitOps reconciliation is solving a problem you don't have&lt;/td&gt;
&lt;td&gt;Plain CI + helm upgrade&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;The pain is Kubernetes itself, and you're on AWS&lt;/td&gt;
&lt;td&gt;ECS Fargate&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;One closing heuristic: before replacing ArgoCD, write down the three incidents or frustrations that triggered the search. If all three mention ArgoCD specifically — sync behavior, UI, RBAC — pick from rows one through five. If they mention nodes, upgrades, or “why is our EKS bill like this,” your problem is one layer down, and no GitOps tool will fix it.&lt;/p&gt;

&lt;p&gt;Related questions&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  ›&lt;a href="https://dev.to/blog/ecs-fargate-best-practices/"&gt;How does ECS Fargate handle deployments without a GitOps layer?&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  ›&lt;a href="https://dev.to/blog/aws-fargate-pricing-real-costs/"&gt;What does running 10+ ECS environments actually cost?&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  ›&lt;a href="https://dev.to/ecs-vs-eks/"&gt;ECS vs EKS — which is right for your team?&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Map your ECS fleet in 5 min:&lt;/strong&gt; &lt;a href="https://fortem.dev/audit" rel="noopener noreferrer"&gt;fortem.dev/audit&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>ecs</category>
      <category>gitops</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
