ECS Service Discovery: Cloud Map vs Service Connect
Originally published at https://fortem.dev/blog/ecs-service-discovery-guide
Cloud Map, Service Connect, or an internal ALB? A practical decision framework for ECS Fargate teams — with the July 2025 blue/green unblock, real cost math, and Terraform snippet.
Guide
TL;DR
- AWS docs say "We recommend Service Connect" for new ECS-to-ECS traffic — built-in retries, metrics, and mTLS without extra config.
- Blue/green + Service Connect shipped July 17, 2025. The main reason teams stayed on Cloud Map is gone.
- Cloud Map service discovery still wins for non-ECS callers (Lambda, EC2, on-prem) and services with 1,000+ tasks per service.
- An internal ALB wins when you need L7 routing rules or a stable endpoint callable by anything outside ECS.
- Services not enrolled in Service Connect cannot resolve its short names — namespace membership is required on both sides.
Your first ECS service called its dependency by ALB DNS name. That worked fine. Now you have 15 services, some team is asking about mTLS, someone noticed the console keeps surfacing "Service Connect," and a cross-account architecture is on the roadmap. Here's what the three options actually do, where each breaks, and the rule for choosing.
What service discovery actually solves
Service discovery maps a logical name to healthy task IPs as tasks start, stop, and reschedule — without you touching DNS entries or hardcoding addresses.
ECS tasks on Fargate get ephemeral private IPs. The IP changes every time a task restarts or a new version deploys. If service A hardcodes the IP of service B, you get connection failures on every B deployment. An ALB solves this — the ALB DNS name stays stable — but you're paying ~$16–22/month per ALB and adding a network hop for every service-to-service call.
Service discovery handles three things automatically: registration (ECS adds a new task to the registry on launch), discovery (your app resolves a name to the current set of healthy IPs), and deregistration (ECS removes a task on shutdown). Your app just calls http://payments or http://payments.internal.example.com and gets a healthy endpoint back.
What it is not: a load balancer, an API gateway, or a circuit breaker. It finds services. Retry logic, traffic shaping, and rate limiting are separate concerns — though Service Connect handles the first two.
Option 1 — ECS Service Connect
Service Connect is AWS's recommended approach for ECS-to-ECS traffic. ECS injects an Envoy sidecar per task that handles routing, retries, metrics, and optional mTLS — your app just calls a short name.
"We recommend Service Connect, which provides Amazon ECS configuration for service discovery, connectivity, and traffic monitoring."
— AWS ECS networking best practices, verified June 2026
How it works. When you enable Service Connect on an ECS service, ECS automatically runs an Envoy proxy container inside each task. Services call each other by short name — http://payments instead of payments.internal.example.com. The proxy handles load balancing, retries on transient failures, and outlier detection.
DNS mechanism. Service Connect does not write to VPC DNS or Route 53. It manages its own namespace. Short names are only resolvable from inside tasks enrolled in the same namespace. A Lambda function, EC2 instance, or ECS service not in the namespace cannot resolve those names — they get a connection timeout instead of a useful error.
Cost. Service Connect itself is free. The Cloud Map backend usage is free when used via Service Connect. You pay for the Envoy sidecar's compute: AWS recommends 0.1 vCPU and 128 MB per task, which runs ~$0.31/month per task at On-Demand Fargate rates. For a 20-service fleet with 3 tasks per service, that's ~$18/month of sidecar overhead.
Blue/green deployments. As of July 17, 2025, ECS Native Blue/Green supports Service Connect. Test traffic routes to the green revision via the x-amzn-ecs-blue-green-test header. The previous blocker — CodeDeploy blue/green was incompatible with Service Connect — no longer applies if you migrate to ECS Native Blue/Green.
mTLS. Native support via AWS Private CA. Service Connect rotates TLS certificates every 5 days — roughly 6 certificates per service per month. Factor in Private CA cost if you enable this.
Known limitation
If you create some services, wait more than 5 hours, then add more services to the same cluster, the original services may not resolve the new ones via DNS until you redeploy them. This is a known proxy registration timing issue — not a showstopper, but something to handle in your deployment runbook.
Option 2 — Cloud Map service discovery
Cloud Map service discovery registers ECS tasks as Route 53 DNS A-records. Any VPC resource — Lambda, EC2, on-prem via Direct Connect — can resolve those names. No proxy overhead, no namespace enrollment requirement.
How it works. ECS registers each task's private IP with a Cloud Map service on launch and deregisters it on shutdown. Route 53 returns the current set of healthy task IPs for a DNS query like payments.internal.example.com. Your app connects directly to the task IP — no proxy in the path. If the DNS TTL is set high, your app may hold a stale IP for a few seconds after a task stops.
Cost. Cloud Map charges $0.10 per registered resource per month, plus $1.00 per million HTTP API calls for non-DNS lookups. Route 53 adds $0.50/month per private hosted zone and $0.40/million DNS queries. For a fleet with 20 services × 3 tasks, that's ~$6/month in Cloud Map registry fees plus minimal Route 53 query costs. Compare to Service Connect's ~$18/month sidecar overhead for the same fleet — Cloud Map is cheaper at moderate scale.
1,000-task limit. AWS docs state: "Services configured to use service discovery have a limit of 1,000 tasks per service. This is due to a Route 53 service quota." If you run high-scale services — a job runner, a stateless API serving burst traffic — Cloud Map service discovery hits this ceiling. Service Connect does not have this limit.
Manual cleanup. AWS docs: "The AWS Cloud Map resources created when service discovery is used must be cleaned up manually." Delete an ECS service and its Cloud Map registration stays. Over time, stale records accumulate — you pay $0.10/month per orphaned registration. Add a cleanup step to your service teardown runbook.
DNS propagation. There is a window between when a task stops and when Route 53 TTL expires. If your DNS TTL is 30 seconds and you scale down rapidly, callers may get dropped connections. AWS's own migration blog documents dropped requests under load during fast scale-in. Set TTL to 10 seconds or less for services with frequent task churn, and implement retry logic in your clients.
When to choose Cloud Map: you have non-ECS callers; your services run more than 1,000 tasks; you need cross-account without setting up AWS RAM; or your existing Cloud Map setup already works and you don't need mTLS or built-in retries.
On ECS multi-environment strategy and mixed-compute architectures, Cloud Map is the only option that gives all layers of the stack a single naming layer.
Option 3 — Internal ALB
An internal Application Load Balancer gives any compute — Lambda, EC2, ECS, on-prem — a stable HTTP endpoint for an ECS service. It costs $16–22/month per ALB plus LCU charges, regardless of traffic.
The ALB handles health checks and connection draining automatically. When an ECS task stops, the ALB deregisters it before terminating — no dropped connections. You get L7 routing rules: path-based routing to send /api/users/* and /api/orders/* to different ECS services behind a single ALB.
When an internal ALB is the right answer: a Lambda function calls an ECS API and needs a stable HTTPS endpoint; you want path-based routing between services; or you need WebSocket support with sticky sessions across multiple ECS tasks.
The cost adds up fast. At ~$22/month per ALB, 10 internal ALBs (one per service) cost $220/month before any traffic. Sharing one ALB across services with path routing brings this to $22/month — but then you're managing routing rules as a shared resource.
Decision table
Use Service Connect for pure ECS-to-ECS traffic in a single Region. Use Cloud Map when non-ECS services are involved. Use an internal ALB when you need L7 routing or a stable endpoint callable by anything.
| Scenario | Service Connect | Cloud Map | Internal ALB |
|---|---|---|---|
| ECS → ECS, same cluster | ✓ Recommended | Works | Works (extra cost) |
| ECS → ECS, cross-cluster | ✓ (shared namespace) | ✓ | Works |
| Lambda → ECS | ✗ (can't resolve names) | ✓ | ✓ |
| EC2 → ECS | ✗ (not enrolled) | ✓ | ✓ |
| Cross-account (with RAM) | ✓ | ✓ | Works |
| Service with 1,000+ tasks | ✓ (no task limit) | ✗ (Route 53 quota) | ✓ |
| mTLS required | ✓ (native, via Private CA) | Manual | Manual |
| Blue/green deployments | ✓ (as of July 2025) | ✓ | ✓ |
KEY INSIGHT: If you're starting a new ECS service today and all its callers are also ECS services, choose Service Connect. If you have an existing Cloud Map setup that works and you don't need mTLS or built-in retries, leave it alone — the migration cost isn't worth it until you have a concrete reason to switch.
Migrating from Cloud Map to Service Connect
AWS publishes a migration guide. The short version: add service_connect_configuration to your ECS service, redeploy, then remove the service_registries block. One deploy cycle per service.
The catch: Service Connect and Cloud Map service discovery are not cross-compatible at the namespace level. A caller on Cloud Map cannot resolve Service Connect short names, and vice versa. Migrate the callee first, then migrate callers in the same deployment window — or keep the Cloud Map registration running in parallel during the cutover.
Here is the Terraform diff for a single service:
Ready to use — Terraform service_connect_configuration
resource "aws_ecs_service" "payments" {
name = "payments"
cluster = aws_ecs_cluster.main.id
task_definition = aws_ecs_task_definition.payments.arn
desired_count = 2
# Remove this block when migrating to Service Connect:
# service_registries {
# registry_arn = aws_service_discovery_service.payments.arn
# }
service_connect_configuration {
enabled = true
namespace = aws_service_discovery_private_dns_namespace.main.arn
service {
port_name = "http" # must match portName in task definition
discovery_name = "payments"
client_alias {
dns_name = "payments"
port = 8080
}
}
}
}
# Task definition: add portName to the port mapping
# "portMappings": [
# {
# "name": "http",
# "containerPort": 8080,
# "protocol": "tcp"
# }
# ]
What changes after the deploy: each task gets an Envoy sidecar container. The service is reachable at http://payments:8080 from any other Service Connect service in the same namespace. Your VPC, subnets, security groups, and task definition don't change.
One thing to watch: App Mesh reaches end-of-life on September 30, 2026. If you're still on App Mesh for ECS workloads, Service Connect is the replacement. (For EKS, AWS recommends VPC Lattice instead.)
If you read this, you might also want to know
How does Service Connect compare to AWS App Mesh?
App Mesh was the predecessor service mesh for ECS and EKS — it reaches end-of-life September 30, 2026. Service Connect is the ECS replacement: simpler to configure (no separate mesh/virtual service objects), fully managed Envoy injection, and integrated with ECS deployments. If you're on App Mesh, migrate to Service Connect for ECS workloads or VPC Lattice for EKS.
Can I run Service Connect and Cloud Map service discovery on the same ECS service?
No. A single ECS service can use either service_registries (Cloud Map) or service_connect_configuration (Service Connect), not both. During migration, you can run parallel ECS services — one on Cloud Map, one on Service Connect — but a single service object chooses one path.
What happens to Service Connect traffic when a task crashes mid-request?
The Envoy proxy in the calling task detects the failed connection and retries on a different task instance using the configured retry policy. Outlier detection marks unhealthy tasks and stops routing to them until they recover. This is the main reliability advantage over Cloud Map, where retry logic lives entirely in your application code.
Common questions
Does Service Connect work with blue/green deployments?
Yes, as of July 17, 2025. AWS shipped built-in blue/green deployments for ECS — including Service Connect. Test traffic routing uses the x-amzn-ecs-blue-green-test header. The previous blocker (CodeDeploy blue/green was incompatible with Service Connect) no longer applies with ECS Native Blue/Green.
Can I use ECS Service Connect across AWS accounts?
Yes, with AWS Resource Access Manager (RAM). Share the Cloud Map namespace across accounts and configure Service Connect in each account's ECS cluster to use the shared namespace. Without RAM, Cloud Map service discovery is simpler for cross-account: it uses Route 53 / VPC DNS, which you extend with VPC peering or Transit Gateway.
How much does the Service Connect Envoy sidecar cost on Fargate?
Service Connect itself is free — no charge for the feature or the Cloud Map backend. You pay only for the Envoy proxy sidecar container's CPU and memory. AWS recommends 0.1 vCPU and 128 MB per task. At Fargate On-Demand rates ($0.04048/vCPU-hr, $0.004445/GB-hr), that's roughly $0.31/month per task running 24/7.
Why can't my service resolve Service Connect names?
Service Connect manages its own namespace — it does NOT update VPC DNS. A service not enrolled in the same Service Connect namespace cannot resolve those short names. Fix: enroll the calling service in the same namespace, or switch the caller to Cloud Map service discovery (which uses Route 53 / VPC DNS and is accessible to any VPC resource).
What's the difference between Cloud Map and ECS service discovery?
They're the same thing. 'ECS service discovery' is the ECS feature that registers tasks into AWS Cloud Map namespaces using Route 53 DNS. Cloud Map is the underlying registry. Service Connect is a newer, higher-level abstraction built on top of Cloud Map — it adds an Envoy proxy per task for retries, metrics, and mTLS, without requiring your app to change its DNS lookup code.
### Service discovery is solved. Operating 20 services across 10 environments is
Map your fleet in 5 min: fortem.dev/audit
Top comments (0)