DEV Community: Mouheb GHABRI

Debugging Hidden AWS Costs: From “Others” in Cost Explorer to a Real Root Cause

Mouheb GHABRI — Thu, 07 May 2026 15:59:53 +0000

I had an AWS account that was supposed to be almost empty.

Most of the resources had already been deleted. Some cleanup had been done manually, and some of it had been done through tools like Terraform. So the expectation was simple:

No active resources = no meaningful cost

But Cost Explorer was still showing small charges.

The amount was tiny, around a few fractions of a cent, but the problem was not the amount itself. The problem was that the charge was unclear.

In the AWS Cost Explorer UI, part of the cost appeared under vague categories like:

Others

That was not enough to understand what was happening.

The real question was:

Why is an account that looks empty still generating cost?

The actual problem

At first, the charge looked confusing because the UI did not clearly explain the source.

A small AWS bill can still hide important information:

- Is there an orphaned resource?
- Is AWS Config still recording resources?
- Is Control Tower recreating or managing something?
- Is CloudTrail charging money?
- Is a service making API calls in the background?
- Is there still a queue, log group, KMS key, bucket, or automation running?

So the goal was not to save $0.003.

The goal was to prove whether the cost was:

- Expected baseline activity
- Orphaned infrastructure
- Governance-related activity
- Service-generated API usage
- Unexpected user or automation activity
That is the real reason this investigation matters.

The mistake to avoid

It’s not a real mistake but sometimes the first instinct is often to start with CloudTrail.

That feels logical because CloudTrail shows activity and API calls, But CloudTrail does not answer the first billing question.

CloudTrail can tell you:

- Who did something?
- When did it happen?
- Which API was called?
- Which service was involved?

But CloudTrail does not directly tell you:

Which AWS billing line charged money?
So the better approach is to do not start with CloudTrail but start with Cost Explorer API because it tells you what was actually billed, while CloudTrail only tells you what activity happened.

Why Cost Explorer API and not Cost Expolorer console ?, because Cost Explorer API because it lets you query the exact billing dimensions directly, while the Console is better for visual overview but can hide small charges inside grouped categories.

What I did first

I isolated one exact billing day.

For example:

START_DATE="2026-05-06"
END_DATE="2026-05-07"

This matters because AWS Cost Explorer uses an exclusive end date.

So this range means only May 6, 2026. This avoids mixing multiple days together and makes the investigation cleaner.

Then I asked Cost Explorer the right question

Instead of relying on the UI, I queried Cost Explorer by:

- Service
- Usage Type
- Cost
The key command was:

aws ce get-cost-and-usage \
  --time-period Start=$START_DATE,End=$END_DATE \
  --granularity DAILY \
  --metrics UnblendedCost \
  --group-by Type=DIMENSION,Key=SERVICE Type=DIMENSION,Key=USAGE_TYPE \
  --output json | jq -r '
    .ResultsByTime[0].Groups[]
    | select((.Metrics.UnblendedCost.Amount | tonumber) > 0)
    | [
        .Keys[0],
        .Keys[1],
        .Metrics.UnblendedCost.Amount,
        .Metrics.UnblendedCost.Unit
      ]
    | @tsv
  ' | column -t -s "$(printf '\t')"

This changed the investigation.

Instead of seeing only vague UI categories, the API showed the real billing lines.

In this case, the important result was:

AWS Config  ConfigurationItemRecorded       0.003  USD
AWS Config  EUW2-ConfigurationItemRecorded  0.003  USD

That immediately narrowed the problem. The charge was not random.

It was not just “Others.” It was AWS Config.

What the usage type told me

The usage type was: ConfigurationItemRecorded

That means AWS Config recorded a configuration item.

In simple terms:

AWS Config saw a resource or configuration state and recorded it.
That recording generated a billable configuration item.

The second usage type included a region prefix: EUW2-ConfigurationItemRecorded

That pointed to: eu-west-2

So now the investigation had a real chain:

- Service = AWS Config
- Usage type = ConfigurationItemRecorded
- Region = eu-west-2
- Cost = about 0.006 USD total
This was already much better than the original Cost Explorer UI view.

Then I checked the region

To confirm where the cost happened, I grouped Cost Explorer by region and service:

aws ce get-cost-and-usage \
  --time-period Start=$START_DATE,End=$END_DATE \
  --granularity DAILY \
  --metrics UnblendedCost \
  --group-by Type=DIMENSION,Key=REGION Type=DIMENSION,Key=SERVICE \
  --output json | jq -r '
    .ResultsByTime[0].Groups[]
    | select((.Metrics.UnblendedCost.Amount | tonumber) > 0)
    | [
        (.Keys[0] // "NoRegion"),
        .Keys[1],
        .Metrics.UnblendedCost.Amount,
        .Metrics.UnblendedCost.Unit
      ]
    | @tsv
  ' | column -t -s "$(printf '\t')"

Then I grouped by region and usage type:

aws ce get-cost-and-usage \
  --time-period Start=$START_DATE,End=$END_DATE \
  --granularity DAILY \
  --metrics UnblendedCost \
  --group-by Type=DIMENSION,Key=REGION Type=DIMENSION,Key=USAGE_TYPE \
  --output json | jq -r '
    .ResultsByTime[0].Groups[]
    | select((.Metrics.UnblendedCost.Amount | tonumber) > 0)
    | [
        (.Keys[0] // "NoRegion"),
        .Keys[1],
        .Metrics.UnblendedCost.Amount,
        .Metrics.UnblendedCost.Unit
      ]
    | @tsv
  ' | column -t -s "$(printf '\t')"

This step is important because some AWS usage types include region prefixes, and others do not.

At this point, the direction was clear: Investigate AWS Config in eu-west-2

Then I checked AWS Config itself

Since Cost Explorer showed AWS Config, the next step was to inspect AWS Config directly.

Write on Medium
I checked whether a configuration recorder existed:

REGION="eu-west-2"

aws configservice describe-configuration-recorders \
  --region $REGION \
  --query 'ConfigurationRecorders[*].{Name:name,RoleARN:roleARN,AllSupported:recordingGroup.allSupported,IncludeGlobal:recordingGroup.includeGlobalResourceTypes}' \
  --output table

Then I checked whether it was recording:

aws configservice describe-configuration-recorder-status \
  --region $REGION \
  --query 'ConfigurationRecordersStatus[*].{Name:name,Recording:recording,LastStatus:lastStatus,LastStatusChange:lastStatusChangeTime}' \
  --output table

Then I checked discovered resource counts:

aws configservice get-discovered-resource-counts \
  --region $REGION \
  --output table

And then listed discovered resources:

{
  printf "RESOURCE_TYPE\tRESOURCE_ID\tRESOURCE_NAME\n"

  for type in $(aws configservice get-discovered-resource-counts \
    --region $REGION \
    --query 'resourceCounts[*].resourceType' \
    --output text); do

    aws configservice list-discovered-resources \
      --region $REGION \
      --resource-type "$type" \
      --output json | jq -r '
        .resourceIdentifiers[]
        | [
            .resourceType,
            .resourceId,
            (.resourceName // "-")
          ]
        | @tsv
      '

  done
} | column -t -s "$(printf '\t')"

This step answers:

Is AWS Config enabled?
Is it recording?
What resources does it still know about?
Is this coming from a managed baseline?

Then I used CloudTrail, but only after Cost Explorer

After identifying AWS Config as the billing service, CloudTrail became useful.

The purpose of CloudTrail was not to find the cost. The purpose was to correlate activity.

I checked events in the same region and same day:

aws cloudtrail lookup-events \
  --region $REGION \
  --start-time "${START_DATE}T00:00:00Z" \
  --end-time "${END_DATE}T00:00:00Z" \
  --query 'sort_by(Events,&EventTime)[*].[EventTime,Username,EventName,EventSource]' \
  --output table

Then I filtered for AWS Config activity:

aws cloudtrail lookup-events \
  --region $REGION \
  --lookup-attributes AttributeKey=EventSource,AttributeValue=config.amazonaws.com \
  --start-time "${START_DATE}T00:00:00Z" \
  --end-time "${END_DATE}T00:00:00Z" \
  --query 'sort_by(Events,&EventTime)[*].[EventTime,Username,EventName,EventSource]' \
  --output table

Then I checked the actor details:

aws cloudtrail lookup-events \
  --region $REGION \
  --start-time "${START_DATE}T00:00:00Z" \
  --end-time "${END_DATE}T00:00:00Z" \
  --output json | jq -r '
    .Events[]
    | .CloudTrailEvent
    | fromjson
    | [
        .eventTime,
        .eventSource,
        .eventName,
        .userIdentity.type,
        (.userIdentity.arn // "-"),
        (.userAgent // "-"),
        (.sourceIPAddress // "-")
      ]
    | @tsv
  ' | column -t -s "$(printf '\t')"

This helped separate possible sources:

- Human user
- AWSReservedSSO role
- AWS service
- Control Tower execution role
- CloudFormation StackSet
- Scheduled automation
- Application activity

What I discovered

The investigation showed that the paid service was AWS Config.

CloudTrail itself was not the source of the charge, this was an important correction.

The notes explicitly state that AWS Config caused the charge, while CloudTrail showed FreeEventsRecorded = 0 USD. So CloudTrail helped investigate the activity, but it was not the paid service.

The root cause chain became:

Cost Explorer UI showed unclear “Others”
→ Cost Explorer API exposed AWS Config charges
→ Usage type showed ConfigurationItemRecorded
→ Region prefix EUW2 pointed to eu-west-2
→ AWS Config showed discovered resources / recorder activity
→ CloudTrail helped correlate activity
→ Conclusion: AWS Config recording caused the cost

The final conclusion

The small charge was not random.

It came from: AWS Config recording configuration items

The likely reason was: Control Tower-managed baseline governance

So the account was not necessarily “dirty” with normal application resources. Instead, it still had governance/baseline services active.

That is a different type of problem.

The solution is not always to delete a bucket, queue, or instance. The solution may be to understand whether the account is still governed by Control Tower and whether AWS Config is intentionally managed as part of that baseline.

Why this investigation was worth doing
For a one-time charge of $0.003, spending a lot of time would not make sense.

But for a repeated tiny charge in an account that should be clean, it is worth investigating, not because of the money.

Because it answers:

- Is something still running?
- Is the account still governed?
- Is cleanup incomplete?
- Is a service being recreated?
- Is an automation still active?
- Is the billing source expected or unexpected?
In this case, the investigation was useful because it corrected a wrong assumption.

→ The charge was not CloudTrail.

→ The charge was AWS Config.

The reusable debugging model

The final model is:

Cost Explorer API = what was billed
Usage Type = why it was billed
Region grouping = where it was billed
Service API = what exists
CloudTrail = who or what acted
RCA = final explanation and mitigation This makes the process reusable for other services too.

For example:

KMS-Requests
→ Check KMS keys and CloudTrail KMS events

TimedStorage-ByteHrs
→ Check CloudWatch log groups and retention

Requests-Tier8
→ Check the related service API and CloudTrail events

DataTransfer-Out-Bytes
→ Check VPC, NAT Gateway, S3, CloudFront, ELB, or other traffic sources

The root cause format

A clean RCA should look like this:

Root cause:
AWS Config generated cost because ConfigurationItemRecorded occurred in eu-west-2 on May 6, 2026.
Evidence:

Cost Explorer showed: AWS Config / ConfigurationItemRecorded / 0.003 USD AWS Config / EUW2-ConfigurationItemRecorded / 0.003 USD

Region grouping showed: eu-west-2

AWS Config API showed: Configuration recorder and discovered resource state

CloudTrail showed: Related activity and actor context Impact: About 0.006 USD on May 6, 2026. Conclusion: The charge was expected if the account is still managed by Control Tower baseline governance. Mitigation: Leave it if Control Tower governance is required. If the account is being retired, remove it properly from Control Tower governance. Do not manually delete Control Tower-managed resources unless the governance impact is understood. ## Final takeaway

The issue was not the tiny cost.

The issue was lack of attribution.

The plan solved that by turning an unclear billing symptom into a technical explanation:

AWS was not charging randomly.
Cost Explorer UI was only unclear.
The API exposed AWS Config as the paid service.
The usage type explained the billing action.
The region pointed to eu-west-2.
CloudTrail helped confirm activity, but was not the cost source. That is why this debugging chain is useful.

Conculsion

This investigation showed that the real problem was not the small amount of money, but the lack of clear attribution. Cost Explorer Console gave a vague view, while the Cost Explorer API exposed the exact billing lines: the charge came from AWS Config recording configuration items, not from CloudTrail. CloudTrail was useful only after the billing source was identified, because it helped correlate activity and actors. The correct debugging chain is therefore: start with Cost Explorer API to find what was billed, use usage type and region to understand why and where it was billed, then use service APIs and CloudTrail to explain the technical root cause.

Control vs Responsibility: Choosing the Right Cloud Operating Model

Mouheb GHABRI — Thu, 26 Mar 2026 21:39:10 +0000

Cloud architecture is not a maturity ladder. It is a series of trade-offs between control and operational responsibility.

That distinction matters because many teams still frame the conversation the wrong way. They debate on-premises versus cloud as if one is inherently more advanced than the other. In practice, the better question is simpler: where does control create real business or technical value, and where is it just another system your team now has to operate?

The most expensive architecture decision is often the one made for comfort. Teams keep control they do not need, or hand off control they later realize was essential.

On-Prem: maximum control, maximum burden
On-premises still has a place, but only when the reasons are concrete and durable. If you own the hardware, network, and security boundaries end to end, you can optimize for determinism, isolation, and performance in ways public cloud often cannot match.

That control comes with a price: capacity planning, patching, hardware lifecycle management, resilience engineering, physical security, and slower adaptation when demand changes.

Where it still makes sense:

Strict regulatory or sovereignty requirements that require physical control.
Specialized hardware environments such as HPC, ultra-low-latency systems, or tightly coupled industrial workloads.
Workloads where stable, predictable performance matters more than elasticity.

A common mistake is keeping workloads on-prem because they are “critical.” Critical systems are not automatically better on hardware you manage yourself. In many cases, they are simply harder to modernize once they stay there too long.

IaaS - Infrastructure as a service: cloud infrastructure with familiar responsibilities
IaaS is where many organizations start because it feels familiar. You remove the burden of owning data centers, but you still manage operating systems, network design, patching strategy, runtime behavior, and much of the security model.

That makes IaaS a valid choice, not a temporary embarrassment. It works well when you need control at the OS or network layer, when refactoring is not yet justified, or when legacy workloads have to move without major redesign.

Where it fits well:

Lift-and-shift migrations with limited appetite for application change.
Workloads that require OS-level agents, custom networking, or specific runtime behavior.
Transitional architectures where modernization will happen in phases, not all at once.

The misconception is that moving to IaaS means you are “cloud-native.” Usually it means you changed the hosting model, not the operating model. If your team still builds brittle VMs, scales manually, and treats instances as pets, you have imported old problems into a new billing model.

PaaS - Platform as a service: Less control, more engineering leverage
PaaS starts to pay off when a team is willing to stop treating infrastructure customization as a default requirement. The platform takes over more of the operating stack, and in return you gain standardization, faster delivery, and less operational drag.

This is often where engineering productivity improves materially, especially for product teams that need to ship features rather than maintain bespoke environments.

Where PaaS works best:

Internal or external APIs with predictable deployment patterns.
Business applications that benefit from standard runtimes and managed services.
Teams that want faster delivery cycles without building a platform team too early.

The trade-off is straightforward: platform constraints are real. If your application architecture depends on deep OS tuning, unusual middleware behavior, or custom deployment assumptions, PaaS will feel restrictive.

A common mistake is adopting PaaS while still trying to preserve every legacy operational habit. If you spend your time fighting the platform, you are paying for abstraction without benefiting from it.

FaaS - Function as a service: Operational simplicity, architectural discipline
FaaS is useful when the problem is event-driven and the team understands the implications. It is not “no servers.” It is giving the platform responsibility for provisioning, scaling, and availability at the execution layer.

Used well, FaaS can be extremely effective:

Event processing from queues, streams, object storage, or scheduled triggers.
Spiky or unpredictable workloads where idle capacity would otherwise be wasteful.
Short-lived tasks such as enrichment, transformation, automation, and integration logic.

But serverless is not a shortcut around architecture. It demands better architecture. You need to think carefully about cold starts, concurrency limits, idempotency, retries, observability, timeouts, and downstream bottlenecks.

The biggest misconception is treating FaaS as a universal replacement for services. It is excellent for discrete event-driven work. It is often a poor fit for long-running processes, tightly coupled workflows, or systems that require consistent low-latency behavior under all conditions.

SaaS - Software as a service: outsource the undifferentiated
SaaS is the right answer more often than engineers like to admit. If a capability is not part of your competitive advantage, building and running it yourself is usually a distraction.

That is especially true for areas like collaboration, CRM, HR, ticketing, and many internal business workflows. The value comes from consuming the outcome, not rebuilding the machinery behind it.

Where SaaS is the obvious choice:

Commodity business capabilities that do not justify custom engineering.

Organizations that need to scale operations quickly without growing platform overhead.

Teams that want to focus engineering effort on differentiated products instead of internal tooling.

The common mistake is rejecting SaaS because it limits customization. That is often the point. Many organizations do not need more flexibility; they need fewer one-off processes and less operational sprawl.

What experienced teams understand The real world is rarely pure

A mature architecture usually combines several models at once:

Core systems may remain on-prem or on IaaS because control is genuinely required.
Product services often run on PaaS because delivery speed matters more than low-level tuning.
Event pipelines are strong candidates for FaaS.
Commodity capabilities should usually be SaaS unless there is a compelling reason otherwise.

The mistake is assuming every workload should move in the same direction. It should not. Architecture decisions should follow workload characteristics, team capability, compliance needs, latency requirements, and the cost of operating each layer over time.

My view is simple: control is not an achievement. It is a liability unless it solves a real problem. Abstraction is not automatically progress either. If you give up the wrong control, you can create constraints that are just as expensive as the complexity you were trying to avoid.

The best architectures are not built around ideology. They are built around deliberate choices about where to own complexity and where to buy it.

AWS Organizations + IAM Identity Center: The ‘multi-account + access’ combo I use everywhere

Mouheb GHABRI — Sat, 14 Feb 2026 17:37:37 +0000

If you’re running more than a couple AWS accounts, two services show up again and again:

AWS Organizations: how you structure accounts and apply org-wide governance.

AWS IAM Identity Center (formerly AWS SSO): how humans sign in once and get the right access across those accounts.

Together, they’re the backbone of a clean multi-account setup: governance on top, access in the middle, workloads underneath.

Why multi-account is the default

One AWS account feels simple… until you have:

multiple environments (dev / staging / prod)
multiple teams
different compliance boundaries
and a real need for blast-radius control

A multi-account strategy gives you isolation by default, and AWS Organizations gives you the control plane to manage it all.

AWS Organizations:

Think of AWS Organizations as your “org chart” for AWS.

You get:

A management account (the top-level admin account for the org)
Member accounts (where workloads live)
Organizational Units (OUs) (folders to group accounts like Prod, Dev, Security, Sandbox)

Consolidated billing (aka “one bill, many accounts, also you can benefit from savings”)

Organizations lets you consolidate billing so you can:
receive a single bill
track costs per account
benefit from pooled usage/discount sharing across accounts (depending on the discount model)

This is where governance meets FinOps: you keep accounts separated, but you don’t fragment your billing.

“All features” mode
Organizations has an “all features” mode that unlocks deeper governance controls. Most real-world orgs end up enabling it because it’s required for advanced policies like SCPs.

The real power: Service Control Policies (SCPs)

SCPs are guardrails.
A simple mental model:

IAM policies in an account say “what can this principal do?”
SCPs say “what is this account (or OU) allowed to do at all?”

So even if someone has AdministratorAccess in an account, an SCP can still block certain actions (like disabling CloudTrail, changing regions, or deleting security tooling).

Important nuance: SCPs are not permission grants. They only limit.

Beyond SCPs: other org policies you should know

Organizations supports additional policy types that help standardize at scale, like:

Tag policies (enforce tag keys/values)
Backup policies (baseline backup rules)
AI services opt-out policies (control data usage for certain AI services)

If you’re building a platform, these are your “consistency levers”.

Permission boundaries (the missing middle layer)

SCPs are great for account/OU-wide guardrails, but they apply to everyone in that scope.

When you need guardrails for specific IAM roles/users (especially roles that can create other roles/policies), use IAM permission boundaries.

A permission boundary is a managed policy that sets the maximum permissions an IAM user or role can ever have—no matter what identity-based policies you attach later.

Effective permissions become the intersection of: identity-based policy ∩ permission boundary ∩ SCP (and an explicit deny anywhere wins).

Where it shines:

Delegating IAM to teams safely (e.g., allow developers to create roles, but only within an approved boundary).
Preventing privilege escalation in “self-service” environments.

IAM Identity Center: how humans access all those accounts

IAM Identity Center is what you use when you want:

one place for users to sign in
centralized assignments across accounts
consistent role patterns without hand-crafting IAM roles everywhere

Permission sets = reusable access templates

Permission sets are the core concept.

They’re basically “role recipes” you can stamp across accounts:

ReadOnly
PowerUser
Billing
SecurityAudit
PlatformAdmin (careful with this one)

When you assign a permission set to a user/group for an account, Identity Center provisions the corresponding roles in that account and keeps them aligned when you update the permission set.
Identity sources: where your users live

You can plug Identity Center into:

its built-in directory
Active Directory
an external identity provider (often SAML for SSO, SCIM for provisioning)

Pick what matches how your company already manages identities.

How Organizations + Identity Center fit together

This is the pattern I see most:

Use Organizations to create/organize accounts into OUs (Prod, NonProd, Security, SharedServices).
Apply SCP guardrails at the OU level (strong in Prod, relaxed in Sandbox).
Use Identity Center permission sets to standardize human access.
Assign access by groups, not individuals (e.g., Developers, SRE, Security, Finance).
Optionally use delegated administration so a non-management account can administer Identity Center day-to-day (common in a dedicated “Identity” or “Shared Services” account).

That’s how you keep the management account locked down while still enabling operational ownership.
A practical baseline (what I’d implement first)

If I’m starting from scratch, I usually do this in order:

Create an Organization + enable all features.
Design OUs: Security, Infrastructure/Shared, Prod, NonProd, Sandbox.
Add SCPs:
- deny leaving the org (optional but common)
- protect logging/security services
- restrict regions if required
Enable IAM Identity Center.
Create permission sets:
- ReadOnly
- DeveloperPowerUser (no IAM)
- OpsAdmin (tightly controlled)
- SecurityAudit
Assign by groups to the right accounts/OUs.
Common mistakes (that hurt later)
Putting workloads in the management account.
Treating SCPs like IAM (they don’t grant permissions).
Giving broad permission sets and trying to “fix it later”.
Assigning access directly to users instead of groups.
No OU strategy → you end up with “policy spaghetti”.

Your turn

How do you split your accounts?
One account per app?
One per environment?
One per team?
Or a hybrid (my usual choice)?

Mastering Deployment Strategies on AWS: Big Bang, Rolling, Blue-Green, and Canary Explained

Mouheb GHABRI — Fri, 24 Oct 2025 10:42:56 +0000

Modern cloud applications are rarely static. They evolve continuously, new features, patches, infrastructure improvements. all require deployments that are safe, repeatable, and ideally, seamless. Choosing the right deployment strategy is essential to minimize downtime, reduce risk, and maintain user trust.

AWS provides powerful tools to implement various _deployment _approaches, from simple, all-at-once updates to advanced traffic-shifting releases. In this post, we’ll break down four common strategies, Big Bang, Rolling, Blue-Green, and Canary — and explore how each can be applied in AWS environments.

Big Bang Deployment

A single, all-at-once release where the old system is taken down and the new version is brought up.

How it works: Stop the old system, deploy everything, start the new system.
When to use: Small systems, low complexity, or when downtime is acceptable and coordination is straightforward. -Pros: Simple cutover, no need to maintain two versions in parallel.
Cons: Requires downtime. High blast radius if something goes wrong. Demands a solid rollback plan.
Example: A tightly coupled database schema migration that touches many services at once.

AWS Practical Example:
In AWS, a Big Bang deployment might involve updating all EC2 instances or ECS tasks simultaneously. You could stop your existing EC2 instances, deploy a new AMI with the updated application, and restart the environment. Similarly, a full CloudFormation stack update could replace all resources at once. It’s straightforward but can cause downtime while the new environment initializes.

Rolling Deployment

Gradually replace instances of the old version with the new version across your fleet.

How it works:
Update a subset of servers or pods at a time, wait for health checks and metrics, then continue.
When to use:
Horizontal fleets where instances are interchangeable.
Pros:
Minimal or no downtime. Limits the impact of defects. Easy to pause or roll back mid-rollout.
Cons: Longer rollout time. Mixed versions run concurrently, which can expose compatibility issues.
Example:
In a 10-server pool, drain one server, deploy the new version, validate, then proceed server by server.

AWS Practical Example:
In AWS, this can be achieved using an Auto Scaling Group (ASG) with an instance refresh. The ASG replaces instances gradually using a new AMI, verifying each one through load balancer health checks. In ECS, the service scheduler manages rolling updates automatically, it spins up new tasks with the updated container image and drains the old ones as they become healthy. In EKS, Kubernetes deployments natively handle this through progressive pod replacement.

Blue-Green Deployment

Run two identical production environments, “blue” and “green,” and switch traffic between them.

How it works: Keep one environment live while you deploy and validate the new version on the idle one. Flip traffic when ready.
When to use: You need near-zero downtime and fast, safe rollback.
Pros: Near-zero downtime cutover. Instant rollback by switching traffic back.
Cons: Doubles infrastructure costs while both are running. Requires data and config parity.
Example: Blue serves users; deploy to Green, run smoke and integration tests, then route traffic to Green. Blue becomes the fallback.

AWS Practical Example:
On AWS, this can be implemented using an Application Load Balancer (ALB) with two target groups, one for Blue and one for Green. You deploy the new version to the Green environment, run automated health checks, and once validated, switch the ALB’s routing to the Green target group. If any issue is detected, you can immediately redirect traffic back to Blue. CodeDeploy, ECS, and Lambda all support native Blue-Green deployment modes to make this process safer and automated.

Canary Deployment

Release to a small, representative subset of users or infrastructure first, then ramp up.

How it works: Start with a small percentage of traffic, a single region, or a small instance group. Monitor real-world signals, then increase exposure.
When to use: You want real user telemetry and progressive confidence before full rollout.
Pros: Early detection of regressions with limited impact. Flexible, data-driven ramp-up.
Cons: Requires robust observability and routing controls. Version skew can add complexity.
Example: Ship to 1% of users in one region, validate error rates and latency, then step up to 5%, 25%, and 100%.

AWS Practical Example:
In AWS, you can use traffic shifting to achieve canary deployments. For Lambda, CodeDeploy supports automatic canary rollouts by gradually increasing the percentage of traffic going to the new version over time. For ECS or EKS, canaries can be managed through an ALB or Route 53 weighted routing policies, where only a small portion of requests initially hit the new version. As monitoring through CloudWatch and X-Ray confirms stability, traffic is progressively increased until full rollout.

Choosing the Right Strategy

Each deployment strategy has its strengths and trade-offs, and the best choice depends on your system’s complexity, tolerance for downtime, and risk appetite.

Big Bang deployments are best suited for simple systems or infrequent releases where downtime is acceptable. They have high downtime, are difficult to roll back, but are low-cost since only one environment is maintained.

Rolling deployments work well for scalable fleets and web applications. They keep downtime low by updating instances in batches. Rollbacks are moderately complex but manageable, and costs remain low since no duplicate environments are needed.

Blue-Green deployments are ideal for mission-critical applications where near-zero downtime and safe rollbacks are required. They make it easy to revert by switching traffic back to the previous environment, though they temporarily increase costs by running two environments in parallel.

Canary deployments shine in continuous delivery pipelines and high-traffic systems. They introduce minimal downtime, allow easy rollback, and provide a controlled way to monitor new releases in production. Their cost impact is moderate, as they may temporarily run multiple versions while gradually shifting traffic.

Ultimately, Rolling deployments provide a balanced, low-risk starting point for most teams, while Blue-Green and Canary strategies deliver advanced safety for production-critical workloads.

Deployment strategy isn’t just about pushing code, it’s about managing risk, user experience, and reliability. AWS gives you the flexibility to implement whichever approach best fits your team’s needs. The real mastery lies in understanding when to use each one, and how to automate and monitor it effectively.

[Boost]

Mouheb GHABRI — Sun, 20 Apr 2025 22:01:13 +0000