đ Executive Summary
TL;DR: Unidentified, costly âzombie microservicesâ (metaphorically, âPPC campaignsâ burning cash) often consume significant cloud resources with unknown dependencies, leading to high bills and fear of shutdown. Safely decommission these services using methods like gradual resource reduction (âStrangle and Observeâ), thorough dependency mapping (âArchaeological Digâ), or a controlled, reversible âScream Testâ during low-traffic periods.
đŻ Key Takeaways
- The âStrangle and Observeâ method involves cautiously reducing a serviceâs allocated resources (e.g., scaling down EC2 instances or decreasing cron frequency) to identify hidden dependencies through monitoring system reactions and alerts, thereby minimizing immediate risk.
- The âArchaeological Digâ is the comprehensive approach, requiring the use of observability tools (e.g., DataDog, VPC flow logs) to map ingress/egress traffic and business functions, culminating in a formal deprecation plan and the removal of associated infrastructure-as-code.
- For services with zero documentation or logs, a controlled âScream Testâ can be executed, first in staging then production during low-traffic windows with a ready rollback plan, to identify critical dependencies by observing direct system failures.
Struggling with âzombieâ services and legacy processes racking up your cloud bill? Learn when and how to safely decommission infrastructure without causing a production outage.
My âPPC Campaignâ is a Zombie Microservice: When to Pull the Plug
I remember staring at the monthly cloud bill. It was a five-figure number that made my stomach turn, and one line item stood out: a fleet of massive EC2 instances under a service named âDataAggregator-PRODâ. They were costing us nearly $4,000 a month, just humming along. I asked around. The new product manager had never heard of it. The junior devs thought it was âsome legacy thing we donât touch.â It was a ghost in the machine, a technical âPPC campaignâ burning cash with zero measurable ROI. The problem? No one knew for sure what would happen if we turned it off. This is a story Iâve seen play out at nearly every company Iâve worked for.
The âWhyâ: How We Create These Digital Ghosts
This isnât about blaming people. Itâs a natural consequence of growth, changing priorities, and team turnover. A project that was critical two years ago gets superseded. The original developers move on. The documentation, if it ever existed, is now a dead link in a forgotten Confluence space. We end up with these zombie services for a few key reasons:
- Fear of the Unknown: The primary reason. âWhat if this service quietly powers the checkout page and we cause a million-dollar outage?â Itâs easier to keep paying the bill than to risk being the one who broke production.
- Lack of Ownership: When a service belongs to everyone, it belongs to no one. Without a clear owner responsible for its lifecycle, itâs destined to become technical debt.
- Poor Observability: If you canât easily see whatâs calling a service and what that service is calling, youâre flying blind. You canât confidently decommission something you canât fully understand.
So youâre stuck with this expensive, mysterious process. You know itâs probably useless, but the risk of shutting it down feels too high. Letâs walk through how we, in the trenches, actually solve this.
The Fixes: From Cautious Tweak to Calculated Gamble
1. The Quick Fix: The âStrangle and Observeâ Method
This is my go-to first step when the political capital or time for a full investigation is low. Itâs a bit hacky, but itâs effective. You donât kill the service, you starve it. The goal is to make it cheap and see who screams.
If itâs an auto-scaling group of servers, scale the desired/min/max count down to one, on the smallest instance type possible. If itâs a data pipeline, change its cron schedule from every hour to once a day at 3 AM. The service is still ârunning,â which satisfies the nervous stakeholders, but your costs plummet. Now, you watch your monitoring dashboards like a hawk. Look for new error spikes in upstream or downstream services, check support ticket queues, and listen for whispers of âHey, is the XYZ report running slow?â.
Pro Tip: Before you do this, make sure your alerting is top-notch. If the
legacy-api-gw-01starts throwing 503 errors because its tiny instance is overwhelmed, you need to know immediately, not a day later when a customer complains.
2. The Permanent Fix: The âArchaeological Digâ
This is the ârightâ way to do it. It takes time and effort but eliminates risk and cleans up technical debt properly. You become a detective, tracing the serviceâs digital footprint.
Your best friends here are your observability toolsâthink DataDog, New Relic, Honeycomb, or even just deep-diving into VPC flow logs and CloudWatch metrics. You need to answer three questions:
- Who calls this service? (Ingress traffic)
- What does this service call? (Egress traffic)
- What business function does it perform? (The âso what?â)
Youâll build a dependency map. Youâll put together a formal deprecation plan. Youâll communicate it to every team whose services interact with it. âOn Q3, we will be decommissioning the report-generator-worker-pool. Our telemetry shows it is only called by the now-retired v2-dashboard. Please contact us if you have any dependencies weâve missed.â You set a date, you execute the plan, and you delete the infrastructure-as-code that defined it. Itâs clean, professional, and safe.
# Example Terraform Plan (The Goal)
# module "legacy_data_aggregator" {
# source = "./modules/ec2-cluster"
# ...
# }
# The above module will be removed in release v3.45.0 on 08/15.
# Ticket: DEVOPS-1234
# Reason: Service has been superseded by the 'realtime-metrics-api'.
# Contact: #devops-team on Slack
3. The âNuclearâ Option: The Scream Test
Letâs be honest. Sometimes you have zero documentation, zero logs, and zero time. The service is an opaque box, and the âArchaeological Digâ would take months. In these rare cases, you can perform a controlled âscream test.â
This is not a cowboy move. It is a calculated risk. First, you do it in a staging environment if you have one. Shut it down and leave it off for a full sprint. If the QA team doesnât notice anything, you have your first piece of evidence.
For production, you plan it like a surgical strike. You announce a maintenance window on a low-traffic period, like a Saturday at 2 AM. You have a rollback plan ready to goâliterally a single command or button click to bring it back online. You shut it down. And then⌠you wait.
If nothing happens after an hour, you can be reasonably confident. If nothing happens after a week, you can be very confident. If the BI team calls you three weeks later because their quarterly report failed, you have your answer. You can bring it back online temporarily and transition to the âPermanent Fixâ method, now armed with a known dependency.
Warning: Use this option sparingly. It can burn trust if it goes wrong. But sometimes, itâs the only way to make progress on deeply-entrenched technical debt and finally stop paying for ghosts.
đ Read the original article on TechResolve.blog
â Support my work
If this article helped you, you can buy me a coffee:

Top comments (0)