Solved: When do you decide to stop a PPC campaign?

#devops #programming #tutorial #cloud

🚀 Executive Summary

TL;DR: Unidentified, costly “zombie microservices” (metaphorically, ‘PPC campaigns’ burning cash) often consume significant cloud resources with unknown dependencies, leading to high bills and fear of shutdown. Safely decommission these services using methods like gradual resource reduction (‘Strangle and Observe’), thorough dependency mapping (‘Archaeological Dig’), or a controlled, reversible ‘Scream Test’ during low-traffic periods.

🎯 Key Takeaways

The ‘Strangle and Observe’ method involves cautiously reducing a service’s allocated resources (e.g., scaling down EC2 instances or decreasing cron frequency) to identify hidden dependencies through monitoring system reactions and alerts, thereby minimizing immediate risk.
The ‘Archaeological Dig’ is the comprehensive approach, requiring the use of observability tools (e.g., DataDog, VPC flow logs) to map ingress/egress traffic and business functions, culminating in a formal deprecation plan and the removal of associated infrastructure-as-code.
For services with zero documentation or logs, a controlled ‘Scream Test’ can be executed, first in staging then production during low-traffic windows with a ready rollback plan, to identify critical dependencies by observing direct system failures.

Struggling with ‘zombie’ services and legacy processes racking up your cloud bill? Learn when and how to safely decommission infrastructure without causing a production outage.

My ‘PPC Campaign’ is a Zombie Microservice: When to Pull the Plug

I remember staring at the monthly cloud bill. It was a five-figure number that made my stomach turn, and one line item stood out: a fleet of massive EC2 instances under a service named ‘DataAggregator-PROD’. They were costing us nearly $4,000 a month, just humming along. I asked around. The new product manager had never heard of it. The junior devs thought it was “some legacy thing we don’t touch.” It was a ghost in the machine, a technical ‘PPC campaign’ burning cash with zero measurable ROI. The problem? No one knew for sure what would happen if we turned it off. This is a story I’ve seen play out at nearly every company I’ve worked for.

The “Why”: How We Create These Digital Ghosts

This isn’t about blaming people. It’s a natural consequence of growth, changing priorities, and team turnover. A project that was critical two years ago gets superseded. The original developers move on. The documentation, if it ever existed, is now a dead link in a forgotten Confluence space. We end up with these zombie services for a few key reasons:

Fear of the Unknown: The primary reason. “What if this service quietly powers the checkout page and we cause a million-dollar outage?” It’s easier to keep paying the bill than to risk being the one who broke production.
Lack of Ownership: When a service belongs to everyone, it belongs to no one. Without a clear owner responsible for its lifecycle, it’s destined to become technical debt.
Poor Observability: If you can’t easily see what’s calling a service and what that service is calling, you’re flying blind. You can’t confidently decommission something you can’t fully understand.

So you’re stuck with this expensive, mysterious process. You know it’s probably useless, but the risk of shutting it down feels too high. Let’s walk through how we, in the trenches, actually solve this.

The Fixes: From Cautious Tweak to Calculated Gamble

1. The Quick Fix: The ‘Strangle and Observe’ Method

This is my go-to first step when the political capital or time for a full investigation is low. It’s a bit hacky, but it’s effective. You don’t kill the service, you starve it. The goal is to make it cheap and see who screams.

If it’s an auto-scaling group of servers, scale the desired/min/max count down to one, on the smallest instance type possible. If it’s a data pipeline, change its cron schedule from every hour to once a day at 3 AM. The service is still “running,” which satisfies the nervous stakeholders, but your costs plummet. Now, you watch your monitoring dashboards like a hawk. Look for new error spikes in upstream or downstream services, check support ticket queues, and listen for whispers of “Hey, is the XYZ report running slow?”.

Pro Tip: Before you do this, make sure your alerting is top-notch. If the legacy-api-gw-01 starts throwing 503 errors because its tiny instance is overwhelmed, you need to know immediately, not a day later when a customer complains.

2. The Permanent Fix: The ‘Archaeological Dig’

This is the “right” way to do it. It takes time and effort but eliminates risk and cleans up technical debt properly. You become a detective, tracing the service’s digital footprint.

Your best friends here are your observability tools—think DataDog, New Relic, Honeycomb, or even just deep-diving into VPC flow logs and CloudWatch metrics. You need to answer three questions:

Who calls this service? (Ingress traffic)
What does this service call? (Egress traffic)
What business function does it perform? (The ‘so what?’)

You’ll build a dependency map. You’ll put together a formal deprecation plan. You’ll communicate it to every team whose services interact with it. “On Q3, we will be decommissioning the report-generator-worker-pool. Our telemetry shows it is only called by the now-retired v2-dashboard. Please contact us if you have any dependencies we’ve missed.” You set a date, you execute the plan, and you delete the infrastructure-as-code that defined it. It’s clean, professional, and safe.

# Example Terraform Plan (The Goal)
# module "legacy_data_aggregator" {
#  source = "./modules/ec2-cluster"
#  ...
# }

# The above module will be removed in release v3.45.0 on 08/15.
# Ticket: DEVOPS-1234
# Reason: Service has been superseded by the 'realtime-metrics-api'.
# Contact: #devops-team on Slack

3. The ‘Nuclear’ Option: The Scream Test

Let’s be honest. Sometimes you have zero documentation, zero logs, and zero time. The service is an opaque box, and the ‘Archaeological Dig’ would take months. In these rare cases, you can perform a controlled “scream test.”

This is not a cowboy move. It is a calculated risk. First, you do it in a staging environment if you have one. Shut it down and leave it off for a full sprint. If the QA team doesn’t notice anything, you have your first piece of evidence.

For production, you plan it like a surgical strike. You announce a maintenance window on a low-traffic period, like a Saturday at 2 AM. You have a rollback plan ready to go—literally a single command or button click to bring it back online. You shut it down. And then… you wait.

If nothing happens after an hour, you can be reasonably confident. If nothing happens after a week, you can be very confident. If the BI team calls you three weeks later because their quarterly report failed, you have your answer. You can bring it back online temporarily and transition to the ‘Permanent Fix’ method, now armed with a known dependency.

Warning: Use this option sparingly. It can burn trust if it goes wrong. But sometimes, it’s the only way to make progress on deeply-entrenched technical debt and finally stop paying for ghosts.