Table of Contents
- Introduction
- What is a Race Condition?
- The Scenario: A DNS Management System
- The Race Condition: A Perfect Storm
- Visualizing the Race Condition
- The Core Bug Pattern
- Why This Bug is So Insidious
- How to Fix It: Proper Cleanup Logic
- Advanced Prevention: Fencing Tokens
- Lessons for Distributed Systems Engineers
- Real-World Applications
- The Broader Context: Why This Matters
- Practical Checklist for Your Systems
- Conclusion: The Humbling Reality of Distributed Systems
- Summary
Introduction
Imagine this scenario: Your production system has been running flawlessly for years. You have redundancy, automated recovery, and multiple independent components working in harmony. Then one day, under a perfect storm of conditions, everything crashes. Not because of a hardware failure, not because of malicious actors, but because of a tiny race condition that only manifests when several unlikely events happen simultaneously.
This isn't hypothetical. This exact pattern has caused major outages in global production systems, bringing down critical infrastructure and affecting millions of users. The culprit? A subtle timing bug in automation software that, under specific conditions, deletes resources that are actively being used.
In this article, we'll explore one of the most insidious types of race conditions in distributed systems: the "cleanup catastrophe." We'll break down exactly how it works, why it's so hard to catch, and most importantly, how to prevent it in your own systems.
Whether you're building microservices, distributed databases, or cloud infrastructure, understanding this pattern will help you avoid a category of bugs that can remain dormant for years before striking catastrophically.
What is a Race Condition?
Before diving into the complex scenario, let's establish the basics.
A race condition occurs when the behavior of a system depends on the sequence or timing of uncontrollable events. In software, this typically happens when multiple processes or threads access shared resources without proper coordination.
Simple Example: The Bank Account Problem
# Two threads trying to update the same account
balance = 1000
# Thread 1: Deposit $100
temp1 = balance # Reads 1000
temp1 = temp1 + 100 # Calculates 1100
balance = temp1 # Writes 1100
# Thread 2: Withdraw $50 (happens at the same time)
temp2 = balance # Reads 1000 (before Thread 1 writes!)
temp2 = temp2 - 50 # Calculates 950
balance = temp2 # Writes 950
# Final balance: $950
# Expected balance: $1050
# Lost: $100
This is a classic race condition. Both threads read the balance before either writes, so one update gets lost.
But the race condition we're exploring today is more subtle and more dangerous.
The Scenario: A DNS Management System
Let's examine a realistic distributed system that manages DNS records for a high-availability service. This pattern applies to many systems: configuration management, cache invalidation, distributed task scheduling, and more.
System Architecture
Our system has three key components designed for reliability and automation:
1. The Planner
- Purpose: Monitors infrastructure health (load balancers, servers, etc.)
- Function: Generates configuration "plans" based on current state
- Output: Versioned plans (Plan 1, Plan 2, Plan 3, etc.) specifying which IP addresses should be active
- Operating Model: Continuously running, generating new plans as infrastructure changes
2. The Enactors (3 instances)
- Purpose: Apply configuration plans to the production DNS system
- Function: Execute the actual DNS record updates based on plans from the Planner
- Redundancy: Three separate instances running independently for fault tolerance
- Operating Model: Pick up plans, validate them, apply changes to DNS
3. The Cleanup Process
- Purpose: Remove obsolete and "stale" plans to prevent confusion
- Function: Identifies plans that are no longer needed and deletes their associated resources
- Logic: If a plan's generation time is older than the most recently completed plan, mark it as stale
- Assumption: Plans complete quickly, so generation time ≈ completion time
This architecture looks solid. Redundant Enactors provide fault tolerance. Continuous planning adapts to infrastructure changes. Cleanup keeps the system tidy.
But this design has a fatal flaw.
The Race Condition: A Perfect Storm
The race condition occurs when several things go wrong simultaneously, creating a timing window that the system's designers didn't anticipate.
Step 1: The Slowdown (Time T1)
Enactor #1 begins processing Plan A at time T1. Under normal circumstances, this takes a few seconds. But something unusual happens:
- Network congestion delays communication with DNS servers
- Resource contention on the Enactor instance slows processing
- Temporary throttling from the DNS API causes retries
- A software bug triggers unexpected retry logic
Whatever the cause, Enactor #1 is stuck processing Plan A far longer than expected.
Timeline:
T1: Enactor #1 starts applying Plan A (generated at T1)
└─> Expected completion: T1 + 5 seconds
└─> Actual completion: T1 + 5 minutes (unusually slow!)
Step 2: The World Keeps Turning (Time T2-T3)
While Enactor #1 struggles, the Planner doesn't know about the delay. It continues its job:
- At T2: Generates Plan B based on updated infrastructure state
- At T3: Generates Plan C with even newer information
The Planner assumes eventual consistency will work everything out. After all, that's what it was designed for.
Timeline:
T1: Enactor #1 starts applying Plan A (still in progress...)
T2: Planner generates Plan B
T3: Planner generates Plan C
Step 3: The Speedster (Time T4-T5)
Enactor #2, operating independently and unaware of Enactor #1's troubles, picks up the newer plans:
- At T4: Quickly processes and applies Plan B (completes in 5 seconds)
- At T5: Quickly processes and applies Plan C (completes in 5 seconds)
Enactor #2 is doing exactly what it was designed to do—applying the latest configuration as fast as possible.
Timeline:
T1: Enactor #1 starts applying Plan A (still in progress...)
T2: Planner generates Plan B
T3: Planner generates Plan C
T4: Enactor #2 applies Plan B (completes quickly)
T5: Enactor #2 applies Plan C (completes quickly)
Step 4: The Cleanup Catastrophe (Time T5)
Here's where everything falls apart. After completing Plan C, Enactor #2 (or the cleanup process) examines the system state:
# Cleanup logic at time T5
current_time = T5
most_recent_completed_plan = Plan_C # Generation time: T3
active_plans = [Plan_A, Plan_C]
# The fatal logic
for plan in active_plans:
if plan.generation_time < most_recent_completed_plan.generation_time:
# Plan A was generated at T1, which is < T3
# Conclusion: Plan A is "stale" and "obsolete"
delete_plan_resources(plan)
# This deletes ALL IP addresses associated with Plan A
The cleanup logic sees:
- Plan A was generated at T1 (old)
- Plan C was generated at T3 and completed at T5 (new)
- Conclusion: Plan A is stale and should be deleted
The fatal assumption: The cleanup assumes that if a plan is "old" by generation time, it isn't being actively used. But Enactor #1 is still applying Plan A at T5!
Timeline:
T1: Enactor #1 starts applying Plan A (STILL in progress at T5!)
T2: Planner generates Plan B
T3: Planner generates Plan C
T4: Enactor #2 applies Plan B (completes)
T5: Enactor #2 applies Plan C (completes)
└─> Cleanup runs: "Plan A is old, delete it!"
└─> Deletes Plan A's resources (IP addresses, DNS records)
Step 5: The Collision (Time T6)
At T6, Enactor #1 finally completes its delayed processing of Plan A. But there's a catastrophic problem:
- Plan A's resources (IP addresses, DNS records) were just deleted at T5
- Enactor #1 tries to finalize Plan A with non-existent resources
- Result: Empty DNS record or corrupted configuration
Timeline:
T1: Enactor #1 starts applying Plan A
T5: Cleanup deletes Plan A's resources
T6: Enactor #1 finishes applying Plan A
└─> ERROR: Resources don't exist!
└─> Result: Empty/corrupted DNS state
Step 6: The Inconsistent State
The system now has:
- No valid IP addresses in the DNS record (or a corrupted configuration)
- Metadata indicating multiple plans in various states of completion
- An inconsistent state that prevents automated recovery
The Enactors can't fix this automatically because their logic assumes they can always start from a known good state. But there is no good state—the ground truth has been deleted.
Visualizing the Race Condition
Here's a visual representation of the timing:
Time → T1 T2 T3 T4 T5 T6
| | | | | |
Plan A: [================================●] ← Slow (Enactor #1)
↑ ↑
Plan B: [====●] | |
| |
Plan C: [====●] | |
| |
Cleanup: [Deletes A] |
|
Collision: [Enactor #1 finishes]
[Resources gone!]
Legend:
[===] = Plan being applied
● = Plan completed
↑ = Critical event
The problem: Cleanup deletes Plan A's resources at T5, while Enactor #1 is still using them, completing at T6.
The Core Bug Pattern
Here's the essential code pattern that causes this race condition:
# BUGGY: The Dangerous Cleanup Logic
def cleanup_old_plans(plans, most_recent_completed):
for plan in plans:
if plan.generation_time < most_recent_completed.generation_time:
# BUG: Doesn't check if plan is currently being applied!
delete_plan_resources(plan)
# This can delete resources that are actively in use
Why this is dangerous:
- It assumes generation time correlates with application status
- It doesn't check if a plan is currently being applied
- It doesn't verify when the plan was last accessed
- It has no safety buffer or grace period
Why This Bug is So Insidious
This race condition is particularly dangerous because:
1. It Works 99.99% of the Time
Under normal conditions:
- Plans complete in seconds
- Generation time ≈ Completion time
- Cleanup never deletes in-use resources
- The system appears perfectly reliable
2. It's Invisible in Testing
To trigger this bug, you need:
- Unusual delays in one Enactor (hard to reproduce)
- Continued plan generation during the delay (timing-dependent)
- Another Enactor completing plans quickly (normal behavior)
- Cleanup running at exactly the wrong moment (probabilistic)
Standard tests won't catch this. Even stress tests might miss it.
3. It Fails Catastrophically
When this bug triggers:
- It doesn't cause a partial failure
- It doesn't throw an obvious error
- It creates an inconsistent state that automated systems can't fix
- Recovery requires manual intervention
4. It Can Lie Dormant for Years
The system could run for years without triggering this condition, then suddenly fail when the perfect storm of timing occurs.
How to Fix It: Proper Cleanup Logic
Here's how to implement safe cleanup that prevents this race condition:
# SAFE: Proper Cleanup Logic with Multiple Safeguards
def cleanup_old_plans(plans, most_recent_completed):
for plan in plans:
# Check 1: Is the plan old by generation time?
if plan.generation_time >= most_recent_completed.generation_time:
continue # Plan is current, skip it
# Check 2: Is the plan currently being applied?
if plan.is_currently_being_applied():
continue # Plan is in use, don't delete!
# Check 3: Was the plan accessed recently?
if (current_time - plan.last_access_time) < SAFETY_BUFFER:
continue # Plan was accessed recently, be cautious
# Check 4: How many other plans reference this one?
if plan.reference_count > 0:
continue # Other plans depend on this, keep it
# All checks passed, safe to delete
delete_plan_resources(plan)
Key Improvements:
1. State Tracking
class PlanState(Enum):
CREATED = 1
PICKED_UP = 2
IN_PROGRESS = 3
COMPLETING = 4
COMPLETED = 5
FAILED = 6
def is_currently_being_applied(plan):
return plan.state in [
PlanState.PICKED_UP,
PlanState.IN_PROGRESS,
PlanState.COMPLETING
]
2. Last Access Tracking
# Update access time whenever a plan is touched
def apply_plan(plan):
plan.last_access_time = current_time()
plan.state = PlanState.IN_PROGRESS
# ... apply the plan ...
plan.last_access_time = current_time()
plan.state = PlanState.COMPLETED
3. Safety Buffer
# Don't delete anything accessed in the last 10 minutes
SAFETY_BUFFER = timedelta(minutes=10)
# Even if a plan looks old, give it a grace period
if (current_time - plan.last_access_time) < SAFETY_BUFFER:
# Too recent, keep it for safety
continue
4. Reference Counting
# Track how many Enactors are using this plan
class Plan:
def __init__(self):
self.reference_count = 0
self.lock = Lock()
def acquire(self):
with self.lock:
self.reference_count += 1
def release(self):
with self.lock:
self.reference_count -= 1
def can_delete(self):
return self.reference_count == 0
Advanced Prevention: Fencing Tokens
Another powerful technique is using fencing tokens to prevent stale operations from succeeding:
class ConfigurationStore:
def __init__(self):
self.current_version = 0
self.data = {}
self.lock = Lock()
def apply_plan(self, plan, expected_version):
with self.lock:
# Atomic check-and-set
if self.current_version != expected_version:
raise StaleOperationError(
f"Plan based on version {expected_version}, "
f"but current version is {self.current_version}"
)
# Apply the changes
self.data.update(plan.changes)
self.current_version += 1
return self.current_version
# Usage
def enactor_apply_plan(plan):
try:
new_version = config_store.apply_plan(
plan,
expected_version=plan.base_version
)
print(f"Successfully applied plan, now at version {new_version}")
except StaleOperationError as e:
print(f"Plan is stale: {e}")
# Don't apply this plan, it's based on old state
This ensures that even if a slow Enactor completes, it can't apply a stale plan because the version check will fail.
Lessons for Distributed Systems Engineers
This race condition offers several critical lessons:
1. Question Your Timing Assumptions
❌ Bad: "This operation usually takes 100ms, so timeout after 1 second"
✅ Good: "This operation usually takes 100ms, but timeout after 30 seconds and monitor p99"
❌ Bad: "If generation time is old, the plan is obsolete"
✅ Good: "If generation time is old AND it's not in use AND no one accessed it recently, it might be obsolete"
Key principle: Design for the exception, not the rule.
2. Implement Comprehensive State Tracking
Don't just track "started" and "completed." Track:
-
CREATED: Plan exists but not yet picked up -
PICKED_UP: Enactor has claimed this plan -
IN_PROGRESS: Plan is actively being applied -
COMPLETING: Plan is in final stages -
COMPLETED: Plan finished successfully -
FAILED: Plan failed and is safe to clean up
This granular tracking makes it impossible to accidentally delete in-use resources.
3. Add Circuit Breakers to Cleanup Operations
Before deleting critical resources, verify:
- Is this resource currently locked by another process?
- When was this resource last accessed?
- How many other resources will this deletion affect?
- Is there a way to soft-delete first and hard-delete later?
Example:
def safe_delete(resource):
# Soft delete: mark as deleted but keep data
resource.marked_for_deletion = True
resource.deletion_time = current_time()
# Schedule hard delete for later (24 hours)
schedule_task(
hard_delete_resource,
resource.id,
delay=timedelta(hours=24)
)
def hard_delete_resource(resource_id):
resource = get_resource(resource_id)
# Final safety check before permanent deletion
if resource.last_access_time > resource.deletion_time:
# Someone accessed it after marking for deletion!
resource.marked_for_deletion = False
return
# Safe to permanently delete
permanently_delete(resource)
4. Use Optimistic Concurrency Control
Implement version numbers that prevent stale operations:
class Resource:
def __init__(self):
self.version = 0
self.data = {}
def update(self, new_data, expected_version):
if self.version != expected_version:
raise ConcurrencyError(
f"Expected version {expected_version}, "
f"but current version is {self.version}"
)
self.data = new_data
self.version += 1
return self.version
5. Design for Observable Inconsistency
Make it easy to detect when the system is in an inconsistent state:
def verify_system_consistency():
"""Health check that verifies consistency"""
issues = []
# Check 1: Any plans in COMPLETING state for > 5 minutes?
stuck_plans = [
p for p in get_all_plans()
if p.state == PlanState.COMPLETING
and (current_time() - p.state_change_time) > timedelta(minutes=5)
]
if stuck_plans:
issues.append(f"Plans stuck in COMPLETING: {stuck_plans}")
# Check 2: Do DNS records match the latest completed plan?
latest_plan = get_latest_completed_plan()
actual_dns = get_current_dns_state()
if actual_dns != latest_plan.expected_dns:
issues.append(f"DNS mismatch: expected {latest_plan.expected_dns}, got {actual_dns}")
# Check 3: Any plans marked for deletion that are still being referenced?
for plan in get_plans_marked_for_deletion():
if plan.reference_count > 0:
issues.append(f"Plan {plan.id} marked for deletion but still referenced")
return issues
6. Test Timing Edge Cases
Your test suite should include:
def test_delayed_enactor_race_condition():
"""Reproduce the race condition with artificial delays"""
# Create initial plan
plan_a = planner.generate_plan()
# Enactor 1 starts applying Plan A, but we'll delay it
enactor_1 = SlowEnactor(delay=timedelta(minutes=5))
enactor_1.apply_async(plan_a)
# While Enactor 1 is delayed, generate newer plans
time.sleep(1)
plan_b = planner.generate_plan()
plan_c = planner.generate_plan()
# Enactor 2 applies newer plans quickly
enactor_2 = FastEnactor()
enactor_2.apply(plan_b)
enactor_2.apply(plan_c)
# Run cleanup (this should NOT delete Plan A's resources!)
cleanup.run()
# Wait for Enactor 1 to finish
enactor_1.wait()
# Verify: System should still be in consistent state
assert verify_system_consistency() == []
assert get_current_dns_state() is not None
7. Implement Gradual Rollouts for Automation
Even if automation has been running for years, treat changes carefully:
class CleanupAutomation:
def __init__(self):
self.enabled_percentage = 0 # Start disabled
def run_cleanup(self):
if random.random() > self.enabled_percentage:
# Not enabled for this execution
return
# Cleanup logic here
self.cleanup_old_plans()
def gradual_rollout(self):
"""Gradually enable cleanup automation"""
self.enabled_percentage = 0.01 # 1%
time.sleep(hours=24) # Monitor for 24 hours
self.enabled_percentage = 0.10 # 10%
time.sleep(hours=24)
self.enabled_percentage = 0.50 # 50%
time.sleep(hours=24)
self.enabled_percentage = 1.0 # 100%
8. Build Manual Overrides
No matter how reliable your automation is, have manual controls:
class EmergencyControls:
"""Emergency manual overrides for production issues"""
@classmethod
def stop_all_automation(cls):
"""Immediately stop all automated processes"""
CleanupAutomation.enabled = False
PlannerAutomation.enabled = False
EnactorAutomation.enabled = False
log.critical("ALL AUTOMATION DISABLED via emergency control")
@classmethod
def manual_restore_dns(cls, plan_id):
"""Manually restore DNS state from a specific plan"""
plan = get_plan_by_id(plan_id)
# Bypass all automation and directly set DNS
dns_service.set_records(plan.dns_records)
log.critical(f"DNS manually restored from plan {plan_id}")
@classmethod
def force_cleanup_plan(cls, plan_id):
"""Manually force deletion of a specific plan"""
plan = get_plan_by_id(plan_id)
# Override all safety checks
delete_plan_resources(plan, force=True)
log.critical(f"Plan {plan_id} force-deleted via manual override")
Real-World Applications
This race condition pattern applies to many distributed systems:
1. Distributed Cache Invalidation
# Similar bug in cache invalidation
def invalidate_cache(key, generation_time):
if cache.get_generation_time(key) < generation_time:
# BUG: What if another process is currently updating this key?
cache.delete(key)
2. Distributed Task Scheduling
# Similar bug in task cleanup
def cleanup_old_tasks(tasks):
for task in tasks:
if task.created_at < cutoff_time:
# BUG: What if this task is currently executing?
delete_task(task)
3. Configuration Management
# Similar bug in config rollback
def rollback_to_previous_version(current_version):
if current_version.timestamp < latest_version.timestamp:
# BUG: What if current_version is being applied right now?
delete_version_resources(current_version)
4. Service Mesh/Load Balancer Updates
# Similar bug in endpoint updates
def update_endpoints(new_endpoints):
old_endpoints = get_current_endpoints()
if old_endpoints.version < new_endpoints.version:
# BUG: What if requests are currently routing to old_endpoints?
remove_endpoints(old_endpoints)
The Broader Context: Why This Matters
This race condition is more than just a specific bug. It's a window into fundamental challenges of distributed systems:
1. Complexity Breeds Fragility
The system was designed to be reliable through redundancy (3 Enactors) and automation (continuous planning). But the interaction between these components created new failure modes that weren't anticipated.
Key Insight: Adding redundancy doesn't always add reliability. Sometimes it adds new ways to fail.
2. Automation Moves Faster Than Humans
The race condition can happen in seconds. By the time humans can react, the system is in an unrecoverable state. Automation is powerful, but it needs safeguards against moving too fast.
Key Insight: Fast automation needs slow safeguards.
3. The Assumption Trap
The cleanup logic can work perfectly for years. It's only when unusual delays occur that the assumption about timing breaks down. Many systems have similar hidden assumptions.
Key Insight: Document your assumptions and test what happens when they're violated.
4. Timing is Everything
In distributed systems, you can't rely on operations completing in a predictable order or time frame. Networks have delays, processes get scheduled unpredictably, and timing assumptions break down.
Key Insight: Never assume operation A will complete before operation B starts, even if A started first.
Practical Checklist for Your Systems
Use this checklist to audit your own distributed systems for similar race conditions:
Cleanup Logic Audit
- [ ] Does cleanup check if resources are currently in use?
- [ ] Does cleanup use timestamps to determine staleness?
- [ ] Is there a safety buffer before deletion?
- [ ] Can you soft-delete before hard-deleting?
- [ ] Is there a manual override to stop cleanup?
State Tracking Audit
- [ ] Do you track in-flight operations separately from completed ones?
- [ ] Can you query "is resource X currently being used?"
- [ ] Do you log state transitions with correlation IDs?
- [ ] Can you reconstruct the timeline of operations?
Concurrency Control Audit
- [ ] Do you use version numbers or fencing tokens?
- [ ] Can stale operations succeed?
- [ ] Is there a way to detect when operations overlap?
- [ ] Do you have locks or mutexes for critical sections?
Testing Audit
- [ ] Do you test with artificial delays?
- [ ] Do you test operations completing out of order?
- [ ] Do you test what happens when automation runs during manual operations?
- [ ] Do you have chaos engineering tests?
Observability Audit
- [ ] Can you detect inconsistent state?
- [ ] Do you alert on unexpected state transitions?
- [ ] Can you trace a single operation through the entire system?
- [ ] Do you monitor p95/p99 latencies for timing issues?
Conclusion: The Humbling Reality of Distributed Systems
Race conditions like this one remind us that distributed systems are inherently complex. Even with redundancy, automation, and careful design, subtle timing bugs can create catastrophic failures.
But this isn't a story about inevitable failure. It's a story about learning from failure modes and designing systems that are resilient to timing edge cases.
The key takeaways:
- Never assume operation timing: What usually takes 100ms might occasionally take 10 minutes
- Track in-flight operations: Don't just know what's complete; know what's in progress
- Add safety buffers to cleanup: Aggressive cleanup is dangerous
- Use optimistic concurrency control: Version numbers prevent stale operations
- Test timing edge cases: The bugs you don't test for will find you in production
- Design for observability: Make inconsistent states detectable
- Build manual overrides: Automation needs kill switches
- Document your assumptions: Today's safe assumption is tomorrow's critical bug
In distributed systems, race conditions are inevitable. But understanding them is what separates systems that fail mysteriously from systems that fail safely and recover gracefully.
Every system has timing assumptions. The question is: Do you know what yours are?
Summary
This article was written to help developers understand a subtle but devastating category of race conditions in distributed systems. The pattern described here has caused real production outages, affecting millions of users.
If you found this useful, consider sharing it with your team. Review your own systems with the checklist provided. Test timing edge cases that you've been assuming "will never happen."
Because in distributed systems, the timing edge case you don't test for is the one that will take down your production system at 3 AM on a holiday weekend.
Let's build more reliable systems together.
Top comments (0)