Scaling isn't about having the perfect architecture from day one. It's about building systems that can evolve, monitoring them obsessively, and learning from your inevitable failures.
Over the past few years, I've had the privilege of scaling systems at two major tech companies—from transforming an advertiser intelligence platform to building a notification system that processes 50K+ events per second. The journey from "this batch job runs overnight" to "we need real-time recommendations with 80%+ confidence" taught me some expensive lessons about scaling distributed systems.
Here's what I wish I'd known before our first real-time migration broke everything (twice).
Lesson 1: Your Database Will Betray You (Plan for It)
The Disaster: At my current company, we had a beautiful batch processing system for our advertiser intelligence platform. It ran nightly, processed advertiser data, generated insights, and everyone was happy. Then the business asked for real-time recommendations. "How hard could it be?" we thought. We tried to retrofit our existing system, and during the first load test, our database couldn't handle the sudden shift from batch writes to constant real-time queries. Advertiser dashboards went down, and we had some very unhappy customers.
What We Learned: The jump from batch to real-time isn't just about speed—it's a fundamental architectural shift.
Here's what actually worked for us:
From Batch to Real-Time: The Architecture Shift
Moving from our 24-hour batch processing to real-time APIs wasn't just about making things faster. We had to completely rethink our data flow:
# Old batch approach - worked great at night
def process_advertiser_insights():
all_advertisers = get_all_advertisers()
for advertiser in all_advertisers:
insights = generate_insights(advertiser)
batch_update_database(insights)
# New real-time approach - needed different thinking
def get_real_time_insights(advertiser_id):
# Can't process everything—need smart caching
cached_insights = redis.get(f"insights:{advertiser_id}")
if cached_insights and not_stale(cached_insights):
return cached_insights
# Generate only what's needed, when needed
fresh_insights = generate_targeted_insights(advertiser_id)
redis.setex(f"insights:{advertiser_id}", 300, fresh_insights)
return fresh_insights
Predictive Models Need Different Data Patterns
Our advertiser churn prediction models (which achieved 80%+ confidence) needed real-time feature engineering. This meant rethinking how we stored and accessed data:
# Feature engineering for real-time predictions
class AdvertiserFeatureStore:
def __init__(self):
self.redis_client = Redis()
self.feature_cache_ttl = 300 # 5 minutes
def get_churn_features(self, advertiser_id):
features = {}
# Recent activity features (need to be fresh)
recent_activity = self.get_recent_activity(advertiser_id)
features['days_since_last_campaign'] = recent_activity.days_since_last
features['spend_trend_7d'] = recent_activity.spend_trend
# Historical features (can be cached longer)
historical_key = f"historical_features:{advertiser_id}"
cached_historical = self.redis_client.get(historical_key)
if cached_historical:
features.update(json.loads(cached_historical))
else:
historical = self.compute_historical_features(advertiser_id)
self.redis_client.setex(historical_key, 3600, json.dumps(historical))
features.update(historical)
return features
Lesson 2: Caching is Your Best Friend and Worst Enemy
The Disaster: During the development of a notification platform, we had a brilliant caching strategy. We cached everything—user preferences, device states, notification templates. Then we had a cache invalidation bug during a deployment that caused emergency notifications to show cached, outdated device statuses. Imagine getting an "all clear" notification when your smoke detector was actually going off. That was a career-defining moment of terror.
What We Learned: Every cache is a potential safety hazard when you're dealing with critical systems.
The Rules We Live By Now:
- Never cache safety-critical data. Emergency states, device alerts, anything that could impact user safety—always fetch fresh.
- Cache at the right level. We started caching raw device data, but caching processed insights and recommendations was way more effective.
- Plan for cache failures. Your cache will go down. Your notification system should degrade gracefully, not silence critical alerts.
def get_advertiser_insights(advertiser_id):
cache_key = f"insights:{advertiser_id}"
# Try cache first
try:
cached_data = redis.get(cache_key)
if cached_data:
# But validate critical data is still fresh
insights = json.loads(cached_data)
if insights.get('includes_safety_data'):
# Never trust cached safety-critical information
fresh_safety_data = get_fresh_safety_metrics(advertiser_id)
insights.update(fresh_safety_data)
return insights
except RedisConnectionError:
# Cache is down, log but continue
logger.warning("Cache unavailable, falling back to database")
# Generate data (expensive operation)
insights_data = generate_advertiser_insights(advertiser_id)
# Try to cache result, but don't fail if cache is down
try:
redis.setex(cache_key, 300, json.dumps(insights_data))
except RedisConnectionError:
pass # Silently fail cache writes
return insights_data
Cache Warming Saved Our Bacon
During traffic spikes, cold cache misses were killing us. We now pre-populate cache during off-peak hours:
def warm_popular_content():
"""Run this job every hour to pre-populate cache"""
popular_items = get_trending_content()
for item in popular_items:
# Trigger cache population
get_content_with_cache(item.id)
Lesson 3: Message Queues: Your Async Lifeline (When They Work)
The Disaster: We were building a voice-enabled notification platform to handle 50K+ events per second. Initially, we were processing notifications synchronously as they came in. During a major smart home event (think: everyone's smoke detectors going off during a wildfire), our API servers got completely overwhelmed trying to process emergency notifications in real-time. The system became unresponsive right when people needed it most.
What We Learned: If it can be async, it should be async. Especially when lives depend on it.
Priority Queues Saved Our Emergency Platform
Not all async jobs are created equal. We built a sophisticated priority queue system using managed queuing services:
- P0 Queue: Emergency notifications (smoke, security, medical alerts)
- P1 Queue: Important notifications (low battery, device offline)
- P2 Queue: Informational notifications (weather, reminders)
# Route messages based on urgency and type
def route_notification(notification):
priority = determine_priority(notification)
if notification.type == 'emergency':
emergency_queue.send_message(notification, priority=0)
# Also send immediate push for emergency
send_immediate_push(notification)
elif notification.type == 'device_alert':
device_queue.send_message(notification, priority=1)
else:
general_queue.send_message(notification, priority=2)
def determine_priority(notification):
if 'smoke' in notification.alert_type or 'emergency' in notification.alert_type:
return 0 # Highest priority
elif 'battery' in notification.alert_type:
return 1 # Medium priority
return 2 # Low priority
Managed Streaming and Event-Driven Architecture
We used managed streaming services (think Kafka-as-a-service) for our event-driven architecture. The key insight: different event types need different processing patterns:
# Different consumers for different event patterns
class NotificationConsumer:
def __init__(self):
self.emergency_consumer = StreamingConsumer('emergency-events')
self.batch_consumer = StreamingConsumer('batch-events')
def consume_emergency_events(self):
# Process immediately, one at a time
for message in self.emergency_consumer:
try:
process_emergency_notification(message.value)
except Exception as e:
# Failed emergency notifications go to immediate retry
emergency_dlq.send(message.value)
def consume_batch_events(self):
# Process in batches for efficiency
batch = []
for message in self.batch_consumer:
batch.append(message.value)
if len(batch) >= 100:
process_notification_batch(batch)
batch = []
Dead Letter Queues Saved Us
Jobs will fail. Networks will partition. Services will crash. Dead letter queues let you debug what went wrong without losing data:
def process_image(job_data):
try:
# Process the image
result = image_processor.process(job_data['image_url'])
return result
except Exception as e:
# Log the error
logger.error(f"Image processing failed: {e}", extra=job_data)
# Send to dead letter queue for manual inspection
dead_letter_queue.put({
'original_job': job_data,
'error': str(e),
'timestamp': time.time(),
'retry_count': job_data.get('retry_count', 0)
})
raise # Re-raise so queue system knows it failed
Lesson 4: Monitoring: You Can't Fix What You Can't See
The Disaster: Our advertiser intelligence system was "running fine" according to our basic health checks. Advertisers were complaining about slow recommendation loading, but our monitoring showed green across the board. Turns out, our health checks were hitting cached endpoints while real advertiser queries were hitting complex predictive model APIs that were timing out.
What We Learned: Basic uptime monitoring is worse than useless—it gives you false confidence about user experience.
The Metrics That Actually Matter:
- Real user experience latency (P95, P99 of actual API calls, not health checks)
- Business-critical path success rates (advertiser onboarding, emergency notifications)
- Model performance in production (prediction confidence, feature freshness)
- Cross-service dependency health (because failure cascades are real)
# This is how we measure real advertiser experience now
class AdvertiserExperienceTracker:
def __init__(self):
self.metrics = defaultdict(list)
def track_recommendation_request(self, advertiser_id, model_type, latency, confidence, success):
timestamp = time.time()
# Track the full user journey, not just individual API calls
self.metrics[f"recommendation_journey_{model_type}"].append({
'latency': latency,
'confidence': confidence,
'success': success,
'timestamp': timestamp
})
# Alert on business-critical metrics
if confidence < 0.8 and model_type == 'churn_prediction':
self.alert(f"Churn prediction confidence dropped to {confidence}")
# P95 latency for advertiser-facing features
recent_requests = [
req for req in self.metrics[f"recommendation_journey_{model_type}"]
if timestamp - req['timestamp'] < 300
]
if len(recent_requests) > 10:
p95_latency = np.percentile([r['latency'] for r in recent_requests], 95)
if p95_latency > 2000: # 2 seconds is too slow for advertisers
self.alert(f"High P95 latency for {model_type}: {p95_latency}ms")
Distributed Tracing Changed Everything
When you have 20+ services, finding bottlenecks is impossible without distributed tracing. We use Jaeger, and it's been a game-changer:
from jaeger_client import Config
def create_tracer():
config = Config(
config={
'sampler': {'type': 'const', 'param': 1},
'logging': True,
},
service_name='user-service',
)
return config.initialize_tracer()
@traced_function
def process_user_request(user_id):
with tracer.start_span('database_lookup') as span:
user = get_user(user_id)
span.set_tag('user_tier', user.tier)
with tracer.start_span('permission_check'):
permissions = check_permissions(user)
return build_response(user, permissions)
Lesson 5: Circuit Breakers: Failing Fast to Stay Alive
The Disaster: During a major incident with our emergency assistance platform, one of our downstream services (the one that validates emergency contacts) started experiencing high latency. Instead of failing fast, every emergency call request started waiting 30 seconds for contact validation. During an actual emergency, this delay could be life-threatening. Our entire emergency response system became unusable because one non-critical validation step was slow.
What We Learned: Cascading failures will kill you faster than the original problem, especially in safety-critical systems.
Circuit Breakers for Life-Critical Systems
Now we wrap every external service call with a circuit breaker, but with different thresholds based on criticality:
from pybreaker import CircuitBreaker
# Different circuit breaker configs for different service types
emergency_contact_breaker = CircuitBreaker(
fail_max=3, # Very sensitive - open after 3 failures
reset_timeout=30, # Try again quickly - 30 seconds
exclude=[ValueError] # Don't count validation errors as failures
)
analytics_breaker = CircuitBreaker(
fail_max=10, # Less sensitive - this isn't life-critical
reset_timeout=300, # Wait longer - 5 minutes
exclude=[ValueError]
)
@emergency_contact_breaker
def validate_emergency_contact(contact_info):
return contact_service.validate(contact_info)
def handle_emergency_call(user_id, emergency_type):
try:
# Try to validate contacts, but don't block on it
contact_validation = validate_emergency_contact(user_contacts)
return initiate_emergency_call(user_id, emergency_type, contact_validation)
except CircuitBreakerError:
# Contact service is down, proceed with emergency call anyway
logger.warning("Contact validation service unavailable, proceeding with emergency call")
return initiate_emergency_call(user_id, emergency_type, fallback_contacts=True)
Graceful Degradation for Critical Systems
When services fail, don't just return errors. Provide fallback experiences that maintain safety:
- Emergency contact service down? Use cached contacts and local emergency numbers
- Recommendation engine down? Show recently successful campaign patterns
- Analytics service down? Skip the metrics, complete the core action
- Device status service down? Assume devices are operational but alert users
The key insight: identify your system's absolute core functionality and protect it at all costs.
The Meta-Lesson: Scale the Team and Organizational Practices
The biggest lesson from scaling teams at multiple tech companies? Technical scaling is only half the battle. As your system grows from a few services to dozens, you need to scale your team's ability to understand, debug, and evolve complex distributed systems.
What worked for us:
- Comprehensive runbooks for everything. When the pager goes off at 3 AM during an emergency platform incident, you want step-by-step instructions, not archaeology. I spent months creating runbooks that could guide any engineer through our most complex failure scenarios.
- Clear ownership boundaries. Each service has a dedicated team and on-call rotation. No shared ownership means no ownership, especially when you're supporting emergency services.
- Regular chaos engineering. We regularly break things in staging to practice our incident response. This was crucial for the emergency platform—we needed to know our failover mechanisms worked before real emergencies happened.
- Blameless post-mortems with action items. Focus on system improvements, not finger-pointing. After our contact validation incident, we implemented better circuit breaker patterns across all emergency-critical paths.
- Onboarding at scale. I designed onboarding programs for 100+ engineers, improving efficiency by 75%. The key: hands-on experience with real (safe) failure scenarios during onboarding.
The Honest Truth About Scaling at Big Tech
Scaling isn't about having the perfect architecture from day one. It's about building systems that can evolve, monitoring them obsessively, and learning from your inevitable failures. I've helped transform systems from 24-hour batch processing to real-time recommendations with 80%+ confidence, and built platforms that process 50K+ events per second while maintaining the reliability that emergency services demand.
We're still learning. Just last month, we had an incident where a machine learning model deployment caused prediction confidence to drop temporarily. But now we fail better—faster detection, automated rollbacks, and fewer users affected.
The journey from building advertiser intelligence systems to emergency assistance platforms taught me that scaling is less about fancy technology and more about discipline, monitoring, and planning for failure. Whether you're processing ad recommendations or emergency calls, the fundamentals remain the same: fail fast, degrade gracefully, and always prioritize your users' most critical needs.
Your mileage may vary, but hopefully, these lessons help you avoid some of the expensive mistakes we made along the way.
What's the most expensive scaling lesson you've learned? Have you had to migrate from batch to real-time processing? Drop your war stories in the comments—we're all in this together!
Top comments (1)
Insightful and well presented.