Database queries eating your budget? How we fixed N+1 problems and cut costs 40%
Ever seen your infrastructure costs climb while performance tanks? A European CRM platform with 50K users was burning €18,000 monthly on oversized servers while their API responses crawled at 2.8 seconds p95. Spoiler: throwing more CPU at the problem wasn't the answer.
The problem: classic over-provisioning meets query hell
Their setup screamed "we scaled vertically and hoped for the best":
- Three oversized VMs running 24/7 at 12% CPU utilization
- Database server with 32 cores mostly waiting on inefficient queries
- 2.3M daily API requests taking 450ms average per database query
- Cache hit rate sitting at a painful 23%
The real kicker? 78% of their traffic happened during European business hours, but they were paying for peak capacity around the clock.
What the audit revealed
We dug into their query patterns and found the usual suspects:
N+1 query nightmare: A single dashboard load triggered 47 separate database queries. Classic eager loading problem.
Synchronous external API calls: Adding 200-800ms to user requests that should have been background jobs.
Default Redis config: Treating it like basic key-value storage instead of a proper caching layer.
Even load balancing: Distributing traffic equally regardless of actual server load or time patterns.
The fix: optimize first, scale second
Instead of adding more resources, we focused on making existing ones work smarter.
Database optimization
Replaced N+1 patterns with batch queries and eager loading:
-- Before: 47 queries for user dashboard
SELECT * FROM users WHERE id = ?;
SELECT * FROM projects WHERE user_id = ?;
SELECT * FROM tasks WHERE project_id = ?;
-- ...repeated for each project
-- After: 3 optimized queries with JOINs
SELECT u.*, p.*, t.*
FROM users u
LEFT JOIN projects p ON u.id = p.user_id
LEFT JOIN tasks t ON p.id = t.project_id
WHERE u.id = ?;
Redis caching strategy
Implemented multi-tier caching:
- Query results: 15min TTL for user data, 1hr for reference data
- External API responses: 30min TTL with async refresh
- Session data cached to eliminate auth lookups
Auto-scaling configuration
min_instances: 1
max_instances: 4
scale_up_trigger: cpu > 70% for 5min
scale_down_trigger: cpu < 30% for 15min
instance_type: c5.large # down from c5.2xlarge
Background job processing
Moved external API calls, reports, and emails to async workers:
# Before: blocking user request
response = requests.get(external_api_url)
return render_template('dashboard.html', data=response.json())
# After: cached data with async refresh
data = cache.get(cache_key) or get_cached_default()
refresh_data_async.delay(cache_key) # Celery task
return render_template('dashboard.html', data=data)
Results that actually matter
Performance improvements:
- Database queries: 450ms → 89ms (80% faster)
- P95 response times: 2.8s → 1.1s (61% improvement)
- Cache hit rate: 23% → 87%
- Page load times: 4.2s → 2.1s average
Cost reductions:
- Total monthly spend: €18,000 → €11,000 (39% reduction)
- App server costs: 64% reduction through rightsizing
- Database costs: 47% reduction after optimization
- CPU utilization: 12% → 45-65% (properly sized resources)
Key takeaways for your next optimization
- Profile before you scale: Query analysis revealed the real bottlenecks weren't resource constraints
- Cache aggressively: 87% hit rates eliminated most database calls
- Scale horizontally with auto-scaling: Match your infrastructure to actual usage patterns
- Async everything external: Keep third-party API calls out of user request paths
- Monitor utilization: 12% CPU usage on oversized instances is just burning money
The platform now handles 3.7M daily requests (60% more than before) with better performance and 39% lower costs. Sometimes the best scaling strategy is optimizing what you already have.
Originally published on binadit.com
Top comments (0)