A Weekend Crisis That Turned Into a Masterclass in Database Optimization
Keywords: Django Simple History, PostgreSQL optimization, database bloat, Django performance tuning, production debugging, ORM optimization, VACUUM FULL, historical records cleanup, freight management system
The Saturday Morning That Changed Everything
It was a regular Saturday morning when our production freight management platform suddenly crashed. No warnings. No obvious errors. Just... down.
Our DigitalOcean managed PostgreSQL database was struggling, CPU at 99.9%, queries timing out, and our logistics operations grinding to a halt. Shipments worth millions were in limbo, and we had no idea why.
This is the story of how we diagnosed and fixed one of the most insidious database problems in Django applications: 99.95% duplicate historical records causing massive database bloat.
The Initial Symptoms
Database Stats That Made Us Panic:
Database size: 10.2 GB (and growing)
CPU usage: Constant 99.9%
Load average: 7.37 (on a 2-core instance)
Query timeouts: Every few seconds
Application status: Completely down
We knew something was catastrophically wrong, but what?
The Investigation Begins
Step 1: Identifying the Culprit
Running a simple query to check table sizes revealed the smoking gun:
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS total_size,
pg_size_pretty(pg_relation_size(schemaname||'.'||tablename)) AS table_size,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename) - pg_relation_size(schemaname||'.'||tablename)) AS size_bytes
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY size_bytes DESC;
The Results Were Shocking:
Table | Total Size | Records |
---|---|---|
historical_shipment_containers |
3.7 GB | 20.7M |
historical_shipment_events |
2.2 GB | 11.2M |
historical_tracking_data |
1.7 GB | 6.1M |
historical_customs_records |
763 MB | 2.8M |
Total: 38 million historical records consuming around 8GB!
Our actual shipment data? about 1,500 shipments. Something was very, very wrong.
The Root Cause: Django Simple History Gone Wild
We use django-simple-history
to track changes to shipment data - essential for freight operations where audit trails are critical. But our implementation had a fatal flaw.
What Went Wrong
Our system syncs shipment data from another service every 20 minutes via Celery background tasks. For each container, we were calling:
# The problematic code
container, created = ShipmentContainer.objects.update_or_create(
shipment=shipment,
container_number=container_number,
defaults={
'container_type': container_type,
'gate_out_date': gate_out_date,
'empty_return_date': empty_return_date,
}
)
The Hidden Problem:
update_or_create()
calls .save()
EVERY TIME, even when data hasn't changed!
With django-simple-history installed, every .save()
creates a new historical record.
The Math:
150 containers being tracked
Synced every 20 minutes
72 syncs per day
10,800 duplicate history records PER DAY
Running for months...
Result: 6.8 million duplicate container history records!
The Failed First Attempt
Django Simple History provides a built-in cleanup command:
python manage.py clean_duplicate_history --auto
We tried it. And waited. And waited.
Removed 2432 historical records for ShipmentContainer
Removed 16741 historical records for ShipmentContainer
Removed 7877 historical records for ShipmentContainer
...
(1 hour 43 minutes later, still running)
Locally, this would take days. In production? Unacceptable.
The command processes records one-by-one, which is fine for thousands of duplicates. For millions? We needed a different approach.
The SQL Solution: PostgreSQL Window Functions to the Rescue
We needed something faster. Much faster. Enter: PostgreSQL window functions with batched deletion.
The Strategy
Instead of processing records one-by-one, we used SQL to identify and delete duplicates in batches:
-- Identify duplicates using window functions
WITH ranked_history AS (
SELECT
history_id,
ROW_NUMBER() OVER (
PARTITION BY shipment_id, container_number, container_type,
gate_out_date, empty_return_date
ORDER BY history_date DESC
) as rn
FROM historical_shipment_containers
)
DELETE FROM historical_shipment_containers
WHERE history_id IN (
SELECT history_id FROM ranked_history WHERE rn > 1
);
This query:
Groups identical records by all tracked fields
Numbers them chronologically (newest = 1)
Deletes everything except the newest record
Time to delete 6.8M records: ~2 hours (vs. days with the Django command)
How Window Functions Work
For those unfamiliar, ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)
is like saying:
"Group records by these fields, then number each record within its group by date, keeping the newest as #1"
This lets us delete all records numbered 2, 3, 4, etc. (the duplicates) while keeping #1 (the most recent).
The Space Reclamation: VACUUM FULL
After deleting millions of rows, we discovered another gotcha: PostgreSQL doesn't automatically reclaim disk space!
The deleted rows were marked as "dead tuples" but still occupying disk space.
The Solution: VACUUM FULL
VACUUM FULL historical_tracking_data;
VACUUM FULL historical_shipment_containers;
VACUUM FULL historical_shipment_events;
⚠️ Warning: VACUUM FULL
locks the table during operation. We put the application in maintenance mode for 15 minutes.
The Results:
Table | Before | After | Space Saved |
---|---|---|---|
Milestone | 1.7 GB | 13 MB | 99.94% |
Containers | 3.7 GB | 1.6 MB | 99.97% |
Events | 2.2 GB | 13 MB | 99.41% |
Total Saved: 7.6 GB
Preventing Future Bloat: The Real Fix
Cleaning the database was only half the battle. We needed to prevent this from happening again.
Problem 1: Unnecessary Saves in update_or_create()
Before:
# This saves EVERY TIME, even if nothing changed!
container, created = ShipmentContainer.objects.update_or_create(
shipment=shipment,
container_number=container_number,
defaults={
'container_type': container_type,
'gate_out_date': gate_out_date,
}
)
After:
def smart_update_container(shipment, container_data):
"""
Only save if fields have actually changed.
Prevents unnecessary history records.
"""
try:
container = ShipmentContainer.objects.get(
shipment=shipment,
container_number=container_data['number']
)
# Check if ANY field has actually changed
needs_update = False
if container.container_type != container_data['type']:
container.container_type = container_data['type']
needs_update = True
# ONLY save if something actually changed
if needs_update:
container.save()
return container
Impact: Reduced saves by 99%! API returns the same data 99% of the time, so only 1% of syncs actually need to save.
Problem 2: Type Mismatches Causing False Changes
We discovered the tracking API returns status codes as integers (50), but our database stores them as strings ("50").
# API returns: {"status_code": 50}
# Database has: "50"
# This comparison ALWAYS fails!
if tracking.status_code != api_data.get("status_code"): # 50 != "50" → True!
tracking.status_code = api_data.get("status_code")
tracking.save() # Unnecessary save every time!
Fix:
# Always convert to the same type before comparing
new_status = str(api_data.get("status_code")) if api_data.get("status_code") is not None else None
if tracking.status_code != new_status:
tracking.status_code = new_status
needs_update = True
Lesson: API integrations often return different types than your database stores. Always normalize types before comparison!
Problem 3: Timezone-Aware vs Timezone-Naive Datetime Comparison
Our date parsing function was returning timezone-naive datetimes, but PostgreSQL stores them as timezone-aware with Django's USE_TZ=True
.
# API returns: "2025-08-22"
# Our parser created: datetime(2025, 8, 22, 0, 0, 0) # Naive!
# Database stores: datetime(2025, 8, 22, 0, 0, 0, tzinfo=UTC) # Aware!
# These are NEVER equal, triggering saves every sync!
This caused Django to log warnings:
RuntimeWarning: DateTimeField received a naive datetime while time zone support is active.
Fix:
def parse_api_date(date_value):
"""Parse date from API and return timezone-aware datetime."""
from django.utils import timezone as tz
if isinstance(date_value, dict) and date_value.get("Date"):
date_str = date_value["Date"]
elif isinstance(date_value, str):
date_str = date_value
else:
return None
try:
# Try ISO format first
dt = parser.isoparse(date_str)
if dt.tzinfo is None:
dt = tz.make_aware(dt, tz.get_default_timezone())
return dt
except ValueError:
# Fall back to simple date format
try:
date_obj = datetime.strptime(date_str, "%Y-%m-%d")
# Convert to timezone-aware datetime
return tz.make_aware(date_obj, tz.get_default_timezone())
except ValueError:
return None
Impact: Eliminated all timezone-related false changes and warnings!
Problem 4: Latitude/Longitude Causing Perpetual History Bloat
For shipment tracking, we store GPS coordinates showing where vessels are. These coordinates change constantly as ships move across the ocean - sometimes every few minutes!
We were tracking every single position change in history, which was pointless. We don't need an audit trail of "vessel moved 0.5 nautical miles" thousands of times.
Solution: Exclude frequently-changing, non-critical fields from history tracking.
class ShipmentEvent(models.Model):
# ... fields ...
latitude = models.DecimalField(max_digits=30, decimal_places=15, ...)
longitude = models.DecimalField(max_digits=30, decimal_places=15, ...)
# Exclude lat/long from history - they change too frequently!
# We still store current position, just don't track every change
history = HistoricalRecords(excluded_fields=['latitude', 'longitude'])
Impact: Prevents thousands of unnecessary history records as vessels sail!
The Final Results
Before
Database size: 10.2 GB
Historical records: 28 million
Duplicate records: 24.2 million (99.95%)
CPU usage: 99.9% constant
Application status: Down
Sync performance: Timing out
Query response time: 2-5 seconds
After
Database size: 1.6 GB (saved 8.6 GB)
Historical records: 1.8 million
Duplicate records: 0
CPU usage: <10% average
Application status: Stable
Sync performance: Sub-second
Query response time: 50-200ms
10x performance improvement across the board!
Key Lessons Learned
1. Monitor Your Historical Tables
Historical tables can silently grow to dwarf your actual data. Set up monitoring:
-- Weekly table size check
SELECT
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size,
(SELECT COUNT(*) FROM tablename) as row_count
FROM pg_tables
WHERE schemaname = 'public' AND tablename LIKE '%historical%'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
Add this to your weekly ops checklist or set up automated alerts when historical tables exceed expected sizes.
2. Never Blindly Use update_or_create() with History Tracking
Always check if fields have actually changed before saving:
# Bad - saves every time
obj, created = Model.objects.update_or_create(pk=1, defaults={'field': value})
# Good - only saves when changed
obj = Model.objects.get(pk=1)
if obj.field != new_value:
obj.field = new_value
obj.save()
3. Type Consistency Matters in API Integrations
API integrations often return different types than your database stores. Always normalize:
# Convert types before comparison
api_value = str(api_response.get("field")) # Ensure string
db_value = str(obj.field) # Ensure string
if api_value != db_value:
obj.field = api_value
obj.save()
4. Timezone Awareness is Critical
With USE_TZ=True
in Django settings, always return timezone-aware datetimes from parsers:
from django.utils import timezone as tz
dt = datetime.strptime(date_str, "%Y-%m-%d")
dt = tz.make_aware(dt, tz.get_default_timezone()) # Make aware!
5. Not Everything Needs History
Exclude frequently-changing non-critical fields from history:
# Good candidates for exclusion:
# - updated_at timestamps
# - last_sync timestamps
# - GPS coordinates
# - view counts
# - Any high-frequency changing data
history = HistoricalRecords(
excluded_fields=['updated_at', 'last_sync', 'latitude', 'longitude']
)
6. PostgreSQL Window Functions Are Powerful
For bulk operations on millions of rows, SQL outperforms ORM by orders of magnitude.
Django ORM is great for application logic, but for data cleanup, bulk updates, and analytics, raw SQL with window functions is often 100x+ faster.
7. VACUUM FULL After Large Deletions
Don't forget to reclaim disk space after deleting millions of rows!
-- Regular VACUUM (non-blocking, safe for production)
VACUUM ANALYZE table_name;
-- VACUUM FULL (blocking, use in maintenance window)
VACUUM FULL table_name;
Regular VACUUM is safe to run anytime. VACUUM FULL requires downtime but reclaims more space.
Tools and Technologies Used
Django 5.0 with django-simple-history
PostgreSQL 15 (DigitalOcean Managed Database)
Celery for background task processing
Redis as Celery broker
pgAdmin 4 for database management
DigitalOcean infrastructure and monitoring
Python 3.12
Performance Impact
Before the Fix:
Database queries: 2-5 seconds average
Background sync tasks: 45-60 seconds
Admin page loads: 8-12 seconds
API response times: 1-3 seconds
CPU: Constant 99.9%
Database connections: Maxed out
After the Fix:
Database queries: 50-200ms average
Background sync tasks: 1-2 seconds
Admin page loads: 500ms-1s
API response times: 100-300ms
CPU: 5-10% average
Database connections: Healthy pool
10x performance improvement across the board!
Monitoring and Maintenance
We now run weekly checks to catch any regression early:
-- Check for duplicate accumulation
WITH ranked AS (
SELECT
COUNT(*) as total,
COUNT(DISTINCT (shipment_id, container_number, container_type, gate_out_date)) as unique_records
FROM historical_shipment_containers
)
SELECT
total,
unique_records,
(total - unique_records) as duplicates,
ROUND(100.0 * (total - unique_duplicates) / total, 2) as duplicate_percentage
FROM ranked;
Healthy Result:
total: 13,812
unique_records: 13,802
duplicates: 10
duplicate_percentage: 0.07%
A small number of duplicates (< 1%) is normal due to legitimate updates. If this crosses 5%, we investigate.
The Human Element
This incident happened on a Saturday morning. Our team:
Immediately put the platform in maintenance mode
Communicated transparently with stakeholders
Debugged collaboratively (huge props to the team!)
Implemented fixes incrementally
Tested thoroughly before reopening
Documented everything for future reference
Time to resolution: 14 hours (Saturday 7 AM → Saturday 9 PM) Total downtime: 2 hours (maintenance mode during VACUUM operations)
We follow agile practices, which helped us:
Prioritize fixes (clean data first, prevent recurrence second)
Test in small increments
Roll back quickly if needed
Collaborate effectively under pressure
Conclusion
What started as a catastrophic Saturday morning crash turned into a masterclass in:
Database optimization
Django ORM performance tuning
PostgreSQL internals
Production debugging under pressure
Collaborative problem-solving
Key Takeaway: The most dangerous bugs are the ones that accumulate silently over months. Regular monitoring, performance testing, and database audits aren't optional - they're essential.
Our freight management platform is now faster, leaner, and ready to scale. And we learned valuable lessons that will inform our engineering practices for years to come.
If you're using django-simple-history (or any audit logging system), review your update patterns today. You might be sitting on a ticking time bomb!
Additional Resources
About the Author: Software Engineer at Maalbardaar, building scalable freight management solutions. Passionate about database optimization, Django performance, and solving real-world logistics challenges.
Tech Stack: Django, PostgreSQL, Celery, Docker, DigitalOcean, React
Follow for more: Production war stories, Django tips, and database optimization techniques.
Have you faced similar database bloat issues? How did you solve them? Share your story in the comments!
Top comments (0)