Syed Muhammad Haris

Posted on Oct 19

How We Saved 7.6GB and Fixed 99.95% Database Bloat in Our Django Logistics Platform

#programming #python #django #postgres

A Weekend Crisis That Turned Into a Masterclass in Database Optimization

Keywords: Django Simple History, PostgreSQL optimization, database bloat, Django performance tuning, production debugging, ORM optimization, VACUUM FULL, historical records cleanup, freight management system

The Saturday Morning That Changed Everything

It was a regular Saturday morning when our production freight management platform suddenly crashed. No warnings. No obvious errors. Just... down.

Our DigitalOcean managed PostgreSQL database was struggling, CPU at 99.9%, queries timing out, and our logistics operations grinding to a halt. Shipments worth millions were in limbo, and we had no idea why.

This is the story of how we diagnosed and fixed one of the most insidious database problems in Django applications: 99.95% duplicate historical records causing massive database bloat.

The Initial Symptoms

Database Stats That Made Us Panic:

Database size: 10.2 GB (and growing)
CPU usage: Constant 99.9%
Load average: 7.37 (on a 2-core instance)
Query timeouts: Every few seconds
Application status: Completely down

We knew something was catastrophically wrong, but what?

The Investigation Begins

Step 1: Identifying the Culprit

Running a simple query to check table sizes revealed the smoking gun:

SELECT
    schemaname,
    tablename,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS total_size,
    pg_size_pretty(pg_relation_size(schemaname||'.'||tablename)) AS table_size,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename) - pg_relation_size(schemaname||'.'||tablename)) AS size_bytes
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY size_bytes DESC;

The Results Were Shocking:

Table	Total Size	Records
`historical_shipment_containers`	3.7 GB	20.7M
`historical_shipment_events`	2.2 GB	11.2M
`historical_tracking_data`	1.7 GB	6.1M
`historical_customs_records`	763 MB	2.8M

Total: 38 million historical records consuming around 8GB!

Our actual shipment data? about 1,500 shipments. Something was very, very wrong.

The Root Cause: Django Simple History Gone Wild

We use django-simple-history to track changes to shipment data - essential for freight operations where audit trails are critical. But our implementation had a fatal flaw.

What Went Wrong

Our system syncs shipment data from another service every 20 minutes via Celery background tasks. For each container, we were calling:

# The problematic code
container, created = ShipmentContainer.objects.update_or_create(
    shipment=shipment,
    container_number=container_number,
    defaults={
        'container_type': container_type,
        'gate_out_date': gate_out_date,
        'empty_return_date': empty_return_date,
    }
)

The Hidden Problem:

update_or_create() calls .save() EVERY TIME, even when data hasn't changed!

With django-simple-history installed, every .save() creates a new historical record.

The Math:

150 containers being tracked
Synced every 20 minutes
72 syncs per day
10,800 duplicate history records PER DAY
Running for months...

Result: 6.8 million duplicate container history records!

The Failed First Attempt

Django Simple History provides a built-in cleanup command:

python manage.py clean_duplicate_history --auto

We tried it. And waited. And waited.

Removed 2432 historical records for ShipmentContainer
Removed 16741 historical records for ShipmentContainer
Removed 7877 historical records for ShipmentContainer
...
(1 hour 43 minutes later, still running)

Locally, this would take days. In production? Unacceptable.

The command processes records one-by-one, which is fine for thousands of duplicates. For millions? We needed a different approach.

The SQL Solution: PostgreSQL Window Functions to the Rescue

We needed something faster. Much faster. Enter: PostgreSQL window functions with batched deletion.

The Strategy

Instead of processing records one-by-one, we used SQL to identify and delete duplicates in batches:

-- Identify duplicates using window functions
WITH ranked_history AS (
    SELECT
        history_id,
        ROW_NUMBER() OVER (
            PARTITION BY shipment_id, container_number, container_type,
                         gate_out_date, empty_return_date
            ORDER BY history_date DESC
        ) as rn
    FROM historical_shipment_containers
)
DELETE FROM historical_shipment_containers
WHERE history_id IN (
    SELECT history_id FROM ranked_history WHERE rn > 1
);

This query:

Groups identical records by all tracked fields
Numbers them chronologically (newest = 1)
Deletes everything except the newest record

Time to delete 6.8M records: ~2 hours (vs. days with the Django command)

How Window Functions Work

For those unfamiliar, ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) is like saying:

"Group records by these fields, then number each record within its group by date, keeping the newest as #1"

This lets us delete all records numbered 2, 3, 4, etc. (the duplicates) while keeping #1 (the most recent).

The Space Reclamation: VACUUM FULL

After deleting millions of rows, we discovered another gotcha: PostgreSQL doesn't automatically reclaim disk space!

The deleted rows were marked as "dead tuples" but still occupying disk space.

The Solution: VACUUM FULL

VACUUM FULL historical_tracking_data;
VACUUM FULL historical_shipment_containers;
VACUUM FULL historical_shipment_events;

⚠️ Warning: VACUUM FULL locks the table during operation. We put the application in maintenance mode for 15 minutes.

The Results:

Table	Before	After	Space Saved
Milestone	1.7 GB	13 MB	99.94%
Containers	3.7 GB	1.6 MB	99.97%
Events	2.2 GB	13 MB	99.41%

Total Saved: 7.6 GB

Preventing Future Bloat: The Real Fix

Cleaning the database was only half the battle. We needed to prevent this from happening again.

Problem 1: Unnecessary Saves in update_or_create()

Before:

# This saves EVERY TIME, even if nothing changed!
container, created = ShipmentContainer.objects.update_or_create(
    shipment=shipment,
    container_number=container_number,
    defaults={
        'container_type': container_type,
        'gate_out_date': gate_out_date,
    }
)

After:

def smart_update_container(shipment, container_data):
    """
    Only save if fields have actually changed.
    Prevents unnecessary history records.
    """
    try:
        container = ShipmentContainer.objects.get(
            shipment=shipment,
            container_number=container_data['number']
        )

        # Check if ANY field has actually changed
        needs_update = False

        if container.container_type != container_data['type']:
            container.container_type = container_data['type']
            needs_update = True

        # ONLY save if something actually changed
        if needs_update:
            container.save()

        return container

Impact: Reduced saves by 99%! API returns the same data 99% of the time, so only 1% of syncs actually need to save.

Problem 2: Type Mismatches Causing False Changes

We discovered the tracking API returns status codes as integers (50), but our database stores them as strings ("50").

# API returns: {"status_code": 50}
# Database has: "50"

# This comparison ALWAYS fails!
if tracking.status_code != api_data.get("status_code"):  # 50 != "50" → True!
    tracking.status_code = api_data.get("status_code")
    tracking.save()  # Unnecessary save every time!

Fix:

# Always convert to the same type before comparing
new_status = str(api_data.get("status_code")) if api_data.get("status_code") is not None else None

if tracking.status_code != new_status:
    tracking.status_code = new_status
    needs_update = True

Lesson: API integrations often return different types than your database stores. Always normalize types before comparison!

Problem 3: Timezone-Aware vs Timezone-Naive Datetime Comparison

Our date parsing function was returning timezone-naive datetimes, but PostgreSQL stores them as timezone-aware with Django's USE_TZ=True.

# API returns: "2025-08-22"
# Our parser created: datetime(2025, 8, 22, 0, 0, 0)  # Naive!
# Database stores: datetime(2025, 8, 22, 0, 0, 0, tzinfo=UTC)  # Aware!

# These are NEVER equal, triggering saves every sync!

This caused Django to log warnings:

RuntimeWarning: DateTimeField received a naive datetime while time zone support is active.

Fix:

def parse_api_date(date_value):
    """Parse date from API and return timezone-aware datetime."""
    from django.utils import timezone as tz

    if isinstance(date_value, dict) and date_value.get("Date"):
        date_str = date_value["Date"]
    elif isinstance(date_value, str):
        date_str = date_value
    else:
        return None

    try:
        # Try ISO format first
        dt = parser.isoparse(date_str)
        if dt.tzinfo is None:
            dt = tz.make_aware(dt, tz.get_default_timezone())
        return dt
    except ValueError:
        # Fall back to simple date format
        try:
            date_obj = datetime.strptime(date_str, "%Y-%m-%d")
            # Convert to timezone-aware datetime
            return tz.make_aware(date_obj, tz.get_default_timezone())
        except ValueError:
            return None

Impact: Eliminated all timezone-related false changes and warnings!

Problem 4: Latitude/Longitude Causing Perpetual History Bloat

For shipment tracking, we store GPS coordinates showing where vessels are. These coordinates change constantly as ships move across the ocean - sometimes every few minutes!

We were tracking every single position change in history, which was pointless. We don't need an audit trail of "vessel moved 0.5 nautical miles" thousands of times.

Solution: Exclude frequently-changing, non-critical fields from history tracking.

class ShipmentEvent(models.Model):
    # ... fields ...
    latitude = models.DecimalField(max_digits=30, decimal_places=15, ...)
    longitude = models.DecimalField(max_digits=30, decimal_places=15, ...)

    # Exclude lat/long from history - they change too frequently!
    # We still store current position, just don't track every change
    history = HistoricalRecords(excluded_fields=['latitude', 'longitude'])

Impact: Prevents thousands of unnecessary history records as vessels sail!

The Final Results

Before

Database size: 10.2 GB
Historical records: 28 million
Duplicate records: 24.2 million (99.95%)
CPU usage: 99.9% constant
Application status: Down
Sync performance: Timing out
Query response time: 2-5 seconds

After

Database size: 1.6 GB (saved 8.6 GB)
Historical records: 1.8 million
Duplicate records: 0
CPU usage: <10% average
Application status: Stable
Sync performance: Sub-second
Query response time: 50-200ms

10x performance improvement across the board!

Key Lessons Learned

1. Monitor Your Historical Tables

Historical tables can silently grow to dwarf your actual data. Set up monitoring:

-- Weekly table size check
SELECT
    tablename,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size,
    (SELECT COUNT(*) FROM tablename) as row_count
FROM pg_tables
WHERE schemaname = 'public' AND tablename LIKE '%historical%'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;

Add this to your weekly ops checklist or set up automated alerts when historical tables exceed expected sizes.

2. Never Blindly Use update_or_create() with History Tracking

Always check if fields have actually changed before saving:

# Bad - saves every time
obj, created = Model.objects.update_or_create(pk=1, defaults={'field': value})

# Good - only saves when changed
obj = Model.objects.get(pk=1)
if obj.field != new_value:
    obj.field = new_value
    obj.save()

3. Type Consistency Matters in API Integrations

API integrations often return different types than your database stores. Always normalize:

# Convert types before comparison
api_value = str(api_response.get("field"))  # Ensure string
db_value = str(obj.field)  # Ensure string

if api_value != db_value:
    obj.field = api_value
    obj.save()

4. Timezone Awareness is Critical

With USE_TZ=True in Django settings, always return timezone-aware datetimes from parsers:

from django.utils import timezone as tz

dt = datetime.strptime(date_str, "%Y-%m-%d")
dt = tz.make_aware(dt, tz.get_default_timezone())  # Make aware!

5. Not Everything Needs History

Exclude frequently-changing non-critical fields from history:

# Good candidates for exclusion:
# - updated_at timestamps
# - last_sync timestamps
# - GPS coordinates
# - view counts
# - Any high-frequency changing data

history = HistoricalRecords(
    excluded_fields=['updated_at', 'last_sync', 'latitude', 'longitude']
)

6. PostgreSQL Window Functions Are Powerful

For bulk operations on millions of rows, SQL outperforms ORM by orders of magnitude.

Django ORM is great for application logic, but for data cleanup, bulk updates, and analytics, raw SQL with window functions is often 100x+ faster.

7. VACUUM FULL After Large Deletions

Don't forget to reclaim disk space after deleting millions of rows!

-- Regular VACUUM (non-blocking, safe for production)
VACUUM ANALYZE table_name;

-- VACUUM FULL (blocking, use in maintenance window)
VACUUM FULL table_name;

Regular VACUUM is safe to run anytime. VACUUM FULL requires downtime but reclaims more space.

Tools and Technologies Used

Django 5.0 with django-simple-history
PostgreSQL 15 (DigitalOcean Managed Database)
Celery for background task processing
Redis as Celery broker
pgAdmin 4 for database management
DigitalOcean infrastructure and monitoring
Python 3.12

Performance Impact

Before the Fix:

Database queries: 2-5 seconds average
Background sync tasks: 45-60 seconds
Admin page loads: 8-12 seconds
API response times: 1-3 seconds
CPU: Constant 99.9%
Database connections: Maxed out

After the Fix:

Database queries: 50-200ms average
Background sync tasks: 1-2 seconds
Admin page loads: 500ms-1s
API response times: 100-300ms
CPU: 5-10% average
Database connections: Healthy pool

10x performance improvement across the board!

Monitoring and Maintenance

We now run weekly checks to catch any regression early:

-- Check for duplicate accumulation
WITH ranked AS (
    SELECT
        COUNT(*) as total,
        COUNT(DISTINCT (shipment_id, container_number, container_type, gate_out_date)) as unique_records
    FROM historical_shipment_containers
)
SELECT
    total,
    unique_records,
    (total - unique_records) as duplicates,
    ROUND(100.0 * (total - unique_duplicates) / total, 2) as duplicate_percentage
FROM ranked;

Healthy Result:

total: 13,812
unique_records: 13,802
duplicates: 10
duplicate_percentage: 0.07%

A small number of duplicates (< 1%) is normal due to legitimate updates. If this crosses 5%, we investigate.

The Human Element

This incident happened on a Saturday morning. Our team:

Immediately put the platform in maintenance mode
Communicated transparently with stakeholders
Debugged collaboratively (huge props to the team!)
Implemented fixes incrementally
Tested thoroughly before reopening
Documented everything for future reference

Time to resolution: 14 hours (Saturday 7 AM → Saturday 9 PM) Total downtime: 2 hours (maintenance mode during VACUUM operations)

We follow agile practices, which helped us:

Prioritize fixes (clean data first, prevent recurrence second)
Test in small increments
Roll back quickly if needed
Collaborate effectively under pressure

Conclusion

What started as a catastrophic Saturday morning crash turned into a masterclass in:

Database optimization
Django ORM performance tuning
PostgreSQL internals
Production debugging under pressure
Collaborative problem-solving

Key Takeaway: The most dangerous bugs are the ones that accumulate silently over months. Regular monitoring, performance testing, and database audits aren't optional - they're essential.

Our freight management platform is now faster, leaner, and ready to scale. And we learned valuable lessons that will inform our engineering practices for years to come.

If you're using django-simple-history (or any audit logging system), review your update patterns today. You might be sitting on a ticking time bomb!

Additional Resources

About the Author: Software Engineer at Maalbardaar, building scalable freight management solutions. Passionate about database optimization, Django performance, and solving real-world logistics challenges.

Tech Stack: Django, PostgreSQL, Celery, Docker, DigitalOcean, React

Follow for more: Production war stories, Django tips, and database optimization techniques.

Have you faced similar database bloat issues? How did you solve them? Share your story in the comments!

Django #PostgreSQL #DatabaseOptimization #SoftwareEngineering #ProductionDebugging #PerformanceTuning #FreightTech #LogisticsTechnology #WebDevelopment #BackendEngineering #Python #DevOps

DEV Community