Mohammad Ismail Mirza

Posted on Dec 1

The Silent Killer: How One Django Signal Can Crashed your AI Support Platform ?

#ai #python #api #backend

You've been coding for hours.
The feature is elegant.
Tests pass.
Code review looks great.
You hit merge.

Forty-five minutes later, your Slack lights up. Not a polite ping—an explosion.

🚨 Red alerts.
😤 Customer complaints.
😰 Your CEO asking, "What's happening?"

Your AI support agents—the ones customers love—are frozen.

Conversations stuck.
ChromaDB timing out.
Database maxed out.

Everything you built is collapsing in real time.

By the end, you'll trace it to three innocent lines of code you wrote on a quiet Tuesday morning.

TL;DR — The One-Line Disaster

Calling .save() inside a Django signal caused:

Recursive writes
Triggered related signals
Overwhelmed the database
Blocked Celery workers
Full system meltdown under load

The fix:

Signals should never modify models. Use them only to queue async work after commit.

The Setup: Everything Looked Perfect

We built a clean pipeline:

Users send support questions
LangGraph agents (OpenAI-powered) respond
Messages get stored
Embeddings added to ChromaDB for semantic search

Our models were straightforward:

class SupportConversation(models.Model):
    customer_id = models.CharField(max_length=100)
    agent_name = models.CharField(max_length=50)
    status = models.CharField(max_length=20, default='active')
    updated_at = models.DateTimeField(auto_now=True)

class ConversationMessage(models.Model):
    conversation = models.ForeignKey(SupportConversation, on_delete=models.CASCADE)
    role = models.CharField(max_length=20)  # 'user' or 'assistant'
    content = models.TextField()
    embedding_id = models.CharField(max_length=255, null=True)

Then one Tuesday morning, a developer added this:

@receiver(post_save, sender=ConversationMessage)
def handle_message_saved(sender, instance, created, **kwargs):
    if created and instance.role == 'assistant':
        generate_embeddings_task.delay(instance.id)
        instance.conversation.save()  # ← The ticking time bomb

It seemed harmless.
Tests passed.
Staging worked.
10 users? No problem.

And then Friday happened.

The Meltdown: How It All Fell Apart

11:35 AM — Marketing launches a flash promo.

"Try our AI support agents—free for 24 hours."

Within 60 seconds, 500 fresh chat sessions flood in.

What happens next:

A user sends a message →
AI responds →
ConversationMessage saves →
Signal fires →
Celery task queued →
conversation.save() triggers →
Another signal fires →
Database write →
Repeat… and repeat… and repeat.

Each conversation generates multiple unexpected writes.

The chain reaction unfolds:

11:36:00 — DB connection pool: 48/50 in use
11:36:10 — Celery workers blocking, waiting for DB
11:36:20 — ChromaDB calls backing up behind DB queue
11:36:30 — 5,000 tasks in queue
11:36:50 — 25,000 tasks queued
11:37:00 — Everything is stuck

Messages stop saving.
Requests time out.
ChromaDB errors.

And then Slack explodes:

@oncall: "AI support is frozen."

@product: "Customers can't get replies."

@infra: "Why are Celery workers dead?"

@devops: "DB CPU at 100%, restarting now..."

The platform collapses.

Recovery time: 4+ hours.

The root cause? Nested signals triggering recursive .save() calls.

Root Cause: The Recursive Death Spiral

ConversationMessage.save()
    ↓
post_save signal fires
    ↓
instance.conversation.save()  ← nested save
    ↓
post_save (SupportConversation) fires
    ↓
Model writes again
    ↓
Triggers more signals
    ↓
More writes under load = exponential explosion

Under high concurrency, this becomes an accidental DDoS against your own database.

The Four Fixes That Saved Us

1. Always Guard with `created`

Ensure the handler fires only once:

from django.db.models.signals import post_save
from django.dispatch import receiver
from django.db import transaction

@receiver(post_save, sender=ConversationMessage)
def queue_message_processing(sender, instance, created, **kwargs):
    # ✅ Only run on NEW messages
    if not created:
        return

    if instance.role == 'assistant':
        # ✅ Queue AFTER transaction commits (no nested saves)
        transaction.on_commit(
            lambda: generate_embeddings_task.delay(instance.id)
        )

Why it works:

No repeated firing
No nested writes
The async task runs only after the DB transaction commits
Clean separation of concerns

2. Let Celery Update Models Safely

Use update_fields to skip signal handlers:

from celery import shared_task

@shared_task
def generate_embeddings_task(message_id):
    message = ConversationMessage.objects.get(id=message_id)

    # Call OpenAI API
    embedding = openai.embeddings.create(
        model="text-embedding-3-small",
        input=message.content
    ).data[0].embedding

    # Store in ChromaDB
    chroma_client.add(
        ids=[f"msg_{message_id}"],
        embeddings=[embedding]
    )

    # ✅ CRITICAL: Use update_fields to skip the signal handler
    ConversationMessage.objects.filter(id=message_id).update(
        embedding_id=f"msg_{message_id}"
    )

Update your handler to recognize this:

@receiver(post_save, sender=ConversationMessage)
def queue_message_processing(sender, instance, created, update_fields=None, **kwargs):
    if not created:
        # ✅ If only embedding_id changed, skip entirely
        if update_fields and update_fields == {'embedding_id'}:
            return
        return

    # ... rest of handler

Why it works:

Celery can safely modify models
Queryset updates with update_fields bypass post_save signals
No recursive loops even under heavy load

3. Use QuerySet Updates for Parent Stats

❌ DON'T do this:

conversation.message_count = 5
conversation.save()  # ❌ triggers another signal!

✅ DO this instead:

SupportConversation.objects.filter(id=conversation.id).update(
    message_count=conversation.conversationmessage_set.count()
)

Why:

QuerySet operations bypass signals entirely
No cascade
Atomic database operation
10x faster

4. Batch Updates Instead of Per-Message

Instead of firing stats updates on every message, batch them:

from celery import shared_task
from django.db.models import Count

@shared_task
def batch_update_conversation_stats():
    """
    Run every 30 seconds via Celery Beat.
    Update all active conversations at once.
    """
    conversations = SupportConversation.objects.filter(
        status='active'
    ).annotate(msg_count=Count('conversationmessage'))

    for conv in conversations:
        SupportConversation.objects.filter(id=conv.id).update(
            message_count=conv.msg_count
        )

# In celery.py, schedule with:
# 'batch-update-stats': {
#     'task': 'myapp.tasks.batch_update_conversation_stats',
#     'schedule': crontab(minute='*/1'),  # Every minute
# }

The math:

500 concurrent messages = 500 individual updates (❌ cascades)
500 concurrent messages = 1 batch update every 30 seconds (✅ efficient)

Testing for Signal Cascades (Critical!)

You must test this:

from django.test import TestCase
from django.test.utils import CaptureQueriesContext
from django.db import connection

class ConversationSignalTests(TestCase):

    def test_message_creation_is_silent(self):
        """
        Verify: Creating ONE message = ONE database write.
        If this fails, you have a signal loop.
        """
        conv = SupportConversation.objects.create(customer_id="test_user")

        with CaptureQueriesContext(connection) as ctx:
            ConversationMessage.objects.create(
                conversation=conv,
                role='user',
                content="Help!",
                tokens_used=1,
                response_time_ms=10
            )

        writes = [q for q in ctx.captured_queries 
                  if q['sql'].startswith(('INSERT', 'UPDATE'))]

        self.assertEqual(len(writes), 1, 
            f"Expected 1 write, got {len(writes)}. Signal loop detected!")

    def test_load_no_cascade(self):
        """
        Simulate production load: 100 concurrent messages.
        Verify no cascade explosion.
        """
        convs = [
            SupportConversation.objects.create(customer_id=f"user_{i}")
            for i in range(100)
        ]

        with CaptureQueriesContext(connection) as ctx:
            for conv in convs:
                ConversationMessage.objects.create(
                    conversation=conv,
                    role='user',
                    content="Question",
                    tokens_used=3,
                    response_time_ms=15
                )

        writes = [q for q in ctx.captured_queries 
                  if q['sql'].startswith(('INSERT', 'UPDATE'))]

        self.assertLess(len(writes), 150,
            f"Too many writes: {len(writes)}. Cascade detected!")

Best Practices Checklist

✔ Always Do:

✅ Use the created flag guard
✅ Use transaction.on_commit() for async work
✅ Keep signals pure—never modify models
✅ Use queryset.update(), never .save() in handlers
✅ Use update_fields when Celery updates models
✅ Batch expensive tasks (stats, aggregations)
✅ Stress test with concurrent load
✅ Monitor signal depth in production

✘ Never Do:

❌ Call .save() inside a signal handler
❌ Trigger parent model saves from child signals
❌ Assume "it works locally" means "it scales"
❌ Trust signals under load without testing
❌ Create circular signal dependencies

Final Lesson: The Signal Rule

Django signals are powerful.
They're also silent footguns.

One innocent .save() buried in a signal handler can:

💥 Hammer your database
🧊 Freeze your async workers
📉 Cascade into platform-wide downtime
😞 Ruin your Friday

The rule is simple:

Signals should queue async work—not change database state.
Celery should handle all updates.

Follow this, and you'll never meet the monster we met at 11:36 AM on a Friday.

Signal fires
  ↓
Queue Celery task
  ↓
Commit transaction
  ↓
Celery task runs safely
  ↓
Updates database (signal-free)
  ↓
✅ No cascade

Resources

Last updated: December 1, 2025

Have your own signal horror story? Drop it in the comments—you're not alone. 🚀

DEV Community

The Silent Killer: How One Django Signal Can Crashed your AI Support Platform ?

TL;DR — The One-Line Disaster

The Setup: Everything Looked Perfect

The Meltdown: How It All Fell Apart

What happens next:

The chain reaction unfolds:

And then Slack explodes:

Root Cause: The Recursive Death Spiral

The Four Fixes That Saved Us

1. Always Guard with `created`

2. Let Celery Update Models Safely

3. Use QuerySet Updates for Parent Stats

4. Batch Updates Instead of Per-Message

Testing for Signal Cascades (Critical!)

Best Practices Checklist

✔ Always Do:

✘ Never Do:

Final Lesson: The Signal Rule

Resources

Top comments (0)

TL;DR — The One-Line Disaster

The Setup: Everything Looked Perfect

The Meltdown: How It All Fell Apart

What happens next:

The chain reaction unfolds:

And then Slack explodes:

Root Cause: The Recursive Death Spiral

The Four Fixes That Saved Us

1. Always Guard with created

2. Let Celery Update Models Safely

3. Use QuerySet Updates for Parent Stats

4. Batch Updates Instead of Per-Message

Testing for Signal Cascades (Critical!)

Best Practices Checklist

✔ Always Do:

✘ Never Do:

Final Lesson: The Signal Rule

Resources

1. Always Guard with `created`