DEV Community

Cover image for The Silent Killer: How One Django Signal Can Crashed your AI Support Platform ?
Mohammad Ismail Mirza
Mohammad Ismail Mirza

Posted on

The Silent Killer: How One Django Signal Can Crashed your AI Support Platform ?

You've been coding for hours.
The feature is elegant.
Tests pass.
Code review looks great.
You hit merge.

Forty-five minutes later, your Slack lights up. Not a polite ping—an explosion.

🚨 Red alerts.
😤 Customer complaints.
😰 Your CEO asking, "What's happening?"

Your AI support agents—the ones customers love—are frozen.

Conversations stuck.
ChromaDB timing out.
Database maxed out.

Everything you built is collapsing in real time.

By the end, you'll trace it to three innocent lines of code you wrote on a quiet Tuesday morning.


TL;DR — The One-Line Disaster

Calling .save() inside a Django signal caused:

  • Recursive writes
  • Triggered related signals
  • Overwhelmed the database
  • Blocked Celery workers
  • Full system meltdown under load

The fix:

Signals should never modify models. Use them only to queue async work after commit.


The Setup: Everything Looked Perfect

We built a clean pipeline:

  • Users send support questions
  • LangGraph agents (OpenAI-powered) respond
  • Messages get stored
  • Embeddings added to ChromaDB for semantic search

Our models were straightforward:

class SupportConversation(models.Model):
    customer_id = models.CharField(max_length=100)
    agent_name = models.CharField(max_length=50)
    status = models.CharField(max_length=20, default='active')
    updated_at = models.DateTimeField(auto_now=True)

class ConversationMessage(models.Model):
    conversation = models.ForeignKey(SupportConversation, on_delete=models.CASCADE)
    role = models.CharField(max_length=20)  # 'user' or 'assistant'
    content = models.TextField()
    embedding_id = models.CharField(max_length=255, null=True)
Enter fullscreen mode Exit fullscreen mode

Then one Tuesday morning, a developer added this:

@receiver(post_save, sender=ConversationMessage)
def handle_message_saved(sender, instance, created, **kwargs):
    if created and instance.role == 'assistant':
        generate_embeddings_task.delay(instance.id)
        instance.conversation.save()  # ← The ticking time bomb
Enter fullscreen mode Exit fullscreen mode

It seemed harmless.
Tests passed.
Staging worked.
10 users? No problem.

And then Friday happened.


The Meltdown: How It All Fell Apart

11:35 AM — Marketing launches a flash promo.

"Try our AI support agents—free for 24 hours."

Within 60 seconds, 500 fresh chat sessions flood in.

What happens next:

A user sends a message →
AI responds →
ConversationMessage saves →
Signal fires →
Celery task queued →
conversation.save() triggers →
Another signal fires →
Database write →
Repeat… and repeat… and repeat.

Each conversation generates multiple unexpected writes.

The chain reaction unfolds:

11:36:00 — DB connection pool: 48/50 in use
11:36:10 — Celery workers blocking, waiting for DB
11:36:20 — ChromaDB calls backing up behind DB queue
11:36:30 — 5,000 tasks in queue
11:36:50 — 25,000 tasks queued
11:37:00 — Everything is stuck
Enter fullscreen mode Exit fullscreen mode

Messages stop saving.
Requests time out.
ChromaDB errors.

And then Slack explodes:

@oncall: "AI support is frozen."

@product: "Customers can't get replies."

@infra: "Why are Celery workers dead?"

@devops: "DB CPU at 100%, restarting now..."

The platform collapses.

Recovery time: 4+ hours.

The root cause? Nested signals triggering recursive .save() calls.


Root Cause: The Recursive Death Spiral

ConversationMessage.save()
    ↓
post_save signal fires
    ↓
instance.conversation.save()  ← nested save
    ↓
post_save (SupportConversation) fires
    ↓
Model writes again
    ↓
Triggers more signals
    ↓
More writes under load = exponential explosion
Enter fullscreen mode Exit fullscreen mode

Under high concurrency, this becomes an accidental DDoS against your own database.


The Four Fixes That Saved Us

1. Always Guard with created

Ensure the handler fires only once:

from django.db.models.signals import post_save
from django.dispatch import receiver
from django.db import transaction

@receiver(post_save, sender=ConversationMessage)
def queue_message_processing(sender, instance, created, **kwargs):
    # ✅ Only run on NEW messages
    if not created:
        return

    if instance.role == 'assistant':
        # ✅ Queue AFTER transaction commits (no nested saves)
        transaction.on_commit(
            lambda: generate_embeddings_task.delay(instance.id)
        )
Enter fullscreen mode Exit fullscreen mode

Why it works:

  • No repeated firing
  • No nested writes
  • The async task runs only after the DB transaction commits
  • Clean separation of concerns

2. Let Celery Update Models Safely

Use update_fields to skip signal handlers:

from celery import shared_task

@shared_task
def generate_embeddings_task(message_id):
    message = ConversationMessage.objects.get(id=message_id)

    # Call OpenAI API
    embedding = openai.embeddings.create(
        model="text-embedding-3-small",
        input=message.content
    ).data[0].embedding

    # Store in ChromaDB
    chroma_client.add(
        ids=[f"msg_{message_id}"],
        embeddings=[embedding]
    )

    # ✅ CRITICAL: Use update_fields to skip the signal handler
    ConversationMessage.objects.filter(id=message_id).update(
        embedding_id=f"msg_{message_id}"
    )
Enter fullscreen mode Exit fullscreen mode

Update your handler to recognize this:

@receiver(post_save, sender=ConversationMessage)
def queue_message_processing(sender, instance, created, update_fields=None, **kwargs):
    if not created:
        # ✅ If only embedding_id changed, skip entirely
        if update_fields and update_fields == {'embedding_id'}:
            return
        return

    # ... rest of handler
Enter fullscreen mode Exit fullscreen mode

Why it works:

  • Celery can safely modify models
  • Queryset updates with update_fields bypass post_save signals
  • No recursive loops even under heavy load

3. Use QuerySet Updates for Parent Stats

DON'T do this:

conversation.message_count = 5
conversation.save()  # ❌ triggers another signal!
Enter fullscreen mode Exit fullscreen mode

DO this instead:

SupportConversation.objects.filter(id=conversation.id).update(
    message_count=conversation.conversationmessage_set.count()
)
Enter fullscreen mode Exit fullscreen mode

Why:

  • QuerySet operations bypass signals entirely
  • No cascade
  • Atomic database operation
  • 10x faster

4. Batch Updates Instead of Per-Message

Instead of firing stats updates on every message, batch them:

from celery import shared_task
from django.db.models import Count

@shared_task
def batch_update_conversation_stats():
    """
    Run every 30 seconds via Celery Beat.
    Update all active conversations at once.
    """
    conversations = SupportConversation.objects.filter(
        status='active'
    ).annotate(msg_count=Count('conversationmessage'))

    for conv in conversations:
        SupportConversation.objects.filter(id=conv.id).update(
            message_count=conv.msg_count
        )

# In celery.py, schedule with:
# 'batch-update-stats': {
#     'task': 'myapp.tasks.batch_update_conversation_stats',
#     'schedule': crontab(minute='*/1'),  # Every minute
# }
Enter fullscreen mode Exit fullscreen mode

The math:

  • 500 concurrent messages = 500 individual updates (❌ cascades)
  • 500 concurrent messages = 1 batch update every 30 seconds (✅ efficient)

Testing for Signal Cascades (Critical!)

You must test this:

from django.test import TestCase
from django.test.utils import CaptureQueriesContext
from django.db import connection

class ConversationSignalTests(TestCase):

    def test_message_creation_is_silent(self):
        """
        Verify: Creating ONE message = ONE database write.
        If this fails, you have a signal loop.
        """
        conv = SupportConversation.objects.create(customer_id="test_user")

        with CaptureQueriesContext(connection) as ctx:
            ConversationMessage.objects.create(
                conversation=conv,
                role='user',
                content="Help!",
                tokens_used=1,
                response_time_ms=10
            )

        writes = [q for q in ctx.captured_queries 
                  if q['sql'].startswith(('INSERT', 'UPDATE'))]

        self.assertEqual(len(writes), 1, 
            f"Expected 1 write, got {len(writes)}. Signal loop detected!")

    def test_load_no_cascade(self):
        """
        Simulate production load: 100 concurrent messages.
        Verify no cascade explosion.
        """
        convs = [
            SupportConversation.objects.create(customer_id=f"user_{i}")
            for i in range(100)
        ]

        with CaptureQueriesContext(connection) as ctx:
            for conv in convs:
                ConversationMessage.objects.create(
                    conversation=conv,
                    role='user',
                    content="Question",
                    tokens_used=3,
                    response_time_ms=15
                )

        writes = [q for q in ctx.captured_queries 
                  if q['sql'].startswith(('INSERT', 'UPDATE'))]

        self.assertLess(len(writes), 150,
            f"Too many writes: {len(writes)}. Cascade detected!")
Enter fullscreen mode Exit fullscreen mode

Best Practices Checklist

✔ Always Do:

  • ✅ Use the created flag guard
  • ✅ Use transaction.on_commit() for async work
  • ✅ Keep signals pure—never modify models
  • ✅ Use queryset.update(), never .save() in handlers
  • ✅ Use update_fields when Celery updates models
  • ✅ Batch expensive tasks (stats, aggregations)
  • ✅ Stress test with concurrent load
  • ✅ Monitor signal depth in production

✘ Never Do:

  • ❌ Call .save() inside a signal handler
  • ❌ Trigger parent model saves from child signals
  • ❌ Assume "it works locally" means "it scales"
  • ❌ Trust signals under load without testing
  • ❌ Create circular signal dependencies

Final Lesson: The Signal Rule

Django signals are powerful.
They're also silent footguns.

One innocent .save() buried in a signal handler can:

  • 💥 Hammer your database
  • 🧊 Freeze your async workers
  • 📉 Cascade into platform-wide downtime
  • 😞 Ruin your Friday

The rule is simple:

Signals should queue async work—not change database state.
Celery should handle all updates.

Follow this, and you'll never meet the monster we met at 11:36 AM on a Friday.

Signal fires
  ↓
Queue Celery task
  ↓
Commit transaction
  ↓
Celery task runs safely
  ↓
Updates database (signal-free)
  ↓
✅ No cascade
Enter fullscreen mode Exit fullscreen mode

Resources


Last updated: December 1, 2025

Have your own signal horror story? Drop it in the comments—you're not alone. 🚀

Top comments (0)