Syncing 60,000 Products Without Breaking Everything

#python #webdev #database #architecture

A dental supply store with 60,000 products wants to sync their catalog to my search engine. Every product needs a vector embedding. Here's what I learned about not melting the server.

Why Webhooks, Not Pull

The first architectural decision was whether my system should pull data from stores or let stores push data to it.

Pull-based sync sounds simpler: hit the store's API on a schedule, diff the results, update what changed. In practice it falls apart. You need to poll every store on some interval, deal with pagination across different APIs, handle rate limits on the store's side, and figure out what changed since last time. If you support multiple platforms (Drupal, WooCommerce, Sylius, custom), every platform has a different API shape.

I went with webhook-only. Stores push events to a single endpoint: POST /webhooks/sync/{store_id}/. The payload is a list of typed events:

{
  "events": [
    {"type": "product.created", "data": {"identification_number": "SKU-123", "sku": "...", "names": {"default": {"en": "..."}}, ...}},
    {"type": "product.updated", "data": {"identification_number": "SKU-456", "sku": "...", "names": {"default": {"en": "..."}}, ...}}
  ]
}

Nine event types total: product.created, product.updated, product.deleted, and the same for pages, plus sync.start, sync.complete, and order.completed. Every event goes through a Pydantic schema before anything happens. Invalid payload? 400 response before it ever hits a queue.

The tradeoff is real: now every store integration has to implement webhook sending. But I ship official modules for Drupal and WooCommerce that handle it, and for everything else there's a documented webhook API. The upside is the system never needs to know how any store's API works.

The Processing Pipeline

A webhook request hits the endpoint and goes through five checks before anything gets queued:

Rate limit (120 requests per 60 seconds per store, sliding window in Redis)
Store lookup (does this store_id exist?)
Subscription check (is the subscription active?)
HMAC-SHA256 signature (every store has an encrypted webhook secret, auto-generated on creation)
Schema validation (every event validated against its Pydantic model)

If all five pass, the events go to Celery:

process_webhook_events.delay(store.id, events_data)

The endpoint returns 202 immediately. The store doesn't wait for embeddings or database writes. The Celery task picks it up, and WebhookProcessor.process_events() handles the actual work.

One detail that cost me a bug: sync session item registration happens synchronously at webhook receipt time, before Celery picks up the events. I'll explain why in the reconciliation section.

Hash-Based Skip Logic

Generating an embedding for a product is the expensive part. The embedding model runs in a separate inference service (a FastAPI app with a sentence transformer), and each call takes real time. For 60,000 products, you can't regenerate every embedding on every sync.

Each product has a content hash: a SHA-256 of its embedding prompt (name, SKU, category, brand, description, attributes). When a product.updated event comes in, the processor looks up the existing product in Qdrant and compares hashes:

existing = self.qdrant_service.get_product_by_identification_number(
    store.id, data.identification_number
)

if existing and existing.get("hash") == product.hash:
    if not self._product_has_non_hash_changes(product, existing):
        return  # nothing changed, skip entirely
    # only price/stock/images changed, reuse existing embedding
    product.embedding = existing.get("embedding")

Three outcomes:

Hash matches, no other changes: skip entirely. No embedding, no write.
Hash matches, but price/stock/images changed: reuse the existing embedding, update only the mutable fields.
Hash differs: generate new embedding, write everything.

In a typical "full resync" of 60,000 products where maybe 200 changed, this skips embedding generation for 59,800 of them.

Batch Processing

Product create and update events get special treatment. Instead of processing one at a time, they're collected and run through a three-phase batch pipeline:

Phase 1 (Collect): Build product objects, check subscription limits, hash-check against Qdrant. Products that need new embeddings go in one list; products that can reuse existing embeddings go in another.

Phase 2 (Batch embed): One call to the inference service's /embed/batch endpoint per batch of 50 products. Instead of 200 individual HTTP calls, you get 4.

Phase 3 (Batch upsert): One Qdrant upsert for all products.

texts = [p.get_embedding_prompt() for p in products_needing_embedding]
embeddings = embedding_service.generate_embeddings_batch(texts)

for product, embedding in zip(products_needing_embedding, embeddings, strict=True):
    product.embedding = embedding

The batch size of 50 for the inference service is a tuned number. Higher and the inference service starts timing out. Lower and the HTTP overhead adds up.

Full Sync Reconciliation

Incremental webhooks handle the common case: a store updates a product, sends product.updated, done. But what about products that got deleted on the store side and no one sent a product.deleted? What about data drift after weeks of intermittent sync failures?

That's what full sync reconciliation solves. The protocol is three steps:

Store sends sync.start with a session_id, entity ("products" or "pages"), and channel
Store sends every product as product.created or product.updated, each with a sync_session_id field — all translations consolidated in one event
Store sends sync.complete

The system tracks which items it saw during the session. When sync.complete arrives, anything in Qdrant that wasn't seen gets soft-deleted.

The session state lives in Redis with a 24-hour TTL:

# Key structure
f"sync_session:{store_id}:{entity}:{channel}"    # stores the session_id
f"sync_items:{store_id}:{entity}:{channel}:{session_id}"  # stores a set of seen IDs

When each event arrives at the webhook endpoint, the item's identification_number gets added to the seen-items set before the event is queued to Celery:

def _register_sync_items(self, store_id, events):
    for event in events:
        sync_session_id = event.data.get("sync_session_id")
        if not sync_session_id:
            continue
        SyncSessionService.register_item(
            store_id, entity, channel, sync_session_id, identification_number
        )

This was a bug fix. Originally I registered items during Celery processing. But Celery tasks run asynchronously, and if the store sent sync.complete in a separate request that arrived while product events were still queued, the seen-items set was incomplete. Items got deleted that shouldn't have been. Moving registration to the synchronous webhook handler fixed the race condition.

On sync.complete, the processor gets all product IDs from Qdrant, diffs them against the seen set, and soft-deletes the rest:

all_products = self.qdrant_service.get_all_product_ids(store.id, channel)
for identification_number in all_products:
    if identification_number not in seen_items:
        # soft-delete: set deleted=True in Qdrant

What Broke Along the Way

The sync registration race condition was the worst one. A store would sync 60,000 products across hundreds of webhook requests, then send sync.complete. But Celery hadn't finished processing all the product events yet, so the seen-items set was missing thousands of entries. The cleanup step would delete products that were still being processed. I noticed because a store reported half their catalog vanishing after a sync. The fix was registering items at the webhook endpoint, not in the Celery worker.

Subscription limit tracking during batch processing had a similar counting bug. The _check_product_limit method queries Qdrant for the current count, but during batch processing, new products haven't been upserted yet. If a store on a 2,000-product plan sent 100 new products in one batch, the limit check saw the same count for all 100 and let them all through. I added an in-memory counter (new_product_counts dict, keyed by channel) that tracks how many new products have been approved within the current batch.

Inference service timeouts with large batches. My first implementation sent all products needing embeddings in one HTTP call. With 500 products, the inference service would take 30+ seconds and the HTTP client would time out. Chunking into batches of 50 with a separate INFERENCE_SERVICE_BATCH_TIMEOUT setting fixed it.

What's Still Not Great

The seen-items set for sync sessions is stored as a Python set in Redis (via Django's cache framework). For a 60,000-product catalog, that's 60,000 strings in memory. A Redis set with SADD/SMEMBERS would be more memory-efficient. Haven't hit a wall yet, but it's on the list.

The reconciliation cleanup iterates over every product ID and does individual lookups and updates. For very large catalogs with thousands of deletions, this is slow. Qdrant's filtering doesn't support "not in this set of IDs" directly, so there's no bulk shortcut.

There's no retry logic at the event level. The Celery task retries up to 3 times on failure, but that retries the entire batch — re-doing hash checks and Qdrant lookups for products that already succeeded. Single-product failures get logged and skipped, which is correct. Full inference service outages are the painful case.

The Drupal side of this pipeline (entity hooks, queue workers, alter hooks for customization) is an official module on drupal.org. I wrote about how it handles Drupal's schema flexibility in Building a Chat Assistant Module for Drupal Commerce. The webhook documentation covers the full event schema and authentication setup for any platform. Try the live demo at demo.emporiqa.com or create a free Emporiqa account with $25 of signup credit (~100 conversations). No card needed.