Async Everything: Why Background Processing Matters for Operational Software

#backgroundjobs #saas #architecture #wordpress

The first version of most operational software runs everything synchronously. User clicks a button, the server does the work, the response comes back. It works fine until it does not. And when it stops working, it stops working suddenly and visibly, usually when a customer is watching.

I learned this building SampleHQ. The CRM setup flow that worked in development took thirty seconds in production because Salesforce metadata deployment is slow. The CSV import that handled fifty rows instantly choked on five hundred. The shipping label purchase that completed in two seconds occasionally timed out at eight because the Shippo API was under load.

Every one of these was a synchronous operation that should have been asynchronous from the start.

What Breaks Synchronously

External API calls. Any operation that depends on a third-party API is at the mercy of that API's response time. Shippo label purchases, CRM data syncs, carrier tracking lookups, address validation. These calls typically take one to five seconds. But under load, or when the provider has an incident, they take ten, twenty, thirty seconds. The user stares at a spinner and assumes the system is broken.

Bulk operations. Importing five hundred samples from a CSV file means five hundred database inserts, five hundred image downloads, and five hundred validation checks. Running this synchronously locks the browser tab for minutes. The user cannot tell whether the import is progressing or stuck. If they close the tab, the import stops mid-way, leaving partial data.

Multi-step workflows. Connecting a CRM requires OAuth, then capability probing, then metadata deployment, then webhook subscription setup. Each step depends on the previous one, and each involves API calls that can take seconds. Running this as a synchronous chain means a setup flow that takes forty-five seconds of wall time with no feedback.

Nightly maintenance. Attribution backfill, webhook cleanup, deal cache refresh, email log pruning. These jobs run across every tenant in a multi-tenant system. Running them synchronously in a single request is not just slow. It is impossible. They need to run as scheduled background tasks.

Action Scheduler as a Job Queue

WordPress does not have a built-in job queue. WP-Cron is a scheduling mechanism, not a processing queue. It runs on page loads, has no retry logic, and cannot handle concurrent job execution. For operational software, it is insufficient.

Action Scheduler, originally built for WooCommerce, is the job queue I use. It stores pending jobs in the database, processes them on a schedule (every minute via system cron), supports retries with configurable backoff, and handles concurrent execution across multiple workers.

The key architectural decision was loading Action Scheduler early in the plugin bootstrap, before plugins_loaded. This ensures that the scheduler's own hooks register before any business logic that depends on them. Getting this load order wrong causes jobs to silently fail to schedule, which is a category of bug that only manifests in production and is extremely difficult to diagnose.

What We Queue

Eleven distinct background jobs run in the SampleHQ platform:

Attribution backfill runs nightly. It recomputes attribution snapshots for deals whose linked orders or deal amounts have changed. This ensures attribution reporting stays accurate even when CRM data is updated after the initial computation.

Webhook cleanup runs daily. It purges processed webhook event logs and expired idempotency keys. Without this, the webhook tables grow indefinitely in a multi-tenant environment.

Email queue processing runs continuously. Notification emails are never sent synchronously. They are queued and processed by a background worker, which prevents email delivery delays from blocking user-facing operations.

Shipping label purchase is queued per-order. When a user purchases a label through Shippo, the request is queued with retry logic. If the Shippo API is slow or returns an error, the job retries with exponential backoff rather than failing in the user's face.

CSV import processing runs as a background job with progress tracking. The user uploads the file and sees a progress indicator. The import runs in the background, processing rows in batches. When it completes, the user gets a notification with success and error counts.

Image download fallback runs every five minutes. When a CSV import includes image URLs, those images are downloaded asynchronously. If a download fails (URL unreachable, server timeout), the fallback job retries it on the next cycle.

Cancellation lifecycle runs daily. When a subscription is canceled, the system tracks the lifecycle: day three winback email, day seven second email, day fourteen workspace freeze, day thirty suspension, day ninety archive. Each step is a scheduled job triggered by the lifecycle timer.

Plus site summary refresh, AI watch checks, ERP sync log cleanup, and magic link token cleanup.

The User Experience Difference

The difference between synchronous and async operations is trust. When a user clicks "Purchase Label" and sees an immediate confirmation with a "processing" status, they trust the system. When they click the same button and wait eight seconds with no feedback, they assume it failed and click again, which queues a duplicate purchase.

Async operations with proper status feedback eliminate this class of user error entirely. The job is queued instantly. The UI shows the job status. The user can continue working. When the job completes, the UI updates and optionally sends a notification.

Retry Logic and Failure Handling

Every async job needs a failure strategy. The default should be retry with exponential backoff: first retry after ten seconds, second after one minute, third after ten minutes. If a job fails three times, it moves to a failed state and triggers an alert rather than continuing to retry indefinitely.

The failure alert is important because silent failures are worse than loud ones. A shipping label that silently fails to purchase is worse than one that fails and notifies the fulfillment team. The notification gives them the chance to retry manually or investigate the issue.

For critical operations like attribution computation and CRM sync, I also run nightly reconciliation jobs that check for inconsistencies and re-queue failed operations. This safety net catches edge cases where the retry logic itself was insufficient.

The Rule

The rule I follow: if an operation depends on an external API, takes more than two seconds, or processes more than a handful of records, it should be async. The engineering cost of making it async is a few hours per operation. The cost of leaving it synchronous is user frustration, timeout errors, duplicate operations, and support tickets.

SampleHQ runs eleven scheduled background jobs processing CRM syncs, shipping labels, attribution, imports, and email delivery. The result is a system that stays responsive under load and handles failures gracefully instead of showing spinners.