Sid Probstein

Posted on Jun 1

Running PyTorch fork-safe in Celery on macOS

#python #celery

If you've ever seen this in your Celery logs:

Process 'ForkPoolWorker-7' pid:32839 exited with 'signal 11 (SIGSEGV)'
billiard.exceptions.WorkerLostError:
    Worker exited prematurely: signal 11 (SIGSEGV) Job: 0.

...and the macOS crash report buries the real message in a JSON blob:

"asi": {
  "CoreFoundation": ["*** multi-threaded process forked ***"],
  "libsystem_c.dylib": ["crashed on child side of fork pre-exec"]
}

...you've hit one of the classic fork-after-init traps. Here's what's going on and how to actually fix it.

The one-line fix (if you're in a hurry)

Set these env vars before the Celery worker's MainProcess imports anything heavy:

# <your-project>/celery.py ... the very first thing the worker imports
import os

for var in ("OPENBLAS_NUM_THREADS", "OMP_NUM_THREADS", "MKL_NUM_THREADS",
            "NUMEXPR_NUM_THREADS", "VECLIB_MAXIMUM_THREADS"):
    os.environ.setdefault(var, "1")

os.environ.setdefault("OBJC_DISABLE_INITIALIZE_FORK_SAFETY", "YES")

The first set forces every BLAS library to single-threaded mode. VECLIB_MAXIMUM_THREADS is the one most people forget; it covers Apple's Accelerate framework, which is what PyTorch uses by default on Apple Silicon. The last one tells the Objective-C runtime to skip its fork-safety abort.

Why this happens

PyTorch's nn.Linear on macOS arm64 calls into Apple Accelerate, which does its parallel matmuls via libdispatch (Grand Central Dispatch).

The first BLAS call lazily spins up a pool of libdispatch worker queues in the calling process.

If that "calling process" is your Celery worker's MainProcess (say, because something during boot does a tiny matmul: spaCy preload, an embedding warmup, anything that imports numpy and runs a real op), those queues now live in the parent.

When the prefork pool then fork()s a child, the child inherits broken queue handles. The next BLAS call from inside the child dereferences a stale pointer and you get the SIGSEGV.

The stack trace in the crash report makes it unambiguous:

0: _dispatch_apply_with_attr_f      (libdispatch)
1: dispatch_apply_with_attr         (libdispatch)
3: cblas_sgemm                      (Accelerate)
5: at::native::cpublas::gemm        (libtorch_cpu)
6: at::native::addmm_impl_cpu_      (libtorch_cpu)
7: at::native::linear               (libtorch_cpu)
8: torch::autograd::THPVariable_linear

What doesn't work

"Just lazy-load the model in the child." Even if you defer from_pretrained until you're inside a forked child, that first call still hits Accelerate BLAS, and the dispatch queues your child inherited from the parent are already broken.

"Just bypass sentence_transformers.CrossEncoder.predict() and use bare-torch." Same story. Whether you go through CrossEncoder or call AutoModelForSequenceClassification directly, the SIGSEGV is one frame down inside linear().

"Just don't import torch at the top of the module." Necessary but not sufficient. In our case, removing import torch from ai_provider.py was real progress, but then we discovered litellm transitively pulls torch the first time you call it. Every "warmup" preload that touched litellm still poisoned the parent. You have to audit every code path that runs before the first fork.

The defensive pattern that does work

Defer heavy imports. Don't import torch at module top in anything that's part of the Celery autodiscovery chain. Push it into the function that needs it:

# Bad ... taints anyone who imports this module
import torch

def rerank(query, documents):
    with torch.no_grad():
        ...

# Good ... torch only loads in workers that actually rerank
def rerank(query, documents):
    import torch
    with torch.no_grad():
        ...

Gate "warmup" preloads off the Celery worker. Preloading models at startup makes sense for an ASGI server like Daphne. It is actively harmful in a forking Celery worker, because the warmup runs in MainProcess:

class MyAppConfig(AppConfig):
    def ready(self):
        is_celery_worker = "celery" in sys.argv and "worker" in sys.argv
        if not is_celery_worker:
            self._preload_cross_encoder()

What about Linux / Docker?

Yes, this affects Linux too, just less dramatically.

OpenBLAS and MKL both spin up thread pools on first use that don't survive fork; the typical Linux failure mode is a hang or a deadlock rather than a SIGSEGV.

The good news: the same *_NUM_THREADS=1 env vars are the fix.

VECLIB_MAXIMUM_THREADS and OBJC_DISABLE_INITIALIZE_FORK_SAFETY are no-ops on Linux, so the snippet above is portable. The deferred-import and gate-off-warmup patterns apply unchanged.

Top comments (2)

𝚂𝚊𝚞𝚛𝚊𝚋𝚑 𝚁𝚊𝚒 • Jun 1

Long time, Sid. :)

Echo • Jun 2

This is one of those "spends 2 days before finding the answer" issues. The fork-safe PyTorch + Celery + macOS combo is a real footgun; pinning the worker prefork and disabling MPS in the child before any tensor allocation is the only sane path.