If you've ever seen this in your Celery logs:
Process 'ForkPoolWorker-7' pid:32839 exited with 'signal 11 (SIGSEGV)'
billiard.exceptions.WorkerLostError:
Worker exited prematurely: signal 11 (SIGSEGV) Job: 0.
...and the macOS crash report buries the real message in a JSON blob:
"asi": {
"CoreFoundation": ["*** multi-threaded process forked ***"],
"libsystem_c.dylib": ["crashed on child side of fork pre-exec"]
}
...you've hit one of the classic fork-after-init traps. Here's what's going on and how to actually fix it.
The one-line fix (if you're in a hurry)
Set these env vars before the Celery worker's MainProcess imports anything heavy:
# <your-project>/celery.py ... the very first thing the worker imports
import os
for var in ("OPENBLAS_NUM_THREADS", "OMP_NUM_THREADS", "MKL_NUM_THREADS",
"NUMEXPR_NUM_THREADS", "VECLIB_MAXIMUM_THREADS"):
os.environ.setdefault(var, "1")
os.environ.setdefault("OBJC_DISABLE_INITIALIZE_FORK_SAFETY", "YES")
The first set forces every BLAS library to single-threaded mode. VECLIB_MAXIMUM_THREADS is the one most people forget; it covers Apple's Accelerate framework, which is what PyTorch uses by default on Apple Silicon. The last one tells the Objective-C runtime to skip its fork-safety abort.
Why this happens
PyTorch's nn.Linear on macOS arm64 calls into Apple Accelerate, which does its parallel matmuls via libdispatch (Grand Central Dispatch).
The first BLAS call lazily spins up a pool of libdispatch worker queues in the calling process.
If that "calling process" is your Celery worker's MainProcess (say, because something during boot does a tiny matmul: spaCy preload, an embedding warmup, anything that imports numpy and runs a real op), those queues now live in the parent.
When the prefork pool then fork()s a child, the child inherits broken queue handles. The next BLAS call from inside the child dereferences a stale pointer and you get the SIGSEGV.
The stack trace in the crash report makes it unambiguous:
0: _dispatch_apply_with_attr_f (libdispatch)
1: dispatch_apply_with_attr (libdispatch)
3: cblas_sgemm (Accelerate)
5: at::native::cpublas::gemm (libtorch_cpu)
6: at::native::addmm_impl_cpu_ (libtorch_cpu)
7: at::native::linear (libtorch_cpu)
8: torch::autograd::THPVariable_linear
What doesn't work
"Just lazy-load the model in the child." Even if you defer from_pretrained until you're inside a forked child, that first call still hits Accelerate BLAS, and the dispatch queues your child inherited from the parent are already broken.
"Just bypass sentence_transformers.CrossEncoder.predict() and use bare-torch." Same story. Whether you go through CrossEncoder or call AutoModelForSequenceClassification directly, the SIGSEGV is one frame down inside linear().
"Just don't import torch at the top of the module." Necessary but not sufficient. In our case, removing import torch from ai_provider.py was real progress, but then we discovered litellm transitively pulls torch the first time you call it. Every "warmup" preload that touched litellm still poisoned the parent. You have to audit every code path that runs before the first fork.
The defensive pattern that does work
- Defer heavy imports. Don't import torch at module top in anything that's part of the Celery autodiscovery chain. Push it into the function that needs it:
# Bad ... taints anyone who imports this module
import torch
def rerank(query, documents):
with torch.no_grad():
...
# Good ... torch only loads in workers that actually rerank
def rerank(query, documents):
import torch
with torch.no_grad():
...
- Gate "warmup" preloads off the Celery worker. Preloading models at startup makes sense for an ASGI server like Daphne. It is actively harmful in a forking Celery worker, because the warmup runs in MainProcess:
class MyAppConfig(AppConfig):
def ready(self):
is_celery_worker = "celery" in sys.argv and "worker" in sys.argv
if not is_celery_worker:
self._preload_cross_encoder()
What about Linux / Docker?
Yes, this affects Linux too, just less dramatically.
OpenBLAS and MKL both spin up thread pools on first use that don't survive fork; the typical Linux failure mode is a hang or a deadlock rather than a SIGSEGV.
The good news: the same *_NUM_THREADS=1 env vars are the fix.
VECLIB_MAXIMUM_THREADS and OBJC_DISABLE_INITIALIZE_FORK_SAFETY are no-ops on Linux, so the snippet above is portable. The deferred-import and gate-off-warmup patterns apply unchanged.
Top comments (0)