DEV Community: Sid Probstein

SWIRL 5 is GA: knowledge authority for enterprise AI

Sid Probstein — Wed, 22 Jul 2026 15:04:12 +0000

SWIRL 5 is generally available! I want to use this post to explain what it is at an engineering level, because the one-line pitch ("the knowledge authority layer for enterprise AI") does not tell you how it works or where the hard parts are. I would rather show you the machine.

Quick disclosure: I wrote the original version of SWIRL and I run the company. So read this as the person who built it explaining the design, not as a neutral survey. I have tried to be honest about the limits, including where a hand-tuned stack matches us.

The thing that changed

A year ago, "federated search across your systems, then re-rank the results" was a product you could sell. Today it is table stakes. MCP turned retrieval into a commodity: any model can reach any system through a connector, and every serious stack ships a re-ranker. If your pitch is "we retrieve and we re-rank," a technical evaluator will point at four other tools that do the same thing by lunch.

So the interesting question moved. It is no longer "can you find the document." Everything finds the document. It is "which of the versions you found is the one my organization actually stands behind." In a real enterprise corpus the answer to any given query does not exist once. It exists as a draft, three redlines, a copy someone saved to their desktop, and the executed final, all sitting in different systems, all semantically near-identical. Retrieval returns all of them. The model picks one, confidently, and it has no idea which one carries authority.

That is a governance problem wearing a search costume, and it is what SWIRL 5 is built to solve.

Canonical version election

The core primitive is version election. When SWIRL federates a query and gets back a cluster of near-identical documents, it does not just hand the pile to the model. It:

Clusters the versions it found across every source.
Scores each on signals that actually correlate with authority: source authority (an executed contract in iManage outranks a draft on someone's OneDrive), naming ("Executed", "Final", version numbers), and recency.
Elects one canonical version, and exposes its reasoning so a human can see why.

On top of that, teams can pin a canonical result for a query directly. Once pinned, every later search and every agent calling SWIRL gets the endorsed answer, full stop. Election is the automatic path; pinning is the human override. Both produce the same thing: a single answer the organization has stood behind, not the model's best guess.

This is the piece the frontier models cannot do for themselves. Claude and Copilot are extremely good at drafting and summarizing. Neither has any way to know which of your nine versions is ratified, because that fact does not live in the documents. It lives in your organization.

Ranking: three passes, run locally, no vector database

Under the election sits the ranking pipeline. It runs in three passes, and both models run locally in your tenant. Nothing goes over the wire to a third-party ranking service, and there is no vector database to build, secure, or keep in sync.

Keyword + BM25. Lexical first. Quoted phrases and exact terms are honored as written. In enterprise and legal content this matters: a defined term or a specific clause has to match exactly, not approximately.
Embedding re-rank. E5-Large-V2 embeddings with title-aware chunking, fused with the lexical scores using reciprocal rank fusion. Semantic recall, without letting it steamroll the exact matches from pass one.
Cross-encoder. MS-MARCO cross-encoder reads the query and each candidate document together, as a pair, and scores real relevance rather than vector similarity. It is the expensive pass, which is exactly why it runs last, only on the candidates that survived the first two.

The "no vector database" part is not a slogan, it is a deployment property. There is no second copy of your content sitting in an index that your information-governance policy never contemplated. The documents stay in iManage, Box, SharePoint, wherever they already live. Only the ranking happens in SWIRL. For a regulated buyer, removing that second copy removes an entire category of risk.

If you want the independent version of this argument, Meta's XetHub benchmarked keyword-only, vector-only, and hybrid re-rank, and hybrid won. Their post was literally titled "you don't need a vector database."

Assembly: a bounded prompt, not a context dump

Retrieval and ranking decide what is relevant. Assembly decides what the model actually sees, and that is where the token bill and a lot of the accuracy live.

SWIRL treats the prompt as a hard budget, around 3,000 tokens, and fills it deliberately: at most 10 sources, only those scoring above a relevance threshold, a topic matcher that scores the spans inside each source so only the passages that answer the query go in, per-source truncation to fit, and version de-duplication that drops the superseded copies before assembly. One bounded call, not the refine or tree-summarize modes that call the model once per chunk and multiply both tokens and latency.

The counterintuitive result, which we measured against LangChain and LlamaIndex defaults and checked against their source: in a versioned corpus, sending less produces a better answer. More context means more near-identical duplicates, and the model gets less certain, not more. We wrote that study up separately if you want the numbers and the honest caveats.

Grounding checks

Every generated answer is checked against the sources it cited. Claims that are not supported by the retrieved passages are flagged rather than shipped silently. This is not a hallucination cure, nothing is, but it changes the failure mode from "confident and wrong and unmarked" to "flagged for a human." In an enterprise setting that distinction is the whole game.

How it fits your stack

SWIRL 5 is headless and API-first, with a first-class MCP server. Any AI work surface calls it over MCP or REST and gets back ranked, permissioned, canonical answers with citations attached. It runs in your tenant, honors each source system's existing permissions on every query, and works with the model you choose, hosted Claude or GPT, Copilot, or a local model on your own hardware. The agent angle is the one I would watch: a human reading a superseded policy usually catches it, an agent does not pause, so serving agents the approved answer instead of a raw retrieval is a safety property, not a nicety.

Where it is hard, honestly

Version election is heuristic. Source authority, naming, and recency get you a long way, but naming conventions are messy and "most recent" is not always "operative." Pinning exists precisely because the automatic signal is not always enough. We show the reasoning so humans can correct it.
A carefully hand-tuned vanilla RAG can match our per-call token count. Low k, a good reranker, small chunks, plus your own de-dup layer. Our value is delivering that discipline by default and adding cross-version de-duplication that off-the-shelf stacks do not, not magic per-token compression.
Grounding checks reduce unsupported claims, they do not eliminate them. Treat the flag as a prompt for review, not a guarantee.

Try it

If you are putting AI on your own data, that is exactly the case this is built for. It is generally available now at swirlaiconnect.com, and I am happy to run it against a slice of your own systems so you can see the ranking, the citations, and the permission boundary on your data rather than a demo corpus. I built it, so you would be talking to the person who wrote the code.

How many tokens does your RAG stack actually send to the LLM?

Sid Probstein — Tue, 07 Jul 2026 13:13:00 +0000

The token bill for a RAG system is not set by your vector database. It's set one step later, by how you assemble the context you hand the LLM. Retrieval finds candidates; assembly decides how many of them, how much of each, and across how many LLM calls. That's where the money is spent.

So "vector DB vs. framework X vs. SWIRL" is the wrong axis. The real comparison is between context-assembly strategies. Here's an honest, source-checked look at how many tokens each common approach sends, and where SWIRL 5 actually costs less.

The honest headline first

SWIRL doesn't win by using smaller chunks. Anyone can lower top_k. On a single lean query, a minimal config like LlamaIndex's default (top_k=2) sends fewer raw tokens than SWIRL. If someone tells you SWIRL "always uses the fewest tokens," a technical evaluator will disprove it in five minutes.

SWIRL's advantage is structural, and it shows up exactly where enterprise content lives, in corpora full of document versions:

It never spends tokens on duplicate or superseded versions. Stock top-k returns whatever is nearest in embedding space, which in a versioned corpus means several near-identical copies. SWIRL collapses them to one canonical document before the LLM sees them.
It always answers in one bounded call. The multi-document synthesis modes people reach for when they want quality (LangChain refine/map_reduce, LlamaIndex refine/tree_summarize) multiply LLM calls, and refine multiplies tokens super-linearly. SWIRL never pays that tax.

What the common defaults actually do

Naive vector-DB RAG (Pinecone / Weaviate / Qdrant tutorials)

The pattern is: embed the query → retrieve top-k chunks → stuff them all into one prompt. Typical documented defaults:

Chunk size ~512 tokens (the common "start here"; the band is 256-1024)
Chunk overlap 10-20%
top_k 3-5
No de-duplication. None of the vendor quickstarts add a dedup or diversity step. Overlap alone guarantees adjacent chunks share text, and multiple versions of a document sit in the same embedding neighborhood, so top-k routinely returns redundant context and the pipeline sends all of it.

Input tokens ≈ k × chunk_tokens + overhead. At k=5 and ~1,000-token chunks, that's ~5,000 tokens, a large fraction of it redundant.

Sources: Pinecone, Weaviate, Qdrant.

LangChain

Verified from source:

Retriever default k=4 (similarity_search(k=4) in vectorstores/base.py)
RecursiveCharacterTextSplitter default 4,000 chars / 200 overlap (≈1,000 tokens) (text_splitters/base.py)

The combine-documents chains differ enormously in cost:

Chain	LLM calls (N docs)	Token behavior
`stuff`	1	all docs in one prompt
`map_reduce`	N + ≥1	one call per doc, then reduce
`refine`	N, sequential	re-sends the growing answer each step → super-linear tokens, no parallelism
`map_rerank`	N	one doc per call

refine is the trap: each step re-transmits the accumulating answer plus the next document, so the running answer is re-sent N-1 times and grows as it goes.

LlamaIndex

Verified from source (llama-index-core):

chunk_size 1,024 tokens, similarity_top_k 2, default response mode compact (constants.py, factory.py)
No default de-duplication. Node postprocessors are opt-in and the default similarity cutoff is off, so two versions of a doc in top-k both go to the LLM.

Response modes (docs):

Mode	LLM calls	Token behavior
`compact` (default)	~1	packs nodes into as few prompts as fit
`refine`	N	one call per node, re-sends evolving answer
`tree_summarize`	>1, recursive	summarize groups, then summaries-of-summaries
`accumulate`	N	query each node separately

compact with top_k=2 is genuinely lean, but that's a recall trade-off, and it still sends duplicate versions.

What SWIRL 5 does (verified in source)

SWIRL treats the LLM prompt as a hard budget and fills it deliberately:

Mechanism	Behavior
Hard prompt budget	RAG prompt capped at ~3,000 tokens (`SWIRL_RAG_TOK_DEFAULT`); assembly stops when full
Source cap + relevance gate	≤ 10 sources, only those scoring ≥ 0.8 are eligible
Scored semantic chunks	a BM25 topic-matcher scores spans within each source; on overflow SWIRL narrows to those scored spans, not the whole chunk
Truncation	per-source token-by-token truncation to fit the budget
Markup	relevant spans wrapped in `<SW-IMPORTANT>…</SW-IMPORTANT>` plus a compact per-source metadata header
De-dup before the LLM	version-cluster alternates dropped; if a doc is pinned, only the canonical is kept, so N versions collapse to 1
Single call	one stuff-style synthesis call (worst case +1 JSON repair)

Net: ~3,000 input tokens, one call, zero redundant-version tokens, a predictable ceiling that doesn't grow with document size or corpus size.

Putting numbers on it

Scenario: a query over a corpus where the relevant policy exists in 5 versions, plus 3 other relevant documents, a realistic enterprise shape. Total LLM input tokens per query (summed across calls):

Reading it honestly:

vs. a deliberately minimal stuff config (LlamaIndex compact, top_k=2): SWIRL is comparable per call, but that config buys its low count with poor recall and still ships duplicate versions.
vs. typical stuff RAG (k=5, ~1k chunks): SWIRL is ~40-45% fewer input tokens, and it removes the redundant-version half entirely.
vs. the quality-oriented multi-call modes (refine, tree_summarize, map_reduce): SWIRL is a 2-3× reduction in input tokens, and a larger reduction in output tokens, since those modes generate once per call.

Where the advantage is real, and where it isn't

Real and hard to get off-the-shelf: cross-version de-duplication. None of these frameworks do it by default. In a versioned corpus it's the difference-maker, and it compounds: the more versions, the more SWIRL saves.
Real: one bounded call vs. N-call synthesis; a predictable cost ceiling.
Honest caveat: a hand-tuned vanilla RAG (low k, a good reranker, small chunks, stuff, plus your own dedup layer) can match SWIRL's per-call token count. SWIRL's value is delivering that discipline by default, and adding version de-dup the others lack. It isn't magic per-token compression.

Takeaway

If you're answering questions over a corpus with real document versioning, the tokens you waste aren't in chunk size - they're in sending the LLM the same document five times and in synthesis modes that call the model once per chunk. SWIRL 5's design removes both by default. Measure your own stack the same way: total input and output tokens per query, summed across every LLM call. That number, not top_k, is your bill.

Links + Snippets Not Enough for RAG

Sid Probstein — Mon, 06 Jul 2026 12:00:00 +0000

Several people posting lately about how RAG + search is not enough:

"All the model gets is a list of links and snippets. It's not enough to make sense of most real business questions."

Agreed. If your pipeline stuffs ten search snippets into a prompt and hopes, you get confident mush. A snippet is a pointer to evidence, not the evidence. The fix is a step most RAG setups skip: page fetch plus a reader. Three parts.

1. Fetch the page, not the snippet

The search result is a pointer. So before generating, SWIRL fetches the actual page or document behind each top hit. Now the pipeline is working from the full source, not the 200 characters a search API happened to return.

2. Read it with a reader LLM

This is the part that matters. Some model reads each fetched page against the question and pulls out the passages that actually answer it, marking and scoring them. The chaff never reaches the expensive answering model.

Do extraction before generation, not instead of it.

3. Budget the context

Real pages do not fit neatly into a context window, so the reader either truncates to the highest-signal passages or marks them in place, best first, until the budget is spent. The answering model gets curated, ranked evidence instead of a pile of chunks competing for attention.

The result is the difference between "here are some links" and an answer you can act on. Same retrieval, far better grounding, because something actually read the sources before the model spoke.

That take is exactly why the reader step exists. Naive RAG earns its bad reputation. This is how you avoid it.

(And don't get me started about not having to put it all in a vector database first!)

Making RAG admit when it's guessing: source-grounded hallucination checks

Sid Probstein — Wed, 01 Jul 2026 15:12:48 +0000

The failure mode that scares me most in RAG isn't a wrong answer. It's a confident wrong answer with three citations that don't actually say what the answer claims.

So in SWIRL 5 I stopped trusting the model to police itself and added a check that runs after generation.

The flow:

Generate the answer with its citations, as usual.
Split the answer into atomic claims — roughly one assertion per sentence.
For each claim, pull the specific spans from the retrieved passages the model cited.
Run an entailment check: does the cited text actually support this claim, contradict it, or neither?
Any claim that isn't supported gets flagged in the UI, inline, before the user reads a word of it.

The interesting part wasn't the entailment model; it was everything around it.

Claim segmentation is harder than it sounds. Naive sentence splitting produces claims that are unverifiable on their own because the subject lives two sentences up.

Citations lie by omission. A model will cite a document that's topically relevant but doesn't contain the specific number it just quoted. The whole point of the check is to catch exactly that gap.

Latency budget. An honesty layer nobody waits for is an honesty layer nobody ships. SWIRL 5 batches and optionally caches passage embeddings and more.

The result isn't "SWIRL never hallucinates." Nothing can promise that. The result is: when it's on thin ice, it tells you, and it points at the exact sentence.

That's the version of trustworthy I can actually build.

SWIRL Community 4.5 Update

Sid Probstein — Sun, 28 Jun 2026 16:30:28 +0000

I wrote about SWIRL here last summer. Time for an update.

SWIRL Community 4.5 is out. It is the open-source, Apache-2.0 build of SWIRL: federated search and RAG across your apps, running on your own machine, no vector database required. Three things in this release are worth your attention.

1. Point RAG at any LLM

The big one. Community now lets you configure AI Providers and send your RAG queries to any model you want: OpenAI, Anthropic, Azure OpenAI, or a fully local model behind Ollama or vLLM. Pick the provider, set the model, done.

ai_provider:
  name: anthropic      # openai | azure | ollama | vllm
  model: your-model
  api_key: ${LLM_KEY}

This matters for two reasons. RAG quality is now your choice, not ours, so you can run a stronger model when accuracy counts. And you can keep the whole pipeline local, with retrieval and generation both inside your network. This was an Enterprise feature; it is in Community as of 4.5.

2. Ask RAG a question directly

RAG now accepts a natural-language question as its input. Before, you ran a search and generated over the results; now you can hand SWIRL a plain question and it does retrieval and generation in one step. Question in, cited answer out, which means a lot less glue code to wire SWIRL into a chat box or an app.

3. Updated Galaxy

Community now includes the new Galaxy 5 UI, including a search history widget + dashboard, cleaned up admin interface, built-in activity analytics and more.

That is the release. Free, self-hosted, and now model-agnostic.

The knowledge-authority layer: what your agents can't get from the outside

Sid Probstein — Wed, 17 Jun 2026 13:28:55 +0000

Every enterprise AI conversation right now starts in the same place: "connect the model to our data." Then it stalls in the same place: which data, copied where, governed by whom.

Let me make an argument that runs against the current default - and then show the architecture it implies.

The default is a second copy of your data

The standard RAG recipe is: crawl your sources, chunk them, embed them, and load the vectors into a database. Now your model can retrieve. It also means you have a second copy of your content living in an index you have to secure, keep in sync, and explain to whoever owns compliance. You've recreated every permission boundary by hand, and you'll eventually get one wrong.

For a lot of teams that copy is simply not allowed. Regulated content, client-confidential material, anything privileged - copying it into a vendor store is exposure you don't get paid to take on.

You probably don't need the vector database

Here's the part people don't want to hear. Meta's XetHub team benchmarked three retrieval strategies: keyword-only (BM25), vector-only, and hybrid (keyword to pull candidates, then re-rank). Keyword-only came last. Vector-only did better.

Hybrid won - and their conclusion was blunt: "No vector database necessary."

That matches what we see in production. Vector similarity is a great high-precision filter, not a great first pass. Lead with exact matches and quoted terms, then let embeddings and a cross-encoder re-rank what's left.

What "make your LLM better" actually means

It's not a slogan; it's a pipeline. In SWIRL, relevance is three passes, and both models run locally:

Federate and match. Query every connected source in parallel - keyword + BM25 - and honor quoted phrases and exact terms first.
Embedding re-rank. Re-rank candidates with E5-large-v2, using title-aware chunking and hybrid keyword+vector fusion (RRF). No vector database to build or secure.
Cross-encoder re-rank. An MS-MARCO cross-encoder reads the query and document together and scores real relevance, not vector distance.

Feed that to your LLM - whatever model you've chosen, including an on-prem one - and the answer gets better, because the context got better. Same model, sharper input.

The layer no model supplies from the outside

The stack is settling: foundation models orchestrate, MCP is the retrieval interface, the chat UI is a commodity. The piece none of them provide from outside your walls is knowledge authority - which document is official, which clause your org actually uses, which answer carries approval.

So we made it a first-class layer. SWIRL 5 exposes an MCP server. Any agent - Claude, Copilot, ChatGPT, your own - calls SWIRL and gets ranked, permissioned, organization-approved answers. A team pins the canonical result for a query once; every agent gets it after that. And no copy of your data leaves your tenant.

Why this shape

Three properties fall out of it, and they're the whole reason to build it this way:

Private by architecture. Data stays in place; permissions are enforced live; there's no second index to govern.
The answer, not a guess. Cross-encoder ranking plus canonical answers means people and agents get the result the org trusts.
The safe on-ramp to AI. Headless and MCP-native, deployed in your tenant - the lowest-risk way to give agents enterprise reach.

If you're wiring agents into enterprise data and the "just copy everything into a vector store" step is making your security team twitch, there's another shape available. SWIRL 5 goes GA July 15; the preview is open if you want to point it at your own stack. Either way - I'd genuinely like to hear how you're handling the authority problem, because I don't think the industry has it figured out yet.

Sid Probstein is the creator of SWIRL and CEO of SWIRL AI.

Moving Docker images between repos with crane (after imagetools wasted my afternoon)

Sid Probstein — Thu, 11 Jun 2026 17:55:15 +0000

I had a freshly-built, multi-arch dev image (linux/amd64 + linux/arm64) and one job: promote it into a private partner repo on Docker Hub, plus stamp a dated tag so :latest is always traceable back to a real build. Cross-repo. Should be five minutes.

It was not five minutes.

The problem

The obvious tool is docker buildx imagetools create... it's built for copying manifests between tags. So I reached for it. And it sat there. Then it 400'd. Then I retried, and it hung. Cross-repo blob copies on Docker Hub are reproducibly flaky with imagetools, and I burned the better part of an hour confirming that before I went looking for something else.

The fallback most people reach for next is worse: pull the image down, retag it, push it to the new repo. That round-trips the entire image — every layer, every arch - through your laptop's daemon and disk, only to push the same bits back up. And if you're not careful with how you tag, you flatten a multi-arch index down to whatever single arch your machine happens to be. No thanks.

Why crane

crane (from Google's go-containerregistry) does the copy registry-to-registry. It never pulls the image to your machine.

It preserves the multi-arch manifest. crane cp copies the whole image index by digest. Both arches come along. Nothing gets flattened.
It actually works cross-repo on Docker Hub. Where imagetools 400'd and hung, crane did the server-to-server copy in seconds.
No daemon, no disk. It talks to the registries directly. Your laptop just orchestrates.

flowchart LR
    subgraph slow["pull → tag → push"]
        A[source repo<br/>you/app:dev] --> L[your daemon + disk<br/>whole image] --> B[partner repo<br/>you/partner-app:latest]
    end
    subgraph crane["crane cp"]
        C[source repo<br/>you/app:dev] -->|manifest + blobs<br/>by digest| D[partner repo<br/>you/partner-app:latest]
    end

Run it as a container, no install

docker run --rm gcr.io/go-containerregistry/crane:debug <crane-args>

That's the whole installation story. Nothing on the host.

Auth issues (MacOS)

crane in the container reads ~/.docker/config.json. On macOS with Docker Desktop, your login isn't in that file — it's credsStore: osxkeychain, sitting in the macOS keychain. So if you naively mount ~/.docker/config.json into the container, crane sees no usable credential and hands you UNAUTHORIZED.

The fix: pull the credential out of the keychain on the host, write it into a temporary inline config, mount that, and delete it the moment you're done. Never print it, never commit it.

TMP=$(mktemp -d); trap 'rm -rf "$TMP"' EXIT
cat > "$TMP/mkcfg.py" <<'PY'
import json, sys, base64
d = json.load(sys.stdin)
auth = base64.b64encode((d["Username"] + ":" + d["Secret"]).encode()).decode()
json.dump({"auths": {"https://index.docker.io/v1/": {"auth": auth}}}, open(sys.argv[1], "w"))
PY
echo "https://index.docker.io/v1/" | docker-credential-osxkeychain get | python3 "$TMP/mkcfg.py" "$TMP/config.json"

CRANE() { docker run --rm -v "$TMP/config.json":/root/.docker/config.json:ro \
            gcr.io/go-containerregistry/crane:debug "$@"; }

The trap ... EXIT cleans up the temp config when your shell exits, so the credential doesn't linger.

On Linux (or anywhere docker login writes inline creds), skip all of that and mount ~/.docker/config.json directly. And obvious-but-worth-saying: you need a read/write login on the destination. A read-only token can't push.

Promote the image

# copy SRC -> DST, full multi-arch manifest, server to server:
CRANE cp you/app:dev \
         you/partner-app:latest

# also stamp a dated, immutable tag so :latest is always traceable to a build:
CRANE cp you/app:dev \
         you/partner-app:dev-2026-06-11

Two copies, both registry-side, both done before you can refill your coffee.

Verification

Don't trust that the copy worked. Prove it. The destination digest must equal the source digest — same bits, same manifest:

CRANE digest you/app:dev                 # source
CRANE digest you/partner-app:latest      # must equal source
CRANE digest you/partner-app:dev-2026-06-11   # must equal source

If those three lines match, the right image landed under both tags. If they don't, you copied the wrong thing — better to find out here than in a partner's deployment.

Bonus

CRANE ls you/partner-app                          # list tags
CRANE manifest you/app:dev                         # full manifest JSON, see the arches
CRANE tag you/app@sha256:<digest> newtag           # add a tag to an existing digest
CRANE copy <SRC> <DST>                             # alias of cp

One caveat worth knowing: tag deletion isn't a crane operation. Use the Docker Hub UI or the API for that (and it needs an admin-scoped token — a read/write PAT won't delete).

TL;DR

Reach for crane any time you're moving an image between registries or repos and you care about the manifest arriving intact — promotions, mirrors, cross-org handoffs. It skips the daemon, skips the disk, and it doesn't fall over on Docker Hub cross-repo copies the way imagetools does. And whatever you do, build the crane digest source-equals-dest check into the workflow. It costs two seconds.

Running PyTorch fork-safe in Celery on macOS

Sid Probstein — Mon, 01 Jun 2026 13:23:46 +0000

If you've ever seen this in your Celery logs:

Process 'ForkPoolWorker-7' pid:32839 exited with 'signal 11 (SIGSEGV)'
billiard.exceptions.WorkerLostError:
    Worker exited prematurely: signal 11 (SIGSEGV) Job: 0.

...and the macOS crash report buries the real message in a JSON blob:

"asi": {
  "CoreFoundation": ["*** multi-threaded process forked ***"],
  "libsystem_c.dylib": ["crashed on child side of fork pre-exec"]
}

...you've hit one of the classic fork-after-init traps. Here's what's going on and how to actually fix it.

The one-line fix (if you're in a hurry)

Set these env vars before the Celery worker's MainProcess imports anything heavy:

# <your-project>/celery.py ... the very first thing the worker imports
import os

for var in ("OPENBLAS_NUM_THREADS", "OMP_NUM_THREADS", "MKL_NUM_THREADS",
            "NUMEXPR_NUM_THREADS", "VECLIB_MAXIMUM_THREADS"):
    os.environ.setdefault(var, "1")

os.environ.setdefault("OBJC_DISABLE_INITIALIZE_FORK_SAFETY", "YES")

The first set forces every BLAS library to single-threaded mode. VECLIB_MAXIMUM_THREADS is the one most people forget; it covers Apple's Accelerate framework, which is what PyTorch uses by default on Apple Silicon. The last one tells the Objective-C runtime to skip its fork-safety abort.

Why this happens

PyTorch's nn.Linear on macOS arm64 calls into Apple Accelerate, which does its parallel matmuls via libdispatch (Grand Central Dispatch).

The first BLAS call lazily spins up a pool of libdispatch worker queues in the calling process.

If that "calling process" is your Celery worker's MainProcess (say, because something during boot does a tiny matmul: spaCy preload, an embedding warmup, anything that imports numpy and runs a real op), those queues now live in the parent.

When the prefork pool then fork()s a child, the child inherits broken queue handles. The next BLAS call from inside the child dereferences a stale pointer and you get the SIGSEGV.

The stack trace in the crash report makes it unambiguous:

0: _dispatch_apply_with_attr_f      (libdispatch)
1: dispatch_apply_with_attr         (libdispatch)
3: cblas_sgemm                      (Accelerate)
5: at::native::cpublas::gemm        (libtorch_cpu)
6: at::native::addmm_impl_cpu_      (libtorch_cpu)
7: at::native::linear               (libtorch_cpu)
8: torch::autograd::THPVariable_linear

What doesn't work

"Just lazy-load the model in the child." Even if you defer from_pretrained until you're inside a forked child, that first call still hits Accelerate BLAS, and the dispatch queues your child inherited from the parent are already broken.

"Just bypass sentence_transformers.CrossEncoder.predict() and use bare-torch." Same story. Whether you go through CrossEncoder or call AutoModelForSequenceClassification directly, the SIGSEGV is one frame down inside linear().

"Just don't import torch at the top of the module." Necessary but not sufficient. In our case, removing import torch from ai_provider.py was real progress, but then we discovered litellm transitively pulls torch the first time you call it. Every "warmup" preload that touched litellm still poisoned the parent. You have to audit every code path that runs before the first fork.

The defensive pattern that does work

Defer heavy imports. Don't import torch at module top in anything that's part of the Celery autodiscovery chain. Push it into the function that needs it:

# Bad ... taints anyone who imports this module
import torch

def rerank(query, documents):
    with torch.no_grad():
        ...

# Good ... torch only loads in workers that actually rerank
def rerank(query, documents):
    import torch
    with torch.no_grad():
        ...

Gate "warmup" preloads off the Celery worker. Preloading models at startup makes sense for an ASGI server like Daphne. It is actively harmful in a forking Celery worker, because the warmup runs in MainProcess:

class MyAppConfig(AppConfig):
    def ready(self):
        is_celery_worker = "celery" in sys.argv and "worker" in sys.argv
        if not is_celery_worker:
            self._preload_cross_encoder()

What about Linux / Docker?

Yes, this affects Linux too, just less dramatically.

OpenBLAS and MKL both spin up thread pools on first use that don't survive fork; the typical Linux failure mode is a hang or a deadlock rather than a SIGSEGV.

The good news: the same *_NUM_THREADS=1 env vars are the fix.

VECLIB_MAXIMUM_THREADS and OBJC_DISABLE_INITIALIZE_FORK_SAFETY are no-ops on Linux, so the snippet above is portable. The deferred-import and gate-off-warmup patterns apply unchanged.

The AI agent stack that’s quietly taking over enterprise workflows

Sid Probstein — Sat, 03 May 2025 02:58:27 +0000

Accenture, IBM, and AWS are all placing bets on Crew AI. Why? Because it makes building and deploying real AI agents possible.

With Crew AI, teams are spinning up agents that:

Launch predictive marketing campaigns
Automate financial back-office ops
Optimize inventory and logistics
And tackle 100+ other enterprise use cases

But here’s the catch: agents are only as good as the data they can reach. That’s where SWIRL comes in.

By pairing Crew AI with SWIRL, you get more than just agents—you get enterprise-ready, data-rich workflows that scale. No custom plumbing. No brittle integrations.

With Crew AI + SWIRL, your agents can:

Connect to 100+ enterprise data sources out-of-the-box
Fetch the most relevant structured/unstructured data across silos
Respect row-level permissions with real enterprise auth
Summarize and answer with your LLM of choice
Plug in easily via zero-code connectors

Want to see this in action?

Message me for a demo or check the open source edition here: https://github.com/swirlai/swirl-search

It was already indexed!

Sid Probstein — Sun, 13 Aug 2023 16:48:52 +0000

I recently had the pleasure of chatting with @dmitrykan on his Vector Podcast. Check it out: https://dmitry-kan.medium.com/vector-podcast-with-sid-probstein-search-in-siloed-data-with-swirl-f2b9595a2715

We talked about quite a few things, including:

The challenges of enterprise search in the post-cloud era
How cross-silo search is particularlyt tricky because of entitlements (aka permissions) across silos
Zero-code configuration of connectors in Swirl, where JSON path and developer API doc get the job done
How large language models contextually re-rank disparate search results

There was an interesting twist at the end of the call. Dmitry uses a service called Clearword to transcribe recordings. Dmitry asked: “how quickly can you index the transcript and search it with Swirl?”

Here is my answer: It was already indexed!

Since there is no audio and it goes by quickly, let me explain ... Clearword emailed the transcript to both of us shortly after recording ended. It was indexed by Microsoft Outlook within seconds of arriving in my inbox.

To verify this, I copied some text from the middle of the transcript and pasted it into Swirl, which returned the link to the email message with the transcript and the phrase I searched for.

That simple truth - that the average enterprise is awash in search forms - is the entire reason metasearch is such a game-changing approach. Instead of making yet another repository, Swirl sends queries to existing search APIs and re-ranks the results from everything. It saves users a huge amount of time without a major IT project.

Want to see for yourself? git it going with 2 commands via Docker here: https://github.com/swirlai/swirl-search

Swirl 2.5 released

Sid Probstein — Wed, 09 Aug 2023 14:37:03 +0000

I am delighted to announce availability of Swirl 2.5!

This version focused on performance. Configured with 12 SearchProviders, Swirl 2.5 supports ~15 queries/second on a Standard F16s v2 server (16 vcpus, 32 GiB memory) with a median response time of ~3 seconds.

Version 2.5 also includes SearchProviders for HubSpot contact, company, and deal records, plus improvements to the Galaxy search UI (shown above).

Check out the Release notes for full details: https://github.com/swirlai/swirl-search/releases/tag/v2.5.0

What is Swirl? A new open source metasearch engine; it queries anything with an API then uses spaCy to re-rank the unified results without copying any data! Includes zero-code configs for Apache Solr, ChatGPT, Elastic Search, OpenSearch, PostgreSQL, Google BigQuery, RequestsGet, Google PSE, NLResearch.com, Miro, Microsoft 365, HubSpot, Atlassian, YouTrack, GitHub & more!

I wrote a metasearch engine called Swirl

Sid Probstein — Thu, 03 Aug 2023 18:48:40 +0000

Hi all! Swirl sends queries to existing search engines, unifies the results and re-ranks them all using large language models. It solves cross-silo information access and search problems in a fraction of the time and effort required to copy, ingest and index data.

Here's a brief video intro: https://youtu.be/sfsBYyu6qDQ

Swirl was written in python atop the django/celery/redis stack with a choice of Sqlite3 or PostgreSQL back-ends. The source code is available under the Apache 2.0 license. The distribution includes zero-code configs for Apache Solr, ChatGPT, Elastic Search, OpenSearch, PostgreSQL, Google BigQuery, RequestsGet, Google PSE, NLResearch.com, Miro, Microsoft 365, Atlassian, YouTrack, GitHub, HubSpot & more. Plug-in your access tokens or use Microsoft 365 to login and users can stop searching and start Swirling!

Links:

We are seeking contributions and feedback from developers working on all kinds of search solutions... thanks!