DEV Community

Matthew Gladding
Matthew Gladding

Posted on • Originally published at gladlabs.io

Hunting Ghost 503s and Pipeline Halts

What we shipped on 2026-06-25

Our biggest fight today was with a series of silent failures that only appeared in the wild. We spent most of the day chasing "ghost" errors--the kind that look fine in local tests but collapse under the weight of production timeouts and missing services.

We started with the image regeneration pipeline, where poindexter tasks regen-image was sporadically returning HTTP 503s (PR #1930). The culprit was a stale HTTP keep-alive connection in our shared httpx.AsyncClient. Uvicorn would close the idle connection server-side after 5 seconds, but the client tried to reuse it anyway, triggering a RemoteProtocolError (0c08688). Since our local diffusers fallback isn't installed in the worker image, the system just gave up and returned a 503. The fix was simple: we stopped pooling and now open a fresh client for every SDXL call. Keep-alive provides zero benefit for low-frequency regens, and the stability is worth the overhead.

Simultaneously, we had to stop the canonical_blog pipeline from simply freezing at the QA stage. We found that when critic model settings were empty, _resolve_critic_model would raise a RuntimeError that propagated all the way up to _wrap_atom, marking every task as halted=True (PR #1931). We wrapped that fallback call in a try/except block so it degrades to a graceful skip instead of a total halt, and we seeded pipeline_critic_model=ollama/phi4:14b into the defaults so fresh installs don't start in a broken state.

Even after fixing the settings, the pipeline still struggled because the Prefect subprocess doesn't run the FastAPI lifespan where SettingsService lives (PR #1932). We had to add a fallback path to resolve the critic model via SiteConfig when self.settings is None (af7e09a). It was a classic case of "it works in the API, but not in the worker."

On the observability side, we caught a routing bug in brain/alert_sync (PR #1934). We had hardcoded datasourceUid: "prometheus" for every rule, meaning our SQL-driven alerts were being sent to Prometheus--which obviously can't execute SQL. This caused every 60s eval cycle to fail with "data source not found" (02e5355). By adding datasource_type to _hash_rule, we invalidated the stale hashes and let the brain sync cycle auto-recover the routing to local-brain-db.

We cleaned up a few more regressions before cutting release 0.87.1 (PR #1937):

  • Fixed a frontend bug where missing posts were returning HTTP 200 instead of 404 because generateMetadata was committing the status too early (PR #1925).
  • Registered embeddings_collapse and embeddings_orphan_prune in load_all() after realizing they'd been left off the explicit import list in handlers/__init__.py (PR #1933).

These fixes don't add new features, but they close the gap between "it works on my machine" and a resilient autonomous system. We're finally moving past the fragility of the QA pipeline; now we can actually trust the critic to do its job.

Auto-compiled by Poindexter from today's commits and PRs. See the work: github.com/Glad-Labs/poindexter.

Sources

Top comments (0)