DEV Community

TJ Sweet
TJ Sweet

Posted on

Architectural Consolidation for Low-Latency Retrieval Systems: Why We Co-Located Transport, Embedding, Search, and Reranking

Most Graph-RAG systems are built as a chain of services:

  1. API ingress
  2. query embedding service
  3. vector DB
  4. sparse/BM25 service
  5. fusion/rerank service
  6. generation service

Typical Graph-RAG Architecture

That decomposition is clean on paper. It is rarely cheap on the critical path.

NornicDB made a deliberate architectural trade: co-locate the online retrieval path in one runtime/container (transport, query embedding, retrieval, fusion, rerank, and response assembly) and optimize that path hard.

NornicDB co-location

This post is about that choice: what it buys, what it costs, and how we mitigate the costs in code today.


Why consolidate at all?

If you split 5 stages across services and each boundary adds even ~1.0–1.5 ms of serialization/network/scheduler overhead, you can burn 5–7.5 ms before meaningful retrieval work.

That’s basically the whole budget for “feels instant” search.

In the co-located NornicDB path, we cut most of that boundary tax out. In the run you shared (1M corpus setup), we saw:

2026/02/18 08:01:14 🔍 Search request database="nornic" query="where a prescriptions?"
2026/02/18 08:01:14 ⏱️ Search timing: method=rrf_hybrid cache_hit=false fallback=false total_ms=0 vector_ms=0 bm25_ms=0 fusion_ms=0 candidates[v=26,b=0,f=26] returned=20 query="where a prescriptions?"
[HTTP] POST /nornicdb/search 200 7.96575ms
2026/02/18 08:01:36 🔍 Search request database="nornic" query="where to get the drugs?"
2026/02/18 08:01:36 ⏱️ Search timing: method=rrf_hybrid cache_hit=false fallback=false total_ms=0 vector_ms=0 bm25_ms=0 fusion_ms=0 candidates[v=26,b=0,f=26] returned=20 query="where to get the drugs?"
[HTTP] POST /nornicdb/search 200 7.334291ms
Enter fullscreen mode Exit fullscreen mode

Mean from those two samples: ~7.65 ms end-to-end HTTP.


The architectural shape we optimized for

NornicDB keeps compatibility/protocol flexibility at the edge (Bolt/Cypher, REST, GraphQL, gRPC including Qdrant-compatible flows), but collapses online retrieval internals into one operational surface:

  • in-process embedding path
  • in-process hybrid retrieval orchestration
  • in-process optional stage-2 reranking
  • in-process transactional graph + vector state

This is the core reason deployment can be “single container, one runtime, one rollback unit” instead of “service choreography.”


Why compressed ANN exists in this architecture

Compression wasn’t added as a “nice-to-have index type.”

It was added as a scaling lever that preserves the single-runtime model longer.

Raw 1024-d float32 vector = 1024 × 4 = 4096 bytes before indexing overhead.

At scale, memory bandwidth and cache locality become the bottleneck, not just algorithmic complexity.

With IVFPQ-style compression, vector payload can drop by orders of magnitude (profile-dependent), which improves:

  • in-memory density
  • cache residency
  • tail-latency stability under load
  • throughput per dollar on fixed hardware

In code, compressed mode is explicitly gated and safety-wrapped:

  • pkg/search/search.go uses compressed profile resolution
  • if compressed profile is inactive -> standard path
  • if compressed build/load fails -> automatic fallback to standard path

So compression is a scalability primitive, not a reliability gamble.


Costs of co-location, and how NornicDB mitigates each one

1) Reduced independent scaling of subcomponents

Risk: embedding/rerank/generation can’t be scaled as separate deployments as easily.

Mitigations implemented:

  • Per-database overrides for embedding/search/HNSW/k-means and related knobs (docs/operations/configuration.md), so you can tune behavior by workload without splitting the whole system.
  • Provider decoupling at runtime: embedding and rerank can be local or external (OpenAI/Ollama/HTTP) via config (pkg/server/server.go, docs/operations/configuration.md).
  • Planned next step: sharding roadmap (docs/plans/sharding*.md) for horizontal scale without returning to “everything is a remote hop.”

2) Tighter resource coupling (CPU/memory/cache contention)

Risk: one process means shared contention.

Mitigations implemented:

  • File-backed vector store path to bound RAM during large builds and persistence (pkg/search/search.go: vectorFileStore low-RAM path).
  • Runtime strategy switching across CPU brute/GPU brute/HNSW using thresholds (pkg/search/search.go, docs/operations/configuration.md), with debounced transitions and replay-before-cutover behavior.
  • Compressed ANN mode to reduce memory footprint and bandwidth pressure at high vector counts.
  • Async write and queue controls exposed via config for throughput/consistency tuning (docs/operations/configuration.md).

3) Larger blast radius per deploy

Risk: one deploy can affect the full online path.

Mitigations implemented:

  • Fail-open reranking load path: server starts immediately; reranker loads async; if unavailable/health-check fails, search continues without stage-2 rerank (pkg/server/server.go).
  • Fail-open rerank execution: rerank errors revert to original order (pkg/search/search.go).
  • Compressed ANN fallback: compression failures fall back to standard retrieval path (pkg/search/search.go).
  • Version/compat checks + rebuild path for persisted indexes (docs/operations/configuration.md, pkg/search/search.go).

4) Harder team autonomy boundaries

Risk: fewer service boundaries can blur ownership.

Mitigations implemented:

  • Explicit extension seams via plugin systems (APOC-style and Heimdall plugin interfaces) in docs/user-guides/heimdall-plugins.md.
  • Protocol boundaries remain explicit at API edges (Bolt/Cypher, REST, GraphQL, gRPC), so interface ownership is still clear even when runtime is co-located.

5) Vendor/runtime lock-in risk

Risk: too many in-process optimizations can trap you in one stack.

Mitigations implemented:

  • Protocol pluralism in the product surface: Bolt/Cypher, REST, GraphQL, Qdrant-compatible gRPC, additive native gRPC (README.md).
  • Provider pluralism for model execution: local + external provider modes for embedding/rerank (docs/operations/configuration.md, pkg/server/server.go).
  • Compatibility-first stance (Neo4j + Qdrant workflows) keeps migration cost low.

Tradeoff summary

NornicDB’s stance is not “microservices are bad.”

It’s: for this workload, on this latency budget, boundary placement is a performance decision first.

  • If your top concern is strict per-stage org isolation, split services.
  • If your top concern is single-digit-ms retrieval with simpler operations, co-location wins more often.

NornicDB chose co-location, then added mitigations to avoid common co-location failure modes:

  • configurable per-DB policy
  • runtime strategy adaptation
  • compressed ANN for memory scale
  • fail-open degradation paths
  • future sharding trajectory

That combination is the architecture story:

one deployable runtime today, with deliberate seams for scale tomorrow.

Top comments (0)