DEV Community: TJ Sweet

Why NornicDB Uses Its Own Monotonic Counter for MVCC Ordering

TJ Sweet — Tue, 19 May 2026 15:45:16 +0000

TL;DR

NornicDB's MVCC layer assigns each committed write a (CommitTimestamp, CommitSequence) pair, where CommitTimestamp comes from time.Now().UnixNano() and CommitSequence comes from a process-wide atomic uint64 counter. Snapshot-isolation conflict detection orders versions by sequence first, not timestamp. We did this because:

Wall-clock nanoseconds are not monotonic. Linux clock_gettime(CLOCK_REALTIME) can step backward under NTP correction, and even between adjacent reads on different goroutines.
Our parser is faster than the wall clock's resolution. A simple Cypher MATCH (n) RETURN n parses+validates in 39 ns with zero allocations. Multiple commits routinely land inside the same UnixNano() bucket.
Go's built-in monotonic clock is per-time.Time, not global. It is stripped by UnixNano() and is undefined across time.Time values produced by independent time.Now() calls.

A uint64 counter incremented atomically per commit gives us a total order that nothing in the operating system can perturb. At one billion commits per second sustained, it overflows in ~584 years.

The Bug We Were Hunting

The regression that drove this work was an intermittent CI failure in TestExecuteCypher_SetInvalidatesManagedEmbeddings:

conflict: node 0x7f3a... changed after transaction start

The failure was a phantom conflict. No second writer existed. The transaction reading the node had been opened after the commit it was racing against — there should have been nothing to conflict with. But the snapshot-isolation check disagreed.

The check is, at its core, a comparison between two MVCCVersion records: the version at which a transaction began its read, and the version at which a row was last committed. If committedVersion > readVersion, the SI machinery flags the row as having been written after the transaction started.

MVCCVersion.Compare() ordered by timestamp first:

// pkg/storage/types.go
func (a MVCCVersion) Compare(b MVCCVersion) int {
    if a.CommitTimestamp != b.CommitTimestamp {
        if a.CommitTimestamp < b.CommitTimestamp {
            return -1
        }
        return 1
    }
    if a.CommitSequence < b.CommitSequence { return -1 }
    if a.CommitSequence > b.CommitSequence { return 1 }
    return 0
}

That ordering is only sound if CommitTimestamp is monotonic across the entire process. It isn't.

Why `time.Now().UnixNano()` Cannot Be Used as a Global Order

Wall-Clock Drift, Concretely

time.Now() on Linux ultimately calls clock_gettime(CLOCK_REALTIME). CLOCK_REALTIME is the wall clock and is subject to:

NTP slewing (adjtime): the kernel slows or speeds the clock by up to 500 ppm to converge on the reference time.
NTP stepping (settimeofday): if the offset exceeds the panic threshold (~128 ms by default), the clock jumps — possibly backward.
PTP corrections in containerized hosts where the hypervisor's clock is the reference.
VM live-migration, where the guest's CLOCK_REALTIME snaps to the destination host's clock on resume.
Per-CPU rdtsc skew: clock_gettime reads a per-CPU TSC and converts it. If two goroutines are scheduled on different cores, their reads can disagree by tens to hundreds of nanoseconds — and the disagreement is not guaranteed to be in any particular direction.

A concrete sequence that breaks timestamp ordering:

t = T₀                  goroutine A: commit row R, stamp = 1_700_000_000_000_000_100
t = T₀ + 5µs            kernel applies NTP step, clock jumps -500ns
t = T₀ + 6µs            goroutine B: BeginTransaction, samples 1_700_000_000_000_000_050
goroutine B reads R
SI check: committed (100) > readVersion (50) → CONFLICT

There was no concurrent writer. The reader simply sampled a clock that had moved backward in the interim.

The Parser Is Faster Than the Clock Tick

time.Now() claims nanosecond resolution, but the underlying TSC tick is the actual quantum, and the kernel's vDSO + syscall path has its own latency floor. On a typical x86_64 Linux host, two back-to-back time.Now() calls return identical UnixNano() values an appreciable fraction of the time — anywhere from one in a few to one in a few hundred, depending on hardware.

Our parser benchmarks make this concrete:

BenchmarkParserValidationIsolation/Nornic/simple_match-16     30,066,961    39.09 ns/op    0 B/op    0 allocs/op
BenchmarkParserValidationIsolation/Nornic/match_with_label-16 21,996,284    53.20 ns/op    0 B/op    0 allocs/op
BenchmarkParserValidationIsolation/Nornic/create_node-16      17,062,600    70.40 ns/op    0 B/op    0 allocs/op
BenchmarkParserValidationIsolation/Nornic/merge_node-16       17,167,636    71.85 ns/op    0 B/op    0 allocs/op

A MATCH (n) RETURN n parses and validates in 39 nanoseconds. That is ~25.6 million queries per second on a single goroutine, with zero heap allocations. Throughput across query shapes is 450–550 MB/s of source text.

Compare to ANTLR on the same machine, same queries:

BenchmarkParserValidationIsolation/ANTLR/simple_match-16        494,116    4,725 ns/op    5,708 B/op    53 allocs/op
BenchmarkParserValidationIsolation/ANTLR/create_node-16         425,192    5,649 ns/op    6,700 B/op    62 allocs/op
BenchmarkParserValidationIsolation/ANTLR/match_where_in-16      152,034   15,616 ns/op   17,748 B/op   149 allocs/op

ANTLR is ~120× slower and allocates per parse. The fact that our parser is fast is not incidental — it is the entire reason UnixNano() cannot order our writes. Slow parsers naturally space commits apart by microseconds, and microsecond gaps swamp clock skew. We don't have that luxury.

Same-Tick Math

Suppose clock_gettime has an effective resolution of R nanoseconds (typical: R ∈ [1, 40]) and we are sustaining Q commits per second. The probability that two commits land in the same tick is approximately:

P(collision) ≈ 1 - e^(-Q·R / 1e9)

At Q = 1,000,000 commits/sec and R = 20 ns, P ≈ 1 - e^(-0.02) ≈ 1.98%. Roughly one collision every fifty commits. Over a 10-second ingestion burst, that's hundreds of unordered pairs — and we still need a total order to reason about snapshot isolation correctly.

Why `time.Now()`'s Monotonic Reading Doesn't Help

Go's time.Now() does include a monotonic reading from CLOCK_MONOTONIC. It is real, and it is genuinely monotonic. But:

The monotonic component is stripped by UnixNano(). Per the Go docs: "Because the monotonic clock reading has no meaning outside the current process, serializing a t.UnixNano() value and parsing it back loses the monotonic reading."
It is only meaningful between two time.Time values that share a wall+monotonic pair, used by Sub, After, Before, Equal. We persist a single int64 to disk.
It is per-time.Time, not a process-global counter. Two independent time.Now() calls produce two independent monotonic samples that are not guaranteed to be totally ordered with respect to each other once you reduce them to scalars.

There is no public Go API that returns a single int64 of monotonic nanoseconds suitable for storage and cross-goroutine comparison. You can hack one with runtime.nanotime via //go:linkname, but it has the same per-process scope as time.Now()'s monotonic reading and ties us to runtime internals.

The Fix: A Process-Global Atomic Counter

pkg/storage/badger.go carries two atomic fields on the engine:

mvccSeq             atomic.Uint64  // strictly-increasing commit sequence
mvccHighWaterNanos  atomic.Int64   // max committed CommitTimestamp ever observed

Every commit calls allocateMVCCVersion(), which:

Atomically increments mvccSeq and reads the new value.
Reads time.Now().UnixNano().
Atomically advances mvccHighWaterNanos to max(highWater, now).
Returns MVCCVersion{CommitTimestamp: now, CommitSequence: seq}.

BeginTransaction() calls currentMVCCReadVersion(), which clamps now upward to mvccHighWaterNanos. A backward NTP step cannot make a new transaction observe a read timestamp earlier than something already committed.

Snapshot-isolation conflict detection in pkg/storage/badger_transaction.go then compares by sequence first:

func snapshotIsolationConflict(read, committed MVCCVersion) bool {
    if read.CommitSequence != maxMVCCCommitSequence &&
       committed.CommitSequence != maxMVCCCommitSequence {
        return committed.CommitSequence > read.CommitSequence
    }
    // Saturation fallback only — see below.
    return committed.CommitTimestamp > read.CommitTimestamp
}

Because mvccSeq is a single atomic that no commit can skip, the sequence is a true total order. The wall-clock timestamp is retained for two reasons: (a) it's human-readable in dumps and admin UIs, and (b) it's a fallback total order if and only if the sequence saturates.

Tests We Wrote

TestCurrentMVCCReadVersion_ClampsToHighWater — read timestamp never precedes the high-water mark.
TestAllocateMVCCVersion_AdvancesHighWater — high-water is monotonic even when wall-clock samples drift backward.
TestBeginTransaction_DoesNotConflictAfterClockSkew — the original regression: bump high-water two seconds ahead, then commit; must not raise a phantom conflict.
TestSnapshotIsolationConflict_UsesTimestampWhenSequenceSaturated — sequence-first ordering, with timestamp fallback only at saturation.
TestAllocateMVCCVersion_FallsBackToTimestampOrderingWhenSequenceExhausted — when mvccSeq == ^uint64(0), advance the high-water timestamp by 1 ns rather than wrapping the counter to zero.

Time Until Overflow

mvccSeq is a uint64. Its maximum value is:

2^64 − 1 = 18,446,744,073,709,551,615

At a sustained commit rate of Q commits/sec, time-to-saturation is:

T = (2^64 − 1) / Q   seconds

Sustained commit rate	Time to overflow
1,000 commits/sec (heavy OLTP)	~584,942,417 years
1,000,000 commits/sec (extreme ingest)	~584,942 years
1,000,000,000 commits/sec (1B/sec, hypothetical)	~584.94 years
1,000,000,000,000 commits/sec (1T/sec, impossible)	~213 days

For reference, the Earth's projected remaining time before the Sun renders the surface uninhabitable is on the order of 10⁹ years. At 1 billion commits/sec — three orders of magnitude beyond what any single-machine database currently sustains — mvccSeq would still outlast a human civilization several times over. The saturation fallback exists for completeness, not because we expect anyone to hit it.

For a more defensible upper-bound argument: even if every Cypher query in our parser benchmark were a single committing write at peak parser throughput (~25.6 M qps for simple_match), T ≈ 22,800 years.

Why Not Other Techniques

A few alternatives we considered and rejected:

Hybrid Logical Clocks (HLC). HLC pairs a wall-clock with a logical counter and bumps the logical part on causality violations. It works well for distributed systems where you need wall-clock-aligned timestamps that also respect causality. For a single-node MVCC ordering, an atomic counter does the job with one-third the code and no max-skew tuning knob.

TrueTime / interval clocks. Spanner's TrueTime exposes [earliest, latest] bounds and waits out the uncertainty. This requires a hardware time source we don't have and introduces commit latency proportional to clock uncertainty. Overkill for single-node ordering.

runtime.nanotime via //go:linkname. Gives us a process-monotonic int64 of nanoseconds. Functionally close to what we want, but: (a) ties us to runtime internals that have changed in past Go releases, (b) is still per-process — useless for the eventual cross-node case where we'll want a counter that can be partitioned and merged.

time.Now() with a "monotonic clamp" only, no counter. This is what the high-water-mark mechanism does on its own. It prevents backward sampling, but it does not solve same-nanosecond ties. Two commits that legitimately land in the same UnixNano() bucket are unordered, and SI requires a total order.

The atomic counter is the smallest mechanism that provides the guarantee we need.

Closing Note

A useful heuristic: any time you find yourself reasoning about wall-clock ordering in code that runs faster than the wall clock can resolve, you have a logic bug waiting for an NTP correction to expose it. The fact that our Cypher parser executes in 39 ns isn't just a performance number — it's a correctness constraint on every other system in the database that wants to observe its output in order.

Beyond the Vector Wall: The Case for Microsecond Graph-RAG

TJ Sweet — Sun, 12 Apr 2026 16:37:54 +0000

Last August, I was investigating agentic flows and I foresaw the RAG landscape was about to hit a ceiling.

We were all chasing better "vibes" through chunking strategies and embedding model swaps, but the underlying structural rot was becoming impossible to ignore. I ran across an article on context engineering that articulated a shift I’d been sensing for months:

"Graph-RAG represents a paradigm shift from retrieving unstructured text chunks to retrieving structured knowledge from a Knowledge Graph (KG)... This approach offers contextual richness, explainability, and multi-hop reasoning by traversing paths in the graph." — ikala

The Linearizability Crisis

The industry’s reliance on pure vector search introduces a fundamental flaw: Semantic Clobbering. In a high-velocity environment, you cannot simply "stuff" data into a vector store and expect logic to emerge. Without a linearizable data model, a high-scoring recent insertion can—and will—corrupt the retrieval logic of established facts simply because it shares a similar embedding space.
RAG shouldn't be a lottery. If we want agentic systems that can actually reason over complex datasets, we need the structural integrity of a Knowledge Graph where entities and relationships are first-class citizens, not just collateral of a top_k search.

The Latency Bottleneck

Historically, Graph-RAG has been dismissed as "latency-prohibitive." The orchestration overhead—querying a Graph DB, fetching vectors, linearizing the subgraph, and then hitting the LLM—creates a "death by a thousand round-trips." If your agent needs to influence token generation in real-time, waiting 500ms for retrieval is a non-starter.
To enable true agentic flows, we have to bring graph-retrieval latencies down to the microsecond level. This isn't just an optimization; it's a prerequisite for the next generation of database architecture.

Architectural Consolidation

We are seeing the consequences of architectural fragmentation everywhere. Developers are drowning in:

Retrieval Inconsistency: Data clobbering and ranking noise.
Service Bloat: Managing fragmented services for graph, vector, and logic.
Deployment Friction: The lack of manageable, consolidated systems that co-locate storage and compute. I started building my own solution last year because the writing was on the wall. The future of RAG isn't just "more data"—it's the consolidation of service layers into a high-performance, low-latency engine that treats the graph and the vector as a single, unified context source. We don't need better wrappers; we need to rethink how the data lives in the first place.

I’m looking for a small number of maintainers for NornicDB

TJ Sweet — Mon, 30 Mar 2026 23:18:15 +0000

NornicDB is a Neo4j-compatible graph + vector database with MVCC, auditability-oriented features, hybrid retrieval, and a strong bias toward performance and operational simplicity. It’s the kind of system where correctness, latency, storage behavior, and developer ergonomics all matter at the same time.

I’m not looking for “contributors” in the generic open source sense. I’m looking for people whose engineering habits match the shape of this project.

The people I work best with tend to have a few things in common:

they use agentic tooling well, but don’t use it as a substitute for taste or rigor
they like spec-driven development, not just coding until tests pass
they default to TDD or regression-first work when touching complex systems
they care about performance, memory behavior, query shapes, and hot paths
they care about developer experience, naming, docs, tooling, and maintainability
they can hold correctness and pragmatism in their head at the same time
they are comfortable working on systems that have database, query engine, protocol, and infrastructure concerns all mixed together

This is not a beginner-friendly maintenance surface. It’s a real database codebase, and a lot of the work sits in the uncomfortable middle where product expectations, compatibility, performance, and internal simplicity all pull in different directions.

The kinds of things maintainers might work on:

Cypher and Bolt compatibility
MVCC and transactional behavior
vector and hybrid retrieval execution paths
storage engine correctness and performance
audit/history/retention semantics
benchmarks, profiling, and allocation reduction
test infrastructure, spec coverage, and regression prevention
docs and tooling that improve the contributor experience

I care much more about engineering taste than resume pedigree.

If you’re the kind of person who:

writes tests for bugs before fixing them
gets annoyed by hidden allocations and avoidable abstractions
wants docs and tooling to be part of the product, not an afterthought
uses modern AI tooling to move faster, but still insists on clear specs and defensible code
likes the idea of maintaining infrastructure that other serious teams can trust

I’d like to talk.

If that sounds like you, reply here or DM me with:

a few lines on what kinds of systems work you like
links to anything you’ve built, maintained, or profiled
what parts of NornicDB you’d most want to touch

I’m intentionally looking for a small number of strong fits, not a large intake.

The "Boxing In" Strategy: How Go is the Goldilocks Language for AI-Assisted Engineering

TJ Sweet — Sun, 29 Mar 2026 17:20:24 +0000

There is a growing realization among developers using AI agents like Cursor, Windsurf, or GitHub Copilot: the choice of programming language is no longer just about runtime performance or ecosystem. It is now about LLM Steering.

During the development of NornicDB and other projects, I used AI-assisted engineering. I want to make a clear distinction here: this is not "vibe coding." To me, "vibing" is just going with whatever the AI suggests—a passive approach that often leads to technical debt.

AI-assisted engineering is a deliberate, high-rigor cycle: using AI for research and planning, drafting a spec, reviewing it, whiteboarding the logic, using the AI to validate the theory in isolated code, and then applying it to the project. In this workflow, Go is structurally unique. It doesn't just run well; it "boxes in" the AI during that final implementation phase, preventing the hallucination-filled "spaghetti" that often plagues AI-generated code in more flexible languages.

1. The "GPS" Effect: Forcing Explicit Intent

The greatest weakness of LLMs is abstraction drift. In languages with deep inheritance or highly flexible functional patterns (like TypeScript or Python), an AI often loses the architectural thread, suggesting three different ways to solve the same problem.

Go solves this by being intentionally limited:

Package Boundaries: Go’s strict folder-to-package mapping acts as a physical guardrail. The LLM is structurally discouraged from creating complex, circular dependencies.
No "Magic": Because Go lacks hidden meta-programming, complex decorators, or deep class hierarchies, the AI is forced to write explicit code.

My Opinion: I believe that for a probabilistic model like an LLM, "explicit" is synonymous with "predictable." By narrowing the solution space to a few idiomatic paths, Go acts as a structural GPS. It doesn't let the AI get "too clever," which is usually when logic begins to break down.

2. The OODA Loop: Validating Theory at Scale

A core part of my engineering process is using AI to validate a theory in code before it ever touches the main repository. Go’s near-instant compilation makes this Observe-Orient-Decide-Act (OODA) loop incredibly tight.

Instant Feedback: If a validation cycle takes 30 seconds (common in C++ or heavy Java apps), the momentum of the engineering process dies. Go allows me to test a theoretical concurrency pattern or a pointer-safety fix in milliseconds.
Tooling Synergy: Because go fmt, go test, and go race are standard and built-in, the AI can generate and run validation tests that match production standards immediately.

3. Logical Cross-Pollination (The C/C++ Factor)

I’ve noticed anecdotally that LLMs seem to leverage their massive training data in C and C++ to improve their Go logic. While the syntax differs, the underlying systems logic—concurrency patterns, pointer safety, and memory alignment—is highly transferable.

The Logic Transfer: Algorithmic patterns (like HNSW for vector search or MVCC for transaction isolation) translate beautifully from C++ logic into Go implementation.
The "Contamination" Risk (Criticism): You must be the "Adult in the Room." Because Go looks like the C-family, LLMs will occasionally try to write "Go-flavored C," attempting manual memory management or pointer arithmetic that fights Go’s garbage collector. This is why the Review and Whiteboarding stages of my process are non-negotiable.

Proof of Concept: The NornicDB Experience

When I implemented Snapshot Isolation (SI) and a BYOM (Bring Your Own Model) embedding engine into NornicDB, the AI didn't just "vibe" out the code. We went through a rigorous spec and validation phase.

Because Go handles concurrency through core keywords (channels/select), the AI-generated implementation of that spec was structurally sound from the first draft. In more permissive languages, the AI might have suggested five different async libraries; in Go, it just followed the spec into a select block.

The result? A hybrid system that hits ~0.6ms P50 for vector search and ~1.6ms for 1-hop graph traversals. The "box" didn't limit the performance—it ensured the AI built it correctly according to the plan.

Conclusion: Boxes, Not Blank Canvases

If you’re struggling with AI-assisted development, stop giving your agents a blank canvas. A blank canvas is where hallucinations happen. Give them a box.

Go is that box. It isn’t opinionated in a way that restricts your freedom, but it is foundational in a way that forces the AI to implement your validated vision with rigor. When the language enforces the boundaries, the engineer is finally free to focus on the high-level architecture and the deep planning that "vibe coding" often skips.

Is Go the perfect language? No. But for a rigorous AI-assisted engineering workflow, it’s the most reliable one we have.

I am the author of **NornicDB, an open-source hybrid database. You can see how these engineering patterns resulted in high-performance infrastructure at github.com/orneryd/NornicDB.

~1ms hybrid graph + vector queries (network is now the bottleneck)

TJ Sweet — Thu, 26 Mar 2026 00:16:17 +0000

I finally have benchmark results worth sharing.

TL;DR

~0.6ms p50 — vector search
~1.6ms p50 — vector + 1-hop graph traversal
~6k–15k req/s locally

When deployed remotely:

~110ms p50, which exactly matches network latency

→ The database is fast enough that the network dominates total latency

What was tested

Two query types:

Vector only (embedding similarity, top-k)
Vector + one-hop graph traversal (expand into knowledge graph)

Each run:

800 requests
noisy / real-ish text inputs
concurrent execution

Local (M3 Max 64GB Native MacOS Installer)

Vector only

p50: ~0.58ms
p95: ~0.80ms
~15.7k req/s

Vector + graph

p50: ~1.6ms
p95: ~2.3ms
~6k req/s

Remote (GCP, 8 cores, 32GB RAM)

Client → server latency: ~110ms

Vector only

p50: ~110.7ms

Vector + graph

p50: ~112.9ms

The delta between local and remote ≈ network RTT.

What’s interesting

Adding graph traversal costs ~1ms
Latency distribution is tight (low variance)
Hybrid queries behave almost like constant-time at small depth

Most systems treat this as:

vector DB + graph DB + glue code

This is:

one execution engine

How this compares (public numbers)

Vector DBs (Pinecone / Weaviate / Qdrant)

Typically 5–50ms p50 depending on index + scale
Often network + ANN dominates

Neo4j (graph + vector)

Graph queries: typically 10–100ms+ depending on traversal
Vector is a newer add-on layer

TigerGraph

Strong traversal performance (parallelized)
Still generally multi-ms to 10s of ms for real queries

Important caveats

These are single-node, in-memory-ish conditions
Dataset is not at billion-scale (yet)
Remote throughput is latency-bound, not compute-bound
Found a response consistency bug (fixed next)

What this suggests

If hybrid queries are:

~1–2ms compute
+100ms network

Then optimizing the DB further doesn’t matter unless:

you colocate compute
or batch / pipeline queries

Takeaway

We’re hitting a point where:

hybrid retrieval is cheaper than the network it rides on

Looking for feedback on:

deeper traversal benchmarks (2–3 hops)
scaling behavior (dataset + concurrency)
fair comparisons vs existing systems
real-world workloads (RAG, entity resolution, etc.)

If this resonates (or sounds wrong), I’d love to hear why.

Addendum: test setup + external verification

For anyone who wants to reproduce or challenge these numbers: the benchmark used a single-node dataset with 67,280 nodes, 40,921 edges, and 67,298 embeddings indexed with HNSW (CPU-only). Workload was 800 requests/query type, noisy natural-language prompts, concurrent clients, and two query shapes: (1) vector top-k, (2) vector top-k + 1-hop graph expansion over returned entities. Local runs were on an M3 Max locally with the native installer; remote runs were on GCP (8 vCPU, 32GB RAM).

The key observation is straightforward: local compute stayed in low-ms, while remote p50 tracked client↔server RTT (~110ms), so end-to-end latency was network-bound. If you run this yourself, please share p50/p95, dataset size, and hop depth so results are directly comparable.

Item	Value
Nodes	67,280
Edges	40,921
Embeddings	67,298
Vector index	HNSW, CPU-only
Request count	800 per query type
Query types	Vector top-k; Vector top-k + 1-hop traversal

Verification queries (same shape)

# Vector-only (same query shape as benchmark)
curl -s -u "$NORNIC_USERNAME:$NORNIC_PASSWORD" "$ENDPOINT" \
  -H "Content-Type: application/json" -H "Accept: application/json" \
  -d '{
    "statements":[
      {
        "statement":"CALL db.index.vector.queryNodes('\''idx_original_text'\'', $topK, $text) YIELD node, score RETURN node.originalText AS originalText, score ORDER BY score DESC LIMIT $topK",
        "parameters":{"text":"get it delivered","topK":5},
        "resultDataContents":["row"]
      }
    ]
  }'

# Vector + one-hop graph (same query shape as benchmark)
curl -s -u "$NORNIC_USERNAME:$NORNIC_PASSWORD" "$ENDPOINT" \
  -H "Content-Type: application/json" -H "Accept: application/json" \
  -d '{
    "statements":[
      {
        "statement":"CALL db.index.vector.queryNodes('\''idx_original_text'\'', $topK, $text) YIELD node, score MATCH (node:OriginalText)-[:TRANSLATES_TO]->(t:TranslatedText) WHERE t.language = $targetLang RETURN node.originalText AS originalText, score, t.language AS language, coalesce(t.auditedText, t.translatedText) AS translatedText ORDER BY score DESC, language LIMIT $topK",
        "parameters":{"text":"get it delivered","topK":5,"targetLang":"es"},
        "resultDataContents":["row"]
      }
    ]
  }'

https://github.com/orneryd/NornicDB/releases/tag/v1.0.33

Building a Low-Latency MVCC Graph+Vector Database: The Pitfalls That Actually Matter

TJ Sweet — Wed, 25 Mar 2026 15:01:22 +0000

Most posts about graph+vector systems focus on feature lists. The hard part is not features. It is maintaining low tail latency while preserving snapshot isolation, temporal history, and managed embeddings in one database runtime.

This post focuses on the non-obvious engineering problems that showed up in production-like conditions, and the techniques that actually resolved them.

1) Latency budgets are architecture budgets

For hybrid retrieval, every boundary in the online path (transport, embedding, retrieval, rerank, graph materialization) adds fixed cost. If you need “instant-feeling” responses, boundary placement is a performance decision, not just an org-chart decision.

The practical pattern is:

Keep protocol flexibility at the edge.
Keep the hot retrieval and consistency path tight.

2) Snapshot isolation for graphs requires topology-aware validation

In graph storage, SI is not just “row version check at commit.” You must validate graph structure races:

edge creation racing with endpoint deletion
concurrent adjacency mutations around node deletes
traversal visibility consistency across snapshot boundaries

Without topology-aware commit validation, you can pass SI-style checks and still commit structurally invalid graph states.

3) MVCC retention can create historical lookup cliffs

Once you introduce pruning, historical reads can degrade badly if lookup depends on sparse post-prune chains. This becomes visible only under real retention churn.

The fix is to persist per-key retention anchors in MVCC metadata and resolve historical visibility from deterministic retained floors, not optimistic chain walking. That stabilizes historical lookups even after repeated prune cycles.

4) “Current-only” indexing is mandatory when history exists

Temporal history and online retrieval have different goals:

temporal history exists for audit/reconstruction
online search exists for current relevance

If historical versions leak into live vector/keyword indexes, retrieval quality drifts and stale entities contaminate candidates. Current-only indexing for live search avoids that failure mode while preserving full historical queryability through MVCC/temporal APIs.

5) Async embeddings create intentional dual-latency behavior

When the database manages embeddings, write behavior naturally splits:

fast commit path for transactional state
deferred embedding work with longer completion windows

This is expected. The requirement is clear operational semantics and instrumentation that separates:

commit latency
queueing latency
embedding execution latency

Without that separation, healthy async behavior gets misdiagnosed as storage/query regressions.

6) NFS exposed lock contention that fast local storage hid

A key lesson from this release cycle: moving to Docker + NFS did not just make things slower, it changed what was visible.

On very fast local storage, some lock contention patterns were masked by short I/O stalls. Under NFS latency variance, those same code paths held locks across work that did not need to be in the critical section. Tail spikes made the contention obvious.

What changed:

We narrowed lock scope in storage API hot paths.
We applied targeted unlock/relock boundaries around non-critical, longer-running work.
We kept correctness-sensitive state transitions inside the minimal protected region.

The result was not “NFS became fast.” The result was that storage-path lock contention stopped amplifying NFS latency into avoidable tail spikes.

7) Conflict semantics and retries are part of performance, not just correctness

Under contention, raw engine-specific conflict leaks produce unstable client behavior and poor retry patterns. Normalizing conflicts into a consistent retryable class and using bounded retry helpers at API boundaries improves both correctness and latency predictability under concurrent load.

8) Timings must be interpreted by query shape, not averages

Mixed workloads contain fundamentally different operations:

point lookups
indexed reads
bulk scans/deletes
embedding-triggering writes
validation queries

Microsecond reads and multi-second maintenance or async-adjacent writes can coexist in the same healthy system. Performance analysis only makes sense when timings are tied to operation class and execution path.

Closing

The core challenge in this category is not “graph” and not “vector” in isolation. It is enforcing one coherent consistency and latency contract across transactional graph state, temporal history, and managed embedding workflows.

The pitfalls above are where that contract usually breaks. They are also where the most meaningful performance and reliability gains came from in practice.

In the New Agentic World: The Software Career Ladder Is Being Rewritten

TJ Sweet — Wed, 04 Mar 2026 16:39:23 +0000

I’m going to be direct: software is not going away, but the shape of software work is changing faster than most people are willing to admit.

I believe we are entering an agentic era where AI doesn’t just autocomplete code, it co-implements systems. That changes who gets hired, what skills are considered “core,” and where human judgment still matters.

This post is intentionally forward-looking. I’ll separate my opinion/projection from what is currently supported by evidence.

My Thesis (Opinion)

1) Software architecture becomes rarer, higher-stakes, and more formal

I expect fewer people to hold true architecture roles, and those roles to become more selective and possibly more credentialed over time. In an agentic world, architecture is no longer “draw boxes and arrows.” It becomes:

defining system boundaries AI can safely operate within,
setting policy and compliance constraints,
owning failure modes and rollback design,
deciding what must stay deterministic vs probabilistic.

In short: fewer architects, but more responsibility per architect.

2) Data engineering fluency becomes the new baseline for “software engineer”

I expect a big shift where what we currently call “data engineering” becomes normal engineering literacy. If your product has AI in it, then data quality, lineage, retrieval, embedding strategy, and observability are not specialist concerns - they’re table stakes.

My stronger take: the engineer who can’t reason about data pipelines and model interfaces will feel like a frontend engineer in 2008 who refused to learn JavaScript.

3) DevOps does not disappear - it evolves into AI governance in production

DevOps/SRE becomes even more critical. The work shifts toward validating AI-proposed changes, enforcing guardrails, and making sure “it worked in a prompt” doesn’t become “we took down prod.”

Infra will be increasingly generated, but trust will still be earned through verification, policy, and incident response.

4) The entry-level ladder is getting steeper, right now

The painful truth: a lot of junior-level code tasks are exactly what agents absorb first. That doesn’t mean juniors are useless; it means old apprenticeship pathways are breaking before new ones are built.

Large companies with structured graduate programs may keep hiring at scale. Everyone else may expect “AI-augmented mid-level output” from day one.

What Current Evidence Supports

Strong support: AI/data skills are rising fast

The WEF Future of Jobs 2025 reports strong growth in AI, big data, and software-related roles and skills. Source: World Economic Forum
BLS still projects strong software developer growth while data-centric occupations remain among the fastest-rising categories. Source: U.S. Bureau of Labor Statistics

Strong support: DevOps/platform quality becomes more important with AI

DORA findings suggest AI can improve parts of the development process, but delivery outcomes depend heavily on platform quality and operational fundamentals. Source: DORA 2024 Report

Moderate-to-strong support: junior role pressure is real, but uneven

There is growing evidence and credible discussion that entry-level pathways are under pressure as AI handles routine implementation work.
At the same time, hiring patterns are uneven by company type, geography, and maturity of internal training pipelines.

Strong support: AI coding tools increase productivity in many contexts - but not automatically

Studies around AI coding assistants show productivity and confidence gains in many settings.
Results vary by team process, review culture, and test rigor; quality regressions can occur without guardrails.
- GitHub Copilot research
- Stack Overflow 2024 AI survey

Where I’m Projecting Beyond the Data (And I’m Owning That)

These are my bets, not settled facts:

“Few software architects”: evidence shows role polarization, but not definitive proof of formal accreditation.
“Data engineering becomes the default SWE identity”: evidence supports convergence of skills, but not full replacement of traditional engineering tracks.
“Junior ladder pulled up”: directionally supported, but likely to be cyclical and industry-dependent rather than absolute.

A Practical Career Map for the Agentic Era

If you’re a student, junior, or mid-level engineer, here’s the adaptation path I believe matters most:

Learn systems + data together

Build things where retrieval, metrics, and model behavior are first-class concerns.
Treat AI output as untrusted code

Verification, testing, and failure analysis are career accelerators now.
Develop “prompt-to-production” judgment

Anyone can generate code; fewer people can make safe, maintainable, compliant systems.
Build in public with measurable outcomes

Show latency reductions, lower error rates, improved reliability - not just demos.
Get good at platform constraints

CI/CD, policy-as-code, secrets, observability, rollback plans: this is where human leverage compounds.

Final Take

I don’t think software engineering is dying. I think it’s splitting.

One path becomes high-trust engineering: architecture, data systems, platform reliability, and governance.

The other path becomes commoditized implementation mediated by agents.

My opinion is simple: the winners won’t be the people who “use AI.”

They’ll be the people who can direct, constrain, verify, and operationalize AI at system level.

That’s the new craft.

The Full Graph-RAG Stack As Declarative Pipelines in Cypher

TJ Sweet — Wed, 04 Mar 2026 16:15:04 +0000

Most RAG systems aren’t architected so much as assembled:

embedding service
vector search service
reranker
LLM endpoint
application glue for retries, timeouts, auth, and marshaling

It works, until you spend more time maintaining orchestration than improving retrieval quality.

https://github.com/orneryd/NornicDB/commits/main/?since=2026-03-03&until=2026-03-03

This update to NornicDB changes that model: retrieval, embedding, reranking, and inference are now first-class Cypher procedures. The important part is not “new APIs.” The important part is that these stages now execute as part of the query engine.

What landed

New Cypher primitives:

db.retrieve
db.rretrieve
db.rerank
db.infer
db.index.vector.embed

These are read-oriented pipeline operators designed to compose inside Cypher, not wrappers around separate app-tier flows.

Why this is materially different

1) The pipeline is now declarative and inspectable

Instead of this in app code:

const embedding = await embed(query)
const results = await vectorSearch(embedding)
const reranked = await rerank(query, results)
const answer = await infer(query, reranked)

you can express the same intent in Cypher:

CALL db.index.vector.embed($query) YIELD embedding
CALL db.index.vector.queryNodes('doc_index', 20, embedding) YIELD node, score
WITH collect({id: node.id, content: coalesce(node.content, toString(node)), score: score}) AS candidates
CALL db.rerank({query: $query, candidates: candidates, rerankTopK: 20}) YIELD id, final_score
RETURN id, final_score

That makes the pipeline:

versionable
benchmarkable
testable
visible to query execution semantics

2) Less orchestration overhead in the hot path

You still call models. But you remove a lot of unnecessary app-layer choreography between stages:

fewer service hops
less JSON marshalling back-and-forth
fewer per-hop retries/timeouts to coordinate

This reduces tail-latency and shrinks the operational failure surface.

3) Graph + semantic logic are fused in one plan

Because these are Cypher stages, you can combine semantic retrieval with graph constraints directly:

MATCH (u:User {id: $userId})-[:MEMBER_OF]->(g:Group)
CALL db.index.vector.embed($query) YIELD embedding
CALL db.index.vector.queryNodes('doc_index', 50, embedding) YIELD node, score
WHERE (node)-[:VISIBLE_TO]->(g)
WITH collect({id: node.id, content: coalesce(node.content, toString(node)), score: score}) AS candidates
CALL db.rerank({query: $query, candidates: candidates}) YIELD id, final_score
RETURN id, final_score

This is not “vector DB results, then post-filter in app code.” It’s one composable query flow.

Query planner + cache: why this is practical, not just ergonomic

Adding new procedures is easy. Making them production-usable is harder. The key enabler is how query planning and caching interact with these primitives.

Planning path

NornicDB already routes CALL procedures through the Cypher executor dispatch path. That means these RAG primitives participate in the same execution flow as other query stages, rather than being side-channel operations.

Query plan cache

NornicDB keeps a parsed/structured plan cache for repeated query shapes. For RAG workloads, this matters because many queries are template-like:

same Cypher structure
different parameters ($query, $userId, etc.)

So the engine avoids repeated parse/analysis overhead for the same pipeline shape, and only rebinds inputs.

Result cache policy (important boundaries)

Read-query result caching now treats these primitives intentionally:

cacheable by default:
- db.retrieve
- db.rretrieve
- db.rerank
- db.index.vector.embed
db.infer is not cached by default
- can be opted in per call (cache: true / cache_enabled: true)

This is the right split:

retrieval/rerank/embed are often deterministic enough for reuse under normal invalidation rules
inference can be non-deterministic and should require explicit opt-in

Correctness under writes

Cached read results follow normal invalidation behavior on writes. So this is not “cache forever”; it is “cache when safe, invalidate on data mutation.”

Net effect: you keep low overhead for repeated pipeline templates without pretending inference is always deterministic.

Procedure boundaries (clear contract)

db.retrieve: retrieval stage
db.rretrieve: retrieval shorthand, auto-rerank if configured/available
db.rerank: true Stage-2 API over caller-provided candidates (does not run retrieval)
db.index.vector.embed: returns embedding array for explicit manual pipeline control
db.infer: inference stage, default non-cached

That split keeps simple flows short and advanced flows explicit.

What this is not

This does not mean:

instant hosted model platform
one-line “AI solved” pipeline
no tradeoffs in model/provider choice

You still choose providers and quality/latency/cost tradeoffs. What changed is where orchestration logic lives.

Common patterns today:

vector systems with retrieval APIs, but app-driven orchestration
graph + external RAG glue
managed black-box pipelines with limited control

This approach is different: orchestration becomes query-native and composable in Cypher, with planner/cache semantics instead of ad-hoc application control flow.

Why this matters

The main gain is not syntactic convenience. It is reducing accidental complexity:

fewer moving parts outside the data layer
fewer duplicated pipelines across services/repos
better observability and repeatability for retrieval flows
easier benchmarkability of real pipeline templates

The strategic question shifts from:

“How should we glue these services together?”

to:

“Which query pipeline shape should we run for this workload?”

That is a better problem to have.

Architectural Consolidation for Low-Latency Retrieval Systems: Why We Co-Located Transport, Embedding, Search, and Reranking

TJ Sweet — Mon, 02 Mar 2026 18:13:10 +0000

Most Graph-RAG systems are built as a chain of services:

API ingress
query embedding service
vector DB
sparse/BM25 service
fusion/rerank service
generation service

That decomposition is clean on paper. It is rarely cheap on the critical path.

NornicDB made a deliberate architectural trade: co-locate the online retrieval path in one runtime/container (transport, query embedding, retrieval, fusion, rerank, and response assembly) and optimize that path hard.

This post is about that choice: what it buys, what it costs, and how we mitigate the costs in code today.

Why consolidate at all?

If you split 5 stages across services and each boundary adds even ~1.0–1.5 ms of serialization/network/scheduler overhead, you can burn 5–7.5 ms before meaningful retrieval work.

That’s basically the whole budget for “feels instant” search.

In the co-located NornicDB path, we cut most of that boundary tax out. In the run you shared (1M corpus setup), we saw:

2026/02/18 08:01:14 🔍 Search request database="nornic" query="where a prescriptions?"
2026/02/18 08:01:14 ⏱️ Search timing: method=rrf_hybrid cache_hit=false fallback=false total_ms=0 vector_ms=0 bm25_ms=0 fusion_ms=0 candidates[v=26,b=0,f=26] returned=20 query="where a prescriptions?"
[HTTP] POST /nornicdb/search 200 7.96575ms
2026/02/18 08:01:36 🔍 Search request database="nornic" query="where to get the drugs?"
2026/02/18 08:01:36 ⏱️ Search timing: method=rrf_hybrid cache_hit=false fallback=false total_ms=0 vector_ms=0 bm25_ms=0 fusion_ms=0 candidates[v=26,b=0,f=26] returned=20 query="where to get the drugs?"
[HTTP] POST /nornicdb/search 200 7.334291ms

Mean from those two samples: ~7.65 ms end-to-end HTTP.

The architectural shape we optimized for

NornicDB keeps compatibility/protocol flexibility at the edge (Bolt/Cypher, REST, GraphQL, gRPC including Qdrant-compatible flows), but collapses online retrieval internals into one operational surface:

in-process embedding path
in-process hybrid retrieval orchestration
in-process optional stage-2 reranking
in-process transactional graph + vector state

This is the core reason deployment can be “single container, one runtime, one rollback unit” instead of “service choreography.”

Why compressed ANN exists in this architecture

Compression wasn’t added as a “nice-to-have index type.”

It was added as a scaling lever that preserves the single-runtime model longer.

Raw 1024-d float32 vector = 1024 × 4 = 4096 bytes before indexing overhead.

At scale, memory bandwidth and cache locality become the bottleneck, not just algorithmic complexity.

With IVFPQ-style compression, vector payload can drop by orders of magnitude (profile-dependent), which improves:

in-memory density
cache residency
tail-latency stability under load
throughput per dollar on fixed hardware

In code, compressed mode is explicitly gated and safety-wrapped:

pkg/search/search.go uses compressed profile resolution
if compressed profile is inactive -> standard path
if compressed build/load fails -> automatic fallback to standard path

So compression is a scalability primitive, not a reliability gamble.

Costs of co-location, and how NornicDB mitigates each one

1) Reduced independent scaling of subcomponents

Risk: embedding/rerank/generation can’t be scaled as separate deployments as easily.

Mitigations implemented:

Per-database overrides for embedding/search/HNSW/k-means and related knobs (docs/operations/configuration.md), so you can tune behavior by workload without splitting the whole system.
Provider decoupling at runtime: embedding and rerank can be local or external (OpenAI/Ollama/HTTP) via config (pkg/server/server.go, docs/operations/configuration.md).
Planned next step: sharding roadmap (docs/plans/sharding*.md) for horizontal scale without returning to “everything is a remote hop.”

2) Tighter resource coupling (CPU/memory/cache contention)

Risk: one process means shared contention.

Mitigations implemented:

File-backed vector store path to bound RAM during large builds and persistence (pkg/search/search.go: vectorFileStore low-RAM path).
Runtime strategy switching across CPU brute/GPU brute/HNSW using thresholds (pkg/search/search.go, docs/operations/configuration.md), with debounced transitions and replay-before-cutover behavior.
Compressed ANN mode to reduce memory footprint and bandwidth pressure at high vector counts.
Async write and queue controls exposed via config for throughput/consistency tuning (docs/operations/configuration.md).

3) Larger blast radius per deploy

Risk: one deploy can affect the full online path.

Mitigations implemented:

Fail-open reranking load path: server starts immediately; reranker loads async; if unavailable/health-check fails, search continues without stage-2 rerank (pkg/server/server.go).
Fail-open rerank execution: rerank errors revert to original order (pkg/search/search.go).
Compressed ANN fallback: compression failures fall back to standard retrieval path (pkg/search/search.go).
Version/compat checks + rebuild path for persisted indexes (docs/operations/configuration.md, pkg/search/search.go).

4) Harder team autonomy boundaries

Risk: fewer service boundaries can blur ownership.

Mitigations implemented:

Explicit extension seams via plugin systems (APOC-style and Heimdall plugin interfaces) in docs/user-guides/heimdall-plugins.md.
Protocol boundaries remain explicit at API edges (Bolt/Cypher, REST, GraphQL, gRPC), so interface ownership is still clear even when runtime is co-located.

5) Vendor/runtime lock-in risk

Risk: too many in-process optimizations can trap you in one stack.

Mitigations implemented:

Protocol pluralism in the product surface: Bolt/Cypher, REST, GraphQL, Qdrant-compatible gRPC, additive native gRPC (README.md).
Provider pluralism for model execution: local + external provider modes for embedding/rerank (docs/operations/configuration.md, pkg/server/server.go).
Compatibility-first stance (Neo4j + Qdrant workflows) keeps migration cost low.

Tradeoff summary

NornicDB’s stance is not “microservices are bad.”

It’s: for this workload, on this latency budget, boundary placement is a performance decision first.

If your top concern is strict per-stage org isolation, split services.
If your top concern is single-digit-ms retrieval with simpler operations, co-location wins more often.

NornicDB chose co-location, then added mitigations to avoid common co-location failure modes:

configurable per-DB policy
runtime strategy adaptation
compressed ANN for memory scale
fail-open degradation paths
future sharding trajectory

That combination is the architecture story:

one deployable runtime today, with deliberate seams for scale tomorrow.

Cutting Cypher Latency: Streaming Traversal and Query-Shape Specialization in NornicDB

TJ Sweet — Thu, 26 Feb 2026 18:56:46 +0000

Below are the headline numbers that motivated the execution model choices in NornicDB. They’re presented first so you can calibrate the rest of the post: the goal is not “benchmarks as marketing,” but to show the scale of the overhead we’re targeting and then explain where it comes from.

Results at a glance (same hardware)

LDBC Social Network Benchmark (M3 Max, 64GB)

Query Type	NornicDB	Neo4j	Speedup
Message content lookup	6,389 ops/sec	518 ops/sec	12×
Recent messages (friends)	2,769 ops/sec	108 ops/sec	25×
Avg friends per city	4,713 ops/sec	91 ops/sec	52×
Tag co-occurrence	2,076 ops/sec	65 ops/sec	32×

Northwind Benchmark (M3 Max, 64GB)

Operation	NornicDB	Neo4j	Speedup
Index lookup	7,623 ops/sec	2,143 ops/sec	3.6×
Count nodes	5,253 ops/sec	798 ops/sec	6.6×
Write: node	5,578 ops/sec	1,690 ops/sec	3.3×
Write: edge	6,626 ops/sec	1,611 ops/sec	4.1×

Parser mode comparison (Northwind query suite)

NornicDB supports two Cypher parser modes that can be switched at runtime:

⚡ nornic (default): lightweight validation + direct execution
🌳 antlr: strict OpenCypher parsing + full parse tree (better diagnostics, higher overhead)

Query	⚡ nornic	🌳 antlr	Slowdown
Count all nodes	3,272 hz	45 hz	73×
Count all relationships	3,693 hz	50 hz	74×
Find customer by ID	4,213 hz	2,153 hz	2×
Products supplied by supplier	4,023 hz	53 hz	76×
Supplier→Category traversal	3,225 hz	22 hz	147×
Products with/without orders	3,881 hz	0.82 hz	4,753×
Create/delete relationship	3,974 hz	62 hz	64×

Suite runtime:

Mode	Total time
⚡ nornic	17.5s
🌳 antlr	35.3s

Those deltas—especially the big outliers—are what this post is about: where does that overhead come from, and what changes when you design around it?

The problem with “general” execution pipelines

Most mature databases follow a layered approach:

Parse query text into a syntax tree
Build a logical plan
Optimize the plan (often cost-based)
Produce a physical plan
Execute the plan using a generic operator runtime

That architecture has real advantages: flexibility, correctness, and a framework for optimizing complex queries. But it also has costs that show up in production for common graph workloads:

Row-by-row operator overhead (Volcano-style pipelines) can dominate lightweight traversals.
Intermediate materialization increases memory traffic.
Object churn and indirections increase GC pressure and cache misses.
Planning overhead becomes noticeable when queries are small but frequent.

For many real-world graph applications—lookups, short traversals, neighborhood expansions, and simple aggregations—those overheads can outweigh the actual graph work.

What we built: a hybrid engine with streaming fast paths

NornicDB takes a hybrid approach:

A general Cypher engine to support a wide set of queries.
Optimized streaming executors for common traversal + aggregation shapes.
Runtime-switchable parsing modes to trade strictness/debuggability for throughput.

The default production mode favors minimal overhead in the hot path. For query shapes we know are common, we aim to fuse pattern matching and aggregation into tight loops and avoid expensive intermediate structures.

Stream-parse-execute (default mode)

In the default “nornic” parser mode, the engine is designed around a stream-parse-execute approach. The intent is to avoid building heavy intermediate parse structures when we don’t need them, and to push execution decisions into a lightweight, shape-aware path.

This is not a claim that NornicDB has “no planning” anywhere. The codebase still contains analysis artifacts and caching for specific features. The claim is narrower and more useful:

For common traversal and aggregation shapes, NornicDB bypasses generic logical-plan execution and uses pattern-specialized, single-pass streaming executors.

Strict parsing when you want it: ANTLR mode

NornicDB also supports an ANTLR-based parser mode. This mode is stricter and provides better error reporting (line/column), which is valuable during development and debugging. It’s also more expensive: building full parse trees and walking them introduces overhead that can dominate certain query classes.

That tradeoff is intentional. The same engine can run in:

Production mode (lower overhead, practical throughput)
Debug mode (strict validation and better diagnostics)

Why this model performs well

Performance improvements come from removing layers of overhead on the path that matters most for many graph workloads: traversal + filter + aggregate.

1) Fused traversal and aggregation

For eligible query shapes, NornicDB executes traversal and aggregation in a single pass. Instead of producing intermediate row sets and feeding them through multiple generic operators, the executor performs direct scans and aggregates as it traverses.

2) Streaming execution and early termination

For a subset of query shapes, NornicDB’s execution can stream results and short-circuit work early—for example, when a query contains a LIMIT and the engine can stop once enough rows are produced.

A precise statement is:

Streaming traversal is real for optimized query classes, including LIMIT short-circuiting and selected no-materialization fast paths. This is shape-dependent, not universal for every Cypher query.

3) Fewer intermediate structures in hot paths

The largest gains often come not from clever algorithms, but from not doing unnecessary work:

Avoiding full path materialization when only aggregates are needed
Avoiding row-by-row operator dispatch
Avoiding heavy parse trees in the production fast path

In traversal-heavy workloads, these effects compound.

A note on correctness: constraints and transactions

Performance only matters if results are correct and operations are safe.

NornicDB is not just a query interpreter. It includes:

Schema constraints and validation logic
Explicit transaction control (BEGIN / COMMIT / ROLLBACK)
Storage-backed transaction handling for supported backends

A publication-safe way to state this is:

NornicDB enforces schema constraints and supports explicit storage-backed transactions, while also using optimized fast paths for eligible query shapes.

The real tradeoff: hot-path query shape management

The largest downside of shape-specialized execution isn’t performance—it’s organizational cost.

Every optimized path has a lifecycle:

Detect and classify the shape reliably
Implement an optimized executor
Prove semantic equivalence with the general engine
Add regression tests and performance baselines
Keep it correct as Cypher features expand

This is real management overhead, and historically it’s why many engines converge on generic operator runtimes.

Why this tradeoff looks different now

Historically, query-shape specialization has high human overhead. In an agent-driven world, the workload is more template-like, and agents can automate the specialization loop: mine top shapes, generate optimized executors, generate differential tests against a reference engine, and maintain coverage metrics. This shifts the work from manual tuning to automated verification and makes specialized execution economically viable again.

The key point isn’t that AI “writes the database for you.” It’s that:

Workloads become more template-like when generated by tools and agents.
Specialization can be treated as a pipeline: observe → prioritize → implement → verify → measure.

What this model is best at (and what it’s not)

This execution model shines when:

Queries are traversal-heavy and relatively structured
Workloads are dominated by a small set of templates
You care about low-latency and predictable performance
Aggregations can be fused into traversal

It’s not designed to claim universal dominance in every Cypher edge case. There will always be queries where a deep optimizer and a fully generalized runtime are the right tools. NornicDB’s approach is to optimize what matters most and retain a general path for everything else.

Closing thoughts

NornicDB’s execution model is a deliberate choice: remove overhead from the hot path by using streaming, shape-specialized executors for common Cypher patterns, while maintaining constraints and transactional boundaries.

If you’re curious, the best way to evaluate these claims is to run the benchmarks and inspect which queries hit optimized paths versus fallback behavior. Performance claims only matter when engineers can reproduce them—and that’s the bar we’re aiming for.

How I sped up HNSW construction ~2.7x

TJ Sweet — Mon, 23 Feb 2026 14:33:56 +0000

HNSW Build Time at 1M Embeddings: 27 Minutes to 10 Minutes by Fixing Insertion Order

For a 1M-embedding corpus, we reduced HNSW construction time from about 27 minutes to about 10 minutes (2.7x) without changing recall or graph quality.

This post explains:

the problem (where traversal work is wasted during construction),
the solution (BM25-seeded insertion order),
and the math behind the observed speedup.

All numbers in this writeup use the validated parameters from the current implementation:

M=16
ef_construction=100
seed set size = 256 * 8 = 2,048 nodes

Problem: Random insertion order creates traversal waste

HNSW build quality and build cost both depend on insertion order. With random insertion:

early nodes form accidental local hubs,
new nodes frequently enter a poor region first,
greedy search spends extra distance evaluations before finding useful neighbors.

That wasted traversal work compounds over time.

Visual A: Where random-order traversal waste comes from

In practice, this increases construction cost by a multiplicative overhead factor. I will call that factor beta:

build_time = ideal_time * beta

Where:

ideal_time is the minimum cost if each insert reaches good neighbors with minimal detours,
beta > 1 captures wasted traversal and repair work.

Baseline mechanics: why layer-0 dominates at 1M scale

Using M=16, the level distribution gives:

P(node at layer >= 1) = 1/M = 1/16 = 6.25%
so 93.75% of inserted nodes effectively do all meaningful work in layer 0.

For ef_construction=100, expected distance computations per insertion are:

Upper layers: 0.067 * 100 * 16   =   107
Layer 0:     1.000 * 100 * 32    = 3,200
                                    -----
Total per insert                  = 3,307

(32 above is 2*M, the layer-0 connection bound.)

So the primary optimization target is not exotic upper-layer behavior; it is reducing layer-0 traversal waste during insertion.

Solution: BM25-seeded insertion creates a backbone first

Instead of random insertion order, we pick a lexically diverse seed set from BM25 and insert those vectors first.

Seed extraction:

take high-IDF terms,
for each term, take top docs by term frequency,
defaults: NORNICDB_HNSW_LEXICAL_SEED_MAX_TERMS=256, NORNICDB_HNSW_LEXICAL_SEED_PER_TERM=8,
maximum seed set: 256 * 8 = 2,048 nodes.

Build order:

insert seed nodes first,
insert remaining N - seed_count nodes.

This gives the graph a broad early backbone, so later inserts find useful neighbors quickly instead of wandering.

Visual B: Seed-first construction reduces detours

The math check on the 27 -> 10 minute result

Use a conservative distance-op estimate for 1024-dim float32 vectors in Go:

compute plus memory effects: about 160 ns per distance operation in this workload class.

Then ideal floor for 1M insertions:

ideal_time
= 1,000,000 * 3,307 * 160 ns
= 529 s
= 8.8 min

Now map measured times to beta:

beta_random = 27 / 8.8 = 3.07
beta_seeded = 10 / 8.8 = 1.13
speedup     = beta_random / beta_seeded
            = 3.07 / 1.13
            = 2.72x ~= 2.7x

This is the key point: the reported speedup is exactly what you expect if seeded order mostly removes traversal waste.

Visual C: Overhead factor (`beta`) before vs after

beta (lower is better)

Random order   | ############################### 3.07
Seeded order   | ###########                     1.13

Interpretation:

random build spends about 3.07x the ideal work,
seeded build is close to the floor at about 1.13x.

Time decomposition for the 1M run

Total build time decomposition (minutes)

Case            Ideal floor   Overhead   Total
Random order      8.8          18.2      27.0
Seeded order      8.8           1.2      10.0

Visual D: Same floor, different overhead

Random order | [#########.................][##################] 27.0
Seeded order | [#########.................][#]                  10.0
               ideal floor (8.8 min)       overhead

Both runs share the same algorithmic floor; the difference is how much overhead is paid while traversing and wiring the graph.

Why this does not require a recall tradeoff

This change does not reduce ef_construction, M, or search-time quality knobs. It changes insertion order so the builder spends less effort reaching good neighborhoods.

That is why a large build-time gain can occur without reducing recall or graph quality: the graph is built with the same target connectivity constraints, but with less wasted traversal on the way there.

How to reproduce in your environment

Keep HNSW params fixed (M, ef_construction unchanged).
Build once with seeding disabled (or effectively zero seed set).
Build once with defaults:

NORNICDB_HNSW_LEXICAL_SEED_MAX_TERMS=256
NORNICDB_HNSW_LEXICAL_SEED_PER_TERM=8

Log and compare:
total build time,
insertion throughput,
recall on a fixed validation query set.

Use ratio as the primary cross-machine signal. Absolute minutes depend on CPU, memory bandwidth, cache behavior, and runtime effects.

Secondary effect: same seed mechanism helps k-means init

The same BM25-derived seed mechanism is also used by bm25+kmeans++ seed mode for centroid initialization. That improves initial centroid spread and typically reduces convergence iterations in the k-means phase.

The important architectural detail is reuse: one seed extraction pass supports both HNSW construction order and k-means initialization.

Closing

The 27-to-10 minute result is not a tuning artifact. It is a direct consequence of reducing traversal waste during graph construction:

keep the same quality parameters,
improve insertion geometry,
move beta from about 3.07 to about 1.13.

At 1M scale, this is enough to produce a repeatable 2.7x build-time improvement while preserving result quality.

https://github.com/orneryd/NornicDB

DEV Community: TJ Sweet

Why NornicDB Uses Its Own Monotonic Counter for MVCC Ordering

TL;DR

The Bug We Were Hunting

Why time.Now().UnixNano() Cannot Be Used as a Global Order

Wall-Clock Drift, Concretely

The Parser Is Faster Than the Clock Tick

Same-Tick Math

Why time.Now()'s Monotonic Reading Doesn't Help

The Fix: A Process-Global Atomic Counter

Tests We Wrote

Time Until Overflow

Why Not Other Techniques

Closing Note

Beyond the Vector Wall: The Case for Microsecond Graph-RAG

The Linearizability Crisis

The Latency Bottleneck

Architectural Consolidation

I’m looking for a small number of maintainers for NornicDB

The "Boxing In" Strategy: How Go is the Goldilocks Language for AI-Assisted Engineering

1. The "GPS" Effect: Forcing Explicit Intent

2. The OODA Loop: Validating Theory at Scale

3. Logical Cross-Pollination (The C/C++ Factor)

Proof of Concept: The NornicDB Experience

Conclusion: Boxes, Not Blank Canvases

~1ms hybrid graph + vector queries (network is now the bottleneck)

TL;DR

What was tested

Local (M3 Max 64GB Native MacOS Installer)

Remote (GCP, 8 cores, 32GB RAM)

What’s interesting

How this compares (public numbers)

Important caveats

What this suggests

Takeaway

Looking for feedback on:

Addendum: test setup + external verification

Verification queries (same shape)

Building a Low-Latency MVCC Graph+Vector Database: The Pitfalls That Actually Matter

1) Latency budgets are architecture budgets

2) Snapshot isolation for graphs requires topology-aware validation

3) MVCC retention can create historical lookup cliffs

4) “Current-only” indexing is mandatory when history exists

5) Async embeddings create intentional dual-latency behavior

6) NFS exposed lock contention that fast local storage hid

7) Conflict semantics and retries are part of performance, not just correctness

8) Timings must be interpreted by query shape, not averages

Closing

In the New Agentic World: The Software Career Ladder Is Being Rewritten

My Thesis (Opinion)

1) Software architecture becomes rarer, higher-stakes, and more formal

2) Data engineering fluency becomes the new baseline for “software engineer”

3) DevOps does not disappear - it evolves into AI governance in production

4) The entry-level ladder is getting steeper, right now

What Current Evidence Supports

Strong support: AI/data skills are rising fast

Strong support: DevOps/platform quality becomes more important with AI

Moderate-to-strong support: junior role pressure is real, but uneven

Strong support: AI coding tools increase productivity in many contexts - but not automatically

Where I’m Projecting Beyond the Data (And I’m Owning That)

A Practical Career Map for the Agentic Era

Final Take

The Full Graph-RAG Stack As Declarative Pipelines in Cypher

What landed

Why this is materially different

1) The pipeline is now declarative and inspectable

2) Less orchestration overhead in the hot path

3) Graph + semantic logic are fused in one plan

Query planner + cache: why this is practical, not just ergonomic

Planning path

Query plan cache

Result cache policy (important boundaries)

Correctness under writes

Procedure boundaries (clear contract)

What this is not

Common patterns today:

Why this matters

Architectural Consolidation for Low-Latency Retrieval Systems: Why We Co-Located Transport, Embedding, Search, and Reranking

Why consolidate at all?

The architectural shape we optimized for

Why `time.Now().UnixNano()` Cannot Be Used as a Global Order

Why `time.Now()`'s Monotonic Reading Doesn't Help

Visual C: Overhead factor (`beta`) before vs after