DEV Community: Sergey Nikolaev

Manticore Search 28.4.4: Faster KNN, better conversational search, easier installs and more faceting controls

Sergey Nikolaev — Tue, 14 Jul 2026 11:24:55 +0000

Manticore Search 28.4.4 has been released. This release brings faster KNN rescoring, more flexible conversational search, a simpler install and upgrade path, better faceting controls, per-table relevance defaults, and fixes across authentication, replication, SQL compatibility, distributed queries, and columnar/KNN internals.

This post is a catch-up for everything shipped from 27.2.0 through 28.4.4.

Upgrade Notes

Please review these before upgrading:

28.0.0 bumps the plugin ABI version SPH_UDF_VERSION to 12. External UDF, ranker, and token-filter plugin binaries must be rebuilt before loading them into this version. The change lets token-filter plugins receive long dict=keywords_32k tokens instead of silently bypassing values above the old 126-byte plugin limit. Existing table data and configuration remain compatible, and no index migration is required. Downgrade is possible if you also restore plugin binaries built for the older ABI and do not rely on the new long-token plugin behavior. (Issue #4667, PR #4668)
If authentication is enabled, prioritize the 28.4.x update. 28.4.2 fixes an authentication/authorization permission-check bypass in MySQL multi-statement execution. 28.3.4 also improves authenticated MySQL startup compatibility for Connector/J and PyMySQL, and 28.4.1 fixes authenticated replication-cluster restart recovery by persisting the stored replication user in cluster metadata. (PR #4713, Issue #4691, Issue #4705)
If you manage MCL separately from the daemon, upgrade it together with Manticore. This release includes MCL 13.7.0, the new embeddings_threads setting, and several columnar/KNN build, packaging, and cleanup fixes. Mixing an older library with a newer daemon is not recommended. (PR #169, PR #186, PR #4676)

Highlights

Faster KNN rescoring

KNN search now batches distance calculations during the rescore pass. After HNSW returns the candidate set, Manticore recomputes final full-precision distances and re-sorts the results. Batching that work reduces per-candidate overhead in the final stage of vector search.

For vector-heavy workloads, this takes work out of the part of the query that runs after candidate selection. Results do not change; the final ranking pass just has less overhead when many candidates are rescored.

Conversational search through SQL and HTTP

Conversational search is now available through the /search JSON API as well as SQL CALL CHAT. That makes it easier to use Manticore's chat flow from applications that already talk to the HTTP API and do not want to add a separate SQL path just for chat requests.

CREATE CHAT MODEL also gained custom_prompt support, so answers can follow application-specific instructions such as citation rules, tone, response length, or formatting. The feature is still built on the same Manticore Search flow: retrieve relevant documents from an existing vectorized table, build context, keep conversation history, and return an answer with supporting sources.

One-line installation

The quick-start install path is now simpler:

curl https://manticoresearch.com | sh

The same installer can also upgrade an existing installation, list available versions, switch between stable and development repositories, and install a selected version. Package managers still remain the source of truth for installed files, repositories, services, and dependencies; the new path just removes the manual setup steps around them.

For all options, run:

curl https://manticoresearch.com | sh -s help

Facets can keep zero-count buckets visible

Faceted search now supports zero-count facet buckets through SQL ZEROES and JSON "zeroes": true.

This is a small but important UI feature. In e-commerce-style filtering, you often want to keep an option visible even when the current filter combination gives it a count of 0. Combined with max-mode facet behavior, zero-count buckets make it easier to show selected, available, and currently unavailable choices without hiding part of the filter vocabulary from the user.

Better defaults for search relevance

Manticore now supports CREATE TABLE ... profile='relevance', plus stored per-table defaults for ranker and boolean_mode.

Based on our search quality tests, profile='relevance' and the ranking settings it enables improve relevance in many cases. The application also no longer needs to repeat the same ranking parameters in every request.

More control over embedding CPU usage

embeddings_threads caps the CPU threads used for auto-embedding inserts, ALTER TABLE ... REBUILD KNN, and text-to-vector KNN queries.

This matters on shared hosts and mixed workloads. Embedding generation and KNN rebuilds can be CPU-heavy; a server-level cap makes those jobs easier to schedule without letting them take over the whole machine.

Bug Fixes

This release includes 17 bug fixes. The most important ones are:

28.4.3 fixed incompatible multi-statement handling around COUNT(DISTINCT ...).
28.4.2 fixed an authentication/authorization permission-check bypass in MySQL multi-statement execution, so each statement in a multi-statement request is validated correctly under auth.
28.4.1 and 28.3.1 fixed replication-cluster recovery edge cases: authenticated cluster restart recovery and startup when a node was left alone in the cluster.
Distributed queries with remote stored fields now fail on remote GETFIELD fetch errors or malformed replies instead of returning apparently successful rows with empty or untrusted stored-field values.
Interrupted columnar/KNN merge cleanup now removes temporary component files, preventing orphaned .tmp.spc.* files from breaking later table rename, attach, or drop operations.
DBeaver compatibility improved by accepting simple single-table aliases in SELECT queries.
Connector/J and PyMySQL clients can now complete authenticated native-password login flows, and harmless session SET statements no longer fail under auth.
The built-in Ukrainian lemmatizer now normalizes apostrophe words correctly, so forms such as здоров'ям match здоров'я under lemmatize_uk_all.
Blended-keyword handling now honors the configured blend_mode, restoring separator-stripped variants consistently for indexing and keyword extraction.
RT auto-optimization no longer compacts a table below two disk chunks unless that lower cutoff is explicitly requested.
A crash involving percentiles aggregations together with terms aggregations in the same /search request on multi-chunk RT tables was fixed.
Long SQL parse errors now preserve UTF-8 character boundaries, so invalid queries containing Cyrillic and other multibyte text no longer produce truncated or corrupted error messages.

For the complete list, see the changelog.

Need help or want to connect?

Join our Slack
Visit the Forum
Report issues or suggest features on GitHub
Email us at contact@manticoresearch.com

Sharding in Manticore Search: automatic distribution and replication

Sergey Nikolaev — Fri, 03 Jul 2026 03:32:41 +0000

Search systems often start simple: one table on one server. That works until one of two things happens. Either a single query stops being able to use all the CPU you paid for, or a single server stops being enough — for capacity, for throughput, or for the simple fact that a server can fail and take your data with it.

The automatic sharding built into Manticore Search, available since release 27.1.5, addresses both issues by splitting a table into several smaller physical pieces (shards), that can be searched in parallel and placed on different nodes:

On a single node, sharding spreads concurrent writes across independent pieces and keeps each one small enough to stay fast.
Across a cluster, sharding distributes data over multiple nodes and — this is the main point — automatically replicates each shard and keeps that replication factor intact as nodes fail and recover.

The second part is the real reason most people reach for sharding: high availability. You declare how many shards you want and how many copies of each should exist, and Manticore handles placement, replication, and rebalancing. You don't script failover.

Below: both use cases, the machinery without drowning in internals, the commands you'll run, and the current limits.

Short glossary

Key terms:

Term	Meaning
Shard	One physical piece of a table — a real table that Manticore creates and manages for you. A table with `shards='4'` has four of them.
Replica	A copy of a shard on another node. Replicas are how data survives a node failure.
Replication factor (RF)	How many nodes hold a copy of each shard. `rf='2'` means every shard exists on two nodes.
Distributed table	The table you actually query. It has the name you gave it and transparently fans queries out to all shards.
Cluster	A Manticore replication cluster — the group of nodes between which data is replicated.
Master	The node that currently coordinates sharding operations (placement, rebalancing). Elected automatically.
Rebalancing	The automatic process that moves or copies shards when the set of nodes changes.

How to create a sharded table

Sharding in Manticore is driven entirely by two simple options on CREATE TABLE:

CREATE TABLE products (id bigint, title text, price float) shards='4' rf='2'

shards='N' — split the table into N physical pieces.
rf='M' — keep M copies of each piece across the cluster (the replication factor).

In the common case, that one CREATE TABLE is all you write. There is no separate "make this distributed" step, no manual agent= lists as in older manual sharding setups, and no per-node table creation. Manticore creates the physical shards, places them, sets up replication, and creates a distributed table named products on every node, so the application can use the same table name from any cluster node.

Use case A: sharding on a single node

Start with the case where you have a single server — perhaps a big, many-core one — and no cluster yet. In this setup, sharding is not about storage durability; it helps use that one machine more effectively. If all writes go into one real-time table, concurrent INSERTs contend on the same internal table locks. As the table grows, RAM-chunk merges get heavier and can slow down ingestion. Splitting that table into several independent shards helps on both fronts: writes spread across the shards, and each piece stays small. High availability isn't part of the picture yet — that needs more than one node — so this is purely a performance play.

The simplest form has no cluster and rf='1':

CREATE TABLE logs (id bigint, message text, ts timestamp) shards='8' rf='1'

This creates eight physical shards on the one node and a distributed table logs that points at all of them. How does that help on a single machine?

More concurrent ingestion. Each shard is an independent real-time table, so concurrent writes spread across them instead of serializing on one table's locks — the win the benchmarks below measure directly.
Smaller pieces stay fast. Real-time tables periodically merge their internal RAM chunks. A table split into shards keeps each shard's chunks smaller, so those merges use fewer resources and are less likely to slow inserts down.
Query parallelism (usually a small gain on one node). A distributed table searches its shards in parallel across the server's worker thread pool, so a single query can use several cores instead of one — bounded by searchd.threads and the number of physical cores. On a single node, though, this overlaps with pseudo-sharding, and the gain is usually small (~5–12%) — see the read benchmarks below.

If you've used Manticore's pseudo-sharding before, the goal may be familiar — use all the cores for one query — but the mechanism is different. Pseudo-sharding parallelizes a single physical table automatically at query time. Explicit sharding creates real shards you control: you decide how many, they're separate tables you can reason about, and — crucially — the same sharded table can later be spread across nodes without changing how your application talks to it.

The two are complementary, but they don't stack for free. Physical sharding — a distributed table over several local tables — already keeps the worker threads busy, so if you've explicitly sharded a table, enabling pseudo_sharding on top usually adds little and can even cost a bit of throughput. Test it both ways with manticore-load: run your workload with and without pseudo_sharding, and if it adds nothing on top of explicit shards, turn it off.

On a single node the replication factor must be 1: there's only one node, so there's nowhere to put a second copy. That's also the catch — single-node sharding gives you parallelism, not durability. For durability you need more than one node.

Use case B: multi-node sharding and automatic replication

This is what sharding is really for. Start from a replication cluster of several nodes (see Setting up replication for how to create one), then create the table inside that cluster with the cluster: prefix and an RF greater than 1:

CREATE TABLE mycluster:products (id bigint, title text, price float) shards='4' rf='2'

Here's what Manticore does for you:

Creates four shards.
Places them across the cluster's nodes in a balanced way.
Creates a second copy of every shard on a different node, because rf='2'.
Wires up replication between each shard and its replica.
Creates a distributed table products on every node, so any node can serve reads and accept writes.

From the application's point of view, nothing changed — you still INSERT INTO products … and SELECT … FROM products. Reads fan out across the shards and the results are merged; writes are routed to a shard. But now every shard lives on two nodes, and that's the property you care about: any single node can fail and the table stays fully available with no data loss.

The replication factor scales with your durability needs and your node count:

RF	Copies per shard	Survives	Typical use
1	1	nothing — a lost node loses its shards	single-node parallelism, dev/test, data you can rebuild
2	2	one node failure	the common production choice
3+	3 or more	multiple simultaneous failures	mission-critical, frequent-failure environments

The constraint is simple: you can't ask for more copies than you have nodes. rf='3' needs at least three nodes in the cluster. Manticore checks this when you create the table and tells you if the cluster is too small.

-- 6 shards, 3 copies each, across a 3+ node cluster
CREATE TABLE mycluster:events (id bigint, body text) shards='6' rf='3'

Putting it together: a multi-node walkthrough

Say you have a three-node replication cluster called mycluster (if you don't yet, Setting up replication walks through CREATE CLUSTER and JOIN CLUSTER). Create a sharded, replicated table from any node:

CREATE TABLE mycluster:products (id bigint, title text, price float) shards='4' rf='2'

Manticore creates four shards, puts two copies of each across the three nodes, and a distributed table products on every node. Check the placement:

SHOW SHARDING STATUS products;

-- illustrative output (abbreviated columns)
+-------+-------+--------+----+-----------+
| shard | node  | status | rf | rf_status |
+-------+-------+--------+----+-----------+
|     0 | node1 | active |  2 | ok        |
|     0 | node2 | active |  2 | ok        |
|     1 | node2 | active |  2 | ok        |
|     1 | node3 | active |  2 | ok        |
|     2 | node1 | active |  2 | ok        |
|     2 | node3 | active |  2 | ok        |
|     3 | node1 | active |  2 | ok        |
|     3 | node2 | active |  2 | ok        |
+-------+-------+--------+----+-----------+

Every shard appears on two distinct nodes — that's rf=2 — and every rf_status is ok. (The full result also includes table, cluster, and replication_cluster columns.) Now use it like any other table, from any node:

INSERT INTO products (id, title, price) VALUES (1, 'Wireless mouse', 19.99);
SELECT * FROM products WHERE MATCH('mouse');

The write is routed to a shard and replicated to that shard's other copy; the read fans out across all four shards and merges the results. Your application never names a shard.

Maintaining the replication factor

Setting rf='2' is easy. The hard part in any distributed system is honoring that condition over time, as machines fail and come back and as you add capacity. But you no longer have to worry about that. Manticore Search automates this work.

How it works in Manticore is that the cluster elects a master node that runs a coordination loop. It monitors the cluster's topology — which nodes are alive — and reacts to changes:

A node fails

Its shards now have fewer copies than RF requires. The master detects the missing node and tries to rebuild the missing replicas. If the cluster still has at least rf active nodes after the failure, it places new copies on active nodes that don't already hold them, restoring the replication factor. Queries keep working as long as at least one copy of each shard is still available.

Continuing the walkthrough above — if node3 goes down, SHOW SHARDING STATUS products shows the affected shards as degraded (one copy down, one still up):

-- illustrative: node3 is down
+-------+-------+----------+----+-----------+
| shard | node  | status   | rf | rf_status |
+-------+-------+----------+----+-----------+
|     1 | node2 | active   |  2 | degraded  |
|     1 | node3 | inactive |  2 | degraded  |
|     2 | node1 | active   |  2 | degraded  |
|     2 | node3 | inactive |  2 | degraded  |
|   ... | ...   | ...      |    | ...       |
+-------+-------+----------+----+-----------+

There are still two active nodes (node1, node2) and rf=2, so the master creates the missing copies of shards 1 and 2 on the active node that lacks them. While a new copy is being built it shows up as pending; once replication catches up it becomes active and rf_status returns to ok.

The important caveat: Manticore can only restore RF if there is somewhere to put the new copy. If the active node count drops below rf, the requested RF cannot be met yet: affected shards stay degraded with their surviving copy until a node returns or you add one. Manticore won't create another copy of the same shard on the same node, and it won't silently pretend RF is met. If no live copy of a shard remains, its status becomes broken; that case is covered below. For rf='1', a failed node's shards are simply gone — there was never a second copy.

A node joins

New capacity should be used. The master rebalances so the new node takes its share of the load. How it does this depends on the RF:

RF = 1: shards must be moved (there's only one copy, so it can't just be duplicated). Manticore moves them safely using a temporary internal cluster: it copies the data to the new node first and removes it from the old one only after that, so the shard always has an available copy.
RF ≥ 2: shards are replicated to the new node using the cluster's existing replication, then the distribution is rebalanced. No risky data movement, because another copy always exists.

Every copy of a shard is down

If all nodes holding a given shard are lost at once, that shard's rf_status becomes broken — there's no surviving copy to serve or to replicate from. The rest of the table keeps working; the broken shard recovers when one of its nodes returns. RF reduces the chance of this case: with rf='2' it takes two simultaneous failures of the right nodes, with rf='3' three.

All of this happens through an internal, ordered, rollback-aware operation queue, so a rebalancing operation either completes or is cleanly rolled back — even if the master node itself dies mid-operation, the next master cleans up the half-finished work. The point for you as an operator: you set RF once, and the cluster works to keep it true.

How it works under the hood (the short version)

You don't need this to use sharding, but it helps to know what's happening.

Physical shards are real tables. A table with four shards is backed by four real tables that Manticore creates and manages for you. You normally never touch them directly.
Your application talks to a distributed table. Manticore creates one named products on every node. In its internal definition, local shards are listed directly, and shards on other nodes are connected through agent. That's what makes SELECT … FROM products transparently hit everything.
Coordination state lives in the cluster. Manticore tracks its own internal metadata — shard placement, coordination state, and the pending-operation queue — so it always knows who holds what and what work is still outstanding. In a multi-node setup this state is replicated across the cluster, so every node shares the same view.
The master drives changes. Placement, replication setup, and rebalancing are computed by the master and pushed onto the queue as ordered commands with rollback instructions, then executed across nodes.
Replication reuses Manticore's clustering. The same proven replication mechanism Manticore already uses for clusters keeps shard replicas in sync.

Architecturally:

            CREATE TABLE ... shards='4' rf='2'
                          │
                          ▼
                  ┌────────────────┐
                  │ Manticore      │  computes placement,
                  │ elected master │  enqueues ordered ops
                  └────────┬───────┘
                          │
        ┌─────────────────┼─────────────────┐
        ▼                 ▼                 ▼
    ┌──────────┐     ┌──────────┐     ┌──────────┐
    │  node1   │     │  node2   │     │  node3   │
    │ s0 s2 s3 │◄───►│ s0 s1 s3 │◄───►│  s1 s2   │   (each shard on 2 nodes = rf 2)
    └──────────┘     └──────────┘     └──────────┘
    distributed table "products" exists on every node
    (same placement as the SHOW SHARDING STATUS output above)

Operating a sharded table

Everything you'd expect to work, works — and there are a couple of sharding-specific commands for visibility.

Inspect the schema. DESC and SHOW CREATE TABLE work on the logical table; Manticore resolves them through the underlying shards:

DESC products;
SHOW CREATE TABLE products;

See where every shard lives and whether the RF is healthy. This is the command you'll watch during failures and rebalancing:

SHOW SHARDING STATUS products;

+-------+-------+--------+----+-----------+
| shard | node  | status | rf | rf_status |
+-------+-------+--------+----+-----------+
|     0 | node1 | active |  2 | ok        |
|     0 | node2 | active |  2 | ok        |
|   ... | ...   | ...    |    | ...       |
+-------+-------+--------+----+-----------+

It reports one row per shard copy, with these columns:

Column	Meaning
`table`	the logical table name
`shard`	shard number
`node`	the node holding this copy
`status`	`active`, `inactive` (node down), or `pending` (being created)
`cluster`	the replication cluster the table belongs to
`replication_cluster`	the internal cluster that keeps this shard's copies in sync
`rf`	how many copies this shard currently has
`rf_status`	`ok` (RF satisfied), `degraded` (some copies down but at least one up), or `broken` (no copies up)

rf_status is the at-a-glance health signal: all ok means the cluster is meeting the replication factor you asked for; degraded means it's working but exposed; broken means a shard is down.

Find the coordinator:

SHOW SHARDING MASTER;

Drop it cleanly. Dropping a sharded table works exactly like dropping a regular table — DROP TABLE removes the table and all its shards across the cluster:

DROP TABLE products;

Scale by changing the cluster. Because rebalancing is automatic, the way you scale a sharded table out is by adding nodes to the cluster (Adding a new node). The master notices the new node and rebalances onto it without any action on the table itself.

Choosing the shard count and replication factor

A few rules of thumb:

Shards for faster writes: a handful of shards is usually enough, well below your core count — in the benchmarks below a 16-core / 32-thread box peaked at 4–8 shards and by 32 shards was slower than no sharding at all. Start small (4–8), measure, and only add more if your own numbers say so. More shards than that rarely helps and adds per-shard overhead.
Shards for distribution: with multiple nodes, you want enough shards that they divide evenly across nodes and leave room to grow — a multiple of your node count is a good default. Don't go wild: each shard is a real table with its own overhead. (Manticore caps the shard count at 3000.)
RF for durability: rf='2' is the standard production choice — it survives any single node failure at 2× storage. Use rf='3' only when you genuinely need to survive simultaneous failures or have strict availability requirements, and remember it costs 3× the storage and more replication traffic.
RF=1 is for performance or throwaway data only. It has no fault tolerance. Use it on a single node for parallelism, or in a cluster only when you have an external way to rebuild lost data.

Benchmarks: does sharding actually speed up inserts?

A feature is only worth using if it earns its keep, so we measured. The question we wanted answered honestly: for the same workload, does sharding make ingestion faster — and if so, when? The methodology was simple: compare every sharded run against the same baseline — a regular table without sharding.

In short: on a 16-core box with 32 concurrent writers, sharding raised insert throughput by about 1.5× at its best — from ~163k to ~253k docs/s — but only when the shard count stayed small. The best results came at 4–8 shards; by 32 shards, throughput had fallen below the unsharded baseline. The binary log cost roughly 25% of write performance, and rf=2 replication across two real machines cost about 30% more — fair prices for durability, but not free.

Setup. A dedicated server with no other significant CPU load — AMD Ryzen 9 5950X (16 cores / 32 threads), 128 GB RAM. Everything ran inside a single Docker container running a recent dev build with the sharding feature. Manticore ran with stock settings — no performance tuning: only listener ports, the data directory, and the binary log path were set; the thread pool, RT memory limits, and binary log behaviour were left at their defaults. Load came from manticore-load. Each run inserts the same documents — (id bigint, name text, type int) where name is 10–100 random words — in batches of 1000 into a real-time table. Only the shard count, the replication factor, and the binary log change between runs; the "no sharding" baseline is a plain RT table. We run the full single-node shard sweep twice — once with the binary log on (the default), once with it off — inserting 20,000,000 docs per run with 32 concurrent writers.

# the exact shape of every insert run (shards/rf vary)
manticore-load --batch-size=1000 --threads=32 --total=20000000 \
  --init="create table test(id bigint, name text, type int) shards='8' rf='1'" \
  --load="insert into test(id,name,type) values(<increment>,'<text/10/100>',<int/1/100>)"

Single node: throughput vs shard count

Docs inserted per second, 20M docs, 32 writers — both binary log modes:

Shards	binary log on (default)	binary log off
none (baseline)	162,920	218,079
2	191,976	246,288
4	252,807	290,665
8	251,008	265,015
16	175,848	182,288
32	108,006	111,381

The chart plots the full sweep twice — binary log on (blue, the default) and off (orange). Three things stand out:

Concurrent inserts get a real boost. With 32 writers, splitting the table lifts throughput by ~1.5× at the peak — from 163k to 253k docs/s on the default (binary-log-on) line. Each shard is an independent real-time table, so spreading writes across several sharply cuts the lock contention a single RT table hits under concurrency, and lets the inserts use more cores.
The best range is small — 4–8 shards. The gain peaks there and then falls off fast. By 16 shards it's barely above baseline, and at 32 shards throughput is below the unsharded table (0.66×) — past that range, per-shard and coordination overhead outweighs the extra parallelism. More shards is emphatically not better. Both lines have the same shape and the same best range — the binary log doesn't move it.
Durability costs most where you're fastest. Turning the binary log off (orange) lifts the whole curve, but the gap is widest in the high-throughput region — 253k → 291k at 4 shards — and nearly vanishes once the shard count is too high (108k vs 111k at 32 shards), where coordination overhead, not durability, is the bottleneck. We measure the binary-log impact separately below.

One caveat worth stating plainly: this speedup comes from concurrent writes. A single, strictly-sequential writer can't exploit parallel shards and will see no speedup (and a touch of distributed-layer overhead). Sharding pays off when many clients or consumers write at once — which is what real ingestion pipelines do.

The cost of durability: the binary log

Manticore's binary log makes inserts crash-safe (it can replay un-flushed transactions after an unclean shutdown). It's on by default. You can already see its cost in the chart above — the orange (binary-log-off) line runs above the blue one. Isolating it on the plain unsharded baseline, with only binlog_path changing:

Disabling the binary log raised baseline throughput from 162k to 218k docs/s — so crash-safe inserts cost roughly 25%. That's the price of not losing un-flushed writes on a hard crash; leave it on in production unless you can rebuild the data from source and want the extra speed during bulk loads.

The cost of replication on writes

Durability across machines isn't free either — and this is the one test that's only honest on real, separate hardware. So unlike the single-node benchmarks above, we ran the two replication tests on two distinct physical machines: two 4-core / 7 GB cloud VMs joined as a Manticore cluster. That way rf=1 and rf=2 never fight over the same cores or memory — the fair comparison a single box simply can't give.

rf=1 keeps a single copy on one machine; rf=2 keeps a full copy on both, so every insert is synchronously replicated across the network before it's acknowledged. Same 1M-doc load, 4 writer threads:

Replication cost about 30% of insert throughput (112k → 78k docs/s) — the price of copying every write to the second machine before acking. That's the rf trade-off in one number: write a bit slower, survive losing an entire machine.

Do sharding and replication speed up reads?

Reads are easy to measure wrong, so first the method. A trivial query — fetch 20 rows, no ranking — runs in well under a millisecond, so its cost is all fixed overhead; a distributed table that hops to an agent then looks ~3× slower (≈7,700 vs 2,400 q/s here) purely because the network round-trip dwarfs the near-zero work. That's not a real query. A realistic full-text query (a few terms to match and rank, ~10–30 ms) tells the truth: the distribution overhead shrinks to a few percent, because now real work dominates the fixed cost. The numbers below all use realistic queries on the 2-node cluster, read from a single entry node.

# realistic read — what the numbers below use: full-text match-and-rank, ~10–30 ms each
manticore-load --threads=4 --total=5000 \
  --load="select id from <table> where match('<text/1/5> <text/1/5>')"

# trivial read — sub-ms, all fixed overhead, exaggerates distributed cost ~3x
manticore-load --threads=4 --total=20000 \
  --load="select id from <table> limit 20 option ranker=none"

<table> is one of: a plain RT table (no sharding), a type='distributed' table over local shards (single-node sharding — local='s0' local='s1' …), or a sharded table with shards='4' rf='1'/rf='2'. Every read is issued to one node — we never query both nodes as separate clients.

Sharding across nodes speeds reads up. A 4-shard rf=1 table spread over both machines does ~516 q/s vs ~315 for a single unsharded node — about 1.6× — because each query runs across both nodes' cores at once.
Replication keeps reads fast. rf=2 (every shard on both nodes) does ~410 q/s — a touch below rf=1. The reason is not that the load sticks to one node (it spreads across replicas); it's that with rf=2 every shard is reached through the agent/mirror path — even the copies that happen to be local — and that path costs the few percent noted above, whereas rf=1 reads its on-node shards in-process. Either way reads stay well above a single node, and the data now survives a machine loss.
On one box, sharding's read win is small. Splitting a table into 2–4 shards on a single node adds only ~5–12% (query parallelism), and only when there are spare cores — heavier queries help (≈12% vs ≈3% for light ones), but a fully-loaded box is a wash. The real read scaling comes from adding machines, not shards on one box.

Method note: the read and replication tests run on a separate pair of cloud VMs — 4 vCPU / 7 GB RAM each, two distinct physical machines — joined as a 2-node Manticore cluster, same recent dev build, stock settings, 1.5M docs. They are a different, much less powerful setup than the 16-core / 32-thread server used for the insert sweeps above, so draw conclusions within each test, not across them. The small core count is also why single-node sharding's read gain is modest here — there are few spare cores for a query to parallelize across.

Bottom line

On one box, sharding gives a ~1.5× concurrent insert speedup (163k → 253k docs/s here at 20M).
The best range here is 4–8 shards on this 16-core / 32-thread CPU. Too many shards hurts: at 32 shards throughput fell below the unsharded baseline. Match shard count to cores and write concurrency, not to a big round number.
The binary log costs ~25% of write throughput (and the gap shrinks to near-zero once the shard count is too high), and rf=2 replication ~30% on writes (measured on two separate machines) — both fair prices for crash safety and node-failure survival, respectively.
With realistic full-text queries, sharding across a 2-node cluster reads ~1.6× a single node, and rf=2 keeps reads fast while surviving a node loss. Trivial id-lookup queries exaggerate distributed overhead ~3× — always benchmark reads with realistic queries.
Absolute numbers depend on hardware and document shape; what travels between setups is the shape of these curves — so benchmark your own workload before committing to a shard count.

Limitations and things to know

There are sharp edges worth knowing up front:

rf is required. A sharded CREATE TABLE must specify rf=. If you omit it, CREATE TABLE fails.
rf greater than 1 requires a cluster. You can't create a multi-copy sharded table on a standalone node — there's nowhere to put the copies. Multi-copy tables must use the cluster:name form.
Local sharded tables can't be created on a node that's already in a cluster. If a node belongs to a replication cluster, create the table in that cluster (CREATE TABLE cluster:name …) rather than as a local table, or the sharding metadata won't be tracked correctly. Manticore detects this and tells you.
Maximum 3000 shards per table.
RF=1 means no fault tolerance. A lost node's shards are gone. This is inherent, not a bug — it's the trade-off you accept for rf='1'.
You need at least RF nodes. rf='M' requires a cluster of at least M nodes; creation fails otherwise.
Creation is synchronous up to a timeout. CREATE TABLE waits for the distribution to complete (default 30s). For very large shard counts, raise it with timeout='N' (seconds), e.g. shards='3000' rf='3' timeout='60'.

Where this leaves you

Sharding in Manticore covers a wide span with a deliberately small interface. With shards='N' rf='1' on a single box, sharding spreads concurrent writes across independent pieces and keeps each one small. With shards='N' rf='M' inside a cluster, it gives you a distributed, replicated table that survives node failures and rebalances itself when the cluster changes — without you writing a line of failover logic. The same table definition grows from one node to many, and your application keeps talking to it the same way throughout. In practice, this means you can start by improving write throughput on one node and later move to a fault-tolerant cluster without changing the application.

To go deeper into the building blocks sharding stands on:

Have a sharding question or a workload you'd like us to benchmark? Let us know.

Faster KNN index builds in Manticore

Sergey Nikolaev — Thu, 02 Jul 2026 08:48:19 +0000

TL;DR

Building a KNN index used to be the slow part of saving and merging tables with vector attributes. As of release v27.1.5, Manticore can use several CPU cores for this work during chunk saves, OPTIMIZE merges, auto-optimize, and ALTER TABLE ... REBUILD KNN. On a 16-core Ryzen 9 5950X, building a KNN index for 1 million 1536-dimensional vectors dropped from 8 minutes to 39 seconds.

Why HNSW build speed matters

Manticore uses HNSW graphs to power KNN search over float_vector attributes. You can think of an HNSW graph as a map that helps Manticore quickly find vectors that are close to the query vector.

Building that map can take a long time. For tables without KNN, saving or merging data is mostly about writing ordinary table data. For tables with KNN, Manticore also has to insert every vector into an HNSW graph, and that extra work can dominate the total time.

This matters most after large inserts and during maintenance. Fresh data is saved from memory to disk chunks. Existing chunks are later merged by OPTIMIZE or auto-optimize. Each saved or merged chunk needs its own HNSW graph, so faster graph building means shorter waits after bulk loading, faster background optimization, and less time spent on maintenance operations.

Disk chunk count matters for search too. Manticore stores RT tables as disk chunks, and each chunk has its own HNSW graph. A KNN query searches every chunk and merges the results, so fewer chunks usually mean faster KNN queries. The fastest layout is often one chunk with one graph.

Auto-optimize does not go all the way to one chunk by default. It merges chunks in the background, but stops when the table reaches a target chunk count. For ordinary tables, the target is 2 * num_logical_cpus; for tables with a KNN attribute, it is lower: num_physical_cpus / 2. On a 32-thread / 16-core host, that means 8 chunks for a KNN table instead of 64 for an ordinary table. The KNN target is lower because extra chunks hurt KNN search latency more, but the default still leaves more than one graph. To make auto-optimize converge to a single chunk, set optimize_cutoff to 1 server-wide, per table, or at runtime with SET GLOBAL optimize_cutoff = 1. You can also do it manually with OPTIMIZE TABLE ... OPTION cutoff = 1.

What used to happen

Freshly inserted documents first accumulate in a RAM chunk. When that RAM chunk reaches rt_mem_limit, which defaults to 128MB, Manticore saves it as a new disk chunk. For a table with a KNN attribute, that save includes building a fresh HNSW graph from the vectors in the RAM chunk.

The same kind of HNSW build happens when disk chunks are merged. OPTIMIZE TABLE and auto-optimize read live rows from existing chunks, write a new merged chunk, and build a new HNSW graph for that merged result. ALTER TABLE ... REBUILD KNN, and ALTER operations that add or drop a float_vector column, also rebuild the graph.

Before this change, the HNSW part of each individual save, merge, or rebuild used one worker:

A RAM-to-disk chunk save walked all live rows from the RAM chunk's segments one by one and inserted their vectors into one HNSW graph.
A chunk merge walked all live rows from the input disk chunks one by one and inserted their vectors into the new graph.
ALTER TABLE ... REBUILD KNN rebuilt each graph in one worker.

Manticore already had some parallelism around these operations. Up to 2 RAM-chunk saves can run at the same time. The optimizer can also run several chunk merges at once, controlled by parallel_chunk_merges. The default is 2 when the host has enough CPU cores. But inside each individual save or merge, the KNN graph build was still single-worker. On KNN-heavy tables, that single worker often determined how long the whole operation took.

What changed

Manticore now splits one KNN graph build across several workers. Each worker gets part of the rows, inserts its vectors into the same destination graph, and finishes independently. The graph-building library coordinates those concurrent inserts so the graph remains valid.

The exact split depends on the operation:

During RAM-to-disk saves, workers take RAM segments from a shared queue until all segments are processed.
During chunk merges and ALTER TABLE ... REBUILD KNN, Manticore divides the live rows into similarly sized ranges so the work is spread evenly.

Single-thread improvements

The same release also improves the single-worker path. Even when knn_parallel_build is set to 1, the benchmark below shows a 10% improvement before adding parallelism. That comes from three changes:

Two-pass neighbor processing. When inserting a vector, the algorithm walks through candidate neighbors and computes distances to them. The new code splits that into two passes: the first pass walks the neighbor list and prefetches the vector data, and the second pass computes the distances. This gives the CPU time to bring the vectors into cache before they are used.
Two comparisons at a time. Some distance calculations now process two candidate vectors together. This reduces repeated work in the inner loop where most of the build time is spent.
Compile-time distance dispatch in build mode. The builder now picks the right distance function once for the build, such as inner product vs. L2 and raw float vs. binary-quantized vectors. That avoids a function-pointer lookup on every distance call and lets the compiler optimize the inner loop more aggressively.

The default and the config

A new searchd setting, knn_parallel_build, controls how many workers one KNN build may use. The default is min(4, threads / 4), where threads is Manticore's threads setting - the size of the worker pool that runs queries and background tasks, which defaults to the number of logical CPU cores on the host.

In practical terms, that means one worker on small hosts and up to four workers by default on larger hosts: a 4-thread host gets one worker, an 8-thread host gets two, a 16-thread host gets four, and anything above that is also capped at four. The default is conservative because production machines often need to handle searches, inserts, and background work at the same time.

You usually do not need to change it. Consider raising it when you are rebuilding or optimizing a KNN-heavy table on a host that is not serving live traffic:

SET GLOBAL knn_parallel_build = 16;

Set it to 1 if you need the old single-worker behavior:

SET GLOBAL knn_parallel_build = 1;

The value can also be set in the searchd config and checked with SHOW VARIABLES.

CPU usage

Multiple saves and merges can be active at the same time, and each one can use up to knn_parallel_build workers. These workers use Manticore's existing threads pool. They do not create an unlimited number of extra operating-system threads; if all pool threads are busy, extra work waits in the queue.

This is why the default leaves headroom. On a 32-thread host, the default is 4 workers per KNN build. If two chunk saves overlap, the KNN build work can use up to 8 workers, leaving the rest of the thread pool available for other work.

Benchmark

Setup:

AMD Ryzen 9 5950X (16 physical / 32 logical cores)
Dataset: dbpedia-openai-1M - 1M vectors, 1536 dimensions, cosine distance
Quantization: 1-bit (binary) quantization
Data inserted into an RT table, then OPTIMIZE to a single disk chunk
Measurement: ALTER TABLE knn_data REBUILD KNN, three runs per setting, single-chunk so timings are stable
HNSW settings: defaults

ALTER TABLE ... REBUILD KNN was used because it exercises the same parallel build path as chunk saves and chunk merges, while giving stable timings that are easy to reproduce.

Results:

With one worker, the new code is already 10% faster than the old code: 492 seconds dropped to 442 seconds.
With 16 workers, rebuild time dropped to 39 seconds, about 11x faster than the new one-worker result.
Going from 16 to 32 workers helped only a little: 39 seconds became 36 seconds. On this machine, the useful limit is close to the 16 physical CPU cores.
The default is meant for shared production hosts. For maintenance work on a dedicated host, raising knn_parallel_build can be worth it.

Migration

No action is required. Existing tables keep working, and KNN graphs built by the parallel path are functionally equivalent to graphs built by the old single-worker path.

One detail can matter for strict reproducibility: parallel workers may insert vectors in a different order, so Manticore's on-disk KNN graph file, stored with the .spknn extension, is not guaranteed to be byte-for-byte identical to a single-worker build. Search quality and query speed are expected to be the same. If byte-for-byte reproducibility matters, set knn_parallel_build = 1.

Conclusion

This change speeds up one of the slowest maintenance steps for KNN tables. Parallel graph building reduces the time needed to save chunks, merge chunks, and rebuild KNN data, while the improved single-worker path also speeds up builds on smaller systems. Existing tables continue to work without changes. When CPU resources are available during maintenance, knn_parallel_build can be raised to build KNN graphs faster.

Manticore Search under systemd: beyond fork, PID files, and guesswork

Sergey Nikolaev — Fri, 26 Jun 2026 12:12:48 +0000

If you run Manticore Search on Linux, systemd should be the default way to manage it.

That sentence sounds obvious now, but for a long time it was only partly true. Manticore could run under systemd, yes, though the relationship was a little awkward. The daemon model came from an earlier Unix world; systemd came later and wanted different things from a service. So the setup worked, but never in a very satisfying way.

What changed is simple enough: Manticore now supports native systemd notifications.

Why care? Because several mildly annoying operational problems get better at once:

systemctl status tells a more truthful story
startup and reload are easier to follow
logs fit naturally into journalctl
shutdown is safer when real-time tables are flushing data
PID files stop doing so much heavy lifting

That last point matters less than people think, until the day it matters a lot.

Start with shutdown, because that's where things usually get real

The nicest part of this change is not the nicest-looking part. It is shutdown behavior.

If Manticore is flushing data from real-time tables, shutdown may take a while. Older setups often fell back to searchd --stopwait and a certain amount of hope. Sometimes that was enough. Sometimes it really was not.

The failure mode is boring and nasty: systemd decides the service is stuck, waits long enough, then sends SIGKILL. If Manticore is in the middle of a flush, that is about the worst possible moment to force the process down.

The newer behavior is cooperative. While Manticore is still flushing data, it tells systemd that progress is ongoing and that more time is needed. Concretely, it sends timeout-extension notifications every 15 seconds, each asking for another 30 seconds.

I would put this above almost every other benefit in the article. Better status output is nice. Cleaner reload is nice. Not getting your shutdown path mangled during an RT flush is nicer.

A slow stop is not automatically suspicious anymore. Sometimes it just means the server is finishing the part that actually matters.

What the old setup looked like

Historically, Manticore used the normal Unix daemon pattern: detach from the terminal, double-fork into the background, write a PID file, continue on. That was not some strange design choice. If portability matters, you end up there pretty quickly.

The friction came from mixing that model with systemd.

Support existed, but it was limited. The unit had to trust that the daemon forked correctly and that the PID file existed and still pointed to the right process. When everything behaved, fine. When it drifted even a little, supervision got fuzzy.

The usual weak spots were predictable:

process tracking depended on an external file
status reporting was indirect
startup progress was mostly invisible to systemd
reloads and shutdowns were harder to monitor cleanly

None of this sounds dramatic when written out calmly. In practice it leads to exactly the kind of question operators hate: is this service actually healthy, or just currently alive.

One specific example is worth keeping in mind. If Manticore's internal watchdog restarted searchd after a crash, systemd could keep supervising the original process relationship instead of the newly resurrected daemon. That is where warnings like Supervising process ... which is not our child came from.

A smaller but common failure mode was config drift: change the pid_file path in Manticore, forget to update the unit, and systemd ends up tracking the wrong place. The daemon might be up while systemctl status tells a much less helpful story.

That screenshot captures the old setup fairly well. systemd could see a process tree, but not always the service lifecycle you cared about.

The notify-based unit

The newer setup uses Type=notify, which lets Manticore report its own state directly instead of forcing systemd to infer too much from a PID file and guesswork.

The systemd unit now looks like this:

[Unit]
Description=Manticore Search Engine
...

[Service]
Type=notify
...

ExecStart=/usr/bin/searchd --config /etc/manticoresearch/manticore.conf --nodetach $_ADDITIONAL_SEARCHD_PARAMS
...

Restart=on-failure
...

A few details matter here.

Type=notify means systemd expects status updates from Manticore itself.

--nodetach keeps searchd in the foreground. Under systemd, that is the right default. Anything else is extra ceremony. The flag itself is not new; what changed is that it is now part of the packaged systemd setup rather than a debugging-only kind of option.

PIDFile is no longer the main supervision mechanism, but it can still be useful and added manually if you have older tooling that expects it.

With this setup, Manticore can report states such as:

starting
loading tables
reloading
ready

That does not sound glamorous. Still, compared to the old "there is a process, probably fine" model, it is a real improvement.

A small thing I like: reload becomes less awkward

Reloading Manticore has traditionally meant sending SIGHUP to searchd. That is still what happens under systemd too. The packaged unit uses:

ExecReload=/bin/kill -HUP $MAINPID

So systemctl reload manticore does not restart the daemon. It sends SIGHUP to the running searchd, which initiates table rotation: Manticore reopens plain tables and switches to freshly built table files without doing a full daemon restart.

That distinction matters. reload is the low-risk way to trigger rotation without taking the service down first. Depending on the seamless_rotate setting, new queries may be briefly stalled and clients may see temporary errors during the rotation window. restart is a different operation entirely: it stops the service and then does a fresh start.

The nice part now is visibility. When rotation starts, Manticore reports RELOADING=1 to systemd; when it finishes, it reports READY=1 again. So systemctl status manticore can show that the daemon is actively rotating instead of just sitting there in a generic running state.

`--nodetach` also fixes the logging story

This option is not new. The environment around it changed.

When Manticore stays attached to systemd instead of disappearing into the background, logs go to the journal and the main process is the one systemd actually knows about. There is less indirection. That usually pays off later, not immediately. Before --nodetach, startup messages could begin in journalctl and then the rest of the story would move into Manticore's own log files after detach.

So for many deployments, the normal tools are enough:

journalctl -u manticore
journalctl -u manticore -f
journalctl -u manticore --since "1 hour ago"

I like boring logging setups. One place to look first is underrated.

If you are happy using the systemd journal as the main log destination, you may not need the older logging setup in manticore.conf. Often that is one less thing to think about.

About the internal watchdog

If Manticore runs under systemd, the simple setup is the one I would recommend: let systemd supervise the service, and do not enable Manticore's internal watchdog unless you actually need it.

Could you run both? Yes.
Would I choose that for a normal deployment? No.

If you explicitly enable the internal watchdog, the supervision model becomes less direct and you may see warnings like:

Supervising process ... which is not our child

That configuration is supported. It is also, for most people, just extra moving parts.

A few loose notes

PID files matter much less to systemd itself with notification-based supervision. You may be able to remove pid_file from the configuration, but only if nothing else in your environment still depends on it. Old scripts tend to survive longer than anyone expects.

If you use executable configuration files, including shebang-based configs such as #!/usr/bin/env python, this setup tends to behave more naturally than some older service-management approaches did.

Commands you will actually use

sudo systemctl start manticore
sudo systemctl stop manticore
sudo systemctl restart manticore
sudo systemctl reload manticore
sudo systemctl status manticore
sudo systemctl enable manticore

And for logs:

sudo journalctl -u manticore
sudo journalctl -u manticore -f
sudo journalctl -u manticore --since "1 hour ago"

That is basically it.

The practical result is not flashy: Manticore behaves more like a modern Linux service and less like a daemon from an older era that everyone learned to work around. But those are often the best infrastructure changes. They remove friction you had almost stopped noticing.

14 faster embeddings: how we rebuilt the ONNX path in Manticore

Sergey Nikolaev — Thu, 25 Jun 2026 11:50:04 +0000

When we shipped Auto Embeddings — the feature that turns any text column into a vector automatically, with no separate model service to run — the most common piece of feedback was about speed. The previous path went through SentenceTransformers on top of Candle, Hugging Face's pure-Rust ML inference runtime, and it left a lot of CPU on the floor: most workloads sat in the low-double-digits of docs/sec no matter how we fed them, and concurrent calls serialised on a single model session.

So we spent a few weeks rebuilding how Manticore runs ONNX models. The new ONNX Runtime backend shipped in Manticore Search 27.1.5. ONNX (Open Neural Network Exchange) is the portable model format that most of the popular open-source embedding models — MiniLM, BGE, E5, and friends — already publish. The result is a backend that's ~14× faster on average than the previous SentenceTransformers/Candle path on the same hardware (average cheap 16 cores / 32 threads server), same model, same weights, averaged over the full threads × batch workload grid — and that advantage holds whether you run 1 client thread or 32. The old path stayed in the 5–11 docs/sec range across the entire grid; the new one lives in the 70–230 docs/sec band.

This post is the engineering log: what we tried, what surprised us, what we threw away, and what the final design looks like.

TL;DR

~14× faster on average than the previous SentenceTransformers/Candle path, averaged across the full threads × batch workload grid (1 / 2 / 4 / 8 / 16 / 32 threads × batch sizes 1…128) on the same box (16 cores / 32 threads), same model, same weights.
Released in Manticore Search 27.1.5, the new ONNX path is now the default fast path for any HuggingFace model that ships an .onnx file.
On all-MiniLM-L12-v2, the old Candle path sat at 5–11 docs/sec across every configuration we tried. The new ONNX path lands in the 70–230 docs/sec range — the same ~14× margin holds whether you run 1 client thread or 32.
Single-insert latency on our test box: ~14 ms with a single client, ~56 ms under 8-way concurrent load — both well below the 200+ ms Candle was hitting.
Want maximum bulk ingest throughput? Use a high batch size (32–128) on a single client thread. The new backend parallelises inside the call, so client-side fan-out just piles coordination overhead on top — peak on our box was 233 docs/sec at 1 thread + batch=64.
The two changes that mattered most: turning intra_op_spinning off, and giving up on batching documents inside the worker.
No user-facing API changes. A table that already points at an ONNX-capable MODEL_NAME picks up the new path automatically. Switching an existing table to a different model isn't a one-liner — Manticore doesn't allow altering MODEL_NAME on a FLOAT_VECTOR field in place — but you don't have to recreate the whole table either: you can add a new column with the new model alongside, rebuild its embeddings, and drop the old one.

Why this matters

With auto-embeddings, the database itself runs the model on every INSERT. That means embedding speed is INSERT speed — your ingest throughput is whatever the embedding step can sustain.

The old SentenceTransformers/Candle path left performance on the table. Concurrency hit lock contention, batched calls plateaued because of padding overhead, and between calls the runtime parked threads in ways that prevented the next call from picking up where the previous one left off. The headline symptom was simple: top would show the box well under full load no matter what you threw at it. The whole sweep — single-row INSERTs, 128-row bulk INSERTs, one client thread, thirty-two client threads — sat at 5–11 docs/sec, because nothing about how you fed it could buy you more CPU.

The new ONNX path raises the floor by an order of magnitude and gives users meaningful performance tuning options. A single-thread, single-row INSERT now lands 72 docs/sec — already ~7× the old Candle ceiling. Add concurrency or batch size and it climbs into the 130–230 docs/sec range, with the top of the grid at 233 docs/sec on a single client thread at --batch-size=64. Averaged across the whole threads × batch matrix, the new path is ~14× the old one.

Why ONNX, and not Candle

Manticore's embeddings library has supported a few backends for a while. The Candle path is great for correctness and easy to ship. But for production inference of small encoder models like the MiniLM and BGE family, ONNX Runtime is hard to beat:

ONNX Runtime (or ORT — Microsoft's official, hand-tuned C++ inference engine for ONNX models) does graph fusion, constant folding, kernel autotuning.
Most of the popular embedding models on HuggingFace already publish a pre-fused model.onnx in their onnx/ directory. The on-disk file is already in the shape ORT wants.

On the same all-MiniLM-L12-v2 weights, on CPU, the ONNX path is a noticeable step up over the Candle path. Same quality, much less per-document work.

The ORT session is created with a small set of opinions:

let session = ort::session::Session::builder()?
    .with_optimization_level(GraphOptimizationLevel::Level3)?
    .with_intra_threads(0)?            // let ORT pick (= all cores)
    .with_intra_op_spinning(false)?    // do NOT busy-wait between calls
    .with_flush_to_zero()?             // kill denormals on attention softmax
    .with_approximate_gelu()?          // ~10% faster activation, no quality loss
    .commit_from_file(&onnx_path)?;

Most of these are uncontroversial, "of course you turn that on" knobs. One is not: intra_op_spinning(false). We'll come back to it — it's the single biggest win in the whole branch, and it's not really an ORT setting so much as a load-shape decision.

The concurrency model — the part most readers will find new

If you give a Rust developer "make ONNX go fast" with no other constraints, they reach for one of two patterns. We tried both. They are both wrong for this workload.

Pattern 1: a single shared Session behind a Mutex (a Mutex is a lock that lets only one thread touch the session at a time). Easy to reason about, easy to get right. Throughput collapses under concurrency because every caller serialises on the lock. Fine for a CLI tool, awful for a database serving many concurrent INSERTs.

Pattern 2: a session pool, one Session per CPU. No more lock contention, but cold-start time multiplies, RAM use multiplies, and small inputs pay a dispatch cost just to land on a session. We had a working version of this in a development branch and it never quite delivered.

The thing that unlocked the design is something most Rust ONNX wrappers get wrong: on Linux and macOS, ORT's C Run() API is thread-safe. You can share one Session across many concurrent callers without any locking. The C++ side already serialises what needs serialising; the Rust API just hides it behind borrow-checker rules that do not match what the underlying library actually allows.

So we wrap the session in a small platform-aware type:

#[cfg(not(target_os = "windows"))]
struct SessionWrapper {
    inner: std::cell::UnsafeCell<ort::session::Session>,
}

#[cfg(not(target_os = "windows"))]
unsafe impl Sync for SessionWrapper {}
#[cfg(not(target_os = "windows"))]
unsafe impl Send for SessionWrapper {}

impl SessionWrapper {
    fn with_session<R>(&self, f: impl FnOnce(&mut Session) -> R) -> R {
        f(unsafe { &mut *self.inner.get() })
    }
}

Yes, this is unsafe. We're taking the borrow checker out of the loop because the underlying library is documented to be safe under the access pattern we're using. It's a deliberate unsafe with a one-line justification, not a foot-gun.

On Windows, ORT's threading model has known issues, so we serialise Run() with a Mutex. Importantly, the lock is held for the entire closure, not just the call to run() — that's what fixed the race we saw on Windows where one thread's SessionOutputs were still being read while another thread had already started a new run(). Closure-scoped locking, not call-scoped.

Adaptive parallelism — the wrong turns we took

This is the part of the work that took the longest, because every textbook says "to make ONNX fast, batch your inputs". So our first attempts followed the textbook.

We tokenized chunks of 8, 16, 32 documents at a time, padded them to max_len, and ran a single forward pass per worker thread. The throughput numbers came back lower than processing the same texts one-by-one through the same session. We ran it again. Same result. We spent a while trying to disprove it before accepting it. The reverted commit 980b24b "Revert: perf(model): batch inference in worker threads" is the moment we stopped fighting and rebuilt around what the profiler kept telling us.

Two things were behind the surprise.

The padding tax. A batch of mixed-length texts pads every row up to the longest row. The model then does work proportional to batch_size * max_len * hidden_dim, regardless of how much real content is in the batch. Real text inputs are highly variable in length: a typical batch of 8 random sentences might have one 60-token outlier and seven 8-token rows. The model spends most of its cycles multiplying padding tokens against attention weights. With one-doc batches, the model only does work proportional to that doc's actual token count. Per-document, "no batching" is cheaper than "batching" once the variance in input length is realistic.

Spinning. ORT's intra-op thread pool defaults to spinning between dispatches — threads burn CPU in a tight loop waiting for the next chunk of work. With one big batch per session call this is invisible: the thread is always busy with real work. With many concurrent small calls, it becomes a disaster: every worker's intra-op pool is pinned at 100% CPU between calls, and there's no CPU left for anything else. We saw exactly this pattern in top: every core at 100%, throughput lower than spinning-off. This sounds wrong until you remember the rest of the system needs CPU time too — the tokenizer, the HNSW build, the rest of searchd. Flipping with_intra_op_spinning(false) on was a one-line change that immediately raised throughput and dropped CPU usage at the same time.

So the final shape is the opposite of the textbook recipe:

One shared session, no pool.
One document per inference call, no batching inside the worker.
Many concurrent callers, scaled to CPU count.
No spinning between calls — yield the CPU like a polite citizen.

fn predict_pipelined(&self, texts: &[&str]) -> Result<Vec<Vec<f32>>, _> {
    let bs = batch_size();

    // Small input — single tokenize + infer, no thread overhead.
    // This is the path a 1-doc INSERT takes.
    if texts.len() <= bs {
        return Self::tokenize_and_infer(&self.session, &self.tokenizer, texts, ...);
    }

    // Large input — split across workers, each running 1-doc-at-a-time
    // through the SHARED session. This deliberately mimics the
    // many-concurrent-callers pattern that ORT is happiest with.
    let num_workers = (texts.len() / bs).min(available_cpus()).max(1);
    let docs_per_worker = texts.len().div_ceil(num_workers);

    std::thread::scope(|s| {
        for worker_texts in texts.chunks(docs_per_worker) {
            s.spawn(move || {
                for text in worker_texts {
                    Self::tokenize_and_infer(&session, &tokenizer,
                                             std::slice::from_ref(text), ...)?;
                }
                Ok(())
            });
        }
    });
    // ...
}

The two-branch design is on purpose. A 1-row INSERT comes in with texts.len() == 1, which is <= bs, so it takes the fast path with zero thread spawning, zero channel sends, zero coordination overhead. A bulk REPLACE INTO with thousands of rows takes the parallel branch and gets the throughput benefit. The cheap case stays cheap, the expensive case stays parallel.

We also enable parallel tokenization once at startup (TOKENIZERS_PARALLELISM=true) and pre-truncate inputs by character count before BPE, so a 100KB blob of text doesn't pin a CPU on the tokenizer for a second before the model even sees it.

Numbers

All runs on our standard benchmark box, using all-MiniLM-L12-v2-onnx, 1000 documents per run. Generated with manticore-load:

manticore-load --quiet --drop --batch-size=1 --threads=8 --total=1000 \
  --init="CREATE TABLE t (
    f text,
    v FLOAT_VECTOR KNN_TYPE='hnsw' HNSW_SIMILARITY='l2'
      MODEL_NAME='onnx-models/all-MiniLM-L12-v2-onnx' FROM=''
  )" \
  --load="INSERT INTO t(f) VALUES('<text/10/100>')"

Same command with --batch-size=2, 8, 32, 128, all at 8 threads:

`--batch-size`	docs/sec	avg call latency (ms)	per-doc latency (ms)
1	143	55.9	55.9
2	113	141.6	70.8
8	91	703.3	87.9
32	146	1753.4	54.8
128	147	6966.0	54.4

Compared against Candle at the same 8 threads — which sat flat at 10 docs/sec across every batch size — that's between 9× and 15× more documents per second depending on the batch you pick. The "avg call latency" column is the time for one full INSERT statement to return, not per document; divide by the batch size and the per-doc cost lands in the 55–90 ms band.

If you swap the table to 1 client thread — the configuration that turns out to be optimal for bulk loading — the numbers climb further: 72 / 76 / 93 / 175 / 233 / 222 docs/sec at batches 1 / 2 / 8 / 32 / 64 / 128. The peak in the entire grid is 233 docs/sec at 1 thread × batch=64, with per-document latency of ~4.3 ms.

How to feed it for maximum throughput

If you're loading a lot of data in bulk and want maximum docs/sec, the recipe is straightforward: send large INSERT ... VALUES (..), (..), ... statements (batch 32–128) from a single client thread, not many small inserts from many threads. The new backend already parallelises inside the call (see the predict_pipelined code above), so client-side fan-out just piles coordination overhead on top of what ORT is already doing — that's why 1 thread × batch=64 (233 docs/sec) beats 8 threads × batch=128 (147 docs/sec) by a clear margin.

If your workload is naturally one-row-at-a-time — web requests, queue consumers, MCP servers — just use INSERT INTO. The single-thread / single-row floor of 72 docs/sec is already ~7× the old Candle path, low enough latency that this isn't a tier you need to optimise around any more.

Before vs after, across the whole grid

To make the before/after concrete, we also swept the full threads × batch grid against the old Candle/trans path on the same box, same weights:

Each X-axis tick is backend threads/batch-size. The left half (trans …) is the old Candle path — docs/sec sits at 5–11 across the entire grid no matter how many threads or how large the batch, while CPU is already pinned. The right half (onnx …) is the new path — docs/sec is an order of magnitude higher across the whole sweep. Within the new path: at small batches, adding client threads helps (1T/batch=1 = 72 → 8T/batch=1 = 143); at large batches, a single client thread wins (1T/batch=64 = 233 is the global peak).

Same sweep, but plotting efficiency (docs/sec per % CPU) alongside docs/sec. On the Candle (trans) side, both lines hug the floor — the box is spending CPU without producing documents. On the ONNX (onnx) side, efficiency is highest at 1–2 threads with mid-sized batches, where each percent of CPU buys the most embeddings, and it stays well above the old path even as we crank threads up to 32.

What's next

A few things are queued behind this work:

GPU path. The current ONNX setup is CPU-only. The _use_gpu parameter is plumbed through but not yet wired to the ORT CUDA execution provider.
Windows perf parity. We currently serialise on Windows because of an ORT threading bug. Once that bug is resolved upstream, Windows should get the same shared-session behaviour Linux/macOS already have.
More architectures down the ONNX path. Right now ONNX is the path for BERT-family encoders. T5, causal-LM and quantized GGUF models still go through Candle for now.

Try it

If your existing table is already pointed at an ONNX-capable model, the new path takes over once you upgrade to Manticore Search 27.1.5 or newer — no schema changes, no re-ingest. You should just see your INSERTs go faster.

If you're not on an ONNX model yet — or you want to move to a smaller / faster one to take maximum advantage of the new backend — note that you can't swap the model on an existing field. Manticore doesn't support altering MODEL_NAME on an existing FLOAT_VECTOR field, so migrating in place isn't an option. You have two practical paths to choose between, depending on what's easier in your setup:

Option A — dump, edit, reload. Even if you no longer have the original source data, you can mysqldump the existing table to a SQL file, edit the CREATE TABLE in that dump to point MODEL_NAME at the ONNX-optimised model you want, and replay the dump into a fresh table. Manticore will re-embed every row through the new path on the way in.

Option B — add a new column alongside, rebuild, drop the old one. If you'd rather stay in SQL and avoid the dump round-trip, add a new FLOAT_VECTOR column on the same table that points at the ONNX model, then trigger a one-shot re-embed of that column from the source text:

ALTER TABLE t ADD COLUMN v_new FLOAT_VECTOR KNN_TYPE='hnsw'
  HNSW_SIMILARITY='l2'
  MODEL_NAME='Xenova/all-MiniLM-L6-v2'
  FROM='text_field';

ALTER TABLE t REBUILD EMBEDDINGS v_new;
-- once you've cut over reads to v_new, drop the old column
ALTER TABLE t DROP COLUMN v_old;

See the Rebuilding embeddings section of the docs for the exact syntax and constraints.

On brand-new tables, none of this matters — just pick an ONNX-optimised MODEL_NAME from the start.

A good place to shop for ONNX-ready embedding models is the Xenova collection on Hugging Face — these are pre-converted to ONNX and ready to drop into MODEL_NAME='...'. Filter the list by the feature-extraction task to narrow it down to embedding-style models. Some sensible starting points:

Xenova/all-MiniLM-L6-v2 — small and fast, 384-dim, great default.
Xenova/all-MiniLM-L12-v2 — the model we benchmarked in this post, 384-dim, a step up in quality.
Xenova/bge-small-en-v1.5 — strong English retrieval, 384-dim.
Xenova/multilingual-e5-small — multilingual coverage, 384-dim.

If you aren't using auto-embeddings yet at all, the original announcement walks through the SQL from scratch.

📚 KNN search documentation
💬 Slack community — we'd love to see how the new path holds up on your data.

Український лематизатор тепер вбудовано в Manticore Search

Sergey Nikolaev — Tue, 23 Jun 2026 11:14:40 +0000

Коротко

починаючи з релізу 27.1.5 український лематизатор більше не потребує окремого Python-стека.
Раніше потрібно було встановлювати окремий пакет, Python 3.9, pymorphy2 і українські словники.
Гарна новина - тепер словник уже входить до Manticore.

Достатньо лише ввімкнути явно морфологію:

morphology='lemmatize_uk_all'

Окремо додавати українські символи до charset_table також вже не потрібно: стандартний non_cont містить мапінги для є, і, ї, ґ.
А от апостроф для української мови важливий, але тут є один важливий нюанс. Якщо просто додати його в charset_table це може зачепити данні на англійській мові, де апостроф також використовується.

Саме тому для українських текстів ми рекомендуємо використовувати окрему таблицю з власним charset_table та апострофом, а не змішувати українську з англійською чи іншими мовами в одній таблиці.

Це все що потрібно урахувати для повноцінної підтримки Української в ManticoreSearch. Ніяких словників, пакетів чи скриптів. Тепер все працює прямо "з коробки"

Що таке лематизатор

У повнотекстовому пошуку часто потрібно знайти слово не лише в тій формі, яку ввів користувач. У документі може бути мрії, а користувач шукає мрія. Або в тексті є інтернет-магазину, а в запиті приходить інтернет-магазин. Людина легко бачить, що це форми того самого слова. Для пошукового рушія без морфології це різні токени.

Для цього в пошукових рушіях використовують стемінг і лематизацію.

Стемер зазвичай працює за правилами: відкидає або замінює закінчення. Це швидко, але результат буває грубим і не завжди схожим на справжнє слово.

Лематизатор спирається на словник і морфологію, щоб отримати нормальну форму слова. Для української мови це особливо помітно через відмінки, рід і число.

Що змінилося

Якщо ви вже пробували українську лематизацію в Manticore, то проблема могла бути не в самому пошуку, а у встановленні:

окремий manticore-lemmatizer-uk;
Python 3.9 з --enable-shared;
pymorphy2 і pymorphy2-dicts-uk;
додаткові системні залежності.

Тепер український словник постачається як звичайний мовний файл uk.pak, а Manticore завантажує його напряму. Вам залишається налаштувати таблицю: вказати потрібну morphology і працювати далі.

Мінімальна конфігурація

Створимо таблицю для українських текстів:

CREATE TABLE uk_docs(title text)
  morphology='lemmatize_uk_all'
  charset_table='non_cont,U+0027';

Тут важливо ввімкнути морфологію:

morphology='lemmatize_uk_all' вмикає український лематизатор та індексує усі знайдені нормальні форми.

Для української мови додаємо лише апостроф (U+0027), щоб слова на кшталт обов'язковим індексувалися як один токен.

Для однієї нормальної форми підійде lemmatize_uk. Щоб індексувати усі можливі форми, виберіть lemmatize_uk_all.

Перевіримо на прикладі

Додамо кілька документів:

INSERT INTO uk_docs VALUES
  (1, 'мрії про червону сукню'),
  (2, 'каталог інтернет-магазину'),
  (3, 'команд-учасниць запросили на зустріч');

Запит мрія знаходить документ, де слово записано як мрії:

SELECT id, title FROM uk_docs WHERE MATCH('мрія') ORDER BY id ASC;

+------+---------------------------+
| id   | title                     |
+------+---------------------------+
|    1 | мрії про червону сукню    |
+------+---------------------------+

Запит червоний знаходить червону:

SELECT id, title FROM uk_docs WHERE MATCH('червоний') ORDER BY id ASC;

+------+---------------------------+
| id   | title                     |
+------+---------------------------+
|    1 | мрії про червону сукню    |
+------+---------------------------+

А інтернет-магазин знаходить інтернет-магазину:

SELECT id, title FROM uk_docs WHERE MATCH('інтернет-магазин') ORDER BY id ASC;

+------+---------------------------+
| id   | title                     |
+------+---------------------------+
|    2 | каталог інтернет-магазину |
+------+---------------------------+

Що відбувається з токенами

Якщо хочете побачити не лише результат пошуку, а й саму нормалізацію, використовуйте CALL KEYWORDS:

CALL KEYWORDS(
  'мрії червона інтернет-магазину команд-учасниць',
  'uk_docs'
);

+------+--------------------+--------------+
| qpos | tokenized          | normalized   |
+------+--------------------+--------------+
| 1    | мрії               | мрія         |
| 2    | червона            | червоний     |
| 3    | інтернет           | інтернет     |
| 4    | магазину           | магазин      |
| 5    | команд             | команда      |
| 6    | учасниць           | учасниця     |
+------+--------------------+--------------+

Тут добре видно різницю з простим обрізанням закінчень: на виході маємо нормальні форми слів, за якими вже можна шукати. мрії перетворюється на мрія, червона на червоний, магазину на магазин.

Що варто пам'ятати

Користуватися українським лематизатором стало простіше, але для кожної таблиці його все одно треба ввімкнути явно через morphology.

Стандартний charset_table=non_cont уже покриває українські символи є, і, ї, ґ. Якщо ви задаєте таблицю саме для українських текстів, достатньо додати до нього апостроф: charset_table='non_cont,U+0027'.

Якщо ви використовуєте офіційні пакети або образи Manticore Search актуальних версій, український uk.pak уже має бути на місці. Якщо у вас власна збірка або нестандартне розташування файлів, перевірте, що lemmatizer_base вказує на каталог, де лежить uk.pak.

Докладніше про налаштування морфології можна прочитати в документації: morphology.

Faster KNN search in Manticore: 2-pass HNSW, batched distances, and AVX-512

Sergey Nikolaev — Tue, 23 Jun 2026 10:43:59 +0000

TL;DR: Three changes to the HNSW search engine improve KNN throughput by up to 29% at high k, with over 20% gains under concurrent load. No API changes, no index rebuild, no configuration. Just faster searches.

Faster KNN search in Manticore

Manticore's KNN search is built on top of hnswlib, an open-source HNSW implementation. Historically, most of our KNN work focused on custom distance functions, such as those used for binary quantization, rather than on hnswlib's core search loop. We also added features like prefiltering with ACORN-1 and early termination, but the main search loop stayed the same: hnswlib still visited neighbors, computed distances, and maintained its set of candidates the same way.

These changes go further, modifying hnswlib's core search loop itself - restructuring how it traverses neighbors, how it calls distance functions, and how it interacts with the CPU's memory hierarchy. Combined with new AVX-512 distance implementations in the columnar library, these changes target three sources of overhead: inefficient memory access patterns, redundant data loads, and indirect function call overhead.

Compile-time distance function specialization

Previously, the distance function was a runtime function pointer stored in the HNSW index and called for every candidate. For large search budgets, that can mean a large number of indirect calls per query. Indirect calls prevent the compiler from inlining the distance function into the search loop, and they create branch prediction overhead.

The new code resolves the distance function at compile time using C++ templates. When the search begins, a single switch statement selects the right template specialization based on the distance metric and quantization settings. From that point on, the entire inner loop - neighbor traversal, distance computation, candidate set updates - runs as one monolithic function with the distance calculation fully inlined. The compiler can now optimize register allocation, instruction scheduling, and loop unrolling across the distance computation boundary.

2-pass neighbor processing

The HNSW algorithm explores the graph by visiting nodes and computing distances to their neighbors. In the original implementation, each neighbor was processed in a single pass: check if visited, fetch its vector data, compute distance, update the set of candidates. This meant that memory prefetch hints had little time to take effect before the data was needed.

The new implementation splits this into two passes. Pass 1 iterates all neighbors of the current node, skips already-visited ones, and collects the unvisited neighbors into a small batch array. As each neighbor is added to the batch, a prefetch hint is issued for its vector data. Pass 2 iterates the batch and computes distances. By the time Pass 2 reaches each vector, the prefetch from Pass 1 has had time to bring the data into cache.

Pass 2 walks a compact sequential array of candidate IDs, not the graph structure itself. The underlying vector loads are still scattered, but the data has been prefetched ahead of time.

For unfiltered queries (no WHERE clause on the KNN search), the new code also takes a fast path that eliminates the per-candidate filter check entirely.

Batched distance computation

The 2-pass structure helps in two ways: it gives prefetching more time to work, and it makes batching easy. Once Pass 2 has a compact list of candidates, it can score them two at a time instead of one by one.

When scoring two candidates, the query vector is loaded once per SIMD iteration and reused for both distance computations, eliminating redundant loads.

This reduces repeated query-side loads and lets the scoring loop process candidates in pairs, with a fallback for an odd remainder. Batch-2 functions are provided for inner product, L2, and their binary-quantized variants.

AVX-512 support

The new AVX-512 distance code processes 16 floats per iteration instead of 8 with AVX2. For inner product and L2 distance, the core loop uses fused multiply-add (_mm512_fmadd_ps), which combines multiplication and accumulation in a single instruction. For binary-quantized vectors, the AVX-512 VPOPCNTDQ extension speeds up bit-counting operations used in distance calculation.

Manticore now ships three library variants: a baseline build, an AVX2 build, and an AVX-512 build. At startup, Manticore detects the CPU's capabilities and loads the appropriate library automatically. No configuration is needed.

Benchmark results

The following benchmarks were run on the dbpedia-openai-1M-1536-angular dataset (1M vectors, 1536 dimensions, cosine distance) on an AMD Ryzen 7 9700X (Zen 5, 8 physical cores / 16 logical cores). All data uses 1-bit binary quantization with oversampling and rescoring disabled. For multithreaded runs, throughput is reported as average per-thread queries per second: each worker runs its own batch of queries, its QPS is measured independently, and the final number is the average across workers. Each result is the average of 6 independent runs. Early termination was also disabled to isolate the effect of these optimizations on raw HNSW traversal.

Zen 5 was chosen because it supports AVX-512 with native 512-bit datapaths, avoiding the split-512 execution behavior and heavy AVX-512 downclocking associated with some older Intel processors. This helps isolate the algorithmic effects of these changes from CPU-specific AVX-512 throttling behavior.

Algorithmic improvements alone

The first chart isolates the effect of the algorithmic changes (2-pass processing, batched distances, compile-time dispatch) by comparing the new AVX2 build against the previous AVX2 build. Both builds use the same SIMD instruction set, so the difference is purely from the new code structure.

On a single thread, the gain grows steadily from +3% at k=10 to +24% at k=1000 as distance computation comes to dominate the search workload. With more threads competing for memory bandwidth, the per-thread gain shrinks: +9-10% at 4 or 8 threads, and only +2-5% at 16 threads.

The 16-thread case is SMT (each physical core runs two threads). Distance computation is memory-bound, so when two threads share a core's L1/L2 caches, the prefetching and batching wins are partially absorbed by shared-resource contention. The algorithmic improvements still help, but the headroom shrinks.

SIMD width benefit (AVX-512 vs AVX2)

The second chart isolates the effect of AVX-512 by comparing the AVX-512 build against the new AVX2 build (both share the same algorithmic improvements).

AVX-512 is slightly slower than AVX2 at k=10 (around -2%) regardless of thread count. This is specific to AVX-512: the algorithmic improvements alone don't show this regression, so it's not a uniform per-query overhead. From k=30 upward, AVX-512 pulls ahead at every thread count.

The interesting pattern is that AVX-512's benefit grows with thread count. Although this benchmark disables oversampling, the default Manticore KNN query uses LIMIT 20, and with the default oversampling=3.0 (which multiplies the effective HNSW search budget for rescoring after quantized search) that becomes k=60 internally. At k=60, AVX-512 vs AVX2 (new) is +1.2% on a single thread, +2.6% at 4 threads, +3.4% at 8 threads, and +6.5% at 16 threads.

Combined improvement (AVX-512 vs the old code)

The third chart shows the cumulative effect: AVX-512 with all the new code, compared against the previous AVX2 build. This is what a user upgrading from the previous Manticore version to the new one would see if their CPU supports AVX-512.

The single-thread curve climbs from +0.5% at k=10 to +29% at k=1000. The multi-thread curves all reach +22-24% at k=1000. The improvement is broadly distributed across thread counts - the algorithmic and SIMD gains compose differently at different concurrency levels, but the combined result is consistently large at moderate-to-high k.

Why the gain grows with k

All three charts show the same shape: small improvement at low k, large at high k. The reason is that low-k queries spend a larger share of their time on graph traversal (visiting nodes, checking visited bits, popping the candidate set) - work that scales with the graph structure, not k. As k grows, the effective search budget grows proportionally, and the queries spend more time on distance computation. The optimizations target distance computation and the loops around it, so their benefit scales with the share of work that distance computation represents.

What this means for you

These improvements require no action. They are available in the recent Manticore Search 27.1.5 release; there are no API changes, no new configuration options, and no need to rebuild indexes.

The gains stack with the KNN early termination: early termination reduces the number of distance computations per query, and these optimizations make each computation faster.

The biggest improvements show up with:

High-dimensional vectors (more arithmetic per distance computation, more SIMD benefit)
Large k values (more total distance computations, more opportunity for batching and cache optimization)
Queries with oversampling (oversampling multiplies the effective k, pushing queries into the range where gains are largest)

Manticore Search 27.1.5: Authentication, sharded tables, conversational search and faster vector search

Sergey Nikolaev — Mon, 22 Jun 2026 02:19:45 +0000

Manticore Search 27.1.5 has been released. This release brings built-in authentication and authorization, sharded tables, conversational search, faster HNSW builds, better faceting and aggregations, and a long list of fixes across KNN, replication, protocol compatibility and other areas.

This post is a catch-up for everything shipped from 25.0.1 through 27.1.5.

Upgrade Notes

Please review these before upgrading:

27.0.0 adds built-in auth/authz, and enabling it changes access assumptions. Auth is not enabled by default, but once you enable it, anonymous access no longer works. Roll it out in stages: upgrade remote agents and replication peers first, then upgrade the masters that query or manage them, and enable auth only after the whole topology is on the new version. Distributed remote-agent and replication-related operations also need matching stored auth data across the participating daemons. A successful JOIN CLUSTER replaces the joining node's local auth data with the donor cluster's auth data. (Issue #2833, PR #3648)
26.0.0 changed replication storage layout. Incoming replicated tables now live under the normal data_dir/<table> layout instead of the cluster path. If you run replication clusters with a custom path, you may need to move or re-synchronize replicated tables after upgrade. Downgrade is only safe before the new layout is adopted. (Issue #4431, PR #4598)
If you manage MCL separately from the daemon, upgrade it together with Manticore. This release line moves through several MCL updates, from vector-performance work to multithreaded HNSW builds and later stability fixes. Mixing an older library with a newer daemon is not recommended. (25.2.0, 25.15.0, 26.0.3, 26.3.2, 27.1.0)

Highlights

Built-in authentication and authorization

Manticore now supports users, passwords, bearer tokens, and fine-grained permissions across MySQL, HTTP/HTTPS, distributed remote agents, and replication-related operations. This makes access control a first-class part of the product instead of something that always has to be handled outside the database.

Sharded tables

Manticore can now create and manage sharded tables, distribute inserts across shards, and handle more of the surrounding lifecycle in one place. That makes larger write-heavy deployments easier to operate and reduces the amount of sharding-specific logic that has to live outside the engine.

Conversational search

This release adds conversational search to Manticore Search. It is exposed through CREATE CHAT MODEL and CALL CHAT, so you can ask questions over an existing vectorized table instead of building a separate retrieval layer around the same data.

Under the hood, Manticore Search runs KNN on a FLOAT_VECTOR field, builds LLM context from that field's from='...' source columns, keeps conversation history by conversation_uuid, and returns both the answer and the supporting sources. If you already keep embeddings in Manticore, this makes document Q&A and support-style assistants much easier to wire up.

Faster vector builds and KNN improvements

Vector search kept improving throughout this cycle.

Manticore improved KNN performance, added local ONNX embeddings support, sped up ONNX inference, and then made HNSW build and rebuild work much faster with multithreaded index construction.

A few important steps in that work:

25.1.0 improved KNN distance calculation and AVX-512 loading.
25.2.0 added local ONNX embeddings support in MCL and improved vector-search performance further.
25.14.0 and 25.15.0 added multithreaded HNSW builds together with the required library support.

The biggest practical improvement here is a much faster auto-embedding and shorter build and rebuild time for large vector tables. Initial KNN builds, chunk merges, and ALTER TABLE ... REBUILD KNN are all affected.

Better faceting and aggregations

Faceting and aggregations also became more useful.

facet_filter_mode makes it easier to build e-commerce-style filters that preserve selected, available, and unavailable buckets under active filtering.

On the analytics side:

date_histogram() gained time_zone and offset
Opensearch dashboards support
Manticore added statistical aggregations such as percentiles, percentile_ranks, and mad

Other Notable Improvements

This release line also includes several smaller but useful additions:

searchd --check validates configuration before startup without side effects.
EXIT CLUSTER lets a node leave a replication cluster online without restarting.
dict=keywords_32k makes it possible to index very long machine-generated tokens such as hashes and message IDs without silent truncation.
The built-in Ukrainian lemmatizer expands native morphology support for Ukrainian text search.
Systemd Type=notify improves startup and shutdown supervision.
searchd process under systemd management now logs to systemd journal
JOIN queries now support explicit left-table column prefixes.
OpenSearch Dashboards support.
manticore-load gained multi-query support.

Bug Fixes

This release line also includes 65 changelog-listed fixes. The latest follow-up releases added a few more worth calling out:

27.1.5 fixed a crash when fetching columnar float_vector attributes.
27.1.4 fixed ALTER TABLE ... RECONFIGURE and SHOW CREATE TABLE for one-way upgrades from dict='keywords' to dict=keywords_32k.
27.1.3 updated Buddy to 4.0.1 and tightened Queue-plugin mutation permission handling under auth.
KNN-by-doc_id queries now preserve offset and max_matches correctly.
KNN rescoring order was fixed, so explicit ORDER BY tie-breakers work again.
Hybrid fused queries with GROUP BY on columnar tables stopped crashing.
Replication and node-rejoin crash paths were cleaned up further.
Binary MySQL protocol behavior was fixed in 25.12.1, which matters for integrations that expect real client compatibility.
Fluent Bit bulk-ingest interoperability was fixed, preventing successful responses from being replayed as duplicate inserts.
27.1.2 fixed sql_attr_multi handling for plain indexes built from multiple source blocks.

For the complete list, see the changelog.

Need help or want to connect?

Join our Slack
Visit the Forum
Report issues or suggest features on GitHub
Email us at contact@manticoresearch.com

The Evolution of 'More Like This

Sergey Nikolaev — Tue, 02 Jun 2026 04:01:20 +0000

In many search scenarios, the user does not start from an empty query box, but from an existing result.

A user opens an article and wants to find related material. A buyer views a product card and looks for close alternatives. A support engineer investigates an incident and wants to see earlier cases with the same symptoms. In all these situations, the user already has a relevant document to start from.

This scenario is traditionally called More Like This (MLT): a function for finding documents similar to the selected one. In this article, MLT means search that starts from a known document, not from a newly typed query.

The classic MLT approach, or similar-document search, was based on comparing textual matches. Modern implementations increasingly use embeddings: numerical representations of documents. A search index stores embeddings as vectors, and the search system can find documents with close vector representations.

Short glossary

To avoid repeating definitions throughout the article, here are the main terms:

Term	Meaning in this article
More Like This (MLT)	search for documents similar to an already selected document
embedding	a numerical representation of text, a product, an image, or another object
embedding vector	a numerical representation of an object, such as text or a product, stored in the index to find similar objects by vector proximity
KNN, nearest-neighbor search	search for nearest neighbors, meaning objects with close vectors
ANN, approximate nearest neighbors	approximate nearest-neighbor search; it speeds up KNN on large datasets without scanning every vector
RAG, Retrieval-Augmented Generation	an approach where the search system retrieves context for a generative model
hybrid search	combining full-text search and vector search in one scenario
reranking	an additional sorting step for already retrieved candidates using a more precise model or rule

What classic More Like This did

Classic MLT was lexical. It answered a simple question: which documents use similar important words?

The process usually looked like this:

The search system took the source document.
It analyzed its text.
It selected informative terms.
It built a query from those terms.
It searched for documents with a similar set of words.
It returned a list of similar documents.

Internally, this used familiar full-text search mechanisms: TF-IDF or BM25, term frequency, stopwords, field boosts, and document-frequency limits. That is why older MLT implementations exposed parameters such as min_term_freq, min_doc_freq, max_doc_freq, and max_query_terms.

This was not just an interface element, but a full search mechanism. MLT was used for related articles and products, duplicate detection, support-ticket matching, legal search, patent research, and internal knowledge bases.

Where the lexical approach is still strong

Lexical MLT works well when specific words, identifiers, and stable formulations matter.

Examples:

error codes;
product SKUs;
part numbers;
function names;
stack traces;
legal wording;
nearly identical product or ticket descriptions.

The reason is that exact matching is critical here. If two incident reports contain the same error code or the same stack trace, full-text search sees a direct match. For example, when searching tickets with the code ERR_404, lexical MLT quickly finds every mention of that code, while vector search may return tickets that describe similar but not identical problems.

Lexical MLT had another advantage: it was cheap to run. The inverted index is already in the search engine. The analyzers are already configured. Ranking already works. There is no need to deploy separate search infrastructure just to support a “find similar” feature.

The limitation is also clear. If two documents describe the same thing in different words, lexical MLT may fail to connect them. Synonyms work unevenly. Paraphrases are harder. Cross-lingual similarity is usually unavailable. For example, memory leak and unbounded heap growth may describe the same problem, but a standard analyzer sees different tokens.

Lexical MLT efficiently finds documents with matching or similar wording. Semantic search helps when the meaning matches, not the words.

What embeddings change

Using embeddings — numerical representations of documents — changes the comparison principle: instead of words, the system compares vector representations.

A document no longer has to be represented only as a set of weighted terms. It can be stored as a dense vector. Nearby vectors usually correspond to documents that are similar in meaning, even if they are written in different words.

The lexical approach looks for matches by words and terms, while embedding search looks at the proximity of document vector representations. The first approach is optimal for exact matches such as error codes and SKUs. The second finds semantically close documents, even when they are phrased differently.

This expands the scope of this kind of search. You can compare not only articles, but also products, images, code fragments, user events, or context fragments in a RAG system. In RAG, the search system first retrieves relevant context, and then the generative model uses that context to produce an answer.

Lexical search does not disappear. Exact error codes, SKUs, names, statute references, and near duplicates are still better handled lexically. That is why production systems often use hybrid search: full-text search provides exact matches, vector search adds results by meaning, filters constrain the search space, and reranking refines the final order.

As shown in our comparison of lexical and vector search, the former wins on precise strict matches, while the latter improves coverage of semantic relationships.

MLT as lookup by a vector from the index

If a vector representation has already been computed for a document and stored in the index, modern MLT can be described without a separate API example:

Take the source document.
Retrieve its precomputed vector representation from the index.
Find the nearest vectors.
Return the documents those vectors belong to.

This is still More Like This: the user starts from one document and gets related results. Only the comparison method changes. Instead of extracting terms, the search system uses the vector representation of the source document.

In Manticore Search, this operation can be performed directly at the search-engine level: the query specifies the ID of the source document, and Manticore takes its embedding vector from the index and runs KNN search. The application does not need to fetch the vector separately, serialize hundreds or thousands of numbers, and send them back in a second request.

A minimal SQL example looks like this:

SELECT id, title, knn_dist()
FROM products
WHERE knn(embedding, 10, 123)
LIMIT 10;

Here, embedding is the field with the precomputed embedding vector, 123 is the ID of the source document, and 10 is the number of nearest documents to return. The knn_dist() function returns the distance between vectors: a smaller value means greater semantic proximity to the source document. The same operation can be performed through the HTTP JSON API; the search logic does not change. The application passes the document ID, and Manticore performs lookup using that document’s vector from the index.

For large datasets, KNN is usually implemented through an ANN index. This speeds up search through approximate computation and avoids scanning every vector. For the user, the important part is not the internal structure of the index, but the result: quickly finding documents that are close to the source in meaning.

Why search is better handled in the engine

You can implement this scenario in the application: first fetch the document, then extract its vector, then send a separate KNN query, and then combine the result with filters.

That approach makes the system architecture more complex. The application has to:

pass the vector between services;
prevent accidental logging;
check the embedding model version;
keep data synchronized with the main index;
apply the same filters used in normal search.

When the search system performs the lookup itself, the path is shorter:

The application passes the ID of the source document.
The search system finds the precomputed vector representation in the index.
The search system runs nearest-neighbor search (KNN) or its approximate variant (ANN).
The search system returns the found documents with the same access filters and metadata.

Benefits of this approach:

fewer inter-service requests from the application;
large vectors do not have to be sent through external APIs;
filters stay close to search;
the result is easier to reproduce and debug;
the application does not need an additional layer for similarity calculation — everything runs inside the search engine.

This will not fix poor embeddings or remove the need to tune ranking. But it reduces the number of interacting components in the search chain, which makes the system easier to maintain.

Practical examples and the evolution of MLT

Search from an existing object is especially useful when the user has already found a relevant starting point.

Scenario	Source object	What to find
Support	ticket with an error	past tickets with similar symptoms and related fixes
Catalog	product card	close alternatives, similar models, or products from the same category
RAG	relevant fragment already found by the first search	context expansion: neighboring sections of the same document, related documentation fragments, or similar discussions
Developer tools	stack trace, diff, or bug description	related code changes, discussions, and past incidents

In these examples, there is no need to type a new query manually. The system uses the source object as a reference point and finds documents similar to it lexically, semantically, or by both criteria.

In the context of RAG, this is not about the primary search by the user’s query, but about subsequent context selection: the system has already found a relevant fragment and uses it as the reference object to collect surrounding context. This is useful when one fragment is too narrow: nearby content may include a term definition, a configuration example, a related discussion, or a neighboring section of the same guide.

In systems with personalization or AI agents, it is important to clearly define which data is used for search: the system may consider the user’s search-query history, the context of previous interactions, or saved working notes. This makes it clear which data participates in retrieval and why the result is considered similar.

The evolution of MLT can be described like this:

Period	What changed
2000s	MLT mostly relied on lexical analysis, TF-IDF, BM25, and term overlap.
2010s	Word2Vec and GloVe appeared and became widely used, making it possible to build semantic embeddings of words and texts.
Early 2020s	FAISS and similar ANN libraries made it possible to run vector search efficiently even on very large datasets.
Mid-2020s	RAG, recommendations, and search from an existing object made lookup by stored vectors a common product scenario.

The evolution of MLT is a shift from lexical comparison to matching document vector representations. But the practical request stayed the same: find documents relevant to the source result.

What to keep in mind

Semantic MLT does not replace all search engineering.

Production systems still need:

exact search for identifiers, error codes, and other strict matches;
embedding model metadata and versioning;
ACL filters: rules for document access by roles or users;
tenant filters: data isolation between customers or workspaces;
hybrid search when both meaning and exact matches matter;
reranking when result order is critical;
search-quality monitoring: precision and recall metrics, false-positive frequency, and missed relevant documents caused by ANN-index approximation errors.

Lexical MLT can miss documents that use different words. Vector search sometimes returns overly broad results, or false positives, and can miss relevant documents because of the approximate nature of ANN indexes. That is why the quality of this kind of search should be evaluated on real queries and real data.

Conclusion

More Like This has moved from purely lexical search to hybrid solutions that combine lexical, vector, and filtering mechanisms.

The core concept remains the same: the user selects a source document, and the system finds materials relevant to it, taking both lexical and semantic similarity into account.

KNN early termination in Manticore Search

Sergey Nikolaev — Mon, 01 Jun 2026 09:51:55 +0000

Modern search engines do more than match keywords. When you search for "cozy mystery set in Paris" and get results for "atmospheric detective novel in France" that's vector search at work: documents and queries are converted into lists of numbers, called embeddings, and the search engine finds the documents whose numbers are closest to the query's.

Manticore Search supports this natively. Under the hood, it uses a data structure called HNSW: a graph that connects nearby vectors, so it can find nearest neighbors quickly without scanning every document. That makes vector search fast enough to run on millions of documents in milliseconds.

But HNSW has an inefficiency. Early in the traversal, almost every distance computation finds a better candidate than the ones already in the result set.

As the search goes on, those improvements become rarer, but the algorithm keeps traversing the graph until it exhausts its exploration budget. By that point, the result set has often already converged, and the remaining work does little or nothing to improve it. Early termination fixes this by detecting that point and stopping early.

The effect becomes more noticeable as k grows, where k is the number of nearest neighbors the query asks Manticore to return. Returning more neighbors requires more graph exploration, and much of that extra work happens after the result set has already stabilized. That also makes early termination more valuable, because it has more unnecessary work to cut.

This gets more pronounced with vector quantization. Quantization compresses stored vectors to save memory, which slightly lowers search precision. To recover it, Manticore uses oversampling: it fetches 3x more candidates than requested, then rescores them using the original full-precision vectors. With the default 3x oversampling, HNSW explores many more candidates per query. Large k values often come from this kind of candidate expansion: an application may ask the vector index for hundreds or thousands of candidates, then rescore, rerank, or filter them down to a much smaller final result set to improve recall and precision. That raises latency, and early termination helps win some of it back.

The waste is measurable. Benchmarks on a 1M-vector dataset show that with k=60, which is the default result limit with default 3x oversampling, early termination reduces distance computations to about 65% of the full search. At k=1000, computations drop to 30%. At k=10000, just 20%. The search converges long before the exploration budget runs out, and the savings grow with k.

Early termination lets Manticore detect this convergence and stop. The algorithm was designed with a specific precision target: lose no more than 2-4% of result set precision compared to a full HNSW search.

How it works

The algorithm tracks a simple signal: discovery rate - the fraction of distance computations that actually improve the result set.

Each time a new node's distance is computed, one of two things happens: either it's good enough to enter the heap - the priority queue that holds the current best candidate neighbors - or it's worse than everything already there and gets discarded. Entering the heap counts as a "discovery." Early in the search, discoveries are frequent - the heap is filling up and most candidates are useful. As the search progresses and the heap saturates with good results, discoveries become rare. Most new distance computations just confirm that the algorithm has already found the best candidates.

Manticore monitors this transition. After each round of neighbor expansion, it computes the discovery rate:

discovery_rate = new_candidates_collected / distances_computed

If this rate stays below a threshold for several rounds in a row, the search stops.

The idea is simple: if the algorithm keeps computing distances but nothing improves the result, the search has converged.

The threshold: quantile-based adaptation

That raises the obvious next question: what threshold should count as "low"? A fixed threshold wouldn't work well - different datasets and different regions of the same dataset have wildly different discovery rate distributions. What counts as "low" depends on context.

Manticore uses a quantile-based adaptive threshold. Instead of comparing the discovery rate against a fixed number, it continuously estimates a low percentile of recent rounds (20th percentile, or 14th percentile for L2 distance) and uses that as the baseline. This keeps the method lightweight while letting it adapt to different datasets and different regions of the graph.

In other words, the threshold adapts to the local search pattern. If the algorithm enters a sparse region of the graph, the threshold drops and avoids stopping too early. If it enters a richer region, the threshold rises.

Patience: how many bad rounds before stopping

The threshold alone is not enough, though. A single round with a low discovery rate isn't enough to declare convergence. It could just be a temporary dip before the search finds a better path. Manticore uses a "patience counter" that requires multiple consecutive bad rounds before terminating.

The patience value scales inversely with ef, the HNSW exploration factor that controls how many candidates the search keeps exploring. For example, patience ranges from 9 at low ef values down to 6 at very high ef. Larger ef values mean more total rounds, so even with lower patience the algorithm has seen more evidence before deciding to stop. The counter resets to zero whenever a round has a healthy discovery rate, so a single good round restarts the patience window. This prevents the algorithm from stopping during a temporary plateau that leads to a productive region of the graph.

Warm-up phase

The algorithm ignores the termination signal while the heap is still filling up, meaning fewer than ef candidates have been collected. During this phase, discovery rates are artificially high because almost everything enters the heap, so the signal is not useful. Early termination only starts once the heap is full and new candidates must replace existing ones.

Benchmark results

The quantile thresholds were tuned to keep precision loss within 2–4%. They were tuned separately for L2 and cosine/IP distance metrics, and validated across both quantized and non-quantized data.

The following benchmarks were run on the dbpedia-entities dataset (1M vectors, 768 dimensions) on a machine with 8 physical cores / 16 logical cores.

"Precision" here means the fraction of true k-nearest neighbors that appear in the result set (with fixed k, this is the same as recall@k).
"Precision ratio" is the precision of HNSW with early termination ("ET") divided by precision without it (1.0 means no precision loss).
"Visit ratio" is the fraction of distance computations performed compared to full HNSW search (lower is better).

Oversampling and rescoring were disabled to isolate the effect of early termination on raw HNSW traversal.

The green line on the chart (precision) stays almost flat across all k values, with precision ratio remaining above 0.97 throughout the benchmark. Meanwhile the orange line (visit ratio) drops steeply. At k=100, it cuts distance computations nearly in half. At k=1000, it saves 70%. At k=10000, 80%.

At k <= 10, early termination is disabled because the search is already cheap and the savings are too small to justify any precision loss. The savings grow with k, because larger result sets lead to more rounds of neighbor expansion and more chances to detect convergence early.

Performance under concurrent load

The benchmarks above show that early termination cuts distance computations a lot while preserving precision. But what does that mean for actual query latency, especially under concurrent load? The chart below shows latency ratios (ET / no ET) at 1, 8, and 16 concurrent threads on the same dbpedia dataset:

At k=1000, early termination reduces distance computations by 71% (ratio 0.29). The latency improvement depends on how many threads are running at the same time:

1 thread: 24% faster (ratio 0.76)
8 threads: 45% faster (ratio 0.55)
16 threads: 48% faster (ratio 0.52)

The distance computation savings stay the same regardless of thread count, but the latency benefit nearly doubles from 1 to 16 threads.

The main reason is lower pressure on the CPU memory system. Each distance computation pulls vector data and graph links into cache. When several threads run HNSW traversal at the same time, they compete for shared cache and memory bandwidth. Doing fewer distance computations per query reduces memory traffic, keeps each thread’s working set smaller, and lowers cache churn between queries. As a result, each thread finishes faster and interferes less with the others.

Single-thread benchmarks understate the benefit of early termination. Under production-like concurrent load, the percentage latency reduction is roughly twice as large.

When early termination kicks in (and when it doesn't)

Early termination is enabled by default and works on both quantized and non-quantized vector data. It is automatically disabled when k <= 10.

The benefit grows with the effective exploration budget, which is max(ef, k). Since hnswlib uses this internally as the number of candidates it keeps in play, larger k means more candidates, more rounds, and more chances to detect convergence.

Quantized vectors are typically used with rescoring and oversampling (both enabled by default) to recover precision lost from quantization. Oversampling (default 3x) multiplies the effective k passed to HNSW. For example, a query with k=100 uses 300 candidates internally when oversampling is 3x. That larger search budget gives early termination more room to detect convergence and stop early. Since the performance benefit of early termination grows with k, oversampling pushes queries into the range where the savings are largest.

Syntax

Early termination is on by default. To disable it:

SQL:

-- default: early termination enabled
SELECT id, knn_dist()
FROM products
WHERE knn(embedding, (0.12, 0.45, 0.78, 0.33));

-- explicitly disable early termination
SELECT id, knn_dist()
FROM products
WHERE knn(embedding, (0.12, 0.45, 0.78, 0.33), { early_termination=0 });

-- combine with other KNN options
SELECT id, knn_dist()
FROM products
WHERE knn(embedding, (0.12, 0.45, 0.78, 0.33), { ef=200, early_termination=0 });

JSON:

POST /search
{
    "table": "products",
    "knn": {
        "field": "embedding",
        "query": [0.12, 0.45, 0.78, 0.33],
        "early_termination": false
    }
}

When to disable it

There are a few scenarios where you might want to turn early termination off:

Maximum precision is critical. Early termination trades a small amount of recall for speed. If your application requires the absolute best recall that HNSW can provide at a given ef, disable it.
Small k values (<= 30). The algorithm auto-disables for k <= 10, but even for k between 11 and 30, the performance benefit is modest. If you notice any recall difference in this range, disabling early termination costs little in latency.
Benchmarking HNSW recall. If you are measuring HNSW recall, you probably want deterministic behavior without adaptive shortcuts. Disable early termination to get a clean baseline.

How it relates to other KNN optimizations

Early termination is one of several optimizations that Manticore applies to KNN search. It works independently of and stacks with the others:

Prefiltering reduces wasted work by skipping filtered-out documents during HNSW traversal. Early termination reduces wasted work by stopping the traversal once the result set has converged. They solve different problems and work well together.
Oversampling retrieves more candidates than k to improve recall after rescoring. Early termination can reduce the cost of that expanded search by stopping once enough good candidates have been found.
Rescoring recalculates distances using full-precision vectors after the initial search with quantized vectors. Early termination operates during the initial quantized search phase, reducing the number of candidates evaluated before rescoring kicks in.
Automatic brute-force fallback skips HNSW entirely when a linear scan is cheaper. Early termination only applies when HNSW is actually used.

How to Make xt850 Match xt 850

Sergey Nikolaev — Fri, 08 May 2026 05:30:14 +0000

TL;DR

Since version 23.0.0, Manticore can make searches like xt850 match xt 850 using bigram_delimiter together with digit-aware bigram_index modes.

This solves a common tokenization mismatch in product search, where users remove spaces from model names but the source data stores them as separate tokens.

Assumptions and verification

This article assumes:

RT tables created with SQL examples exactly as shown
default tokenization unless the example explicitly changes a setting
ASCII digits in model names, because second_numeric and second_has_digit are digit-aware modes built around 0-9

All SQL examples and expected outputs in this article were verified against a real Manticore 23.0.0 instance before publishing, using fresh tables created from scratch for each scenario.

The broader search problem

Imagine a catalog containing:

xt 850 action camera
iphone 5se battery case
canon eos 80d body
thinkpad x1 carbon

Now imagine users searching for:

xt850
iphone5se
eos80d
thinkpadx1

From the user's point of view, these should obviously match. From the engine's point of view, they often do not, because the indexed text is tokenized as separate terms.

Search systems usually attack that mismatch in one of four ways:

index prefixes or infixes
add custom normalization rules
duplicate content into alternate normalized fields
index adjacent token pairs and optionally store glued variants too

Manticore's newer bigram functionality is a structured way to do the fourth option without awkward field duplication.

Baseline: why `xt850` fails by default

Here is the problem in its simplest form:

DROP TABLE IF EXISTS bi_default_demo;

CREATE TABLE bi_default_demo(title text);

INSERT INTO bi_default_demo VALUES
  (1,'xt 850 action camera');

SELECT id, title FROM bi_default_demo WHERE MATCH('xt850');

Expected result:

Empty set

Why does this fail?

Because the document is indexed as two separate tokens, xt and 850, while the query is a single token, xt850.

By default, Manticore does not assume that:

xt850 should be split into xt + 850
or xt + 850 should also be searchable as xt850

So this is not really a typo-tolerance problem or a phrase problem. It is a tokenization mismatch: the index sees two tokens, while the query provides one.

That is the gap the newer bigram settings are designed to close. They let Manticore index selected adjacent token pairs in a form that can also match glued queries.

Why bigrams help here

bigram_index can help with both phrase acceleration and model-name matching, and in this article we focus on the xt 850 vs xt850 problem.

The key idea is simple:

detect adjacent token pairs that look like model names
store those pairs in a glued form too
let queries such as xt850, iphone5se, or thinkpadx1 hit the spaced text

That is where bigram_delimiter matters.

A note about bigram_delimiter

bigram_index decides which adjacent pairs are eligible.

bigram_delimiter decides how eligible bigrams are stored:

true: internal delimited token only
none: glued token only, such as galaxy24
both: both forms

The practical difference is easiest to understand from the query side:

with true, Manticore keeps the internal bigram form used for phrase optimization, but it does not keep the glued user-facing form, so a query like xt850 will not match xt 850
with none, Manticore keeps only the glued form, so xt850 can match xt 850, but you are leaning entirely on the glued representation for those pairs
with both, Manticore keeps both the internal bigram representation and the glued form, so xt850 can match xt 850 without giving up ordinary phrase behavior

For this use case, both is usually the safer default because it covers the user-visible problem directly while keeping behavior less surprising for normal phrase queries and mixed workloads.

Mode 1: `second_numeric`

bigram_index = second_numeric
bigram_delimiter = both

This mode is aimed at model names where the second token is purely numeric.

That is common in product catalogs:

xt 850
galaxy 24
playstation 5
pixel 8

The idea is simple: users often search these as glued terms such as xt850, galaxy24, or playstation5, even though the source text stores them with a space.

second_numeric stores the pair only when the second token is ASCII digits only.

Use it when:

you have product generations and numbered models
users often remove spaces in search
the second token is usually just digits

Example

DROP TABLE IF EXISTS bi_second_numeric_demo;

CREATE TABLE bi_second_numeric_demo(title text)
  bigram_index='second_numeric'
  bigram_delimiter='both';

INSERT INTO bi_second_numeric_demo VALUES
  (1,'xt 850 action camera'),
  (2,'galaxy 24 ultra'),
  (3,'playstation 5 slim'),
  (4,'iphone 5se case'),
  (5,'canon eos 80d body'),
  (6,'thinkpad x1 carbon');

Then test the queries one by one:

SELECT id, title FROM bi_second_numeric_demo WHERE MATCH('xt850');

+------+----------------------+
| id   | title                |
+------+----------------------+
|    1 | xt 850 action camera |
+------+----------------------+

SELECT id, title FROM bi_second_numeric_demo WHERE MATCH('galaxy24');

+------+-----------------+
| id   | title           |
+------+-----------------+
|    2 | galaxy 24 ultra |
+------+-----------------+

SELECT id, title FROM bi_second_numeric_demo WHERE MATCH('playstation5');

+------+--------------------+
| id   | title              |
+------+--------------------+
|    3 | playstation 5 slim |
+------+--------------------+

SELECT id, title FROM bi_second_numeric_demo WHERE MATCH('iphone5se');

Empty set

SELECT id, title FROM bi_second_numeric_demo WHERE MATCH('eos80d');

Empty set

SELECT id, title FROM bi_second_numeric_demo WHERE MATCH('thinkpadx1');

Empty set

That boundary is the whole point of the mode:

24 and 5 qualify
5se, 80d, and x1 do not

Mode 2: `second_has_digit`

bigram_index = second_has_digit
bigram_delimiter = both

This mode is the more flexible sibling of second_numeric.

It stores the pair when the second token contains at least one ASCII digit. That makes it a much better fit for real product catalogs, where model identifiers are often mixed alphanumeric strings:

xt 850
iphone 5se
eos 80d
thinkpad x1

Use it when:

your model names mix letters and digits
users frequently remove spaces in their searches
you want catalog-friendly matching without indexing every pair in the table

Example

DROP TABLE IF EXISTS bi_second_has_digit_demo;

CREATE TABLE bi_second_has_digit_demo(title text)
  bigram_index='second_has_digit'
  bigram_delimiter='both';

INSERT INTO bi_second_has_digit_demo VALUES
  (1,'xt 850 action camera'),
  (2,'galaxy 24 ultra'),
  (3,'playstation 5 slim'),
  (4,'iphone 5se case'),
  (5,'canon eos 80d body'),
  (6,'thinkpad x1 carbon'),
  (7,'kindle paperwhite signature');

Then test the queries one by one:

SELECT id, title FROM bi_second_has_digit_demo WHERE MATCH('xt850');

+------+----------------------+
| id   | title                |
+------+----------------------+
|    1 | xt 850 action camera |
+------+----------------------+

SELECT id, title FROM bi_second_has_digit_demo WHERE MATCH('galaxy24');

+------+-----------------+
| id   | title           |
+------+-----------------+
|    2 | galaxy 24 ultra |
+------+-----------------+

SELECT id, title FROM bi_second_has_digit_demo WHERE MATCH('iphone5se');

+------+---------------------+
| id   | title               |
+------+---------------------+
|    4 | iphone 5se case     |
+------+---------------------+

SELECT id, title FROM bi_second_has_digit_demo WHERE MATCH('eos80d');

+------+---------------------+
| id   | title               |
+------+---------------------+
|    5 | canon eos 80d body  |
+------+---------------------+

SELECT id, title FROM bi_second_has_digit_demo WHERE MATCH('thinkpadx1');

+------+---------------------+
| id   | title               |
+------+---------------------+
|    6 | thinkpad x1 carbon  |
+------+---------------------+

SELECT id, title FROM bi_second_has_digit_demo WHERE MATCH('kindlesignature');

Empty set

This is often the better fit for mixed model identifiers, because real catalog data frequently includes forms like 5se, 80d, or x1 rather than only clean numeric suffixes like 24.

How to choose between the two

If your search problem is specifically "How do I make xt850 find xt 850?", the practical rule is:

use second_numeric when the second token is digits-only
use second_has_digit when the second token may be mixed, like 5se, 80d, or x1

There is one practical caveat: this is compatible with other common text-processing settings in the straightforward case. xt 850 still matches xt850 with morphology='stem_en' enabled and with a wordforms rule enabled.

But that does not mean those settings rewrite the glued query for you. In tests, iphones 5 matched iphones5, but not iphone5, even with stemming or a wordforms rule mapping iphones to iphone. So the short version is: basic xt 850 vs xt850 matching stays compatible with morphology and wordforms, but if you rely on them, test the exact query shape you care about.

Final takeaway

The xt850 problem is not really about one product name. It is about a broader mismatch between how users type model names and how search engines tokenize them.

Since version 23.0.0, Manticore gives you a built-in way to handle that mismatch with bigram_delimiter plus the digit-aware bigram_index modes, which is much cleaner than duplicating fields or inventing custom preprocessing pipelines.

If your main problem is phrase-search performance rather than glued model-name matching, see How to Speed Up Phrase Search with bigram_index.

How to Speed Up Phrase Search with bigram_index

Sergey Nikolaev — Thu, 07 May 2026 08:50:15 +0000

TL;DR

bigram_index can be used for several purposes, and in this article we focus specifically on phrase-search performance: on the 1M-document benchmark below, bigram_index='all' improved QPS by about 2.9x and cut average phrase-query latency by about 3.2x.

If your main problem is matching xt850 against xt 850 rather than speeding up phrase search, see How to Make xt850 Match xt 850.

Phrase search can be expensive. Even when a query is short, the engine still has to verify ordering and adjacency, and that work gets more noticeable when:

the individual words are common
the dataset is large
phrase queries are frequent in your workload

That is exactly what bigram_index is for.

What bigram indexing actually does

Normally, a phrase like "noise cancelling headphones" is handled as separate tokens that also need to appear in the right order and next to each other. Bigram indexing lets Manticore pre-store adjacent token pairs such as:

noise cancelling
cancelling headphones

That gives the engine a faster way to narrow down candidate documents during phrase matching.

This article focuses specifically on phrase acceleration.

Important caveat: bigrams work at tokenization level

This is the part that is easy to miss when you only look at the happy-path speedup story.

bigram_index works at the tokenization level only. It does not account for later transformations such as morphology, wordforms, or stopwords, and that can materially change phrase-matching expectations.

The practical conclusion is simple: bigrams can be excellent for phrase speed, but if your index relies heavily on morphology, wordforms, or stopwords, test the actual phrase behavior you care about before rolling the setting out broadly.

Mode 1: Default behavior

This is the baseline. No explicit bigram indexing is enabled, so no bigram posting lists are stored.

Use it when:

phrase search is rare
documents are short
you want the leanest indexing path

Example

DROP TABLE IF EXISTS bi_none_demo;

CREATE TABLE bi_none_demo(title text);

INSERT INTO bi_none_demo VALUES
  (1,'wireless noise cancelling headphones'),
  (2,'noise cancelling microphone'),
  (3,'wireless gaming headset');

SELECT id, title FROM bi_none_demo WHERE MATCH('"noise cancelling"');

This is the baseline behavior. The query matches the expected rows, but Manticore has no precomputed bigram posting lists to help resolve the phrase more efficiently.

Mode 2: `all`

bigram_index = all

This is the most aggressive phrase-acceleration mode. Every adjacent token pair gets indexed as a bigram.

Use it when:

exact phrase search is a core feature
phrase queries often include common words and produce many candidates
you want the strongest phrase acceleration
you do not want to tune a frequent-word list

Example

DROP TABLE IF EXISTS bi_all_demo;

CREATE TABLE bi_all_demo(title text)
  bigram_index='all';

INSERT INTO bi_all_demo VALUES
  (1,'lord of the rings trilogy'),
  (2,'house of the dragon season 2'),
  (3,'made for iphone charger');

SELECT id, title FROM bi_all_demo WHERE MATCH('"house of the dragon"');
SELECT id, title FROM bi_all_demo WHERE MATCH('"made for iphone"');

The important point here is not different matches, but different indexing strategy: all stores every adjacent pair, so phrase queries have the maximum amount of bigram help available at search time.

The reason to choose all is when phrase search becomes more expensive because many documents match the individual words, and Manticore then has to do more positional verification to confirm the exact phrase. all helps by narrowing candidates earlier.

Mode 3: `first_freq`

bigram_index = first_freq
bigram_freq_words = for, of, the, with

This mode stores a pair only when the first token is in your frequent-word list.

Use it when:

phrase search matters
you want a lighter alternative to all
many phrases in your data contain words that are genuinely frequent in your own corpus

With the list above:

for iphone is eligible
of the is eligible
the dragon is eligible
made for is not eligible
lord of is not eligible

For production use, do not pick bigram_freq_words from memory. Derive it from your own data. A practical way is to dump dictionary stats with indextool using --dumpdict ... --stats, review the most frequent tokens, and then build a small bigram_freq_words list from those results.

Example

DROP TABLE IF EXISTS bi_first_freq_demo;

CREATE TABLE bi_first_freq_demo(title text)
  bigram_index='first_freq'
  bigram_freq_words='for,of,the,with';

INSERT INTO bi_first_freq_demo VALUES
  (1,'made for iphone charger'),
  (2,'lord of the rings trilogy'),
  (3,'house of the dragon season 2');

SELECT id, title FROM bi_first_freq_demo WHERE MATCH('"made for iphone"');
SELECT id, title FROM bi_first_freq_demo WHERE MATCH('"lord of the"');

The queries still return the expected rows. What changes is which pairs get indexed:

"made for iphone" benefits from for iphone
"lord of the" benefits from of the

This makes first_freq a lighter alternative to all when many useful phrases involve common bridge words.

Mode 4: `both_freq`

bigram_index = both_freq
bigram_freq_words = for, of, the, with

This is the narrowest frequency-based mode. A pair is stored only when both tokens are in the frequent-word list.

Use it when:

you want the most conservative bigram footprint
you mainly care about pairs built from words that are highly frequent in your corpus
you are tuning a large corpus and do not want to index every adjacent pair

With the same list:

of the is eligible
for iphone is not eligible
the dragon is not eligible

Example

DROP TABLE IF EXISTS bi_both_freq_demo;

CREATE TABLE bi_both_freq_demo(title text)
  bigram_index='both_freq'
  bigram_freq_words='for,of,the,with';

INSERT INTO bi_both_freq_demo VALUES
  (1,'lord of the rings trilogy'),
  (2,'house of the dragon season 2'),
  (3,'made for iphone charger');

SELECT id, title FROM bi_both_freq_demo WHERE MATCH('"lord of the"');
SELECT id, title FROM bi_both_freq_demo WHERE MATCH('"made for iphone"');

The queries still match, but the internal selectivity differs:

"lord of the" includes of the, which both_freq is willing to store
"made for iphone" includes for iphone, which first_freq would cover but both_freq would not

Which performance mode should you choose?

The benchmark in this article shows that all can deliver a strong speedup, but it is still just one benchmark on one workload.

Manticore's own documentation says that for most use cases, both_freq is the best mode. That is a sensible default because it aims for a more balanced trade-off between phrase acceleration and indexing cost.

Use the modes like this:

choose both_freq as the default starting point for general phrase-search workloads
choose all when phrase search is especially important and you want the strongest acceleration, accepting higher indexing cost
choose first_freq when many useful phrases in your data involve common bridge words and you want something broader than both_freq
choose the default behavior when phrase acceleration is not important

Benchmark: does bigram indexing really speed up phrase search?

Yes. In a simple local benchmark, the difference was easy to measure.

I used manticore-load to build two 1M-document tables against the same Manticore instance:

one with no explicit bigram_index setting
one with bigram_index='all'

The documents were random 60-80 word texts, and the benchmark repeatedly ran random 2-word phrase queries.

For clarity, both indexing and search were run with --threads=1. Multi-threaded numbers would of course be higher, but single-thread runs make it easier to see what the feature changes on one CPU core.

SELECT COUNT(*) FROM bench_bigram_* WHERE MATCH('"<text/2/2>"')

Benchmark setup

Data load without bigrams:

manticore-load \
  --drop \
  --wait \
  --threads=1 \
  --batch-size=1000 \
  --total=1000000 \
  --init="CREATE TABLE bench_bigram_none_rand(title text)" \
  --load="INSERT INTO bench_bigram_none_rand(id,title) VALUES(<increment>,'<text/60/80>')"

Data load with all bigrams:

manticore-load \
  --drop \
  --wait \
  --threads=1 \
  --batch-size=1000 \
  --total=1000000 \
  --init="CREATE TABLE bench_bigram_all_rand(title text) bigram_index='all'" \
  --load="INSERT INTO bench_bigram_all_rand(id,title) VALUES(<increment>,'<text/60/80>')"

Search benchmark without bigrams:

manticore-load \
  --threads=1 \
  --total=5000 \
  --load="SELECT COUNT(*) FROM bench_bigram_none_rand WHERE MATCH('\\\"<text/2/2>\\\"')"

Search benchmark with all bigrams:

manticore-load \
  --threads=1 \
  --total=5000 \
  --load="SELECT COUNT(*) FROM bench_bigram_all_rand WHERE MATCH('\\\"<text/2/2>\\\"')"

What I observed

On this local run:

Table	QPS	Avg latency
`bench_bigram_none_rand`	`755`	`1.3 ms`
`bench_bigram_all_rand`	`2175`	`0.4 ms`

That is roughly a 2.9x improvement in QPS and about a 3.2x improvement in average latency on the same 1M-document workload.

Indexing was slower with bigram_index='all', which is expected:

without bigrams: about 45k docs/sec
with all: about 17k docs/sec

That trade-off is exactly why multiple modes exist.

Final takeaway

If your main problem is phrase-search performance, treat bigram_index first and foremost as an acceleration feature.

For most real workloads, start with both_freq and measure. Move to all if you need a stronger effect and can afford the extra indexing cost. Consider first_freq when your phrase workload is heavily shaped by common bridge words.

DEV Community: Sergey Nikolaev

Manticore Search 28.4.4: Faster KNN, better conversational search, easier installs and more faceting controls

Upgrade Notes

Highlights

Faster KNN rescoring

Conversational search through SQL and HTTP

One-line installation

Facets can keep zero-count buckets visible

Better defaults for search relevance

More control over embedding CPU usage

Bug Fixes

Need help or want to connect?

Sharding in Manticore Search: automatic distribution and replication

Short glossary

How to create a sharded table

Use case A: sharding on a single node

Use case B: multi-node sharding and automatic replication

Putting it together: a multi-node walkthrough

Maintaining the replication factor

A node fails

A node joins

Every copy of a shard is down

How it works under the hood (the short version)

Operating a sharded table

Choosing the shard count and replication factor

Benchmarks: does sharding actually speed up inserts?

Single node: throughput vs shard count

The cost of durability: the binary log

The cost of replication on writes

Do sharding and replication speed up reads?

Bottom line

Limitations and things to know

Where this leaves you

Faster KNN index builds in Manticore

TL;DR

Why HNSW build speed matters

What used to happen

What changed

Single-thread improvements

The default and the config

CPU usage

Benchmark

Migration

Conclusion

Manticore Search under systemd: beyond fork, PID files, and guesswork

Start with shutdown, because that's where things usually get real

What the old setup looked like

The notify-based unit

A small thing I like: reload becomes less awkward

--nodetach also fixes the logging story

About the internal watchdog

A few loose notes

Commands you will actually use

14 faster embeddings: how we rebuilt the ONNX path in Manticore

TL;DR

Why this matters

Why ONNX, and not Candle

The concurrency model — the part most readers will find new

Adaptive parallelism — the wrong turns we took

Numbers

How to feed it for maximum throughput

Before vs after, across the whole grid

What's next

Try it

Український лематизатор тепер вбудовано в Manticore Search

Коротко

Що таке лематизатор

Що змінилося

Мінімальна конфігурація

Перевіримо на прикладі

Що відбувається з токенами

Що варто пам'ятати

Faster KNN search in Manticore: 2-pass HNSW, batched distances, and AVX-512

Faster KNN search in Manticore

Compile-time distance function specialization

2-pass neighbor processing

Batched distance computation

AVX-512 support

Benchmark results

Algorithmic improvements alone

`--nodetach` also fixes the logging story

Baseline: why `xt850` fails by default

Mode 1: `second_numeric`

Mode 2: `second_has_digit`

Mode 2: `all`

Mode 3: `first_freq`

Mode 4: `both_freq`