Takahiro Sato

Posted on Jun 3

Query Markdown as a Database with mq-db: SQL, mq, and Interval Indexes

#markdown #database #sql #rust

Managing large collections of Markdown documents (documentation sites, knowledge bases, notes) often means resorting to a messy combination of grep and regex.

mq-db takes a different approach: it parses every Markdown element into a typed block, builds interval and secondary indexes, and exposes the result through SQL and a jq-inspired query language called mq.

What mq-db Does

mq-db turns Markdown files into a queryable, embedded database with zero external dependencies.

# Index a set of documents
mq-db index docs/ --recursive --output store.mq-db

# Run SQL across all indexed files
mq-db sql "SELECT block_type, count(*) FROM blocks GROUP BY block_type" --db store.mq-db

# Or use the interactive TUI
mq-db tui --db store.mq-db

Every Markdown element (heading, paragraph, code block, list item, table cell) becomes a Block row with a block_type, content, and structural metadata.

The Block Model

Parsing a Markdown file produces a flat vector of blocks:

struct Block {
    id: u32,
    document_id: u32,
    block_type: BlockType,  // Heading, Paragraph, Code, List, etc.
    content: String,
    span: Option<Span>,     // Source line/column
    pre: u32,               // Interval index: pre-order
    post: u32,              // Interval index: post-order
    properties: Properties, // depth, lang, ordered, etc.
}

The pre and post fields encode heading hierarchy using a Nested Set (Pre-Post Order) traversal. An ancestor check ("is block A inside section B?") reduces to a single integer comparison:

A is_under B  ↔  B.pre < A.pre  AND  A.post < B.post

This eliminates the need for expensive tree traversal at query time, keeping lookup performance at $O(1)$.

Index Layers

To ensure high performance, mq-db applies a cheapest-first indexing strategy across three layers before any blocks are read from disk:

Layer 1: Zone Maps. Per-document statistics stored in the .mq-db file. If a document cannot contain the requested heading text, language tag, or depth, it is skipped entirely.
Layer 2: Interval Index. Narrows the candidate range to a specific section using the (pre, post) pair of the anchor heading.
Layer 3: Secondary Indexes. Includes a BitmapIndex for block_type, a BTreeIndex for pre/post ranges, and a HashIndex for exact matches on content, lang, and depth. The SQL predicate pushdown picks an IndexHint automatically.

SQL Queries

The custom SQL engine is based on sqlparser and runs entirely in-memory with no SQLite dependency.

Block type distribution across all documents

SELECT block_type, count(*) AS total
FROM blocks
GROUP BY block_type
ORDER BY total DESC;

┌─────────────┬───────┐
│ block_type  │ total │
├─────────────┼───────┤
│ paragraph   │   241 │
│ heading     │    89 │
│ code        │    73 │
└─────────────┴───────┘

H1 titles per document

SELECT d.path, b.content
FROM blocks b
JOIN documents d ON d.id = b.document_id
WHERE b.block_type = 'heading' AND b.depth = 1;

All content inside a section with `under()`

The under() scalar function leverages the interval index for fast hierarchy queries:

SELECT b.block_type, b.content
FROM blocks b
WHERE under(
  b.pre, b.post,
  (SELECT pre  FROM blocks WHERE block_type = 'heading' AND content = 'Architecture'),
  (SELECT post FROM blocks WHERE block_type = 'heading' AND content = 'Architecture')
)
ORDER BY b.pre;

This instantly retrieves every block nested under the Architecture section across all indexed documents in their original order.

Inline mq with `mq()`

You can also run an mq program directly inside your SQL queries using the mq() scalar function:

SELECT mq('.h1 | to_text', content) AS title
FROM blocks
WHERE block_type = 'code' AND lang = 'markdown';

mq Queries

For quick command-line extractions, the mq engine accepts native jq-style expressions:

mq-db mq ".h1" --db store.mq-db
mq-db mq 'select(.code_lang == "rust")' --db store.mq-db
mq-db mq ".h | select(.depth == 2)" --db store.mq-db --format markdown

DDL: Custom In-Memory Tables

You can create custom tables from query results or define an explicit schema:

-- Snapshot headings into a custom table
CREATE TABLE headings AS
SELECT content, depth FROM blocks WHERE block_type = 'heading';

-- Explicit schema + insert
CREATE TABLE notes (id TEXT, body TEXT);
INSERT INTO notes VALUES ('1', 'Design decision: use interval index for hierarchy');

-- Inspect schema
SHOW TABLES;
DESC notes;

-- Clean up
DROP TABLE notes;

Interactive REPL

mq-db repl --db store.mq-db --mode sql

mq-db  (.help for commands  .quit to exit)
mode: sql  (.mode mq | .mode sql)

sql> SELECT content FROM blocks WHERE block_type = 'heading' LIMIT 3;
┌──────────────────┐
│ content          │
├──────────────────┤
│ Overview         │
│ Architecture     │
│ Query Engine     │
└──────────────────┘
(3 rows)

sql> .mode mq
→ mq mode
mq> .h2
## Architecture
## Query Engine

HTTP Server

mq-db serve --db store.mq-db  # Listens on 127.0.0.1:7878

The server exposes three simple endpoints:

Method	Path	Description
`GET`	`/health`	Returns internal engine status and document count
`POST`	`/sql`	Executes SQL queries and returns rows as JSON
`POST`	`/mq`	Evaluates mq expressions and returns results

curl -s -X POST http://127.0.0.1:7878/sql \
  -H 'Content-Type: application/json' \
  -d '{"query":"SELECT block_type, count(*) FROM blocks GROUP BY block_type"}'

Library API

mq-db can also be integrated directly into your Rust projects as a library:

use mq_db::{DocumentStore, SqlEngine, MqEngine};

let mut store = DocumentStore::new();
store.add_file("docs/DESIGN.md")?;
store.add_str("# Hello\n\n## Architecture\n\nDetails\n")?;

// Chainable query API
let chunks = store.query()
    .documents(|doc| doc.zone_maps.heading_contents.contains("Architecture"))
    .under_heading("Architecture", Some(2))
    .filter(|b| matches!(b.block_type, BlockType::Paragraph | BlockType::Code))
    .blocks();

// SQL Evaluation
let engine = SqlEngine::new(&store)?;
let out = engine.execute(
    "SELECT content FROM blocks WHERE block_type = 'heading' ORDER BY pre"
)?;
print!("{}", out.to_table());

// Serialization
store.save("store.mq-db")?;
let store = DocumentStore::load("store.mq-db")?;

Installation

# Install script (Linux / macOS)
curl -fsSL https://raw.githubusercontent.com/harehare/mq-db/main/bin/install.sh | bash

# Via Cargo
cargo install mq-db

Use Cases

RAG Preprocessing: Extract precise section content by heading for LLM embedding pipelines.
Documentation Search: Run fast SQL queries across extensive doc sites without deploying heavy search infrastructure.
Structural Linting: Easily detect style violations (e.g., missing subheadings, empty sections, or incorrect list placement).
Knowledge Base Analytics: Aggregate, group, and discover patterns across thousands of personal or team notes.

Resources

⭐ If you find mq-db useful, dropping a star on GitHub is highly appreciated!

Top comments (1)

mote • Jun 10

Nested Set Model (pre/post ordering) for Markdown hierarchy is clever. That O(1) parent-child check via integer comparison is way faster than recursive CTEs. Most people reaching for SQLite + recursive CTEs for document trees don't realize you can precompute the interval bounds at index time.

Building a custom SQL engine instead of embedding SQLite is a bold call. sqlparser is great for parsing, but implementing the full execution pipeline (optimizer, join strategies, aggregations) from scratch is a lot of surface area. Any particular reason you didn't embed SQLite via rusqlite and just expose the interval index as a virtual table?

Question: how does the .mq-db file format handle incremental updates? If a Markdown file changes, re-indexing the entire document tree means recomputing all pre/post values — do you do full rebuilds or is there a diff-based update path?