TL;DR — content-mill is an open-source CLI and library that reads static content — MkDocs sites, markdown directories, JSON files, HTML pages — and indexes it into Meilisearch, driven by a YAML config. You define the document shape; it handles extraction, templating, chunking, and atomic zero-downtime re-indexing. You still tune templates and debug extraction for your own content — that part's on you — but you stop maintaining bespoke scraper code.
npm install @centrali-io/content-mill
If you've ever tried to make your docs, blog posts, or changelogs searchable with Meilisearch, you know the drill: write a custom scraper, parse the content, transform it into the right shape, push it to an index, and hope you don't break search during re-indexing.
I got tired of writing that glue code for every project, so I built content-mill — a CLI and library that indexes static content into Meilisearch, driven by a YAML config.
The problem
Meilisearch is fantastic for search, but getting your content into it is surprisingly manual. Every docs site, every changelog, every collection of markdown files needs its own extraction pipeline. And if you want zero-downtime re-indexing? That's more code on top.
Most existing solutions are either tightly coupled to a specific framework (like DocSearch for Algolia) or expect you to run a full crawler. Lighter-weight options exist — usually ad-hoc scripts people write once per project — but nothing I could find that's reusable across source types and explicit about document shape.
What content-mill does
You describe your content sources and the document shape you want in a YAML config:
meili:
host: http://localhost:7700
apiKey: ${MEILI_MASTER_KEY}
sources:
- name: docs
type: mkdocs
config: ./mkdocs.yml
index: docs
document:
primaryKey: id
fields:
id: "{{ slug }}"
title: "{{ heading }}"
content: "{{ body }}"
section: "{{ nav_section }}"
url: "{{ path }}"
type: "docs"
searchableAttributes: [title, content]
filterableAttributes: [section, type]
Then run:
npx @centrali-io/content-mill index --config content-mill.yml
Once the config matches your content, re-running is a single command. You'll still spend time tuning templates and sanity-checking extraction (use --dry-run for that) — but you're not maintaining scraper code anymore. content-mill handles extraction, templating, and atomic index swapping, so search never goes down during re-indexing.
Four source types, one interface
content-mill ships with adapters for the content formats you're most likely already using:
-
mkdocs— Reads yourmkdocs.yml, follows the nav tree, and parses each markdown page. You getnav_sectioncontext so you know which part of the docs each page belongs to. -
markdown-dir— Recursively reads.mdfiles from a directory. Supports YAML frontmatter, so you can pull version numbers, dates, or any metadata into your search index. Great for changelogs and blog posts. -
json— Reads a JSON array (or directory of JSON files). Every key in each object becomes a template variable. Perfect for structured data you already have lying around. -
html— Reads.htmlfiles, strips scripts/styles/nav/footer, and gives you clean text. Useful for indexing a built static site.
Templating: you control the document shape
The key design decision is that you define what your Meilisearch documents look like. Source adapters extract raw variables (slug, heading, body, path, frontmatter.*, etc.), and you map them to fields using {{ template }} syntax:
fields:
id: "{{ slug }}-{{ chunk_index }}"
title: "{{ chunk_heading }}"
content: "{{ chunk_body }}"
excerpt: "{{ body | truncate(200) }}"
url: "{{ path }}#{{ chunk_heading | slugify }}"
Filters like truncate, slugify, lower, upper, and strip_md can be chained with pipes. This means you're not locked into someone else's schema — your search index looks exactly the way your frontend expects.
Chunking for granular results
Whole-page results are often too broad for docs search. content-mill can split pages by heading level:
chunking:
strategy: heading
level: 2
This turns one long page into multiple documents — one per ## section — each with its own chunk_heading, chunk_body, and chunk_index. Your search results can now link directly to the relevant section instead of dumping users at the top of a page.
Zero-downtime re-indexing
Every indexing run uses Meilisearch's index swap:
- Documents go into a temp index (
docs_tmp) - Atomic swap with the live index (
docs) - Old index gets cleaned up
If something fails mid-way, your live index is untouched. No maintenance windows needed.
CI/CD in two lines
# GitHub Actions
- name: Index docs
env:
MEILI_MASTER_KEY: ${{ secrets.MEILI_MASTER_KEY }}
run: npx @centrali-io/content-mill index --config content-mill.yml
Hook this into your release pipeline and your search index stays in sync with every deploy.
Use as a library
Don't need the CLI? Import it directly:
import { loadConfig, indexAll } from '@centrali-io/content-mill';
const config = loadConfig('./content-mill.yml');
await indexAll(config, { dryRun: false });
Or build the config object in code if you prefer programmatic control.
Why not docs-scraper, DocSearch, or a custom crawler?
- docs-scraper (the Meilisearch-native option) is a Scrapy-based web crawler. Works well for live sites, heavy for "I already have markdown in a repo."
- Algolia DocSearch is excellent, but framework-specific and indexes into Algolia — not useful if you've chosen Meilisearch.
- Custom scrapers work fine for one project. Painful when you have three of them to maintain across different repos.
content-mill is intentionally narrow: static content in, Meilisearch out, config-driven shape in between. If you're not already on Meilisearch, use something else.
Getting started
npm install @centrali-io/content-mill
- Create a
content-mill.ymlwith your Meilisearch connection and source definitions - Run with
--dry-runfirst to preview the extracted documents - Run for real and check your Meilisearch dashboard
The full config reference and source type examples are in the README on GitHub.
content-mill is MIT-licensed and open source. If you use Meilisearch and have static content to index, try it — and if your source type isn't covered (AsciiDoc, RST, Notion export, whatever), open an issue and I'll look at adding an adapter.

Top comments (0)