Mary Olowu

Posted on Apr 22

Stop Writing Custom Scrapers: Index Static Content into Meilisearch with One Config

#meilisearch #search #typescript #opensource

TL;DR — content-mill is an open-source CLI and library that reads static content — MkDocs sites, markdown directories, JSON files, HTML pages — and indexes it into Meilisearch, driven by a YAML config. You define the document shape; it handles extraction, templating, chunking, and atomic zero-downtime re-indexing. You still tune templates and debug extraction for your own content — that part's on you — but you stop maintaining bespoke scraper code.
npm install @centrali-io/content-mill

If you've ever tried to make your docs, blog posts, or changelogs searchable with Meilisearch, you know the drill: write a custom scraper, parse the content, transform it into the right shape, push it to an index, and hope you don't break search during re-indexing.

I got tired of writing that glue code for every project, so I built content-mill — a CLI and library that indexes static content into Meilisearch, driven by a YAML config.

The problem

Meilisearch is fantastic for search, but getting your content into it is surprisingly manual. Every docs site, every changelog, every collection of markdown files needs its own extraction pipeline. And if you want zero-downtime re-indexing? That's more code on top.

Most existing solutions are either tightly coupled to a specific framework (like DocSearch for Algolia) or expect you to run a full crawler. Lighter-weight options exist — usually ad-hoc scripts people write once per project — but nothing I could find that's reusable across source types and explicit about document shape.

What content-mill does

You describe your content sources and the document shape you want in a YAML config:

meili:
  host: http://localhost:7700
  apiKey: ${MEILI_MASTER_KEY}

sources:
  - name: docs
    type: mkdocs
    config: ./mkdocs.yml
    index: docs
    document:
      primaryKey: id
      fields:
        id: "{{ slug }}"
        title: "{{ heading }}"
        content: "{{ body }}"
        section: "{{ nav_section }}"
        url: "{{ path }}"
        type: "docs"
      searchableAttributes: [title, content]
      filterableAttributes: [section, type]

Then run:

npx @centrali-io/content-mill index --config content-mill.yml

Once the config matches your content, re-running is a single command. You'll still spend time tuning templates and sanity-checking extraction (use --dry-run for that) — but you're not maintaining scraper code anymore. content-mill handles extraction, templating, and atomic index swapping, so search never goes down during re-indexing.

Four source types, one interface

content-mill ships with adapters for the content formats you're most likely already using:

mkdocs — Reads your mkdocs.yml, follows the nav tree, and parses each markdown page. You get nav_section context so you know which part of the docs each page belongs to.
markdown-dir — Recursively reads .md files from a directory. Supports YAML frontmatter, so you can pull version numbers, dates, or any metadata into your search index. Great for changelogs and blog posts.
json — Reads a JSON array (or directory of JSON files). Every key in each object becomes a template variable. Perfect for structured data you already have lying around.
html — Reads .html files, strips scripts/styles/nav/footer, and gives you clean text. Useful for indexing a built static site.

Templating: you control the document shape

The key design decision is that you define what your Meilisearch documents look like. Source adapters extract raw variables (slug, heading, body, path, frontmatter.*, etc.), and you map them to fields using {{ template }} syntax:

fields:
  id: "{{ slug }}-{{ chunk_index }}"
  title: "{{ chunk_heading }}"
  content: "{{ chunk_body }}"
  excerpt: "{{ body | truncate(200) }}"
  url: "{{ path }}#{{ chunk_heading | slugify }}"

Filters like truncate, slugify, lower, upper, and strip_md can be chained with pipes. This means you're not locked into someone else's schema — your search index looks exactly the way your frontend expects.

Chunking for granular results

Whole-page results are often too broad for docs search. content-mill can split pages by heading level:

chunking:
  strategy: heading
  level: 2

This turns one long page into multiple documents — one per ## section — each with its own chunk_heading, chunk_body, and chunk_index. Your search results can now link directly to the relevant section instead of dumping users at the top of a page.

Zero-downtime re-indexing

Every indexing run uses Meilisearch's index swap:

Documents go into a temp index (docs_tmp)
Atomic swap with the live index (docs)
Old index gets cleaned up

If something fails mid-way, your live index is untouched. No maintenance windows needed.

CI/CD in two lines

# GitHub Actions
- name: Index docs
  env:
    MEILI_MASTER_KEY: ${{ secrets.MEILI_MASTER_KEY }}
  run: npx @centrali-io/content-mill index --config content-mill.yml

Hook this into your release pipeline and your search index stays in sync with every deploy.

Use as a library

Don't need the CLI? Import it directly:

import { loadConfig, indexAll } from '@centrali-io/content-mill';

const config = loadConfig('./content-mill.yml');
await indexAll(config, { dryRun: false });

Or build the config object in code if you prefer programmatic control.

Why not docs-scraper, DocSearch, or a custom crawler?

docs-scraper (the Meilisearch-native option) is a Scrapy-based web crawler. Works well for live sites, heavy for "I already have markdown in a repo."
Algolia DocSearch is excellent, but framework-specific and indexes into Algolia — not useful if you've chosen Meilisearch.
Custom scrapers work fine for one project. Painful when you have three of them to maintain across different repos.

content-mill is intentionally narrow: static content in, Meilisearch out, config-driven shape in between. If you're not already on Meilisearch, use something else.

Getting started

npm install @centrali-io/content-mill

Create a content-mill.yml with your Meilisearch connection and source definitions
Run with --dry-run first to preview the extracted documents
Run for real and check your Meilisearch dashboard

The full config reference and source type examples are in the README on GitHub.

content-mill is MIT-licensed and open source. If you use Meilisearch and have static content to index, try it — and if your source type isn't covered (AsciiDoc, RST, Notion export, whatever), open an issue and I'll look at adding an adapter.

DEV Community