DEV Community

Mary Olowu
Mary Olowu

Posted on

Stop Writing Custom Scrapers: Index Any Static Content into Meilisearch with One Config File

If you've ever tried to make your docs, blog posts, or changelogs searchable with Meilisearch, you know the drill: write a custom scraper, parse the content, transform it into the right shape, push it to an index, and hope you don't break search during re-indexing.

I got tired of writing that glue code for every project, so I built content-mill — a CLI and library that indexes static content into Meilisearch from a single YAML config.

The Problem

Meilisearch is fantastic for search, but getting your content into it is surprisingly manual. Every docs site, every changelog, every collection of markdown files needs its own extraction pipeline. And if you want zero-downtime re-indexing? That's more code on top.

Most existing solutions are either tightly coupled to a specific framework (like DocSearch for Algolia) or require you to write a full crawler. If you just have some markdown files and a Meilisearch instance, there's nothing lightweight that bridges the gap.

What content-mill Does

You describe your content sources and the document shape you want in a YAML config:

meili:
  host: http://localhost:7700
  apiKey: ${MEILI_MASTER_KEY}

sources:
  - name: docs
    type: mkdocs
    config: ./mkdocs.yml
    index: docs
    document:
      primaryKey: id
      fields:
        id: "{{ slug }}"
        title: "{{ heading }}"
        content: "{{ body }}"
        section: "{{ nav_section }}"
        url: "{{ path }}"
        type: "docs"
      searchableAttributes: [title, content]
      filterableAttributes: [section, type]
Enter fullscreen mode Exit fullscreen mode

Then run:

npx @centrali-io/content-mill index --config content-mill.yml
Enter fullscreen mode Exit fullscreen mode

That's it. content-mill reads your sources, extracts content, applies your field templates, and pushes everything to Meilisearch with atomic index swapping (so search never goes down during re-indexing).

Four Source Types, One Interface

content-mill ships with adapters for the content formats you're most likely already using:

mkdocs — Reads your mkdocs.yml, follows the nav tree, and parses each markdown page. You get nav_section context so you know which part of the docs each page belongs to.

markdown-dir — Recursively reads .md files from a directory. Supports YAML frontmatter, so you can pull version numbers, dates, or any metadata into your search index. Great for changelogs and blog posts.

json — Reads a JSON array (or directory of JSON files). Every key in each object becomes a template variable. Perfect for structured data you already have lying around.

html — Reads .html files, strips scripts/styles/nav/footer, and gives you clean text. Useful for indexing a built static site.

Templating: You Control the Document Shape

The key design decision is that you define what your Meilisearch documents look like. Source adapters extract raw variables (slug, heading, body, path, frontmatter.*, etc.), and you map them to fields using {{ template }} syntax:

fields:
  id: "{{ slug }}-{{ chunk_index }}"
  title: "{{ chunk_heading }}"
  content: "{{ chunk_body }}"
  excerpt: "{{ body | truncate(200) }}"
  url: "{{ path }}#{{ chunk_heading | slugify }}"
Enter fullscreen mode Exit fullscreen mode

Filters like truncate, slugify, lower, upper, and strip_md can be chained with pipes. This means you're not locked into someone else's schema — your search index looks exactly the way your frontend expects.

Chunking for Granular Results

Whole-page results are often too broad for docs search. content-mill can split pages by heading level:

chunking:
  strategy: heading
  level: 2
Enter fullscreen mode Exit fullscreen mode

This turns one long page into multiple documents — one per ## section — each with its own chunk_heading, chunk_body, and chunk_index. Your search results can now link directly to the relevant section instead of dumping users at the top of a page.

Zero-Downtime Re-indexing

Every indexing run uses Meilisearch's index swap:

  1. Documents go into a temp index (docs_tmp)
  2. Atomic swap with the live index (docs)
  3. Old index gets cleaned up

If something fails mid-way, your live index is untouched. No maintenance windows needed.

CI/CD in Two Lines

# GitHub Actions
- name: Index docs
  env:
    MEILI_MASTER_KEY: ${{ secrets.MEILI_MASTER_KEY }}
  run: npx @centrali-io/content-mill index --config content-mill.yml
Enter fullscreen mode Exit fullscreen mode

Hook this into your release pipeline and your search index stays in sync with every deploy.

Use as a Library

Don't need the CLI? Import it directly:

import { loadConfig, indexAll } from '@centrali-io/content-mill';

const config = loadConfig('./content-mill.yml');
await indexAll(config, { dryRun: false });
Enter fullscreen mode Exit fullscreen mode

Or build the config object in code if you prefer programmatic control.

Getting Started

npm install @centrali-io/content-mill
Enter fullscreen mode Exit fullscreen mode
  1. Create a content-mill.yml with your Meilisearch connection and source definitions
  2. Run with --dry-run first to preview the extracted documents
  3. Run for real and check your Meilisearch dashboard

The full config reference and source type examples are in the README on GitHub.


content-mill is MIT licensed and open source. If you're using Meilisearch and have static content to index, I'd love to hear how it works for your use case. Issues and PRs welcome on GitHub.


Tags: #meilisearch #search #typescript #opensource

Top comments (0)