TL;DR
If you have a markdown wiki and want to embed it for RAG, wiki42 does the chunking right: one chunk per page, frontmatter as metadata, [[wikilinks]] resolved, multilingual E5 embeddings.
pip install wiki42
from wiki42 import compile_wiki
chunks = compile_wiki("./my-wiki", backend="cloud") # or "local"
# → list of dicts ready for Pinecone, Chroma, Qdrant, FAISS, ...
Why one more chunker
Generic chunkers split on token count. Markdown wikis already have semantic units — pages. Splitting them in the middle breaks retrieval.
wiki42:
- treats 1 page = 1 chunk (whatever its length)
- parses YAML frontmatter as searchable metadata
- resolves
[[wikilinks]]as crossref for graph queries - generates multilingual E5 embeddings out of the box
It's a drop-in replacement for Pinecone server-side embedding, but markdown-aware.
Why we open-sourced
Built at 42rows S.r.l. as the chunker behind our RAG products. We isolated and open-sourced the core because nobody else was solving wiki chunking properly.
- GitHub: 42ROWS/wiki42 — MIT
- Homepage: 42rows.com/guides/wiki42
Feedback and PRs welcome.
Top comments (0)