Alex Bogle

Posted on Jun 10

I Contributed to Chroma's Open-Source Docs — Here's What I Changed and What I Learned

#ai #database #opensource #rag

Chroma is one of the most widely used open-source vector databases in the AI ecosystem. It's the backbone of countless RAG pipelines, semantic search systems, and AI agent memory layers — including projects I've built myself. So when I found issue #3111 on their GitHub — a request to update their Haystack integration documentation — I decided to take it on.

The issue was straightforward: the existing docs were outdated, lacked clear installation steps, and had almost no practical examples for ChromaDocumentStore. But when I looked closer, I found two already-open PRs. That's where it got interesting.

The Existing PRs

Before writing anything, I checked what others had already done. There were two open PRs for this exact issue:

PR #3112: Deleted the entire documentation file. The contributor decided the best way to "update" the docs was to remove them entirely. This doesn't help users — it just creates a dead link.
PR #7229 (manikanta7cheruku): Fixed a Path bug in the code example and swapped the LLM from HuggingFace (Mixtral) to OpenAI (GPT-3.5-turbo). The Path fix was good. The LLM swap was questionable — it replaced a free, self-hosted model with one that requires a paid API key, which is a worse default for open-source documentation.

Neither PR actually addressed the core request: comprehensive, clear examples for ChromaDocumentStore. So I wrote my own.

What I Changed

I rewrote docs/mintlify/integrations/frameworks/haystack.mdx from 80 lines to 187 lines. Here's what was added:

Clearer Installation

The original docs only had pip install chroma-haystack. I added haystack-ai as an explicit dependency since the integration package doesn't always pull it in correctly.

Quick Start Example

A minimal 10-line example showing Document creation, writing, and counting — the fastest way to verify your setup works:

from haystack import Document
from haystack_integrations.document_stores.chroma import ChromaDocumentStore

document_store = ChromaDocumentStore()

docs = [
    Document(content="Chroma stores documents as vectors."),
    Document(content="Haystack provides the orchestration layer."),
]
document_store.write_documents(docs)
print(document_store.count_documents())  # Output: 2

All Three Deployment Modes

The original docs only showed in-memory mode. I documented all three: in-memory (development), persistent storage (disk-backed), and remote client-server (production with Docker or chroma run).

Fixed the Path Bug

The original code had "data" / Path(name) — a string divided by a Path object, which throws a TypeError. Fixed to Path("data") / name.

Retrieval Examples

Added three retrieval patterns: text query (auto-embedded by Chroma), embedding query (pre-computed vectors), and metadata filtering using Haystack's filter syntax.

Full RAG Pipeline

A complete end-to-end RAG example connecting ChromaDocumentStore → ChromaQueryRetriever → PromptBuilder → HuggingFaceTGIGenerator. I kept HuggingFace (free, self-hosted) instead of swapping to OpenAI.

Troubleshooting Section

Common issues: import errors, empty results, remote connection refused, HuggingFace token errors. Each with a one-line fix.

What I Learned

1. Check Existing PRs First

Two people had already worked on this issue before me. One deleted the file. One did minimal fixes. If I hadn't checked, I might have duplicated effort or missed context. Always review what's already in flight.

2. Import Paths Change

The old docs used from chroma_haystack import ChromaDocumentStore — that's the archived package. The current one is from haystack_integrations.document_stores.chroma import ChromaDocumentStore. Same class, different package. This is exactly the kind of thing that makes docs go stale.

3. You Can't Push Directly to Upstream

I tried git push origin my-branch and got a 403. You have to fork the repo first, push to your fork, then create the PR with gh pr create --repo chroma-core/chroma --head SaintChris:my-branch. Simple, but not obvious if you haven't contributed to a large project before.

4. Docs Are Code

A broken code example in documentation is the same as a broken function — it wastes hours of developer time. The Path bug in the original docs would have caused a TypeError for anyone who copied the example. Fixing that one line probably saves more developer hours than most code contributions.

The PR

PR #7230 is now open on the chroma-core/chroma repository. It's documentation-only — no code changes. The branch is at SaintChris/chroma.

Whether it gets merged or not, the process itself was the value. I now understand how Chroma's deployment modes work at a deeper level, I've read their documentation pipeline, and I have a contribution on one of the most widely-used vector databases in the AI ecosystem.

Why This Matters

Open-source documentation is infrastructure. When it's wrong, every developer who copies that example hits a wall. When it's incomplete, they spend hours reading source code to figure out what the docs should have said.

Contributing to docs isn't glamorous. But for a project like Chroma — which thousands of developers depend on for RAG pipelines, agent memory, and semantic search — a clear, comprehensive integration guide is worth more than most feature PRs.

DEV Community