Steven Leggett

Posted on Mar 17

Setting Up CocoIndex with Docker and pgvector - A Practical Guide

#python #docker #postgres #ai

Setting Up CocoIndex with Docker and pgvector - A Practical Guide

CocoIndex is a data transformation framework for AI that handles indexing with incremental processing. It uses a Rust engine with Python bindings, which means it's fast, but the setup has a few gotchas that aren't obvious from the docs. The project is open source on GitHub.

I spent an afternoon getting it running locally and hit every sharp edge so you don't have to. Here's what actually works.

What You'll Build

A pipeline that reads markdown files, chunks them, generates vector embeddings using sentence-transformers, and stores them in PostgreSQL with pgvector for semantic similarity search.

Prerequisites

Python 3.11 to 3.13 (officially supported - 3.14 works but isn't listed yet)
Docker
About 10 minutes

Step 1: PostgreSQL with pgvector (not plain Postgres)

This is the first thing that will bite you. CocoIndex requires the vector extension for HNSW indexes. Plain postgres:16 or postgres:17 will fail with extension "vector" is not available.

CocoIndex provides a docker compose config you can use directly:

docker compose -f <(curl -L https://raw.githubusercontent.com/cocoindex-io/cocoindex/refs/heads/main/dev/postgres.yaml) up -d

Or run the container manually:

docker run -d --name cocoindex-postgres \
  -e POSTGRES_USER=cocoindex \
  -e POSTGRES_PASSWORD=cocoindex \
  -e POSTGRES_DB=cocoindex \
  -p 5432:5432 \
  pgvector/pgvector:pg17

If port 5432 is already in use, pick a different host port (e.g., -p 5450:5432) and adjust your connection string.

Port tip: Before picking a port, check nothing else is listening there. SSH tunnels can silently bind to the same port as Docker, causing misleading "password authentication failed" errors even when your credentials are correct. Verify with:

lsof -i :5432

You should only see Docker's com.docker process, not SSH or anything else.

Step 2: Python Environment

mkdir cocoindex-quickstart && cd cocoindex-quickstart
python3 -m venv .venv
source .venv/bin/activate
pip install -U 'cocoindex[embeddings]'

The [embeddings] extra pulls in sentence-transformers and torch. It's a big download but gives you local embeddings with no API key needed.

Step 3: Configure the Database Connection

Create a .env file in your project root. CocoIndex reads it automatically via python-dotenv:

echo 'COCOINDEX_DATABASE_URL=postgresql://cocoindex:cocoindex@localhost:5432/cocoindex' > .env

Adjust the port if you mapped to something other than 5432 in Step 1.

Step 4: Write the Pipeline

Create main.py:

import cocoindex

@cocoindex.flow_def(name="TextEmbedding")
def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
    data_scope["documents"] = flow_builder.add_source(
        cocoindex.sources.LocalFile(path="markdown_files"))

    doc_embeddings = data_scope.add_collector()

    with data_scope["documents"].row() as doc:
        doc["chunks"] = doc["content"].transform(
            cocoindex.functions.SplitRecursively(),
            language="markdown", chunk_size=2000, chunk_overlap=500)

        with doc["chunks"].row() as chunk:
            chunk["embedding"] = chunk["text"].transform(
                cocoindex.functions.SentenceTransformerEmbed(
                    model="sentence-transformers/all-MiniLM-L6-v2"
                )
            )
            doc_embeddings.collect(
                filename=doc["filename"],
                location=chunk["location"],
                text=chunk["text"],
                embedding=chunk["embedding"],
            )

    doc_embeddings.export(
        "doc_embeddings",
        cocoindex.storages.Postgres(),
        primary_key_fields=["filename", "location"],
        vector_indexes=[
            cocoindex.VectorIndexDef(
                field_name="embedding",
                metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY,
            )
        ],
    )

Step 5: Add Some Content

mkdir markdown_files

Drop some markdown files in there. For testing, even a couple of files work. The pipeline will chunk them, embed each chunk, and store the vectors.

Step 6: Run the Indexer

cocoindex update main.py

It will show you the tables it needs to create and ask for confirmation. Type yes.

You'll see it load the sentence-transformers model (first run downloads it from HuggingFace), create the pgvector extension, build the HNSW index, and process your files. Output looks like:

TextEmbedding.documents (batch update): 2/2 source rows: 2 added

Step 7: Query with Semantic Search

Install psycopg2 and create a simple query script:

pip install psycopg2-binary

# query.py
import sys
from sentence_transformers import SentenceTransformer
import psycopg2

DB_URL = "postgresql://cocoindex:cocoindex@localhost:5432/cocoindex"

def search(query: str, top_k: int = 3):
    model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
    embedding = model.encode(query)
    vec_str = "[" + ",".join(str(x) for x in embedding) + "]"

    conn = psycopg2.connect(DB_URL)
    cur = conn.cursor()
    cur.execute("""
        SELECT filename, left(text, 200),
               1 - (embedding <=> %s::vector) as similarity
        FROM textembedding__doc_embeddings
        ORDER BY embedding <=> %s::vector
        LIMIT %s
    """, (vec_str, vec_str, top_k))

    results = cur.fetchall()
    cur.close()
    conn.close()
    return results

if __name__ == "__main__":
    query = " ".join(sys.argv[1:]) or "What is incremental processing?"
    print(f"\nQuery: {query}\n")
    for filename, text, score in search(query):
        print(f"Score: {score:.4f} | {filename}")
        print(f"  {text.strip()[:150]}...\n")

python query.py "which embedding models are popular?"

Things That Tripped Me Up

Use pgvector/pgvector, not postgres. This is the number one issue. The plain Postgres Docker image doesn't include the vector extension. You need pgvector/pgvector:pg17 (or pg16). CocoIndex will fail at table creation without it.

Table naming is lowercase. Your flow is named TextEmbedding but the table is textembedding__doc_embeddings. CocoIndex lowercases the flow name. Keep this in mind when writing direct SQL queries.

The old main_fn API is gone. If you see examples using cocoindex.main_fn(), that's outdated. The current API (v0.3.36+) uses the cocoindex CLI command directly.

Docker volume persistence. If you change Postgres env vars (user/password) but reuse the container volume, the old credentials persist. Use docker rm -v to remove the volume when recreating.

The .env file wins. CocoIndex loads .env automatically via python-dotenv. If you set COCOINDEX_DATABASE_URL in your shell but have a different value in .env, the file takes precedence. This caught me when debugging connection issues.

Port conflicts with SSH tunnels. If you're forwarding database ports over SSH (common with remote dev setups), an SSH tunnel can bind to the same port as Docker. The connection goes to the wrong Postgres, and you get auth failures that look like a password problem. Always verify the port with lsof.

Using CocoIndex with Claude Code

If you're using Claude Code, there are a couple of integrations worth knowing about.

Claude Code Skill

CocoIndex provides an official Claude Code skill that gives Claude built-in knowledge about CocoIndex's API, so it can help you write pipelines, create custom functions, and run CLI commands correctly. This would have saved me from hitting the deprecated main_fn API issue.

Install it from within Claude Code:

/plugin marketplace add cocoindex-io/cocoindex-claude
/plugin install cocoindex-skills@cocoindex

Once installed, Claude Code understands CocoIndex's current API and can generate correct pipeline code without relying on outdated examples.

MCP for Code Search

cocoindex-code is a lightweight MCP server for semantic code search:

pipx install cocoindex-code
claude mcp add cocoindex-code -- ccc mcp

It uses SQLite locally and runs its own embeddings - no Postgres required. It's a separate tool from the main CocoIndex library, focused on searching your codebase.

MCP for Postgres-backed Indexes

There is no official MCP server for the Postgres-backed pipeline we built in this guide. The main cocoindex library has a built-in HTTP server (cocoindex server main.py) that exposes REST APIs, but it uses a proprietary protocol for their CocoInsight UI, not the MCP standard.

If you need MCP access to your pgvector index, you'd need to write a thin wrapper. The query.py script above is essentially all the logic you need - wrap it in an MCP server and you're there. That's a good project for a follow-up post.

The full pipeline takes about 10 minutes to set up once you know the gotchas. The incremental processing means subsequent runs only reprocess changed files, which is where CocoIndex really shines over rebuilding indexes from scratch.

Top comments (2)

klement Gunndu • Mar 17

The SSH tunnel port collision gotcha alone saved me debugging time — I've wasted hours on that exact "password auth failed" red herring with pgvector setups.

Apex Stack • Mar 19

Really appreciate the "gotchas" format here — the pgvector vs plain postgres distinction is one of those things that costs you an hour of confused error messages if you don't know it upfront. I run a financial data platform on Supabase PostgreSQL with thousands of stock tickers across 12 languages, and the incremental processing aspect of CocoIndex is what caught my eye. Right now we rebuild indexes from scratch whenever data changes, which is fine at small scale but becomes painful as the dataset grows. The idea of only reprocessing changed files would be a huge win for pipelines where you're pulling fresh market data daily but most historical content stays static. Curious — have you stress-tested it with larger document sets (say 10k+ files)? Wondering how the HNSW index build time scales and whether the incremental updates hold up at that volume.