DEV Community

Cover image for Setting Up CocoIndex with Docker and pgvector - A Practical Guide
Steven Leggett
Steven Leggett

Posted on

Setting Up CocoIndex with Docker and pgvector - A Practical Guide

Setting Up CocoIndex with Docker and pgvector - A Practical Guide

CocoIndex is a data transformation framework for AI that handles indexing with incremental processing. It uses a Rust engine with Python bindings, which means it's fast, but the setup has a few gotchas that aren't obvious from the docs. The project is open source on GitHub.

I spent an afternoon getting it running locally and hit every sharp edge so you don't have to. Here's what actually works.

What You'll Build

A pipeline that reads markdown files, chunks them, generates vector embeddings using sentence-transformers, and stores them in PostgreSQL with pgvector for semantic similarity search.

Prerequisites

  • Python 3.11 to 3.13 (officially supported - 3.14 works but isn't listed yet)
  • Docker
  • About 10 minutes

Step 1: PostgreSQL with pgvector (not plain Postgres)

This is the first thing that will bite you. CocoIndex requires the vector extension for HNSW indexes. Plain postgres:16 or postgres:17 will fail with extension "vector" is not available.

CocoIndex provides a docker compose config you can use directly:

docker compose -f <(curl -L https://raw.githubusercontent.com/cocoindex-io/cocoindex/refs/heads/main/dev/postgres.yaml) up -d
Enter fullscreen mode Exit fullscreen mode

Or run the container manually:

docker run -d --name cocoindex-postgres \
  -e POSTGRES_USER=cocoindex \
  -e POSTGRES_PASSWORD=cocoindex \
  -e POSTGRES_DB=cocoindex \
  -p 5432:5432 \
  pgvector/pgvector:pg17
Enter fullscreen mode Exit fullscreen mode

If port 5432 is already in use, pick a different host port (e.g., -p 5450:5432) and adjust your connection string.

Port tip: Before picking a port, check nothing else is listening there. SSH tunnels can silently bind to the same port as Docker, causing misleading "password authentication failed" errors even when your credentials are correct. Verify with:

lsof -i :5432
Enter fullscreen mode Exit fullscreen mode

You should only see Docker's com.docker process, not SSH or anything else.

Step 2: Python Environment

mkdir cocoindex-quickstart && cd cocoindex-quickstart
python3 -m venv .venv
source .venv/bin/activate
pip install -U 'cocoindex[embeddings]'
Enter fullscreen mode Exit fullscreen mode

The [embeddings] extra pulls in sentence-transformers and torch. It's a big download but gives you local embeddings with no API key needed.

Step 3: Configure the Database Connection

Create a .env file in your project root. CocoIndex reads it automatically via python-dotenv:

echo 'COCOINDEX_DATABASE_URL=postgresql://cocoindex:cocoindex@localhost:5432/cocoindex' > .env
Enter fullscreen mode Exit fullscreen mode

Adjust the port if you mapped to something other than 5432 in Step 1.

Step 4: Write the Pipeline

Create main.py:

import cocoindex

@cocoindex.flow_def(name="TextEmbedding")
def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
    data_scope["documents"] = flow_builder.add_source(
        cocoindex.sources.LocalFile(path="markdown_files"))

    doc_embeddings = data_scope.add_collector()

    with data_scope["documents"].row() as doc:
        doc["chunks"] = doc["content"].transform(
            cocoindex.functions.SplitRecursively(),
            language="markdown", chunk_size=2000, chunk_overlap=500)

        with doc["chunks"].row() as chunk:
            chunk["embedding"] = chunk["text"].transform(
                cocoindex.functions.SentenceTransformerEmbed(
                    model="sentence-transformers/all-MiniLM-L6-v2"
                )
            )
            doc_embeddings.collect(
                filename=doc["filename"],
                location=chunk["location"],
                text=chunk["text"],
                embedding=chunk["embedding"],
            )

    doc_embeddings.export(
        "doc_embeddings",
        cocoindex.storages.Postgres(),
        primary_key_fields=["filename", "location"],
        vector_indexes=[
            cocoindex.VectorIndexDef(
                field_name="embedding",
                metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY,
            )
        ],
    )
Enter fullscreen mode Exit fullscreen mode

Step 5: Add Some Content

mkdir markdown_files
Enter fullscreen mode Exit fullscreen mode

Drop some markdown files in there. For testing, even a couple of files work. The pipeline will chunk them, embed each chunk, and store the vectors.

Step 6: Run the Indexer

cocoindex update main.py
Enter fullscreen mode Exit fullscreen mode

It will show you the tables it needs to create and ask for confirmation. Type yes.

You'll see it load the sentence-transformers model (first run downloads it from HuggingFace), create the pgvector extension, build the HNSW index, and process your files. Output looks like:

TextEmbedding.documents (batch update): 2/2 source rows: 2 added
Enter fullscreen mode Exit fullscreen mode

Step 7: Query with Semantic Search

Install psycopg2 and create a simple query script:

pip install psycopg2-binary
Enter fullscreen mode Exit fullscreen mode
# query.py
import sys
from sentence_transformers import SentenceTransformer
import psycopg2

DB_URL = "postgresql://cocoindex:cocoindex@localhost:5432/cocoindex"

def search(query: str, top_k: int = 3):
    model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
    embedding = model.encode(query)
    vec_str = "[" + ",".join(str(x) for x in embedding) + "]"

    conn = psycopg2.connect(DB_URL)
    cur = conn.cursor()
    cur.execute("""
        SELECT filename, left(text, 200),
               1 - (embedding <=> %s::vector) as similarity
        FROM textembedding__doc_embeddings
        ORDER BY embedding <=> %s::vector
        LIMIT %s
    """, (vec_str, vec_str, top_k))

    results = cur.fetchall()
    cur.close()
    conn.close()
    return results

if __name__ == "__main__":
    query = " ".join(sys.argv[1:]) or "What is incremental processing?"
    print(f"\nQuery: {query}\n")
    for filename, text, score in search(query):
        print(f"Score: {score:.4f} | {filename}")
        print(f"  {text.strip()[:150]}...\n")
Enter fullscreen mode Exit fullscreen mode
python query.py "which embedding models are popular?"
Enter fullscreen mode Exit fullscreen mode

Things That Tripped Me Up

Use pgvector/pgvector, not postgres. This is the number one issue. The plain Postgres Docker image doesn't include the vector extension. You need pgvector/pgvector:pg17 (or pg16). CocoIndex will fail at table creation without it.

Table naming is lowercase. Your flow is named TextEmbedding but the table is textembedding__doc_embeddings. CocoIndex lowercases the flow name. Keep this in mind when writing direct SQL queries.

The old main_fn API is gone. If you see examples using cocoindex.main_fn(), that's outdated. The current API (v0.3.36+) uses the cocoindex CLI command directly.

Docker volume persistence. If you change Postgres env vars (user/password) but reuse the container volume, the old credentials persist. Use docker rm -v to remove the volume when recreating.

The .env file wins. CocoIndex loads .env automatically via python-dotenv. If you set COCOINDEX_DATABASE_URL in your shell but have a different value in .env, the file takes precedence. This caught me when debugging connection issues.

Port conflicts with SSH tunnels. If you're forwarding database ports over SSH (common with remote dev setups), an SSH tunnel can bind to the same port as Docker. The connection goes to the wrong Postgres, and you get auth failures that look like a password problem. Always verify the port with lsof.

Using CocoIndex with Claude Code

If you're using Claude Code, there are a couple of integrations worth knowing about.

Claude Code Skill

CocoIndex provides an official Claude Code skill that gives Claude built-in knowledge about CocoIndex's API, so it can help you write pipelines, create custom functions, and run CLI commands correctly. This would have saved me from hitting the deprecated main_fn API issue.

Install it from within Claude Code:

/plugin marketplace add cocoindex-io/cocoindex-claude
/plugin install cocoindex-skills@cocoindex
Enter fullscreen mode Exit fullscreen mode

Once installed, Claude Code understands CocoIndex's current API and can generate correct pipeline code without relying on outdated examples.

MCP for Code Search

cocoindex-code is a lightweight MCP server for semantic code search:

pipx install cocoindex-code
claude mcp add cocoindex-code -- ccc mcp
Enter fullscreen mode Exit fullscreen mode

It uses SQLite locally and runs its own embeddings - no Postgres required. It's a separate tool from the main CocoIndex library, focused on searching your codebase.

MCP for Postgres-backed Indexes

There is no official MCP server for the Postgres-backed pipeline we built in this guide. The main cocoindex library has a built-in HTTP server (cocoindex server main.py) that exposes REST APIs, but it uses a proprietary protocol for their CocoInsight UI, not the MCP standard.

If you need MCP access to your pgvector index, you'd need to write a thin wrapper. The query.py script above is essentially all the logic you need - wrap it in an MCP server and you're there. That's a good project for a follow-up post.


The full pipeline takes about 10 minutes to set up once you know the gotchas. The incremental processing means subsequent runs only reprocess changed files, which is where CocoIndex really shines over rebuilding indexes from scratch.

Top comments (1)

Collapse
 
klement_gunndu profile image
klement Gunndu

The SSH tunnel port collision gotcha alone saved me debugging time — I've wasted hours on that exact "password auth failed" red herring with pgvector setups.