Setting Up CocoIndex with Docker and pgvector - A Practical Guide
CocoIndex is a data transformation framework for AI that handles indexing with incremental processing. It uses a Rust engine with Python bindings, which means it's fast, but the setup has a few gotchas that aren't obvious from the docs. The project is open source on GitHub.
I spent an afternoon getting it running locally and hit every sharp edge so you don't have to. Here's what actually works.
What You'll Build
A pipeline that reads markdown files, chunks them, generates vector embeddings using sentence-transformers, and stores them in PostgreSQL with pgvector for semantic similarity search.
Prerequisites
- Python 3.11 to 3.13 (officially supported - 3.14 works but isn't listed yet)
- Docker
- About 10 minutes
Step 1: PostgreSQL with pgvector (not plain Postgres)
This is the first thing that will bite you. CocoIndex requires the vector extension for HNSW indexes. Plain postgres:16 or postgres:17 will fail with extension "vector" is not available.
CocoIndex provides a docker compose config you can use directly:
docker compose -f <(curl -L https://raw.githubusercontent.com/cocoindex-io/cocoindex/refs/heads/main/dev/postgres.yaml) up -d
Or run the container manually:
docker run -d --name cocoindex-postgres \
-e POSTGRES_USER=cocoindex \
-e POSTGRES_PASSWORD=cocoindex \
-e POSTGRES_DB=cocoindex \
-p 5432:5432 \
pgvector/pgvector:pg17
If port 5432 is already in use, pick a different host port (e.g., -p 5450:5432) and adjust your connection string.
Port tip: Before picking a port, check nothing else is listening there. SSH tunnels can silently bind to the same port as Docker, causing misleading "password authentication failed" errors even when your credentials are correct. Verify with:
lsof -i :5432
You should only see Docker's com.docker process, not SSH or anything else.
Step 2: Python Environment
mkdir cocoindex-quickstart && cd cocoindex-quickstart
python3 -m venv .venv
source .venv/bin/activate
pip install -U 'cocoindex[embeddings]'
The [embeddings] extra pulls in sentence-transformers and torch. It's a big download but gives you local embeddings with no API key needed.
Step 3: Configure the Database Connection
Create a .env file in your project root. CocoIndex reads it automatically via python-dotenv:
echo 'COCOINDEX_DATABASE_URL=postgresql://cocoindex:cocoindex@localhost:5432/cocoindex' > .env
Adjust the port if you mapped to something other than 5432 in Step 1.
Step 4: Write the Pipeline
Create main.py:
import cocoindex
@cocoindex.flow_def(name="TextEmbedding")
def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
data_scope["documents"] = flow_builder.add_source(
cocoindex.sources.LocalFile(path="markdown_files"))
doc_embeddings = data_scope.add_collector()
with data_scope["documents"].row() as doc:
doc["chunks"] = doc["content"].transform(
cocoindex.functions.SplitRecursively(),
language="markdown", chunk_size=2000, chunk_overlap=500)
with doc["chunks"].row() as chunk:
chunk["embedding"] = chunk["text"].transform(
cocoindex.functions.SentenceTransformerEmbed(
model="sentence-transformers/all-MiniLM-L6-v2"
)
)
doc_embeddings.collect(
filename=doc["filename"],
location=chunk["location"],
text=chunk["text"],
embedding=chunk["embedding"],
)
doc_embeddings.export(
"doc_embeddings",
cocoindex.storages.Postgres(),
primary_key_fields=["filename", "location"],
vector_indexes=[
cocoindex.VectorIndexDef(
field_name="embedding",
metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY,
)
],
)
Step 5: Add Some Content
mkdir markdown_files
Drop some markdown files in there. For testing, even a couple of files work. The pipeline will chunk them, embed each chunk, and store the vectors.
Step 6: Run the Indexer
cocoindex update main.py
It will show you the tables it needs to create and ask for confirmation. Type yes.
You'll see it load the sentence-transformers model (first run downloads it from HuggingFace), create the pgvector extension, build the HNSW index, and process your files. Output looks like:
TextEmbedding.documents (batch update): 2/2 source rows: 2 added
Step 7: Query with Semantic Search
Install psycopg2 and create a simple query script:
pip install psycopg2-binary
# query.py
import sys
from sentence_transformers import SentenceTransformer
import psycopg2
DB_URL = "postgresql://cocoindex:cocoindex@localhost:5432/cocoindex"
def search(query: str, top_k: int = 3):
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
embedding = model.encode(query)
vec_str = "[" + ",".join(str(x) for x in embedding) + "]"
conn = psycopg2.connect(DB_URL)
cur = conn.cursor()
cur.execute("""
SELECT filename, left(text, 200),
1 - (embedding <=> %s::vector) as similarity
FROM textembedding__doc_embeddings
ORDER BY embedding <=> %s::vector
LIMIT %s
""", (vec_str, vec_str, top_k))
results = cur.fetchall()
cur.close()
conn.close()
return results
if __name__ == "__main__":
query = " ".join(sys.argv[1:]) or "What is incremental processing?"
print(f"\nQuery: {query}\n")
for filename, text, score in search(query):
print(f"Score: {score:.4f} | {filename}")
print(f" {text.strip()[:150]}...\n")
python query.py "which embedding models are popular?"
Things That Tripped Me Up
Use pgvector/pgvector, not postgres. This is the number one issue. The plain Postgres Docker image doesn't include the vector extension. You need pgvector/pgvector:pg17 (or pg16). CocoIndex will fail at table creation without it.
Table naming is lowercase. Your flow is named TextEmbedding but the table is textembedding__doc_embeddings. CocoIndex lowercases the flow name. Keep this in mind when writing direct SQL queries.
The old main_fn API is gone. If you see examples using cocoindex.main_fn(), that's outdated. The current API (v0.3.36+) uses the cocoindex CLI command directly.
Docker volume persistence. If you change Postgres env vars (user/password) but reuse the container volume, the old credentials persist. Use docker rm -v to remove the volume when recreating.
The .env file wins. CocoIndex loads .env automatically via python-dotenv. If you set COCOINDEX_DATABASE_URL in your shell but have a different value in .env, the file takes precedence. This caught me when debugging connection issues.
Port conflicts with SSH tunnels. If you're forwarding database ports over SSH (common with remote dev setups), an SSH tunnel can bind to the same port as Docker. The connection goes to the wrong Postgres, and you get auth failures that look like a password problem. Always verify the port with lsof.
Using CocoIndex with Claude Code
If you're using Claude Code, there are a couple of integrations worth knowing about.
Claude Code Skill
CocoIndex provides an official Claude Code skill that gives Claude built-in knowledge about CocoIndex's API, so it can help you write pipelines, create custom functions, and run CLI commands correctly. This would have saved me from hitting the deprecated main_fn API issue.
Install it from within Claude Code:
/plugin marketplace add cocoindex-io/cocoindex-claude
/plugin install cocoindex-skills@cocoindex
Once installed, Claude Code understands CocoIndex's current API and can generate correct pipeline code without relying on outdated examples.
MCP for Code Search
cocoindex-code is a lightweight MCP server for semantic code search:
pipx install cocoindex-code
claude mcp add cocoindex-code -- ccc mcp
It uses SQLite locally and runs its own embeddings - no Postgres required. It's a separate tool from the main CocoIndex library, focused on searching your codebase.
MCP for Postgres-backed Indexes
There is no official MCP server for the Postgres-backed pipeline we built in this guide. The main cocoindex library has a built-in HTTP server (cocoindex server main.py) that exposes REST APIs, but it uses a proprietary protocol for their CocoInsight UI, not the MCP standard.
If you need MCP access to your pgvector index, you'd need to write a thin wrapper. The query.py script above is essentially all the logic you need - wrap it in an MCP server and you're there. That's a good project for a follow-up post.
The full pipeline takes about 10 minutes to set up once you know the gotchas. The incremental processing means subsequent runs only reprocess changed files, which is where CocoIndex really shines over rebuilding indexes from scratch.
Top comments (1)
The SSH tunnel port collision gotcha alone saved me debugging time — I've wasted hours on that exact "password auth failed" red herring with pgvector setups.