Thesius Code

Posted on Mar 23 • Originally published at datanest-stores.pages.dev

Vector Database Toolkit

#ai #llm #machinelearning #python

Vector Database Toolkit

Vector databases are the backbone of every RAG pipeline, semantic search engine, and recommendation system — but each one has different APIs, indexing strategies, and operational quirks. This toolkit gives you unified setup guides, working code examples, and benchmarking scripts for ChromaDB, Pinecone, Weaviate, and pgvector. Plus hybrid search patterns, indexing strategies, and production operational guides.

Key Features

Multi-Database Support — Unified Python client abstraction for ChromaDB, Pinecone, Weaviate, and pgvector with consistent CRUD operations
Setup & Migration Guides — Step-by-step setup for each database, including Docker configs, cloud provisioning, and schema migration scripts
Indexing Strategies — HNSW, IVF, and PQ index configuration with tuning guides for recall vs. speed tradeoffs
Hybrid Search — Combine dense vector search with sparse keyword search across all supported backends
Benchmarking Scripts — Measure query latency, throughput, recall@K, and memory usage across databases with your own data
Production Operations — Backup/restore procedures, monitoring queries, scaling guides, and cost estimation per database
Embedding Pipeline — Batch embedding generation with rate limiting, retry logic, and incremental upsert support

Quick Start

from vector_toolkit import VectorClient, EmbeddingPipeline

# 1. Initialize with any backend (same API for all)
client = VectorClient(
    backend="chromadb",
    connection={
        "persist_directory": "./chroma_db",
    },
    collection="product_catalog",
    embedding_model="text-embedding-3-small",
    dimensions=1536,
)

# 2. Index documents
documents = [
    {"id": "doc_1", "text": "Premium leather wallet with RFID blocking", "category": "accessories"},
    {"id": "doc_2", "text": "Wireless noise-canceling headphones", "category": "electronics"},
    {"id": "doc_3", "text": "Organic cotton crew neck t-shirt", "category": "apparel"},
]
client.upsert(documents, text_key="text", metadata_keys=["category"])

# 3. Search
results = client.search("high-quality audio equipment", top_k=5)
for r in results:
    print(f"[{r.score:.3f}] {r.id}: {r.text}")

Architecture

┌─────────────────────────────────────────────┐
│           VectorClient (Unified API)         │
│                                              │
│  upsert() │ search() │ delete() │ count()   │
└──────────────────┬───────────────────────────┘
                   │
    ┌──────────────┼──────────────┐
    │              │              │
    ▼              ▼              ▼
┌────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐
│ChromaDB│  │ Pinecone │  │ Weaviate │  │ pgvector │
│(local) │  │ (cloud)  │  │(hybrid)  │  │  (SQL)   │
└────────┘  └──────────┘  └──────────┘  └──────────┘
                   │
                   ▼
         ┌─────────────────┐
         │EmbeddingPipeline│
         │ Batch + Rate    │
         │ Limit + Retry   │
         └─────────────────┘

Usage Examples

Switch Backends Without Code Changes

# Development: Use ChromaDB (local, no setup)
dev_client = VectorClient(backend="chromadb", connection={"persist_directory": "./db"})

# Staging: Use pgvector (existing Postgres)
staging_client = VectorClient(
    backend="pgvector",
    connection={
        "host": "localhost",
        "port": 5432,
        "database": "vectors",
        "user": "app_user",
        "password": "${PGVECTOR_PASSWORD}",
    },
)

# Production: Use Pinecone (managed, scalable)
prod_client = VectorClient(
    backend="pinecone",
    connection={
        "api_key": "${PINECONE_API_KEY}",
        "environment": "us-east-1",
        "index_name": "product-catalog",
    },
)

# Same code works with all three:
results = client.search("wireless headphones", top_k=5)

Hybrid Search (Dense + Sparse)

from vector_toolkit.search import HybridSearch

hybrid = HybridSearch(
    client=client,
    dense_weight=0.7,
    sparse_weight=0.3,     # BM25 keyword matching
    fusion="reciprocal_rank",  # reciprocal_rank | weighted_sum
)

results = hybrid.search(
    query="error code ERR-4012 connection timeout",
    top_k=10,
    filters={"category": "troubleshooting"},
)
# Dense search finds semantically similar docs about connection issues
# Sparse search catches the exact error code "ERR-4012"

Benchmarking

from vector_toolkit.benchmark import Benchmark

bench = Benchmark(
    backends=["chromadb", "pgvector", "pinecone"],
    dataset="benchmark_data/1M_embeddings.npy",
    queries="benchmark_data/1000_queries.npy",
    ground_truth="benchmark_data/ground_truth.json",
    metrics=["latency_p50", "latency_p99", "recall_at_10", "throughput_qps"],
)

report = bench.run()
print(report.table())
# ┌───────────┬────────────┬────────────┬───────────┬──────────┐
# │ Backend   │ Latency P50│ Latency P99│ Recall@10 │ QPS      │
# ├───────────┼────────────┼────────────┼───────────┼──────────┤
# │ ChromaDB  │ 12ms       │ 45ms       │ 0.94      │ 180      │
# │ pgvector  │ 18ms       │ 62ms       │ 0.92      │ 250      │
# │ Pinecone  │ 22ms       │ 58ms       │ 0.96      │ 1200     │
# └───────────┴────────────┴────────────┴───────────┴──────────┘

Index Tuning

from vector_toolkit.indexing import IndexConfig

# HNSW (best for most use cases)
hnsw_config = IndexConfig(
    type="hnsw",
    m=16,                    # Connections per node (higher = better recall, more memory)
    ef_construction=200,     # Build-time accuracy (higher = better index, slower build)
    ef_search=100,           # Query-time accuracy (higher = better recall, slower query)
)

# IVF (better for very large datasets > 10M vectors)
ivf_config = IndexConfig(
    type="ivf",
    nlist=1024,              # Number of clusters
    nprobe=32,               # Clusters to search (higher = better recall, slower)
)

client.create_index(config=hnsw_config)

Configuration

# vector_toolkit_config.yaml
default_backend: "chromadb"

backends:
  chromadb:
    persist_directory: "./chroma_db"
    anonymized_telemetry: false

  pinecone:
    api_key: "${PINECONE_API_KEY}"
    environment: "us-east-1"
    index_name: "product-catalog"
    metric: "cosine"             # cosine | dotproduct | euclidean
    pod_type: "s1.x1"

  weaviate:
    url: "https://api.example.com"
    api_key: "${WEAVIATE_API_KEY}"
    schema_auto_create: true

  pgvector:
    host: "localhost"
    port: 5432
    database: "vectors"
    user: "${PG_USER}"
    password: "${PG_PASSWORD}"
    pool_size: 10

embedding:
  model: "text-embedding-3-small"
  dimensions: 1536
  batch_size: 100
  rate_limit_rpm: 3000
  retry_max: 3
  retry_delay_seconds: 1

indexing:
  type: "hnsw"
  m: 16
  ef_construction: 200
  ef_search: 100

search:
  default_top_k: 10
  hybrid_enabled: true
  dense_weight: 0.7
  sparse_weight: 0.3
  fusion_method: "reciprocal_rank"

benchmark:
  dataset_sizes: [10000, 100000, 1000000]
  query_count: 1000
  output_dir: "benchmark_results/"

Best Practices

Start with ChromaDB locally, migrate to managed in production — ChromaDB requires zero setup for prototyping; switch to Pinecone/Weaviate when you need scale.
Choose the right distance metric — Use cosine for normalized embeddings (most common), dotproduct for unnormalized, euclidean for absolute distances.
Tune HNSW parameters for your recall target — Default m=16, ef=100 gives ~95% recall. For 99%+ recall, increase ef_search to 200+.
Use metadata filters before vector search — Filtering first, then searching the filtered subset is much faster than searching everything and post-filtering.
Batch your upserts — Insert documents in batches of 100-500. Single-document inserts are 10-50x slower.
Benchmark with YOUR data — Published benchmarks use synthetic data. Run the benchmarking scripts with your actual embeddings and query patterns.

Troubleshooting

Problem	Cause	Fix
Search returns irrelevant results	Wrong distance metric or poor embedding model	Switch to `cosine` metric; try `text-embedding-3-large` for better quality
Upsert is extremely slow	Single-document inserts or no batching	Use `client.upsert_batch()` with batch_size=500
pgvector queries slow on large tables	Missing HNSW or IVF index	Run `client.create_index()` — without an index, pgvector does brute-force scan
Pinecone returns timeout errors	Index not fully initialized or quota exceeded	Wait 2-3 minutes after index creation; check plan limits in Pinecone console

This is 1 of 11 resources in the AI Builder Pro toolkit. Get the complete [Vector Database Toolkit] with all files, templates, and documentation for $39.

Get the Full Kit →

Or grab the entire AI Builder Pro bundle (11 products) for $169 — save 30%.

Get the Complete Bundle →

DEV Community

Vector Database Toolkit

Vector Database Toolkit

Key Features

Quick Start

Architecture

Usage Examples

Switch Backends Without Code Changes

Hybrid Search (Dense + Sparse)

Benchmarking

Index Tuning

Configuration

Best Practices

Troubleshooting

Related Articles

Top comments (0)