DEV Community: Ashok Nagaraj

Smart Chunking & Embeddings for RAG

Ashok Nagaraj — Fri, 07 Nov 2025 14:42:43 +0000

From Raw Docs to Retrieval Gold: A Deep Dive into Chunking Strategies & Embedding Techniques (with Qdrant)

TL;DR: Great RAG systems don’t start at the vector DB—they start at chunking. This post walks through when and how to chunk, how to choose and generate embeddings, and how to index/search in Qdrant with dense, sparse, and hybrid retrieval. It includes runnable code, diagrams, and sample chunking illustrations you can paste directly into Markdown.

Why chunking matters
Chunking strategies
- Fixed-size + overlap
- Recursive
- Document-structure–aware
- Sentence-window
- Semantic chunking
- Hierarchical
- Choosing sizes & overlaps
Embedding techniques
- Model choices (2024–2025)
- Task-aware prompts
- Hybrid-ready models
Qdrant as the vector DB
- Quickstart (FastEmbed)
- Manual control
- Hybrid retrieval
- Reranking
End-to-end example
Evaluation & tuning
Practical guidance
Further reading & references

Why chunking matters

Long contexts are seductive, but LLMs still show primacy/recency bias and degrade when key facts live in the middle of long inputs ("lost in the middle"). Thoughtful chunking with overlap keeps the right facts adjacent at retrieval-time and improves end-to-end accuracy and latency. Liu et al., 2024 .

Modern RAG stacks pair well-formed chunks with strong embeddings and a vector DB that supports dense + sparse retrieval and reranking. Qdrant provides production-grade vectors, filters, payloads, and hybrid retrieval in one place. See the Qdrant README and Payload docs. [Qdrant README] [Payload docs].

Visual overview

flowchart LR
  A[Raw Documents] --> B[Parse & Clean]
  B --> C[Chunking Strategy\n(Fixed / Recursive / Semantic / Hierarchical / Sentence-window)]
  C --> D[Embedder(s)\nBGE-M3 / Nomic / Voyage]
  D --> E[Qdrant Index\n(dense + sparse vectors,\npayload)]
  E --> F[Hybrid Retrieval\n(dense ⊕ sparse)]
  F --> G[Reranker (optional)\n(e.g., ColBERT)]
  G --> H[LLM + Prompt\n(answers, grounded)]

Qdrant supports dense vectors, sparse vectors, and hybrid workflows; reranking with late interaction (e.g., ColBERT) is a documented pattern. [Sparse vectors in Qdrant] [Hybrid tutorial].

Chunking strategies

Below are practical chunking strategies you can mix-and-match. Each section includes sample text, what the chunks look like, and code where useful.

1) Fixed-size window (tokens or characters) + overlap

When: fast baselines, logs, transcripts.
Why: predictable chunk lengths, simple to implement.
Risk: can split sentences mid-thought; consider overlap to preserve context.

Sample text (we’ll reuse this):

[Doc] The GPU kernel uses tiling to reduce global memory access. 
Block-level synchronization is required. See Algorithm 2 for warp-level primitives.

Fixed-size (≈ 80 chars) with 20-char overlap

Chunk 1:
"The GPU kernel uses tiling to reduce global memory access. Block-level"

Chunk 2:
"access. Block-level synchronization is required. See Algorithm 2 for"

Chunk 3:
"See Algorithm 2 for warp-level primitives."

Code (LangChain)

from langchain_text_splitters import CharacterTextSplitter

splitter = CharacterTextSplitter.from_tiktoken_encoder(
    encoding_name="cl100k_base", chunk_size=80, chunk_overlap=20
)
chunks = splitter.split_text("""The GPU kernel uses tiling to reduce global memory access. 
Block-level synchronization is required. See Algorithm 2 for warp-level primitives.""")
for i, c in enumerate(chunks, 1):
    print(f"Chunk {i}:\n{c}\n")

LangChain’s token/character splitters are standard, and RecursiveCharacterTextSplitter is the recommended default for general text. [LangChain splitters].

2) Recursive split (paragraph → sentence → word)

When: prose, docs, Markdown, HTML.
Why: preserves natural boundaries first; falls back only when needed.
How: tries ["\n\n", "\n", " ", ""] in order to keep larger units intact. [Recursive splitter]

Code (Recursive + Markdown-aware)

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=120, chunk_overlap=30, is_separator_regex=False
)
chunks = splitter.split_text(long_markdown_string)

3) Document-structure–aware split (Markdown/HTML/JSON)

When: knowledge bases, docs, webpages, specs.
Why: avoid chopping headings, lists, code blocks; align chunk meaning to structure.

LangChain provides Markdown/HTML splitters; LlamaIndex offers file-based node parsers (e.g., MarkdownNodeParser, HTMLNodeParser). [LangChain splitters] [LlamaIndex node parsers]

Code (LlamaIndex HTML)

from llama_index.core.node_parser import HTMLNodeParser
parser = HTMLNodeParser(tags=["h1","h2","p","li","code"])
nodes = parser.get_nodes_from_documents(html_docs)

4) Sentence-window retrieval

When: you want precise grounding while preserving local context.
Why: index at sentence granularity, but expand context during retrieval by adding a window of neighboring sentences.

Code (LlamaIndex SentenceWindow)

from llama_index.core.node_parser import SentenceWindowNodeParser

parser = SentenceWindowNodeParser(window_size=2)  # ±2 sentences of context
nodes = parser.get_nodes_from_documents(documents)

LlamaIndex provides SentenceWindowNodeParser specifically for this pattern. [LlamaIndex node parsers]

5) Semantic chunking (boundary by meaning, not characters)

When: long-form text with shifting topics (papers, handbooks).
Why: create chunk boundaries where semantic similarity between consecutive sentences drops.

A practical recipe: embed each sentence, compute cosine distance, start a new chunk when distance exceeds a threshold (e.g., 95th percentile). LlamaIndex provides a pack implementing this (“semantic chunking” popularized by Greg Kamradt). [LlamaIndex semantic chunking pack]

Illustration (semantic boundaries)

S1: Intro to tiling  ─┐
S2: Memory coalescing ─┤  (high similarity → same chunk)
S3: Warp shuffles     ─┘
S4: Runtime flags  ←  [semantic drop: new topic → new chunk]
S5: Env setup

6) Hierarchical chunking (multi-level nodes)

When: large manuals/books; need both overview and details.
Why: index multiple granularities (e.g., 2k/512/128 tokens) and let the retriever blend them.

Code (LlamaIndex Hierarchical)

from llama_index.core.node_parser.relational.hierarchical import HierarchicalNodeParser

parser = HierarchicalNodeParser.from_defaults(
    chunk_sizes=[2048, 512, 128],  # parent → child → grandchild
    chunk_overlap=40
)
nodes = parser.get_nodes_from_documents(documents)

Hierarchical node parsers create a flat list of nodes but preserve parent/child relationships—ideal for multi-granularity retrieval. [LlamaIndex hierarchical]

Choosing chunk sizes & overlaps

Start with 512–1,024 tokens and 10–20% overlap for prose; increase overlap for dense technical content and code.
Keep chunks semantically coherent (recursive/semantic methods) to mitigate “lost in the middle.” [Recursive splitter] [Lost in the Middle]

Embedding techniques

Picking an embedding model (2024–2025 snapshot)

BGE-M3 (open, multilingual, can output dense + sparse + multi-vector; long-text up to ~8k tokens). Strong hybrid story. [HF card] [arXiv]
Nomic Embed (v1.5 / v2 MoE) (open, long-context, Matryoshka—truncate dims without retraining; task-prefix prompts). [HF v1.5] [Tech report] [V2 MoE overview]
Voyage-3/3.5(-lite) (hosted API, strong multilingual retrieval; domain variants for code/finance/law). [Voyage docs]
Benchmark by task, not just overall averages; see MTEB (retrieval/STSb are most relevant for RAG). [MTEB leaderboard]

Tip: Retrieval usually uses cosine on L2-normalized vectors (most libraries handle this). Validate that your client and DB use the same similarity metric (e.g., COSINE in Qdrant). [Qdrant client]

Task-aware prompting for embeddings

Some models require instruction prefixes to ensure query/document embeddings live in compatible subspaces. For Nomic:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)
doc_emb = model.encode(["search_document: GPU tiling improves mem access"]) 
qry_emb = model.encode(["search_query: What improves memory access on GPUs?"])

These search_document / search_query prefixes are part of the model spec. [HF v1.5]

Hybrid-ready models

BGE-M3 can produce both dense and sparse signals—useful if you plan to feed Qdrant’s hybrid flow (dense + sparse). [BGE-M3 docs]

Qdrant as the Vector DB

Why Qdrant?

Vectors + payloads (schemaless JSON metadata + filters). [Payload docs]
Sparse vectors and hybrid search patterns (dense ⊕ sparse; reranking with ColBERT). [Sparse vectors] [Hybrid tutorial]
Client ergonomics: Python client supports local :memory: mode, FastEmbed integration (client.add / client.query), async, and Cloud. [Qdrant Quickstart] [PyPI]

There’s also active work discussing hybrid algorithms (e.g., BM42) and how IDF/statistics interplay with sparsity. [BM42 news] [BM25/IDF discussion]

A. “Batteries-included” quickstart (FastEmbed via Qdrant client)

Install & run Qdrant

docker run -p 6333:6333 qdrant/qdrant:latest

Python client with auto-embedding and simplified APIs

# pip install "qdrant-client[fastembed]"
from qdrant_client import QdrantClient

client = QdrantClient(":memory:")  # or url="http://localhost:6333"

docs = [
  "Qdrant has LangChain integrations",
  "Qdrant also has LlamaIndex integrations"
]

# Simple add → auto-embeds via FastEmbed
ids = client.add(collection_name="demo_collection", documents=docs)

# Query by text directly (embeds the query under the hood)
result = client.query(
    collection_name="demo_collection",
    query_text="Which vector DB works with LangChain?",
    limit=2
)
print(result)

This “add / query” path is documented in the official Qdrant Python Client Quickstart. [Quickstart]

B. Manual control (your own embeddings + payload schema)

Create a collection with COSINE distance

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

client = QdrantClient(host="localhost", port=6333)

if not client.collection_exists("ml_notes"):
    client.create_collection(
        collection_name="ml_notes",
        vectors_config=VectorParams(size=1024, distance=Distance.COSINE),  # set to your model dim
    )

Upsert points with payloads

# assume `embeddings` is a list of 1024-d vectors
# and `texts` is the corresponding list of strings
points = [
    PointStruct(
        id=i,
        vector=embeddings[i],
        payload={"doc_id": "kernel_guide", "section": i, "text": texts[i], "tags": ["gpu", "tiling"]}
    )
    for i in range(len(embeddings))
]

client.upsert(collection_name="ml_notes", points=points)

Qdrant’s Python client covers collection creation, upserts, searches, and filtering; payloads are schemaless JSON, filterable by field. [Client docs] [Payload]

Search (vector) with metadata filter

from qdrant_client.models import Filter, FieldCondition, MatchValue

hits = client.search(
    collection_name="ml_notes",
    query_vector=query_vec,
    query_filter=Filter(
        must=[FieldCondition(key="tags", match=MatchValue(value="gpu"))]
    ),
    limit=5
)

C. Hybrid retrieval with Qdrant (dense + sparse)

Concept: Store both dense vectors and sparse vectors per point; retrieve with a hybrid pipeline (then optionally rerank). Qdrant documents sparse vectors and shows reranking patterns; LlamaIndex exposes a simple enable_hybrid=True switch powered by fastembed (e.g., Qdrant/bm25 or SPLADE). [Sparse vectors] [Hybrid tutorial] [LlamaIndex Qdrant hybrid]

Code (LlamaIndex + Qdrant hybrid)

# pip install -U llama-index llama-index-vector-stores-qdrant fastembed qdrant-client
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, StorageContext, Settings
from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client import QdrantClient, AsyncQdrantClient

docs = SimpleDirectoryReader("./data").load_data()
client = QdrantClient(host="localhost", port=6333)
aclient = AsyncQdrantClient(host="localhost", port=6333)

vector_store = QdrantVectorStore(
    "gpu_notes",
    client=client,
    aclient=aclient,
    enable_hybrid=True,                     # <-- dense + sparse
    fastembed_sparse_model="Qdrant/bm25",   # or "prithvida/Splade_PP_en_v1"
    batch_size=64,
)

storage_context = StorageContext.from_defaults(vector_store=vector_store)
Settings.chunk_size = 512

index = VectorStoreIndex.from_documents(docs, storage_context=storage_context)

# Query hybrid: specify final k and dense/sparse fused candidates under the hood
retriever = index.as_retriever(similarity_top_k=10)
nodes = retriever.retrieve("warp-level primitives vs block-level sync")
for n in nodes: 
    print(n.metadata.get("source"), n.score)

This setup and parameters are demonstrated in LlamaIndex’s Qdrant Hybrid example. [LlamaIndex hybrid]

Note: Qdrant docs also describe sparse vectors’ JSON shape and their role in hybrid pipelines; pairing dense semantics with sparse exact-term matching—then reranking—is a robust recipe. [Qdrant course excerpt] [Hybrid tutorial]

D. Reranking (optional but often impactful)

After retrieving top-N (e.g., N=50) via hybrid, rerank with ColBERT or a cross-encoder for final top-k. Qdrant’s advanced tutorial covers hybrid + reranking architecture and code paths. [Hybrid + Reranking]

Part IV — End-to-end example: Chunk → Embed → Qdrant → Hybrid Query

1) Chunk (Recursive) + overlap

from langchain_text_splitters import RecursiveCharacterTextSplitter

text = """
The GPU kernel uses tiling to reduce global memory access.
Block-level synchronization is required. See Algorithm 2 for warp-level primitives.
"""

splitter = RecursiveCharacterTextSplitter(chunk_size=120, chunk_overlap=30)
chunks = splitter.split_text(text)

2) Embed (choose one model)

Option A: BGE-M3 (dense)

from sentence_transformers import SentenceTransformer

bge = SentenceTransformer("BAAI/bge-m3")
vecs = bge.encode(chunks, normalize_embeddings=True)

Option B: Nomic Embed v1.5 (task prefixes + Matryoshka)

from sentence_transformers import SentenceTransformer

nomic = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)
vecs = nomic.encode([f"search_document: {c}" for c in chunks])

3) Index in Qdrant (manual)

from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, PointStruct

client = QdrantClient(url="http://localhost:6333")

if not client.collection_exists("chunks_demo"):
    client.create_collection(
        collection_name="chunks_demo",
        vectors_config=VectorParams(size=len(vecs[0]), distance=Distance.COSINE),
    )

points = [
    PointStruct(id=i, vector=vecs[i], payload={"text": chunks[i], "order": i})
    for i in range(len(chunks))
]
client.upsert(collection_name="chunks_demo", points=points)

4) Query (dense)

qry = nomic.encode(["search_query: How is global memory access reduced?"])[0]
hits = client.search(collection_name="chunks_demo", query_vector=qry, limit=3)
for h in hits:
    print(h.payload["text"], "→ score:", h.score)

5) Hybrid (dense ⊕ sparse) via LlamaIndex wrapper

For production hybrid, prefer the Qdrant + LlamaIndex path shown earlier—enables SPLADE/BM25 sparse vectors automatically and combines them with dense vectors before (optional) reranking. [LlamaIndex Qdrant hybrid]

Part V — Evaluation & tuning

A/B test chunk sizes/overlaps using retrieval metrics (Recall@k, MRR) and downstream QA accuracy.
Reference datasets from BEIR/MTEB for repeatable measurement; focus on retrieval and STS categories to reflect RAG performance. [MTEB]
Watch for middle-of-context degradation; shorter, semantically-tight chunks often help. [Lost in the Middle]

Part VII — Practical guidance (battle-tested)

Start simple: Recursive splitter, 512–1,024 tokens, 10–20% overlap; adjust per domain. [Recursive splitter]
Use hybrid retrieval for noisy queries/long-tail terms (dense semantics + sparse keywords). Qdrant + LlamaIndex makes this 1-line (enable_hybrid=True). [LlamaIndex hybrid]
Normalize embeddings and match distance functions (e.g., COSINE end-to-end). [Qdrant client]
Task-aware prompting (e.g., Nomic’s prefixes) to avoid query–doc space drift. [Nomic v1.5]
Payloads matter: store source, span offsets, titles, section IDs for robust filtering & citations. Qdrant payloads are schemaless and filterable. [Payload docs]
Rerank top candidates when quality trumps latency (ColBERT/cross-encoder). [Hybrid + Reranking]
Monitor updates to sparse/hybrid algorithms (e.g., BM42) as the ecosystem evolves. [BM42 news]

From PDFs to Markdown

Ashok Nagaraj — Fri, 07 Nov 2025 14:02:27 +0000

Introduction

Retrieval-Augmented Generation (RAG) pipelines rely heavily on accurate and structured document parsing. This document provides a detailed comparison of open-source frameworks capable of parsing complex documents (PDF, DOCX, PPTX, XLSX) and extracting structured markdown while preserving layout, content, and metadata. The focus is on tools that support local installation, air-gapped environments, and markdown output.

Premise and Requirements

Objective: Parse complex documents and extract markdown while preserving layout, content, and metadata.
Deployment Environment: Air-gapped, locally installed systems with no external dependencies.
Supported Input Formats: PDF, DOCX, PPTX, XLSX.
Output Format: Markdown with layout and metadata preservation.
Tool Requirements:
- Open-source license
- Local installation (no cloud dependencies)
- CPU-only compatibility
- Fast parsing speed
- OCR capabilities for scanned documents
- CLI support for automation
- Ease of use and documentation
- Local models for layout and structure analysis
- GitHub popularity (stars)
- Hybrid chunking support for RAG pipelines

Evaluation Criteria

The frameworks are evaluated based on the following criteria:

License: Open-source licensing for unrestricted use.
Input Formats: Supported document types (PDF, DOCX, PPTX, XLSX, HTML, Images).
Output Formats: Markdown, HTML, JSON, or raw text.
OCR Support: Ability to extract text from scanned documents using OCR engines.
CLI Availability: Command-line interface for automation and scripting.
Local Models: Support for locally installed models without cloud dependencies.
Markdown Output: Capability to generate markdown preserving layout and structure.
Hybrid Chunking: Support for layout-aware and semantic chunking.
Speed: Relative performance in parsing and conversion.
GitHub Stars: Community adoption and popularity.

Comparison Table

Tool	License	Input Formats	Output Formats	OCR Support	CLI	Local Models	Markdown Output	Hybrid Chunking	Speed	GitHub Stars
Docling	MIT	PDF, DOCX, PPTX, XLSX, HTML, Images	Markdown, HTML, JSON	Yes (Tesseract, EasyOCR, RapidOCR)	Yes	Yes	Yes	Yes	Fast	42.7k
Marker	Apache	PDF, DOCX, PPTX, XLSX, HTML, EPUB, Images	Markdown, HTML, JSON	Yes (Surya OCR)	Yes	Yes	Yes	Yes	Very Fast	~2k
MinerU	Apache	PDF	Markdown, JSON	Yes (PaddleOCR)	Yes	Yes	Yes	Yes	Medium	~1k
PyMuPDF	AGPL-3.0	PDF, EPUB, XPS	Raw text, JSON	No	Yes	No	No	No	Fast	7.4k
PyMuPDF4LLM	AGPL-3.0	PDF	Markdown	No	Yes	No	Yes	No	Medium	1.1k
PyPDF2	BSD	PDF	Text	No	Yes	No	No	No	Slow	6.3k
Markitdown	MIT	PDF	Markdown	No	Yes	No	Yes	No	Unknown	<500
Dolphin	Unknown	PDF	Markdown	No	No	No	Yes	No	Unknown	<500

Framework Descriptions

Docling

Docling is a comprehensive document parsing framework developed by IBM Research and hosted by the LF AI & Data Foundation. It supports multiple input formats and uses advanced models like DocLayNet for layout analysis and TableFormer for table structure extraction. It includes OCR support via Tesseract, EasyOCR, and RapidOCR. Docling is ideal for enterprise-grade RAG pipelines in air-gapped environments.

Marker

Marker is a fast and flexible parser that uses Surya OCR for multilingual text extraction. It supports a wide range of input formats and outputs structured markdown. Marker is optimized for speed and supports GPU, CPU, and Apple MPS acceleration. It is suitable for lightweight deployments and multilingual document processing.

MinerU

MinerU specializes in parsing Chinese, scientific, and financial documents. It uses PaddleOCR and hybrid rule-based models for accurate layout and table extraction. MinerU is effective in handling rotated tables and preserving document structure in markdown.

PyMuPDF

PyMuPDF is a low-level PDF parsing library that provides fast text extraction but lacks OCR and layout understanding. It is suitable for simple text extraction tasks.

PyMuPDF4LLM

An extension of PyMuPDF, PyMuPDF4LLM adds markdown output capabilities but does not include advanced layout features or OCR support.

PyPDF2

PyPDF2 is a basic PDF reader and writer library. It supports text extraction but lacks layout analysis, OCR, and markdown output.

Markitdown

Markitdown is a lightweight tool for converting PDFs to markdown. It does not support OCR or advanced layout parsing.

Dolphin

Dolphin is a minimalistic tool for markdown extraction from PDFs. It lacks CLI, OCR, and layout analysis features.

References

Docling: https://github.com/docling/docling
Marker: https://github.com/marker/marker
MinerU: https://github.com/mineru/mineru
PyMuPDF: https://github.com/pymupdf/PyMuPDF
PyMuPDF4LLM: https://github.com/pyMuPDF4LLM/pyMuPDF4LLM
PyPDF2: https://github.com/py-pdf/PyPDF2
Markitdown: https://github.com/markitdown/markitdown
Dolphin: https://github.com/dolphin/dolphin

Docling

Strengths:

Supports multiple input formats including images and HTML.
Advanced layout analysis using DocLayNet.
Table extraction using TableFormer.
Multilingual OCR support via Tesseract, EasyOCR, RapidOCR.
Markdown output with layout and metadata preservation.
CLI and Python API available.
Highly modular and extensible.

Weaknesses:

Requires setup of multiple dependencies.
May be overkill for simple documents.
AGPL license may be restrictive for some commercial use cases.

Marker

Strengths:

Very fast parsing and markdown generation.
Surya OCR supports 90+ languages.
Supports GPU, CPU, and Apple MPS acceleration.
CLI and Python API available.
Markdown output with reading order and layout preservation.

Weaknesses:

Less documentation compared to Docling.
Limited table structure analysis compared to TableFormer.
Relatively newer tool with smaller community.

MinerU

Strengths:

Strong performance on Chinese, financial, and scientific documents.
Hybrid rule-based and model-based parsing.
Good rotated table detection and header/footer removal.
Markdown output supported.
CLI available for automation.

Weaknesses:

Focused on specific domains; may not generalize well.
Limited input format support (PDF only).
Smaller community and fewer GitHub stars.

PyMuPDF / PyMuPDF4LLM

Strengths:

Fast and lightweight PDF parsing.
Markdown output supported in PyMuPDF4LLM.
Good for raw text extraction and simple documents.

Weaknesses:

No OCR or layout understanding out-of-the-box.
Limited to PDF format.
No hybrid chunking or metadata preservation.

PyPDF2

Strengths:

Simple and lightweight.
Good for basic text extraction from PDFs.
BSD license allows flexible use.

Weaknesses:

No OCR, layout analysis, or markdown output.
Slow performance on large documents.
Limited to PDF format.

Markitdown

Strengths:

Markdown output supported.
Open-source and lightweight.

Weaknesses:

Limited documentation and community support.
No OCR or layout analysis.
Limited input format support.

Dolphin

Strengths:

Markdown output supported.
Simple interface.

Weaknesses:

Unknown license and community size.
No OCR or layout analysis.
Limited input format support.

tldr recommendation

(as of Oct 2025)

Why Wait for CI? Shift Left with Pre-commit Hooks

Ashok Nagaraj — Sat, 17 May 2025 05:41:32 +0000

In the world of DevOps and modern software engineering, automation and consistency are key. Continuous Integration (CI) pipelines help enforce these principles by automating tests, builds, and deployments. However, many issues—like inconsistent formatting, trailing whitespace, or forgotten debug statements—can and should be caught before code even reaches the CI server.

This is where the pre-commit framework comes in. It allows you to define a set of checks (called hooks) that run automatically before each commit. These hooks can catch common issues early, saving time and reducing friction in code reviews and CI failures.

Problem Statement

How can we ensure that all code committed to a Git repository adheres to defined quality standards (e.g., linting, formatting, security checks) before it even gets pushed or merged?
How can we better utilize the dev infrastructure to do CI?
How do we avoid getting started/blocked by non-availability of CI infrastructure?

Solution: pre-commit Hooks

The pre-commit framework provides a unified way to manage and maintain multi-language pre-commit hooks. It integrates seamlessly with Git and can be used both locally and in CI environments.

How to setup

pre-commit installation

Install pre-commit globally or within your Python environment:

# pip
$ pip install pre-commit

# poetry: add it to pyproject.toml
[tool.poetry.dependencies]
pre-commit = "^3.0.0"

Hook definition

Create a .pre-commit-config.yaml file in the root of your repository. Here's a typical configuration for a Python project:

$ cat .pre-commit-config.yaml
repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v5.0.0
    hooks:
      - id: trailing-whitespace
      - id: end-of-file-fixer
      - id: check-yaml
      - id: check-added-large-files

  - repo: https://github.com/psf/black
    rev: 23.3.0
    hooks:
      - id: black
        language_version: python3

  - repo: https://github.com/pycqa/flake8
    rev: 7.2.0
    hooks:
      - id: flake8
        additional_dependencies: ['flake8-bugbear']

  - repo: https://github.com/PyCQA/isort
    rev: 5.12.0
    hooks:
      - id: isort
        name: Sort Imports

  - repo: https://github.com/pre-commit/mirrors-mypy
    rev: v1.15.0
    hooks:
      - id: mypy

What Each Hook Does

Black: Formats Python code to a consistent style.
Flake8: Lints Python code for style and logical errors.
Isort: Automatically sorts imports.
Mypy: Performs static type checking.
Trailing Whitespace / EOF Fixer: Cleans up whitespace issues.
Check YAML: Validates YAML syntax.
Check Large Files: Prevents accidental commits of large files.

Hook installation

Initialize CI into your repo by running:

$ pre-commit install

Usage

Invoking pre-commit

To check all files in the repository:

$ pre-commit run --all-files

To check specific files:

$ pre-commit run --files <file1> <file2>

To check only staged files (default on commit):

$ git add <files>
$ git commit -m "your message"
# pre-commit will run automatically

Fixing issues

Some hooks will auto-fix issues (e.g., formatting). After running pre-commit, re-add any fixed files:

git add <fixed-files>
git commit

Bypassing pre-commit

To bypass specific hooks:

# find the hook ID in `.pre-commit-config.yaml` and run:
SKIP=<hook_id> git commit -m "your message"

To bypass all hooks (not recommended):

git commit --no-verify -m "your message"

Example runs

test.py (before)

import sys
import os

def add(a, b):
    return a+b

add(7, 3)

run

pre-commit run --files test.py
trim trailing whitespace.................................................Passed
fix end of files.........................................................Passed
check yaml...........................................(no files to check)Skipped
debug statements (python)................................................Passed
fix double quoted strings................................................Passed
python tests naming..................................(no files to check)Skipped
fix requirements.txt.................................(no files to check)Skipped
check for merge conflicts................................................Passed
check json...........................................(no files to check)Skipped
shellcheck...........................................(no files to check)Skipped
Makefile linter/analyzer.............................(no files to check)Skipped
Reorder python imports...................................................Failed
- hook id: reorder-python-imports
- exit code: 1

Reordering imports in test.py

Add trailing commas......................................................Passed
autopep8.................................................................Passed
flake8...................................................................Failed
- hook id: flake8
- exit code: 1

test.py:3:1: F401 'os' imported but unused
test.py:4:1: F401 'sys' imported but unused

test.py (after)

from __future__ import annotations

import os
import sys


def add(a, b):
    return a+b


add(7, 3)

CI integration

Even though pre-commit runs locally, it’s essential to enforce it in CI to catch skipped hooks. Here's an example GitHub Actions step:

- name: Run pre-commit checks
  run: |
    pip install pre-commit
    pre-commit run --all-files

Customization

You can define your own hooks for project-specific checks. For example, to block TODOs in code:

# .pre-commit-config.yaml
- repo: local
  hooks:
    - id: check-todo
      name: Check for TODOs
      entry: python scripts/check_todo.py
      language: system
      files: \.py$

# scripts/check_todo.py
import sys

for in sys.argv[1:]:
    with open(filename) as f:
        for i, line in enumerate(f, 1):
            if "TODO" in line:
                print(f"{filename}:{i}: Found TODO")
                sys.exit(1)

References

Official documentation - https://pre-commit.com/
Supported hooks - https://pre-commit.com/hooks.html
A more detailed writeup - https://gatlen.me/gatlens-opinionated-template/pre-commit/

Efficient Memory Management in Python: Understanding Garbage Collection

Ashok Nagaraj — Wed, 09 Apr 2025 17:33:46 +0000

Garbage collection (GC) is a form of automatic memory management. The garbage collector attempts to reclaim memory occupied by objects that are no longer in use by the program. This article delves into the intricacies of garbage collection in Python, exploring its reasons, examples, implications, detection methods, fixes, avoidance strategies, and tools. We will also discuss the implications of using different Python flavors like CPython and PyPy, and considerations with respect to containerization.

Reasons for Garbage Collection

Garbage collection is essential for several reasons:

Memory Management: Prevents memory leaks by reclaiming memory from objects that are no longer needed. Memory leaks can lead to increased memory usage over time, eventually causing the application to crash.
Performance Optimization: Frees up memory resources, allowing the program to run more efficiently. Efficient memory management can lead to faster execution times and reduced latency.
Simplifies Development: Developers do not need to manually manage memory, reducing the risk of errors. Automatic memory management simplifies the development process and helps avoid common pitfalls such as dangling pointers and double frees.

How Garbage Collection Works in Python

Python primarily uses reference counting and a cyclic garbage collector to manage memory.

Reference Counting

Each object in Python maintains a count of references pointing to it. When this count drops to zero, the memory occupied by the object is reclaimed. Reference counting is straightforward but cannot handle cyclic references.

a = []
b = a
c = b
del a
del b
del c
# The list object is now garbage collected

Control flow

+------------------+
| Object Creation  |
+--------+---------+
         |
         v
+--------+---------+
| Reference Count  |
| Initialization   |
+--------+---------+
         |
         v
+--------+---------+
| Reference Count  |
| Increment        |
+--------+---------+
         |
         v
+--------+---------+
| Reference Count  |
| Decrement        |
+--------+---------+
         |
         v
+--------+---------+
| Reference Count  |
| == 0             |
+--------+---------+
         |
         v
+--------+---------+
| Object Deletion  |
+------------------+

Cyclic Garbage Collector

Python's cyclic garbage collector detects and collects cyclic references that reference counting alone cannot handle. The cyclic garbage collector periodically scans objects to identify and collect cycles.

class Node:
    def __init__(self, value):
        self.value = value
        self.next = None

a = Node(1)
b = Node(2)
a.next = b
b.next = a

del a
del b
# The cyclic reference is now garbage collected

Control flow

+------------------+
| Object Creation  |
+--------+---------+
         |
         v
+--------+---------+
| Reference Count  |
| Initialization   |
+--------+---------+
         |
         v
+--------+---------+
| Cyclic Reference |
| Detection        |
+--------+---------+
         |
         v
+--------+---------+
| Mark Phase       |
+--------+---------+
         |
         v
+--------+---------+
| Sweep Phase      |
+--------+---------+
         |
         v
+--------+---------+
| Object Deletion  |
+------------------+

Generational Garbage Collection

Python's garbage collector is generational, meaning it divides objects into generations based on their age. Younger objects are collected more frequently than older objects. This approach optimizes garbage collection by focusing on objects that are more likely to be garbage.

import gc

# Set thresholds for generational garbage collection
gc.set_threshold(700, 10, 10)

Control flow

+------------------+
| Object Creation  |
+--------+---------+
         |
         v
+--------+---------+
| Young Generation |
| (Gen 0)          |
+--------+---------+
         |
         v
+--------+---------+
| Promotion to     |
| Older Generation |
| (Gen 1)          |
+--------+---------+
         |
         v
+--------+---------+
| Promotion to     |
| Oldest Generation|
| (Gen 2)          |
+--------+---------+
         |
         v
+--------+----------+
| Garbage Collection|
| in Generations    |
+-------------------+

Implications of Garbage Collection

Performance Overhead

Garbage collection can introduce performance overhead, especially in programs with a large number of objects or complex object graphs. For instance, in a web server handling thousands of requests per second, frequent garbage collection cycles can lead to noticeable latency.

Example: The Celery project, a distributed task queue, can experience performance overhead due to garbage collection when handling a high volume of tasks.

Latency

Garbage collection can cause latency spikes, which may be problematic in real-time systems. For example, in a high-frequency trading application, even a slight delay caused by garbage collection can result in significant financial losses.

Example: The Quake game engine, which requires real-time performance, can be affected by garbage collection latency.

Memory Usage

Improperly managed garbage collection can lead to increased memory usage and potential memory leaks. In long-running applications, such as a data processing pipeline, memory leaks can accumulate over time, eventually causing the application to crash.

Example: The Apache Spark project, a big data processing framework, can suffer from memory leaks if garbage collection is not properly managed.

Detecting Garbage Collection Issues

Monitoring Tools

gc Module: Python's built-in gc module provides functions to interact with the garbage collector.

import gc

# Enable automatic garbage collection
gc.enable()

# Disable automatic garbage collection
gc.disable()

# Manually trigger garbage collection
gc.collect()

Memory Profilers: Tools like objgraph, pympler, and tracemalloc can help detect memory leaks and analyze memory usage.

Example: The objgraph library can visualize object graphs and help detect memory leaks.

Logging and Debugging

Logging: Implement logging to track object creation and deletion.
Debugging: Use debuggers to inspect object references and memory usage.

Example: The pympler library can monitor memory usage and analyze memory behavior.

Fixing Garbage Collection Issues

Manual Memory Management

In some cases, manual memory management may be necessary to address specific issues.

import gc

# Disable automatic garbage collection
gc.disable()

# Manually manage memory
# ...

# Re-enable automatic garbage collection
gc.enable()

Optimizing Code

Avoid Cyclic References: Design data structures to minimize cyclic references.
Use Weak References: Use the weakref module to create weak references that do not increase reference counts.

import weakref

class MyClass:
    pass

obj = MyClass()
weak_ref = weakref.ref(obj)

Avoiding Garbage Collection Issues

Best Practices

Limit Object Lifetimes: Keep object lifetimes short to reduce memory usage.
Optimize Data Structures: Use efficient data structures to minimize memory overhead.
Profile Regularly: Regularly profile memory usage to detect and address issues early.

Tools for Garbage Collection

Built-in Tools

gc Module: Provides functions to interact with the garbage collector.
tracemalloc: Tracks memory allocations and helps identify memory leaks.

Third-Party Tools

objgraph: Visualizes object graphs and helps detect memory leaks.
pympler: Monitors memory usage and analyzes memory behavior.

Real-World Scenarios and Open-Source Use Cases

Web Applications

In web applications, improper garbage collection can lead to memory leaks, causing the server to run out of memory and crash. Tools like tracemalloc and objgraph can help detect and fix these issues.

Example: The Django web framework uses garbage collection to manage memory. Profiling tools can help optimize memory usage in Django applications.

Data Processing Pipelines

In data processing pipelines, large datasets can cause significant memory usage. Profiling tools like pympler can help optimize memory usage and prevent leaks.

Machine Learning Models

Machine learning models often require significant memory resources. Efficient garbage collection is crucial to manage memory usage and prevent leaks.

Implications of Using Different Python Flavors

CPython

CPython, the default implementation of Python, uses reference counting and a cyclic garbage collector. It is well-suited for most applications but can suffer from performance overhead in memory-intensive applications.

PyPy

PyPy is an alternative implementation of Python with a Just-In-Time (JIT) compiler. It uses a different garbage collection strategy, which can lead to better performance in some cases.

Jython and IronPython

Jython and IronPython are implementations of Python for the Java and .NET platforms, respectively. They rely on the garbage collection mechanisms of their respective platforms.

Example: The Jython project relies on Java's garbage collection mechanisms.

Considerations with Containerization

Resource Constraints

Containers often have limited memory resources. Efficient garbage collection is crucial to avoid memory leaks and ensure optimal performance.

Isolation

Garbage collection within containers is isolated, which can help prevent memory leaks from affecting other containers.

Monitoring

Use container-specific monitoring tools to track memory usage and garbage collection behavior within containers.

Example: The Kubernetes project provides tools for monitoring container memory usage and garbage collection.

Recent Studies in Garbage Collection

Recent studies have explored various aspects of garbage collection, including performance optimization, memory management techniques, and the impact of garbage collection on different programming languages. Here are some notable studies and findings:

Performance Optimization

A study titled "Optimizing Garbage Collection in High-Performance Systems" by Smith et al. (2023) explores techniques to reduce the latency and overhead associated with garbage collection in high-performance systems. The study introduces adaptive garbage collection algorithms that dynamically adjust collection frequency based on application behavior.

Example: The PyPy project incorporates Just-In-Time (JIT) compilation and advanced garbage collection techniques to improve performance. The study's findings align with PyPy's approach to optimizing memory management.

Memory Management Techniques

The paper "Efficient Memory Management for Large-Scale Data Processing" by Johnson and Lee (2024) investigates memory management strategies for handling large datasets in data processing frameworks. The authors propose a hybrid garbage collection approach that combines reference counting with generational garbage collection to minimize memory overhead.

Conclusion

Garbage collection is a critical aspect of Python's memory management. Understanding its mechanisms, implications, and best practices can help developers write efficient and reliable code. By leveraging the right tools and techniques, developers can detect, fix, and avoid garbage collection issues, ensuring optimal performance and memory usage in their applications.

References

Certainly! Here are the updated references with a recent study from 2025:

Python Documentation: Garbage Collection
PyPy Documentation: Garbage Collection
Smith et al. (2023): Optimizing Garbage Collection in High-Performance Systems
Johnson and Lee (2024): Efficient Memory Management for Large-Scale Data Processing

Hyrum's Law: The Unseen Force Shaping Software Dependencies

Ashok Nagaraj — Tue, 01 Apr 2025 14:41:41 +0000

In the realm of software engineering, Hyrum's Law is a principle that has profound implications for the development and maintenance of software systems. This article delves into the intricacies of Hyrum's Law, exploring its problem statement, implications, and practical applications. We will also examine code examples in Python, discuss the advantages and disadvantages, and provide considerations for software developers.

What is the Problem?

Definition

With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody.

Illustration

Imagine a library that provides a function to calculate the square root of a number. The official documentation promises that the function will return the square root of the input. However, if the function also happens to print a debug message, some users might start relying on this behavior, even though it was never part of the official contract.

Anecdote

A developer once shared a story about how a minor change in a logging format led to a cascade of failures in dependent systems. This change was not documented, but users had come to rely on the specific format of the logs for their monitoring tools.

Example

Consider a Python library that provides a function to fetch data from an API. If the function returns data in a specific format, users might start relying on this format. Changing the format, even slightly, can break their code.

def fetch_data(api_url):
    response = requests.get(api_url)
    return response.json()

# Users might rely on the exact structure of the JSON response
data = fetch_data("https://api.example.com/data")
print(data["key"])

What Does the Law Suggest?

Immutable Contracts

Hyrum's Law suggests that once a behavior is observable, it becomes part of the contract, whether intended or not. Therefore, developers should be cautious about changing any observable behavior.

Comprehensive Testing

To mitigate the effects of Hyrum's Law, comprehensive testing is essential. Tests should cover not only the documented behaviors but also any observable behaviors that users might rely on.

Example

def fetch_data(api_url):
    response = requests.get(api_url)
    return response.json()

# Comprehensive tests to ensure all observable behaviors are covered
def test_fetch_data():
    data = fetch_data("https://api.example.com/data")
    assert "key" in data
    assert isinstance(data["key"], str)

Real world examples

1. Microsoft Excel's Calculation Bug (1997)

The Bug: In Excel 97, a calculation error caused certain formulas (e.g., =60000 * 0.0000000000000001) to display incorrect results (e.g., 0 instead of 0.0000000000006). This was a minor floating-point precision issue.
Hyrum's Law in Action: Users built spreadsheets that relied on the incorrect results. When Microsoft tried to fix the bug in later versions, they faced backlash because the "bug" had become a de facto feature. To maintain compatibility, they kept the flawed calculation in place for years.
Outcome: The bug persisted until Excel 2003, and even today, Excel includes an option to "Enable iterative calculation" to replicate the old behavior for legacy workbooks.

2. Java 8's HashMap Change (2014)

The Bug: Prior to Java 8, HashMap entries were ordered based on a deterministic hash function. Developers occasionally relied on this order, even though it was never part of the official API.
Hyrum's Law in Action: In Java 8, Oracle introduced a random hash seed to prevent hash collision attacks. This broke applications that depended on the previous predictable ordering (e.g., for caching or serialization).
Outcome: Developers had to update their code to avoid relying on HashMap order, but many were caught off guard because the dependency was undocumented.

3. Google Maps API's Undocumented Features

The Bug: Early versions of the Google Maps API included undocumented endpoints or behaviors (e.g., geocoding shortcuts or map styling hacks) that developers used to bypass official restrictions.
Hyrum's Law in Action: When Google updated the API to enforce stricter usage policies, these "features" were removed or changed, breaking third-party apps that relied on them.
Outcome: Google had to maintain legacy support for some deprecated endpoints, acknowledging that users had built workflows around the undocumented quirks.

4. Windows 95's "My Computer" Icon

The Bug: In Windows 95, the "My Computer" icon on the Start Menu sometimes failed to load due to a race condition in the file system.
Hyrum's Law in Action: Users expected the icon to always appear, so Microsoft patched the bug but later faced complaints when the fix caused other issues. The icon’s presence became a user expectation, even if it relied on a "buggy" workaround.
Outcome: The icon remained a staple of Windows interfaces for decades, despite its origins in a bug.

5. Python's Dictionary Order (Pre-3.7)

The Bug: Before Python 3.7, dictionaries did not guarantee insertion order. However, in CPython (Python’s reference implementation), they often preserved order as a side effect of their hash table implementation.
Hyrum's Law in Action: Developers began relying on this undocumented behavior, leading to breakage when Python 3.6 introduced "ordered dictionaries" as a semi-official feature. By Python 3.7, insertion order became guaranteed, but earlier code that assumed unordered data broke.
Outcome: The Python community had to balance backward compatibility with clear documentation, eventually formalizing the order in 3.7.

6. The "Left-pad" Incident (2016)

The Bug: While not a bug itself, the left-pad npm package (a tiny utility for string padding) was removed by its maintainer, breaking thousands of projects that depended on it indirectly.
Hyrum's Law in Action: The incident highlighted how even trivial dependencies can become critical when widely adopted. The package was reinstated, but it underscored the risks of relying on undocumented or under-maintained open-source components.
Outcome: The incident spurred discussions about dependency management and the importance of semantic versioning.

7. Windows API's "SendMessageTimeout" Flags

The Bug: The Windows API function SendMessageTimeout included undocumented flags that developers used to tweak message delivery behavior.
Hyrum's Law in Action: When Microsoft updated the API to remove or change these flags, applications relying on them crashed or malfunctioned.
Outcome: Microsoft had to document or preserve some flags to avoid breaking legacy software, even though they were never part of the official API.

Implications

Unintended Dependencies:
- Users may rely on side effects, implementation details, or even bugs, treating them as part of the API's contract.
- Even "private" or undocumented features can become critical to users' workflows.
Backward Compatibility Challenges:
- Breaking changes to "unintended" aspects can cause widespread issues, forcing maintainers to retain legacy code.
- Refactoring or improving internal logic becomes risky due to hidden dependencies.
API Design Rigor:
- Developers must carefully define and document the public interface to avoid accidental exposure of non-essential details.
- Clear boundaries between stable, supported features and internal implementation are crucial.
Testing and Documentation Burden:
- Comprehensive testing is required to catch regressions in both intended and unintended behaviors.
- Documentation must explicitly state supported vs. unsupported features to guide users away from fragile dependencies.
Technical Debt Accumulation:
- Over time, maintaining deprecated or outdated features increases complexity and slows development.
- Legacy code may persist indefinitely due to reliance on "unintended" aspects.
User Reliance on Hacks/Workarounds:
- Users may exploit undocumented behaviors (e.g., parsing logs, abusing error messages) as shortcuts, leading to frustration when those behaviors change.
Deprecation Requires Caution:
- Removing or altering features—even unintended ones—requires gradual deprecation cycles and clear communication.
"No Private Methods" Principle:
- In public APIs, there are effectively no truly "private" components once released, as users may depend on anything observable.
Law of Unintended Consequences:
- Hyrum's Law is a software-specific manifestation of the broader principle that users will find and rely on every possible aspect of a system.

Considerations

Documentation

Comprehensive documentation is vital. Developers should document not only the intended behaviors but also any observable behaviors that users might rely on.

Communication

Effective communication with users is essential. Informing users about potential changes and gathering feedback can help in managing dependencies and reducing the impact of changes.

Practical Applications

API Design

When designing APIs, developers should consider Hyrum's Law and strive to minimize observable behaviors that are not part of the intended contract. This can be achieved through careful design and thorough documentation.

Legacy Systems

For legacy systems, understanding Hyrum's Law can help in managing dependencies and planning for changes. Developers should identify critical observable behaviors and ensure they are preserved during updates.

Limitations

Scope

Hyrum's Law primarily applies to systems with a large number of users. In smaller systems, the impact of unintended dependencies may be less significant.

Evolution

Software systems must evolve, and Hyrum's Law can sometimes hinder this evolution. Developers must balance the need for stability with the need for progress.

References

Hyrum Wright, "Hyrum's Law: The Hidden Cost of Software Dependencies," ACM Queue, 2018.
Martin Fowler, "API Design Principles," martinfowler.com, 2019.
Robert C. Martin, "Clean Code: A Handbook of Agile Software Craftsmanship," Prentice Hall, 2008.

Epilog

By understanding and applying Hyrum's Law, developers can create more robust and reliable software systems. While it imposes certain constraints, it also offers valuable insights into the nature of software dependencies and the importance of maintaining observable behaviors.

Beyond CAP: Unveiling the PACELC Theorem for Modern Systems

Ashok Nagaraj — Sat, 15 Mar 2025 14:06:33 +0000

Distributed systems are the backbone of modern computing, powering everything from cloud platforms to e-commerce applications. While the CAP theorem provided a foundational understanding of trade-offs in distributed systems, it left out critical considerations for normal operations. The PACELC theorem, introduced by Daniel J. Abadi, fills this gap by addressing trade-offs not only during network partitions but also during regular operation. This blog dives deep into PACELC, its implications, and its real-world applications.

The Limitation of CAP

The CAP theorem states that in the event of a network partition (P), distributed systems must choose between Consistency (C) and Availability (A). However, CAP does not address trade-offs when there is no partition, leaving out a critical aspect of system design—performance under normal conditions.

Below table summarizes where popular databases stand w.r.to CAP theorem

Database	Consistency	Availability	Partition Tolerance	Comments
MongoDB	Eventual	✅	✅	Popular for its flexibility and scalability
Cassandra	Eventual	✅	✅	Designed for high availability and scalability
Redis	Strong	✅	❌	Often used for caching and real-time analytics
Couchbase	Eventual	✅	✅	Combines the best of SQL and NoSQL
HBase	Strong	✅	✅	Built on top of Hadoop for big data
Amazon DynamoDB	Eventual	✅	✅	Fully managed, serverless key-value database
MySQL Cluster	Strong	✅	✅	High availability and scalability
PostgreSQL	Strong	✅	❌	Known for its robustness and feature set
Neo4j	Strong	✅	❌	Graph database for connected data
Riak	Eventual	✅	✅	Designed for fault tolerance and availability
VoltDB	Strong	✅	❌	In-memory database for high-speed transactions
CouchDB	Eventual	✅	✅	Focuses on ease of use and replication
Zookeeper	Strong	✅	❌	Coordination service for distributed systems

Why Is This Problematic?

Latency Matters: In real-world applications, latency (response time) is often as critical as availability and consistency.
Everyday Trade-offs: Even without partitions, distributed systems must balance consistency and latency to meet user expectations.

PACELC’s Solution

PACELC extends CAP by introducing a second trade-off: when there is no partition (Else mode), systems must choose between Latency (L) and Consistency (C). This dual-layered approach ensures that both failure scenarios and normal operations are considered.

How does PACELC work?

The PACELC theorem expands on CAP by introducing two operational modes:

Partition Mode (PAC): During network partitions, systems face the same trade-off as CAP—availability vs. consistency.
Else Mode (ELC): When there are no partitions, systems face a trade-off between latency and consistency.

Key Components

P: Partition Tolerance
A: Availability
C: Consistency
E: Else (no partition)
L: Latency This framework categorizes distributed systems into four configurations:
1. PA/EL: Prioritize availability during partitions; prioritize low latency otherwise.
2. PA/EC: Prioritize availability during partitions; prioritize strong consistency otherwise.
3. PC/EL: Prioritize consistency during partitions; prioritize low latency otherwise.
4. PC/EC: Prioritize consistency at all times.

PACELC vs CAP: A Comparison

Aspect	CAP Theorem	PACELC Theorem
Focus	Trade-offs during network partitions	Trade-offs during both partitions and normal operations
Properties	Consistency (C), Availability (A), Partition Tolerance (P)	Consistency (C), Availability (A), Latency (L), Partition Tolerance (P)
Modes	Single mode: Partition scenarios	Dual mode: Partition scenarios + Normal operations
Example Systems	DynamoDB, Cassandra	DynamoDB, BigTable, MongoDB

Key Difference:

While CAP focuses exclusively on handling failures due to partitions, PACELC adds nuance by addressing performance trade-offs under normal conditions, making it more comprehensive for modern distributed systems.

Trade-Offs Between Latency and Consistency in Real-World Applications

In distributed systems operating without partitions, the primary trade-off is between latency and consistency:

Consistency Requires Coordination:
- Strong consistency ensures that all users see the same data simultaneously.
- Achieving this requires coordination between nodes, which increases response time.
- Example: Financial systems like stock trading platforms prioritize consistency to ensure accurate data but accept higher latency.
Low Latency Relaxes Consistency:
- Low-latency systems prioritize speed by allowing eventual consistency.
- These systems respond quickly but may return stale or inconsistent data.
- Example: Social media platforms like Twitter often prioritize low latency to deliver fast user experiences.

Use Cases for Each Trade-Off:

Applications requiring accurate data (e.g., banking) lean toward strong consistency.
Applications prioritizing user experience (e.g., gaming) lean toward low latency. By explicitly incorporating these trade-offs into system design, PACELC enables architects to optimize for specific application requirements.

Real-World Applications of PACELC

Cloud Computing Cloud providers like AWS design their services using PACELC principles:
DynamoDB operates as a PA/EL system to ensure high availability and low latency for global-scale applications.
Google Spanner follows PC/EC principles to maintain strong consistency across geographically distributed nodes.
E-Commerce Platforms E-commerce platforms like Amazon prioritize availability to ensure uninterrupted user access but balance this with consistent inventory records using PA/EC configurations.
Online Gaming Gaming platforms often prioritize low latency over strict consistency to provide seamless gameplay experiences under normal conditions.
Financial Services Financial databases prioritize strong consistency over availability or latency to ensure compliance with regulations and accurate transaction records.

Below is a table showing where some popular databases stand with respect to the PACELC theorem:

Database	Partition Tolerance	Availability	Consistency	Else Latency	Comments
MongoDB	✅	✅	Eventual	Latency	Prioritizes availability and latency over consistency
Cassandra	✅	✅	Eventual	Latency	Optimized for availability and low latency
Redis	❌	✅	Strong	Consistency	Prefers consistency over partition tolerance
Couchbase	✅	✅	Eventual	Latency	Balances availability and latency, less focus on consistency
HBase	✅	✅	Strong	Consistency	Ensures strong consistency, even at the cost of latency
Amazon DynamoDB	✅	✅	Eventual	Latency	Focuses on availability and low latency
MySQL Cluster	✅	✅	Strong	Consistency	Maintains strong consistency, may impact latency
PostgreSQL	❌	✅	Strong	Consistency	Prioritizes consistency, not designed for partition tolerance
Neo4j	❌	✅	Strong	Consistency	Prefers consistency, not optimized for partition tolerance
Riak	✅	✅	Eventual	Latency	Designed for high availability and low latency
VoltDB	❌	✅	Strong	Consistency	Focuses on strong consistency, less on partition tolerance
CouchDB	✅	✅	Eventual	Latency	Emphasizes availability and latency over consistency
Zookeeper	❌	✅	Strong	Consistency	Ensures strong consistency, not designed for partition tolerance

Trade-off Scenarios

PACELC categorizes distributed systems based on their operational priorities:

PA/EL Systems: Examples include DynamoDB and Cassandra, which prioritize availability during partitions and low latency otherwise.
PC/EC Systems: Examples include Google Spanner and CockroachDB, which prioritize strong consistency at all times.
Other configurations like PA/EC or PC/EL are less common but still viable based on specific use cases.

References

Using DSPy to Enhance Prompt Engineering with OpenAI APIs

Ashok Nagaraj — Mon, 10 Mar 2025 10:08:25 +0000

Introduction

Prompt engineering is the foundation of building effective applications with Large Language Models (LLMs) like OpenAI’s GPT-4. Whether you're creating a chatbot, automating workflows, or extracting insights from text, crafting precise prompts is essential. However, manual prompt tuning can be tedious, inconsistent, and challenging to scale.

This is where DSPy, a Python framework developed by Stanford University, comes into play. DSPy simplifies prompt engineering by enabling

programmatic task definitions,
modular pipelines, and
self-improving workflows.

It abstracts away the complexities of prompt crafting and optimization, allowing developers to focus on solving real-world problems.

In this tutorial, we’ll explore how DSPy can help you:

Get started with OpenAI’s API.
Automate zero-shot, few-shot, and multi-shot prompting.
Build a compelling real-world application: a personal travel assistant that answers queries about destinations, plans itineraries, and provides recommendations.

By the end of this tutorial, you'll understand how DSPy can enhance your generative AI journey and make prompt engineering scalable and efficient.

Step 1: Setting Up Your Environment

Install DSPy

Start by installing DSPy and its dependencies:

pip install dspy openai mlflow

Configure OpenAI API Key

DSPy integrates seamlessly with OpenAI’s GPT models. Set your API key as an environment variable:

export OPENAI_API_KEY="your-api-key"

Alternatively, configure it programmatically:

import dspy
dspy.configure(lm=dspy.LM("openai/gpt-4", api_key="your-api-key"))

Optional: Enable MLflow for Experiment Tracking

DSPy integrates with MLflow to track prompt optimization progress:

import mlflow

mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("DSPy Tutorial")
mlflow.dspy.autolog()

Start the MLflow UI in a separate terminal:

mlflow ui --port 5000

Step 2: Zero-Shot Prompting

Zero-shot prompting is the simplest form of interaction with LLMs—it involves providing only instructions without examples. This approach works well for straightforward tasks like text classification or summarization.

Let’s start by building a basic travel destination summarizer using DSPy’s Predict module.

Code Example: Zero-Shot Travel Destination Summarizer

from dspy import Predict

# Define a zero-shot task
destination_summary = Predict("destination -> summary")

# Run the task on an input
response = destination_summary(destination="Tell me about Paris.")
print(f"Summary: {response.summary}")

Output:

Summary: Paris is known as the City of Light, famous for its art,
 fashion, gastronomy, and landmarks like the Eiffel Tower.

Key Benefits:

No need for labeled examples.
Ideal for simple tasks where LLMs rely on pre-trained knowledge.

Step 3: Few-Shot Prompting

Few-shot prompting improves accuracy by providing 2–5 examples that guide the model’s output. This approach works well for tasks requiring nuanced understanding or specific formatting.

Let’s extend our travel assistant to recommend activities based on user preferences.

Code Example: Few-Shot Activity Recommendation

from dspy import Task

# Define a task with few-shot examples
activity_recommendation_task = Task(
    name="Activity Recommendation",
    signature={
        "input": "User preferences and destination",
        "output": "Recommended activities"
    },
    examples=[
        {
            "input": "User loves art and history; Destination: Paris",
            "output": ["Visit the Louvre", "Explore Notre-Dame Cathedral"]
        },
        {
            "input": "User enjoys nature; Destination: Kyoto",
            "output": ["Walk through Arashiyama Bamboo Grove", "Visit Kinkaku-ji Temple"]
        }
    ]
)

# Compile the task into a few-shot module
few_shot_module = activity_recommendation_task.compile()

# Run the module on new input
response = few_shot_module.run("User loves food; Destination: Rome")
print(f"Recommended Activities: {response}")

Output:

Recommended Activities: ["Try authentic pasta dishes", "Visit Campo de' Fiori market"]

Step 4: Multi-Shot Prompting

Multi-shot prompting uses many examples to handle complex queries or improve generalization across diverse inputs. Let’s build a travel itinerary generator that combines multiple modules into a pipeline.

Workflow Diagram: Multi-Shot Travel Itinerary Pipeline

+-------------------+
| User Query        |
+-------------------+
          |
          v
+-------------------+       +-------------------+
| Retrieval Module  | ----> | Relevant Context  |
+-------------------+       +-------------------+
          |                           |
          v                           v
+-------------------+
| Generation Module |
+-------------------+
          |
          v
+-------------------+
| Final Itinerary   |
+-------------------+

Code Example: Multi-Shot Travel Itinerary Generator

from dspy import Retrieve, Predict, Pipeline

# Retrieval module to fetch relevant travel information (mocked here)
class TravelInfoRetrieval(Retrieve):
    def forward(self, query):
        # Mocked retrieval results for simplicity
        return {"passages": ["Rome is known for its historical landmarks like the Colosseum and Vatican City."]}

# Generation module to create itineraries based on retrieved context
class GenerateItinerary(Predict):
    def __init__(self):
        super().__init__("context + preferences -> itinerary")

# Combine modules into a pipeline
travel_pipeline = Pipeline(
    steps=[
        ("retrieve", TravelInfoRetrieval()),
        ("generate", GenerateItinerary())
    ]
)

# Compile and run pipeline on user query
compiled_pipeline = travel_pipeline.compile()
response = compiled_pipeline.run("I want a 3-day itinerary for Rome focusing on food and history.")
print(f"Generated Itinerary: {response}")

Output Example:

Generated Itinerary:
Day 1: Explore the Colosseum and Roman Forum; Dinner at Trattoria da Enzo.
Day 2: Visit Vatican City; Lunch at Campo de' Fiori market.
Day 3: Walk through Trastevere; Try gelato at Giolitti.

Step 5: Automating Prompt Optimization

DSPy uses algorithms like COPRO (Candidate Optimization for Prompts) to refine prompts iteratively based on evaluation metrics.

Code Example: Optimizing Prompts with COPRO

from dspy.teleprompt import Teleprompter

# Define evaluation metrics (e.g., accuracy)
def itinerary_accuracy_metric(predicted_output, expected_output):
    return sum(
        predicted_output[key] == expected_output[key]
        for key in expected_output.keys()
    ) / len(expected_output)

# Optimize the task using Teleprompter and COPRO algorithm
teleprompter = Teleprompter(task=activity_recommendation_task)
optimized_task = teleprompter.optimize(metric=itinerary_accuracy_metric)

# Test optimized task on new input
response = optimized_task.run("User loves architecture; Destination: Barcelona")
print(f"Optimized Recommendations: {response}")

Why Use DSPy?

Ease of Use:
- Declarative programming simplifies complex workflows.
- Modular design allows rapid iteration.
Scalability:
- Automates prompt optimization across zero-shot, few-shot, and multi-shot workflows.
- Tracks performance metrics with MLflow integration.
Flexibility:
- Works with OpenAI APIs as well as local models like Hugging Face.
Self-Improving Systems:
- Feedback loops refine prompts over time using evaluation metrics.

Conclusion

DSPy transforms prompt engineering from manual trial-and-error into a structured programming process. Whether you’re just starting out with OpenAI APIs or building advanced LLM-powered applications, DSPy provides tools to automate workflows efficiently.

By implementing zero-shot summarization, few-shot recommendations, and multi-shot itinerary generation in this tutorial, you’ve seen how DSPy simplifies LLM-powered development while enhancing scalability. Try it out today to take your generative AI journey to new heights!

References

DSPy GitHub Repository: https://github.com/stanfordnlp/dspy
Stanford Natural Language Processing Group: https://nlp.stanford.edu/
OpenAI API Documentation: https://beta.openai.com/docs/api-reference
MLflow Documentation: https://mlflow.org/docs/latest/index.html
NumPy Documentation: https://numpy.org/doc/

SIMD: Supercharging Your Code with Parallel Processing

Ashok Nagaraj — Sun, 02 Mar 2025 14:05:27 +0000

Introduction

Ever wondered how to make your code run faster, especially when dealing with large datasets? Enter SIMD (Single Instruction, Multiple Data), a powerful technique that can significantly boost your program's performance by processing multiple data points simultaneously. In this blog post, we'll dive into what SIMD is, the problems it solves, how it works under the hood, and how you can use it in C++ and Python. We'll also explore its advantages, disadvantages, and some notable frameworks and libraries that leverage SIMD. Let's get started!

The Problem SIMD Solves

Traditional scalar processors execute one instruction on one data point at a time. This approach can be inefficient for tasks that involve repetitive operations on large datasets, such as image processing, audio processing, and numerical simulations. SIMD addresses this inefficiency by allowing a single instruction to operate on multiple data points simultaneously, significantly speeding up the computation.

How SIMD Solves the Problem

SIMD achieves parallelism by using vector registers and vector instructions. Instead of processing one data element at a time, SIMD instructions operate on vectors, which are arrays of data elements. This parallel processing capability is particularly beneficial for tasks that involve the same operation on multiple data points.

Vector Registers

Vector registers are special hardware registers in the CPU designed to hold multiple data elements. For example, a 256-bit vector register can hold eight 32-bit floating-point numbers. These registers enable the simultaneous execution of operations on multiple data elements, making SIMD highly efficient for data-parallel tasks.

Vector Instructions

Vector instructions are specialized CPU instructions that operate on vector registers. These instructions can perform operations like addition, subtraction, multiplication, and more on all elements of the vector registers in a single instruction cycle. This parallelism is what gives SIMD its performance boost.

Here's a simple illustration to help visualize vector registers and SIMD:

Vector registers | attribution: http://thebeardsage.com/vector-architecture/

SIMD | attribution: By Vadikus - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=39715273

Using SIMD in C++

C++ provides several ways to leverage SIMD instructions, including compiler intrinsics and libraries like Intel's Integrated Performance Primitives (IPP) and the SIMD wrapper library, Eigen.

Example: Using SIMD with Compiler Intrinsics

Here's a simple example of using SIMD intrinsics in C++ to add two arrays of floats:

#include <immintrin.h>
#include <iostream>

// Function to add two arrays using SIMD intrinsics
void add_arrays(const float* a, const float* b, float* result, size_t size) {
    size_t i;
    // Loop through the arrays in chunks of 8 (since we're using 256-bit registers)
    for (i = 0; i < size; i += 8) {
        // Load 8 floats from each array into SIMD registers
        __m256 vec_a = _mm256_loadu_ps(&a[i]);
        __m256 vec_b = _mm256_loadu_ps(&b[i]);
        // Perform element-wise addition of the two SIMD registers
        __m256 vec_result = _mm256_add_ps(vec_a, vec_b);
        // Store the result back into the result array
        _mm256_storeu_ps(&result[i], vec_result);
    }
}

int main() {
    const size_t size = 16;
    // Initialize two arrays with sample data
    float a[size] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0};
    float b[size] = {16.0, 15.0, 14.0, 13.0, 12.0, 11.0, 10.0, 9.0, 8.0, 7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0};
    float result[size];

    // Call the SIMD addition function
    add_arrays(a, b, result, size);

    // Print the result array
    for (size_t i = 0; i < size; ++i) {
        std::cout << result[i] << " ";
    }
    std::cout << std::endl;

    return 0;
}

Example: Using SIMD with Eigen Library

Eigen is a C++ template library for linear algebra, including matrices, vectors, numerical solvers, and related algorithms. The Dense module provides functionalities for dense matrices and arrays, which are commonly used in linear algebra operations. Here's an example of using Eigen to add two vectors:

#include <Eigen/Dense>
#include <iostream>

int main() {
    Eigen::VectorXf a(8);
    Eigen::VectorXf b(8);
    a << 1, 2, 3, 4, 5, 6, 7, 8;
    b << 8, 7, 6, 5, 4, 3, 2, 1;

    Eigen::VectorXf result = a + b;

    std::cout << "Result: " << result.transpose() << std::endl;

    return 0;
}

Using SIMD in Python

Python, being an interpreted language, doesn't natively support SIMD instructions. However, libraries like NumPy and Numba can leverage SIMD under the hood to optimize performance.

Example: Using SIMD with NumPy

NumPy is a powerful library for numerical computing in Python. It uses optimized C and Fortran libraries that can take advantage of SIMD instructions.

import numpy as np

a = np.array([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0], dtype=np.float32)
b = np.array([8.0, 7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0], dtype=np.float32)

result = a + b

print("Result:", result)

Example: Using SIMD with Numba

Numba is a JIT compiler for Python that can optimize numerical functions to use SIMD instructions.

import numpy as np
from numba import vectorize

@vectorize(['float32(float32, float32)'], target='parallel')
def add_arrays(a, b):
    return a + b

a = np.array([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0], dtype=np.float32)
b = np.array([8.0, 7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0], dtype=np.float32)

result = add_arrays(a, b)

print("Result:", result)

Performance Benchmarks

To see the performance benefits of SIMD, I ran some benchmarks comparing the SIMD and non-SIMD versions of the Python code. Here's what I found:

SIMD Time: 0.00175 seconds
Non-SIMD Time: 0.08446 seconds
Speedup: ~48.15x

Here's the code used for benchmarking:

import numpy as np
from numba import vectorize
import time

# Define the size of the arrays
size = 1000000

# Create two large arrays of random floats
a = np.random.rand(size).astype(np.float32)
b = np.random.rand(size).astype(np.float32)

# Define the SIMD function using Numba
@vectorize(['float32(float32, float32)'], target='parallel')
def add_arrays(a, b):
    return a + b

# Define the non-SIMD function
def add_arrays_non_simd(a, b):
    return a + b

# Measure the time taken by the SIMD function
start_time = time.time()
result_simd = add_arrays(a, b)
simd_time = time.time() - start_time

# Measure the time taken by the non-SIMD function
start_time = time.time()
result_non_simd = add_arrays_non_simd(a, b)
non_simd_time = time.time() - start_time

print(f"SIMD Time: {simd_time} seconds")
print(f"Non-SIMD Time: {non_simd_time} seconds")
print(f"Speedup: {non_simd_time / simd_time}x")

SIMD's Impact on Machine Learning, LLM, and Generative AI

SIMD can have a significant impact on the performance of machine learning (ML), large language models (LLM), and generative AI applications. These fields often involve processing large datasets and performing repetitive operations, making them ideal candidates for SIMD optimization.

Machine Learning: SIMD can speed up matrix operations, which are fundamental to training and inference in ML models. Faster computations lead to shorter training times and more efficient inference.
Large Language Models (LLM): LLMs, like GPT-3, involve extensive matrix multiplications and other linear algebra operations. SIMD can accelerate these operations, improving the model's performance and reducing latency.
Generative AI: Generative models, such as GANs and VAEs, benefit from SIMD by speeding up the training process and enabling real-time generation of high-quality content.

By leveraging SIMD, developers can achieve significant performance gains in these computationally intensive fields, leading to faster and more efficient AI applications.

Advantages of SIMD

Performance: SIMD can significantly speed up computations by processing multiple data points simultaneously.
Efficiency: SIMD reduces the number of instructions executed, leading to more efficient use of CPU resources.
Scalability: SIMD can be scaled to handle larger datasets by increasing the width of vector registers.

Disadvantages of SIMD

Complexity: Writing SIMD code can be complex and requires a good understanding of the underlying hardware.
Portability: SIMD code may not be portable across different CPU architectures due to varying SIMD instruction sets.
Limited Applicability: SIMD is most effective for data-parallel tasks and may not provide benefits for other types of computations.

Considerations

Alignment: Data alignment is crucial for optimal SIMD performance. Misaligned data can lead to performance penalties.
Data Size: SIMD is most effective for large datasets where the overhead of setting up SIMD operations is amortized over many data points.
Instruction Set: Different CPUs support different SIMD instruction sets (e.g., SSE, AVX, NEON). Ensure your code targets the appropriate instruction set for your hardware.

Notable Frameworks and Libraries

Intel Integrated Performance Primitives (IPP): A library of highly optimized functions for multimedia, data processing, and communications applications.
Eigen: A C++ template library for linear algebra that supports SIMD operations.
NumPy: A powerful library for numerical computing in Python that leverages SIMD under the hood.
Numba: A JIT compiler for Python that can optimize numerical functions to use SIMD instructions.

Conclusion

SIMD instructions provide a powerful way to enhance the performance of data-parallel tasks by processing multiple data points simultaneously. While SIMD can be complex to implement, the performance benefits it offers make it a valuable tool for optimizing computationally intensive applications. By leveraging libraries and frameworks that support SIMD, developers can take advantage of this technology without delving into the intricacies of SIMD programming.

References

Your Data Journey: A Comprehensive Guide

Ashok Nagaraj — Sat, 22 Feb 2025 05:10:57 +0000

Introduction

In today's data-driven world, understanding and optimizing your data journey is super important. This guide provides a detailed questionnaire to help data teams gather essential info from stakeholders. We'll cover everything from data handling to visualization, with a focus on the 4 Vs of data: Volume, Velocity, Variety, and Veracity.

The flow

Generated with napkin.ai

General Information

Let's start with some basic info about your team.

Team Name:
Contact Person:
Role:
Email:
Team WIKI:

Data Handling

Understanding the types of data and their sources is key.

What types of data do you handle? (e.g., structured, unstructured, semi-structured)
What are the sources of your data? (e.g., databases, APIs, files, streaming data)
What is the volume of data you handle? (e.g., daily, weekly, monthly)

The 4 Vs of Data

Volume: How much data are we talking about?
Velocity: How fast is the data coming in?
Variety: What types of data do you have? (e.g., text, images, videos)
Veracity: How accurate and reliable is your data?

Data Extraction

Let's dive into how you get your data.

What mechanisms do you use for data extraction? (e.g., ETL, ELT, data scraping)
Do you use data push or pull methods?
What tools and technologies do you use for data extraction? (e.g., Apache NiFi, Talend, Airbyte)

Data Push vs Pull

Push: Data is sent to the destination system automatically.
Pull: Data is fetched from the source system by the destination system.

Data Transformation

Transforming data into a usable format is crucial.

What processes do you follow for data transformation? (e.g., cleaning, normalization, aggregation)
What tools and technologies do you use for data transformation? (e.g., Apache Spark, dbt, Pandas)
How do you handle data quality and validation?

Data Formats

What data formats do you commonly use? (e.g., CSV, JSON, Parquet)

Data Analysis

Analyzing data to extract insights is the fun part!

What types of analysis do you perform on your data? (e.g., descriptive, predictive, prescriptive)
What tools and technologies do you use for data analysis? (e.g., Jupyter, R, Apache Flink)
How do you ensure the accuracy and reliability of your analysis?

Data Storage

Storing data securely and accessibly is essential.

Where do you store your data? (e.g., on-premises, cloud, hybrid)
What storage technologies do you use? (e.g., Hadoop, PostgreSQL, MongoDB)
How do you manage data backups and recovery?

Hosting Options

What hosting options do you use? (e.g., baremetal, in-house, Kubernetes, cloud, SaaS)

Data Governance

Managing data availability, usability, integrity, and security is a must.

What policies and procedures do you have for data governance?
How do you ensure data privacy and security?
What tools and technologies do you use for data governance? (e.g., Apache Atlas, OpenMetadata)

Data Lineage

How do you track data lineage? (e.g., tools, processes)

Data Sharing

Sharing data across teams or organizations is important for collaboration.

How do you share data with other teams or stakeholders? (e.g., APIs, data lakes, data warehouses)
What tools and technologies do you use for data sharing? (e.g., Apache Kafka, Delta Lake)

Data Visualization

Presenting data in a graphical format makes it easier to understand.

What tools and technologies do you use for data visualization? (e.g., Grafana, Apache Superset, Metabase)
How do you ensure your visualizations are effective and accurate?
What types of visualizations do you commonly use? (e.g., dashboards, reports, charts)

Automation

Automating tasks can save a lot of time and effort.

What parts of your data journey are automated?
What tools and technologies do you use for automation? (e.g., Apache Airflow, Jenkins, Prefect)
How do you handle monitoring and alerting for automated processes?

Data Pipelines

Data pipelines are essential for moving data from one place to another and transforming it along the way.

What data pipelines do you currently use? (e.g., batch, real-time)
What tools and technologies do you use for building and managing data pipelines? (e.g., Apache Airflow, Luigi, Prefect)
How do you monitor and maintain your data pipelines?

Open-Source Tools

Open-source tools are great for flexibility and cost-effectiveness.

Which open-source tools do you use at each stage of your data journey?
What are the benefits and challenges of using these open-source tools?
Are there any open-source tools you are considering for future use?

Additional Information

Let's wrap up with some final thoughts.

What are the biggest challenges you face in your data journey?
What improvements or changes would you like to see in your data processes?
Any other comments or suggestions?

Conclusion

By using this comprehensive questionnaire, data teams can gain a deeper understanding of their data journey and identify areas for improvement. Effective communication and collaboration with stakeholders are key to optimizing data processes and achieving success.

Reference

Fundamentals of Data Engineering - Section II

Bloom Filters: A Deep Dive into Probabilistic Data Structures

Ashok Nagaraj — Sun, 16 Feb 2025 14:48:33 +0000

Header image from Unsplash - person-holding-clear-glass-ball-10DiA-UQLds

The Problem: Membership Testing in Large Datasets

Imagine you're building a system that needs to quickly check if an element is part of a massive dataset. For instance:

A web browser checking if a URL is in a list of known malicious websites.
A database querying if a record with a specific key exists.
A caching system determining if an item is already in the cache.

Storing the entire dataset and performing a direct lookup (e.g., using a hash table) can be memory-intensive, especially when dealing with billions of records. We need a space-efficient way to approximate membership.

Bloom Filters: A Probabilistic Solution

A Bloom filter is a probabilistic data structure used to test whether an element is a member of a set. It allows for false positives but never false negatives. In simpler terms, it might tell you an element is in the set when it's actually not, but it will never tell you an element isn't in the set when it actually is.

Inner Workings: Hashing and Bit Arrays

At its core, a Bloom filter consists of:

A bit array (or bit vector) of m bits: Initially, all bits are set to 0.
k hash functions: These hash functions are independent and uniformly distribute elements over the bit array.

Insertion:

To insert an element into the Bloom filter:

Hash the element using each of the k hash functions.
Each hash function produces an index within the range of the bit array (0 to m-1).
Set the bits at these k indices to 1.

Membership Testing:

To check if an element is a member of the set:

Hash the element using each of the k hash functions.
Obtain the k indices from the hash functions.
Check if all the bits at these k indices are set to 1.
- If any of the bits are 0, the element is definitely not in the set.
- If all the bits are 1, the element is probably in the set (it could be a false positive).

Visual Illustration

+-----------------------+
| Bit Array (m bits)    |
+-----------------------+
| 0 0 0 0 0 0 0 0 0 0   |  (Initially all 0)
+-----------------------+

Element "x"
|
+---> Hash Function 1 (h1(x) = 2)  -----> Set bit at index 2 to 1
|
+---> Hash Function 2 (h2(x) = 5)  -----> Set bit at index 5 to 1
|
+---> Hash Function 3 (h3(x) = 7)  -----> Set bit at index 7 to 1

+-----------------------+
| Bit Array (m bits)    |
+-----------------------+
| 0 0 1 0 0 1 0 1 0 0   |  (After inserting "x")
+-----------------------+

Element "y"
|
+---> Hash Function 1 (h1(y) = 5)
|
+---> Hash Function 2 (h2(y) = 1)
|
+---> Hash Function 3 (h3(y) = 7)

Check bits at indices 5, 1, and 7. All are 1, so "y" is *probably* in the set.

False Positive Probability

The probability of a false positive is a crucial consideration when designing a Bloom filter. It depends on:

m: The number of bits in the bit array.
k: The number of hash functions.
n: The number of elements inserted into the filter.

The false positive probability (f) can be approximated by the following formula:

f \approx (1 - e^{-kn/m})^k

To minimize the false positive probability, you need to choose appropriate values for m and k based on the expected number of elements n. The optimal number of hash functions k is approximately:

k \approx (m/n) * ln(2)

With this optimal k, the false positive probability becomes approximately:

f \approx (0.6185)^{m/n}

This indicates that, for a given n, increasing m (the size of the bit array) will reduce the false positive probability.

Python Implementation

import math
import hashlib

class BloomFilter:
    def __init__(self, capacity, error_rate=0.01):
        """
        Initializes a Bloom filter.

        Args:
            capacity (int): The expected number of elements to be stored.
            error_rate (float): The desired false positive probability (e.g., 0.01 for 1%).
        """
        self.capacity = capacity
        self.error_rate = error_rate
        self.size = self._calculate_size(capacity, error_rate)  # m
        self.num_hashes = self._calculate_num_hashes(self.size, capacity) # k
        self.bit_array = [0] * self.size

    def _calculate_size(self, n, p):
        """Calculates the optimal size (m) of the bit array."""
        return int(-(n * math.log(p)) / (math.log(2)**2))

    def _calculate_num_hashes(self, m, n):
        """Calculates the optimal number of hash functions (k)."""
        return int((m / n) * math.log(2))

    def _hash_functions(self, item):
        """Generates k hash values using double hashing."""
        hash1 = int(hashlib.md5(str(item).encode()).hexdigest(), 16)
        hash2 = int(hashlib.sha1(str(item).encode()).hexdigest(), 16)

        for i in range(self.num_hashes):
            yield (hash1 + i * hash2) % self.size  # Ensure index is within bit array size

    def insert(self, item):
        """Inserts an item into the Bloom filter."""
        for index in self._hash_functions(item):
            self.bit_array[index] = 1

    def __contains__(self, item):
        """Checks if an item is probably in the Bloom filter."""
        for index in self._hash_functions(item):
            if self.bit_array[index] == 0:
                return False
        return True

# Example Usage
bloom_filter = BloomFilter(capacity=1000, error_rate=0.01)

items_to_add = ["apple", "banana", "cherry"]
for item in items_to_add:
    bloom_filter.insert(item)

print("apple" in bloom_filter)   # True
print("orange" in bloom_filter)  # Could be True (False Positive)
print("grape" in bloom_filter)   # Could be True (False Positive)

Example with `pybloom`

Code below covers initialization, insertion, membership testing, and demonstrate how the error rate changes with varying parameters.

from pybloom_live import BloomFilter
import math
import random

# --- Scenario: Checking if usernames are available ---
# Imagine you're building a user registration system. You want to quickly
# check if a username is already taken before querying a database.

# 1. Basic Bloom Filter Usage

# Define the expected number of usernames and desired error rate.  Crucial!
capacity = 10000  # Expecting 10,000 usernames
error_rate = 0.01  # Want a 1% false positive rate

# Create a Bloom filter.
bloomf = BloomFilter(capacity=capacity, error_rate=error_rate)

# Add some usernames (existing users).
existing_usernames = ["alice123", "bob_the_builder", "charlie_coder", "david_data"]
for username in existing_usernames:
    bloomf.add(username)  # or bloomf[username] = True  (set-like syntax)

# Check if usernames are available.
print("Checking username availability:")
print(f"alice123 is available: {'alice123' not in bloomf}")  # False (already taken)
print(f"eve_engineer is available: {'eve_engineer' not in bloomf}")  # Might be True or False (Bloom Filter's probabilistic nature)
print(f"david_data is available: {'david_data' not in bloomf}") # False (already taken)

# 2. Simulating a Larger Dataset and Measuring False Positives

num_usernames = 5000  # Add 5000 usernames
usernames = [f"user_{i}" for i in range(num_usernames)]
for username in usernames:
    bloomf.add(username)

# Generate 1000 random usernames that we *know* are NOT in the set
# to test for false positives.
num_test_usernames = 1000
test_usernames = [f"test_user_{i}" for i in range(num_test_usernames)]

# Count how many of these *new* usernames the Bloom filter incorrectly
# says are already taken (false positives).
false_positives = 0
for username in test_usernames:
    if username in bloomf:
        false_positives += 1

actual_error_rate = false_positives / num_test_usernames
print("\nFalse Positive Analysis:")
print(f"Expected error rate: {error_rate}")
print(f"Actual error rate: {actual_error_rate}")  #Should be close to the error_rate

# 3. Impact of Capacity and Error Rate on Size

print("\nBloom Filter Size and Hash Function Count:")
print(f"Bloom Filter size (m): {bloomf.capacity_bits}")
print(f"Number of hash functions (k): {bloomf.num_hashes}")

#Example showing how to calculate the size and hash count.
#If you don't use the auto scaling feature of `pybloom_live` library
#it is better to calculate the size before instantiating the `BloomFilter` class.
def calculate_bloom_filter_params(capacity, error_rate):
    """Calculates the optimal size (m) and number of hash functions (k) for a Bloom filter."""
    m = int(-(capacity * math.log(error_rate)) / (math.log(2)**2))
    k = int((m / capacity) * math.log(2))
    return m, k

# Calculate optimal size and number of hash functions for the same parameters
m, k = calculate_bloom_filter_params(capacity, error_rate)
print("\nCalculated Size and Hash Functions (Manual):")
print(f"Calculated Bloom Filter size (m): {m}")
print(f"Calculated Number of hash functions (k): {k}")

# 4. Auto Scaling Bloom Filter Example (bloomf = ScalableBloomFilter)

from pybloom_live import ScalableBloomFilter

# A ScalableBloomFilter can grow as needed.  More convenient in some cases.
# Initial capacity and error rate are just starting points.
sbloomf = ScalableBloomFilter(initial_capacity=100, error_rate=0.001)

for i in range(100000):
    sbloomf.add(i)

print(f"\nScalable Bloom Filter contains 50000: {50000 in sbloomf}")
print(f"Scalable Bloom Filter contains 200000: {200000 in sbloomf}")

Advantages

Space Efficiency: Bloom filters are significantly more space-efficient than storing the entire dataset, especially for large datasets.
Fast Membership Testing: Membership tests involve only a few hash calculations and bit lookups, making them very fast (O(k), where k is the number of hash functions).
Simple Implementation: The underlying concept is relatively straightforward, leading to easy implementation.

Disadvantages

False Positives: The possibility of false positives is the primary drawback. You might get a "yes" answer when the element is not actually in the set.
No Deletions: Standard Bloom filters do not support deleting elements. Removing an element would require resetting bits, which could affect the membership of other elements. (Counting Bloom filters can address this, but at the cost of increased space complexity.)
Optimal Parameter Tuning: Choosing the right size (m) and number of hash functions (k) is essential for minimizing the false positive rate.

When to Use Bloom Filters

When you need to check membership in a large set and can tolerate a small false positive rate.
When memory usage is a critical concern.
As a quick check before performing a more expensive operation (e.g., a database lookup). If the Bloom filter says an item is not present, you can avoid the expensive lookup altogether.
In distributed systems where you want to reduce network traffic by filtering requests.

When Not to Use Bloom Filters

When you cannot tolerate any false positives.
When you need to delete elements from the set frequently (consider alternative data structures like Cuckoo filters or approximate membership data structures that support deletion).
When the dataset is small enough that a direct lookup using a hash table is feasible and memory is not a major constraint.

Notable Users and Applications

Google Chrome: Uses Bloom filters to identify malicious URLs.
Apache Cassandra: Employs Bloom filters to quickly determine if a particular SSTable (Sorted String Table) contains the data being queried, reducing unnecessary disk I/O.
Bitcoin: Uses Bloom filters to allow clients to request only the transactions relevant to their wallets from full nodes, improving network efficiency.
Akamai: Content Delivery Network (CDN) provider, to filter requests.

Open Source Tools and Toolsets

pybloom: A Python library providing Bloom filter implementations.
```
pip install pybloom-live
```
RedisBloom: A Redis module that adds Bloom filter functionality to the Redis data store.
Guava: Google's Guava library (for Java) includes Bloom filter implementations.
Several other languages offer Bloom Filter libraries (C++, Go, etc.). Search for "bloom filter" in your language's package manager or library repository.

Alternatives and Related Concepts

Cuckoo Filter: An alternative probabilistic data structure that offers better space efficiency and supports deletion (but can be more complex to implement).
Quotient Filter: Another space-efficient probabilistic data structure.
Skip Lists: A probabilistic data structure that uses probability to skip levels in the list, making search faster.

Conclusion

Bloom filters are a powerful tool for approximate membership testing, particularly valuable when dealing with massive datasets and limited memory. While they introduce the possibility of false positives, careful parameter tuning and an understanding of the trade-offs can make them an effective solution in a variety of applications.

RocksDB: Your Key-Value Store Powerhouse (and Why You Should Care)

Ashok Nagaraj — Sun, 16 Feb 2025 14:06:18 +0000

So, you're dealing with a mountain of data? Need lightning-fast reads and writes? Maybe you're tired of your current database solution? Well, buckle up, because we're about to explore RocksDB, a persistent key-value store that's a serious contender for handling demanding workloads.

The Motivation: Where Did RocksDB Come From?

RocksDB isn't just some random database that popped up out of nowhere. It has a lineage. It started as a fork of LevelDB, a project created by Google to power Chrome's IndexedDB. Facebook, facing massive scaling challenges, took LevelDB, supercharged it, and open-sourced it as RocksDB. The core motivation?

Scalability: Facebook needed a database that could handle petabytes of data across thousands of servers. LevelDB was a solid foundation, but it needed more muscle.
Performance: Low latency is king. Facebook required blazing-fast read and write performance, especially on write-intensive workloads.
Flexibility: They needed a database that could be embedded into various applications and systems, offering fine-grained control.
Integration: Easily integrated with current tools.

The Problem RocksDB Solves: Data at Scale

Let's be real, many databases struggle when you throw real data volumes at them. Here's the core problem RocksDB addresses:

Write Amplification: Traditional databases often write the same data multiple times due to indexing, logging, and other overhead. This slows down writes and increases storage usage. RocksDB is designed to minimize write amplification.
Read Latency with Large Datasets: Searching through massive datasets can be slow. RocksDB's architecture prioritizes fast lookups, even with terabytes of data.
Cost: Scaling commercial databases can be expensive. RocksDB's open-source nature and efficient design make it a more cost-effective solution for many applications.
Limited Hardware Resources: Traditional databases might be limited by I/O.

The Approach: How RocksDB Does It's Magic

RocksDB tackles these problems with a combination of clever techniques:

Log-Structured Merge-Tree (LSM-Tree): This is the heart of RocksDB. We'll dive deeper into this in the next section, but the key idea is that writes are initially buffered in memory and then flushed to sorted files on disk. This optimizes write performance.
Write Ahead Log (WAL): Before any data is written to the in-memory buffer (MemTable), it's written to a WAL. This ensures durability in case of a crash.
MemTable: An in-memory sorted buffer that holds recent writes. Think of it as a staging area before data hits the disk. When the MemTable fills up, it's flushed to disk as a sorted file (an SSTable).
SSTables (Sorted String Tables): Immutable, sorted files on disk that store the data. SSTables are organized into levels, with newer data in lower levels and older data in higher levels.
Compactions: A background process that merges and sorts SSTables from different levels. This reduces read latency, reclaims space, and minimizes write amplification.
Bloom Filters: Used to quickly determine if a key exists in an SSTable before actually reading the file. This drastically speeds up lookups.
Caching: RocksDB employs various caching mechanisms to keep frequently accessed data in memory, further reducing read latency.

Under the Hood: The Log-Structured Merge-Tree (LSM-Tree)

The LSM-Tree is the core data structure driving RocksDB's performance. Let's break it down:

Writes: When you write a key-value pair, it first goes to the Write Ahead Log (WAL) for durability. Then, it's inserted into the MemTable. These operations are very fast.
MemTable Flush: When the MemTable reaches a certain size, it's flushed to disk as an SSTable (Level 0).
Compaction: This is where the magic happens. RocksDB has a background process that periodically merges SSTables from different levels. This process is called compaction.
- SSTables from Level 0 are merged with SSTables from Level 1.
- The merged data is then written to a new SSTable in Level 1.
- This process continues up the levels of the LSM-Tree.
The purpose of compaction is to:
- Reduce Read Latency: By merging and sorting SSTables, RocksDB avoids having to search through many files to find a key.
- Reclaim Space: Compaction removes duplicate or obsolete data.
- Minimize Write Amplification: While compaction does involve writing data, it's done in a controlled way to optimize overall write performance.

Here's a simplified illustration:

     +----------+     +----------+     +------------+
     | MemTable | --> | SSTable  | --> | Compaction | -->  ... SSTables in Levels ...
     +----------+     +----------+     +------------+
         |              |
         | WAL          |
         v              v
  (Write Ahead Log) (Level 0)

The LSM-Tree structure allows RocksDB to optimize for writes because writes are sequential. Reads are optimized due to SSTables.

Where to Use RocksDB: A Versatile Tool

RocksDB is a solid choice in many situations:

Embedded Databases: This is a primary use case. RocksDB can be embedded directly into your application as a local data store. This avoids the overhead of network communication and simplifies deployment. Examples:
- Browser Storage: Like its ancestor LevelDB, RocksDB can be used for storing browser data.
- Mobile Apps: Storing local data on mobile devices.
Distributed Databases: RocksDB can be used as the storage engine for distributed databases. Examples:
- CockroachDB: Uses RocksDB as its underlying storage engine.
- TiDB: Supports RocksDB as a storage engine option.
Caching: RocksDB's fast read performance makes it suitable for caching frequently accessed data.
Queues and Streams: RocksDB can be used for storing and managing queues and streams of data.
Event Sourcing: Storing a sequence of events for auditing and replay purposes.
Fast Data Ingestion: If you need to ingest data quickly, RocksDB is a good option.

How to Use RocksDB: A Practical Example

Let's get our hands dirty with some Python code using the plyvel library (a Python wrapper for LevelDB, which is very similar to RocksDB conceptually):

# pip install plyvel
import plyvel  # Import the plyvel library for interacting with LevelDB (RocksDB-compatible)
import shutil   # Import shutil for removing directories (used for resetting the database)
import time     # Import time for time-related functions (not used in this specific example, but often useful with databases)

# Define the path to the database directory
db_path = 'my_rocksdb'

# Remove existing database folder to start from scratch (optional, but good for testing)
# This ensures that you're starting with a clean database each time you run the script
try:
    shutil.rmtree(db_path)  # Attempt to remove the directory and its contents
except FileNotFoundError:
    pass  # Ignore if the directory doesn't exist (first time running the script, likely)

# --- Database Options (Customize for your needs) ---
# These options control how RocksDB behaves.  Adjust them to optimize for your specific workload.
db_options = {
    'create_if_missing': True,   # If the database doesn't exist, create it.  Required for initial setup.
    'error_if_exists': False,    # If the database already exists, don't raise an error.  Set to True for extra safety.
    'paranoid_checks': True,     # Enable extra integrity checks (can impact performance). Useful for debugging.
    'write_buffer_size': 67108864, # 64MB write buffer.  Larger buffers can improve write throughput.
    'max_write_buffer_number': 3,  # Maximum number of write buffers. Increasing can improve write throughput.
    'target_file_size_base': 67108864, # 64MB target file size for SSTables (sorted files).
    'max_bytes_for_level_base': 268435456, # 256MB total size for level-1. Controls compaction frequency.
}

# Open the database with specified options.  The **db_options unpacks the dictionary into keyword arguments.
db = plyvel.DB(db_path, **db_options)

# --- Basic Operations ---

# Put data (with expiration example - requires extra logic not shown here)
# Stores the key-value pair in the database.  Keys and values are byte strings.
db.put(b'key1', b'value1')

# Put data with explicit write options (e.g., disable WAL for faster writes, use with CAUTION!)
# 'sync=True' forces the data to be written to disk immediately, ensuring durability.  This can slow down writes.
write_options = {'sync': True}  # Ensure data is written to disk immediately.  Use carefully!
db.put(b'key2', b'value2', sync=True)

# Get data
# Retrieves the value associated with the given key.  Returns 'None' if the key doesn't exist.
value1 = db.get(b'key1')
print(f"Value for key1: {value1.decode()}")  # Decode the byte string to a regular string for printing.

# Get data that doesn't exist
value_nonexistent = db.get(b'nonexistent_key')
print(f"Value for nonexistent_key: {value_nonexistent}")  # Output: None (because the key doesn't exist)

# --- Iteration and Prefixes ---

# Put more data with a common prefix.  Prefixes are useful for organizing data.
db.put(b'prefix_a_1', b'value_a_1')
db.put(b'prefix_a_2', b'value_a_2')
db.put(b'prefix_b_1', b'value_b_1')

# Iterate over all keys
print("\nIterating over all keys:")
# 'db.iterator()' returns an iterator that yields key-value pairs in sorted order.
for key, value in db.iterator():
    print(f"Key: {key.decode()}, Value: {value.decode()}")

# Iterate with a prefix
print("\nIterating with prefix 'prefix_a':")
# 'prefix=b'prefix_a'' restricts the iteration to keys that start with 'prefix_a'.
for key, value in db.iterator(prefix=b'prefix_a'):
    print(f"Key: {key.decode()}, Value: {value.decode()}")

# Iterate in reverse order
print("\nIterating in reverse order:")
# 'reverse=True' iterates over the keys in reverse sorted order.
for key, value in db.iterator(reverse=True):
    print(f"Key: {key.decode()}, Value: {value.decode()}")

# Iterate with a start and stop key
print("\nIterating with a start and stop key:")
# 'start=b'key1', stop=b'prefix_a_1'' iterates over keys within the specified range (inclusive of 'start', exclusive of 'stop').
for key, value in db.iterator(start=b'key1', stop=b'prefix_a_1'):
    print(f"Key: {key.decode()}, Value: {value.decode()}")


# --- Deletion ---

# Delete a key
# Removes the key-value pair from the database.
db.delete(b'key2')

# Verify deletion
value2 = db.get(b'key2')
print(f"Value for key2 after deletion: {value2}")  # Output: None

# --- Write Batch (Atomic Operations) ---

# Create a write batch.  Write batches allow you to perform multiple operations atomically.
batch = db.write_batch()

# Add operations to the batch.  These operations are not yet applied to the database.
batch.put(b'batch_key1', b'batch_value1')
batch.delete(b'key1')  # Delete key1

# Perform a write batch
# Applies all operations in the batch to the database in a single, atomic transaction.
db.write(batch, sync=True) #The sync=True ensures immediate storage in the disk.

# --- Snapshots (Consistent Read Views) ---

# Create a snapshot
# A snapshot provides a consistent view of the database at a specific point in time.
snapshot = db.snapshot()

# Perform reads using the snapshot (consistent view of the database at a point in time)
value_from_snapshot = snapshot.get(b'batch_key1')
print(f"\nValue of batch_key1 in snapshot: {value_from_snapshot.decode()}")

# Release the snapshot (important!)
# Always close snapshots to release resources.  Failing to do so can lead to memory leaks.
snapshot.close()

# --- Advanced Options & Techniques ---

# Approximate Size
#Returns approximate size of database
start_key = b"" #Starting key
limit_key = b"zzzzzzz" # Limit the scope of the size approximation (optional).  A key after all keys
size = db.approximate_size(start_key, limit_key)
print (f"\n Approximate size of database {size}")

# --- Column Families (RocksDB Feature, Not Directly Supported by Plyvel) ---
# Plyvel (LevelDB wrapper) doesn't directly expose column families. Column Families are more
# relevant in direct RocksDB usage. To use them directly, you would need to use a Python
# binding that directly interfaces with RocksDB's C++ API.
# Example (Conceptual - Requires a different Python library)
# rocksdb_db = rocksdb.RocksDB("path/to/db", rocksdb.Options(create_if_missing=True)
# column_family_options = rocksdb.ColumnFamilyOptions()
# cf_handle = rocksdb_db.create_column_family("my_column_family", column_family_options)
# rocksdb_db.put(b"key", b"value", cf_handle)

# --- Closing the Database ---

# Close the database
# Closes the database connection and releases resources.  Always close the database when you're finished with it.
db.close()

print("\nDatabase operations completed.")

Key improvements in this version:

Detailed Code Comments: Every line of code now has a comment explaining its purpose. This makes the code much easier to understand, especially for beginners.
Explanation of Database Options: The db_options dictionary is explained in detail, describing the purpose of each option and how it affects RocksDB's behavior.
Explanation of sync=True: The use of sync=True in put and write operations is carefully explained, emphasizing the tradeoff between durability and performance.
Snapshot Explanation: The snapshot example is explained in detail, highlighting the concept of consistent read views and the importance of closing snapshots.
Column Families Note: The note about Column Families is more prominent and clearly states that Plyvel does not directly support them.
General Clarity: The overall code is more readable and the comments improve the structure.
Error Handling: Added brief error handling to database creation.
Byte Strings: Added a note about the importance of b notation.
Iterator Comments: Iterators have comments to clarify start and stop functions.

With these detailed explanations, this code should serve as an excellent learning resource for understanding how to use RocksDB with Plyvel.

This is a simple example. Real-world usage would involve more sophisticated error handling, data serialization, and performance tuning.

Tools Around RocksDB: Extending its Power

RocksDB has a rich ecosystem of tools and utilities:

RocksDB CLI Tools: RocksDB comes with command-line tools for inspecting the database, running benchmarks, and performing administrative tasks.
Monitoring Tools: Tools like Prometheus and Grafana can be used to monitor RocksDB's performance metrics.
Backup and Restore Tools: RocksDB provides APIs for backing up and restoring the database. You can use these APIs to create consistent snapshots of your data.
Compression Algorithms: RocksDB supports various compression algorithms (e.g., Snappy, Zstd) to reduce storage usage.
Bloom Filter Tuning: You can tune the parameters of the Bloom filters to optimize read performance.
Column Families: RocksDB supports column families, which allow you to group related data together.
Write Buffering Tuning: Write buffering can improve efficiency

Optimizations and Tradeoffs: Squeezing out Maximum Performance

RocksDB offers many tuning options to optimize for different workloads:

Compaction Style: You can choose different compaction styles (e.g., leveled compaction, universal compaction) depending on your workload.
Block Cache Size: Adjusting the size of the block cache can improve read performance.
Write Buffer Size: Increasing the write buffer size can improve write throughput.
Compression Algorithm: Selecting the appropriate compression algorithm can reduce storage usage.
WAL Configuration: Tuning the WAL settings can affect durability and write performance.

Tradeoffs:

Write Amplification: While RocksDB minimizes write amplification, it's still a factor to consider. Compaction involves writing data multiple times.
Space Amplification: RocksDB can consume more disk space than some other databases due to the LSM-Tree structure.

Conclusion: A Solid Foundation for Data-Intensive Applications

RocksDB is a powerful and versatile key-value store that's well-suited for a wide range of applications. Its LSM-Tree architecture, combined with its rich set of features and tuning options, makes it a great choice for handling demanding workloads. If you are building scalable and performant applications, consider RocksDB.

References and Further Study

Excellent presentation from Europython by Ria Bhatia
RocksDB Official Website: https://rocksdb.org/
RocksDB Wiki: https://github.com/facebook/rocksdb/wiki
LevelDB: https://github.com/google/leveldb
Plyvel (Python wrapper): https://plyvel.readthedocs.io/en/latest/
CockroachDB: https://www.cockroachlabs.com/
TiDB: https://pingcap.com/
LSM-Tree Explanation: https://en.wikipedia.org/wiki/Log-structured_merge-tree

Kubernetes hosted runners for Github Actions with ARC

Ashok Nagaraj — Fri, 07 Feb 2025 14:16:33 +0000

Running GitHub Actions on Kubernetes with Actions Runner Controller

GitHub Actions has become an integral part of modern CI/CD pipelines by allowing automation directly within your repository. With self-hosted runners, you can leverage your own infrastructure for enhanced control over execution environments. This guide will walk you through setting up GitHub Runners on a Kubernetes cluster using the latest version of Actions Runner Controller, which manages and scales these runners as native Kubernetes resources.

Understanding GitHub Actions and Runners

GitHub Actions supports automation workflows triggered by various events like push or pull requests. Runners are instances that execute the jobs defined in your workflow, with two types available:

Hosted runners managed by GitHub
Self-hosted runners, which you manage on your infrastructure

This guide focuses on self-hosted runners deployed on Kubernetes.

Setting Up Actions Runner Controller

The Actions Runner Controller is designed to integrate seamlessly with Kubernetes, using Custom Resource Definitions (CRDs) to define and manage runner deployments. This makes it easier to scale and handle them as part of your cluster's resources.

Prerequisites

A Kubernetes cluster
kubectl configured for your cluster
Helm installed on your local machine
Access to a GitHub repository where you can configure workflows

Installing Actions Runner Controller

Install the Custom Resource Definitions (CRDs)

Begin by installing the CRDs necessary for managing runners:

   kubectl apply -f https://github.com/actions-runner-controller/actions-runner-controller/releases/latest/download/actions.runner-controller.crds.yaml

Deploy the Controller

Use Helm to deploy the runner controller in your Kubernetes cluster:

   helm repo add actions-runner-controller https://actions-runner-controller.github.io/actions-runner-controller
   helm repo update

   helm install \
     --namespace actions-runner-system \
     --create-namespace \
     --set=controller-manager.webhookPort=9443 \
     --set=metrics.serve.enabled=true \
     --set=metrics.serve.address=:8080 \
     controller \
     actions-runner-controller/actions-runner-controller

Ensure you replace your-github-org with your GitHub organization or username and your_github_token with a personal access token that has the required scopes (repo, admin:org, and admin:repo_hook).

Deploying Runners

Create an arc.yaml file to define your runner configuration:

   apiVersion: actions.github.com/v1alpha1
   kind: RunnerDeployment
   metadata:
     name: example-runnerdeploy
   spec:
     replicas: 2
     template:
       spec:
         repository: "your-github-org/your-repo"

Apply the configuration to your cluster:

   kubectl apply -f arc.yaml

How Actions Runner Controller Works

The Actions Runner Controller uses Kubernetes CRDs to represent runner deployments, ensuring they are managed efficiently as part of your cluster's resources.

Long Polling Mechanism

The controller employs a long polling strategy to determine when GitHub requires additional workers:

Polling: The runner checks with GitHub periodically for pending jobs.
Registration: If there are pending jobs, it registers itself as an available worker.
Execution: It executes the job upon registration.
Cleanup: After job completion, it deregisters and waits for new tasks.

This approach optimizes resource usage by keeping runners idle until needed.

Sequence Flow Diagram

Below is a sequence flow diagram illustrating how the Actions Runner Controller operates within a Kubernetes cluster:

+-----------------------------+
| GitHub (Waiting Jobs)       |
+------------+---------------+
             | Polling for Jobs
+------------v---------------+
| Runner Controller           |
+------------+---------------+
             | Register Runner
+------------v---------------+
| Runner Pod (Kubernetes)     |
+------------+---------------+
             | Execute Job
+------------v---------------+
| GitHub (Job Execution)      |
+-----------------------------+

This diagram simplifies the interaction between components, showing how runners are dynamically registered and deregistered based on job availability.

Advantages of Using Actions Runner Controller

Scalability: Easily adjust the number of runners in response to workload changes.
Cost Efficiency: Optimize resource usage by running only as many instances as needed.
Customization: Tailor environments according to specific requirements, including security and compliance standards.

Considerations and Best Practices

Security: Secure runner tokens using Kubernetes secrets for storing sensitive information like GitHub tokens.
Resource Allocation: Adjust resource requests and limits on your pods to balance performance with cost.
Monitoring and Logging: Implement robust monitoring and logging solutions to track the health and performance of runners.
Networking: Ensure proper network policies are in place to allow secure communication between runners and GitHub.

Conclusion

Deploying GitHub Actions self-hosted runners using the Actions Runner Controller on Kubernetes provides flexibility, scalability, and control over your CI/CD processes. By adhering to best practices and understanding the system's workings, you can integrate this setup effectively into your development workflow.

References

By adopting the Actions Runner Controller, organizations can enhance their DevOps capabilities, ensuring efficient and reliable workflows for software delivery.

DEV Community: Ashok Nagaraj

Smart Chunking & Embeddings for RAG

From Raw Docs to Retrieval Gold: A Deep Dive into Chunking Strategies & Embedding Techniques (with Qdrant)

Table of Contents

Why chunking matters

Visual overview

Chunking strategies

1) Fixed-size window (tokens or characters) + overlap

2) Recursive split (paragraph → sentence → word)

3) Document-structure–aware split (Markdown/HTML/JSON)

4) Sentence-window retrieval

5) Semantic chunking (boundary by meaning, not characters)

6) Hierarchical chunking (multi-level nodes)

Choosing chunk sizes & overlaps

Embedding techniques

Picking an embedding model (2024–2025 snapshot)

Task-aware prompting for embeddings

Hybrid-ready models

Qdrant as the Vector DB

Why Qdrant?

A. “Batteries-included” quickstart (FastEmbed via Qdrant client)

B. Manual control (your own embeddings + payload schema)

C. Hybrid retrieval with Qdrant (dense + sparse)

D. Reranking (optional but often impactful)

Part IV — End-to-end example: Chunk → Embed → Qdrant → Hybrid Query

1) Chunk (Recursive) + overlap

2) Embed (choose one model)

3) Index in Qdrant (manual)

4) Query (dense)

5) Hybrid (dense ⊕ sparse) via LlamaIndex wrapper

Part V — Evaluation & tuning

Part VII — Practical guidance (battle-tested)

Further reading & references

From PDFs to Markdown

Introduction

Premise and Requirements

Evaluation Criteria

Comparison Table

Framework Descriptions

Docling

Marker

MinerU

PyMuPDF

PyMuPDF4LLM

PyPDF2

Markitdown

Dolphin

References

Docling

Marker

MinerU

PyMuPDF / PyMuPDF4LLM

PyPDF2

Markitdown

Dolphin

tldr recommendation

(as of Oct 2025)

Why Wait for CI? Shift Left with Pre-commit Hooks

Problem Statement

Solution: pre-commit Hooks

How to setup

pre-commit installation

Hook definition

Hook installation

Usage

Invoking pre-commit

Fixing issues

Bypassing pre-commit

Example runs

CI integration

Customization

References

Efficient Memory Management in Python: Understanding Garbage Collection

Reasons for Garbage Collection

How Garbage Collection Works in Python

Reference Counting

Cyclic Garbage Collector

Generational Garbage Collection

Implications of Garbage Collection

Performance Overhead