DEV Community: Piyush Choudhari

Making A Peer Review System for My Blogs Using Google-ADK & Mem0

Piyush Choudhari — Thu, 27 Nov 2025 04:18:24 +0000

My Process

When writing my technical blogs, I have a very rigid process I like to follow.

Research the topic I am interested in
Create a structured research roadmap I require to gain knowledge about the particular topic
Go through the roadmap and try to learn/research the concepts as in-depth as I can
Start coding whatever the relevant implementation for that topic is
Finally, start writing the blog

But one thing always bugs me, "Is my blog factually correct and have I compromised the integrity of my blog anywhere?".

That leads me to frantically go through my sources repeatedly and asking tools like Perplexity about the blog. So, I had the idea to automate this process by a creating a Peer Review System.

What This System Does

1. What the System Focuses On

It behaves like a technical editor, not just a grammar checker.
It evaluates writing for:
- Structure
- Clarity
- Factual accuracy
- Tone correctness
- Proper use of supporting evidence
The purpose is to help the writer produce content that is accurate, readable, and consistent.

2. How It Reviews Content

The system doesn’t read content blindly.
It uses uploaded reference files as a knowledge base.
Relevant information from those files is retrieved using semantic search rather than keyword matching.
If a statement appears in the writing:
- The system first checks if it exists in the uploaded sources.
- If confirmed, the system becomes more confident in that claim.
- If not found, it triggers an external web-based fact check.

3. How Memory Improves Review Quality

Feedback adapts over time instead of resetting with each review.
The system tracks repeated mistakes or patterns such as:
- Missing citations
- Style inconsistencies
- Formatting issues
If the same issue shows up again, the system highlights it more firmly.
This turns the review into a learning process rather than a one-time correction.

Screenshots:

A Demo Peer Review Report:

https://drive.google.com/file/d/1HQh5stEAj4tkh3E7Fyf1jOse52ZDw-bb/view?usp=sharing

Workflow

Phase 1: Ingestion

Fetches content from URLs if needed
Loads past review history for the project
Examines uploaded source documents

Phase 2: Verification

Identifies all factual claims in the content
Searches uploaded sources for supporting evidence
Uses Google search for external fact-checking
Validates technical assertions and statistics

Phase 3: Evaluation

Assesses clarity, flow, and structure
Checks accuracy against evidence
Evaluates tone for target audience
Compares to past feedback to track improvement
Flags recurring issues with escalated severity

Phase 4: Synthesis

Generates structured report
Provides evidence for all major issues
References past feedback when relevant
Gives actionable, constructive feedback

Features

1. Model Flexibility

You aren’t locked into one AI provider.
Switching between models like Gemini, Claude, GPT, or Ollama only requires changing one environment variable.
This gives control over:
- Cost
- Performance
- Privacy
The review logic remains consistent across models.

2. Context-Aware Retrieval

Uploaded reference files are stored in a vector database.
The system breaks them into chunks, embeds them, and indexes them for efficient search.
During review, it retrieves relevant sections using semantic similarity rather than simple keyword matching.
This helps the system understand meaning, not just matching exact text.

3. Automated Fact Verification

When a claim isn’t supported by uploaded sources, the system escalates verification.
A separate search agent performs a structured web lookup.
The goal is not to rewrite content, but to confirm whether the information is reliable and accurate.

4. Built-In Memory

The system remembers past reviews and writing patterns.
If a mistake repeats, the system identifies it as a recurring issue.
Instead of pointing it out repeatedly at the same level, the feedback becomes stronger and more specific.
This encourages long-term improvement rather than one-off corrections.

Limitations

The verification is only as good as the model plus the search results
Source reliability isn’t enforced
Web search can surface low quality or outdated material
The model is still the final judge. It can misinterpret sources, over trust weak evidence, or fabricate justification

Implementation: GitHub

Training a Mixture-of-Experts Router

Piyush Choudhari — Mon, 17 Nov 2025 13:54:25 +0000

Introduction

Since Deepseek-MoE introduced the MoE architecture, I was aware of it and saw it's adoption across the open source and proprietary model providers. But I never tried to understand the idea deeper. The idea that you can expand a model’s capacity without sending every token through a huge feed-forward block felt very interesting.

I tried to write the whole thing myself, from data loading and tokenization to a GPT-style transformer with optional MoE layers. This included the dataset pipeline, transformer blocks, routing logic, expert modules, and a training loop that tracked timing, throughput, losses, and expert usage.

This blog walks through what I learned, how each component fits together, and the results that stood out, with the hope that these insights help anyone curious about MoE models or planning to build one.

Project Plan

Before writing any code, I outlined the system I wanted. I needed a modular setup that let me swap dense layers for MoE layers, try different routing strategies, and run controlled comparisons without constant rewrites. That meant keeping data, model, MoE components, and training logic cleanly separated.

I began with the dataset pipeline. A steady source of tokenized text is essential for any language model experiment, and I wanted the option to fall back to synthetic data. After that, I focused on the model architecture. I planned to build a small GPT-style transformer first, since it provided a stable baseline and a familiar structure to extend with MoE layers. The goal was to keep the dense path intact while making the MoE path a drop-in replacement so both versions could be compared under identical conditions.

Next came the MoE module, which required the most iteration. I wanted per-token routing, top k selection, load balancing, and expert statistics without creating a messy forward pass. I mapped out how the router, experts, and auxiliary loss would interact and built an interface that let transformer blocks treat dense and MoE layers the same.

The final piece was the training loop. I needed detailed metrics: throughput, timing, auxiliary losses, temperature schedules, and expert usage. The Trainer class would handle epochs, collect metrics, and coordinate evaluation so experiments remained consistent.

With the components defined, the plan was straightforward: build the dataset module, implement the transformer, add the MoE layer, wire everything in the Trainer, and run dense and MoE configurations under a shared framework.

Implementation

Once the plan was set, the next step was turning each idea into a working module. I wanted the codebase to feel like a compact training stack with clear boundaries. That shaped how the transformer, MoE layer, and Trainer were built.

The transformer came first, following a standard GPT-style decoder with token embeddings, positional embeddings, a stack of self-attention blocks, and a final projection. The key feature was a pluggable feed-forward sublayer. If a layer index matched an MoE position, the dense FFN was swapped for an MoE layer through a shared interface. This made it easy to alternate between dense and sparse configurations without altering the architecture.

The MoE layer required the most careful engineering. Routing occurs per token, so the implementation flattens batch and sequence dimensions, applies a linear router, then uses a temperature scaled softmax to produce expert probabilities. Tokens pick top experts, pass through identical feed-forward experts, and are recombined with routing weights. The layer also tracks usage, probabilities, and entropy, which was crucial for spotting imbalance and specialization patterns.

With the model ready, the Trainer handled timing, throughput, auxiliary losses, and temperature schedules. It recorded detailed metrics, measured forward and backward phases separately, and supported early stopping. Together, these components formed a focused environment for comparing dense and MoE models and revealing their trade-offs.

Dataset

For these experiments I needed a dataset that was structured enough to reveal real modeling behavior yet small enough for fast iteration. WikiText-2 fit well. It contains high-quality English text from Wikipedia, offering natural sentence structures, topic shifts, and long-range dependencies that make model behavior easy to inspect.

I kept the raw text but used a custom tokenization pipeline instead of the original large vocabulary. I mapped everything into a fixed vocabulary of 10000 tokens. This kept the embedding matrix small and made the model lighter, while also shifting the dataset’s statistics. A smaller vocabulary increases sequence length due to more subword splits and raises the frequency of common tokens. That steeper distribution created clearer patterns early in training and made capacity differences between dense and MoE variants easier to observe.

The dataset is stored in Parquet format and loaded with Polars. After tokenization, it produces about 2.1 million training tokens and about 217000 validation tokens. These totals stay similar after remapping, although some lines become longer and the token distribution grows more concentrated.

A streaming dataset slices training sequences with sliding windows, keeping memory use low and avoiding preprocessing. I also added a synthetic fallback for development. Overall, WikiText-2 with a 10000-token vocabulary offered a realistic and efficient environment for comparing dense and MoE models.

MoE vs Dense: Experiment Results and Recommendations

TL;DR

The dense model achieved the best validation loss and the highest training throughput. It remains the strongest performer for this setup when evaluating generalization versus raw speed.
Mixture-of-experts (MoE) variants increased model capacity (roughly 17.3M parameters vs 9.9M for the dense model) but introduced substantial runtime overhead, primarily in the backward pass. That overhead reduced tokens/sec compared with the dense baseline.
Among MoE variants, throughput_opt substantially reduced backward cost compared with the other MoEs and delivered the best throughput among MoEs, but it did not match the dense model in tokens/sec or in best validation loss.
top2 reached the lowest training loss but the worst validation loss, suggesting overfitting or instability in gating/generalization.
Routing entropy and per-expert usage indicate reasonably balanced expert assignment across experiments, but layer- and experiment-level differences remain and may explain some of the generalization differences.

Key numbers (single-epoch timing, parameters, best validation loss, avg throughput)

Model	Params	Best validation loss	Avg throughput (tokens/sec)	Forward (s)	Backward (s)	Optimizer (s)	Total per-epoch (s)
Dense	9,924,608	9.3698	68,825	0.40	0.20	0.03	0.63
MoE-baseline	17,286,656	9.3809	44,976	0.75	1.17	0.05	1.97
MoE-top2	17,286,656	9.3921	37,387	0.90	1.47	0.05	2.42
MoE-strong_reg	17,286,656	9.3798	44,243	0.76	1.19	0.05	2.00
MoE-throughput_opt	17,286,656	9.3870	54,784	0.56	0.96	0.03	1.55

Training and validation dynamics

Training loss: All variants show steadily decreasing training loss. MoE-top2 reaches the lowest training loss across epochs, indicating higher capacity or faster training fit.
Validation loss: Dense achieves the lowest validation loss (9.3698). MoE variants either match or exceed (worse) the dense validation loss:
- MoE-baseline and MoE-strong_reg are close to each other and only marginally worse than dense.
- MoE-top2 exhibits the largest gap and a clear upward trend in validation loss after epoch 4, indicating overfitting or gating instability.
- MoE-throughput_opt shows stable validation behavior but does not surpass dense.
Interpretation: The MoE configurations give higher representational capacity but require careful gating/regularization and optimization to realize generalization benefits. Without such tuning, larger capacity can overfit or destabilize validation performance.

Throughput and timing breakdown

The dominant cost for MoE models is the backward pass. Backward times:
- MoE-top2: 1.47 s
- MoE-baseline: 1.17 s
- MoE-strong_reg: 1.19 s
- MoE-throughput_opt: 0.96 s
- Dense: 0.20 s
throughput_opt reduced the backward cost significantly compared with other MoEs, producing the best MoE throughput (54,784 tokens/sec), but still below dense.
Total per-epoch wall-clock is smallest for dense (0.63s) and largest for top2 (2.42s).
Interpretation: MoE overheads are chiefly in expert-specific gradient/communication during backward. Optimizations that reduce communication or reduce expert work in backward propagate directly to throughput gains (as throughput_opt demonstrates).

Expert routing behavior: usage and entropy

Entropy: Routing entropy per layer is close to the theoretical maximum (max ≈ 2.08 for 8 experts). That indicates routing is using many experts rather than collapsing to a single expert. Specific observations:
- top2 layer 2 entropy reaches very close to max early, which matches its aggressive expert usage behavior.
- Layer 4 entropies are slightly lower and more variable across experiments.
Per-expert usage: Usage bars across models and layers show modest deviations from perfect uniformity (uniform = 12.5% for 8 experts). Most experiments show usage within roughly 10.6% to 14.7% per expert. A few spots show underused experts (for example throughput_opt layer 4 had some experts near 9.8–9.9%).
Interpretation: Routing is generally balanced, but small non-uniformities exist and could cause local specialization or slight load imbalance. Very skewed usage might lead to undertrained experts and affect generalization.

Trade-offs and interpretation

Dense vs MoE capacity: MoE increases parameter count substantially but this did not translate into better validation loss in these runs. Denser capacity alone is not sufficient to improve generalization.
Throughput trade-offs: Dense model is faster and achieves better validation performance. For these experiments, MoE costs (especially backward step) made them slower despite higher capacity. throughput_opt shows that MoE overhead can be reduced but not fully eliminated.
MoE-top2 anomaly: The top2 variant fits training data best but generalizes worst. This suggests gating choices (top-2 routing) can increase overfitting risk or cause training instability unless balanced with stronger regularization or gating temperature tuning.
Balancing and entropy: Entropy values close to the max indicate gating is not collapsing, which is good for utilization. However, small routing imbalances can still impact performance. Regularization techniques to encourage balanced loads may help.

Conclusion

Implementation: GitHub

Working through this project gave me a clearer sense of how MoE models behave in practice. I expected the extra capacity from experts to translate quickly into better validation performance, but the results were more nuanced. The dense baseline stayed the most stable and consistent in both speed and generalization, which reinforced the idea that complexity only helps when training dynamics are tuned to support it.

The MoE experiments showed where the real challenges appear. Routing adds flexibility but also brings noise, imbalance, and overhead that a standard transformer avoids. Watching how validation loss shifted under different routing strategies, or how backward times rose as soon as experts were introduced, clarified why production MoE systems depend on tight engineering and regularization. Even in a small setup, these effects surfaced immediately.

Building a Vector Database from Scratch - CapybaraDB

Piyush Choudhari — Tue, 11 Nov 2025 02:45:42 +0000

Introduction

Vector databases are one of the most popular and widely used systems in the tech industry. Their market was valued at ≈2.5 billion in 2024 and is projected to >3 billion in 2025. Over 70% of all organizations investing/implementing AI use vector databases for searching and embedding.

I have used vector databases in multiple use cases and projects. Be it RAG, searching and filtering documents or even feeding context to agents. After using multiple databases like FAISS, ChromaDB, Pinecone and pgvector, I was fascinated by vector databases and their internal workings.

Hence, I decided to implement one myself.

CapybaraDB, it is a lightweight vector database implementation, built from scratch in Python:

It can perform semantic search using sentence-transformers for embeddings.
It supports built-in token-based chunking.
CUDA acceleration.
Precision control (float32, float16, binary).
.npz file storage for persistance.

What is a Vector Database?

A vector database is a very special type of database which is very efficient in storing and searching dimensional vector embeddings. Embeddings are basically numerical representations of data like text, images, videos, audio, etc. In terms of structure, these embeddings are made up of array of floating point numbers representing the direction and magnitude of the generated vector.

A traditional database searches for exact match for the query entered, but vector databases find items by measuring the distance/difference between the query vector and embedded vectors inside the multidimensional space. Metrics like, euclidean distance or cosine similiarity can be used to measure distances between the vectors.

They're essential for modern AI applications including semantic search (finding meaning, not just keywords), recommendation systems, RAG (Retrieval Augmented Generation) for chatbots, image similarity search, and anomaly detection.

Popular examples include Pinecone, Weaviate, Milvus, Qdrant, and Chroma. They've become crucial infrastructure as AI applications need to search through millions of embeddings in milliseconds while maintaining accuracy.

Design Philosophy

Simplicity

* A "toy" vector db implementation, aiming for minimal complexity

* Straightforward APIs (`add_document`, `search`, `get_document`)

* Minimal config to get started

Flexibilty

* Utility support for multiple file formats

* Configurable precision levels (float32, float16 and binary)

* Choice to keep in-memory store or on disk

* GPU support

Minimal dependencies

* Core dependencies limited to essential libs

* Lightweight footprint for prototyping and learning

Educational focus

* Demonstrating fundamental vector database concepts

Metrics & Benchmarks

Indexing Performance

Data source: benchmark_results/indexing_performance.json

Document counts tested: 10, 50, 100, 500, 1000
Total times (s): 0.138, 1.015, 2.388, 23.126, 76.331
Average time per doc (s): 0.0138, 0.0203, 0.0239, 0.0463, 0.0763
Storage times remain small relative to embedding time even at 1k docs (≈0.122 s)
Index size (MB): 0.020, 0.089, 0.174, 0.859, 1.715
Peak memory (MB): ~2.2–65.4 across scales

Key takeaways:

Embedding dominates total indexing time. Storage overhead is negligible in comparison.
Linear growth with dataset size; average time per document rises as batches get larger and memory pressure appears.
Index size scales linearly and remains compact for thousands of chunks.

Refer to benchmark_results/indexing_performance.png for the trend lines and indexing_performance_breakdown.png for stacked time components.

Query Performance

Data source: benchmark_results/query_performance.json

Dataset sizes tested: 100, 500, 1000, 2500, 5000
Average query latency (ms): 7.79, 7.54, 9.10, 8.52, 8.45
Throughput (qps): 128.3, 132.6, 109.9, 117.4, 118.3
p50 latency (ms): 7.45–8.79
p95 latency (ms): 10.09–12.01
p99 latency (ms): 11.80–16.39
Breakdown (avg):
- Embedding time (ms): ~3.87–4.53
- Retrieval time (ms): ~3.50–4.57

Observations:

Latency remains stable and low (≈7–9 ms on average) from 100 to 5000 vectors for top-k search, reflecting efficient vectorized exact search.
Throughput remains >100 qps at all tested sizes.
The split between query embedding and retrieval remains balanced; both contribute roughly half of total latency.
Note: one anomalous value appears in min_latency_ms at 500 (-524.27 ms). This is a measurement artifact and should be ignored; distributional statistics (p50/p95/p99) are consistent and reliable.

Charts: benchmark_results/query_performance.png and query_performance_breakdown.png visualize latency distributions and the embedding vs retrieval split.

Retrieval Quality (Synthetic)

Data source: benchmark_results/retrieval_quality_synthetic.json

Configuration:

Dataset: Synthetic
Chunk size: 512

Quality metrics:

Precision@k: P@1=1.00, P@3≈0.756, P@5≈0.480, P@10≈0.240
Recall@k: R@1≈0.433, R@3≈0.956, R@5=1.00, R@10=1.00
F1@k: F1@1=0.60, F1@3≈0.836, F1@5≈0.643, F1@10≈0.385
nDCG@k: nDCG@1=1.00, nDCG@3≈0.954, nDCG@5≈0.979, nDCG@10≈0.979

Interpretation:

Very strong early precision (P@1=1.0) and nDCG across cutoffs indicate effective ranking of the most relevant content.
Near-perfect recall by k=5 shows top-5 captures essentially all relevant items.

See benchmark_results/retrieval_quality_synthetic.png for the quality curves.

Disclaimer ⚠️: The documents in the dataset used here are relatively short (typically well under 512 tokens).

As a result, a chunk size of 512 effectively corresponds to document-level embeddings — each document was indexed as a single vector.

While this setup is sufficient for small-scale or toy benchmarks, it may not generalize to longer documents where sub-document (passage-level) chunking becomes necessary for finer-grained retrieval.

Future evaluations will include experiments with smaller chunk sizes (e.g., 128–256) and longer document corpora to assess chunk-level retrieval effects.

What These Results Mean from a Perspective of A "Toy Database"

Small to medium collections (≤10k chunks): exact search is fast, simple, and accurate.
Low latency: median ≈7–9 ms per query with >100 qps throughput in benchmarks.
Strong quality: excellent early precision and recall on the synthetic task with coherent chunking.
Scales linearly: indexing and index size grow linearly; storage overhead is minimal compared to embedding time.

Core Architecture

1. BaseIndex

The main in-memory (temp) data store of CapybaraDB is the BaseIndex class, a data structure that holds:

class BaseIndex:    documents: Dict[str, str]           # doc_id -> full document text    chunks: Dict[str, Dict[str, str]]   # chunk_id -> {text, doc_id}    vectors: Optional[torch.Tensor]     # All chunk embeddings    chunk_ids: List[str]                # Order-preserving chunk IDs    total_chunks: int    total_documents: int    embedding_dim: Optional[int]

This design keeps documents and their chunks separate while maintaining relationships through IDs. Why this separation? It allows us to:

Return full documents when retrieving search results
Track which chunk belongs to which document
Maintain metadata without duplicating data

2. Index

The Index class extends BaseIndex with persistence:

class Index(BaseIndex):    def __init__(self, storage_path: Optional[Path] = None):        super().__init__()        self.storage = Storage(storage_path)

This is where of auto-loading happens. When you create an Index, it checks if a persisted version exists and loads it automatically. This is of-course optional, if no path is provided, the db is kept in-memory.

3. CapybaraDB: The Main Interface

The CapybaraDB class which exposes the API:

class CapybaraDB:    def __init__(        self,        collection: Optional[str] = None,        chunking: bool = False,        chunk_size: int = 512,        precision: Literal["binary", "float16", "float32"] = "float32",        device: Literal["cpu", "cuda"] = "cpu",    ):

You can create multiple collections, control chunking, adjust precision, and choose your compute device.

Embeddings

1. Architecture

CapybaraDB uses sentence-transformers/all-MiniLM-L6-v2, a lightweight transformer model that converts text into 384-dimensional vectors.

class EmbeddingModel:    def __init__(        self,        precision: Literal["binary", "float16", "float32"] = "float32",        device: Literal["cpu", "cuda"] = "cpu",    ):        self.model_name = "sentence-transformers/all-MiniLM-L6-v2"        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)        self.model = AutoModel.from_pretrained(self.model_name).to(device)

The model is initialized once and reused for all embeddings, keeping operations fast and memory-efficient.

2. The Embedding Process

When you call embed() on a document, here's what happens:

def embed(self, documents: Union[str, List[str]]) -> torch.Tensor:    encoded_documents = self.tokenizer(        documents, padding=True, truncation=True, return_tensors="pt"    )        with torch.no_grad():        model_output = self.model(**encoded_documents)        sentence_embeddings = self._mean_pooling(        model_output, encoded_documents["attention_mask"]    )    sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)

Tokenization: Using tiktoken package and cl100kbase token encoding the chunks are tokenized
Generation: Production of context-aware representations for each position
Normalization: L2 norm to convert all vectors to unit length for accurate retrieval

3. Precision Modes

CapybaraDB supports three precision modes:

Float32 (default): Full precision, highest accuracy

Float16: Half precision, ~50% memory savings, minimal accuracy loss

Binary: Each dimension becomes 0 or 1, resulting in memory savings. The embedding process converts values > 0 to 1.0:

if self.precision == "binary":    sentence_embeddings = (sentence_embeddings > 0).float()

Binary embeddings use a scaled dot product during search to compensate for information loss.

Document Processing Pipeline

Adding a Document

When you add a document, here's the full journey:

def add_document(self, text: str, doc_id: Optional[str] = None) -> str:    if doc_id is None:        doc_id = str(uuid.uuid4())        self.index.documents[doc_id] = text    self.index.total_documents += 1

Step 1: ID Generation

If no ID is provided, we generate a UUID. This ensures every document is uniquely identifiable.

Step 2: Chunking (Optional)

If chunking is enabled, the document is split using token-based chunking:

if self.chunking:    enc = tiktoken.get_encoding("cl100k_base")    token_ids = enc.encode(text)    chunks = []    for i in range(0, len(token_ids), self.chunk_size):        tok_chunk = token_ids[i : i + self.chunk_size]        chunk_text = enc.decode(tok_chunk)        chunks.append(chunk_text)

Why token-based chunking instead of character-based?

Respects word boundaries
Considers tokenizer structure
Produces more semantically coherent chunks
Works better with the embedding model

Step 3: Create Chunks

Each chunk gets its own UUID and is stored with metadata:

for chunk in chunks:    chunk_id = str(uuid.uuid4())    self.index.chunks[chunk_id] = {"text": chunk, "doc_id": doc_id}    chunk_ids.append(chunk_id)    self.index.total_chunks += 1

Step 4: Generate Embeddings

All chunks are embedded in one batch:

chunk_texts = [self.index.chunks[cid]["text"] for cid in chunk_ids]chunk_embeddings = self.model.embed(chunk_texts)

Batch processing is key to performance. Embedding 100 chunks together is much faster than 100 individual embeddings.

Step 5: Append to Vector Store

This is where the vectors are added to the index:

if self.index.vectors is None:    self.index.vectors = chunk_embeddings    self.index.chunk_ids = chunk_ids    self.index.embedding_dim = chunk_embeddings.size(1)else:    self.index.vectors = torch.cat(        [self.index.vectors, chunk_embeddings], dim=0    )    self.index.chunk_ids.extend(chunk_ids)

The first document creates the tensor. Subsequent documents are concatenated along the batch dimension.

Step 6: Persistence

If not in-memory mode, the index is saved immediately:

if not self.index.storage.in_memory:    self.index.save()

This means you can add documents and they're persisted incrementally—no manual save needed!

The Search Engine

The Search Process

Here's how search works end-to-end:

def search(self, query: str, top_k: int = 5):    if self.index.vectors is None:        return []        self.index.ensure_vectors_on_device(target_device)    indices, scores = self.model.search(query, self.index.vectors, top_k)        results = []    for idx, score in zip(indices.tolist(), scores.tolist()):        chunk_id = self.index.chunk_ids[idx]        chunk_info = self.index.chunks[chunk_id]        doc_id = chunk_info["doc_id"]                results.append({            "doc_id": doc_id,            "chunk_id": chunk_id,            "text": chunk_info["text"],            "score": score,            "document": self.index.documents[doc_id],        })        return results

Step 1: Query Embedding

The query text is embedded using the same model:

def search(self, query: str, embeddings: torch.Tensor, top_k: int):    query_embedding = self.embed(query)

Step 2: Similarity Computation

The similarity between query and all stored vectors is computed:

if self.precision == "binary":    similarities = torch.matmul(        embeddings.float(),         query_embedding.t().float()    ) / query_embedding.size(1)else:    similarities = torch.matmul(embeddings, query_embedding.t())

For each query, the system computes similarity with all stored vectors. In binary precision, it performs a dot product between 0/1 vectors, scaled by the embedding dimension—yielding the fraction of matching active bits. In float precision, normalized embeddings use standard cosine similarity via matrix multiplication with the query embedding.

Step 3: Top-K Selection

We use torch.topk to find the most similar vectors:

scores, indices = torch.topk(    similarities.squeeze(),    min(top_k, embeddings.size(0)))

Step 4: Result Assembly

For each result, we reconstruct the full context by:

Looking up the chunk text
Finding the parent document ID
Retrieving the full document

This gives you both the specific chunk that matched and the full document context.

Storage and Persistence

The Storage Layer

The Storage class handles persistence with NumPy's compressed NPZ format:

def save(self, index) -> None:    data = {        "vectors": index.vectors.cpu().numpy(),        "chunk_ids": np.array(index.chunk_ids),        "chunk_texts": np.array([index.chunks[cid]["text"] for cid in index.chunk_ids]),        "chunk_doc_ids": np.array([index.chunks[cid]["doc_id"] for cid in index.chunk_ids]),        "doc_ids": np.array(list(index.documents.keys())),        "doc_texts": np.array(list(index.documents.values())),        "total_chunks": index.total_chunks,        "total_documents": index.total_documents,        "embedding_dim": index.embedding_dim or 0,    }        np.savez_compressed(self.file_path, **data)

Why NPZ?

Compressed by default (saves space)
Efficient binary format
Handles large arrays well
Cross-platform and language-agnostic

In-Memory vs Persistent

CapybaraDB supports two modes:

In-Memory: No file path specified. Data stays in RAM, lost on exit.

Persistent: File path specified. Data is saved to disk after each add_document() call.

This dual-mode design enables both temporary experiments (in-memory) and production use (persistent).

Putting It All Together

Example: Simple Document Search

from capybaradb.main import CapybaraDB # Initializedb = CapybaraDB(    collection="research_papers",    chunking=True,    chunk_size=512,    device="cuda") # Add documentsdoc1_id = db.add_document("Machine learning is transforming NLP...")doc2_id = db.add_document("Deep neural networks excel at image recognition...") # Searchresults = db.search("artificial intelligence", top_k=2) # Use resultsfor result in results:    print(f"Score: {result['score']:.4f}")    print(f"Matched text: {result['text'][:100]}...")    print(f"Full document: {result['document']}")    print("---")

Conclusion

Implementation: GitHub

This implementation of CapybaraDB was purely for education purposes and my own learning. I had a great time figuring out the nitty-gritty details behind vector databases and will definitely take on more challenging implementations in the future.

Be Curious About Your Compute

Piyush Choudhari — Wed, 24 Sep 2025 13:03:39 +0000

Hardware Often Takes The Backseat

I've been a software guy throughout my journey and rarely I've tried to lift the curtain up and focus on the hardware. Those rare times probably happened during my IOT study sessions at my engineering course.

Most of the blockers I've ever faced during my AI engineering journey, have been related to software. Broken drivers, unpatched source code, outdated libraries and of course classic python quirks.

But one day I faced a blocker, obviously not knowing that it was hardware related. At my internship, I was trying to draw inference from an Automatic1111 server on AWS, the server was equipped with two specific diffusion models, one was SDXL (very heavy) another was an SD1.5 (lighter). However drawing inference from one model caused the next inference from the second model to be very slow. My first reaction was that it was because of caching.

But after debugging this issue, I learned that this was expected behavior by Automatic1111 module. It loaded a model into VRAM and "kept it hot" for fast inference then loaded the second model, hence resulting in slow inference.

That day I learnt that understanding and knowing the relation between the software and hardware interactions goes a long way.

"Not thinking about the hardware" is something I observe a lot from peers and engineers. So, I had the idea to write this blog.

Types Of Available Hardware

CPU vs GPU vs TPU

Aspect	CPU	GPU	TPU
Role	General-purpose processor, brain of the computer	Originally for graphics, now massively parallel compute	Custom ASIC by Google for tensor operations & deep learning
Cores	Few powerful cores (2–64, more in servers)	Thousands of lightweight cores (A100 ~7,000+)	Systolic arrays with Matrix Multiply Units (MXUs)
Execution Model	Low-latency, sequential, strong branch handling	SIMT/SIMD, warp scheduling, high throughput	Data flows across systolic array; highly specialized instructions
Memory	Large cache hierarchy, moderate bandwidth	High-bandwidth VRAM (HBM/GDDR), smaller caches, bulk data optimized	HBM tightly coupled with compute; optimized for DL precision formats
Strengths	Versatile, single-thread performance, task switching	Matrix/vector math, AI/ML training, rendering, high throughput	Extremely efficient at AI training/inference, high perf-per-watt
Weaknesses	Limited parallelism, not efficient for massive matrix ops	Poor at branch-heavy sequential code, higher latency, needs CUDA/OpenCL	Not general-purpose, tied to Google ecosystem, less flexible

AI or any computationally expensive workload, requires extremely high throughput as the calculations are extremely straight forward (matmul, gradient averages, dot products, etc...) unlike the calculations the CPU does which includes complex branching logic and minimal latency. Hence GPUs and TPUs are extremely popular choices.

GPUs, in particular, have extremely high bandwidths but unlike TPUs, they interface well using CUDA lib and Pytorch or TensorFlow also include high interoperability with them.

Bottlenecks

It is normal to focus on metrics like TFLOPs. But a systems performance and turnaround time is often limited by how quickly data moves and can be accessed, not just how fast can it be processed.

Memory capacity (VRAM) and bandwidth are frequently more important than raw compute power. A processor is inefficient if it sits idle waiting for data. VRAM capacity is a key constraint, as it determines whether a large model, such as an LLM, can fit onto a single GPU. For instance, a high-TFLOP consumer GPU like an RTX 4090 with 24 GB of VRAM cannot train models that require the 80 GB or more offered by datacenter GPUs. VRAM size also limits the training batch size, affecting efficiency.
When scaling to multiple GPUs for larger models, the interconnect, the communication pathway between GPUs could become the main bottleneck. Standard interconnects like PCIe have limited bandwidth (around 64 GB/s), which can get saturated when GPUs synchronize data, leading to diminishing returns when scaling beyond a few GPUs. In contrast, proprietary technologies like NVIDIA’s NVLink provide vastly superior bandwidth (up to 900 GB/s), resulting in efficient scaling for training massive foundation models.
The entire system can be constrained by slow data I/O from storage. If data cannot be loaded from disks to the GPUs quickly enough, even the most powerful hardware will be wasted, creating a foundational bottleneck.

Accelerators

Modern AI accelerators are designed around specific philosophies and each type of accelerator handle different types of workloads.

NVIDIA H100 focuses on cutting-edge training with its Transformer Engine and FP8 support, demanding extreme bandwidth and power efficiency for massive LLMs.
Google TPU v5p uses a systolic array (MXU) for extreme efficiency in large-scale, matrix-heavy workloads, tightly coupled with Google’s distributed infrastructure.
AMD MI300 competes by integrating CPU and GPU components into one package, offering a large unified memory pool — attractive for diverse HPC and AI workloads where flexibility and capacity matter.

These differences reflect the architectural trade-offs between general-purpose flexibility (GPU) and specialized efficiency (TPU), with AMD carving a hybrid path.

Accelerator Comparison

Accelerator	Memory	Bandwidth	Key Use Case
NVIDIA H100	80 GB HBM3	~3.0 TB/s	Cutting-edge LLM training, HPC workloads with extreme throughput needs
AMD MI300X	192 GB HBM3	~5.3 TB/s	Large-scale AI training, HPC, and very large models (fits datasets in memory)
Google TPU v5p	95 GB HBM2e per chip	~2.77 TB/s	Large-scale AI training/inference; matrix-heavy workloads in Google ecosystem

Practical Tips

To optimize AI workloads we need to focus on data movement too not just compute speed.

Profile your system: The first step is to identify where the bottlenecks are. Is it memory capacity (VRAM), memory bandwidth, interconnect speed between GPUs, or slow data loading from storage?
Use mixed precision (AMP): Modern GPUs have specialized Tensor Cores that accelerate FP16/BF16 computations. Using Automatic Mixed Precision (AMP) in frameworks like Pytorch significantly speeds up training by using lower precision for matrix maths while maintaining accuracy with FP32 for sensitive operations like loss calculation. This also reduces VRAM usage.
Tune batch size and data loaders: VRAM capacity limits your batch size. A larger batch can improve hardware utilization, but a small one may be forced by memory constraints. Ensure your data loading pipeline from storage to GPU is not the bottleneck, as even the fastest GPU is wasted if it's waiting for data.
Leverage optimized frameworks: For multi-GPU training, use libraries like PyTorch's DistributedDataParallel with NCCL backends, or advanced frameworks like DeepSpeed, to efficiently manage gradient synchronization, which is often a key bottleneck. These tools are crucial for scaling effectively.

Conclusion

You don't need to be a GPU engineer or a chip designer to understand the nuances of hardware and the interaction between hardware and software. Only requirements are to keep the hardware in mind while designing system and to Be Curious About Your Compute

How I Built an Automated Social Media Workflow with LangGraph

Piyush Choudhari — Sun, 21 Sep 2025 02:25:06 +0000

Why an automation?

I recently started blogging, more specifically I have my interest in technical blogging. I enjoy going through papers, articles and GitHub repos which spike my interest. Hence leading me to the idea of posting about the same. But I realize, technical blogging can be more than just a medium to share my interests, it can be a powerful way to build up my personal brand.

Speaking of creating a personal brand, a lot goes into it, but in my simple view point it is only requires two things:

High value and high impact content
A great distribution strategy

Social media is the "great distribution strategy".

But writing content optimized for social media, more specifically LinkedIn and X were not the areas I wanted to spend a lot of time in and also me being an engineer, I was attracted to automating it away.

How it works?

Central Idea

At the heart of this project is a Langgraph workflow that automates my content pipeline, which means, right from the idea stage to the draft and publish stage of my social media posts (not my blogs), the workflow takes care of the entire generation.

What kind of posts I want?
What type of post I need depending on day of the week?
Is the content of the post legitimate and factual?
Does the post make any claims with no basis?
Does the post state any statements without proper citations?

Why Langgraph?

After using multiple tools for creating agents like LCEL, Crew AI and DSPY. I felt Langgraph was extremely easy to follow in terms of the workflow as it maps very well to flowcharts and UMLs we encounter on daily basis. Also, I am familiar with Langgraph as I used it extensively at my internship to build agentic systems.

Flowchart

Here’s a breakdown of the process:

Capture & Research
- The workflow starts by capturing a raw idea (capture_idea).
- It then pulls in notes and references from Obsidian (obsidian_research), ensuring the idea is grounded in prior research.
Planning & Teaser Generation
- The planner_agent structures the idea into a roadmap.
- Depending on the phase, the system may generate an early teaser post (teaser_generator) to share on Monday as a preview.
Drafting the Blog
- If we’re in the drafting phase, the blog_drafter creates a long-form draft.
- Once the final blog URL is available, the system scrapes the published content (scraper) and builds a concise summary (summarizer).
Social Media Post Creation
- Using the blog summary, the workflow generates tailored LinkedIn posts (final_post_generator) and X/Twitter posts (x_generator).
Validation & Review
- Posts are validated for structure, tone, and platform-fit (validator).
- If issues are found, they are sent for peer review (peer_reviewer) and, if necessary, refined by the content_improver.
- The workflow limits improvements to 3 iterations to avoid endless loops.
Self-Evaluation & Recovery
- Before finalizing, posts undergo a self-check (self_evaluator).
- If errors occur (e.g., missing blog URL, broken logic), a recovery_agent steps in to fix or gracefully exit.
Completion
- Once validated, the system outputs ready-to-publish posts for LinkedIn and X.
- If human oversight is required (e.g., controversial phrasing or ambiguous context), the workflow pauses and flags it for manual review.

The workflow code snippet:

def create_workflow():
    workflow = StateGraph(AutomationState)

    workflow.add_node("capture_idea", capture_idea)
    workflow.add_node("obsidian_research", process_obsidian_content)
    workflow.add_node("planner_agent", planner_agent)
    workflow.add_node("teaser_generator", teaser_generator)
    workflow.add_node("blog_drafter", blog_drafter)
    workflow.add_node("scraper", scrape_blog_content)
    workflow.add_node("summarizer", generate_blog_summary)
    workflow.add_node("final_post_generator", generate_linkedin_posts)
    workflow.add_node("x_generator", generate_x_posts)
    workflow.add_node("validator", validate_posts)
    workflow.add_node("peer_reviewer", peer_review_agent)
    workflow.add_node("content_improver", content_improver_agent)
    workflow.add_node("self_evaluator", self_evaluator)
    workflow.add_node("recovery_agent", recovery_agent)

    workflow.set_entry_point("capture_idea")

    workflow.add_edge("capture_idea", "obsidian_research")
    workflow.add_edge("obsidian_research", "planner_agent")

    workflow.add_conditional_edges(
        "planner_agent",
        should_generate_teaser,
        {
            "teaser_generator": "teaser_generator",
            "planner_agent": "planner_agent",
            "scraper": "scraper",
        },
    )

    workflow.add_conditional_edges(
        "teaser_generator",
        should_generate_blog_draft,
        {
            "blog_drafter": "blog_drafter",
            "planner_agent": "planner_agent",
            "scraper": "scraper",
        },
    )

    workflow.add_conditional_edges(
        "blog_drafter", should_scrape_blog, {"scraper": "scraper", "END": END}
    )

    workflow.add_edge("scraper", "summarizer")
    workflow.add_edge("summarizer", "final_post_generator")
    workflow.add_edge("final_post_generator", "x_generator")

    workflow.add_conditional_edges(
        "x_generator",
        should_validate_or_end,
        {"validator": "validator", "recovery_agent": "recovery_agent", "END": END},
    )

    workflow.add_conditional_edges(
        "validator",
        should_improve_or_evaluate,
        {
            "peer_reviewer": "peer_reviewer",
            "self_evaluator": "self_evaluator",
            "recovery_agent": "recovery_agent",
        },
    )

    workflow.add_conditional_edges(
        "peer_reviewer",
        should_improve_or_end,
        {
            "content_improver": "content_improver",
            "self_evaluator": "self_evaluator",
            "recovery_agent": "recovery_agent",
        },
    )

    workflow.add_edge("content_improver", "validator")

    workflow.add_conditional_edges(
        "self_evaluator",
        should_loop_or_end,
        {"validator": "validator", "recovery_agent": "recovery_agent", "END": END},
    )

    workflow.add_edge("recovery_agent", END)

    return workflow.compile()

Entire code for my automation: GitHub

Closing Thoughts

It was very fun building out this automation. Obviously it can't be deemed a boon for my productivity in such a short period of time, but I will be keeping a close eye on gains (or losses) as a result of this automation, which I certainly will be posting about.