<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Alok Ranjan Daftuar</title>
    <description>The latest articles on DEV Community by Alok Ranjan Daftuar (@aloknecessary).</description>
    <link>https://dev.to/aloknecessary</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3791551%2F62fbfeb5-1fba-4e79-bc4b-780b7ce52748.jpg</url>
      <title>DEV Community: Alok Ranjan Daftuar</title>
      <link>https://dev.to/aloknecessary</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/aloknecessary"/>
    <language>en</language>
    <item>
      <title>Context Engineering: The Discipline That Determines What Your LLM Actually Sees</title>
      <dc:creator>Alok Ranjan Daftuar</dc:creator>
      <pubDate>Mon, 29 Jun 2026 08:11:02 +0000</pubDate>
      <link>https://dev.to/aloknecessary/context-engineering-the-discipline-that-determines-what-your-llm-actually-sees-569g</link>
      <guid>https://dev.to/aloknecessary/context-engineering-the-discipline-that-determines-what-your-llm-actually-sees-569g</guid>
      <description>&lt;p&gt;Prompt engineering asks: how do I phrase this instruction? Context engineering asks: what information does the model need, in what form, in what order, and how much of it — to produce a correct answer?&lt;/p&gt;

&lt;p&gt;For a long time, the implicit mental model was: give the LLM more context and it performs better. This is wrong. A 20,000-token window stuffed with weakly relevant content produces worse answers than a 4,000-token window with precisely curated information. Larger windows do not eliminate context quality problems — they amplify them.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Context Window Is a Budget
&lt;/h2&gt;

&lt;p&gt;Treat it as a budget with competing line items, not a container you fill. Start with the total window, subtract fixed allocations (system prompt, output reserve, safety margin), and what remains is your dynamic budget split across retrieved chunks, conversation history, and memory.&lt;/p&gt;

&lt;p&gt;The first question should always be: "can we get better at selecting less, rather than including more?"&lt;/p&gt;




&lt;h2&gt;
  
  
  Four Memory Types, Four Purposes
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Episodic&lt;/strong&gt; — conversation history. Highest priority for continuity. Grows unbounded — needs compression.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semantic&lt;/strong&gt; — durable facts about the user (role, team, preferences). Compact, injected in system prompt before retrieved content.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Procedural&lt;/strong&gt; — reusable workflows and SOPs. Retrieved selectively when the query type matches.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Working&lt;/strong&gt; — intermediate results within a single request (agentic loop output). Ephemeral, request-scoped.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each type has different durability, update frequency, and token cost. Conflating them into a single undifferentiated store is the most common memory architecture mistake.&lt;/p&gt;




&lt;h2&gt;
  
  
  Structured Injection Patterns
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;XML tags&lt;/strong&gt; for section boundaries (&lt;code&gt;&amp;lt;documents&amp;gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;user_context&amp;gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;instructions&amp;gt;&lt;/code&gt;) — gives the model clear anchors for where information types begin and end&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Indexed documents&lt;/strong&gt; — label chunks with indices so citations can be traced&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ordering matters&lt;/strong&gt; — most relevant content first (primacy effect), user query last (recency effect)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Grounding instruction is not optional&lt;/strong&gt; — explicit instruction to use only provided context and signal when insufficient&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Lost-in-the-Middle
&lt;/h2&gt;

&lt;p&gt;Models attend more strongly to content near the beginning and end of the context window. Information buried in the middle receives less attention. Mitigations:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Relevance-ordered injection (highest score first)&lt;/li&gt;
&lt;li&gt;Sandwich pattern (critical content at both start and end)&lt;/li&gt;
&lt;li&gt;Active relevance filtering (exclude low-scoring chunks even if they fit)&lt;/li&gt;
&lt;li&gt;Smaller, tighter windows (fewer high-quality chunks &amp;gt; more mediocre chunks)&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Conversation Compression
&lt;/h2&gt;

&lt;p&gt;A 100-turn conversation consumes your entire retrieved context budget. Naive truncation loses critical early constraints. Solutions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Sliding window with pinned turns&lt;/strong&gt; — critical turns (user constraints, decisions) never truncated&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Progressive summarization&lt;/strong&gt; — compress old segments into 3-5 sentence summaries using Haiku (cheap, mechanical task)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Context Assembly Is Testable
&lt;/h2&gt;

&lt;p&gt;Unit test your assembly layer: budget compliance, ordering preserved, critical turns survive truncation, no mid-chunk truncation. Every assembly failure produces a predictable RAGAS metric signature — context precision drops point to noisy inclusion, faithfulness drops point to contradictions.&lt;/p&gt;




&lt;h2&gt;
  
  
  Read the Full Article
&lt;/h2&gt;

&lt;p&gt;This is a summary of my deep dive into context engineering. The full article covers the complete discipline with production implementations:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;👉 &lt;a href="https://aloknecessary.github.io/blogs/discipline-that-determines-what-your-llm-actually-sees/?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication&amp;amp;utm_content=context-engineering" rel="noopener noreferrer"&gt;Context Engineering: The Discipline That Determines What Your LLM Actually Sees — Full Article&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The full article includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Context window budget accounting with Python dataclasses&lt;/li&gt;
&lt;li&gt;Four memory types with implementation patterns (episodic, semantic, procedural, working)&lt;/li&gt;
&lt;li&gt;Working memory bridge from agentic retrieval loops&lt;/li&gt;
&lt;li&gt;XML-structured injection with document indexing&lt;/li&gt;
&lt;li&gt;Primacy/recency ordering strategy&lt;/li&gt;
&lt;li&gt;Progressive summarization with critical turn pinning&lt;/li&gt;
&lt;li&gt;Lost-in-the-middle mitigation (4 strategies with code)&lt;/li&gt;
&lt;li&gt;Contradiction detection and resolution&lt;/li&gt;
&lt;li&gt;Noise taxonomy (stale, tangential, redundant, over-retrieved)&lt;/li&gt;
&lt;li&gt;Unit testing context assembly&lt;/li&gt;
&lt;li&gt;AssemblyMetadata integration with RAGAS eval pipeline&lt;/li&gt;
&lt;li&gt;RAGAS metric → assembly failure mapping table&lt;/li&gt;
&lt;li&gt;Production checklist (19 items)&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>machinelearning</category>
      <category>python</category>
    </item>
    <item>
      <title>Agentic RAG: Designing Self-Correcting Retrieval Loops for Production</title>
      <dc:creator>Alok Ranjan Daftuar</dc:creator>
      <pubDate>Mon, 22 Jun 2026 05:59:20 +0000</pubDate>
      <link>https://dev.to/aloknecessary/agentic-rag-designing-self-correcting-retrieval-loops-for-production-2lbg</link>
      <guid>https://dev.to/aloknecessary/agentic-rag-designing-self-correcting-retrieval-loops-for-production-2lbg</guid>
      <description>&lt;p&gt;Standard RAG retrieves once and hopes for the best. Agentic RAG retrieves, reflects, decides it was wrong, and tries again — without being told to.&lt;/p&gt;

&lt;p&gt;Single-pass RAG has a fundamental flaw: it commits to its first retrieval attempt and generates forward regardless. It has no mechanism to check whether the retrieved chunks actually contain the answer. This works for simple factual queries. It breaks on multi-hop questions, ambiguous intent, and analytical queries requiring sequenced lookups.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;An agentic RAG system treats retrieval as a tool available to a reasoning loop. The LLM decides what to retrieve, evaluates what came back, and determines when to stop.&lt;/p&gt;

&lt;p&gt;The key component: a &lt;strong&gt;reflection agent&lt;/strong&gt; sits between retrieval and generation. It evaluates the quality and sufficiency of accumulated context and either terminates the loop or sends it back with a refined query.&lt;/p&gt;

&lt;p&gt;Three patterns in increasing complexity:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Iterative Query Refinement&lt;/strong&gt; — single tool, query rewritten per pass&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-Tool Orchestration&lt;/strong&gt; — agent selects between keyword, semantic, hybrid, and filtered search&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hierarchical Decomposition&lt;/strong&gt; — planner splits multi-hop queries into dependent sub-queries&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Routing: The Most Important Decision
&lt;/h2&gt;

&lt;p&gt;Sending every query through the agentic path is the most common mistake. Agentic retrieval adds 2-8s latency and 4-12x cost. Simple factual queries (60-75% of typical traffic) get no quality improvement from it.&lt;/p&gt;

&lt;p&gt;Use a hybrid router: deterministic rules first (regex patterns, length heuristics, keyword signals), LLM classification only for ambiguous cases. Use Haiku for routing — it's a classification task, not a reasoning task.&lt;/p&gt;




&lt;h2&gt;
  
  
  Reflection Agent: Deciding When to Stop
&lt;/h2&gt;

&lt;p&gt;The reflection agent's judgment quality determines the entire system's utility. Calibrate it against real queries:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Iteration 1:&lt;/strong&gt; 65-75% of queries should terminate (simple queries succeeding on first pass)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Iteration 2:&lt;/strong&gt; 15-20% (needed one refinement)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Iteration 3:&lt;/strong&gt; 5-10% (multi-hop or genuinely ambiguous)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Iteration 4+:&lt;/strong&gt; &amp;lt;5% (forced termination — investigate these)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If significant traffic hits max iterations, either routing is broken or your corpus has coverage gaps.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure Isolation and Loop Bounding
&lt;/h2&gt;

&lt;p&gt;Without explicit bounding, misbehaving loops drive latency and cost to unacceptable levels. Non-negotiable limits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;max_iterations: 4&lt;/strong&gt; — never exceed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;timeout: 12s&lt;/strong&gt; — wall-clock for entire loop&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;min_new_chunks_per_iteration: 1&lt;/strong&gt; — if retrieval returns nothing new, break immediately&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;context token budget&lt;/strong&gt; — stop accepting chunks beyond the budget&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On timeout or max iterations: generate with accumulated context + caveat, never return a 500 error.&lt;/p&gt;




&lt;h2&gt;
  
  
  Cost Reality
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Single-pass RAG:     ~$0.003/request
Agentic (2 iter):    ~$0.006/request (2x)
Agentic (4 iter):    ~$0.010/request (3-4x)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If 25% of traffic goes agentic at 2.5x cost → 37% total increase (acceptable). If 75% goes agentic → costs triple (likely unacceptable). The router controls your bill.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Key Insight
&lt;/h2&gt;

&lt;p&gt;An agentic system with no observability is not an improvement over single-pass — it's a more expensive pipeline that's harder to debug. The loop delivers quality improvement only when it is instrumented, bounded, and its behavior is understood at the query level.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Agency without accountability is just unpredictability.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Read the Full Article
&lt;/h2&gt;

&lt;p&gt;This is a summary of my deep dive into agentic RAG architecture. The full article covers the complete system with production implementations:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;👉 &lt;a href="https://aloknecessary.github.io/blogs/designing-self-correcting-retrieval-loops-for-production/?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication&amp;amp;utm_content=agentic-rag-self-correcting-retrieval" rel="noopener noreferrer"&gt;Designing Self-Correcting Retrieval Loops for Production — Full Article&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The full article includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Full agentic RAG architecture diagram (router → planner → loop → generation)&lt;/li&gt;
&lt;li&gt;Query planner implementation with multi-hop decomposition (Python/Anthropic)&lt;/li&gt;
&lt;li&gt;Iterative retrieval loop with async timeout and dedup&lt;/li&gt;
&lt;li&gt;Reflection agent prompt and calibration patterns&lt;/li&gt;
&lt;li&gt;Multi-tool orchestration with Claude tool-use API&lt;/li&gt;
&lt;li&gt;Hybrid router (rules-first + LLM fallback)&lt;/li&gt;
&lt;li&gt;Loop bounding with five hard limits&lt;/li&gt;
&lt;li&gt;Graceful degradation with context caveats&lt;/li&gt;
&lt;li&gt;Per-request cost model (single-pass vs 2-iter vs 4-iter)&lt;/li&gt;
&lt;li&gt;Latency budget breakdown and streaming response pattern&lt;/li&gt;
&lt;li&gt;Structured loop telemetry with structlog&lt;/li&gt;
&lt;li&gt;Alerting metrics for agentic systems&lt;/li&gt;
&lt;li&gt;Production deployment checklist&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>machinelearning</category>
      <category>python</category>
    </item>
    <item>
      <title>LLM Evaluation in Production: Building the Eval Pipeline That Runs on Every Deploy</title>
      <dc:creator>Alok Ranjan Daftuar</dc:creator>
      <pubDate>Wed, 17 Jun 2026 13:22:36 +0000</pubDate>
      <link>https://dev.to/aloknecessary/llm-evaluation-in-production-building-the-eval-pipeline-that-runs-on-every-deploy-5eki</link>
      <guid>https://dev.to/aloknecessary/llm-evaluation-in-production-building-the-eval-pipeline-that-runs-on-every-deploy-5eki</guid>
      <description>&lt;p&gt;Everyone ships the RAG system. Almost nobody ships the eval system that tells them when the RAG system starts lying.&lt;/p&gt;

&lt;p&gt;You updated the embedding model. Tweaked the system prompt. Swapped the re-ranker. Metrics look fine. Three weeks later, support tickets arrive — the system is drawing inferences the source documents never made. No alarm fired. No test failed. The system drifted silently.&lt;/p&gt;

&lt;p&gt;This is not a model quality problem. It is an evaluation infrastructure problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Four Metrics That Matter
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Faithfulness&lt;/strong&gt; — of the claims in the response, what fraction are directly supported by the retrieved context? Your primary hallucination guard. Does not require ground truth.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Answer Relevance&lt;/strong&gt; — how directly does the response address the user's question? Catches the "technically correct but useless" failure mode. Does not require ground truth.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context Precision&lt;/strong&gt; — of the retrieved chunks, what fraction were actually relevant? Requires ground truth. Belongs in offline CI eval.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Answer Correctness&lt;/strong&gt; — how factually accurate vs the reference answer? Most expensive, requires curated ground truth. Pre-deploy regression suite only.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Operational rule:&lt;/strong&gt; Faithfulness and Answer Relevance run on every deploy and on sampled production traffic. Context Precision and Answer Correctness run in CI against the golden dataset.&lt;/p&gt;




&lt;h2&gt;
  
  
  LLM-as-Judge: The Pattern and Pitfalls
&lt;/h2&gt;

&lt;p&gt;RAGAS uses an LLM to evaluate LLM output — the only practical way to evaluate semantic quality at scale.&lt;/p&gt;

&lt;p&gt;Pitfalls to manage:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Positional bias&lt;/strong&gt; — randomize order in pairwise comparisons&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verbosity bias&lt;/strong&gt; — judge rates longer answers higher even when less accurate&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-preference&lt;/strong&gt; — use a different model family as judge than the one generating answers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Calibration drift&lt;/strong&gt; — pin judge model to a specific version; treat upgrades as baseline resets&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Calibrate against human labels using Cohen's Kappa on 50-100 examples. Below 0.4 means your judge prompt needs revision.&lt;/p&gt;




&lt;h2&gt;
  
  
  CI/CD Integration
&lt;/h2&gt;

&lt;p&gt;The eval pipeline triggers on every PR touching RAG code, prompts, or model configuration:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Run RAG pipeline against golden dataset (100+ curated questions)&lt;/li&gt;
&lt;li&gt;Score with RAGAS (faithfulness, relevance, precision, correctness)&lt;/li&gt;
&lt;li&gt;Compare against baseline — block deploy if regression exceeds threshold&lt;/li&gt;
&lt;li&gt;Post results as PR comment with per-metric scores and pass/fail status&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Cost: ~$0.50-$2.00 per full eval run at Claude Sonnet pricing. On PRs, run only faithfulness + relevance (cheapest). Full suite runs nightly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Production Sampling
&lt;/h2&gt;

&lt;p&gt;CI catches regressions from code changes. Production sampling catches drift from corpus staleness, query distribution shift, and model behavior changes.&lt;/p&gt;

&lt;p&gt;Sample 5% of live traffic for async evaluation. Never evaluate synchronously — judge calls add 2-5s per request. Track 7-day rolling faithfulness and answer relevance. Alert when they drop &amp;gt;0.05 from monthly baseline.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Key Insight
&lt;/h2&gt;

&lt;p&gt;LLM systems do not have stable, deterministic behavior. They drift through corpus changes, model updates, prompt evolution, and query distribution shift. Evaluation is not a checkpoint — it is continuous infrastructure.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Build the eval system before you need it. By the time you need it, it is already too late — you will be debugging a production quality regression with no historical baseline and no automated detection.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Read the Full Article
&lt;/h2&gt;

&lt;p&gt;This is a summary of my deep dive into LLM evaluation infrastructure. The full article covers the complete eval stack with implementation examples:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;👉 &lt;a href="https://aloknecessary.github.io/blogs/llm-evaluation-in-production/?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication&amp;amp;utm_content=llm-evaluation-in-production" rel="noopener noreferrer"&gt;LLM Evaluation in Production — Full Article&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The full article includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Evaluation stack architecture (retrieval layer vs generation layer)&lt;/li&gt;
&lt;li&gt;Four metrics with RAGAS Python implementations&lt;/li&gt;
&lt;li&gt;LLM-as-Judge faithfulness prompt with claim-level scoring&lt;/li&gt;
&lt;li&gt;Judge calibration against human labels (Cohen's Kappa)&lt;/li&gt;
&lt;li&gt;RAGAS configuration with Claude as judge model&lt;/li&gt;
&lt;li&gt;Regression threshold framework (absolute + delta from baseline)&lt;/li&gt;
&lt;li&gt;Golden dataset generation, versioning, and holdout partitions&lt;/li&gt;
&lt;li&gt;Full GitHub Actions eval pipeline (YAML + runner scripts)&lt;/li&gt;
&lt;li&gt;Production sampling with async eval queue worker&lt;/li&gt;
&lt;li&gt;Eval observability dashboard schema (PostgreSQL)&lt;/li&gt;
&lt;li&gt;Eight failure modes in eval systems and mitigations&lt;/li&gt;
&lt;li&gt;Production deployment checklist&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>devops</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Building Reliable RAG Pipelines: From Prototype to Production</title>
      <dc:creator>Alok Ranjan Daftuar</dc:creator>
      <pubDate>Wed, 10 Jun 2026 06:20:00 +0000</pubDate>
      <link>https://dev.to/aloknecessary/building-reliable-rag-pipelines-from-prototype-to-production-2mcp</link>
      <guid>https://dev.to/aloknecessary/building-reliable-rag-pipelines-from-prototype-to-production-2mcp</guid>
      <description>&lt;p&gt;Most teams get RAG working in a notebook over a weekend. Very few get it working reliably in production. The gap is not model quality — it is engineering discipline.&lt;/p&gt;

&lt;p&gt;The RAG prototype is fifty lines of Python. It works. Then production happens — users ask unexpected questions, retrieval degrades as the corpus grows, and the model confidently synthesizes wrong answers from bad context. Nobody knows, because there is no instrumentation to catch it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Chunking: The Foundation
&lt;/h2&gt;

&lt;p&gt;A poor chunking strategy cannot be compensated for downstream. If relevant information is split across chunks or diluted into one too large, no retrieval algorithm will recover it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hierarchical chunking&lt;/strong&gt; is the production-grade pattern: maintain parent chunks (full sections) and child chunks (sentences/short paragraphs). Retrieve at child granularity for precision. Return parent text as LLM context for completeness.&lt;/p&gt;

&lt;p&gt;Every chunk must carry metadata — source document ID, version, content hash, embedding model version. &lt;code&gt;content_hash&lt;/code&gt; tells you when a chunk needs re-embedding because the source changed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Retrieval: Hybrid Is the Default
&lt;/h2&gt;

&lt;p&gt;Neither BM25 nor vector search alone is sufficient. Hybrid retrieval with Reciprocal Rank Fusion (RRF) is the baseline for production RAG.&lt;/p&gt;

&lt;p&gt;The pipeline:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Dense retrieval&lt;/strong&gt; (vector similarity) + &lt;strong&gt;Sparse retrieval&lt;/strong&gt; (BM25 keywords) in parallel&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RRF merge&lt;/strong&gt; — rank-based fusion without score normalization&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-encoder re-ranker&lt;/strong&gt; — precision pass on top candidates&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Skipping the re-ranker is the most common mistake. Initial retrieval optimizes for recall. The re-ranker optimizes for precision — critical when your context window only fits top-5 chunks.&lt;/p&gt;




&lt;h2&gt;
  
  
  Context Assembly: Where Pipelines Quietly Break
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Token budget management&lt;/strong&gt; — hard ceiling, never rely on hope that chunks fit&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deduplication&lt;/strong&gt; — hierarchical chunking and hybrid retrieval can surface the same content via multiple paths&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Source attribution&lt;/strong&gt; — every chunk in context must carry its source ID for citation&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The "I Don't Know" Instruction Is Not Optional
&lt;/h2&gt;

&lt;p&gt;Without explicit grounding instructions, LLMs fill context gaps with plausible hallucinations. Your system prompt must instruct the model to acknowledge when context is insufficient — and to cite sources for every factual claim.&lt;/p&gt;




&lt;h2&gt;
  
  
  Evaluate Retrieval Independently
&lt;/h2&gt;

&lt;p&gt;The most common RAG debugging mistake: assuming a bad answer is a generation failure. Most bad RAG answers are &lt;strong&gt;retrieval failures&lt;/strong&gt; — the right chunk was not in the context.&lt;/p&gt;

&lt;p&gt;Measure &lt;strong&gt;Recall@K&lt;/strong&gt; and &lt;strong&gt;MRR&lt;/strong&gt; against a ground truth dataset of 50-100 queries. Fix retrieval before you blame the model.&lt;/p&gt;




&lt;h2&gt;
  
  
  Production Observability
&lt;/h2&gt;

&lt;p&gt;A RAG pipeline without observability is a black box that silently degrades. Key signals:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;"I don't know" rate&lt;/strong&gt; — drops below 80% signals retrieval degradation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chunks dropped rate&lt;/strong&gt; — rising means context window pressure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retrieval latency p99&lt;/strong&gt; — vector index performance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Corpus staleness&lt;/strong&gt; — content hash mismatches between source docs and stored chunks&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Read the Full Article
&lt;/h2&gt;

&lt;p&gt;This is a summary of my deep dive into production RAG engineering. The full article covers every pipeline component with implementation examples:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;👉 &lt;a href="https://aloknecessary.github.io/blogs/rag_prototype_to_production/?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication&amp;amp;utm_content=rag-prototype-to-production" rel="noopener noreferrer"&gt;Building Reliable RAG Pipelines — Full Article&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The full article includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Full pipeline architecture diagram (9 stages)&lt;/li&gt;
&lt;li&gt;Three chunking approaches with Python implementations (fixed, semantic, hierarchical)&lt;/li&gt;
&lt;li&gt;Hybrid retrieval with RRF implementation (Qdrant)&lt;/li&gt;
&lt;li&gt;Cross-encoder re-ranking (self-hosted and Cohere API)&lt;/li&gt;
&lt;li&gt;Context assembly with token budget management and deduplication&lt;/li&gt;
&lt;li&gt;Prompt construction with grounding and guardrails&lt;/li&gt;
&lt;li&gt;Retrieval evaluation framework (Recall@K, MRR, context relevance)&lt;/li&gt;
&lt;li&gt;Per-request tracing schema and aggregate alerting metrics&lt;/li&gt;
&lt;li&gt;Corpus staleness detection implementation&lt;/li&gt;
&lt;li&gt;Graceful degradation with BM25 fallback&lt;/li&gt;
&lt;li&gt;Production deployment checklist&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>machinelearning</category>
      <category>python</category>
    </item>
    <item>
      <title>Event-Driven Architecture: The Dual Write Problem and How to Solve It</title>
      <dc:creator>Alok Ranjan Daftuar</dc:creator>
      <pubDate>Thu, 04 Jun 2026 06:32:28 +0000</pubDate>
      <link>https://dev.to/aloknecessary/event-driven-architecture-the-dual-write-problem-and-how-to-solve-it-5266</link>
      <guid>https://dev.to/aloknecessary/event-driven-architecture-the-dual-write-problem-and-how-to-solve-it-5266</guid>
      <description>&lt;p&gt;You have a well-designed order service. It writes to the database and publishes an event to Kafka. Clean, decoupled, event-driven. Then Kafka has a brief network hiccup. The database write succeeds. The event publish fails. The order exists. Fulfillment never hears about it. No alert fires. Just a quietly broken order going nowhere.&lt;/p&gt;

&lt;p&gt;This is the dual write problem — an &lt;strong&gt;architectural correctness problem&lt;/strong&gt; that exists the moment you write to two separate systems without a coordination mechanism.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;A dual write occurs when your application writes to two separate systems as part of a single logical operation without atomicity across both. The dangerous failure modes are silent — the HTTP response returns 200, the client gets a success, and nothing downstream happens.&lt;/p&gt;

&lt;p&gt;The naive fixes don't work:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Try/catch with retry&lt;/strong&gt; — introduces duplicate events; consumers must be idempotent&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Publish first, then write DB&lt;/strong&gt; — just reverses which failure mode you're exposed to&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Distributed transactions (2PC)&lt;/strong&gt; — sacrifices availability and introduces distributed locking&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The real solution: &lt;strong&gt;reduce to a single atomic write and derive the event from it&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Solution 1: Transactional Outbox Pattern
&lt;/h2&gt;

&lt;p&gt;Write the event as a row in an &lt;code&gt;outbox&lt;/code&gt; table in the same database transaction as your business data. A separate relay process reads from the outbox and publishes to the broker.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Both writes succeed or fail together (single DB transaction)&lt;/li&gt;
&lt;li&gt;Relay publishes and marks messages as published&lt;/li&gt;
&lt;li&gt;Guarantees at-least-once delivery — consumers must be idempotent&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; greenfield services, full control over event schema, teams wanting simplicity.&lt;/p&gt;




&lt;h2&gt;
  
  
  Solution 2: Change Data Capture (Debezium)
&lt;/h2&gt;

&lt;p&gt;Read directly from the database's transaction log (WAL/binlog). Every committed write is captured and streamed to Kafka automatically. No application code changes required.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sub-second publish latency (WAL-based, no polling)&lt;/li&gt;
&lt;li&gt;Captures all state changes including DB migrations and admin tools&lt;/li&gt;
&lt;li&gt;Requires infrastructure for Kafka Connect + Debezium&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; legacy systems, high-throughput services, capturing all state changes without code modification.&lt;/p&gt;




&lt;h2&gt;
  
  
  Solution 3: Event Sourcing
&lt;/h2&gt;

&lt;p&gt;The event log is the source of truth. The database is a derived projection. There is no dual write because there is only one write — appending events to the event store.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Eliminates the problem entirely&lt;/li&gt;
&lt;li&gt;Introduces significant complexity (schema versioning, aggregate rehydration, eventual consistency)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; domains where history of state changes matters (financial systems, audit-heavy domains).&lt;/p&gt;




&lt;h2&gt;
  
  
  Operational Non-Negotiables
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Consumer idempotency&lt;/strong&gt; — at-least-once delivery means duplicates will arrive. Deduplicate on event ID.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Outbox housekeeping&lt;/strong&gt; — purge published messages; don't let the table grow unbounded.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Replication slot monitoring&lt;/strong&gt; — for CDC, a stuck connector causes WAL accumulation and disk exhaustion.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Read the Full Article
&lt;/h2&gt;

&lt;p&gt;This is a summary of my deep dive into the dual write problem. The full article covers all three solutions with production implementation examples:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;👉 &lt;a href="https://aloknecessary.github.io/blogs/dual-write-problem-in-event-driven-architecture/?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication&amp;amp;utm_content=dual-write-problem" rel="noopener noreferrer"&gt;The Dual Write Problem and How to Solve It — Full Article&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The full article includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Four failure scenarios with a dual write matrix&lt;/li&gt;
&lt;li&gt;Transactional Outbox Pattern implementation (.NET with EF Core)&lt;/li&gt;
&lt;li&gt;Polling relay vs log-tailing relay comparison&lt;/li&gt;
&lt;li&gt;Debezium PostgreSQL connector configuration&lt;/li&gt;
&lt;li&gt;Event Sourcing with aggregate pattern (C#)&lt;/li&gt;
&lt;li&gt;Decision matrix for choosing between the three solutions&lt;/li&gt;
&lt;li&gt;Operational concerns: housekeeping, replication slot monitoring, consumer idempotency&lt;/li&gt;
&lt;li&gt;Production deployment checklist&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>microservices</category>
      <category>architecture</category>
      <category>eventdriven</category>
      <category>distributedsystems</category>
    </item>
    <item>
      <title>AI-Assisted Data Reconciliation at Scale: Patterns for Distributed Systems</title>
      <dc:creator>Alok Ranjan Daftuar</dc:creator>
      <pubDate>Mon, 01 Jun 2026 08:41:05 +0000</pubDate>
      <link>https://dev.to/aloknecessary/ai-assisted-data-reconciliation-at-scale-patterns-for-distributed-systems-31l9</link>
      <guid>https://dev.to/aloknecessary/ai-assisted-data-reconciliation-at-scale-patterns-for-distributed-systems-31l9</guid>
      <description>&lt;p&gt;In any sufficiently large distributed system, data reconciliation is the dark matter of engineering — invisible, pervasive, and holding everything together through mechanisms nobody fully understands.&lt;/p&gt;

&lt;p&gt;Rule-based reconciliation works until it doesn't. Rule engines break on ambiguity, cannot handle semantic equivalence across schema versions, and generate false positives at scale that overwhelm operations teams. AI — specifically embedding-based similarity and LLM classification — fills the gap. Not as a replacement, but as a layer that handles what rules cannot.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where Traditional Reconciliation Breaks Down
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Eventual consistency windows&lt;/strong&gt; — a naive reconciliation job that diffs at a point in time generates thousands of false positives that are self-healing within seconds. The rule engine cannot distinguish transient inconsistency from legitimate divergence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cross-service schema drift&lt;/strong&gt; — Service A stores an address as &lt;code&gt;{ street, city, state, zip }&lt;/code&gt;. Service B stores it as &lt;code&gt;{ addressLine1, municipality, postalCode }&lt;/code&gt;. Semantically equivalent. A field-level comparator flags every record as mismatched.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Semantic equivalence in free-text&lt;/strong&gt; — &lt;code&gt;"Acme Corporation"&lt;/code&gt; vs &lt;code&gt;"ACME Corp."&lt;/code&gt; vs &lt;code&gt;"Acme Corp (formerly Roadrunner Supplies)"&lt;/code&gt;. Rule-based systems cannot reason about semantic identity at scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Volume-driven false positive fatigue&lt;/strong&gt; — at millions of records per day, even 0.1% false positives generate thousands of alerts. Real issues get buried. The reconciliation system becomes theater.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture: Rules First, AI at the Boundary
&lt;/h2&gt;

&lt;p&gt;The pattern is not AI-first. It is &lt;strong&gt;rules-first, AI at the boundary&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Deterministic mismatch detection&lt;/strong&gt; — checksums, field comparisons, primary key matching&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High-confidence matches/mismatches&lt;/strong&gt; — auto-resolve or route to correction (no AI needed)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ambiguous cases&lt;/strong&gt; → AI classification layer:

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Embedding similarity&lt;/strong&gt; — detect semantic equivalence across schema variations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM classification&lt;/strong&gt; — reason about &lt;em&gt;why&lt;/em&gt; a mismatch exists&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Embedding-Based Similarity
&lt;/h2&gt;

&lt;p&gt;Serialize records into schema-agnostic text representations before embedding. Compute cosine similarity. Calibrate thresholds against labeled data:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;≥ 0.95&lt;/strong&gt; → auto-resolve as equivalent&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0.80 – 0.95&lt;/strong&gt; → route to LLM classification&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&amp;lt; 0.80&lt;/strong&gt; → high-confidence mismatch, route to correction&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The thresholds are not universal — calibrate against 500–1000 manually classified record pairs from your actual data.&lt;/p&gt;




&lt;h2&gt;
  
  
  LLM Classification for the Ambiguous Band
&lt;/h2&gt;

&lt;p&gt;For the 5–15% of mismatches that fall in the ambiguous range, an LLM reasons about context that vector distance cannot capture. Classifications: &lt;code&gt;equivalent&lt;/code&gt;, &lt;code&gt;stale_copy&lt;/code&gt;, &lt;code&gt;legitimate_divergence&lt;/code&gt;, &lt;code&gt;data_corruption&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Cost management: route only the ambiguous band to the LLM. Batch where latency allows. Cache results for record pairs re-evaluated in subsequent cycles.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where AI Should Never Be Trusted
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Financial and compliance records&lt;/strong&gt; — dollar amount disagreements are correctness errors, not semantic questions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Primary key and identity resolution&lt;/strong&gt; — AI suggestions acceptable; auto-resolution without human sign-off is not&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Any decision that must be explainable to a regulator&lt;/strong&gt; — "87% confidence" is not an audit-compliant explanation&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Key Insight
&lt;/h2&gt;

&lt;p&gt;AI in reconciliation is a &lt;strong&gt;judgment layer&lt;/strong&gt;, not a trust layer. It handles ambiguous cases that rules cannot, reduces volume reaching human review, and provides structured reasoning. The deterministic foundation must remain intact.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A reconciliation system you cannot audit is worse than one that generates false positives. Build the observability before you build the AI.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Read the Full Article
&lt;/h2&gt;

&lt;p&gt;This is a summary of my deep dive into AI-assisted data reconciliation. The full article covers the complete architecture with implementation examples:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;👉 &lt;a href="https://aloknecessary.github.io/blogs/ai_assisted_data_reconciliation/?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication&amp;amp;utm_content=ai-data-reconciliation" rel="noopener noreferrer"&gt;AI-Assisted Data Reconciliation at Scale — Full Article&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The full article includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Where traditional reconciliation breaks down (4 failure modes)&lt;/li&gt;
&lt;li&gt;Full architecture diagram with rules-first, AI-at-boundary pattern&lt;/li&gt;
&lt;li&gt;Embedding-based similarity implementation (Python, OpenAI embeddings)&lt;/li&gt;
&lt;li&gt;LLM classification prompt pattern with structured JSON output&lt;/li&gt;
&lt;li&gt;Observation window pattern for filtering eventual consistency false positives&lt;/li&gt;
&lt;li&gt;Hard boundaries where AI should never auto-resolve&lt;/li&gt;
&lt;li&gt;Observability patterns with structured logging&lt;/li&gt;
&lt;li&gt;Production deployment checklist&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>distributedsystems</category>
      <category>architecture</category>
      <category>datascience</category>
    </item>
    <item>
      <title>Why Lift-and-Shift Fails Quietly: Architectural Smells That Appear After Migration</title>
      <dc:creator>Alok Ranjan Daftuar</dc:creator>
      <pubDate>Fri, 29 May 2026 07:34:23 +0000</pubDate>
      <link>https://dev.to/aloknecessary/why-lift-and-shift-fails-quietly-architectural-smells-that-appear-after-migration-bdj</link>
      <guid>https://dev.to/aloknecessary/why-lift-and-shift-fails-quietly-architectural-smells-that-appear-after-migration-bdj</guid>
      <description>&lt;p&gt;Every cloud migration starts with a promise: &lt;em&gt;"We'll get onto cloud first, optimize later."&lt;/em&gt; That sentence is where the trouble begins.&lt;/p&gt;

&lt;p&gt;Lift-and-shift leaves on-premises assumptions baked into a system operating in a fundamentally different environment. The failure doesn't arrive on day one. It arrives three months later, in a Slack alert at 2am, or in an invoice that made a VP ask uncomfortable questions.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Latency Amplification
&lt;/h2&gt;

&lt;p&gt;On a physical LAN, a service call is sub-millisecond. In a cloud VPC, even same-AZ calls incur 1-3ms. A service making 40 synchronous downstream calls goes from ~4ms network overhead to ~160ms — without any code change.&lt;/p&gt;

&lt;p&gt;Same call graph. Same code. 8x more latency — purely from network topology.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; consolidate reads with batch APIs, introduce async messaging for non-critical paths, add caching for hot reference data.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Chatty Services
&lt;/h2&gt;

&lt;p&gt;The N+1 problem at infrastructure scale. A service making 60 per-entity HTTP calls to render a dashboard is annoying on LAN. In cloud, it's a 300-600ms tax on every page load.&lt;/p&gt;

&lt;p&gt;Chatty patterns also exhaust connection pools faster — each call traverses the network and holds an open connection during transit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; batch endpoints on all internal APIs, DataLoader pattern, connection pool profiling under realistic concurrency.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Cost Surprises
&lt;/h2&gt;

&lt;p&gt;The PoC cost $340. The first production month is $8,200. Nobody changed the architecture.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data egress&lt;/strong&gt; — free on-prem, metered in cloud. Cross-AZ, cross-region, and internet egress all bill.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Over-provisioning&lt;/strong&gt; — on-prem sizing instincts (buy for 3-5 years) don't translate. Cloud charges per idle CPU cycle.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Idle infrastructure&lt;/strong&gt; — dev/staging environments left running 24/7.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  4. Stateful Assumptions
&lt;/h2&gt;

&lt;p&gt;In-memory session state works with a single server. The moment you auto-scale, 33% of requests hit instances with no session. Filesystem dependencies break when containers reschedule or pods restart.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; externalize session to Redis. Replace local filesystem writes with object storage at the upload boundary.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. The Observability Void
&lt;/h2&gt;

&lt;p&gt;On-prem monitoring (Nagios, Zabbix) watches hardware metrics that mean nothing in cloud. What you need to observe is different: cold start times, managed service throttling, connection pool utilization, cost-per-request.&lt;/p&gt;

&lt;p&gt;The danger window is immediately after migration when legacy monitoring reports "all green" while user-facing metrics degrade invisibly.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. The Monolith in Microservice Clothing
&lt;/h2&gt;

&lt;p&gt;Containerized and deployed to Kubernetes with separate deployments per service. On the surface: microservices. Underneath: shared database schemas, synchronous HTTP chains, coordinated deployments. A distributed monolith you &lt;em&gt;think&lt;/em&gt; is clean is a production incident waiting to happen.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Realistic Migration Philosophy
&lt;/h2&gt;

&lt;p&gt;Lift-and-shift is not a failure state. It's a phase. The mistake is treating it as a destination. Every migrated workload should have a documented list of known architectural debts, an owner for each, and a timeline to address them — agreed &lt;em&gt;before&lt;/em&gt; the migration.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Moving to cloud does not modernize your architecture. It gives you a new environment in which your existing architectural decisions — good and bad — will be amplified.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Read the Full Article
&lt;/h2&gt;

&lt;p&gt;This is a summary of my deep dive into post-migration architectural smells. The full article covers all six patterns with diagnostics, mitigations, and a pre-migration review checklist:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;👉 &lt;a href="https://aloknecessary.github.io/blogs/lift-and-shift-fails-quietly/?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication&amp;amp;utm_content=lift-and-shift-fails-quietly" rel="noopener noreferrer"&gt;Why Lift-and-Shift Fails Quietly — Full Article&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The full article includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Latency amplification with SVG architecture diagram (on-prem vs cloud)&lt;/li&gt;
&lt;li&gt;Chatty services with before/after code examples and connection pool diagnostics&lt;/li&gt;
&lt;li&gt;Cost surprise breakdown with egress pricing tables&lt;/li&gt;
&lt;li&gt;Stateful assumptions with session externalization code (Node.js/Redis)&lt;/li&gt;
&lt;li&gt;Observability void with Prometheus recording rules for post-migration signals&lt;/li&gt;
&lt;li&gt;Distributed monolith diagnostic patterns&lt;/li&gt;
&lt;li&gt;Complete pre-migration architecture review checklist&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>cloud</category>
      <category>architecture</category>
      <category>devops</category>
      <category>migration</category>
    </item>
    <item>
      <title>Designing Cloud-Native Systems That Survive Region-Level Failures</title>
      <dc:creator>Alok Ranjan Daftuar</dc:creator>
      <pubDate>Wed, 20 May 2026 10:56:50 +0000</pubDate>
      <link>https://dev.to/aloknecessary/designing-cloud-native-systems-that-survive-region-level-failures-4c3p</link>
      <guid>https://dev.to/aloknecessary/designing-cloud-native-systems-that-survive-region-level-failures-4c3p</guid>
      <description>&lt;p&gt;Most teams design for instance and zone failures but treat region-level outages as someone else's problem. Region-level failures are rare — but they are not theoretical. AWS us-east-1 has had multiple significant incidents. Azure AD suffered a global authentication outage in 2023. Google Cloud's europe-west9 went offline due to a data center fire.&lt;/p&gt;

&lt;p&gt;When a region fails, the blast radius is not one service. It is every workload, every database, every queue, and every control plane operation scoped to that region.&lt;/p&gt;




&lt;h2&gt;
  
  
  Multi-AZ Does Not Protect Against Regional Failures
&lt;/h2&gt;

&lt;p&gt;Multi-AZ protects against data center failures. It does not protect against:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Regional control plane failures&lt;/strong&gt; — the API that manages your resources is regional. If it degrades, you cannot scale or deploy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Regional service outages&lt;/strong&gt; — SQS, Lambda, DynamoDB, Cosmos DB are all regional.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shared fate dependencies&lt;/strong&gt; — IAM, Secrets Manager, Key Vault are regional. If your app cannot retrieve secrets, it doesn't matter that compute is healthy across three AZs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The December 2021 AWS us-east-1 incident demonstrated this. Services in unaffected AZs experienced degradation because their dependencies were not AZ-independent.&lt;/p&gt;




&lt;h2&gt;
  
  
  Multi-Region Architecture Patterns
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Pilot Light&lt;/strong&gt; — secondary region has minimum infrastructure (DB replicas, networking). Compute provisioned on failover. RTO: 15-60 min. Cost: ~10-15% of primary.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Warm Standby&lt;/strong&gt; — secondary runs a scaled-down but fully functional copy. On failover, scale up and promote DB. RTO: 5-15 min. Cost: ~25-40% of primary.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Active-Active&lt;/strong&gt; — both regions serve traffic simultaneously. No failover needed. Requires multi-region writes (DynamoDB Global Tables, Cosmos DB) and conflict resolution. RTO: near-zero. Cost: ~80-100%+ of primary.&lt;/p&gt;




&lt;h2&gt;
  
  
  Data Replication: The Hardest Problem
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Synchronous&lt;/strong&gt; — zero data loss, but adds 50-150ms to every write. Impractical for most workloads.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Asynchronous&lt;/strong&gt; — no write latency impact, but creates a replication lag window where data can be lost if primary fails.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For active-active with async replication, you need a conflict resolution strategy. Last-writer-wins works for profiles and preferences. It silently drops writes for counters and balances — use application-level merge or CRDTs there.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Practical rule:&lt;/strong&gt; if you cannot define a conflict resolution strategy for a data entity, route its writes to a single primary region.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failover Automation
&lt;/h2&gt;

&lt;p&gt;Manual failover is not failover. Under the stress of a region-level incident, manual steps fail or take far longer than practiced.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;DNS-based failover (Route 53, Traffic Manager) with health checks on actual regional functionality — not just process liveness&lt;/li&gt;
&lt;li&gt;Database promotion automated via API (Aurora Global: under 1 minute, RDS cross-region: 5-10 minutes)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test quarterly&lt;/strong&gt; — not a tabletop exercise, an actual failover. Measure real RTO. Fix the gaps.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Common Mistakes
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Untested failover&lt;/strong&gt; — an assumption, not a plan&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hidden regional dependencies&lt;/strong&gt; — auth provider or secrets manager pinned to one region&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Same deployment pipeline for both regions&lt;/strong&gt; — a bad deploy takes down both simultaneously&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No capacity planning&lt;/strong&gt; — secondary region hits service quotas during scale-up&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Read the Full Article
&lt;/h2&gt;

&lt;p&gt;This is a summary of my deep dive into multi-region resilience. The full article covers all patterns with AWS and Azure architecture sketches, cost analysis, and a decision framework:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;👉 &lt;a href="https://aloknecessary.github.io/blogs/cloud_native_region_failure_architecture/?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication&amp;amp;utm_content=cloud-native-region-failure" rel="noopener noreferrer"&gt;Designing Cloud-Native Systems That Survive Region-Level Failures — Full Article&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The full article includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multi-AZ vs multi-region — what each actually protects (and what it doesn't)&lt;/li&gt;
&lt;li&gt;Three patterns with RTO/RPO/cost profiles (Pilot Light, Warm Standby, Active-Active)&lt;/li&gt;
&lt;li&gt;AWS and Azure architecture sketches for active-active&lt;/li&gt;
&lt;li&gt;Data replication deep dive (sync vs async, managed DB options, conflict resolution)&lt;/li&gt;
&lt;li&gt;Failover automation with Route 53 config and health check design&lt;/li&gt;
&lt;li&gt;Cost vs resilience decision framework with workload tiering&lt;/li&gt;
&lt;li&gt;Six common mistakes that break multi-region architectures&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>cloud</category>
      <category>architecture</category>
      <category>aws</category>
      <category>distributedsystems</category>
    </item>
    <item>
      <title>Designing for Partial Failure: Why 'Everything is Highly Available' Is a Myth</title>
      <dc:creator>Alok Ranjan Daftuar</dc:creator>
      <pubDate>Mon, 11 May 2026 08:41:55 +0000</pubDate>
      <link>https://dev.to/aloknecessary/designing-for-partial-failure-why-everything-is-highly-available-is-a-myth-25bh</link>
      <guid>https://dev.to/aloknecessary/designing-for-partial-failure-why-everything-is-highly-available-is-a-myth-25bh</guid>
      <description>&lt;p&gt;Your system will fail. The question is whether it fails completely or gracefully — and that answer is decided at design time, not incident time.&lt;/p&gt;

&lt;p&gt;High availability is not a property of your system — it is an emergent behavior of how your system handles the inevitable failure of its parts. A cluster of five-nines components can still produce a zero-nines system if you haven't designed for what happens when one of them degrades.&lt;/p&gt;




&lt;h2&gt;
  
  
  Partial Failure Is the Normal State
&lt;/h2&gt;

&lt;p&gt;The CAP theorem is clean in theory. In production, it manifests as &lt;strong&gt;partial unavailability&lt;/strong&gt; — some nodes respond, some don't, and your system has to decide what to do about it.&lt;/p&gt;

&lt;p&gt;Real-world partitions are never clean:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A replica is reachable but 800ms slower than normal&lt;/li&gt;
&lt;li&gt;A downstream service responds to health checks but times out on actual requests&lt;/li&gt;
&lt;li&gt;A database primary is alive, but replication lag has grown to 45 seconds&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Degraded systems are harder to reason about than failed ones. A service that returns errors is visible. A service that returns stale data silently is far more dangerous.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Practical rule:&lt;/strong&gt; for every external dependency, explicitly answer — &lt;em&gt;"What does my service do when this dependency is unavailable for 30 seconds? For 5 minutes? For 30 minutes?"&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  How Cascading Failures Actually Propagate
&lt;/h2&gt;

&lt;p&gt;Cascading failures rarely start big. They start with one slow service and end with everything down.&lt;/p&gt;

&lt;p&gt;The classic thread pool exhaustion cascade:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A payment service starts responding in 4s instead of 200ms&lt;/li&gt;
&lt;li&gt;Threads pile up — connection pool goes from 20% to 80%&lt;/li&gt;
&lt;li&gt;Latency bleeds upstream — unrelated operations slow down because they share the same thread pool&lt;/li&gt;
&lt;li&gt;Connection pools hit their limit — new requests fail immediately&lt;/li&gt;
&lt;li&gt;Health checks fail — load balancer removes the service from rotation&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Your entire checkout is down because a payment service was &lt;em&gt;slow&lt;/em&gt; — not even failed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cascade enablers to eliminate:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Synchronous chains longer than 2–3 hops&lt;/li&gt;
&lt;li&gt;Shared thread pools across dependencies&lt;/li&gt;
&lt;li&gt;Missing or oversized timeouts (a 30s default in a p99 200ms service is a loaded gun)&lt;/li&gt;
&lt;li&gt;Retry storms without backoff&lt;/li&gt;
&lt;li&gt;No bulkhead isolation between operations&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Graceful Degradation Patterns
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Circuit Breaker&lt;/strong&gt; — monitors failure rate and opens the circuit above a threshold, immediately returning a fallback instead of attempting the call. States: Closed → Open → Half-Open.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bulkhead Isolation&lt;/strong&gt; — prevents one failing dependency from consuming all shared resources. Isolate thread/connection pools per downstream service. At the Kubernetes level, namespace ResourceQuotas enforce cluster-level isolation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Timeout Hierarchy&lt;/strong&gt; — every downstream call's timeout must be shorter than the upstream caller's timeout. A 5s payment timeout inside a 20s checkout timeout inside a 30s user request timeout.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fallback Responses&lt;/strong&gt; — return degraded but functional responses rather than errors. Pricing service down? Return last-cached price with a "price may vary" indicator. Feature flags unavailable? Return safe defaults.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Retry with Exponential Backoff + Jitter&lt;/strong&gt; — retries without backoff amplify load on degraded services. Jitter prevents synchronized retry storms. Only retry on transient failures (5xx, timeouts) — never on 4xx.&lt;/p&gt;




&lt;h2&gt;
  
  
  Observability During Partial Failure
&lt;/h2&gt;

&lt;p&gt;Graceful degradation can mask serious problems for hours. The four signals that matter most:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Circuit breaker state transitions&lt;/strong&gt; — a circuit that opens at 2am with no alert is silent failure accumulating debt&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-dependency error rates&lt;/strong&gt; — aggregate 0.5% looks fine; per-dependency 40% on payment tells you exactly what's wrong&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Queue depth and consumer lag&lt;/strong&gt; — the leading indicator before errors surface at the API layer&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fallback invocation rate&lt;/strong&gt; — 0.1% is noise; 15% is a dependency in chronic distress being silently masked&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Key Insight
&lt;/h2&gt;

&lt;p&gt;Most architecture reviews start with the happy path. Flip that. &lt;strong&gt;Design your degraded states first&lt;/strong&gt; — what does this system look like when the payment service is down? When a network partition isolates one AZ?&lt;/p&gt;

&lt;p&gt;If you can answer with specific, tested, observable behaviors — you have a resilient system. If the answer is "it depends on what fails" — you have a system that will surprise you in production.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Partial failure is not an edge case. It is the normal operating condition of any distributed system at scale.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Read the Full Article
&lt;/h2&gt;

&lt;p&gt;This is a summary of my deep dive into designing for partial failure. The full article covers each pattern with production implementation examples, real-world cascade case studies, and a complete resilience checklist:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;👉 &lt;a href="https://aloknecessary.github.io/blogs/designing_for_partial_failure/?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication&amp;amp;utm_content=designing-for-partial-failure" rel="noopener noreferrer"&gt;Designing for Partial Failure — Full Article&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The full article includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CAP theorem applied to partial unavailability scenarios&lt;/li&gt;
&lt;li&gt;Real-world cascade case studies (AWS us-east-1 2021, Facebook 2021)&lt;/li&gt;
&lt;li&gt;Circuit breaker implementation in .NET (Polly) and Node.js (opossum)&lt;/li&gt;
&lt;li&gt;Bulkhead isolation with Kubernetes ResourceQuotas&lt;/li&gt;
&lt;li&gt;Timeout hierarchy design with named HttpClient factories&lt;/li&gt;
&lt;li&gt;Stale cache fallback implementation with Redis&lt;/li&gt;
&lt;li&gt;Retry with exponential backoff and jitter (TypeScript)&lt;/li&gt;
&lt;li&gt;Structured logging patterns for degraded responses&lt;/li&gt;
&lt;li&gt;Production resilience checklist&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>distributedsystems</category>
      <category>architecture</category>
      <category>systemdesign</category>
      <category>sre</category>
    </item>
    <item>
      <title>The CAP Theorem in Practice: Making the Right Trade-offs at Scale</title>
      <dc:creator>Alok Ranjan Daftuar</dc:creator>
      <pubDate>Mon, 04 May 2026 05:59:51 +0000</pubDate>
      <link>https://dev.to/aloknecessary/the-cap-theorem-in-practice-making-the-right-trade-offs-at-scale-1d5i</link>
      <guid>https://dev.to/aloknecessary/the-cap-theorem-in-practice-making-the-right-trade-offs-at-scale-1d5i</guid>
      <description>&lt;p&gt;Every distributed system you build is already taking a side in the CAP trade-off. The question is whether you made that choice deliberately or discover it during an incident.&lt;/p&gt;

&lt;p&gt;CAP states that a distributed system can guarantee at most two of three properties: &lt;strong&gt;Consistency&lt;/strong&gt;, &lt;strong&gt;Availability&lt;/strong&gt;, and &lt;strong&gt;Partition Tolerance&lt;/strong&gt;. The critical insight most teams miss — P is not optional. Networks fail. Pods crash. AZs go dark. You are choosing between &lt;strong&gt;CP&lt;/strong&gt; and &lt;strong&gt;AP&lt;/strong&gt;. Full stop.&lt;/p&gt;




&lt;h2&gt;
  
  
  CP vs AP: What You're Actually Trading
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;CP systems&lt;/strong&gt; (etcd, ZooKeeper, CockroachDB) refuse to serve requests during a partition rather than return stale data. Leader-based consensus ensures correctness. Choose CP for financial ledgers, inventory reservation, distributed locks — any domain where stale reads are more dangerous than errors.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AP systems&lt;/strong&gt; (Cassandra, DynamoDB, DNS) continue serving requests during a partition, accepting diverging state. Reconciliation happens later. Choose AP for user feeds, shopping carts, session data — any domain where temporary inconsistency is tolerable and availability is a hard SLA.&lt;/p&gt;

&lt;p&gt;Neither is universally correct. What is unacceptable is having no defined behavior.&lt;/p&gt;




&lt;h2&gt;
  
  
  PACELC: The Model That Actually Matches Production
&lt;/h2&gt;

&lt;p&gt;CAP only describes behavior during partitions. Your system spends most of its time healthy. PACELC extends CAP: even during normal operation, you are trading &lt;strong&gt;Latency&lt;/strong&gt; against &lt;strong&gt;Consistency&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A CP system with synchronous replication pays a latency tax on every write — all the time, not just during incidents. DynamoDB offers eventual consistency by default (low latency) or strong consistency per read (higher latency). The trade-off is continuous, not just during failures.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architectural Patterns Shaped by CAP
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Saga Pattern&lt;/strong&gt; — inherently AP. Each local transaction commits immediately (available). Global consistency is eventual. Compensating transactions are your consistency guarantee, not your database.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CQRS + Event Sourcing&lt;/strong&gt; — assigns CP to commands (strong consistency via transactional aggregate root) and AP to queries (eventual consistency via denormalized projections). You are not picking one model — you are assigning different models per use case.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tunable Consistency (Cassandra)&lt;/strong&gt; — &lt;code&gt;CONSISTENCY QUORUM&lt;/code&gt; on reads and writes achieves CP behavior. &lt;code&gt;CONSISTENCY ONE&lt;/code&gt; maximizes AP. Tune per operation, not per cluster. User profile reads can tolerate eventual consistency. Payment status reads cannot.&lt;/p&gt;




&lt;h2&gt;
  
  
  Common Mistakes
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Treating CAP as a database property&lt;/strong&gt; — it is a system property. Your retry logic, caching, and timeout behavior all participate in the trade-off.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Assuming strong consistency is always safer&lt;/strong&gt; — CP under partition returns errors. Cascading timeouts from a blocked write path can cause a larger outage than serving stale data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One consistency model across the entire system&lt;/strong&gt; — your order service (CP), product catalog (AP), session store (AP), and audit log (CP) should not share a single strategy.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Read the Full Article
&lt;/h2&gt;

&lt;p&gt;This is a summary of my deep dive into CAP theorem trade-offs. The full article covers CP vs AP with canonical examples, the PACELC model, architectural patterns, and a decision framework:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;👉 &lt;a href="https://aloknecessary.github.io/blogs/cap_theorem_architecture/?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication&amp;amp;utm_content=cap-theorem" rel="noopener noreferrer"&gt;The CAP Theorem in Practice — Full Article&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The full article includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Detailed CP vs AP comparison with canonical system examples&lt;/li&gt;
&lt;li&gt;PACELC model with system-by-system partition and normal operation behavior&lt;/li&gt;
&lt;li&gt;Saga and CQRS patterns analyzed through the CAP lens&lt;/li&gt;
&lt;li&gt;Tunable consistency deep dive with Cassandra&lt;/li&gt;
&lt;li&gt;Real-world decision framework for architecture reviews&lt;/li&gt;
&lt;li&gt;Four common mistakes architects make with CAP trade-offs&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>distributedsystems</category>
      <category>architecture</category>
      <category>database</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>BM25 vs. Vector Search: Choosing the Right Retrieval Strategy for Production Systems</title>
      <dc:creator>Alok Ranjan Daftuar</dc:creator>
      <pubDate>Tue, 07 Apr 2026 05:53:00 +0000</pubDate>
      <link>https://dev.to/aloknecessary/bm25-vs-vector-search-choosing-the-right-retrieval-strategy-for-production-systems-599n</link>
      <guid>https://dev.to/aloknecessary/bm25-vs-vector-search-choosing-the-right-retrieval-strategy-for-production-systems-599n</guid>
      <description>&lt;p&gt;Search is deceptively complex. You can stand up Elasticsearch in an afternoon and have something that works. Whether it surfaces the right document when a user asks "how do I reset my subscription?" instead of typing "subscription reset steps" is an entirely different problem.&lt;/p&gt;

&lt;p&gt;The two dominant retrieval paradigms — &lt;strong&gt;BM25&lt;/strong&gt; and &lt;strong&gt;Vector Search&lt;/strong&gt; — are both mature and production-proven. The real question is why one fails where the other succeeds, and how to combine them.&lt;/p&gt;




&lt;h2&gt;
  
  
  BM25: The Probabilistic Workhorse
&lt;/h2&gt;

&lt;p&gt;BM25 scores documents using term frequency, inverse document frequency, and document length normalization. It is still the default ranking algorithm in Elasticsearch and OpenSearch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Excels at:&lt;/strong&gt; exact keyword matching (SKUs, error codes, CLI flags), transparent debuggable ranking, sub-millisecond latency at scale, zero GPU cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Breaks down at:&lt;/strong&gt; vocabulary mismatch ("cancel membership" vs "terminate subscription"), semantic intent, cross-language queries, conceptual similarity.&lt;/p&gt;

&lt;p&gt;BM25 is fundamentally a bag-of-words model. It has no understanding of meaning.&lt;/p&gt;




&lt;h2&gt;
  
  
  Vector Search: Semantic Retrieval via Embeddings
&lt;/h2&gt;

&lt;p&gt;Vector search transforms text into dense numerical vectors where semantically similar content is geometrically close. Retrieval becomes nearest-neighbor search in that space.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Excels at:&lt;/strong&gt; semantic equivalence, natural language questions, cross-lingual retrieval, RAG pipelines where LLMs generate natural language queries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Breaks down at:&lt;/strong&gt; exact term matching (&lt;code&gt;NullPointerException&lt;/code&gt;, order IDs), infrastructure cost (GPU for embedding, RAM for HNSW indexes), staleness when documents change frequently.&lt;/p&gt;

&lt;p&gt;The quality of vector search is entirely determined by your embedding model — domain fit, dimensionality, and max token length all matter.&lt;/p&gt;




&lt;h2&gt;
  
  
  Hybrid Search: The Production Reality
&lt;/h2&gt;

&lt;p&gt;In most real-world systems, neither alone is sufficient. The answer is &lt;strong&gt;hybrid retrieval&lt;/strong&gt; — running both in parallel and combining scores.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reciprocal Rank Fusion (RRF)&lt;/strong&gt; merges ranked lists without score normalization. Documents that rank well in both lists get a substantial boost. It is rank-based, not score-based — no need to normalize cosine similarity against BM25 scores.&lt;/p&gt;

&lt;p&gt;The full pipeline:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;BM25 + Vector Search&lt;/strong&gt; in parallel → top-K candidates each&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RRF merge&lt;/strong&gt; → combined ranked list&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-encoder re-ranker&lt;/strong&gt; → precision pass on top candidates&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Final top-N&lt;/strong&gt; → LLM context or search results&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Skipping the re-ranker is a common mistake. Initial retrieval optimizes for recall. The re-ranker optimizes for precision — especially critical when context window limits mean you can only pass top-5 to an LLM.&lt;/p&gt;




&lt;h2&gt;
  
  
  Common Architecture Mistakes
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Vector search alone for RAG&lt;/strong&gt; — misses exact-match cases that BM25 catches, creating systematic blind spots&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ignoring chunk boundaries&lt;/strong&gt; — embedding a 5000-token document as a single vector destroys retrieval specificity&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;General-purpose embedding model on specialized corpus&lt;/strong&gt; — domain-specific models significantly outperform on legal, medical, or code retrieval&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not evaluating retrieval independently&lt;/strong&gt; — teams blame the LLM when retrieval is the actual failure point. Measure Recall@K and MRR before debugging generation.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Read the Full Article
&lt;/h2&gt;

&lt;p&gt;This is a summary of my deep dive into retrieval architecture. The full article covers BM25 mechanics, vector search internals, the complete tooling landscape, chunking strategies, and a decision framework:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;👉 &lt;a href="https://aloknecessary.github.io/blogs/bm25_vs_vector_search/?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication&amp;amp;utm_content=bm25-vs-vector-search" rel="noopener noreferrer"&gt;BM25 vs. Vector Search — Full Article&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The full article includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;BM25 scoring formula breakdown and tuning parameters&lt;/li&gt;
&lt;li&gt;Embedding model evaluation criteria and provider comparison&lt;/li&gt;
&lt;li&gt;Head-to-head scenario table showing when each wins&lt;/li&gt;
&lt;li&gt;Hybrid architecture pattern with RRF and re-ranking&lt;/li&gt;
&lt;li&gt;Complete tooling landscape (Pinecone, Qdrant, Weaviate, pgvector, Elasticsearch 8.x)&lt;/li&gt;
&lt;li&gt;Chunking strategy deep dive (fixed, semantic, hierarchical)&lt;/li&gt;
&lt;li&gt;Decision framework for choosing your retrieval architecture&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>search</category>
      <category>architecture</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Saga Orchestration vs. Choreography: Making the Right Trade-off in Event-Driven Systems</title>
      <dc:creator>Alok Ranjan Daftuar</dc:creator>
      <pubDate>Fri, 27 Mar 2026 09:04:14 +0000</pubDate>
      <link>https://dev.to/aloknecessary/saga-orchestration-vs-choreography-making-the-right-trade-off-in-event-driven-systems-5fmm</link>
      <guid>https://dev.to/aloknecessary/saga-orchestration-vs-choreography-making-the-right-trade-off-in-event-driven-systems-5fmm</guid>
      <description>&lt;p&gt;The saga pattern looks straightforward in diagrams. It becomes genuinely complex the moment you operate it in production.&lt;/p&gt;

&lt;p&gt;The central question — &lt;strong&gt;orchestration or choreography&lt;/strong&gt; — carries consequences that ripple through your codebase, your operational posture, and your team's cognitive load for years.&lt;/p&gt;

&lt;p&gt;This is not a "use orchestration for complex sagas, choreography for simple ones" post. The real trade-offs are more specific.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Baseline: What Both Approaches Must Solve
&lt;/h2&gt;

&lt;p&gt;Before choosing an approach, every saga implementation must handle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Atomicity at step boundaries&lt;/strong&gt; — commit the database write and publish the event in the same transaction (transactional outbox or CDC)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Idempotent consumers&lt;/strong&gt; — at-least-once delivery means your steps &lt;em&gt;will&lt;/em&gt; be invoked more than once&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compensation correctness&lt;/strong&gt; — compensating transactions are not rollbacks; they undo changes in a world that has moved on&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability&lt;/strong&gt; — correlation IDs, structured logging, and queryable saga state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are table stakes, not optional concerns.&lt;/p&gt;




&lt;h2&gt;
  
  
  Orchestration: Central Control
&lt;/h2&gt;

&lt;p&gt;A dedicated orchestrator drives the saga — it knows the sequence, issues commands, waits for responses, and drives compensation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shines when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Workflows have complex conditional branching&lt;/li&gt;
&lt;li&gt;Long-running sagas involve human steps or wait states&lt;/li&gt;
&lt;li&gt;Operational visibility and debugging matter most&lt;/li&gt;
&lt;li&gt;Compensation must be guaranteed and sequenced&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Breaks down when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The orchestrator becomes a throughput bottleneck&lt;/li&gt;
&lt;li&gt;Tight temporal coupling conflicts with event-driven decoupling goals&lt;/li&gt;
&lt;li&gt;Business logic gravitates into the orchestrator (god-object risk)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Choreography: Decentralized Reactions
&lt;/h2&gt;

&lt;p&gt;No central coordinator. Each service listens for events, performs its local transaction, and publishes events that others react to. The saga is an emergent property.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shines when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Services are genuinely independent&lt;/li&gt;
&lt;li&gt;Throughput is high and latency requirements are strict&lt;/li&gt;
&lt;li&gt;The workflow is stable and simple&lt;/li&gt;
&lt;li&gt;Independent deployability is valued over centralized visibility&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Breaks down when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Saga state is implicit and debugging requires forensic log analysis&lt;/li&gt;
&lt;li&gt;Business logic is distributed across every participating service&lt;/li&gt;
&lt;li&gt;Compensation failures go undetected — no component knows a step was missed&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Failure Modes That Catch Teams Off Guard
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Lost compensation events&lt;/strong&gt; — a compensating transaction fails, lands in a DLQ, and the system stays inconsistent until someone investigates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pivot transaction ambiguity&lt;/strong&gt; — misidentifying the point of no return leads to compensating steps that cannot actually be reversed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Saga timeouts and orphaned state&lt;/strong&gt; — sagas that time out without completing compensation leave the system in a partially-applied state.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Event schema evolution&lt;/strong&gt; — a schema change breaks consumers silently, causing sagas to process with incorrect data.&lt;/p&gt;




&lt;h2&gt;
  
  
  Making the Decision
&lt;/h2&gt;

&lt;p&gt;Most large systems use &lt;strong&gt;both&lt;/strong&gt; — choreography for high-throughput, loosely-coupled flows; orchestration for complex, stateful, business-critical workflows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The key insight:&lt;/strong&gt; neither approach eliminates the need for idempotent consumers, transactional outboxes, schema governance, DLQ monitoring, or explicit compensation design. The approach determines where control and visibility live — not whether your system is correct.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Get the baseline right. Then choose the approach that fits your operational context — not the one that looked better in the last conference talk you attended.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Read the Full Article
&lt;/h2&gt;

&lt;p&gt;This is a summary of my deep dive into saga patterns. The full article covers orchestration and choreography in detail with production failure scenarios, compensation strategies, and decision frameworks:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;👉 &lt;a href="https://aloknecessary.github.io/blogs/saga-orchestration-vs-choreography/?utm_source=devto&amp;amp;utm_medium=referral&amp;amp;utm_campaign=blog_syndication&amp;amp;utm_content=saga-orchestration-vs-choreography" rel="noopener noreferrer"&gt;Saga Orchestration vs. Choreography — Full Article&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The full article includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Detailed comparison of both approaches with subsections on how they work, where they shine, and where they break down&lt;/li&gt;
&lt;li&gt;Four critical failure modes that affect both approaches&lt;/li&gt;
&lt;li&gt;Practical decision heuristics for choosing the right approach&lt;/li&gt;
&lt;li&gt;Baseline requirements every saga implementation must handle&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>distributedsystems</category>
      <category>architecture</category>
      <category>microservices</category>
      <category>eventdriven</category>
    </item>
  </channel>
</rss>
