<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Raj Kundalia</title>
    <description>The latest articles on DEV Community by Raj Kundalia (@rajkundalia).</description>
    <link>https://dev.to/rajkundalia</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F633218%2Ffab0df55-e22f-4dc2-9f39-0cdc3f4f9d59.jpeg</url>
      <title>DEV Community: Raj Kundalia</title>
      <link>https://dev.to/rajkundalia</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/rajkundalia"/>
    <language>en</language>
    <item>
      <title>What Happens When Every Prompt Slot Says Something Different</title>
      <dc:creator>Raj Kundalia</dc:creator>
      <pubDate>Sun, 28 Jun 2026 10:11:28 +0000</pubDate>
      <link>https://dev.to/rajkundalia/what-happens-when-every-prompt-slot-says-something-different-33c1</link>
      <guid>https://dev.to/rajkundalia/what-happens-when-every-prompt-slot-says-something-different-33c1</guid>
      <description>&lt;p&gt;&lt;em&gt;A controlled experiment exploring how Claude and Qwen resolve conflicting instructions across system prompts, user messages, and tool descriptions.&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cross-posting from Medium:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;a href="https://medium.com/@rajkundalia/where-you-put-the-instruction-matters-more-than-what-it-says-2d5ffcdd9369" rel="noopener noreferrer"&gt;https://medium.com/@rajkundalia/where-you-put-the-instruction-matters-more-than-what-it-says-2d5ffcdd9369&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In the first experiment of the series, &lt;strong&gt;Where You Put the Instruction Matters More Than What It Says&lt;/strong&gt;, I asked a simple question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Does it matter where you place an instruction?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The answer depended entirely on the model.&lt;/p&gt;

&lt;p&gt;For &lt;strong&gt;Qwen 2.5-Coder 3B&lt;/strong&gt;, the answer was &lt;strong&gt;yes&lt;/strong&gt;. The same instruction produced dramatically different compliance rates depending on whether it lived in the system prompt, user message (or task prompt), or tool description.&lt;/p&gt;

&lt;p&gt;For &lt;strong&gt;Claude Haiku 4.5&lt;/strong&gt; and &lt;strong&gt;Claude Sonnet 4.6&lt;/strong&gt;, the answer appeared to be &lt;strong&gt;no&lt;/strong&gt;. Both models followed the instruction perfectly regardless of where it was placed.&lt;/p&gt;

&lt;p&gt;That experiment measured &lt;strong&gt;placement strength&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;But it left an obvious follow-up question unanswered.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What happens when every prompt slot says something different?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's what this experiment measures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub repository:&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://github.com/rajkundalia/prompt-placement-anatomy" rel="noopener noreferrer"&gt;https://github.com/rajkundalia/prompt-placement-anatomy&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fzpzwmewb1wh3xhz0rmyu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fzpzwmewb1wh3xhz0rmyu.png" alt="Image1" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  The Experiment
&lt;/h1&gt;

&lt;p&gt;The underlying task is unchanged from Part 1.&lt;/p&gt;

&lt;p&gt;The agent counts TODO markers across five markdown files using two filesystem tools: &lt;code&gt;list_files&lt;/code&gt; and &lt;code&gt;read_file&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The models are the same.&lt;/p&gt;

&lt;p&gt;The agent loop is the same.&lt;/p&gt;

&lt;p&gt;The only thing that changes is the prompt.&lt;/p&gt;

&lt;p&gt;In Part 1, the same instruction was placed into one slot at a time.&lt;/p&gt;

&lt;p&gt;In Part 2, every slot contains a different instruction simultaneously.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Slot&lt;/th&gt;
&lt;th&gt;Instruction&lt;/th&gt;
&lt;th&gt;Marker&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;System prompt&lt;/td&gt;
&lt;td&gt;End your final answer with the marker &lt;code&gt;[DONE]&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;[DONE]&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;User message&lt;/td&gt;
&lt;td&gt;End your final answer with the marker &lt;code&gt;[FINISHED]&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;[FINISHED]&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool description (&lt;code&gt;read_file&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;End your final answer with the marker &lt;code&gt;[COMPLETE]&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;[COMPLETE]&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Every instruction is active in every run.&lt;/p&gt;

&lt;p&gt;The model cannot satisfy all three.&lt;/p&gt;

&lt;p&gt;It has to choose one, ignore them entirely, or produce some mixture of them.&lt;/p&gt;

&lt;p&gt;Unlike Part 1, this experiment isn't measuring compliance.&lt;/p&gt;

&lt;p&gt;It's measuring &lt;strong&gt;which instruction wins.&lt;/strong&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  Measuring the Winner
&lt;/h1&gt;

&lt;p&gt;Each run falls into one of five possible outcomes.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Outcome&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;System&lt;/td&gt;
&lt;td&gt;Response ends with &lt;code&gt;[DONE]&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;User&lt;/td&gt;
&lt;td&gt;Response ends with &lt;code&gt;[FINISHED]&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool&lt;/td&gt;
&lt;td&gt;Response ends with &lt;code&gt;[COMPLETE]&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;None of the expected markers appear&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Conflict in output&lt;/td&gt;
&lt;td&gt;Multiple markers appear&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The final 150 characters of every response are searched using case-insensitive regular expressions.&lt;/p&gt;




&lt;h1&gt;
  
  
  Results: Qwen 2.5-Coder 3B (Ollama)
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F8gw2spgjwep8u9grfyts.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F8gw2spgjwep8u9grfyts.png" alt="Image2" width="800" height="528"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The first thing I noticed was how familiar these numbers looked.&lt;/p&gt;

&lt;p&gt;In Part 1, placing the instruction in the user message produced &lt;strong&gt;64% compliance&lt;/strong&gt;, while the system prompt managed &lt;strong&gt;8%&lt;/strong&gt; and the tool description &lt;strong&gt;2%&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Now, under direct competition, the user message wins &lt;strong&gt;60%&lt;/strong&gt; of the time, the system prompt wins &lt;strong&gt;2%&lt;/strong&gt;, and the tool description never wins at all.&lt;/p&gt;

&lt;p&gt;Although the experiments ask different questions, they tell a remarkably consistent story.&lt;/p&gt;

&lt;p&gt;The slot that was strongest in isolation is also the slot that dominates when every instruction competes.&lt;/p&gt;

&lt;p&gt;The conflict condition also exposed behavior that Part 1 could never reveal.&lt;/p&gt;

&lt;p&gt;Nearly a third of the runs ended without any expected marker.&lt;/p&gt;

&lt;p&gt;Another &lt;strong&gt;6%&lt;/strong&gt; produced multiple competing markers in the same response.&lt;/p&gt;

&lt;p&gt;Instead of consistently selecting one instruction, the model sometimes failed to produce a single clear winner.&lt;/p&gt;

&lt;h2&gt;
  
  
  A note about tool execution
&lt;/h2&gt;

&lt;p&gt;One implementation detail is important when interpreting these results.&lt;/p&gt;

&lt;p&gt;Unlike the Claude models, Qwen never successfully executed the tool loop.&lt;/p&gt;

&lt;p&gt;Rather than producing structured tool calls, it emitted tool-call JSON as plain text and completed every run in a single turn.&lt;/p&gt;

&lt;p&gt;This means the tool description was never exercised as part of an actual tool invocation.&lt;/p&gt;

&lt;p&gt;It existed only as text inside the context window.&lt;/p&gt;

&lt;p&gt;That limitation is consistent with the results from Part 1, where the tool description also had almost no observable influence for Qwen.&lt;/p&gt;




&lt;h1&gt;
  
  
  Results: Claude Haiku 4.5 (Anthropic API)
&lt;/h1&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Outcome&lt;/th&gt;
&lt;th&gt;Frequency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;User &lt;code&gt;[FINISHED]&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;System &lt;code&gt;[DONE]&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool &lt;code&gt;[COMPLETE]&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Conflict in output&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Every run produced exactly the same outcome.&lt;/p&gt;

&lt;p&gt;The model completed the tool loop correctly, used three turns, and always finished with &lt;code&gt;[FINISHED]&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This is where the experiment becomes interesting.&lt;/p&gt;

&lt;p&gt;Part 1 suggested that every prompt slot was equally effective because each placement achieved &lt;strong&gt;100% compliance&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Part 2 reveals a more nuanced picture.&lt;/p&gt;

&lt;p&gt;When every slot contains the same instruction, every slot can successfully deliver that instruction.&lt;/p&gt;

&lt;p&gt;Once those instructions conflict, however, the model consistently resolves the disagreement in favor of the user message.&lt;/p&gt;

&lt;p&gt;The placement experiment and the conflict experiment are measuring different properties of the model.&lt;/p&gt;




&lt;h1&gt;
  
  
  Results: Claude Sonnet 4.6 (Anthropic API)
&lt;/h1&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Outcome&lt;/th&gt;
&lt;th&gt;Frequency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;User &lt;code&gt;[FINISHED]&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;System &lt;code&gt;[DONE]&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool &lt;code&gt;[COMPLETE]&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Conflict in output&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Claude Sonnet was tested across &lt;strong&gt;12 runs&lt;/strong&gt;, stopped early once the pattern was clearly established—that is, the user instruction determined the final formatting of the response.&lt;/p&gt;




&lt;h1&gt;
  
  
  Summary
&lt;/h1&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;System&lt;/th&gt;
&lt;th&gt;User&lt;/th&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;None&lt;/th&gt;
&lt;th&gt;Conflict&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;qwen2.5-coder:3b&lt;/td&gt;
&lt;td&gt;Small local (Ollama)&lt;/td&gt;
&lt;td&gt;2%&lt;/td&gt;
&lt;td&gt;60%&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;32%&lt;/td&gt;
&lt;td&gt;6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;claude-haiku-4.5&lt;/td&gt;
&lt;td&gt;Small frontier (Anthropic)&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;claude-sonnet-4.6&lt;/td&gt;
&lt;td&gt;Large frontier (Anthropic)&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Three observations stand out:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The tool description never won: across all runs and all three models, &lt;code&gt;[COMPLETE]&lt;/code&gt; never emerged as the surviving instruction.&lt;/li&gt;
&lt;li&gt;The system prompt rarely won: it appeared once for Qwen and never for either Claude model.&lt;/li&gt;
&lt;li&gt;Both Claude models behaved identically despite their difference in size. Haiku, Anthropic's smallest model, resolved the conflict exactly the same way as Sonnet.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F26mjl9jh8brfvkd5pdp9.png" alt="Image3" width="799" height="436"&gt;
&lt;/h2&gt;

&lt;h1&gt;
  
  
  Looking at Both Experiments Together
&lt;/h1&gt;

&lt;p&gt;Although both experiments involve prompt placement, they answer different questions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Part 1&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Can this prompt slot successfully deliver an instruction?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Part 2&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;When multiple instructions compete, which one determines the final output?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For Qwen:&lt;/p&gt;

&lt;p&gt;The user message was the strongest placement in isolation, and it remained the dominant placement under direct competition.&lt;/p&gt;

&lt;p&gt;For the Claude models:&lt;/p&gt;

&lt;p&gt;Part 1 showed that all three prompt slots could successfully deliver an instruction when no competing instruction existed.&lt;/p&gt;

&lt;p&gt;Part 2 showed that once conflict was introduced, the user message consistently determined the final formatting in this experiment.&lt;/p&gt;

&lt;p&gt;Together, the two experiments show that &lt;strong&gt;instruction visibility&lt;/strong&gt; and &lt;strong&gt;instruction priority&lt;/strong&gt; are different characteristics of an LLM.&lt;/p&gt;

&lt;p&gt;A model may reliably process instructions from every prompt slot while still preferring one slot whenever those instructions disagree.&lt;/p&gt;




&lt;h1&gt;
  
  
  What This Means in Practice
&lt;/h1&gt;

&lt;p&gt;If you're building agents with smaller open-weight models, prompt placement is more than a stylistic choice.&lt;/p&gt;

&lt;p&gt;Across both experiments, the user message was consistently the most reliable place for formatting instructions.&lt;/p&gt;

&lt;p&gt;System prompts and tool descriptions were substantially less effective, particularly when competing instructions existed.&lt;/p&gt;

&lt;p&gt;For the Claude models tested here, the practical takeaway is different.&lt;/p&gt;

&lt;p&gt;They successfully followed instructions regardless of placement when no conflict existed.&lt;/p&gt;

&lt;p&gt;However, in this experiment, conflicting formatting instructions were consistently resolved in favor of the user message.&lt;/p&gt;

&lt;p&gt;It's important to keep the scope of that finding in mind.&lt;/p&gt;

&lt;p&gt;This experiment only examined formatting instructions within a controlled agent loop.&lt;/p&gt;

&lt;p&gt;It does &lt;strong&gt;not&lt;/strong&gt; imply that user prompts override safety policies or other system-level behaviors, which are governed by different mechanisms and would require a different experimental design.&lt;/p&gt;




&lt;h1&gt;
  
  
  Caveats
&lt;/h1&gt;

&lt;p&gt;The markers &lt;code&gt;[DONE]&lt;/code&gt;, &lt;code&gt;[FINISHED]&lt;/code&gt;, and &lt;code&gt;[COMPLETE]&lt;/code&gt; are different strings.&lt;/p&gt;

&lt;p&gt;They differ in length and may differ in how frequently similar tokens appeared during model training.&lt;/p&gt;

&lt;p&gt;Rotating the markers across prompt slots would control for that effect, but it would also triple the size of the experiment and was not done here.&lt;/p&gt;

&lt;p&gt;The sample sizes also differ across models:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;50&lt;/strong&gt; runs for Qwen&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;30&lt;/strong&gt; for Claude Haiku&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;12&lt;/strong&gt; for Claude Sonnet&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Anthropic models exhibited highly consistent behavior, allowing the experiments to stop once the dominant pattern was established.&lt;/p&gt;

&lt;p&gt;Finally, these results are model- and task-specific.&lt;/p&gt;

&lt;p&gt;Different architectures, quantization levels, or tasks may produce different behaviors.&lt;/p&gt;

&lt;p&gt;The goal of this experiment is not to establish a universal prompt hierarchy, but to measure how these particular models behave under controlled conditions.&lt;/p&gt;

&lt;p&gt;Statistical confidence intervals were calculated during analysis but are omitted here because the dominant winner was unambiguous.&lt;/p&gt;




&lt;h1&gt;
  
  
  Final Thoughts
&lt;/h1&gt;

&lt;p&gt;The most interesting result wasn't that the user message won.&lt;/p&gt;

&lt;p&gt;It was that two experiments, built to measure different properties, kept arriving at the same answer.&lt;/p&gt;

&lt;p&gt;For one model, the strongest placement in isolation was also the strongest placement under conflict.&lt;/p&gt;

&lt;p&gt;For the others, perfect placement compliance concealed a deterministic preference that only became visible once the prompts disagreed.&lt;/p&gt;

&lt;p&gt;Sometimes the most interesting model behavior doesn't appear when there's only one correct instruction.&lt;/p&gt;

&lt;p&gt;It appears when every prompt slot asks for something different, and the model has to decide which one deserves the final word.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Follow me on LinkedIn:&lt;/strong&gt; &lt;a href="https://www.linkedin.com/in/rajkundalia/" rel="noopener noreferrer"&gt;Raj Kundalia&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Related
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://medium.com/@rajkundalia/where-you-put-the-instruction-matters-more-than-what-it-says-2d5ffcdd9369" rel="noopener noreferrer"&gt;&lt;strong&gt;Where You Put the Instruction Matters More Than What It Says&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://medium.com/@rajkundalia/why-95-reviews-beats-20-reviews-even-when-both-score-95-21d21ea3cb92" rel="noopener noreferrer"&gt;&lt;strong&gt;Why 95 Reviews Beats 20 Reviews—Even When Both Score 95%&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>promptengineering</category>
      <category>llm</category>
    </item>
    <item>
      <title>Where You Put the Instruction Matters More Than What It Says</title>
      <dc:creator>Raj Kundalia</dc:creator>
      <pubDate>Sat, 13 Jun 2026 15:27:32 +0000</pubDate>
      <link>https://dev.to/rajkundalia/where-you-put-the-instruction-matters-more-than-what-it-says-2o9</link>
      <guid>https://dev.to/rajkundalia/where-you-put-the-instruction-matters-more-than-what-it-says-2o9</guid>
      <description>&lt;p&gt;&lt;em&gt;An experiment comparing system prompts, user prompts, and tool descriptions across Claude and Qwen&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Originally published on Medium: &lt;a href="https://medium.com/@rajkundalia/where-you-put-the-instruction-matters-more-than-what-it-says-2d5ffcdd9369" rel="noopener noreferrer"&gt;https://medium.com/@rajkundalia/where-you-put-the-instruction-matters-more-than-what-it-says-2d5ffcdd9369&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;There’s a lot of advice on how to write good prompts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use chain-of-thought&lt;/li&gt;
&lt;li&gt;Add examples&lt;/li&gt;
&lt;li&gt;Be specific&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But I hadn’t seen much real-world evidence on a different question:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When you give an LLM agent an instruction, does it matter which slot you put it in?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I am not talking about wording or tone. I mean the structural slot: &lt;strong&gt;system message&lt;/strong&gt;, &lt;strong&gt;user message&lt;/strong&gt;, or &lt;strong&gt;tool description&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;These aren’t just different positions in a string. They’re different fields in the API payload, and models are trained to treat them differently.&lt;/p&gt;

&lt;p&gt;I wanted to know:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Does the slot actually affect whether the model follows the instruction?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So I built an experiment to find out.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub Repository:&lt;/strong&gt; &lt;a href="https://github.com/rajkundalia/prompt-placement-anatomy" rel="noopener noreferrer"&gt;https://github.com/rajkundalia/prompt-placement-anatomy&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkl1i0nih7wogcwr2wwjo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkl1i0nih7wogcwr2wwjo.png" alt="Prompt Placement Anatomy" width="799" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Experiment
&lt;/h2&gt;

&lt;p&gt;The design is deliberately boring.&lt;/p&gt;

&lt;p&gt;The agent’s job is to count TODO markers across five markdown files using two filesystem tools:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;list_files&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;read_file&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The instruction under test is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;End your final answer with the marker &lt;code&gt;[DONE]&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That instruction gets placed in exactly one of three slots per run:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;System message&lt;/strong&gt; — typically reserved for persona and behavioral rules&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;User message&lt;/strong&gt; — where the task itself lives&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool description&lt;/strong&gt; — metadata attached to the tool schema, appended to the &lt;code&gt;read_file&lt;/code&gt; tool description&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The task stays identical across all three variants.&lt;/p&gt;

&lt;p&gt;The only variable is where the &lt;code&gt;[DONE]&lt;/code&gt; instruction lives.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why No &lt;code&gt;submit_answer&lt;/code&gt; Tool?
&lt;/h3&gt;

&lt;p&gt;One design decision worth calling out:&lt;/p&gt;

&lt;p&gt;There is no &lt;code&gt;submit_answer&lt;/code&gt; or &lt;code&gt;final_response&lt;/code&gt; tool.&lt;/p&gt;

&lt;p&gt;The agent terminates by returning ordinary text with no further tool calls.&lt;/p&gt;

&lt;p&gt;Compliance is checked on that free-text response using a case-insensitive search for &lt;code&gt;[DONE]&lt;/code&gt; in the last 80 characters.&lt;/p&gt;

&lt;p&gt;This was intentional.&lt;/p&gt;

&lt;p&gt;I wanted to measure whether the model follows a formatting instruction in its natural output, not whether it can populate a structured tool argument correctly.&lt;/p&gt;

&lt;p&gt;Those are different skills.&lt;/p&gt;

&lt;p&gt;Each placement is run multiple times.&lt;/p&gt;

&lt;p&gt;Metrics collected:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Compliance rate (did it append &lt;code&gt;[DONE]&lt;/code&gt;?)&lt;/li&gt;
&lt;li&gt;Completion rate (did it finish within the 15-turn cap?)&lt;/li&gt;
&lt;li&gt;Turns to completion&lt;/li&gt;
&lt;li&gt;Total token usage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Compliance is the headline metric.&lt;/p&gt;

&lt;p&gt;The other metrics help explain agent behavior but are not the primary outcome.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why No Frameworks?
&lt;/h2&gt;

&lt;p&gt;The agent loop is a Python &lt;code&gt;while&lt;/code&gt; loop:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Send a message&lt;/li&gt;
&lt;li&gt;Check for tool calls&lt;/li&gt;
&lt;li&gt;Execute tools&lt;/li&gt;
&lt;li&gt;Append results&lt;/li&gt;
&lt;li&gt;Repeat&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If the model produces text with no tool calls, the run is done.&lt;/p&gt;

&lt;p&gt;I avoided frameworks deliberately.&lt;/p&gt;

&lt;p&gt;Frameworks add their own:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;system messages&lt;/li&gt;
&lt;li&gt;tool schema modifications&lt;/li&gt;
&lt;li&gt;hidden instructions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If I’m measuring placement effects, I need to know exactly what’s in each slot and nothing else.&lt;/p&gt;

&lt;p&gt;The entire implementation is about 300 lines of Python and fully visible in &lt;code&gt;agent_loop.py&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Results: Qwen 2.5-Coder 3B (Ollama)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;50 runs per placement&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Placement&lt;/th&gt;
&lt;th&gt;Compliance Rate&lt;/th&gt;
&lt;th&gt;Completion Rate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;System&lt;/td&gt;
&lt;td&gt;8%&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;User&lt;/td&gt;
&lt;td&gt;64%&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool Description&lt;/td&gt;
&lt;td&gt;2%&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The model produced a final answer every time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;100% completion across the board.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;But whether it remembered to append &lt;code&gt;[DONE]&lt;/code&gt; depended almost entirely on where the instruction lived.&lt;/p&gt;

&lt;p&gt;User message placement was dramatically more effective than both alternatives.&lt;/p&gt;

&lt;p&gt;The gap between user (64%) and system (8%) is large enough that the Wilson 95% confidence intervals do not overlap, suggesting a real difference rather than sampling noise.&lt;/p&gt;

&lt;p&gt;Tool description placement was effectively useless at 2%.&lt;/p&gt;

&lt;p&gt;The system message wasn’t much better at 8%.&lt;/p&gt;

&lt;p&gt;For this model, on this task, only the user message slot reliably delivered instructions.&lt;/p&gt;




&lt;h2&gt;
  
  
  Results: Claude Sonnet 4.6 (Anthropic API)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;20 runs per placement&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Placement&lt;/th&gt;
&lt;th&gt;Compliance Rate&lt;/th&gt;
&lt;th&gt;Mean Turns&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;System&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;User&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool Description&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Completely placement-insensitive.&lt;/p&gt;

&lt;p&gt;100% compliance across all three slots.&lt;/p&gt;

&lt;p&gt;The model followed the &lt;code&gt;[DONE]&lt;/code&gt; instruction regardless of where it lived.&lt;/p&gt;

&lt;p&gt;It also used the tools correctly every time:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;List files&lt;/li&gt;
&lt;li&gt;Read files&lt;/li&gt;
&lt;li&gt;Produce answer&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;No chart generated; a flat line at 100% carries no information.&lt;/p&gt;




&lt;h2&gt;
  
  
  Results: Claude Haiku 4.5 (Anthropic API)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;50 runs per placement&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Placement&lt;/th&gt;
&lt;th&gt;Compliance Rate&lt;/th&gt;
&lt;th&gt;Mean Turns&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;System&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;User&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool Description&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Identical.&lt;/p&gt;

&lt;p&gt;This is Anthropic’s smallest and cheapest model, yet it showed the same placement robustness as Sonnet.&lt;/p&gt;

&lt;p&gt;Even Haiku exhibited zero placement sensitivity.&lt;/p&gt;

&lt;p&gt;No chart generated; a flat line at 100% carries no information.&lt;/p&gt;

&lt;p&gt;If you're wondering what "turns" are, Anthropic's Agent SDK documentation explains the agent loop nicely:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://code.claude.com/docs/en/agent-sdk/agent-loop#the-loop-at-a-glance" rel="noopener noreferrer"&gt;https://code.claude.com/docs/en/agent-sdk/agent-loop#the-loop-at-a-glance&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;System&lt;/th&gt;
&lt;th&gt;User&lt;/th&gt;
&lt;th&gt;Tool Desc&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;qwen2.5-coder:3b&lt;/td&gt;
&lt;td&gt;Small local (Ollama)&lt;/td&gt;
&lt;td&gt;8%&lt;/td&gt;
&lt;td&gt;64%&lt;/td&gt;
&lt;td&gt;2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;claude-haiku-4.5&lt;/td&gt;
&lt;td&gt;Small frontier (Anthropic)&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;claude-sonnet-4.6&lt;/td&gt;
&lt;td&gt;Large frontier (Anthropic)&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The biggest difference wasn’t between system, user, and tool slots.&lt;/p&gt;

&lt;p&gt;It was between model classes.&lt;/p&gt;

&lt;p&gt;Both Anthropic models followed the instruction regardless of placement.&lt;/p&gt;

&lt;p&gt;The 3B-parameter open-weight model did not.&lt;/p&gt;

&lt;p&gt;For that model, the user message was the only placement that produced meaningful compliance.&lt;/p&gt;

&lt;p&gt;Based on these results, placement sensitivity was a major factor for the 3B open-weight model and effectively a non-factor for the two frontier models tested.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwbpvbabcbb0fcgo7yjup.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwbpvbabcbb0fcgo7yjup.png" alt="Results Summary" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Means in Practice
&lt;/h2&gt;

&lt;p&gt;Many teams choose small local models for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cost&lt;/li&gt;
&lt;li&gt;Latency&lt;/li&gt;
&lt;li&gt;Privacy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you’re one of them, instruction placement isn’t a matter of style.&lt;/p&gt;

&lt;p&gt;It’s a matter of reliability.&lt;/p&gt;

&lt;p&gt;In this experiment, placing a critical instruction in the system message or tool description was almost as ineffective as omitting it entirely.&lt;/p&gt;

&lt;p&gt;The user message was the only slot that consistently delivered meaningful compliance.&lt;/p&gt;

&lt;p&gt;If you're building with frontier models, placement didn't matter under these conditions.&lt;/p&gt;




&lt;h2&gt;
  
  
  Caveats
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;The prompts were short (~300 tokens for Ollama, ~6,000 tokens for Claude including tool calls).&lt;/li&gt;
&lt;li&gt;Task accuracy was not measured.&lt;/li&gt;
&lt;li&gt;The counting task is a distractor designed to force multi-turn tool use.&lt;/li&gt;
&lt;li&gt;The exact percentages apply only to &lt;code&gt;qwen2.5-coder:3b&lt;/code&gt; on this task.&lt;/li&gt;
&lt;li&gt;Different models, quantizations, and tasks may produce different results.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What may generalize more broadly is the ranking:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;On similar small open-weight models, the user message may continue to be the most effective placement, even if the size of the advantage changes.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Despite those caveats, the central result is hard to ignore:&lt;/p&gt;

&lt;p&gt;For the 3B model, the same instruction produced dramatically different behavior depending solely on where it was placed.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next: Instruction Conflict (Part 2)
&lt;/h2&gt;

&lt;p&gt;This experiment measures placement strength in isolation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One instruction&lt;/li&gt;
&lt;li&gt;One slot&lt;/li&gt;
&lt;li&gt;No competing signals&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The natural follow-up is instruction conflict.&lt;/p&gt;

&lt;p&gt;Imagine:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;System prompt&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Append [DONE]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;User message&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Append [FINISHED]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Tool description&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Append [COMPLETE]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then observe which marker appears in the final answer.&lt;/p&gt;

&lt;p&gt;This reveals the priority ordering of slots, not just whether they're read.&lt;/p&gt;

&lt;p&gt;Questions worth exploring:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Does the system prompt win over the user message?&lt;/li&gt;
&lt;li&gt;Do frontier models follow a hierarchy?&lt;/li&gt;
&lt;li&gt;Does a small model notice the conflict at all?&lt;/li&gt;
&lt;li&gt;Does it simply follow whichever slot it was already attending to?&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Related Reading
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://medium.com/@rajkundalia/why-95-reviews-beats-20-reviews-even-when-both-score-95-21d21ea3cb92" rel="noopener noreferrer"&gt;Why 95 Reviews Beats 20 Reviews — Even When Both Score 95%&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The statistical foundation behind the Wilson confidence intervals used in this experiment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Connect
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/rajkundalia/prompt-placement-anatomy" rel="noopener noreferrer"&gt;https://github.com/rajkundalia/prompt-placement-anatomy&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Medium: &lt;a href="https://medium.com/@rajkundalia/where-you-put-the-instruction-matters-more-than-what-it-says-2d5ffcdd9369" rel="noopener noreferrer"&gt;https://medium.com/@rajkundalia/where-you-put-the-instruction-matters-more-than-what-it-says-2d5ffcdd9369&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;LinkedIn: &lt;a href="https://www.linkedin.com/in/rajkundalia/" rel="noopener noreferrer"&gt;https://www.linkedin.com/in/rajkundalia/&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>promptengineering</category>
    </item>
    <item>
      <title>Why 95 Reviews Beats 20 Reviews — Even When Both Score 95%</title>
      <dc:creator>Raj Kundalia</dc:creator>
      <pubDate>Sun, 07 Jun 2026 07:49:21 +0000</pubDate>
      <link>https://dev.to/rajkundalia/why-95-reviews-beats-20-reviews-even-when-both-score-95-2cp9</link>
      <guid>https://dev.to/rajkundalia/why-95-reviews-beats-20-reviews-even-when-both-score-95-2cp9</guid>
      <description>&lt;p&gt;Understanding Wilson Score, confidence intervals, and the mysterious 1.96.&lt;/p&gt;

&lt;p&gt;Originally published on Medium: &lt;a href="https://medium.com/@rajkundalia/why-95-reviews-beats-20-reviews-even-when-both-score-95-21d21ea3cb92" rel="noopener noreferrer"&gt;Why 95 Reviews Beats 20 Reviews — Even When Both Score 95%&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I was running a controlled experiment measuring how instruction placement in LLM prompts affects agent behavior. After collecting results across three placement variants, I wanted to know: is the difference I'm seeing real, or just noise from a small sample size?&lt;/p&gt;

&lt;p&gt;Link for the aforementioned experiment: WIP.&lt;/p&gt;

&lt;p&gt;While looking into ways to answer that question, I came across the Wilson Score interval. I saw an equation and a figure 1.96 and I could not grasp it immediately. I spent some time to figure things out and wrote a small piece on it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbhefsifcqg2b0d75yoob.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbhefsifcqg2b0d75yoob.png" alt="Image1" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The good news: the idea behind Wilson Score is much simpler than the formula.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Imagine two restaurants:&lt;br&gt;
Restaurant A: 95 positive reviews out of 100&lt;br&gt;
Restaurant B: 19 positive reviews out of 20&lt;/p&gt;

&lt;p&gt;Both have a 95% positive rating. Should they rank equally?&lt;br&gt;
Most people would say no. We trust Restaurant A more because it has much more evidence behind its score.&lt;/p&gt;

&lt;p&gt;This is exactly the problem Wilson Score tries to solve.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Plain Percentages Fail
&lt;/h2&gt;

&lt;p&gt;A naive ranking system only looks at percentages:&lt;br&gt;
1 positive review out of 1 = 100%&lt;br&gt;
1000 positive reviews out of 1000 = 100%&lt;/p&gt;

&lt;p&gt;Clearly these are not equally trustworthy. A single review tells us almost nothing. A thousand reviews tell us a lot.&lt;/p&gt;

&lt;p&gt;Wilson Score rewards both quality and evidence.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is Your Observed Rate?
&lt;/h2&gt;

&lt;p&gt;Before going further, there is one simple idea to establish.&lt;br&gt;
When you collect reviews, you end up with two numbers: how many were positive, and how many total. Divide one by the other and you get your observed rate - the percentage of positive reviews you actually saw.&lt;/p&gt;

&lt;p&gt;95 positive reviews out of 100 → observed rate = 95 ÷ 100 = 0.95 (or 95%)&lt;br&gt;
19 positive reviews out of 20 → observed rate = 19 ÷ 20 = 0.95 (or 95%)&lt;/p&gt;

&lt;p&gt;Both restaurants have the same observed rate. The difference is that one has much more evidence behind it.&lt;/p&gt;

&lt;p&gt;In the Wilson Score formula, this observed rate is written as p - just shorthand so the formula doesn't have to spell it out every time. But all it ever means is: the percentage you actually measured.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Core Idea: Your Observation Is Just One Possibility
&lt;/h2&gt;

&lt;p&gt;Here is the thing most explanations skip over.&lt;/p&gt;

&lt;p&gt;When you see 19 out of 20 positive reviews, you naturally say "that restaurant is 95% good." But what you actually observed is just one possible outcome from many.&lt;/p&gt;

&lt;p&gt;Imagine you could rewind time and collect reviews again. Maybe this time you'd get 17 out of 20. Or 18. Or 20 out of 20. All of those are realistic results from the same restaurant, just from a different lucky or unlucky sample. The fewer reviews you have, the more those outcomes can vary.&lt;br&gt;
So the honest question isn't "what did I observe?" It's "given what I observed, what is the range of real quality levels that could have produced this?"&lt;/p&gt;

&lt;p&gt;That range is called a confidence interval.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Confidence Interval Is Just Honesty About Uncertainty
&lt;/h2&gt;

&lt;p&gt;Instead of saying "it's exactly 95%", you say:&lt;br&gt;
"Based on the evidence we have, the real quality of this restaurant is unlikely to be exactly 95%. There is a range of realistic answers around it."&lt;/p&gt;

&lt;p&gt;That range reflects how uncertain you are based on how little evidence you have.&lt;/p&gt;

&lt;p&gt;And "95% confidence" simply means: if you ran this experiment 100 times, 95 of those intervals would contain the real answer. It's not about the rating itself - it's about how trustworthy your estimate is.&lt;/p&gt;




&lt;p&gt;Where Does 1.96 Come From?&lt;/p&gt;

&lt;p&gt;This was the part that confused me initially.&lt;/p&gt;

&lt;p&gt;Think of it as a dial that controls how wide your range is. The wider your range, the more confident you can be that the truth falls inside it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwpdxale9erd9dielmf01.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwpdxale9erd9dielmf01.jpg" alt="image2" width="550" height="564"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Multiplier   Confidence&lt;br&gt;
1.65            90% - narrower range, less sure&lt;br&gt;
1.96            95% - the standard choice&lt;br&gt;
2.58            99% - wider range, very sure&lt;/p&gt;

&lt;p&gt;Mathematicians worked out that if you move 1.96 standard deviations to the left and right of the center of a &lt;a href="https://en.wikipedia.org/wiki/Normal_distribution" rel="noopener noreferrer"&gt;bell curve&lt;/a&gt;, you capture roughly 95% of the area under that curve. That's why 1.96 became the standard multiplier for 95% confidence intervals.&lt;/p&gt;




&lt;h2&gt;
  
  
  Two Different Meanings of 95%
&lt;/h2&gt;

&lt;p&gt;This distinction matters.&lt;/p&gt;

&lt;p&gt;When you say a restaurant's rating is 95%, you mean the observed percentage of positive reviews.&lt;/p&gt;

&lt;p&gt;When you say Wilson Score at 95% confidence, you mean you're using a confidence level that corresponds to 1.96 as your multiplier.&lt;/p&gt;

&lt;p&gt;These are completely different things:&lt;br&gt;
One is the observed rating.&lt;br&gt;
The other is how much you trust your estimate of that rating.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Wilson Score Really Asks
&lt;/h2&gt;

&lt;p&gt;Most people think Wilson Score is trying to calculate the true rating. It is not.&lt;/p&gt;

&lt;p&gt;Instead, it asks:&lt;br&gt;
Given the amount of evidence we have, what is a conservative lower estimate of the true rating?&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
95 positive reviews out of 100 → Wilson lower bound ≈ 88.8%&lt;br&gt;
19 positive reviews out of 20 → Wilson lower bound ≈ 76.4%&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7fojnwq7aviawlyhvq49.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7fojnwq7aviawlyhvq49.png" alt="Image3" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Both have a 95% observed rating. But Wilson trusts the first one much more because it's backed by a larger sample.&lt;/p&gt;




&lt;h2&gt;
  
  
  Wait, How Did 95% Become 88.8%?
&lt;/h2&gt;

&lt;p&gt;Wilson Score is intentionally conservative.&lt;/p&gt;

&lt;p&gt;The observed rating is still 95%. But because you have only a finite number of reviews, there's uncertainty around that number. Wilson subtracts an uncertainty penalty based on the sample size, the confidence level, and the observed rating.&lt;/p&gt;

&lt;p&gt;The result is a lower bound that says:&lt;br&gt;
Based on the evidence we have, we are reasonably confident the true rating is at least 88.8%.&lt;/p&gt;

&lt;p&gt;The smaller the sample size, the larger the penalty. That's why 19/20 gets pushed down to roughly 76.4%.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Ranking Systems Use the Lower Bound
&lt;/h2&gt;

&lt;p&gt;Wilson Score actually produces a full interval - a lower and upper bound. &lt;/p&gt;

&lt;p&gt;For 19 out of 20 reviews, that range is roughly 76% to 99%.&lt;br&gt;
For 95 positive reviews out of 100:&lt;br&gt;
Lower bound ≈ 88.8%&lt;br&gt;
Upper bound ≈ 97.8%&lt;/p&gt;

&lt;p&gt;In other words, the true positive rate is plausibly somewhere inside that range. Notice how this range is much narrower than the range we'd get from only 20 reviews. More evidence means less uncertainty.&lt;/p&gt;

&lt;p&gt;So why do ranking systems focus only on the lower bound?&lt;br&gt;
Because the lower bound answers the most useful question:&lt;br&gt;
What's the minimum quality I'm comfortable believing this item has?&lt;/p&gt;

&lt;p&gt;Using the upper bound would often favor items with very few reviews. A restaurant with 1 out of 1 positive reviews has an upper bound of nearly 100% - clearly misleading. The lower bound keeps that restaurant ranked conservatively until more evidence comes in.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Mental Model
&lt;/h2&gt;

&lt;p&gt;Forget the formula. Think of Wilson Score as:&lt;br&gt;
Observed Rating − Uncertainty Penalty&lt;/p&gt;

&lt;p&gt;The penalty becomes larger when:&lt;br&gt;
The sample size is small&lt;br&gt;
You want higher confidence&lt;br&gt;
There is less evidence available&lt;/p&gt;

&lt;p&gt;That's why a product with 95/100 reviews ranks above a product with 19/20 reviews, even though both show 95%.&lt;/p&gt;




&lt;p&gt;Final Thought&lt;br&gt;
The biggest insight is this: Wilson Score is not measuring quality. It is measuring quality adjusted for confidence.&lt;/p&gt;

&lt;p&gt;A high percentage with very little evidence is treated cautiously. A high percentage with lots of evidence is trusted.&lt;/p&gt;

&lt;p&gt;And that mysterious 1.96? It's simply the number that says: "Let's be 95% confident before we make claims." Nothing magical about it. Just a dial set to the most common standard.&lt;/p&gt;

&lt;p&gt;The more reviews you collect, the smaller your uncertainty penalty, and the closer your Wilson Score gets to your observed rating. Evidence earns trust. That's really all there is to it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Back to My Experiment
&lt;/h2&gt;

&lt;p&gt;Link to the page of my experiment: &lt;strong&gt;WIP&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In my case, I wasn't ranking restaurants. I was measuring whether placing an instruction in the system prompt vs the user prompt vs the tool description made a real difference in how often an LLM followed it.&lt;/p&gt;

&lt;p&gt;For each placement, I got a compliance rate - say, system prompt got 76% compliance across 50 runs, user prompt got 62%.&lt;/p&gt;

&lt;p&gt;The raw percentages tell me which placement looked better. But Wilson Score tells me something more useful:&lt;br&gt;
"Is the gap between 76% and 62% real - or could it just be luck from 50 runs?"&lt;/p&gt;

&lt;p&gt;Here is how to read the result:&lt;br&gt;
If the Wilson intervals of two placements do not overlap → the difference is real. One placement genuinely works better.&lt;br&gt;
If they do overlap → you cannot confidently say one is better. You need more runs.&lt;/p&gt;

&lt;p&gt;So in plain English, Wilson Score told me: "You ran 50 trials. System prompt got 76% compliance. The true compliance rate is somewhere between X% and Y% with 95% confidence. If that range does not overlap with the user prompt's range, system prompt is genuinely better - not just luckier."&lt;/p&gt;

&lt;p&gt;That is what I actually needed to know. Not a ranking. Not a score. Just: is this difference real?&lt;/p&gt;




&lt;p&gt;Further Reading&lt;br&gt;
If you want to go deeper - including the actual formula - the &lt;a href="https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Wilson_score_interval" rel="noopener noreferrer"&gt;Wikipedia article on the Wilson score interval&lt;/a&gt; is a good next step.&lt;/p&gt;

&lt;p&gt;To statisticians and experts in the field: Please comment if there is a mistake in my explanation.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>algorithms</category>
      <category>datascience</category>
      <category>llm</category>
    </item>
    <item>
      <title>Why Your Story Points Feel Arbitrary (And How to Fix It)</title>
      <dc:creator>Raj Kundalia</dc:creator>
      <pubDate>Tue, 12 May 2026 17:21:03 +0000</pubDate>
      <link>https://dev.to/rajkundalia/why-your-story-points-feel-arbitrary-and-how-to-fix-it-9b</link>
      <guid>https://dev.to/rajkundalia/why-your-story-points-feel-arbitrary-and-how-to-fix-it-9b</guid>
      <description>&lt;p&gt;&lt;em&gt;This article was originally published on &lt;a href="https://medium.com/@rajkundalia/why-your-story-points-feel-arbitrary-and-how-to-fix-it-63cb7a9a51da" rel="noopener noreferrer"&gt;Medium&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I just feel it is "2", no, I think it is "3".&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;When I first came across story points, I always wondered how the "experienced" people on the team were calling out numbers so confidently. Now that I've become one of those "experienced" people — I realized I still didn't have a stronger framework for it. Like everyone else, I'd gone with gut feeling. Others would nod along, and sometimes we'd walk out with no real sense of why. All I knew was it should follow the Fibonacci series. Everyone had different intuitions, and the loudest voice — or the consensus — won.&lt;/p&gt;

&lt;p&gt;So I tried to build a mental model for &lt;strong&gt;myself&lt;/strong&gt; &lt;em&gt;(not used by the team, yet)&lt;/em&gt;, so I'd have something to point at when I picked a number. This is what I landed on after trying it on a couple of story-pointing sessions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpnwccuxdi5zi169zh9kn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpnwccuxdi5zi169zh9kn.png" alt=" " width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The four dimensions
&lt;/h2&gt;

&lt;p&gt;Story points are supposed to capture how big the story would be, but the word "big" can pack in a lot of things, so I unpacked it into four things that I could actually measure:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Complexity&lt;/strong&gt; — How hard is this to build? New tech, tricky logic, or big design decisions push this up.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Effort&lt;/strong&gt; — How much work is there? A lot of small, easy changes across many files can still be Medium or High here, even if each change is trivial.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Uncertainty / Risk&lt;/strong&gt; — How clear is the requirement? Open questions, unfamiliar parts of the system, or things that might surprise me mid-way add risk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dependencies&lt;/strong&gt; — Does this depend on other teams or systems? This one is about &lt;em&gt;waiting&lt;/em&gt;, not work. And waiting still inflates the point value. Every time the work unblocks, I have to re-page the state back into my head — and a story that crosses sprint boundaries carries its own cognitive overhead and spillover risk. Some people would argue dependencies shouldn't affect the size at all — they're not work, just calendar drag — but for me the cognitive overhead is real enough that they belong in the rubric.&lt;/p&gt;

&lt;p&gt;For each, I rate Low, Medium, or High.&lt;/p&gt;




&lt;h2&gt;
  
  
  Rating guide
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Low&lt;/th&gt;
&lt;th&gt;Medium&lt;/th&gt;
&lt;th&gt;High&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Complexity&lt;/td&gt;
&lt;td&gt;Pattern we've done many times&lt;/td&gt;
&lt;td&gt;Some design decisions needed&lt;/td&gt;
&lt;td&gt;New tech, design-heavy, or novel problem&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Effort&lt;/td&gt;
&lt;td&gt;Single small change&lt;/td&gt;
&lt;td&gt;Multiple changes, moderate scope&lt;/td&gt;
&lt;td&gt;Many files / modules / large scope&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Uncertainty / Risk&lt;/td&gt;
&lt;td&gt;Requirement fully clear&lt;/td&gt;
&lt;td&gt;Some open questions&lt;/td&gt;
&lt;td&gt;Significant unknowns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dependencies&lt;/td&gt;
&lt;td&gt;None external&lt;/td&gt;
&lt;td&gt;One known, manageable&lt;/td&gt;
&lt;td&gt;Multiple, or blocked on another team&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  How I map ratings to points
&lt;/h2&gt;

&lt;p&gt;Roughly, here's where I thought it made sense. Your numbers will probably differ once you've used this a few times — and they should.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Ratings&lt;/th&gt;
&lt;th&gt;Points&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;All Low&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;One Medium, rest Low&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multiple Medium&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;One High&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multiple High&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mostly High&lt;/td&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Do not be mechanical about this, it works now for me, can change in future.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sanity check:&lt;/strong&gt; if a story lands at 8 or 13, ask whether it should be split before you size it. Stories with several high-rated dimensions are usually epics in disguise.&lt;/p&gt;




&lt;h2&gt;
  
  
  How I actually use it
&lt;/h2&gt;

&lt;p&gt;I rate each dimension in my head before I name a number. The dimensions are the work; the number is just the output.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr8vrq8yff4vq6ksvpvrz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr8vrq8yff4vq6ksvpvrz.png" alt=" " width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The biggest thing I noticed isn't that I'm picking better numbers — it's that I can finally say &lt;em&gt;why&lt;/em&gt;. Before, "5" was a feeling. Now I can trace it back: complexity High, dependencies Medium, the rest Low. Even if I don't share the breakdown out loud, having it in my head means I'm offering an estimate instead of a guess. And when I disagree with the room, I have something specific to point at — "I think it's a 5 because the unknowns here are bigger than they look" — instead of defending a number on instinct.&lt;/p&gt;

&lt;p&gt;This is a starting point, not a strict formula. Override it when experience says otherwise. After a few sprints, look back at the stories that surprised you — were the surprises about complexity? Dependencies? Something else? Adjust the dimensions and the ratings to match what actually drives your estimation misses.&lt;/p&gt;




&lt;h2&gt;
  
  
  What this is not
&lt;/h2&gt;

&lt;p&gt;This doesn't convert to hours, and it shouldn't be used to measure individual productivity.&lt;/p&gt;

&lt;p&gt;The number isn't the point. The four-dimension conversation that produces it is.&lt;/p&gt;




&lt;p&gt;Thank you for reading, suggestions are welcome.&lt;/p&gt;

&lt;p&gt;Follow me on LinkedIn: &lt;a href="https://www.linkedin.com/in/rajkundalia/" rel="noopener noreferrer"&gt;Raj Kundalia&lt;/a&gt;&lt;/p&gt;

</description>
      <category>softwareengineering</category>
      <category>scrum</category>
      <category>agile</category>
      <category>productivity</category>
    </item>
    <item>
      <title>How I Review PRs with AI — Without Losing My Own Judgment</title>
      <dc:creator>Raj Kundalia</dc:creator>
      <pubDate>Sun, 26 Apr 2026 14:05:36 +0000</pubDate>
      <link>https://dev.to/rajkundalia/how-i-review-prs-with-ai-without-losing-my-own-judgment-3kk2</link>
      <guid>https://dev.to/rajkundalia/how-i-review-prs-with-ai-without-losing-my-own-judgment-3kk2</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Originally published on Medium:&lt;/em&gt;&lt;br&gt;
&lt;a href="https://medium.com/@rajkundalia/how-i-review-prs-with-ai-without-losing-my-own-judgment-f930ad30dc60" rel="noopener noreferrer"&gt;https://medium.com/@rajkundalia/how-i-review-prs-with-ai-without-losing-my-own-judgment-f930ad30dc60&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;Over the last few months, my code review queue has changed completely. With agentic coding, PRs are larger, faster, and harder to reason about.&lt;/p&gt;

&lt;p&gt;I needed a system that was faster, but I absolutely did not want to just hand things off to an AI and call it a review.&lt;/p&gt;

&lt;p&gt;Built-in tools exist. Claude Code has &lt;code&gt;/review&lt;/code&gt; or &lt;code&gt;/deep-review&lt;/code&gt;, and GitHub Copilot's PR review is decent out of the box. If you just want an AI pass, they work fine. But I am not optimizing for just an AI pass; I am optimizing for understanding and architectural signal.&lt;/p&gt;

&lt;p&gt;Here is a repeatable framework I use to let AI handle the heavy scanning, while I keep the heavy thinking and judgment firmly in my own hands.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(Note: All the prompts referenced below are open-source in my GitHub repo: 👉 &lt;a href="https://github.com/rajkundalia/ai-code-review-prompts" rel="noopener noreferrer"&gt;https://github.com/rajkundalia/ai-code-review-prompts&lt;/a&gt;. They are tool-agnostic — paste them into Claude, ChatGPT, Cursor, or whatever you prefer.)&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Golden Rule: Context Isolation
&lt;/h2&gt;

&lt;p&gt;Before we get into the phases, there is one non-negotiable rule that makes this entire system work: &lt;strong&gt;One AI session per PR.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you mix your own daily work, multiple PR reviews, and random questions into a single AI session, you lose context. PR reviews are context-heavy. When a colleague replies to your comment four days later, having a dedicated, preserved AI session helps you instantly remember your mental model and why you left that comment in the first place.&lt;/p&gt;

&lt;p&gt;Keep the thread alive from the start of the review through the merge.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3opj335psubnicrc601c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3opj335psubnicrc601c.png" alt=" " width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The 4-Phase PR Review Workflow
&lt;/h2&gt;

&lt;p&gt;When I load my initial prompt, it gives me a starting point: a high-level summary, the files touched, and the core intent of the PR. From there, I move through four distinct phases. &lt;strong&gt;Do not skip ahead.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 1: Build Understanding (Human First)
&lt;/h3&gt;

&lt;p&gt;What happens next is entirely mine. I go file by file, line by line, and ask the AI questions until I have built my own understanding of the flow:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What is this doing?&lt;/li&gt;
&lt;li&gt;Where is this data model used further downstream?&lt;/li&gt;
&lt;li&gt;What breaks if this assumption changes?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is deliberately manual. Anything I still do not understand after interrogating the AI, I flag for a human comment.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If you skip this phase, you're not reviewing the code — you're reviewing the AI's opinion of the code.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  Phase 2: AI First Pass (Filter the Noise)
&lt;/h3&gt;

&lt;p&gt;This is where the AI does its first real pass, flagging standard issues and inconsistencies. This is intentionally a surface pass.&lt;/p&gt;

&lt;p&gt;The reason this is a separate phase from the deep review is simple: I want the obvious stuff caught and out of the way early. It gives me a chance to dismiss irrelevant suggestions immediately, ensuring the next phase isn't cluttered with noise.&lt;/p&gt;

&lt;p&gt;👉 Think of this as signal extraction, not decision-making.&lt;/p&gt;




&lt;h3&gt;
  
  
  Phase 3: The Deep Review (Pressure Testing)
&lt;/h3&gt;

&lt;p&gt;This is the heaviest phase, driven by a few specific forcing functions:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The "Chief Programmer" &amp;amp; "Chief Architect" Persona&lt;/strong&gt;&lt;br&gt;
Giving the AI a specific role produces sharper, more critical output than a generic "review this code." You can adjust the role to fit your domain, e.g., chief AI engineer if you are reviewing prompt code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real Coverage vs. Theater&lt;/strong&gt;&lt;br&gt;
AI agents generate a massive amount of tests. Left unchecked, they will write tests for data models with no logic, or tests that just verify Python works. I explicitly prompt the AI to look for meaningful behavior validation so we catch the noise upfront. It is better than constantly asking the AI to remove redundant tests.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Tests should prove behavior, not existence.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Playing Devil’s Advocate&lt;/strong&gt;&lt;br&gt;
I force the LLM to question its own assumptions. What could go wrong? Where would this fail in production three months from now?&lt;/p&gt;

&lt;p&gt;This surfaces edge cases that standard reviews easily miss.&lt;/p&gt;




&lt;h3&gt;
  
  
  Phase 4: The Verdict
&lt;/h3&gt;

&lt;p&gt;Finally, I combine my Phase 1 understanding with the AI's deep review insights. The AI helps me classify the findings into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Must-fix blockers&lt;/li&gt;
&lt;li&gt;Good-to-have stylistic suggestions&lt;/li&gt;
&lt;li&gt;Noise to be discarded&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Author's Duty: Self-Review
&lt;/h2&gt;

&lt;p&gt;Before your code ever reaches another human, it is your responsibility to review it.&lt;/p&gt;

&lt;p&gt;I converted my PR review framework into a self-review prompt. I run through the exact same phases on my own code. The output here is highly surgical: it tells me the file, the line, what is wrong, and what to do instead.&lt;/p&gt;

&lt;p&gt;The goal is simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The comments you eventually get from your peers should be about high-level design decisions — not trivial things you could have caught yourself.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You get serious brownie points for consistently raising high-quality, pre-vetted PRs.&lt;/p&gt;




&lt;h2&gt;
  
  
  Scaling the Process
&lt;/h2&gt;

&lt;p&gt;Not every PR needs all four phases.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A 10-line config change → quick pass&lt;/li&gt;
&lt;li&gt;A 1,000-line refactor → full deep review&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Match the depth of review to the risk and complexity.&lt;/p&gt;

&lt;p&gt;Over-reviewing small changes is wasteful.&lt;br&gt;
Under-reviewing large ones is dangerous.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;I am not offloading my thinking to an AI. I am using it to explore faster, validate assumptions, and stress-test decisions. The thinking is still mine.&lt;/p&gt;

&lt;p&gt;The leverage is new.&lt;br&gt;
The responsibility isn’t.&lt;/p&gt;

&lt;p&gt;These tools are incredibly powerful — but you still need to hold the leash.&lt;/p&gt;




&lt;p&gt;I’ve open-sourced the prompts and guidelines I use:&lt;br&gt;
👉 &lt;a href="https://github.com/rajkundalia/ai-code-review-prompts" rel="noopener noreferrer"&gt;https://github.com/rajkundalia/ai-code-review-prompts&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you have better ideas, improvements, or ways to reduce noise — I’d genuinely like to see them.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>codequality</category>
      <category>productivity</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>Following a Database Read to the Metal — A Simple Walkthrough</title>
      <dc:creator>Raj Kundalia</dc:creator>
      <pubDate>Sat, 11 Apr 2026 11:01:40 +0000</pubDate>
      <link>https://dev.to/rajkundalia/following-a-database-read-to-the-metal-a-simple-walkthrough-2men</link>
      <guid>https://dev.to/rajkundalia/following-a-database-read-to-the-metal-a-simple-walkthrough-2men</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;This is a cross-post from &lt;a href="https://medium.com/@rajkundalia/following-a-database-read-to-the-metal-a-simple-walkthrough-630a3eb97016" rel="noopener noreferrer"&gt;my Medium article&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I wanted to learn about the internals of database indexes. The first step was understanding how Disk I/O works — so I got Claude/Gemini to curate a reading list, which led me to &lt;strong&gt;Database Pages — A Deep Dive&lt;/strong&gt; by Hussein Nasser.&lt;/p&gt;

&lt;p&gt;There were things I hadn't understood, so I wrote this mellowed-down version for my own clarity. For complete understanding, do read the &lt;a href="https://medium.com/@hnasr/database-pages-a-deep-dive-38cdb2c79eb5" rel="noopener noreferrer"&gt;original post by Hussein Nasser&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Here it goes.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Database Layer
&lt;/h2&gt;

&lt;p&gt;You run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;NAME&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;STUDENTS&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;ID&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1008&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;DB parses the query → looks up &lt;code&gt;STUDENTS&lt;/code&gt; in &lt;code&gt;pg_class&lt;/code&gt; (an internal catalog, also stored on disk) → finds OID (Object Identifier) &lt;code&gt;24601&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;DB knows the file lives at &lt;code&gt;PGDATA/base/&amp;lt;db_oid&amp;gt;/24601&lt;/code&gt; on the filesystem&lt;/li&gt;
&lt;li&gt;DB asks the OS to open that file — the OS hands back a temporary integer called a &lt;strong&gt;file descriptor&lt;/strong&gt; (&lt;code&gt;fd&lt;/code&gt;), say &lt;code&gt;fd = 7&lt;/code&gt;. This is a short-lived handle, valid only for the session. The &lt;code&gt;fd&lt;/code&gt; is never stored on disk.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No index on &lt;code&gt;ID&lt;/code&gt;, so DB scans pages one by one. For each page it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Checks its &lt;strong&gt;buffer pool&lt;/strong&gt; first — if the page is already in memory, no disk read needed&lt;/li&gt;
&lt;li&gt;If not found, issues a &lt;code&gt;read()&lt;/code&gt; to the OS for that page
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;read(fd, 0,    8192)  → page 0: bytes 0–8191
read(fd, 8192, 8192)  → page 1: bytes 8192–16383
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;The OS → SSD journey below happens once per page. We trace it for page 0.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; The exact syscall used by databases may differ — Postgres uses &lt;code&gt;pread()&lt;/code&gt; which takes an explicit offset. The intent here is to show what information is passed, not the exact function signature.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  2. File System / OS Layer
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;OS looks up the &lt;strong&gt;inode&lt;/strong&gt; of file &lt;code&gt;24601&lt;/code&gt; → finds block mapping&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;inode&lt;/strong&gt; (index node): a data structure the Linux filesystem maintains for every file on disk.&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;bytes 0–4095    → LBA 100
bytes 4096–8191 → LBA 101
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;OS checks its &lt;strong&gt;page cache&lt;/strong&gt; → blocks not found&lt;/li&gt;
&lt;li&gt;OS sends a read command to the NVMe driver with LBA 100 and 101&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;NVMe&lt;/strong&gt; (Non-Volatile Memory Express): a communication protocol designed specifically for SSDs.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  3. LBA — The Bridge Between OS and SSD
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;LBA (Logical Block Address)&lt;/strong&gt; is a sequential numbering system for blocks on a storage device.&lt;/p&gt;

&lt;p&gt;The OS doesn't know or care about physical locations on the SSD — it just says:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Give me LBA 100 and 101."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The NVMe controller receives this and translates internally:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;LBA 100 → Physical page 99, offset 0x0001
LBA 101 → Physical page 99, offset 0x1002
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This translation is managed by the SSD's &lt;strong&gt;Flash Translation Layer (FTL)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The reason this layer exists: the SSD can move data around internally (for wear leveling, bad block management, etc.) without the OS ever knowing.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. SSD Layer
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;NVMe controller checks its &lt;strong&gt;DRAM cache&lt;/strong&gt; — page 99 not found&lt;/li&gt;
&lt;li&gt;Fetches the entire NAND page 99 (16KB) into DRAM cache&lt;/li&gt;
&lt;li&gt;Extracts just the requested 8KB (LBA 100 + 101) and returns it to the OS&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  5. Back Up the Stack
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SSD returns 8KB
      ↓
OS stores blocks 100, 101 in PAGE CACHE (RAM)
      ↓
OS returns 8KB to DB
      ↓
DB stores page 0 in BUFFER POOL (RAM)
      ↓
DB scans page 0 — rows 1–1000, row 1008 not found
      ↓
entire journey repeats for page 1
      ↓
DB stores page 1 in BUFFER POOL (RAM)
      ↓
DB scans page 1 — finds row 1008, returns to user ✓
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Layered Abstraction Summary
&lt;/h2&gt;

&lt;p&gt;Each layer only knows its own abstraction and talks to the layer directly below it.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Abstraction it uses&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Database&lt;/td&gt;
&lt;td&gt;File + offset (pages)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OS&lt;/td&gt;
&lt;td&gt;Inodes + LBAs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NVMe Controller&lt;/td&gt;
&lt;td&gt;LBA → physical page (via FTL)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NAND Flash&lt;/td&gt;
&lt;td&gt;Physical pages and cells&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;LBA is the common language between the OS and the SSD&lt;/strong&gt; — the key handoff point where the OS's logical world meets the SSD's physical world. And the FTL is what keeps the physical complexity invisible to everyone above it.&lt;/p&gt;




&lt;p&gt;*Originally published on &lt;a href="https://medium.com/@rajkundalia/following-a-database-read-to-the-metal-a-simple-walkthrough-630a3eb97016" rel="noopener noreferrer"&gt;Medium&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Find me on &lt;a href="https://www.linkedin.com/in/rajkundalia/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt; · &lt;a href="https://medium.com/@rajkundalia" rel="noopener noreferrer"&gt;Medium&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>database</category>
      <category>internals</category>
      <category>systems</category>
      <category>beginners</category>
    </item>
    <item>
      <title>How BAML Brings Engineering Discipline to LLM-Powered Systems</title>
      <dc:creator>Raj Kundalia</dc:creator>
      <pubDate>Sat, 21 Mar 2026 14:36:43 +0000</pubDate>
      <link>https://dev.to/rajkundalia/how-baml-brings-engineering-discipline-to-llm-powered-systems-3k18</link>
      <guid>https://dev.to/rajkundalia/how-baml-brings-engineering-discipline-to-llm-powered-systems-3k18</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;BAML is a domain-specific language and toolchain for defining LLM function interfaces with strict, recoverable output parsing - addressing the reliability gap that makes production LLM systems painful to build and maintain. It generates type-safe client code from schema definitions across Python, TypeScript, Go, Ruby, and several other languages, and uses a parsing approach called Schema Aligned Parsing that recovers structured data even from garbled or partial model responses. For a working reference implementation, see:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/rajkundalia/error-analyzer-with-baml" rel="noopener noreferrer"&gt;GitHub - rajkundalia/error-analyzer-with-baml: Analyze Java compilation and runtime errors using BAML with a local Ollama model.&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  How I came to know about BAML
&lt;/h2&gt;

&lt;p&gt;I was wondering about if there is something that tries to handle output from an LLM and then suddenly, a talk by Vaibhav Gupta landed. I started exploring more; if you want to explore like how did and not read this post, you can try asking these questions to know it by yourself:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What is BAML?&lt;/li&gt;
&lt;li&gt;What is Pydantic? Does it relate to BAML? If yes, how does it relate to BAML?&lt;/li&gt;
&lt;li&gt;What is PydanticAI? How does it compare to BAML? Can I use PydanticAI just for what BAML does? Does PydanticAI retry to get right output from the model?&lt;/li&gt;
&lt;li&gt;How BAML handles a heavily hallucinated output?&lt;/li&gt;
&lt;li&gt;What is instructor? [&lt;a href="https://github.com/567-labs/instructor" rel="noopener noreferrer"&gt;https://github.com/567-labs/instructor&lt;/a&gt;]? Compare it with BAML. - Follow up for clarity: If one is using PydanticAI, there is no point in using Instructor?&lt;/li&gt;
&lt;li&gt;Where exactly does BAML fit into a standard RAG pipeline?&lt;/li&gt;
&lt;li&gt;How does BAML help in token efficiency?&lt;/li&gt;
&lt;li&gt;What is semantic streaming in BAML? What does problems does it solve? How does it help in Generative UI (add information about what Generative UI is in short)?&lt;/li&gt;
&lt;li&gt;What is BAML code generator?&lt;/li&gt;
&lt;li&gt;What is Schema Aligned Parsing? And what can it handle?&lt;/li&gt;
&lt;li&gt;What kind of testing is done or can be done in BAML?&lt;/li&gt;
&lt;li&gt;What is union in BAML?&lt;/li&gt;
&lt;li&gt;How does logging and tracing or observability work in BAML?&lt;/li&gt;
&lt;li&gt;How does BAML use Jinja templating to inject dynamic context, loops, and precise chat roles into prompts without messy string concatenation?&lt;/li&gt;
&lt;li&gt;What are dynamic types (or runtime schemas) in BAML?&lt;/li&gt;
&lt;li&gt;What aspects can BAML help in?&lt;/li&gt;
&lt;li&gt;Will BAML make sense with something like Claude Agent SDK?&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What BAML Is and the Problem It Solves
&lt;/h2&gt;

&lt;p&gt;Every engineer who has tried building an LLM-powered feature knows the first hour of optimism and the next two weeks of fire-fighting. The model returns JSON with an extra key, or wraps it in markdown fences, or truncates mid-response. The prompt worked fine in POC/Demo. Now there are three different parsing bugs during production grade implementation, all subtly different.&lt;/p&gt;

&lt;p&gt;BAML (or Basically a made-up language) - Boundary ML - exists to solve this class of problem at the right level of abstraction. It is a language-level contract between the application and the model. You define what you want the model to return, write the prompt logic in a dedicated templating layer, and BAML handles parsing, type-checking, retries, and client generation across Python, TypeScript, Go, Ruby, and other languages - with opt-in retry policies when you need them.&lt;/p&gt;

&lt;p&gt;The project positions itself as the Pydantic of LLM engineering - a statement about philosophy rather than API compatibility. Just as Pydantic introduced runtime type validation into Python codebases that previously relied on convention and hope, BAML introduces structural guarantees into LLM pipelines that previously relied on prompt tuning and defensive try/except blocks.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F481fg94dkw63atdsuuij.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F481fg94dkw63atdsuuij.png" alt="gemini_generated" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  How BAML Relates to Pydantic and Tools Like Instructor
&lt;/h2&gt;

&lt;p&gt;Pydantic itself does one thing exceptionally well: it validates Python data structures against declared schemas. Feed it a dictionary, and it tells you whether it conforms to the model definition. It does not know anything about language models, prompts, or API calls - it is a validation library, and a very good one.&lt;/p&gt;

&lt;p&gt;Instructor builds on top of Pydantic to handle the LLM layer. It takes a Pydantic model, wraps the OpenAI (or Anthropic, or other) API call, and uses function calling or JSON mode to coax the model into returning something the Pydantic validator can accept. When validation fails, Instructor can retry with the validation error message appended to the conversation, giving the model a chance to self-correct. This is practical, widely used, and works well for straightforward extraction tasks. What Instructor does not do is provide a dedicated authoring layer for prompts, generate client code from schema definitions, or go beyond retry logic when the model output is deeply malformed.&lt;/p&gt;

&lt;p&gt;PydanticAI goes further than Instructor. It is an agent framework - it handles tool registration, multi-step agent loops, dependency injection, and result validation as part of a unified system. Validation failures feed back into the agent's run loop through a reflection mechanism, giving the model a chance to self-correct - structurally similar to what Instructor does but integrated at the framework level rather than as a wrapper. Comparing PydanticAI and BAML feature-for-feature would miss the point.&lt;/p&gt;

&lt;p&gt;The more accurate comparison is about what layer each tool operates at. PydanticAI and BAML both handle structured output and retry behavior, but they do so with different default assumptions. PydanticAI is a Python framework - everything is Python, configured in Python, tested in Python. BAML is a language-level abstraction with its own syntax, its own code generator, and its own parsing engine that operates below what either Pydantic or the model's native JSON mode provides.&lt;/p&gt;

&lt;p&gt;If a team is already using PydanticAI and happy with it, BAML is not a necessary replacement. If the team is hitting parsing failures that retry loops do not reliably fix, or needs multi-language client generation, or wants prompt authoring with first-class tooling support, BAML addresses different parts of the problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  The BAML DSL and Code Generation
&lt;/h2&gt;

&lt;p&gt;BAML is its own language. Not a Python DSL, not a configuration file format - a purpose-built syntax for describing LLM function signatures, data schemas, and prompt templates in a single, unified file format. A &lt;code&gt;.baml&lt;/code&gt; file defines the inputs, the expected output structure, and the prompt template that connects them. The BAML compiler - written in Rust - reads those files and generates native client code in Python, TypeScript, Go, Ruby, and other languages. The Rust foundation is also what makes the SAP parsing engine fast enough to run inline on streaming responses without meaningful latency overhead - error correction applies in under 10ms, orders of magnitude cheaper than a retry API call. This is why BAML can credibly claim to be a language-level abstraction rather than a Python-centric library with thin wrappers for other runtimes.&lt;/p&gt;

&lt;p&gt;This matters for a reason that is easy to dismiss as aesthetic but is actually structural: when the schema and the prompt live in the same file, they cannot drift apart. In a typical setup, the Pydantic model is in one file, the prompt string is in another, and the parsing logic is somewhere else. When the prompt changes, the schema might not. When the schema changes, the prompt often does not. This is less about convenience and more about eliminating an entire class of bugs - schema drift between prompt, parser, and application code - that is difficult to catch in review and invisible until it surfaces in production. BAML makes these co-located and co-versioned by design.&lt;/p&gt;

&lt;p&gt;The generated client code behaves like a typed function call - call the function, pass the inputs, receive the validated return type. The underlying API call, parsing, and error handling are managed by the runtime. Retry behavior is available but opt-in, defined as an explicit policy in the &lt;code&gt;.baml&lt;/code&gt; file rather than applied automatically. There is no boilerplate to maintain per endpoint.&lt;/p&gt;




&lt;h2&gt;
  
  
  Schema Aligned Parsing - BAML's Core Reliability Mechanism
&lt;/h2&gt;

&lt;p&gt;Most structured output approaches rely on either JSON mode (asking the model to emit valid JSON) or function/tool calling (structured prompting that constrains the output format at the API level). Both of these approaches have the same failure mode: when the model output does not conform, parsing fails.&lt;/p&gt;

&lt;p&gt;Without BAML, that failure looks like: model returns slightly malformed JSON, the parser throws, the application retries, the model might produce the same output again, and the request either surfaces an error or silently falls back. With BAML, that same malformed output goes through SAP, which extracts the structured data the model clearly intended to produce, and returns a typed object to the application - no retry required.&lt;/p&gt;

&lt;p&gt;Schema Aligned Parsing - SAP - takes a different approach. Rather than requiring the model output to be valid JSON before interpretation begins, BAML's parser extracts structured data from whatever the model actually returns, using the declared schema as a guide for what to look for.&lt;/p&gt;

&lt;p&gt;Consider what SAP actually handles in practice. A model that wraps its JSON in a markdown code fence - common with instruction-tuned models - would break a strict JSON parser. SAP strips the fences. A model that emits trailing commas or unquoted string values - technically invalid JSON - would fail &lt;code&gt;JSON.parse&lt;/code&gt;. SAP corrects them. A reasoning model that outputs chain-of-thought text before the structured object would confuse most parsers. SAP identifies where the structured content begins and parses from there. An enum value returned in a different capitalisation or with surrounding punctuation gets normalised against the declared enum values in the schema.&lt;/p&gt;

&lt;p&gt;What SAP does not do is hallucinate missing data. If the model completely omits a required field and there is no recoverable signal in the output, BAML reports a parse failure. The mechanism is about recovery, not invention. The practical result is a substantial reduction in false-negative parse failures - cases where the model actually produced the right conceptual answer but in a form that strict JSON parsing would reject.&lt;/p&gt;

&lt;p&gt;This is the technical core of BAML's reliability claim, and it is a real engineering distinction from approaches that rely entirely on the model's ability to produce valid JSON every time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Prompt Authoring with Jinja Templating
&lt;/h2&gt;

&lt;p&gt;BAML uses Jinja-style syntax for prompt construction - powered by Minijinja, a Rust-native template engine implementing the Jinja templating language - which brings a mature, well-understood templating model into a space where most alternatives are either string concatenation or ad-hoc formatting functions.&lt;/p&gt;

&lt;p&gt;The practical benefits are cleaner than they sound. Dynamic context injection - passing a list of documents, a user's history, or a set of retrieved chunks - is expressed as a loop in the template, not as string building in application code. Chat role separation (system prompt, user turn, assistant turn) is handled inline via role macros directly in the template - &lt;code&gt;_.role("system")&lt;/code&gt;, &lt;code&gt;_.role("user")&lt;/code&gt; - rather than being assembled through data structures outside the prompt. Conditional prompt logic, like including an extended set of instructions only when a particular flag is set, reads like a template rather than a maze of conditional string appends.&lt;/p&gt;

&lt;p&gt;The alternative - building prompts through f-strings or concatenation - works until it does not. When prompts reach several hundred tokens with dynamic sections, the only way to debug them is to log the final assembled string and manually reconstruct how it was built - which requires understanding the application code that generated it, not the prompt itself. In BAML, the prompt template is the source of truth and can be inspected, versioned, and tested directly. The Jinja layer also makes it straightforward to separate prompt structure from the data flowing into it, which helps when iterating on prompt content without touching application logic.&lt;/p&gt;




&lt;h2&gt;
  
  
  Unions and Dynamic Types
&lt;/h2&gt;

&lt;p&gt;BAML's type system supports union types - the ability to declare that a field or return value could be one of several distinct schemas. A model that might return either a &lt;code&gt;SearchResult&lt;/code&gt; or an &lt;code&gt;ErrorResponse&lt;/code&gt; depending on the query can express that distinction in the schema definition rather than through runtime inspection of the output.&lt;/p&gt;

&lt;p&gt;Dynamic types solve a related but different problem. Unions work when the possible schemas are known at compile time. When the schema itself depends on data that only exists at runtime - categories pulled from a database, fields defined by user configuration, or tenant-specific structures - BAML provides a &lt;code&gt;@@dynamic&lt;/code&gt; annotation on the type definition and a &lt;code&gt;TypeBuilder&lt;/code&gt; API in the generated client. At runtime, application code uses &lt;code&gt;TypeBuilder&lt;/code&gt; to add fields or enum variants before making the call, and the parser uses the extended schema to interpret the response.&lt;/p&gt;

&lt;p&gt;A concrete example that illustrates both: an extraction pipeline where the possible document types (invoice, contract, medical record) are fixed and known - that is a union, declared once in the &lt;code&gt;.baml&lt;/code&gt; file. If those document types and their fields are instead loaded from a database schema at request time, that is where &lt;code&gt;@@dynamic&lt;/code&gt; and &lt;code&gt;TypeBuilder&lt;/code&gt; come in. The distinction matters: unions are a schema design choice, dynamic types are a runtime extension mechanism.&lt;/p&gt;




&lt;h2&gt;
  
  
  Token Efficiency
&lt;/h2&gt;

&lt;p&gt;BAML's schema-aware prompting tends to produce shorter system instructions than equivalent prompt engineering done by hand. Because the output structure is declared in the schema and the runtime handles parsing flexibility, prompts do not need extensive instructions about output formatting, JSON validity, or field naming conventions. Those concerns are handled at the tooling layer. For high-volume applications where token costs are meaningful, this reduction in system prompt overhead accumulates.&lt;/p&gt;




&lt;h2&gt;
  
  
  Semantic Streaming and Generative UI
&lt;/h2&gt;

&lt;p&gt;LLM responses arrive token by token. In a chat interface, streaming the raw text is straightforward. In a structured output pipeline, streaming creates a problem: the output is not parse-able until it is complete, so the application has to buffer everything, parse at the end, and only then update the UI. This introduces latency from the user's perspective - the model is working, but nothing is happening on screen.&lt;/p&gt;

&lt;p&gt;BAML's semantic streaming solves this by parsing the output incrementally as tokens arrive. Because the parser knows the expected schema, it can identify which field is being populated as the stream progresses. Streaming attributes on schema fields give developers explicit control over atomicity - a field can be configured to surface only when fully complete, or to stream token-by-token as a partial value, depending on what makes sense for the UI.&lt;/p&gt;

&lt;p&gt;This enables a pattern often called Generative UI - rendering partial structured data into meaningful interface components as the model generates the response. An interface showing a list of extracted line items from a document does not need to wait for all line items to load simultaneously. Each item can appear as it is parsed. A dashboard that displays model-extracted analytics fields can populate each card progressively rather than flipping from empty to complete.&lt;/p&gt;

&lt;p&gt;The mechanism is not unique to any particular UI framework - it is a property of the streaming parser that the generated client exposes. Applications consuming the stream receive typed partial objects they can render directly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Testing in BAML
&lt;/h2&gt;

&lt;p&gt;BAML includes a testing layer that allows declaring test cases directly in &lt;code&gt;.baml&lt;/code&gt; files alongside the function definitions they test. A test case specifies the input and optionally assertions about specific field values or structural properties of the result, using &lt;code&gt;@@assert&lt;/code&gt; expressions evaluated against the actual model output.&lt;/p&gt;

&lt;p&gt;Tests run against live model APIs, either through the VSCode playground interactively or via &lt;code&gt;baml-cli test&lt;/code&gt; from the command line. The CLI runner makes it straightforward to integrate BAML tests into CI pipelines, running them selectively on merge or on a scheduled basis.&lt;/p&gt;

&lt;p&gt;The tooling also includes a playground - PromptFiddle - that surfaces prompt rendering, model output, and parse results interactively. This shortens the iteration loop on prompt changes considerably compared to editing, deploying, and inspecting logs.&lt;/p&gt;




&lt;h2&gt;
  
  
  Observability - Logging and Tracing
&lt;/h2&gt;

&lt;p&gt;BAML provides structured trace data for every function call through a Collector API: the rendered prompt, the raw model response, the parsed output, timing, and token usage are all accessible by attaching a collector to a function call. This data can be pushed to Boundary Cloud for production dashboards and alerting, or routed to an external observability system.&lt;/p&gt;

&lt;p&gt;For teams already using LLM observability tools like Langfuse (I have not used this!) or similar OpenTelemetry-compatible platforms, BAML's trace events integrate through standard logging hooks. The key value is that traces include the pre-parsing and post-parsing representations side by side - which makes it possible to distinguish whether a failure is a model issue (the model produced conceptually wrong output) or a parsing boundary issue (the model produced the right answer in a form the parser could not handle). That distinction matters when deciding whether to adjust the prompt, the schema, or the model configuration.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where BAML Fits in a RAG Pipeline and with Agent Frameworks
&lt;/h2&gt;

&lt;p&gt;A typical RAG pipeline has several identifiable layers: retrieval (vector search, keyword search, or hybrid), context assembly (chunking, ranking, formatting), model invocation (the API call), and response handling (parsing, post-processing, returning to the caller).&lt;/p&gt;

&lt;p&gt;BAML operates at the model invocation and response handling layers. It does not replace a vector database, a retrieval library like LlamaIndex, or a reranking model. It does not manage document ingestion or embedding generation. BAML does not make retrieval better; it makes the interface between retrieval and generation reliable. What it replaces is the ad-hoc code that sits between the API call and the application: prompt construction, output parsing, retry logic, and client generation.&lt;/p&gt;

&lt;p&gt;In a RAG system, BAML would typically receive the assembled context - the retrieved chunks, formatted by the application layer - as input to a BAML function. The function template injects that context into the prompt, calls the model, and returns a typed result to the application. The retrieval and chunking infrastructure remains unchanged.&lt;/p&gt;

&lt;p&gt;For agent frameworks - the Claude Agent SDK, LangGraph, Autogen, or similar orchestration tools - BAML serves a similar role. Agent frameworks handle tool registration, loop control, state management, and multi-step planning. BAML-backed functions sit outside that loop as callable tools - the framework invokes them the same way it would any other tool, and BAML handles the structured output guarantees for that specific call. They are not alternatives; they operate at different layers. The combination is particularly useful when tools need to return strongly typed structured data that downstream steps in the agent depend on, rather than freeform text that the orchestrator has to interpret.&lt;/p&gt;




&lt;h2&gt;
  
  
  What to Do Next
&lt;/h2&gt;

&lt;p&gt;The BAML playground at &lt;a href="https://www.promptfiddle.com/" rel="noopener noreferrer"&gt;https://www.promptfiddle.com/&lt;/a&gt; runs entirely in the browser - no installation, no API key setup. It is a good place to experiment with the DSL syntax and see how SAP handles malformed model output before committing to local setup. A broader set of working examples covering extraction, classification, streaming, and agent integration is available at &lt;a href="https://baml-examples.vercel.app/" rel="noopener noreferrer"&gt;https://baml-examples.vercel.app/&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The documentation at docs.boundaryml.com covers installation, the DSL reference, and integration guides for the major model providers. The thing worth evaluating specifically is SAP behavior under the failure cases that already exist in a current system - feed BAML the actual bad outputs that are currently causing parsing failures and observe how the recovery layer handles them. That test is more informative than any benchmark.&lt;/p&gt;

&lt;p&gt;As LLM systems move from prototype to infrastructure, the cost of unreliable parsing compounds. BAML represents a considered answer to where that reliability boundary should live - not in the model, not in retry loops, but in a deterministic layer between them.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmpwzk1vm4zeyo50huhn7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmpwzk1vm4zeyo50huhn7.png" alt="notebook_lm_generated" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Sample Github Repository
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/rajkundalia/error-analyzer-with-baml" rel="noopener noreferrer"&gt;GitHub - rajkundalia/error-analyzer-with-baml: Analyze Java compilation and runtime errors using BAML with a local Ollama model.&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;p&gt;These are the resources and links that I used to know more:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.boundaryml.com/home" rel="noopener noreferrer"&gt;https://docs.boundaryml.com/home&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.boundaryml.com/guide/comparisons/baml-vs-pydantic" rel="noopener noreferrer"&gt;https://docs.boundaryml.com/guide/comparisons/baml-vs-pydantic&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/BoundaryML/baml" rel="noopener noreferrer"&gt;https://github.com/BoundaryML/baml&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/567-labs/instructor" rel="noopener noreferrer"&gt;https://github.com/567-labs/instructor&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ai.pydantic.dev/#next-steps" rel="noopener noreferrer"&gt;https://ai.pydantic.dev/#next-steps&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.boundaryml.com/guide/introduction/what-is-baml#demo-video" rel="noopener noreferrer"&gt;https://docs.boundaryml.com/guide/introduction/what-is-baml#demo-video&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://thedataquarry.com/blog/baml-and-future-agentic-workflows/" rel="noopener noreferrer"&gt;https://thedataquarry.com/blog/baml-and-future-agentic-workflows/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://thedataquarry.com/blog/baml-is-building-blocks-for-ai-engineers/" rel="noopener noreferrer"&gt;https://thedataquarry.com/blog/baml-is-building-blocks-for-ai-engineers/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://youtu.be/leDdmneq2UA?si=1cjuko9ZMnbuWOmC" rel="noopener noreferrer"&gt;https://youtu.be/leDdmneq2UA?si=1cjuko9ZMnbuWOmC&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://towardsai.net/p/machine-learning/the-prompting-language-every-ai-engineer-should-know-a-baml-deep-dive" rel="noopener noreferrer"&gt;https://towardsai.net/p/machine-learning/the-prompting-language-every-ai-engineer-should-know-a-baml-deep-dive&lt;/a&gt; - good deep dive&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gradientflow.com/seven-features-that-make-baml-ideal-for-ai-developers/" rel="noopener noreferrer"&gt;https://gradientflow.com/seven-features-that-make-baml-ideal-for-ai-developers/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://youtu.be/XDZ5i7hWgaI?si=0_8ZbalUbvyMpmYe" rel="noopener noreferrer"&gt;https://youtu.be/XDZ5i7hWgaI?si=0_8ZbalUbvyMpmYe&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/watch?v=XwT7MhT_BEY" rel="noopener noreferrer"&gt;https://www.youtube.com/watch?v=XwT7MhT_BEY&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Sample projects that I found while exploring:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/latlan1/baml-pdf-parsing" rel="noopener noreferrer"&gt;https://github.com/latlan1/baml-pdf-parsing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/kargarisaac/Hekmatica" rel="noopener noreferrer"&gt;https://github.com/kargarisaac/Hekmatica&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/kuzudb/baml-kuzu-demo" rel="noopener noreferrer"&gt;https://github.com/kuzudb/baml-kuzu-demo&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Try out BAML:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.promptfiddle.com/" rel="noopener noreferrer"&gt;https://www.promptfiddle.com/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://baml-examples.vercel.app/" rel="noopener noreferrer"&gt;https://baml-examples.vercel.app/&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>python</category>
    </item>
    <item>
      <title>From println to Production Logging: Internals and Performance Across Languages and the OS</title>
      <dc:creator>Raj Kundalia</dc:creator>
      <pubDate>Sun, 22 Feb 2026 16:01:56 +0000</pubDate>
      <link>https://dev.to/rajkundalia/from-println-to-production-logging-internals-and-performance-across-languages-and-the-os-3fd1</link>
      <guid>https://dev.to/rajkundalia/from-println-to-production-logging-internals-and-performance-across-languages-and-the-os-3fd1</guid>
      <description>&lt;h2&gt;
  
  
  If you do not want to read the article, it is A-OK:
&lt;/h2&gt;

&lt;p&gt;I got interested in logging — and because now we have LLM at our fingertips for asking questions, I decided to form a question bank first:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How are loggers implemented in different languages or in OS's?&lt;/li&gt;
&lt;li&gt;How efficient is logging in different OS?&lt;/li&gt;
&lt;li&gt;How much overhead does loggers bring?&lt;/li&gt;
&lt;li&gt;How are they efficiently implemented?&lt;/li&gt;
&lt;li&gt;How much of a difference is there between sys out vs. writing to a file vs. a logger vs. streaming logs in terms of efficiency and performance? Can we measure this? Compare similar methods for other languages.&lt;/li&gt;
&lt;li&gt;How does logger get information that it is coming from this file? What is the mechanism for this in different languages? — Very important question&lt;/li&gt;
&lt;li&gt;What part of logging filters is based on log level?&lt;/li&gt;
&lt;li&gt;The first thing the logger does is compare the message's level integer against its own threshold integer; if the message level is lower, it returns immediately and nothing else runs. Is this based on configuration?&lt;/li&gt;
&lt;li&gt;Which is the most efficient language to write loggers in that would still be usable in other languages — or does something like this not make sense?&lt;/li&gt;
&lt;li&gt;Why are markers used in logging? What does it solve that we cannot already solve without them? I know Java contains Markers, but do other languages contain them?&lt;/li&gt;
&lt;li&gt;When I provide a lower log level while writing loggers but keep a higher log level in the configuration, does it create a performance impact? (e.g., having many Debug and Trace loggers while the log level is kept at Info).&lt;/li&gt;
&lt;li&gt;In Java, are the placeholders in the loggers — such as Request was successful user={}, userId—concatenations, or is some other mechanism used for them?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you do not want to read the article, you can skip it and use this question bank to form your own understanding.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/rajkundalia/logger-internals-java" rel="noopener noreferrer"&gt;GitHub - rajkundalia/logger-internals-java: A Java logging library built from scratch - exploring async handlers, structured fields, granular caller info…&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;We all assume a disabled log call costs nothing. It doesn't — the level check is cheap, but any string you constructed before passing it to the logger is already gone, whether the log fires or not.&lt;/li&gt;
&lt;li&gt;Every time you see a class name and line number in a log output, something paid for that. In Java, when caller info is enabled, it's a runtime stack walk. In C and Rust, it was resolved at compile time and costs nothing at runtime. Most engineers have never had reason to think about the difference.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;logger.info("User {}", user)&lt;/code&gt; is not just cleaner syntax. It's a different evaluation model — the string is only built if the log actually fires. &lt;code&gt;"User " + user&lt;/code&gt; is evaluated before the logger even sees it.&lt;/li&gt;
&lt;li&gt;Async logging feels like a free upgrade. It isn't. It changes what you can trust about your logs when something crashes — and the logs you lose are exactly the ones you needed.&lt;/li&gt;
&lt;li&gt;In Rust and C/C++, a disabled log call can be removed from the binary entirely at compile time. In Java and Python, it always exists at runtime, even if it does nothing. The language made this choice.&lt;/li&gt;
&lt;li&gt;Go and C logging stacks sit closer to the OS than JVM-based logging stacks. There are fewer layers between the log call and the syscall. That distance has a cost, and it compounds under load.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Thanks to LLMs I could create this: &lt;a href="https://github.com/rajkundalia/logger-internals-java" rel="noopener noreferrer"&gt;https://github.com/rajkundalia/logger-internals-java&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5g9p9z12fs0boz92f510.png" alt="Gemini-Generated" width="800" height="800"&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Why Logging Is Not Just Printing
&lt;/h2&gt;

&lt;p&gt;Most of us haven't considered how much happens between our code calling &lt;code&gt;logger.info(...)&lt;/code&gt; and that string reaching disk: a level check, a formatter, a handler with its own buffering strategy, a lock or queue depending on sync versus async mode, a syscall into the kernel, and sometimes a second system — syslog, journald — that takes over from there. At scale, that pipeline has real cost. String formatting allocates. Synchronous file writes add latency to every thread that logs. A slow disk creates backpressure that stalls application threads. And in a distributed system where logs are your only audit trail, how that pipeline behaves during a crash is not an edge case — it is a design constraint you either chose or inherited without knowing it. None of that is obvious from a &lt;code&gt;println&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Pipeline: What a Logger Actually Does
&lt;/h2&gt;

&lt;p&gt;Before pulling any of this apart, it helps to see the whole shape at once:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Application
    ↓
Logger
    ↓
Level Filter
    ↓
Formatter
    ↓
Appender / Handler
    ↓
Operating System
    ↓
Disk / Stream
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The application emits a log event with a level, message, and arguments. The logger checks whether the configured threshold allows the event through. If it passes, the formatter constructs the final string — interpolating placeholders, appending timestamps, resolving caller location. The appender or handler takes that string and writes it somewhere: a file, stdout, a socket, a rolling buffer. That write becomes a system call, handing control to the OS, which manages buffering and flush behavior before data actually hits disk. Each stage has cost. Each stage is a place where things can go wrong or get optimized. The rest of this post is about what happens at each one.&lt;/p&gt;




&lt;h2&gt;
  
  
  Log Level Filtering Internals
&lt;/h2&gt;

&lt;p&gt;Here's something that seems obvious until you think about it: a DEBUG log call in a hot loop, in a production service configured at INFO, runs on every single iteration. It doesn't log anything — but it doesn't disappear either.&lt;/p&gt;

&lt;p&gt;The level check itself is cheap. Each level maps to an integer, and the check is a comparison — INFO against whatever the event's level is, early return if it doesn't pass. No formatting, no allocation, no appender invocation. In Logback, higher integers map to higher severity — TRACE is 5000, ERROR is 40000. &lt;code&gt;java.util.logging&lt;/code&gt; follows the same direction but uses a different numeric scale and different level names: FINE is 500, SEVERE is 1000. The ordering is not inverted — the scales and names just don't align. Either way, the comparison is fast.&lt;/p&gt;

&lt;p&gt;What I found more interesting is where in the pipeline the check actually happens. I assumed there was one gate. There are often several. In Java's SLF4J backed by Logback, the logger checks first — that's the fast path. But appenders can have their own filter chains, meaning an event can clear the logger-level check and still be dropped downstream. This is deliberate and useful: you can send WARN and above to a file, ERROR and above to an alert sink, and everything to stdout, all from the same pipeline. But it means filtering is not a single decision — it's a sequence of decisions, each adding a small amount of overhead to events that reach it.&lt;/p&gt;

&lt;p&gt;The real cost isn't the check. It's everything you did before the call site. If you constructed a string before passing it to the logger, that work happened regardless of whether the log fires. Which is exactly why placeholder syntax exists, and why it's not just a style preference.&lt;/p&gt;




&lt;h2&gt;
  
  
  How a Logger Knows Where It Came From
&lt;/h2&gt;

&lt;p&gt;You've probably never thought about how a log line knows it came from &lt;code&gt;UserService.java:142&lt;/code&gt;. It just appears. What's actually happening underneath varies so much across languages that it's worth making explicit — because the cost difference is not small.&lt;/p&gt;

&lt;p&gt;In Java, two approaches exist. The older one constructs a &lt;code&gt;Throwable&lt;/code&gt; and extracts the stack trace — the JVM walks the call stack and allocates an array of frame objects. The newer approach, &lt;code&gt;StackWalker&lt;/code&gt; introduced in Java 9, is lazy and stream-based: you only materialize the frames you actually need. Both are runtime operations with real cost, which is why caller location logging is configurable in most Java frameworks and off by default in many Logback configurations. You can see how this plays out in the reference implementation at &lt;a href="https://github.com/rajkundalia/logger-internals-java" rel="noopener noreferrer"&gt;https://github.com/rajkundalia/logger-internals-java&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Python captures caller information as part of &lt;code&gt;LogRecord&lt;/code&gt; creation, inside &lt;code&gt;_log()&lt;/code&gt;, which is only reached after the level check passes. The depth of that inspection — whether stack info is captured, whether additional frame walking occurs — depends on configuration and what the formatter requests. The cost is not paid on every call, but it is paid at record creation time, not at formatting time.&lt;/p&gt;

&lt;p&gt;Go makes this explicit. &lt;code&gt;runtime.Caller(skip int)&lt;/code&gt; returns the file, line, and function name when you ask for it. It's a runtime operation, but controlled — you call it when you need it, rather than it being woven into every log record automatically.&lt;/p&gt;

&lt;p&gt;C and C++ sidestep runtime cost entirely. &lt;code&gt;__FILE__&lt;/code&gt; and &lt;code&gt;__LINE__&lt;/code&gt; are preprocessor macros, expanded at compile time. By the time the binary runs, those values are string literals and integers baked into the executable. No stack walking, no frame introspection, nothing.&lt;/p&gt;

&lt;p&gt;Rust takes the same approach through the log crate's macro system. &lt;code&gt;log::info!("...")&lt;/code&gt; expands at compile time to include the module path and line number as constants. The binary contains no machinery for discovering caller location — it was resolved before the program ran.&lt;/p&gt;

&lt;p&gt;The gap between compile-time resolution and runtime stack walking is the kind of thing that's invisible until you're logging at high volume. C/C++ and Rust pay nothing. Java pays on every logged event where caller info is enabled. Go pays when you ask. Most engineers pick a logging framework without knowing which of these models they've signed up for.&lt;/p&gt;




&lt;h2&gt;
  
  
  Placeholders vs String Concatenation
&lt;/h2&gt;

&lt;p&gt;These two lines look similar. They are not:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Eager: string is built before the logger is invoked&lt;/span&gt;
&lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;info&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Connected user: "&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;toString&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;

&lt;span class="c1"&gt;// Lazy: string is only built if the level check passes&lt;/span&gt;
&lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;info&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Connected user: {}"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the first version, the JVM evaluates &lt;code&gt;user.toString()&lt;/code&gt; and concatenates the string before the logger receives anything. If the level check drops the event — which it will, for any DEBUG or TRACE call in a production service configured at INFO — that allocation and work was wasted. At low log volumes this is invisible. Scattered through hot paths at high throughput, it accumulates.&lt;/p&gt;

&lt;p&gt;In the second version, &lt;code&gt;user&lt;/code&gt; is passed as an object reference. The logger receives the raw argument. Only if the event clears the level filter does the formatter resolve the placeholder and build the final string. &lt;code&gt;toString()&lt;/code&gt; is never called otherwise, and no intermediate string is allocated.&lt;/p&gt;

&lt;p&gt;This only matters because of how filtering works — specifically the early return discussed in the filtering section. The two design choices reinforce each other: a cheap level check creates the condition under which deferred string construction delivers its benefit. If logging were unconditional, the distinction wouldn't save anything.&lt;/p&gt;




&lt;h2&gt;
  
  
  OS Interaction: Where Language Logging Ends and the OS Begins
&lt;/h2&gt;

&lt;p&gt;There's a boundary in every logging pipeline that most application engineers have never had reason to think about: the point where your code hands a string to the OS and stops being in control of what happens next.&lt;/p&gt;

&lt;p&gt;When an appender writes to a file, it eventually calls &lt;code&gt;write()&lt;/code&gt; — a system call. Everything above that boundary is the language runtime: string formatting, in-memory buffering, lock acquisition. Everything below it is the kernel: its own buffers, filesystem cache, eventual persistence to disk. Crossing that boundary involves a context switch from user space to kernel space. It's not free, and it happens on every unbuffered write.&lt;/p&gt;

&lt;p&gt;This is why buffered I/O matters. Rather than one &lt;code&gt;write()&lt;/code&gt; per log line, most production logging configurations accumulate output in memory and flush periodically or when the buffer is full. Fewer syscalls, higher throughput. The trade-off: a crash can lose whatever is buffered and not yet flushed. You are always choosing between durability and throughput at that boundary, whether you know it or not.&lt;/p&gt;

&lt;p&gt;The OS also offers its own logging infrastructure — syslog on POSIX systems, journald on Linux. These are daemons that accept log messages via a socket and handle buffering, rotation, and persistence outside your application entirely. The boundary shifts: your application writes to a socket, and the daemon takes responsibility for the rest. Structured fields are first-class in journald. Log rotation is not your problem. The cost is IPC (Inter-process communication) overhead — a socket write instead of a local file write.&lt;/p&gt;

&lt;p&gt;Go and C-adjacent logging stacks sit naturally close to this boundary. Go's &lt;code&gt;os.File.Write&lt;/code&gt; is a thin wrapper over &lt;code&gt;write()&lt;/code&gt; with minimal overhead between your code and the syscall. JVM logging absolutely works at scale — but it involves more layers: GC-managed heap allocations, object creation for log events, the JVM's own I/O abstraction. Those layers add up under load.&lt;/p&gt;




&lt;h2&gt;
  
  
  Synchronous vs Asynchronous Logging
&lt;/h2&gt;

&lt;p&gt;At some point, most engineers configure async logging and move on. Throughput goes up, latency on application threads drops, and nothing seems worse. It feels like a free upgrade.&lt;/p&gt;

&lt;p&gt;Here's what actually changed: you no longer have a guarantee that a log line you wrote ever reached disk.&lt;/p&gt;

&lt;p&gt;Synchronous logging blocks the calling thread until the write completes. The appender acquires a lock, formats the string, calls &lt;code&gt;write()&lt;/code&gt;, releases the lock. Every log call has latency. Under high write volume to a slow disk, this becomes a bottleneck that shows up on every application thread that logs.&lt;/p&gt;

&lt;p&gt;Async logging breaks this coupling. Your thread drops an event into a queue and returns immediately. A dedicated logging thread drains the queue, formats events, and writes to the appender. Throughput increases because writes get batched. Thread latency drops to the cost of a queue insertion. This sounds like a strict improvement. It is not.&lt;/p&gt;

&lt;p&gt;The queue is bounded. Under sustained high load it fills up. At that point the framework has a decision to make: block the calling thread, drop the event, or expand the queue. Many async logging implementations are configured to drop lower-severity events under pressure unless explicitly set to block — Logback's &lt;code&gt;AsyncAppender&lt;/code&gt;, for instance, starts discarding TRACE, DEBUG, and INFO events when the queue reaches 80% capacity by default, while WARN and ERROR are retained. Which means under the conditions where your system is most stressed, in the moments just before something breaks, you may be losing the exact log lines that would have told you why.&lt;/p&gt;

&lt;p&gt;The crash case is worse. Events sitting in the queue when the application crashes never reach the appender. Your crash logs — the ones you needed most — may not exist.&lt;/p&gt;

&lt;p&gt;Async logging is worth using. It is the right choice in many high-throughput systems. But it is an architectural decision about what you are willing to lose and when. Using it without understanding the failure contract means you have made that trade without knowing it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Compile-Time vs Runtime Filtering
&lt;/h2&gt;

&lt;p&gt;Something I hadn't considered when I started this: in Java, Python, and Go, a disabled log call still exists in the binary. In Java and Python this is unambiguous — the level check runs on every call. Go's compiler is more aggressive about in-lining and dead code elimination, so the picture is less clear-cut and depends on the logging library and how it's implemented. But in none of these languages can the call be eliminated entirely at compile time the way it can in Rust or C/C++.&lt;/p&gt;

&lt;p&gt;Take a TRACE call inside a hot loop in a Java service configured at INFO. On every iteration, the JVM executes an integer comparison and branches. The call is suppressed, but it was visited. At high enough frequency, that cost appears.&lt;/p&gt;

&lt;p&gt;In Rust and C/C++, this can be eliminated entirely. A &lt;code&gt;trace!()&lt;/code&gt; macro in Rust, conditioned on a compile-time feature flag, is removed by the compiler if tracing is disabled at build time. The instruction does not exist in the binary. There is no branch, no comparison, no overhead of any kind. The code was removed before the program ran.&lt;/p&gt;

&lt;p&gt;The trade-off is operational flexibility. A Java application can change its log level at runtime — attach to a running JVM, set the Logback threshold to TRACE, watch debug output appear without a restart. A C binary compiled with TRACE disabled cannot do this. The capability is gone. You traded dynamic observability for zero runtime cost.&lt;/p&gt;

&lt;p&gt;Which is right depends on context. A long-running service that needs live level adjustment values the runtime flexibility. A systems program where every cycle matters may prefer compile-time elimination. Most languages make this choice implicitly, as part of how their logging ecosystem is designed. It is worth knowing which choice your language made for you.&lt;/p&gt;




&lt;h2&gt;
  
  
  Cross-Language Comparison Table
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Language&lt;/th&gt;
&lt;th&gt;Caller Detection&lt;/th&gt;
&lt;th&gt;Filter Type&lt;/th&gt;
&lt;th&gt;Async Ecosystem&lt;/th&gt;
&lt;th&gt;Compile-time Elimination&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Java&lt;/td&gt;
&lt;td&gt;StackWalker / Throwable&lt;/td&gt;
&lt;td&gt;Runtime&lt;/td&gt;
&lt;td&gt;Logback AsyncAppender&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Go&lt;/td&gt;
&lt;td&gt;runtime.Caller&lt;/td&gt;
&lt;td&gt;Runtime&lt;/td&gt;
&lt;td&gt;zap, zerolog (non-block)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Python&lt;/td&gt;
&lt;td&gt;currentframe / LogRecord&lt;/td&gt;
&lt;td&gt;Runtime&lt;/td&gt;
&lt;td&gt;QueueHandler&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;C/C++&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;FILE&lt;/strong&gt;, &lt;strong&gt;LINE&lt;/strong&gt; macros&lt;/td&gt;
&lt;td&gt;Runtime / Compile&lt;/td&gt;
&lt;td&gt;spdlog async mode&lt;/td&gt;
&lt;td&gt;Yes (preprocessor)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rust&lt;/td&gt;
&lt;td&gt;Compile-time macro expansion&lt;/td&gt;
&lt;td&gt;Runtime / Compile&lt;/td&gt;
&lt;td&gt;tracing crate&lt;/td&gt;
&lt;td&gt;Yes (feature flags)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Markers in Java/SLF4J — A Brief Callout
&lt;/h2&gt;

&lt;p&gt;Log levels give you one axis for filtering: severity. But severity alone can't answer a question like "show me all security-related events, regardless of level." That's what Markers solve. In SLF4J, a Marker is a named tag attached to a log event — SECURITY, AUDIT, BILLING — that appenders can filter on independently of level. You can route all AUDIT-marked events to a dedicated file while dropping untagged DEBUG events entirely. It's multi-dimensional filtering: level is one axis, marker is another. Other ecosystems approximate this — Go's zap uses structured fields, Python's logging has Filter objects that can inspect arbitrary LogRecord attributes — but SLF4J Markers are one of the cleaner formulations of the idea, and they're underused in codebases that reach for custom log levels when what they actually need is a second axis.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Surprised Me
&lt;/h2&gt;

&lt;p&gt;We all assume async logging was a performance upgrade with no real downside. It's a trade — lower latency on application threads in exchange for weaker guarantees about what survives a crash. That trade is often worth making. It's not invisible.&lt;/p&gt;

&lt;p&gt;I didn't expect caller detection to have such variance across languages. The gap between &lt;code&gt;__FILE__&lt;/code&gt; resolved at compile time and &lt;code&gt;StackWalker&lt;/code&gt; walking the call stack at runtime is not a footnote — it's an architectural difference that shows up under load, and most engineers pick a logging framework without knowing which model they've chosen.&lt;/p&gt;

&lt;p&gt;Filtering being a pipeline of gates, not a single check, was more nuanced than I expected. I assumed one threshold, one decision. In practice, logger-level filters and appender-level filters can conflict, and events can be dropped at multiple points for different reasons.&lt;/p&gt;

&lt;p&gt;The syscall boundary reframed how I think about logging performance. Everything above it is yours — allocations, formatting, buffering. Everything below it is the kernel's. Understanding where that boundary sits, and how often you cross it, makes the buffering trade-offs obvious in a way they weren't before.&lt;/p&gt;

&lt;p&gt;Compile-time log elimination felt genuinely strange when I first understood it. The log crate in Rust doesn't just suppress a call when a level is disabled — the code is removed entirely from the binary by the compiler. That's a fundamentally different model from anything Java or Python offer, and it matters in contexts where it matters.&lt;/p&gt;

&lt;p&gt;Markers are really interesting. The logs that are easiest to reason about in production are the ones where someone thought carefully about how to filter them — not just what level to assign, but what category they belong to. It's a small design decision that compounds over time.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftta7cskyt86plg0xm4nf.png" alt="Notebook-LM" width="800" height="446"&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;p&gt;These are the rabbit holes that led here.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://stackoverflow.com/questions/26949503/how-exactly-is-the-logger-a-singleton-and-how-are-different-log-files-created-i" rel="noopener noreferrer"&gt;https://stackoverflow.com/questions/26949503/how-exactly-is-the-logger-a-singleton-and-how-are-different-log-files-created-i&lt;/a&gt; — The good old StackOverFlow had a question regarding this.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.oracle.com/javase/6/docs/technotes/guides/logging/overview.html" rel="noopener noreferrer"&gt;https://docs.oracle.com/javase/6/docs/technotes/guides/logging/overview.html&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.reddit.com/r/java/comments/rdv98z/have_you_ever_wondered_how_javas_logging/" rel="noopener noreferrer"&gt;https://www.reddit.com/r/java/comments/rdv98z/have_you_ever_wondered_how_javas_logging/&lt;/a&gt; — Down the memory lane.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.loggly.com/ultimate-guide/java-logging-basics/" rel="noopener noreferrer"&gt;https://www.loggly.com/ultimate-guide/java-logging-basics/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.marcobehler.com/guides/java-logging" rel="noopener noreferrer"&gt;https://www.marcobehler.com/guides/java-logging&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://signoz.io/guides/java-log/" rel="noopener noreferrer"&gt;https://signoz.io/guides/java-log/&lt;/a&gt; — table for log level is very good&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/pinojs/pino" rel="noopener noreferrer"&gt;https://github.com/pinojs/pino&lt;/a&gt; — JS Library for logging&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://davidagood.com/logging-in-java/" rel="noopener noreferrer"&gt;https://davidagood.com/logging-in-java/&lt;/a&gt; — Java's logging is crazy&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/TheTechGranth/thegranths/tree/master/src/main/java/SystemDesign/LoggingFramework" rel="noopener noreferrer"&gt;https://github.com/TheTechGranth/thegranths/tree/master/src/main/java/SystemDesign/LoggingFramework&lt;/a&gt; — a good basic logger&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.youtube.com/watch?v=hOzH7ecc8vg&amp;amp;t=2s" rel="noopener noreferrer"&gt;https://www.youtube.com/watch?v=hOzH7ecc8vg&amp;amp;t=2s&lt;/a&gt; — a good explanation for LLD for logger&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/live/QV4O9u1N_XU?si=lO4YYFxf-jOk5tTb" rel="noopener noreferrer"&gt;https://www.youtube.com/live/QV4O9u1N_XU?si=lO4YYFxf-jOk5tTb&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://algomaster.io/learn/system-design/logging" rel="noopener noreferrer"&gt;https://algomaster.io/learn/system-design/logging&lt;/a&gt; — logging best practices&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>logging</category>
      <category>java</category>
    </item>
    <item>
      <title>Distributed Tracing in Spring Boot: A Practical Guide to OpenTelemetry and Jaeger</title>
      <dc:creator>Raj Kundalia</dc:creator>
      <pubDate>Sat, 31 Jan 2026 18:23:48 +0000</pubDate>
      <link>https://dev.to/rajkundalia/distributed-tracing-in-spring-boot-a-practical-guide-to-opentelemetry-and-jaeger-30dn</link>
      <guid>https://dev.to/rajkundalia/distributed-tracing-in-spring-boot-a-practical-guide-to-opentelemetry-and-jaeger-30dn</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Distributed tracing helps you understand how requests flow through microservices by tracking every hop with minimal overhead. This guide covers OpenTelemetry integration in Spring Boot 4 using the native starter, explains core concepts like spans and context propagation, and demonstrates Jaeger-based tracing with best practices for production. Whether you're debugging latency issues or optimizing service dependencies, distributed tracing provides the visibility modern architectures demand.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub Repository:&lt;/strong&gt; &lt;a href="https://github.com/rajkundalia/learning-distributed-tracing" rel="noopener noreferrer"&gt;learning-distributed-tracing&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgiyd98se0uq495t6vvvi.png" alt="Image1" width="800" height="446"&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  The Problem: Debugging in the Dark
&lt;/h2&gt;

&lt;p&gt;In a monolithic application, debugging a slow request is straightforward. Add some logging, attach a profiler, and you can see exactly where time is spent. But microservices change everything. A single user request might touch ten or more services, each with its own logs. Failures often happen between services, not inside them. When something breaks or slows down, where do you even start?&lt;/p&gt;

&lt;p&gt;Traditional logging falls short here. Sure, you can correlate logs by request ID, but manually piecing together the journey across services, databases, and queues is tedious and error-prone. You need something that automatically tracks the entire execution path, measures timing at each step, and shows you the complete picture. That's distributed tracing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding Observability: Metrics, Logs, and Traces
&lt;/h2&gt;

&lt;p&gt;Modern observability rests on three pillars. &lt;strong&gt;Metrics&lt;/strong&gt; are numerical measurements like CPU usage or request count—great for alerting but lacking context for debugging. &lt;strong&gt;Logs&lt;/strong&gt; are discrete events that tell you what happened at a specific moment but struggle with correlation across distributed systems. &lt;strong&gt;Traces&lt;/strong&gt; capture the complete journey of a request through your system, showing execution flow and timing.&lt;/p&gt;

&lt;p&gt;These pillars complement each other. Metrics tell you there's a problem, logs provide event details, and traces show you the execution path. Together, they form a complete observability strategy.&lt;/p&gt;

&lt;p&gt;It's worth distinguishing observability from monitoring. &lt;strong&gt;Monitoring&lt;/strong&gt; answers "Is the system healthy?" through dashboards and alerts. &lt;strong&gt;Observability&lt;/strong&gt; answers "Why is the system behaving this way?" by designing systems to answer questions you didn't anticipate. Distributed tracing is a core enabler of observability, not a replacement for monitoring.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fundamentals of Distributed Tracing
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Telemetry&lt;/strong&gt; refers to automated data collection from remote sources—your application constantly reporting its health and activity. &lt;strong&gt;Spans&lt;/strong&gt; are the building blocks of traces, representing units of work with start time, duration, and metadata. When Service A calls Service B, both create spans that form a parent-child relationship showing the call hierarchy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Traces&lt;/strong&gt; are collections of spans representing a single transaction. A trace ID ties all related spans together across service boundaries. &lt;strong&gt;Context Propagation&lt;/strong&gt; maintains trace continuity—when Service A calls Service B, it passes the trace context in HTTP headers, allowing Service B to create child spans under the same trace.&lt;/p&gt;

&lt;h2&gt;
  
  
  OpenTelemetry: The Industry Standard
&lt;/h2&gt;

&lt;p&gt;Before OpenTelemetry, every observability vendor had proprietary SDKs and formats. If you wanted to switch from Jaeger to Zipkin, you'd re-instrument your entire codebase. This vendor lock-in meant architectural decisions became permanent commitments.&lt;/p&gt;

&lt;p&gt;OpenTelemetry is a vendor-neutral framework providing APIs, SDKs, and tools for telemetry data. Formed by merging OpenTracing and OpenCensus, it provides a single instrumentation API that works with any backend. The value proposition is simple: instrument once, send data anywhere.&lt;/p&gt;

&lt;p&gt;The architecture includes the &lt;strong&gt;API and SDK&lt;/strong&gt; for creating telemetry, &lt;strong&gt;Auto-instrumentation&lt;/strong&gt; for frameworks like Spring and JDBC, and the &lt;strong&gt;Collector&lt;/strong&gt;—an optional but recommended component that receives, processes, and exports telemetry.&lt;/p&gt;

&lt;p&gt;While this article focuses on distributed tracing, it's worth noting that OpenTelemetry standardizes all three pillars of observability—metrics, logs, and traces. The same SDK and protocol handle all three, giving you a unified approach to instrumentation across your entire observability stack.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OTLP (OpenTelemetry Protocol)&lt;/strong&gt; is the wire format for transmitting telemetry data. Supporting both gRPC and HTTP transports, OTLP defines how traces, metrics, and logs are serialized and sent to collectors or backends. The protocol handles backpressure, retries, and batching for reliable delivery. Most modern observability tools now support OTLP natively, making it the de facto standard.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flue2hi3fad8jsa6m9kfa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flue2hi3fad8jsa6m9kfa.png" alt="Image2" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Spring Boot 4 and OpenTelemetry Integration
&lt;/h2&gt;

&lt;p&gt;Spring Boot 4 brings first-class support for OpenTelemetry through the &lt;code&gt;spring-boot-starter-opentelemetry&lt;/code&gt; dependency. This starter provides automatic configuration and instrumentation for common scenarios like HTTP requests, database calls, and messaging.&lt;/p&gt;

&lt;p&gt;Previous versions of Spring Boot required manual setup using the OpenTelemetry Java agent or custom configuration. Spring Boot 2 and 3 users could leverage the Java agent for bytecode instrumentation, which worked but added operational complexity. The agent approach meant deploying a JAR alongside your application and configuring it via environment variables or system properties.&lt;/p&gt;

&lt;p&gt;With Spring Boot 4, the starter eliminates much of this complexity. Add the dependency, configure a few properties, and you're done. Under the hood, it uses Spring's auto-configuration to set up the OpenTelemetry SDK, register instrumentation libraries, and configure exporters based on your application properties.&lt;/p&gt;

&lt;p&gt;The starter automatically instruments:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HTTP requests and responses via Spring MVC and WebFlux&lt;/li&gt;
&lt;li&gt;RestTemplate, RestClient, and WebClient calls&lt;/li&gt;
&lt;li&gt;JDBC database operations&lt;/li&gt;
&lt;li&gt;Logs (automatically includes trace and span IDs)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For additional instrumentation like Kafka messaging, you can use the &lt;code&gt;@WithSpan&lt;/code&gt; annotation for manual instrumentation, or use the OpenTelemetry Java Agent which provides automatic instrumentation for 150+ libraries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Spring Boot Actuator's Role&lt;/strong&gt;: While Actuator isn't required for tracing, it plays a complementary role in Spring Boot 4's observability story. Actuator's &lt;code&gt;ObservationRegistry&lt;/code&gt; is what actually observes requests and framework operations. The OpenTelemetry starter bridges these observations into OTel-compliant traces. Think of Actuator as operational introspection (health, metrics) and OpenTelemetry as behavioral introspection (request flows).&lt;/p&gt;

&lt;p&gt;You can still use the Java agent if you need instrumentation for libraries outside Spring's ecosystem, but for typical Spring Boot applications, the starter is sufficient and more maintainable. Framework-level instrumentation gives you baseline visibility automatically, while custom spans should be added only where domain insight is needed. This balance is critical—over-instrumentation creates noise, while under-instrumentation hides intent.&lt;/p&gt;

&lt;h2&gt;
  
  
  Jaeger: Your Trace Backend
&lt;/h2&gt;

&lt;p&gt;Jaeger is an open-source distributed tracing platform originally developed by Uber, providing storage, querying, and visualization for traces. While OpenTelemetry handles generation and collection, Jaeger handles the backend.&lt;/p&gt;

&lt;p&gt;Jaeger's architecture includes agents, collectors, a query service, and a web UI. For development, the all-in-one Docker image combines all components. A common misconception is that Jaeger requires Kubernetes—it doesn't. Jaeger runs on Docker, VMs, or bare metal. The all-in-one image works for local development, while production typically uses separate components with external storage like Cassandra or Elasticsearch.&lt;/p&gt;

&lt;p&gt;Jaeger supports multiple ingestion formats, including OTLP. With OpenTelemetry's standardization, OTLP is now recommended, meaning your Spring Boot application sends traces in OTLP format directly to Jaeger without needing Jaeger-specific libraries.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tracing Beyond Services: Databases and Message Queues
&lt;/h2&gt;

&lt;p&gt;One of the most powerful aspects of distributed tracing is visibility into external dependencies. When your application makes a database call or publishes to Kafka, those operations appear as spans in your trace.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Database tracing&lt;/strong&gt; works through JDBC instrumentation. When your Spring Boot application executes a SQL query, the OpenTelemetry instrumentation automatically creates a span containing the query, execution time, and database connection details. This visibility is crucial for identifying slow queries or N+1 problems—those situations where you're executing one query to fetch entities, then N additional queries to fetch related data for each entity. Database spans make these anti-patterns immediately visible in your trace timeline. However, be mindful of sensitive data. Database spans can include SQL statements with parameter values, which might contain PII. OpenTelemetry provides span processors to redact or mask sensitive information before export.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Message queue tracing&lt;/strong&gt; extends traces across asynchronous boundaries. When Service A publishes a message to Kafka, it injects the trace context into message headers. When Service B consumes that message, it extracts the context and continues the trace. This creates a parent-child relationship between the producer and consumer spans, even though they execute at different times. The result is end-to-end visibility into asynchronous workflows, making it much easier to debug message processing issues or track down where data transformations went wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance Impact and Production Considerations
&lt;/h2&gt;

&lt;p&gt;Distributed tracing adds overhead from creating spans, serializing data, and network transmission. The impact varies by component:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CPU&lt;/strong&gt;: Span creation and serialization typically add microseconds per operation. The OpenTelemetry SDK uses efficient batching to minimize per-span overhead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Memory&lt;/strong&gt;: The SDK buffers spans before export. Configure batch size and timeout based on traffic patterns and memory constraints to prevent excessive buffering.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Network IO&lt;/strong&gt;: Sending traces to a local collector over localhost has minimal impact. Remote backends introduce latency and bandwidth usage. Using a collector to batch and compress traces reduces network overhead significantly. Importantly, the collector absorbs most of the performance cost, acting as a buffer between your applications and backends.&lt;/p&gt;

&lt;p&gt;In practice, overhead is typically under 5 percent for CPU and memory. The key is intelligent sampling—trace 1-5 percent of traffic in production rather than every request (development should trace 100 percent for debugging). OpenTelemetry supports probability-based sampling for production and rate-limiting to cap traces per second.&lt;/p&gt;

&lt;h2&gt;
  
  
  Best Practices for Distributed Tracing
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use meaningful span names&lt;/strong&gt;: "validatePaymentRequest" beats "process" every time. Good naming makes traces self-documenting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Add relevant attributes&lt;/strong&gt;: Follow OpenTelemetry semantic conventions for HTTP, databases, and queues. Add custom attributes for business context like user ID or tenant ID.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Don't over-instrument&lt;/strong&gt;: Creating spans for every method produces noise. Focus on external calls, database queries, and significant business logic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Implement proper error handling&lt;/strong&gt;: Mark spans as failed and record exception details when errors occur. This helps identify which service and operation caused failures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sample intelligently&lt;/strong&gt;: Trace everything in development (probability 1.0), but use 1-5 percent sampling in production (probability 0.01-0.05). This gives you statistically significant insights without overloading infrastructure. Consider adaptive sampling that increases rates for slow requests or errors.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Watch for orphaned spans&lt;/strong&gt;: When requests hand off work to async thread pools, ensure context propagation is maintained. If a new thread loses the trace context, your trace will break, resulting in disconnected "orphaned spans" that can't be correlated. Spring Boot 4 usually handles this automatically, but verify your custom executors are properly instrumented.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use the Collector&lt;/strong&gt;: It provides buffering, enrichment, routing, and reliability that SDK exporters alone cannot.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Monitor your telemetry pipeline&lt;/strong&gt;: Track export success rates and latency. If your pipeline breaks, you're debugging blind.&lt;/p&gt;

&lt;h2&gt;
  
  
  Querying and Analyzing Traces
&lt;/h2&gt;

&lt;p&gt;Jaeger's UI provides powerful analysis tools. Search for traces by service, operation, tags, duration, and time range. The trace timeline shows the complete request flow with parent-child relationships visually nested. For advanced use cases, Jaeger Query Language (JQL) enables programmatic querying and integration with automated alerting systems. The trace comparison feature helps identify performance regressions by highlighting timing differences between trace versions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Distributed tracing transforms how you understand and debug microservices. By automatically capturing request flows and timing information, it eliminates the guesswork from performance analysis and incident response. OpenTelemetry provides the standardized instrumentation, OTLP handles reliable transmission, and backends like Jaeger give you the visualization and querying tools to make sense of the data.&lt;/p&gt;

&lt;p&gt;Spring Boot 4's native OpenTelemetry support makes adoption straightforward. Add the starter, configure your exporter, and you're tracing HTTP requests, database queries, and message queues with minimal code. The result is a system where every request tells its own story, complete with timing, dependencies, and errors.&lt;/p&gt;

&lt;p&gt;Start small. Enable tracing in one service, verify the data reaches Jaeger, and gradually expand to your entire application. The visibility you gain will pay dividends the first time you debug a cross-service issue or optimize a slow endpoint. Distributed tracing isn't just a monitoring tool; it's a fundamental shift in how you understand distributed systems.&lt;/p&gt;

&lt;p&gt;For hands-on examples and complete configuration, check out the &lt;a href="https://github.com/rajkundalia/learning-distributed-tracing" rel="noopener noreferrer"&gt;learning-distributed-tracing&lt;/a&gt; repository.&lt;/p&gt;

&lt;h2&gt;
  
  
  Learning Links:
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://spring.io/blog/2025/11/18/opentelemetry-with-spring-boot" rel="noopener noreferrer"&gt;https://spring.io/blog/2025/11/18/opentelemetry-with-spring-boot&lt;/a&gt;&lt;br&gt;
&lt;a href="https://opentelemetry.io/docs/zero-code/java/spring-boot-starter/" rel="noopener noreferrer"&gt;https://opentelemetry.io/docs/zero-code/java/spring-boot-starter/&lt;/a&gt;&lt;br&gt;
&lt;a href="https://foojay.io/today/spring-boot-4-opentelemetry-explained/" rel="noopener noreferrer"&gt;https://foojay.io/today/spring-boot-4-opentelemetry-explained/&lt;/a&gt;&lt;br&gt;
&lt;a href="https://last9.io/blog/opentelemetry-for-spring/" rel="noopener noreferrer"&gt;https://last9.io/blog/opentelemetry-for-spring/&lt;/a&gt;&lt;br&gt;
&lt;a href="https://signoz.io/blog/opentelemetry-spring-boot/" rel="noopener noreferrer"&gt;https://signoz.io/blog/opentelemetry-spring-boot/&lt;/a&gt;&lt;br&gt;
&lt;a href="https://vorozco.com/blog/2024/2024-11-18-A-practical-guide-spring-boot-open-telemetry.html" rel="noopener noreferrer"&gt;https://vorozco.com/blog/2024/2024-11-18-A-practical-guide-spring-boot-open-telemetry.html&lt;/a&gt;&lt;br&gt;
&lt;a href="https://medium.com/cloud-native-daily/how-to-send-traces-from-spring-boot-to-jaeger-229c19f544db" rel="noopener noreferrer"&gt;https://medium.com/cloud-native-daily/how-to-send-traces-from-spring-boot-to-jaeger-229c19f544db&lt;/a&gt;&lt;br&gt;
&lt;a href="https://medium.com/xebia-engineering/jaeger-integration-with-spring-boot-application-3c6ec4a96a6f" rel="noopener noreferrer"&gt;https://medium.com/xebia-engineering/jaeger-integration-with-spring-boot-application-3c6ec4a96a6f&lt;/a&gt;&lt;br&gt;
&lt;a href="https://blog.vinsguru.com/distributed-tracing-in-microservices-with-jaeger/" rel="noopener noreferrer"&gt;https://blog.vinsguru.com/distributed-tracing-in-microservices-with-jaeger/&lt;/a&gt;&lt;br&gt;
&lt;a href="https://last9.io/blog/distributed-tracing-with-spring-boot/" rel="noopener noreferrer"&gt;https://last9.io/blog/distributed-tracing-with-spring-boot/&lt;/a&gt;&lt;br&gt;
&lt;a href="https://signoz.io/blog/jaeger-vs-zipkin/" rel="noopener noreferrer"&gt;https://signoz.io/blog/jaeger-vs-zipkin/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>java</category>
      <category>microservices</category>
      <category>monitoring</category>
      <category>springboot</category>
    </item>
    <item>
      <title>LangChain vs LangGraph vs LangSmith: Understanding the Ecosystem</title>
      <dc:creator>Raj Kundalia</dc:creator>
      <pubDate>Sat, 17 Jan 2026 13:29:07 +0000</pubDate>
      <link>https://dev.to/rajkundalia/langchain-vs-langgraph-vs-langsmith-understanding-the-ecosystem-3m5o</link>
      <guid>https://dev.to/rajkundalia/langchain-vs-langgraph-vs-langsmith-understanding-the-ecosystem-3m5o</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Building LLM apps isn’t just about prompts anymore.&lt;br&gt;
It’s about &lt;strong&gt;composition&lt;/strong&gt;, &lt;strong&gt;orchestration&lt;/strong&gt;, and &lt;strong&gt;observability&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LangChain&lt;/strong&gt; provides the foundational building blocks for creating LLM applications through modular components and a unified interface for working with different AI providers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LangGraph&lt;/strong&gt; extends this foundation with &lt;strong&gt;stateful, graph-based orchestration&lt;/strong&gt; for complex multi-agent workflows requiring loops, branching, and persistent state.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LangSmith&lt;/strong&gt; completes the picture by offering &lt;strong&gt;observability, tracing, and evaluation&lt;/strong&gt; tools for debugging and monitoring LLM applications in production.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LangChain&lt;/strong&gt; for straightforward chains and RAG systems&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LangGraph&lt;/strong&gt; when you need sophisticated state management and agent coordination&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LangSmith&lt;/strong&gt; throughout development and production for visibility into behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Hands-on GitHub Repositories
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LangChain RAG Project&lt;/strong&gt; → &lt;a href="https://github.com/rajkundalia/langchain-rag-project" rel="noopener noreferrer"&gt;https://github.com/rajkundalia/langchain-rag-project&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LangGraph Analyzer&lt;/strong&gt; → &lt;a href="https://github.com/rajkundalia/langgraph-analyzer" rel="noopener noreferrer"&gt;https://github.com/rajkundalia/langgraph-analyzer&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LangSmith Learning&lt;/strong&gt; → &lt;a href="https://github.com/rajkundalia/langsmith-learning" rel="noopener noreferrer"&gt;https://github.com/rajkundalia/langsmith-learning&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;The landscape of LLM application development has evolved rapidly since 2022.&lt;/p&gt;

&lt;p&gt;What began as simple prompt–response interactions has grown into &lt;strong&gt;multi-step workflows&lt;/strong&gt; involving retrieval systems, tool usage, autonomous agents, and long-running processes. This evolution introduced &lt;strong&gt;new problems at each stage&lt;/strong&gt; of the development lifecycle.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The composition problem&lt;/strong&gt; → How do you connect prompts, models, tools, and data?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The orchestration problem&lt;/strong&gt; → How do you manage branching, retries, loops, and shared state?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The observability problem&lt;/strong&gt; → How do you debug, evaluate, and monitor these systems?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The LangChain ecosystem emerged to address each layer:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Problem&lt;/th&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Year&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Composition&lt;/td&gt;
&lt;td&gt;LangChain&lt;/td&gt;
&lt;td&gt;2022&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Orchestration&lt;/td&gt;
&lt;td&gt;LangGraph&lt;/td&gt;
&lt;td&gt;2024&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Observability&lt;/td&gt;
&lt;td&gt;LangSmith&lt;/td&gt;
&lt;td&gt;2023–2024&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each tool targets a &lt;strong&gt;specific layer&lt;/strong&gt; in the LLM application stack.&lt;/p&gt;




&lt;h2&gt;
  
  
  LangChain: The Foundation
&lt;/h2&gt;

&lt;p&gt;LangChain is the &lt;strong&gt;core framework&lt;/strong&gt; for building LLM-powered applications.&lt;/p&gt;

&lt;p&gt;Its primary goal is abstraction: different LLM providers expose different APIs, capabilities, and quirks. LangChain hides these differences behind a &lt;strong&gt;unified interface&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Core Building Blocks
&lt;/h3&gt;

&lt;p&gt;LangChain is composed of modular, swappable components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Prompts&lt;/strong&gt; – Templates and structured inputs for models&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Models&lt;/strong&gt; – OpenAI, Anthropic, Google, or local LLMs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory&lt;/strong&gt; – Conversation history and contextual state&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tools&lt;/strong&gt; – Function calls to external systems&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retrievers&lt;/strong&gt; – Vector databases and RAG pipelines&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  LCEL: LangChain Expression Language
&lt;/h3&gt;

&lt;p&gt;What ties everything together is &lt;strong&gt;LCEL&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;LCEL introduces a &lt;strong&gt;declarative, pipe-based syntax&lt;/strong&gt; for composing chains:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;prompt | model | output_parser
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Instead of writing imperative glue code, you describe &lt;strong&gt;data flow&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why LCEL Matters
&lt;/h3&gt;

&lt;p&gt;LCEL enables:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Automatic async, streaming, and batch execution&lt;/li&gt;
&lt;li&gt;Built-in LangSmith tracing&lt;/li&gt;
&lt;li&gt;Parallel execution of independent steps&lt;/li&gt;
&lt;li&gt;A unified &lt;code&gt;Runnable&lt;/code&gt; interface (&lt;code&gt;invoke&lt;/code&gt;, &lt;code&gt;batch&lt;/code&gt;, &lt;code&gt;stream&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This makes chains &lt;strong&gt;faster&lt;/strong&gt;, &lt;strong&gt;cleaner&lt;/strong&gt;, and easier to reason about.&lt;/p&gt;




&lt;h3&gt;
  
  
  Multi-Provider Support
&lt;/h3&gt;

&lt;p&gt;LangChain supports dozens of LLM providers and integrations.&lt;/p&gt;

&lt;p&gt;You can switch providers by changing &lt;strong&gt;one line of configuration&lt;/strong&gt;, enabling:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Vendor independence&lt;/li&gt;
&lt;li&gt;A/B testing across models&lt;/li&gt;
&lt;li&gt;Cost and latency optimization&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  When LangChain Is Enough
&lt;/h3&gt;

&lt;p&gt;Use LangChain when your workflow is primarily:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input → Process → Output
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Typical use cases include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Chatbots with memory&lt;/li&gt;
&lt;li&gt;RAG-based Q&amp;amp;A systems&lt;/li&gt;
&lt;li&gt;Natural language → SQL generation&lt;/li&gt;
&lt;li&gt;Linear tool pipelines&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your application doesn’t need complex branching or shared long-lived state, &lt;strong&gt;LangChain is the right tool&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7a2mw7h25kemz76pn20y.png" alt="LangChain Component Flow" width="800" height="656"&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  LangGraph: Stateful Agent Orchestration
&lt;/h2&gt;

&lt;p&gt;LangGraph solves the &lt;strong&gt;orchestration problem&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;As soon as your application needs to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;make decisions,&lt;/li&gt;
&lt;li&gt;loop,&lt;/li&gt;
&lt;li&gt;retry,&lt;/li&gt;
&lt;li&gt;or coordinate multiple agents, linear chains start to break down.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Graph-Based Architecture
&lt;/h3&gt;

&lt;p&gt;LangGraph models your application as a &lt;strong&gt;directed graph&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Nodes&lt;/strong&gt; → processing steps or agents&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Edges&lt;/strong&gt; → execution flow between nodes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This enables patterns that are hard or impossible with chains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Loops and retries&lt;/li&gt;
&lt;li&gt;Conditional branching&lt;/li&gt;
&lt;li&gt;Parallel execution&lt;/li&gt;
&lt;li&gt;Shared, persistent state&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  State as a First-Class Concept
&lt;/h3&gt;

&lt;p&gt;Every LangGraph workflow operates on a &lt;strong&gt;shared state object&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Nodes receive the current state&lt;/li&gt;
&lt;li&gt;They compute updates&lt;/li&gt;
&lt;li&gt;Updates are merged back into state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This allows multiple agents to collaborate naturally.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Research agent gathers sources&lt;/li&gt;
&lt;li&gt;Fact-checking agent validates claims&lt;/li&gt;
&lt;li&gt;Synthesis agent produces the final answer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All without complex message passing.&lt;/p&gt;




&lt;h3&gt;
  
  
  Conditional Routing
&lt;/h3&gt;

&lt;p&gt;LangGraph supports &lt;strong&gt;conditional edges&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A function decides which node runs next based on runtime state:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Route customer queries to specialist agents&lt;/li&gt;
&lt;li&gt;Loop back when required information is missing&lt;/li&gt;
&lt;li&gt;Retry until success conditions are met&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Persistence &amp;amp; Checkpointing
&lt;/h3&gt;

&lt;p&gt;LangGraph includes built-in &lt;strong&gt;checkpointing&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Persist state across restarts&lt;/li&gt;
&lt;li&gt;Resume long-running workflows&lt;/li&gt;
&lt;li&gt;Support human-in-the-loop pauses&lt;/li&gt;
&lt;li&gt;Enable time-travel debugging&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is critical for production-grade agent systems.&lt;/p&gt;




&lt;h3&gt;
  
  
  Visualization Support
&lt;/h3&gt;

&lt;p&gt;LangGraph workflows are inspectable and exportable:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mermaid diagrams for documentation&lt;/li&gt;
&lt;li&gt;PNG images for presentations&lt;/li&gt;
&lt;li&gt;ASCII graphs for terminal debugging&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This makes complex agent systems &lt;strong&gt;understandable and communicable&lt;/strong&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  When You Need LangGraph
&lt;/h3&gt;

&lt;p&gt;Choose LangGraph when you need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Explicit shared state&lt;/li&gt;
&lt;li&gt;Runtime decision-making&lt;/li&gt;
&lt;li&gt;Retry and failure recovery&lt;/li&gt;
&lt;li&gt;Multi-agent coordination&lt;/li&gt;
&lt;li&gt;Long-running workflows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A classic example is an &lt;strong&gt;autonomous research agent&lt;/strong&gt; that iteratively searches, reads, verifies, and synthesizes information.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe6w9w6egyyj8r4c3ricl.png" alt="LangGraph State Machine Example" width="757" height="737"&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  LangSmith: The Observability Layer
&lt;/h2&gt;

&lt;p&gt;LangSmith answers the question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“What is my LLM application actually doing?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It doesn’t build workflows — it &lt;strong&gt;illuminates them&lt;/strong&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  Tracing Everything
&lt;/h3&gt;

&lt;p&gt;LangSmith captures full execution traces:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prompts and responses&lt;/li&gt;
&lt;li&gt;Token usage and latency&lt;/li&gt;
&lt;li&gt;Component call stacks&lt;/li&gt;
&lt;li&gt;Errors and retries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can drill down from:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a full workflow run
→ to a single LLM call.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This makes debugging &lt;em&gt;dramatically&lt;/em&gt; easier.&lt;/p&gt;




&lt;h3&gt;
  
  
  Evaluation &amp;amp; Regression Testing
&lt;/h3&gt;

&lt;p&gt;LangSmith allows you to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Create evaluation datasets&lt;/li&gt;
&lt;li&gt;Run structured tests&lt;/li&gt;
&lt;li&gt;Track quality metrics&lt;/li&gt;
&lt;li&gt;Compare prompts and models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This enables &lt;strong&gt;regression testing&lt;/strong&gt; for LLM apps — a must-have for production systems.&lt;/p&gt;




&lt;h3&gt;
  
  
  Production Monitoring
&lt;/h3&gt;

&lt;p&gt;In production, LangSmith tracks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Response times&lt;/li&gt;
&lt;li&gt;Error rates&lt;/li&gt;
&lt;li&gt;Token and cost trends&lt;/li&gt;
&lt;li&gt;Usage by workflow or user&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Alerts help you catch issues early and optimize costs.&lt;/p&gt;




&lt;h3&gt;
  
  
  Framework-Agnostic
&lt;/h3&gt;

&lt;p&gt;While LangSmith integrates seamlessly with LangChain and LangGraph, it’s &lt;strong&gt;not limited to them&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;You can instrument &lt;em&gt;any&lt;/em&gt; LLM application with LangSmith.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdx2l37b7n29anhmsrrhh.png" alt="LangSmith Diagram" width="800" height="341"&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Quick Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Solves&lt;/th&gt;
&lt;th&gt;Use When&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;LangChain&lt;/td&gt;
&lt;td&gt;Composition&lt;/td&gt;
&lt;td&gt;Linear workflows, RAG, simple agents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LangGraph&lt;/td&gt;
&lt;td&gt;Orchestration&lt;/td&gt;
&lt;td&gt;Branching, loops, shared state, multi-agent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LangSmith&lt;/td&gt;
&lt;td&gt;Observability&lt;/td&gt;
&lt;td&gt;Debugging, evaluation, production monitoring&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyia3pj7oyxnurc75suxn.png" alt="Decision Tree: Which Tool to Use?" width="688" height="703"&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  The Broader Ecosystem
&lt;/h2&gt;

&lt;h3&gt;
  
  
  LangFlow
&lt;/h3&gt;

&lt;p&gt;LangFlow provides a &lt;strong&gt;visual, drag-and-drop&lt;/strong&gt; interface for building LangChain workflows.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Great for prototyping&lt;/li&gt;
&lt;li&gt;Helpful for non-technical collaboration&lt;/li&gt;
&lt;li&gt;Often exported to code for production&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Model Context Protocol (MCP)
&lt;/h3&gt;

&lt;p&gt;MCP (by Anthropic) standardizes &lt;strong&gt;tool and resource access&lt;/strong&gt; for LLMs.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Works at the tool/retriever layer&lt;/li&gt;
&lt;li&gt;Complements LangChain and LangGraph&lt;/li&gt;
&lt;li&gt;Reduces custom integration effort&lt;/li&gt;
&lt;li&gt;Framework-agnostic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;MCP does &lt;strong&gt;not&lt;/strong&gt; replace orchestration tools — it enhances connectivity.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The LangChain ecosystem is &lt;strong&gt;layered, not competitive&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LangChain&lt;/strong&gt; builds the core logic&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LangGraph&lt;/strong&gt; manages complex workflows&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LangSmith&lt;/strong&gt; makes everything observable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most serious LLM applications will use &lt;strong&gt;more than one&lt;/strong&gt; of these tools.&lt;/p&gt;

&lt;p&gt;Start simple, add complexity only when needed, and &lt;strong&gt;never ship without observability&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Further Reading &amp;amp; Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.datacamp.com/tutorial/langchain-vs-langgraph-vs-langsmith-vs-langflow" rel="noopener noreferrer"&gt;https://www.datacamp.com/tutorial/langchain-vs-langgraph-vs-langsmith-vs-langflow&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.datacamp.com/tutorial/langgraph-tutorial" rel="noopener noreferrer"&gt;https://www.datacamp.com/tutorial/langgraph-tutorial&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.datacamp.com/tutorial/langgraph-agents" rel="noopener noreferrer"&gt;https://www.datacamp.com/tutorial/langgraph-agents&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.techvoot.com/blog/langchain-vs-langgraph-vs-langflow-vs-langsmith-2025" rel="noopener noreferrer"&gt;https://www.techvoot.com/blog/langchain-vs-langgraph-vs-langflow-vs-langsmith-2025&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Video&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://www.youtube.com/watch?v=vJOGC8QJZJQ" rel="noopener noreferrer"&gt;https://www.youtube.com/watch?v=vJOGC8QJZJQ&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Academy Finxter Series (Excellent Deep Dive)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://academy.finxter.com/langchain-langsmith-and-langgraph/" rel="noopener noreferrer"&gt;https://academy.finxter.com/langchain-langsmith-and-langgraph/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://academy.finxter.com/langsmith-and-writing-tools/" rel="noopener noreferrer"&gt;https://academy.finxter.com/langsmith-and-writing-tools/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://academy.finxter.com/langgraph/" rel="noopener noreferrer"&gt;https://academy.finxter.com/langgraph/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://academy.finxter.com/multi-agent-teams-preparation/" rel="noopener noreferrer"&gt;https://academy.finxter.com/multi-agent-teams-preparation/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://academy.finxter.com/setting-up-our-multi-agent-team/" rel="noopener noreferrer"&gt;https://academy.finxter.com/setting-up-our-multi-agent-team/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://academy.finxter.com/web-research-and-asynchronous-tools/" rel="noopener noreferrer"&gt;https://academy.finxter.com/web-research-and-asynchronous-tools/&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>langchain</category>
      <category>langsmith</category>
      <category>langgraph</category>
    </item>
    <item>
      <title>Understanding Model Context Protocol (MCP): Beyond the Hype</title>
      <dc:creator>Raj Kundalia</dc:creator>
      <pubDate>Mon, 08 Dec 2025 17:02:38 +0000</pubDate>
      <link>https://dev.to/rajkundalia/understanding-model-context-protocol-mcp-beyond-the-hype-3g8a</link>
      <guid>https://dev.to/rajkundalia/understanding-model-context-protocol-mcp-beyond-the-hype-3g8a</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;As always&lt;/strong&gt;, I have created code repositories which will be easier to understand; also, &lt;strong&gt;resources much better than what I have here are added at the bottom&lt;/strong&gt;:&lt;br&gt;
MCP Book Library: &lt;em&gt;&lt;a href="https://github.com/rajkundalia/mcp-book-library" rel="noopener noreferrer"&gt;https://github.com/rajkundalia/mcp-book-library&lt;/a&gt;&lt;/em&gt;&lt;br&gt;
MCP Toolbox: &lt;em&gt;&lt;a href="https://github.com/rajkundalia/mcp-toolbox" rel="noopener noreferrer"&gt;https://github.com/rajkundalia/mcp-toolbox&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;As software engineers, we were and are witnessing a fragmentation problem in the AI ecosystem. Every major model provider (Anthropic, OpenAI, Google) and every tool (Linear, GitHub, Slack) has its own proprietary integration pattern. If you want Claude to talk to your PostgreSQL database, you write a specific integration, and if you switch to GPT-5, you rewrite it.&lt;/p&gt;

&lt;p&gt;This “m × n” integration problem — where m models need to connect to n tools — is creating an exponential explosion of custom code. It is one of the primary bottlenecks preventing LLMs from becoming true agents.&lt;/p&gt;

&lt;p&gt;Enter the Model Context Protocol (MCP).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhqjboqgnjweltr20yhe8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhqjboqgnjweltr20yhe8.png" alt="MCP-image" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is MCP?
&lt;/h2&gt;

&lt;p&gt;The Model Context Protocol is an open standard that defines how AI models interact with data and tools. Think of it as a “USB-C port” for AI applications.&lt;/p&gt;

&lt;p&gt;In short, MCP removes the need for bespoke integrations between every tool and every AI model. Instead of building a specific connector for every data source to every AI model, MCP provides a universal protocol.&lt;/p&gt;

&lt;p&gt;If a tool is “MCP compliant,” any MCP client (like Claude Desktop, Cursor, or Zed) can instantly connect to it without custom glue code.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why MCP?
&lt;/h2&gt;

&lt;p&gt;The value proposition is decoupling.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;For tool builders:&lt;/strong&gt; You build one MCP server for your API. It now works with Claude, Cursor, and any future MCP-compliant application.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;For AI app developers:&lt;/strong&gt; You build your host application once and gain access to the entire ecosystem of MCP servers (Google Drive, Slack, PostgreSQL, etc.).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;For end users:&lt;/strong&gt; You can switch between AI providers without losing access to your tools.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This solves the m × n problem by reducing it to m + n. The math alone makes the case compelling.&lt;/p&gt;




&lt;h2&gt;
  
  
  How MCP Works Architecturally
&lt;/h2&gt;

&lt;p&gt;The architecture relies on a triangle of roles. The “Client” is often hidden inside the application you are using.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;MCP Hosts:&lt;/strong&gt; The user-facing application (e.g., Claude Desktop, Zed, or a custom dashboard). The Host orchestrates the flow, manages the UI, and contains the LLM.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP Clients:&lt;/strong&gt; The bridge (often a library) embedded within the Host. It maintains the connection with the Server, negotiates capabilities, and routes requests.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP Servers:&lt;/strong&gt; Where your custom logic lives. A server wraps a capability (Postgres, file system, REST API) and exposes it via standardized primitives.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpazu60u2tehs3t90gvee.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpazu60u2tehs3t90gvee.png" alt="image2" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Core MCP Primitives
&lt;/h2&gt;

&lt;p&gt;When you write an MCP server, you are generally exposing one of these three capabilities.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Resources:&lt;/strong&gt; Passive data. The client asks to “read” a URI (for example, &lt;code&gt;postgres://logs/latest&lt;/code&gt;). These are analogous to file reads—informational only.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tools:&lt;/strong&gt; Executable functions, allowing the LLM to take action (for example, &lt;code&gt;execute_sql_query&lt;/code&gt;, &lt;code&gt;send_slack_message&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompts:&lt;/strong&gt; Reusable context. A server can define a template (for example, “Analyze Error Logs”) that the host loads to jumpstart a conversation.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Capability Discovery and Schemas
&lt;/h2&gt;

&lt;p&gt;A critical part of the protocol is discovery. When a client connects, it asks the server, “What can you do?” and the server responds with a list of tools and resources, including JSON Schemas for arguments.&lt;/p&gt;

&lt;p&gt;This is how the LLM knows exactly which parameters (for example, &lt;code&gt;isbn: string&lt;/code&gt;) are required to call a tool, enforcing type safety at the model level.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why JSON-RPC 2.0?
&lt;/h2&gt;

&lt;p&gt;MCP uses JSON-RPC 2.0 for its wire protocol, and this choice maps naturally to the problem space.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bidirectional:&lt;/strong&gt; JSON‑RPC supports both requests and notifications from either side over a single logical session, which maps cleanly onto long‑lived transports like stdio or streaming HTTP.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Session-based:&lt;/strong&gt; MCP sessions are often long-lived. JSON-RPC handles this persistent state naturally without the overhead of stateless HTTP headers for every interaction.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transport agnostic:&lt;/strong&gt; The message shape remains identical whether piped over local stdio (for local dev) or SSE/WebSockets (for remote deployment).&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Example: A Full MCP Flow
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;User&lt;/strong&gt;: “Check the library database for book availability for ISBN 12345.”&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Host (LLM):&lt;/strong&gt; Recognizes the intent and asks the client to find a relevant tool.&lt;br&gt;
&lt;strong&gt;Client:&lt;/strong&gt; Identifies &lt;code&gt;check_availability&lt;/code&gt; via discovery and sends a JSON-RPC request:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"method"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"tools/call"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"params"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"check_availability"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"arguments"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"isbn"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"12345"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Server:&lt;/strong&gt; Receives the request, runs the query, and returns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"result"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Available: 5 copies"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Host:&lt;/strong&gt; Feeds this back into the LLM context window.&lt;br&gt;
&lt;strong&gt;LLM:&lt;/strong&gt; Responds: “Good news! There are 5 copies available.”&lt;/p&gt;




&lt;h2&gt;
  
  
  Advanced Mechanisms: Sampling and Roots
&lt;/h2&gt;

&lt;p&gt;MCP extends beyond simple API calls with features that enable sophisticated interaction.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Sampling:&lt;/strong&gt; Enables the server to delegate complex tasks back to the host. During the execution of a tool, the server can effectively say, “Hey LLM, I need your brain for a second,” and request the host to generate text or analyze code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Roots:&lt;/strong&gt; A security boundary mechanism. A server can declare boundaries (for example, “I only have access to &lt;code&gt;/var/www/project&lt;/code&gt;”), preventing access to files or resources outside a specific scope.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Real-Time Updates and Transports
&lt;/h2&gt;

&lt;p&gt;Unlike standard APIs where the client must poll for changes, MCP supports server-initiated notifications.&lt;/p&gt;

&lt;p&gt;Once a session is established, a server can send streaming responses and JSON‑RPC notifications without additional polling. For example, a filesystem server can notify the host immediately when a watched file changes, or a long-running build process can stream log lines as they appear.&lt;/p&gt;

&lt;p&gt;This is supported across the main standard transports.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;stdio:&lt;/strong&gt; For local processes (ideal for desktop apps like Cursor).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SSE (Server-Sent Events):&lt;/strong&gt; For remote servers sending updates to clients.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Custom transports:&lt;/strong&gt; The protocol is extensible to additional carriers like WebSockets; draft proposals already explore this on top of the existing HTTP/streaming model.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Is MCP a Silver Bullet?
&lt;/h2&gt;

&lt;p&gt;MCP solves the integration problem, but it is not a magic fix for every scenario.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it when you:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Need interactive AI–tool integrations
&lt;/li&gt;
&lt;li&gt;Expect multiple AI models to use the same tools
&lt;/li&gt;
&lt;li&gt;Have tooling that evolves frequently&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Avoid it when you:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Have a simple one-off integration
&lt;/li&gt;
&lt;li&gt;Run large batch jobs without interaction
&lt;/li&gt;
&lt;li&gt;Care about latency more than flexibility&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Production Challenges
&lt;/h2&gt;

&lt;p&gt;While the local development story is fantastic, moving to production introduces complexity.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The Scaling Challenge
&lt;/h3&gt;

&lt;p&gt;In development, a “one host process → one server process” model via stdio works well. In production, this naive 1:1 model does not scale, because you cannot spawn a new database connection process for every one of 10,000 concurrent users.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The solution:&lt;/strong&gt; Production architectures use MCP gateways, which sit between clients and servers to handle connection pooling and multiplex many logical sessions over fewer physical connections.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Security and Auth
&lt;/h3&gt;

&lt;p&gt;MCP defines the transport, but it does not strictly mandate how you authenticate. In a remote setup, you need to secure the transport layer (for example, via headers in SSE).&lt;/p&gt;

&lt;p&gt;Because MCP servers can execute code or read files, strict roots configuration and containerization are essential to prevent privilege escalation.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Debugging and Observability
&lt;/h3&gt;

&lt;p&gt;Debugging streaming JSON‑RPC over a long‑lived transport can be opaque. Unlike REST, where you have discrete HTTP logs, MCP is a stream of messages.&lt;/p&gt;

&lt;p&gt;Production implementations require robust tracing (for example, correlation IDs) to track a request as it hops from Host → Gateway → Server and back.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;The Model Context Protocol represents a meaningful step toward standardizing AI-to-tool communication. While Anthropic seeded the ecosystem, there is now broad adoption across open-source tools, IDEs, and infrastructure providers.&lt;/p&gt;

&lt;p&gt;However, treat it as a protocol, not a magic solution. It requires ecosystem adoption and careful architectural planning for production scale.&lt;/p&gt;




&lt;h2&gt;
  
  
  Example MCP Implementations
&lt;/h2&gt;

&lt;p&gt;To explore MCP in practice, here are the implementation repositories built while learning the ecosystem:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;MCP Book Library:&lt;/strong&gt; &lt;a href="https://github.com/rajkundalia/mcp-book-library" rel="noopener noreferrer"&gt;https://github.com/rajkundalia/mcp-book-library&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP Toolbox:&lt;/strong&gt; &lt;a href="https://github.com/rajkundalia/mcp-toolbox" rel="noopener noreferrer"&gt;https://github.com/rajkundalia/mcp-toolbox&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These projects demonstrate MCP servers and integrations for realistic data sources and workflows.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Use &lt;code&gt;mcp&lt;/code&gt; Over &lt;code&gt;fastmcp&lt;/code&gt;?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Short version:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use &lt;code&gt;mcp&lt;/code&gt; (official) if you want to learn the architecture, build custom clients/hosts, or manually configure the HTTP/SSE layers (which is exactly what many project prompts ask for).&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;fastmcp&lt;/code&gt; if you just want to ship a tool to Claude Desktop in a few minutes and do not care how the wiring works under the hood.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The best way to understand MCP is to build with it. Start small, implement a simple server for a data source you use regularly, and compare the experience to traditional point-to-point integrations.&lt;/p&gt;




&lt;h2&gt;
  
  
  Resources That Helped
&lt;/h2&gt;

&lt;p&gt;Some resources that helped deepen understanding of MCP and its ecosystem:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://youtu.be/5CmAKm1wWW0?si=17DNRC7cQ89UfSLD" rel="noopener noreferrer"&gt;https://youtu.be/5CmAKm1wWW0?si=17DNRC7cQ89UfSLD&lt;/a&gt; – a great starter video.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://huggingface.co/blog/Kseniase/mcp" rel="noopener noreferrer"&gt;https://huggingface.co/blog/Kseniase/mcp&lt;/a&gt; – very good conceptual and practical overview.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://modelcontextprotocol.io/docs/getting-started/intro" rel="noopener noreferrer"&gt;https://modelcontextprotocol.io/docs/getting-started/intro&lt;/a&gt; – official, well-written documentation.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.descope.com/learn/post/mcp" rel="noopener noreferrer"&gt;https://www.descope.com/learn/post/mcp&lt;/a&gt; – good discussion of security and auth aspects.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://zapier.com/blog/mcp/" rel="noopener noreferrer"&gt;https://zapier.com/blog/mcp/&lt;/a&gt; – promotes Zapier, but still an insightful read on real-world use.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://norahsakal.com/blog/mcp-vs-api-model-context-protocol-explained/" rel="noopener noreferrer"&gt;https://norahsakal.com/blog/mcp-vs-api-model-context-protocol-explained/&lt;/a&gt; – useful section on when to use MCP.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://medium.com/ai-cloud-lab/model-context-protocol-mcp-with-ollama-a-full-deep-dive-working-code-part-1-81a3bb6d16b3" rel="noopener noreferrer"&gt;https://medium.com/ai-cloud-lab/model-context-protocol-mcp-with-ollama-a-full-deep-dive-working-code-part-1-81a3bb6d16b3&lt;/a&gt; and &lt;a href="https://medium.com/ai-cloud-lab/model-context-protocol-mcp-with-ollama-and-llama-3-a-step-by-step-guide-part-2-2a5917c8c745" rel="noopener noreferrer"&gt;https://medium.com/ai-cloud-lab/model-context-protocol-mcp-with-ollama-and-llama-3-a-step-by-step-guide-part-2-2a5917c8c745&lt;/a&gt; – detailed deep dives with working code.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://skywork.ai/skypage/en/ollama-mcp-MCP-Server-The-Definitive-Guide-for-AI-Engineers/1972585330623180800" rel="noopener noreferrer"&gt;https://skywork.ai/skypage/en/ollama-mcp-MCP-Server-The-Definitive-Guide-for-AI-Engineers/1972585330623180800&lt;/a&gt; – explains &lt;code&gt;ollama-mcp&lt;/code&gt;, an MCP server that exposes a local Ollama instance as standardized tools.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://apidog.com/blog/mcp-ollama/" rel="noopener noreferrer"&gt;https://apidog.com/blog/mcp-ollama/&lt;/a&gt; – explains Dolphin MCP, a Python-based MCP client that bridges an LLM and multiple MCP servers.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>mcp</category>
      <category>ai</category>
      <category>llm</category>
    </item>
    <item>
      <title>API Gateway vs Service Mesh: Beyond the North–South/East–West Myth</title>
      <dc:creator>Raj Kundalia</dc:creator>
      <pubDate>Thu, 20 Nov 2025 01:41:21 +0000</pubDate>
      <link>https://dev.to/rajkundalia/api-gateway-vs-service-mesh-beyond-the-north-southeast-west-myth-2mpg</link>
      <guid>https://dev.to/rajkundalia/api-gateway-vs-service-mesh-beyond-the-north-southeast-west-myth-2mpg</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Please note that the page became big because I had questions on my own and less information would have made things look speculatory. You can skip this and read links added at the end of the page, they are very good.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  My Experimental Code Link
&lt;/h2&gt;

&lt;p&gt;Like always, if you just read and not code for this, it pretty much becomes as good as not reading it. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Github Link:&lt;/strong&gt; &lt;a href="https://github.com/rajkundalia/api-gateway-service-mesh-sample" rel="noopener noreferrer"&gt;https://github.com/rajkundalia/api-gateway-service-mesh-sample&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This took a long time, I tried implementing a service mesh but it went above my scope - so things like Intentions in Consul would not work.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Introduction: The Misconception That's Costing Teams
&lt;/h2&gt;

&lt;p&gt;If you've worked with microservices, you've probably heard this oversimplification: &lt;strong&gt;"API Gateways handle north–south traffic, while Service Meshes handle east–west traffic."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This directional framing has become microservices folklore - repeated in architecture discussions and echoed in conference talks for years.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Here's the issue: it's fundamentally wrong.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This misconception leads to poor architectural decisions, unnecessary complexity, and recurring confusion about which technology solves which problem. Teams often reach for an API Gateway when a Service Mesh is what they truly need - or vice versa - because they focus on traffic direction rather than the underlying purpose.&lt;/p&gt;

&lt;p&gt;The truth is more nuanced:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;API Gateways can manage east–west traffic&lt;/strong&gt; via internal gateways that govern inter-service communication, apply policies, and handle versioning.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Service Meshes can handle north–south traffic&lt;/strong&gt; through mesh-aware ingress gateways (such as Istio's Ingress Gateway or Linkerd's ingress controller) that bring external traffic into the mesh.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So if traffic direction isn't the real difference, what is?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj61cz9nwhk7fqzjypl4n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj61cz9nwhk7fqzjypl4n.png" alt="Image" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Purpose and responsibility.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;An API Gateway treats services as &lt;strong&gt;products&lt;/strong&gt; - with user governance, access control, monetization, lifecycle management, and business context.&lt;/p&gt;

&lt;p&gt;A Service Mesh, by contrast, provides &lt;strong&gt;infrastructure-level reliability&lt;/strong&gt; for service-to-service communication - zero business logic, zero product thinking, purely connectivity.&lt;/p&gt;

&lt;p&gt;In this article, we'll cut through the confusion and give you a clear mental model for when to use each technology - or when using both together creates the strongest architecture.&lt;/p&gt;

&lt;h3&gt;
  
  
  You'll learn:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;What problems each technology actually solves (and why traffic direction doesn't matter)&lt;/li&gt;
&lt;li&gt;The architectural differences that lead to different use cases&lt;/li&gt;
&lt;li&gt;How capabilities like mTLS, retries, and zero-trust security define service meshes&lt;/li&gt;
&lt;li&gt;A practical decision framework for choosing the right tool&lt;/li&gt;
&lt;li&gt;How API Gateways and Service Meshes complement each other in real-world systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's start by understanding the fundamental problems each technology was designed to solve.&lt;/p&gt;




&lt;h2&gt;
  
  
  Understanding the Real Problem Each Solves
&lt;/h2&gt;

&lt;h3&gt;
  
  
  API Gateway: APIs as a Product
&lt;/h3&gt;

&lt;p&gt;An API Gateway's primary purpose is to &lt;strong&gt;expose services as managed, consumable APIs&lt;/strong&gt; - treating your services like products that internal or external consumers can discover, use, and rely on.&lt;/p&gt;

&lt;p&gt;But an API Gateway is far more than a reverse proxy. It embeds business logic and enables API composition: aggregating data from multiple services into a single response, transforming payloads, standardizing errors, and presenting a unified interface that shields clients from backend complexity. This is effectively the Backend-for-Frontend (BFF) pattern.&lt;/p&gt;

&lt;p&gt;And once you move past request/response mechanics, the real power emerges. API Gateways participate in the entire &lt;strong&gt;API lifecycle&lt;/strong&gt; - the part most developers overlook:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Creation &amp;amp; design:&lt;/strong&gt; specs, versioning, schema validation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Testing &amp;amp; documentation:&lt;/strong&gt; interactive docs, automated tests, sandboxes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Publishing &amp;amp; onboarding:&lt;/strong&gt; developer portals, marketplaces, self-service access&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monetization:&lt;/strong&gt; usage metering, billing hooks, tiered plans&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Analytics:&lt;/strong&gt; usage patterns, behavior insights, performance dashboards&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where the gateway gains &lt;strong&gt;business context&lt;/strong&gt;. It knows concepts like customers, products, API keys, and rate-limit tiers. When a mobile client sends a request, the gateway understands: &lt;em&gt;"This is Acme Corp, a premium tier subscriber, allowed 10,000 requests per hour on the /payments API."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Modern platforms such as &lt;strong&gt;Kong, AWS API Gateway, Azure API Management, Apigee, and Ambassador&lt;/strong&gt; all embody this philosophy - combining policy enforcement with full lifecycle and product-style API management.&lt;/p&gt;

&lt;h3&gt;
  
  
  Service Mesh: Service Connectivity Infrastructure
&lt;/h3&gt;

&lt;p&gt;A Service Mesh has a fundamentally different purpose: &lt;strong&gt;providing decoupled infrastructure for service-to-service communication&lt;/strong&gt; without requiring changes to application code.&lt;/p&gt;

&lt;p&gt;Service Meshes offload network functions from services into a dedicated infrastructure layer. They handle concerns like service discovery, load balancing, circuit breaking, retries, and timeouts - all the complexity that developers would otherwise implement (and often implement inconsistently) across services.&lt;/p&gt;

&lt;p&gt;Critically, &lt;strong&gt;Service Meshes have no business logic&lt;/strong&gt;. They're purely connectivity and observability infrastructure. A service mesh doesn't know or care whether it's routing a payment transaction or a product catalog query. Every service is treated equally as a network endpoint with routing rules and policies.&lt;/p&gt;

&lt;p&gt;This enables &lt;strong&gt;polyglot architectures&lt;/strong&gt;. Your Python services, Go services, and Java services all get the same networking capabilities without embedding client libraries or writing language-specific code. The infrastructure handles it transparently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The key insight:&lt;/strong&gt; A Service Mesh is business-agnostic. It operates at the infrastructure layer, understanding concepts like "service instances," "endpoints," "failure rates," and "latency percentiles" - but never "customers," "API products," or "billing tiers."&lt;/p&gt;

&lt;p&gt;Popular implementations include &lt;strong&gt;Istio, Linkerd, Consul Connect, and AWS App Mesh.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Quick Comparison
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;API Gateway&lt;/th&gt;
&lt;th&gt;Service Mesh&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Primary Purpose&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Expose services as managed API products&lt;/td&gt;
&lt;td&gt;Decouple service communication infrastructure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Context&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Business-aware (users, products, billing)&lt;/td&gt;
&lt;td&gt;Business-agnostic (endpoints, metrics)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Logic&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Can contain transformation, aggregation logic&lt;/td&gt;
&lt;td&gt;No business logic, pure infrastructure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Lifecycle Scope&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Full API lifecycle (design → retirement)&lt;/td&gt;
&lt;td&gt;Runtime connectivity only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Consumer Focus&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;External developers, partners, clients&lt;/td&gt;
&lt;td&gt;Services communicating with each other&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Architecture Deep Dive
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Deployment Models
&lt;/h3&gt;

&lt;p&gt;The architectural differences between API Gateways and Service Meshes are stark, and understanding these differences clarifies why each excels at different problems.&lt;/p&gt;

&lt;h4&gt;
  
  
  API Gateway: Centralized Architecture
&lt;/h4&gt;

&lt;p&gt;An API Gateway deploys as a standalone reverse proxy or clustered front-door, creating a single entry point (or small cluster) for API traffic. It lives in its own architectural layer, distinct from your services.&lt;/p&gt;

&lt;p&gt;Here's a simplified view:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;External Clients (Mobile, Web, Partners)
              ↓
    ┌─────────────────┐
    │  API Gateway    │ ← Centralized, clustered for HA
    │   (Kong/AWS)    │
    └─────────────────┘
         ↓    ↓    ↓
    ┌────┐ ┌────┐ ┌────┐
    │Svc │ │Svc │ │Svc │
    │ A  │ │ B  │ │ C  │
    └────┘ └────┘ └────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Traffic flows through the gateway as a dedicated hop. The gateway terminates external connections, applies policies, performs routing decisions, and forwards requests to backend services. Deployment is relatively straightforward - you provision the gateway infrastructure separately from your services.&lt;/p&gt;

&lt;h4&gt;
  
  
  Service Mesh: Decentralized Architecture
&lt;/h4&gt;

&lt;p&gt;A Service Mesh deploys in a fundamentally different way: a &lt;strong&gt;sidecar proxy alongside every service replica&lt;/strong&gt;. This is a decentralized, peer-to-peer model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Service A          Service B          Service C
┌─────────┐        ┌─────────┐        ┌─────────┐
│  App    │        │  App    │        │  App    │
│Container│        │Container│        │Container│
└────┬────┘        └────┬────┘        └────┬────┘
     │                  │                  │
┌────┴────┐        ┌────┴────┐        ┌────┴────┐
│ Envoy   │◄──────►│ Envoy   │◄──────►│ Envoy   │
│ Sidecar │        │ Sidecar │        │ Sidecar │
└─────────┘        └─────────┘        └─────────┘
       ▲                 ▲                 ▲
       └─────────────────┴─────────────────┘
              Control Plane (Istio/Linkerd)
              (Configuration, not traffic)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each service instance gets its own proxy (typically Envoy). When Service A calls Service B, the request flows: &lt;strong&gt;App A → Sidecar A → Sidecar B → App B&lt;/strong&gt;. The service code itself doesn't know about the mesh - it makes standard HTTP or gRPC calls to localhost, and the sidecar handles everything else.&lt;/p&gt;

&lt;p&gt;This deployment model is more invasive. It requires modifying your CI/CD pipelines to inject sidecars, updating Kubernetes manifests (or VM configurations), and managing the lifecycle of proxies alongside applications.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Insight:&lt;/strong&gt; In an API Gateway, traffic converges at a central point. In a Service Mesh, traffic flows peer-to-peer between distributed proxies, with the control plane managing configuration but never touching actual requests.&lt;/p&gt;




&lt;h2&gt;
  
  
  Control Plane vs Data Plane Architecture
&lt;/h2&gt;

&lt;p&gt;This separation of concerns is crucial for understanding Service Meshes, though it applies (less critically) to some API Gateway implementations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Service Mesh: Deep Dive into Control and Data Planes
&lt;/h3&gt;

&lt;p&gt;The &lt;strong&gt;control plane&lt;/strong&gt; (examples: Istio's Pilot, Linkerd's Controller, Consul's servers) is the brain of the mesh:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Configuration management:&lt;/strong&gt; Distributes routing rules, traffic policies, and service configurations to all sidecars&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Service discovery:&lt;/strong&gt; Maintains a live registry of all service instances and their endpoints&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Certificate authority:&lt;/strong&gt; Generates and rotates mTLS certificates for service identity&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Telemetry aggregation:&lt;/strong&gt; Collects metrics and traces from data plane proxies&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Policy enforcement setup:&lt;/strong&gt; Configures access control rules and rate limits&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Critically:&lt;/strong&gt; the control plane is NOT on the request path. It handles configuration and management but never sees actual user requests. This is fundamental to mesh scalability.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;data plane&lt;/strong&gt; (examples: Envoy sidecars in Istio, Linkerd2-proxy in Linkerd) does the heavy lifting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Handles actual request traffic:&lt;/strong&gt; Every request flows through data plane proxies&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enforces policies:&lt;/strong&gt; Implements circuit breakers, retries, timeouts configured by control plane&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;L4/L7 routing and load balancing:&lt;/strong&gt; Makes real-time routing decisions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security enforcement:&lt;/strong&gt; Performs mTLS handshakes, validates certificates&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Telemetry generation:&lt;/strong&gt; Reports metrics, logs, and traces for observability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's make this concrete with service discovery as an example. When Service C scales from 3 to 5 replicas, here's what happens:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Kubernetes (or your orchestrator) starts two new pods with Service C containers and Envoy sidecars&lt;/li&gt;
&lt;li&gt;The Envoy sidecars register with the control plane upon startup&lt;/li&gt;
&lt;li&gt;The control plane updates its service registry with the two new endpoints&lt;/li&gt;
&lt;li&gt;The control plane pushes updated routing configurations to all Envoy sidecars in the mesh&lt;/li&gt;
&lt;li&gt;Within seconds, Service A and Service B know about the new Service C instances and start load balancing across all 5 replicas&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;No DNS propagation delays. No manual configuration updates. No service discovery libraries in application code. The control plane orchestrates everything, while sidecars handle the actual routing.&lt;/p&gt;

&lt;h3&gt;
  
  
  API Gateway: Simpler Control Plane Model
&lt;/h3&gt;

&lt;p&gt;Some API Gateway implementations (like Kong with its declarative configuration) have control plane concepts, but the separation is less critical. Many gateways bundle control and data plane functions in the same process. Configuration changes might require gateway reloads, and the gateway itself is on the request path - serving as both traffic handler and configuration enforcer.&lt;/p&gt;




&lt;h2&gt;
  
  
  Organizational and Deployment Challenges
&lt;/h2&gt;

&lt;p&gt;Service Meshes face unique adoption barriers that API Gateways largely avoid:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Universal Sidecar Deployment Requirement
&lt;/h3&gt;

&lt;p&gt;To get value from a service mesh, you need sidecars deployed alongside &lt;strong&gt;all services&lt;/strong&gt; you want to manage. This creates organizational friction: it's not something a single team can adopt independently. You need buy-in from every service owner.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Shared Control Plane Access
&lt;/h3&gt;

&lt;p&gt;All services must share access to the mesh control plane. This crosses security boundaries - teams that previously had isolated deployments now share infrastructure. Organizations with strict security postures find this challenging.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Cannot Control External Services
&lt;/h3&gt;

&lt;p&gt;You can only mesh services you directly control. Third-party APIs, legacy systems outside your infrastructure, and managed services like external databases cannot participate in the mesh. This limits where resilience patterns apply.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Certificate Authority Coordination
&lt;/h3&gt;

&lt;p&gt;Services in the same mesh must share a Certificate Authority (CA) for mTLS. This requires cross-team coordination on security policies and trust models. Different teams or products often want separate CAs for isolation - which means separate meshes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why This Matters:&lt;/strong&gt; Service mesh adoption is often limited to team or product boundaries. An API Gateway, deployed as central infrastructure, can span the entire organization much more easily. It doesn't require every team to change their deployment processes.&lt;/p&gt;

&lt;p&gt;Now that we understand the architectural differences and deployment realities, let's examine specific capabilities side-by-side.&lt;/p&gt;




&lt;h2&gt;
  
  
  Capabilities Comparison
&lt;/h2&gt;

&lt;p&gt;Both technologies offer overlapping capabilities, but with different implementations and tradeoffs. Understanding these differences guides architectural decisions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Service Discovery
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;API Gateway:&lt;/strong&gt; Uses external service registries (Consul, Eureka, DNS, Kubernetes Services). The gateway queries the registry to find service endpoints, then routes traffic accordingly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Service Mesh:&lt;/strong&gt; Built-in service discovery via the control plane. The control plane automatically tracks all sidecar-enabled services, maintaining a live registry without external dependencies. When a service scales or moves, the mesh knows immediately.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Authentication and Authorization ⭐
&lt;/h3&gt;

&lt;p&gt;This is perhaps the most important architectural differentiator between the two patterns.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;API Gateway:&lt;/strong&gt; Focuses on &lt;strong&gt;user and client identity&lt;/strong&gt;. Validates API keys, OAuth2 tokens, JWT claims. Answers questions like: "Is this mobile app authorized to call the /payments endpoint?" or "Has this partner exceeded their rate limit?" Security is about edge protection - who gets into your system and what they can access.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Service Mesh:&lt;/strong&gt; Focuses on &lt;strong&gt;service identity&lt;/strong&gt; via mTLS certificates. Every service gets a cryptographic identity. Answers questions like: "Is this really the Payment service calling Fraud Detection?" or "Should Order Service be allowed to communicate with User Profile Service?" Security is about Zero-Trust architecture - no service implicitly trusts another.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Load Balancing
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;API Gateway:&lt;/strong&gt; Server-side load balancing at the gateway layer. The gateway distributes requests across service instances based on configured algorithms (round-robin, least connections, weighted).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Service Mesh:&lt;/strong&gt; Client-side load balancing distributed via sidecars. Each sidecar makes load balancing decisions locally, using health status and latency information from the control plane. This enables more sophisticated strategies like locality-aware routing (prefer same-zone instances).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Rate Limiting
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;API Gateway:&lt;/strong&gt; Edge-focused, per-client or per-API-key. Limits like "1000 requests per hour for this developer" or "premium tier customers get 10x capacity." Centralized enforcement at the gateway.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Service Mesh:&lt;/strong&gt; Can implement distributed rate limiting to prevent service overload. For example, preventing the Notification Service from overwhelming Email Service with requests, regardless of which client triggered the flow. Enforcement happens at sidecars across the mesh.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Circuit Breakers and Retries
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;API Gateway:&lt;/strong&gt; Configured at the gateway level to protect against downstream service failures. If Payment Service is down, the gateway can circuit break to avoid cascading failures.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Service Mesh:&lt;/strong&gt; Configured at the control plane, enforced at every sidecar. Each service gets automatic circuit breakers and retries without code changes. When Inventory Service calls Warehouse Service and detects failures, the sidecar automatically circuit breaks - no retry logic in Inventory Service code.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Health Checks
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;API Gateway:&lt;/strong&gt; Gateway actively probes downstream services for health, removing unhealthy instances from its routing pool.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Service Mesh:&lt;/strong&gt; Sidecars monitor local service health and report to the control plane. Passive health checks based on actual request success rates. Faster reaction to failures because the sidecar sits adjacent to the service.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Observability
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;API Gateway:&lt;/strong&gt; Edge metrics and API-level analytics. Tracks which APIs are called, by whom, how often, and with what latency. Great for understanding API usage patterns and client behavior.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Service Mesh:&lt;/strong&gt; Deep service-to-service metrics and distributed tracing. Tracks every internal call with detailed latency breakdowns, success rates, and request volumes. Enables debugging complex distributed transactions by tracing requests as they flow through multiple services.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; When a user checkout fails, the API Gateway shows the client request hit the /checkout endpoint with a 500 error. The service mesh traces reveal that Order Service → Inventory Service succeeded, but Inventory Service → Warehouse Service timed out after 3 retries - pinpointing the exact failure point.&lt;/p&gt;

&lt;h3&gt;
  
  
  Protocol Support
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;API Gateway:&lt;/strong&gt; Primarily HTTP/HTTPS, with increasing support for gRPC, WebSockets, and GraphQL. Focused on application-layer protocols.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Service Mesh:&lt;/strong&gt; Supports both L4 (TCP) and L7 (HTTP, gRPC) protocols. Can handle raw TLS connections, TCP traffic, and any IP-based protocol. Broader protocol range because it operates at the network infrastructure layer.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Chaos Engineering and Defect Simulation
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;API Gateway:&lt;/strong&gt; Limited capabilities - some gateways allow injecting delays or errors, but it's not a primary feature.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Service Mesh:&lt;/strong&gt; Built-in chaos engineering support. Can inject faults (return 500 errors), add delays (simulate network latency), or abort connections to specific services. Enables testing resilience in production-like conditions. For example, "Make 10% of calls from Order Service to Inventory Service return 503 errors to verify circuit breakers work."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffufyriypzldgrsqr0ea3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffufyriypzldgrsqr0ea3.png" alt="image" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Summary Table
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;API Gateway&lt;/th&gt;
&lt;th&gt;Service Mesh&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Service Discovery&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;External registry (Consul, DNS)&lt;/td&gt;
&lt;td&gt;Built-in via control plane&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Authentication/Authorization&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;User/client identity (OAuth, API keys)&lt;/td&gt;
&lt;td&gt;Service identity (mTLS certificates)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Load Balancing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Server-side, centralized&lt;/td&gt;
&lt;td&gt;Client-side, distributed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Rate Limiting&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Per-client/API key at edge&lt;/td&gt;
&lt;td&gt;Per-service, distributed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Circuit Breakers&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;At gateway&lt;/td&gt;
&lt;td&gt;Distributed, no code changes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Health Checks&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Gateway probes services&lt;/td&gt;
&lt;td&gt;Sidecars monitor local health&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Observability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Edge metrics, API analytics&lt;/td&gt;
&lt;td&gt;Service-to-service tracing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Protocols&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;HTTP/HTTPS, gRPC, WebSockets&lt;/td&gt;
&lt;td&gt;L4 + L7 (TCP, HTTP, gRPC, TLS)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Chaos Engineering&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Built-in fault injection&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Among these capabilities, mutual TLS deserves special attention because it fundamentally changes how services authenticate and trust each other.&lt;/p&gt;




&lt;h2&gt;
  
  
  Mutual TLS (mTLS) in Service Mesh
&lt;/h2&gt;

&lt;h3&gt;
  
  
  How mTLS Works and Why It Matters
&lt;/h3&gt;

&lt;h4&gt;
  
  
  The Mechanism:
&lt;/h4&gt;

&lt;p&gt;When a service mesh is deployed, the control plane includes a Certificate Authority (CA). This CA generates unique, short-lived certificates for every service replica. When Service A's sidecar calls Service B's sidecar, both sides present certificates during the TLS handshake, cryptographically proving their identities.&lt;/p&gt;

&lt;p&gt;Here's the flow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Order Service sidecar initiates connection to Payment Service&lt;/li&gt;
&lt;li&gt;Payment sidecar presents certificate: "I am payment.production.svc.cluster"&lt;/li&gt;
&lt;li&gt;Order sidecar verifies certificate against the mesh CA&lt;/li&gt;
&lt;li&gt;Order sidecar presents its own certificate: "I am order.production.svc.cluster"&lt;/li&gt;
&lt;li&gt;Payment sidecar verifies Order's certificate&lt;/li&gt;
&lt;li&gt;Encrypted, authenticated connection established&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Crucially, sidecars automatically handle certificate rotation. Certificates might rotate every few hours, and services never see this complexity - it's entirely transparent.&lt;/p&gt;

&lt;h4&gt;
  
  
  The Value:
&lt;/h4&gt;

&lt;p&gt;This eliminates the need for service-level authentication code. Previously, Payment Service might check an API key or JWT token to verify the caller. With mTLS, the infrastructure proves identity cryptographically. Your service code doesn't need to know about authentication - it receives requests that have already been authenticated at the network layer.&lt;/p&gt;

&lt;p&gt;Additionally:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Encryption by default:&lt;/strong&gt; All east-west traffic is encrypted, protecting against network sniffing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit trail:&lt;/strong&gt; The mesh knows exactly which services communicated with which other services&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compliance:&lt;/strong&gt; Meets requirements for data-in-transit encryption (SOC2, PCI-DSS, HIPAA)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Certificate Authority Boundaries
&lt;/h3&gt;

&lt;p&gt;Services in the same mesh must share a Certificate Authority. This has organizational implications.&lt;/p&gt;

&lt;p&gt;Consider a large company with two product teams: Banking and Trading. For security isolation, they want separate Certificate Authorities - Banking services shouldn't trust certificates from Trading services. This means they need two separate service meshes (Mesh A and Mesh B).&lt;/p&gt;

&lt;p&gt;But what if Banking needs to expose APIs to Trading? This is where API Gateways complement service meshes. An API Gateway can sit at the boundary between meshes, terminating mTLS from one mesh and re-establishing it in another mesh (or using traditional API authentication). The gateway bridges different trust domains.&lt;/p&gt;

&lt;h3&gt;
  
  
  mTLS and Zero-Trust Networking
&lt;/h3&gt;

&lt;p&gt;mTLS enables Zero-Trust architecture for internal service communication.&lt;/p&gt;

&lt;p&gt;Traditional security followed the "castle and moat" model: strong perimeter defenses, but once inside the network, services implicitly trusted each other. An attacker who breached the perimeter had free access to internal systems.&lt;/p&gt;

&lt;p&gt;Zero-Trust rejects this model: &lt;strong&gt;never trust, always verify&lt;/strong&gt;. Every request, even between internal services, requires authentication. No service is trusted by default, regardless of network location.&lt;/p&gt;

&lt;p&gt;Service meshes with mTLS implement Zero-Trust for east-west traffic. Even if an attacker deploys a rogue container inside your cluster, it cannot communicate with legitimate services because it lacks valid certificates signed by the mesh CA. Every service must cryptographically prove its identity on every request.&lt;/p&gt;

&lt;p&gt;With these capabilities and security models in mind, let's turn to practical decision-making: when should you use each technology?&lt;/p&gt;




&lt;h2&gt;
  
  
  When to Use Each
&lt;/h2&gt;

&lt;p&gt;There's no one-size-fits-all answer. Choosing between API Gateways and Service Meshes depends on your primary challenge, team maturity, and architectural scale. Let's build a decision framework.&lt;/p&gt;

&lt;h3&gt;
  
  
  Decision Framework: Use API Gateway When…
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Primary Challenge: External Access &amp;amp; Client Management
&lt;/h4&gt;

&lt;p&gt;If you need to expose services to external consumers - developers, partners, customers, mobile apps - choose an API Gateway. It excels at edge security, client authentication (API keys, OAuth2), and managing the full API product lifecycle.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Concrete scenario:&lt;/strong&gt; You're building a SaaS platform where third-party developers integrate with your product catalog API. You need developer onboarding, API key provisioning, documentation portals, usage analytics, and tiered rate limiting. An API Gateway provides all of this out-of-the-box.&lt;/p&gt;

&lt;h4&gt;
  
  
  Primary Challenge: Service Abstraction &amp;amp; Evolution
&lt;/h4&gt;

&lt;p&gt;If different products or teams need to communicate with governance, versioning, and backward compatibility, choose an API Gateway. It provides abstraction as underlying services evolve.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Concrete scenario:&lt;/strong&gt; Your mobile team needs stable APIs while your backend undergoes frequent changes. The API Gateway maintains version 1 and version 2 of the /orders endpoint, routing v1 clients to legacy services and v2 clients to the new architecture. Backend teams can refactor without breaking mobile apps.&lt;/p&gt;

&lt;h4&gt;
  
  
  Primary Challenge: Centralized Control &amp;amp; Simplicity
&lt;/h4&gt;

&lt;p&gt;If you're starting your microservices journey and need immediate value with lower operational complexity, choose an API Gateway. Simpler deployment, easier to understand, lower barrier to entry.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Concrete scenario:&lt;/strong&gt; You're migrating from a monolith to 5–10 microservices. You need request routing, basic rate limiting, and API documentation. A service mesh would be overkill - too much infrastructure overhead for your scale. An API Gateway solves your immediate needs without the operational burden.&lt;/p&gt;

&lt;h4&gt;
  
  
  Primary Challenge: Edge Security &amp;amp; Rate Limiting
&lt;/h4&gt;

&lt;p&gt;If your main concern is protecting services from external threats and managing API quotas per customer, choose an API Gateway.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Concrete scenario:&lt;/strong&gt; Your public APIs face potential DDoS attacks, credential stuffing, and abusive clients. The API Gateway implements rate limiting, IP blocking, JWT validation, and anomaly detection at the edge, before traffic reaches your services.&lt;/p&gt;

&lt;h3&gt;
  
  
  Decision Framework: Use Service Mesh When…
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Primary Challenge: Internal Service Reliability
&lt;/h4&gt;

&lt;p&gt;If you have large-scale internal architecture (dozens to hundreds of services) with complex communication patterns, and services need automatic retries, circuit breakers, and timeouts without code changes, choose a Service Mesh.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Concrete scenario:&lt;/strong&gt; You have 80 microservices across 12 teams. Services frequently fail partially - timeouts, transient errors, network blips. Rather than each team implementing retry logic differently (or not at all), the service mesh provides consistent resilience patterns across all services. When Recommendation Service calls User Profile Service and gets a timeout, the sidecar automatically retries with exponential backoff - no code change needed.&lt;/p&gt;

&lt;h4&gt;
  
  
  Primary Challenge: Polyglot Environments &amp;amp; Code Elimination
&lt;/h4&gt;

&lt;p&gt;If you want to eliminate networking code from services and need uniform connectivity across services written in different languages, choose a Service Mesh.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Concrete scenario:&lt;/strong&gt; Your platform includes Python ML services, Go APIs, Java batch processors, and Node.js real-time services. Rather than maintaining four different HTTP client libraries with circuit breakers, retries, and observability, the service mesh provides identical capabilities to all services regardless of language. Developers focus on business logic, not networking infrastructure.&lt;/p&gt;

&lt;h4&gt;
  
  
  Primary Challenge: Security Compliance &amp;amp; Zero-Trust
&lt;/h4&gt;

&lt;p&gt;If security compliance requires mTLS encryption for all internal communication, or you need Zero-Trust architecture with cryptographic service identity, choose a Service Mesh.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Concrete scenario:&lt;/strong&gt; Rather than configuring TLS in every service's application code, the service mesh provides automatic mTLS between all services. Auditors see consistent encryption policies enforced at the infrastructure layer, dramatically simplifying compliance evidence.&lt;/p&gt;

&lt;h4&gt;
  
  
  Primary Challenge: Deep Observability &amp;amp; Traffic Control
&lt;/h4&gt;

&lt;p&gt;If you require deep east-west observability and distributed tracing across all services, or need advanced traffic management (canary deployments, traffic splitting, A/B testing) for internal services, choose a Service Mesh.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Concrete scenario:&lt;/strong&gt; You're rolling out a major refactor of Order Service. You want to send 5% of traffic to the new version, monitor error rates and latency, gradually increase to 50%, then 100%. The service mesh enables this with configuration changes - no deployment changes, no feature flags in code. If error rates spike, you roll back instantly by updating traffic weights.&lt;/p&gt;

&lt;h3&gt;
  
  
  When NOT to Use Service Mesh
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Avoiding Unnecessary Complexity:
&lt;/h4&gt;

&lt;p&gt;Service meshes are powerful but operationally complex. Don't use them if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Small architectures (&amp;lt; 10–15 services):&lt;/strong&gt; Operational overhead outweighs benefits. You'll spend more time managing the mesh than you save from its features.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Team lacks infrastructure expertise:&lt;/strong&gt; Service meshes have a steep learning curve. If your team struggles with Kubernetes basics, adding a service mesh will slow you down.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cannot deploy sidecars:&lt;/strong&gt; If you depend on external services, legacy systems you don't control, or third-party SaaS APIs, a service mesh can't manage those connections.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Organizational resistance:&lt;/strong&gt; Service meshes require cross-team adoption. If teams resist sidecar injection or control plane dependencies, forced adoption fails.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ultra-sensitive performance requirements:&lt;/strong&gt; Sidecars add latency (typically 1–5ms per hop). For ultra-low-latency scenarios where even milliseconds matter, this overhead is unacceptable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Limited operational resources:&lt;/strong&gt; Service meshes require dedicated platform engineering resources. If you lack staff to manage mesh infrastructure, troubleshoot sidecar issues, and handle certificate rotation problems, don't adopt a mesh.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Decision Matrix: Use Both When…
&lt;/h3&gt;

&lt;h4&gt;
  
  
  The Comprehensive Approach:
&lt;/h4&gt;

&lt;p&gt;Many mature architectures use both technologies together, leveraging each for its strengths.&lt;/p&gt;

&lt;p&gt;Use both when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need edge control for external clients (API Gateway) AND in-mesh reliability for internal services (Service Mesh)&lt;/li&gt;
&lt;li&gt;You want API-as-a-product capabilities (documentation, monetization, developer portals) AND Zero-Trust security internally (mTLS between services)&lt;/li&gt;
&lt;li&gt;You have a mature platform engineering team capable of managing layered infrastructure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example decision:&lt;/strong&gt; "We expose our Payment API to mobile apps and partners via API Gateway - handling JWT validation, per-customer rate limiting, and maintaining a developer portal. Internal communication between Payment Service, Fraud Detection Service, and Notification Service uses a service mesh - providing mTLS encryption, circuit breakers, and distributed tracing. The API Gateway itself runs as a service within the mesh, getting the same resilience and observability benefits."&lt;/p&gt;




&lt;h2&gt;
  
  
  Real-World Architecture Example
&lt;/h2&gt;

&lt;p&gt;Let's walk through a financial institution scenario that illustrates how both technologies complement each other.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scenario: Multi-Product Financial Platform
&lt;/h3&gt;

&lt;p&gt;A financial institution has two major products:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Banking Platform&lt;/strong&gt; (account management, transfers, statements)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trading Platform&lt;/strong&gt; (stock trading, portfolio management, market data)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each product has its own engineering team, separate deployments, and independent release cycles. Here's how they use both technologies:&lt;/p&gt;

&lt;h4&gt;
  
  
  Service Mesh Deployment (Two Separate Meshes)
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Banking Mesh:&lt;/strong&gt; Covers 25 microservices (Account Service, Transaction Service, Statement Generator, etc.) with its own Certificate Authority for security isolation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trading Mesh:&lt;/strong&gt; Covers 18 microservices (Order Execution, Portfolio Service, Market Data, etc.) with a separate Certificate Authority&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each mesh provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;mTLS encryption for all internal communication within that product&lt;/li&gt;
&lt;li&gt;Circuit breakers and retries for resilience&lt;/li&gt;
&lt;li&gt;Distributed tracing to debug complex transactions&lt;/li&gt;
&lt;li&gt;Zero-Trust security - no service trusts another by default&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  API Gateway Deployment (Multiple Gateways)
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Internal API Gateway:&lt;/strong&gt; Banking Platform exposes select APIs to Trading Platform (e.g., "Get Account Balance" for margin trading). This gateway sits at the boundary between Banking Mesh and Trading Mesh, bridging different trust domains.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Edge API Gateway:&lt;/strong&gt; Both products expose APIs to mobile applications. This gateway handles:

&lt;ul&gt;
&lt;li&gt;JWT validation for user authentication&lt;/li&gt;
&lt;li&gt;Rate limiting per user tier (retail vs institutional)&lt;/li&gt;
&lt;li&gt;API versioning (mobile app v1.2 uses older endpoint, v2.0 uses new schema)&lt;/li&gt;
&lt;li&gt;Developer portal for partner integrations&lt;/li&gt;
&lt;li&gt;Analytics on API usage patterns&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h4&gt;
  
  
  Multi-Datacenter Deployment
&lt;/h4&gt;

&lt;p&gt;The architecture spans two datacenters (DC1 and DC2) for high availability:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each datacenter has full mesh deployment (Banking Mesh and Trading Mesh)&lt;/li&gt;
&lt;li&gt;API Gateways in each datacenter for local request handling&lt;/li&gt;
&lt;li&gt;Cross-datacenter mesh communication uses mTLS across the WAN&lt;/li&gt;
&lt;li&gt;API Gateway load balancers route users to nearest datacenter&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Key Architectural Insights:
&lt;/h4&gt;

&lt;p&gt;This architecture demonstrates several principles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Isolation through separate meshes:&lt;/strong&gt; Banking and Trading use different CAs, preventing accidental trust relationships&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;API Gateways bridge trust domains:&lt;/strong&gt; Internal gateway mediates between meshes when cross-product communication is needed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Layered security:&lt;/strong&gt; Edge gateway handles user authentication, mesh handles service authentication&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Different lifecycle management:&lt;/strong&gt; API versions can change without mesh reconfiguration; mesh policies can change without API versioning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When a mobile user checks their trading portfolio's buying power, here's the flow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Mobile app → Edge API Gateway (JWT validation, rate limiting)&lt;/li&gt;
&lt;li&gt;Edge API Gateway → Trading Platform's Portfolio Service (via Trading Mesh, with mTLS)&lt;/li&gt;
&lt;li&gt;Portfolio Service → Internal API Gateway (requesting account balance from Banking)&lt;/li&gt;
&lt;li&gt;Internal API Gateway → Banking Platform's Account Service (via Banking Mesh, with mTLS)&lt;/li&gt;
&lt;li&gt;Response flows back through each layer&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each technology layer adds value: the edge gateway protects against external threats and manages API products, while the meshes ensure reliable, secure service-to-service communication.&lt;/p&gt;




&lt;h2&gt;
  
  
  Pros and Cons Summary
&lt;/h2&gt;

&lt;p&gt;Understanding the tradeoffs helps set realistic expectations and plan for operational challenges.&lt;/p&gt;

&lt;h3&gt;
  
  
  API Gateway
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Pros:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Standardizes API delivery:&lt;/strong&gt; Consistent authentication, rate limiting, and versioning across all APIs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Simplifies client integration:&lt;/strong&gt; Single entry point with unified documentation reduces client complexity&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High flexibility:&lt;/strong&gt; Can transform requests, aggregate responses, implement complex routing logic&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Easier adoption:&lt;/strong&gt; Centralized deployment model requires less organizational coordination&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Centralized analytics:&lt;/strong&gt; Single place to monitor API usage, client behavior, and performance trends&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Legacy integration:&lt;/strong&gt; Can front legacy systems, providing modern API interfaces to old infrastructure&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Cons:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Single point of failure risk:&lt;/strong&gt; Though clustering mitigates this, the gateway remains a critical chokepoint&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Centralization complexity at scale:&lt;/strong&gt; As more APIs are added, gateway configuration grows complex&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency introduction:&lt;/strong&gt; Extra hop adds latency (typically 5–20ms depending on gateway processing)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Limited internal visibility:&lt;/strong&gt; Only sees edge traffic, not service-to-service communication patterns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scaling challenges:&lt;/strong&gt; While horizontal scaling is possible, it's more complex than distributed architectures&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Service Mesh
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Pros:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Built-in observability:&lt;/strong&gt; Comprehensive metrics, distributed tracing, and logging without code instrumentation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enhanced security:&lt;/strong&gt; Automatic mTLS, Zero-Trust architecture, cryptographic service identity&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resilience without code:&lt;/strong&gt; Circuit breakers, retries, timeouts configured centrally, enforced everywhere&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fine-grained traffic control:&lt;/strong&gt; Canary deployments, traffic splitting, A/B testing at infrastructure level&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chaos engineering capabilities:&lt;/strong&gt; Inject faults and delays to test system resilience&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Abstracts networking from code:&lt;/strong&gt; Developers focus on business logic, not HTTP clients and retry libraries&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Language agnostic:&lt;/strong&gt; Same capabilities for Go, Python, Java, Node.js services&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Cons:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Steep learning curve:&lt;/strong&gt; Complex architecture requires dedicated platform engineering expertise&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Operational complexity:&lt;/strong&gt; Managing control plane, certificate rotation, sidecar upgrades adds operational burden&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency overhead:&lt;/strong&gt; Each sidecar hop adds latency; multiple hops compound this&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resource overhead:&lt;/strong&gt; Memory and CPU per sidecar&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Requires infrastructure maturity:&lt;/strong&gt; Best suited for Kubernetes environments with GitOps practices&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Organizational challenges:&lt;/strong&gt; Requires cross-team adoption and coordination - can't be implemented in isolation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment complexity:&lt;/strong&gt; Sidecar injection, control plane dependencies increase deployment complexity&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Let's return to where we started: the pervasive north-south/east-west myth that frames API Gateways and Service Meshes as mutually exclusive technologies defined by traffic direction.&lt;/p&gt;

&lt;p&gt;This framing is fundamentally flawed. Both technologies can handle both traffic types. API Gateways can manage internal service-to-service communication through private gateways. Service Meshes can expose external traffic through ingress gateways. The real distinction has nothing to do with where traffic flows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What actually matters is purpose:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;API Gateways&lt;/strong&gt; treat services as products with business context - managing full API lifecycles, understanding users and customers, handling monetization and developer onboarding. They operate at the application edge with business awareness.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Service Meshes&lt;/strong&gt; provide business-agnostic infrastructure for service connectivity - offloading networking concerns from application code, enabling Zero-Trust security through mTLS, and providing deep observability without instrumentation. They operate at the infrastructure layer with no business logic.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Looking forward, both patterns continue to evolve. Service Meshes are simplifying operationally (Linkerd's focus on simplicity, Istio's ambient mesh reducing sidecar overhead). API Gateways are adding mesh-like features (Kong Mesh, Ambassador's service mesh integration). The boundaries blur, but the fundamental purposes remain distinct.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Choose your tools based on the problems they solve, not the traffic patterns they handle.&lt;/strong&gt; Your architecture - and your team's sanity - will thank you.&lt;/p&gt;




&lt;h2&gt;
  
  
  Note
&lt;/h2&gt;

&lt;p&gt;Obviously this content has been generated by LLM, but my approach to writing has been the following:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;I read topics from various pages out there.&lt;/li&gt;
&lt;li&gt;I come across questions/sub topics that I would want to cover.&lt;/li&gt;
&lt;li&gt;I add this questions/subtopics and then generate using LLM.&lt;/li&gt;
&lt;li&gt;I read the LLM generated content and then keep what I find necessary.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://medium.com/microservices-in-practice/service-mesh-vs-api-gateway-a6d814b9bf56" rel="noopener noreferrer"&gt;Service Mesh vs API Gateway - Medium&lt;/a&gt; - Decent page&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.solo.io/topics/istio/service-mesh-vs-api-gateway" rel="noopener noreferrer"&gt;Service Mesh vs API Gateway - Solo.io&lt;/a&gt; - Good benefits of service mesh mentioned here&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://konghq.com/blog/enterprise/the-difference-between-api-gateways-and-service-mesh" rel="noopener noreferrer"&gt;The Difference Between API Gateways and Service Mesh - Kong&lt;/a&gt; - Very good piece - after reading this I thought I should not write the blog&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.digitalapi.ai/blogs/api-gateway-vs-service-mesh-whats-the-difference" rel="noopener noreferrer"&gt;API Gateway vs Service Mesh: What's the Difference - DigitalAPI&lt;/a&gt; - Good page&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://nordicapis.com/should-you-use-an-api-gateway-or-service-mesh/" rel="noopener noreferrer"&gt;Should You Use an API Gateway or Service Mesh? - Nordic APIs&lt;/a&gt; - Simple yet elegant explanation&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.gravitee.io/blog/microservices-discovery-api-gateway-vs-service-mesh" rel="noopener noreferrer"&gt;API Gateway vs Service Mesh - Gravitee&lt;/a&gt; - Similarities and differences are nicely compared here&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>apigateway</category>
      <category>servicemesh</category>
      <category>springboot</category>
    </item>
  </channel>
</rss>
