<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Parth Sarthi Sharma</title>
    <description>The latest articles on DEV Community by Parth Sarthi Sharma (@parth_sarthisharma_105e7).</description>
    <link>https://dev.to/parth_sarthisharma_105e7</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3676140%2F76dc188c-9a29-40da-ad4b-85b4b05c3306.jpg</url>
      <title>DEV Community: Parth Sarthi Sharma</title>
      <link>https://dev.to/parth_sarthisharma_105e7</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/parth_sarthisharma_105e7"/>
    <language>en</language>
    <item>
      <title>9 Practical Ways Senior ML Engineers Reduce Inference Latency</title>
      <dc:creator>Parth Sarthi Sharma</dc:creator>
      <pubDate>Sat, 20 Jun 2026 07:37:31 +0000</pubDate>
      <link>https://dev.to/parth_sarthisharma_105e7/9-practical-ways-senior-ml-engineers-reduce-inference-latency-j9f</link>
      <guid>https://dev.to/parth_sarthisharma_105e7/9-practical-ways-senior-ml-engineers-reduce-inference-latency-j9f</guid>
      <description>&lt;p&gt;Most teams blame the model when an AI application feels slow.&lt;/p&gt;

&lt;p&gt;In reality, the model is often only one part of the latency budget.&lt;/p&gt;

&lt;p&gt;A typical AI request may involve:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Request
    ↓
Authentication
    ↓
Feature Retrieval
    ↓
Vector Search
    ↓
Agent Orchestration
    ↓
LLM Inference
    ↓
Guardrails
    ↓
Response Generation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By the time the user sees a response, latency has accumulated across multiple layers of the system.&lt;/p&gt;

&lt;p&gt;After working on cloud-native systems, GenAI platforms, and distributed architectures, I've noticed that the best AI engineers focus on optimizing the entire pipeline—not just the model.&lt;/p&gt;

&lt;p&gt;Here are 9 practical techniques commonly used in production AI systems.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Optimize Feature Retrieval Before Touching the Model
&lt;/h2&gt;

&lt;p&gt;Many AI and ML systems spend more time fetching data than generating predictions.&lt;/p&gt;

&lt;p&gt;Common examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fraud detection systems fetching customer risk profiles&lt;/li&gt;
&lt;li&gt;Recommendation systems retrieving user interaction history&lt;/li&gt;
&lt;li&gt;Personalization engines loading customer attributes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A model that takes 50ms to infer becomes a 500ms system if feature retrieval takes 450ms.&lt;/p&gt;

&lt;p&gt;Instead of:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Request
 ↓
Database Queries
 ↓
Model
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Request
 ↓
Online Feature Store
 ↓
Model
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Technologies commonly used:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Redis&lt;/li&gt;
&lt;li&gt;DynamoDB&lt;/li&gt;
&lt;li&gt;Feast Online Store&lt;/li&gt;
&lt;li&gt;Tecton Online Store&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The fastest prediction is often achieved by reducing feature lookup latency.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Separate Real-Time and Batch Features
&lt;/h2&gt;

&lt;p&gt;Not every feature needs to be calculated at request time.&lt;/p&gt;

&lt;p&gt;Bad:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Request
 ↓
Calculate 30-day spending history
 ↓
Model
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Good:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Nightly Batch Pipeline
 ↓
Precompute Features
 ↓
Store in Feature Store

Request
 ↓
Feature Lookup
 ↓
Model
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Examples of batch features:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Average spend last 30 days&lt;/li&gt;
&lt;li&gt;Customer lifetime value&lt;/li&gt;
&lt;li&gt;Product affinity scores&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Examples of real-time features:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Transactions in last 5 minutes&lt;/li&gt;
&lt;li&gt;Products viewed in current session&lt;/li&gt;
&lt;li&gt;Failed login attempts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This reduces inference latency dramatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Cache Aggressively
&lt;/h2&gt;

&lt;p&gt;One of the highest ROI optimizations.&lt;/p&gt;

&lt;p&gt;Many requests are repetitive.&lt;/p&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Frequently asked support questions&lt;/li&gt;
&lt;li&gt;Popular product recommendations&lt;/li&gt;
&lt;li&gt;Repeated vector search results&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead of:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Query
 ↓
RAG
 ↓
LLM
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Query
 ↓
Cache Check
 ↓
Return Cached Response
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Common technologies:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Redis&lt;/li&gt;
&lt;li&gt;CloudFront&lt;/li&gt;
&lt;li&gt;Application-level caches&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A cache hit often reduces latency from seconds to milliseconds.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Reduce Retrieval Latency
&lt;/h2&gt;

&lt;p&gt;In RAG systems, retrieval often becomes the bottleneck.&lt;/p&gt;

&lt;p&gt;Typical latency contributors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Large vector indexes&lt;/li&gt;
&lt;li&gt;Excessive top-K retrieval&lt;/li&gt;
&lt;li&gt;Poor filtering strategies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead of:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Search Entire Knowledge Base
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Metadata Filters
 +
Vector Search
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Search only banking documents&lt;/li&gt;
&lt;li&gt;Search only relevant departments&lt;/li&gt;
&lt;li&gt;Search only customer-specific data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Reducing search space significantly improves response times.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Use Hybrid Retrieval Carefully
&lt;/h2&gt;

&lt;p&gt;Many teams combine:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Vector Search
+
Keyword Search
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;which improves quality but increases latency.&lt;/p&gt;

&lt;p&gt;Practical approach:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Keyword Search
 ↓
Candidate Set
 ↓
Vector Ranking
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;instead of searching the entire corpus twice.&lt;/p&gt;

&lt;p&gt;Quality matters, but so does speed.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Parallelize Tool Calls and Agent Workflows
&lt;/h2&gt;

&lt;p&gt;One of the most common mistakes in agentic systems is sequential execution.&lt;/p&gt;

&lt;p&gt;Bad:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Agent
 ↓
Tool A
 ↓
Tool B
 ↓
Tool C
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Total latency:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;A + B + C
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Better:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Agent
 ↓
Parallel Execution
 ↓
Tool A
Tool B
Tool C
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Total latency:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;max(A,B,C)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This can reduce response time by several seconds.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. Use Smaller Models Where Possible
&lt;/h2&gt;

&lt;p&gt;Not every task requires a large model.&lt;/p&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Better Choice&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Classification&lt;/td&gt;
&lt;td&gt;Small Model&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Intent Detection&lt;/td&gt;
&lt;td&gt;Small Model&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Routing&lt;/td&gt;
&lt;td&gt;Small Model&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Summarization&lt;/td&gt;
&lt;td&gt;Medium Model&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Complex Reasoning&lt;/td&gt;
&lt;td&gt;Large Model&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A common production pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Small Model
 ↓
Route Request
 ↓
Large Model (only when needed)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This reduces both latency and cost.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. Quantize Models
&lt;/h2&gt;

&lt;p&gt;A technique heavily used in production ML systems.&lt;/p&gt;

&lt;p&gt;Instead of:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;FP32 Model
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;INT8
INT4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;or similar quantized formats.&lt;/p&gt;

&lt;p&gt;Benefits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Smaller memory footprint&lt;/li&gt;
&lt;li&gt;Faster inference&lt;/li&gt;
&lt;li&gt;Lower infrastructure costs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Especially useful for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Edge deployments&lt;/li&gt;
&lt;li&gt;Real-time recommendation systems&lt;/li&gt;
&lt;li&gt;High-throughput inference workloads&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The trade-off is a small accuracy reduction.&lt;/p&gt;




&lt;h2&gt;
  
  
  9. Measure the Entire Latency Budget
&lt;/h2&gt;

&lt;p&gt;This is where observability becomes critical.&lt;/p&gt;

&lt;p&gt;Many teams optimize the model while ignoring everything else.&lt;/p&gt;

&lt;p&gt;Track latency across:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Feature Retrieval
Vector Search
Agent Routing
Tool Calls
LLM Inference
Guardrails
Response Validation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A typical breakdown might look like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Feature Retrieval      50ms
Vector Search         120ms
Tool Calls            300ms
LLM Inference        2200ms
Guardrails            150ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without tracing, teams often optimize the wrong component.&lt;/p&gt;

&lt;p&gt;Platforms such as Langfuse, HoneyHive, Arize Phoenix, and OpenTelemetry-based observability stacks make these bottlenecks visible.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Lesson
&lt;/h2&gt;

&lt;p&gt;The fastest AI systems are rarely the ones with the fastest models.&lt;/p&gt;

&lt;p&gt;They are the systems with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Efficient feature retrieval&lt;/li&gt;
&lt;li&gt;Smart caching&lt;/li&gt;
&lt;li&gt;Optimized retrieval pipelines&lt;/li&gt;
&lt;li&gt;Parallel execution&lt;/li&gt;
&lt;li&gt;Right-sized models&lt;/li&gt;
&lt;li&gt;Strong observability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Senior AI engineers optimize the entire system.&lt;/p&gt;

&lt;p&gt;Because users don't care whether the delay comes from a vector database, a feature store, an agent, or an LLM.&lt;/p&gt;

&lt;p&gt;They only notice one thing:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How long it takes to get an answer.&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>machinelearning</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>I Compared 7 AI Observability Platforms So You Don’t Have To (2026 Edition)</title>
      <dc:creator>Parth Sarthi Sharma</dc:creator>
      <pubDate>Thu, 11 Jun 2026 05:50:50 +0000</pubDate>
      <link>https://dev.to/parth_sarthisharma_105e7/i-compared-7-ai-observability-platforms-so-you-dont-have-to-2026-edition-3jdc</link>
      <guid>https://dev.to/parth_sarthisharma_105e7/i-compared-7-ai-observability-platforms-so-you-dont-have-to-2026-edition-3jdc</guid>
      <description>&lt;p&gt;The AI tooling ecosystem is exploding.&lt;/p&gt;

&lt;p&gt;Every week there seems to be a new platform promising:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Better traces&lt;/li&gt;
&lt;li&gt;Better evaluations&lt;/li&gt;
&lt;li&gt;Better prompt debugging&lt;/li&gt;
&lt;li&gt;Better monitoring&lt;/li&gt;
&lt;li&gt;Better cost visibility&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The challenge isn’t finding an AI observability tool anymore.&lt;/p&gt;

&lt;p&gt;The challenge is choosing one.&lt;/p&gt;

&lt;p&gt;If you’re building AI applications today, chances are you’ve come across names like Langfuse, LangSmith, HoneyHive, Helicone, Arize, Braintrust, or Phoenix.&lt;/p&gt;

&lt;p&gt;After exploring these platforms, I noticed something interesting:&lt;/p&gt;

&lt;p&gt;Most tools overlap in functionality, but each one is optimized for a very different workflow.&lt;/p&gt;

&lt;p&gt;This article focuses on comparing the tools themselves—not explaining AI observability concepts.&lt;/p&gt;

&lt;p&gt;Let’s dive in.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Evaluation Criteria&lt;/p&gt;

&lt;p&gt;For this comparison I evaluated each platform across:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tracing and debugging&lt;/li&gt;
&lt;li&gt;Prompt monitoring&lt;/li&gt;
&lt;li&gt;Evaluations (Evals)&lt;/li&gt;
&lt;li&gt;Cost tracking&lt;/li&gt;
&lt;li&gt;Dataset management&lt;/li&gt;
&lt;li&gt;Self-hosting support&lt;/li&gt;
&lt;li&gt;Enterprise readiness&lt;/li&gt;
&lt;li&gt;Ease of adoption&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Quick Comparison Table&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Comparison Table
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Open Source&lt;/th&gt;
&lt;th&gt;Tracing&lt;/th&gt;
&lt;th&gt;Evaluations&lt;/th&gt;
&lt;th&gt;Cost Monitoring&lt;/th&gt;
&lt;th&gt;Self Host&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Langfuse&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;td&gt;⭐⭐⭐⭐⭐&lt;/td&gt;
&lt;td&gt;⭐⭐⭐⭐⭐&lt;/td&gt;
&lt;td&gt;⭐⭐⭐⭐&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;td&gt;Most teams&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HoneyHive&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;td&gt;⭐⭐⭐⭐⭐&lt;/td&gt;
&lt;td&gt;⭐⭐⭐⭐⭐&lt;/td&gt;
&lt;td&gt;⭐⭐⭐⭐&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;td&gt;Enterprise AI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LangSmith&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;td&gt;⭐⭐⭐⭐⭐&lt;/td&gt;
&lt;td&gt;⭐⭐⭐⭐&lt;/td&gt;
&lt;td&gt;⭐⭐⭐&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;td&gt;LangChain users&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Helicone&lt;/td&gt;
&lt;td&gt;⚠️ Partial&lt;/td&gt;
&lt;td&gt;⭐⭐⭐&lt;/td&gt;
&lt;td&gt;⭐⭐&lt;/td&gt;
&lt;td&gt;⭐⭐⭐⭐⭐&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;td&gt;Cost visibility&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Arize&lt;/td&gt;
&lt;td&gt;⚠️ Partial&lt;/td&gt;
&lt;td&gt;⭐⭐⭐⭐⭐&lt;/td&gt;
&lt;td&gt;⭐⭐⭐⭐⭐&lt;/td&gt;
&lt;td&gt;⭐⭐⭐⭐&lt;/td&gt;
&lt;td&gt;⚠️ Limited&lt;/td&gt;
&lt;td&gt;Large production systems&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Braintrust&lt;/td&gt;
&lt;td&gt;⚠️ Partial&lt;/td&gt;
&lt;td&gt;⭐⭐⭐&lt;/td&gt;
&lt;td&gt;⭐⭐⭐⭐⭐&lt;/td&gt;
&lt;td&gt;⭐⭐&lt;/td&gt;
&lt;td&gt;⚠️ Limited&lt;/td&gt;
&lt;td&gt;Evaluation-first workflows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Phoenix&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;td&gt;⭐⭐⭐⭐&lt;/td&gt;
&lt;td&gt;⭐⭐⭐⭐&lt;/td&gt;
&lt;td&gt;⭐⭐&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;td&gt;Lightweight OSS setups&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Langfuse&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;What Stands Out&lt;/p&gt;

&lt;p&gt;Langfuse has become one of the most popular choices for AI engineering teams.&lt;/p&gt;

&lt;p&gt;It combines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tracing&lt;/li&gt;
&lt;li&gt;Prompt management&lt;/li&gt;
&lt;li&gt;Evaluations&lt;/li&gt;
&lt;li&gt;Dataset tracking&lt;/li&gt;
&lt;li&gt;Cost analytics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;in a single platform.&lt;/p&gt;

&lt;p&gt;The biggest advantage is flexibility.&lt;/p&gt;

&lt;p&gt;Unlike many commercial products, Langfuse does not lock you into a specific framework.&lt;/p&gt;

&lt;p&gt;You can use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OpenAI&lt;/li&gt;
&lt;li&gt;Anthropic&lt;/li&gt;
&lt;li&gt;Gemini&lt;/li&gt;
&lt;li&gt;Bedrock&lt;/li&gt;
&lt;li&gt;LangChain&lt;/li&gt;
&lt;li&gt;LangGraph&lt;/li&gt;
&lt;li&gt;Custom agents&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;without major friction.&lt;/p&gt;

&lt;p&gt;Strengths&lt;/p&gt;

&lt;p&gt;✅ Open source&lt;/p&gt;

&lt;p&gt;✅ Self-hosting available&lt;/p&gt;

&lt;p&gt;✅ Strong evaluation workflows&lt;/p&gt;

&lt;p&gt;✅ Framework agnostic&lt;/p&gt;

&lt;p&gt;✅ Excellent developer experience&lt;/p&gt;

&lt;p&gt;Weaknesses&lt;/p&gt;

&lt;p&gt;❌ More setup than fully managed platforms&lt;/p&gt;

&lt;p&gt;❌ Enterprise features may require additional work&lt;/p&gt;

&lt;p&gt;Best For&lt;/p&gt;

&lt;p&gt;Teams wanting a long-term observability platform without vendor lock-in.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;HoneyHive&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;What Stands Out&lt;/p&gt;

&lt;p&gt;HoneyHive focuses heavily on enterprise AI quality and testing.&lt;/p&gt;

&lt;p&gt;The platform goes beyond simple tracing and emphasizes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Evaluation pipelines&lt;/li&gt;
&lt;li&gt;Regression testing&lt;/li&gt;
&lt;li&gt;Prompt experimentation&lt;/li&gt;
&lt;li&gt;AI system quality measurement&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This makes it particularly attractive for organizations deploying AI into production at scale.&lt;/p&gt;

&lt;p&gt;Strengths&lt;/p&gt;

&lt;p&gt;✅ Enterprise-grade workflows&lt;/p&gt;

&lt;p&gt;✅ Strong evaluation capabilities&lt;/p&gt;

&lt;p&gt;✅ Regression testing&lt;/p&gt;

&lt;p&gt;✅ Production monitoring&lt;/p&gt;

&lt;p&gt;Weaknesses&lt;/p&gt;

&lt;p&gt;❌ Less attractive for hobby projects&lt;/p&gt;

&lt;p&gt;❌ Commercial-first offering&lt;/p&gt;

&lt;p&gt;Best For&lt;/p&gt;

&lt;p&gt;Organizations that treat AI systems like mission-critical software.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;LangSmith&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;What Stands Out&lt;/p&gt;

&lt;p&gt;If your stack is already built around LangChain or LangGraph, LangSmith feels almost automatic.&lt;/p&gt;

&lt;p&gt;The integration is excellent.&lt;/p&gt;

&lt;p&gt;You get:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Agent traces&lt;/li&gt;
&lt;li&gt;Execution paths&lt;/li&gt;
&lt;li&gt;Prompt inspection&lt;/li&gt;
&lt;li&gt;Chain debugging&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;with minimal effort.&lt;/p&gt;

&lt;p&gt;Strengths&lt;/p&gt;

&lt;p&gt;✅ Best LangChain integration&lt;/p&gt;

&lt;p&gt;✅ Excellent trace visualization&lt;/p&gt;

&lt;p&gt;✅ Fast setup&lt;/p&gt;

&lt;p&gt;✅ Agent debugging experience&lt;/p&gt;

&lt;p&gt;Weaknesses&lt;/p&gt;

&lt;p&gt;❌ Less attractive outside LangChain ecosystems&lt;/p&gt;

&lt;p&gt;❌ Limited self-hosting options&lt;/p&gt;

&lt;p&gt;Best For&lt;/p&gt;

&lt;p&gt;Teams deeply invested in LangChain or LangGraph.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Helicone&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;What Stands Out&lt;/p&gt;

&lt;p&gt;Helicone is probably the easiest way to understand where your AI budget is going.&lt;/p&gt;

&lt;p&gt;Its focus is much more operational than evaluation-centric.&lt;/p&gt;

&lt;p&gt;You get visibility into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Request volume&lt;/li&gt;
&lt;li&gt;Token usage&lt;/li&gt;
&lt;li&gt;Model consumption&lt;/li&gt;
&lt;li&gt;Cost breakdowns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;without significant complexity.&lt;/p&gt;

&lt;p&gt;Strengths&lt;/p&gt;

&lt;p&gt;✅ Excellent cost analytics&lt;/p&gt;

&lt;p&gt;✅ Quick integration&lt;/p&gt;

&lt;p&gt;✅ OpenAI proxy model&lt;/p&gt;

&lt;p&gt;✅ Lightweight deployment&lt;/p&gt;

&lt;p&gt;Weaknesses&lt;/p&gt;

&lt;p&gt;❌ Evaluation capabilities lag competitors&lt;/p&gt;

&lt;p&gt;❌ Less sophisticated tracing&lt;/p&gt;

&lt;p&gt;Best For&lt;/p&gt;

&lt;p&gt;Startups trying to control AI infrastructure costs.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Arize&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;What Stands Out&lt;/p&gt;

&lt;p&gt;Arize comes from the machine learning observability world.&lt;/p&gt;

&lt;p&gt;As a result, it brings strong production monitoring capabilities that many AI-native tools still lack.&lt;/p&gt;

&lt;p&gt;The platform is particularly strong when organizations combine:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Traditional ML systems&lt;/li&gt;
&lt;li&gt;Recommendation systems&lt;/li&gt;
&lt;li&gt;LLM applications&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;inside the same environment.&lt;/p&gt;

&lt;p&gt;Strengths&lt;/p&gt;

&lt;p&gt;✅ Mature monitoring platform&lt;/p&gt;

&lt;p&gt;✅ Strong evaluation tooling&lt;/p&gt;

&lt;p&gt;✅ Enterprise scale&lt;/p&gt;

&lt;p&gt;✅ ML + LLM support&lt;/p&gt;

&lt;p&gt;Weaknesses&lt;/p&gt;

&lt;p&gt;❌ Can feel overwhelming for small teams&lt;/p&gt;

&lt;p&gt;❌ Higher operational complexity&lt;/p&gt;

&lt;p&gt;Best For&lt;/p&gt;

&lt;p&gt;Large-scale AI platforms operating in production.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Braintrust&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;What Stands Out&lt;/p&gt;

&lt;p&gt;Braintrust takes a different approach.&lt;/p&gt;

&lt;p&gt;Rather than starting with traces, it starts with evaluations.&lt;/p&gt;

&lt;p&gt;The philosophy is simple:&lt;/p&gt;

&lt;p&gt;“If you can’t measure quality, you can’t improve quality.”&lt;/p&gt;

&lt;p&gt;This makes Braintrust especially useful for teams focused on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prompt optimization&lt;/li&gt;
&lt;li&gt;Model comparisons&lt;/li&gt;
&lt;li&gt;Benchmarking&lt;/li&gt;
&lt;li&gt;Continuous evaluation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Strengths&lt;/p&gt;

&lt;p&gt;✅ Excellent evaluation workflows&lt;/p&gt;

&lt;p&gt;✅ Dataset management&lt;/p&gt;

&lt;p&gt;✅ Benchmarking capabilities&lt;/p&gt;

&lt;p&gt;✅ Model comparison workflows&lt;/p&gt;

&lt;p&gt;Weaknesses&lt;/p&gt;

&lt;p&gt;❌ Less focused on operational monitoring&lt;/p&gt;

&lt;p&gt;❌ Tracing is not the primary strength&lt;/p&gt;

&lt;p&gt;Best For&lt;/p&gt;

&lt;p&gt;Teams building evaluation-driven AI development processes.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Phoenix&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;What Stands Out&lt;/p&gt;

&lt;p&gt;Phoenix is one of the strongest open-source alternatives available.&lt;/p&gt;

&lt;p&gt;It provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tracing&lt;/li&gt;
&lt;li&gt;Evaluation workflows&lt;/li&gt;
&lt;li&gt;Debugging capabilities&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;without introducing significant operational overhead.&lt;/p&gt;

&lt;p&gt;Many engineers adopt Phoenix because they want observability without committing to a larger commercial ecosystem.&lt;/p&gt;

&lt;p&gt;Strengths&lt;/p&gt;

&lt;p&gt;✅ Open source&lt;/p&gt;

&lt;p&gt;✅ Lightweight deployment&lt;/p&gt;

&lt;p&gt;✅ Good tracing&lt;/p&gt;

&lt;p&gt;✅ Simple adoption&lt;/p&gt;

&lt;p&gt;Weaknesses&lt;/p&gt;

&lt;p&gt;❌ Smaller ecosystem&lt;/p&gt;

&lt;p&gt;❌ Fewer enterprise features&lt;/p&gt;

&lt;p&gt;Best For&lt;/p&gt;

&lt;p&gt;Engineers wanting lightweight observability with minimal complexity.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;My Recommendations&lt;/p&gt;

&lt;p&gt;If I had to choose today:&lt;/p&gt;

&lt;h2&gt;
  
  
  My Recommendations
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Recommendation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Best Overall&lt;/td&gt;
&lt;td&gt;Langfuse&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best Enterprise Choice&lt;/td&gt;
&lt;td&gt;HoneyHive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best LangChain Experience&lt;/td&gt;
&lt;td&gt;LangSmith&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best Cost Tracking&lt;/td&gt;
&lt;td&gt;Helicone&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best Evaluation Platform&lt;/td&gt;
&lt;td&gt;Braintrust&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best Production Monitoring&lt;/td&gt;
&lt;td&gt;Arize&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best Lightweight Open Source Option&lt;/td&gt;
&lt;td&gt;Phoenix&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Final Thoughts&lt;/p&gt;

&lt;p&gt;The interesting thing about AI observability tools is that most of them solve similar problems.&lt;/p&gt;

&lt;p&gt;The real difference is where they place their emphasis.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Langfuse focuses on flexibility.&lt;/li&gt;
&lt;li&gt;HoneyHive focuses on enterprise quality.&lt;/li&gt;
&lt;li&gt;LangSmith focuses on developer productivity.&lt;/li&gt;
&lt;li&gt;Helicone focuses on costs.&lt;/li&gt;
&lt;li&gt;Arize focuses on production monitoring.&lt;/li&gt;
&lt;li&gt;Braintrust focuses on evaluations.&lt;/li&gt;
&lt;li&gt;Phoenix focuses on lightweight open-source adoption.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There is no universally “best” platform.&lt;/p&gt;

&lt;p&gt;The right choice depends on what bottleneck you’re trying to solve:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Debugging?&lt;/li&gt;
&lt;li&gt;Evaluation?&lt;/li&gt;
&lt;li&gt;Monitoring?&lt;/li&gt;
&lt;li&gt;Cost optimization?&lt;/li&gt;
&lt;li&gt;Enterprise governance?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Choose the tool that aligns with that bottleneck, and you’ll likely get far more value than chasing feature checklists.&lt;/p&gt;

&lt;p&gt;What AI observability platform are you currently using, and what made you choose it?&lt;/p&gt;

</description>
      <category>ai</category>
      <category>observability</category>
      <category>softwareengineering</category>
      <category>sre</category>
    </item>
    <item>
      <title>How Senior Engineers Use AI Without Burning Through Token Limits - Reduce AI Token Usage by 60–90%</title>
      <dc:creator>Parth Sarthi Sharma</dc:creator>
      <pubDate>Sat, 06 Jun 2026 12:16:25 +0000</pubDate>
      <link>https://dev.to/parth_sarthisharma_105e7/how-senior-ai-engineers-use-ai-without-burning-through-token-limits-reduce-ai-token-usage-by-4cpl</link>
      <guid>https://dev.to/parth_sarthisharma_105e7/how-senior-ai-engineers-use-ai-without-burning-through-token-limits-reduce-ai-token-usage-by-4cpl</guid>
      <description>&lt;p&gt;Last month I watched a developer exhaust their Claude usage limit in less than a week.&lt;/p&gt;

&lt;p&gt;They weren't generating massive applications.&lt;/p&gt;

&lt;p&gt;They weren't building complex AI systems.&lt;/p&gt;

&lt;p&gt;They were simply asking AI to repeatedly scan the same repository, read the same files, and explain the same architecture over and over again.&lt;/p&gt;

&lt;p&gt;Sound familiar?&lt;/p&gt;

&lt;p&gt;As AI-assisted development becomes mainstream, many teams are discovering a new engineering challenge:&lt;/p&gt;

&lt;p&gt;Token efficiency.&lt;/p&gt;

&lt;p&gt;Just as experienced engineers learned to optimize cloud spend, senior engineers are now learning to optimize AI context.&lt;/p&gt;

&lt;p&gt;The difference between a developer who runs out of tokens every few days and one who comfortably works all month often isn't the AI model.&lt;/p&gt;

&lt;p&gt;It's how they manage context.&lt;/p&gt;

&lt;p&gt;Here's the toolkit and workflow I've seen work consistently.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Hidden Cost of Vibe Coding
&lt;/h2&gt;

&lt;p&gt;Imagine you ask:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Fix a bug in PaymentService.ts
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your AI assistant proceeds to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scan the entire repository&lt;/li&gt;
&lt;li&gt;Read infrastructure code&lt;/li&gt;
&lt;li&gt;Explore frontend folders&lt;/li&gt;
&lt;li&gt;Traverse documentation&lt;/li&gt;
&lt;li&gt;Load previous conversations&lt;/li&gt;
&lt;li&gt;Inspect unrelated dependencies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You asked about one file.&lt;/p&gt;

&lt;p&gt;The model consumed context from hundreds.&lt;/p&gt;

&lt;p&gt;That's where your tokens disappear.&lt;/p&gt;

&lt;p&gt;The goal isn't to reduce intelligence.&lt;/p&gt;

&lt;p&gt;The goal is to reduce unnecessary context.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. RTK: Stop Paying For Useless Command Output
&lt;/h2&gt;

&lt;p&gt;One of the biggest hidden token sinks is terminal output.&lt;/p&gt;

&lt;p&gt;Many AI coding agents automatically consume:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;npm install logs&lt;/li&gt;
&lt;li&gt;build outputs&lt;/li&gt;
&lt;li&gt;test results&lt;/li&gt;
&lt;li&gt;deployment logs&lt;/li&gt;
&lt;li&gt;dependency resolution messages&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most of this information is irrelevant.&lt;/p&gt;

&lt;p&gt;Tools like RTK solve this problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  What RTK Does
&lt;/h2&gt;

&lt;p&gt;RTK acts as a proxy layer between your development environment and the LLM.&lt;/p&gt;

&lt;p&gt;Instead of forwarding everything:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;RTK filters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;redundant messages&lt;/li&gt;
&lt;li&gt;repeated warnings&lt;/li&gt;
&lt;li&gt;progress indicators&lt;/li&gt;
&lt;li&gt;noise&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;before they ever reach the model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benefits
&lt;/h2&gt;

&lt;p&gt;Reported reductions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;60–90% reduction in token consumption for common development workflows&lt;/li&gt;
&lt;li&gt;Faster agent reasoning&lt;/li&gt;
&lt;li&gt;Cleaner context windows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The principle is simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If a human wouldn't read it, the model probably shouldn't either.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  2. Lean-CTX: Compress Context Before It Reaches The Model
&lt;/h2&gt;

&lt;p&gt;Most developers optimize prompts.&lt;/p&gt;

&lt;p&gt;Few optimize files.&lt;/p&gt;

&lt;p&gt;Large source files often contain:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;generated code&lt;/li&gt;
&lt;li&gt;comments&lt;/li&gt;
&lt;li&gt;repetitive structures&lt;/li&gt;
&lt;li&gt;boilerplate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Lean-CTX dynamically compresses and optimizes file content before it gets sent to the model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why It Matters
&lt;/h2&gt;

&lt;p&gt;Instead of sending:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;4,000 line file
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;you might send:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Relevant functions
Dependencies
Symbols
Interfaces
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The AI receives the information it needs while consuming significantly fewer tokens.&lt;/p&gt;

&lt;p&gt;Think of it as:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;gzip for AI context.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  3. AI Codex &amp;amp; Repository Indexers
&lt;/h2&gt;

&lt;p&gt;One of the most expensive activities in AI coding is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Explore my codebase."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The model begins reading dozens of files trying to understand:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;routes&lt;/li&gt;
&lt;li&gt;APIs&lt;/li&gt;
&lt;li&gt;schemas&lt;/li&gt;
&lt;li&gt;services&lt;/li&gt;
&lt;li&gt;components&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This exploratory phase can easily burn tens of thousands of tokens.&lt;/p&gt;

&lt;p&gt;Repository indexing tools solve this.&lt;/p&gt;

&lt;h2&gt;
  
  
  What They Generate
&lt;/h2&gt;

&lt;p&gt;Instead of scanning everything:&lt;/p&gt;

&lt;p&gt;Generate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ROUTES.md
DATABASE_SCHEMA.md
COMPONENTS.md
SERVICES.md
DEPENDENCIES.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the AI can understand the system from five small files instead of 500 source files.&lt;/p&gt;

&lt;h2&gt;
  
  
  Typical Savings
&lt;/h2&gt;

&lt;p&gt;Many teams report avoiding:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;30k–50k tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;during initial codebase exploration.&lt;/p&gt;

&lt;p&gt;This is one of the highest ROI improvements you can make.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Caveman Rule: My Favorite Token Hack
&lt;/h2&gt;

&lt;p&gt;This sounds ridiculous.&lt;/p&gt;

&lt;p&gt;But it works.&lt;/p&gt;

&lt;p&gt;When you need code, you don't need essays.&lt;/p&gt;

&lt;p&gt;You don't need:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Certainly! Here's a detailed explanation...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You need:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Bug here.
Fix this.
Run test.
Done.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Caveman Rule instructs the AI to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;skip conversational filler&lt;/li&gt;
&lt;li&gt;avoid lengthy summaries&lt;/li&gt;
&lt;li&gt;communicate with minimal words&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;Instead of:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;I've identified several possible root causes...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You get:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Null value here.
Add guard clause.
Problem solved.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The technical accuracy remains.&lt;/p&gt;

&lt;p&gt;The verbosity disappears.&lt;/p&gt;

&lt;p&gt;Many developers report output token reductions approaching 75%.&lt;/p&gt;




&lt;h2&gt;
  
  
  Create A Project Brain
&lt;/h2&gt;

&lt;p&gt;One of the biggest mistakes I see:&lt;/p&gt;

&lt;p&gt;Developers repeatedly explaining their project.&lt;/p&gt;

&lt;p&gt;Every new session starts with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;We're using:
- Node.js
- PostgreSQL
- Kubernetes
- OpenTelemetry
- GitHub Actions
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Again.&lt;/p&gt;

&lt;p&gt;And again.&lt;/p&gt;

&lt;p&gt;And again.&lt;/p&gt;

&lt;p&gt;Instead create:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CLAUDE.md
AGENTS.md
PROJECT_CONTEXT.md
ARCHITECTURE.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Store:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;architecture&lt;/li&gt;
&lt;li&gt;conventions&lt;/li&gt;
&lt;li&gt;coding standards&lt;/li&gt;
&lt;li&gt;deployment patterns&lt;/li&gt;
&lt;li&gt;repository structure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now every session starts with shared understanding.&lt;/p&gt;

&lt;p&gt;The AI spends less time learning.&lt;/p&gt;

&lt;p&gt;You spend fewer tokens teaching.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Fragmented Code Approach
&lt;/h2&gt;

&lt;p&gt;Another expensive habit:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Rewrite the entire file.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The AI responds with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2,000 lines
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You pay for all of it.&lt;/p&gt;

&lt;p&gt;Instead ask:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Modify only lines 120–150.
Return patch only.
No summary.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Benefits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;fewer output tokens&lt;/li&gt;
&lt;li&gt;smaller future context&lt;/li&gt;
&lt;li&gt;easier reviews&lt;/li&gt;
&lt;li&gt;lower costs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The best AI engineers increasingly think in patches, not rewrites.&lt;/p&gt;




&lt;h2&gt;
  
  
  Native IDE Features Most Developers Ignore
&lt;/h2&gt;

&lt;p&gt;Many modern AI IDEs already provide token optimization features.&lt;/p&gt;

&lt;p&gt;Most people never use them.&lt;/p&gt;




&lt;h2&gt;
  
  
  Cost Caps
&lt;/h2&gt;

&lt;p&gt;Set:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;maximum tool calls&lt;/li&gt;
&lt;li&gt;session budgets&lt;/li&gt;
&lt;li&gt;usage limits&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Treat tokens like cloud spend.&lt;/p&gt;

&lt;p&gt;Because they are.&lt;/p&gt;




&lt;h2&gt;
  
  
  Compact Sessions
&lt;/h2&gt;

&lt;p&gt;Claude and other tools support context compaction.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/compact
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This removes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;redundant conversation history&lt;/li&gt;
&lt;li&gt;obsolete decisions&lt;/li&gt;
&lt;li&gt;resolved issues&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;while preserving important context.&lt;/p&gt;

&lt;p&gt;Think:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;garbage collection for conversations.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  New Session, New Problem
&lt;/h2&gt;

&lt;p&gt;One of the easiest wins:&lt;/p&gt;

&lt;p&gt;Start fresh.&lt;/p&gt;

&lt;p&gt;When:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a feature is complete&lt;/li&gt;
&lt;li&gt;a bug is resolved&lt;/li&gt;
&lt;li&gt;you're switching domains&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;create a new session.&lt;/p&gt;

&lt;p&gt;Old conversations become baggage.&lt;/p&gt;

&lt;p&gt;The model keeps re-reading:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;mistakes&lt;/li&gt;
&lt;li&gt;abandoned approaches&lt;/li&gt;
&lt;li&gt;irrelevant context&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Fresh context often produces better results.&lt;/p&gt;




&lt;h2&gt;
  
  
  My Personal Context Engineering Checklist
&lt;/h2&gt;

&lt;p&gt;Before asking AI anything:&lt;/p&gt;

&lt;h2&gt;
  
  
  Repository
&lt;/h2&gt;

&lt;p&gt;Exclude:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;node_modules/
dist/
coverage/
build/
.next/
target/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Context
&lt;/h2&gt;

&lt;p&gt;Maintain:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CLAUDE.md
AGENTS.md
ARCHITECTURE.md
PROJECT_CONTEXT.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Tooling
&lt;/h2&gt;

&lt;p&gt;Use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;RTK&lt;/li&gt;
&lt;li&gt;Lean-CTX&lt;/li&gt;
&lt;li&gt;AI Codex&lt;/li&gt;
&lt;li&gt;Repository Indexers&lt;/li&gt;
&lt;li&gt;Semantic Search&lt;/li&gt;
&lt;li&gt;Code Graphs&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Prompting
&lt;/h2&gt;

&lt;p&gt;Prefer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Patch only.
No summary.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;instead of:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Explain everything.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Sessions
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Compact regularly&lt;/li&gt;
&lt;li&gt;Start fresh often&lt;/li&gt;
&lt;li&gt;Keep contexts small&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;For years we optimized:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;cloud costs&lt;/li&gt;
&lt;li&gt;compute costs&lt;/li&gt;
&lt;li&gt;storage costs&lt;/li&gt;
&lt;li&gt;network costs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now we need to optimize:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;context costs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The next generation of high-performing AI engineers won't be the people with the biggest context windows.&lt;/p&gt;

&lt;p&gt;They'll be the people who know exactly what context to send.&lt;/p&gt;

&lt;p&gt;Prompt engineering helped us talk to AI.&lt;/p&gt;

&lt;p&gt;Context engineering helps us scale AI.&lt;/p&gt;

&lt;p&gt;And in the age of vibe coding, context is the new compute.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>vibecoding</category>
      <category>machinelearning</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>Reflection vs Reflexion Agents: The Next Leap in Agentic AI</title>
      <dc:creator>Parth Sarthi Sharma</dc:creator>
      <pubDate>Sun, 22 Mar 2026 03:04:03 +0000</pubDate>
      <link>https://dev.to/parth_sarthisharma_105e7/reflection-vs-reflexion-agents-the-next-leap-in-agentic-ai-1k0m</link>
      <guid>https://dev.to/parth_sarthisharma_105e7/reflection-vs-reflexion-agents-the-next-leap-in-agentic-ai-1k0m</guid>
      <description>&lt;p&gt;As generative AI systems evolve from simple prompt-response tools into &lt;strong&gt;autonomous agents&lt;/strong&gt;, one capability is becoming increasingly critical:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The ability for AI systems to &lt;strong&gt;improve themselves during execution&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is where two powerful concepts come into play:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Reflection&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reflexion&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They sound similar. They are often confused.&lt;br&gt;&lt;br&gt;
But architecturally — and practically — they are very different.&lt;/p&gt;

&lt;p&gt;Let’s break them down.&lt;/p&gt;


&lt;h2&gt;
  
  
  🚀 Why This Matters
&lt;/h2&gt;

&lt;p&gt;If you're building:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AI copilots&lt;/li&gt;
&lt;li&gt;Autonomous workflows&lt;/li&gt;
&lt;li&gt;Multi-step reasoning systems&lt;/li&gt;
&lt;li&gt;Or agentic architectures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then &lt;strong&gt;how your system learns from mistakes&lt;/strong&gt; will define:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Accuracy&lt;/li&gt;
&lt;li&gt;Reliability&lt;/li&gt;
&lt;li&gt;Cost efficiency&lt;/li&gt;
&lt;li&gt;User trust&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  🧠 What is Reflection?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Reflection&lt;/strong&gt; is when an AI system:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Reviews its own output and improves it &lt;strong&gt;within the same execution loop&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;
  
  
  🔁 How it works
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Generate response
&lt;/li&gt;
&lt;li&gt;Evaluate response (self-critique or evaluator model)
&lt;/li&gt;
&lt;li&gt;Refine response
&lt;/li&gt;
&lt;li&gt;Repeat until acceptable
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;
  
  
  🧩 Architecture Pattern
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Input
↓
LLM → Output
↓
Self-Evaluation (LLM or rule-based)
↓
Refinement Loop
↓
Final Output
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  ✅ Key Characteristics
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Happens &lt;strong&gt;within a single session&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;No memory across runs&lt;/li&gt;
&lt;li&gt;Iterative improvement&lt;/li&gt;
&lt;li&gt;Often uses:

&lt;ul&gt;
&lt;li&gt;Self-critique prompts&lt;/li&gt;
&lt;li&gt;Evaluation models&lt;/li&gt;
&lt;li&gt;Chain-of-thought refinement&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;
  
  
  💡 Example
&lt;/h3&gt;

&lt;p&gt;User asks:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Summarize this legal document."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Reflection agent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Generates summary&lt;/li&gt;
&lt;li&gt;Checks:

&lt;ul&gt;
&lt;li&gt;Missing clauses?&lt;/li&gt;
&lt;li&gt;Ambiguity?&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Refines output&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;
  
  
  👍 Pros
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Improves output quality instantly
&lt;/li&gt;
&lt;li&gt;No infrastructure complexity
&lt;/li&gt;
&lt;li&gt;Easy to implement
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  👎 Cons
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;No long-term learning
&lt;/li&gt;
&lt;li&gt;Repeats same mistakes across sessions
&lt;/li&gt;
&lt;li&gt;Increased latency (multiple LLM calls)&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  🔁 What is Reflexion?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Reflexion&lt;/strong&gt; goes a step further.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;It enables an AI system to &lt;strong&gt;learn from past mistakes and improve future performance&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This concept was popularized by research on &lt;strong&gt;self-improving agents with memory&lt;/strong&gt;.&lt;/p&gt;


&lt;h3&gt;
  
  
  🔄 How it works
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Perform task
&lt;/li&gt;
&lt;li&gt;Evaluate outcome
&lt;/li&gt;
&lt;li&gt;Store feedback in memory
&lt;/li&gt;
&lt;li&gt;Use memory to improve future decisions
&lt;/li&gt;
&lt;/ol&gt;


&lt;h3&gt;
  
  
  🧩 Architecture Pattern
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Input
↓
Agent Execution
↓
Outcome Evaluation
↓
Memory Store (success/failure insights)
↓
Future Runs Use Memory
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  🧠 Key Difference
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Reflection&lt;/th&gt;
&lt;th&gt;Reflexion&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Session-based&lt;/td&gt;
&lt;td&gt;Cross-session&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No memory&lt;/td&gt;
&lt;td&gt;Persistent memory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Improves current output&lt;/td&gt;
&lt;td&gt;Improves future outputs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stateless&lt;/td&gt;
&lt;td&gt;Stateful&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;


&lt;h3&gt;
  
  
  💡 Example
&lt;/h3&gt;

&lt;p&gt;AI agent writing grant applications:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Attempt 1: Rejected ❌
&lt;/li&gt;
&lt;li&gt;Stores feedback:

&lt;ul&gt;
&lt;li&gt;"Too generic"&lt;/li&gt;
&lt;li&gt;"Lacks domain-specific references"&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Next attempt:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Uses stored insights&lt;/li&gt;
&lt;li&gt;Produces better output ✅&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  🔥 Why Reflexion is a Big Deal
&lt;/h2&gt;

&lt;p&gt;Reflexion introduces something critical:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Learning without retraining the model&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Instead of fine-tuning:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You &lt;strong&gt;store experiences&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;You &lt;strong&gt;adapt behavior dynamically&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  🏗️ Real-World Implementation
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Reflection (simple)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Prompt chaining&lt;/li&gt;
&lt;li&gt;Self-critique prompts&lt;/li&gt;
&lt;li&gt;ReAct-style loops&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;
  
  
  Reflexion (advanced)
&lt;/h3&gt;

&lt;p&gt;Requires:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Memory layer:

&lt;ul&gt;
&lt;li&gt;Vector DB (e.g., embeddings)&lt;/li&gt;
&lt;li&gt;Key-value store&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Feedback signals:

&lt;ul&gt;
&lt;li&gt;Human feedback&lt;/li&gt;
&lt;li&gt;Automated scoring&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Retrieval mechanism:

&lt;ul&gt;
&lt;li&gt;Inject past learnings into prompts&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  ⚙️ Example Stack
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;LLM: Claude / GPT / Nova
&lt;/li&gt;
&lt;li&gt;Memory: Vector DB (FAISS, OpenSearch)
&lt;/li&gt;
&lt;li&gt;Orchestration: LangChain / custom agents
&lt;/li&gt;
&lt;li&gt;Evaluation: Rule-based or LLM-as-judge
&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  ⚖️ When to Use What?
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Use Reflection when:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;You need &lt;strong&gt;better answers now&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;No need for memory&lt;/li&gt;
&lt;li&gt;Simpler workflows&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;
  
  
  Use Reflexion when:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Tasks are &lt;strong&gt;repetitive and evolving&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Feedback is available&lt;/li&gt;
&lt;li&gt;Long-term improvement matters&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  🧠 Combining Both (Best Practice)
&lt;/h2&gt;

&lt;p&gt;The most powerful systems use &lt;strong&gt;both&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Reflexion (long-term learning)
+
Reflection (short-term refinement)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;👉 This creates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Immediate quality improvement
&lt;/li&gt;
&lt;li&gt;Continuous learning over time
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🧪 Real-World Use Cases
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;AI coding assistants
&lt;/li&gt;
&lt;li&gt;Customer support agents
&lt;/li&gt;
&lt;li&gt;Financial advisory copilots
&lt;/li&gt;
&lt;li&gt;Healthcare decision support
&lt;/li&gt;
&lt;li&gt;Autonomous research assistants
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  ⚠️ Challenges
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Reflection
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Cost (multiple LLM calls)&lt;/li&gt;
&lt;li&gt;Latency&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Reflexion
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Memory design complexity
&lt;/li&gt;
&lt;li&gt;Signal quality (bad feedback = bad learning)
&lt;/li&gt;
&lt;li&gt;Retrieval accuracy
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🧭 Final Thoughts
&lt;/h2&gt;

&lt;p&gt;We are moving from:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Prompt → Response  &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;to:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Prompt → Reason → Reflect → Learn → Improve  &lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  🔥 Key Insight
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;Reflection makes AI &lt;strong&gt;smarter in the moment&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Reflexion makes AI &lt;strong&gt;smarter over time&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  ✍️ Closing
&lt;/h2&gt;

&lt;p&gt;If you're building next-gen AI systems,&lt;br&gt;&lt;br&gt;
understanding this difference is not optional — it's foundational.&lt;/p&gt;

&lt;p&gt;The future of AI is not just about &lt;strong&gt;better models&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;It’s about &lt;strong&gt;better systems around those models&lt;/strong&gt;.&lt;/p&gt;




&lt;p&gt;💬 Curious how to implement Reflexion in production?&lt;br&gt;&lt;br&gt;
Happy to share a deep dive in the next post.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agentskills</category>
      <category>llm</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>Prompt Engineering Is Not Enough: Enter Flow Engineering for Production LLM Systems</title>
      <dc:creator>Parth Sarthi Sharma</dc:creator>
      <pubDate>Sat, 07 Mar 2026 03:20:46 +0000</pubDate>
      <link>https://dev.to/parth_sarthisharma_105e7/prompt-engineering-is-not-enough-enter-flow-engineering-for-production-llm-systems-47ic</link>
      <guid>https://dev.to/parth_sarthisharma_105e7/prompt-engineering-is-not-enough-enter-flow-engineering-for-production-llm-systems-47ic</guid>
      <description>&lt;p&gt;Large Language Models have unlocked a new generation of applications — copilots, assistants, RAG systems, autonomous agents, and internal AI tools.&lt;/p&gt;

&lt;p&gt;But many teams building with LLMs hit the same wall.&lt;/p&gt;

&lt;p&gt;Their application works in demos… but becomes unreliable in production.&lt;/p&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;p&gt;Because &lt;strong&gt;prompt engineering alone is not enough.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To build reliable AI systems, we need something more powerful:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Flow Engineering.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In this article, we'll explore:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Why prompt engineering alone fails in production&lt;/li&gt;
&lt;li&gt;What &lt;strong&gt;Flow Engineering&lt;/strong&gt; actually means&lt;/li&gt;
&lt;li&gt;The architecture of real-world LLM systems&lt;/li&gt;
&lt;li&gt;Practical examples engineers can implement today&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Era of Prompt Engineering
&lt;/h2&gt;

&lt;p&gt;When GPT-style models first became popular, the focus was on &lt;strong&gt;prompt engineering&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Prompt engineering is the art of crafting instructions to guide the LLM to produce better responses.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You are a helpful assistant. 
Summarise the following meeting transcript in bullet points.
Focus only on action items.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Developers quickly discovered techniques like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Few-shot prompting&lt;/li&gt;
&lt;li&gt;Chain-of-thought prompts&lt;/li&gt;
&lt;li&gt;Role prompting&lt;/li&gt;
&lt;li&gt;Structured output prompts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These techniques &lt;strong&gt;improve individual LLM calls.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;But they only solve part of the problem.&lt;/p&gt;

&lt;p&gt;Prompt engineering optimises one interaction.&lt;/p&gt;

&lt;p&gt;Real applications involve &lt;strong&gt;many interactions and system components.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem with Prompt-Only Systems
&lt;/h2&gt;

&lt;p&gt;Let's imagine we are building a simple &lt;strong&gt;customer support AI assistant.&lt;/strong&gt;&lt;br&gt;
A naive architecture might look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Question
      ↓
     LLM
      ↓
   Response
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works in simple demos.&lt;/p&gt;

&lt;p&gt;But real systems quickly require more complexity.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Retrieve relevant documents&lt;/li&gt;
&lt;li&gt;Use tools (APIs, databases)&lt;/li&gt;
&lt;li&gt;Validate outputs&lt;/li&gt;
&lt;li&gt;Retry on errors&lt;/li&gt;
&lt;li&gt;Maintain conversation context&lt;/li&gt;
&lt;li&gt;Apply guardrails&lt;/li&gt;
&lt;li&gt;Log reasoning steps&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Suddenly, our architecture looks more like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Question
      ↓
Context Retrieval (RAG)
      ↓
Tool Selection
      ↓
LLM Reasoning
      ↓
Output Validation
      ↓
Response Generation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This &lt;strong&gt;multi-step pipeline&lt;/strong&gt; is where &lt;strong&gt;Flow Engineering&lt;/strong&gt; comes in.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is Flow Engineering?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Flow Engineering&lt;/strong&gt; is the design of structured execution flows around LLMs.&lt;/p&gt;

&lt;p&gt;Instead of focusing on a single prompt, engineers design &lt;strong&gt;end-to-end reasoning pipelines.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Think of it as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Prompt Engineering = How the LLM thinks

Flow Engineering = How the system operates
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Flow engineering involves designing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Execution pipelines&lt;/li&gt;
&lt;li&gt;Tool orchestration&lt;/li&gt;
&lt;li&gt;State management&lt;/li&gt;
&lt;li&gt;Error handling&lt;/li&gt;
&lt;li&gt;Validation&lt;/li&gt;
&lt;li&gt;Feedback loops&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Flow engineering treats LLM applications as distributed systems, not chatbots.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  A Real Production Flow
&lt;/h2&gt;

&lt;p&gt;Let's look at a simplified &lt;strong&gt;production AI flow.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Question
   ↓
Input Guardrails
   ↓
Context Retrieval (Vector DB)
   ↓
Tool Routing
   ↓
LLM Reasoning
   ↓
Tool Execution
   ↓
Response Validation
   ↓
Final Answer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each step solves a real engineering problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Guardrails&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Prevent prompt injection or malicious input.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context Retrieval&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Fetch relevant documents using vector search.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tool Routing&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Determine which tools the AI should use.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Validation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Ensure output matches schema or safety rules.&lt;/p&gt;

&lt;p&gt;Without this flow, AI systems become &lt;strong&gt;unpredictable.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Example: Prompt vs Flow
&lt;/h2&gt;

&lt;p&gt;Let's compare two implementations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt Engineering Only&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarise this transcript and extract action items.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This may work sometimes.&lt;/p&gt;

&lt;p&gt;But what if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;transcript is too long&lt;/li&gt;
&lt;li&gt;model hallucinate action items&lt;/li&gt;
&lt;li&gt;output format changes&lt;/li&gt;
&lt;li&gt;context is missing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now let's see a &lt;strong&gt;flow-based approach.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Example: Flow Engineered System
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_meeting_summary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transcript&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;

    &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;split_transcript&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transcript&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;summaries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarise this transcript section:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;summaries&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;combined_summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Combine these summaries and extract action items:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;summaries&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;validated_output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;validate_schema&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;combined_summary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;validated_output&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;chunking&lt;/li&gt;
&lt;li&gt;intermediate reasoning&lt;/li&gt;
&lt;li&gt;structured validation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This dramatically improves reliability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Components of Flow Engineering
&lt;/h2&gt;

&lt;p&gt;Most production LLM flows include these components.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. State Management&lt;/strong&gt;&lt;br&gt;
Flows maintain state across steps.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Conversation History
Retrieved Documents
Tool Results
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Frameworks like &lt;strong&gt;LangGraph&lt;/strong&gt; model this using state machines.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Tool Orchestration&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;LLMs often interact with tools.&lt;/p&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;databases&lt;/li&gt;
&lt;li&gt;APIs&lt;/li&gt;
&lt;li&gt;search engines&lt;/li&gt;
&lt;li&gt;internal systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Flow engineering controls:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;which tool to use&lt;/li&gt;
&lt;li&gt;when to call it&lt;/li&gt;
&lt;li&gt;how to merge results&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Retry &amp;amp; Error Handling&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;LLMs are probabilistic.&lt;/p&gt;

&lt;p&gt;Sometimes outputs are invalid.&lt;/p&gt;

&lt;p&gt;A flow can automatically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;retry generation&lt;/li&gt;
&lt;li&gt;correct formatting&lt;/li&gt;
&lt;li&gt;request clarification&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;4. Guardrails &amp;amp; Validation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before returning outputs, systems often validate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;JSON schema&lt;/li&gt;
&lt;li&gt;safety policies&lt;/li&gt;
&lt;li&gt;hallucinations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This prevents unreliable responses.&lt;/p&gt;

&lt;h2&gt;
  
  
  Flow Engineering Frameworks
&lt;/h2&gt;

&lt;p&gt;Several frameworks help engineers implement LLM flows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LangGraph&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Models AI workflows as state machines.&lt;/p&gt;

&lt;p&gt;Great for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;complex agent workflows&lt;/li&gt;
&lt;li&gt;branching logic&lt;/li&gt;
&lt;li&gt;memory management&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Semantic Kernel&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Popular in enterprise environments.&lt;/p&gt;

&lt;p&gt;Supports:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;planners&lt;/li&gt;
&lt;li&gt;function calling&lt;/li&gt;
&lt;li&gt;workflow orchestration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Custom Orchestration&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Many teams implement flows directly using:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python&lt;/li&gt;
&lt;li&gt;Node.js&lt;/li&gt;
&lt;li&gt;serverless pipelines&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because flows are essentially application logic.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Flow Engineering Matters
&lt;/h2&gt;

&lt;p&gt;Companies deploying production AI systems quickly discover:&lt;/p&gt;

&lt;p&gt;The challenge is not the model.&lt;/p&gt;

&lt;p&gt;The challenge is system design around the model.&lt;/p&gt;

&lt;p&gt;Flow engineering provides:&lt;/p&gt;

&lt;p&gt;✔ reliability&lt;br&gt;
✔ reproducibility&lt;br&gt;
✔ observability&lt;br&gt;
✔ safety&lt;br&gt;
✔ scalability&lt;/p&gt;

&lt;p&gt;Without it, LLM applications behave unpredictably.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Shift AI Engineers Must Make
&lt;/h2&gt;

&lt;p&gt;Early LLM development focused on prompts.&lt;/p&gt;

&lt;p&gt;But the industry is moving toward AI systems engineering.&lt;/p&gt;

&lt;p&gt;That means thinking in terms of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;pipelines&lt;/li&gt;
&lt;li&gt;workflows&lt;/li&gt;
&lt;li&gt;orchestration&lt;/li&gt;
&lt;li&gt;tool ecosystems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In short:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AI applications are evolving from prompt-driven apps to flow-driven systems.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Prompt engineering is still important.&lt;/p&gt;

&lt;p&gt;But in production systems, prompts are only one component.&lt;/p&gt;

&lt;p&gt;The real power of modern AI systems comes from well-designed execution flows.&lt;/p&gt;

&lt;p&gt;If you want reliable AI applications, start thinking like a systems engineer, not just a prompt writer.&lt;/p&gt;

&lt;h3&gt;
  
  
  What’s Next
&lt;/h3&gt;

&lt;p&gt;In upcoming articles, we'll dive deeper into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reflection vs Reflexion agents&lt;/li&gt;
&lt;li&gt;LangGraph state machines&lt;/li&gt;
&lt;li&gt;Semantic Kernel orchestration&lt;/li&gt;
&lt;li&gt;Model Context Protocol (MCP)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These concepts build on flow engineering to create more capable AI systems.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>promptengineering</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>Secrets Management for LLM Tools: Don’t Let Your OpenAI Keys End Up on GitHub 🚨</title>
      <dc:creator>Parth Sarthi Sharma</dc:creator>
      <pubDate>Sat, 14 Feb 2026 04:04:56 +0000</pubDate>
      <link>https://dev.to/parth_sarthisharma_105e7/secrets-management-for-llm-tools-dont-let-your-openai-keys-end-up-on-github-38c0</link>
      <guid>https://dev.to/parth_sarthisharma_105e7/secrets-management-for-llm-tools-dont-let-your-openai-keys-end-up-on-github-38c0</guid>
      <description>&lt;h2&gt;
  
  
  "A practical guide to securing LLM API keys, embeddings, vector
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;: If you're building with LLMs and you're not treating secrets as first-class infrastructure, you're already at risk.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Every week, we see:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OpenAI keys pushed to GitHub&lt;/li&gt;
&lt;li&gt;API keys logged in CloudWatch&lt;/li&gt;
&lt;li&gt;Secrets hardcoded in Streamlit demos that later go to production&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;LLM systems multiply secrets quickly. If you don’t design for this early, things get messy fast.&lt;/p&gt;

&lt;p&gt;This is a production-ready blueprint for securing LLM systems properly.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem: LLM Secrets Multiply Fast 🐰
&lt;/h2&gt;

&lt;p&gt;One LLM integration turns into dozens of credentials:&lt;/p&gt;

&lt;p&gt;1 LLM API key (OpenAI / Anthropic)&lt;br&gt;
→ 3 embedding endpoints&lt;br&gt;
→ 5 vector store connections (Pinecone / Weaviate)&lt;br&gt;
→ 2 RAG databases&lt;br&gt;
→ 10 external tools (SerpAPI, Wolfram, etc.)&lt;br&gt;
→ 50 microservices&lt;br&gt;
= 70+ secrets&lt;/p&gt;

&lt;p&gt;The bigger your AI system gets, the larger your attack surface becomes.&lt;/p&gt;


&lt;h2&gt;
  
  
  1️⃣ Never Hardcode Secrets
&lt;/h2&gt;

&lt;p&gt;❌ Wrong (guaranteed leak eventually)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# NEVER DO THIS
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sk-123...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Hardcoded secrets:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;End up in git history&lt;/li&gt;
&lt;li&gt;Get copied into logs&lt;/li&gt;
&lt;li&gt;Leak via screenshots or stack traces&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;✅ Right: Runtime Environment Injection&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# config.py
&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;OPENAI_API_KEY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Principle:&lt;/strong&gt;&lt;br&gt;
Secrets should be injected at runtime, never committed to source code.&lt;/p&gt;
&lt;h2&gt;
  
  
  2️⃣ Use Cloud-Native Secrets Managers
&lt;/h2&gt;

&lt;p&gt;If you're in production, use a managed secrets service.&lt;/p&gt;

&lt;p&gt;AWS Secrets Manager + Lambda Example&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# lambda_function.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_secrets&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;secretsmanager&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;secret&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_secret_value&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SecretId&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llm-prod/openai&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;secret&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SecretString&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lambda_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;secrets&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_secrets&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;secrets&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="c1"&gt;# LLM logic here
&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Benefits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Centralized storage&lt;/li&gt;
&lt;li&gt;IAM-based access control&lt;/li&gt;
&lt;li&gt;Audit logs&lt;/li&gt;
&lt;li&gt;Automatic rotation support&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Terraform for Secret Infrastructure
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_secretsmanager_secret"&lt;/span&gt; &lt;span class="s2"&gt;"llm_keys"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"llm-prod/openai"&lt;/span&gt;
  &lt;span class="nx"&gt;tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;Environment&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Production"&lt;/span&gt;
    &lt;span class="nx"&gt;Team&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"AI"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_secretsmanager_secret_version"&lt;/span&gt; &lt;span class="s2"&gt;"llm_keys_version"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;secret_id&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_secretsmanager_secret&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;llm_keys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;secret_string&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;jsonencode&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;OPENAI_API_KEY&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"sk-..."&lt;/span&gt;
    &lt;span class="nx"&gt;ANTHROPIC_API_KEY&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"sk-ant-..."&lt;/span&gt;
    &lt;span class="nx"&gt;PINECONE_API_KEY&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"pxl-..."&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Infrastructure-as-Code ensures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Repeatability&lt;/li&gt;
&lt;li&gt;Auditability&lt;/li&gt;
&lt;li&gt;No manual copy-paste secret management&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  3️⃣ Prefer Dynamic Credentials Over Static API Keys ⚡
&lt;/h2&gt;

&lt;p&gt;Static API keys are long-lived and high risk.&lt;/p&gt;

&lt;p&gt;Dynamic credentials reduce blast radius.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;IAM Roles for Service Accounts (Kubernetes + AWS IRSA)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ServiceAccount&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;llm-worker&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;eks.amazonaws.com/role-arn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;arn:aws:iam::123456789012:role/llm-worker-role&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;llm-worker&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;serviceAccountName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;llm-worker&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;llm-worker&lt;/span&gt;
          &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;OPENAI_API_KEY&lt;/span&gt;
              &lt;span class="na"&gt;valueFrom&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="na"&gt;secretKeyRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;llm-secrets&lt;/span&gt;
                  &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openai-key&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Even better:&lt;/strong&gt; eliminate API keys entirely where possible and use workload identity federation.&lt;/p&gt;




&lt;h2&gt;
  
  
  4️⃣ Secure CI/CD with OIDC (No Long-Lived AWS Keys)
&lt;/h2&gt;

&lt;p&gt;Never store AWS credentials in GitHub secrets if you can avoid it.&lt;/p&gt;

&lt;p&gt;Use OIDC federation instead:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deploy LLM Pipeline&lt;/span&gt;

&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;deploy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;permissions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;id-token&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;write&lt;/span&gt;
      &lt;span class="na"&gt;contents&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;read&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aws-actions/configure-aws-credentials@v4&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;role-to-assume&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;arn:aws:iam::123456789012:role/github-actions-llm-deploy&lt;/span&gt;
          &lt;span class="na"&gt;aws-region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;us-east-1&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;python deploy.py&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This avoids:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Static AWS access keys&lt;/li&gt;
&lt;li&gt;Manual credential rotation&lt;/li&gt;
&lt;li&gt;CI secret sprawl&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  5️⃣ Agentic LLM Systems Need Scoped Secrets 🧠
&lt;/h2&gt;

&lt;p&gt;When building multi-agent systems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each agent should have scoped credentials&lt;/li&gt;
&lt;li&gt;Short-lived tokens preferred&lt;/li&gt;
&lt;li&gt;No shared global API key across agents&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;LLMAgentSecrets&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sm_client&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sm_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sm_client&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_agent_secret&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;secret_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llm-agent-&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="n"&gt;secret&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sm_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_secret_value&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SecretId&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;secret_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;secret&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SecretString&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Design for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Isolation&lt;/li&gt;
&lt;li&gt;Least privilege&lt;/li&gt;
&lt;li&gt;Auditable access&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  ✅ Production Security Checklist
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="err"&gt;☐&lt;/span&gt; &lt;span class="nc"&gt;No&lt;/span&gt; &lt;span class="n"&gt;hardcoded&lt;/span&gt; &lt;span class="nf"&gt;secrets&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;git&lt;/span&gt; &lt;span class="n"&gt;grep&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="s"&gt;"sk-"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="err"&gt;☐&lt;/span&gt; &lt;span class="nc"&gt;Cloud&lt;/span&gt; &lt;span class="n"&gt;secrets&lt;/span&gt; &lt;span class="n"&gt;manager&lt;/span&gt; &lt;span class="n"&gt;in&lt;/span&gt; &lt;span class="n"&gt;use&lt;/span&gt;
&lt;span class="err"&gt;☐&lt;/span&gt; &lt;span class="no"&gt;IAM&lt;/span&gt; &lt;span class="n"&gt;roles&lt;/span&gt; &lt;span class="n"&gt;preferred&lt;/span&gt; &lt;span class="n"&gt;over&lt;/span&gt; &lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="n"&gt;keys&lt;/span&gt;
&lt;span class="err"&gt;☐&lt;/span&gt; &lt;span class="no"&gt;OIDC&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="no"&gt;CI&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="no"&gt;CD&lt;/span&gt;
&lt;span class="err"&gt;☐&lt;/span&gt; &lt;span class="nc"&gt;Secrets&lt;/span&gt; &lt;span class="n"&gt;scanning&lt;/span&gt; &lt;span class="nf"&gt;enabled&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;TruffleHog&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;GitGuardian&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="err"&gt;☐&lt;/span&gt; &lt;span class="nc"&gt;Log&lt;/span&gt; &lt;span class="n"&gt;sanitization&lt;/span&gt; &lt;span class="n"&gt;in&lt;/span&gt; &lt;span class="n"&gt;place&lt;/span&gt;
&lt;span class="err"&gt;☐&lt;/span&gt; &lt;span class="nc"&gt;Rotation&lt;/span&gt; &lt;span class="n"&gt;policy&lt;/span&gt; &lt;span class="nf"&gt;defined&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="err"&gt;≤&lt;/span&gt; &lt;span class="mi"&gt;90&lt;/span&gt; &lt;span class="n"&gt;days&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="err"&gt;☐&lt;/span&gt; &lt;span class="nc"&gt;Audit&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt; &lt;span class="n"&gt;enabled&lt;/span&gt;
&lt;span class="err"&gt;☐&lt;/span&gt; &lt;span class="nc"&gt;Least&lt;/span&gt; &lt;span class="n"&gt;privilege&lt;/span&gt; &lt;span class="n"&gt;enforced&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Common Leak Vectors 🚫
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Leak Vector&lt;/th&gt;
&lt;th&gt;Detection&lt;/th&gt;
&lt;th&gt;Prevention&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Git commits&lt;/td&gt;
&lt;td&gt;`git log -p&lt;/td&gt;
&lt;td&gt;grep sk-`&lt;/td&gt;
&lt;td&gt;Pre-commit hooks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Logs&lt;/td&gt;
&lt;td&gt;CloudWatch Insights&lt;/td&gt;
&lt;td&gt;Log scrubbing&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Docker images&lt;/td&gt;
&lt;td&gt;Inspect image layers&lt;/td&gt;
&lt;td&gt;Multi-stage builds&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory dumps&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/proc/[pid]/environ&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Container hardening&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Cost vs Risk 💰
&lt;/h3&gt;

&lt;p&gt;Typical monthly cost for secure secrets management:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS Secrets Manager: ~$0.40 per secret&lt;/li&gt;
&lt;li&gt;Secret scanning tools: modest monthly fee&lt;/li&gt;
&lt;li&gt;&lt;p&gt;OIDC: no additional cost&lt;br&gt;
Compare that to:&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Revoking leaked keys&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Service outages&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Customer trust damage&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Security is cheaper than cleanup.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Takeaways 🎯
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Dynamic &amp;gt; Static&lt;/li&gt;
&lt;li&gt;Inject at runtime, never commit&lt;/li&gt;
&lt;li&gt;Audit secret access&lt;/li&gt;
&lt;li&gt;Rotate regularly&lt;/li&gt;
&lt;li&gt;Scan continuously&lt;/li&gt;
&lt;li&gt;Apply least privilege everywhere&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;LLMs are powerful.&lt;/p&gt;

&lt;h2&gt;
  
  
  But API keys are still just credentials — treat them like production infrastructure.
&lt;/h2&gt;

&lt;p&gt;Have you ever dealt with an exposed LLM API key in production? What happened?&lt;br&gt;
Let’s discuss 👇&lt;/p&gt;

</description>
      <category>security</category>
      <category>ai</category>
      <category>llm</category>
      <category>devops</category>
    </item>
    <item>
      <title>Observability in AI Systems</title>
      <dc:creator>Parth Sarthi Sharma</dc:creator>
      <pubDate>Tue, 27 Jan 2026 13:17:25 +0000</pubDate>
      <link>https://dev.to/parth_sarthisharma_105e7/observability-in-ai-systems-27ag</link>
      <guid>https://dev.to/parth_sarthisharma_105e7/observability-in-ai-systems-27ag</guid>
      <description>&lt;h2&gt;
  
  
  Why RAG Pipelines Fail Silently (and How to See It)
&lt;/h2&gt;

&lt;p&gt;Traditional software taught us a hard lesson:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;If you can’t observe it, you can’t operate it.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;AI systems — especially &lt;strong&gt;RAG pipelines&lt;/strong&gt; — are repeating the same mistakes we made with distributed systems a decade ago.&lt;/p&gt;

&lt;p&gt;They look fine.&lt;br&gt;
They respond fast.&lt;br&gt;
They return answers.&lt;/p&gt;

&lt;p&gt;And yet — they are &lt;strong&gt;quietly wrong&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This article explains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Why observability is fundamentally harder in AI systems&lt;/li&gt;
&lt;li&gt;What observability &lt;em&gt;actually means&lt;/em&gt; for RAG pipelines&lt;/li&gt;
&lt;li&gt;What signals matter (and which ones don’t)&lt;/li&gt;
&lt;li&gt;How mature teams design observable AI systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No dashboards for the sake of dashboards — only what helps you debug reality.&lt;/p&gt;


&lt;h2&gt;
  
  
  Why AI Observability Is Different From Traditional Observability
&lt;/h2&gt;

&lt;p&gt;In classic systems, we observe:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CPU&lt;/li&gt;
&lt;li&gt;memory&lt;/li&gt;
&lt;li&gt;latency&lt;/li&gt;
&lt;li&gt;error rates&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In AI systems, the hardest failures are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Semantic&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Probabilistic&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Contextual&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A RAG pipeline can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;return HTTP 200&lt;/li&gt;
&lt;li&gt;respond in 300ms&lt;/li&gt;
&lt;li&gt;use the correct model&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;…and still give a &lt;strong&gt;wrong answer&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That’s why AI observability must go deeper.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Core Problem With RAG Pipelines
&lt;/h2&gt;

&lt;p&gt;A basic RAG flow looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Query
↓
Embedding
↓
Vector Search
↓
Top-K Chunks
↓
Prompt Assembly
↓
LLM Generation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When the output is wrong, &lt;strong&gt;where did it fail?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bad query?&lt;/li&gt;
&lt;li&gt;Wrong chunks?&lt;/li&gt;
&lt;li&gt;Missing chunks?&lt;/li&gt;
&lt;li&gt;Prompt formatting?&lt;/li&gt;
&lt;li&gt;Model hallucination?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without observability, you’re guessing.&lt;/p&gt;




&lt;h2&gt;
  
  
  What “Observability” Means in the AI World
&lt;/h2&gt;

&lt;p&gt;AI observability is the ability to answer:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Why did the system produce &lt;em&gt;this&lt;/em&gt; answer for &lt;em&gt;this&lt;/em&gt; input at &lt;em&gt;this&lt;/em&gt; time?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That requires &lt;strong&gt;traceability&lt;/strong&gt;, not just metrics.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Four Pillars of RAG Observability
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1️⃣ Query Observability
&lt;/h3&gt;

&lt;p&gt;You must log:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Original user query&lt;/li&gt;
&lt;li&gt;Rewritten / normalized query (if any)&lt;/li&gt;
&lt;li&gt;Detected intent or routing decision&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why it matters&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Many failures start with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ambiguous questions&lt;/li&gt;
&lt;li&gt;underspecified intent&lt;/li&gt;
&lt;li&gt;bad query rewriting&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you can’t see the &lt;em&gt;effective query&lt;/em&gt;, you can’t debug retrieval.&lt;/p&gt;




&lt;h3&gt;
  
  
  2️⃣ Retrieval Observability (Most Important)
&lt;/h3&gt;

&lt;p&gt;This is where &lt;strong&gt;most RAG systems fail&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;You should observe:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Retrieved chunk IDs&lt;/li&gt;
&lt;li&gt;Source documents&lt;/li&gt;
&lt;li&gt;Similarity scores&lt;/li&gt;
&lt;li&gt;Chunk rank&lt;/li&gt;
&lt;li&gt;Retrieval strategy used (vector, keyword, hybrid)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example questions observability should answer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Which chunks were retrieved?&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Which chunk influenced the answer most?&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Was relevant information missing?&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you don’t log retrieved chunks, &lt;strong&gt;you don’t have RAG observability&lt;/strong&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  3️⃣ Prompt Observability
&lt;/h3&gt;

&lt;p&gt;Your prompt is your &lt;strong&gt;runtime program&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;You must capture:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Final prompt sent to the LLM&lt;/li&gt;
&lt;li&gt;Context size and token count&lt;/li&gt;
&lt;li&gt;Chunk ordering&lt;/li&gt;
&lt;li&gt;System instructions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Why?&lt;br&gt;
Because subtle changes in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ordering&lt;/li&gt;
&lt;li&gt;truncation&lt;/li&gt;
&lt;li&gt;formatting&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;can completely change answers.&lt;/p&gt;


&lt;h3&gt;
  
  
  4️⃣ Generation &amp;amp; Answer Observability
&lt;/h3&gt;

&lt;p&gt;Beyond the final answer, log:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Model name &amp;amp; version&lt;/li&gt;
&lt;li&gt;Temperature / decoding params&lt;/li&gt;
&lt;li&gt;Token usage&lt;/li&gt;
&lt;li&gt;Latency&lt;/li&gt;
&lt;li&gt;Safety or refusal triggers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Advanced systems also track:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Answer confidence&lt;/li&gt;
&lt;li&gt;Self-evaluation scores (Self-RAG)&lt;/li&gt;
&lt;li&gt;Groundedness signals&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  The Most Common RAG Failure Modes (Seen in Production)
&lt;/h2&gt;
&lt;h3&gt;
  
  
  ❌ “The model hallucinated”
&lt;/h3&gt;

&lt;p&gt;Usually false.&lt;/p&gt;

&lt;p&gt;More often:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Wrong chunk retrieved&lt;/li&gt;
&lt;li&gt;Right chunk ranked too low&lt;/li&gt;
&lt;li&gt;Context truncated&lt;/li&gt;
&lt;li&gt;Outdated document used&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Observability makes this visible.&lt;/p&gt;


&lt;h3&gt;
  
  
  ❌ “Vector search is bad”
&lt;/h3&gt;

&lt;p&gt;Often:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Chunking is wrong&lt;/li&gt;
&lt;li&gt;Embedding mismatch&lt;/li&gt;
&lt;li&gt;Query rewriting failed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Again — visible with the right signals.&lt;/p&gt;


&lt;h2&gt;
  
  
  Tracing a Single RAG Request (What Good Looks Like)
&lt;/h2&gt;

&lt;p&gt;A single request trace should show:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Request ID: 9f23...

Query:
"Can I carry forward unused leave?"

Rewritten Query:
"Leave carry forward policy Australia"

Retrieved Chunks:

handbook.md#leave-carry-forward (score: 0.89)

policy.md#exceptions (score: 0.81)

Prompt Tokens:
3,214

Model:
gpt-4.1-mini

Answer:
"Yes, up to 10 days can be carried forward..."

Confidence:
High
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you can’t reconstruct this — you can’t debug.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Traditional Metrics Are Not Enough
&lt;/h2&gt;

&lt;p&gt;Latency and cost are necessary — but insufficient.&lt;/p&gt;

&lt;p&gt;AI systems need &lt;strong&gt;semantic metrics&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Groundedness&lt;/li&gt;
&lt;li&gt;Faithfulness&lt;/li&gt;
&lt;li&gt;Retrieval coverage&lt;/li&gt;
&lt;li&gt;Answer stability over time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are harder — but essential.&lt;/p&gt;




&lt;h2&gt;
  
  
  Observability Enables Advanced RAG Patterns
&lt;/h2&gt;

&lt;p&gt;You cannot safely implement:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Adaptive RAG&lt;/li&gt;
&lt;li&gt;Corrective RAG&lt;/li&gt;
&lt;li&gt;Self-RAG&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;without observability.&lt;/p&gt;

&lt;p&gt;Why?&lt;br&gt;
Because all of them rely on &lt;strong&gt;feedback signals&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Was retrieval good?&lt;/li&gt;
&lt;li&gt;Was the answer grounded?&lt;/li&gt;
&lt;li&gt;Should we retry?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No signals → no control loop.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Simple Observability Checklist
&lt;/h2&gt;

&lt;p&gt;If you’re building RAG in production, you should be able to answer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which document influenced this answer?&lt;/li&gt;
&lt;li&gt;Why was this chunk chosen over others?&lt;/li&gt;
&lt;li&gt;What changed compared to yesterday?&lt;/li&gt;
&lt;li&gt;Would a different retrieval strategy help?&lt;/li&gt;
&lt;li&gt;Can I replay this request?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the answer is “no” — observability is missing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;RAG pipelines don’t usually fail loudly.&lt;/p&gt;

&lt;p&gt;They fail:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;quietly&lt;/li&gt;
&lt;li&gt;confidently&lt;/li&gt;
&lt;li&gt;at scale&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The future of AI systems isn’t just better models.&lt;/p&gt;

&lt;p&gt;It’s &lt;strong&gt;systems that can explain themselves&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;And observability is how that starts.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If you’ve debugged a RAG issue that turned out to be “invisible” at first, I’d love to hear what signal finally revealed it.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>rag</category>
      <category>softwareengineering</category>
      <category>llm</category>
    </item>
    <item>
      <title>Self-RAG vs Adaptive RAG vs Corrective RAG</title>
      <dc:creator>Parth Sarthi Sharma</dc:creator>
      <pubDate>Thu, 22 Jan 2026 11:24:30 +0000</pubDate>
      <link>https://dev.to/parth_sarthisharma_105e7/self-rag-vs-adaptive-rag-vs-corrective-rag-3ge8</link>
      <guid>https://dev.to/parth_sarthisharma_105e7/self-rag-vs-adaptive-rag-vs-corrective-rag-3ge8</guid>
      <description>&lt;h2&gt;
  
  
  How Retrieval Systems Are Learning to Fix Themselves
&lt;/h2&gt;

&lt;p&gt;Retrieval-Augmented Generation (RAG) started simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Retrieve documents → add them to the prompt → generate an answer.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That worked… until it didn’t.&lt;/p&gt;

&lt;p&gt;As RAG systems moved into production, teams began to see the same failures again and again:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hallucinations despite having “good” data&lt;/li&gt;
&lt;li&gt;Irrelevant chunks polluting the prompt&lt;/li&gt;
&lt;li&gt;Silent failures that were hard to debug&lt;/li&gt;
&lt;li&gt;High token costs with low answer quality&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The response wasn’t just &lt;em&gt;better embeddings&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;It was &lt;strong&gt;smarter control loops&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That’s how &lt;strong&gt;Self-RAG&lt;/strong&gt;, &lt;strong&gt;Adaptive RAG&lt;/strong&gt;, and &lt;strong&gt;Corrective RAG&lt;/strong&gt; emerged.&lt;/p&gt;

&lt;p&gt;They all share one idea:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;RAG shouldn’t be static.&lt;br&gt;&lt;br&gt;
It should reason about its own failure.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But they solve &lt;strong&gt;different layers of the problem&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Core Problem With Traditional RAG
&lt;/h2&gt;

&lt;p&gt;Classic RAG makes three assumptions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The user query is well-formed
&lt;/li&gt;
&lt;li&gt;Retrieved chunks are relevant
&lt;/li&gt;
&lt;li&gt;More context leads to better answers
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In reality:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Queries are vague or underspecified&lt;/li&gt;
&lt;li&gt;Vector search returns &lt;em&gt;plausible but wrong&lt;/em&gt; chunks&lt;/li&gt;
&lt;li&gt;LLMs answer confidently even when context is poor&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Traditional RAG has &lt;strong&gt;no self-awareness&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Modern RAG patterns add it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Self-RAG: “Should I Even Answer This?”
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What it is&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Self-RAG teaches the model to &lt;strong&gt;evaluate its own generation&lt;/strong&gt; using explicit self-reflection.&lt;/p&gt;

&lt;p&gt;Instead of blindly answering, the model asks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Did I actually use the retrieved context?&lt;/li&gt;
&lt;li&gt;Is this answer supported by evidence?&lt;/li&gt;
&lt;li&gt;Should I revise, regenerate, or refuse?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;How it works (conceptually)&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Retrieve documents
&lt;/li&gt;
&lt;li&gt;Generate a draft answer
&lt;/li&gt;
&lt;li&gt;Run self-critique prompts such as:

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Is this answer grounded in the retrieved text?&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Is there missing or contradictory information?&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Regenerate or abstain if confidence is low
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;What it’s good at&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reducing hallucinations&lt;/li&gt;
&lt;li&gt;Citation-aware answers&lt;/li&gt;
&lt;li&gt;Knowledge-intensive question answering&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Limitations&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Still depends on retrieval quality&lt;/li&gt;
&lt;li&gt;Adds latency&lt;/li&gt;
&lt;li&gt;Reflection quality depends heavily on prompt design&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Mental model&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Self-RAG adds a &lt;strong&gt;judge&lt;/strong&gt; after generation.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Adaptive RAG: “Do I Even Need Retrieval?”
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What it is&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Adaptive RAG dynamically &lt;strong&gt;changes the pipeline itself&lt;/strong&gt; based on the query.&lt;/p&gt;

&lt;p&gt;Instead of:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Always retrieve → always generate&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It asks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is retrieval needed at all?&lt;/li&gt;
&lt;li&gt;How much context is enough?&lt;/li&gt;
&lt;li&gt;Should the query be rewritten?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Typical adaptations&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Skip retrieval for simple or well-known facts&lt;/li&gt;
&lt;li&gt;Increase retrieval depth for complex queries&lt;/li&gt;
&lt;li&gt;Rewrite ambiguous questions&lt;/li&gt;
&lt;li&gt;Route between different tools (search, DB, memory)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why this matters&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Many RAG systems are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Over-fetching&lt;/li&gt;
&lt;li&gt;Overstuffing prompts&lt;/li&gt;
&lt;li&gt;Burning tokens unnecessarily&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Adaptive RAG optimizes for &lt;strong&gt;cost and accuracy&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mental model&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Adaptive RAG adds a &lt;strong&gt;router&lt;/strong&gt; before retrieval.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Corrective RAG: “Something Went Wrong — Fix It”
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What it is&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Corrective RAG focuses on &lt;strong&gt;detecting and repairing retrieval failures&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;It assumes failure is inevitable and designs for recovery.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Common corrective strategies&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Detect low-quality or irrelevant chunks&lt;/li&gt;
&lt;li&gt;Drop contradictory context&lt;/li&gt;
&lt;li&gt;Trigger re-retrieval with a refined query&lt;/li&gt;
&lt;li&gt;Switch retrieval strategies (BM25 ↔ vector search)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key difference from Self-RAG&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Self-RAG critiques the &lt;em&gt;answer&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Corrective RAG critiques the &lt;em&gt;context&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why this matters&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In production, most RAG failures come from:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Wrong chunks&lt;/li&gt;
&lt;li&gt;Missing chunks&lt;/li&gt;
&lt;li&gt;Outdated information&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Corrective RAG attacks the &lt;strong&gt;root cause&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mental model&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Corrective RAG adds a &lt;strong&gt;repair loop&lt;/strong&gt; around retrieval.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Putting It All Together
&lt;/h2&gt;

&lt;p&gt;These approaches are &lt;strong&gt;not competing ideas&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;They are &lt;strong&gt;layers&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A mature RAG system often looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Query
↓
Adaptive Router (Do we retrieve? How?)
↓
Retrieval
↓
Corrective Check (Are these chunks good?)
↓
Generation
↓
Self-RAG Evaluation (Is this answer grounded?)
↓
Final Response (or retry / refuse)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each layer addresses a different failure mode.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Matters in Real Systems
&lt;/h2&gt;

&lt;p&gt;If you’re building:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Enterprise search&lt;/li&gt;
&lt;li&gt;Customer support assistants&lt;/li&gt;
&lt;li&gt;Internal knowledge bots&lt;/li&gt;
&lt;li&gt;Agentic workflows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Static RAG will fail — often quietly.&lt;/p&gt;

&lt;p&gt;The future of RAG is not:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Bigger models or longer prompts&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Systems that know when they are wrong.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;RAG is evolving from a simple pipeline into a &lt;strong&gt;control system&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The teams that succeed won’t be the ones with the largest models —&lt;br&gt;&lt;br&gt;
but the ones with the &lt;strong&gt;tightest feedback loops&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If you’re experimenting with Self-RAG, Adaptive RAG, or Corrective RAG in production,&lt;br&gt;&lt;br&gt;
I’d love to hear what worked (or broke) for you.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>rag</category>
      <category>vectordatabase</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>LangChain vs LangGraph vs Semantic Kernel vs Google AI ADK vs CrewAI</title>
      <dc:creator>Parth Sarthi Sharma</dc:creator>
      <pubDate>Tue, 20 Jan 2026 12:57:54 +0000</pubDate>
      <link>https://dev.to/parth_sarthisharma_105e7/langchain-vs-langgraph-vs-semantic-kernel-vs-google-ai-adk-vs-crewai-1oa1</link>
      <guid>https://dev.to/parth_sarthisharma_105e7/langchain-vs-langgraph-vs-semantic-kernel-vs-google-ai-adk-vs-crewai-1oa1</guid>
      <description>&lt;h3&gt;
  
  
  Choosing the Right LLM Framework Without the Hype
&lt;/h3&gt;

&lt;p&gt;The LLM ecosystem is moving fast. Every few weeks, a new framework promises to “simplify AI agents,” “orchestrate reasoning,” or “make production-ready AI easy.”&lt;/p&gt;

&lt;p&gt;But if you’re building real systems, you’ve probably asked:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Why do I need so many frameworks for what feels like the same thing?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I’ve worked with multiple LLM stacks and this article is my attempt to &lt;strong&gt;cut through the noise&lt;/strong&gt; and explain:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What problem each framework &lt;em&gt;actually&lt;/em&gt; solves&lt;/li&gt;
&lt;li&gt;Where they shine&lt;/li&gt;
&lt;li&gt;Where they become liabilities&lt;/li&gt;
&lt;li&gt;Which one you should choose depending on your use case&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not a feature checklist. It’s a &lt;strong&gt;mental model&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Big Picture: What Problem Are We Solving?
&lt;/h2&gt;

&lt;p&gt;All these frameworks exist because &lt;strong&gt;LLMs are not applications&lt;/strong&gt;.&lt;br&gt;
They are &lt;em&gt;components&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Real-world LLM systems need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prompt orchestration&lt;/li&gt;
&lt;li&gt;Tool calling&lt;/li&gt;
&lt;li&gt;Memory&lt;/li&gt;
&lt;li&gt;Retrieval (RAG)&lt;/li&gt;
&lt;li&gt;Control flow&lt;/li&gt;
&lt;li&gt;Observability&lt;/li&gt;
&lt;li&gt;Failure handling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each framework makes &lt;strong&gt;different trade-offs&lt;/strong&gt; around these problems.&lt;/p&gt;




&lt;h2&gt;
  
  
  LangChain: The Swiss Army Knife (and its curse)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What it is:&lt;/strong&gt;&lt;br&gt;
LangChain is a &lt;em&gt;high-level abstraction layer&lt;/em&gt; for building LLM-powered apps quickly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it does well:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Rapid prototyping&lt;/li&gt;
&lt;li&gt;Huge ecosystem of integrations&lt;/li&gt;
&lt;li&gt;Easy chaining of prompts, tools, retrievers&lt;/li&gt;
&lt;li&gt;Strong community momentum&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Where it struggles:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hidden control flow&lt;/li&gt;
&lt;li&gt;Debugging is painful at scale&lt;/li&gt;
&lt;li&gt;Abstractions leak under complex logic&lt;/li&gt;
&lt;li&gt;Performance tuning is hard&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;When to use LangChain&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MVPs&lt;/li&gt;
&lt;li&gt;Hackathons&lt;/li&gt;
&lt;li&gt;POCs&lt;/li&gt;
&lt;li&gt;Teams new to LLMs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;When to avoid&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Complex, stateful workflows&lt;/li&gt;
&lt;li&gt;Systems needing precise control or observability&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;LangChain is optimized for &lt;strong&gt;speed of development&lt;/strong&gt;, not &lt;strong&gt;clarity of execution&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  LangGraph: When You Realize LLMs Are State Machines
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What it is:&lt;/strong&gt;&lt;br&gt;
LangGraph is LangChain’s answer to the criticism: &lt;em&gt;“LLM workflows aren’t linear.”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;It models AI systems as &lt;strong&gt;graphs&lt;/strong&gt; instead of chains.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it does well:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Explicit state transitions&lt;/li&gt;
&lt;li&gt;Cycles, retries, branching&lt;/li&gt;
&lt;li&gt;Long-running agents&lt;/li&gt;
&lt;li&gt;Better reasoning visibility&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Trade-offs:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;More complex mental model&lt;/li&gt;
&lt;li&gt;Still tied to LangChain ecosystem&lt;/li&gt;
&lt;li&gt;Steeper learning curve&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;When LangGraph shines&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multi-step agents&lt;/li&gt;
&lt;li&gt;Tool-heavy workflows&lt;/li&gt;
&lt;li&gt;Systems with retries and loops&lt;/li&gt;
&lt;li&gt;Human-in-the-loop scenarios&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;LangGraph is what you reach for when LangChain starts to feel “magical.”&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Semantic Kernel: Engineering-first, AI-second
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What it is:&lt;/strong&gt;&lt;br&gt;
Microsoft’s take on LLM orchestration, designed for &lt;strong&gt;software engineers&lt;/strong&gt;, not prompt hackers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key strengths:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Strong typing&lt;/li&gt;
&lt;li&gt;Explicit planners&lt;/li&gt;
&lt;li&gt;Native support for C# and Python&lt;/li&gt;
&lt;li&gt;Enterprise-friendly architecture&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Smaller ecosystem&lt;/li&gt;
&lt;li&gt;Less “plug-and-play”&lt;/li&gt;
&lt;li&gt;Slower iteration for experiments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best fit&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Enterprise teams&lt;/li&gt;
&lt;li&gt;Strong engineering discipline&lt;/li&gt;
&lt;li&gt;Systems that need maintainability over speed&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Semantic Kernel feels like it was designed by people who maintain systems at 3am.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Google AI ADK: Opinionated and Cloud-native
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What it is:&lt;/strong&gt;&lt;br&gt;
Google’s Agent Development Kit focuses on &lt;strong&gt;structured agent workflows&lt;/strong&gt;, tightly integrated with Google Cloud and Gemini.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clear agent lifecycle&lt;/li&gt;
&lt;li&gt;Strong observability hooks&lt;/li&gt;
&lt;li&gt;Cloud-native design&lt;/li&gt;
&lt;li&gt;Production-aligned abstractions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Limitations:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Less flexible outside Google’s ecosystem&lt;/li&gt;
&lt;li&gt;Smaller open-source community (for now)&lt;/li&gt;
&lt;li&gt;More opinionated architecture&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best fit&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Teams already on GCP&lt;/li&gt;
&lt;li&gt;Production-first AI systems&lt;/li&gt;
&lt;li&gt;Regulated or large-scale environments&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;ADK assumes you care about deployment and monitoring from day one.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  CrewAI: The “Multi-Agent” Narrative
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What it is:&lt;/strong&gt;&lt;br&gt;
CrewAI focuses on orchestrating &lt;strong&gt;multiple agents with roles&lt;/strong&gt;, mimicking human teams.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it’s good at:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Role-based agent design&lt;/li&gt;
&lt;li&gt;Easy mental model&lt;/li&gt;
&lt;li&gt;Content generation pipelines&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Where it falls short:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Limited control&lt;/li&gt;
&lt;li&gt;Less suitable for complex state handling&lt;/li&gt;
&lt;li&gt;Not ideal for deeply engineered systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use CrewAI if&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You’re building collaborative agent demos&lt;/li&gt;
&lt;li&gt;Content or research workflows&lt;/li&gt;
&lt;li&gt;Experimenting with agent behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;CrewAI is great for storytelling, not systems engineering.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  A Practical Decision Framework
&lt;/h2&gt;

&lt;p&gt;Instead of asking &lt;em&gt;“Which framework is best?”&lt;/em&gt;, ask:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Do I need speed or control?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Speed → &lt;strong&gt;LangChain&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Control → &lt;strong&gt;Semantic Kernel / LangGraph&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Is this production-critical?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Yes → &lt;strong&gt;Semantic Kernel / Google ADK&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;No → &lt;strong&gt;LangChain / CrewAI&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Is the workflow stateful and complex?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Yes → &lt;strong&gt;LangGraph&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;No → &lt;strong&gt;LangChain&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Enterprise or startup?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Enterprise → &lt;strong&gt;Semantic Kernel / ADK&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Startup → &lt;strong&gt;LangChain&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Uncomfortable Truth
&lt;/h2&gt;

&lt;p&gt;Most mature AI teams eventually:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Start with &lt;strong&gt;LangChain&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Outgrow it&lt;/li&gt;
&lt;li&gt;Move to &lt;strong&gt;custom orchestration&lt;/strong&gt; or &lt;strong&gt;graph-based systems&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Frameworks should &lt;strong&gt;accelerate learning&lt;/strong&gt;, not lock you in.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;LLM frameworks are evolving because &lt;strong&gt;we still don’t fully understand how to engineer AI systems&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Choose tools that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Make failure visible&lt;/li&gt;
&lt;li&gt;Encourage explicit design&lt;/li&gt;
&lt;li&gt;Don’t hide complexity forever&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because eventually, complexity always shows up.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If this helped you think more clearly about the LLM ecosystem, feel free to share or comment with your experience. I’d love to learn how others are navigating this space.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>rag</category>
      <category>softwareengineering</category>
      <category>llm</category>
    </item>
    <item>
      <title>Local RAG vs Cloud RAG: What Changes When You Leave the Demo</title>
      <dc:creator>Parth Sarthi Sharma</dc:creator>
      <pubDate>Mon, 12 Jan 2026 11:25:38 +0000</pubDate>
      <link>https://dev.to/parth_sarthisharma_105e7/local-rag-vs-cloud-rag-what-changes-when-you-leave-the-demo-2nlb</link>
      <guid>https://dev.to/parth_sarthisharma_105e7/local-rag-vs-cloud-rag-what-changes-when-you-leave-the-demo-2nlb</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Local RAG feels free.&lt;br&gt;
Until your first production incident.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If you’ve built any RAG system recently, chances are it started locally.&lt;/p&gt;

&lt;p&gt;A small dataset.&lt;br&gt;
A local vector store.&lt;br&gt;
Fast queries.&lt;br&gt;
Clean answers.&lt;/p&gt;

&lt;p&gt;Everything feels under control.&lt;/p&gt;

&lt;p&gt;And for a while — it is.&lt;/p&gt;

&lt;p&gt;This article is about &lt;strong&gt;what quietly changes&lt;/strong&gt; when RAG systems move from demos to real usage, and why the Local vs Cloud RAG decision is less about tools and more about &lt;strong&gt;operational guarantees&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Local RAG Feels Like the Right Choice (At First)
&lt;/h3&gt;

&lt;p&gt;Local RAG optimises for exactly what you want early on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Zero infra friction&lt;/li&gt;
&lt;li&gt;Near-zero cost&lt;/li&gt;
&lt;li&gt;Tight iteration loops&lt;/li&gt;
&lt;li&gt;Full control over data and logic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Restart the process&lt;/li&gt;
&lt;li&gt;Rebuild the index&lt;/li&gt;
&lt;li&gt;Tune chunk sizes&lt;/li&gt;
&lt;li&gt;Experiment freely&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prototypes&lt;/li&gt;
&lt;li&gt;POCs&lt;/li&gt;
&lt;li&gt;Internal tools&lt;/li&gt;
&lt;li&gt;Early-stage features&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Local RAG is not just acceptable — it’s &lt;strong&gt;ideal&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;So where does it go wrong?&lt;/p&gt;

&lt;h3&gt;
  
  
  The Problem: Local RAG Doesn’t Fail Loudly
&lt;/h3&gt;

&lt;p&gt;Local RAG rarely explodes.&lt;/p&gt;

&lt;p&gt;It degrades.&lt;/p&gt;

&lt;p&gt;Slowly.&lt;/p&gt;

&lt;p&gt;Subtly.&lt;/p&gt;

&lt;p&gt;In ways that are hard to reproduce.&lt;/p&gt;

&lt;p&gt;At first:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One user&lt;/li&gt;
&lt;li&gt;Sequential queries&lt;/li&gt;
&lt;li&gt;Index fits comfortably in memory&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then usage grows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Concurrent requests increase&lt;/li&gt;
&lt;li&gt;Memory pressure rises&lt;/li&gt;
&lt;li&gt;Index rebuilds take longer&lt;/li&gt;
&lt;li&gt;Latency becomes inconsistent&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Nothing is “broken”.&lt;/p&gt;

&lt;p&gt;But the system becomes &lt;strong&gt;unpredictable&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;And unpredictability is the worst failure mode in production.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Actually Breaks First (And Surprises Teams)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Concurrency&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most local vector stores are optimised for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Single-process access&lt;/li&gt;
&lt;li&gt;Limited parallelism&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Under load:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Queries queue&lt;/li&gt;
&lt;li&gt;Writes block reads&lt;/li&gt;
&lt;li&gt;Latency spikes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Memory &amp;amp; Resource Contention&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Local RAG competes with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The app runtime&lt;/li&gt;
&lt;li&gt;The LLM client&lt;/li&gt;
&lt;li&gt;Other background processes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A single spike can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Trigger OOM&lt;/li&gt;
&lt;li&gt;Kill the process&lt;/li&gt;
&lt;li&gt;Lose in-memory state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Index Lifecycle Management&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Rebuilding indexes locally often means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Blocking reads&lt;/li&gt;
&lt;li&gt;Restarting services&lt;/li&gt;
&lt;li&gt;Manual intervention&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is fine once. It’s painful at scale.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Teams Jump to Cloud RAG Too Early
&lt;/h3&gt;

&lt;p&gt;On the flip side, many teams move to cloud RAG &lt;strong&gt;before they need to&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Common reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fear of future scale&lt;/li&gt;
&lt;li&gt;“Production readiness” anxiety&lt;/li&gt;
&lt;li&gt;Over-indexing on best practices&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The result:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Paying for capacity you don’t use&lt;/li&gt;
&lt;li&gt;Higher baseline latency&lt;/li&gt;
&lt;li&gt;Vendor lock-in decisions too early&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cloud RAG is not “better RAG”. It’s &lt;strong&gt;RAG with guarantees&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;And guarantees come with cost.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Cloud RAG Actually Buys You
&lt;/h3&gt;

&lt;p&gt;Cloud-managed RAG systems exist to solve &lt;em&gt;operational problems&lt;/em&gt;, not retrieval quality.&lt;/p&gt;

&lt;p&gt;They give you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Concurrency handling&lt;/li&gt;
&lt;li&gt;Persistence and durability&lt;/li&gt;
&lt;li&gt;Observability hooks&lt;/li&gt;
&lt;li&gt;Backups and recovery&lt;/li&gt;
&lt;li&gt;Predictable performance envelopes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What they don’t magically fix:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Poor chunking&lt;/li&gt;
&lt;li&gt;Bad retrieval logic&lt;/li&gt;
&lt;li&gt;Overstuffed prompts&lt;/li&gt;
&lt;li&gt;Weak context engineering&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If ingestion is broken locally, it will be broken in the cloud — just more expensively.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Real Decision Axis (This Is the Key)
&lt;/h3&gt;

&lt;p&gt;The Local vs Cloud RAG decision is not about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Chroma vs Pinecone&lt;/li&gt;
&lt;li&gt;FAISS vs Weaviate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It’s about answering &lt;strong&gt;four questions honestly&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;How many concurrent users do I expect?&lt;/li&gt;
&lt;li&gt;How painful is downtime or degraded answers?&lt;/li&gt;
&lt;li&gt;Do I need observability and auditability?&lt;/li&gt;
&lt;li&gt;How often will my index change?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Local RAG optimises for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Speed&lt;/li&gt;
&lt;li&gt;Control&lt;/li&gt;
&lt;li&gt;Learning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cloud RAG optimises for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reliability&lt;/li&gt;
&lt;li&gt;Predictability&lt;/li&gt;
&lt;li&gt;Scale&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Neither is “correct” in isolation.&lt;/p&gt;

&lt;h3&gt;
  
  
  A Practical Migration Pattern That Works
&lt;/h3&gt;

&lt;p&gt;Mature teams rarely jump straight from local to fully managed cloud RAG.&lt;/p&gt;

&lt;p&gt;Instead, they:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Start local&lt;/li&gt;
&lt;li&gt;Learn their retrieval patterns&lt;/li&gt;
&lt;li&gt;Stabilise chunking and routing&lt;/li&gt;
&lt;li&gt;Introduce cloud RAG &lt;strong&gt;only when operational pain appears&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This keeps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cost low early&lt;/li&gt;
&lt;li&gt;Architecture flexible&lt;/li&gt;
&lt;li&gt;Decisions reversible&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Final Takeaway
&lt;/h3&gt;

&lt;p&gt;Local RAG fails quietly.&lt;br&gt;
Cloud RAG fails expensively.&lt;/p&gt;

&lt;p&gt;The right choice depends on &lt;strong&gt;when you’re willing to pay&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;With engineering effort&lt;/li&gt;
&lt;li&gt;Or with infrastructure cost&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The worst choice is deciding too early — in either direction.&lt;/p&gt;

&lt;h3&gt;
  
  
  What’s Next
&lt;/h3&gt;

&lt;p&gt;In the next article, we’ll dive into one of the most under-discussed problems in RAG systems:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Observability in RAG Pipelines: Knowing Which Chunk Failed (and Why)&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We’ll explore:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Why “LLM hallucinated” is usually a monitoring failure&lt;/li&gt;
&lt;li&gt;What should be traced in a RAG request (retrieval, ranking, prompt, tokens)&lt;/li&gt;
&lt;li&gt;How to identify:&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Wrong chunk retrieval&lt;br&gt;
Empty or partial context&lt;br&gt;
Latency bottlenecks&lt;br&gt;
Silent failures in agents and tools&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How tools like OpenTelemetry, LangSmith, and custom tracing fit together&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because you can’t fix what you can’t see — and most RAG systems today are completely blind.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>rag</category>
      <category>vectordatabase</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>Prompt Routing &amp; Context Engineering: Letting the System Decide What It Needs</title>
      <dc:creator>Parth Sarthi Sharma</dc:creator>
      <pubDate>Fri, 09 Jan 2026 21:59:50 +0000</pubDate>
      <link>https://dev.to/parth_sarthisharma_105e7/prompt-routing-context-engineering-letting-the-system-decide-what-it-needs-5ak</link>
      <guid>https://dev.to/parth_sarthisharma_105e7/prompt-routing-context-engineering-letting-the-system-decide-what-it-needs-5ak</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Most LLM systems fail not because the model is weak&lt;br&gt;
but because we shove everything into the prompt and hope for magic.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If you’ve ever built a RAG or agentic system, you’ve probably tried this at least once:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Retrieve more documents&lt;/li&gt;
&lt;li&gt;Increase chunk count&lt;/li&gt;
&lt;li&gt;Add system instructions&lt;/li&gt;
&lt;li&gt;Extend the prompt&lt;/li&gt;
&lt;li&gt;Increase context window&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And yet… the answer still feels off.&lt;/p&gt;

&lt;p&gt;That’s because &lt;strong&gt;context is not information&lt;/strong&gt;. Context is &lt;strong&gt;relevance + timing + placement&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This article is about how mature LLM systems stop &lt;em&gt;stuffing prompts&lt;/em&gt;&lt;br&gt;
and start &lt;strong&gt;deciding what context they actually need&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Core Problem: Static Prompts in a Dynamic World
&lt;/h3&gt;

&lt;p&gt;Most early-stage LLM systems look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Query
  → Retrieve top K chunks
  → Stuff everything into a single prompt
  → Generate response
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works… until it doesn’t.&lt;/p&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;p&gt;Because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Not all questions need the same context&lt;/li&gt;
&lt;li&gt;Not all tasks need the same instructions&lt;/li&gt;
&lt;li&gt;Not all users need the same depth&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Yet we treat every request identically. That’s where &lt;strong&gt;prompt routing&lt;/strong&gt; enters.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Is Prompt Routing (Really)?
&lt;/h3&gt;

&lt;p&gt;Prompt routing is &lt;strong&gt;decision-making before generation&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Instead of asking:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“How do I write the perfect prompt?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Which prompt, context, and tools does &lt;em&gt;this&lt;/em&gt; request require?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Think of it as a &lt;strong&gt;traffic controller for LLM calls&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A routing layer decides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which system prompt to use&lt;/li&gt;
&lt;li&gt;Which context sources to include&lt;/li&gt;
&lt;li&gt;Whether retrieval is even required&lt;/li&gt;
&lt;li&gt;Whether the model should reason, summarise, or act&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  A Mental Model: LLMs Don’t Need More Context — They Need the Right Context
&lt;/h3&gt;

&lt;p&gt;Consider these two queries:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;em&gt;“Summarise the payment terms in this contract”&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;“Can we safely terminate this contract early and what are the risks?”&lt;/em&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Same document. Very different needs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Query 1 needs&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A small, focused chunk&lt;/li&gt;
&lt;li&gt;No reasoning&lt;/li&gt;
&lt;li&gt;No tools&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Query 2 needs&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multiple clauses&lt;/li&gt;
&lt;li&gt;Cross-referencing&lt;/li&gt;
&lt;li&gt;Risk interpretation&lt;/li&gt;
&lt;li&gt;Possibly external policy context&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If both go through the same prompt pipeline, one of them will fail.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prompt Routing in Practice (Without Buzzwords)
&lt;/h3&gt;

&lt;p&gt;A practical routing layer usually classifies queries into &lt;strong&gt;intent buckets&lt;/strong&gt;, such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;❓ Factual lookup&lt;/li&gt;
&lt;li&gt;📄 Summarisation&lt;/li&gt;
&lt;li&gt;🧠 Reasoning / decision-making&lt;/li&gt;
&lt;li&gt;🛠 Tool execution&lt;/li&gt;
&lt;li&gt;🔁 Multi-step workflows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This classification can be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Rule-based (early stage)&lt;/li&gt;
&lt;li&gt;LLM-based (later stage)&lt;/li&gt;
&lt;li&gt;Hybrid (best in production)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once intent is known, everything else follows.&lt;/p&gt;

&lt;h3&gt;
  
  
  Context Engineering: The Part Most People Miss
&lt;/h3&gt;

&lt;p&gt;Prompt routing decides &lt;em&gt;what path to take&lt;/em&gt;.&lt;br&gt;
&lt;strong&gt;Context engineering decides what to inject and where&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Bad context engineering looks like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dumping raw chunks&lt;/li&gt;
&lt;li&gt;No ordering&lt;/li&gt;
&lt;li&gt;No metadata&lt;/li&gt;
&lt;li&gt;No separation between instructions and data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Good context engineering is deliberate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Proven patterns that actually work&lt;/strong&gt;:&lt;br&gt;
&lt;strong&gt;1. Instruction / Data Separation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Never mix:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;System rules&lt;/li&gt;
&lt;li&gt;Retrieved content&lt;/li&gt;
&lt;li&gt;User instructions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;LLMs treat early tokens as &lt;em&gt;authority&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Query-Aware Retrieval&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Retrieve &lt;strong&gt;based on intent&lt;/strong&gt;, not keywords.&lt;/p&gt;

&lt;p&gt;A “why” question should retrieve:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Explanations&lt;/li&gt;
&lt;li&gt;Rationale&lt;/li&gt;
&lt;li&gt;Trade-offs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A “what” question should retrieve:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Definitions&lt;/li&gt;
&lt;li&gt;Tables&lt;/li&gt;
&lt;li&gt;Direct facts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Context Placement Matters&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Important facts belong:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;At the &lt;strong&gt;start&lt;/strong&gt; (primacy bias)&lt;/li&gt;
&lt;li&gt;Or at the &lt;strong&gt;end&lt;/strong&gt; (recency bias)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Middle content is often ignored &lt;em&gt;(hello, Lost in the Middle)&lt;/em&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why This Is the Bridge Between RAG and Agentic Systems
&lt;/h3&gt;

&lt;p&gt;Prompt routing is the &lt;strong&gt;missing layer&lt;/strong&gt; between:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Simple RAG&lt;/li&gt;
&lt;li&gt;Agentic RAG&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without routing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Agents overthink&lt;/li&gt;
&lt;li&gt;Simple RAG underperforms&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With routing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Simple RAG stays simple&lt;/li&gt;
&lt;li&gt;Agents are invoked only when needed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is how mature systems stay:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Faster&lt;/li&gt;
&lt;li&gt;Cheaper&lt;/li&gt;
&lt;li&gt;Easier to debug&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  A Simple Rule of Thumb
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;If retrieval answers the question → don’t use an agent&lt;br&gt;
If decisions must be made → route to reasoning&lt;br&gt;
If actions are needed → allow tools&lt;br&gt;
If uncertainty exists → slow the system down&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That’s not prompt engineering.&lt;/p&gt;

&lt;p&gt;That’s &lt;strong&gt;system design&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  What’s Next
&lt;/h3&gt;

&lt;p&gt;In the next article, we’ll explore:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Local RAG vs Cloud RAG: What Changes When You Leave the Demo&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We’ll look at:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Why local RAG feels perfect during development&lt;/li&gt;
&lt;li&gt;Where it quietly breaks under concurrency and scale&lt;/li&gt;
&lt;li&gt;What cloud RAG actually buys you (and what it doesn’t)&lt;/li&gt;
&lt;li&gt;How routing and context strategies behave differently in local vs managed setups&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because once your system can decide what context it needs,&lt;br&gt;
the next challenge is making sure that decision is reliable, observable, and repeatable in production.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>promptengineering</category>
      <category>vectordatabase</category>
      <category>rag</category>
    </item>
    <item>
      <title>Simple RAG vs Agentic RAG: What Problem Are You Actually Solving?</title>
      <dc:creator>Parth Sarthi Sharma</dc:creator>
      <pubDate>Thu, 08 Jan 2026 12:50:08 +0000</pubDate>
      <link>https://dev.to/parth_sarthisharma_105e7/simple-rag-vs-agentic-rag-what-problem-are-you-actually-solving-3fg0</link>
      <guid>https://dev.to/parth_sarthisharma_105e7/simple-rag-vs-agentic-rag-what-problem-are-you-actually-solving-3fg0</guid>
      <description>&lt;p&gt;Let’s start with a real problem.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Can I terminate this contract early, and what penalties apply?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A set of contracts (PDFs)&lt;/li&gt;
&lt;li&gt;A user asking a natural-language question&lt;/li&gt;
&lt;li&gt;An LLM-powered application&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The question is not:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“Should I use RAG or agents?”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The real question is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;How much reasoning does this problem actually require?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Step 1: The Simple RAG Approach (And Why It Often Works)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What Simple RAG Looks Like&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A typical Simple RAG pipeline:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;User asks a question&lt;/li&gt;
&lt;li&gt;Embed the query&lt;/li&gt;
&lt;li&gt;Retrieve top-K chunks&lt;/li&gt;
&lt;li&gt;Inject them into the prompt&lt;/li&gt;
&lt;li&gt;Generate an answer&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In code terms (conceptually):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;query → retriever → context → prompt → LLM → answer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What Happens in Practice&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For many questions, this works &lt;em&gt;surprisingly well&lt;/em&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“What is the notice period?”&lt;/li&gt;
&lt;li&gt;“When does the contract expire?”&lt;/li&gt;
&lt;li&gt;“Is early termination allowed?”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Why?&lt;br&gt;
Because the answer exists &lt;strong&gt;verbatim&lt;/strong&gt; in the documents.&lt;/p&gt;

&lt;p&gt;No planning.&lt;br&gt;
No tool chaining.&lt;br&gt;
No decision-making.&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 2: Where Simple RAG Starts to Break
&lt;/h3&gt;

&lt;p&gt;Now try this question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“If I terminate early due to breach, does the penalty still apply?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Suddenly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The answer spans &lt;strong&gt;multiple clauses&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Conditions matter&lt;/li&gt;
&lt;li&gt;Exceptions override defaults&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What Simple RAG does:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Retrieves multiple chunks&lt;/li&gt;
&lt;li&gt;Dumps them into context&lt;/li&gt;
&lt;li&gt;Hopes the LLM figures it out&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Sometimes it does.&lt;br&gt;
Sometimes it hallucinates confidently.&lt;/p&gt;

&lt;p&gt;The failure mode isn’t retrieval — it’s &lt;strong&gt;implicit reasoning&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 3: Enter Agentic RAG (And Why People Overuse It)
&lt;/h3&gt;

&lt;p&gt;Agentic RAG introduces &lt;strong&gt;explicit reasoning steps&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Instead of:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Answer directly”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The system does:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Identify sub-questions&lt;/li&gt;
&lt;li&gt;Decide which tools to call&lt;/li&gt;
&lt;li&gt;Retrieve information iteratively&lt;/li&gt;
&lt;li&gt;Synthesize an answer&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Conceptually:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;plan → retrieve → evaluate → retrieve → decide → answer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This shines when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Questions are multi-hop&lt;/li&gt;
&lt;li&gt;Dependencies exist&lt;/li&gt;
&lt;li&gt;Decisions affect next steps&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“Check termination clause”&lt;/li&gt;
&lt;li&gt;“Check breach exceptions”&lt;/li&gt;
&lt;li&gt;“Check penalty override”&lt;/li&gt;
&lt;li&gt;“Combine results”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is &lt;strong&gt;real reasoning&lt;/strong&gt;, not just recall.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Where Agentic RAG Becomes a Liability
&lt;/h3&gt;

&lt;p&gt;Now consider this question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“What is the termination notice period?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;An agent might:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Plan unnecessarily&lt;/li&gt;
&lt;li&gt;Call tools repeatedly&lt;/li&gt;
&lt;li&gt;Increase latency&lt;/li&gt;
&lt;li&gt;Increase cost&lt;/li&gt;
&lt;li&gt;Introduce new failure modes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You traded:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A 1-step pipeline
for&lt;/li&gt;
&lt;li&gt;A 5-step reasoning loop
To answer a &lt;strong&gt;lookup question&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is overengineering.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Core Insight Most Teams Miss
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Agentic RAG is not “better RAG.”&lt;/strong&gt;&lt;br&gt;
It’s a &lt;em&gt;different tool for a different problem&lt;/em&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The decision is not:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Simple vs Agentic&lt;br&gt;
It’s:&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Recall vs Reasoning&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  A Practical Decision Rule (Use This)
&lt;/h3&gt;

&lt;p&gt;Use &lt;strong&gt;Simple RAG&lt;/strong&gt; when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The answer exists verbatim&lt;/li&gt;
&lt;li&gt;Questions are independent&lt;/li&gt;
&lt;li&gt;Latency and cost matter&lt;/li&gt;
&lt;li&gt;Determinism is important&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Use &lt;strong&gt;Agentic RAG&lt;/strong&gt; when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Answers span multiple sources&lt;/li&gt;
&lt;li&gt;Decisions affect next retrieval&lt;/li&gt;
&lt;li&gt;You need traceable reasoning&lt;/li&gt;
&lt;li&gt;You accept higher cost for correctness&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why Many Systems Fail in Production
&lt;/h3&gt;

&lt;p&gt;Most teams:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Jump to Agentic RAG too early&lt;/li&gt;
&lt;li&gt;Before fixing ingestion&lt;/li&gt;
&lt;li&gt;Before fixing chunking&lt;/li&gt;
&lt;li&gt;Before understanding attention limits&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Agents amplify:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bad context&lt;/li&gt;
&lt;li&gt;Poor retrieval&lt;/li&gt;
&lt;li&gt;Weak observability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They don’t fix fundamentals.&lt;/p&gt;

&lt;h3&gt;
  
  
  Final Takeaway
&lt;/h3&gt;

&lt;p&gt;Simple RAG fails when reasoning is required.&lt;br&gt;
Agentic RAG fails when reasoning is unnecessary.&lt;/p&gt;

&lt;p&gt;The best systems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Route questions intentionally&lt;/li&gt;
&lt;li&gt;Use agents selectively&lt;/li&gt;
&lt;li&gt;Treat reasoning as a cost, not a default&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What’s Next
&lt;/h3&gt;

&lt;p&gt;Next, we’ll go one level deeper:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Prompt Routing &amp;amp; Context Engineering: Letting the System Decide What It Needs&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That’s where real production intelligence starts.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>rag</category>
      <category>llm</category>
      <category>softwareengineering</category>
    </item>
  </channel>
</rss>
