<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: David Van Assche (S.L)</title>
    <description>The latest articles on DEV Community by David Van Assche (S.L) (@soulentheo).</description>
    <link>https://dev.to/soulentheo</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3511013%2F0730c08e-cba7-492f-b16c-fe3921a41036.png</url>
      <title>DEV Community: David Van Assche (S.L)</title>
      <link>https://dev.to/soulentheo</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/soulentheo"/>
    <language>en</language>
    <item>
      <title>Measuring What Your AI Learned: Epistemic Vectors in Practice</title>
      <dc:creator>David Van Assche (S.L)</dc:creator>
      <pubDate>Mon, 13 Apr 2026 17:45:28 +0000</pubDate>
      <link>https://dev.to/soulentheo/measuring-what-your-ai-learned-epistemic-vectors-in-practice-3jdh</link>
      <guid>https://dev.to/soulentheo/measuring-what-your-ai-learned-epistemic-vectors-in-practice-3jdh</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Part 2 of the &lt;a href="https://dev.to/soulentheo/your-ai-doesnt-know-what-it-doesnt-know-and-thats-the-biggest-problem-in-ai-tooling-309a-temp-slug-5818830"&gt;Epistemic AI series&lt;/a&gt;. In Part 1, we defined the problem: AI tools don't track what they know. Here, we make it measurable.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;When we talk about "what the AI knows," we're not being metaphorical. Knowledge has structure, and that structure is measurable — not perfectly, but well enough to catch the failures that matter.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 13 Epistemic Vectors
&lt;/h2&gt;

&lt;p&gt;Empirica tracks 13 dimensions of an AI's knowledge state. Not as a gimmick — each vector maps to a specific class of failure you've seen in practice:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;vectors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;know&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;        &lt;span class="mf"&gt;0.65&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Domain understanding
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;uncertainty&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.35&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# What I DON'T know (explicit!)
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;context&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;     &lt;span class="mf"&gt;0.70&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Surrounding state awareness
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;clarity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;     &lt;span class="mf"&gt;0.80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# How clear the path forward is
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;coherence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="mf"&gt;0.75&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Internal consistency
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;signal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;      &lt;span class="mf"&gt;0.60&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Information quality vs noise
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;density&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;     &lt;span class="mf"&gt;0.55&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Relevant knowledge per unit context
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;state&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;       &lt;span class="mf"&gt;0.70&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Current system/project state
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;change&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;      &lt;span class="mf"&gt;0.40&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Amount of change made
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;completion&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="mf"&gt;0.30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Progress toward goal
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;impact&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;      &lt;span class="mf"&gt;0.65&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Significance of work
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;engagement&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="mf"&gt;0.85&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# How actively working the problem
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;do&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;          &lt;span class="mf"&gt;0.70&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Ability to execute
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why 13?&lt;/strong&gt; Because we kept finding failure modes that weren't captured:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;know&lt;/code&gt; without &lt;code&gt;uncertainty&lt;/code&gt; = overconfident AI&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;clarity&lt;/code&gt; without &lt;code&gt;signal&lt;/code&gt; = clear path built on noise&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;completion&lt;/code&gt; without &lt;code&gt;change&lt;/code&gt; = claiming done but nothing happened&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;engagement&lt;/code&gt; without &lt;code&gt;do&lt;/code&gt; = actively spinning without capability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each pair creates a &lt;strong&gt;tension&lt;/strong&gt; that prevents gaming. You can't claim high &lt;code&gt;know&lt;/code&gt; while &lt;code&gt;uncertainty&lt;/code&gt; is also high — the measurement catches the contradiction.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Transaction Lifecycle
&lt;/h2&gt;

&lt;p&gt;Vectors aren't static. They change as the AI works. The &lt;strong&gt;epistemic transaction&lt;/strong&gt; is the measurement window:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PREFLIGHT → [investigate] → CHECK → [implement] → POSTFLIGHT
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  PREFLIGHT: Declare Your Baseline
&lt;/h3&gt;

&lt;p&gt;Before starting work, the AI declares what it thinks it knows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;empirica preflight-submit - &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;'
{
  "task_context": "Implement JWT auth middleware",
  "vectors": {
    "know": 0.45,
    "uncertainty": 0.40,
    "context": 0.60,
    "clarity": 0.50
  },
  "reasoning": "Read the route definitions but haven't explored 
    the middleware chain yet. Moderate context from project structure."
}
&lt;/span&gt;&lt;span class="no"&gt;EOF
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the &lt;strong&gt;starting measurement&lt;/strong&gt;. It's a prediction: "Here's how well I think I understand this before investigating."&lt;/p&gt;

&lt;h3&gt;
  
  
  Investigation Phase (Noetic)
&lt;/h3&gt;

&lt;p&gt;The AI reads code, searches patterns, builds understanding. Everything it discovers gets logged:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# What you learned&lt;/span&gt;
empirica finding-log &lt;span class="nt"&gt;--finding&lt;/span&gt; &lt;span class="s2"&gt;"Auth middleware uses Express next() 
  pattern at routes/auth.js:45"&lt;/span&gt; &lt;span class="nt"&gt;--impact&lt;/span&gt; 0.5

&lt;span class="c"&gt;# What you don't know&lt;/span&gt;
empirica unknown-log &lt;span class="nt"&gt;--unknown&lt;/span&gt; &lt;span class="s2"&gt;"How are user roles differentiated? 
  No role field in JWT payload schema."&lt;/span&gt;

&lt;span class="c"&gt;# What didn't work&lt;/span&gt;
empirica deadend-log &lt;span class="nt"&gt;--approach&lt;/span&gt; &lt;span class="s2"&gt;"Tried passport.js integration"&lt;/span&gt;   &lt;span class="nt"&gt;--why-failed&lt;/span&gt; &lt;span class="s2"&gt;"Too heavy for JWT-only auth, would add 12 dependencies"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These aren't just notes — they're &lt;strong&gt;grounded evidence&lt;/strong&gt; that the calibration system uses to verify self-assessments.&lt;/p&gt;

&lt;h3&gt;
  
  
  CHECK: Gate the Transition
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;empirica check-submit - &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;'
{
  "vectors": {
    "know": 0.82,
    "uncertainty": 0.15,
    "context": 0.85,
    "clarity": 0.88
  },
  "reasoning": "Investigated middleware chain, understand JWT flow, 
    found role definitions in JWT claims. Ready to implement."
}
&lt;/span&gt;&lt;span class="no"&gt;EOF
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The system evaluates: did the vectors change in a way that's consistent with the evidence logged? If the AI claims &lt;code&gt;know: 0.82&lt;/code&gt; but logged zero findings and zero unknowns, that's a rushed assessment — the gate catches it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is the critical insight: you can't skip investigation and go straight to acting.&lt;/strong&gt; The measurement &lt;em&gt;forces&lt;/em&gt; understanding before execution.&lt;/p&gt;

&lt;h3&gt;
  
  
  POSTFLIGHT: Measure the Learning
&lt;/h3&gt;

&lt;p&gt;After implementation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;empirica postflight-submit - &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;'
{
  "vectors": {
    "know": 0.90,
    "uncertainty": 0.08,
    "change": 0.80,
    "completion": 1.0
  },
  "reasoning": "Auth middleware implemented with role guards. 
    Unit tests passing. Learned about Express 5 async changes."
}
&lt;/span&gt;&lt;span class="no"&gt;EOF
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;strong&gt;delta&lt;/strong&gt; between PREFLIGHT and POSTFLIGHT is the learning:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;know&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;        &lt;span class="s"&gt;0.45 → 0.90  (+0.45)&lt;/span&gt;  &lt;span class="c1"&gt;# Learned a lot&lt;/span&gt;
&lt;span class="na"&gt;uncertainty&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.40 → 0.08  (-0.32)&lt;/span&gt;  &lt;span class="c1"&gt;# Resolved most unknowns&lt;/span&gt;
&lt;span class="na"&gt;change&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;      &lt;span class="s"&gt;0.00 → 0.80  (+0.80)&lt;/span&gt;  &lt;span class="c1"&gt;# Made substantial changes&lt;/span&gt;
&lt;span class="na"&gt;completion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s"&gt;0.00 → 1.00  (+1.00)&lt;/span&gt;  &lt;span class="c1"&gt;# Goal met&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This delta IS the measurement. Over time, you can see:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Does the AI consistently overestimate its starting knowledge?&lt;/li&gt;
&lt;li&gt;Does it underestimate uncertainty?&lt;/li&gt;
&lt;li&gt;Do its estimates get more accurate across sessions?&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Grounded Verification: The Part That Keeps It Honest
&lt;/h2&gt;

&lt;p&gt;Self-assessment alone is self-serving. The grounded verification layer compares the AI's claims against &lt;strong&gt;deterministic evidence&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# AI claims: know=0.90, change=0.80
# Grounded evidence:
&lt;/span&gt;&lt;span class="n"&gt;evidence&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;test_results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;passed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;failed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;     &lt;span class="c1"&gt;# 3 failures!
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ruff_violations&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                              &lt;span class="c1"&gt;# lint issues
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;git_diff_lines&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;156&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                            &lt;span class="c1"&gt;# real change metric
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;findings_logged&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                              &lt;span class="c1"&gt;# investigation breadth
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unknowns_resolved&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                            &lt;span class="c1"&gt;# learning evidence
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Grounded calibration:
# - test failures → know is probably ~0.75, not 0.90
# - git diff confirms change=0.80 is reasonable
# - 5 findings + 3 resolved unknowns → investigation was real
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The calibration score measures the distance between self-assessment and grounded evidence. &lt;strong&gt;A score of 0.0 means perfect calibration.&lt;/strong&gt; In practice, we see scores of 0.10-0.30 — the AI is usually overconfident, and the grounded layer catches it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Looks Like in Practice
&lt;/h2&gt;

&lt;p&gt;Here's a real POSTFLIGHT from an Empirica session (editing for clarity):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;Calibration score&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.134&lt;/span&gt;
&lt;span class="na"&gt;Grounded coverage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;69.2%&lt;/span&gt;

&lt;span class="na"&gt;Gaps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;know&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;        &lt;span class="s"&gt;overestimate by 0.33  (claimed 0.82, evidence shows 0.49)&lt;/span&gt;
  &lt;span class="na"&gt;uncertainty&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;underestimate by 0.13 (claimed 0.15, evidence shows 0.28)&lt;/span&gt;
  &lt;span class="na"&gt;coherence&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;   &lt;span class="s"&gt;underestimate by 0.20 (claimed 0.75, evidence shows 0.95)&lt;/span&gt;

&lt;span class="na"&gt;Sources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;artifacts, codebase_model, prose_quality,&lt;/span&gt; 
         &lt;span class="s"&gt;document_metrics, source_quality, action_verification&lt;/span&gt;
&lt;span class="na"&gt;Sources failed&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[]&lt;/span&gt;  &lt;span class="s"&gt;(all evidence collectors healthy)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The AI was overestimating its knowledge and underestimating its uncertainty — the most common pattern. &lt;strong&gt;But now we can see it&lt;/strong&gt;, which means we can correct for it in the next transaction.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;empirica
&lt;span class="nb"&gt;cd &lt;/span&gt;your-project
empirica project-init
empirica setup-claude-code

&lt;span class="c"&gt;# Start a measured session:&lt;/span&gt;
empirica session-create &lt;span class="nt"&gt;--ai-id&lt;/span&gt; claude-code
&lt;span class="c"&gt;# → Opens transaction, gates investigation before action&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The framework is open source, the measurement is real, and the calibration improves over time. Not because the model gets better — because the &lt;strong&gt;measurement infrastructure&lt;/strong&gt; makes overconfidence visible.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Next in the series: **Part 3 — Grounded Calibration vs Self-Assessment&lt;/em&gt;* — why the AI's self-report is structurally unreliable and how deterministic evidence changes the game.*&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;a href="https://github.com/Nubaeon/empirica" rel="noopener noreferrer"&gt;Empirica on GitHub&lt;/a&gt; | &lt;a href="https://dev.to/soulentheo/your-ai-doesnt-know-what-it-doesnt-know-and-thats-the-biggest-problem-in-ai-tooling-309a-temp-slug-5818830"&gt;Part 1: Your AI Doesn't Know What It Doesn't Know&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>tutorial</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Your AI Doesn't Know What It Doesn't Know — And That's the Biggest Problem in AI Tooling</title>
      <dc:creator>David Van Assche (S.L)</dc:creator>
      <pubDate>Mon, 13 Apr 2026 17:45:27 +0000</pubDate>
      <link>https://dev.to/soulentheo/your-ai-doesnt-know-what-it-doesnt-know-and-thats-the-biggest-problem-in-ai-tooling-18d</link>
      <guid>https://dev.to/soulentheo/your-ai-doesnt-know-what-it-doesnt-know-and-thats-the-biggest-problem-in-ai-tooling-18d</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"The most dangerous thing isn't an AI that's wrong. It's an AI that's wrong and confident about it."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Every developer working with AI agents has hit this wall: your tool says something with absolute confidence, and it's completely wrong. Not because the model is bad — because &lt;strong&gt;nothing in the system tracks what it actually knows versus what it's guessing.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the epistemic gap, and it's the single biggest unsolved problem in AI developer tooling.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: Confidence Without Calibration
&lt;/h2&gt;

&lt;p&gt;When you use Claude, ChatGPT, or any LLM-based tool:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It never says "I'm 60% sure about this"&lt;/li&gt;
&lt;li&gt;It doesn't distinguish between "I read this in the codebase" and "I'm inferring this from patterns"&lt;/li&gt;
&lt;li&gt;After a long conversation, it loses track of what it verified versus what it assumed&lt;/li&gt;
&lt;li&gt;When context compresses, learned insights vanish silently&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This isn't a model problem. GPT-5 won't fix it. Claude Opus 5 won't fix it. &lt;strong&gt;It's a measurement problem at the infrastructure layer.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  What Actually Happens in Practice
&lt;/h3&gt;

&lt;p&gt;You ask your AI to update the auth middleware. It says "Done!" with 100% confidence. But:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Did it check if JWT was already configured? Maybe.&lt;/li&gt;
&lt;li&gt;Did it verify the session store compatibility? Probably not.&lt;/li&gt;
&lt;li&gt;Will it remember this decision next session? No.&lt;/li&gt;
&lt;li&gt;Did it investigate before acting, or just pattern-match? You'll never know.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The AI doesn't track:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What it &lt;strong&gt;investigated&lt;/strong&gt; versus what it &lt;strong&gt;assumed&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Which assumptions turned out to be &lt;strong&gt;wrong&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;What it learned that should &lt;strong&gt;persist&lt;/strong&gt; across sessions&lt;/li&gt;
&lt;li&gt;How its confidence &lt;strong&gt;should change&lt;/strong&gt; based on evidence&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why This Matters More Than You Think
&lt;/h2&gt;

&lt;p&gt;If you're building AI-assisted workflows, this gap compounds:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;No learning curve.&lt;/strong&gt; Your AI makes the same mistakes on day 100 that it made on day 1, because nothing measures whether its predictions improve.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Invisible context loss.&lt;/strong&gt; When conversations compact (Claude Code, Cursor, etc. all do this), the AI loses track of what it verified. It re-assumes things it already checked.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Sycophancy masquerading as agreement.&lt;/strong&gt; When you push back on a wrong answer, the AI often just agrees with you — not because you're right, but because agreement is the path of least resistance. Without calibration, there's no mechanism to distinguish "user is right, I should update" from "user is insistent, I should capitulate."&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;No grounded verification.&lt;/strong&gt; The AI self-reports its confidence. Nobody checks. It's like a student grading their own exam.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What Epistemic Measurement Looks Like
&lt;/h2&gt;

&lt;p&gt;Imagine if your AI tooling tracked 13 dimensions of its own knowledge state:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Vector&lt;/th&gt;
&lt;th&gt;What It Measures&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;know&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;How well it understands the domain&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;uncertainty&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;What it DOESN'T know (explicit)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;context&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Understanding of surrounding state&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;clarity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;How clear the path forward is&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;signal&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Quality of information vs noise&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;change&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Amount of change made&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;completion&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Progress toward current goal&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;And imagine it measured these at three points:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PREFLIGHT&lt;/strong&gt;: "Here's what I think I know before starting"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CHECK&lt;/strong&gt;: "Here's what I learned during investigation — am I ready to act?"
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;POSTFLIGHT&lt;/strong&gt;: "Here's what I actually learned and changed"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The delta between PREFLIGHT and POSTFLIGHT IS the learning. Not a vibe. A measurement.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Grounded Calibration Loop
&lt;/h2&gt;

&lt;p&gt;Self-assessment alone is sycophantic. What you actually need is a comparison between what the AI &lt;em&gt;claims&lt;/em&gt; to know and what &lt;em&gt;deterministic evidence&lt;/em&gt; shows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AI self-assessment&lt;/strong&gt;: know = 0.85, uncertainty = 0.10&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Grounded evidence&lt;/strong&gt; (test results, linter, git diff): know = 0.62, uncertainty = 0.35&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Calibration gap&lt;/strong&gt;: overestimating know by 0.23, underestimating uncertainty by 0.25&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adjustment signal&lt;/strong&gt;: "Be more cautious with know estimates in future transactions"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The grounded evidence comes from &lt;strong&gt;deterministic services&lt;/strong&gt; — test results, linter output, git metrics, documentation coverage — things that don't lie. When the AI says "I know this codebase well" but the test suite shows 3 failures in the module it just edited, the gap is measurable.&lt;/p&gt;

&lt;p&gt;This is what calibration means: &lt;strong&gt;the distance between what you claim to know and what the evidence shows.&lt;/strong&gt; Over time, this distance should shrink. If it doesn't, the AI isn't getting better — it's just getting more confident.&lt;/p&gt;

&lt;h2&gt;
  
  
  This Isn't Theory — It's Infrastructure
&lt;/h2&gt;

&lt;p&gt;We've been building this measurement layer as an open-source framework called &lt;a href="https://github.com/Nubaeon/empirica" rel="noopener noreferrer"&gt;Empirica&lt;/a&gt;. It's a Python CLI that hooks into Claude Code (and any LLM tool that supports hooks) to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Track epistemic vectors across sessions&lt;/li&gt;
&lt;li&gt;Gate actions behind investigation (you can't write code until you've demonstrated understanding)&lt;/li&gt;
&lt;li&gt;Verify self-assessments against deterministic evidence&lt;/li&gt;
&lt;li&gt;Persist learning across context compaction&lt;/li&gt;
&lt;li&gt;Measure calibration drift over time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's not a wrapper or a prompt. It's measurement infrastructure that makes the epistemic gap visible and closes it over time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Getting Started
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Prerequisites:&lt;/strong&gt; Python 3.10+, a project with a git repo, and optionally &lt;a href="https://claude.ai/code" rel="noopener noreferrer"&gt;Claude Code&lt;/a&gt; for the full hook integration.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install Empirica&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;empirica

&lt;span class="c"&gt;# Initialize tracking in your project&lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;your-project
empirica project-init

&lt;span class="c"&gt;# If using Claude Code, wire up the hooks:&lt;/span&gt;
empirica setup-claude-code
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. From this point, every Claude Code conversation in this project is measured — PREFLIGHT declares baseline knowledge, CHECK gates the transition from investigation to action, and POSTFLIGHT captures what was actually learned. The Sentinel (an automated gate) ensures investigation happens before implementation.&lt;/p&gt;

&lt;p&gt;Without Claude Code, you can still use the CLI directly to track any AI workflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Declare what you know before starting&lt;/span&gt;
empirica preflight-submit - &lt;span class="o"&gt;&amp;lt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="s1"&gt;'{"vectors": {"know": 0.5, "uncertainty": 0.4}, "reasoning": "Starting auth investigation"}'&lt;/span&gt;

&lt;span class="c"&gt;# Log what you discover&lt;/span&gt;
empirica finding-log &lt;span class="nt"&gt;--finding&lt;/span&gt; &lt;span class="s2"&gt;"JWT middleware uses Express next() pattern"&lt;/span&gt; &lt;span class="nt"&gt;--impact&lt;/span&gt; 0.5

&lt;span class="c"&gt;# Measure what you learned&lt;/span&gt;
empirica postflight-submit - &lt;span class="o"&gt;&amp;lt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="s1"&gt;'{"vectors": {"know": 0.85, "uncertainty": 0.1}, "reasoning": "Auth flow fully understood"}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What's Next in This Series
&lt;/h2&gt;

&lt;p&gt;This is Part 1 of a series on epistemic AI — making AI tools that actually know what they know:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Part 2&lt;/strong&gt;: &lt;a href="https://dev.to/soulentheo/measuring-what-your-ai-learned-epistemic-vectors-in-practice-4j3l-temp-slug-4262219"&gt;Measuring What Your AI Learned — epistemic vectors in practice&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 3&lt;/strong&gt;: Grounded Calibration vs Self-Assessment — why self-reporting fails&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 4&lt;/strong&gt;: Adding Epistemic Hooks to Your Workflow — integration tutorial&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 5&lt;/strong&gt;: The Voice Layer — how AI learns your communication patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each article will have runnable code, real measurements, and honest assessments of what works and what doesn't. Because that's the whole point — &lt;strong&gt;if you're not honest about uncertainty, you're just building a more eloquent liar.&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Empirica is open source (MIT) and under active development. We're a small team in Vienna building measurement infrastructure for AI. If this resonates, &lt;a href="https://github.com/Nubaeon/empirica" rel="noopener noreferrer"&gt;check us out on GitHub&lt;/a&gt; or follow this series for the deep dives.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devtools</category>
      <category>python</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Why Your AI Agent Needs Memory That Decays (and How Qdrant Makes It Work)</title>
      <dc:creator>David Van Assche (S.L)</dc:creator>
      <pubDate>Fri, 06 Mar 2026 13:30:22 +0000</pubDate>
      <link>https://dev.to/soulentheo/why-your-ai-agent-needs-memory-that-decays-and-how-qdrant-makes-it-work-f9m</link>
      <guid>https://dev.to/soulentheo/why-your-ai-agent-needs-memory-that-decays-and-how-qdrant-makes-it-work-f9m</guid>
      <description>&lt;p&gt;I've been building an open-source epistemic measurement framework called Empirica, and one of the core challenges I ran into early on was memory — not the "stuff vectors in a database and retrieve them" kind, but memory that actually behaves like memory. Things fade. Patterns strengthen with repetition. A dead-end from three weeks ago should still surface when the AI is about to walk into the same wall, but a finding from a one-off debugging session probably shouldn't carry the same weight six months later.&lt;/p&gt;

&lt;p&gt;That's where Qdrant comes in, and I want to share how we're using it because it's a fairly different use case from the typical RAG setup.&lt;/p&gt;

&lt;h3&gt;
  
  
  The problem with flat retrieval
&lt;/h3&gt;

&lt;p&gt;Most RAG implementations treat memory as a flat store — embed a chunk, retrieve by similarity, done. That works for document Q&amp;amp;A, but it falls apart when you need temporal awareness. An AI agent working across sessions and projects needs to know not just &lt;em&gt;what&lt;/em&gt; was discovered, but &lt;em&gt;when&lt;/em&gt;, &lt;em&gt;how confident we were&lt;/em&gt;, and &lt;em&gt;whether that knowledge is still valid&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Think about how your own memory works — you don't recall every detail of every workday equally. The time you accidentally dropped the production database? That stays vivid. The routine PR you reviewed last Tuesday? Already fading. That asymmetry is functional, not a bug.&lt;/p&gt;

&lt;h3&gt;
  
  
  Two memory types, one vector store
&lt;/h3&gt;

&lt;p&gt;We use Qdrant for two distinct memory layers:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Eidetic memory&lt;/strong&gt; — facts with confidence scores. These are discrete epistemic artifacts: findings ("the auth system uses JWT refresh with 15min expiry"), dead-ends ("tried migrating to async but the ORM doesn't support it"), decisions ("chose SQLite over Postgres because single-user, no server needed"), mistakes ("forgot to check null on the config reload path"). Each carries a confidence score that gets challenged when new evidence contradicts it — a finding's confidence drops if a related finding surfaces that undermines it. Think of it as an immune system: findings are antigens, lessons are antibodies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Episodic memory&lt;/strong&gt; — session narratives with temporal decay. These capture the arc of a work session: what was the AI investigating, what did it learn, how did its confidence change from start to finish. Episodic memories naturally decay over time — a session from yesterday is more relevant than one from last month, unless the pattern keeps repeating, in which case it strengthens instead of fading.&lt;/p&gt;

&lt;p&gt;Both live in Qdrant as separate collections per project, which gives us clean isolation and lets us do cross-project pattern discovery when we need it.&lt;/p&gt;

&lt;h3&gt;
  
  
  The retrieval side — Noetic RAG
&lt;/h3&gt;

&lt;p&gt;I've been calling this approach "Noetic RAG" — retrieval augmented generation on the &lt;em&gt;thinking&lt;/em&gt;, not just the artifacts. When an AI agent starts a new session, we don't just load documents. We load:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dead-ends that match the current task (so it doesn't repeat failed approaches)&lt;/li&gt;
&lt;li&gt;Mistake patterns with prevention strategies&lt;/li&gt;
&lt;li&gt;Decisions and their rationale (so it understands &lt;em&gt;why&lt;/em&gt; things are the way they are)&lt;/li&gt;
&lt;li&gt;Episodic arcs from similar sessions (temporal context)&lt;/li&gt;
&lt;li&gt;Cross-project patterns (if the same anti-pattern appeared in project A, surface it in project B)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The similarity search here isn't just cosine distance on the task description — it's filtered by recency, weighted by confidence, and scoped by project (with optional global reach for cross-project learnings).&lt;/p&gt;

&lt;h3&gt;
  
  
  What this looks like in practice
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Focused search: eidetic facts + episodic session arcs
&lt;/span&gt;&lt;span class="n"&gt;empirica&lt;/span&gt; &lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;search&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;ID&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auth token rotation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# Full search: all collections
&lt;/span&gt;&lt;span class="n"&gt;empirica&lt;/span&gt; &lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;search&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;ID&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auth token rotation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="nb"&gt;all&lt;/span&gt;

&lt;span class="c1"&gt;# Include cross-project patterns
&lt;/span&gt;&lt;span class="n"&gt;empirica&lt;/span&gt; &lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;search&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;ID&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auth token rotation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="k"&gt;global&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When context compacts (and it will — Claude Code's 200k window fills up fast), the bootstrap reloads ~800 tokens of epistemically ranked context instead of trying to reconstruct everything from scratch. Findings, unknowns, active goals, architectural decisions — weighted by confidence and recency.&lt;/p&gt;

&lt;h3&gt;
  
  
  The temporal dimension
&lt;/h3&gt;

&lt;p&gt;This is the part that makes Qdrant particularly well-suited. We store timestamps and decay parameters as payload fields, and filter on them at query time. A dead-end from yesterday with high confidence outranks a finding from last month with medium confidence. But a pattern that's been confirmed three times across two projects? That climbs in relevance regardless of age.&lt;/p&gt;

&lt;p&gt;The decay isn't a fixed curve — it's modulated by reinforcement. Every time a pattern re-emerges, its effective age resets. Qdrant's payload filtering makes this efficient: we can do the temporal math at query time without re-embedding anything.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why this matters beyond the obvious
&lt;/h3&gt;

&lt;p&gt;The real value isn't just "AI remembers things" — it's that the memory is &lt;em&gt;epistemically grounded&lt;/em&gt;. Every artifact has uncertainty quantification. Every session has calibration data (how accurate was the AI's self-assessment compared to objective evidence like test results and code quality metrics). The memory doesn't just tell you what happened — it tells you how much to trust what happened.&lt;/p&gt;

&lt;p&gt;After 5,600+ measured transactions, the calibration data shows AI agents consistently overestimate their own confidence by 20-40%. Having memory that carries that calibration forward means the system gets more honest over time, not just more knowledgeable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Try it
&lt;/h3&gt;

&lt;p&gt;Empirica is MIT licensed and open source. If you're building anything where AI agents need to remember across sessions — especially if temporal awareness matters — the prosodic/episodic/eidetic architecture might be worth looking at.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/Nubaeon/empirica" rel="noopener noreferrer"&gt;github.com/Nubaeon/empirica&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Website:&lt;/strong&gt; &lt;a href="https://getempirica.com" rel="noopener noreferrer"&gt;getempirica.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Install:&lt;/strong&gt; &lt;code&gt;pip install empirica&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Happy to answer questions about the Qdrant integration or the broader noetic RAG architecture.&lt;/p&gt;

</description>
      <category>qdrant</category>
      <category>ai</category>
      <category>opensource</category>
      <category>python</category>
    </item>
    <item>
      <title>The best (free - cheap) AI friendly Cli and Coding environments</title>
      <dc:creator>David Van Assche (S.L)</dc:creator>
      <pubDate>Fri, 26 Sep 2025 17:01:41 +0000</pubDate>
      <link>https://dev.to/soulentheo/the-best-free-cheap-ai-friendly-cli-and-coding-environments-16m6</link>
      <guid>https://dev.to/soulentheo/the-best-free-cheap-ai-friendly-cli-and-coding-environments-16m6</guid>
      <description>&lt;h2&gt;
  
  
  With so many &lt;strong&gt;LLM providers and coding environments&lt;/strong&gt;, how do you choose the right one for your next project? We all want the "best" model, but what we really need is the one that's the most reliable, the most cost-effective, and the most suited for our workflow. This guide breaks down the real-world performance, pricing, and hidden costs of the top LLM providers and CLI environments, from freemium to enterprise. We'll go beyond the marketing claims and give you the data you need to make an informed decision.
&lt;/h2&gt;

&lt;h3&gt;
  
  
  CLI and Code-Focused Environments (Sorted by Cost)
&lt;/h3&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Tier 1: Free &amp;amp; Open-Source (Cost is just API Tokens / Free Tier Access)&lt;/strong&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Cursor CLI&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cost Model:&lt;/strong&gt; Free. Relies on the user's API key (OpenAI, Anthropic, etc.).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use Case &amp;amp; Features:&lt;/strong&gt; An editor and CLI environment built around a code-aware AI. Ideal for developers who want maximum control over the model and are happy to manage their own API costs.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Qwen Code&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cost Model:&lt;/strong&gt; &lt;strong&gt;Free tier&lt;/strong&gt; with 2,000 requests per day and a 60 RPM limit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use Case &amp;amp; Features:&lt;/strong&gt; A coding agent focused on tool calling and environment interaction. Offers a generous free tier for developers on a budget, perfect for experimenting with agentic workflows.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;GitHub Copilot CLI&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cost Model:&lt;/strong&gt; &lt;strong&gt;Free Tier Available.&lt;/strong&gt; New "Copilot Free" tier offers 2,000 code completions and 50 premium requests per month. &lt;strong&gt;Students, teachers, and open-source maintainers get Copilot Pro for free.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use Case &amp;amp; Features:&lt;/strong&gt; Agent-powered, GitHub-native tool that executes coding tasks. This is the new, more powerful &lt;strong&gt;agentic&lt;/strong&gt; Copilot CLI, replacing the older &lt;code&gt;gh-copilot&lt;/code&gt; extension.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Tier 2: Freemium &amp;amp; Free-for-Individual (Generous Free Access)&lt;/strong&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Gemini Code Assist&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cost Model:&lt;/strong&gt; **Free for individuals (permanently).&lt;/li&gt;
&lt;li&gt;** Access to higher daily limits is available through a subscription to Google AI Pro ($19.99/month), which often includes an &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;extended free trial for 12 months for students&lt;/strong&gt; in eligible regions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use Case &amp;amp; Features:&lt;/strong&gt; An AI-first coding assistant integrated directly into major IDEs and the terminal (Gemini CLI). The individual version is a highly generous, no-cost option.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Warp Code&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cost Model:&lt;/strong&gt; &lt;strong&gt;Freemium.&lt;/strong&gt; Includes 150 free AI requests per month. Paid plans start around &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;$15/user/month&lt;/strong&gt; for teams.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use Case &amp;amp; Features:&lt;/strong&gt; A complete agentic development environment that unifies the terminal, editor, and AI features. Known for its speed, local indexing, and multi-model orchestration.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Atlassian Rovodev&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cost Model:&lt;/strong&gt; &lt;strong&gt;Free tier&lt;/strong&gt; with an Atlassian Cloud account. Quotas are based on "AI credits" tied to paid Jira/Confluence plans.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use Case &amp;amp; Features:&lt;/strong&gt; Integrated with the Atlassian ecosystem, focusing on developer tasks within project management (Jira) and documentation (Confluence). Best for teams already on the Atlassian stack.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Tier 3: Subscription Required (Paid Access to High-End Models)&lt;/strong&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Codex CLI (OpenAI)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cost Model:&lt;/strong&gt; Included with ChatGPT paid plans (Plus, Pro, Business, Enterprise), starting at $20/user/month (ChatGPT Plus).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use Case &amp;amp; Features:&lt;/strong&gt; A comprehensive software engineering agent powered by models like &lt;strong&gt;GPT-5-Codex&lt;/strong&gt;. It works in the terminal, IDE, and cloud, using tools, tracking progress with a to-do list, and supporting multi-modal input.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Claude Code (Anthropic)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cost Model:&lt;/strong&gt; Likely included with a paid Claude Pro or higher subscription. Uses Anthropic's latest models (e.g., Sonnet/Opus).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use Case &amp;amp; Features:&lt;/strong&gt; An agentic coding partner focused on &lt;strong&gt;extended thinking&lt;/strong&gt; and complex, multi-step tasks. It uses planning modes and creates project memory files (&lt;code&gt;CLAUDE.md&lt;/code&gt;) for deep context management.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Grok CLI (xAI)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cost Model:&lt;/strong&gt; Requires an X Premium+ subscription, which is roughly &lt;strong&gt;$ 30 - $40/month&lt;/strong&gt; for consumer access, or an API plan for token-based billing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use Case &amp;amp; Features:&lt;/strong&gt; Distinguished by its focus on real-time data integration (from the X platform) and its "rebellious streak." Best for projects requiring up-to-the-minute data integration alongside coding tasks.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  LLM and Inference Provider Cross-Reference
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;OpenAI&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Freemium / Entry Level Cost:&lt;/strong&gt; Free tier with a $5.00 credit (valid for 3 months).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing Model:&lt;/strong&gt; Pay-per-token; different rates for different models (e.g., GPT-4o is more expensive than GPT-3.5).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliability &amp;amp; Throughput:&lt;/strong&gt; High. Offers tiered usage limits that scale with spend. Known for robust infrastructure but can have occasional downtime. &lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Anthropic&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Freemium / Entry Level Cost:&lt;/strong&gt; Free tier with a $10.00 credit (valid for a limited period).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing Model:&lt;/strong&gt; Pay-per-token. Haiku model is the most cost-effective.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliability &amp;amp; Throughput:&lt;/strong&gt; High. Free tier has rate limits (e.g., 5 RPM, 20K TPM on Haiku), which are sufficient for development and experimentation.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Deepseek&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Freemium / Entry Level Cost:&lt;/strong&gt; Free tier with some trial credits.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing Model:&lt;/strong&gt; Pay-per-token, with separate rates for input (cache hit/miss) and output tokens.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliability &amp;amp; Throughput:&lt;/strong&gt; Generally good, but may not have the same global infrastructure as larger providers. Good for cost-sensitive projects.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Qwen (Dashscope)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Freemium / Entry Level Cost:&lt;/strong&gt; Free tier with 2,000 requests per day and a 60 RPM limit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing Model:&lt;/strong&gt; Pay-per-token after free tier. The Qwen-Flash model is very cheap for simple tasks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliability &amp;amp; Throughput:&lt;/strong&gt; Good. The free tier is generous for personal projects and offers a great way to test the model's capabilities.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Fireworks AI&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Freemium / Entry Level Cost:&lt;/strong&gt; $50 monthly spend limit with a valid payment method.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing Model:&lt;/strong&gt; Pay-per-token. Very competitive rates for various open-source models like Deepseek and Qwen.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliability &amp;amp; Throughput:&lt;/strong&gt; Very high. Known for its speed and low latency. The free tier is well-suited for experimentation and small-scale applications before committing to a higher spend tier.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Amazon Bedrock&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Freemium / Entry Level Cost:&lt;/strong&gt; Pay-as-you-go, no upfront cost. Some models have a free trial period.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing Model:&lt;/strong&gt; Complex. On-demand, provisioned throughput, and commitment-based pricing. Pricing varies significantly by model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliability &amp;amp; Throughput:&lt;/strong&gt; Extremely High. Backed by AWS's robust infrastructure, offering high reliability and the ability to scale. Best for production use cases where you need a consistent throughput.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Hugging Face&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Freemium / Entry Level Cost:&lt;/strong&gt; Free tier for inference endpoints on some models.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing Model:&lt;/strong&gt; Pay-per-hour for dedicated inference endpoints.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliability &amp;amp; Throughput:&lt;/strong&gt; Reliability depends on the model's popularity and the infrastructure supporting it. The free tier can have high latency due to a queuing system.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;OpenRouter&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Freemium / Entry Level Cost:&lt;/strong&gt; Pay-per-token. You only pay for what you use.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing Model:&lt;/strong&gt; Aggregator. Provides access to many models (including OpenAI and Anthropic) on a single API key, often at competitive rates.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliability &amp;amp; Throughput:&lt;/strong&gt; Varies by model, but generally high. The platform manages the back-end complexity of multiple models, making it a great entry point for comparing different models.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

</description>
      <category>programming</category>
      <category>ai</category>
      <category>beginners</category>
      <category>linux</category>
    </item>
  </channel>
</rss>
