<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: kol kol</title>
    <description>The latest articles on DEV Community by kol kol (@kollittle).</description>
    <link>https://dev.to/kollittle</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3919931%2Fc79f33b2-a2d7-46ef-85a5-74c1b888f1c7.png</url>
      <title>DEV Community: kol kol</title>
      <link>https://dev.to/kollittle</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kollittle"/>
    <language>en</language>
    <item>
      <title>I Spent $500 on RAG Infrastructure Before Realizing These 7 Mistakes Were Killing My Results</title>
      <dc:creator>kol kol</dc:creator>
      <pubDate>Sat, 20 Jun 2026 22:08:38 +0000</pubDate>
      <link>https://dev.to/kollittle/i-spent-500-on-rag-infrastructure-before-realizing-these-7-mistakes-were-killing-my-results-iph</link>
      <guid>https://dev.to/kollittle/i-spent-500-on-rag-infrastructure-before-realizing-these-7-mistakes-were-killing-my-results-iph</guid>
      <description>&lt;h1&gt;
  
  
  I Spent $500 on RAG Infrastructure Before Realizing These 7 Mistakes Were Killing My Results
&lt;/h1&gt;

&lt;p&gt;I built a RAG pipeline for private document search. It cost me $500 in vector database compute, weeks of debugging, and a lot of frustration. The results were mediocre — users got irrelevant answers, queries were slow, and the whole thing felt like a fancy keyword search with extra steps.&lt;/p&gt;

&lt;p&gt;Then I audited the pipeline step by step. Turns out, I made 7 mistakes that are incredibly common in RAG systems. Fixing them transformed the pipeline from "meh" to genuinely useful.&lt;/p&gt;

&lt;p&gt;Here's what I got wrong, and what I changed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mistake #1: I Chopped Documents Into Random Pieces
&lt;/h2&gt;

&lt;p&gt;I was splitting documents by fixed token count — 512 tokens per chunk, done. Simple, right?&lt;/p&gt;

&lt;p&gt;Wrong. I was destroying semantic context. A paragraph about API authentication would get split mid-sentence, with half in one chunk and half in another. When retrieval ran, the LLM got fragmented context and produced garbage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix:&lt;/strong&gt; Parent-Document retrieval with semantic chunking.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Split by natural document boundaries first (paragraphs, sections, headers) — these are your "parent documents"&lt;/li&gt;
&lt;li&gt;Create smaller child chunks from parents for vector search&lt;/li&gt;
&lt;li&gt;When a child chunk matches, return the full parent document to the LLM&lt;/li&gt;
&lt;li&gt;Add 10-20% overlap between chunks so boundary information isn't lost
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# What I should have done from the start
&lt;/span&gt;&lt;span class="n"&gt;CHUNK_CONFIG&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chunk_size&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chunk_overlap&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;separator&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;。&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;！&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;？&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Query accuracy jumped 30% after this one change.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mistake #2: I Used 0.5:0.5 Weights for Hybrid Search
&lt;/h2&gt;

&lt;p&gt;My vector database supports hybrid search — combining vector similarity with keyword (BM25) matching. I left the weights at the default 50/50 split and assumed that was fine.&lt;/p&gt;

&lt;p&gt;It wasn't. For technical documentation, exact keyword matches matter way more than the default acknowledges. Someone searching for "HNSW ef_construction" needs that exact term, not a semantically similar but wrong answer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix:&lt;/strong&gt; Dynamic weights based on query type.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Factual queries ("what is X"): 35% vector, 65% keyword&lt;/li&gt;
&lt;li&gt;Semantic queries ("how do I build X"): 75% vector, 25% keyword
&lt;/li&gt;
&lt;li&gt;General queries: 60% vector, 40% keyword
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;WEIGHTS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;factual&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vector&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.35&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;keyword&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.65&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;semantic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vector&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.75&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;keyword&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.25&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;general&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vector&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;keyword&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The keyword weight bump for factual queries alone eliminated most of the "almost right but wrong" answers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mistake #3: I Blew Up My Vector Database's Memory
&lt;/h2&gt;

&lt;p&gt;I set &lt;code&gt;ef_construction&lt;/code&gt; to the maximum value because "higher is better, right?" On a 50GB+ index, this meant the index build process consumed all available RAM and crashed. Twice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix:&lt;/strong&gt; Size-appropriate HNSW parameters.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Don't max this out — your server will cry
&lt;/span&gt;&lt;span class="n"&gt;HNSW_CONFIG&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;M&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;              &lt;span class="c1"&gt;# connections per node (8-32 is the sweet spot)
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ef_construction&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# not 400. Not 1000. 200.
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ef_search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;       &lt;span class="c1"&gt;# query time, not build time
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Index build time went from "it crashed" to 45 minutes. Memory usage dropped 70%.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mistake #4: My Embedding Model Was Too Generic
&lt;/h2&gt;

&lt;p&gt;I was using a general-purpose embedding model trained on Wikipedia and web text. My documents were technical API references and engineering runbooks. The model didn't understand my domain.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix:&lt;/strong&gt; Switch to a model fine-tuned for technical/code content. The difference was night and day — suddenly "migration" and "transform" weren't treated as synonyms just because they're sometimes related in general text.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mistake #5: I Had No Query Rewrite Layer
&lt;/h2&gt;

&lt;p&gt;Users typed natural questions like "why is my build slow" and the system searched for those exact words in technical documentation that said "CI pipeline optimization" and "build duration analysis." Zero overlap. Zero results.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix:&lt;/strong&gt; A lightweight LLM query rewrite step before retrieval.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User query: "why is my build slow"
→ Rewritten: "CI pipeline performance optimization build duration"
→ Retrieved: Relevant documentation ✅
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This single step improved recall by 40%. The cost? About 0.001 cents per query with a small model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mistake #6: I Didn't Filter Duplicate Context
&lt;/h2&gt;

&lt;p&gt;Retrieving top-10 chunks meant I often got the same paragraph 3 times with slightly different wording. The LLM would repeat itself, hallucinate from the repetition, and produce bloated answers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix:&lt;/strong&gt; Maximal marginal relevance (MMR) re-ranking.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Instead of returning top-10 most similar
# Return top-10 most similar AND diverse
&lt;/span&gt;&lt;span class="n"&gt;retrieved&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vector_store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;diverse&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;mmr_rerank&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retrieved&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lambda_param&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Answers became more concise and covered more ground.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mistake #7: I Never Measured Retrieval Quality
&lt;/h2&gt;

&lt;p&gt;I was evaluating the whole RAG pipeline end-to-end. If the final answer was bad, I didn't know if it was the retrieval, the prompt, or the LLM.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix:&lt;/strong&gt; Separate retrieval evaluation.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Track hit rate: does the retrieved context contain the answer?&lt;/li&gt;
&lt;li&gt;Track MRR (Mean Reciprocal Rank): how high in the results is the right chunk?&lt;/li&gt;
&lt;li&gt;Build a golden test set of 100 query-document pairs&lt;/li&gt;
&lt;li&gt;Only optimize the generation layer once retrieval scores are solid&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This saved me from chasing the wrong problems for weeks.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Results After All 7 Fixes
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Answer relevance&lt;/td&gt;
&lt;td&gt;~45%&lt;/td&gt;
&lt;td&gt;~85%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avg query latency&lt;/td&gt;
&lt;td&gt;3.2s&lt;/td&gt;
&lt;td&gt;1.8s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Monthly vector DB cost&lt;/td&gt;
&lt;td&gt;$180&lt;/td&gt;
&lt;td&gt;$95&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Duplicate context in responses&lt;/td&gt;
&lt;td&gt;60%&lt;/td&gt;
&lt;td&gt;8%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The Takeaway
&lt;/h2&gt;

&lt;p&gt;RAG isn't hard because the algorithms are complex. It's hard because there are 7+ interconnected knobs, and they all interact with each other.&lt;/p&gt;

&lt;p&gt;My advice: fix chunking first, then weights, then embedding quality. In that order. Everything else is optimization.&lt;/p&gt;

&lt;p&gt;What's your biggest RAG headache? Drop it in the comments — I've probably hit it too.&lt;/p&gt;

</description>
      <category>codcompass</category>
      <category>ai</category>
      <category>knowledgebase</category>
      <category>webdev</category>
    </item>
    <item>
      <title>My API Broke Every January 1st — The Timezone Bug That Slipped Past Code Review</title>
      <dc:creator>kol kol</dc:creator>
      <pubDate>Sat, 20 Jun 2026 14:04:46 +0000</pubDate>
      <link>https://dev.to/kollittle/my-api-broke-every-january-1st-the-timezone-bug-that-slipped-past-code-review-j40</link>
      <guid>https://dev.to/kollittle/my-api-broke-every-january-1st-the-timezone-bug-that-slipped-past-code-review-j40</guid>
      <description>&lt;p&gt;My API broke at exactly 00:00 UTC on January 1st. Not the users' midnight — &lt;em&gt;UTC midnight&lt;/em&gt;. Which meant our users in Tokyo had been living with broken data since 9 AM their time.&lt;/p&gt;

&lt;p&gt;And the worst part? The tests all passed. The staging environment worked fine. It only broke in production, because production is in a different timezone than staging.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bug
&lt;/h2&gt;

&lt;p&gt;Here's what the code looked like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;getDailyReport&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;date&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toISOString&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;T&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;start&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;T23:59:59Z&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;reports&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;findMany&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;where&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;createdAt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;gte&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;start&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="na"&gt;lt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;end&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Seems fine, right? &lt;code&gt;toISOString()&lt;/code&gt; gives you UTC. We're filtering by date. What could go wrong?&lt;/p&gt;

&lt;p&gt;Here's what went wrong: &lt;code&gt;new Date(date)&lt;/code&gt; when &lt;code&gt;date&lt;/code&gt; is just &lt;code&gt;"2026-01-01"&lt;/code&gt; (no time component) gets interpreted in the &lt;strong&gt;local timezone&lt;/strong&gt;. In staging (UTC server), &lt;code&gt;"2026-01-01"&lt;/code&gt; → &lt;code&gt;2026-01-01T00:00:00.000Z&lt;/code&gt;. In production (US-East server), &lt;code&gt;"2026-01-01"&lt;/code&gt; → &lt;code&gt;2026-01-01T05:00:00.000Z&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Five hour offset. Every single date query. For an entire year before anyone noticed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Tests Passed
&lt;/h2&gt;

&lt;p&gt;Our CI runs in Docker containers set to UTC. Our staging server is also UTC. Our production server? US-East. The timezone mismatch was invisible until New Year's Day rolled around and the date boundary crossed the timezone offset.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Staging (UTC):     2026-01-01 → Jan 1 00:00 UTC ✅
Production (EST):  2026-01-01 → Jan 1 05:00 UTC ❌
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We lost 5 hours of data on every query. The reports showed numbers that were "close enough" that nobody flagged it for 12 months.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fix
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;getDailyReport&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;date&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Always append time to force UTC interpretation&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;date&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;T00:00:00Z`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;date&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;T23:59:59.999Z`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;reports&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;findMany&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;where&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;createdAt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;gte&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;lt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;end&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One line change. Append &lt;code&gt;T00:00:00Z&lt;/code&gt; to force the &lt;code&gt;Date&lt;/code&gt; constructor into UTC mode. No more ambiguity.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Fix (Process, Not Code)
&lt;/h2&gt;

&lt;p&gt;The code fix took 30 seconds. The real fix took a week:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Added a timezone assertion in CI&lt;/strong&gt; — our test suite now explicitly checks that &lt;code&gt;process.env.TZ === 'UTC'&lt;/code&gt;. If anyone changes the CI timezone, tests fail.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Set &lt;code&gt;TZ=UTC&lt;/code&gt; in all Dockerfiles&lt;/strong&gt; — every container, every environment, same timezone. No surprises.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Added a timezone check to our deploy script&lt;/strong&gt; — &lt;code&gt;date +%Z&lt;/code&gt; must return &lt;code&gt;UTC&lt;/code&gt; before deploy proceeds.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Wrote a linter rule&lt;/strong&gt; — flags any &lt;code&gt;new Date(string)&lt;/code&gt; where the string doesn't contain timezone info.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Lesson
&lt;/h2&gt;

&lt;p&gt;Timezone bugs are sneaky because they don't crash. They produce wrong data that looks right. Your users won't get an error page — they'll get silently incorrect numbers, and they'll trust them.&lt;/p&gt;

&lt;p&gt;Three rules I now follow:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Never trust the system timezone.&lt;/strong&gt; Always set &lt;code&gt;TZ=UTC&lt;/code&gt; explicitly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Never parse dates without timezones.&lt;/strong&gt; &lt;code&gt;"2026-01-01"&lt;/code&gt; is ambiguous. &lt;code&gt;"2026-01-01T00:00:00Z"&lt;/code&gt; is not.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Never assume your CI timezone matches production.&lt;/strong&gt; Assert it in your tests.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I've been coding for years. I still got bit by this. If it can happen to me, it can happen to you.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Read more developer war stories and technical deep-dives at &lt;a href="https://codcompass.com" rel="noopener noreferrer"&gt;codcompass.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>codcompass</category>
      <category>ai</category>
      <category>knowledgebase</category>
      <category>webdev</category>
    </item>
    <item>
      <title>My API Broke Every January 1st — The Timezone Bug I Should Have Caught in Code Review</title>
      <dc:creator>kol kol</dc:creator>
      <pubDate>Fri, 19 Jun 2026 22:02:33 +0000</pubDate>
      <link>https://dev.to/kollittle/my-api-broke-every-january-1st-the-timezone-bug-i-should-have-caught-in-code-review-51hb</link>
      <guid>https://dev.to/kollittle/my-api-broke-every-january-1st-the-timezone-bug-i-should-have-caught-in-code-review-51hb</guid>
      <description>&lt;p&gt;My API broke at exactly 00:00 UTC on January 1st. Not the users' midnight — &lt;em&gt;UTC midnight&lt;/em&gt;. Which meant our users in Tokyo had been living with broken data since 9 AM their time.&lt;/p&gt;

&lt;p&gt;And the worst part? The tests all passed. The staging environment worked fine. It only broke in production, because production is in a different timezone than staging.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bug
&lt;/h2&gt;

&lt;p&gt;Here's what the code looked like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;getDailyReport&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;date&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toISOString&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;T&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;start&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;T23:59:59Z&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;reports&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;findMany&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;where&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;createdAt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;gte&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;start&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="na"&gt;lt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;end&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Seems fine, right? &lt;code&gt;toISOString()&lt;/code&gt; gives you UTC. We're filtering by date. What could go wrong?&lt;/p&gt;

&lt;p&gt;Here's what went wrong: &lt;code&gt;new Date(date)&lt;/code&gt; when &lt;code&gt;date&lt;/code&gt; is just &lt;code&gt;"2026-01-01"&lt;/code&gt; (no time component) gets interpreted in the &lt;strong&gt;local timezone&lt;/strong&gt;. In staging (UTC server), &lt;code&gt;"2026-01-01"&lt;/code&gt; → &lt;code&gt;2026-01-01T00:00:00.000Z&lt;/code&gt;. In production (US-East server), &lt;code&gt;"2026-01-01"&lt;/code&gt; → &lt;code&gt;2026-01-01T05:00:00.000Z&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Five hour offset. Every single date query. For an entire year before anyone noticed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Tests Passed
&lt;/h2&gt;

&lt;p&gt;Our CI runs in Docker containers set to UTC. Our staging server is also UTC. Our production server? US-East. The timezone mismatch was invisible until New Year's Day rolled around and the date boundary crossed the timezone offset.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Staging (UTC):     2026-01-01 → Jan 1 00:00 UTC ✅
Production (EST):  2026-01-01 → Jan 1 05:00 UTC ❌
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We lost 5 hours of data on every query. The reports showed numbers that were "close enough" that nobody flagged it for 12 months.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fix
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;getDailyReport&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;date&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Always append time to force UTC interpretation&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;date&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;T00:00:00Z`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;date&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;T23:59:59.999Z`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;reports&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;findMany&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;where&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;createdAt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;gte&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;lt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;end&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One line change. Append &lt;code&gt;T00:00:00Z&lt;/code&gt; to force the &lt;code&gt;Date&lt;/code&gt; constructor into UTC mode. No more ambiguity.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Fix (Process, Not Code)
&lt;/h2&gt;

&lt;p&gt;The code fix took 30 seconds. The real fix took a week:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Added a timezone assertion in CI&lt;/strong&gt; — our test suite now explicitly checks that &lt;code&gt;process.env.TZ === 'UTC'&lt;/code&gt;. If anyone changes the CI timezone, tests fail.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Set &lt;code&gt;TZ=UTC&lt;/code&gt; in all Dockerfiles&lt;/strong&gt; — every container, every environment, same timezone. No surprises.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Added a timezone check to our deploy script&lt;/strong&gt; — &lt;code&gt;date +%Z&lt;/code&gt; must return &lt;code&gt;UTC&lt;/code&gt; before deploy proceeds.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Wrote a linter rule&lt;/strong&gt; — flags any &lt;code&gt;new Date(string)&lt;/code&gt; where the string doesn't contain timezone info.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Lesson
&lt;/h2&gt;

&lt;p&gt;Timezone bugs are sneaky because they don't crash. They produce wrong data that looks right. Your users won't get an error page — they'll get silently incorrect numbers, and they'll trust them.&lt;/p&gt;

&lt;p&gt;Three rules I now follow:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Never trust the system timezone.&lt;/strong&gt; Always set &lt;code&gt;TZ=UTC&lt;/code&gt; explicitly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Never parse dates without timezones.&lt;/strong&gt; &lt;code&gt;"2026-01-01"&lt;/code&gt; is ambiguous. &lt;code&gt;"2026-01-01T00:00:00Z"&lt;/code&gt; is not.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Never assume your CI timezone matches production.&lt;/strong&gt; Assert it in your tests.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I've been coding for years. I still got bit by this. If it can happen to me, it can happen to you.&lt;/p&gt;

</description>
      <category>codcompass</category>
      <category>ai</category>
      <category>knowledgebase</category>
      <category>webdev</category>
    </item>
    <item>
      <title>I Let AI Write My Backend Code for a Week — Here's What Actually Broke</title>
      <dc:creator>kol kol</dc:creator>
      <pubDate>Sun, 14 Jun 2026 14:02:24 +0000</pubDate>
      <link>https://dev.to/kollittle/i-let-ai-write-my-backend-code-for-a-week-heres-what-actually-broke-1d36</link>
      <guid>https://dev.to/kollittle/i-let-ai-write-my-backend-code-for-a-week-heres-what-actually-broke-1d36</guid>
      <description>&lt;p&gt;I told myself it would be fine. I had been using AI coding assistants for suggestions and autocomplete for months — and it worked great. So when a new project came up with a tight deadline, I thought: why not let AI handle the whole backend?&lt;/p&gt;

&lt;p&gt;I set up a Cursor workspace, wrote a detailed spec, and hit generate. What followed was 5 days of "it compiles, but..." debugging that taught me more about software engineering than any tutorial ever did.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Went Surprisingly Well
&lt;/h2&gt;

&lt;p&gt;The boilerplate was genuinely impressive. In about 2 hours, I had:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A fully typed Express.js API with 12 endpoints&lt;/li&gt;
&lt;li&gt;Zod validation schemas for every route&lt;/li&gt;
&lt;li&gt;A Prisma schema with proper relations&lt;/li&gt;
&lt;li&gt;Docker compose setup with Postgres and Redis&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The code looked clean. Tests passed. I was feeling like a 10x developer.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Cracks Started Showing
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Bug #1: Silent Type Coercion
&lt;/h3&gt;

&lt;p&gt;The AI generated this validation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;userSchema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;object&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;age&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;number&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Looks fine, right? Except the API received ages as strings from the frontend. Zod parsed them fine in development (coercion worked). But in production with stricter mode? &lt;code&gt;NaN&lt;/code&gt; everywhere. Users were getting 400 errors on signup.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; &lt;code&gt;z.coerce.number().int().positive()&lt;/code&gt; — but I had to find all 23 instances manually.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bug #2: The N+1 Query Nobody Asked For
&lt;/h3&gt;

&lt;p&gt;For a dashboard endpoint that listed users with their orders and order items, the AI generated:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;users&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;prisma&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;findMany&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;user&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;users&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;orders&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;prisma&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;order&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;findMany&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;where&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Classic N+1. The Prisma docs literally have a page titled "How to avoid N+1 queries." With 500 users, this endpoint made 501 database queries and took 8 seconds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; &lt;code&gt;include&lt;/code&gt; with nested relations — one query, 120ms.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bug #3: Race Conditions in Token Refresh
&lt;/h3&gt;

&lt;p&gt;The AI wrote a token refresh flow that looked perfect in isolation. But under load, concurrent refresh requests would invalidate each other's tokens. The AI's solution? "Add a retry mechanism." My solution? "Use a refresh token rotation pattern that handles concurrency properly."&lt;/p&gt;

&lt;h3&gt;
  
  
  Bug #4: The Error Handler That Swallowed Everything
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Error:&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Something went wrong&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;console.log&lt;/code&gt; doesn't serialize Error objects properly. Every production error was just &lt;code&gt;{}&lt;/code&gt; in the logs. We ran like this for 3 days before anyone noticed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; &lt;code&gt;console.error&lt;/code&gt; with proper error serialization and a proper logging library (we went with Pino).&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Problem
&lt;/h2&gt;

&lt;p&gt;Here's what I learned: &lt;strong&gt;AI generates code that's correct in isolation but fragile in context.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It doesn't know:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your deployment architecture (so it misses N+1 queries)&lt;/li&gt;
&lt;li&gt;Your traffic patterns (so it ignores race conditions)&lt;/li&gt;
&lt;li&gt;Your logging infrastructure (so it uses the wrong logger)&lt;/li&gt;
&lt;li&gt;Your team's conventions (so it mixes patterns)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The generated code passes tests because tests are narrow. It compiles because the syntax is valid. But production is where context matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Changed
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;AI writes the first draft, humans write the final version.&lt;/strong&gt; I'm not going back to writing everything from scratch, but every PR now requires a manual review of control flow, error handling, and data access patterns.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Architecture decisions stay human.&lt;/strong&gt; Schema design, caching strategy, and error handling patterns are too context-dependent to outsource.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Add integration tests that AI can't fake.&lt;/strong&gt; Unit tests pass. Integration tests reveal the gaps. We added a test suite that runs the full API against a real Postgres instance.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Observability from day one.&lt;/strong&gt; Structured logging, request tracing, and error tracking are now part of the project template, not an afterthought.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;AI didn't break my project. My assumption that "generated code equals production-ready code" did.&lt;/p&gt;

&lt;p&gt;AI is an incredible force multiplier when used as a pair programmer. It's a liability when treated as a replacement for engineering judgment.&lt;/p&gt;

&lt;p&gt;The week cost me 3 extra days of debugging, but I shipped a more robust system than I would have built alone — because the AI's mistakes taught me where my own blind spots were.&lt;/p&gt;

&lt;p&gt;Use AI. But keep your hands on the wheel.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Have you had similar experiences with AI-generated code? I'd love to hear your war stories in the comments.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>codcompass</category>
      <category>ai</category>
      <category>knowledgebase</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Our Test Suite Passed 100% — Then Users Found 14 Bugs in One Day</title>
      <dc:creator>kol kol</dc:creator>
      <pubDate>Tue, 09 Jun 2026 18:03:24 +0000</pubDate>
      <link>https://dev.to/kollittle/our-test-suite-passed-100-then-users-found-14-bugs-in-one-day-5o0</link>
      <guid>https://dev.to/kollittle/our-test-suite-passed-100-then-users-found-14-bugs-in-one-day-5o0</guid>
      <description>&lt;p&gt;We had 847 tests. Green checkmarks across the board. 100% coverage on our critical paths. I was proud of that dashboard.&lt;/p&gt;

&lt;p&gt;Then a user reported that our checkout was double-charging on Safari. Another said the password reset emails weren't arriving. Within 24 hours we had 14 confirmed bugs — and our CI pipeline was still proudly green.&lt;/p&gt;

&lt;p&gt;That's when I realized: &lt;strong&gt;100% code coverage is a vanity metric that makes you feel safe while your users burn.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Illusion of Coverage
&lt;/h2&gt;

&lt;p&gt;Here's what our test suite was great at:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Testing individual functions in isolation&lt;/li&gt;
&lt;li&gt;Verifying happy paths with clean inputs&lt;/li&gt;
&lt;li&gt;Catching regressions in pure utility functions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's what it completely missed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Browser-specific behavior&lt;/strong&gt; — Safari's date parsing is different from Chrome's. Our test runner used Node.js. No browser, no Safari.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Race conditions&lt;/strong&gt; — Two API calls firing simultaneously? Our mocked fetch resolved instantly. In production, timing matters.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integration gaps&lt;/strong&gt; — Each module had tests. The &lt;em&gt;connections&lt;/em&gt; between modules did not.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real-world data&lt;/strong&gt; — Our fixtures were clean. User data is never clean.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Bug That Started It All
&lt;/h2&gt;

&lt;p&gt;A user in Japan reported being charged twice for a single purchase. We couldn't reproduce it locally. Our payment integration tests passed every time.&lt;/p&gt;

&lt;p&gt;The root cause: a double-submit button on slow networks. Our mock API responded in 12ms. Real networks: 800ms. That gap was enough for impatient fingers to click twice.&lt;/p&gt;

&lt;p&gt;The fix was 3 lines of code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight tsx"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;isSubmitting&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;setIsSubmitting&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;useState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="c1"&gt;// Button: disabled={isSubmitting}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three lines. But the test suite — our beautiful 847-test suite — had zero tests for this scenario because &lt;strong&gt;nobody wrote a test for "user clicks button twice."&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The 14-Bug Autopsy
&lt;/h2&gt;

&lt;p&gt;After that incident, we categorized all 14 bugs:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Bug Category&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;th&gt;Tests Should've Caught It&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Browser compatibility&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;❌ No cross-browser tests&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Race conditions&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;❌ Mocks too fast&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Edge-case user input&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;❌ Fixtures too clean&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Third-party API changes&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;❌ No contract testing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time zone bugs&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;❌ All tests ran in UTC&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;14 bugs. Zero caught by CI. The problem wasn't that we didn't have enough tests — &lt;strong&gt;we had the wrong kind of tests.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Changed
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Added Integration Tests at Module Boundaries
&lt;/h3&gt;

&lt;p&gt;Unit tests check the bricks. Integration tests check the mortar. We added tests specifically for the &lt;em&gt;connections&lt;/em&gt; between services — where most real bugs hide.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Started Running Tests in Real Browsers
&lt;/h3&gt;

&lt;p&gt;We added Playwright for critical user flows: checkout, auth, search. These run against a real Chrome and Firefox instance. Safari is next.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Mock Network Latency
&lt;/h3&gt;

&lt;p&gt;Instead of instant mock responses, we randomized delays between 100ms and 2000ms. This surfaced race conditions we never knew existed.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Contract Testing for APIs
&lt;/h3&gt;

&lt;p&gt;We used Pact to verify that our frontend's expectations of backend APIs actually match reality. Two bugs disappeared the day we added this.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Time Zone Roulette
&lt;/h3&gt;

&lt;p&gt;We randomize the test runner's timezone. Half our date bugs appeared within the first week.&lt;/p&gt;

&lt;h2&gt;
  
  
  The New Philosophy
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Coverage tells you what code runs. It doesn't tell you what breaks.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Now we track different metrics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bug escape rate&lt;/strong&gt; — bugs found by users vs. caught in CI&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mean time to detection&lt;/strong&gt; — how fast our tests find regressions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integration test coverage&lt;/strong&gt; — not line coverage, but &lt;em&gt;scenario&lt;/em&gt; coverage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Our total test count went down (we deleted 200+ redundant unit tests). Our bug escape rate went down 80%.&lt;/p&gt;

&lt;p&gt;The dashboard looks less impressive. The product works better.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Have you been burned by "green tests, broken production"? What testing gaps surprised you most? I'd love to hear your war stories in the comments.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>codcompass</category>
      <category>ai</category>
      <category>knowledgebase</category>
      <category>webdev</category>
    </item>
    <item>
      <title>I Added 20 Indexes to "Fix" Slow Queries — My Database Got 3x Slower</title>
      <dc:creator>kol kol</dc:creator>
      <pubDate>Mon, 08 Jun 2026 14:02:21 +0000</pubDate>
      <link>https://dev.to/kollittle/i-added-20-indexes-to-fix-slow-queries-my-database-got-3x-slower-3ac2</link>
      <guid>https://dev.to/kollittle/i-added-20-indexes-to-fix-slow-queries-my-database-got-3x-slower-3ac2</guid>
      <description>&lt;h1&gt;
  
  
  I Added 20 Indexes to "Fix" Slow Queries — My Database Got 3x Slower
&lt;/h1&gt;

&lt;p&gt;Six months ago, I inherited a PostgreSQL database that was choking on production traffic. API response times hit 8 seconds. Users were timing out. The ops team was getting paged at 2 AM.&lt;/p&gt;

&lt;p&gt;So I did what any "experienced" developer would do: I added indexes. Lots of them.&lt;/p&gt;

&lt;p&gt;Twenty indexes across twelve tables. Problem solved, right?&lt;/p&gt;

&lt;p&gt;Wrong. The database got &lt;strong&gt;slower&lt;/strong&gt;. Write operations crawled. Disk usage spiked. And the queries I was trying to optimize? They were still slow.&lt;/p&gt;

&lt;p&gt;Here's what I learned the hard way about index tuning — and the process I use now that actually works.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Mistake Everyone Makes
&lt;/h2&gt;

&lt;p&gt;The biggest misconception about indexes is this: &lt;em&gt;more indexes = faster queries&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;PostgreSQL has to maintain every index on every write. Add an index, and every INSERT, UPDATE, and DELETE gets heavier. With 20 extra indexes, our write-heavy analytics table was spending more time updating indexes than storing data.&lt;/p&gt;

&lt;p&gt;But the real killer was something I didn't expect: &lt;strong&gt;index bloat&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Went Wrong
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. I Indexed Low-Cardinality Columns
&lt;/h3&gt;

&lt;p&gt;I put an index on a &lt;code&gt;status&lt;/code&gt; column with only 4 possible values: &lt;code&gt;pending&lt;/code&gt;, &lt;code&gt;active&lt;/code&gt;, &lt;code&gt;suspended&lt;/code&gt;, &lt;code&gt;deleted&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;PostgreSQL's query planner looked at that index, saw that each value matched ~25% of rows, and decided a full table scan was cheaper. The index was dead weight — costing disk space and write performance, providing zero read benefit.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. I Created Redundant Indexes
&lt;/h3&gt;

&lt;p&gt;I had:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;CREATE INDEX idx_user_email ON users(email)&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;CREATE INDEX idx_user_email_name ON users(email, name)&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The second index already covers queries on &lt;code&gt;email&lt;/code&gt; alone. The first one was pure redundancy. PostgreSQL was maintaining two indexes for essentially the same lookup.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. I Ignored Partial Indexes
&lt;/h3&gt;

&lt;p&gt;Our &lt;code&gt;orders&lt;/code&gt; table had millions of rows, but 90% were &lt;code&gt;completed&lt;/code&gt; or &lt;code&gt;cancelled&lt;/code&gt;. The slow queries were all looking for &lt;code&gt;status = 'pending'&lt;/code&gt;. A partial index like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_orders_pending&lt;/span&gt; 
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'pending'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This tiny index (10% of the table) outperformed my full-table indexes by 5x.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fix: A Methodical Index Audit
&lt;/h2&gt;

&lt;p&gt;Here's the process I followed to undo the damage and actually optimize:&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Find Unused Indexes
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; 
  &lt;span class="n"&gt;schemaname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;tablename&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;indexname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;idx_scan&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;pg_size_pretty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pg_relation_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;indexrelid&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;index_size&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_user_indexes&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;idx_scan&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="n"&gt;indisunique&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;pg_relation_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;indexrelid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This revealed 14 indexes that had &lt;strong&gt;never been used&lt;/strong&gt; since the last stats reset. I dropped them immediately.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Find the Real Slow Queries
&lt;/h3&gt;

&lt;p&gt;Instead of guessing, I used &lt;code&gt;pg_stat_statements&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; 
  &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;calls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;mean_exec_time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;total_exec_time&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_statements&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;total_exec_time&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This showed me which queries were actually burning CPU time. Not the ones I assumed were slow — the ones that &lt;em&gt;actually were&lt;/em&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Use EXPLAIN ANALYZE
&lt;/h3&gt;

&lt;p&gt;For every slow query, I ran &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; to see the actual execution plan. Not &lt;code&gt;EXPLAIN&lt;/code&gt; — &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt;. The difference is that &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; actually runs the query and shows real timing data.&lt;/p&gt;

&lt;p&gt;What I found: PostgreSQL was doing sequential scans on tables where I had indexes, because my query conditions didn't match the index column order.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Build Right-Sized Indexes
&lt;/h3&gt;

&lt;p&gt;The three indexes that actually made a difference:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Composite index matching the actual WHERE + ORDER BY pattern&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_analytics_date_type&lt;/span&gt; 
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event_type&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;event_date&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'2026-01-01'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Covering index that includes all needed columns (no table lookup)&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_users_lookup&lt;/span&gt; 
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
&lt;span class="n"&gt;INCLUDE&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Expression index for a common pattern&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_orders_lower_email&lt;/span&gt; 
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;LOWER&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer_email&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Results
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before Audit&lt;/th&gt;
&lt;th&gt;After Audit&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total indexes&lt;/td&gt;
&lt;td&gt;47&lt;/td&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avg query time&lt;/td&gt;
&lt;td&gt;3.2s&lt;/td&gt;
&lt;td&gt;0.4s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Write latency&lt;/td&gt;
&lt;td&gt;180ms&lt;/td&gt;
&lt;td&gt;25ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Index disk usage&lt;/td&gt;
&lt;td&gt;12.4 GB&lt;/td&gt;
&lt;td&gt;2.1 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Index cache hit rate&lt;/td&gt;
&lt;td&gt;67%&lt;/td&gt;
&lt;td&gt;94%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The Rule I Follow Now
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Never add an index without running EXPLAIN ANALYZE first.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every index should have a specific query it's designed to accelerate. If you can't point to the query and show the before/after execution plan, don't create the index.&lt;/p&gt;

&lt;p&gt;Indexes are not a "just in case" thing. They're a surgical tool. Use them like one.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Have you ever made your database slower by trying to optimize it? What was your wake-up call?&lt;/em&gt;&lt;/p&gt;

</description>
      <category>codcompass</category>
      <category>ai</category>
      <category>knowledgebase</category>
      <category>webdev</category>
    </item>
    <item>
      <title>I Thought My API Was Rate-Limited — Until Someone Scraped 2 Million Requests in 4 Hours</title>
      <dc:creator>kol kol</dc:creator>
      <pubDate>Sun, 07 Jun 2026 14:04:53 +0000</pubDate>
      <link>https://dev.to/kollittle/i-thought-my-api-was-rate-limited-until-someone-scraped-2-million-requests-in-4-hours-509d</link>
      <guid>https://dev.to/kollittle/i-thought-my-api-was-rate-limited-until-someone-scraped-2-million-requests-in-4-hours-509d</guid>
      <description>&lt;p&gt;I had &lt;code&gt;express-rate-limit&lt;/code&gt; installed. I had it configured. I had tests that proved it worked.&lt;/p&gt;

&lt;p&gt;And yet, someone still scraped 2 million API requests from my production server in under 4 hours. Costing me $4,200 in upstream API calls.&lt;/p&gt;

&lt;p&gt;Here's exactly what went wrong, how I found out, and the architecture I use now.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Setup That Lied to Me
&lt;/h2&gt;

&lt;p&gt;My API was a simple Express app. I added rate limiting like any reasonable developer would:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;rateLimit&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;express-rate-limit&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;limiter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;rateLimit&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;windowMs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;// 15 minutes&lt;/span&gt;
  &lt;span class="na"&gt;max&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                  &lt;span class="c1"&gt;// 100 requests per window&lt;/span&gt;
  &lt;span class="na"&gt;standardHeaders&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;legacyHeaders&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;use&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/api/&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;limiter&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tests passed. I saw &lt;code&gt;X-RateLimit-Limit: 100&lt;/code&gt; in curl responses. I slept well.&lt;/p&gt;

&lt;p&gt;The problem? I was running &lt;strong&gt;4 instances&lt;/strong&gt; behind a load balancer. Each instance had its own in-memory counter. So the real limit was &lt;strong&gt;400 requests per 15 minutes&lt;/strong&gt; — not 100.&lt;/p&gt;

&lt;p&gt;And the attacker wasn't hitting one IP with 100 requests. They were rotating through a proxy pool of 2,000+ IPs.&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Happened
&lt;/h2&gt;

&lt;p&gt;At 2:47 AM, our monitoring dashboard showed something odd: API request volume spiked 800%. I dismissed it as a newsletter push going out.&lt;/p&gt;

&lt;p&gt;By 4:00 AM, the database connection pool was saturated. Queries that normally took 12ms were timing out at 30 seconds.&lt;/p&gt;

&lt;p&gt;By 6:30 AM, I checked our upstream LLM provider bill. We'd made &lt;strong&gt;2.1 million API calls&lt;/strong&gt; since midnight. At $0.002 per call, that's roughly &lt;strong&gt;$4,200&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The attacker was:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Hitting our search endpoint with systematic keyword variations&lt;/li&gt;
&lt;li&gt;Rotating IPs from a residential proxy network&lt;/li&gt;
&lt;li&gt;Staying under per-instance rate limits by spreading requests across IPs&lt;/li&gt;
&lt;li&gt;Extracting structured data from our responses&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Why My Defenses Failed
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Defense&lt;/th&gt;
&lt;th&gt;Why It Failed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;express-rate-limit&lt;/code&gt; (in-memory)&lt;/td&gt;
&lt;td&gt;Not shared across instances&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IP-based limiting&lt;/td&gt;
&lt;td&gt;Proxy rotation defeated it&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No request logging depth&lt;/td&gt;
&lt;td&gt;Couldn't trace the attack pattern&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No anomaly alerts&lt;/td&gt;
&lt;td&gt;800% spike looked like "normal traffic"&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The fundamental mistake: I treated rate limiting as a &lt;strong&gt;configuration problem&lt;/strong&gt; instead of an &lt;strong&gt;architecture problem&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fix: Distributed Rate Limiting
&lt;/h2&gt;

&lt;p&gt;I rebuilt the system with three layers:&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1: Redis Sliding Window (The Real Rate Limiter)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;Redis&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ioredis&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;createClient&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;redis-rate-limiter&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;redis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Redis&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;REDIS_URL&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;checkRateLimit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;max&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;windowSec&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;now&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;windowStart&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;now&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;windowSec&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="c1"&gt;// Use Redis sorted set for true sliding window&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;zremrangebyscore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;windowStart&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;zcard&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="nx"&gt;max&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;allowed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;remaining&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;zadd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;now&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;now&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;-&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;random&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;expire&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;windowSec&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;allowed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;remaining&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;max&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;count&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives you a &lt;strong&gt;true&lt;/strong&gt; 100-request limit across all instances, not 100 per instance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2: Behavioral Fingerprinting
&lt;/h3&gt;

&lt;p&gt;IP addresses are useless against proxy pools. Instead, I track:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Request pattern entropy&lt;/strong&gt; — Are endpoints being hit in alphabetical order? That's a scraper.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timing regularity&lt;/strong&gt; — Requests every exactly 1.0 seconds? Bot.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Header consistency&lt;/strong&gt; — Same User-Agent, same Accept-Encoding, same everything? Bot.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;calculateRequestEntropy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;endpoints&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;uniqueEndpoints&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;endpoints&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;size&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="c1"&gt;// Low entropy = sequential/scraping pattern&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;uniqueEndpoints&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nx"&gt;endpoints&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Entropy &amp;lt; 0.3 → likely scraping&lt;/span&gt;
&lt;span class="c1"&gt;// Entropy &amp;gt; 0.7 → likely human&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Layer 3: Cost-Based Circuit Breakers
&lt;/h3&gt;

&lt;p&gt;This is the one that actually saves money:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Track estimated cost per endpoint&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;endpointCosts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/api/search&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.002&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;// LLM call&lt;/span&gt;
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/api/analyze&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.015&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;// Expensive LLM call&lt;/span&gt;
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/api/health&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="c1"&gt;// Cheap&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;hourlyCost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;COST_THRESHOLD&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// Alert at $50/hr&lt;/span&gt;

&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;trackCost&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;hourlyCost&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nx"&gt;endpointCosts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;hourlyCost&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;COST_THRESHOLD&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Auto-throttle expensive endpoints&lt;/span&gt;
    &lt;span class="nx"&gt;expensiveEndpoints&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;enabled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nx"&gt;slack&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;alert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`API cost spike: $&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;hourlyCost&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toFixed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;/hr`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When costs spike, expensive endpoints automatically throttle. You don't need to be awake at 3 AM to stop a bleeding wallet.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Results After 30 Days
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Successful scrapes&lt;/td&gt;
&lt;td&gt;2 incidents&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Peak API cost/hr&lt;/td&gt;
&lt;td&gt;$4,200&lt;/td&gt;
&lt;td&gt;$12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;False positive blocks&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;2 (tuned rules)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Legitimate user impact&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;None detected&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The Real Lesson
&lt;/h2&gt;

&lt;p&gt;Rate limiting isn't about setting a number. It's about understanding:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Your threat model&lt;/strong&gt; — Who would want to scrape your API and why?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Your architecture&lt;/strong&gt; — In-memory doesn't work in a distributed system. Period.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Your cost exposure&lt;/strong&gt; — Know the dollar cost per endpoint, and set automatic circuit breakers.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The $4,200 mistake taught me that &lt;strong&gt;security theater&lt;/strong&gt; — rate limiting that looks right but isn't — is worse than no rate limiting at all. It gives you confidence to deploy things that aren't actually protected.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Have you ever been bitten by a "working" defense that wasn't? What's your rate limiting setup? Drop it in the comments — I'm always looking for ways to improve mine.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>codcompass</category>
      <category>ai</category>
      <category>knowledgebase</category>
      <category>webdev</category>
    </item>
    <item>
      <title>I Left a Dead Feature Flag in Production for 6 Months — Here's What It Cost Me</title>
      <dc:creator>kol kol</dc:creator>
      <pubDate>Sat, 06 Jun 2026 18:15:55 +0000</pubDate>
      <link>https://dev.to/kollittle/i-left-a-dead-feature-flag-in-production-for-6-months-heres-what-it-cost-me-3dj5</link>
      <guid>https://dev.to/kollittle/i-left-a-dead-feature-flag-in-production-for-6-months-heres-what-it-cost-me-3dj5</guid>
      <description>&lt;p&gt;Six months. That's how long a disabled feature flag sat in my production codebase before a new team member asked: "What does this &lt;code&gt;ENABLE_LEGACY_CHECKOUT&lt;/code&gt; thing do?"&lt;/p&gt;

&lt;p&gt;Nobody remembered. The feature had been replaced, the flag was permanently set to &lt;code&gt;false&lt;/code&gt;, but the dead code path was still there — loading, parsing, and checking on every single request.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Discovery
&lt;/h2&gt;

&lt;p&gt;It started as a routine code review. A junior dev flagged a function that looked suspicious:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;processPayment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;order&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;flags&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ENABLE_LEGACY_CHECKOUT&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;legacyPaymentProcessor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;order&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;newPaymentProcessor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;order&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The flag had been &lt;code&gt;false&lt;/code&gt; for 187 days. Every payment went through &lt;code&gt;newPaymentProcessor&lt;/code&gt;. But every single call still evaluated the condition, loaded the legacy module (yes, it was lazy-loaded), and ran the config lookup.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hidden Costs
&lt;/h2&gt;

&lt;p&gt;Here's what that one dead flag was doing:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Performance Tax&lt;/strong&gt; — 0.3ms per request doesn't sound like much. Multiply by 2.4 million requests per month and you're burning ~12 hours of CPU time. Not catastrophic, but it adds up across dozens of flags.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Cognitive Load&lt;/strong&gt; — New developers spent an average of 20 minutes trying to understand what each flag controlled. We had 47 active flags and 12 dead ones. That's 25% of our flag registry being ghost code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Testing Overhead&lt;/strong&gt; — Our test matrix had to account for flag combinations. Dead flags meant dead test branches that nobody was maintaining. We found 3 test suites that only tested dead code paths.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Deployment Risk&lt;/strong&gt; — When someone finally cleaned up the flag, they accidentally removed a config validation that was shared between legacy and new paths. Caused a 15-minute outage.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Did About It
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Flag Audit Script
&lt;/h3&gt;

&lt;p&gt;I wrote a quick script to scan our codebase:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Find all feature flags and check if they're always true or always false&lt;/span&gt;
&lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s2"&gt;"config&lt;/span&gt;&lt;span class="se"&gt;\.&lt;/span&gt;&lt;span class="s2"&gt;flags&lt;/span&gt;&lt;span class="se"&gt;\.&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; src/ | &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="s2"&gt;"ENABLE_[A-Z_]*"&lt;/span&gt; | &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nb"&gt;sort&lt;/span&gt; | &lt;span class="nb"&gt;uniq&lt;/span&gt; &lt;span class="nt"&gt;-c&lt;/span&gt; | &lt;span class="nb"&gt;sort&lt;/span&gt; &lt;span class="nt"&gt;-rn&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then cross-referenced with our LaunchDarkly dashboard to find flags that hadn't changed state in 90+ days.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. The "Flag Funeral" Process
&lt;/h3&gt;

&lt;p&gt;For each dead flag:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Verify&lt;/strong&gt; — Check logs for 30 days to confirm zero true evaluations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Comment&lt;/strong&gt; — Add a &lt;code&gt;@deprecated&lt;/code&gt; note with removal date&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remove&lt;/strong&gt; — Delete the code path in the next sprint&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Celebrate&lt;/strong&gt; — Log it in our engineering changelog (yes, really)&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  3. Automated Cleanup Rules
&lt;/h3&gt;

&lt;p&gt;We added CI checks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Flag unchanged for 60 days → warning in PR&lt;/li&gt;
&lt;li&gt;Flag unchanged for 90 days → auto-create cleanup ticket&lt;/li&gt;
&lt;li&gt;Flag unchanged for 120 days → block new flag creation until old ones are cleaned&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Results
&lt;/h2&gt;

&lt;p&gt;After 3 weeks of cleanup:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Active feature flags&lt;/td&gt;
&lt;td&gt;47&lt;/td&gt;
&lt;td&gt;31&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dead code branches&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avg. flag understanding time&lt;/td&gt;
&lt;td&gt;20 min&lt;/td&gt;
&lt;td&gt;5 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Test suite execution time&lt;/td&gt;
&lt;td&gt;14 min&lt;/td&gt;
&lt;td&gt;11 min&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;34% reduction in flag count. 75% faster onboarding for new flags. And most importantly — zero confusion about what's live and what's dead.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Lesson
&lt;/h2&gt;

&lt;p&gt;Feature flags are like garden plants. If you don't prune them, they grow wild and start choking the things you actually care about.&lt;/p&gt;

&lt;p&gt;Every flag you add is technical debt with an expiration date. Set the date. Enforce it.&lt;/p&gt;

&lt;p&gt;Your future self — and your team — will thank you.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Found this useful? Follow me for more production war stories and practical devops tips.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>programming</category>
      <category>webdev</category>
      <category>testing</category>
    </item>
    <item>
      <title>I Broke Production 3 Times This Week — How a CI/CD Pipeline Audit Fixed Everything</title>
      <dc:creator>kol kol</dc:creator>
      <pubDate>Sat, 06 Jun 2026 14:03:37 +0000</pubDate>
      <link>https://dev.to/kollittle/i-broke-production-3-times-this-week-how-a-cicd-pipeline-audit-fixed-everything-35ne</link>
      <guid>https://dev.to/kollittle/i-broke-production-3-times-this-week-how-a-cicd-pipeline-audit-fixed-everything-35ne</guid>
      <description>&lt;p&gt;Last week I broke production three times. Not because of bad code — because our CI/CD pipeline was quietly lying to us.&lt;/p&gt;

&lt;p&gt;Here's what happened, what I found during the audit, and the exact pipeline changes that eliminated deployment failures.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Breakages
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Break #1:&lt;/strong&gt; A database migration ran twice because our pipeline didn't track which migrations had already executed. Result: duplicate key errors across 200+ records.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Break #2:&lt;/strong&gt; Environment variable interpolation silently dropped a critical API key in production. The staging build passed because the variable was set in our &lt;code&gt;.env.staging&lt;/code&gt; but not &lt;code&gt;.env.production&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Break #3:&lt;/strong&gt; A dependency update changed a function signature. Our test suite passed because the mocked version still matched the old signature. Production exploded at runtime.&lt;/p&gt;

&lt;p&gt;Three different failure modes. One root cause: our CI/CD pipeline had more gaps than a net.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Pipeline Audit
&lt;/h2&gt;

&lt;p&gt;I mapped every step from &lt;code&gt;git push&lt;/code&gt; to production deployment. Here's what I found:&lt;/p&gt;

&lt;h3&gt;
  
  
  Gap 1: No Migration Tracking
&lt;/h3&gt;

&lt;p&gt;Our pipeline ran &lt;code&gt;prisma migrate deploy&lt;/code&gt; blindly on every deployment. No check for already-applied migrations. No rollback plan.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Added a migration status check that queries &lt;code&gt;_prisma_migrations&lt;/code&gt; before running anything new:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check pending migrations before applying&lt;/span&gt;
npx prisma migrate status
npx prisma migrate deploy &lt;span class="nt"&gt;--skip-generate&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Gap 2: Env Var Validation Missing
&lt;/h3&gt;

&lt;p&gt;We had 23 environment variables in production. Zero validation that they existed before deploying.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Added a pre-deployment validation step:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Required env vars checklist&lt;/span&gt;
&lt;span class="nv"&gt;REQUIRED_VARS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"DATABASE_URL NEXTAUTH_SECRET API_KEY STRIPE_SECRET"&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;var &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="nv"&gt;$REQUIRED_VARS&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="nt"&gt;-z&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="p"&gt;!var&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
    &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"❌ Missing required variable: &lt;/span&gt;&lt;span class="nv"&gt;$var&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
    &lt;span class="nb"&gt;exit &lt;/span&gt;1
  &lt;span class="k"&gt;fi
done
&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"✅ All required environment variables present"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This single check has blocked 4 bad deployments since I added it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gap 3: Tests Didn't Match Reality
&lt;/h3&gt;

&lt;p&gt;Our mock data was stale. Tests passed against a mocked API that hadn't been updated in months.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Two changes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Contract testing:&lt;/strong&gt; Added &lt;code&gt;@stoplight/spectral&lt;/code&gt; to validate our OpenAPI spec against actual responses&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integration tests in CI:&lt;/strong&gt; Running real API calls against a staging database, not just unit tests with mocks&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The New Pipeline
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;git push → Lint → Type Check → Unit Tests → Contract Tests 
  → Build → Env Validation → Staging Deploy → Integration Tests 
  → Migration Check → Production Deploy → Health Check
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each stage blocks the next on failure. No more "tests passed but prod broke."&lt;/p&gt;

&lt;h2&gt;
  
  
  The Results
&lt;/h2&gt;

&lt;p&gt;Since the audit (2 weeks ago):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;0 production breakages&lt;/strong&gt; (vs 3 in the previous week)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;4 blocked bad deployments&lt;/strong&gt; before they reached staging&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deploy confidence:&lt;/strong&gt; Team actually ships on Fridays now&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Takeaway
&lt;/h2&gt;

&lt;p&gt;Your CI/CD pipeline isn't just automation — it's your last line of defense. If it has gaps, production will find them.&lt;/p&gt;

&lt;p&gt;Audit your pipeline. Map every step. Ask "what happens if this lies to me?" for each one.&lt;/p&gt;

&lt;p&gt;Then fix the gaps before production finds them for you.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;What's the worst CI/CD failure you've dealt with? Drop it in the comments — misery loves company.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>cicd</category>
      <category>programming</category>
      <category>webdev</category>
    </item>
    <item>
      <title>I Built a RAG Chat Assistant from Scratch — Here's What Nobody Tells You About Production RAG</title>
      <dc:creator>kol kol</dc:creator>
      <pubDate>Fri, 05 Jun 2026 14:03:37 +0000</pubDate>
      <link>https://dev.to/kollittle/i-built-a-rag-chat-assistant-from-scratch-heres-what-nobody-tells-you-about-production-rag-3295</link>
      <guid>https://dev.to/kollittle/i-built-a-rag-chat-assistant-from-scratch-heres-what-nobody-tells-you-about-production-rag-3295</guid>
      <description>&lt;p&gt;Everyone talks about how easy it is to build a RAG (Retrieval-Augmented Generation) chat assistant. Just embed some documents, throw them in a vector database, and connect an LLM, right?&lt;/p&gt;

&lt;p&gt;Well, I just spent weeks building a production RAG system for a real knowledge base platform with nearly 2,000 articles. And let me tell you — the gap between a weekend hackathon demo and something that actually works for real users is enormous.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is RAG, Anyway?
&lt;/h2&gt;

&lt;p&gt;In case you haven't heard the acronym a thousand times this year: RAG is a technique where you retrieve relevant documents from your own knowledge base and feed them into an LLM as context, so the AI answers based on &lt;em&gt;your&lt;/em&gt; data instead of its training cut-off.&lt;/p&gt;

&lt;p&gt;The theory is simple. The reality is full of landmines.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Landmines Nobody Warns You About
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Chunking is Everything (And You're Probably Doing It Wrong)
&lt;/h3&gt;

&lt;p&gt;My first attempt? Split by 500-token chunks with no overlap. The result was terrible — half the retrieved chunks were cut off mid-sentence, losing critical context.&lt;/p&gt;

&lt;p&gt;What actually worked:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Semantic chunking&lt;/strong&gt;: Split at natural boundaries (headings, paragraph breaks)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Overlap strategy&lt;/strong&gt;: 15-20% overlap between adjacent chunks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metadata enrichment&lt;/strong&gt;: Each chunk carries its source article title, category, and position&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This single change improved answer quality by an estimated 40%.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Vector Search Alone Isn't Enough
&lt;/h3&gt;

&lt;p&gt;Pure cosine similarity on embeddings returns results that are &lt;em&gt;topically similar&lt;/em&gt; but often miss the mark for specific technical questions.&lt;/p&gt;

&lt;p&gt;The winning combo:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;BM25 (keyword search)&lt;/strong&gt; for exact technical term matching&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vector similarity&lt;/strong&gt; for semantic understanding&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hybrid ranking&lt;/strong&gt; with a weighted score (0.4 BM25 + 0.6 vector)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This hybrid approach catches both "how to configure PostgreSQL connection pooling" (BM25 wins) and "why is my database slow under load" (vector wins).&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Context Window is a Budget, Not a Freebie
&lt;/h3&gt;

&lt;p&gt;Most tutorials stuff as many retrieved chunks as possible into the prompt. But every token costs money and degrades response quality.&lt;/p&gt;

&lt;p&gt;My approach:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Top 5 most relevant chunks&lt;/strong&gt; (not 10, not 20)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Intelligent deduplication&lt;/strong&gt;: Remove near-identical chunks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Source attribution&lt;/strong&gt;: Each answer links back to the original article&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Architecture That Actually Worked
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Question → Query Rewriting → Hybrid Search (BM25 + Vector) 
  → Reranking → Top-5 Selection → Prompt Assembly → LLM Response
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key tech choices:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Vector DB&lt;/strong&gt;: Supabase (pgvector) — because it's Postgres and you already know SQL&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embeddings&lt;/strong&gt;: OpenAI text-embedding-3-small (fast, cheap, good enough)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reranking&lt;/strong&gt;: Custom scoring function (BM25 + vector + recency boost)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM&lt;/strong&gt;: GPT-4o-mini for cost efficiency on production traffic&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Results
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;~2,000 technical articles&lt;/strong&gt; in the knowledge base&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sub-second query latency&lt;/strong&gt; for most questions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Source-linked answers&lt;/strong&gt; — every response cites which article it came from&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost per query&lt;/strong&gt;: ~$0.005 (embedding + LLM call)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Start with a reranker model&lt;/strong&gt; from day one — it's worth the extra compute&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build a feedback loop early&lt;/strong&gt; — let users thumbs-up/down answers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't over-engineer chunking&lt;/strong&gt; — start simple, iterate based on actual query patterns&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;RAG isn't magic. It's engineering. And like any engineering problem, the devil is in the details. The gap between "it works on my laptop" and "it works for thousands of users asking weird questions at 3 AM" is filled with chunking strategies, hybrid search tuning, and relentless iteration.&lt;/p&gt;

&lt;p&gt;If you're building a RAG system right now: start simple, measure everything, and expect to rewrite your retrieval pipeline at least three times.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Want to dive deeper into building developer tools? Check out my technical knowledge base at &lt;a href="https://codcompass.com" rel="noopener noreferrer"&gt;codcompass.com&lt;/a&gt; — growing weekly with real-world insights on shipping software.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>codcompass</category>
      <category>ai</category>
      <category>knowledgebase</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Why I Stopped Using try/catch Everywhere — Error Handling Patterns That Actually Scale</title>
      <dc:creator>kol kol</dc:creator>
      <pubDate>Thu, 04 Jun 2026 14:07:16 +0000</pubDate>
      <link>https://dev.to/kollittle/my-docker-container-crashed-at-3am-5-containerization-lessons-from-production-1b6</link>
      <guid>https://dev.to/kollittle/my-docker-container-crashed-at-3am-5-containerization-lessons-from-production-1b6</guid>
      <description>&lt;h1&gt;
  
  
  Why I Stopped Using try/catch Everywhere — Error Handling Patterns That Actually Scale
&lt;/h1&gt;

&lt;p&gt;My app had 847 try/catch blocks. 73% of them were identical: &lt;code&gt;catch (error) { console.error(error); throw error; }&lt;/code&gt;. I was wrapping error handling in more error handling. The code wasn't safer — it was just louder.&lt;/p&gt;

&lt;p&gt;Here's what I learned after refactoring error handling across 12 microservices.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Anti-Pattern: Defensive try/catch Sprawl
&lt;/h2&gt;

&lt;p&gt;Every async function wrapped in its own try/catch, logging the same error, throwing the same error, catching the same error three layers up. The result?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stack traces that told you nothing&lt;/li&gt;
&lt;li&gt;Duplicate error logs (same error logged 4 times)&lt;/li&gt;
&lt;li&gt;Actual recovery logic buried under boilerplate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I call this &lt;strong&gt;anxiety-driven development&lt;/strong&gt; — wrapping everything in try/catch because we're afraid of unhandled rejections, not because we have a plan for each error.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Worked
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Centralized Error Boundary
&lt;/h3&gt;

&lt;p&gt;Instead of 847 try/catch blocks, I built one error boundary per service:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ServiceBoundary&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nx"&gt;execute&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;T&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;operation&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;T&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;T&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;operation&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;classified&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;classify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;classified&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;classified&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;recoverable&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;retry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;operation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;classified&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="nx"&gt;classified&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One boundary. Consistent classification. No duplicate logging.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Error Classification, Not Catch-All
&lt;/h3&gt;

&lt;p&gt;Not all errors are equal:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Operational errors&lt;/strong&gt; (network timeout, 429 rate limit) → retry with backoff&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Programmatic errors&lt;/strong&gt; (null reference, type mismatch) → fail fast, alert on-call&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Expected errors&lt;/strong&gt; (validation failure, not-found) → return to caller, don't throw&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once I started classifying, 60% of my try/catch blocks became unnecessary. Validation errors get returned as results, not thrown as exceptions.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Result Types for Expected Failures
&lt;/h3&gt;

&lt;p&gt;Borrowed from Rust's &lt;code&gt;Result&amp;lt;T, E&amp;gt;&lt;/code&gt; pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;T&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;E&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;AppError&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
  &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;ok&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nl"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;T&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;ok&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nl"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;E&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When failure is expected (user not found, validation failed, rate limited), return a Result. Don't throw. Throwing should be reserved for &lt;em&gt;unexpected&lt;/em&gt; failures.&lt;/p&gt;

&lt;p&gt;This cut our error-related unit tests by 40% because we stopped testing "does this throw?" for things that aren't actually exceptional.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Structured Error Metadata
&lt;/h3&gt;

&lt;p&gt;Every error now carries:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;AppError&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nb"&gt;Error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;code&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;        &lt;span class="c1"&gt;// "USER_NOT_FOUND" not "ENOENT"&lt;/span&gt;
  &lt;span class="nl"&gt;severity&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;info&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;warn&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;critical&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;retryable&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;boolean&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Record&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;unknown&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// userId, requestId, etc.&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No more guessing what &lt;code&gt;ECONNRESET&lt;/code&gt; means in your logs. No more searching through 847 catch blocks to find where the error was swallowed.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Let It Crash (Sometimes)
&lt;/h3&gt;

&lt;p&gt;The hardest lesson: some errors &lt;em&gt;should&lt;/em&gt; crash. If your database connection pool is exhausted, silently retrying 10 times while the queue builds up isn't resilience — it's a slow death spiral.&lt;/p&gt;

&lt;p&gt;Fail fast, let the orchestrator restart you, and alert someone. A crash is often more honest than a zombie service that's technically "up" but functionally dead.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Result
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;try/catch blocks&lt;/td&gt;
&lt;td&gt;847&lt;/td&gt;
&lt;td&gt;212&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mean error resolution time&lt;/td&gt;
&lt;td&gt;47 min&lt;/td&gt;
&lt;td&gt;8 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Duplicate error logs per incident&lt;/td&gt;
&lt;td&gt;4.2x&lt;/td&gt;
&lt;td&gt;1x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;On-call false pages&lt;/td&gt;
&lt;td&gt;31%&lt;/td&gt;
&lt;td&gt;7%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The code isn't just cleaner — it's honest about what can go wrong and specific about what to do when it does.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;How does your team handle errors? Are you catching everything, or are you classifying and responding?&lt;/em&gt;&lt;/p&gt;

</description>
      <category>codcompass</category>
      <category>ai</category>
      <category>knowledgebase</category>
      <category>webdev</category>
    </item>
    <item>
      <title>My Monitoring Dashboard Was Lying to Me — How I Learned the Difference Between Monitoring and Observability</title>
      <dc:creator>kol kol</dc:creator>
      <pubDate>Wed, 03 Jun 2026 22:47:19 +0000</pubDate>
      <link>https://dev.to/kollittle/my-monitoring-dashboard-was-lying-to-me-how-i-learned-the-difference-between-monitoring-and-44a4</link>
      <guid>https://dev.to/kollittle/my-monitoring-dashboard-was-lying-to-me-how-i-learned-the-difference-between-monitoring-and-44a4</guid>
      <description>&lt;p&gt;I spent three hours last week staring at a perfectly green dashboard while our users were getting 5-second response times. Every metric said "healthy." Every alert was silent. But the product was broken.&lt;/p&gt;

&lt;p&gt;That's when it clicked: &lt;strong&gt;I had built a monitoring system, not an observability system.&lt;/strong&gt; And they are not the same thing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Dashboard Illusion
&lt;/h2&gt;

&lt;p&gt;Here's what my setup looked like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CPU usage: 23% ✅&lt;/li&gt;
&lt;li&gt;Memory: 61% ✅&lt;/li&gt;
&lt;li&gt;Request rate: normal ✅&lt;/li&gt;
&lt;li&gt;Error rate: 0.2% ✅&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But users were rage-clicking because the checkout flow was timing out on a specific third-party API call. The error rate was "fine" because retries masked the failures. The CPU was low because the bottleneck was network I/O, not computation. Every metric I was watching was measuring the wrong thing.&lt;/p&gt;

&lt;p&gt;This is what I call &lt;strong&gt;decorative telemetry&lt;/strong&gt; — numbers that look authoritative in a standup meeting but tell you nothing about what users actually experience.&lt;/p&gt;

&lt;h2&gt;
  
  
  Monitoring vs Observability: The Real Difference
&lt;/h2&gt;

&lt;p&gt;The distinction that changed everything:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Monitoring asks: "Is it broken?"&lt;/strong&gt; — You define thresholds, you get alerts when crossed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability asks: "Why is it broken?"&lt;/strong&gt; — You can explore unknown-unknowns without shipping new code.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Monitoring is a car dashboard (speed, fuel, temperature). Observability is a mechanic's diagnostic tool (can trace any symptom back to root cause).&lt;/p&gt;

&lt;p&gt;Most teams — including mine — build dashboards and call it observability. That's like buying a speedometer and claiming you can diagnose engine problems.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Fixed It
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. SLO-First Instrumentation
&lt;/h3&gt;

&lt;p&gt;Instead of measuring infrastructure, I started measuring what users perceive. For checkout, that meant:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SLO: 99% of checkout requests complete in &amp;lt; 2 seconds
SLI: Actual latency at p95 per 5-minute window
Error Budget: 7.2 minutes of SLO violation allowed per month
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When the third-party API degraded, our error budget burned visibly. The dashboard went from "all green" to "73% budget consumed in 2 hours." That's actionable.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Correlation Discipline
&lt;/h3&gt;

&lt;p&gt;The real breakthrough was forcing structured logs, trace IDs, and bounded context tags into every service. When something breaks, you follow the trace ID — not guess which dashboard to check.&lt;/p&gt;

&lt;p&gt;Before: "Checkout is slow. Let me check 6 dashboards."&lt;br&gt;
After: "Trace abc123 failed at step 4. It's the payment provider."&lt;/p&gt;

&lt;p&gt;Mean-time-to-incomprehension dropped from hours to minutes.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Alert Hygiene
&lt;/h3&gt;

&lt;p&gt;I cut our alerts from 47 to 8. The rule: &lt;strong&gt;page only what demands immediate human action.&lt;/strong&gt; Everything else goes to a dashboard or a weekly report.&lt;/p&gt;

&lt;p&gt;The 8 remaining alerts have explicit severity semantics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SEV-1: Users cannot complete core flows → page immediately&lt;/li&gt;
&lt;li&gt;SEV-2: Degraded experience for &amp;gt; 10% of users → page if it persists &amp;gt; 15 min&lt;/li&gt;
&lt;li&gt;SEV-3: Internal tooling degraded → Slack notification&lt;/li&gt;
&lt;li&gt;Everything else: Dashboard-only&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Maturity Model I Wish I'd Had Earlier
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Level 1 — Foundational:&lt;/strong&gt; Consistent logging, health checks, and actionable paging tied to runbooks. (This is where most teams think they're "done.")&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Level 2 — Intermediate:&lt;/strong&gt; Distributed tracing with sampling strategies that survive production load. You can follow a request across services.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Level 3 — Advanced:&lt;/strong&gt; Anomaly detection where baselines are meaningful — not noisy vanity curves that nobody trusts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Level 4 — Principal:&lt;/strong&gt; Org-wide observability contracts. Every team instruments the same way, SLOs drive priority, and incident learning becomes institutional knowledge.&lt;/p&gt;

&lt;p&gt;I was at Level 1 thinking I was at Level 3.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hard Truth
&lt;/h2&gt;

&lt;p&gt;Dashboards without on-call runbooks are decorative. Metrics without error budgets are opinions. And alerts that page for everything page for nothing.&lt;/p&gt;

&lt;p&gt;Observability isn't a tool you install. It's a discipline you practice — mapping every signal to a decision, every alert to an action, and every incident to a learning loop.&lt;/p&gt;

&lt;p&gt;The green dashboard was lying to me not because it was wrong, but because it was answering questions nobody was asking.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;What's the biggest gap between what your dashboards show and what your users experience?&lt;/em&gt;&lt;/p&gt;

</description>
      <category>observability</category>
      <category>monitoring</category>
      <category>devops</category>
      <category>sre</category>
    </item>
  </channel>
</rss>
