<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Brittany </title>
    <description>The latest articles on DEV Community by Brittany  (@brittany_37606c0775530a57).</description>
    <link>https://dev.to/brittany_37606c0775530a57</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3262910%2F68ed7ba7-ff95-4943-b28e-eeb2ed4a2708.png</url>
      <title>DEV Community: Brittany </title>
      <link>https://dev.to/brittany_37606c0775530a57</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/brittany_37606c0775530a57"/>
    <language>en</language>
    <item>
      <title>Most ML accuracy issues aren’t model problems.
They’re upstream SQL problems.

JOIN granularity.
Silent NULLs.
Distorted aggregations.

Sometimes the biggest ML improvement isn’t a new model — it’s a better query.</title>
      <dc:creator>Brittany </dc:creator>
      <pubDate>Sun, 22 Feb 2026 03:06:01 +0000</pubDate>
      <link>https://dev.to/brittany_37606c0775530a57/most-ml-accuracy-issues-arent-model-problems-theyre-upstream-sql-problems-join-4ejo</link>
      <guid>https://dev.to/brittany_37606c0775530a57/most-ml-accuracy-issues-arent-model-problems-theyre-upstream-sql-problems-join-4ejo</guid>
      <description>&lt;p&gt;

&lt;/p&gt;
&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/brittany_37606c0775530a57/your-ml-model-isnt-wrong-your-sql-probably-is-42ba" class="crayons-story__hidden-navigation-link"&gt;Your ML Model Isn’t Wrong. Your SQL Probably Is.&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/brittany_37606c0775530a57" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3262910%2F68ed7ba7-ff95-4943-b28e-eeb2ed4a2708.png" alt="brittany_37606c0775530a57 profile" class="crayons-avatar__image"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/brittany_37606c0775530a57" class="crayons-story__secondary fw-medium m:hidden"&gt;
              Brittany 
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                Brittany 
                
              
              &lt;div id="story-author-preview-content-3274277" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/brittany_37606c0775530a57" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3262910%2F68ed7ba7-ff95-4943-b28e-eeb2ed4a2708.png" class="crayons-avatar__image" alt=""&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;Brittany &lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/brittany_37606c0775530a57/your-ml-model-isnt-wrong-your-sql-probably-is-42ba" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;Feb 22&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/brittany_37606c0775530a57/your-ml-model-isnt-wrong-your-sql-probably-is-42ba" id="article-link-3274277"&gt;
          Your ML Model Isn’t Wrong. Your SQL Probably Is.
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/machinelearning"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;machinelearning&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/dataengineering"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;dataengineering&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/sql"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;sql&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/mlops"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;mlops&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
            &lt;a href="https://dev.to/brittany_37606c0775530a57/your-ml-model-isnt-wrong-your-sql-probably-is-42ba#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              Comments


              1&lt;span class="hidden s:inline"&gt; comment&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            2 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;




</description>
      <category>machinelearning</category>
      <category>dataengineering</category>
      <category>sql</category>
      <category>mlops</category>
    </item>
    <item>
      <title>Your ML Model Isn’t Wrong. Your SQL Probably Is.</title>
      <dc:creator>Brittany </dc:creator>
      <pubDate>Sun, 22 Feb 2026 02:54:36 +0000</pubDate>
      <link>https://dev.to/brittany_37606c0775530a57/your-ml-model-isnt-wrong-your-sql-probably-is-42ba</link>
      <guid>https://dev.to/brittany_37606c0775530a57/your-ml-model-isnt-wrong-your-sql-probably-is-42ba</guid>
      <description>&lt;p&gt;Your churn model isn’t degrading because the algorithm is weak.&lt;/p&gt;

&lt;p&gt;It might be degrading because of a JOIN.&lt;/p&gt;

&lt;p&gt;I’ve seen teams spend weeks tuning hyperparameters, switching architectures, and debating feature importance — only to discover the real issue was upstream data logic.&lt;/p&gt;

&lt;p&gt;Before you tune the model, check your SQL.&lt;/p&gt;

&lt;p&gt;The Problem Most Teams Misdiagnose&lt;/p&gt;

&lt;p&gt;When performance drops, we usually suspect:&lt;/p&gt;

&lt;p&gt;Model drift&lt;/p&gt;

&lt;p&gt;Hyperparameter tuning&lt;/p&gt;

&lt;p&gt;Feature scaling&lt;/p&gt;

&lt;p&gt;Algorithm choice&lt;/p&gt;

&lt;p&gt;Those are valid concerns.&lt;/p&gt;

&lt;p&gt;But machine learning models don’t invent patterns.&lt;/p&gt;

&lt;p&gt;They learn from the data we feed them.&lt;/p&gt;

&lt;p&gt;If the dataset is flawed, the model will faithfully learn those flaws.&lt;/p&gt;

&lt;p&gt;Upstream data logic determines downstream model behavior.&lt;/p&gt;

&lt;p&gt;Scenario: The “Failing” Churn Model&lt;/p&gt;

&lt;p&gt;A churn prediction model starts underperforming.&lt;/p&gt;

&lt;p&gt;Same architecture.&lt;br&gt;
Same training pipeline.&lt;br&gt;
Same evaluation framework.&lt;/p&gt;

&lt;p&gt;Nothing obvious changed.&lt;/p&gt;

&lt;p&gt;After investigation, the issue wasn’t model complexity.&lt;/p&gt;

&lt;p&gt;It was this:&lt;/p&gt;

&lt;p&gt;SELECT *&lt;br&gt;
FROM customers c&lt;br&gt;
JOIN orders o&lt;br&gt;
ON c.customer_id = o.customer_id;&lt;/p&gt;

&lt;p&gt;It looks harmless.&lt;/p&gt;

&lt;p&gt;It runs fast.&lt;br&gt;
It returns data.&lt;br&gt;
It passes basic tests.&lt;/p&gt;

&lt;p&gt;But customers with multiple orders are duplicated across rows.&lt;/p&gt;

&lt;p&gt;High-activity users become unintentionally overweighted in the training dataset.&lt;/p&gt;

&lt;p&gt;The model didn’t fail.&lt;/p&gt;

&lt;p&gt;It did exactly what we told it to do.&lt;/p&gt;

&lt;p&gt;Mistake #1: Duplicate Rows from JOINs&lt;/p&gt;

&lt;p&gt;If your model expects one row per customer but your query returns one row per transaction, you’ve changed the learning problem.&lt;/p&gt;

&lt;p&gt;The issue isn’t SQL skill — it’s granularity awareness.&lt;/p&gt;

&lt;p&gt;A better approach:&lt;/p&gt;

&lt;p&gt;SELECT&lt;br&gt;
    c.customer_id,&lt;br&gt;
    COUNT(o.order_id) AS total_orders&lt;br&gt;
FROM customers c&lt;br&gt;
LEFT JOIN orders o&lt;br&gt;
ON c.customer_id = o.customer_id&lt;br&gt;
GROUP BY c.customer_id;&lt;/p&gt;

&lt;p&gt;Aggregate intentionally before training.&lt;/p&gt;

&lt;p&gt;Define the learning unit.&lt;/p&gt;

&lt;p&gt;Mistake #2: Silent NULL Handling&lt;/p&gt;

&lt;p&gt;NULLs rarely crash pipelines.&lt;/p&gt;

&lt;p&gt;They quietly distort them.&lt;/p&gt;

&lt;p&gt;SELECT income&lt;br&gt;
FROM customers;&lt;/p&gt;

&lt;p&gt;If income contains NULLs and you don’t handle them deliberately, the model sees noise.&lt;/p&gt;

&lt;p&gt;Even something simple like:&lt;/p&gt;

&lt;p&gt;SELECT&lt;br&gt;
    COALESCE(income, 0) AS income&lt;br&gt;
FROM customers;&lt;/p&gt;

&lt;p&gt;forces you to define intent.&lt;/p&gt;

&lt;p&gt;The important part isn’t the function.&lt;/p&gt;

&lt;p&gt;It’s the decision.&lt;/p&gt;

&lt;p&gt;Mistake #3: Distorted Aggregations&lt;/p&gt;

&lt;p&gt;Global averages can hide meaningful segmentation.&lt;/p&gt;

&lt;p&gt;SELECT AVG(transaction_amount)&lt;br&gt;
FROM transactions;&lt;/p&gt;

&lt;p&gt;It works.&lt;br&gt;
It returns a number.&lt;br&gt;
It feels reasonable.&lt;/p&gt;

&lt;p&gt;But a model trained on broad aggregates may underperform in production because it lacks entity-level context.&lt;/p&gt;

&lt;p&gt;Instead:&lt;/p&gt;

&lt;p&gt;SELECT&lt;br&gt;
    customer_id,&lt;br&gt;
    AVG(transaction_amount) AS avg_txn&lt;br&gt;
FROM transactions&lt;br&gt;
GROUP BY customer_id;&lt;/p&gt;

&lt;p&gt;Aggregation logic should reflect the model objective — not convenience.&lt;/p&gt;

&lt;p&gt;Aggregation is feature construction.&lt;/p&gt;

&lt;p&gt;Feature construction is model behavior.&lt;/p&gt;

&lt;p&gt;The Bigger Pattern&lt;/p&gt;

&lt;p&gt;Many ML failures blamed on “model accuracy” are actually upstream data logic issues.&lt;/p&gt;

&lt;p&gt;Strong ML systems require strong SQL foundations.&lt;/p&gt;

&lt;p&gt;Data pipelines are part of the model architecture — not just preprocessing.&lt;/p&gt;

&lt;p&gt;Strong models are built on strong data contracts.&lt;/p&gt;

&lt;p&gt;Before You Tune the Model, Ask:&lt;/p&gt;

&lt;p&gt;Are joins intentional?&lt;/p&gt;

&lt;p&gt;Is entity granularity clearly defined?&lt;/p&gt;

&lt;p&gt;Are aggregations aligned with the objective?&lt;/p&gt;

&lt;p&gt;Are NULLs handled deliberately?&lt;/p&gt;

&lt;p&gt;Is the training dataset versioned?&lt;/p&gt;

&lt;p&gt;Sometimes the biggest ML improvement isn’t a new model.&lt;/p&gt;

&lt;p&gt;It’s a better query.&lt;/p&gt;

&lt;p&gt;If you’d like to see the structured breakdown with examples and commentary, I documented it here:&lt;/p&gt;

&lt;p&gt;👉 GitHub repository:&lt;br&gt;
&lt;a href="https://github.com/brie1807/sql-to-ml-pipeline-mistakes" rel="noopener noreferrer"&gt;https://github.com/brie1807/sql-to-ml-pipeline-mistakes&lt;/a&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>dataengineering</category>
      <category>sql</category>
      <category>mlops</category>
    </item>
    <item>
      <title>We debate neural networks, but SQL quietly shapes what a model is allowed to believe. A systems-level perspective on ML architecture.</title>
      <dc:creator>Brittany </dc:creator>
      <pubDate>Mon, 16 Feb 2026 23:30:52 +0000</pubDate>
      <link>https://dev.to/brittany_37606c0775530a57/we-debate-neural-networks-but-sql-quietly-shapes-what-a-model-is-allowed-to-believe-sharing-a-17n2</link>
      <guid>https://dev.to/brittany_37606c0775530a57/we-debate-neural-networks-but-sql-quietly-shapes-what-a-model-is-allowed-to-believe-sharing-a-17n2</guid>
      <description>&lt;div class="ltag__link"&gt;
  &lt;a href="/brittany_37606c0775530a57" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__pic"&gt;
      &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3262910%2F68ed7ba7-ff95-4943-b28e-eeb2ed4a2708.png" alt="brittany_37606c0775530a57"&gt;
    &lt;/div&gt;
  &lt;/a&gt;
  &lt;a href="https://dev.to/brittany_37606c0775530a57/machine-learning-starts-with-a-where-clause-52da" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__content"&gt;
      &lt;h2&gt;Machine Learning Starts With a WHERE Clause&lt;/h2&gt;
      &lt;h3&gt;Brittany  ・ Feb 16&lt;/h3&gt;
      &lt;div class="ltag__link__taglist"&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/a&gt;
&lt;/div&gt;


</description>
    </item>
    <item>
      <title>Machine Learning Starts With a WHERE Clause</title>
      <dc:creator>Brittany </dc:creator>
      <pubDate>Mon, 16 Feb 2026 23:26:15 +0000</pubDate>
      <link>https://dev.to/brittany_37606c0775530a57/machine-learning-starts-with-a-where-clause-52da</link>
      <guid>https://dev.to/brittany_37606c0775530a57/machine-learning-starts-with-a-where-clause-52da</guid>
      <description>&lt;p&gt;🧠 Intro (Systems-Level Tone)&lt;/p&gt;

&lt;p&gt;Most people think machine learning starts with a model.&lt;/p&gt;

&lt;p&gt;It doesn’t.&lt;/p&gt;

&lt;p&gt;It starts with a query.&lt;/p&gt;

&lt;p&gt;Before SageMaker trains.&lt;br&gt;
Before scikit-learn fits.&lt;br&gt;
Before hyperparameters are tuned.&lt;/p&gt;

&lt;p&gt;Someone writes a WHERE clause.&lt;/p&gt;

&lt;p&gt;And that clause quietly decides what the model is allowed to learn.&lt;/p&gt;

&lt;p&gt;🏗️ SQL Is Architectural — Not Just Operational&lt;/p&gt;

&lt;p&gt;In real ML systems, SQL isn’t just for “getting data.”&lt;/p&gt;

&lt;p&gt;It defines:&lt;/p&gt;

&lt;p&gt;Which records are included&lt;/p&gt;

&lt;p&gt;Which time windows matter&lt;/p&gt;

&lt;p&gt;Which behaviors become features&lt;/p&gt;

&lt;p&gt;Which outcomes are excluded&lt;/p&gt;

&lt;p&gt;Which bias is unintentionally preserved&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;SELECT&lt;br&gt;
    customer_id,&lt;br&gt;
    COUNT(*) AS total_orders,&lt;br&gt;
    SUM(amount) AS lifetime_value,&lt;br&gt;
    MAX(order_date) AS last_purchase&lt;br&gt;
FROM transactions&lt;br&gt;
WHERE order_date &amp;gt;= '2024-01-01'&lt;br&gt;
GROUP BY customer_id;&lt;/p&gt;

&lt;p&gt;That single WHERE clause just decided:&lt;/p&gt;

&lt;p&gt;The time boundary of learning&lt;/p&gt;

&lt;p&gt;What counts as “recent behavior”&lt;/p&gt;

&lt;p&gt;Whether seasonality exists&lt;/p&gt;

&lt;p&gt;Whether older patterns are erased&lt;/p&gt;

&lt;p&gt;The model hasn’t trained yet.&lt;/p&gt;

&lt;p&gt;But its worldview has already been shaped.&lt;/p&gt;

&lt;p&gt;📊 Feature Engineering Happens Before Python&lt;/p&gt;

&lt;p&gt;Most ML discussions focus on:&lt;/p&gt;

&lt;p&gt;Neural networks&lt;/p&gt;

&lt;p&gt;Gradient descent&lt;/p&gt;

&lt;p&gt;Model selection&lt;/p&gt;

&lt;p&gt;But feature engineering often happens inside the database.&lt;/p&gt;

&lt;p&gt;Aggregations like:&lt;/p&gt;

&lt;p&gt;SUM()&lt;/p&gt;

&lt;p&gt;AVG()&lt;/p&gt;

&lt;p&gt;COUNT()&lt;/p&gt;

&lt;p&gt;Window functions&lt;/p&gt;

&lt;p&gt;Time-based grouping&lt;/p&gt;

&lt;p&gt;These are not “data prep steps.”&lt;/p&gt;

&lt;p&gt;They are architectural decisions.&lt;/p&gt;

&lt;p&gt;If you compute:&lt;/p&gt;

&lt;p&gt;AVG(amount)&lt;/p&gt;

&lt;p&gt;Instead of:&lt;/p&gt;

&lt;p&gt;SUM(amount)&lt;/p&gt;

&lt;p&gt;You change the scale of influence.&lt;/p&gt;

&lt;p&gt;If you group by week instead of month, you change volatility.&lt;/p&gt;

&lt;p&gt;If you filter out NULLs, you may remove entire demographic signals.&lt;/p&gt;

&lt;p&gt;SQL quietly determines signal strength.&lt;/p&gt;

&lt;p&gt;⚠️ Data Leakage Is Often a Query Problem&lt;/p&gt;

&lt;p&gt;Many ML failures aren’t algorithmic.&lt;/p&gt;

&lt;p&gt;They’re temporal mistakes.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;SELECT *&lt;br&gt;
FROM training_data&lt;br&gt;
WHERE prediction_date &amp;gt; outcome_date;&lt;/p&gt;

&lt;p&gt;If your query accidentally includes future outcomes,&lt;br&gt;
you’ve created a perfect model.&lt;/p&gt;

&lt;p&gt;And a useless one.&lt;/p&gt;

&lt;p&gt;Leakage is rarely a Python issue.&lt;/p&gt;

&lt;p&gt;It’s usually a SQL design issue.&lt;/p&gt;

&lt;p&gt;🧠 The System View&lt;/p&gt;

&lt;p&gt;Machine learning is often presented as:&lt;/p&gt;

&lt;p&gt;Data → Model → Prediction&lt;/p&gt;

&lt;p&gt;In reality, it’s:&lt;/p&gt;

&lt;p&gt;Raw Data → SQL Constraints → Engineered Features → Training Dataset → Model&lt;/p&gt;

&lt;p&gt;SQL is the gatekeeper.&lt;/p&gt;

&lt;p&gt;The model only sees what the query allows.&lt;/p&gt;

&lt;p&gt;💡 Why This Matters (Cost + Architecture)&lt;/p&gt;

&lt;p&gt;In AWS environments:&lt;/p&gt;

&lt;p&gt;Bad queries increase Athena/Redshift cost&lt;/p&gt;

&lt;p&gt;Poor feature aggregation increases training time&lt;/p&gt;

&lt;p&gt;Overly wide datasets increase memory usage&lt;/p&gt;

&lt;p&gt;Incorrect joins inflate SageMaker compute bills&lt;/p&gt;

&lt;p&gt;SQL decisions scale financially.&lt;/p&gt;

&lt;p&gt;Models amplify whatever SQL defines.&lt;/p&gt;

&lt;p&gt;🛠 GitHub Companion Plan&lt;/p&gt;

&lt;p&gt;Create repo:&lt;/p&gt;

&lt;p&gt;sql-ml-architecture-foundations&lt;/p&gt;

&lt;p&gt;Include:&lt;/p&gt;

&lt;p&gt;queries.sql (example feature engineering queries)&lt;/p&gt;

&lt;p&gt;Small sample dataset (CSV)&lt;/p&gt;

&lt;p&gt;README explaining:&lt;/p&gt;

&lt;p&gt;How each query changes model behavior&lt;/p&gt;

&lt;p&gt;How SQL affects cost, bias, and drift&lt;/p&gt;

&lt;p&gt;How this ties into ML pipelines (SageMaker, Glue, Feature Store)&lt;/p&gt;

&lt;p&gt;This makes the article:&lt;/p&gt;

&lt;p&gt;Conceptual&lt;/p&gt;

&lt;p&gt;Applied&lt;/p&gt;

&lt;p&gt;Portfolio-ready&lt;/p&gt;

&lt;p&gt;🔥 This Is Authority Writing&lt;/p&gt;

&lt;p&gt;You are not saying:&lt;br&gt;
“SQL is important.”&lt;/p&gt;

&lt;p&gt;You are saying:&lt;/p&gt;

&lt;p&gt;SQL is the architectural layer that defines what a model is allowed to believe.&lt;/p&gt;

&lt;p&gt;That is senior-level framing.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>datascience</category>
      <category>machinelearning</category>
      <category>sql</category>
    </item>
    <item>
      <title>Missing Data Isn’t a Cleanup Problem — It’s a Signal</title>
      <dc:creator>Brittany </dc:creator>
      <pubDate>Tue, 10 Feb 2026 19:56:29 +0000</pubDate>
      <link>https://dev.to/brittany_37606c0775530a57/missing-data-isnt-a-cleanup-problem-its-a-signal-407n</link>
      <guid>https://dev.to/brittany_37606c0775530a57/missing-data-isnt-a-cleanup-problem-its-a-signal-407n</guid>
      <description>&lt;p&gt;Most machine learning courses teach you how to handle missing data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fill it.&lt;br&gt;
Drop it.&lt;br&gt;
Impute it.&lt;br&gt;
Move on.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;And for exams, that’s usually enough.&lt;/p&gt;

&lt;p&gt;But production systems tell a different story.&lt;/p&gt;

&lt;p&gt;In the real world, missing data isn’t just something to fix —&lt;br&gt;
it’s often the first signal that something upstream is breaking.&lt;/p&gt;

&lt;p&gt;This is where the gap between passing exams and building durable ML systems begins.&lt;/p&gt;

&lt;p&gt;What Exams Teach About Missing Data&lt;/p&gt;

&lt;p&gt;In exam scenarios, missing values are treated as a technical inconvenience:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Replace with the mean or median&lt;/li&gt;
&lt;li&gt;Forward-fill or backward-fill&lt;/li&gt;
&lt;li&gt;Drop rows with too many nulls&lt;/li&gt;
&lt;li&gt;Use models that tolerate missing values
These techniques are valid.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They’re also context-free.&lt;/p&gt;

&lt;p&gt;The exam assumes the data problem already happened —&lt;br&gt;
your job is just to make the model run.&lt;/p&gt;

&lt;p&gt;Production doesn’t care that your model runs.&lt;br&gt;
It cares that it keeps running.&lt;/p&gt;

&lt;p&gt;What Production Systems Teach Instead&lt;/p&gt;

&lt;p&gt;In production, missing data usually shows up for a reason.&lt;/p&gt;

&lt;p&gt;And that reason matters more than the fix.&lt;/p&gt;

&lt;p&gt;Missing values often mean:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A pipeline failed silently&lt;/li&gt;
&lt;li&gt;An upstream service timed out&lt;/li&gt;
&lt;li&gt;A schema changed without notice&lt;/li&gt;
&lt;li&gt;A feature stopped being generated&lt;/li&gt;
&lt;li&gt;A data source degraded slowly over time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these are modeling problems.&lt;br&gt;
They’re system problems.&lt;/p&gt;

&lt;p&gt;If you immediately impute and move on, the model may keep producing outputs —&lt;br&gt;
but now it’s learning from broken assumptions.&lt;/p&gt;

&lt;p&gt;That’s how models degrade quietly.&lt;/p&gt;

&lt;p&gt;Missing Data as a Diagnostic Signal&lt;/p&gt;

&lt;p&gt;Missing values are often symptoms, not errors.&lt;/p&gt;

&lt;p&gt;Instead of asking:&lt;/p&gt;

&lt;p&gt;“How do I fill this?”&lt;/p&gt;

&lt;p&gt;Production systems force you to ask:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Why did this feature go missing?&lt;/li&gt;
&lt;li&gt;Is the missingness random or systematic?&lt;/li&gt;
&lt;li&gt;Did this appear suddenly or gradually?&lt;/li&gt;
&lt;li&gt;Does missing data correlate with certain users, times, or regions?
Those questions don’t show up on exams.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They do decide whether a system survives in the real world.&lt;/p&gt;

&lt;p&gt;Why Simple Methods Sometimes Win&lt;/p&gt;

&lt;p&gt;This is why simpler techniques often outperform complex ones in production.&lt;/p&gt;

&lt;p&gt;Not because they’re smarter —&lt;br&gt;
but because they’re more stable when assumptions break.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mean imputation is predictable&lt;/li&gt;
&lt;li&gt;Dropping features is transparent&lt;/li&gt;
&lt;li&gt;Rule-based fallbacks are debuggable
Complex models can hide data issues by adapting too well —
until performance suddenly collapses weeks later.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Real Skill Gap&lt;/p&gt;

&lt;p&gt;Passing exams proves you know what to do when data is missing.&lt;/p&gt;

&lt;p&gt;Building durable ML systems requires knowing when missing data is trying to tell you something.&lt;/p&gt;

&lt;p&gt;That’s the gap.&lt;/p&gt;

&lt;p&gt;Exams ask: “What’s the correct technique?”&lt;br&gt;
Production asks: “Why is this happening now?”&lt;/p&gt;

&lt;p&gt;Exams optimize for correctness&lt;/p&gt;

&lt;p&gt;Production optimizes for awareness&lt;/p&gt;

&lt;p&gt;And awareness is what keeps models alive.&lt;/p&gt;

&lt;p&gt;Final Thought&lt;/p&gt;

&lt;p&gt;Missing data isn’t just a preprocessing step.&lt;/p&gt;

&lt;p&gt;It’s feedback.&lt;/p&gt;

&lt;p&gt;If you listen to it early, you fix pipelines.&lt;br&gt;
If you ignore it, you retrain models that are already drifting.&lt;/p&gt;

&lt;p&gt;And that’s where the real difference between learning ML&lt;br&gt;
and operating ML begins.&lt;/p&gt;

&lt;p&gt;DEV Tags&lt;/p&gt;

&lt;p&gt;machinelearning datascience mlops careerdevelopment artificialintelligence&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>featureengineering</category>
      <category>ai</category>
    </item>
    <item>
      <title>Missing Data in Machine Learning: A Practical Step-by-Step Approach</title>
      <dc:creator>Brittany </dc:creator>
      <pubDate>Sat, 07 Feb 2026 19:15:07 +0000</pubDate>
      <link>https://dev.to/brittany_37606c0775530a57/missing-data-in-machine-learning-a-practical-step-by-step-approach-1p85</link>
      <guid>https://dev.to/brittany_37606c0775530a57/missing-data-in-machine-learning-a-practical-step-by-step-approach-1p85</guid>
      <description>&lt;p&gt;Missing data breaks more machine learning models than bad algorithms — not because it’s hard to detect, but because it’s easy to overthink.&lt;/p&gt;

&lt;p&gt;When datasets contain NaNs, sparse features, or incomplete records, the default reaction is often to add complexity.&lt;br&gt;
In practice, stability usually matters more than sophistication.&lt;/p&gt;

&lt;p&gt;Here’s a practical, step-by-step way to think about missing data in real ML systems.&lt;/p&gt;

&lt;p&gt;Step 1: Assume Missing Data Is Normal&lt;/p&gt;

&lt;p&gt;In real systems, missing data isn’t an edge case.&lt;/p&gt;

&lt;p&gt;It comes from:&lt;/p&gt;

&lt;p&gt;partially filled forms&lt;/p&gt;

&lt;p&gt;dropped logs or sensors&lt;/p&gt;

&lt;p&gt;schema changes over time&lt;/p&gt;

&lt;p&gt;merged datasets from different sources&lt;/p&gt;

&lt;p&gt;If you treat missing values as rare exceptions, you’ll design fragile pipelines.&lt;br&gt;
Instead, assume they’re part of the data distribution.&lt;/p&gt;

&lt;p&gt;Goal: design preprocessing that keeps working as systems evolve.&lt;/p&gt;

&lt;p&gt;Step 2: Identify Why the Data Is Missing (Not Just Where)&lt;/p&gt;

&lt;p&gt;Not all missing data is random.&lt;/p&gt;

&lt;p&gt;Ask:&lt;/p&gt;

&lt;p&gt;Did users skip a field?&lt;/p&gt;

&lt;p&gt;Did a service fail?&lt;/p&gt;

&lt;p&gt;Did a logging or schema change occur?&lt;/p&gt;

&lt;p&gt;When missingness is tied to behavior or infrastructure, it carries information — but it also introduces risk.&lt;br&gt;
Models trained on one missingness pattern may fail when that pattern changes.&lt;/p&gt;

&lt;p&gt;Goal: avoid baking temporary assumptions into your model.&lt;/p&gt;

&lt;p&gt;Step 3: Start With the Simplest Stable Baseline&lt;/p&gt;

&lt;p&gt;Before reaching for advanced techniques, establish a stable baseline.&lt;/p&gt;

&lt;p&gt;Simple imputation methods (mean or median):&lt;/p&gt;

&lt;p&gt;reduce variance&lt;/p&gt;

&lt;p&gt;preserve feature scale&lt;/p&gt;

&lt;p&gt;behave consistently over time&lt;/p&gt;

&lt;p&gt;They don’t adapt. They don’t infer.&lt;br&gt;
That predictability is exactly what makes them reliable in production.&lt;/p&gt;

&lt;p&gt;Goal: maximize stability before optimizing accuracy.&lt;/p&gt;

&lt;p&gt;Step 4: Be Careful With “Smart” Solutions&lt;/p&gt;

&lt;p&gt;Advanced imputers, PCA, and neural networks can accidentally learn the pattern of missingness, not the underlying signal.&lt;/p&gt;

&lt;p&gt;Common failure modes:&lt;/p&gt;

&lt;p&gt;great validation metrics&lt;/p&gt;

&lt;p&gt;poor generalization&lt;/p&gt;

&lt;p&gt;silent performance decay after deployment&lt;/p&gt;

&lt;p&gt;Complexity increases sensitivity to distribution shifts — especially when missing data is involved.&lt;/p&gt;

&lt;p&gt;Goal: avoid solutions that look good during training but fail quietly later.&lt;/p&gt;

&lt;p&gt;Step 5: Use PCA and Deep Learning Only When the Pipeline Is Stable&lt;/p&gt;

&lt;p&gt;Advanced techniques work best when:&lt;/p&gt;

&lt;p&gt;missingness is minimal or well-understood&lt;/p&gt;

&lt;p&gt;feature definitions are consistent&lt;/p&gt;

&lt;p&gt;training data matches production patterns&lt;/p&gt;

&lt;p&gt;PCA is useful for noise reduction — not for “fixing” missing values.&lt;br&gt;
Deep learning handles missing data well only when designed explicitly for it.&lt;/p&gt;

&lt;p&gt;Goal: earn complexity after stability is proven.&lt;/p&gt;

&lt;p&gt;Step 6: Treat Missing Data as System Feedback&lt;/p&gt;

&lt;p&gt;Missing values often signal:&lt;/p&gt;

&lt;p&gt;broken pipelines&lt;/p&gt;

&lt;p&gt;misaligned teams&lt;/p&gt;

&lt;p&gt;shifting assumptions&lt;/p&gt;

&lt;p&gt;Feature stores help by enforcing consistent definitions and freshness.&lt;br&gt;
Monitoring helps detect when missingness patterns change.&lt;/p&gt;

&lt;p&gt;Fixing the system upstream is often more effective than adding intelligence downstream.&lt;/p&gt;

&lt;p&gt;Goal: solve the root cause, not just the symptom.&lt;/p&gt;

&lt;p&gt;Step 7: Optimize for Long-Term Behavior, Not Short-Term Metrics&lt;/p&gt;

&lt;p&gt;A slightly less accurate model that behaves predictably will outperform a fragile one over time.&lt;/p&gt;

&lt;p&gt;This is why simple preprocessing approaches persist in production systems:&lt;br&gt;
they survive real-world variability.&lt;/p&gt;

&lt;p&gt;Goal: choose approaches that fail gracefully.&lt;/p&gt;

&lt;p&gt;Final Takeaway&lt;/p&gt;

&lt;p&gt;When handling missing data:&lt;/p&gt;

&lt;p&gt;assume it’s normal&lt;/p&gt;

&lt;p&gt;understand why it exists&lt;/p&gt;

&lt;p&gt;start simple&lt;/p&gt;

&lt;p&gt;earn complexity&lt;/p&gt;

&lt;p&gt;prioritize stability&lt;/p&gt;

&lt;p&gt;Machine learning systems don’t fail because they’re not smart enough.&lt;br&gt;
They fail because they’re not stable enough.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>dataengineering</category>
      <category>datascience</category>
    </item>
    <item>
      <title>Why Categorical Data Can Quietly Break Your ML Model</title>
      <dc:creator>Brittany </dc:creator>
      <pubDate>Fri, 06 Feb 2026 02:07:48 +0000</pubDate>
      <link>https://dev.to/brittany_37606c0775530a57/why-categorical-data-can-quietly-break-your-ml-model-53na</link>
      <guid>https://dev.to/brittany_37606c0775530a57/why-categorical-data-can-quietly-break-your-ml-model-53na</guid>
      <description>&lt;p&gt;Your model didn’t fail because of the algorithm.&lt;br&gt;
It failed because of how your data was represented.&lt;/p&gt;

&lt;p&gt;One of the easiest ways to break a machine learning model isn’t choosing the wrong algorithm.&lt;/p&gt;

&lt;p&gt;It’s feeding the model categorical data without thinking about how the model actually interprets numbers.&lt;/p&gt;

&lt;p&gt;This problem shows up constantly in real ML pipelines—especially when models perform well during training but behave unpredictably in production.&lt;/p&gt;

&lt;p&gt;The Hidden Problem with Categorical Data&lt;/p&gt;

&lt;p&gt;Machine learning models don’t understand categories.&lt;/p&gt;

&lt;p&gt;They understand numbers.&lt;/p&gt;

&lt;p&gt;When you pass categorical values like:&lt;/p&gt;

&lt;p&gt;country&lt;/p&gt;

&lt;p&gt;product type&lt;/p&gt;

&lt;p&gt;customer segment&lt;/p&gt;

&lt;p&gt;status&lt;/p&gt;

&lt;p&gt;you’re forced to decide how those categories are represented numerically.&lt;/p&gt;

&lt;p&gt;That decision matters more than many people realize.&lt;/p&gt;

&lt;p&gt;Why “Just Assigning Numbers” Is Dangerous&lt;/p&gt;

&lt;p&gt;A common mistake is encoding categories like this:&lt;/p&gt;

&lt;p&gt;Red → 1&lt;/p&gt;

&lt;p&gt;Blue → 2&lt;/p&gt;

&lt;p&gt;Green → 3&lt;/p&gt;

&lt;p&gt;To a human, these are just labels.&lt;/p&gt;

&lt;p&gt;To a model, they look like ordered values.&lt;/p&gt;

&lt;p&gt;The model now assumes:&lt;/p&gt;

&lt;p&gt;Green &amp;gt; Blue &amp;gt; Red&lt;/p&gt;

&lt;p&gt;The “distance” between categories has meaning&lt;/p&gt;

&lt;p&gt;But in most real-world problems, that relationship doesn’t exist.&lt;/p&gt;

&lt;p&gt;This can quietly distort model behavior without throwing errors or warnings.&lt;/p&gt;

&lt;p&gt;What One-Hot Encoding Actually Fixes&lt;/p&gt;

&lt;p&gt;One-hot encoding removes false relationships.&lt;/p&gt;

&lt;p&gt;Instead of a single numeric column, each category becomes its own binary feature:&lt;/p&gt;

&lt;p&gt;Red → [1, 0, 0]&lt;/p&gt;

&lt;p&gt;Blue → [0, 1, 0]&lt;/p&gt;

&lt;p&gt;Green → [0, 0, 1]&lt;/p&gt;

&lt;p&gt;Now the model sees:&lt;/p&gt;

&lt;p&gt;No ordering&lt;/p&gt;

&lt;p&gt;No implied distance&lt;/p&gt;

&lt;p&gt;Each category as an independent signal&lt;/p&gt;

&lt;p&gt;This is why one-hot encoding is often the default choice in many ML pipelines.&lt;/p&gt;

&lt;p&gt;When One-Hot Encoding Helps Most&lt;/p&gt;

&lt;p&gt;One-hot encoding works best when:&lt;/p&gt;

&lt;p&gt;Categories have no natural order&lt;/p&gt;

&lt;p&gt;Models assume numeric relationships (e.g., linear models)&lt;/p&gt;

&lt;p&gt;You want to avoid injecting unintended bias&lt;/p&gt;

&lt;p&gt;You’ll often see it used with:&lt;/p&gt;

&lt;p&gt;Linear regression&lt;/p&gt;

&lt;p&gt;Logistic regression&lt;/p&gt;

&lt;p&gt;Feature engineering pipelines before training&lt;/p&gt;

&lt;p&gt;When One-Hot Encoding Creates New Problems&lt;/p&gt;

&lt;p&gt;One-hot encoding isn’t free.&lt;/p&gt;

&lt;p&gt;It introduces:&lt;/p&gt;

&lt;p&gt;High dimensionality&lt;/p&gt;

&lt;p&gt;Sparse data&lt;/p&gt;

&lt;p&gt;Increased memory and compute cost&lt;/p&gt;

&lt;p&gt;This becomes an issue when:&lt;/p&gt;

&lt;p&gt;Categories have high cardinality (thousands of values)&lt;/p&gt;

&lt;p&gt;You’re working with large datasets&lt;/p&gt;

&lt;p&gt;You’re deploying models with tight performance constraints&lt;/p&gt;

&lt;p&gt;At that point, encoding strategy becomes a system design decision, not just preprocessing.&lt;/p&gt;

&lt;p&gt;Why This Matters in Real ML Systems&lt;/p&gt;

&lt;p&gt;Encoding choices affect:&lt;/p&gt;

&lt;p&gt;Model performance&lt;/p&gt;

&lt;p&gt;Training time&lt;/p&gt;

&lt;p&gt;Inference cost&lt;/p&gt;

&lt;p&gt;Data consistency between training and production&lt;/p&gt;

&lt;p&gt;A model may look accurate in experiments and still fail quietly after deployment if encoding isn’t handled consistently.&lt;/p&gt;

&lt;p&gt;This is why many ML failures aren’t algorithm failures.&lt;/p&gt;

&lt;p&gt;They’re data representation failures.&lt;/p&gt;

&lt;p&gt;The Bigger Takeaway&lt;/p&gt;

&lt;p&gt;Feature engineering decisions shape how a model understands the world.&lt;/p&gt;

&lt;p&gt;One-hot encoding isn’t just a technical detail—it’s a way of protecting your model from learning relationships that don’t exist.&lt;/p&gt;

&lt;p&gt;If a model behaves strangely, don’t start by changing the algorithm.&lt;/p&gt;

&lt;p&gt;Start by asking:&lt;/p&gt;

&lt;p&gt;How is this data represented?&lt;/p&gt;

&lt;p&gt;What assumptions does this encoding introduce?&lt;/p&gt;

&lt;p&gt;Is the model learning real patterns—or artificial ones?&lt;/p&gt;

&lt;p&gt;Most ML issues begin there.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>datascience</category>
      <category>machinelearning</category>
      <category>mlops</category>
    </item>
  </channel>
</rss>
