<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Vineet Chauhan</title>
    <description>The latest articles on DEV Community by Vineet Chauhan (@vineet_chauhan_a828338181).</description>
    <link>https://dev.to/vineet_chauhan_a828338181</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3935167%2F28e42c33-ffce-49c7-bd1b-b0c2c436d670.png</url>
      <title>DEV Community: Vineet Chauhan</title>
      <link>https://dev.to/vineet_chauhan_a828338181</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/vineet_chauhan_a828338181"/>
    <language>en</language>
    <item>
      <title>Deep Learning Is More Logistic Regression Than You Think</title>
      <dc:creator>Vineet Chauhan</dc:creator>
      <pubDate>Sat, 06 Jun 2026 20:01:20 +0000</pubDate>
      <link>https://dev.to/vineet_chauhan_a828338181/deep-learning-is-more-logistic-regression-than-you-think-4bgj</link>
      <guid>https://dev.to/vineet_chauhan_a828338181/deep-learning-is-more-logistic-regression-than-you-think-4bgj</guid>
      <description>&lt;h2&gt;
  
  
  Why an Algorithm From the 1950s Still Powers Modern AI
&lt;/h2&gt;

&lt;p&gt;When I first learned Machine Learning, I treated Logistic Regression as a beginner algorithm.&lt;/p&gt;

&lt;p&gt;You learn it.&lt;/p&gt;

&lt;p&gt;You build a classifier.&lt;/p&gt;

&lt;p&gt;You get an accuracy score.&lt;/p&gt;

&lt;p&gt;Then you move on.&lt;/p&gt;

&lt;p&gt;At least that's what I thought.&lt;/p&gt;

&lt;p&gt;After Logistic Regression came:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Decision Trees&lt;/li&gt;
&lt;li&gt;Random Forests&lt;/li&gt;
&lt;li&gt;XGBoost&lt;/li&gt;
&lt;li&gt;Neural Networks&lt;/li&gt;
&lt;li&gt;Transformers&lt;/li&gt;
&lt;li&gt;Large Language Models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The journey seemed straightforward.&lt;/p&gt;

&lt;p&gt;Old algorithm → Better algorithm → Even better algorithm.&lt;/p&gt;

&lt;p&gt;But after studying Deep Learning more to some extent, I discovered something surprising.&lt;/p&gt;

&lt;p&gt;The algorithm I thought I had left behind was everywhere.&lt;/p&gt;

&lt;p&gt;Not Decision Trees.&lt;/p&gt;

&lt;p&gt;Not Random Forests.&lt;/p&gt;

&lt;p&gt;Not SVMs.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Logistic Regression.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And the deeper I looked, the more I realized that modern Deep Learning did not replace Logistic Regression.&lt;/p&gt;

&lt;p&gt;It scaled its ideas.&lt;/p&gt;




&lt;h2&gt;
  
  
  The First Time I Noticed It
&lt;/h2&gt;

&lt;p&gt;I was learning about neural networks.&lt;/p&gt;

&lt;p&gt;The instructor drew a neuron:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;z&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;w1&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;x1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;w2&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;x2&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;
&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sigmoid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I stared at the equation for a few seconds.&lt;/p&gt;

&lt;p&gt;Then it hit me.&lt;/p&gt;

&lt;p&gt;That is literally Logistic Regression.&lt;/p&gt;

&lt;p&gt;The exact same weighted sum.&lt;/p&gt;

&lt;p&gt;The exact same sigmoid activation.&lt;/p&gt;

&lt;p&gt;The exact same probability output.&lt;/p&gt;

&lt;p&gt;The exact same optimization process.&lt;/p&gt;

&lt;p&gt;The only difference?&lt;/p&gt;

&lt;p&gt;A neural network has many of them.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Exactly Does Logistic Regression Do?
&lt;/h2&gt;

&lt;p&gt;At its core, Logistic Regression performs two operations.&lt;/p&gt;

&lt;p&gt;Step 1:&lt;/p&gt;

&lt;p&gt;Take a weighted sum.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;z&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;w1&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;x1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;w2&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;x2&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;w3&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;x3&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Step 2:&lt;/p&gt;

&lt;p&gt;Convert it into probability.&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;code&gt;sigmoid(z) = 1/1+e^-x&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;The sigmoid function transforms any number into a value between 0 and 1.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input = -10 → 0.00004

Input = 0 → 0.5

Input = 10 → 0.99995
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This probability becomes the final prediction.&lt;/p&gt;

&lt;p&gt;Simple.&lt;/p&gt;

&lt;p&gt;Elegant.&lt;/p&gt;

&lt;p&gt;Effective.&lt;/p&gt;




&lt;h2&gt;
  
  
  Now Look At A Neural Network
&lt;/h2&gt;

&lt;p&gt;A neuron performs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;z&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;w1&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;x1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;w2&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;x2&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;activation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In early neural networks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;activation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sigmoid&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;which means:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Neuron
=
Logistic Regression Unit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The moment I realized this, neural networks became much easier to understand.&lt;/p&gt;

&lt;p&gt;Instead of imagining some magical AI machine, I started seeing thousands of Logistic Regression models stacked together.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Not Decision Trees?
&lt;/h2&gt;

&lt;p&gt;This question bothered me for a long time.&lt;/p&gt;

&lt;p&gt;Why didn't Deep Learning evolve from Decision Trees?&lt;/p&gt;

&lt;p&gt;Why not Random Forests?&lt;/p&gt;

&lt;p&gt;Why specifically Logistic Regression?&lt;/p&gt;

&lt;p&gt;The answer lies in mathematics.&lt;/p&gt;




&lt;h2&gt;
  
  
  Reason 1: Logistic Regression Is Differentiable
&lt;/h2&gt;

&lt;p&gt;Decision Trees make hard decisions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Age &amp;gt; 30 ?

Yes → Left

No → Right
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A tiny change in age can suddenly change the entire path.&lt;/p&gt;

&lt;p&gt;This creates discontinuities.&lt;/p&gt;

&lt;p&gt;Gradient Descent cannot work efficiently.&lt;/p&gt;

&lt;p&gt;Logistic Regression is different.&lt;/p&gt;

&lt;p&gt;Its sigmoid curve is smooth.&lt;/p&gt;

&lt;p&gt;Every tiny change produces a tiny output change.&lt;/p&gt;

&lt;p&gt;This makes gradients possible.&lt;/p&gt;

&lt;p&gt;And gradients are the fuel of Deep Learning.&lt;/p&gt;

&lt;p&gt;Without gradients:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;No Backpropagation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without backpropagation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;No Neural Networks
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without neural networks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;No ChatGPT
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Reason 2: Logistic Regression Produces Probabilities
&lt;/h2&gt;

&lt;p&gt;A Decision Tree says:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Class A
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;or&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Class B
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Logistic Regression says:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;P(Class A) = 0.92
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Probability matters.&lt;/p&gt;

&lt;p&gt;Modern AI relies heavily on probabilities.&lt;/p&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;p&gt;Spam Detection&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;98% Spam
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Medical Diagnosis&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;73% Cancer Risk
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Language Models&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;P(next word = "cat")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Transformers are fundamentally probability machines.&lt;/p&gt;

&lt;p&gt;And Logistic Regression introduced that philosophy long ago.&lt;/p&gt;




&lt;h2&gt;
  
  
  Reason 3: Cross Entropy Came From Logistic Regression
&lt;/h2&gt;

&lt;p&gt;One of the most important loss functions in Deep Learning is:&lt;/p&gt;

&lt;p&gt;L=-[y\log(p)+(1-y)\log(1-p)]&lt;/p&gt;

&lt;p&gt;Almost every deep learning engineer uses it.&lt;/p&gt;

&lt;p&gt;Image Classification.&lt;/p&gt;

&lt;p&gt;Medical AI.&lt;/p&gt;

&lt;p&gt;Fraud Detection.&lt;/p&gt;

&lt;p&gt;NLP.&lt;/p&gt;

&lt;p&gt;Large Language Models.&lt;/p&gt;

&lt;p&gt;The interesting part?&lt;/p&gt;

&lt;p&gt;This is the same loss function used in Logistic Regression.&lt;/p&gt;

&lt;p&gt;The entire deep learning world still depends on it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Reason 4: Logistic Regression Is A Single Neuron
&lt;/h2&gt;

&lt;p&gt;This was the biggest realization for me.&lt;/p&gt;

&lt;p&gt;A Logistic Regression model can be represented as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input
 ↓
Weighted Sum
 ↓
Sigmoid
 ↓
Output
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now look at a neural network:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input
 ↓
100 Neurons
 ↓
100 Neurons
 ↓
100 Neurons
 ↓
Output
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each neuron is doing a very similar operation.&lt;/p&gt;

&lt;p&gt;The network becomes powerful because thousands of these simple units collaborate.&lt;/p&gt;

&lt;p&gt;Deep Learning is not complexity replacing simplicity.&lt;/p&gt;

&lt;p&gt;It is simplicity repeated at scale.&lt;/p&gt;




&lt;h2&gt;
  
  
  Let's Verify This With Code
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Logistic Regression
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.linear_model&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LogisticRegression&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LogisticRegression&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;predictions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now a neural network.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch.nn&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Sequential&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Linear&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Sigmoid&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Look carefully.&lt;/p&gt;

&lt;p&gt;Both perform:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Weighted Sum&lt;/li&gt;
&lt;li&gt;Sigmoid Transformation&lt;/li&gt;
&lt;li&gt;Probability Prediction&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Mathematically they are nearly identical.&lt;/p&gt;

&lt;p&gt;The PyTorch version is essentially Logistic Regression implemented as a neural network.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Difference
&lt;/h2&gt;

&lt;p&gt;If they are so similar, why use Deep Learning?&lt;/p&gt;

&lt;p&gt;Because Logistic Regression can only learn simple boundaries.&lt;/p&gt;

&lt;p&gt;Imagine separating red and blue dots.&lt;/p&gt;

&lt;p&gt;Logistic Regression creates:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;One Straight Line
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Deep Learning creates:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Curves
Shapes
Complex Regions
Non-Linear Patterns
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By stacking layers, the network gradually transforms simple linear boundaries into highly complex decision surfaces.&lt;/p&gt;

&lt;p&gt;That is the true power of Deep Learning.&lt;/p&gt;

&lt;p&gt;Not a different idea.&lt;/p&gt;

&lt;p&gt;A larger version of the same idea.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Most Surprising Place I Found Logistic Regression
&lt;/h2&gt;

&lt;p&gt;LSTMs.&lt;/p&gt;

&lt;p&gt;The architecture behind many sequence models.&lt;/p&gt;

&lt;p&gt;Inside every LSTM cell are gates.&lt;/p&gt;

&lt;p&gt;Forget Gate.&lt;/p&gt;

&lt;p&gt;Input Gate.&lt;/p&gt;

&lt;p&gt;Output Gate.&lt;/p&gt;

&lt;p&gt;Guess what activation function they use?&lt;/p&gt;

&lt;p&gt;Sigmoid.&lt;/p&gt;

&lt;p&gt;Every gate computes probabilities.&lt;/p&gt;

&lt;p&gt;Every gate decides:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Keep Information?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;or&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Forget Information?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;using Logistic Regression principles.&lt;/p&gt;

&lt;p&gt;Even modern AI systems still carry its DNA.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;When I first learned Logistic Regression, I thought it was something to finish and forget.&lt;/p&gt;

&lt;p&gt;Now I see it differently.&lt;/p&gt;

&lt;p&gt;I see it as the first neural network.&lt;/p&gt;

&lt;p&gt;I see it as the origin of probability-based learning.&lt;/p&gt;

&lt;p&gt;I see it as the mathematical foundation behind cross entropy, gradient descent, and backpropagation.&lt;/p&gt;

&lt;p&gt;The next time someone says Logistic Regression is an old algorithm, remember:&lt;/p&gt;

&lt;p&gt;Deep Learning did not replace Logistic Regression.&lt;/p&gt;

&lt;p&gt;Deep Learning scaled it.&lt;/p&gt;

&lt;p&gt;And some of the most advanced AI systems ever built still rely on ideas introduced by Logistic Regression decades ago.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>deeplearning</category>
      <category>datascience</category>
    </item>
    <item>
      <title>Better Data Beats Better Algorithms: Before Changing the Model, Change the Data</title>
      <dc:creator>Vineet Chauhan</dc:creator>
      <pubDate>Sat, 06 Jun 2026 19:47:54 +0000</pubDate>
      <link>https://dev.to/vineet_chauhan_a828338181/better-data-beats-better-algorithms-before-changing-the-model-change-the-data-107k</link>
      <guid>https://dev.to/vineet_chauhan_a828338181/better-data-beats-better-algorithms-before-changing-the-model-change-the-data-107k</guid>
      <description>&lt;p&gt;&lt;em&gt;How Feature Engineering Taught Me That Better Data Often Beats Better Algorithms&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;When I first started learning Machine Learning, I believed what many beginners believe:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If my model is not performing well, I need a better algorithm.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So I kept switching models.&lt;/p&gt;

&lt;p&gt;I moved from Logistic Regression to Decision Trees, then Random Forest, and later even started reading about XGBoost and Neural Networks.&lt;/p&gt;

&lt;p&gt;The results improved slightly, but never dramatically.&lt;/p&gt;

&lt;p&gt;What surprised me was that the biggest improvement didn't come from changing the algorithm.&lt;/p&gt;

&lt;p&gt;It came from changing the data.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;I was working on a dataset containing missing values, outliers, and categorical variables.&lt;/p&gt;

&lt;p&gt;Like many beginners, my first instinct was simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pred&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model trained successfully.&lt;/p&gt;

&lt;p&gt;The accuracy looked acceptable.&lt;/p&gt;

&lt;p&gt;But something felt wrong.&lt;/p&gt;

&lt;p&gt;The data itself was messy.&lt;/p&gt;

&lt;p&gt;Some columns contained missing values.&lt;/p&gt;

&lt;p&gt;Some numerical features had extreme outliers.&lt;/p&gt;

&lt;p&gt;Several categorical columns were represented as text.&lt;/p&gt;

&lt;p&gt;Yet I expected the model to magically learn everything.&lt;/p&gt;




&lt;h2&gt;
  
  
  My First Experiment
&lt;/h2&gt;

&lt;p&gt;I trained a Logistic Regression model on the raw dataset.&lt;/p&gt;

&lt;p&gt;Results:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Accuracy : 72%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not terrible.&lt;/p&gt;

&lt;p&gt;Not impressive either.&lt;/p&gt;

&lt;p&gt;Instead of changing the model, I decided to investigate the data.&lt;/p&gt;

&lt;p&gt;This turned out to be the most important decision of the entire project.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 1: Handling Missing Values
&lt;/h2&gt;

&lt;p&gt;The dataset contained several missing values.&lt;/p&gt;

&lt;p&gt;At first I considered simply deleting rows.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dropna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inplace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The problem?&lt;/p&gt;

&lt;p&gt;I lost a significant portion of the data.&lt;/p&gt;

&lt;p&gt;So I experimented with multiple approaches:&lt;/p&gt;

&lt;h3&gt;
  
  
  Mean Imputation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.impute&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SimpleImputer&lt;/span&gt;

&lt;span class="n"&gt;imputer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SimpleImputer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;strategy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;mean&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;imputer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit_transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Median Imputation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;imputer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SimpleImputer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;strategy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;median&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  KNN Imputation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.impute&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;KNNImputer&lt;/span&gt;

&lt;span class="n"&gt;imputer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;KNNImputer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_neighbors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;imputer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit_transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;KNN preserved relationships between records much better than simple averaging.&lt;/p&gt;

&lt;p&gt;This alone improved performance.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: Fighting Outliers
&lt;/h2&gt;

&lt;p&gt;I then visualized the numerical columns.&lt;/p&gt;

&lt;p&gt;The boxplots looked terrible.&lt;/p&gt;

&lt;p&gt;A few extreme values were stretching entire distributions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;boxplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;experience&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model was spending too much effort trying to fit a handful of unusual observations.&lt;/p&gt;

&lt;p&gt;I used IQR-based treatment.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;Q1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;experience&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;quantile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.25&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Q3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;experience&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;quantile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.75&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;IQR&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Q3&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;Q1&lt;/span&gt;

&lt;span class="n"&gt;lower&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Q1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mf"&gt;1.5&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;IQR&lt;/span&gt;
&lt;span class="n"&gt;upper&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Q3&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;1.5&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;IQR&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;experience&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;experience&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;upper&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After removing outliers, the data distribution became much cleaner.&lt;/p&gt;

&lt;p&gt;More importantly, the model began learning actual patterns instead of noise.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 3: Encoding Categorical Features
&lt;/h2&gt;

&lt;p&gt;Machine Learning algorithms cannot understand text.&lt;/p&gt;

&lt;p&gt;They only understand numbers.&lt;/p&gt;

&lt;p&gt;So columns like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Male
Female

Private
Public

Graduate
Masters
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;needed transformation.&lt;/p&gt;

&lt;p&gt;I applied One-Hot Encoding.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_dummies&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
               &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gender&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;company_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and Ordinal Encoding where order mattered.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;education_level&lt;/span&gt;

&lt;span class="n"&gt;High&lt;/span&gt; &lt;span class="n"&gt;School&lt;/span&gt;
&lt;span class="n"&gt;Graduate&lt;/span&gt;
&lt;span class="n"&gt;Masters&lt;/span&gt;
&lt;span class="n"&gt;PhD&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This converted human-readable categories into machine-readable information.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 4: Feature Scaling
&lt;/h2&gt;

&lt;p&gt;Some columns ranged between:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0 – 5
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;while others ranged between:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0 – 100000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Distance-based algorithms become biased toward larger values.&lt;/p&gt;

&lt;p&gt;I applied MinMax Scaling.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.preprocessing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MinMaxScaler&lt;/span&gt;

&lt;span class="n"&gt;scaler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MinMaxScaler&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;X_train&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scaler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit_transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;X_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scaler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now every feature contributed fairly.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Happened Next?
&lt;/h2&gt;

&lt;p&gt;I trained the exact same Logistic Regression model again.&lt;/p&gt;

&lt;p&gt;Nothing changed except the data.&lt;/p&gt;

&lt;p&gt;Results:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Before Feature Engineering : 72%

After Feature Engineering  : 86%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A gain of 14 percentage points.&lt;/p&gt;

&lt;p&gt;Without changing the algorithm.&lt;/p&gt;

&lt;p&gt;Without using deep learning.&lt;/p&gt;

&lt;p&gt;Without adding complexity.&lt;/p&gt;

&lt;p&gt;Just by improving the data.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Most Important Lesson
&lt;/h2&gt;

&lt;p&gt;This project changed the way I think about Machine Learning.&lt;/p&gt;

&lt;p&gt;Earlier I believed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Better Algorithm
       ↓
Better Results
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now I believe:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Better Data
       ↓
Better Features
       ↓
Better Results
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Most real-world machine learning problems are not algorithm problems.&lt;/p&gt;

&lt;p&gt;They are data problems.&lt;/p&gt;

&lt;p&gt;A powerful model trained on poor-quality data will still struggle.&lt;/p&gt;

&lt;p&gt;A simple model trained on clean, meaningful data can often outperform much more complex alternatives.&lt;/p&gt;




&lt;h2&gt;
  
  
  Challenges I Faced
&lt;/h2&gt;

&lt;p&gt;The hardest part was not training the model.&lt;/p&gt;

&lt;p&gt;The hardest part was preparing the data.&lt;/p&gt;

&lt;p&gt;Some difficulties included:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Losing rows during Complete Case Analysis&lt;/li&gt;
&lt;li&gt;Choosing between Mean, Median, and KNN Imputation&lt;/li&gt;
&lt;li&gt;Combining transformed datasets&lt;/li&gt;
&lt;li&gt;Handling dimensionality after One-Hot Encoding&lt;/li&gt;
&lt;li&gt;Identifying genuine outliers versus valuable rare cases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These challenges taught me more than model training ever did.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Feature Engineering is not the most glamorous part of Machine Learning.&lt;/p&gt;

&lt;p&gt;Nobody posts screenshots of missing value treatment on social media.&lt;/p&gt;

&lt;p&gt;Nobody celebrates scaling features.&lt;/p&gt;

&lt;p&gt;Yet this is where much of the real improvement happens.&lt;/p&gt;

&lt;p&gt;After this project, I stopped asking:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Which model should I use?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;and started asking:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What is my data trying to tell me?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That single change in mindset improved my machine learning skills more than learning any new algorithm.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>deeplearning</category>
    </item>
    <item>
      <title>Feature Engineering is Not Just “Cleaning Data”: What I Learned While Building a Real ML Pipeline</title>
      <dc:creator>Vineet Chauhan</dc:creator>
      <pubDate>Thu, 28 May 2026 10:43:45 +0000</pubDate>
      <link>https://dev.to/vineet_chauhan_a828338181/feature-engineering-is-not-just-cleaning-data-what-i-learned-while-building-a-real-ml-pipeline-4ng3</link>
      <guid>https://dev.to/vineet_chauhan_a828338181/feature-engineering-is-not-just-cleaning-data-what-i-learned-while-building-a-real-ml-pipeline-4ng3</guid>
      <description>&lt;p&gt;Most machine learning tutorials make preprocessing look straightforward.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Handle missing values.&lt;/li&gt;
&lt;li&gt;Encode categorical features.&lt;/li&gt;
&lt;li&gt;Train the model.&lt;/li&gt;
&lt;li&gt;Get accuracy.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But while working on a real classification dataset, I realized feature engineering is far less about applying textbook techniques and far more about making careful decisions under uncertainty.&lt;/p&gt;

&lt;p&gt;This project completely changed how I think about preprocessing.&lt;/p&gt;

&lt;p&gt;Instead of writing another “complete guide to feature engineering,” I wanted to document the actual engineering problems I faced while building a preprocessing pipeline — including debugging mistakes, failed assumptions, distribution shifts, encoding challenges, and how preprocessing itself changed model behaviour.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Dataset Looked Simple at First&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Initially, the dataset looked manageable:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Numerical features&lt;/li&gt;
&lt;li&gt;Categorical features&lt;/li&gt;
&lt;li&gt;Missing values&lt;/li&gt;
&lt;li&gt;Binary target variable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Nothing seemed unusual.&lt;/p&gt;

&lt;p&gt;But the moment preprocessing started, the real complexity appeared.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The First Problem: Missing Values Were Uneven Everywhere&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;One of the first things I checked was the percentage of missing values across columns.&lt;/p&gt;

&lt;p&gt;Some columns had:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;less than 1% missing values&lt;/li&gt;
&lt;li&gt;some had 5–10%&lt;/li&gt;
&lt;li&gt;others had more than 30%&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This immediately raised an important question:&lt;/p&gt;

&lt;p&gt;Should every missing value be handled using the same strategy?&lt;/p&gt;

&lt;p&gt;The answer quickly became no.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fghsj30fvxbf98938g4h5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fghsj30fvxbf98938g4h5.png" alt=" " width="350" height="351"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;My Initial Mistake: Applying One Strategy Everywhere&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;At first, I tried treating all missing values similarly.&lt;/p&gt;

&lt;p&gt;That approach failed quickly because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Complete Case Analysis removed too many rows&lt;/li&gt;
&lt;li&gt;KNN Imputation behaved poorly on categorical-heavy features&lt;/li&gt;
&lt;li&gt;Encoded categorical values introduced unrealistic numeric relationships&lt;/li&gt;
&lt;li&gt;Feature distributions started changing unexpectedly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This was the first moment I realized:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Feature engineering is not a fixed recipe.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Different features require different preprocessing decisions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Using Complete Case Analysis (CCA)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For columns with less than 5% missing values, I used Complete Case Analysis.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2xkue5c1zga1p9ttq9s5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2xkue5c1zga1p9ttq9s5.png" alt=" " width="380" height="99"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At first, this seemed harmless.&lt;/p&gt;

&lt;p&gt;But then I decided to compare feature distributions before and after row deletion.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbsre4jjde2kzofr46jl1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbsre4jjde2kzofr46jl1.png" alt=" " width="800" height="579"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This turned out to be one of the most important observations in the project.&lt;/p&gt;

&lt;p&gt;Even small row deletions slightly changed feature density and distributions.&lt;/p&gt;

&lt;p&gt;That was the moment I understood:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Missing value handling is not only about removing NaNs — it can also reshape the dataset itself.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A Small Pandas Mistake That Broke My Pipeline&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;One debugging issue confused me for quite a while.&lt;/p&gt;

&lt;p&gt;Initially, I wrote:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;df[cols].dropna()&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This unintentionally removed all other columns from the dataframe.&lt;/p&gt;

&lt;p&gt;The correct approach was:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;df.dropna(subset=cols)&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The difference was tiny syntactically but huge logically.&lt;/p&gt;

&lt;p&gt;This taught me something surprisingly important:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Many machine learning problems are not model problems.&lt;br&gt;
They are dataframe manipulation problems.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why KNN Imputer Became Complicated&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Initially, I planned to use KNN Imputer for all remaining missing values.&lt;/p&gt;

&lt;p&gt;But another issue appeared immediately.&lt;/p&gt;

&lt;p&gt;KNN relies on distance calculations.&lt;/p&gt;

&lt;p&gt;Distance works naturally for numerical data, but categorical columns require encoding first.&lt;/p&gt;

&lt;p&gt;That introduced several complications:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Label encoding created artificial numeric relationships&lt;/li&gt;
&lt;li&gt;One Hot Encoding exploded feature dimensionality&lt;/li&gt;
&lt;li&gt;NaN values converted into strings accidentally during preprocessing&lt;/li&gt;
&lt;li&gt;Encoded categories distorted neighbor similarity&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This made me realize:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;KNN Imputer works much better for numerical features than heavily categorical datasets.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Eventually, I switched to a hybrid preprocessing strategy instead of forcing one solution everywhere.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Final Missing Value Strategy&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I ended up using:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Feature Type:-                        Strategy:-&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Low missing numerical    ---&amp;gt;        Complete Case Analysis&lt;/li&gt;
&lt;li&gt;High missing numerical   ---&amp;gt;        Median/KNN Imputation&lt;/li&gt;
&lt;li&gt;High missing categorical ---&amp;gt;        Most Frequent / “Missing” category&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This hybrid approach worked far better than blindly applying one technique globally.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Encoding Was More Important Than I Expected&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Encoding looked simple in theory.&lt;/p&gt;

&lt;p&gt;But in practice, deciding between:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One Hot Encoding&lt;/li&gt;
&lt;li&gt;Ordinal Encoding&lt;/li&gt;
&lt;li&gt;Label Encoding&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;actually mattered a lot.&lt;/p&gt;

&lt;p&gt;I used:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One Hot Encoding&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;gender&lt;/li&gt;
&lt;li&gt;major_discipline&lt;/li&gt;
&lt;li&gt;company_type&lt;/li&gt;
&lt;li&gt;enrolled_university&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;because these categories had no natural order.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ordinal Encoding&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;education_level&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;because educational levels actually contain ranking:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;Primary School &amp;lt; High School &amp;lt; Graduate &amp;lt; Masters &amp;lt; PhD&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This distinction improved model behavior more than I initially expected.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Most Interesting Observation: Preprocessing Changed the Models More Than the Models Changed Themselves&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I trained multiple models:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Logistic Regression&lt;/li&gt;
&lt;li&gt;Decision Tree&lt;/li&gt;
&lt;li&gt;Random Forest&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;both before and after preprocessing.&lt;/p&gt;

&lt;p&gt;The results were surprisingly different.&lt;/p&gt;

&lt;p&gt;Linear models improved heavily after scaling and proper encoding.&lt;/p&gt;

&lt;p&gt;Random Forest remained comparatively stable even before aggressive preprocessing.&lt;/p&gt;

&lt;p&gt;That observation completely changed my perspective.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Data preprocessing often influences performance more than changing the algorithm itself.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Another Real Problem: Feature Explosion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;After applying One Hot Encoding, the number of features increased rapidly.&lt;/p&gt;

&lt;p&gt;This was another practical challenge rarely discussed in beginner tutorials.&lt;/p&gt;

&lt;p&gt;Encoding solved categorical representation issues, but it also increased dimensionality and preprocessing complexity significantly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Pipelines Became Necessary&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;At one point, preprocessing became chaotic.&lt;/p&gt;

&lt;p&gt;Different transformations were happening separately:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;scaling&lt;/li&gt;
&lt;li&gt;imputation&lt;/li&gt;
&lt;li&gt;encoding&lt;/li&gt;
&lt;li&gt;train-test transformations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Tracking transformed columns manually became painful.&lt;/p&gt;

&lt;p&gt;This was when I finally understood why sklearn Pipelines and ColumnTransformers matter so much.&lt;/p&gt;

&lt;p&gt;Not because they look advanced —&lt;br&gt;
but because preprocessing becomes unmanageable very quickly in real projects.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What This Project Changed for Me&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before this project, I thought feature engineering mostly meant:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;removing null values&lt;/li&gt;
&lt;li&gt;encoding categories&lt;/li&gt;
&lt;li&gt;scaling features&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now I think differently.&lt;/p&gt;

&lt;p&gt;Feature engineering is closer to:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;understanding how data behaves under transformation.&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Every preprocessing decision changes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;distributions&lt;/li&gt;
&lt;li&gt;feature relationships&lt;/li&gt;
&lt;li&gt;dimensionality&lt;/li&gt;
&lt;li&gt;information retention&lt;/li&gt;
&lt;li&gt;model assumptions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even small preprocessing choices can significantly change model behavior.&lt;/p&gt;

&lt;p&gt;*&lt;em&gt;Final Thoughts&lt;br&gt;
*&lt;/em&gt;&lt;br&gt;
One thing became very clear after this project:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;Machine learning is not just model training.&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Most real effort goes into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;understanding data&lt;/li&gt;
&lt;li&gt;debugging preprocessing&lt;/li&gt;
&lt;li&gt;handling edge cases&lt;/li&gt;
&lt;li&gt;preserving useful information&lt;/li&gt;
&lt;li&gt;testing assumptions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Feature engineering is where datasets stop behaving like clean classroom examples and start behaving like real systems.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;"And honestly, that is where machine learning starts becoming interesting."&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>datascience</category>
      <category>programming</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Building a Hantavirus Misinformation Detector: Challenges of NLP in Low-Data Health Domains</title>
      <dc:creator>Vineet Chauhan</dc:creator>
      <pubDate>Sat, 16 May 2026 17:00:47 +0000</pubDate>
      <link>https://dev.to/vineet_chauhan_a828338181/building-a-hantavirus-misinformation-detector-challenges-of-nlp-in-low-data-health-domains-3m5o</link>
      <guid>https://dev.to/vineet_chauhan_a828338181/building-a-hantavirus-misinformation-detector-challenges-of-nlp-in-low-data-health-domains-3m5o</guid>
      <description>&lt;p&gt;Most fake news detection projects rely on massive datasets containing thousands of examples.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;I wanted to explore something much more difficult:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Can a small NLP system detect misinformation around an emerging disease like Hantavirus?&lt;/p&gt;

&lt;p&gt;What made this project interesting was not the model itself, but the challenge of working in a low-data environment where reliable misinformation examples barely exist.&lt;/p&gt;

&lt;p&gt;Unlike COVID-19 misinformation datasets, hantavirus-related misinformation is extremely limited online. This forced me to manually curate both factual and misleading claims while understanding how health misinformation behaves linguistically.&lt;/p&gt;

&lt;p&gt;This project became less about achieving high accuracy and more about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;understanding NLP pipelines,&lt;/li&gt;
&lt;li&gt;handling imperfect datasets,&lt;/li&gt;
&lt;li&gt;and analyzing misinformation patterns.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Understanding the Problem&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Health misinformation spreads differently from normal fake news.&lt;/p&gt;

&lt;p&gt;Many misleading claims are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1. partially believable,&lt;/li&gt;
&lt;li&gt;2. emotionally framed,&lt;/li&gt;
&lt;li&gt;3. or based on incomplete truths.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“Natural remedies can cure hantavirus”&lt;/li&gt;
&lt;li&gt;“Governments are hiding outbreak data”&lt;/li&gt;
&lt;li&gt;“Hot water prevents infection”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The challenge was not simply classifying text as fake or real, but understanding how subtle misinformation patterns emerge in health-related discussions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Dataset Creation (The Hardest Part)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This was by far the most difficult stage of the project.&lt;/p&gt;

&lt;p&gt;Unlike mainstream misinformation domains, there are very few structured datasets specifically related to hantavirus misinformation. Because of this, I manually curated a small dataset using:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;trusted medical sources,&lt;/li&gt;
&lt;li&gt;news articles,&lt;/li&gt;
&lt;li&gt;and realistic misinformation patterns.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;4. Real Data Sources&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I collected factual information from:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;WHO&lt;/li&gt;
&lt;li&gt;CDC&lt;/li&gt;
&lt;li&gt;Reuters&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Examples included:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;transmission details,&lt;/li&gt;
&lt;li&gt;symptoms,&lt;/li&gt;
&lt;li&gt;prevention methods,&lt;/li&gt;
&lt;li&gt;and treatment limitations.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;5. Fake Data Construction&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Finding real misinformation examples for hantavirus was difficult because the topic is relatively niche.&lt;/p&gt;

&lt;p&gt;Instead of generating random false statements, I focused on realistic misinformation patterns commonly seen in health-related fake news:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;miracle cures,&lt;/li&gt;
&lt;li&gt;conspiracy theories,&lt;/li&gt;
&lt;li&gt;exaggerated transmission claims,&lt;/li&gt;
&lt;li&gt;and misleading prevention methods.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“Garlic water can completely cure hantavirus”&lt;/li&gt;
&lt;li&gt;“The virus spreads rapidly through city air systems”&lt;/li&gt;
&lt;li&gt;“A secret vaccine already exists”&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;Dataset Structure&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The dataset included:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;text&lt;/li&gt;
&lt;li&gt;label&lt;/li&gt;
&lt;li&gt;source&lt;/li&gt;
&lt;li&gt;category&lt;/li&gt;
&lt;li&gt;difficulty&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This structure helped organize misinformation types and analyze which claims were easier or harder for the model to classify.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;7. Dataset Analysis&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fake vs Real Distribution&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fllj7a23hu7x36h4uf4vv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fllj7a23hu7x36h4uf4vv.png" alt=" " width="704" height="566"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Category Distribution&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fynva9uz1a9ds5xywkyzk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fynva9uz1a9ds5xywkyzk.png" alt=" " width="800" height="494"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Difficulty Distribution&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frrobr4azjk7yn5nndwgv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frrobr4azjk7yn5nndwgv.png" alt=" " width="703" height="566"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;8. NLP Pipeline&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The NLP pipeline was intentionally kept simple to better understand the fundamentals.&lt;/p&gt;

&lt;p&gt;The workflow consisted of:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Text preprocessing&lt;/li&gt;
&lt;li&gt;TF-IDF vectorization&lt;/li&gt;
&lt;li&gt;Logistic Regression classification&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;9. Text Preprocessing&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The first step involved cleaning the text data:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;converting text to lowercase,&lt;/li&gt;
&lt;li&gt;removing punctuation,&lt;/li&gt;
&lt;li&gt;removing unnecessary spaces,&lt;/li&gt;
&lt;li&gt;and standardizing sentence structure.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffi9i1jlluuimt4xvc191.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffi9i1jlluuimt4xvc191.png" alt=" " width="675" height="334"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;10. TF-IDF Vectorization&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Machine learning models cannot directly understand raw text.&lt;/p&gt;

&lt;p&gt;TF-IDF converts words into numerical representations based on their importance across the dataset.&lt;/p&gt;

&lt;p&gt;This allowed the model to identify patterns such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“secret cure”&lt;/li&gt;
&lt;li&gt;“government hiding”&lt;/li&gt;
&lt;li&gt;“supportive care”&lt;/li&gt;
&lt;li&gt;“WHO reports”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;11. Most Interesting Observation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;One of the most surprising findings was:&lt;/p&gt;

&lt;p&gt;believable misinformation is much harder to classify than extreme misinformation.&lt;/p&gt;

&lt;p&gt;Claims like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“Herbal remedies may reduce hantavirus symptoms”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;were more difficult for the model than clearly absurd claims.&lt;/p&gt;

&lt;p&gt;This highlighted an important limitation of simple NLP models:&lt;br&gt;
they rely heavily on statistical language patterns rather than true medical understanding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;12. Limitations&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This project has several limitations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;small dataset size,&lt;/li&gt;
&lt;li&gt;manually curated misinformation,&lt;/li&gt;
&lt;li&gt;limited real-world social media data,&lt;/li&gt;
&lt;li&gt;and no deep learning models.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because of these constraints, the model should not be treated as a production-ready misinformation detector.&lt;/p&gt;

&lt;p&gt;Instead, this project should be viewed as:&lt;/p&gt;

&lt;p&gt;an exploratory NLP experiment in a low-data health misinformation domain.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;13. Future Improvements&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;There are several directions for improving this project:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;collecting real social media misinformation,&lt;/li&gt;
&lt;li&gt;increasing dataset size,&lt;/li&gt;
&lt;li&gt;using transformer-based models like BERT,&lt;/li&gt;
&lt;li&gt;multilingual misinformation detection,&lt;/li&gt;
&lt;li&gt;and explainable AI methods such as SHAP or LIME.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;14. Final Thoughts&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This project taught me that the hardest part of NLP is often not the model itself.&lt;/p&gt;

&lt;p&gt;It is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;collecting meaningful data,&lt;/li&gt;
&lt;li&gt;understanding ambiguity,&lt;/li&gt;
&lt;li&gt;and dealing with imperfect real-world information.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Working on a low-data problem like hantavirus misinformation made the project far more challenging — and far more educational — than simply training a model on a large public dataset.&lt;/p&gt;

&lt;p&gt;Even though the model itself was simple, the process revealed how difficult health misinformation detection actually is in practice&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyw0qhyzqg1xnqf555ugm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyw0qhyzqg1xnqf555ugm.png" alt=" " width="673" height="562"&gt;&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>nlp</category>
      <category>machinelearning</category>
      <category>dataengineering</category>
    </item>
  </channel>
</rss>
