<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Mubarak Mohamed</title>
    <description>The latest articles on DEV Community by Mubarak Mohamed (@moubarakmohame4).</description>
    <link>https://dev.to/moubarakmohame4</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F999794%2F2ee2e2bc-aedc-4707-8b95-6bb0e18ec328.png</url>
      <title>DEV Community: Mubarak Mohamed</title>
      <link>https://dev.to/moubarakmohame4</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/moubarakmohame4"/>
    <language>en</language>
    <item>
      <title>Why Decision Trees Don't Need Feature Scaling (And Why This Matters)</title>
      <dc:creator>Mubarak Mohamed</dc:creator>
      <pubDate>Tue, 24 Feb 2026 12:40:43 +0000</pubDate>
      <link>https://dev.to/moubarakmohame4/why-decision-trees-dont-need-feature-scaling-and-why-this-matters-91d</link>
      <guid>https://dev.to/moubarakmohame4/why-decision-trees-dont-need-feature-scaling-and-why-this-matters-91d</guid>
      <description>&lt;p&gt;Ever spent hours normalizing your dataset only to wonder if it was really necessary? If you're using tree-based algorithms, I've got news for you...&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Decision Trees, Random Forests, XGBoost, and LightGBM don't need feature scaling&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Distance-based algorithms (k-NN, SVM, Neural Networks) absolutely do&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Why?&lt;/strong&gt; Trees use threshold comparisons, not distance calculations&lt;/p&gt;

&lt;p&gt;Let's dig into why this is the case and prove it with code!&lt;/p&gt;
&lt;h2&gt;
  
  
  Wait, What's Feature Scaling Again?
&lt;/h2&gt;

&lt;p&gt;Feature scaling transforms your numerical variables to a common scale. The two most popular methods:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Min-Max Scaling&lt;/strong&gt; → squashes values between 0 and 1&lt;br&gt;
&lt;strong&gt;Standardization (Z-score)&lt;/strong&gt; → centers data around 0 with std dev of 1&lt;/p&gt;
&lt;h3&gt;
  
  
  Quick example:
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before scaling
&lt;/span&gt;&lt;span class="n"&gt;salary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;25000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;50000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;75000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100000&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;age&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;22&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;45&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# After Min-Max scaling
&lt;/span&gt;&lt;span class="n"&gt;salary_scaled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.33&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.67&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;age_scaled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.21&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.61&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  🌲 How Decision Trees Actually Work
&lt;/h2&gt;

&lt;p&gt;Here's the key insight: &lt;strong&gt;Decision Trees make decisions based on threshold comparisons, not distances&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;At each node, a tree asks questions like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Is salary &amp;gt; 50000?
  ├─ YES → Is age &amp;gt; 35?
  │        ├─ YES → Prediction A
  │        └─ NO → Prediction B
  └─ NO → Prediction C
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The algorithm:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Tests every possible threshold on every feature&lt;/li&gt;
&lt;li&gt;Calculates a purity metric (Gini, Entropy, or Variance)&lt;/li&gt;
&lt;li&gt;Picks the split that best separates the data&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  The purity metrics:
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Gini Impurity&lt;/strong&gt; (classification):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;Gini&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;Σ&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p_i&lt;/span&gt;&lt;span class="err"&gt;²&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Entropy&lt;/strong&gt; (classification):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;Entropy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;Σ&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p_i&lt;/span&gt; &lt;span class="err"&gt;×&lt;/span&gt; &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="err"&gt;₂&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p_i&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Variance Reduction&lt;/strong&gt; (regression):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;Variance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="err"&gt;×&lt;/span&gt; &lt;span class="n"&gt;Σ&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_i&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;ȳ&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="err"&gt;²&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Critical point:&lt;/strong&gt; None of these calculations involve distances between observations!&lt;/p&gt;

&lt;h2&gt;
  
  
  The Magic: Why Scaling Doesn't Matter
&lt;/h2&gt;

&lt;p&gt;Let's say we're testing a split on salary:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Original data:&lt;/strong&gt; &lt;code&gt;salary &amp;gt; 60000&lt;/code&gt;&lt;br&gt;
&lt;strong&gt;Scaled data:&lt;/strong&gt; &lt;code&gt;salary_scaled &amp;gt; 0.5&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;These two conditions &lt;strong&gt;separate the exact same observations&lt;/strong&gt;! 🎯&lt;/p&gt;
&lt;h3&gt;
  
  
  Here's why:
&lt;/h3&gt;

&lt;p&gt;Scaling is a &lt;strong&gt;monotonic transformation&lt;/strong&gt; - it preserves the order of values.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Original
&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;30000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;45000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;60000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;75000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;90000&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# After Min-Max scaling  
&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.75&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.00&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The order stays the same: &lt;code&gt;30000 &amp;lt; 45000 &amp;lt; 60000&lt;/code&gt; → &lt;code&gt;0.00 &amp;lt; 0.25 &amp;lt; 0.50&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Since trees test all possible thresholds, they'll find the same optimal split regardless of scale!&lt;/p&gt;

&lt;h2&gt;
  
  
  Proof Time: Let's Code!
&lt;/h2&gt;

&lt;p&gt;Let's prove this with a real experiment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.datasets&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;make_classification&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.model_selection&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;train_test_split&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cross_val_score&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.tree&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DecisionTreeClassifier&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.preprocessing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;StandardScaler&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;accuracy_score&lt;/span&gt;

&lt;span class="c1"&gt;# Set random seed
&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Generate dataset with wildly different scales
&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;make_classification&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_samples&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_features&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
                          &lt;span class="n"&gt;n_informative&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Create different scales intentionally
&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;      &lt;span class="c1"&gt;# Scale: 0-100
&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;10000&lt;/span&gt;    &lt;span class="c1"&gt;# Scale: 0-10000
&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;2000&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;3000&lt;/span&gt;  &lt;span class="c1"&gt;# Scale: 1000-5000
&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Feature scales:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Feature 0: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; to &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Feature 1: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; to &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Feature 2: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; to &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Split data
&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;train_test_split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Scale data
&lt;/span&gt;&lt;span class="n"&gt;scaler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StandardScaler&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;X_train_scaled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scaler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit_transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;X_test_scaled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scaler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Model 1: WITHOUT scaling
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;dt_raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DecisionTreeClassifier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_depth&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;dt_raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;acc_raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;accuracy_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dt_raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;cv_raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;cross_val_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dt_raw&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cv&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;WITHOUT scaling: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;acc_raw&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CV score: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cv_raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; (+/- &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cv_raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;std&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Model 2: WITH scaling
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;dt_scaled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DecisionTreeClassifier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_depth&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;dt_scaled&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train_scaled&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;acc_scaled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;accuracy_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dt_scaled&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test_scaled&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;cv_scaled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;cross_val_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dt_scaled&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_train_scaled&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cv&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;WITH scaling: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;acc_scaled&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CV score: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cv_scaled&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; (+/- &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cv_scaled&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;std&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Results:
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;WITHOUT scaling: 0.9400
CV score: 0.9200 (+/- 0.0183)

WITH scaling: 0.9400
CV score: 0.9200 (+/- 0.0183)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Identical performance!&lt;/strong&gt; 🎉&lt;/p&gt;

&lt;h2&gt;
  
  
  All Tree-Based Algorithms Follow This Rule
&lt;/h2&gt;

&lt;p&gt;This applies to the entire tree family:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Algorithm&lt;/th&gt;
&lt;th&gt;Needs Scaling?&lt;/th&gt;
&lt;th&gt;Why Not?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Decision Tree&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;Threshold comparisons&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Random Forest&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;Ensemble of decision trees&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Extra Trees&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;Random threshold selection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gradient Boosting&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;Sequential tree building&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;XGBoost&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;Optimized tree splits&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LightGBM&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;Binning preserves order&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CatBoost&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;Categorical encoding + tree splits&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  But These Algorithms DO Need Scaling
&lt;/h2&gt;

&lt;p&gt;For contrast, here's why distance-based algorithms are picky:&lt;/p&gt;

&lt;h3&gt;
  
  
  k-Nearest Neighbors (k-NN)
&lt;/h3&gt;

&lt;p&gt;Uses Euclidean distance:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;distance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="err"&gt;√&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="err"&gt;₁&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="err"&gt;₂&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="err"&gt;²&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="err"&gt;₁&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="err"&gt;₂&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="err"&gt;²&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With &lt;code&gt;salary: 50000-51000&lt;/code&gt; and &lt;code&gt;age: 30-50&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;distance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="err"&gt;√&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="mi"&gt;50000&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;51000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="err"&gt;²&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="err"&gt;²&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;distance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="err"&gt;√&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1000000&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="err"&gt;≈&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Age is completely dominated by salary!&lt;/strong&gt; Without scaling, age becomes irrelevant.&lt;/p&gt;

&lt;h3&gt;
  
  
  Let's prove it:
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.neighbors&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;KNeighborsClassifier&lt;/span&gt;

&lt;span class="c1"&gt;# k-NN without scaling
&lt;/span&gt;&lt;span class="n"&gt;knn_raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;KNeighborsClassifier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_neighbors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;knn_raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;acc_knn_raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;accuracy_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;knn_raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# k-NN with scaling
&lt;/span&gt;&lt;span class="n"&gt;knn_scaled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;KNeighborsClassifier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_neighbors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;knn_scaled&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train_scaled&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;acc_knn_scaled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;accuracy_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;knn_scaled&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test_scaled&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;k-NN WITHOUT scaling: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;acc_knn_raw&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;k-NN WITH scaling: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;acc_knn_scaled&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Results:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;k-NN WITHOUT scaling: 0.8800
k-NN WITH scaling: 0.9633
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Massive 21.67% improvement!&lt;/strong&gt; Scaling is critical for k-NN.&lt;/p&gt;

&lt;h3&gt;
  
  
  Other sensitive algorithms:
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;SVM&lt;/strong&gt; → Optimizes geometric margins&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Logistic Regression&lt;/strong&gt; → Gradient descent sensitive to magnitude&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Neural Networks&lt;/strong&gt; → Gradient stability requires normalized inputs&lt;/p&gt;
&lt;h2&gt;
  
  
  🤓 Edge Cases: When You Might Still Scale Trees
&lt;/h2&gt;

&lt;p&gt;While not necessary, scaling can help in these scenarios:&lt;/p&gt;
&lt;h3&gt;
  
  
  1. Feature Importance Interpretation
&lt;/h3&gt;

&lt;p&gt;Some implementations calculate importance based on total criterion reduction. Variables with larger ranges might appear artificially more important.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Impact:&lt;/strong&gt; Usually negligible, but worth checking in extreme cases (0-1 vs 0-1000000)&lt;/p&gt;
&lt;h3&gt;
  
  
  2. Regularization in Advanced Models
&lt;/h3&gt;

&lt;p&gt;XGBoost and LightGBM offer L1/L2 regularization:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;xgboost&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;xgb&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;xgb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;XGBClassifier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;reg_alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# L1 
&lt;/span&gt;    &lt;span class="n"&gt;reg_lambda&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;   &lt;span class="c1"&gt;# L2
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These penalties can be slightly sensitive to scale, though impact is marginal.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Mixed Model Pipelines
&lt;/h3&gt;

&lt;p&gt;When combining algorithms:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.ensemble&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;VotingClassifier&lt;/span&gt;

&lt;span class="n"&gt;pipeline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;VotingClassifier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;estimators&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;rf&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;RandomForestClassifier&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;  &lt;span class="c1"&gt;# Doesn't need scaling
&lt;/span&gt;        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;svm&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;SVC&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;                     &lt;span class="c1"&gt;# Needs scaling
&lt;/span&gt;        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;lr&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;LogisticRegression&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;        &lt;span class="c1"&gt;# Needs scaling
&lt;/span&gt;    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Scale everything - won't hurt the Random Forest!&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;h3&gt;
  
  
  When working with trees:
&lt;/h3&gt;

&lt;p&gt;✅ &lt;strong&gt;Skip scaling&lt;/strong&gt; → Save computation time&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Focus on feature engineering&lt;/strong&gt; → 100x more impact&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Tune hyperparameters&lt;/strong&gt; → &lt;code&gt;max_depth&lt;/code&gt;, &lt;code&gt;learning_rate&lt;/code&gt;, etc.&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Handle missing values&lt;/strong&gt; → Still critical!&lt;/p&gt;

&lt;h3&gt;
  
  
  When working with distance/gradient-based models:
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Always scale&lt;/strong&gt; → Non-negotiable&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Standardization usually better&lt;/strong&gt; than Min-Max&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Check your pipeline&lt;/strong&gt; → Ensure consistent preprocessing&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Trees compare thresholds, not distances&lt;/strong&gt; → Scaling is irrelevant&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monotonic transformations preserve order&lt;/strong&gt; → Same splits regardless of scale&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;k-NN, SVM, Neural Nets need scaling&lt;/strong&gt; → Distance/gradient calculations are sensitive&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Feature engineering &amp;gt; Scaling&lt;/strong&gt; → Focus your efforts where they matter&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  🔗 Want to Go Deeper?
&lt;/h2&gt;

&lt;p&gt;Here are some great resources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://scikit-learn.org/stable/modules/tree.html" rel="noopener noreferrer"&gt;Scikit-learn Tree Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://xgboost.readthedocs.io/en/stable/parameter.html" rel="noopener noreferrer"&gt;XGBoost Parameters Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/watch?v=7VeUPuFGJHk" rel="noopener noreferrer"&gt;StatQuest: Decision Trees&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Have you ever wasted time scaling data for tree models? What's your preprocessing workflow? Drop a comment below! 👇&lt;/p&gt;

&lt;p&gt;If this helped you, consider:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;❤️ Giving it a like&lt;/li&gt;
&lt;li&gt;🔖 Bookmarking for later&lt;/li&gt;
&lt;li&gt;🔄 Sharing with your team&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Happy coding!&lt;/strong&gt; 🎉&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Found a typo or have a suggestion? Leave a comment or reach out!&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  📊 Quick Reference Cheatsheet
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# ❌ Don't waste time on this for trees
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.preprocessing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;StandardScaler&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.ensemble&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RandomForestClassifier&lt;/span&gt;

&lt;span class="n"&gt;scaler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StandardScaler&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;X_scaled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scaler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit_transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Unnecessary!
&lt;/span&gt;&lt;span class="n"&gt;rf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RandomForestClassifier&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;rf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_scaled&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Just do this instead
&lt;/span&gt;&lt;span class="n"&gt;rf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RandomForestClassifier&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;rf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Works perfectly!
&lt;/span&gt;
&lt;span class="c1"&gt;# But DO scale for these
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.neighbors&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;KNeighborsClassifier&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.svm&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SVC&lt;/span&gt;

&lt;span class="n"&gt;scaler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StandardScaler&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;X_scaled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scaler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit_transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Critical!
&lt;/span&gt;
&lt;span class="n"&gt;knn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;KNeighborsClassifier&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;knn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_scaled&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;&lt;em&gt;Part of my Machine Learning Fundamentals series. Follow for more deep dives!&lt;/em&gt; 🚀&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>tutorial</category>
      <category>python</category>
    </item>
    <item>
      <title>2026: The Year Data Science Changed Forever (And What It Means for You)</title>
      <dc:creator>Mubarak Mohamed</dc:creator>
      <pubDate>Tue, 10 Feb 2026 12:08:16 +0000</pubDate>
      <link>https://dev.to/moubarakmohame4/2026-the-year-data-science-changed-forever-and-what-it-means-for-you-36ad</link>
      <guid>https://dev.to/moubarakmohame4/2026-the-year-data-science-changed-forever-and-what-it-means-for-you-36ad</guid>
      <description>&lt;p&gt;I've been in Data Science for 5 years, and 2026 feels different. Not "new tool different" — &lt;strong&gt;fundamentally different&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Last week, I watched a marketing manager with zero coding experience build a customer churn prediction model in 20 minutes using a conversational AI interface. Three years ago, that would've taken my team two weeks. &lt;/p&gt;

&lt;p&gt;This isn't just about tools getting better. The entire data profession is being redefined, and if you're not paying attention, you might miss the shift.&lt;/p&gt;

&lt;h2&gt;
  
  
  🚨 Why 2026 Actually Matters
&lt;/h2&gt;

&lt;p&gt;Let me be clear: I'm not here to tell you "AI is taking your job" (it's not). But ignoring what's happening would be like a web developer ignoring JavaScript frameworks in 2015.&lt;/p&gt;

&lt;p&gt;Three seismic shifts are converging &lt;strong&gt;right now&lt;/strong&gt;:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Generative AI isn't just answering questions anymore
&lt;/h3&gt;

&lt;p&gt;It's writing production SQL queries, generating entire analysis pipelines, and explaining statistical concepts better than most tutorials.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# What we used to do (2023)
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sales.csv&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# ... 50 lines of data cleaning ...
# ... 30 lines of visualization code ...
&lt;/span&gt;
&lt;span class="c1"&gt;# What happens now (2026)
# Prompt: "Analyze sales.csv, clean the data, and show me regional trends"
# AI generates the entire pipeline + explains every decision
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  2. AutoML reached production maturity
&lt;/h3&gt;

&lt;p&gt;Platforms like DataRobot and H2O.ai don't just train models — they:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Handle feature engineering automatically&lt;/li&gt;
&lt;li&gt;Select optimal algorithms&lt;/li&gt;
&lt;li&gt;Deploy to production with monitoring&lt;/li&gt;
&lt;li&gt;Explain predictions in plain language&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The technical barrier to ML just collapsed.&lt;/strong&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  3. No-code ate the analytics market
&lt;/h3&gt;

&lt;p&gt;Your CEO can now ask Tableau: &lt;em&gt;"Why did Q1 revenue drop in the Southeast?"&lt;/em&gt; and get a structured answer with visualizations. No SQL. No Python. No data analyst in the loop.&lt;/p&gt;

&lt;p&gt;Does this mean Data Analysts are obsolete? &lt;strong&gt;Absolutely not.&lt;/strong&gt; But the job description just changed radically.&lt;/p&gt;
&lt;h2&gt;
  
  
  🔍 The 6 Trends You Can't Ignore
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Trend 1: AI Copilots in Every Tool
&lt;/h3&gt;

&lt;p&gt;Every major BI platform now has a conversational interface. This isn't a gimmick — it's changing &lt;strong&gt;who can do analytics&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Impact&lt;/strong&gt;: Your value shifts from &lt;em&gt;creating dashboards&lt;/em&gt; to &lt;em&gt;interpreting insights and guiding strategy&lt;/em&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  Trend 2: Real-Time Analytics Becomes Standard
&lt;/h3&gt;

&lt;p&gt;Streaming data (Kafka, Flink) + cloud infrastructure means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dynamic pricing models that adjust in real-time&lt;/li&gt;
&lt;li&gt;Instant fraud detection&lt;/li&gt;
&lt;li&gt;Live personalization engines&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The batch processing era is ending.&lt;/strong&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Trend 3: Augmented Analytics (AI That Thinks Ahead)
&lt;/h3&gt;

&lt;p&gt;This goes beyond automation. The system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Suggests analyses you didn't think to run&lt;/li&gt;
&lt;li&gt;Detects anomalies proactively&lt;/li&gt;
&lt;li&gt;Predicts questions before they're asked&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's like having a junior data scientist monitoring everything 24/7.&lt;/p&gt;
&lt;h3&gt;
  
  
  Trend 4: The Explainability Mandate
&lt;/h3&gt;

&lt;p&gt;With EU AI Act and similar regulations worldwide, "black box" models are becoming liabilities.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;New essential skill&lt;/strong&gt;: Being able to explain &lt;em&gt;why&lt;/em&gt; the model made a decision, not just &lt;em&gt;what&lt;/em&gt; it predicted.&lt;/p&gt;
&lt;h3&gt;
  
  
  Trend 5: Data Governance Isn't Optional Anymore
&lt;/h3&gt;

&lt;p&gt;Privacy regulations (GDPR, CCPA, etc.) + AI ethics requirements mean:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need to track data lineage&lt;/li&gt;
&lt;li&gt;You must prevent algorithmic bias&lt;/li&gt;
&lt;li&gt;Transparency is legally required&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;This is creating entirely new roles&lt;/strong&gt; (AI Ethics Officer, Data Governance Specialist).&lt;/p&gt;
&lt;h3&gt;
  
  
  Trend 6: Role Evolution is Accelerating
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Old Role&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;New Focus&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Data Analyst&lt;/td&gt;
&lt;td&gt;Strategic advisor + AI orchestrator&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data Scientist&lt;/td&gt;
&lt;td&gt;Complex problems + research + innovation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;em&gt;New:&lt;/em&gt; Analytics Engineer&lt;/td&gt;
&lt;td&gt;Bridge between data eng and analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;em&gt;New:&lt;/em&gt; AI Product Manager&lt;/td&gt;
&lt;td&gt;Build data-driven products&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h2&gt;
  
  
  💡 What This Means For You
&lt;/h2&gt;
&lt;h3&gt;
  
  
  If you're a Data Analyst
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Don't panic.&lt;/strong&gt; Your job is evolving, not disappearing.&lt;/p&gt;

&lt;p&gt;✅ &lt;strong&gt;What to learn&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prompt engineering (seriously, it's a skill)&lt;/li&gt;
&lt;li&gt;Business acumen + domain expertise&lt;/li&gt;
&lt;li&gt;Data storytelling and communication&lt;/li&gt;
&lt;li&gt;Critical thinking to validate AI outputs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;❌ &lt;strong&gt;What's becoming less valuable&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Purely technical skills (SQL, Python) without context&lt;/li&gt;
&lt;li&gt;Repetitive dashboard creation&lt;/li&gt;
&lt;li&gt;Manual data cleaning&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  If you're learning Data Science
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Great timing actually.&lt;/strong&gt; The barrier to entry is lower, but the skill ceiling is higher.&lt;/p&gt;

&lt;p&gt;You can now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Start with no-code tools to learn concepts&lt;/li&gt;
&lt;li&gt;Gradually add technical depth where needed&lt;/li&gt;
&lt;li&gt;Focus on business impact from day one&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Hot take&lt;/strong&gt;: You might become more valuable by mastering Tableau + business strategy than by grinding LeetCode for 6 months.&lt;/p&gt;
&lt;h3&gt;
  
  
  If you're hiring
&lt;/h3&gt;

&lt;p&gt;Stop asking for "5 years Python + PhD in Statistics". &lt;/p&gt;

&lt;p&gt;Start looking for people who can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Translate business problems into data questions&lt;/li&gt;
&lt;li&gt;Critically evaluate AI-generated insights&lt;/li&gt;
&lt;li&gt;Communicate findings to non-technical stakeholders&lt;/li&gt;
&lt;li&gt;Navigate ethical implications of data use&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  🤔 "Should I Still Learn Data Science in 2026?"
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Yes. But differently.&lt;/strong&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  The essential skills now are:
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Technical foundation&lt;/strong&gt; (still needed, just less time):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SQL + data manipulation&lt;/li&gt;
&lt;li&gt;Statistical thinking&lt;/li&gt;
&lt;li&gt;One visualization tool deeply&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;New essentials&lt;/strong&gt; (invest heavily here):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prompt engineering for data tasks&lt;/li&gt;
&lt;li&gt;Model interpretation &amp;amp; validation&lt;/li&gt;
&lt;li&gt;Communication &amp;amp; storytelling&lt;/li&gt;
&lt;li&gt;Ethics &amp;amp; governance fundamentals&lt;/li&gt;
&lt;li&gt;Business domain knowledge&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pro tip&lt;/strong&gt;: Spend 40% of your learning time on technical skills, 60% on context, communication, and judgment.&lt;/p&gt;
&lt;h2&gt;
  
  
  🎯 Key Takeaways
&lt;/h2&gt;

&lt;p&gt;📌 &lt;strong&gt;Remember&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;2026 marks a structural shift, not just new tools&lt;/li&gt;
&lt;li&gt;Technical tasks are simplifying, strategic thinking is more valuable&lt;/li&gt;
&lt;li&gt;Accessibility is increasing (good for beginners)&lt;/li&gt;
&lt;li&gt;New roles are emerging faster than old ones are disappearing&lt;/li&gt;
&lt;li&gt;The skill gap is widening between "technical operators" and "strategic data professionals"&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  🗣️ Let's Discuss
&lt;/h2&gt;

&lt;p&gt;I'm curious about your experience:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Have you used AI to generate code or analysis?&lt;/strong&gt; What worked? What didn't?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data professionals&lt;/strong&gt;: How has your role changed in the past year?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Beginners&lt;/strong&gt;: Does this make you more or less excited to enter the field?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Drop your thoughts in the comments. I genuinely want to hear different perspectives on this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;P.S.&lt;/strong&gt; If you found this valuable, I write deep dives like this regularly on &lt;a href="https://coachdata.dev" rel="noopener noreferrer"&gt;coachdata.dev&lt;/a&gt;. We focus on practical skills and career navigation in the evolving data landscape.&lt;/p&gt;

&lt;p&gt;Happy learning 🚀&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fassets.dev.to%2Fassets%2Fgithub-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/codecrafters-io" rel="noopener noreferrer"&gt;
        codecrafters-io
      &lt;/a&gt; / &lt;a href="https://github.com/codecrafters-io/build-your-own-x" rel="noopener noreferrer"&gt;
        build-your-own-x
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Master programming by recreating your favorite technologies from scratch.
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;p&gt;&lt;a href="https://codecrafters.io/github-banner" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/d5519a56f2d0fb4feb658af2aaae80023bcacca77f1dcb0c984488cf30d16c80/68747470733a2f2f636f646563726166746572732e696f2f696d616765732f757064617465642d62796f782d62616e6e65722e676966" alt="Banner"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Build your own &amp;lt;insert-technology-here&amp;gt;&lt;/h2&gt;
&lt;/div&gt;

&lt;p&gt;This repository is a compilation of well-written, step-by-step guides for re-creating our favorite technologies from scratch.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;What I cannot create, I do not understand — Richard Feynman.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It's a great way to learn.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/codecrafters-io/build-your-own-x#build-your-own-3d-renderer" rel="noopener noreferrer"&gt;3D Renderer&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/codecrafters-io/build-your-own-x#build-your-own-augmented-reality" rel="noopener noreferrer"&gt;Augmented Reality&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/codecrafters-io/build-your-own-x#build-your-own-bittorrent-client" rel="noopener noreferrer"&gt;BitTorrent Client&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/codecrafters-io/build-your-own-x#build-your-own-blockchain--cryptocurrency" rel="noopener noreferrer"&gt;Blockchain / Cryptocurrency&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/codecrafters-io/build-your-own-x#build-your-own-bot" rel="noopener noreferrer"&gt;Bot&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/codecrafters-io/build-your-own-x#build-your-own-command-line-tool" rel="noopener noreferrer"&gt;Command-Line Tool&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/codecrafters-io/build-your-own-x#build-your-own-database" rel="noopener noreferrer"&gt;Database&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/codecrafters-io/build-your-own-x#build-your-own-docker" rel="noopener noreferrer"&gt;Docker&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/codecrafters-io/build-your-own-x#build-your-own-emulator--virtual-machine" rel="noopener noreferrer"&gt;Emulator / Virtual Machine&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/codecrafters-io/build-your-own-x#build-your-own-front-end-framework--library" rel="noopener noreferrer"&gt;Front-end Framework / Library&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/codecrafters-io/build-your-own-x#build-your-own-game" rel="noopener noreferrer"&gt;Game&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/codecrafters-io/build-your-own-x#build-your-own-git" rel="noopener noreferrer"&gt;Git&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/codecrafters-io/build-your-own-x#build-your-own-network-stack" rel="noopener noreferrer"&gt;Network Stack&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/codecrafters-io/build-your-own-x#build-your-own-neural-network" rel="noopener noreferrer"&gt;Neural Network&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/codecrafters-io/build-your-own-x#build-your-own-operating-system" rel="noopener noreferrer"&gt;Operating System&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/codecrafters-io/build-your-own-x#build-your-own-physics-engine" rel="noopener noreferrer"&gt;Physics Engine&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/codecrafters-io/build-your-own-x#build-your-own-programming-language" rel="noopener noreferrer"&gt;Programming Language&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/codecrafters-io/build-your-own-x#build-your-own-regex-engine" rel="noopener noreferrer"&gt;Regex Engine&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/codecrafters-io/build-your-own-x#build-your-own-search-engine" rel="noopener noreferrer"&gt;Search Engine&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/codecrafters-io/build-your-own-x#build-your-own-shell" rel="noopener noreferrer"&gt;Shell&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/codecrafters-io/build-your-own-x#build-your-own-template-engine" rel="noopener noreferrer"&gt;Template Engine&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/codecrafters-io/build-your-own-x#build-your-own-text-editor" rel="noopener noreferrer"&gt;Text Editor&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/codecrafters-io/build-your-own-x#build-your-own-visual-recognition-system" rel="noopener noreferrer"&gt;Visual Recognition System&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/codecrafters-io/build-your-own-x#build-your-own-voxel-engine" rel="noopener noreferrer"&gt;Voxel Engine&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/codecrafters-io/build-your-own-x#build-your-own-web-browser" rel="noopener noreferrer"&gt;Web Browser&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/codecrafters-io/build-your-own-x#build-your-own-web-server" rel="noopener noreferrer"&gt;Web Server&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/codecrafters-io/build-your-own-x#uncategorized" rel="noopener noreferrer"&gt;Uncategorized&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Tutorials&lt;/h2&gt;
&lt;/div&gt;

&lt;div class="markdown-heading"&gt;
&lt;h4 class="heading-element"&gt;Build your own &lt;code&gt;3D Renderer&lt;/code&gt;
&lt;/h4&gt;
&lt;/div&gt;


&lt;ul&gt;

&lt;li&gt;&lt;a href="https://www.scratchapixel.com/lessons/3d-basic-rendering/introduction-to-ray-tracing/how-does-it-work" rel="nofollow noopener noreferrer"&gt;&lt;strong&gt;C++&lt;/strong&gt;: &lt;em&gt;Introduction to Ray Tracing: a Simple Method for Creating 3D Images&lt;/em&gt;&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;&lt;a href="https://github.com/ssloy/tinyrenderer/wiki" rel="noopener noreferrer"&gt;&lt;strong&gt;C++&lt;/strong&gt;: &lt;em&gt;How OpenGL works: software rendering in 500 lines of code&lt;/em&gt;&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;&lt;a href="http://lodev.org/cgtutor/raycasting.html" rel="nofollow noopener noreferrer"&gt;&lt;strong&gt;C++&lt;/strong&gt;: &lt;em&gt;Raycasting engine of Wolfenstein 3D&lt;/em&gt;&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;&lt;a href="http://www.pbr-book.org/" rel="nofollow noopener noreferrer"&gt;&lt;strong&gt;C++&lt;/strong&gt;: &lt;em&gt;Physically Based Rendering:From Theory To Implementation&lt;/em&gt;&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;&lt;a href="https://raytracing.github.io/books/RayTracingInOneWeekend.html" rel="nofollow noopener noreferrer"&gt;&lt;strong&gt;C++&lt;/strong&gt;: &lt;em&gt;Ray Tracing in One Weekend&lt;/em&gt;&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;&lt;a href="https://www.scratchapixel.com/lessons/3d-basic-rendering/rasterization-practical-implementation/overview-rasterization-algorithm" rel="nofollow noopener noreferrer"&gt;&lt;strong&gt;C++&lt;/strong&gt;: &lt;em&gt;Rasterization: a Practical Implementation&lt;/em&gt;&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;

&lt;a href="https://www.davrous.com/2013/06/13/tutorial-series-learning-how-to-write-a-3d-soft-engine-from-scratch-in-c-typescript-or-javascript/" rel="nofollow noopener noreferrer"&gt;&lt;strong&gt;C#&lt;/strong&gt;&lt;/a&gt;…&lt;/li&gt;

&lt;/ul&gt;
&lt;/div&gt;
&lt;br&gt;
  &lt;/div&gt;
&lt;br&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/codecrafters-io/build-your-own-x" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;br&gt;
&lt;/div&gt;
&lt;br&gt;





&lt;p&gt;&lt;em&gt;What are you learning in 2026? Share your data journey below 👇&lt;/em&gt;&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>ai</category>
      <category>career</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>L'Arsenal du Data Analyst en 2025 : Maîtriser les Outils, les Données et les Tendances pour se démarquer</title>
      <dc:creator>Mubarak Mohamed</dc:creator>
      <pubDate>Tue, 02 Sep 2025 10:18:39 +0000</pubDate>
      <link>https://dev.to/moubarakmohame4/larsenal-du-data-analyst-en-2025-maitriser-les-outils-les-donnees-et-les-tendances-pour-se-4lb0</link>
      <guid>https://dev.to/moubarakmohame4/larsenal-du-data-analyst-en-2025-maitriser-les-outils-les-donnees-et-les-tendances-pour-se-4lb0</guid>
      <description>&lt;p&gt;Le métier de Data Analyst est en constante évolution, et en 2025, il est plus que jamais un rôle crucial au sein des entreprises. Ce n'est plus seulement une question de manipuler des chiffres, mais de transformer des montagnes de données brutes en informations stratégiques, de raconter des histoires claires et de guider les prises de décision. Pour exceller dans ce domaine, il ne suffit pas d'avoir de solides compétences techniques ; il faut aussi savoir naviguer dans un écosystème en perpétuelle mutation. De la maîtrise des outils classiques aux dernières innovations en matière d'IA et de cloud, le Data Analyst moderne se doit d'être polyvalent et de se former en continu.&lt;/p&gt;

&lt;p&gt;Ce guide est un véritable tour d'horizon des ressources indispensables qui façonnent le quotidien d'un Data Analyst en 2025. Nous explorerons les outils incontournables, les meilleures sources de données, les plateformes de formation, les communautés à suivre et les grandes tendances qui transforment la profession.&lt;/p&gt;

&lt;h3&gt;
  
  
  Outils Incontournables : Les Fondations de l'Analyse de Données
&lt;/h3&gt;

&lt;p&gt;Un Data Analyst est avant tout un artisan des données, et son efficacité dépend directement de la qualité de ses outils. En 2025, l'arsenal s'est enrichi et complexifié.&lt;/p&gt;

&lt;h4&gt;
  
  
  1. Les basiques toujours puissants : Excel et SQL
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fod3wyt3hita8dake7uv2.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fod3wyt3hita8dake7uv2.jpg" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Microsoft Excel&lt;/strong&gt; : Loin d'être obsolète, Excel reste un outil de choix pour des analyses exploratoires rapides, la gestion de petites à moyennes bases de données et la création de visualisations simples. La maîtrise des &lt;strong&gt;tableaux croisés dynamiques&lt;/strong&gt;, des fonctions comme &lt;code&gt;RECHERCHEV&lt;/code&gt; ou &lt;code&gt;INDEX+EQUIV&lt;/code&gt;, et des macros en VBA est toujours d'une grande utilité pour automatiser des tâches répétitives ou nettoyer des données.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;SQL (Structured Query Language)&lt;/strong&gt; : Le langage universel des bases de données relationnelles. C'est la porte d'entrée pour interroger, extraire et manipuler les données. La maîtrise de SQL est non-négociable. En 2025, il est crucial de savoir écrire des requêtes complexes, d'utiliser des &lt;strong&gt;fonctions de fenêtrage&lt;/strong&gt; (&lt;code&gt;WINDOW FUNCTIONS&lt;/code&gt;) pour des calculs avancés et de comprendre l'optimisation des requêtes.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  2. L'ère de la Business Intelligence (BI)
&lt;/h4&gt;

&lt;p&gt;Les plateformes de BI sont au cœur du métier. Elles permettent de créer des tableaux de bord dynamiques et des rapports interactifs pour raconter une histoire avec les données.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff238u9h4164kfmnfdz0s.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff238u9h4164kfmnfdz0s.webp" alt=" " width="768" height="432"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Power BI&lt;/strong&gt; : L'outil de Microsoft est un incontournable. Il est puissant, s'intègre parfaitement à l'écosystème Microsoft (Excel, Azure) et propose des fonctionnalités de modélisation de données (langage DAX) et de visualisation robustes. Il est idéal pour les entreprises déjà basées sur l'écosystème Microsoft et offre une courbe d'apprentissage relativement douce.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj216vns8gzs4puc7wvz8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj216vns8gzs4puc7wvz8.png" alt=" " width="575" height="330"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Tableau&lt;/strong&gt; : Connu pour sa capacité à créer des visualisations esthétiques et percutantes, Tableau reste une référence. Il est particulièrement apprécié pour sa simplicité d'utilisation en glisser-déposer et sa capacité à se connecter à une multitude de sources de données.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  3. Le pouvoir du code : Python et ses bibliothèques
&lt;/h4&gt;

&lt;p&gt;Python s'est imposé comme le langage de référence pour l'analyse de données avancée, le machine learning et l'automatisation.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Pandas&lt;/strong&gt; : La bibliothèque indispensable pour manipuler et analyser des données tabulaires (DataFrames).&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Matplotlib et Seaborn&lt;/strong&gt; : Pour des visualisations de données personnalisées et plus complexes que ce que proposent les outils de BI.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Jupyter Notebook&lt;/strong&gt; : L'environnement interactif par excellence pour l'exploration de données. Il permet de combiner du code, des visualisations et du texte explicatif, créant ainsi des analyses claires et reproductibles.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  4. L'IA au service du Data Analyst
&lt;/h4&gt;

&lt;p&gt;Les copilotes et outils basés sur l'IA générative sont les nouvelles stars de l'arsenal du Data Analyst.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;ChatGPT&lt;/strong&gt; et &lt;strong&gt;autres assistants de code (ex. : Copilot)&lt;/strong&gt; : Ces outils ne remplacent pas l'analyste, mais augmentent considérablement sa productivité. Ils peuvent aider à écrire des requêtes SQL complexes, générer des snippets de code Python, expliquer des concepts statistiques ou même créer des résumés d'analyses. Par exemple, vous pouvez demander à ChatGPT d'écrire une requête pour "calculer le chiffre d'affaires total par région et par mois pour l'année 2024" et obtenir un code prêt à l'emploi que vous n'aurez qu'à adapter.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Sources de Données : Le Combustible de l'Analyse
&lt;/h3&gt;

&lt;p&gt;Sans données, il n'y a pas d'analyse. Le Data Analyst en 2025 sait où chercher des données de qualité, qu'elles soient publiques ou privées.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Kaggle&lt;/strong&gt; : Bien plus qu'une simple plateforme de compétitions de data science, Kaggle est une mine d'or de datasets de haute qualité, couvrant des sujets variés, du COVID-19 aux données sur les films. C'est l'endroit parfait pour pratiquer ses compétences sur des projets réels et découvrir comment d'autres analystes ont résolu des problèmes similaires.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Google Dataset Search&lt;/strong&gt; : Ce moteur de recherche dédié aux datasets vous permet de trouver des jeux de données pertinents sur le web, qu'ils soient publiés par des gouvernements, des universités ou des particuliers.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Sites Open Data nationaux et internationaux&lt;/strong&gt; : Chaque pays possède son portail de données ouvertes.

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;France :&lt;/strong&gt; &lt;code&gt;data.gouv.fr&lt;/code&gt; est la référence pour les données publiques françaises. Vous y trouverez des informations sur la démographie, la santé, le transport, etc.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;États-Unis :&lt;/strong&gt; &lt;code&gt;data.gov&lt;/code&gt; regroupe les datasets du gouvernement américain.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Union Européenne :&lt;/strong&gt; &lt;code&gt;data.europa.eu&lt;/code&gt; est le portail officiel des données ouvertes de l'UE.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;




&lt;h3&gt;
  
  
  Formations et Veille : L'Apprentissage Continu, une Nécessité
&lt;/h3&gt;

&lt;p&gt;Le monde de la donnée évolue si rapidement que l'apprentissage ne s'arrête jamais. Pour rester pertinent, il faut s'engager dans une démarche de formation continue.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Plateformes de formation en ligne&lt;/strong&gt; :&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Coursera&lt;/strong&gt; : Propose des spécialisations de haut niveau en partenariat avec des universités prestigieuses. Par exemple, le certificat "Google Data Analytics Professional Certificate" est une excellente porte d'entrée.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Udemy&lt;/strong&gt; : Idéal pour les formations plus courtes, axées sur des compétences spécifiques (ex. : un cours sur Power BI ou l'automatisation avec Python).&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;DataCamp&lt;/strong&gt; et &lt;strong&gt;Dataquest&lt;/strong&gt; : Spécialisées dans la data, ces plateformes offrent des parcours interactifs pour apprendre SQL, Python ou R directement dans le navigateur.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Newsletters spécialisées&lt;/strong&gt; : S'abonner à quelques newsletters de qualité est la meilleure façon de faire de la veille sans y passer trop de temps.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;code&gt;Data Elixir&lt;/code&gt; : Une sélection hebdomadaire des meilleurs articles, outils et tutoriels sur la data science et l'analyse de données.&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;Data Is Plural&lt;/code&gt; : Une newsletter qui partage des datasets intéressants chaque semaine. C'est une excellente source pour trouver de nouveaux projets à explorer.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Podcasts&lt;/strong&gt; : Écouter des experts échanger sur le sujet est une façon pratique de se tenir au courant des dernières tendances.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;code&gt;DataGen&lt;/code&gt; : Un podcast français qui interviewe des professionnels de la data.&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;Super Data Science&lt;/code&gt; : Des entretiens avec des leaders du domaine de la data, de la science des données et de l'IA.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  Communautés et Réseaux : Construire Son Capital Humain
&lt;/h3&gt;

&lt;p&gt;L'échange avec d'autres professionnels est une ressource inestimable pour résoudre des problèmes, trouver un emploi ou simplement rester motivé.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;LinkedIn&lt;/strong&gt; : Le réseau professionnel est une plateforme de choix pour la veille et le networking. Suivre des leaders d'opinion, rejoindre des groupes dédiés à la data, partager ses projets et ses réflexions est essentiel pour construire sa marque personnelle et rester visible dans la communauté.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Slack et Discord&lt;/strong&gt; : De nombreuses communautés se retrouvent sur ces plateformes pour échanger en temps réel. Des serveurs comme &lt;code&gt;The Data Science Community&lt;/code&gt; ou les canaux Slack d'entreprises spécialisées permettent de poser des questions, d'obtenir de l'aide sur des problèmes techniques et de partager ses découvertes.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;GitHub&lt;/strong&gt; : C'est le portfolio du Data Analyst moderne. Y héberger ses projets (analyses, notebooks Jupyter, scripts Python) est un excellent moyen de montrer ses compétences à des recruteurs et de collaborer avec d'autres développeurs.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Forums spécialisés&lt;/strong&gt; : Les plateformes comme &lt;code&gt;Stack Overflow&lt;/code&gt; sont des ressources de premier plan pour trouver des solutions à des problèmes de codage ou de modélisation spécifiques.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Tendances 2025 : Anticiper le Futur de la Data
&lt;/h3&gt;

&lt;p&gt;Le Data Analyst de demain doit non seulement maîtriser les outils d'aujourd'hui, mais aussi anticiper les tendances de demain.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;L'Automatisation et l'IA Générative&lt;/strong&gt; : Ces technologies transforment la façon dont les analyses sont menées. Au lieu de passer des heures à nettoyer des données, les analystes utiliseront de plus en plus des outils automatisés (AutoML) et des copilotes pour les tâches répétitives. Le métier se déplace vers des activités à plus forte valeur ajoutée : la compréhension métier et le "storytelling".&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Le Cloud Computing&lt;/strong&gt; : Les entreprises migrent leurs infrastructures de données vers le cloud. La maîtrise d'outils comme &lt;strong&gt;BigQuery&lt;/strong&gt; (Google Cloud) ou &lt;strong&gt;Snowflake&lt;/strong&gt; est désormais un atout majeur. Ces plateformes permettent de gérer et d'analyser des pétabits de données à grande vitesse, sans se soucier de l'infrastructure sous-jacente.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Le Data Storytelling&lt;/strong&gt; : Il ne suffit plus de produire des rapports complexes. Le Data Analyst en 2025 est un narrateur. La capacité à transformer des chiffres et des graphiques en une histoire convaincante, claire et adaptée à un public non technique est devenue une compétence cruciale. Utiliser des outils comme Power BI ou Tableau pour créer des "storyboards" est une pratique de plus en plus courante.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  L'Aventure de l'Apprentissage Continu
&lt;/h3&gt;

&lt;p&gt;Le métier de Data Analyst en 2025 est une aventure passionnante, mais exigeante. Les outils évoluent, les technologies changent et de nouvelles méthodes émergent chaque jour. La polyvalence, la curiosité et l'envie d'apprendre sont les qualités qui feront la différence.&lt;/p&gt;

&lt;p&gt;Cultivez votre boîte à outils en maîtrisant les fondamentaux comme SQL et Python, mais ne craignez pas de vous aventurer sur des plateformes cloud comme BigQuery. Nourrissez votre esprit en vous formant en continu via des plateformes comme Coursera ou en écoutant des podcasts spécialisés. Et surtout, n'hésitez jamais à vous appuyer sur la communauté, à partager vos projets sur GitHub et à échanger sur LinkedIn. Votre succès en tant que Data Analyst ne dépendra pas seulement de ce que vous savez faire, mais de votre capacité à évoluer avec le monde des données.&lt;/p&gt;

</description>
      <category>data</category>
      <category>datascience</category>
      <category>chatgpt</category>
      <category>bigdata</category>
    </item>
    <item>
      <title>Data Mesh: The Decentralized Revolution That Will Transform Your Data Architecture</title>
      <dc:creator>Mubarak Mohamed</dc:creator>
      <pubDate>Mon, 01 Sep 2025 10:47:55 +0000</pubDate>
      <link>https://dev.to/moubarakmohame4/data-mesh-the-decentralized-revolution-that-will-transform-your-data-architecture-1nn2</link>
      <guid>https://dev.to/moubarakmohame4/data-mesh-the-decentralized-revolution-that-will-transform-your-data-architecture-1nn2</guid>
      <description>&lt;p&gt;Imagine your data team as a bottleneck. Every time a business team needs to access, analyze, or update data, the request goes through this central team, causing delays, frustration, and a loss of agility. This model is the &lt;strong&gt;data monolith&lt;/strong&gt;, often embodied by a single, centralized &lt;strong&gt;data lake&lt;/strong&gt; or &lt;strong&gt;data warehouse&lt;/strong&gt; that quickly becomes unmanageable.&lt;/p&gt;

&lt;p&gt;Product teams are ready to innovate, but they are slowed down by a dependency on a single source of truth, a single team, and a rigid process. The company's speed is a drag. So, how do we solve this puzzle? Should we just add more people to the central team? Or is the problem deeper, related to the very structure of our architecture?&lt;/p&gt;

&lt;h2&gt;
  
  
  Goodbye Monolith, Hello Mesh
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F592vblb8xapla1c2dzn8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F592vblb8xapla1c2dzn8.png" alt=" " width="800" height="530"&gt;&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Data Mesh&lt;/strong&gt; is not a new technology. It is a &lt;strong&gt;paradigm shift&lt;/strong&gt; in architecture and organization. The idea is simple but powerful: instead of centralizing all data, why not decentralize it and organize it by business domain?&lt;/p&gt;

&lt;p&gt;Inspired by the &lt;strong&gt;Microservices Architecture&lt;/strong&gt;, Data Mesh proposes treating data not as a passive resource, but as a living &lt;strong&gt;product&lt;/strong&gt;. Each business domain (customers, products, logistics, etc.) becomes the owner and steward of its own data.&lt;/p&gt;

&lt;p&gt;This model is based on four fundamental principles that transform how we manage and use data.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Domain-oriented Data Ownership
&lt;/h3&gt;

&lt;p&gt;This is the core of Data Mesh. Instead of a central team that ingests all the organization's data, the business teams themselves are responsible for their data. The team in charge of products is responsible for product data. The marketing team manages data on advertising campaigns.&lt;/p&gt;

&lt;p&gt;This promotes greater &lt;strong&gt;accountability&lt;/strong&gt; and a better &lt;strong&gt;understanding&lt;/strong&gt; of the data's semantics. The people who create the data are also the ones who manage it, ensuring better quality.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Data as a Product
&lt;/h3&gt;

&lt;p&gt;In a Data Mesh architecture, data is not just files in a data lake. It's treated as a product in its own right, with clear characteristics and a focus on &lt;strong&gt;consumability&lt;/strong&gt;. A &lt;strong&gt;data product&lt;/strong&gt; must be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Discoverable&lt;/strong&gt;: Easy to find in a data catalog.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Addressable&lt;/strong&gt;: Accessible via a simple interface (API, Kafka stream, etc.).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Interoperable&lt;/strong&gt;: With clear semantics and rich documentation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trustworthy &amp;amp; high quality&lt;/strong&gt;: Tested and maintained by the producing team.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Secure&lt;/strong&gt;: Compliant with security and governance policies.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This principle ensures that data is no longer a chore but a valuable, ready-to-use resource for any other team.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. A Self-Serve Data Platform
&lt;/h3&gt;

&lt;p&gt;For business teams to be truly autonomous, they need tools. A Data Mesh requires a &lt;strong&gt;self-serve data platform&lt;/strong&gt; that provides the necessary infrastructure to create, manage, and expose their data products. This platform serves as an abstraction, allowing teams to focus on business logic without worrying about the technical details of the underlying infrastructure. It provides tools for ingestion, storage, processing, and governance but manages the complexity for end-users.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Federated Computational Governance
&lt;/h3&gt;

&lt;p&gt;If every team does what it wants, it's chaos. This is where the last principle comes in. Governance is not centralized; it is &lt;strong&gt;federated&lt;/strong&gt;. A governance group defines global standards and rules (e.g., metadata formats, security policies) but the application of these rules is decentralized. "Computational" governance tools automate the application of these rules. This ensures &lt;strong&gt;consistency&lt;/strong&gt; and &lt;strong&gt;security&lt;/strong&gt; while maintaining team &lt;strong&gt;autonomy&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Concrete Business Case: Data Mesh at an E-Commerce Company
&lt;/h2&gt;

&lt;p&gt;Let's take the example of a large e-commerce platform. Traditionally, all sales, inventory, and customer data are centralized. With a Data Mesh, the organization could be divided into domains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;"Products" Domain&lt;/strong&gt;: The team responsible for the product catalog owns the product data. It creates a &lt;strong&gt;"Catalog" data product&lt;/strong&gt; that includes descriptions, prices, categories, etc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"Customers" Domain&lt;/strong&gt;: The customer relationship team manages data on customer behavior. It produces a &lt;strong&gt;"Customer Behavior" data product&lt;/strong&gt; containing purchase history, clicks, and reviews.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"Logistics" Domain&lt;/strong&gt;: The supply chain team is responsible for inventory and delivery data. It exposes an &lt;strong&gt;"Inventory Status" data product&lt;/strong&gt; updated in real-time.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each team exposes its data products in a standardized way (via REST APIs, Kafka streams, shared tables). The marketing team, for example, can consume the "Customer Behavior" data product to personalize campaigns and the "Inventory Status" data product to ensure they don't promote out-of-stock products. All this without going through a central team, in an &lt;strong&gt;autonomous and fast&lt;/strong&gt; way.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Data Mesh Tool Kit 🛠️
&lt;/h2&gt;

&lt;p&gt;Implementing a Data Mesh requires an appropriate technical architecture. Here are the types of tools needed, without limiting yourself to a single solution:&lt;/p&gt;

&lt;h3&gt;
  
  
  Ingestion and Streaming Tools
&lt;/h3&gt;

&lt;p&gt;To create and consume data products in real-time.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Apache Kafka&lt;/strong&gt;: The basis for most streaming architectures.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Confluent&lt;/strong&gt;: An enterprise platform built on Kafka, with connectors and simplified management.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Data Platforms
&lt;/h3&gt;

&lt;p&gt;For data storage and processing. Each domain can have its own space, but it must be interoperable.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Databricks&lt;/strong&gt;: A powerful data processing engine that unifies data warehousing and machine learning.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Snowflake&lt;/strong&gt;: A data cloud that allows for great scalability for storage and analysis.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Data Catalogs and Governance
&lt;/h3&gt;

&lt;p&gt;For data products to be discoverable and manageable.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Amundsen&lt;/strong&gt;: An open-source data catalog developed by Lyft.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Collibra&lt;/strong&gt;: An enterprise data governance and management platform.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Orchestration Tools
&lt;/h3&gt;

&lt;p&gt;To automate data pipelines within each domain.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dagster&lt;/strong&gt;: A modern orchestrator focused on managing data products.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prefect&lt;/strong&gt;: Another orchestration tool that focuses on flexibility and ease of use.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Data Mesh is a concrete response to the limitations of traditional data architectures. By decentralizing data ownership, treating it as a product, and providing a self-serve platform, companies can unlock unprecedented agility and scalability.&lt;/p&gt;

&lt;p&gt;It's not a simple project and requires a cultural transformation. But the investment is worth it to free up your teams, accelerate innovation, and make data a true strategic asset.&lt;/p&gt;

&lt;p&gt;And you, how do you manage data in your organization? Would Data Mesh be a solution for your daily challenges? Share your thoughts in the comments below! 👇&lt;/p&gt;

</description>
      <category>kafka</category>
      <category>algorithms</category>
      <category>dataengineering</category>
      <category>programming</category>
    </item>
    <item>
      <title>Google AI Studio: A Free Playground to Experiment with Gemini AI</title>
      <dc:creator>Mubarak Mohamed</dc:creator>
      <pubDate>Fri, 29 Aug 2025 15:38:44 +0000</pubDate>
      <link>https://dev.to/moubarakmohame4/google-ai-studio-a-free-playground-to-experiment-with-gemini-ai-19ki</link>
      <guid>https://dev.to/moubarakmohame4/google-ai-studio-a-free-playground-to-experiment-with-gemini-ai-19ki</guid>
      <description>&lt;p&gt;Artificial intelligence is evolving fast, and both developers and creators are looking for tools that make it easier to test, prototype, and integrate AI into their projects. &lt;strong&gt;Google AI Studio&lt;/strong&gt; is Google’s answer: a free, accessible platform that lets you experiment with &lt;strong&gt;Gemini AI models&lt;/strong&gt; — all without writing a single line of code.&lt;/p&gt;

&lt;p&gt;If you’re curious about AI but find integration too complex, or if you’re a developer who wants to quickly validate ideas before going into production, this tool is definitely worth a try.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Features of Google AI Studio
&lt;/h2&gt;

&lt;p&gt;Google AI Studio comes with four main capabilities that unlock a wide range of possibilities.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Chat with AI
&lt;/h3&gt;

&lt;p&gt;A simple chat interface to interact with Gemini models, test their capabilities, and explore use cases like code explanations, brainstorming ideas, or assisted writing.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Real-Time Stream
&lt;/h3&gt;

&lt;p&gt;The &lt;strong&gt;stream mode&lt;/strong&gt; displays responses as they’re generated in real time. Perfect for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Writing a story or script and watching dialogues unfold live.&lt;/li&gt;
&lt;li&gt;Designing interactive tutorials where answers adapt as you type.&lt;/li&gt;
&lt;li&gt;Providing instant assistance in an application (e.g., customer support chatbots).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Generate Media
&lt;/h3&gt;

&lt;p&gt;The &lt;strong&gt;media generation&lt;/strong&gt; feature allows you to create images or videos from simple text prompts. Example: &lt;em&gt;“Create a futuristic illustration of a smart city at sunset”&lt;/em&gt; → Google AI Studio generates a ready-to-use image.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Build AI-Powered Apps
&lt;/h3&gt;

&lt;p&gt;The &lt;strong&gt;Build&lt;/strong&gt; tab lets you turn experiments into fully working AI applications. With customizable &lt;strong&gt;run settings&lt;/strong&gt;, you can choose:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which Gemini model to use,&lt;/li&gt;
&lt;li&gt;Output formats (text, JSON, etc.),&lt;/li&gt;
&lt;li&gt;Voice and resolution for multimedia content.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This makes it easy to create no-code or low-code AI-powered projects.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical Use Cases
&lt;/h2&gt;

&lt;p&gt;With these features, Google AI Studio can be applied to many scenarios:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Content creators&lt;/strong&gt;: generate visuals for blog posts, scripts for YouTube videos, or dialogues for narrative games.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Educators &amp;amp; trainers&lt;/strong&gt;: design interactive tutorials where AI guides learners step by step.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Entrepreneurs &amp;amp; startups&lt;/strong&gt;: quickly prototype a chatbot, customer support interface, or decision-making assistant.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Developers&lt;/strong&gt;: test Gemini’s API before integrating it into larger applications.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Who Is Google AI Studio For?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Beginners&lt;/strong&gt;: The intuitive interface makes it possible to experiment without any coding knowledge. You can generate content or set up an assistant in just a few clicks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Experienced developers&lt;/strong&gt;: A fast sandbox to test prompts, configure advanced parameters, and validate ideas before implementing them in production.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why You Should Try Google AI Studio Now
&lt;/h2&gt;

&lt;p&gt;Google AI Studio lowers the barrier to entry for Gemini models: free, intuitive, and powerful enough to cover a wide range of needs. Whether you’re a developer, data analyst, content creator, or simply curious about AI, it’s an &lt;strong&gt;ideal starting point&lt;/strong&gt; to explore how AI can enhance your projects.&lt;/p&gt;

&lt;p&gt;👉 Start experimenting today: &lt;a href="https://aistudio.google.com/" rel="noopener noreferrer"&gt;Google AI Studio&lt;/a&gt; is live and ready to use.&lt;/p&gt;

</description>
      <category>learngoogleaistudio</category>
      <category>gemini</category>
      <category>ai</category>
      <category>programming</category>
    </item>
    <item>
      <title>Preparing Your Data for the ARIMA Model: The Secret Step to Reliable Forecasts</title>
      <dc:creator>Mubarak Mohamed</dc:creator>
      <pubDate>Thu, 28 Aug 2025 10:27:10 +0000</pubDate>
      <link>https://dev.to/moubarakmohame4/preparing-your-data-for-the-arima-model-the-secret-step-to-reliable-forecasts-2h9</link>
      <guid>https://dev.to/moubarakmohame4/preparing-your-data-for-the-arima-model-the-secret-step-to-reliable-forecasts-2h9</guid>
      <description>&lt;p&gt;Before making predictions, we need to make sure our data is ready.&lt;br&gt;
A raw time series often contains trends or fluctuations that can &lt;strong&gt;mislead a forecasting model&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;ARIMA&lt;/strong&gt; model has one key requirement: it only works properly with stationary series.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;stationary series&lt;/strong&gt; is one whose statistical properties (mean, variance, autocorrelation) remain stable over time.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;non-stationary series&lt;/strong&gt;, on the other hand, changes significantly (for example, with a strong trend or seasonality).
Without this preparation, ARIMA may produce &lt;strong&gt;biased or unreliable forecasts.&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In the previous article (&lt;a href="https://dev.to/moubarakmohame4/how-time-series-reveal-the-future-an-introduction-to-the-arima-model-2k81"&gt;How Time Series Reveal the Future: An Introduction to the ARIMA Model&lt;/a&gt;), we explored what a &lt;strong&gt;time series&lt;/strong&gt; is, its components (trend, seasonality, noise), and the intuition behind ARIMA.&lt;br&gt;
We also visualized the AirPassengers dataset, which showed a &lt;strong&gt;steady upward trend&lt;/strong&gt; and yearly seasonality.&lt;/p&gt;

&lt;p&gt;👉 But for ARIMA to work, our data must satisfy one key condition: &lt;strong&gt;stationarity&lt;/strong&gt;.&lt;br&gt;
That’s exactly what this article is about: &lt;strong&gt;transforming a non-stationary series into a stationary one&lt;/strong&gt; using simple techniques (differencing, statistical tests).&lt;br&gt;
In other words: after &lt;strong&gt;observing&lt;/strong&gt;, we now move on to &lt;strong&gt;preparing&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Simplified Theory
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;What is stationarity?&lt;/strong&gt;
A &lt;strong&gt;stationary series&lt;/strong&gt; is one whose statistical properties (mean, variance, autocorrelation) remain &lt;strong&gt;stable over time&lt;/strong&gt;.
👉 Example: daily winter temperatures in a city (around a stable mean with small fluctuations).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A &lt;strong&gt;non-stationary series&lt;/strong&gt; changes too much over time:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;*&lt;em&gt;Trend *&lt;/em&gt;(e.g., constant increase in smartphone sales).&lt;/li&gt;
&lt;li&gt;*&lt;em&gt;Seasonality *&lt;/em&gt;(e.g., ice cream sales peaking every summer).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;ARIMA assumes the series is stationary: otherwise, it “believes” past trends will continue forever, leading to &lt;strong&gt;biased forecasts&lt;/strong&gt;.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Differencing&lt;/strong&gt;
To make a series stationary, we use differencing:
Yt′=Yt−Yt−1Y'&lt;em&gt;t = Y_t - Y&lt;/em&gt;{t-1}
In other words, each value is replaced by the *&lt;em&gt;change between two successive periods.
*&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;This removes linear trends.&lt;/li&gt;
&lt;li&gt;For strong seasonality, we can apply &lt;strong&gt;seasonal differencing&lt;/strong&gt; (e.g., difference with the value one year before).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Example: instead of analyzing raw monthly sales, we analyze the &lt;strong&gt;month-to-month change.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Statistical tests (ADF &amp;amp; KPSS)&lt;/strong&gt;
To check if a series is stationary, we use two complementary tests:
&lt;strong&gt;ADF (Augmented Dickey-Fuller Test)&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Null hypothesis (H₀): the series is non-stationary.&lt;/li&gt;
&lt;li&gt;If p-value &amp;lt; 0.05 → reject H₀ → the series is stationary.
&lt;strong&gt;KPSS (Kwiatkowski-Phillips-Schmidt-Shin Test)&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Null hypothesis (H₀): the series is stationary.&lt;/li&gt;
&lt;li&gt;If p-value &amp;lt; 0.05 → reject H₀ → the series is non-stationary.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;In practice:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We apply &lt;strong&gt;both tests&lt;/strong&gt; for robustness.&lt;/li&gt;
&lt;li&gt;If ADF and KPSS disagree, we refine with additional transformations.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Hands-on in Python
&lt;/h2&gt;

&lt;p&gt;We’ll use a simple time series dataset: the annual flow of the Nile River (built into statsmodels).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import adfuller, kpss
from statsmodels.datasets import nile

# Load Nile dataset
data = nile.load_pandas().data
data.index = pd.date_range(start="1871", periods=len(data), freq="Y")
series = data['volume']

# Plot series
plt.figure(figsize=(10,4))
plt.plot(series)
plt.title("Annual Nile River Flow (1871–1970)")
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8uvyfndfbn2dj4bed0k1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8uvyfndfbn2dj4bed0k1.png" alt=" " width="800" height="336"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check stationarity with ADF &amp;amp; KPSS&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def adf_test(series):
    result = adfuller(series, autolag='AIC')
    print("ADF Statistic:", result[0])
    print("p-value:", result[1])
    if result[1] &amp;lt; 0.05:
        print("✅ The series is stationary (ADF).")
    else:
        print("❌ The series is NON-stationary (ADF).")

def kpss_test(series):
    result = kpss(series, regression='c', nlags="auto")
    print("KPSS Statistic:", result[0])
    print("p-value:", result[1])
    if result[1] &amp;lt; 0.05:
        print("❌ The series is NON-stationary (KPSS).")
    else:
        print("✅ The series is stationary (KPSS).")

print("ADF Test")
adf_test(series)
print("\nKPSS Test")
kpss_test(series)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzig3g98us8l6033wuohl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzig3g98us8l6033wuohl.png" alt=" " width="800" height="184"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Apply differencing&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;series_diff = series.diff().dropna()

plt.figure(figsize=(10,4))
plt.plot(series_diff)
plt.title("Differenced series (1st difference)")
plt.show()

print("ADF Test after differencing")
adf_test(series_diff)
print("\nKPSS Test after differencing")
kpss_test(series_diff)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg430sm7hieyvlhefhjmc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg430sm7hieyvlhefhjmc.png" alt=" " width="800" height="471"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding ARIMA and its parameters
&lt;/h2&gt;

&lt;p&gt;The &lt;strong&gt;ARIMA(p, d, q)&lt;/strong&gt; model combines three parts:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;AR (AutoRegressive, p)&lt;/li&gt;
&lt;li&gt;Uses past values to predict the future.&lt;/li&gt;
&lt;li&gt;Example: if &lt;/li&gt;
&lt;li&gt;𝑝 = 2&lt;/li&gt;
&lt;li&gt;p=2, the current value depends on the last 2 values.&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Formula: Yt=ϕ1Yt−1+ϕ2Yt−2+ϵtY_t = \phi_1 Y_{t-1} + \phi_2 Y_{t-2} + \epsilon_t&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;I (Integrated, d)&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Number of differences applied to make the series stationary.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Example: d=0d = 0 → no differencing.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;d=1d = 1 → one difference applied.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;MA (Moving Average, q)&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Uses past errors (residuals) for prediction.&lt;br&gt;
Example: if 𝑞 = 1, the prediction depends on the last error.&lt;br&gt;
Formula: Yt=θ1ϵt−1+θ2ϵt−2+ϵtY_t = \theta_1 \epsilon_{t-1} + \theta_2 \epsilon_{t-2} + \epsilon_t&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In short:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;p = past values memory&lt;/li&gt;
&lt;li&gt;d = differencing degree&lt;/li&gt;
&lt;li&gt;q = past errors memory&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example in Python&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from statsmodels.tsa.arima.model import ARIMA

# ARIMA(1,1,1)
model = ARIMA(series, order=(1,1,1))
fit = model.fit()

print(fit.summary())
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Typical output:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AR (p) → past values effect&lt;/li&gt;
&lt;li&gt;I (d) → differencing applied&lt;/li&gt;
&lt;li&gt;MA (q) → past errors effect&lt;/li&gt;
&lt;li&gt;AIC/BIC → model quality (lower = better)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Choosing the best parameters (p,d,q)
&lt;/h3&gt;

&lt;p&gt;One of the main challenges with ARIMA is selecting the right p, d, q.&lt;/p&gt;

&lt;h2&gt;
  
  
  Choosing the best parameters (p,d,q)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Choosing p and q with ACF &amp;amp; PACF&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ACF → helps to choose q (MA part).&lt;/li&gt;
&lt;li&gt;PACF → helps to choose p (AR part).
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0vvxpy76j6yapb9uz8nj.png" alt=" " width="579" height="424"&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Simple rules:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PACF cutoff → good candidate for p.&lt;/li&gt;
&lt;li&gt;ACF cutoff → good candidate for q.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>beginners</category>
      <category>python</category>
      <category>programming</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>How Time Series Reveal the Future: An Introduction to the ARIMA Model</title>
      <dc:creator>Mubarak Mohamed</dc:creator>
      <pubDate>Wed, 27 Aug 2025 12:12:13 +0000</pubDate>
      <link>https://dev.to/moubarakmohame4/how-time-series-reveal-the-future-an-introduction-to-the-arima-model-2k81</link>
      <guid>https://dev.to/moubarakmohame4/how-time-series-reveal-the-future-an-introduction-to-the-arima-model-2k81</guid>
      <description>&lt;p&gt;Imagine you manage a supermarket. Every Monday, you must decide how much milk, rice, soap, and fruit to order for the week. Too little? Stockouts and unhappy customers. Too much? Excess inventory, waste, and unnecessary costs.&lt;br&gt;
So, how can you predict tomorrow’s demand from yesterday’s purchases?&lt;/p&gt;

&lt;p&gt;That’s the realm of time series—data ordered over time (day by day, month by month). And among the most widely used methods to forecast the future, there’s ARIMA — AutoRegressive Integrated Moving Average.&lt;br&gt;
ARIMA is popular because it’s both interpretable and effective across many real-world domains: sales, weather, energy, healthcare, finance, and more.&lt;/p&gt;

&lt;p&gt;In this opening article, you will:&lt;br&gt;
learn what a time series is and its components (trend, seasonality, noise);&lt;br&gt;
grasp the intuition behind ARIMA (AR, I, MA) without heavy math;&lt;br&gt;
plot a first series in Python to visually detect these patterns.&lt;/p&gt;

&lt;p&gt;Ready? Let’s take it step by step. 👇&lt;/p&gt;
&lt;h2&gt;
  
  
  Transition
&lt;/h2&gt;

&lt;p&gt;This article is the first episode of the series &lt;strong&gt;“Mastering ARIMA for Time Series Analysis and Forecasting.”&lt;/strong&gt;&lt;br&gt;
The goal is clear: to guide you step by step, from the basics to practical applications on real-world projects.&lt;/p&gt;

&lt;p&gt;In every article, you will find:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a simplified theoretical section (to understand without heavy math),&lt;/li&gt;
&lt;li&gt;a hands-on Python example (to directly work with data),&lt;/li&gt;
&lt;li&gt;a small practical project (to apply your knowledge to a real case).
This way, you’ll progress in a &lt;strong&gt;logical, gradual, and practical&lt;/strong&gt; manner.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Simplified Theory (What is a time series? + ARIMA intuition)?
&lt;/h2&gt;

&lt;p&gt;What is a time series?&lt;/p&gt;

&lt;p&gt;A time series is a sequence of data collected at regular intervals.&lt;br&gt;
Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;daily sales in a supermarket,&lt;/li&gt;
&lt;li&gt;hourly temperature,&lt;/li&gt;
&lt;li&gt;stock prices every minute,&lt;/li&gt;
&lt;li&gt;monthly internet subscriptions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Three main components describe a time series:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Trend:&lt;/strong&gt; the long-term overall direction.
Example: gradual increase in internet subscribers.&lt;/li&gt;
&lt;li&gt;S*&lt;em&gt;easonality:&lt;/em&gt;* recurring, regular fluctuations.
Example: ice cream sales peaking every summer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Noise:&lt;/strong&gt; unpredictable, random variations.
Example: a sudden sales spike due to an unexpected event.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The intuition behind ARIMA&lt;/strong&gt;&lt;br&gt;
The &lt;strong&gt;ARIMA&lt;/strong&gt; model combines three simple yet powerful ideas:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AR (Auto-Regressive):&lt;/strong&gt; future values depend on past values.
Example: today’s sales are partly influenced by yesterday’s sales.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;I (Integrated)&lt;/strong&gt;: to make the series more stable, we remove trends through differencing (computing the changes between periods).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MA (Moving Average):&lt;/strong&gt; future values adjust based on past forecast errors.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 In short:&lt;br&gt;
&lt;strong&gt;ARIMA = memory of the past + trend stabilization + error correction.&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Hands-on Python
&lt;/h2&gt;

&lt;p&gt;Before diving into ARIMA, let’s take the first step: visualizing a time series.&lt;br&gt;
We’ll use the famous &lt;a href="https://www.kaggle.com/datasets/rakannimer/air-passengers/data" rel="noopener noreferrer"&gt;AirPassengers&lt;/a&gt; dataset (monthly airline passengers from 1949 to 1960).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd
import matplotlib.pyplot as plt

# Load AirPassengers dataset
url = "dataset/airline-passengers.csv"
data = pd.read_csv(url, parse_dates=['Month'], index_col='Month')

# Preview first rows
print(data.head())

# Visualization
plt.figure(figsize=(10,5))
plt.plot(data, label='Number of passengers')
plt.title("AirPassengers - Monthly Airline Passengers (1949-1960)")
plt.xlabel("Date")
plt.ylabel("Passengers")
plt.legend()
plt.show()

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8s9tewu82j5dmtav4086.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8s9tewu82j5dmtav4086.png" alt=" " width="800" height="421"&gt;&lt;/a&gt;&lt;br&gt;
Expected result:&lt;/p&gt;

&lt;p&gt;A curve showing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a rising trend (more passengers over the years),&lt;/li&gt;
&lt;li&gt;a yearly seasonality (summer peaks, winter drops).
This first visualization is essential: it helps us spot the patterns that ARIMA will later model.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Real-world use cases
&lt;/h2&gt;

&lt;p&gt;Time series models like **ARIMA **are widely used to forecast the future based on past data. Here are some concrete examples:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Finance:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Predict stock prices or market indices.&lt;/li&gt;
&lt;li&gt;Anticipate trends to make better investment decisions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Sales / Retail:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Estimate product demand to avoid shortages or excess stock.&lt;/li&gt;
&lt;li&gt;Plan inventory and promotions according to seasonality.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Public Health:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Track the progression of epidemics such as seasonal flu or COVID-19.&lt;/li&gt;
&lt;li&gt;Forecast medical resource needs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Weather / Energy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Predict temperatures, rainfall, or electricity consumption.&lt;/li&gt;
&lt;li&gt;Help companies and municipalities manage resources efficiently.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Transport / Logistics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Forecast traffic or public transport passenger numbers.&lt;/li&gt;
&lt;li&gt;Optimize schedules and resource allocation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;ARIMA **is not just theory—it’s a **practical tool&lt;/strong&gt; to solve real problems across almost all sectors.&lt;/p&gt;

&lt;h2&gt;
  
  
  valuation / Results
&lt;/h2&gt;

&lt;p&gt;At this stage, we haven’t applied the ARIMA model yet, but we can already &lt;strong&gt;draw important insights&lt;/strong&gt; from visual exploration:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Identifying the trend:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The AirPassengers chart clearly shows a steady increase in passengers over the years.&lt;/li&gt;
&lt;li&gt;Understanding this trend helps prepare future forecasts.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Identifying seasonality:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Every year, there are recurring summer peaks, typical of yearly seasonality.&lt;/li&gt;
&lt;li&gt;Seasonality must be considered for accurate predictions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Recognizing noise:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Unpredictable variations appear: some months deviate from the trend or seasonality.&lt;/li&gt;
&lt;li&gt;ARIMA will later help correct past errors and reduce the impact of noise.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Exploratory analysis is the first crucial step in any time series modeling.&lt;br&gt;
Before modeling with ARIMA, it’s essential to understand the series’ structure: trend, seasonality, and noise.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion and recap
&lt;/h2&gt;

&lt;p&gt;In this introductory article, you’ve learned the basics to &lt;strong&gt;understand and explore time series:&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;What a time series is:&lt;/strong&gt; data collected at regular intervals with trend, seasonality, and noise.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The intuition behind ARIMA:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AR (Auto-Regressive): memory of the past,&lt;/li&gt;
&lt;li&gt;I (Integrated): series stabilization,&lt;/li&gt;
&lt;li&gt;MA (Moving Average): error correction.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Python visualization:&lt;/strong&gt; using the AirPassengers dataset to observe trend and seasonality.&lt;br&gt;
&lt;strong&gt;Real-world use cases:&lt;/strong&gt; finance, sales, health, weather, transport… ARIMA is everywhere forecasting is needed.&lt;br&gt;
&lt;strong&gt;Exploratory evaluation:&lt;/strong&gt; visualization already helps understand patterns and prepares for modeling.&lt;/p&gt;

&lt;p&gt;This first step is crucial: understanding your data comes before modeling. The quality of your forecasts depends directly on this understanding.&lt;/p&gt;

&lt;p&gt;Now that we’ve &lt;strong&gt;explored and understood the time series&lt;/strong&gt;, it’s time to move to the next step: preparing data for ARIMA.&lt;/p&gt;

&lt;p&gt;In the next article, we’ll cover:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;how to check if a series is &lt;strong&gt;stationary&lt;/strong&gt;,&lt;/li&gt;
&lt;li&gt;how to apply **differencing **to make a series stationary,&lt;/li&gt;
&lt;li&gt;which statistical tests to use: &lt;strong&gt;ADF (Augmented Dickey-Fuller)&lt;/strong&gt; and &lt;strong&gt;KPSS&lt;/strong&gt;,&lt;/li&gt;
&lt;li&gt;and how to interpret these tests to decide the parameters of our ARIMA model.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This step is crucial: ARIMA works &lt;strong&gt;best on stationary series&lt;/strong&gt;, and proper data preparation ensures more accurate forecasts.&lt;/p&gt;

&lt;p&gt;See you in the next article to &lt;strong&gt;move from exploration to modeling!&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>beginners</category>
      <category>python</category>
      <category>deeplearning</category>
      <category>ai</category>
    </item>
    <item>
      <title>Building an ETL Pipeline with Python Using CoinGecko API</title>
      <dc:creator>Mubarak Mohamed</dc:creator>
      <pubDate>Thu, 20 Feb 2025 08:20:02 +0000</pubDate>
      <link>https://dev.to/moubarakmohame4/building-an-etl-pipeline-with-python-using-coingecko-api-7oc</link>
      <guid>https://dev.to/moubarakmohame4/building-an-etl-pipeline-with-python-using-coingecko-api-7oc</guid>
      <description>&lt;p&gt;Extract, Transform, Load (ETL) is a fundamental process in data engineering used to collect data from various sources, process it, and store it in a structured format for analysis. In this tutorial, we will build a simple ETL pipeline using Python and the CoinGecko API to extract cryptocurrency market data, transform it into a structured format, and load it into a SQLite file for further use.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;To follow along, you need to have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python installed (&amp;gt;=3.7)&lt;/li&gt;
&lt;li&gt;requests and pandas libraries installed&lt;/li&gt;
&lt;li&gt;A CoinGecko API key (optional but recommended for higher request limits)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can install the required libraries using:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install requests
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 1: Extract Data from CoinGecko API
&lt;/h2&gt;

&lt;p&gt;The extraction phase involves fetching cryptocurrency market data from the CoinGecko API.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import requests
import pandas as pd

def extract_data_from_api():
    url = "https://api.coingecko.com/api/v3/coins/markets?vs_currency=usd"

    headers = {
        "accept": "application/json",
        "x-cg-demo-api-key": "YOUR_API_KEY_HERE",  # Replace with your API key
    }

    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        data = response.json()
        return pd.json_normalize(data)  # Convert JSON response to DataFrame
    else:
        raise Exception("Error fetching data from API")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 2: Transform the Data
&lt;/h2&gt;

&lt;p&gt;Transformation is necessary to clean and structure the data before loading it. We'll select relevant columns and rename them for clarity.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def transform_data(df):
    df_transformed = df[["id", "symbol", "name", "current_price", "market_cap", "total_volume", "price_change_percentage_24h", "last_updated"]].copy()
    df_transformed.columns = ["id", "symbol", "name", "price_usd", "market_cap_usd", "volume_24h_usd", "price_change_24h_percent", "date"]
    df_transformed["date"] = pd.to_datetime(df_transformed["date"]).dt.date
    df_transformed = df_transformed.fillna(0)
    return df_transformed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 3: Load Data into SQLite Database
&lt;/h2&gt;

&lt;p&gt;The final step is to store the processed data into an SQLite database for further analysis.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import sqlite3

def load_data_to_sqlite(df, db_file, table_name):
    # Connect to the database
    conn = sqlite3.connect(db_file)
    cursor = conn.cursor()

    # Create the table if it does not exist
    cursor.execute(f"""
        CREATE TABLE IF NOT EXISTS {table_name} (
            id TEXT PRIMARY KEY,
            symbol TEXT, 
            name TEXT, 
            price_usd REAL, 
            market_cap_usd REAL, 
            volume_24h_usd REAL, 
            price_change_24h_percent REAL
        )
    """)

    # Load the data into the table
    df.to_sql(table_name, conn, if_exists='replace', index=False)

    # Commit and close the connection
    conn.commit()
    conn.close()
    print(f"Data successfully loaded into table '{table_name}' in database {db_file}")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 4: Putting It All Together
&lt;/h2&gt;

&lt;p&gt;Now, we can orchestrate the entire ETL process using a main function.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def etl_pipeline():
    # Extraction 
    df_crypto = extract_data_from_api()
    print("Extraction successful")

    # Transformation
    df_transformed = transform_data(df_crypto)
    print("Transformation successful")

    # Loading
    db_file = "database.db"
    table_name = "crypto_data"
    load_data_to_sqlite(df_transformed, db_file, table_name)
    print("Loading successful")

if __name__ == "__main__":
    etl_pipeline()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This tutorial demonstrated how to build a simple ETL pipeline in Python using the CoinGecko API. We covered extracting data from the API, transforming it into a structured format, and loading it into an SQLite database. This pipeline can be extended to store data in a cloud database, automate execution using cron jobs, or integrate with data visualization tools.&lt;/p&gt;

&lt;p&gt;Happy coding!&lt;/p&gt;

</description>
      <category>data</category>
      <category>etl</category>
      <category>cryptocurrency</category>
      <category>programming</category>
    </item>
    <item>
      <title>Understanding Python Terminology: Module, Package, Library, and Framework</title>
      <dc:creator>Mubarak Mohamed</dc:creator>
      <pubDate>Thu, 26 Dec 2024 11:50:26 +0000</pubDate>
      <link>https://dev.to/moubarakmohame4/understanding-python-terminology-module-package-library-and-framework-424f</link>
      <guid>https://dev.to/moubarakmohame4/understanding-python-terminology-module-package-library-and-framework-424f</guid>
      <description>&lt;p&gt;When starting to learn a programming language, one of the first challenges is getting familiar with the terminology. In Python, terms like &lt;strong&gt;module&lt;/strong&gt;, &lt;strong&gt;package&lt;/strong&gt;, &lt;strong&gt;library&lt;/strong&gt;, and &lt;strong&gt;framework&lt;/strong&gt; are commonly used, but their distinctions aren’t always clear to beginners. This article aims to explain these concepts clearly and highlight their differences with examples.&lt;/p&gt;




&lt;h3&gt;
  
  
  1. The Module
&lt;/h3&gt;

&lt;p&gt;A module in Python is simply a file that contains Python code. This file has a &lt;code&gt;.py&lt;/code&gt; extension and can include functions, classes, variables, and executable code. Modules allow you to reuse code by importing it into other files.&lt;/p&gt;

&lt;h4&gt;
  
  
  Example:
&lt;/h4&gt;

&lt;p&gt;Let’s create a file &lt;code&gt;math_utils.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# math_utils.py
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;subtract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This module can then be imported and used in another script:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;math_utils&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;add&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Outputs 8
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  2. The Package
&lt;/h3&gt;

&lt;p&gt;A package is a folder containing multiple modules and a special file named &lt;code&gt;__init__.py&lt;/code&gt;. This file allows Python to treat the folder as a package. Packages are used to organize code by grouping related modules.&lt;/p&gt;

&lt;h4&gt;
  
  
  Example:
&lt;/h4&gt;

&lt;p&gt;Package structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;math_tools/
    __init__.py
    algebra.py
    geometry.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;algebra.py&lt;/code&gt;:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;solve_linear&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;geometry.py&lt;/code&gt;:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;area_circle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;radius&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pi&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;pi&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;radius&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Usage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;math_tools.algebra&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;solve_linear&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;math_tools.geometry&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;area_circle&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;solve_linear&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  &lt;span class="c1"&gt;# Outputs 2.0
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;area_circle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;      &lt;span class="c1"&gt;# Outputs 28.27
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  3. The Library
&lt;/h3&gt;

&lt;p&gt;The term &lt;strong&gt;library&lt;/strong&gt; is often used to describe a collection of ready-to-use packages or modules. A library can contain several packages serving various purposes.&lt;/p&gt;

&lt;p&gt;For example, &lt;strong&gt;Requests&lt;/strong&gt; is a popular Python library for making HTTP requests. It includes several internal modules and packages working together to provide a user-friendly interface.&lt;/p&gt;

&lt;h4&gt;
  
  
  Example:
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://api.example.com&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: Some people use the terms &lt;em&gt;library&lt;/em&gt; and &lt;em&gt;package&lt;/em&gt; interchangeably, and this confusion is understandable. The difference often lies in the scale and context of use.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. The Framework
&lt;/h3&gt;

&lt;p&gt;A framework is a structured library designed with a specific purpose. Unlike a simple library that provides tools, a framework enforces an architecture and a way of working. In Python, frameworks are commonly used for web development, data analysis, or artificial intelligence.&lt;/p&gt;

&lt;h4&gt;
  
  
  Example: Flask (Web Framework)
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;flask&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Flask&lt;/span&gt;

&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Flask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@app.route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;home&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Welcome to my website!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;debug&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Flask imposes a minimalist structure but provides essential tools to develop a web application.&lt;/p&gt;




&lt;h3&gt;
  
  
  Summary of Differences
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Term&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Module&lt;/td&gt;
&lt;td&gt;Single Python file containing code.&lt;/td&gt;
&lt;td&gt;&lt;code&gt;math_utils.py&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Package&lt;/td&gt;
&lt;td&gt;Folder containing multiple modules and an &lt;code&gt;__init__.py&lt;/code&gt; file.&lt;/td&gt;
&lt;td&gt;&lt;code&gt;math_tools/&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Library&lt;/td&gt;
&lt;td&gt;Collection of modules or packages for various needs.&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;Requests&lt;/code&gt;, &lt;code&gt;NumPy&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Framework&lt;/td&gt;
&lt;td&gt;Structured library with an enforced architecture.&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;Flask&lt;/code&gt;, &lt;code&gt;Django&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;p&gt;These distinctions are essential to better understand the Python ecosystem and organize your projects effectively. However, the boundary between some terms, such as &lt;em&gt;library&lt;/em&gt; and &lt;em&gt;package&lt;/em&gt;, can be blurry, and their usage may vary from person to person.&lt;/p&gt;

&lt;p&gt;I am open to discussions and debates if you have a different perspective or points to add. Feel free to share your ideas or ask questions!&lt;/p&gt;

</description>
      <category>python</category>
      <category>module</category>
      <category>pip</category>
      <category>django</category>
    </item>
    <item>
      <title>10 Statistical Terms to Know as a Data Analyst</title>
      <dc:creator>Mubarak Mohamed</dc:creator>
      <pubDate>Sat, 21 Dec 2024 06:57:13 +0000</pubDate>
      <link>https://dev.to/moubarakmohame4/10-statistical-terms-to-know-as-a-data-analyst-15hh</link>
      <guid>https://dev.to/moubarakmohame4/10-statistical-terms-to-know-as-a-data-analyst-15hh</guid>
      <description>&lt;p&gt;As a data analyst, mastering statistical concepts is essential to explore, interpret, and effectively present data. Here are 10 key terms explained concisely with practical examples to illustrate their utility.&lt;/p&gt;




&lt;h3&gt;
  
  
  1. &lt;strong&gt;Mean (or Arithmetic Mean)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The mean is calculated by dividing the sum of all values by the total number of values. It represents a central tendency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; Suppose the daily sales of a product are: 100, 120, 140, 160, 180. The mean is:&lt;br&gt;
Mean = (100 + 120 + 140 + 160 + 180)/5 = 140.&lt;br&gt;
&lt;strong&gt;Utility:&lt;/strong&gt; The mean helps determine a representative value, for example, the average revenue per customer in a business.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. &lt;strong&gt;Median&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The median is the middle value of a sorted dataset. If the number of values is even, it is the average of the two middle values.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; For salaries of €1500, €2000, €2500, €3000, €8000, the median is €2500. It is not influenced by the extreme value of €8000.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Utility:&lt;/strong&gt; The median is useful for analyzing skewed data, such as salaries, often biased by high values.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. &lt;strong&gt;Variance and Standard Deviation&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Variance:&lt;/strong&gt; Measures the spread of data relative to the mean.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Standard Deviation:&lt;/strong&gt; The square root of the variance, expressed in the same unit as the data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; If the usage times of a mobile app are: 10, 12, 10, 8, 15 minutes, a high standard deviation would indicate that the times vary greatly around the mean.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Utility:&lt;/strong&gt; These measures help understand performance stability, such as website loading times.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. &lt;strong&gt;Normal Distribution&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;A symmetrical bell-shaped distribution around the mean.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; Human heights often follow a normal distribution: most people have a height close to the mean, with fewer people being very tall or very short.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Utility:&lt;/strong&gt; Useful for predicting typical behaviors and applying statistical tests like the t-test.&lt;/p&gt;




&lt;h3&gt;
  
  
  5. &lt;strong&gt;Correlation&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Correlation measures the relationship between two variables, expressed between -1 (perfect negative correlation) and +1 (perfect positive correlation).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; A company may find a positive correlation between advertising budget and sales.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Utility:&lt;/strong&gt; Identifying potential relationships to make strategic decisions, such as optimizing marketing campaigns.&lt;/p&gt;




&lt;h3&gt;
  
  
  6. &lt;strong&gt;Probability&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Probability assesses the chance of an event occurring, expressed between 0 (impossible) and 1 (certain).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; If an e-commerce site has 500 visitors and 50 make a purchase, the probability of conversion is:&lt;br&gt;
P(Conversion) = 50/500 = 0.1 = 10%.&lt;br&gt;
&lt;strong&gt;Utility:&lt;/strong&gt; Estimating the likelihood of success for an action, such as click-through rates for an ad campaign.&lt;/p&gt;




&lt;h3&gt;
  
  
  7. &lt;strong&gt;P-value&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The p-value is the probability of observing results as extreme as those obtained if the null hypothesis is true.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; In an A/B test, if the p-value is less than 0.05, the null hypothesis (both versions are the same) is rejected.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Utility:&lt;/strong&gt; Validates the effectiveness of a change (e.g., a design modification).&lt;/p&gt;




&lt;h3&gt;
  
  
  8. &lt;strong&gt;Histogram&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;A graph representing the distribution of a variable using value ranges (bars).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; A histogram can show the number of users by age range (20-30 years, 30-40 years, etc.).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Utility:&lt;/strong&gt; Quickly visualize data distribution to identify trends or anomalies.&lt;/p&gt;




&lt;h3&gt;
  
  
  9. &lt;strong&gt;Binomial Distribution&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Models the number of successes in a series of independent trials with two possible outcomes (success/failure).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; If a product has a 20% chance of being defective, the binomial distribution can predict how many out of 100 will be defective.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Utility:&lt;/strong&gt; Predict outcomes in repetitive processes, such as quality tests.&lt;/p&gt;




&lt;h3&gt;
  
  
  10. &lt;strong&gt;Hypothesis Testing&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;A statistical process to evaluate whether a hypothesis about a population is true or not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; A company tests if a new interface increases the conversion rate. The null hypothesis is: "The new interface does not improve the conversion rate."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Utility:&lt;/strong&gt; Enables data-driven decision-making while minimizing bias.&lt;/p&gt;




&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;These 10 statistical terms are fundamental for a data analyst. Mastering these concepts allows for effective understanding and communication of analysis results, facilitating data-driven decision-making.&lt;/p&gt;

</description>
      <category>data</category>
      <category>datascience</category>
      <category>statistics</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Build Your Own Artificial Neuron: A Practical Guide for AI Beginners</title>
      <dc:creator>Mubarak Mohamed</dc:creator>
      <pubDate>Wed, 24 Jul 2024 21:50:50 +0000</pubDate>
      <link>https://dev.to/moubarakmohame4/build-your-own-artificial-neuron-a-practical-guide-for-ai-beginners-2lg7</link>
      <guid>https://dev.to/moubarakmohame4/build-your-own-artificial-neuron-a-practical-guide-for-ai-beginners-2lg7</guid>
      <description>&lt;p&gt;Artificial intelligence (AI) is ubiquitous in our daily lives, from product recommendations on e-commerce websites to virtual assistants on our smartphones. But behind these sophisticated technologies lies a fundamental structure: the artificial neuron. Understanding and developing an artificial neuron is a crucial step for anyone looking to dive into the fascinating world of AI. In this article, we will guide you step-by-step through the process of creating your own artificial neuron, breaking down complex concepts into simple terms and providing concrete examples. Whether you're a curious beginner or a technology enthusiast, this practical guide will open the doors to a new dimension of innovation. Get ready to transform your understanding of AI and discover the limitless potential of this rapidly growing field.&lt;/p&gt;

&lt;p&gt;To develop our artificial neuron program, we will start with a dataset containing 100 rows and 2 columns. This dataset can be likened to plants with the width and length of their leaves. Our goal here is to train our program to recognize toxic and non-toxic plants using this data. To achieve this, we will follow these steps.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe5stklvxtcnvmyyiktxu.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe5stklvxtcnvmyyiktxu.PNG" alt="Image description" width="800" height="543"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 1: Data Acquisition (X, y)&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;X:&lt;/strong&gt; The input data&lt;br&gt;
This is the raw information that the model will process.&lt;br&gt;
Example: For an image recognition model, X could be an array of pixels representing an image. For a house price prediction model, X could include variables such as area, number of rooms, location, etc.&lt;br&gt;
&lt;strong&gt;y:&lt;/strong&gt; The labels&lt;br&gt;
These are the correct answers associated with the input data.&lt;br&gt;
Example: For image recognition, y would be the digit represented in the image. For house price prediction, y would be the actual price of the house.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;X, y = make_blobs(n_samples=100, n_features=2, centers=2, random_state=0)
y = y.reshape((y.shape[0], 1))

print('dimensions de X:', X.shape)
print('dimensions de y:', y.shape)

plt.scatter(X[:,0], X[:, 1], c=y, cmap='summer')
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0yw3iagttnhrhtkp2zi4.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0yw3iagttnhrhtkp2zi4.PNG" alt="Image description" width="395" height="306"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Initialization&lt;/strong&gt;&lt;br&gt;
initialisation(X):&lt;br&gt;
This step allows you to initialize the parameters W and b&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def initialisation(X):
    W = np.random.randn(X.shape[1], 1)
    b = np.random.randn(1)
    return (W, b)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 3: Model Construction&lt;/strong&gt;&lt;br&gt;
Model(X, W, b):&lt;br&gt;
The model is a mathematical function that takes the input data X and the model parameters (W and b) to produce a prediction.&lt;br&gt;
W: Weight matrix&lt;br&gt;
Determines the relative importance of each input feature.&lt;br&gt;
b: Bias vector&lt;br&gt;
Allows the model output to be adjusted independently of the inputs.&lt;br&gt;
Activation function:&lt;br&gt;
Transforms the linear output of the model into a non-linear output, allowing for modeling complex relationships.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def model(X, W, b):
    Z = X.dot(W) + b
    A = 1 / (1 + np.exp(-Z))
    return A
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 4: Error Calculation&lt;/strong&gt;&lt;br&gt;
Cost(A, y):&lt;br&gt;
The cost function measures the discrepancy between the model's predictions (A) and the true labels (y).&lt;br&gt;
Examples of cost functions:&lt;br&gt;
Mean Squared Error (MSE): Used for regression problems.&lt;br&gt;
Cross-entropy: Used for classification problems.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def log_loss(A, y):
    return 1 / len(y) * np.sum(-y * np.log(A) - (1 - y) * np.log(1 - A))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 5: Model Optimization&lt;/strong&gt;&lt;br&gt;
Gradients(A, X, y):&lt;br&gt;
The gradients indicate the direction in which the parameters W and b should be modified to minimize the cost function.&lt;br&gt;
Update(W, b, dW, db):&lt;br&gt;
The parameters are iteratively updated by following the opposite direction of the gradient.&lt;br&gt;
Optimization algorithms:&lt;br&gt;
Stochastic Gradient Descent (SGD): Updates the parameters at each training example.&lt;br&gt;
Batch Gradient Descent: Updates the parameters at each mini-batch of examples.&lt;br&gt;
Mini-batch Gradient Descent: A combination of the previous two.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def gradients(A, X, y):
    dW = 1 / len(y) * np.dot(X.T, A - y)
    db = 1 / len(y) * np.sum(A - y)
    return (dW, db)

def update(dW, db, W, b, learning_rate):
    W = W - learning_rate * dW
    b = b - learning_rate * db
    return (W, b)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 6: Model Evaluation&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from sklearn.metrics import accuracy_score

def predict(X, W, b):
    A = model(X, W, b)
    # print(A)
    return A &amp;gt;= 0.5

def artificial_neuron(X, y, learning_rate = 0.1, n_iter = 100):
    # initialisation W, b
    W, b = initialisation(X)

    Loss = []

    for i in range(n_iter):
        A = model(X, W, b)
        Loss.append(log_loss(A, y))
        dW, db = gradients(A, X, y)
        W, b = update(dW, db, W, b, learning_rate)

    y_pred = predict(X, W, b)
    print(accuracy_score(y, y_pred))

    plt.plot(Loss)
    plt.show()
    return (W, b)  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;W, b = artificial_neuron(X, y)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6f2yqpnh7wjkn27gw5rs.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6f2yqpnh7wjkn27gw5rs.PNG" alt="Image description" width="414" height="290"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Decision boundary&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;fig, ax = plt.subplots(figsize=(9, 6))
ax.scatter(X[:,0], X[:, 1], c=y, cmap='summer')

x1 = np.linspace(-1, 4, 100)
x2 = ( - W[0] * x1 - b) / W[1]

ax.plot(x1, x2, c='orange', lw=3)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5o0pwd6w6bdelcigvme7.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5o0pwd6w6bdelcigvme7.PNG" alt="Image description" width="558" height="373"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>beginners</category>
      <category>ai</category>
      <category>deeplearning</category>
    </item>
    <item>
      <title>Time Series in Data Science: Analysis of Bitcoin and Ethereum</title>
      <dc:creator>Mubarak Mohamed</dc:creator>
      <pubDate>Fri, 19 Jul 2024 11:14:46 +0000</pubDate>
      <link>https://dev.to/moubarakmohame4/time-series-in-data-science-analysis-of-bitcoin-and-ethereum-51n3</link>
      <guid>https://dev.to/moubarakmohame4/time-series-in-data-science-analysis-of-bitcoin-and-ethereum-51n3</guid>
      <description>&lt;p&gt;Time series play a crucial role in Data Science, especially when analyzing financial data. The price variations of cryptocurrencies like Bitcoin and Ethereum offer an excellent opportunity to explore time series. In this article, we will analyze the price variations of Bitcoin and Ethereum in euros, using datasets ranging from 2012 to 2019 for Bitcoin and from 2015 to 2019 for Ethereum. We will also illustrate the use of some basic time series techniques with concrete examples and practical recommendations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Importing Libraries and Loading Data&lt;/strong&gt;&lt;br&gt;
Before diving into the analysis, we need to import the necessary libraries and load the datasets.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Loading Bitcoin data
btc = pd.read_csv("BTC-EUR.csv", index_col='Date', parse_dates=True)
btc.head()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkqrwyjm2mtpnm3btkc6h.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkqrwyjm2mtpnm3btkc6h.PNG" alt="Image description" width="346" height="169"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data Exploration&lt;/strong&gt;&lt;br&gt;
Let's take a look at the first few rows of the data to get an idea of its structure.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;btc.head()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkpg6shkjdgw4oh9sedir.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkpg6shkjdgw4oh9sedir.PNG" alt="Image description" width="346" height="169"&gt;&lt;/a&gt;&lt;br&gt;
This allows us to verify that the data has been correctly loaded and that date indexing has been successfully applied.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Weekly Variation Analysis&lt;/strong&gt;&lt;br&gt;
Now, let's analyze the weekly variations of Bitcoin's closing prices.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;btc['Close'].resample('W').agg(['mean', 'std'])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0wq826d50ceqc13liyid.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0wq826d50ceqc13liyid.PNG" alt="Image description" width="251" height="326"&gt;&lt;/a&gt;&lt;br&gt;
Recommendation: Resampling is a powerful technique to summarize data at different frequencies (daily, weekly, monthly, etc.). It helps to reveal hidden trends and patterns.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data Visualization&lt;/strong&gt;&lt;br&gt;
Visualizing data is crucial to understand trends and anomalies. Let's start by plotting Bitcoin's closing prices.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;btc['Close'].plot(figsize=(9, 6))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvh1dmybi088fkwai1uut.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvh1dmybi088fkwai1uut.PNG" alt="Image description" width="726" height="479"&gt;&lt;/a&gt;&lt;br&gt;
The first time I plotted financial data, I was surprised at how much detail can be hidden in a simple curve.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Specific Period Data Analysis&lt;/strong&gt;&lt;br&gt;
We can also focus on specific periods for more detailed analysis.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;btc['2019']['Close'].plot(figsize=(9, 6))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fakj3q41v0n6yojnsqa8y.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fakj3q41v0n6yojnsqa8y.PNG" alt="Image description" width="747" height="493"&gt;&lt;/a&gt;&lt;br&gt;
And for an even shorter period:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;btc['2019-09']['Close'].plot(figsize=(9, 6))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fynqy0mdhzt6gvcp1qt7c.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fynqy0mdhzt6gvcp1qt7c.PNG" alt="Image description" width="639" height="472"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Comparison of Monthly and Weekly Averages&lt;/strong&gt;&lt;br&gt;
For deeper analysis, let's compare the monthly and weekly averages of closing prices for the year 2017.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;plt.figure(figsize=(12, 9))
btc['2017']['Close'].plot()
btc.loc['2017','Close'].resample("M").mean().plot(label='Moyenne par mois ', lw=2, ls=':', alpha=0.8)
btc.loc['2017', 'Close'].resample("W").mean().plot(label='Moyenne par semaine ', lw=2, ls='--', alpha=0.8)
plt.legend()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuvvy5ftnegwdrnd4e65i.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuvvy5ftnegwdrnd4e65i.PNG" alt="Image description" width="663" height="510"&gt;&lt;/a&gt;&lt;br&gt;
Recommendation: Comparing averages at different frequencies can reveal seasonal trends or economic cycles.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ethereum Analysis&lt;/strong&gt;&lt;br&gt;
Now, let's analyze the Ethereum data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;eth = pd.read_csv('ETH-EUR.csv', index_col='Date', parse_dates=True)
eth.head()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Merging Bitcoin and Ethereum Data&lt;/strong&gt;&lt;br&gt;
For comparative analysis, we will merge the Bitcoin and Ethereum data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;btc_eth = pd.merge(btc, eth, how='inner', on='Date', suffixes=('_btc', '_eth'))
btc_eth.head()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr38jx736q5waqtlpl6rk.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr38jx736q5waqtlpl6rk.PNG" alt="Image description" width="796" height="156"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Comparative Visualization of Variations&lt;/strong&gt;&lt;br&gt;
Finally, let's visualize the variations of both cryptocurrencies.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;btc_eth[['Close_btc', 'Close_eth']].plot(figsize=(12, 8), subplots=True)
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9r5l1mt0us8wf0w1t6nr.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9r5l1mt0us8wf0w1t6nr.PNG" alt="Image description" width="752" height="502"&gt;&lt;/a&gt;&lt;br&gt;
Comparing data from different cryptocurrencies can give us insight into their relative behavior and correlation.&lt;/p&gt;

&lt;p&gt;Time series analysis is an indispensable tool in Data Science, particularly for financial data. By using resampling, visualization, and comparison techniques, we can uncover trends and patterns hidden in the data. Cryptocurrencies, with their volatility and growing popularity, offer an ideal learning ground for these techniques.&lt;/p&gt;

</description>
      <category>timeseries</category>
      <category>datascience</category>
      <category>cryptocurrency</category>
      <category>python</category>
    </item>
  </channel>
</rss>
