<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kelvin Kipyegon</title>
    <description>The latest articles on DEV Community by Kelvin Kipyegon (@kelvin_kipyegon_c09cd9b69).</description>
    <link>https://dev.to/kelvin_kipyegon_c09cd9b69</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2847173%2F29c894ac-8e4d-4ca5-985f-bc5765823f87.png</url>
      <title>DEV Community: Kelvin Kipyegon</title>
      <link>https://dev.to/kelvin_kipyegon_c09cd9b69</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kelvin_kipyegon_c09cd9b69"/>
    <language>en</language>
    <item>
      <title>Inventory Forecasting System From Scratch</title>
      <dc:creator>Kelvin Kipyegon</dc:creator>
      <pubDate>Mon, 16 Mar 2026 05:43:42 +0000</pubDate>
      <link>https://dev.to/kelvin_kipyegon_c09cd9b69/inventory-forecasting-system-from-scratch-fei</link>
      <guid>https://dev.to/kelvin_kipyegon_c09cd9b69/inventory-forecasting-system-from-scratch-fei</guid>
      <description>&lt;h1&gt;
  
  
  Inventory Forecasting System From Scratch
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;By Kelvin Kipyegon&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;I am currently applying for a role that requires demand forecasting and inventory planning. I decided to do a project related to forecasting for my portfolio.&lt;/p&gt;

&lt;p&gt;This post documents exactly what I did, what went wrong, how I fixed it, and what the final result looks like.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Set Out to Build
&lt;/h2&gt;

&lt;p&gt;A system that could:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Take raw sales data and clean it properly&lt;/li&gt;
&lt;li&gt;Forecast weekly demand for multiple products&lt;/li&gt;
&lt;li&gt;Calculate safety stock and reorder points&lt;/li&gt;
&lt;li&gt;Flag which products needed attention&lt;/li&gt;
&lt;li&gt;Present everything in an Excel report and a live dashboard&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Data
&lt;/h2&gt;

&lt;p&gt;I used the UCI Online Retail dataset — 541,909 real transactions from a wholesale operation, freely available online. It felt close enough to what a real workshop or warehouse operation would have.&lt;/p&gt;

&lt;p&gt;First thing I did was look at what I was working with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_excel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Rows: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, Columns: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isnull&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Rows: 541909, Columns: 8

CustomerID     135080
Description      1454
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;135,000 rows with no customer ID. That is 25% of the dataset. The data was a little bit messy.&lt;/p&gt;




&lt;h2&gt;
  
  
  Cleaning the Data
&lt;/h2&gt;

&lt;p&gt;Four things needed fixing before I could send anything downstream:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df_clean&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dropna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CustomerID&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Description&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;df_clean&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_clean&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;df_clean&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;InvoiceNo&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;C&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="n"&gt;df_clean&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_clean&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df_clean&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Quantity&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;df_clean&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_clean&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df_clean&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;UnitPrice&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;df_clean&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Revenue&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_clean&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Quantity&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;df_clean&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;UnitPrice&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After cleaning I had 397,884 rows — down from 541,909. I lost about 26% of the data .Those rows were returns, cancellations, and test entries. Keeping them would have made every analysis after this point wrong.&lt;/p&gt;




&lt;h2&gt;
  
  
  Picking Products to Forecast
&lt;/h2&gt;

&lt;p&gt;I filtered to the top 10 products by total quantity sold, then checked how many weeks of sales history each one had. One product — PAPER CRAFT, LITTLE BIRDIE — had 80,995 units sold in a single week and nothing else. That is not demand history, that is a one-off bulk order. I dropped it.&lt;/p&gt;

&lt;p&gt;I ended up with 8 products, each with between 22 and 53 weeks of consistent weekly sales data.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Forecasting Model
&lt;/h2&gt;

&lt;p&gt;I started with Facebook Prophet, which is a time-series forecasting library. It did not work well. The reason which I only understood after debugging for a while is that Prophet needs at least 2 years of data to learn yearly seasonality properly. I had one year. The predictions were off by hundreds of percent on some products.&lt;/p&gt;

&lt;p&gt;So I switched to a 4-week rolling average. Simpler, more honest about what the data could support, and easier to explain to someone who does not work in Python.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;window&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;
&lt;span class="n"&gt;history&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;y&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;iloc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;window&lt;/span&gt;&lt;span class="p"&gt;:].&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;predictions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;pred&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;window&lt;/span&gt;&lt;span class="p"&gt;:])&lt;/span&gt;
    &lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pred&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pred&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I trained on 80% of the data and tested on the remaining 20%. Accuracy was measured using RMSE — I avoided MAPE because it breaks when actual demand hits zero, which happened on a few test weeks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Results:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Product&lt;/th&gt;
&lt;th&gt;Avg Weekly Demand&lt;/th&gt;
&lt;th&gt;Accuracy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Rabbit Night Light&lt;/td&gt;
&lt;td&gt;2,763 units&lt;/td&gt;
&lt;td&gt;66.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Assorted Bird Ornament&lt;/td&gt;
&lt;td&gt;899 units&lt;/td&gt;
&lt;td&gt;52.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jumbo Bag Red Retrospot&lt;/td&gt;
&lt;td&gt;1,168 units&lt;/td&gt;
&lt;td&gt;40.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;White Hanging Heart&lt;/td&gt;
&lt;td&gt;699 units&lt;/td&gt;
&lt;td&gt;36.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mini Paint Set Vintage&lt;/td&gt;
&lt;td&gt;625 units&lt;/td&gt;
&lt;td&gt;29.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Popcorn Holder&lt;/td&gt;
&lt;td&gt;1,914 units&lt;/td&gt;
&lt;td&gt;12.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cake Cases&lt;/td&gt;
&lt;td&gt;448 units&lt;/td&gt;
&lt;td&gt;0.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WW2 Gliders&lt;/td&gt;
&lt;td&gt;1,358 units&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Average accuracy: &lt;strong&gt;29.9%&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Some of these are weak. The honest interpretation is that products like Cake Cases and WW2 Gliders have highly variable demand — a rolling average struggles with spikes. The right response to that is not to force a better-looking number, it is to give those products a larger safety stock buffer, which is exactly what the next step does.&lt;/p&gt;




&lt;h2&gt;
  
  
  Safety Stock and Reorder Points
&lt;/h2&gt;

&lt;p&gt;These are standard inventory management formulas. I used a 95% service level and assumed a 1-week supplier lead time:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Safety Stock&lt;/strong&gt; = 1.65 × Standard Deviation of Demand × √(Lead Time)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reorder Point&lt;/strong&gt; = Average Weekly Demand × Lead Time + Safety Stock&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;safety_stock&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.65&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;std_demand&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lead_time_weeks&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;rop&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;avg_demand&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;lead_time_weeks&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;safety_stock&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current_stock&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;safety_stock&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CRITICAL&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;current_stock&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;rop&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;REORDER NOW&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;OK&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Product&lt;/th&gt;
&lt;th&gt;Safety Stock&lt;/th&gt;
&lt;th&gt;Reorder Point&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Rabbit Night Light&lt;/td&gt;
&lt;td&gt;2,523&lt;/td&gt;
&lt;td&gt;3,760&lt;/td&gt;
&lt;td&gt;🔴 CRITICAL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WW2 Gliders&lt;/td&gt;
&lt;td&gt;1,611&lt;/td&gt;
&lt;td&gt;2,657&lt;/td&gt;
&lt;td&gt;🟡 REORDER NOW&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Popcorn Holder&lt;/td&gt;
&lt;td&gt;1,546&lt;/td&gt;
&lt;td&gt;2,474&lt;/td&gt;
&lt;td&gt;🟡 REORDER NOW&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bird Ornament&lt;/td&gt;
&lt;td&gt;895&lt;/td&gt;
&lt;td&gt;1,562&lt;/td&gt;
&lt;td&gt;🟡 REORDER NOW&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;White Hanging Heart&lt;/td&gt;
&lt;td&gt;903&lt;/td&gt;
&lt;td&gt;1,597&lt;/td&gt;
&lt;td&gt;🟡 REORDER NOW&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mini Paint Set&lt;/td&gt;
&lt;td&gt;658&lt;/td&gt;
&lt;td&gt;1,150&lt;/td&gt;
&lt;td&gt;🟡 REORDER NOW&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cake Cases&lt;/td&gt;
&lt;td&gt;642&lt;/td&gt;
&lt;td&gt;1,278&lt;/td&gt;
&lt;td&gt;🟡 REORDER NOW&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jumbo Bag&lt;/td&gt;
&lt;td&gt;791&lt;/td&gt;
&lt;td&gt;1,662&lt;/td&gt;
&lt;td&gt;🟢 OK&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Excel Report
&lt;/h2&gt;

&lt;p&gt;I exported the full table to a color-coded Excel file using openpyxl. The idea was that a procurement team should be able to open this file and know exactly what needs ordering without any explanation.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Dashboard
&lt;/h2&gt;

&lt;p&gt;Everything feeds into a live Streamlit dashboard with three pages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Stock Overview&lt;/strong&gt; — headline numbers and charts comparing current stock against reorder points&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Demand Forecast&lt;/strong&gt; — select a product, see its history and 4-week forecast&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reorder Alerts&lt;/strong&gt; — the full status table with recommended order quantities&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Live dashboard:&lt;/strong&gt; &lt;em&gt;(&lt;a href="https://forecasting-vv9qmj2kmug42tpct3jqv8.streamlit.app/" rel="noopener noreferrer"&gt;https://forecasting-vv9qmj2kmug42tpct3jqv8.streamlit.app/&lt;/a&gt;)&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Took Away From This
&lt;/h2&gt;

&lt;p&gt;A few honest reflections:&lt;/p&gt;

&lt;p&gt;The data cleaning took longer than the modelling. That was not what I expected going in, but it makes sense. Garbage in, garbage out — and in a real operation, the stock records are often the messiest part.&lt;/p&gt;

&lt;p&gt;Choosing the simpler model was the right call. There is a temptation to use the most sophisticated tool available. Sometimes that is wrong.&lt;/p&gt;

&lt;p&gt;The low-accuracy products were the most interesting finding. If I had just reported average accuracy and moved on, I would have missed the point. The variability &lt;em&gt;is&lt;/em&gt; the insight — those products need different handling, not a better model.&lt;br&gt;
Next I will try to automate everything to a good proper data ETL pipeline.&lt;/p&gt;




&lt;p&gt;If you are working through something similar or have feedback on the approach, I would be glad to hear it.&lt;/p&gt;

&lt;p&gt;— Kelvin Kipyegon&lt;/p&gt;

</description>
    </item>
    <item>
      <title>1. "Python Program to Filter CSV Rows and Write Output to New File"</title>
      <dc:creator>Kelvin Kipyegon</dc:creator>
      <pubDate>Thu, 13 Feb 2025 11:35:57 +0000</pubDate>
      <link>https://dev.to/kelvin_kipyegon_c09cd9b69/1-python-program-to-filter-csv-rows-and-write-output-to-new-file-3oci</link>
      <guid>https://dev.to/kelvin_kipyegon_c09cd9b69/1-python-program-to-filter-csv-rows-and-write-output-to-new-file-3oci</guid>
      <description>&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import csv

input_file = 'input.csv'
output_file = 'output.csv'
column_index = 1

with open(input_file, 'r') as infile:
    csv_reader = csv.reader(infile)
    header = next(csv_reader)
    filtered_rows = [header]

    for row in csv_reader:
        if float(row[column_index]) &amp;gt; 100:
            filtered_rows.append(row)

with open(output_file, 'w', newline='') as outfile:
    csv_writer = csv.writer(outfile)
    csv_writer.writerows(filtered_rows)

print("Filtered rows have been written to output.csv")


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The code logic is as follows;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Imports the CSV module&lt;/strong&gt;: &lt;br&gt;
The code starts by importing the &lt;code&gt;csv&lt;/code&gt; module, which helps us read and write CSV files.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;File paths and column index&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;input_file = 'input.csv'&lt;/code&gt; tells the program where to find the file we want to read.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;output_file = 'output.csv'&lt;/code&gt; is where the program will save the filtered data.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;column_index = 1&lt;/code&gt; indicates the column where we will check the values (in this case, the second column because column counting starts from 0).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Open the input file&lt;/strong&gt;:&lt;br&gt;
The program opens the &lt;code&gt;input.csv&lt;/code&gt; file to read the data inside.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Read the header&lt;/strong&gt;:&lt;br&gt;
It reads the first row of the file, which contains the column names, and stores it in &lt;code&gt;header&lt;/code&gt;. This will be used later when writing to the new file.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Filter the rows&lt;/strong&gt;:&lt;br&gt;
The program goes through each row of data:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It checks if the number in the specified column (the second column) is greater than 100.&lt;/li&gt;
&lt;li&gt;If the number is greater than 100, the program keeps that row.&lt;/li&gt;
&lt;li&gt;If not, the row is skipped.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Write to the output file&lt;/strong&gt;:&lt;br&gt;
After filtering, the program writes the header and the remaining rows (that meet the condition) to a new file called &lt;code&gt;output.csv&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Print a message&lt;/strong&gt;:&lt;br&gt;
Finally, the program prints a message to let you know that the filtered data has been saved to the new file.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;2a.  **A Python multithreading solution to download multiple files simultaneously.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import threading
import requests

urls = [
    'https://example.com/file1.jpg',
    'https://example.com/file2.jpg',
    'https://example.com/file3.jpg'
]

def download_file(url):
    try:
        response = requests.get(url)
        filename = url.split('/')[-1]
        with open(filename, 'wb') as f:
            f.write(response.content)
        print(f"Downloaded: {filename}")
    except Exception as e:
        print(f"Failed to download {url}: {e}")

threads = []
for url in urls:
    thread = threading.Thread(target=download_file, args=(url,))
    threads.append(thread)
    thread.start()

for thread in threads:
    thread.join()

print("All downloads are complete.")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Explanation of the code:
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;URLs List&lt;/strong&gt;: &lt;code&gt;urls&lt;/code&gt; contains the list of file URLs you want to download.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Download Function&lt;/strong&gt;: &lt;code&gt;download_file(url)&lt;/code&gt; is a function that downloads a single file from a URL and saves it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Thread Creation&lt;/strong&gt;: For each URL, a new thread is created using &lt;code&gt;threading.Thread&lt;/code&gt; to download the file at the same time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Starting Threads&lt;/strong&gt;: The &lt;code&gt;start()&lt;/code&gt; method is called on each thread to begin downloading the files.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Waiting for Completion&lt;/strong&gt;: &lt;code&gt;join()&lt;/code&gt; ensures the main program waits for all threads to finish before it prints "All downloads are complete."&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;2b. A multiprocessing script to compute the factorial of numbers from 1 to 10.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;multiprocessing&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;factorial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;*=&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Factorial of &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; is &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;process&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;multiprocessing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;factorial&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,))&lt;/span&gt;
        &lt;span class="n"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;All factorials have been computed.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Explanation:
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;factorial(n)&lt;/code&gt; function&lt;/strong&gt;: Calculates the factorial of a number &lt;code&gt;n&lt;/code&gt; and prints the result.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Main Block&lt;/strong&gt;: In the &lt;code&gt;if __name__ == '__main__'&lt;/code&gt; block:

&lt;ul&gt;
&lt;li&gt;Loops through numbers from 1 to 10.&lt;/li&gt;
&lt;li&gt;For each number, creates a new process to compute its factorial.&lt;/li&gt;
&lt;li&gt;Starts each process and waits for it to finish using &lt;code&gt;process.join()&lt;/code&gt; before moving to the next.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;2c A simple Python script that demonstrates how to modify a Pandas DataFrame in parallel using concurrent.futures:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;concurrent.futures&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;modify_row&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;modified&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;concurrent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;futures&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ThreadPoolExecutor&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;executor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;executor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;modify_row&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;iterrows&lt;/span&gt;&lt;span class="p"&gt;()]))&lt;/span&gt;

    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Explanation:
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;DataFrame&lt;/strong&gt;: A simple DataFrame &lt;code&gt;df&lt;/code&gt; is created with a column &lt;code&gt;'value'&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;modify_row&lt;/code&gt; function&lt;/strong&gt;: This function modifies the row by adding a new column &lt;code&gt;'modified'&lt;/code&gt;, where the value is the original &lt;code&gt;'value'&lt;/code&gt; multiplied by 2.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ThreadPoolExecutor&lt;/strong&gt;: 

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;executor.map(modify_row, [...])&lt;/code&gt; runs the &lt;code&gt;modify_row&lt;/code&gt; function in parallel for each row in the DataFrame.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Result&lt;/strong&gt;: The modified DataFrame is printed at the end.&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>python</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Mastering SQL for Data Engineering: Advanced Queries, Optimization, and Data Modeling Best Practices</title>
      <dc:creator>Kelvin Kipyegon</dc:creator>
      <pubDate>Tue, 11 Feb 2025 12:01:28 +0000</pubDate>
      <link>https://dev.to/kelvin_kipyegon_c09cd9b69/mastering-sql-for-data-engineering-advanced-queries-optimization-and-data-modeling-best-practices-2gjg</link>
      <guid>https://dev.to/kelvin_kipyegon_c09cd9b69/mastering-sql-for-data-engineering-advanced-queries-optimization-and-data-modeling-best-practices-2gjg</guid>
      <description>&lt;p&gt;SQL is mainly used by Data Engineers to bring data together and run queries that turn raw data into useful business insights. Data Engineers use SQL to change various aspects of the database eg tables and pull out specific data to be used for different purposes. In this article, we will explore advanced SQL techniques, optimization strategies, and data modeling best practices that will help you handle complex data engineering tasks.&lt;/p&gt;




&lt;h2&gt;
  
  
  Core SQL Concepts for Data Engineering
&lt;/h2&gt;

&lt;h3&gt;
  
  
  SELECT, WHERE, JOIN, GROUP BY, and HAVING
&lt;/h3&gt;

&lt;p&gt;The most basic SQL commands are essential for performing almost any data engineering task.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;SELECT&lt;/strong&gt;: Retrieves data from a database.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;  &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This query fetches all the columns and rows from the employees table. The * symbol indicates that all columns should be retrieved.****&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;WHERE&lt;/strong&gt;: Filters data based on conditions.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;  &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'HR'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This query selects only those rows from the employees table where the department is 'HR'. It acts as a filter to get data based on a specific condition.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;JOIN&lt;/strong&gt;: Combines data from multiple tables.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;  &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;department_name&lt;/span&gt; 
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt; 
  &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;departments&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;department_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This query combines the employees table and the departments table by joining them on the department_id column. It retrieves employee names along with their department names.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GROUP BY&lt;/strong&gt;: Groups rows based on column values.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;  &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; 
  &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This query groups employees by their department and counts how many employees are in each department.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;HAVING&lt;/strong&gt;: Filters groups after applying GROUP BY.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;  &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; 
  &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt; 
  &lt;span class="k"&gt;HAVING&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This query counts employees per department but only returns those departments where the count is greater than 5. The HAVING clause filters the result of GROUP BY.&lt;/p&gt;




&lt;h2&gt;
  
  
  Advanced SQL Techniques
&lt;/h2&gt;

&lt;p&gt;Once you’re comfortable with basic SQL, you can explore more advanced techniques that make SQL even more powerful.&lt;/p&gt;

&lt;h3&gt;
  
  
  Recursive Queries and Common Table Expressions (CTEs)
&lt;/h3&gt;

&lt;h3&gt;
  
  
  Recursive Queries and Common Table Expressions (CTEs)
&lt;/h3&gt;




&lt;p&gt;&lt;strong&gt;Introduction to Recursive Queries and Common Table Expressions (CTEs)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When working with SQL, sometimes you need to deal with data that has a hierarchy or structure like a family tree or an organization chart. Recursive Queries and Common Table Expressions (CTEs) are helpful tools to manage this type of data.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;CTE (Common Table Expression)&lt;/strong&gt;: Think of a CTE as a temporary table that you create in the middle of your query. It simplifies complex queries and makes them easier to read and maintain.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Recursive Queries&lt;/strong&gt;: These are a special kind of CTE that allows you to reference the same table or CTE multiple times to build hierarchical data, such as parent-child relationships (e.g., employees and their managers).&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  How CTEs Work
&lt;/h3&gt;

&lt;p&gt;Let’s start with a simple CTE. Here's an example of how to use a CTE to get a list of employees and their departments:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;EmployeeDetails&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;EmployeeDetails&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;WITH&lt;/code&gt; keyword defines the CTE called &lt;code&gt;EmployeeDetails&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Inside the CTE, we select the &lt;code&gt;name&lt;/code&gt; and &lt;code&gt;department&lt;/code&gt; from the &lt;code&gt;employees&lt;/code&gt; table.&lt;/li&gt;
&lt;li&gt;The second &lt;code&gt;SELECT&lt;/code&gt; retrieves data from the CTE we just created. It’s like working with a temporary table called &lt;code&gt;EmployeeDetails&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Recursive Query Example
&lt;/h3&gt;

&lt;p&gt;Now, let’s look at a recursive query, which is useful when working with data that has parent-child relationships, like managers and their employees.&lt;/p&gt;

&lt;p&gt;Here’s an example where we find all employees under a specific manager, even if those employees have their own subordinates:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="k"&gt;RECURSIVE&lt;/span&gt; &lt;span class="n"&gt;EmployeeHierarchy&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="c1"&gt;-- Base case: Select the manager (top-level employee)&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;manager_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;
  &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;manager_id&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;

  &lt;span class="k"&gt;UNION&lt;/span&gt; &lt;span class="k"&gt;ALL&lt;/span&gt;

  &lt;span class="c1"&gt;-- Recursive case: Select employees under the current employees&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;manager_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
  &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;EmployeeHierarchy&lt;/span&gt; &lt;span class="n"&gt;eh&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;manager_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;eh&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;EmployeeHierarchy&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;The CTE &lt;code&gt;EmployeeHierarchy&lt;/code&gt; starts by selecting the top-level employees (those without managers) as the &lt;strong&gt;base case&lt;/strong&gt; (&lt;code&gt;manager_id IS NULL&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;UNION ALL&lt;/code&gt; combines the base case with the recursive part. The second part of the CTE selects employees whose &lt;code&gt;manager_id&lt;/code&gt; matches the &lt;code&gt;id&lt;/code&gt; of someone already in the hierarchy.&lt;/li&gt;
&lt;li&gt;This query keeps running until it has found all employees in the hierarchy.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The result is a list of employees along with their managers, regardless of how many levels deep the hierarchy goes.&lt;/p&gt;




&lt;h3&gt;
  
  
  When to Use Recursive Queries and CTEs
&lt;/h3&gt;

&lt;p&gt;Recursive queries and CTEs are helpful when you need to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Work with hierarchical data (e.g., organizational charts, categories of products).&lt;/li&gt;
&lt;li&gt;Simplify complex queries that would otherwise require multiple subqueries or joins.&lt;/li&gt;
&lt;li&gt;Improve the readability and maintenance of SQL queries.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;--&lt;/p&gt;

&lt;h2&gt;
  
  
  Query Optimization and Performance Tuning
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Understanding Execution Plans and Query Profiling
&lt;/h3&gt;

&lt;p&gt;To optimize SQL queries, it's essential to understand how the database executes them. &lt;strong&gt;Execution plans&lt;/strong&gt; provide insights into how the query is processed, highlighting areas for improvement.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use the &lt;code&gt;EXPLAIN&lt;/code&gt; command to view the execution plan:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;  &lt;span class="k"&gt;EXPLAIN&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;50000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Indexing Strategies to Speed Up Query Performance
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Indexes&lt;/strong&gt; help speed up the retrieval of data. Properly indexed columns significantly reduce query times, especially for large datasets.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Create an index on frequently queried columns:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;  &lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_employee_department&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;department_id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Techniques for Reducing Query Complexity and Improving Efficiency
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;*&lt;em&gt;Avoid SELECT *&lt;/em&gt;*: Instead of selecting all columns, only select the ones you need.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Limit Joins&lt;/strong&gt;: Keep joins to a minimum to reduce data complexity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optimize Subqueries&lt;/strong&gt;: Subqueries can sometimes be replaced by joins or temporary tables for better performance.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Data Modeling Best Practices
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Normalization vs. Denormalization—When to Use Each Approach
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Normalization&lt;/strong&gt; organizes data into smaller tables to reduce redundancy and improve data integrity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Denormalization&lt;/strong&gt; combines tables to make queries faster at the cost of redundancy.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In data engineering, &lt;strong&gt;denormalization&lt;/strong&gt; is often preferred in analytical systems for faster read operations, while &lt;strong&gt;normalization&lt;/strong&gt; is used in transactional systems to ensure data consistency.&lt;/p&gt;

&lt;h3&gt;
  
  
  Designing Efficient Relational Schemas
&lt;/h3&gt;

&lt;p&gt;When designing a database schema, focus on &lt;strong&gt;scalability&lt;/strong&gt; and &lt;strong&gt;performance&lt;/strong&gt;. Use appropriate primary keys, foreign keys, and indexes to make data retrieval faster and more reliable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Star Schema vs. Snowflake Schema for Analytical Queries
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Star Schema&lt;/strong&gt;: Simple, with a central fact table and dimension tables connected directly. It’s fast for queries but may involve some redundancy.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Snowflake Schema&lt;/strong&gt;: More complex, with dimension tables normalized into additional tables. It reduces redundancy but may require more joins in queries.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Example of Optimizing a Slow SQL Query
&lt;/h3&gt;

&lt;p&gt;Let’s say we have a query that calculates total sales for each product category, but it’s running too slow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sales_amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt; 
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This query can be optimized by adding an index on the &lt;code&gt;category&lt;/code&gt; column:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_category&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Mastering SQL is essential for data engineers to handle complex data operations and optimize workflows. By understanding advanced SQL techniques, query optimization, and best practices for data modeling, you can improve the efficiency of your data pipelines and make better business decisions. Keep experimenting with different SQL features, and apply these techniques in real-world projects to continue improving your skills.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Hiii</title>
      <dc:creator>Kelvin Kipyegon</dc:creator>
      <pubDate>Tue, 11 Feb 2025 11:47:47 +0000</pubDate>
      <link>https://dev.to/kelvin_kipyegon_c09cd9b69/hiii-6ea</link>
      <guid>https://dev.to/kelvin_kipyegon_c09cd9b69/hiii-6ea</guid>
      <description>&lt;p&gt;place holder&lt;/p&gt;

</description>
    </item>
    <item>
      <title>place</title>
      <dc:creator>Kelvin Kipyegon</dc:creator>
      <pubDate>Tue, 11 Feb 2025 11:44:31 +0000</pubDate>
      <link>https://dev.to/kelvin_kipyegon_c09cd9b69/hello-everyonedata-engineering-i-hope-you-noticed-that-we-didnt-have-an-assignment-last-3b0e</link>
      <guid>https://dev.to/kelvin_kipyegon_c09cd9b69/hello-everyonedata-engineering-i-hope-you-noticed-that-we-didnt-have-an-assignment-last-3b0e</guid>
      <description></description>
    </item>
  </channel>
</rss>
