DEV Community: Kelvin Kipyegon

Inventory Forecasting System From Scratch

Kelvin Kipyegon — Mon, 16 Mar 2026 05:43:42 +0000

Inventory Forecasting System From Scratch

By Kelvin Kipyegon

I am currently applying for a role that requires demand forecasting and inventory planning. I decided to do a project related to forecasting for my portfolio.

This post documents exactly what I did, what went wrong, how I fixed it, and what the final result looks like.

What I Set Out to Build

A system that could:

Take raw sales data and clean it properly
Forecast weekly demand for multiple products
Calculate safety stock and reorder points
Flag which products needed attention
Present everything in an Excel report and a live dashboard

The Data

I used the UCI Online Retail dataset — 541,909 real transactions from a wholesale operation, freely available online. It felt close enough to what a real workshop or warehouse operation would have.

First thing I did was look at what I was working with:

import pandas as pd

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx"
df = pd.read_excel(url)

print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")
print(df.isnull().sum())

Rows: 541909, Columns: 8

CustomerID     135080
Description      1454

135,000 rows with no customer ID. That is 25% of the dataset. The data was a little bit messy.

Cleaning the Data

Four things needed fixing before I could send anything downstream:

df_clean = df.dropna(subset=['CustomerID', 'Description'])
df_clean = df_clean[~df_clean['InvoiceNo'].astype(str).str.startswith('C')]
df_clean = df_clean[df_clean['Quantity'] > 0]
df_clean = df_clean[df_clean['UnitPrice'] > 0]
df_clean['Revenue'] = df_clean['Quantity'] * df_clean['UnitPrice']

After cleaning I had 397,884 rows — down from 541,909. I lost about 26% of the data .Those rows were returns, cancellations, and test entries. Keeping them would have made every analysis after this point wrong.

Picking Products to Forecast

I filtered to the top 10 products by total quantity sold, then checked how many weeks of sales history each one had. One product — PAPER CRAFT, LITTLE BIRDIE — had 80,995 units sold in a single week and nothing else. That is not demand history, that is a one-off bulk order. I dropped it.

I ended up with 8 products, each with between 22 and 53 weeks of consistent weekly sales data.

The Forecasting Model

I started with Facebook Prophet, which is a time-series forecasting library. It did not work well. The reason which I only understood after debugging for a while is that Prophet needs at least 2 years of data to learn yearly seasonality properly. I had one year. The predictions were off by hundreds of percent on some products.

So I switched to a 4-week rolling average. Simpler, more honest about what the data could support, and easier to explain to someone who does not work in Python.

window = 4
history = list(train['y'].iloc[-window:].values)
predictions = []

for _ in range(len(test) + 4):
    pred = np.mean(history[-window:])
    predictions.append(pred)
    history.append(pred)

I trained on 80% of the data and tested on the remaining 20%. Accuracy was measured using RMSE — I avoided MAPE because it breaks when actual demand hits zero, which happened on a few test weeks.

Results:

Product	Avg Weekly Demand	Accuracy
Rabbit Night Light	2,763 units	66.8%
Assorted Bird Ornament	899 units	52.5%
Jumbo Bag Red Retrospot	1,168 units	40.5%
White Hanging Heart	699 units	36.6%
Mini Paint Set Vintage	625 units	29.1%
Popcorn Holder	1,914 units	12.7%
Cake Cases	448 units	0.8%
WW2 Gliders	1,358 units	0%

Average accuracy: 29.9%

Some of these are weak. The honest interpretation is that products like Cake Cases and WW2 Gliders have highly variable demand — a rolling average struggles with spikes. The right response to that is not to force a better-looking number, it is to give those products a larger safety stock buffer, which is exactly what the next step does.

Safety Stock and Reorder Points

These are standard inventory management formulas. I used a 95% service level and assumed a 1-week supplier lead time:

Safety Stock = 1.65 × Standard Deviation of Demand × √(Lead Time)

Reorder Point = Average Weekly Demand × Lead Time + Safety Stock

safety_stock = round(1.65 * std_demand * np.sqrt(lead_time_weeks))
rop = round(avg_demand * lead_time_weeks + safety_stock)

if current_stock <= safety_stock:
    status = 'CRITICAL'
elif current_stock <= rop:
    status = 'REORDER NOW'
else:
    status = 'OK'

Output:

Product	Safety Stock	Reorder Point	Status
Rabbit Night Light	2,523	3,760	🔴 CRITICAL
WW2 Gliders	1,611	2,657	🟡 REORDER NOW
Popcorn Holder	1,546	2,474	🟡 REORDER NOW
Bird Ornament	895	1,562	🟡 REORDER NOW
White Hanging Heart	903	1,597	🟡 REORDER NOW
Mini Paint Set	658	1,150	🟡 REORDER NOW
Cake Cases	642	1,278	🟡 REORDER NOW
Jumbo Bag	791	1,662	🟢 OK

Excel Report

I exported the full table to a color-coded Excel file using openpyxl. The idea was that a procurement team should be able to open this file and know exactly what needs ordering without any explanation.

The Dashboard

Everything feeds into a live Streamlit dashboard with three pages:

Stock Overview — headline numbers and charts comparing current stock against reorder points
Demand Forecast — select a product, see its history and 4-week forecast
Reorder Alerts — the full status table with recommended order quantities

Live dashboard: (https://forecasting-vv9qmj2kmug42tpct3jqv8.streamlit.app/)

What I Took Away From This

A few honest reflections:

The data cleaning took longer than the modelling. That was not what I expected going in, but it makes sense. Garbage in, garbage out — and in a real operation, the stock records are often the messiest part.

Choosing the simpler model was the right call. There is a temptation to use the most sophisticated tool available. Sometimes that is wrong.

The low-accuracy products were the most interesting finding. If I had just reported average accuracy and moved on, I would have missed the point. The variability is the insight — those products need different handling, not a better model.
Next I will try to automate everything to a good proper data ETL pipeline.

If you are working through something similar or have feedback on the approach, I would be glad to hear it.

— Kelvin Kipyegon

1. "Python Program to Filter CSV Rows and Write Output to New File"

Kelvin Kipyegon — Thu, 13 Feb 2025 11:35:57 +0000

import csv

input_file = 'input.csv'
output_file = 'output.csv'
column_index = 1

with open(input_file, 'r') as infile:
    csv_reader = csv.reader(infile)
    header = next(csv_reader)
    filtered_rows = [header]

    for row in csv_reader:
        if float(row[column_index]) > 100:
            filtered_rows.append(row)

with open(output_file, 'w', newline='') as outfile:
    csv_writer = csv.writer(outfile)
    csv_writer.writerows(filtered_rows)

print("Filtered rows have been written to output.csv")

The code logic is as follows;

Imports the CSV module:
The code starts by importing the csv module, which helps us read and write CSV files.
File paths and column index:
- input_file = 'input.csv' tells the program where to find the file we want to read.
- output_file = 'output.csv' is where the program will save the filtered data.
- column_index = 1 indicates the column where we will check the values (in this case, the second column because column counting starts from 0).
Open the input file:
The program opens the input.csv file to read the data inside.
Read the header:
It reads the first row of the file, which contains the column names, and stores it in header. This will be used later when writing to the new file.
Filter the rows:
The program goes through each row of data:
- It checks if the number in the specified column (the second column) is greater than 100.
- If the number is greater than 100, the program keeps that row.
- If not, the row is skipped.
Write to the output file:
After filtering, the program writes the header and the remaining rows (that meet the condition) to a new file called output.csv.
Print a message:
Finally, the program prints a message to let you know that the filtered data has been saved to the new file.

2a. **A Python multithreading solution to download multiple files simultaneously.

import threading
import requests

urls = [
    'https://example.com/file1.jpg',
    'https://example.com/file2.jpg',
    'https://example.com/file3.jpg'
]

def download_file(url):
    try:
        response = requests.get(url)
        filename = url.split('/')[-1]
        with open(filename, 'wb') as f:
            f.write(response.content)
        print(f"Downloaded: {filename}")
    except Exception as e:
        print(f"Failed to download {url}: {e}")

threads = []
for url in urls:
    thread = threading.Thread(target=download_file, args=(url,))
    threads.append(thread)
    thread.start()

for thread in threads:
    thread.join()

print("All downloads are complete.")

Explanation of the code:

URLs List: urls contains the list of file URLs you want to download.
Download Function: download_file(url) is a function that downloads a single file from a URL and saves it.
Thread Creation: For each URL, a new thread is created using threading.Thread to download the file at the same time.
Starting Threads: The start() method is called on each thread to begin downloading the files.
Waiting for Completion: join() ensures the main program waits for all threads to finish before it prints "All downloads are complete."

2b. A multiprocessing script to compute the factorial of numbers from 1 to 10.

import multiprocessing

def factorial(n):
    result = 1
    for i in range(1, n + 1):
        result *= i
    print(f"Factorial of {n} is {result}")

if __name__ == '__main__':
    for i in range(1, 11):
        process = multiprocessing.Process(target=factorial, args=(i,))
        process.start()
        process.join()

    print("All factorials have been computed.")

Explanation:

factorial(n) function: Calculates the factorial of a number n and prints the result.
Main Block: In the if __name__ == '__main__' block:
- Loops through numbers from 1 to 10.
- For each number, creates a new process to compute its factorial.
- Starts each process and waits for it to finish using process.join() before moving to the next.

2c A simple Python script that demonstrates how to modify a Pandas DataFrame in parallel using concurrent.futures:

import pandas as pd
import concurrent.futures

def modify_row(row):
    row['modified'] = row['value'] * 2
    return row

def main():
    data = {'value': [1, 2, 3, 4, 5]}
    df = pd.DataFrame(data)

    with concurrent.futures.ThreadPoolExecutor() as executor:
        results = list(executor.map(modify_row, [row for _, row in df.iterrows()]))

    df = pd.DataFrame(results)
    print(df)

if __name__ == '__main__':
    main()

Explanation:

DataFrame: A simple DataFrame df is created with a column 'value'.
modify_row function: This function modifies the row by adding a new column 'modified', where the value is the original 'value' multiplied by 2.
ThreadPoolExecutor:
- executor.map(modify_row, [...]) runs the modify_row function in parallel for each row in the DataFrame.
Result: The modified DataFrame is printed at the end.

Mastering SQL for Data Engineering: Advanced Queries, Optimization, and Data Modeling Best Practices

Kelvin Kipyegon — Tue, 11 Feb 2025 12:01:28 +0000

SQL is mainly used by Data Engineers to bring data together and run queries that turn raw data into useful business insights. Data Engineers use SQL to change various aspects of the database eg tables and pull out specific data to be used for different purposes. In this article, we will explore advanced SQL techniques, optimization strategies, and data modeling best practices that will help you handle complex data engineering tasks.

Core SQL Concepts for Data Engineering

SELECT, WHERE, JOIN, GROUP BY, and HAVING

The most basic SQL commands are essential for performing almost any data engineering task.

SELECT: Retrieves data from a database.

  SELECT * FROM employees;

This query fetches all the columns and rows from the employees table. The * symbol indicates that all columns should be retrieved.****

WHERE: Filters data based on conditions.

  SELECT * FROM employees WHERE department = 'HR';

This query selects only those rows from the employees table where the department is 'HR'. It acts as a filter to get data based on a specific condition.

JOIN: Combines data from multiple tables.

  SELECT e.name, d.department_name 
  FROM employees e 
  JOIN departments d ON e.department_id = d.id;

This query combines the employees table and the departments table by joining them on the department_id column. It retrieves employee names along with their department names.

GROUP BY: Groups rows based on column values.

  SELECT department, COUNT(*) 
  FROM employees 
  GROUP BY department;

This query groups employees by their department and counts how many employees are in each department.

HAVING: Filters groups after applying GROUP BY.

  SELECT department, COUNT(*) 
  FROM employees 
  GROUP BY department 
  HAVING COUNT(*) > 5;

This query counts employees per department but only returns those departments where the count is greater than 5. The HAVING clause filters the result of GROUP BY.

Advanced SQL Techniques

Once you’re comfortable with basic SQL, you can explore more advanced techniques that make SQL even more powerful.

Recursive Queries and Common Table Expressions (CTEs)

Introduction to Recursive Queries and Common Table Expressions (CTEs)

When working with SQL, sometimes you need to deal with data that has a hierarchy or structure like a family tree or an organization chart. Recursive Queries and Common Table Expressions (CTEs) are helpful tools to manage this type of data.

CTE (Common Table Expression): Think of a CTE as a temporary table that you create in the middle of your query. It simplifies complex queries and makes them easier to read and maintain.
Recursive Queries: These are a special kind of CTE that allows you to reference the same table or CTE multiple times to build hierarchical data, such as parent-child relationships (e.g., employees and their managers).

How CTEs Work

Let’s start with a simple CTE. Here's an example of how to use a CTE to get a list of employees and their departments:

WITH EmployeeDetails AS (
  SELECT name, department
  FROM employees
)
SELECT * FROM EmployeeDetails;

The WITH keyword defines the CTE called EmployeeDetails.
Inside the CTE, we select the name and department from the employees table.
The second SELECT retrieves data from the CTE we just created. It’s like working with a temporary table called EmployeeDetails.

Recursive Query Example

Now, let’s look at a recursive query, which is useful when working with data that has parent-child relationships, like managers and their employees.

Here’s an example where we find all employees under a specific manager, even if those employees have their own subordinates:

WITH RECURSIVE EmployeeHierarchy AS (
  -- Base case: Select the manager (top-level employee)
  SELECT id, manager_id, name
  FROM employees
  WHERE manager_id IS NULL

  UNION ALL

  -- Recursive case: Select employees under the current employees
  SELECT e.id, e.manager_id, e.name
  FROM employees e
  JOIN EmployeeHierarchy eh ON e.manager_id = eh.id
)
SELECT * FROM EmployeeHierarchy;

The CTE EmployeeHierarchy starts by selecting the top-level employees (those without managers) as the base case (manager_id IS NULL).
The UNION ALL combines the base case with the recursive part. The second part of the CTE selects employees whose manager_id matches the id of someone already in the hierarchy.
This query keeps running until it has found all employees in the hierarchy.

The result is a list of employees along with their managers, regardless of how many levels deep the hierarchy goes.

When to Use Recursive Queries and CTEs

Recursive queries and CTEs are helpful when you need to:

Work with hierarchical data (e.g., organizational charts, categories of products).
Simplify complex queries that would otherwise require multiple subqueries or joins.
Improve the readability and maintenance of SQL queries.

Query Optimization and Performance Tuning

Understanding Execution Plans and Query Profiling

To optimize SQL queries, it's essential to understand how the database executes them. Execution plans provide insights into how the query is processed, highlighting areas for improvement.

Use the EXPLAIN command to view the execution plan:

  EXPLAIN SELECT * FROM employees WHERE salary > 50000;

Indexing Strategies to Speed Up Query Performance

Indexes help speed up the retrieval of data. Properly indexed columns significantly reduce query times, especially for large datasets.

Create an index on frequently queried columns:

  CREATE INDEX idx_employee_department ON employees(department_id);

Techniques for Reducing Query Complexity and Improving Efficiency

*Avoid SELECT **: Instead of selecting all columns, only select the ones you need.
Limit Joins: Keep joins to a minimum to reduce data complexity.
Optimize Subqueries: Subqueries can sometimes be replaced by joins or temporary tables for better performance.

Data Modeling Best Practices

Normalization vs. Denormalization—When to Use Each Approach

Normalization organizes data into smaller tables to reduce redundancy and improve data integrity.
Denormalization combines tables to make queries faster at the cost of redundancy.

In data engineering, denormalization is often preferred in analytical systems for faster read operations, while normalization is used in transactional systems to ensure data consistency.

Designing Efficient Relational Schemas

When designing a database schema, focus on scalability and performance. Use appropriate primary keys, foreign keys, and indexes to make data retrieval faster and more reliable.

Star Schema vs. Snowflake Schema for Analytical Queries

Star Schema: Simple, with a central fact table and dimension tables connected directly. It’s fast for queries but may involve some redundancy.
Snowflake Schema: More complex, with dimension tables normalized into additional tables. It reduces redundancy but may require more joins in queries.

Example of Optimizing a Slow SQL Query

Let’s say we have a query that calculates total sales for each product category, but it’s running too slow:

SELECT category, SUM(sales_amount) 
FROM sales 
GROUP BY category;

This query can be optimized by adding an index on the category column:

CREATE INDEX idx_category ON sales(category);

Mastering SQL is essential for data engineers to handle complex data operations and optimize workflows. By understanding advanced SQL techniques, query optimization, and best practices for data modeling, you can improve the efficiency of your data pipelines and make better business decisions. Keep experimenting with different SQL features, and apply these techniques in real-world projects to continue improving your skills.

Hiii

Kelvin Kipyegon — Tue, 11 Feb 2025 11:47:47 +0000

place holder

place

Kelvin Kipyegon — Tue, 11 Feb 2025 11:44:31 +0000