DEV Community

Mate Technologies
Mate Technologies

Posted on

πŸ›‘οΈ Build a Smart Excel Data Cleaner in Python (Step-by-Step)

A beginner-friendly walkthrough of SmartExcelGuardian – a professional Excel data cleaning and reporting tool built with Python.

What you’ll build:

A desktop GUI app for Excel cleanup

Automatic missing value handling

Duplicate detection

Heuristic β€œdata health” scoring

Export results to Excel, PDF, JSON, and TXT

GitHub repo (full script):
πŸ‘‰ https://github.com/rogers-cyber/SmartExcelGuardian

🧰 Prerequisites

You should have:

Python 3.9+

Basic Python knowledge

Installed packages:

pip install pandas numpy openpyxl ttkbootstrap reportlab

πŸ“ Project Structure
SmartExcelGuardian/
│── main.py
│── logo.ico
│── excelguardian.log

1️⃣ Import Required Libraries

We start by importing everything we need.

import os, sys, threading, json
import tkinter as tk
from tkinter import filedialog
import ttkbootstrap as tb
from ttkbootstrap.constants import *
from datetime import datetime
Enter fullscreen mode Exit fullscreen mode

Why?

tkinter β†’ GUI foundation

ttkbootstrap β†’ modern dark UI

threading β†’ keep UI responsive during cleanup

Data & Excel Libraries

import pandas as pd
import numpy as np
import re
from openpyxl import Workbook
from openpyxl.styles import PatternFill, Font
from openpyxl.utils.dataframe import dataframe_to_rows
Enter fullscreen mode Exit fullscreen mode

These handle:

Data cleaning (pandas)

Excel export & formatting (openpyxl)

Column name normalization (re)

PDF Export Tools

from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import A4
from reportlab.lib.units import mm
from reportlab.lib.colors import red, orange, green, black
Enter fullscreen mode Exit fullscreen mode

Used for generating professional PDF reports.

2️⃣ Global State & Logging

stop_event = threading.Event()
cleanup_results = {}
log_file = os.path.join(os.getcwd(), "excelguardian.log")
Enter fullscreen mode Exit fullscreen mode

Purpose

stop_event β†’ allows canceling cleanup

cleanup_results β†’ shared export data

log_file β†’ error tracking

3️⃣ Utility Helper Functions
Resource Loader (for packaged apps)

def resource_path(file_name):
    base_path = getattr(sys, "_MEIPASS", os.path.dirname(os.path.abspath(__file__)))
    return os.path.join(base_path, file_name)
Enter fullscreen mode Exit fullscreen mode

This helps when bundling with PyInstaller.

Column Name Cleaner

def clean_column_name(name):
    name = name.strip().lower()
    name = re.sub(r"[^\w\s]", "", name)
    name = re.sub(r"\s+", "_", name)
    return name
Enter fullscreen mode Exit fullscreen mode

Example:

Original Cleaned
Total Sales ($) total_sales
NumPy β†’ JSON Converter

def convert_numpy(obj):
    if isinstance(obj, np.integer): return int(obj)
    if isinstance(obj, np.floating): return float(obj)
    if isinstance(obj, np.ndarray): return obj.tolist()
    raise TypeError
Enter fullscreen mode Exit fullscreen mode

Needed because JSON can’t serialize NumPy types.

4️⃣ Creating the Main Window

app = tb.Window(themename="darkly")
app.title("SmartExcelGuardian v1.1.0")
app.geometry("1100x650")
Enter fullscreen mode Exit fullscreen mode

Why ttkbootstrap?

Modern styling

Dark mode support

Responsive layouts

5️⃣ Title Section

tb.Label(app, text="SmartExcelGuardian",
         font=("Segoe UI", 22, "bold")).pack(pady=(10, 2))

tb.Label(app, text="Professional Excel Data Guardian Tool",
         font=("Segoe UI", 10, "italic"),
         foreground="#9ca3af").pack(pady=(0, 8))
Enter fullscreen mode Exit fullscreen mode

This creates the app header.

6️⃣ Excel File Selector

file_path = tk.StringVar()

tb.Entry(row1, textvariable=file_path, width=60).pack(side="left", padx=6)

tb.Button(
    row1,
    text="πŸ“„ Excel File",
    command=lambda: file_path.set(
        filedialog.askopenfilename(filetypes=[("Excel Files", "*.xlsx *.xls")])
    )
).pack(side="left")
Enter fullscreen mode Exit fullscreen mode

This lets users choose their Excel workbook.

7️⃣ Cleanup Control Buttons

start_btn = tb.Button(row2, text="πŸ›‘ CLEAN DATA", bootstyle="success")
stop_btn = tb.Button(row2, text="πŸ›‘ STOP", bootstyle="danger-outline", state="disabled")
Enter fullscreen mode Exit fullscreen mode

CLEAN DATA β†’ starts background thread

STOP β†’ safely halts processing

8️⃣ Results Table (TreeView)

cols = (
    "column", "original_type", "suggested_type",
    "cleaned_type", "missing_values",
    "duplicates_detected", "heuristic_score",
    "rename_suggestion"
)

tree = tb.Treeview(row3, columns=cols, show="headings")
Enter fullscreen mode Exit fullscreen mode

This displays column-by-column health analysis.

9️⃣ Heuristic Scoring System

def heuristic_score(missing, duplicates, type_issue):
    score = 0
    score += min(30, missing * 2)
    score += min(30, duplicates * 2)
    score += 40 if type_issue else 0
    return min(score, 100)
Enter fullscreen mode Exit fullscreen mode

Score Meaning
Score Risk
0–30 Healthy 🟒
31–70 Moderate 🟠
71–100 High Risk πŸ”΄
πŸ”Ÿ Data Cleaning Engine

def assess_and_clean(df):
    for col in df.columns:
        series = df[col]

Enter fullscreen mode Exit fullscreen mode

Numeric Columns

coerced = pd.to_numeric(series, errors="coerce")
cleaned_series = coerced.fillna(coerced.mean())
Enter fullscreen mode Exit fullscreen mode

βœ” Converts text β†’ numbers
βœ” Fills missing values with mean

Text Columns

cleaned_series = series.astype("string").fillna(series.mode()[0])
Enter fullscreen mode Exit fullscreen mode

βœ” Converts to strings
βœ” Fills missing values with most common value

1️⃣1️⃣ Threaded Cleanup Execution

threading.Thread(
    target=run_cleanup,
    daemon=True
).start()

Enter fullscreen mode Exit fullscreen mode

Why threading?

Keeps UI responsive

Prevents freezing on large Excel files

1️⃣2️⃣ Excel Export with Formulas

sum_formula = f"=SUM(A2:A{ws.max_row})"
mean_formula = f"=AVERAGE(A2:A{ws.max_row})"
Enter fullscreen mode Exit fullscreen mode

Automatically adds:

SUM

AVERAGE

to numeric columns.

1️⃣3️⃣ Conditional Formatting

fill = PatternFill(start_color="FF9999", fill_type="solid")
cell.font = Font(bold=True)
Enter fullscreen mode Exit fullscreen mode

High-risk columns are:

Highlighted πŸ”΄

Bolded for visibility

1️⃣4️⃣ PDF Report Export

def score_color(score):
    if score >= 71: return red
    elif score >= 31: return orange
    else: return green
Enter fullscreen mode Exit fullscreen mode

Creates a multi-page PDF audit report with:

Color-coded scores

Column summaries

Page numbers

1️⃣5️⃣ About & Help Window

tb.Label(frame, text="How to Use", font=("Segoe UI", 12, "bold"))
Enter fullscreen mode Exit fullscreen mode

Provides:

Feature overview

Usage steps

Developer info

πŸš€ Final Result

You now have:

A professional Excel cleaner

A desktop GUI

A heuristic scoring system

Multi-format export

πŸ“Œ Next Improvements

Add per-sheet selection

Add charts (data health trends)

Save user presets

Package as .exe

πŸ”— Full Source Code

πŸ‘‰ https://github.com/rogers-cyber/SmartExcelGuardian

SmartExcelGuardian

Top comments (0)