DEV Community

Mate Technologies
Mate Technologies

Posted on

๐Ÿ›ก๏ธ Build a Smart Excel Data Cleaner in Python (Step-by-Step)

A beginner-friendly walkthrough of SmartExcelGuardian โ€“ a professional Excel data cleaning and reporting tool built with Python.

What youโ€™ll build:

A desktop GUI app for Excel cleanup

Automatic missing value handling

Duplicate detection

Heuristic โ€œdata healthโ€ scoring

Export results to Excel, PDF, JSON, and TXT

GitHub repo (full script):
๐Ÿ‘‰ https://github.com/rogers-cyber/SmartExcelGuardian

๐Ÿงฐ Prerequisites

You should have:

Python 3.9+

Basic Python knowledge

Installed packages:

pip install pandas numpy openpyxl ttkbootstrap reportlab

๐Ÿ“ Project Structure
SmartExcelGuardian/
โ”‚โ”€โ”€ main.py
โ”‚โ”€โ”€ logo.ico
โ”‚โ”€โ”€ excelguardian.log

1๏ธโƒฃ Import Required Libraries

We start by importing everything we need.

import os, sys, threading, json
import tkinter as tk
from tkinter import filedialog
import ttkbootstrap as tb
from ttkbootstrap.constants import *
from datetime import datetime
Enter fullscreen mode Exit fullscreen mode

Why?

tkinter โ†’ GUI foundation

ttkbootstrap โ†’ modern dark UI

threading โ†’ keep UI responsive during cleanup

Data & Excel Libraries

import pandas as pd
import numpy as np
import re
from openpyxl import Workbook
from openpyxl.styles import PatternFill, Font
from openpyxl.utils.dataframe import dataframe_to_rows
Enter fullscreen mode Exit fullscreen mode

These handle:

Data cleaning (pandas)

Excel export & formatting (openpyxl)

Column name normalization (re)

PDF Export Tools

from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import A4
from reportlab.lib.units import mm
from reportlab.lib.colors import red, orange, green, black
Enter fullscreen mode Exit fullscreen mode

Used for generating professional PDF reports.

2๏ธโƒฃ Global State & Logging

stop_event = threading.Event()
cleanup_results = {}
log_file = os.path.join(os.getcwd(), "excelguardian.log")
Enter fullscreen mode Exit fullscreen mode

Purpose

stop_event โ†’ allows canceling cleanup

cleanup_results โ†’ shared export data

log_file โ†’ error tracking

3๏ธโƒฃ Utility Helper Functions
Resource Loader (for packaged apps)

def resource_path(file_name):
    base_path = getattr(sys, "_MEIPASS", os.path.dirname(os.path.abspath(__file__)))
    return os.path.join(base_path, file_name)
Enter fullscreen mode Exit fullscreen mode

This helps when bundling with PyInstaller.

Column Name Cleaner

def clean_column_name(name):
    name = name.strip().lower()
    name = re.sub(r"[^\w\s]", "", name)
    name = re.sub(r"\s+", "_", name)
    return name
Enter fullscreen mode Exit fullscreen mode

Example:

Original Cleaned
Total Sales ($) total_sales
NumPy โ†’ JSON Converter

def convert_numpy(obj):
    if isinstance(obj, np.integer): return int(obj)
    if isinstance(obj, np.floating): return float(obj)
    if isinstance(obj, np.ndarray): return obj.tolist()
    raise TypeError
Enter fullscreen mode Exit fullscreen mode

Needed because JSON canโ€™t serialize NumPy types.

4๏ธโƒฃ Creating the Main Window

app = tb.Window(themename="darkly")
app.title("SmartExcelGuardian v1.1.0")
app.geometry("1100x650")
Enter fullscreen mode Exit fullscreen mode

Why ttkbootstrap?

Modern styling

Dark mode support

Responsive layouts

5๏ธโƒฃ Title Section

tb.Label(app, text="SmartExcelGuardian",
         font=("Segoe UI", 22, "bold")).pack(pady=(10, 2))

tb.Label(app, text="Professional Excel Data Guardian Tool",
         font=("Segoe UI", 10, "italic"),
         foreground="#9ca3af").pack(pady=(0, 8))
Enter fullscreen mode Exit fullscreen mode

This creates the app header.

6๏ธโƒฃ Excel File Selector

file_path = tk.StringVar()

tb.Entry(row1, textvariable=file_path, width=60).pack(side="left", padx=6)

tb.Button(
    row1,
    text="๐Ÿ“„ Excel File",
    command=lambda: file_path.set(
        filedialog.askopenfilename(filetypes=[("Excel Files", "*.xlsx *.xls")])
    )
).pack(side="left")
Enter fullscreen mode Exit fullscreen mode

This lets users choose their Excel workbook.

7๏ธโƒฃ Cleanup Control Buttons

start_btn = tb.Button(row2, text="๐Ÿ›ก CLEAN DATA", bootstyle="success")
stop_btn = tb.Button(row2, text="๐Ÿ›‘ STOP", bootstyle="danger-outline", state="disabled")
Enter fullscreen mode Exit fullscreen mode

CLEAN DATA โ†’ starts background thread

STOP โ†’ safely halts processing

8๏ธโƒฃ Results Table (TreeView)

cols = (
    "column", "original_type", "suggested_type",
    "cleaned_type", "missing_values",
    "duplicates_detected", "heuristic_score",
    "rename_suggestion"
)

tree = tb.Treeview(row3, columns=cols, show="headings")
Enter fullscreen mode Exit fullscreen mode

This displays column-by-column health analysis.

9๏ธโƒฃ Heuristic Scoring System

def heuristic_score(missing, duplicates, type_issue):
    score = 0
    score += min(30, missing * 2)
    score += min(30, duplicates * 2)
    score += 40 if type_issue else 0
    return min(score, 100)
Enter fullscreen mode Exit fullscreen mode

Score Meaning
Score Risk
0โ€“30 Healthy ๐ŸŸข
31โ€“70 Moderate ๐ŸŸ 
71โ€“100 High Risk ๐Ÿ”ด
๐Ÿ”Ÿ Data Cleaning Engine

def assess_and_clean(df):
    for col in df.columns:
        series = df[col]

Enter fullscreen mode Exit fullscreen mode

Numeric Columns

coerced = pd.to_numeric(series, errors="coerce")
cleaned_series = coerced.fillna(coerced.mean())
Enter fullscreen mode Exit fullscreen mode

โœ” Converts text โ†’ numbers
โœ” Fills missing values with mean

Text Columns

cleaned_series = series.astype("string").fillna(series.mode()[0])
Enter fullscreen mode Exit fullscreen mode

โœ” Converts to strings
โœ” Fills missing values with most common value

1๏ธโƒฃ1๏ธโƒฃ Threaded Cleanup Execution

threading.Thread(
    target=run_cleanup,
    daemon=True
).start()

Enter fullscreen mode Exit fullscreen mode

Why threading?

Keeps UI responsive

Prevents freezing on large Excel files

1๏ธโƒฃ2๏ธโƒฃ Excel Export with Formulas

sum_formula = f"=SUM(A2:A{ws.max_row})"
mean_formula = f"=AVERAGE(A2:A{ws.max_row})"
Enter fullscreen mode Exit fullscreen mode

Automatically adds:

SUM

AVERAGE

to numeric columns.

1๏ธโƒฃ3๏ธโƒฃ Conditional Formatting

fill = PatternFill(start_color="FF9999", fill_type="solid")
cell.font = Font(bold=True)
Enter fullscreen mode Exit fullscreen mode

High-risk columns are:

Highlighted ๐Ÿ”ด

Bolded for visibility

1๏ธโƒฃ4๏ธโƒฃ PDF Report Export

def score_color(score):
    if score >= 71: return red
    elif score >= 31: return orange
    else: return green
Enter fullscreen mode Exit fullscreen mode

Creates a multi-page PDF audit report with:

Color-coded scores

Column summaries

Page numbers

1๏ธโƒฃ5๏ธโƒฃ About & Help Window

tb.Label(frame, text="How to Use", font=("Segoe UI", 12, "bold"))
Enter fullscreen mode Exit fullscreen mode

Provides:

Feature overview

Usage steps

Developer info

๐Ÿš€ Final Result

You now have:

A professional Excel cleaner

A desktop GUI

A heuristic scoring system

Multi-format export

๐Ÿ“Œ Next Improvements

Add per-sheet selection

Add charts (data health trends)

Save user presets

Package as .exe

๐Ÿ”— Full Source Code

๐Ÿ‘‰ https://github.com/rogers-cyber/SmartExcelGuardian

SmartExcelGuardian

Top comments (0)