A beginner-friendly walkthrough of SmartExcelGuardian – a professional Excel data cleaning and reporting tool built with Python.
What you’ll build:
A desktop GUI app for Excel cleanup
Automatic missing value handling
Duplicate detection
Heuristic “data health” scoring
Export results to Excel, PDF, JSON, and TXT
GitHub repo (full script):
👉 https://github.com/rogers-cyber/SmartExcelGuardian
🧰 Prerequisites
You should have:
Python 3.9+
Basic Python knowledge
Installed packages:
pip install pandas numpy openpyxl ttkbootstrap reportlab
📁 Project Structure
SmartExcelGuardian/
│── main.py
│── logo.ico
│── excelguardian.log
1️⃣ Import Required Libraries
We start by importing everything we need.
import os, sys, threading, json
import tkinter as tk
from tkinter import filedialog
import ttkbootstrap as tb
from ttkbootstrap.constants import *
from datetime import datetime
Why?
tkinter → GUI foundation
ttkbootstrap → modern dark UI
threading → keep UI responsive during cleanup
Data & Excel Libraries
import pandas as pd
import numpy as np
import re
from openpyxl import Workbook
from openpyxl.styles import PatternFill, Font
from openpyxl.utils.dataframe import dataframe_to_rows
These handle:
Data cleaning (pandas)
Excel export & formatting (openpyxl)
Column name normalization (re)
PDF Export Tools
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import A4
from reportlab.lib.units import mm
from reportlab.lib.colors import red, orange, green, black
Used for generating professional PDF reports.
2️⃣ Global State & Logging
stop_event = threading.Event()
cleanup_results = {}
log_file = os.path.join(os.getcwd(), "excelguardian.log")
Purpose
stop_event → allows canceling cleanup
cleanup_results → shared export data
log_file → error tracking
3️⃣ Utility Helper Functions
Resource Loader (for packaged apps)
def resource_path(file_name):
base_path = getattr(sys, "_MEIPASS", os.path.dirname(os.path.abspath(__file__)))
return os.path.join(base_path, file_name)
This helps when bundling with PyInstaller.
Column Name Cleaner
def clean_column_name(name):
name = name.strip().lower()
name = re.sub(r"[^\w\s]", "", name)
name = re.sub(r"\s+", "_", name)
return name
Example:
Original Cleaned
Total Sales ($) total_sales
NumPy → JSON Converter
def convert_numpy(obj):
if isinstance(obj, np.integer): return int(obj)
if isinstance(obj, np.floating): return float(obj)
if isinstance(obj, np.ndarray): return obj.tolist()
raise TypeError
Needed because JSON can’t serialize NumPy types.
4️⃣ Creating the Main Window
app = tb.Window(themename="darkly")
app.title("SmartExcelGuardian v1.1.0")
app.geometry("1100x650")
Why ttkbootstrap?
Modern styling
Dark mode support
Responsive layouts
5️⃣ Title Section
tb.Label(app, text="SmartExcelGuardian",
font=("Segoe UI", 22, "bold")).pack(pady=(10, 2))
tb.Label(app, text="Professional Excel Data Guardian Tool",
font=("Segoe UI", 10, "italic"),
foreground="#9ca3af").pack(pady=(0, 8))
This creates the app header.
6️⃣ Excel File Selector
file_path = tk.StringVar()
tb.Entry(row1, textvariable=file_path, width=60).pack(side="left", padx=6)
tb.Button(
row1,
text="📄 Excel File",
command=lambda: file_path.set(
filedialog.askopenfilename(filetypes=[("Excel Files", "*.xlsx *.xls")])
)
).pack(side="left")
This lets users choose their Excel workbook.
7️⃣ Cleanup Control Buttons
start_btn = tb.Button(row2, text="🛡 CLEAN DATA", bootstyle="success")
stop_btn = tb.Button(row2, text="🛑 STOP", bootstyle="danger-outline", state="disabled")
CLEAN DATA → starts background thread
STOP → safely halts processing
8️⃣ Results Table (TreeView)
cols = (
"column", "original_type", "suggested_type",
"cleaned_type", "missing_values",
"duplicates_detected", "heuristic_score",
"rename_suggestion"
)
tree = tb.Treeview(row3, columns=cols, show="headings")
This displays column-by-column health analysis.
9️⃣ Heuristic Scoring System
def heuristic_score(missing, duplicates, type_issue):
score = 0
score += min(30, missing * 2)
score += min(30, duplicates * 2)
score += 40 if type_issue else 0
return min(score, 100)
Score Meaning
Score Risk
0–30 Healthy 🟢
31–70 Moderate 🟠
71–100 High Risk 🔴
🔟 Data Cleaning Engine
def assess_and_clean(df):
for col in df.columns:
series = df[col]
Numeric Columns
coerced = pd.to_numeric(series, errors="coerce")
cleaned_series = coerced.fillna(coerced.mean())
✔ Converts text → numbers
✔ Fills missing values with mean
Text Columns
cleaned_series = series.astype("string").fillna(series.mode()[0])
✔ Converts to strings
✔ Fills missing values with most common value
1️⃣1️⃣ Threaded Cleanup Execution
threading.Thread(
target=run_cleanup,
daemon=True
).start()
Why threading?
Keeps UI responsive
Prevents freezing on large Excel files
1️⃣2️⃣ Excel Export with Formulas
sum_formula = f"=SUM(A2:A{ws.max_row})"
mean_formula = f"=AVERAGE(A2:A{ws.max_row})"
Automatically adds:
SUM
AVERAGE
to numeric columns.
1️⃣3️⃣ Conditional Formatting
fill = PatternFill(start_color="FF9999", fill_type="solid")
cell.font = Font(bold=True)
High-risk columns are:
Highlighted 🔴
Bolded for visibility
1️⃣4️⃣ PDF Report Export
def score_color(score):
if score >= 71: return red
elif score >= 31: return orange
else: return green
Creates a multi-page PDF audit report with:
Color-coded scores
Column summaries
Page numbers
1️⃣5️⃣ About & Help Window
tb.Label(frame, text="How to Use", font=("Segoe UI", 12, "bold"))
Provides:
Feature overview
Usage steps
Developer info
🚀 Final Result
You now have:
A professional Excel cleaner
A desktop GUI
A heuristic scoring system
Multi-format export
📌 Next Improvements
Add per-sheet selection
Add charts (data health trends)
Save user presets
Package as .exe
🔗 Full Source Code

Top comments (0)