I’ve just made my first substantial Python project public. It’s an Excel-to-SQL pipeline focused on cleaning messy spreadsheet data and preparing it for database ingestion.
This is still a work in progress, and I’m actively improving it.
Problem
Working with Excel data often means dealing with:
- Inconsistent date and time formats
- Mixed data types in a single column
- Missing or malformed values
- Subtle issues that only surface during database insertion
I wanted a reusable way to clean and standardise this kind of data before loading it into SQL.
Approach
The project focuses on:
- Column-wise cleaning functions (dates, times, text, etc.)
- Configurable parsing with strict vs permissive modes
- Clear error reporting with row-level context
- Separation between cleaning logic and pipeline orchestration
The goal is to make the pipeline predictable and easier to debug when something goes wrong.
Example
Input (Excel):
- Mixed date formats
- Numbers stored as text
- Invalid values scattered through columns
Output:
- Cleaned, typed data ready for SQL insertion
- Invalid values either coerced or flagged, depending on mode
What I’m unsure about
I’d really value feedback on:
- Code structure and modularity
- Naming conventions and readability
- Error handling design
- Testing approach and coverage
- Overall project organisation
Next step
The next phase is building the SQL writer layer so the pipeline can automatically create tables in SQL Server and populate them with the cleaned data.
Repo
https://github.com/juliana-albertyn/excel-to-sql
I’m learning as I build, so constructive criticism is very welcome.
Top comments (0)