DEV Community

Cover image for Building a CLI Tool to Automate Spreadsheet Data Cleaning with DeepSeek
Praveen Tech World
Praveen Tech World

Posted on • Originally published at praveentechworld.com

Building a CLI Tool to Automate Spreadsheet Data Cleaning with DeepSeek

How I Built a CLI Tool to Automate Spreadsheet Data Cleaning with DeepSeek

The short answer is: I used DeepSeek to create a Python CLI tool that cleans messy Excel files in under 5 seconds per file. The AI wrote 90% of the code, but I had to fix critical gaps in file handling and error recovery. Here’s the battle-tested process.

The Problem: My Spreadsheet Nightmare

Every Monday, I get 12-15 Excel files from field teams filled with:

  • Merged cells that break pandas
  • Inconsistent date formats (MM/DD/YYYY vs DD-MM-YY)
  • Empty header rows
  • "N/A", "NULL", and "---" all representing missing data

Manual cleaning took 3 hours weekly. I needed a tool where I could just run:

cleanxls --input messy_file.xlsx --output clean_file.csv
Enter fullscreen mode Exit fullscreen mode

The AI Attempt: First Prompt to DeepSeek

I started with this exact prompt:

"""
Prompt: Write a Python CLI tool using Click that:
1. Takes an Excel file as input
2. Handles merged cells by unmerging and filling values downward
3. Standardizes dates to YYYY-MM-DD
4. Replaces all null markers (N/A, NULL, ---) with empty strings
5. Outputs a clean CSV
Include error handling for corrupt files.
"""
Enter fullscreen mode Exit fullscreen mode

DeepSeek output a 127-line script using:

  • pandas for data manipulation
  • click for CLI interface
  • openpyxl to handle merged cells

First red flag: It used openpyxl's merged_cells property, which only works on .xlsx files, not .xls.

Where It Broke: 3 Critical Failures

  1. File Format Blindness: Crashed on older .xls files with:
   AttributeError: 'Workbook' object has no attribute 'merged_cells'
Enter fullscreen mode Exit fullscreen mode
  1. Date Disaster: Tried to force-convert everything resembling a date, turning "Order #2024-1001" into a datetime object.

  2. Silent Corruption: When encountering password-protected files, it created empty CSVs without warning.

What I Had to Fix

1. The File Format War

I modified the prompt to specify:

"""
Add support for both .xls (use xlrd) and .xlsx (openpyxl). 
First detect file extension, then use appropriate engine.
"""
Enter fullscreen mode Exit fullscreen mode

2. Date Detection Logic

Added a regex check before date conversion:

if re.match(r'^\d{1,4}[-\/]\d{1,2}[-\/]\d{1,4}$', str(cell_value)):  
    # Attempt date conversion
Enter fullscreen mode Exit fullscreen mode

3. Error Handling Overhaul

Added explicit checks for:

  • Password-protected files
  • Binary corruption (using magic numbers)
  • Permission issues

The Working Result

Final command structure:

cleanxls --input messy.* --output_dir ./clean \
         --date-format "%Y-%m-%d" \
         --null-markers "NULL,N/A,---"
Enter fullscreen mode Exit fullscreen mode

Sample output:

Processing 15 files...
✔ Success: 12 files (avg 4.2s/file)
✖ Failed: 3 files (password protected)
Clean files saved to ./clean/
Enter fullscreen mode Exit fullscreen mode

Key features added:

  • Batch processing with wildcards (messy.*)
  • Configurable null markers
  • Progress bar using tqdm

What I Learned

  1. AI Misses Edge Cases

    DeepSeek didn’t consider that .xls and .xlsx require different libraries. Always specify file formats.

  2. Prompt Iteration Beats Debugging

    Instead of fixing the code, I refined the prompt:

   """
   Modify to handle both .xls and .xlsx by:
   1. Checking file extension
   2. Using xlrd for .xls
   3. Using openpyxl for .xlsx
   """
Enter fullscreen mode Exit fullscreen mode

This got me 80% there vs. manual coding.

  1. Validation is Non-Negotiable The AI-generated script would silently overwrite files. I added:
   if os.path.exists(output_path):
       raise click.Confirm("Overwrite?")
Enter fullscreen mode Exit fullscreen mode

The Exact Prompt

Here’s the final prompt that worked:

"""
Write a Python CLI tool using Click that:
1. Accepts multiple input files via wildcard (e.g., 'data/*.xls*')
2. Detects file extension to use correct engine (xlrd for .xls, openpyxl for .xlsx)
3. Unmerges cells by filling values downward
4. Converts only valid dates to YYYY-MM-DD (skip strings with letters)
5. Replaces configurable null markers (default: NULL,N/A,---)
6. Validates output directory exists
7. Shows progress bar with tqdm
8. Skips password-protected files with warning
Output clean CSVs with '_clean' suffix.
"""
Enter fullscreen mode Exit fullscreen mode

FAQ

1. Can this handle 100MB+ Excel files?

Yes, but add --chunksize 10000 to process in chunks. The initial AI code loaded everything into memory.

2. How do I add custom cleaning rules?

Extend the clean_column() function. The AI structured this well for modifications.

3. Why Click instead of argparse?

DeepSeek defaulted to Click for better CLI UX. I kept it for the auto-help generation.

4. Can this run on schedule?

Yes! I combined it with a cron job to auto-process files in our SFTP drop folder.

5. What about Google Sheets?

That’s my next experiment-using gspread with API auth.

What spreadsheet headache would you automate with this approach? Drop a comment with your worst data cleaning war story.

Internal Links:

Related Guides

  • — The short answer is: I used DeepSeek to generate a Python script that extracts data from receipt ima
  • — The short answer is you can use a Python script with DeepSeek prompts to automatically pull grade da
  • — An IT Ops Lead used DeepSeek to build a Python script tracking AI API costs across providers. The AI

Related Guides

Top comments (0)