Toby Patrick

Posted on Mar 26

Issues of Multi-GB Spreadsheets in Data Lakes

#data #dataengineering #performance

Last Tuesday at 3:47 AM, our production data pipeline ground to a halt. The culprit? A 2.8 GB Excel file that a well-meaning finance analyst uploaded to our data ingestion endpoint. The file contained five years of transaction records meticulously maintained in a single worksheet with 1.2 million rows and 47 columns—complete with formulas, conditional formatting, and merged header cells.

Our pipeline, which had been humming along processing thousands of files daily, choked spectacularly. Memory usage spiked to 47 GB as our parser attempted to load the entire workbook into RAM. The container died. The job retried. It died again. By morning standup, we had a critical incident and a queue of 3,000 files waiting for processing.

If you're a data engineer working in the enterprise space, this scenario probably sounds familiar. Excel files are the cockroaches of the data world—they survive every attempt to eliminate them, and they keep getting bigger. Let’s explore why massive Excel files break traditional data pipelines and, more importantly, how to handle them in production.

Why Excel Files Grow Uncontrollably

Before diving into solutions, it’s worth understanding why Excel files become so problematic in the first place.

Unlike databases with normalized schemas or CSV files with simple delimited text, Excel files are complex binary containers (or in the case of .xlsx, zipped XML archives). A single Excel workbook can contain:

Multiple worksheets, each with millions of rows
Formula cells referencing other cells, sheets, or external files
Embedded charts, pivot tables, and images
Custom number formats and validation rules
VBA macros and custom functions
Conditional formatting rules across entire columns
Hidden rows, columns, and sheets
Comments and metadata

All of this is stored in a format optimized for interactive editing, not batch processing. When users treat Excel as a database—which they inevitably do—files balloon into sizes that break naive parsing approaches.

The worst part? Users don’t even realize their files are massive. Excel runs smoothly on modern systems with 16–32 GB RAM, so everything seems fine—until it hits your pipeline.

The Memory Explosion Problem

Traditional Excel parsing libraries load the entire workbook into memory, making them dangerous for large files.

Typical Pandas Approach

import pandas as pd

df = pd.read_excel('transactions.xlsx', sheet_name='Data')

What Happens Internally

Entire .xlsx file is loaded into memory
ZIP archive is decompressed
XML parsed into structures
Shared strings indexed
Cells converted into Python objects
Formulas processed
Formatting applied

📌 Result:
A 2 GB file can consume 15–20 GB RAM

Even worse:

Slow load times (10–15 minutes)
Crashes in container environments
Starvation of other processes

Solution 1: Streaming Parsers (openpyxl)

Use read-only streaming mode to process Excel files row by row.

from openpyxl import load_workbook

wb = load_workbook('large_file.xlsx', read_only=True, data_only=True)
ws = wb['Sheet1']

for row in ws.rows:
    values = [cell.value for cell in row]
    process_row(values)

wb.close()

Why It Works

No full file loading
Processes data sequentially
Much lower memory usage

Performance Comparison

File Size	Standard pandas	openpyxl (read_only)	Reduction
500 MB	6.8 GB RAM	1.2 GB RAM	82%
1.5 GB	18.4 GB RAM	3.6 GB RAM	80%
2.8 GB	OOM killed	6.9 GB RAM	Success

Limitations

No random access
Cannot edit workbook
Only reads computed values
Sequential processing only

Solution 2: Chunked Processing

For extremely large files, use chunk-based processing to keep memory constant.

from openpyxl import load_workbook
import pyarrow.parquet as pq
import pyarrow as pa

CHUNK_SIZE = 10000

wb = load_workbook('massive_file.xlsx', read_only=True, data_only=True)
ws = wb['Sheet1']

headers = [cell.value for cell in next(ws.rows)]

chunk = []
writer = None

for idx, row in enumerate(ws.rows):
    if idx == 0:
        continue

    chunk.append([cell.value for cell in row])

    if len(chunk) >= CHUNK_SIZE:
        table = pa.table({
            headers[i]: [row[i] for row in chunk]
            for i in range(len(headers))
        })

        if writer is None:
            writer = pq.ParquetWriter('output.parquet', table.schema)

        writer.write_table(table)
        chunk = []

if chunk:
    table = pa.table({
        headers[i]: [row[i] for row in chunk]
        for i in range(len(headers))
    })
    writer.write_table(table)

if writer:
    writer.close()

wb.close()

Key Insight

Memory stays constant regardless of file size
2.8 GB file processed using ~800 MB RAM

Optimal Chunk Size

Chunk Size	Effect
100	Too slow
100,000	Memory spikes
5,000–10,000	✅ Best

Solution 3: Validation and Schema Detection

Large Excel files often lack consistent schemas. Building robust parsers requires understanding the variety of data structures engineers encounter in production. Data engineers with little to none experience with speadsheets must get familiar with the tool (usually through video tutorials and popular online academies like Practity) to spot the edge cases that break naive parsers. This hands-on Excel practice questions with different spreadsheet structures translates directly into more resilient production systems.

Prevent issues by validating files before processing.

import os
import openpyxl

def validate_excel_file(file_path, max_size_mb=500, max_rows=500000):
    file_size_mb = os.path.getsize(file_path) / (1024 * 1024)

    if file_size_mb > max_size_mb:
        return False, f"File exceeds {max_size_mb}MB limit", None

    wb = openpyxl.load_workbook(file_path, read_only=True, data_only=True)

    metadata = {
        'sheets': len(wb.sheetnames),
        'file_size_mb': round(file_size_mb, 2)
    }

    for sheet_name in wb.sheetnames:
        ws = wb[sheet_name]

        max_row = ws.max_row
        max_col = ws.max_column

        if max_row > max_rows:
            wb.close()
            return False, f"Sheet '{sheet_name}' exceeds {max_rows} row limit", metadata

        metadata[sheet_name] = {
            'rows': max_row,
            'columns': max_col
        }

    wb.close()
    return True, "Valid", metadata

Usage

is_valid, message, metadata = validate_excel_file(uploaded_file)

if not is_valid:
    send_to_batch_processing_queue(uploaded_file, metadata)
else:
    process_immediately(uploaded_file)

Lessons Learned

Users always push limits
Memory limits must be enforced
Validation prevents most failures
Conversion (CSV/Parquet) is often better
Monitor file trends over time

Conclusion

Excel files aren’t going anywhere.

Instead of trying to eliminate them, build systems that handle them properly:

Streaming parsers
Chunked processing
Intelligent validation
Format conversion

The same 2.8 GB file that crashed our pipeline at 3:47 AM now processes smoothly in 8 minutes using chunked streaming and Parquet conversion.

No crashes.
No retries.
Just reliable data ingestion.

Final Thought:
Excel may be messy, but with the right architecture, your pipeline doesn’t have to be.

DEV Community