DEV Community

Toby Patrick
Toby Patrick

Posted on

Issues of Multi-GB Spreadsheets in Data Lakes

Last Tuesday at 3:47 AM, our production data pipeline ground to a halt. The culprit? A 2.8 GB Excel file that a well-meaning finance analyst uploaded to our data ingestion endpoint. The file contained five years of transaction records meticulously maintained in a single worksheet with 1.2 million rows and 47 columns—complete with formulas, conditional formatting, and merged header cells.

Our pipeline, which had been humming along processing thousands of files daily, choked spectacularly. Memory usage spiked to 47 GB as our parser attempted to load the entire workbook into RAM. The container died. The job retried. It died again. By morning standup, we had a critical incident and a queue of 3,000 files waiting for processing.

If you're a data engineer working in the enterprise space, this scenario probably sounds familiar. Excel files are the cockroaches of the data world—they survive every attempt to eliminate them, and they keep getting bigger. Let’s explore why massive Excel files break traditional data pipelines and, more importantly, how to handle them in production.

Why Excel Files Grow Uncontrollably

Before diving into solutions, it’s worth understanding why Excel files become so problematic in the first place.

Unlike databases with normalized schemas or CSV files with simple delimited text, Excel files are complex binary containers (or in the case of .xlsx, zipped XML archives). A single Excel workbook can contain:

  • Multiple worksheets, each with millions of rows
  • Formula cells referencing other cells, sheets, or external files
  • Embedded charts, pivot tables, and images
  • Custom number formats and validation rules
  • VBA macros and custom functions
  • Conditional formatting rules across entire columns
  • Hidden rows, columns, and sheets
  • Comments and metadata

All of this is stored in a format optimized for interactive editing, not batch processing. When users treat Excel as a database—which they inevitably do—files balloon into sizes that break naive parsing approaches.

The worst part? Users don’t even realize their files are massive. Excel runs smoothly on modern systems with 16–32 GB RAM, so everything seems fine—until it hits your pipeline.

The Memory Explosion Problem

Traditional Excel parsing libraries load the entire workbook into memory, making them dangerous for large files.

Typical Pandas Approach

import pandas as pd

df = pd.read_excel('transactions.xlsx', sheet_name='Data')
Enter fullscreen mode Exit fullscreen mode

What Happens Internally

  • Entire .xlsx file is loaded into memory
  • ZIP archive is decompressed
  • XML parsed into structures
  • Shared strings indexed
  • Cells converted into Python objects
  • Formulas processed
  • Formatting applied

📌 Result:
A 2 GB file can consume 15–20 GB RAM

Even worse:

  • Slow load times (10–15 minutes)
  • Crashes in container environments
  • Starvation of other processes

Solution 1: Streaming Parsers (openpyxl)

Use read-only streaming mode to process Excel files row by row.

from openpyxl import load_workbook

wb = load_workbook('large_file.xlsx', read_only=True, data_only=True)
ws = wb['Sheet1']

for row in ws.rows:
    values = [cell.value for cell in row]
    process_row(values)

wb.close()
Enter fullscreen mode Exit fullscreen mode

Why It Works

  • No full file loading
  • Processes data sequentially
  • Much lower memory usage

Performance Comparison

File Size Standard pandas openpyxl (read_only) Reduction
500 MB 6.8 GB RAM 1.2 GB RAM 82%
1.5 GB 18.4 GB RAM 3.6 GB RAM 80%
2.8 GB OOM killed 6.9 GB RAM Success

Limitations

  • No random access
  • Cannot edit workbook
  • Only reads computed values
  • Sequential processing only

Solution 2: Chunked Processing

For extremely large files, use chunk-based processing to keep memory constant.

from openpyxl import load_workbook
import pyarrow.parquet as pq
import pyarrow as pa

CHUNK_SIZE = 10000

wb = load_workbook('massive_file.xlsx', read_only=True, data_only=True)
ws = wb['Sheet1']

headers = [cell.value for cell in next(ws.rows)]

chunk = []
writer = None

for idx, row in enumerate(ws.rows):
    if idx == 0:
        continue

    chunk.append([cell.value for cell in row])

    if len(chunk) >= CHUNK_SIZE:
        table = pa.table({
            headers[i]: [row[i] for row in chunk]
            for i in range(len(headers))
        })

        if writer is None:
            writer = pq.ParquetWriter('output.parquet', table.schema)

        writer.write_table(table)
        chunk = []

if chunk:
    table = pa.table({
        headers[i]: [row[i] for row in chunk]
        for i in range(len(headers))
    })
    writer.write_table(table)

if writer:
    writer.close()

wb.close()
Enter fullscreen mode Exit fullscreen mode

Key Insight

  • Memory stays constant regardless of file size
  • 2.8 GB file processed using ~800 MB RAM

Optimal Chunk Size

Chunk Size Effect
100 Too slow
100,000 Memory spikes
5,000–10,000 ✅ Best

Solution 3: Validation and Schema Detection

Large Excel files often lack consistent schemas. Building robust parsers requires understanding the variety of data structures engineers encounter in production. Data engineers with little to none experience with speadsheets must get familiar with the tool (usually through video tutorials and popular online academies like Practity) to spot the edge cases that break naive parsers. This hands-on Excel practice questions with different spreadsheet structures translates directly into more resilient production systems.

Prevent issues by validating files before processing.

import os
import openpyxl

def validate_excel_file(file_path, max_size_mb=500, max_rows=500000):
    file_size_mb = os.path.getsize(file_path) / (1024 * 1024)

    if file_size_mb > max_size_mb:
        return False, f"File exceeds {max_size_mb}MB limit", None

    wb = openpyxl.load_workbook(file_path, read_only=True, data_only=True)

    metadata = {
        'sheets': len(wb.sheetnames),
        'file_size_mb': round(file_size_mb, 2)
    }

    for sheet_name in wb.sheetnames:
        ws = wb[sheet_name]

        max_row = ws.max_row
        max_col = ws.max_column

        if max_row > max_rows:
            wb.close()
            return False, f"Sheet '{sheet_name}' exceeds {max_rows} row limit", metadata

        metadata[sheet_name] = {
            'rows': max_row,
            'columns': max_col
        }

    wb.close()
    return True, "Valid", metadata
Enter fullscreen mode Exit fullscreen mode

Usage

is_valid, message, metadata = validate_excel_file(uploaded_file)

if not is_valid:
    send_to_batch_processing_queue(uploaded_file, metadata)
else:
    process_immediately(uploaded_file)
Enter fullscreen mode Exit fullscreen mode

Lessons Learned

  1. Users always push limits
  2. Memory limits must be enforced
  3. Validation prevents most failures
  4. Conversion (CSV/Parquet) is often better
  5. Monitor file trends over time

Conclusion

Excel files aren’t going anywhere.

Instead of trying to eliminate them, build systems that handle them properly:

  • Streaming parsers
  • Chunked processing
  • Intelligent validation
  • Format conversion

The same 2.8 GB file that crashed our pipeline at 3:47 AM now processes smoothly in 8 minutes using chunked streaming and Parquet conversion.

No crashes.
No retries.
Just reliable data ingestion.


Final Thought:
Excel may be messy, but with the right architecture, your pipeline doesn’t have to be.

Top comments (0)