Last Tuesday at 3:47 AM, our production data pipeline ground to a halt. The culprit? A 2.8 GB Excel file that a well-meaning finance analyst uploaded to our data ingestion endpoint. The file contained five years of transaction records meticulously maintained in a single worksheet with 1.2 million rows and 47 columns—complete with formulas, conditional formatting, and merged header cells.
Our pipeline, which had been humming along processing thousands of files daily, choked spectacularly. Memory usage spiked to 47 GB as our parser attempted to load the entire workbook into RAM. The container died. The job retried. It died again. By morning standup, we had a critical incident and a queue of 3,000 files waiting for processing.
If you're a data engineer working in the enterprise space, this scenario probably sounds familiar. Excel files are the cockroaches of the data world—they survive every attempt to eliminate them, and they keep getting bigger. Let’s explore why massive Excel files break traditional data pipelines and, more importantly, how to handle them in production.
Why Excel Files Grow Uncontrollably
Before diving into solutions, it’s worth understanding why Excel files become so problematic in the first place.
Unlike databases with normalized schemas or CSV files with simple delimited text, Excel files are complex binary containers (or in the case of .xlsx, zipped XML archives). A single Excel workbook can contain:
- Multiple worksheets, each with millions of rows
- Formula cells referencing other cells, sheets, or external files
- Embedded charts, pivot tables, and images
- Custom number formats and validation rules
- VBA macros and custom functions
- Conditional formatting rules across entire columns
- Hidden rows, columns, and sheets
- Comments and metadata
All of this is stored in a format optimized for interactive editing, not batch processing. When users treat Excel as a database—which they inevitably do—files balloon into sizes that break naive parsing approaches.
The worst part? Users don’t even realize their files are massive. Excel runs smoothly on modern systems with 16–32 GB RAM, so everything seems fine—until it hits your pipeline.
The Memory Explosion Problem
Traditional Excel parsing libraries load the entire workbook into memory, making them dangerous for large files.
Typical Pandas Approach
import pandas as pd
df = pd.read_excel('transactions.xlsx', sheet_name='Data')
What Happens Internally
- Entire
.xlsxfile is loaded into memory - ZIP archive is decompressed
- XML parsed into structures
- Shared strings indexed
- Cells converted into Python objects
- Formulas processed
- Formatting applied
📌 Result:
A 2 GB file can consume 15–20 GB RAM
Even worse:
- Slow load times (10–15 minutes)
- Crashes in container environments
- Starvation of other processes
Solution 1: Streaming Parsers (openpyxl)
Use read-only streaming mode to process Excel files row by row.
from openpyxl import load_workbook
wb = load_workbook('large_file.xlsx', read_only=True, data_only=True)
ws = wb['Sheet1']
for row in ws.rows:
values = [cell.value for cell in row]
process_row(values)
wb.close()
Why It Works
- No full file loading
- Processes data sequentially
- Much lower memory usage
Performance Comparison
| File Size | Standard pandas | openpyxl (read_only) | Reduction |
|---|---|---|---|
| 500 MB | 6.8 GB RAM | 1.2 GB RAM | 82% |
| 1.5 GB | 18.4 GB RAM | 3.6 GB RAM | 80% |
| 2.8 GB | OOM killed | 6.9 GB RAM | Success |
Limitations
- No random access
- Cannot edit workbook
- Only reads computed values
- Sequential processing only
Solution 2: Chunked Processing
For extremely large files, use chunk-based processing to keep memory constant.
from openpyxl import load_workbook
import pyarrow.parquet as pq
import pyarrow as pa
CHUNK_SIZE = 10000
wb = load_workbook('massive_file.xlsx', read_only=True, data_only=True)
ws = wb['Sheet1']
headers = [cell.value for cell in next(ws.rows)]
chunk = []
writer = None
for idx, row in enumerate(ws.rows):
if idx == 0:
continue
chunk.append([cell.value for cell in row])
if len(chunk) >= CHUNK_SIZE:
table = pa.table({
headers[i]: [row[i] for row in chunk]
for i in range(len(headers))
})
if writer is None:
writer = pq.ParquetWriter('output.parquet', table.schema)
writer.write_table(table)
chunk = []
if chunk:
table = pa.table({
headers[i]: [row[i] for row in chunk]
for i in range(len(headers))
})
writer.write_table(table)
if writer:
writer.close()
wb.close()
Key Insight
- Memory stays constant regardless of file size
- 2.8 GB file processed using ~800 MB RAM
Optimal Chunk Size
| Chunk Size | Effect |
|---|---|
| 100 | Too slow |
| 100,000 | Memory spikes |
| 5,000–10,000 | ✅ Best |
Solution 3: Validation and Schema Detection
Large Excel files often lack consistent schemas. Building robust parsers requires understanding the variety of data structures engineers encounter in production. Data engineers with little to none experience with speadsheets must get familiar with the tool (usually through video tutorials and popular online academies like Practity) to spot the edge cases that break naive parsers. This hands-on Excel practice questions with different spreadsheet structures translates directly into more resilient production systems.
Prevent issues by validating files before processing.
import os
import openpyxl
def validate_excel_file(file_path, max_size_mb=500, max_rows=500000):
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
if file_size_mb > max_size_mb:
return False, f"File exceeds {max_size_mb}MB limit", None
wb = openpyxl.load_workbook(file_path, read_only=True, data_only=True)
metadata = {
'sheets': len(wb.sheetnames),
'file_size_mb': round(file_size_mb, 2)
}
for sheet_name in wb.sheetnames:
ws = wb[sheet_name]
max_row = ws.max_row
max_col = ws.max_column
if max_row > max_rows:
wb.close()
return False, f"Sheet '{sheet_name}' exceeds {max_rows} row limit", metadata
metadata[sheet_name] = {
'rows': max_row,
'columns': max_col
}
wb.close()
return True, "Valid", metadata
Usage
is_valid, message, metadata = validate_excel_file(uploaded_file)
if not is_valid:
send_to_batch_processing_queue(uploaded_file, metadata)
else:
process_immediately(uploaded_file)
Lessons Learned
- Users always push limits
- Memory limits must be enforced
- Validation prevents most failures
- Conversion (CSV/Parquet) is often better
- Monitor file trends over time
Conclusion
Excel files aren’t going anywhere.
Instead of trying to eliminate them, build systems that handle them properly:
- Streaming parsers
- Chunked processing
- Intelligent validation
- Format conversion
The same 2.8 GB file that crashed our pipeline at 3:47 AM now processes smoothly in 8 minutes using chunked streaming and Parquet conversion.
No crashes.
No retries.
Just reliable data ingestion.
Final Thought:
Excel may be messy, but with the right architecture, your pipeline doesn’t have to be.
Top comments (0)