The Concept
When designing bulk data-ingestion pipelines, engineering teams often face a tension between strict data integrity and user experience. A naive bulk-import system treats the entire payload as a single transaction, if one row fails, the entire batch rolls back. While this guarantees data consistency, it creates a frustrating user bottleneck.
A mature bulk-import system should act as an intelligent filter: gracefully handling complex binary data (like embedded images), processing valid records in batches, and isolating failures so the user can fix and re-upload only what is broken.
The What
Our application features a critical tool allowing distributors to bulk-update products via XLSX spreadsheets. However, the initial implementation suffered from two major limitations:
- Lack of Validation & Error Isolation: Any single row error would halt the entire file execution, failing the upload completely.
- Missing Media Portability: Distributors opening new branches frequently requested the ability to automatically clone existing product images to the new branch profile via the spreadsheet.
The Why
We needed to redesign this pipeline to accomplish two goals:
- Fault-Tolerant Processing: Ensure that valid product updates are successfully committed to the database, while invalid rows are isolated and surfaced back to the distributor for targeted correction.
- Binary Data Extraction: Programmatically detect, extract, and reference product images embedded directly inside spreadsheet cells, automatically mapping them to the correct branch inventory records.
The How
1. File & Data Sanity Checking
Before any database operation occurs, the incoming payload undergoes strict sanitization:
-
MIME-Type Verification: The system strictly enforces that only valid
.xlsxfiles are accepted. - Data Harmonization: Price fields are stripped of special characters, currency symbols, and whitespace, then safely parsed into clean numeric types.
2. Extracting Embedded Images with excelize
Reading images from an Excel file in Go using the excelize library requires a two-step coordinate extraction process, because calling a cell directly using standard string retrieval returns a #VALUE! error.
-
Step 1: Coordinate Mapping: We utilize
GetPictureCells()to retrieve a list of all cell coordinates that actually contain image data. -
Step 2: Binary Extraction: We loop through these coordinates and invoke
GetPictures(). This function reads the raw cell data and returns a struct containing the rawFilebytes,Extension,Format, andInsertType. - Step 3: Database Referencing: The raw bytes are saved to our storage server. We generate a unique, collision-resistant filename for each image, inject this new filename reference back into our internal representation of the spreadsheet rows, and prepare it for database persistence.
3. Batch Processing and Partial-Success Strategy
To handle high concurrency and prevent the database from being bottlenecked by row-by-row INSERT queries, we implemented a batch-processing algorithm designed around pre-fetching and comparing:
[Distributor Uploads XLSX]
│
▼
[Extract all SKUs & Names]
│
▼
[Single Query: Select Existing from DB]
│
▼
[Comparison] ───────────────────┐
│ │
▼ (No Conflicts) ▼ (Conflicts)
[Batch Insert Valid] [Collect Errors]
│ │
▼ ▼
[Update Inventory & Stocks] [Return Error Report to User]
- Pre-fetching Constraints: The system loops through the spreadsheet memory cache to compile a list of all SKUs and product names.
-
Bulk Conflict Checking: We hit the database once using this compiled list to pull all existing records matching those SKUs or names, creating an in-memory lookup map of
existingProducts. - The Comparison: We compare the incoming spreadsheet rows against our database lookup map:
- Valid Products: Rows without conflicts are grouped into a Batch Insert transaction. Successful inserts automatically trigger downstream events to generate inventory records and historical stock movements.
- Invalid Products: Rows with data anomalies or conflicts are skipped, collected into an error array, and returned to the distributor at the end of the execution.
The Trade-offs
Pros:
- Drastically Improved UX: Distributors no longer have to guess which row broke the upload; they receive a clean report of what failed and can keep working seamlessly.
- Reduced Database I/O: Moving away from a "loop-and-insert" pattern to a single pre-fetch query followed by a batch insert significantly reduces database connection overhead and lock contention.
Cons / Risks:
- Memory Overhead: Holding an entire spreadsheet's rows, image bytes, and a database lookup map in memory simultaneously increases the service's RAM footprint. For exceptionally large files, this risks hitting container memory limits.
- Asynchronous Drift: Because valid rows are saved while invalid ones are rejected, the distributor's local spreadsheet becomes partially out-of-sync with the live database until they apply their fixes.
Layman's Terms: How It Works
Imagine a distributor hands you a giant box of 1,000 product sheets to log into your warehouse system.
Previously, if sheet #452 had a typo, you would throw your hands up, dump the entire box in the trash, and tell the distributor to fix the typo and bring all 1,000 sheets back. To make matters worse, if a sheet had a physical photo glued to it, you had no way of copying it to a new branch folder.
With our new system, we look at the whole box at once. We use a special tool to cleanly peel the photos off the paper and save them digitally. Then, we quickly check our computer to see which products look correct. If 950 products are perfect, we log them into the system immediately and update the warehouse shelves. For the 50 that had typos, we put them in a small folder and hand them back to the distributor saying, "These 950 are done. Just fix these 50 specific errors and give them back to us."
Conclusion
While bulk file handling with embedded media is a common hurdle in enterprise applications, approaching it with smart pre-fetching and a partial-success architecture creates a resilient system. If you are currently building a similar data-ingestion pipeline, implementing early cell-coordination mapping and in-memory comparison will save your users time and protect your database performance.
Top comments (0)