How changing requirements forced a 24x performance boost in a data pipeline.
The "Good Enough" Solution That Wasn't
Every project has that one task: "Just load this massive CSV into the database once, and we're done." That was my starting point. I had a 40 GB CSV file containing complex nested JSON structures (models, diagrams, hotspots, elements) that needed to be parsed and saved into a normalized SQL Server schema.
The first version of the importer was straightforward: read a row, parse the JSON, create entities, save to the DB. It worked, but it took over 3 days (~92 hours) to finish. For a one-time migration, this was painful but acceptable. You launch it on Friday, and hopefully, it's done by Monday.
Then the requirements changed.
The business decided this wasn't a one-off event. We needed to load several more files of similar size and potentially update them regularly. Suddenly, a 3-day runtime became a blocker. Loading a queue of files would take weeks, paralyzing analytics and development. The "naive" sequential importer was no longer just slow—it was unusable for the new workflow.
The Challenge: Why Was It Slow?
Parsing and inserting data sounds simple, but at scale (40 GB, millions of complex objects), the "standard" approach hits hard limits:
- Sequential Processing: Reading lines and parsing JSON one by one left the CPU idle while waiting for the DB, and vice versa.
- Database Round-Trips: Saving entities individually (or in tiny groups) caused massive overhead. The database spent more time managing transactions and network calls than actually storing data.
- Memory Pressure: Loading full
JsonDocumentobjects for every row created huge GC pressure. - Fragility: A single error after 2 days of processing could crash the whole pipeline, forcing a restart.
The Solution: High-Performance Architecture
To meet the new "multi-file" requirement, I redesigned the system to be parallel, batched, and resilient.
1. Controlled Parallelism with SemaphoreSlim
Instead of a single thread, I implemented a producer-consumer pattern using SemaphoreSlim to limit concurrency to 8 parallel workers.
- Why: It saturates the CPU and DB connection pool just enough to be fast without choking the server. Unbounded parallelism (
Parallel.ForEach) would have killed the database performance. - Safety: Each worker gets its own
DbContextviaIDbContextFactory, ensuring thread safety without lock contention.
2. Batch Inserts via EF Core (The Big Win)
This was the most critical change. Instead of context.Add(entity); context.SaveChanges(); inside the loop, the new system accumulates entities in memory and flushes them in batches of 100+ rows.
- Impact: This reduces network round-trips by ~100x and drastically lowers transaction log overhead.
3. Architecture & SOLID Principles
To keep the code maintainable, I split the parsing logic into independent Processors, each responsible for a specific part of the JSON (e.g., ModelProcessor, DiagramProcessor).
- SRP (Single Responsibility): Each processor handles only its slice of the domain.
- DIP (Dependency Inversion): High-level services depend on abstractions (
IEntityFactory,IUnitOfWork), making the system easy to test and extend.
4. Reliability Features
- Retry Policies: Up to 25 retries for transient DB errors (deadlocks, timeouts).
- Graceful Degradation: If one processor fails on bad data, it logs the error and continues, rather than crashing the entire import.
- Optimized Parsing: Switched to
JsonElementandTryGetPropertyfor faster, low-allocation JSON traversal.
The Results: 24x Faster
The performance jump was massive, turning a "weekend task" into a "lunch break task".
| Metric | Original Version | Optimized Version | Improvement |
|---|---|---|---|
| Total Time (40 GB) | ~92 hours (3.8 days) | ~3.8 hours | ~24x |
| Throughput | 8–12 rows/sec | 192–300 rows/sec | ~25x |
| Time per 1000 rows | 83–125 sec | 3–5 sec | ~25x |
| Parallelism | 1 thread | 8 workers | 8x |
| Memory Usage | 2 GB+ | ~400 MB | ~5x |
Key Takeaways
- Context Matters: A 3-day script is fine once. It's fatal if you need to run it repeatedly. Always ask "how often will we run this?".
- Batching is King: In EF Core, moving from single inserts to batching is often the single most effective performance upgrade you can make.
- Parallelism Needs Limits: Throwing 100 threads at SQL Server will just slow it down. Finding the "sweet spot" (e.g., 8 workers) is key.
- Resilience is a Feature: When running for hours, networks will blink and deadlocks will happen. Retry policies turn crashes into minor log warnings.
Future Plans
- Add comprehensive tests (xUnit + Moq) with 85%+ coverage for all processors.
- Profile individual pipeline stages to find the next bottleneck (likely JSON parsing CPU time).
- Expose configuration (batch size, thread count) to adapt to different server specs dynamically.
Check out the code: Link to GitHub Repository

Top comments (0)