DEV Community

From 3+ Days to 3.8 Hours: Scaling a .NET CSV Importer for SQL Server

How changing requirements forced a 24x performance boost in a data pipeline.

The "Good Enough" Solution That Wasn't

Every project has that one task: "Just load this massive CSV into the database once, and we're done." That was my starting point. I had a 40 GB CSV file containing complex nested JSON structures (models, diagrams, hotspots, elements) that needed to be parsed and saved into a normalized SQL Server schema.

The first version of the importer was straightforward: read a row, parse the JSON, create entities, save to the DB. It worked, but it took over 3 days (~92 hours) to finish. For a one-time migration, this was painful but acceptable. You launch it on Friday, and hopefully, it's done by Monday.

Then the requirements changed.

The business decided this wasn't a one-off event. We needed to load several more files of similar size and potentially update them regularly. Suddenly, a 3-day runtime became a blocker. Loading a queue of files would take weeks, paralyzing analytics and development. The "naive" sequential importer was no longer just slow—it was unusable for the new workflow.

The Challenge: Why Was It Slow?

Parsing and inserting data sounds simple, but at scale (40 GB, millions of complex objects), the "standard" approach hits hard limits:

  1. Sequential Processing: Reading lines and parsing JSON one by one left the CPU idle while waiting for the DB, and vice versa.
  2. Database Round-Trips: Saving entities individually (or in tiny groups) caused massive overhead. The database spent more time managing transactions and network calls than actually storing data.
  3. Memory Pressure: Loading full JsonDocument objects for every row created huge GC pressure.
  4. Fragility: A single error after 2 days of processing could crash the whole pipeline, forcing a restart.

The Solution: High-Performance Architecture

To meet the new "multi-file" requirement, I redesigned the system to be parallel, batched, and resilient.

1. Controlled Parallelism with SemaphoreSlim

Instead of a single thread, I implemented a producer-consumer pattern using SemaphoreSlim to limit concurrency to 8 parallel workers.

  • Why: It saturates the CPU and DB connection pool just enough to be fast without choking the server. Unbounded parallelism (Parallel.ForEach) would have killed the database performance.
  • Safety: Each worker gets its own DbContext via IDbContextFactory, ensuring thread safety without lock contention.

2. Batch Inserts via EF Core (The Big Win)

This was the most critical change. Instead of context.Add(entity); context.SaveChanges(); inside the loop, the new system accumulates entities in memory and flushes them in batches of 100+ rows.

  • Impact: This reduces network round-trips by ~100x and drastically lowers transaction log overhead.

3. Architecture & SOLID Principles

To keep the code maintainable, I split the parsing logic into independent Processors, each responsible for a specific part of the JSON (e.g., ModelProcessor, DiagramProcessor).

  • SRP (Single Responsibility): Each processor handles only its slice of the domain.
  • DIP (Dependency Inversion): High-level services depend on abstractions (IEntityFactory, IUnitOfWork), making the system easy to test and extend.

4. Reliability Features

  • Retry Policies: Up to 25 retries for transient DB errors (deadlocks, timeouts).
  • Graceful Degradation: If one processor fails on bad data, it logs the error and continues, rather than crashing the entire import.
  • Optimized Parsing: Switched to JsonElement and TryGetProperty for faster, low-allocation JSON traversal.

The Results: 24x Faster

The performance jump was massive, turning a "weekend task" into a "lunch break task".

Metric Original Version Optimized Version Improvement
Total Time (40 GB) ~92 hours (3.8 days) ~3.8 hours ~24x
Throughput 8–12 rows/sec 192–300 rows/sec ~25x
Time per 1000 rows 83–125 sec 3–5 sec ~25x
Parallelism 1 thread 8 workers 8x
Memory Usage 2 GB+ ~400 MB ~5x

Comparison of Import Performance: Original vs Optimized Architecture

Key Takeaways

  1. Context Matters: A 3-day script is fine once. It's fatal if you need to run it repeatedly. Always ask "how often will we run this?".
  2. Batching is King: In EF Core, moving from single inserts to batching is often the single most effective performance upgrade you can make.
  3. Parallelism Needs Limits: Throwing 100 threads at SQL Server will just slow it down. Finding the "sweet spot" (e.g., 8 workers) is key.
  4. Resilience is a Feature: When running for hours, networks will blink and deadlocks will happen. Retry policies turn crashes into minor log warnings.

Future Plans

  • Add comprehensive tests (xUnit + Moq) with 85%+ coverage for all processors.
  • Profile individual pipeline stages to find the next bottleneck (likely JSON parsing CPU time).
  • Expose configuration (batch size, thread count) to adapt to different server specs dynamically.

Check out the code: Link to GitHub Repository

Top comments (0)