DEV Community

Cover image for Zipping 15Gb of S3 files in 6s. How the power of community made it possible.

Zipping 15Gb of S3 files in 6s. How the power of community made it possible.

In my first article, I showed how parallelizing zip assembly across multiple Lambdas can beat the single-Lambda bandwidth ceiling. I zipped 6.9GB in 35 seconds with just 5 workers.

Since then, Jérémie published a follow-up article where a contributor (Fitz) introduced a brilliant optimization: UploadPartCopy. Instead of downloading (or even streaming) big files through Lambda just to upload them back into the zip, you can tell S3 to copy them server-side. This halves the bandwidth requirement and brought his single-Lambda solution down to 106 seconds.

I took Fitz's UploadPartCopy idea and combined it with my parallel approach. Here's what happened.

What I took from Jérémie and Fitz

The UploadPartCopy insight is elegant: since ZIP STORE mode has deterministic offsets, we know exactly where each file's data lands in the final archive. For big files (≥5MB), we can:

  1. Write just the local file header (50 bytes) in an UploadPart
  2. Have S3 copy the file data directly via UploadPartCopy — no download, no upload, instant

This means workers barely use any memory or bandwidth for big files.

Only issue is that S3 multipart upload API requires all segments (except the last one) to be bigger than 5MB. So the local file header needs to be appended to an another file (or group of files).

My planner Lambda groups small files together until they reach 5MB, appends the LOC header of the next big file, then the worker fires an UploadPartCopy for that big file's data.

When we run out of small files, we stream the smallest remaining big file and pair it with (the LOC header then) a copy of the largest remaining one.

For CRC32 (required in zip headers): files uploaded with modern AWS SDKs already have CRC32 stored as object metadata. A simple HeadObject call retrieves it — no need to read the file.

Step Functions: three limitations

My original architecture used Step Functions to orchestrate workers. Here's what I hit.

1. Inline Map caps at ~40 concurrent iterations

The AWS documentation says the Inline Map state supports "up to 40 concurrent iterations." In practice I saw up to 55, but never more. With 1500 duos to process, Step Functions queued them in batches of 55.

I switched to Distributed Map which launches Express child workflow executions. All 1120 iterations started within 2 seconds. Problem solved? Not quite.

2. Distributed Map: fast to dispatch, slow to collect

With Distributed Map, all workers started within 2 seconds. Every single one finished in under 1 second (mostly UploadPartCopy calls). Total Lambda compute: ~500ms average.

Yet the Map state took 38 seconds to complete.

The bottleneck? Step Functions' internal machinery for collecting and aggregating results from 1120 Express child executions. I confirmed: all workers started at 10:06:52-53, all finished by 10:06:54, but the Map state didn't report success until 10:07:28. 35 seconds of pure orchestration overhead.

3. The 256KB payload limit

Step Functions states can pass at most 256KB between them. With 3000 files:

  • The planner's assignment list exceeds 256KB → had to write to S3
  • The aggregated worker results exceed 256KB → had to write CRC32s to S3, read them back in the finalizer

This added complexity and latency (the finalizer reading 1500 small S3 files — 29 seconds sequentially, until I parallelized it down to 1.5s).

After all these fixes, the Step Functions version ran in 41 seconds for 3000 × 5MB files. Respectable — 2.5× faster than Jérémie's 106s — but I knew most of that time was Step Functions overhead, not actual work.

The final version: direct Lambda invocation

I stripped out Step Functions entirely and wrote a single orchestrator Lambda that:

  1. Lists files, computes zip layout (the job of the "planner" Lambda in my StepFunction architecture), and initiates multipart upload (~0.5s)
  2. Invokes all worker Lambdas synchronously in parallel using goroutines + the Lambda SDK (~0.5s to dispatch)
  3. Collects results (workers return inline, no S3 round-trip for parts)
  4. Reads CRC32 files from S3 in parallel, builds central directory, completes multipart upload (~1s)
Orchestrator Lambda (15min timeout, 1024MB)
    │
    ├─── goroutine → Invoke Worker 1 (sync) → return {parts}
    ├─── goroutine → Invoke Worker 2 (sync) → return {parts}
    ├─── ...
    └─── goroutine → Invoke Worker N (sync) → return {parts}
    │
    └─── All done → Build CD → CompleteMultipartUpload
Enter fullscreen mode Exit fullscreen mode

The Lambda SDK's synchronous Invoke blocks until the worker returns. With 200 concurrent goroutines, all workers are dispatched instantly. No orchestration overhead, no state size limits for the parts (only CRC32s go to S3), no 35-second result aggregation.

Now the theoretical time is: orchestration time + time to upload the smallest large file that stays orphan after we pair all large files with groups of small files or single large files

Results: 3000 × 5MB benchmark

Approach Time Notes
Jérémie Gen1 (Rust, streaming) 212s Single Lambda, 512MB
Jérémie Gen2 (Rust, UploadPartCopy) 106s Single Lambda, 640MB
My Step Functions version 41s Distributed Map, 1120 workers
My orchestrator Lambda 6s Direct invoke, ~1500 workers

6 seconds to zip 15GB into a single valid ZIP64 archive. That's a 18× improvement over the optimized single-Lambda approach, and 35× over the original.

Worker stats:

  • Max memory: 85 MB (I initially allocated 3008MB — massively over-provisioned thanks to UploadPartCopy)
  • Average duration: 516ms per worker
  • Max duration: 1035ms

What I learned (round 2)

  1. Step Functions Parallel Map adds seconds, not milliseconds. For latency-sensitive fan-out/fan-in, direct Lambda invocation is faster. Step Functions shines when you need retries, visual debugging, long-running workflows, or error handling, or lightning fast step transition speed. This outstanding performance lasts only until you need more than 40 parallel processes.

  2. UploadPartCopy is the killer optimization. When most files are ≥5MB, workers barely do any work — they just tell S3 to copy data server-side. Memory stays under 100MB regardless of file sizes.

  3. The orchestrator pattern is underrated. A single Lambda with goroutines can invoke hundreds of child Lambdas synchronously, collect results, and finalize — all within one execution context. No state machine, no payload limits between states, no aggregation overhead.

  4. Over-parallelization can hurt. 1500 separate assignments created more Step Functions overhead than the actual compute. Grouping into fewer, larger batches would have been better for the SFN approach.

Try it

Code: github.com/psantus/on-demand-archive-on-s3

The repo has both approaches: Step Functions (cmd/planner + cmd/worker + cmd/finalizer) and the orchestrator Lambda (cmd/orchestrator).

Jérémie's challenge repo: github.com/RustyServerless/demo-s3-archiving

What's next?

The theoretical minimum is bounded by Lambda cold start time (~200ms) plus the slowest UploadPart call (if we lack small files, we may need to upload a large file manually to append another file's LOC to it) plus orchestrator overhead (~500ms).

Your move, Jérémie 😏

Edit: with 73.2Gb (15,000 files), my solutions gives quite acceptable performance. Just 20s (probably due to my 1000 account default concurrency, would likely be faster on an unbounded account :D)

Paul out.

Mic drop

Top comments (0)