S3 zipper challenge: a parallel zip assembly that beats the single Lambda approach

#aws #serverless #s3 #zip

I recently read Jérémie Rodon's excellent article On-Demand Archives on S3, where he describes an elegant Rust solution for zipping 3,000 × 5MB files from S3 within a single Lambda function.

His approach is impressive: streaming a ZIP archive through a custom Rotating Slab Buffer, saturating bandwidth with concurrent downloads, all within 512MB of RAM. The result: 3 minutes 35 seconds.

I thought it was a good challenge to reach better performance. His article ends with an open invitation: "do you think you can do better with your favorite language?" Well, my favorite language is not Rust nor Go nor.. however, I'm fluent in serverless ;) so I took a different angle entirely.

A Different Approach: Why Not Parallelize the Problem?

Jérémie's constraint was a single Lambda. That's elegant, but it means you're bound by one machine's network bandwidth (~600 Mbps). No matter how perfect your streaming is, physics wins: 15GB at 600 Mbps ≈ 200 seconds minimum.

My question was: what if we break that single-machine bottleneck?

The key insight is that ZIP files in STORE mode (no compression) have deterministic byte offsets. Each entry is exactly 50 + len(filename) + filesize bytes (local header + ZIP64 extra field + data). If you know all filenames and sizes upfront, you can pre-calculate exactly where every file will land in the final archive, before downloading a single byte.

This means independent workers can each build their portion of the zip in parallel, and S3's multipart upload lets them write their chunks independently (parts can be uploaded in any order by different processes sharing the same upload ID).

Architecture

Planner Lambda → Step Functions Distributed Map → N Worker Lambdas → Finalizer Lambda
     │                        │ │ │                        │
     │ CreateMultipartUpload  │ │ │ UploadPart (parallel)  │ CompleteMultipartUpload
     ▼                        ▼ ▼ ▼                        ▼
                         S3 Output Bucket

Planner: Lists all source files, computes zip byte offsets, initiates multipart upload, divides work into balanced batches (equal data volume per worker).
Workers (N concurrent): Each downloads its assigned files, constructs zip local file headers + raw data, computes CRC32 on the fly, streams to S3 as multipart parts.
Finalizer: Builds the central directory with real CRC32 values, uploads it as the final part, calls CompleteMultipartUpload.

Results

With a quota-constrained training account (I had 10 concurrency limit so used only 5 concurrent Lambdas, 3008MB each), zipping 6.9GB across 160 files:

Metric	Single Lambda (Jérémie's Rust)	Parallel (this project)
Approach	Stream within 1 Lambda	Fan-out N workers
Time (15GB, 3000 files)	~215s	Estimated ~10-15s with 100+ workers
Time (6.9GB, 160 files, 5 workers)	-	35s
Memory per worker	512MB	3008MB (could be lower)
Language	Rust 🦀	Go

With a production account (1000 concurrent Lambdas), the 3000 × 5MB scenario would complete in under 15 seconds (each worker handles ~150MB, downloads take ~2s at 600Mbps, upload ~2s). The bottleneck shifts from bandwidth to Lambda cold start (~200ms for Go on ARM64).

Tradeoffs

Jérémie's approach is simpler to deploy (one Lambda, no orchestration) and cheaper per invocation (512MB × 215s vs N × 3008MB × few seconds). It's the right choice when you want minimal infrastructure.

The parallel approach wins on wall-clock time, and dramatically so. It's the right choice when the user is waiting and you want the archive ready in seconds, not minutes.

	Single Lambda	Parallel Fan-Out
Wall-clock time	Bounded by bandwidth	Bounded by slowest worker
Complexity	Low	Medium (Step Functions + 3 Lambdas)
Cost per archive	Lower	Higher (more Lambda-seconds total)
Scalability	Fixed ceiling (~600Mbps)	Linear with concurrency
Memory efficiency	Excellent (512MB)	Good (3GB, could optimize)

If I were to use it in prod, there are plenty of room for optimization (our current Lambda used at most 1875mb, well below our allocated 3Gb, we could use Jérémie's streaming optimizations to cut that to by 10). Yet, we'd probably still have some overhead compared to Jeremie's solutions (cold starts, TLS negociations...) and so far it's just a vanity project :)

What I Learned

ZIP STORE mode is embarrassingly parallel: deterministic offsets mean zero coordination between workers during the data phase.
S3 multipart upload is the perfect primitive: parts uploaded out of order, by different processes, assembled by S3 at the end.
Step Functions Distributed Map is ideal for this pattern: it handles fan-out, concurrency limits, retries, and result collection.
The real bottleneck at scale is Lambda concurrency limits, not bandwidth or compute. With sufficient concurrency, you can zip 15GB in the time it takes to download one 5MB file.