Gordon Johnston for Lineup Ninja

Posted on Sep 11, 2019 • Edited on Sep 20, 2022

Zip files on S3 with AWS Lambda and Node

#aws #lambda #s3 #zip

This post was updated 20 Sept 2022 to improve reliability with large numbers of files.

Update the stream handling so streams are only opened to S3 when the file is ready to be processed by the Zip Archiver. This fixes timeouts that could be seen when processing a large number of files.

Use keep alive with S3 and limit connected sockets.

It's not an uncommon requirement to want to package files on S3 into a Zip file for a user to download multiple files in a single package. Maybe it's common enough for AWS to offer this functionality themselves one day. Until then you can write a short script to do it.

If you want to provide this service in a serverless environment such as AWS Lambda you have two main constraints that define the approach you can take.

1 - /tmp is only 512Mb. Your first idea might be to download the files from S3, zip them up, upload the result. This will work fine until you fill up /tmp with the temporary files!

2 - Memory is constrained to 3GB. You could store the temporary files on the heap, but again you are constrained to 3GB. Even in a regular server environment you're not going to want a simple zip function to take 3GB of RAM!

So what can you do? The answer is to stream the data from S3, through an archiver and back onto S3.

Fortunately this Stack Overflow post and its comments pointed the way and this post is basically a rehash of it!

The below code is Typescript but the Javascript is just the same with the types removed.

Start with the imports you need

import * as Archiver from 'archiver';
import * as AWS from 'aws-sdk';
import { createReadStream } from 'fs';
import { Readable, Stream } from 'stream';
import * as lazystream from 'lazystream';

Firstly configure the aws-sdk so that it will use keepalives when communicating with S3, and also limit the maximum number of connections. This improves efficiency and helps avoid hitting an unexpected connection limit. Instead of this section you could set AWS_NODEJS_CONNECTION_REUSE_ENABLED in your lambda environment.

    // Set the S3 config to use keep-alives
    const agent = new https.Agent({ keepAlive: true, maxSockets: 16 });

    AWS.config.update({ httpOptions: { agent } });

Let's start by creating the streams to fetch the data from S3. To prevent timeouts to S3 the streams are wrapped with 'lazystream', this delays the actual opening of the stream until the archiver is ready to read the data.

Let's assume you have a list of keys in keys. For each key we need to create a ReadStream. To track the keys and streams lets create a S3DownloadStreamDetails type. The 'filename' will ultimately be the filename in the Zip, so you can do any transformation you need for that at this stage.

    type S3DownloadStreamDetails = { stream: Readable; filename: string };

Now for our array of keys, we can iterate after it to create the S3StreamDetails objects

    const s3DownloadStreams: S3DownloadStreamDetails[] = keys.map((key: string) => {
        return {
            stream: new lazystream.Readable(() => {
                console.log(`Creating read stream for ${fileToDownload.key}`);
                return s3.getObject({ Bucket: s3UGCBucket, Key: fileToDownload.key }).createReadStream();
            }),
            filename: key,
        };
    });

Now prepare the upload side by creating a Stream.PassThrough object and assigning that as the Body of the params for a S3.PutObjectRequest.


    const streamPassThrough = new Stream.PassThrough();
    const params: AWS.S3.PutObjectRequest = {
        ACL: 'private',
        Body: streamPassThrough
        Bucket: 'Bucket Name',
        ContentType: 'application/zip',
        Key: 'The Key on S3',
        StorageClass: 'STANDARD_IA', // Or as appropriate
    };

Now we can start the upload process.

    const s3Upload = s3.upload(params, (error: Error): void => {
        if (error) {
            console.error(`Got error creating stream to s3 ${error.name} ${error.message} ${error.stack}`);
            throw error;
        }
    });

If you want to monitor the upload process, for example to give feedback to users then you can attach a handler to httpUploadProgress like this.

    s3Upload.on('httpUploadProgress', (progress: { loaded: number; total: number; part: number; key: string }): void => {
        console.log(progress); // { loaded: 4915, total: 192915, part: 1, key: 'foo.jpg' }
    });

Now create the archiver

    const archive = Archiver('zip');
    archive.on('error', (error: Archiver.ArchiverError) => { throw new Error(`${error.name} ${error.code} ${error.message} ${error.path} ${error.stack}`); });

Now we can connect the archiver to pipe data to the upload stream and append all the download streams to it

    await new Promise((resolve, reject) => {

        console.log('Starting upload');

        s3Upload.on('close', resolve);
        s3Upload.on('end', resolve);
        s3Upload.on('error', reject);

        archive.pipe(s3StreamUpload);
        s3DownloadStreams.forEach((streamDetails: S3DownloadStreamDetails) => archive.append(streamDetails.stream, { name: streamDetails.filename }));
        archive.finalize();
    }).catch((error: { code: string; message: string; data: string }) => { throw new Error(`${error.code} ${error.message} ${error.data}`); });

Finally wait for the uploader to finish

    await s3Upload.promise();

and you're done.

I've tested this with +10GB archives and it works like a charm. I hope this has helped you out.

Latest comments (36)

Sunandini • Jan 25 '24

Thanks for the post

How can I zip a folder which contains folders and files on s3?

kang • Feb 28 '23

Hello,

I recently came across your blog post on using lambda to zip S3 files, and I wanted to thank you for sharing such a helpful resource! While testing out the example code, I noticed a few typos, so I took the liberty of fixing them and adapting the code to my needs. I'm happy to report that the lambda function works perfectly now and has saved me a lot of time and effort.

If anyone is interested, I've created a GitHub repository with my updated code that you can check out here: github.com/yufeikang/serverless-zi.... I hope this will be helpful to others who may be looking for a more reliable solution.

Thank you again for your excellent work!

Luke Cartwright • Jun 30 '22

Is there a limit on the number of files that can be zipped?

RatkoD • Jan 18 '22

Can I get some help here. Since I'm not a programer myself, do I just need to add all of this in a .js file and apload it to Lambda or there is something more?

Damiano Bertuna • Jul 30 '21 • Edited

Hi everybody,

we tried with the solution suggested here but we are facing the following problem.
Suppose we want to zip these files:

{
"files": [
    {
      "fileName": "File1_1GB.bin",
      "key": "File1_1GB.bin"
    },
    {
      "fileName": "File2_1GB.bin",
      "key": "File2_1GB.bin"
    },
    {
      "fileName": "File3_1GB.bin",
      "key": "File3_1GB.bin"
    },
    {
      "fileName": "File4_1GB.bin",
      "key": "File4_1GB.bin"
    },
    {
      "fileName": "File5_1GB.bin",
      "key": "File5_1GB.bin"
    },
],
  "bucketRegion": "REGION_NAME",
  "originBucketName": "BUCKET_NAME",
  "destBucketName": "DESTBUCKET",
  "zipName": "ZippedFiles.zip"
}

In the ZippedFiles.zip created we have correctly 5 files but they are not of the correct size, like:

File1 1GB;
File2 1GB;
File3 1GB;
File4 34KB;
File5 34KB;

Our configuration is 15 minutes the timeout and 10GB the memory.

What can be the problem?

Thanks in advance.

Reagards.

dakebusi • Oct 20 '22

Did you get an answer on how to fix your problem? I'm facing exactly the same issue.

Venkat Koushik Muthyapu • Mar 23 '21

Can we use this if we have folders in the folder that we zip?

lj91421 • Mar 10 '21

This has been really useful and straightforward to get working but i am having issues with unit testing.
Has anyone been able to write jest unit tests for this? I am trying to use AWSMock for AWS and i am struggling to get a test working at the moment

Kevin Kirchner • Jan 19 '21

I was really hoping to find a PHP version of this solution! I'm happy it's possible in node at least!

Pedro Rosón Fdez • Nov 26 '20

These handlers are wrong:

        s3Upload.on('close', resolve);
        s3Upload.on('end', resolve);
        s3Upload.on('error', reject);

They have to be over streamPassThrough:

        streamPassThrough.on('close', resolve);
        streamPassThrough.on('end', resolve);
        streamPassThrough.on('error', reject);

Victor S'mith • Dec 17 '20

Hello, it worked for me:

s3Upload.on('close', resolv());
s3Upload.on('end', resolve());
s3Upload.on('error', reject());

Etienne Fontaine • Sep 1 '20

Hi! Thank you for your article.
Do you have a benchmark for this ? Like how long does it takes to zip 100 files of 1Mo or 50 files of 2Mo for exemple ?

Thank you

View full discussion (36 comments)