This post was updated 20 Sept 2022 to improve reliability with large numbers of files.
- Update the stream handling so streams are only opened to S3 when the file is ready to be processed by the Zip Archiver. This fixes timeouts that could be seen when processing a large number of files.
- Use keep alive with S3 and limit connected sockets.
It's not an uncommon requirement to want to package files on S3 into a Zip file for a user to download multiple files in a single package. Maybe it's common enough for AWS to offer this functionality themselves one day. Until then you can write a short script to do it.
If you want to provide this service in a serverless environment such as AWS Lambda you have two main constraints that define the approach you can take.
1 - /tmp is only 512Mb. Your first idea might be to download the files from S3, zip them up, upload the result. This will work fine until you fill up /tmp with the temporary files!
2 - Memory is constrained to 3GB. You could store the temporary files on the heap, but again you are constrained to 3GB. Even in a regular server environment you're not going to want a simple zip function to take 3GB of RAM!
So what can you do? The answer is to stream the data from S3, through an archiver and back onto S3.
Fortunately this Stack Overflow post and its comments pointed the way and this post is basically a rehash of it!
The below code is Typescript but the Javascript is just the same with the types removed.
Start with the imports you need
import * as Archiver from 'archiver';
import * as AWS from 'aws-sdk';
import { createReadStream } from 'fs';
import { Readable, Stream } from 'stream';
import * as lazystream from 'lazystream';
Firstly configure the aws-sdk so that it will use keepalives when communicating with S3, and also limit the maximum number of connections. This improves efficiency and helps avoid hitting an unexpected connection limit. Instead of this section you could set AWS_NODEJS_CONNECTION_REUSE_ENABLED
in your lambda environment.
// Set the S3 config to use keep-alives
const agent = new https.Agent({ keepAlive: true, maxSockets: 16 });
AWS.config.update({ httpOptions: { agent } });
Let's start by creating the streams to fetch the data from S3. To prevent timeouts to S3 the streams are wrapped with 'lazystream', this delays the actual opening of the stream until the archiver is ready to read the data.
Let's assume you have a list of keys in keys
. For each key we need to create a ReadStream. To track the keys and streams lets create a S3DownloadStreamDetails type. The 'filename' will ultimately be the filename in the Zip, so you can do any transformation you need for that at this stage.
type S3DownloadStreamDetails = { stream: Readable; filename: string };
Now for our array of keys, we can iterate after it to create the S3StreamDetails objects
const s3DownloadStreams: S3DownloadStreamDetails[] = keys.map((key: string) => {
return {
stream: new lazystream.Readable(() => {
console.log(`Creating read stream for ${fileToDownload.key}`);
return s3.getObject({ Bucket: s3UGCBucket, Key: fileToDownload.key }).createReadStream();
}),
filename: key,
};
});
Now prepare the upload side by creating a Stream.PassThrough
object and assigning that as the Body of the params for a S3.PutObjectRequest
.
const streamPassThrough = new Stream.PassThrough();
const params: AWS.S3.PutObjectRequest = {
ACL: 'private',
Body: streamPassThrough
Bucket: 'Bucket Name',
ContentType: 'application/zip',
Key: 'The Key on S3',
StorageClass: 'STANDARD_IA', // Or as appropriate
};
Now we can start the upload process.
const s3Upload = s3.upload(params, (error: Error): void => {
if (error) {
console.error(`Got error creating stream to s3 ${error.name} ${error.message} ${error.stack}`);
throw error;
}
});
If you want to monitor the upload process, for example to give feedback to users then you can attach a handler to httpUploadProgress
like this.
s3Upload.on('httpUploadProgress', (progress: { loaded: number; total: number; part: number; key: string }): void => {
console.log(progress); // { loaded: 4915, total: 192915, part: 1, key: 'foo.jpg' }
});
Now create the archiver
const archive = Archiver('zip');
archive.on('error', (error: Archiver.ArchiverError) => { throw new Error(`${error.name} ${error.code} ${error.message} ${error.path} ${error.stack}`); });
Now we can connect the archiver to pipe data to the upload stream and append all the download streams to it
await new Promise((resolve, reject) => {
console.log('Starting upload');
s3Upload.on('close', resolve);
s3Upload.on('end', resolve);
s3Upload.on('error', reject);
archive.pipe(s3StreamUpload);
s3DownloadStreams.forEach((streamDetails: S3DownloadStreamDetails) => archive.append(streamDetails.stream, { name: streamDetails.filename }));
archive.finalize();
}).catch((error: { code: string; message: string; data: string }) => { throw new Error(`${error.code} ${error.message} ${error.data}`); });
Finally wait for the uploader to finish
await s3Upload.promise();
and you're done.
I've tested this with +10GB archives and it works like a charm. I hope this has helped you out.
Latest comments (36)
Thanks for the post
How can I zip a folder which contains folders and files on s3?
Hello,
I recently came across your blog post on using lambda to zip S3 files, and I wanted to thank you for sharing such a helpful resource! While testing out the example code, I noticed a few typos, so I took the liberty of fixing them and adapting the code to my needs. I'm happy to report that the lambda function works perfectly now and has saved me a lot of time and effort.
If anyone is interested, I've created a GitHub repository with my updated code that you can check out here: github.com/yufeikang/serverless-zi.... I hope this will be helpful to others who may be looking for a more reliable solution.
Thank you again for your excellent work!
Is there a limit on the number of files that can be zipped?
Can I get some help here. Since I'm not a programer myself, do I just need to add all of this in a .js file and apload it to Lambda or there is something more?
Hi everybody,
we tried with the solution suggested here but we are facing the following problem.
Suppose we want to zip these files:
In the ZippedFiles.zip created we have correctly 5 files but they are not of the correct size, like:
Our configuration is 15 minutes the timeout and 10GB the memory.
What can be the problem?
Thanks in advance.
Reagards.
Did you get an answer on how to fix your problem? I'm facing exactly the same issue.
Can we use this if we have folders in the folder that we zip?
This has been really useful and straightforward to get working but i am having issues with unit testing.
Has anyone been able to write jest unit tests for this? I am trying to use AWSMock for AWS and i am struggling to get a test working at the moment
I was really hoping to find a PHP version of this solution! I'm happy it's possible in node at least!
These handlers are wrong:
They have to be over streamPassThrough:
Hello, it worked for me:
s3Upload.on('close', resolv());
s3Upload.on('end', resolve());
s3Upload.on('error', reject());
Hi! Thank you for your article.
Do you have a benchmark for this ? Like how long does it takes to zip 100 files of 1Mo or 50 files of 2Mo for exemple ?
Thank you