Leveraging Ephemeral Storage in AWS Lambda

#node #aws #lambda

Ephemeral storage in AWS Lambads?

Ephemeral storage in AWS Lambda is a temporary storage provided in the form of a directory(/tmp) on the lambda file system. This storage is unique to each lambda execution environment.

You can read, write, and do all sorts of file operations to this directory. Multiple lambda invocations can share the same execution environments, so even though the storage is temporary, it can be shared across multiple lambda invocations.

By default, all lambdas come with 512MB of ephemeral storage, however, the storage can be extended up to 10,240MB in 1MB increments. The default 512MB comes at no extra cost to your lambda.

Why use Ephemeral Storage?

Well, it's available out of the box in your lambda instance, why not use it? 😄
There are several use cases for ephemeral storage in AWS Lambdas. In general, any form of lambda operation that can benefit from a file system or sharing temporary states across multiple lambda invocations(caching 👀) can benefit from ephemeral storage in AWS Lambdas.

Use Case: Zip Up S3 Files

Zipping is a common use case in many software applications that deliver bulk files to clients/customers efficiently over the internet. In this article, we will explore a practical example of leveraging ephemeral storage in AWS lambda to zip S3 files. The example lambda will receive a list of S3 keys as input, it will zip up the files(leveraging the ephemeral storage), and upload the zipped output to S3. Below is the source code(written in TypeScript) of the lambda.

import { GetObjectCommand, PutObjectCommand, S3Client } from '@aws-sdk/client-s3';
import { createReadStream, createWriteStream } from 'fs';
import { mkdir, rm } from 'fs/promises';
import path from 'path';
import { Readable } from 'stream';
import archiver from 'archiver';
import { randomUUID } from 'crypto';

const s3Bucket = 'zip-files-test';
const s3Client = new S3Client({ region: process.env.AWS_REGION });

const streamS3ObjectToFile = async (s3Key: string, filePath: string) => {
  const { Body } = await s3Client.send(new GetObjectCommand({
    Bucket: s3Bucket,
    Key: s3Key
  }));
  if (!Body) throw Error(`S3 object not found at: ${s3Key}`);
  const writeStream = createWriteStream(filePath);
  return new Promise((res, rej) => {
    (Body as Readable)
      .pipe(writeStream)
      .on('error', (error) => rej(error))
      .on('close', () => res('ok'));
  })
}

const archiveFiles = (filePaths: string[], outputFilePath: string) => {
  return new Promise((res, rej) => {
    const output = createWriteStream(outputFilePath);
    output.on('close', () => {
      console.log(archive.pointer() + ' total bytes');
      res('ok');
    });
    const archive = archiver('zip', { zlib: { level: 9 } });
    archive.on('error', (err) => rej(err));
    archive.pipe(output);
    filePaths.forEach(filePath => archive.file(filePath, { name: path.basename(filePath) }));
    archive.finalize();
  })
}

export const handler = async (event: { inputS3Keys: string[]; outputS3Key: string; }) => {
  const { inputS3Keys, outputS3Key } = event;
  // Basic validation of event data
  if (!Array.isArray(inputS3Keys) || typeof outputS3Key !== 'string') {
    throw Error('Provide list of s3 keys');
  }

  // create a sub-directory in ephemeral storage(/tmp)
  const tmpFolder = `/tmp/${randomUUID()}`;
  await mkdir(tmpFolder);

  // Stream S3 files to tmp storage
  const tmpFiles: string[] = [];
  const streamFilesAsynchronously = inputS3Keys.map(async (s3Key) => {
    const fileName = path.basename(s3Key);
    const filePath = `${tmpFolder}/${fileName}`;
    await streamS3ObjectToFile(s3Key, filePath);
    tmpFiles.push(filePath);
  })
  await Promise.all(streamFilesAsynchronously);

  // Zip files
  const zipFilePath = `${tmpFolder}/${path.basename(outputS3Key)}`;
  await archiveFiles(tmpFiles, zipFilePath);

  // Upload zip output
  await s3Client.send(new PutObjectCommand({
    Body: createReadStream(zipFilePath),
    Bucket: s3Bucket,
    Key: outputS3Key,
  }));

  // Remove all files written to /tmp
  await rm(tmpFolder, { recursive: true, force: true });
  console.log('Done!');
};

In the source code above, there are 3 primary functions:

streamS3ObjectToFile: This function will stream the s3 object defined by the s3Key parameter to a file path defined by the filePath parameter.
archiveFiles: This function will archive a list of files defined by the inputFilePaths parameter and write the resulting zipped output to a file defined by the outputFilePath parameter.
handler: This is the core function executed on invocation of the lambda. The function will extract the inputs from the event object, call streamS3ObjectToFile to stream the input files to the lambda ephemeral storage, archive the files, store the archived output to ephemeral storage, upload the zipped file to s3, and then delete the content written to /tmp folder.

Testing:

Input files in S3 bucket:
Event object - Lambda invoked from AWS console:
Output zip file uploaded to S3 bucket:

Stream to File(ephemeral storage) vs In-Memory: To minimize memory usage in the Lambda function, I opted to stream all S3 objects to ephemeral storage in the /tmp directory instead of loading them into memory. Even the archiving process was performed by streaming to the ephemeral storage. The alternative—loading S3 objects and performing the archiving operation in-memory using buffers—would have significantly increased the Lambda's memory requirements. For context, streaming to files allowed me to compress a collection of files totaling around 300MB using a Lambda with just 128MB of RAM (the minimum configuration). In contrast, handling the same files in-memory would have required at least 300MB of memory just to load them, not to mention the additional memory needed for processing.

Pro Tip - Cleanup /tmp files: While the /tmp folder is temporary, there's no guarantee of when the content of the /tmp folder will be destroyed. AWS won't auto-delete the content of the ephemeral storage when the lambda is finished, in fact, this /tmp folder will be shared across multiple lambda invocations that use the same execution context. For this reason, it's encouraged to clean up whatever you write to the /tmp folder unless you deliberately want to share the data across multiple lambda invocations e.g for caching.

Conclusion:

Ephemeral storage is a powerful feature that shouldn't be overlooked in AWS Lambdas. I've found it particularly very useful for heavy data processing and complex media/graphic processing tasks.

Are you leveraging ephemeral storage for something interesting, please share in the comment section.