Mwanza Simi

for AWS Community Builders

Posted on May 21

S3 Files Killed My Least Favorite Lambda Pattern

#aws #s3 #lambda #cloud

Every Lambda function I have written that touches S3 has the same three lines of plumbing:

s3.download_file(bucket, key, "/tmp/input.csv")
process("/tmp/input.csv", "/tmp/output.csv")
s3.upload_file("/tmp/output.csv", bucket, output_key)

Download. Process. Upload. Clean up /tmp. Handle the edge case where /tmp is full from a previous invocation. Handle the edge case where the download fails halfway. Handle the edge case where you run out of the 10 GB ephemeral limit because someone uploaded a file larger than you expected.

S3 Files makes all of that go away. You mount the bucket at /mnt/workspace and use open(). The file is right there. You write the output. It syncs to S3.

The Problem It Solves

Lambda functions that process files from S3 have always followed the same ritual:

Download the object from S3 to /tmp
Process it with whatever tool expects a file path
Upload the result back to S3
Clean up /tmp so the next invocation doesn't run out of space

This works. It also creates problems.

/tmp is ephemeral. It's limited to 10 GB. It's not shared between invocations on different execution environments. If your function fails halfway through processing, you retry the entire download. If multiple functions need the same reference file, each one downloads its own copy.

For a single CSV transform, the overhead is tolerable. For a pipeline that processes PDFs, images, video, or runs tools like ffmpeg, imagemagick, trivy, or semgrep, the download-process-upload loop becomes the majority of your code and the majority of your execution time.

S3 Files eliminates the loop. Your function mounts the bucket and reads files directly.

How It Works

S3 Files is a managed NFS v4.1+ file system built on Amazon EFS that presents your S3 bucket as a directory tree. When you mount it on a Lambda function, the function sees files and directories at /mnt/your-path. Under the hood, the data still lives in S3.

The architecture uses what AWS calls "stage and commit":

Your function reads and writes files through the NFS mount
An EFS caching layer stores actively accessed data for low-latency access (~1ms)
Changes written through the mount are exported back to S3 within minutes
Changes made directly through the S3 API appear in the file system within seconds (sometimes longer)

The two layers are explicitly separate. The file system side gives you NFS close-to-open consistency. The S3 side gives you standard strong consistency. Each preserves its own semantics.

For large sequential reads (1 MiB or larger), S3 Files bypasses the cache entirely and streams data directly from S3 using parallel GET requests. This means ML training data, large CSVs, media files, and Parquet datasets get full S3 throughput without paying the cache premium. Files smaller than 128 KB (configurable) are the ones that get stored on the high-performance layer for low-latency access.

What It Looks Like

The old way:

import boto3
import csv
import os

s3 = boto3.client("s3")

def handler(event, context):
    bucket = event["bucket"]
    key = event["key"]
    output_key = key.replace("incoming/", "processed/")

    s3.download_file(bucket, key, "/tmp/input.csv")

    with open("/tmp/input.csv", newline="") as src, \
         open("/tmp/output.csv", "w", newline="") as dst:
        reader = csv.DictReader(src)
        writer = csv.DictWriter(dst, fieldnames=["email", "account_id"])
        writer.writeheader()
        for row in reader:
            writer.writerow({
                "email": row["email"].strip().lower(),
                "account_id": row["account_id"].strip()
            })

    s3.upload_file("/tmp/output.csv", bucket, output_key)

    os.remove("/tmp/input.csv")
    os.remove("/tmp/output.csv")

    return {"output": output_key}

The new way:

import csv
from pathlib import Path

WORKSPACE = Path("/mnt/workspace")

def handler(event, context):
    source = WORKSPACE / event["relative_input_path"]
    target = WORKSPACE / event["relative_output_path"]
    target.parent.mkdir(parents=True, exist_ok=True)

    with source.open(newline="") as src, target.open("w", newline="") as dst:
        reader = csv.DictReader(src)
        writer = csv.DictWriter(dst, fieldnames=["email", "account_id"])
        writer.writeheader()
        for row in reader:
            writer.writerow({
                "email": row["email"].strip().lower(),
                "account_id": row["account_id"].strip()
            })

    return {"output_path": str(target)}

No boto3. No temporary files. No cleanup. The output is written directly to the mounted S3 bucket and syncs back to S3 automatically.

The real win shows up when the processing step isn't a simple CSV transform. If your function shells out to git, ripgrep, ffmpeg, trivy, or any tool that expects a filesystem path, a mounted workspace is simpler than teaching every tool to speak S3.

The Setup

S3 Files on Lambda requires more infrastructure than a plain S3 trigger. Here's what you need:

Requirements:

An S3 file system created on a general purpose bucket (S3 versioning must be enabled)
Mount targets in the same VPC and Availability Zones as your Lambda function
Security groups allowing NFS traffic on port 2049
Lambda function connected to the VPC
Execution role with s3files:ClientMount (and s3files:ClientWrite for write access)
s3:GetObject and s3:GetObjectVersion for direct read optimization
Function memory set to 512 MB or higher (required for direct reads from S3)

The SAM template:

Resources:
  ProcessingFunction:
    Type: AWS::Serverless::Function
    Properties:
      FunctionName: FileProcessorFunction
      CodeUri: ./src
      Handler: index.handler
      Runtime: python3.13
      MemorySize: 512
      Timeout: 300
      VpcConfig:
        SecurityGroupIds:
          - !Ref LambdaSecurityGroup
        SubnetIds:
          - !Ref PrivateSubnet1
          - !Ref PrivateSubnet2
      FileSystemConfigs:
        - Arn: !GetAtt S3FilesAccessPoint.Arn
          LocalMountPath: /mnt/workspace
      Policies:
        - Statement:
            - Effect: Allow
              Action:
                - s3files:ClientMount
                - s3files:ClientWrite
              Resource: "*"
            - Effect: Allow
              Action:
                - s3:GetObject
                - s3:GetObjectVersion
              Resource: !Sub "arn:aws:s3:::${BucketName}/*"

The VPC requirement is the biggest change from a standard Lambda + S3 setup. If your function isn't already in a VPC, you need to add subnets, security groups, and NAT gateways (if the function also needs internet access). That's not trivial for existing deployments.

When to Use S3 Files vs. the Old Pattern

Use S3 Files when:

Your function processes files with tools that expect filesystem paths (ffmpeg, imagemagick, PDF libraries, git, security scanners)
Multiple Lambda functions need shared access to the same working directory
You are tired of managing /tmp size limits and cleanup logic
The function reads large reference datasets that don't change between invocations
You want to eliminate the download-process-upload ceremony

Keep using GetObject + /tmp when:

Your function reads one object, transforms it in memory, and writes one object back
The function is a simple event handler that processes JSON payloads
You need the lowest possible cold start latency (VPC adds ~100-200ms)
Your function doesn't need filesystem semantics at all
The workload is latency-sensitive and can't tolerate the VPC mount dependency

The mental model is straightforward. If your code has download_file followed by upload_file and the processing step uses file paths, S3 Files removes that plumbing. If your code streams objects through memory without touching the filesystem, S3 Files adds complexity for no benefit.

What You Pay

S3 Files pricing has three layers on top of your existing S3 storage costs:

Component	Rate	What triggers it
S3 Standard storage	~$0.023/GB-month	All data in the bucket (unchanged)
High-performance cache	~$0.30/GB-month	Only actively cached data, not the whole bucket
Data access (reads)	~$0.03/GB	Small file reads from cache
Data access (writes)	~$0.06/GB	Writes through the mount

The critical detail: large sequential reads (1 MiB+) bypass the cache entirely and cost only standard S3 GET request pricing. No S3 Files surcharge.

Practical example: You have a 1 TB bucket. Your Lambda functions actively work with 50 GB of files through the mount. Most reads are large Parquet files.

Component	Cost
S3 storage (1 TB)	$23.55
Cache storage (50 GB active)	$15.00
Data access (small file reads)	~$0.50
Data access (writes, 20 GB)	~$1.20
Total	~$40/month

The same 1 TB on EFS would cost ~$300/month. S3 Files costs a fraction because you only pay the cache premium on the active working set, not the entire dataset.

Small operations have metering minimums. Data access operations are metered at a minimum size (reported as 32 KB in early testing). Reading a 1-byte config file gets metered for more than 1 byte. For workloads with millions of tiny metadata-heavy operations, those minimums add up.

Things to Know Before You Build

The sync window isn't instant. Changes written through the mount are exported back to S3 within minutes. Changes made directly in S3 appear in the file system within seconds, but can take a minute or longer. If your downstream system polls S3 for new objects, account for this lag. There's no manual flush API.

Renames are expensive. S3 has no native rename. Renaming a file through the mount means copy + delete at the S3 layer. For a single file, fine. For a directory with 50,000 files, that's 50,000 copy-and-delete operations. Write final output paths directly. Don't use directory renames as workflow commits.

S3 versioning is required. You can't create an S3 file system on a bucket without versioning enabled. This increases storage costs from additional versions.

Glacier storage classes are incompatible. S3 Standard, Intelligent-Tiering, and Infrequent Access all work. Glacier does not.

No hard links. Symbolic links work. If your tool relies on hard links (some build systems and package managers do), it will break.

1,024-byte key length limit. Deeply nested directories with long filenames can hit this ceiling. Measure your path lengths before committing to a directory structure.

Conflicts: S3 wins. If the same file is modified through both the mount and the S3 API simultaneously, the S3 version is treated as the source of truth. The file system version goes to a lost+found directory. Pick one writer per path.

Custom S3 metadata isn't visible. If your application sets x-amz-meta headers through the S3 API, those values don't appear as extended attributes on mounted files. POSIX attributes only.

Cache expiration defaults to 30 days. Data stays in the high-performance layer for 30 days after last access. For batch workloads that touch files once, drop this to 1-2 days to reduce cache storage costs.

S3 Files vs. EFS vs. Mountpoint

Lambda already supported EFS mounts. And S3 Mountpoint exists for read-heavy workloads. Here's when each makes sense:

If you need...	Use this	Why
File paths backed by S3 data	S3 Files	S3 stays the source of truth, cache only for active data
General shared POSIX storage independent of S3	EFS	Mature, no sync lag, all data is hot
Read-only high-throughput access to S3	Mountpoint for S3	Simpler, no EFS layer, no write support needed
Enterprise NAS features (ONTAP, Windows)	FSx	Protocol-specific workloads

The key difference between S3 Files and EFS: with EFS, you pay $0.30/GB for everything stored. With S3 Files, you pay $0.30/GB only for the active working set and $0.023/GB for everything else in S3. The cost advantage grows as total data increases relative to the active subset.

The key difference between S3 Files and Mountpoint: Mountpoint is a FUSE client with limited write support and no caching layer. S3 Files gives you full read-write NFS semantics with a managed cache. If you only need to read large files from S3, Mountpoint is simpler and cheaper.

A Practical Example: Image Processing Pipeline

A common Lambda pattern: S3 trigger fires when an image is uploaded, function generates thumbnails and optimized versions.

The old way requires downloading the source image, processing it with Pillow or ImageMagick, writing multiple outputs to /tmp, then uploading each one back to S3:

import boto3
from PIL import Image
import os

s3 = boto3.client("s3")
SIZES = {"thumb": (150, 150), "medium": (800, 600), "large": (1920, 1080)}

def handler(event, context):
    bucket = event["Records"][0]["s3"]["bucket"]["name"]
    key = event["Records"][0]["s3"]["object"]["key"]

    s3.download_file(bucket, key, "/tmp/source.jpg")
    img = Image.open("/tmp/source.jpg")

    for name, size in SIZES.items():
        resized = img.copy()
        resized.thumbnail(size)
        output_path = f"/tmp/{name}.jpg"
        resized.save(output_path, "JPEG", quality=85)
        s3.upload_file(output_path, bucket, f"processed/{name}/{key}")
        os.remove(output_path)

    os.remove("/tmp/source.jpg")
    return {"processed": list(SIZES.keys())}

With S3 Files:

from PIL import Image
from pathlib import Path

WORKSPACE = Path("/mnt/workspace")
SIZES = {"thumb": (150, 150), "medium": (800, 600), "large": (1920, 1080)}

def handler(event, context):
    key = event["Records"][0]["s3"]["object"]["key"]
    source = WORKSPACE / key

    img = Image.open(source)

    for name, size in SIZES.items():
        output = WORKSPACE / "processed" / name / key
        output.parent.mkdir(parents=True, exist_ok=True)
        resized = img.copy()
        resized.thumbnail(size)
        resized.save(output, "JPEG", quality=85)

    return {"processed": list(SIZES.keys())}

Half the code. No boto3 import. No temporary file management. The source image is read directly from the mount. The outputs are written directly to the mount and sync to S3 within a minute.

Multi-Function Shared Workspace

The pattern that makes S3 Files most interesting isn't single-function file processing. It's multiple functions sharing a workspace.

Before S3 Files, if three Lambda functions needed to collaborate on the same set of files, each one had to download from S3, do its work, upload results, and the next function would download those results. With S3 Files, they all mount the same bucket and read each other's output directly.

Function A (security scan)
  reads  /mnt/workspace/repo/
  writes /mnt/workspace/reports/security.json

Function B (test analysis)
  reads  /mnt/workspace/repo/
  writes /mnt/workspace/reports/tests.json

Function C (merge reports)
  reads  /mnt/workspace/reports/*.json
  writes /mnt/workspace/final/summary.md

No intermediate S3 uploads between steps. No coordination logic to pass object keys between functions. The workspace is the coordination mechanism.

The rule for shared workspaces: one writer per file path. Don't have two functions writing to the same file. Use worker-specific output paths and let the orchestrator merge.

Who This Is For

S3 Files is for Lambda functions that have been pretending S3 objects are files. If your code downloads an object, gives it a file path, processes it with a tool that expects a file, and uploads the result, S3 Files removes the pretending.

The strongest use cases:

Media processing. Image resizing, video transcoding, audio conversion. These tools all expect file paths.

Document processing. PDF extraction, Office document conversion, OCR pipelines. Libraries like pdfplumber, python-docx, and Tesseract work with files.

Code analysis. Security scanners, linters, dependency checkers. Tools like trivy, semgrep, bandit, and eslint expect a directory to scan.

ML inference with reference data. Models that load large reference files (embeddings, lookup tables, feature stores) benefit from the shared mount. Load once, use across invocations.

AI agent workspaces. Agents that use filesystem tools (cat, grep, ls, find) can work directly on S3 data without custom S3 API wrappers.

The weakest use cases: simple JSON transforms, single-object streaming, anything that never touches the filesystem.

Getting Started

# Create the file system on your bucket (verify exact CLI syntax against docs)
aws s3api create-file-system --bucket my-bucket --file-system-name my-workspace

# Create mount targets in your VPC subnets
aws s3api create-mount-target \
  --file-system-id fs-abc123 \
  --subnet-id subnet-xyz \
  --security-groups sg-nfs-access

# Attach to your Lambda function via console or IaC
# Configuration > File systems > Add file system > S3 Files

The Lambda S3 Files documentation covers the full setup. The S3 Files user guide covers file system creation, access points, and synchronization configuration.

Should You Migrate?

For 20 years, the answer to "can I mount S3?" was no. Now it's yes, and the implementation is good enough for production Lambda workloads.

The download-process-upload pattern isn't gone from every codebase. It still makes sense for simple object transforms. But for file-heavy Lambda functions that spend more lines on S3 plumbing than on actual processing logic, S3 Files is a real simplification.

Mount the bucket. Read the file. Write the output.

DEV Community