Storing large files with DynamoDB and S3 - the easy way

#architecture #aws #python #serverless

DynamoDB has a 400KB item limit. That's fine for most data. But sometimes you need to store something bigger - a PDF, an image, a JSON blob that grew too large.

The usual solution? Store the file in S3, save the metadata in DynamoDB. It's a common pattern. But it takes work.

The manual way

Here's what you normally do:

Upload the file to S3
Get the bucket and key back
Save those in DynamoDB
When reading, fetch the S3 metadata
Download from S3 if you need the content
When deleting, remove from both places
Handle errors if one succeeds and the other fails

That's a lot of code for something that should be simple. And you have to do it every time.

# The manual way - lots of code
import boto3

s3 = boto3.client('s3')
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('documents')

# Upload to S3
s3.put_object(
    Bucket='my-bucket',
    Key=f'documents/{doc_id}/file.pdf',
    Body=file_content,
    ContentType='application/pdf'
)

# Save metadata to DynamoDB
table.put_item(Item={
    'pk': f'DOC#{doc_id}',
    'sk': 'METADATA',
    's3_bucket': 'my-bucket',
    's3_key': f'documents/{doc_id}/file.pdf',
    'content_type': 'application/pdf',
    'size': len(file_content),
})

# Later, to read...
response = table.get_item(Key={'pk': f'DOC#{doc_id}', 'sk': 'METADATA'})
item = response['Item']
s3_response = s3.get_object(Bucket=item['s3_bucket'], Key=item['s3_key'])
content = s3_response['Body'].read()

# And to delete...
table.delete_item(Key={'pk': f'DOC#{doc_id}', 'sk': 'METADATA'})
s3.delete_object(Bucket='my-bucket', Key=f'documents/{doc_id}/file.pdf')

It works. But it's a lot of code for something simple. And in pure Python, all that serialization and network handling adds up - especially in Lambda where you pay for every millisecond.

A better way

pydynox is a DynamoDB library with a Rust core. If you haven't heard of it, check out my intro post.

It has an S3Attribute that handles all of this. You define it once, and the library takes care of uploads, downloads, and cleanup.

from pydynox import Model, ModelConfig
from pydynox.attributes import StringAttribute, S3Attribute

class Document(Model):
    model_config = ModelConfig(table="documents")

    pk = StringAttribute(hash_key=True)
    sk = StringAttribute(range_key=True)
    title = StringAttribute()
    file = S3Attribute(bucket="my-bucket")

Now saving a file is one line:

from pydynox import S3File

doc = Document(
    pk="DOC#123",
    sk="METADATA",
    title="Contract",
    file=S3File(data=pdf_bytes, content_type="application/pdf"),
)
doc.save()  # Uploads to S3, saves metadata to DynamoDB

Reading is just as simple:

doc = Document.get(pk="DOC#123", sk="METADATA")

# Get the S3 metadata
print(doc.file.bucket)  # my-bucket
print(doc.file.key)     # documents/DOC#123/file
print(doc.file.size)    # 1234567

# Download the content when you need it
content = doc.file.download()

# Or get a presigned URL for direct access
url = doc.file.presigned_url(expires_in=3600)

Deleting cleans up both places:

doc.delete()  # Removes from DynamoDB AND S3

How it works

When you call save():

pydynox uploads the file to S3
Stores the S3 metadata (bucket, key, size, etag) in DynamoDB
If the upload fails, nothing is saved

When you call delete():

Deletes from DynamoDB first
Then deletes from S3
If S3 delete fails, the DynamoDB record is already gone (orphaned S3 objects can be cleaned up with lifecycle rules)

The S3 key is built from the partition key + the filename you pass in S3File. You can also set a prefix:

# Key will be: uploads/documents/{pk}/{sk}/report.pdf
file = S3Attribute(bucket="my-bucket", prefix="uploads/documents/")

# When saving:
doc.file = S3File(data=pdf_bytes, name="report.pdf")

Async works too

If you're using async:

doc = Document(
    pk="DOC#123",
    sk="METADATA", 
    title="Contract",
    file=S3File(data=pdf_bytes),
)
await doc.async_save()

# Later
doc = await Document.async_get(pk="DOC#123", sk="METADATA")
content = await doc.file.async_download()

When to use this

Use S3Attribute when:

Your data might exceed 400KB
You're storing files (PDFs, images, JSON blobs)
You want presigned URLs for direct downloads
You don't want to manage S3 uploads manually

Don't use it when:

Your data is always small (just use a regular attribute)
You need complex S3 features (versioning, lifecycle rules on specific objects)

The pattern is common

This isn't a new idea. boto3 lets you do this since forever - upload to S3, save metadata to DynamoDB. The difference is doing it automatically instead of writing the same code over and over.

One attribute. One line to save. One line to delete. The library handles the rest.

Links