In this blog, i want to share my way to download a large files (>1GB) scaling horizontally thanks to AWS StepFunctions.
For the full code, feel free to visit the GitHub repository.
Visual representation of an AWS Step Functions workflow parallelizing a large file download process.
To follow along, make sure you have AWS SAM installed. You can find the installation guid_e [_here](https://aws.plainenglish.io/develop-aws-lambda-functions-locally-on-your-machine-ccdd37e10092)
Table of contents
· The Idea
· AWS Lambda for File Metadata
· AWS Lambda for Chunk Creation
· AWS Lambda for Downloading chunks to S3
· AWS StepFunction
∘ Step Function Definition
· Error Handling and Optimization
· Conclusion
The Idea
To handle large files effectively, we will use AWS Step Functions to orchestrate a series of Lambda functions that download the file in chunks from an SFTP server and upload them to S3 via a multipart upload.
Here is a high-level overview of how the solution works:
- Step 1: Retrieve metadata of the file from the SFTP server, such as file size.
- Step 2: Split the file into 50MB chunks.
- Step 3: Use a Step Function’s Map state to execute Lambda functions that download each chunk in parallel.
- Step 4: Upload each chunk to S3 as part of a multipart upload.
- Step 5: Once all chunks are uploaded, complete the multipart upload.
AWS Lambda for File Metadata
First step is to create a lambda function to connect the SFTP server and fetch file metadata (like size)
In the template.yaml , we have to define a lambda like:
# template.yaml
Resources:
...
GetFileMetadataFunction:
Type: AWS::Serverless::Function
Properties:
CodeUri: src/lambdas/get_file_metadata/
Handler: app.lambda_handler
Runtime: python3.12
Timeout: 30
Environment:
Variables:
SFTP_HOST: !Ref SftpHost
SFTP_USERNAME: !Ref SftpUsername
SFTP_PASSWORD: !Ref SftpPassword
Policies:
- Version: "2012-10-17"
Statement:
- Effect: Allow
Action:
- s3:PutObject
- s3:AbortMultipartUpload
Resource: !GetAtt ExampleBucket.Arn
AWS Lambda for Chunk Creation
Next step is to create a lambda function to split the file into chunks based on size and configurable CHUNK_SIZE .
In the template.yaml , we have to define a lambda like:
# template.yaml
Resources:
...
CreateFileChunksFunction:
Type: AWS::Serverless::Function
Properties:
CodeUri: src/lambdas/create_file_chunks/
Handler: app.lambda_handler
Runtime: python3.12
Timeout: 30
Environment:
Variables:
CHUNK_SIZE: 52428800 # 50MB
AWS Lambda for Downloading chunks to S3
Next step is to create a lambda function download a single chunk and upload it to S3 using AWS S3 MultipartUpload.
In the template.yaml , we have to define a lambda like:
# template.yaml
Resources:
...
DownloadChunkFunction:
Type: AWS::Serverless::Function
Properties:
FunctionName: !Sub ${AWS::StackName}-DownloadChunkFunction
Timeout: 900
CodeUri: src/lambdas/download_chunk/
Handler: app.lambda_handler
Runtime: python3.12
Environment:
Variables:
SFTP_HOST: !Ref SftpHost
SFTP_USERNAME: !Ref SftpUsername
SFTP_PASSWORD: !Ref SftpPassword
MemorySize: 1024
Policies:
- AmazonS3FullAccess
- AWSLambdaBasicExecutionRole
AWS StepFunction
Now it’s time to create stepfunction to orchestrate the entire process, we will create a state machine to manage each Lambda invocation:
# template.yaml
Resources:
...
SftpDownloadStateMachine:
Type: AWS::Serverless::StateMachine
Properties:
DefinitionUri: src/state_machines/sftp_download_definition.json
DefinitionSubstitutions:
CreateFileChunksFunction: !GetAtt CreateFileChunksFunction.Arn
GetFileMetadataFunction: !GetAtt GetFileMetadataFunction.Arn
DownloadChunkFunction: !GetAtt DownloadChunkFunction.Arn
Role: !GetAtt StepFunctionRole.Arn
Step Function Definition
The definition file, tell AWS how to orchestrate the entire process by defining lambdas order and through the Map state we will loop though each chunk in parallel triggering the download and the upload for each chunk.
# src/state_machines/sftp_download_definition.json
{
"Comment": "Step function to process zip file",
"StartAt": "Get File Metadata From SFTP",
"States": {
"Get File Metadata From SFTP": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"ResultPath": "$.file_metadata",
"Parameters": {
"Payload.$": "$",
"FunctionName": "${GetFileMetadataFunction}"
},
"Next": "Create chunks"
},
"Create chunks": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"ResultPath": "$.chunks",
"Parameters": {
"Payload": {
"file_size.$": "$.file_metadata.Payload.body.metadata.size"
},
"FunctionName": "${CreateFileChunksFunction}"
},
"Next": "Start multipart upload"
},
"Start multipart upload": {
"Type": "Task",
"Resource": "arn:aws:states:::aws-sdk:s3:createMultipartUpload",
"ResultPath": "$.multipart_upload",
"Parameters": {
"Bucket.$": "$.target_s3_bucket",
"Key.$": "$.target_s3_key"
},
"Next": "Download chunks"
},
"Download chunks": {
"Type": "Map",
"ItemsPath": "$.chunks.Payload.body.chunks",
"ResultPath": "$.download_chunk_results",
"Parameters": {
"bucket_name.$": "$.target_s3_bucket",
"s3_file_key.$": "$.target_s3_key",
"sftp_file_path.$": "$.sftp_file_path",
"upload_id.$": "$.multipart_upload.UploadId",
"chunk.$": "$$.Map.Item.Value"
},
"Iterator": {
"StartAt": "Download single chunk from SFTP",
"States": {
"Download single chunk from SFTP": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {
"Payload": {
"bucket_name.$": "$.bucket_name",
"s3_file_key.$": "$.s3_file_key",
"sftp_file_path.$": "$.sftp_file_path",
"chunk_number.$": "$.chunk.chunk_number",
"start_byte.$": "$.chunk.start_byte",
"end_byte.$": "$.chunk.end_byte",
"upload_id.$": "$.upload_id"
},
"FunctionName": "${DownloadChunkFunction}"
},
"End": true
}
}
},
"Next": "Format multipart parts"
},
"Format multipart parts": {
"Type": "Pass",
"ResultPath": "$.formatted_parts",
"Parameters": {
"Parts.$": "$.download_chunk_results[*].Payload.body"
},
"Next": "Complete multipart upload"
},
"Complete multipart upload": {
"Type": "Task",
"Resource": "arn:aws:states:::aws-sdk:s3:completeMultipartUpload",
"ResultPath": null,
"Parameters": {
"Bucket.$": "$.target_s3_bucket",
"Key.$": "$.target_s3_key",
"UploadId.$": "$.multipart_upload.UploadId",
"MultipartUpload": {
"Parts.$": "$.formatted_parts.Parts"
}
},
"Next": "Fine"
},
"Fine": {
"Type": "Succeed"
}
}
}
Error Handling and Optimization
- Error Handling : Each state in the Step Function can be equipped with retry policies. For example, if the download or upload fails for a specific chunk, the Step Function can automatically retry the operation.
"Retry": [
{
"ErrorEquals": ["States.ALL"],
"IntervalSeconds": 2,
"MaxAttempts": 3,
"BackoffRate": 2.0
}
]
- Optimization : You can tweak MaxConcurrency in the Map state to control how many chunks are processed in parallel. Increasing this number can reduce overall time but requires more resources.
"MaxConcurrency": 10
Conclusion
By leveraging AWS Step Functions and Lambda in tandem, this architecture offers a scalable and robust solution for handling large files from SFTP servers. It ensures that files are downloaded in chunks and uploaded to S3 using multipart upload, making the process efficient and error-resistant.
For the full project, including all Lambda functions and deployment scripts, visit the GitHub repository.
GitHub - Tammura/aws-step-function-large-file-download
Top comments (0)