DEV Community

Tammura
Tammura

Posted on • Originally published at awstip.com on

Efficiently Download Large Files into AWS S3 with Step Functions and Lambda

In this blog, i want to share my way to download a large files (>1GB) scaling horizontally thanks to AWS StepFunctions.

For the full code, feel free to visit the GitHub repository.


Visual representation of an AWS Step Functions workflow parallelizing a large file download process.

To follow along, make sure you have AWS SAM installed. You can find the installation guid_e [_here](https://aws.plainenglish.io/develop-aws-lambda-functions-locally-on-your-machine-ccdd37e10092)

Table of contents

· The Idea

· AWS Lambda for File Metadata

· AWS Lambda for Chunk Creation

· AWS Lambda for Downloading chunks to S3

· AWS StepFunction

Step Function Definition

· Error Handling and Optimization

· Conclusion

The Idea

To handle large files effectively, we will use AWS Step Functions to orchestrate a series of Lambda functions that download the file in chunks from an SFTP server and upload them to S3 via a multipart upload.

Here is a high-level overview of how the solution works:

  • Step 1: Retrieve metadata of the file from the SFTP server, such as file size.
  • Step 2: Split the file into 50MB chunks.
  • Step 3: Use a Step Function’s Map state to execute Lambda functions that download each chunk in parallel.
  • Step 4: Upload each chunk to S3 as part of a multipart upload.
  • Step 5: Once all chunks are uploaded, complete the multipart upload.

AWS Lambda for File Metadata

First step is to create a lambda function to connect the SFTP server and fetch file metadata (like size)

In the template.yaml , we have to define a lambda like:

# template.yaml
Resources:
  ...
  GetFileMetadataFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: src/lambdas/get_file_metadata/
      Handler: app.lambda_handler
      Runtime: python3.12
      Timeout: 30
      Environment:
        Variables:
          SFTP_HOST: !Ref SftpHost
          SFTP_USERNAME: !Ref SftpUsername
          SFTP_PASSWORD: !Ref SftpPassword
      Policies:
        - Version: "2012-10-17"
          Statement:
            - Effect: Allow
              Action:
                - s3:PutObject
                - s3:AbortMultipartUpload
              Resource: !GetAtt ExampleBucket.Arn
Enter fullscreen mode Exit fullscreen mode

AWS Lambda for Chunk Creation

Next step is to create a lambda function to split the file into chunks based on size and configurable CHUNK_SIZE .

In the template.yaml , we have to define a lambda like:

# template.yaml
Resources:
  ...
  CreateFileChunksFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: src/lambdas/create_file_chunks/
      Handler: app.lambda_handler
      Runtime: python3.12
      Timeout: 30
      Environment:
        Variables:
          CHUNK_SIZE: 52428800 # 50MB
Enter fullscreen mode Exit fullscreen mode

AWS Lambda for Downloading chunks to S3

Next step is to create a lambda function download a single chunk and upload it to S3 using AWS S3 MultipartUpload.

In the template.yaml , we have to define a lambda like:

# template.yaml
Resources:
  ...
  DownloadChunkFunction:
    Type: AWS::Serverless::Function
    Properties:
      FunctionName: !Sub ${AWS::StackName}-DownloadChunkFunction
      Timeout: 900
      CodeUri: src/lambdas/download_chunk/
      Handler: app.lambda_handler
      Runtime: python3.12
      Environment:
        Variables:
          SFTP_HOST: !Ref SftpHost
          SFTP_USERNAME: !Ref SftpUsername
          SFTP_PASSWORD: !Ref SftpPassword
      MemorySize: 1024
      Policies:
        - AmazonS3FullAccess
        - AWSLambdaBasicExecutionRole
Enter fullscreen mode Exit fullscreen mode

AWS StepFunction

Now it’s time to create stepfunction to orchestrate the entire process, we will create a state machine to manage each Lambda invocation:

# template.yaml
Resources:
  ...
  SftpDownloadStateMachine:
    Type: AWS::Serverless::StateMachine
    Properties:
      DefinitionUri: src/state_machines/sftp_download_definition.json
      DefinitionSubstitutions:
        CreateFileChunksFunction: !GetAtt CreateFileChunksFunction.Arn
        GetFileMetadataFunction: !GetAtt GetFileMetadataFunction.Arn
        DownloadChunkFunction: !GetAtt DownloadChunkFunction.Arn
      Role: !GetAtt StepFunctionRole.Arn
Enter fullscreen mode Exit fullscreen mode

Step Function Definition

The definition file, tell AWS how to orchestrate the entire process by defining lambdas order and through the Map state we will loop though each chunk in parallel triggering the download and the upload for each chunk.

# src/state_machines/sftp_download_definition.json
{
  "Comment": "Step function to process zip file",
  "StartAt": "Get File Metadata From SFTP",
  "States": {
    "Get File Metadata From SFTP": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "ResultPath": "$.file_metadata",
      "Parameters": {
        "Payload.$": "$",
        "FunctionName": "${GetFileMetadataFunction}"
      },
      "Next": "Create chunks"
    },
    "Create chunks": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "ResultPath": "$.chunks",
      "Parameters": {
        "Payload": {
          "file_size.$": "$.file_metadata.Payload.body.metadata.size"
        },
        "FunctionName": "${CreateFileChunksFunction}"
      },
      "Next": "Start multipart upload"
    },
    "Start multipart upload": {
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:s3:createMultipartUpload",
      "ResultPath": "$.multipart_upload",
      "Parameters": {
        "Bucket.$": "$.target_s3_bucket",
        "Key.$": "$.target_s3_key"
      },
      "Next": "Download chunks"
    },
    "Download chunks": {
      "Type": "Map",
      "ItemsPath": "$.chunks.Payload.body.chunks",
      "ResultPath": "$.download_chunk_results",
      "Parameters": {
        "bucket_name.$": "$.target_s3_bucket",
        "s3_file_key.$": "$.target_s3_key",
        "sftp_file_path.$": "$.sftp_file_path",
        "upload_id.$": "$.multipart_upload.UploadId",
        "chunk.$": "$$.Map.Item.Value"
      },
      "Iterator": {
        "StartAt": "Download single chunk from SFTP",
        "States": {
          "Download single chunk from SFTP": {
            "Type": "Task",
            "Resource": "arn:aws:states:::lambda:invoke",
            "Parameters": {
              "Payload": {
                "bucket_name.$": "$.bucket_name",
                "s3_file_key.$": "$.s3_file_key",
                "sftp_file_path.$": "$.sftp_file_path",
                "chunk_number.$": "$.chunk.chunk_number",
                "start_byte.$": "$.chunk.start_byte",
                "end_byte.$": "$.chunk.end_byte",
                "upload_id.$": "$.upload_id"
              },
              "FunctionName": "${DownloadChunkFunction}"
            },
            "End": true
          }
        }
      },
      "Next": "Format multipart parts"
    },
    "Format multipart parts": {
      "Type": "Pass",
      "ResultPath": "$.formatted_parts",
      "Parameters": {
        "Parts.$": "$.download_chunk_results[*].Payload.body"
      },
      "Next": "Complete multipart upload"
    },
    "Complete multipart upload": {
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:s3:completeMultipartUpload",
      "ResultPath": null,
      "Parameters": {
        "Bucket.$": "$.target_s3_bucket",
        "Key.$": "$.target_s3_key",
        "UploadId.$": "$.multipart_upload.UploadId",
        "MultipartUpload": {
          "Parts.$": "$.formatted_parts.Parts"
        }
      },
      "Next": "Fine"
    },
    "Fine": {
      "Type": "Succeed"
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Error Handling and Optimization

  • Error Handling : Each state in the Step Function can be equipped with retry policies. For example, if the download or upload fails for a specific chunk, the Step Function can automatically retry the operation.
"Retry": [
  {
    "ErrorEquals": ["States.ALL"],
    "IntervalSeconds": 2,
    "MaxAttempts": 3,
    "BackoffRate": 2.0
  }
]
Enter fullscreen mode Exit fullscreen mode
  • Optimization : You can tweak MaxConcurrency in the Map state to control how many chunks are processed in parallel. Increasing this number can reduce overall time but requires more resources.
"MaxConcurrency": 10
Enter fullscreen mode Exit fullscreen mode

Conclusion

By leveraging AWS Step Functions and Lambda in tandem, this architecture offers a scalable and robust solution for handling large files from SFTP servers. It ensures that files are downloaded in chunks and uploaded to S3 using multipart upload, making the process efficient and error-resistant.

For the full project, including all Lambda functions and deployment scripts, visit the GitHub repository.

GitHub - Tammura/aws-step-function-large-file-download


Speedy emails, satisfied customers

Postmark Image

Are delayed transactional emails costing you user satisfaction? Postmark delivers your emails almost instantly, keeping your customers happy and connected.

Sign up

Top comments (0)

Billboard image

Deploy and scale your apps on AWS and GCP with a world class developer experience

Coherence makes it easy to set up and maintain cloud infrastructure. Harness the extensibility, compliance and cost efficiency of the cloud.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay