Guillaume Marchand

Posted on Sep 3 • Originally published at Medium on Sep 3

Process millions of media assets with FFmpeg on AWS

#aws #awsstepfunctions #ffmpeg #audio

Process millions of media assets with FFmpeg on AWS

I need to process over 3 million multi-modal files for training a large language model (LLM) that can understand and generate audio in order to launch generative artificial intelligence based customer experiences. Training an audio LLM requires massive amounts of high-quality audio data to learn and understand acoustic patterns. The team has access to millions of audio files stored in Amazon S3, but processing them sequentially on an Amazon EC2 instance does not scale.

To efficiently process the audio for training LLMs , I improved the AWS Batch with FFmpeg sample code, an open audio/video processing sample code using AWS Batch and an Open Source tool FFmpeg.

In this article, I provide technical details on building a reliable, scalable processing workflow using AWS Step Functions and AWS Batch. The workflow utilizes the open source tool “FFmpeg”, to process large volumes of media assets. I describe how to configure AWS Step Functions to orchestrate AWS Batch jobs, handle job failures gracefully, and work within service limits. This architecture shows how you can leverage AWS services like Step Functions and Batch together with open source tools like FFmpeg to create a robust and managed processing pipeline.

The architecture

At AWS re:Invent 2022, AWS announced the availability of a distributed map for AWS Step Functions. This new state type extended support for orchestrating large-scale parallel workloads.

This state is ideal for processing workflows, where many assets can be processed in parallel. I can compose any AWS service API supported by Step Functions into the workflow. In our use case, AWS Batch is invoked directly from the Amazon States Language to parallel process assets without writing new code. This is achieved through Step Functions Service Integrations which allow users to call supported services directly in the Resource field of a Task state.

The following code describes a Step Function state submitting a new job to Batch:

"SubmitJob": {
            "Type": "Task",
            "Resource": "arn:aws:states:::batch:submitJob.sync",
            "Parameters": {
              "JobName.$": "$.name",
              "JobDefinition.$": "States.Format('arn:aws:batch:<region>:<account>:job-definition/batch-ffmpeg-job-definition-{}',$.compute)",
              "JobQueue.$": "States.Format('arn:aws:batch:<region>:<account>:job-queue/batch-ffmpeg-job-queue-{}',$.compute)",
              "Parameters.$": "$"
            },
            "End": true,
          }

I build upon the existing “AWS Batch and FFmpeg“ sample by encapsulating it in a Step Functions state machine. The state machine uses a distributed Map state task optimized for Amazon S3 inputs. By configuring the S3 bucket and prefix directly in the map, the state machine processes assets in parallel.

For each map task, the state machine invokes a Batch job to leverage FFmpeg tool and perform audio encoding.

The following design shows how AWS Batch handles compute provisioning and scheduling, while Step Functions orchestrates the workflow — all in a serverless model.

Beyond the limit

Retry mechanism

Following the Step Functions Quota documentation, the Step Functions distributed map supports a maximum concurrency of up to 10,000 executions in parallel, which exceeds the concurrency limits of AWS Batch. When integrating Step Functions with other services, I must consider the downstream service’s quotas and limits, to avoid errors, as you can read in the following screenshot.

AWS Batch has a quota of 1 million jobs in Submitted state and a limit of 50 transactions per second (TPS) for SubmitJob API calls. To avoid exceeding the TPS limit, I could configure the Step Functions distributed map’s “maximum concurrency” to 50. However, at this rate, it would take over 16 hours to submit 1 million jobs, excluding processing time.

A better solution is to use Step Functions’ “enhanced error handling” capabilities. This allows us to set a max limit on retry intervals to prevent excessive delays. Adding jitter introduces randomness into the retries, avoiding a retry storm that could overwhelm Batch. The combined error handling controls retry rates appropriately during failures while still allowing the high concurrency of distributed maps for normal operation.

When configuring AWS Step Functions, it’s important to set the maximum concurrency appropriately for the workload. Here, I set it to 5,000. Instead of retrying immediately and aggressively, the Step Functions waits some amount of time between tries. The most common pattern is an exponential backoff, where the wait time (IntervalSeconds = 180 sec.) is increased exponentially (BackoffRate = 3) after every attempt. Exponential backoff can lead to very long backoff times, because exponential functions grow quickly. To avoid retrying for too long, implementations typically cap their backoff to a maximum value ( MaxAttempts = 10). This is called, predictably, “capped exponential backoff “, the blog post ”Timeouts, retries, and backoff with jitter“ from Amazon Builders’ Library explains in detail the concept . If all the failed calls back off to the same time, they cause contention or overload again when they are retried, Jitter adds some amount of randomness to the backoff to spread the retries around in time (JitterStrategy = “FULL”).

The following code describes a Step Function state part with this retry mechanism:

"Retry": [
              {
                "ErrorEquals": [
                  "States.ALL"
                ],
                "BackoffRate": 3,
                "IntervalSeconds": 180,
                "MaxAttempts": 10,
                "Comment": "retry because of AWS Batch Quotas Issue",
                "MaxDelaySeconds": 300,
                "JitterStrategy": "FULL"
              }
            ]

When the Step Functions workflow starts, the initial burst of AWS Batch SubmitJob API calls throttles due to exceeding TPS limits. But the built-in retry policy with capped exponential backoff and jitter allows all jobs to eventually succeed without failing. The backoff provides time for the throttling to clear, while jitter spreads out the retries to avoid more throttling. This shows how Step Functions’ retry policies can gracefully handle temporary throttling or failures.

Application state data

AWS Step Functions store application state data for each workflow invocation. The maximum size limit for this application state data is 256 kilobytes per workflow invocation. This means the total size of all data loaded into the state machine and passed across transitions must be less than 256KB for each invocation. Exceeding this 256KB limit will result in an exception and aborted execution as described in the following screenshot.

Fortunately, AWS Step Functions provide a solution to consolidate large amounts of data from child workflow executions. It aggregates all child workflow execution data, including execution inputs, outputs, and status. Step Functions export executions with the same status to their respective files in the specified Amazon S3 location. The “ResultWriter” field specifies the S3 bucket and prefix where Step Functions will write the aggregated results of all child workflows started by a Distributed Map state.

The following code describes a Step Function state part with this application state data export configuration:

"ResultWriter": {
        "Resource": "arn:aws:states:::s3:putObject",
        "Parameters": {
          "Bucket.$": "$.input.s3_bucket",
          "Prefix": "batch-ffmpeg-state-machine/results-output/"
        }

How to use it

Prerequisites

You will need the following prerequisites to set up the solution:

An AWS account
Latest version of AWS Cloud Development Kit (AWS CDK) with bootstrapping already done
Latest version of Task
Latest version of Docker
Latest version of Python 3.

Deploy the sample code

Deploy the “AWS Batch with FFMPEG“ sample code following the README file in the GitHub repository : https://github.com/aws-samples/aws-batch-with-ffmpeg#deploy-the-solution-with-aws-cdk

Use the solution

A Step Functions execution is triggered with a JSON file as an input. In our case, here is the JSON “input.json” designed for the solution:

{
  "name": "pytest-sdk-audio",
  "compute": "intel",
  "input": {
    "s3_bucket": "my-input-bucket",
    "s3_prefix": "media-assets/",
    "file_options": "null"
  },
  "output": {
    "s3_bucket": "my-output-bucket",
    "s3_prefix": "output/",
    "s3_suffix": "",
    "file_options": "-ac 1 -ar 48000"
  },
  "global": {
    "options": "null"
  }
}

Parameters of this JSON Step Function Execution input are:

$.name: metadata of this job for observability.
$.compute: Instances family used to compute the media asset : intel, arm, amd, nvidia, xilinx.
$.input.s3_bucket and $.input.s3_prefix: List of Amazon S3 Objects to be processed by FFmpeg.
$.input.file_options: FFmpeg input file options described in the FFmpeg official documentation.
$.output.s3_bucket and $.output.s3_prefix: S3 bucket and prefix where all processed media assets will be stored.
$.output.s3_suffix : Suffix to add to all processed media assets which will be stored on a Amazon S3 Bucket
$.output.file_options: FFmpeg output file options described in the official documentation.
$.global.options: FFmpeg global options described in the official documentation.

I submit the Step Function execution with the AWS CLI

aws stepfunctions start-execution — state-machine-arn arn:aws:states:.<region>:<account_id>:stateMachine:batch-ffmpeg-state-machine —name <execution-name> —input "$(jq -R . input.json —raw-output)

When the execution of this Step Functions completes, the processed media assets become available within the S3 bucket configured as “output”. The S3 path to access these media files is: s3://{$.output.s3_bucket}{$.output.s3_suffix}{Input S3 object key}{$.output.s3_suffix}.

Now I have access to millions of properly processed audio files and can proceed to train its audio LLM (Large Language Model).

Cost

AWS Batch enables optimizing compute costs by only paying for the resources used. Leveraging Spot instances allows customers to take advantage of unused EC2 capacity to achieve significant cost savings compared to On-Demand instances. It’s important to benchmark different instance types and sizes to find the optimal configuration for the workload. Testing options like GPU vs CPU helps strike the right balance between performance and cost as described in the following blog post “Optimizing video encoding with FFmpeg using NVIDIA GPU-based Amazon EC2 instances”.

Clean up

To avoid incurring unnecessary charges after testing this solution, I have to clean up the resources I created by following these steps:

Delete all objects in the Amazon S3 bucket used for testing. Remove these objects from the S3 console by selecting all objects and clicking “Delete.”
Destroy the AWS CDK stack that was deployed for testing. Open a terminal in the Git repository and run: task cdk:destroy
Verify that all resources have been removed by checking the AWS console. This ensures no resources are accidentally left running, which would lead to unexpected charges.

Summary

I leveraged an Open Source tool FFmpeg and multiple AWS services (AWS Batch, AWS Step Functions, and Amazon S3) to process millions of audio files in parallel. This serverless architecture overcame scalability and service quota challenges by combining AWS services with an open source technology:

AWS Step Functions’ distributed map enabled large-scale parallel processing of assets stored in S3.
Integrating AWS Batch into the Step Functions workflow provided scalable compute while Step Functions handled orchestration.
Error handling strategies like retries and jitter in Step Functions helped avoid overloading downstream AWS Batch when executing high volumes of jobs.

In summary, the combination of AWS Batch, Step Functions, S3, and Open Source tool like FFmpeg allowed efficient, scalable, parallel processing of millions of assets.

The following screenshot illustrates the item status processing of 2 million audio files accomplished by the team in nearly 2 days. A sequential execution would have taken several weeks to complete the same task.