Javid Jamae

Posted on Apr 24 • Originally published at ffmpeg-micro.com

How to Merge Audio and Video with FFmpeg (CLI and API)

#ffmpeg #video #api #tutorial

Originally published at ffmpeg-micro.com

You have a video file and a separate audio file. Maybe it's an AI voiceover you generated with ElevenLabs. Maybe it's a podcast intro you recorded separately. Either way, you need to combine them into one file, and FFmpeg is the tool that does it.

The problem? FFmpeg's muxing syntax is confusing. You need -map, -shortest, codec flags, and a server to run it on. If you're building this into an app or automation, you also need to figure out scaling, error handling, and file storage.

You can skip all of that with an API call.

What does "muxing" actually mean in FFmpeg?

Muxing (short for multiplexing) is the process of combining separate media streams into a single container file. When you merge an audio track with a video track, you're muxing them together into one MP4, WebM, or MKV file.

FFmpeg handles this with the -map flag, which tells it which streams to pull from which inputs. A typical CLI command looks like this:

ffmpeg -i video.mp4 -i voiceover.mp3 -map 0:v:0 -map 1:a:0 -c:v libx264 -c:a aac -shortest -y output.mp4

That command takes the video stream from the first input and the audio stream from the second input, then combines them. The -shortest flag cuts the output to match whichever input is shorter, so you don't get trailing silence or a frozen frame.

It works. But deploying this on a server, handling file uploads, and managing FFmpeg installations across environments is where things get painful.

How to merge audio and video with the FFmpeg Micro API

FFmpeg Micro is a cloud API that lets you run FFmpeg operations with a single HTTP call. No FFmpeg installation, no server management. You send your files, specify the operation, and get the result back.

For muxing, you pass two inputs (your video and your audio) and use the options array to control how the streams are mapped.

Step 1: Upload your files

FFmpeg Micro uses a 3-step upload flow. First, get a presigned URL. Then upload your file directly. Then confirm the upload.

# Get presigned URL for the video
curl -X POST https://api.ffmpeg-micro.com/v1/upload/presigned-url \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"filename": "video.mp4", "contentType": "video/mp4", "fileSize": 15000000}'

# Upload to the returned URL
curl -X PUT "PRESIGNED_URL_FROM_RESPONSE" \
  -H "Content-Type: video/mp4" \
  --data-binary @video.mp4

# Confirm the upload
curl -X POST https://api.ffmpeg-micro.com/v1/upload/confirm \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"filename": "video.mp4", "fileSize": 15000000}'

Repeat for your audio file. Each confirm response returns a fileUrl (a gs:// URL) that you'll use in the transcode request.

Step 2: Run the mux job

Now send both files to the transcode endpoint with -map options to control which streams go where:

POST /v1/transcodes
Authorization: Bearer YOUR_API_KEY

{
  "inputs": [
    {"url": "gs://your-bucket/video.mp4"},
    {"url": "gs://your-bucket/voiceover.mp3"}
  ],
  "outputFormat": "mp4",
  "options": [
    {"option": "-map", "argument": "0:v:0"},
    {"option": "-map", "argument": "1:a:0"},
    {"option": "-c:v", "argument": "libx264"},
    {"option": "-c:a", "argument": "aac"},
    {"option": "-shortest", "argument": ""}
  ]
}

This does exactly what the CLI command does, but without any server setup. The API accepts up to 10 inputs per job, so you could mux multiple audio tracks or combine several video segments in one request.

Step 3: Download the result

Poll the job status until it's complete, then grab the download URL:

# Check job status
curl https://api.ffmpeg-micro.com/v1/transcodes/JOB_ID \
  -H "Authorization: Bearer YOUR_API_KEY"

# Get download URL when status is "completed"
curl https://api.ffmpeg-micro.com/v1/transcodes/JOB_ID/download \
  -H "Authorization: Bearer YOUR_API_KEY"

The download endpoint returns a signed URL. Your file is ready.

Common muxing scenarios developers run into

Replacing audio on a video. This is the most common case. You have a screen recording or stock footage and need to swap in a different audio track. The -map 0:v:0 -map 1:a:0 pattern handles this.

Adding a voiceover to a silent video. Same approach, but your source video has no audio track at all. FFmpeg doesn't care. It pulls the video from input 0 and audio from input 1 regardless.

Keeping original audio and adding a second track. If you want both the original audio and a new track (like background music), you'd adjust the mapping to include both audio streams. That's a more advanced use case involving -filter_complex amix, but the API supports it through the options array.

Why not just run FFmpeg on your server?

You can. Plenty of teams do. But if you're building this into a product or automation, you'll hit a few walls:

FFmpeg takes 500MB+ of disk space and has platform-specific dependencies
Video processing is CPU-intensive. One mux job won't kill your server, but ten concurrent ones will
Error handling for FFmpeg processes is tedious. Exit codes, stderr parsing, timeout management
File storage and cleanup become your problem

FFmpeg Micro handles all of this. You pay per minute of video processed, and the infrastructure scales automatically. For a 1-minute video mux job, you're looking at a few seconds of processing time and a fraction of a cent.

FAQ

Can I merge audio and video files that are different lengths?

Yes. Use the -shortest option to cut the output to the shorter input's duration. Without it, FFmpeg will pad the shorter stream. In the API, add {"option": "-shortest", "argument": ""} to your options array.

What audio formats work with FFmpeg muxing?

FFmpeg supports MP3, AAC, WAV, OGG, FLAC, Opus, and M4A. FFmpeg Micro supports all of these as inputs. For the output container, MP4 works best with AAC audio, and WebM works best with Opus.

Do I need to re-encode the video when muxing?

Not always. If you use -c:v copy instead of -c:v libx264, FFmpeg copies the video stream without re-encoding. This is faster but only works if the codec is compatible with the output container. MP4 in, MP4 out with H.264 is a safe bet for stream copying.

How many inputs can I combine in one API call?

FFmpeg Micro accepts up to 10 inputs per transcode job. You can mux multiple audio tracks, combine video segments, or mix and match. Each input gets its own index for -map references.

Is there a file size limit?

FFmpeg Micro supports files up to 500MB on the free tier. Paid plans handle larger files. Check the pricing page for current limits.

Sign up for a free FFmpeg Micro account and try muxing your first audio and video file. The free tier includes 10 minutes of processing per month, which is plenty for testing.

DEV Community