Ooi Yee Fei

Posted on Dec 5

From Video to Voiceover in Seconds: Running MLX Swift on ARM-Based iOS Devices

#programming #mobile #ai #softwaredevelopment

I started exploring ARM-based AI applications - how generative AI and machine learning models can run locally on ARM devices. Video editing friction sparked the idea: how can I cut down the time to polish demo videos?

That led to ScriptCraft - an iOS app that transcribes video, cleans up the script with an on-device LLM, generates new narration, and exports the final video. No cloud APIs. No uploads. Just your phone doing the work.

The Idea

I wanted to repurpose video content quickly. Record something, get an AI-polished transcript, hear it narrated professionally, export. No cloud APIs. No waiting for uploads. Just your phone doing the work.

How it works:

Import a video
Transcribe the audio with on-device speech recognition
Enhance the transcript with a local LLM
Generate narration via TTS
Replace the original audio and export

This gives an end-to-end ready video that is refined and ready to share.

MLX Swift: The Promise and The Reality

Apple's MLX framework lets you run ML models on device. MLX Swift brings this to iOS. I wanted to use it for the transcript enhancement step - clean up filler words, fix grammar, make it more readable.

The model: Qwen 0.5B 4-bit quantized. Small enough for mobile, supposedly capable enough for basic text tasks.

Setting it up looked straightforward:

let modelConfiguration = ModelConfiguration.configuration(id: "mlx-community/Qwen2.5-0.5B-Instruct-4bit")
let model = try await LLMModelFactory.shared.loadContainer(configuration: modelConfiguration)

Setting Up Physical Device Testing

MLX requires Metal GPU for inference. I used an iPhone 13 mini, which supports Metal.

Device setup:

Enable Developer Mode: Settings > Privacy & Security > Developer Mode. Restart and confirm.
Match Xcode version to iOS version to avoid build errors.
For local servers, configure with your Mac's network IP (e.g., http://10.0.0.100:5055) and bind to all interfaces with --host 0.0.0.0.

The Transcription Challenge

iOS has a built-in speech recognizer. Use it:

let recognizer = SFSpeechRecognizer(locale: Locale(identifier: "en-US"))
let request = SFSpeechURLRecognitionRequest(url: audioURL)
request.requiresOnDeviceRecognition = true

let result = try await recognizer.recognitionTask(with: request)

Worked great for short clips. For a 3-minute video? It returned maybe 30 seconds of text.

The issue: SFSpeechRecognizer has limits. Apple doesn't document them clearly, but around 1 minute of audio seems to be the practical ceiling for a single request.

My fix: chunk the audio.

private let chunkDuration: TimeInterval = 30.0

func transcribe(asset: AVAsset) async throws -> String {
    let duration = CMTimeGetSeconds(asset.duration)
    var chunks: [String] = []

    for startTime in stride(from: 0, to: duration, by: chunkDuration) {
        let chunkResult = try await transcribeChunk(asset: asset, from: startTime)
        chunks.append(chunkResult)
    }

    return chunks.joined(separator: " ")
}

Better, but still inconsistent. Some chunks came back empty. The video had mixed audio sources (voice + background music from editing). The on-device recognizer struggles with that.

Added a fallback:

// Try on-device first
let onDeviceResult = try await transcribeWithMode(url: url, onDevice: true)
if !onDeviceResult.isEmpty {
    return onDeviceResult
}

// Fallback to server-based (uses Apple's servers)
return try await transcribeWithMode(url: url, onDevice: false)

What I learned: On-device speech recognition is good for clean audio, short clips. For real-world content with mixed sources, you need fallbacks.

The Hallucination Problem

Got transcription working. Fed it to Qwen 0.5B. The output?

"Sure, here's a rewritten version of your transcript with improved flow and engagement:

[Completely fabricated content about topics never mentioned in the original]"

The model hallucinated. My original prompt asked it to "enhance and improve" the script. The 0.5B model interpreted that as "make stuff up."

The fix was embarrassingly simple: ask for less.

private func buildPrompt(transcript: String) -> String {
    """
    Clean up this transcript. ONLY fix grammar and remove filler words.
    DO NOT add any new information or content.

    IMPORTANT: Output the same content, just cleaned. Do not invent or add anything.

    Original: \(transcript)

    Cleaned:
    """
}

No more "enhance." No more "improve." Just "clean up." The model stopped inventing.

Think of it like this: A 0.5B model is like a meticulous proofreader - excellent at catching typos and cleaning up grammar, but ask them to ghostwrite your memoir and they'll start making up your childhood. Keep the job description tight.

What I learned: Small LLMs need small tasks. The 0.5B model can clean text. It cannot creatively rewrite. Know your model's limits and prompt accordingly.

Audio Playback: One More Gotcha

Generated the narration. Called AVAudioPlayer.play(). Silence.

Turns out iOS needs explicit permission to play audio:

let audioSession = AVAudioSession.sharedInstance()
try audioSession.setCategory(.playback, mode: .default)
try audioSession.setActive(true)

Without this, audio plays in simulator but not on device (unless headphones are connected). Another simulator-vs-device difference.

The Final Pipeline

Video Import
    |
[Extract Audio] --> AVAssetExportSession
    |
[Chunk Audio] --> 30-second segments
    |
[Transcribe] --> SFSpeechRecognizer (on-device + server fallback)
    |
[Enhance] --> MLX Swift + Qwen 0.5B (cleanup only)
    |
[Generate Speech] --> Kokoro TTS via mlx-audio
    |
[Compose Video] --> AVMutableComposition (original video + new audio)
    |
Export

Each step has its own service class. Each handles its own errors. The view coordinates them.

What Surprised Me

MLX Swift works. Once you're on a real device with Metal GPU, inference is fast. The 0.5B model runs in under a second for short texts.

iOS has a lot of ML built in. SFSpeechRecognizer, Vision framework, Natural Language framework. You can build surprisingly capable apps without any external models.

MLX needs Metal. Set up physical device testing early when working with on-device ML.

What I'd Do Differently

Explore ARM-native frameworks first. iOS has powerful built-in ML capabilities - SFSpeechRecognizer, Vision, Natural Language, Core ML. Understand what's already optimized for ARM before adding external models.
Simpler prompts first. Start with "clean this text" and only add complexity if the model handles it.
Test with real content. My test videos were clean screen recordings. Real videos have background noise, music, multiple speakers. Test with messy content early.

The Takeaway

On-device ML is powerful but unforgiving. The APIs exist. The models exist. But the gap between "works in simulator" and "works on device" is larger than I expected.

The reward: an app that processes video entirely locally. No cloud uploads. No API costs. No latency.

Worth the debugging sessions? Absolutely.

Tech Stack

Platform:      iOS 18+ (requires Metal GPU)
Language:      Swift
ML Framework:  MLX Swift
LLM:           Qwen 0.5B 4-bit (mlx-community)
Speech:        SFSpeechRecognizer (Apple)
TTS:           Kokoro via mlx-audio (server)
Video:         AVFoundation

DEV Community