DEV Community

Harshit Diyora
Harshit Diyora

Posted on

MediaPipe on iOS: what the docs leave out

The MediaPipe Tasks documentation covers installation, initialization, and a working demo. It does not cover what happens when you try to ship a production app. Six months of building CareSpace AI — a real-time physiotherapy app that runs pose detection at 30fps on every frame — filled in those gaps.

Here's what the docs leave out.


AVCaptureSession setup that actually works

The docs show a generic camera setup. This is the production-tuned configuration that keeps latency low and prevents dropped frames:

class CameraManager: NSObject {
    private let captureSession = AVCaptureSession()
    private let videoOutput = AVCaptureVideoDataOutput()
    private let processingQueue = DispatchQueue(label: "com.app.camera", qos: .userInitiated)

    func configure() throws {
        captureSession.beginConfiguration()

        // VGA is the MediaPipe sweet spot — higher resolution adds latency without accuracy gains
        captureSession.sessionPreset = .vga640x480

        guard let device = AVCaptureDevice.default(.builtInWideAngleCamera, for: .video, position: .front),
              let input = try? AVCaptureDeviceInput(device: device) else {
            throw CameraError.deviceUnavailable
        }
        captureSession.addInput(input)

        // Lock frame rate — variable frame rate causes timestamp jitter in MediaPipe
        try device.lockForConfiguration()
        device.activeVideoMinFrameDuration = CMTime(value: 1, timescale: 30)
        device.activeVideoMaxFrameDuration = CMTime(value: 1, timescale: 30)
        device.unlockForConfiguration()

        // BGRA is what MediaPipe expects — do not use MJPEG or YUV without conversion
        videoOutput.videoSettings = [
            kCVPixelBufferPixelFormatTypeKey as String: kCVPixelFormatType_32BGRA
        ]
        videoOutput.alwaysDiscardsLateVideoFrames = true
        videoOutput.setSampleBufferDelegate(self, queue: processingQueue)
        captureSession.addOutput(videoOutput)

        captureSession.commitConfiguration()
    }
}
Enter fullscreen mode Exit fullscreen mode

Two things that aren't obvious: frame rate locking and pixel format. Without frame rate locking, AVFoundation's automatic frame rate control kicks in when the CPU is under load — and sends MediaPipe frames with inconsistent timestamps, which breaks temporal smoothing. Without kCVPixelFormatType_32BGRA, you're doing an implicit conversion on every frame.


GPU vs CPU delegate — it's not what you think

The docs say: "Use GPU delegate for better performance." That's true for benchmarks. In production, it's more complicated.

// GPU delegate — fast, but causes thermal issues in long sessions
let gpuOptions = PoseLandmarkerOptions()
gpuOptions.baseOptions.delegate = .GPU

// CPU delegate — slower per-frame, but thermally stable
let cpuOptions = PoseLandmarkerOptions()
cpuOptions.baseOptions.delegate = .CPU
Enter fullscreen mode Exit fullscreen mode

The GPU delegate runs 20–30% faster per frame. But on an iPhone 12, after about 8 minutes of continuous use with the GPU delegate, the device throttles and frame rate drops from 30fps to 18fps. With the CPU delegate, the same session runs at a stable 28fps for 20+ minutes.

For a physiotherapy app where sessions can run 15–30 minutes, we use:

  • CPU delegate by default
  • GPU delegate only if the device is iPhone 14 Pro or newer (better thermal headroom)
let delegate: Delegate = {
    if ProcessInfo.processInfo.thermalState == .nominal &&
       UIDevice.current.modelIdentifier >= "iPhone14,2" {
        return .GPU
    }
    return .CPU
}()
Enter fullscreen mode Exit fullscreen mode

Coordinate space mapping — the front camera trap

MediaPipe returns landmark coordinates in normalized space (0,0) to (1,1), relative to the camera frame. This seems straightforward until you try to overlay landmarks on a live preview with a front-facing camera.

Two transforms are required:

func convertToViewCoordinates(
    landmark: NormalizedLandmark,
    in viewBounds: CGRect,
    isFrontCamera: Bool
) -> CGPoint {
    var x = CGFloat(landmark.x)
    var y = CGFloat(landmark.y)

    // Front camera: mirror X coordinate
    // MediaPipe does NOT mirror automatically for the front camera
    if isFrontCamera {
        x = 1.0 - x
    }

    // Portrait mode: swap X and Y, then apply bounds
    // Camera captures in landscape; device is in portrait
    let rotatedX = y
    let rotatedY = 1.0 - x

    return CGPoint(
        x: rotatedX * viewBounds.width,
        y: rotatedY * viewBounds.height
    )
}
Enter fullscreen mode Exit fullscreen mode

If you don't mirror the X coordinate for the front camera, the skeleton is laterally flipped. If you don't handle the portrait/landscape rotation, the skeleton is transposed to the wrong axis. Both problems are invisible in unit tests because they require a rendered UI to see.


The retain cycle that leaks your session

MediaPipe's live stream detection mode uses a callback closure (poseLandmarkerLiveStreamDelegate). If you capture self strongly in that closure, you create a retain cycle that leaks the entire detection session.

// WRONG — retain cycle
class PoseDetectionService {
    var landmarker: PoseLandmarker?

    func setup() {
        let options = PoseLandmarkerOptions()
        // This captures self strongly inside MediaPipe's internal storage
        options.poseLandmarkerLiveStreamDelegate = self
        landmarker = try? PoseLandmarker(options: options)
    }

    deinit {
        // Never called — self is retained by landmarker's delegate reference
        print("PoseDetectionService deallocated")
    }
}
Enter fullscreen mode Exit fullscreen mode
// CORRECT — use a wrapper to break the cycle
class WeakDelegateWrapper: PoseLandmarkerLiveStreamDelegate {
    weak var delegate: PoseLandmarkerLiveStreamDelegate?

    func poseLandmarker(
        _ poseLandmarker: PoseLandmarker,
        didFinishDetection result: PoseLandmarkerResult?,
        timestampInMilliseconds: Int,
        error: Error?
    ) {
        delegate?.poseLandmarker(poseLandmarker, didFinishDetection: result,
            timestampInMilliseconds: timestampInMilliseconds, error: error)
    }
}

// In setup:
let wrapper = WeakDelegateWrapper()
wrapper.delegate = self
options.poseLandmarkerLiveStreamDelegate = wrapper
Enter fullscreen mode Exit fullscreen mode

This pattern is standard for any delegate-based API where you don't control the delegate storage. The docs don't mention it because the retention behaviour depends on how MediaPipe stores the delegate internally.


Model selection: Lite vs Full vs Heavy

Three models, three tradeoffs:

Model Latency (iPhone 12, CPU) Landmark accuracy Visibility score quality
Lite ~18ms Good Low
Full ~35ms Better Medium
Heavy ~55ms Best High

For production:

  • Lite: Finger counting demos, casual fitness apps, situations where you only need rough body position
  • Full: Most physiotherapy and fitness use cases — good accuracy at 30fps on iPhone 12+
  • Heavy: Situations where visibility scores drive logic (like the gating in our noise-handling pipeline) — the scores are significantly more reliable

We ship Full by default and let users switch to Heavy in settings if they want higher accuracy at the cost of ~5fps.


Testing in the simulator

MediaPipe doesn't work in the simulator — the GPU delegate isn't available, and the camera isn't present. Inject a video file instead:

class SimulatorVideoSource: NSObject, AVCaptureVideoDataOutputSampleBufferDelegate {
    private var displayLink: CADisplayLink?
    private var videoReader: AVAssetReader?
    private var trackOutput: AVAssetReaderTrackOutput?
    weak var delegate: AVCaptureVideoDataOutputSampleBufferDelegate?
    var queue = DispatchQueue(label: "simulator.video")

    func startPlayback(url: URL) {
        let asset = AVAsset(url: url)
        guard let track = asset.tracks(withMediaType: .video).first,
              let reader = try? AVAssetReader(asset: asset) else { return }

        let output = AVAssetReaderTrackOutput(
            track: track,
            outputSettings: [kCVPixelBufferPixelFormatTypeKey as String: kCVPixelFormatType_32BGRA]
        )
        reader.add(output)
        reader.startReading()

        videoReader = reader
        trackOutput = output

        displayLink = CADisplayLink(target: self, selector: #selector(sendNextFrame))
        displayLink?.preferredFrameRateRange = CAFrameRateRange(minimum: 30, maximum: 30, preferred: 30)
        displayLink?.add(to: .main, forMode: .default)
    }

    @objc private func sendNextFrame() {
        guard let sampleBuffer = trackOutput?.copyNextSampleBuffer() else {
            displayLink?.invalidate()
            return
        }
        delegate?.captureOutput?(AVCaptureOutput(), didOutput: sampleBuffer, from: AVCaptureConnection())
    }
}
Enter fullscreen mode Exit fullscreen mode

Add a test video (recorded from a device) to your asset catalogue. In #if targetEnvironment(simulator) blocks, use SimulatorVideoSource instead of AVCaptureSession. Your whole pipeline — MediaPipe, landmark smoothing, state machine — runs end-to-end in the simulator against a known video.

Top comments (0)