The MediaPipe Tasks documentation covers installation, initialization, and a working demo. It does not cover what happens when you try to ship a production app. Six months of building CareSpace AI — a real-time physiotherapy app that runs pose detection at 30fps on every frame — filled in those gaps.
Here's what the docs leave out.
AVCaptureSession setup that actually works
The docs show a generic camera setup. This is the production-tuned configuration that keeps latency low and prevents dropped frames:
class CameraManager: NSObject {
private let captureSession = AVCaptureSession()
private let videoOutput = AVCaptureVideoDataOutput()
private let processingQueue = DispatchQueue(label: "com.app.camera", qos: .userInitiated)
func configure() throws {
captureSession.beginConfiguration()
// VGA is the MediaPipe sweet spot — higher resolution adds latency without accuracy gains
captureSession.sessionPreset = .vga640x480
guard let device = AVCaptureDevice.default(.builtInWideAngleCamera, for: .video, position: .front),
let input = try? AVCaptureDeviceInput(device: device) else {
throw CameraError.deviceUnavailable
}
captureSession.addInput(input)
// Lock frame rate — variable frame rate causes timestamp jitter in MediaPipe
try device.lockForConfiguration()
device.activeVideoMinFrameDuration = CMTime(value: 1, timescale: 30)
device.activeVideoMaxFrameDuration = CMTime(value: 1, timescale: 30)
device.unlockForConfiguration()
// BGRA is what MediaPipe expects — do not use MJPEG or YUV without conversion
videoOutput.videoSettings = [
kCVPixelBufferPixelFormatTypeKey as String: kCVPixelFormatType_32BGRA
]
videoOutput.alwaysDiscardsLateVideoFrames = true
videoOutput.setSampleBufferDelegate(self, queue: processingQueue)
captureSession.addOutput(videoOutput)
captureSession.commitConfiguration()
}
}
Two things that aren't obvious: frame rate locking and pixel format. Without frame rate locking, AVFoundation's automatic frame rate control kicks in when the CPU is under load — and sends MediaPipe frames with inconsistent timestamps, which breaks temporal smoothing. Without kCVPixelFormatType_32BGRA, you're doing an implicit conversion on every frame.
GPU vs CPU delegate — it's not what you think
The docs say: "Use GPU delegate for better performance." That's true for benchmarks. In production, it's more complicated.
// GPU delegate — fast, but causes thermal issues in long sessions
let gpuOptions = PoseLandmarkerOptions()
gpuOptions.baseOptions.delegate = .GPU
// CPU delegate — slower per-frame, but thermally stable
let cpuOptions = PoseLandmarkerOptions()
cpuOptions.baseOptions.delegate = .CPU
The GPU delegate runs 20–30% faster per frame. But on an iPhone 12, after about 8 minutes of continuous use with the GPU delegate, the device throttles and frame rate drops from 30fps to 18fps. With the CPU delegate, the same session runs at a stable 28fps for 20+ minutes.
For a physiotherapy app where sessions can run 15–30 minutes, we use:
- CPU delegate by default
- GPU delegate only if the device is iPhone 14 Pro or newer (better thermal headroom)
let delegate: Delegate = {
if ProcessInfo.processInfo.thermalState == .nominal &&
UIDevice.current.modelIdentifier >= "iPhone14,2" {
return .GPU
}
return .CPU
}()
Coordinate space mapping — the front camera trap
MediaPipe returns landmark coordinates in normalized space (0,0) to (1,1), relative to the camera frame. This seems straightforward until you try to overlay landmarks on a live preview with a front-facing camera.
Two transforms are required:
func convertToViewCoordinates(
landmark: NormalizedLandmark,
in viewBounds: CGRect,
isFrontCamera: Bool
) -> CGPoint {
var x = CGFloat(landmark.x)
var y = CGFloat(landmark.y)
// Front camera: mirror X coordinate
// MediaPipe does NOT mirror automatically for the front camera
if isFrontCamera {
x = 1.0 - x
}
// Portrait mode: swap X and Y, then apply bounds
// Camera captures in landscape; device is in portrait
let rotatedX = y
let rotatedY = 1.0 - x
return CGPoint(
x: rotatedX * viewBounds.width,
y: rotatedY * viewBounds.height
)
}
If you don't mirror the X coordinate for the front camera, the skeleton is laterally flipped. If you don't handle the portrait/landscape rotation, the skeleton is transposed to the wrong axis. Both problems are invisible in unit tests because they require a rendered UI to see.
The retain cycle that leaks your session
MediaPipe's live stream detection mode uses a callback closure (poseLandmarkerLiveStreamDelegate). If you capture self strongly in that closure, you create a retain cycle that leaks the entire detection session.
// WRONG — retain cycle
class PoseDetectionService {
var landmarker: PoseLandmarker?
func setup() {
let options = PoseLandmarkerOptions()
// This captures self strongly inside MediaPipe's internal storage
options.poseLandmarkerLiveStreamDelegate = self
landmarker = try? PoseLandmarker(options: options)
}
deinit {
// Never called — self is retained by landmarker's delegate reference
print("PoseDetectionService deallocated")
}
}
// CORRECT — use a wrapper to break the cycle
class WeakDelegateWrapper: PoseLandmarkerLiveStreamDelegate {
weak var delegate: PoseLandmarkerLiveStreamDelegate?
func poseLandmarker(
_ poseLandmarker: PoseLandmarker,
didFinishDetection result: PoseLandmarkerResult?,
timestampInMilliseconds: Int,
error: Error?
) {
delegate?.poseLandmarker(poseLandmarker, didFinishDetection: result,
timestampInMilliseconds: timestampInMilliseconds, error: error)
}
}
// In setup:
let wrapper = WeakDelegateWrapper()
wrapper.delegate = self
options.poseLandmarkerLiveStreamDelegate = wrapper
This pattern is standard for any delegate-based API where you don't control the delegate storage. The docs don't mention it because the retention behaviour depends on how MediaPipe stores the delegate internally.
Model selection: Lite vs Full vs Heavy
Three models, three tradeoffs:
| Model | Latency (iPhone 12, CPU) | Landmark accuracy | Visibility score quality |
|---|---|---|---|
| Lite | ~18ms | Good | Low |
| Full | ~35ms | Better | Medium |
| Heavy | ~55ms | Best | High |
For production:
- Lite: Finger counting demos, casual fitness apps, situations where you only need rough body position
- Full: Most physiotherapy and fitness use cases — good accuracy at 30fps on iPhone 12+
- Heavy: Situations where visibility scores drive logic (like the gating in our noise-handling pipeline) — the scores are significantly more reliable
We ship Full by default and let users switch to Heavy in settings if they want higher accuracy at the cost of ~5fps.
Testing in the simulator
MediaPipe doesn't work in the simulator — the GPU delegate isn't available, and the camera isn't present. Inject a video file instead:
class SimulatorVideoSource: NSObject, AVCaptureVideoDataOutputSampleBufferDelegate {
private var displayLink: CADisplayLink?
private var videoReader: AVAssetReader?
private var trackOutput: AVAssetReaderTrackOutput?
weak var delegate: AVCaptureVideoDataOutputSampleBufferDelegate?
var queue = DispatchQueue(label: "simulator.video")
func startPlayback(url: URL) {
let asset = AVAsset(url: url)
guard let track = asset.tracks(withMediaType: .video).first,
let reader = try? AVAssetReader(asset: asset) else { return }
let output = AVAssetReaderTrackOutput(
track: track,
outputSettings: [kCVPixelBufferPixelFormatTypeKey as String: kCVPixelFormatType_32BGRA]
)
reader.add(output)
reader.startReading()
videoReader = reader
trackOutput = output
displayLink = CADisplayLink(target: self, selector: #selector(sendNextFrame))
displayLink?.preferredFrameRateRange = CAFrameRateRange(minimum: 30, maximum: 30, preferred: 30)
displayLink?.add(to: .main, forMode: .default)
}
@objc private func sendNextFrame() {
guard let sampleBuffer = trackOutput?.copyNextSampleBuffer() else {
displayLink?.invalidate()
return
}
delegate?.captureOutput?(AVCaptureOutput(), didOutput: sampleBuffer, from: AVCaptureConnection())
}
}
Add a test video (recorded from a device) to your asset catalogue. In #if targetEnvironment(simulator) blocks, use SimulatorVideoSource instead of AVCaptureSession. Your whole pipeline — MediaPipe, landmark smoothing, state machine — runs end-to-end in the simulator against a known video.
Top comments (0)