Daisuke Majima

Posted on Jun 2 • Originally published at qiita.com

A Swift library to run Segment Anything natively on iOS (SamKit)

#coreml #machinelearning #swift #ios

For a while I'd wanted to build a Swift Package that runs Meta's Segment Anything Model (SAM) on-device on iOS.

Cut out the object you tap
Cut out the object you box in
Cut out the object you specify by text

Any of these segments instantly, with all inference completing on-device. It even comes with ready-to-use UI components.

So I built it.

GitHub: https://github.com/john-rocky/SamKit

What it can do

Feature	Description
Point & Box	Tap for a point, drag for a box, then segment
Text Prompt	Type text like `"dog"` or `"red cup"` to detect and segment
Subject Lift	Long-press to lift an object out, Apple Photos–style; copy/save/share
Two backbones	MobileSAM (fast, 23MB) and SAM2 Tiny (accurate, 76MB)
Drop-in UI	Just embed the SwiftUI views as-is

Architecture

SAMKit/
├── SAMKit            # core inference engine (point/box)
├── SAMKitGrounding   # text detection (YOLO-World + CLIP)
└── SAMKitUI          # SwiftUI views (SamView / TextPromptView)

Split into three Swift Package products. Import only what you need.

Setup

1. Add the Swift Package

dependencies: [
    .package(url: "https://github.com/john-rocky/SamKit.git", from: "1.0.0")
]

2. Download the models

Get the .mlpackage files from Releases and add them to your Xcode project.

Model	Size	Use
MobileSAM	23 MB	point/box segmentation (required)
SAM2 Tiny	76 MB	higher-accuracy segmentation (optional)
Grounding (YOLO-World + CLIP)	148 MB	text detection (optional)

Usage

Point/box segmentation

The most basic use. Set an image and specify a point.

import SAMKit

// create a session (the model auto-loads from a bundled resource)
let session = try SamSession(
    model: .bundled(.mobileSam),
    config: .bestAvailable      // priority: Neural Engine > GPU > CPU
)

// encode the image (once; later predicts use the cache)
try session.setImage(cgImage)

// segment by point
let result = try session.predict(
    points: [SamPoint(x: 100, y: 200, label: .positive)]
)

// results
let mask = result.masks.first!
mask.cgImage   // segmentation mask image
mask.score     // IoU confidence score
mask.alpha     // alpha-channel data

You can also specify negative points (regions to exclude) and a bounding box:

let result = try session.predict(
    points: [
        SamPoint(x: 100, y: 200, label: .positive),   // point to include
        SamPoint(x: 300, y: 400, label: .negative)     // point to exclude
    ],
    box: SamBox(x0: 50, y0: 50, x1: 400, y1: 400)    // bounding box
)

Segment by text prompt

Combine SAM with text detection by YOLO-World + CLIP.

import SAMKit
import SAMKitGrounding

let session = try TextSegmentationSession(
    groundingModel: .bundled(),
    samModel: .bundled(.mobileSam)
)

try session.setImage(cgImage)

// search by text and segment
let result = try session.segment(query: "dog, cat")

result.detections   // detections (bounding box + label)
result.masks        // segmentation mask for each detection
result.scores       // confidence scores

Cutting out the object

You can generate a transparent PNG from the segmentation result.

// cut out from a single mask
let extracted = result.masks[0].extractObject(from: cgImage)
// → a CGImage with a transparent background

// composite cut-out from multiple masks
let combined = SamMask.extractObject(from: cgImage, masks: result.masks)

Embedding the SwiftUI views

You don't need to build the UI yourself. SAMKitUI includes ready-to-use views.

import SAMKitUI

// interactive segmentation by point/box
SamView(image: uiImage, model: try .bundled(.mobileSam))

// segmentation by text search
TextPromptView(image: uiImage, session: textSession)

These views include:

subject highlight after segmentation (dim background + subject at full brightness)
an animated glowing outline
long-press to lift the object → drag → Copy/Save/Share menu

How Subject Lift is implemented

A technical walkthrough of recreating Apple Photos' "lift the subject" feature.

1. Binarizing the mask

SAM's mask output is continuous sigmoid values, so convert it to a clean binary mask for display.

func binarizeMask(_ maskImage: CGImage) -> CGImage? {
    // get pixel data via CGContext
    let ctx = CGContext(data: nil, width: width, height: height, ...)
    ctx.draw(maskImage, in: rect)

    let pixels = ctx.data!.bindMemory(to: UInt8.self, capacity: width * height * 4)
    let threshold: UInt8 = 128  // 50% — SAM's standard threshold

    for i in 0..<(width * height) {
        let o = i * 4
        if pixels[o + 3] >= threshold {
            // fully opaque white
            pixels[o] = 255; pixels[o+1] = 255; pixels[o+2] = 255; pixels[o+3] = 255
        } else {
            // fully transparent
            pixels[o] = 0; pixels[o+1] = 0; pixels[o+2] = 0; pixels[o+3] = 0
        }
    }
    return ctx.makeImage()
}

At threshold 0 it picks up mask noise and cuts out most of the image. 128 (50%) is stable.

2. Generating the glowing outline

Extract the mask's contour with CGContext's shadow feature. Far faster than per-pixel dilation.

func generateOutline(from maskImage: CGImage) -> CGImage? {
    // Step 1: turn the mask into a solid-white silhouette
    ctx.draw(maskImage, in: rect)
    ctx.setBlendMode(.sourceIn)
    ctx.setFillColor(UIColor.white.cgColor)
    ctx.fill(rect)  // → white silhouette

    // Step 2: draw with a shadow, then erase the interior → only the contour remains
    outCtx.setShadow(offset: .zero, blur: glowRadius, color: UIColor.white.cgColor)
    outCtx.draw(whiteSilhouette, in: rect)   // shadow = the contour's glow

    outCtx.setBlendMode(.destinationOut)
    outCtx.draw(whiteSilhouette, in: rect)   // erase the interior → contour only
}

Key points:

setShadow makes the white glow (only two draws)
.destinationOut erases the interior, leaving only the outer glow
Far faster than a dilation loop (O(thickness² × pixels))

3. Shimmer animation

Use TimelineView and AngularGradient to make light travel around the contour.

TimelineView(.animation(minimumInterval: 1.0 / 30)) { timeline in
    let phase = timeline.date.timeIntervalSinceReferenceDate
        .truncatingRemainder(dividingBy: 2.5) / 2.5  // one lap in 2.5s

    ZStack {
        // soft glow (blurred cyan)
        outlineImage.colorMultiply(Color(red: 0.5, green: 0.85, blue: 1.0))
            .blur(radius: 5).opacity(0.8)

        // sharp outline
        outlineImage.colorMultiply(.white)

        // moving highlight
        outlineImage.colorMultiply(.white)
            .mask(
                AngularGradient(
                    colors: [.white, .white.opacity(0.5), .clear, .clear, ...],
                    center: .center,
                    startAngle: .degrees(phase * 360),
                    endAngle: .degrees(phase * 360 + 360)
                )
            )
    }
}

4. Unified gesture handler

Manage tap (add point), box drawing, and long-press lift all with a single DragGesture(minimumDistance: 0).

SwiftUI's onTapGesture + onLongPressGesture block each other, so I receive all touches in one gesture and classify them by time and movement.

DragGesture(minimumDistance: 0)
    .onChanged { value in
        // schedule a timer on the first touch
        if gestureStartTime == nil {
            gestureStartTime = Date()
            // decide long-press after 0.3s
            DispatchQueue.main.asyncAfter(deadline: .now() + 0.3) {
                guard gestureStartTime != nil, !isLifted, hasVisibleMasks else { return }
                let moved = hypot(lastTranslation.width, lastTranslation.height)
                guard moved < 15 else { return }  // if it moved, it's not a long-press
                handleLiftObject()  // start the lift!
            }
        }

        if isLifted {
            liftDragOffset = value.translation  // follow the drag
        }
    }
    .onEnded { value in
        if isLifted {
            // released → show menu
            showLiftMenu = true
        } else if elapsed < 0.3 && moved < 15 {
            // quick touch → add point
            addPoint(at: value.startLocation)
        }
    }

Classification logic:

Condition	Verdict
< 0.3s, < 15pt moved	tap → add point
≥ 0.3s, < 15pt moved	long-press → start lift
≥ 10pt moved (box mode)	drag → draw box
movement while lifted	lift-drag → move object

5. Subject highlight

Rather than overlaying a colored mask after segmentation, dim the background and show only the subject at its original brightness.

// darken the background
Color.black.opacity(0.25)

// show only the subject at the original image's brightness
Image(uiImage: image)
    .mask(Image(uiImage: UIImage(cgImage: binaryMask)))

This makes the transition to long-press lift natural (the dimming just deepens 0.25 → 0.4).

Performance

Image encoding: once per image; later predicts reuse the cache
Inference: accelerated on Neural Engine / GPU (FP16)
Outline generation: only two CGContext-shadow draws; no pixel loop
Networking: none. Fully on-device

Summary

With SAMKit you can add segmentation to an iOS app in a few lines.

// an interactive segmentation UI in one line
SamView(image: uiImage, model: try .bundled(.mobileSam))

Experiences like Subject Lift are built in too, so you can bring an Apple Photos–like UX into your own app immediately.