DEV Community

Daisuke Majima
Daisuke Majima

Posted on • Originally published at qiita.com

A Swift library to run Segment Anything natively on iOS (SamKit)

SAMKit Demo

For a while I'd wanted to build a Swift Package that runs Meta's Segment Anything Model (SAM) on-device on iOS.

  • Cut out the object you tap
  • Cut out the object you box in
  • Cut out the object you specify by text

Any of these segments instantly, with all inference completing on-device. It even comes with ready-to-use UI components.

So I built it.

GitHub: https://github.com/john-rocky/SamKit

What it can do

Feature Description
Point & Box Tap for a point, drag for a box, then segment
Text Prompt Type text like "dog" or "red cup" to detect and segment
Subject Lift Long-press to lift an object out, Apple Photos–style; copy/save/share
Two backbones MobileSAM (fast, 23MB) and SAM2 Tiny (accurate, 76MB)
Drop-in UI Just embed the SwiftUI views as-is

Architecture

SAMKit/
├── SAMKit            # core inference engine (point/box)
├── SAMKitGrounding   # text detection (YOLO-World + CLIP)
└── SAMKitUI          # SwiftUI views (SamView / TextPromptView)
Enter fullscreen mode Exit fullscreen mode

Split into three Swift Package products. Import only what you need.

Setup

1. Add the Swift Package

dependencies: [
    .package(url: "https://github.com/john-rocky/SamKit.git", from: "1.0.0")
]
Enter fullscreen mode Exit fullscreen mode

2. Download the models

Get the .mlpackage files from Releases and add them to your Xcode project.

Model Size Use
MobileSAM 23 MB point/box segmentation (required)
SAM2 Tiny 76 MB higher-accuracy segmentation (optional)
Grounding (YOLO-World + CLIP) 148 MB text detection (optional)

Usage

Point/box segmentation

The most basic use. Set an image and specify a point.

import SAMKit

// create a session (the model auto-loads from a bundled resource)
let session = try SamSession(
    model: .bundled(.mobileSam),
    config: .bestAvailable      // priority: Neural Engine > GPU > CPU
)

// encode the image (once; later predicts use the cache)
try session.setImage(cgImage)

// segment by point
let result = try session.predict(
    points: [SamPoint(x: 100, y: 200, label: .positive)]
)

// results
let mask = result.masks.first!
mask.cgImage   // segmentation mask image
mask.score     // IoU confidence score
mask.alpha     // alpha-channel data
Enter fullscreen mode Exit fullscreen mode

You can also specify negative points (regions to exclude) and a bounding box:

let result = try session.predict(
    points: [
        SamPoint(x: 100, y: 200, label: .positive),   // point to include
        SamPoint(x: 300, y: 400, label: .negative)     // point to exclude
    ],
    box: SamBox(x0: 50, y0: 50, x1: 400, y1: 400)    // bounding box
)
Enter fullscreen mode Exit fullscreen mode

Segment by text prompt

Combine SAM with text detection by YOLO-World + CLIP.

import SAMKit
import SAMKitGrounding

let session = try TextSegmentationSession(
    groundingModel: .bundled(),
    samModel: .bundled(.mobileSam)
)

try session.setImage(cgImage)

// search by text and segment
let result = try session.segment(query: "dog, cat")

result.detections   // detections (bounding box + label)
result.masks        // segmentation mask for each detection
result.scores       // confidence scores
Enter fullscreen mode Exit fullscreen mode

Cutting out the object

You can generate a transparent PNG from the segmentation result.

// cut out from a single mask
let extracted = result.masks[0].extractObject(from: cgImage)
// → a CGImage with a transparent background

// composite cut-out from multiple masks
let combined = SamMask.extractObject(from: cgImage, masks: result.masks)
Enter fullscreen mode Exit fullscreen mode

Embedding the SwiftUI views

You don't need to build the UI yourself. SAMKitUI includes ready-to-use views.

import SAMKitUI

// interactive segmentation by point/box
SamView(image: uiImage, model: try .bundled(.mobileSam))

// segmentation by text search
TextPromptView(image: uiImage, session: textSession)
Enter fullscreen mode Exit fullscreen mode

These views include:

  • subject highlight after segmentation (dim background + subject at full brightness)
  • an animated glowing outline
  • long-press to lift the object → drag → Copy/Save/Share menu

How Subject Lift is implemented

A technical walkthrough of recreating Apple Photos' "lift the subject" feature.

1. Binarizing the mask

SAM's mask output is continuous sigmoid values, so convert it to a clean binary mask for display.

func binarizeMask(_ maskImage: CGImage) -> CGImage? {
    // get pixel data via CGContext
    let ctx = CGContext(data: nil, width: width, height: height, ...)
    ctx.draw(maskImage, in: rect)

    let pixels = ctx.data!.bindMemory(to: UInt8.self, capacity: width * height * 4)
    let threshold: UInt8 = 128  // 50% — SAM's standard threshold

    for i in 0..<(width * height) {
        let o = i * 4
        if pixels[o + 3] >= threshold {
            // fully opaque white
            pixels[o] = 255; pixels[o+1] = 255; pixels[o+2] = 255; pixels[o+3] = 255
        } else {
            // fully transparent
            pixels[o] = 0; pixels[o+1] = 0; pixels[o+2] = 0; pixels[o+3] = 0
        }
    }
    return ctx.makeImage()
}
Enter fullscreen mode Exit fullscreen mode

At threshold 0 it picks up mask noise and cuts out most of the image. 128 (50%) is stable.

2. Generating the glowing outline

Extract the mask's contour with CGContext's shadow feature. Far faster than per-pixel dilation.

func generateOutline(from maskImage: CGImage) -> CGImage? {
    // Step 1: turn the mask into a solid-white silhouette
    ctx.draw(maskImage, in: rect)
    ctx.setBlendMode(.sourceIn)
    ctx.setFillColor(UIColor.white.cgColor)
    ctx.fill(rect)  // → white silhouette

    // Step 2: draw with a shadow, then erase the interior → only the contour remains
    outCtx.setShadow(offset: .zero, blur: glowRadius, color: UIColor.white.cgColor)
    outCtx.draw(whiteSilhouette, in: rect)   // shadow = the contour's glow

    outCtx.setBlendMode(.destinationOut)
    outCtx.draw(whiteSilhouette, in: rect)   // erase the interior → contour only
}
Enter fullscreen mode Exit fullscreen mode

Key points:

  • setShadow makes the white glow (only two draws)
  • .destinationOut erases the interior, leaving only the outer glow
  • Far faster than a dilation loop (O(thickness² × pixels))

3. Shimmer animation

Use TimelineView and AngularGradient to make light travel around the contour.

TimelineView(.animation(minimumInterval: 1.0 / 30)) { timeline in
    let phase = timeline.date.timeIntervalSinceReferenceDate
        .truncatingRemainder(dividingBy: 2.5) / 2.5  // one lap in 2.5s

    ZStack {
        // soft glow (blurred cyan)
        outlineImage.colorMultiply(Color(red: 0.5, green: 0.85, blue: 1.0))
            .blur(radius: 5).opacity(0.8)

        // sharp outline
        outlineImage.colorMultiply(.white)

        // moving highlight
        outlineImage.colorMultiply(.white)
            .mask(
                AngularGradient(
                    colors: [.white, .white.opacity(0.5), .clear, .clear, ...],
                    center: .center,
                    startAngle: .degrees(phase * 360),
                    endAngle: .degrees(phase * 360 + 360)
                )
            )
    }
}
Enter fullscreen mode Exit fullscreen mode

4. Unified gesture handler

Manage tap (add point), box drawing, and long-press lift all with a single DragGesture(minimumDistance: 0).

SwiftUI's onTapGesture + onLongPressGesture block each other, so I receive all touches in one gesture and classify them by time and movement.

DragGesture(minimumDistance: 0)
    .onChanged { value in
        // schedule a timer on the first touch
        if gestureStartTime == nil {
            gestureStartTime = Date()
            // decide long-press after 0.3s
            DispatchQueue.main.asyncAfter(deadline: .now() + 0.3) {
                guard gestureStartTime != nil, !isLifted, hasVisibleMasks else { return }
                let moved = hypot(lastTranslation.width, lastTranslation.height)
                guard moved < 15 else { return }  // if it moved, it's not a long-press
                handleLiftObject()  // start the lift!
            }
        }

        if isLifted {
            liftDragOffset = value.translation  // follow the drag
        }
    }
    .onEnded { value in
        if isLifted {
            // released → show menu
            showLiftMenu = true
        } else if elapsed < 0.3 && moved < 15 {
            // quick touch → add point
            addPoint(at: value.startLocation)
        }
    }
Enter fullscreen mode Exit fullscreen mode

Classification logic:

Condition Verdict
< 0.3s, < 15pt moved tap → add point
≥ 0.3s, < 15pt moved long-press → start lift
≥ 10pt moved (box mode) drag → draw box
movement while lifted lift-drag → move object

5. Subject highlight

Rather than overlaying a colored mask after segmentation, dim the background and show only the subject at its original brightness.

// darken the background
Color.black.opacity(0.25)

// show only the subject at the original image's brightness
Image(uiImage: image)
    .mask(Image(uiImage: UIImage(cgImage: binaryMask)))
Enter fullscreen mode Exit fullscreen mode

This makes the transition to long-press lift natural (the dimming just deepens 0.25 → 0.4).

Performance

  • Image encoding: once per image; later predicts reuse the cache
  • Inference: accelerated on Neural Engine / GPU (FP16)
  • Outline generation: only two CGContext-shadow draws; no pixel loop
  • Networking: none. Fully on-device

Summary

With SAMKit you can add segmentation to an iOS app in a few lines.

// an interactive segmentation UI in one line
SamView(image: uiImage, model: try .bundled(.mobileSam))
Enter fullscreen mode Exit fullscreen mode

Experiences like Subject Lift are built in too, so you can bring an Apple Photos–like UX into your own app immediately.

GitHub: https://github.com/john-rocky/SamKit

Feedback and issues welcome!


Originally published in Japanese on Qiita. GitHub / X

Top comments (0)