Cool Light Shop Co,. LTD

Posted on May 26

On-Device AI in iOS: How I Built a Screenshot Classifier Without Any Cloud Calls

#ios #swift #ai #privacy

Why On-Device AI Matters for Screenshots

Screenshots are sensitive. They contain prices, flight details, personal conversations, banking info. Uploading them to a cloud AI is a non-starter for most users.

The good news: iOS gives you everything you need to build a capable AI pipeline that runs entirely on-device. Here's how I did it for Snaap, an AI screenshot cleaner.

The Pipeline

Step 1: Find the Screenshots

let options = PHFetchOptions()
options.predicate = NSPredicate(
    format: "mediaType == %d AND (mediaSubtypes & %d) != 0",
    PHAssetMediaType.image.rawValue,
    PHAssetMediaSubtype.photoScreenshot.rawValue
)
options.sortDescriptors = [NSSortDescriptor(key: "creationDate", ascending: false)]
let screenshots = PHAsset.fetchAssets(with: options)

iOS natively tags screenshots — no ML model needed for detection.

Step 2: Extract Text with Vision OCR

let request = VNRecognizeTextRequest { request, error in
    let text = request.results?
        .compactMap { $0 as? VNRecognizedTextObservation }
        .compactMap { $0.topCandidates(1).first?.string }
        .joined(separator: "\n") ?? ""
}
request.recognitionLevel = .accurate
request.usesLanguageCorrection = true

VNRecognizeTextRequest with .accurate level catches even small text on product screenshots. Processing ~600 images takes about 90 seconds on iPhone 14 Pro.

Step 3: Rule-Based Classification

func classify(ocrText: String) -> Category {
    let text = ocrText.lowercased()

    // Travel: look for flight codes like VN123
    if text.contains("boarding pass") || text.contains("flight") ||
       text.range(of: "[A-Z]{2}\\d{3,4}", options: .regularExpression) != nil {
        return .travel
    }

    // Receipt: price patterns + keywords
    let pricePattern = "[\$£€¥]\s*\d+[\.,]\d{2}"
    if text.contains("total") && 
       text.range(of: pricePattern, options: .regularExpression) != nil {
        return .receipt
    }

    // Recipe: multiple cooking keywords
    let recipeWords = ["ingredients", "tbsp", "preheat", "bake", "simmer"]
    if recipeWords.filter({ text.contains($0) }).count >= 2 {
        return .recipe
    }

    // Code: programming keywords
    let codeWords = ["func ", "const ", "import ", "async", "await"]
    if codeWords.filter({ text.contains($0) }).count >= 2 {
        return .code
    }

    return .other
}

The key insight: screenshots of the same category share highly predictable vocabulary. A flight booking always says "boarding pass" or "gate." A receipt always has a price and the word "total." You don't need an LLM for this — domain-specific heuristics work better.

Step 4: Context Generation

func generateSentence(for screenshot: Screenshot) -> String {
    switch screenshot.category {
    case .travel:
        if isDatePast(screenshot.extractedDate) {
            return "Flight to \(destination). You already landed."
        }
        return "Flight to \(destination) — \(formatDate(screenshot.extractedDate))."
    case .product:
        if weeksAgo(screenshot.createdAt) > 4 {
            return "\(product) · \(price). Saved \(weeks) weeks — still want it?"
        }
        return "\(product) · \(price) from \(source)."
    // ... etc
    }
}

The sentences are designed to prompt a decision. "You already landed" makes it safe to delete. "Still want it?" keeps the door open. The goal isn't perfect accuracy — it's removing the fear of deleting.

Step 5: Duplicate Detection with Perceptual Hashing

func computeHash(for image: UIImage) -> String? {
    // Resize to 8x8 grayscale
    // Compute average brightness
    // Build 64-bit string: each bit = pixel > average
    // Hamming distance < 10 = duplicate
}

pHash catches visually identical screenshots even if one is slightly cropped or has a different timestamp. Found 42 duplicates in my library that I never knew existed.

Why Not Use an LLM?

Speed: Rule-based classification is instant. No API latency.
Privacy: Nothing leaves the device. Critical for screenshot content.
Cost: $0 vs. paying per token.
Reliability: No hallucinations, no API outages.
Offline: Works on airplanes, in subways, anywhere.

For a constrained domain like screenshot classification, LLMs are overkill. The vocabulary is predictable, the categories are well-defined, and the cost of a misclassification is low (user just taps "other").

Results & App

Snaap is free on the App Store: https://apps.apple.com/app/snaap-voucher-reminder-ai/id6770817204

The entire AI pipeline — OCR, classification, context generation, duplicate detection, expiry checking — runs in about 0.15 seconds per screenshot on device. No network calls, no backend, no user accounts.

If you're building an iOS app that touches user data, I'd strongly recommend exploring on-device AI first. The frameworks are solid, the privacy story is compelling, and users genuinely appreciate it.

Built with Vision, PhotoKit, GRDB, SwiftUI + UIKit. iOS 16+.

DEV Community