Why On-Device AI Matters for Screenshots
Screenshots are sensitive. They contain prices, flight details, personal conversations, banking info. Uploading them to a cloud AI is a non-starter for most users.
The good news: iOS gives you everything you need to build a capable AI pipeline that runs entirely on-device. Here's how I did it for Snaap, an AI screenshot cleaner.
The Pipeline
Step 1: Find the Screenshots
let options = PHFetchOptions()
options.predicate = NSPredicate(
format: "mediaType == %d AND (mediaSubtypes & %d) != 0",
PHAssetMediaType.image.rawValue,
PHAssetMediaSubtype.photoScreenshot.rawValue
)
options.sortDescriptors = [NSSortDescriptor(key: "creationDate", ascending: false)]
let screenshots = PHAsset.fetchAssets(with: options)
iOS natively tags screenshots — no ML model needed for detection.
Step 2: Extract Text with Vision OCR
let request = VNRecognizeTextRequest { request, error in
let text = request.results?
.compactMap { $0 as? VNRecognizedTextObservation }
.compactMap { $0.topCandidates(1).first?.string }
.joined(separator: "\n") ?? ""
}
request.recognitionLevel = .accurate
request.usesLanguageCorrection = true
VNRecognizeTextRequest with .accurate level catches even small text on product screenshots. Processing ~600 images takes about 90 seconds on iPhone 14 Pro.
Step 3: Rule-Based Classification
func classify(ocrText: String) -> Category {
let text = ocrText.lowercased()
// Travel: look for flight codes like VN123
if text.contains("boarding pass") || text.contains("flight") ||
text.range(of: "[A-Z]{2}\\d{3,4}", options: .regularExpression) != nil {
return .travel
}
// Receipt: price patterns + keywords
let pricePattern = "[\$£€¥]\s*\d+[\.,]\d{2}"
if text.contains("total") &&
text.range(of: pricePattern, options: .regularExpression) != nil {
return .receipt
}
// Recipe: multiple cooking keywords
let recipeWords = ["ingredients", "tbsp", "preheat", "bake", "simmer"]
if recipeWords.filter({ text.contains($0) }).count >= 2 {
return .recipe
}
// Code: programming keywords
let codeWords = ["func ", "const ", "import ", "async", "await"]
if codeWords.filter({ text.contains($0) }).count >= 2 {
return .code
}
return .other
}
The key insight: screenshots of the same category share highly predictable vocabulary. A flight booking always says "boarding pass" or "gate." A receipt always has a price and the word "total." You don't need an LLM for this — domain-specific heuristics work better.
Step 4: Context Generation
func generateSentence(for screenshot: Screenshot) -> String {
switch screenshot.category {
case .travel:
if isDatePast(screenshot.extractedDate) {
return "Flight to \(destination). You already landed."
}
return "Flight to \(destination) — \(formatDate(screenshot.extractedDate))."
case .product:
if weeksAgo(screenshot.createdAt) > 4 {
return "\(product) · \(price). Saved \(weeks) weeks — still want it?"
}
return "\(product) · \(price) from \(source)."
// ... etc
}
}
The sentences are designed to prompt a decision. "You already landed" makes it safe to delete. "Still want it?" keeps the door open. The goal isn't perfect accuracy — it's removing the fear of deleting.
Step 5: Duplicate Detection with Perceptual Hashing
func computeHash(for image: UIImage) -> String? {
// Resize to 8x8 grayscale
// Compute average brightness
// Build 64-bit string: each bit = pixel > average
// Hamming distance < 10 = duplicate
}
pHash catches visually identical screenshots even if one is slightly cropped or has a different timestamp. Found 42 duplicates in my library that I never knew existed.
Why Not Use an LLM?
- Speed: Rule-based classification is instant. No API latency.
- Privacy: Nothing leaves the device. Critical for screenshot content.
- Cost: $0 vs. paying per token.
- Reliability: No hallucinations, no API outages.
- Offline: Works on airplanes, in subways, anywhere.
For a constrained domain like screenshot classification, LLMs are overkill. The vocabulary is predictable, the categories are well-defined, and the cost of a misclassification is low (user just taps "other").
Results & App
Snaap is free on the App Store: https://apps.apple.com/app/snaap-voucher-reminder-ai/id6770817204
The entire AI pipeline — OCR, classification, context generation, duplicate detection, expiry checking — runs in about 0.15 seconds per screenshot on device. No network calls, no backend, no user accounts.
If you're building an iOS app that touches user data, I'd strongly recommend exploring on-device AI first. The frameworks are solid, the privacy story is compelling, and users genuinely appreciate it.
Built with Vision, PhotoKit, GRDB, SwiftUI + UIKit. iOS 16+.
Top comments (0)