DEV Community

Beck_Moulton
Beck_Moulton

Posted on

Skin Health at the Edge: Deploying Custom Vision Transformers (ViT) on iOS with CoreML

Have you ever wondered if that weird rash on your arm is just a heat rash or something that needs a doctor's visit? While we should always consult professionals, the power of Edge AI is making preliminary screenings faster and more private than ever.

In this tutorial, we are bridging the gap between state-of-the-art Research (Vision Transformers) and real-world mobile utility. We will transform a custom-trained Vision Transformer (ViT) into a high-performance CoreML model, enabling millisecond-latency, offline skin lesion classification directly on an iPhone.

By leveraging Vision Framework, CoreML, and SwiftUI, we are moving away from sluggish cloud APIs and embracing the "Privacy First" era of iOS Development.

The Architecture: From Research to Pocket

The workflow involves training (or fine-tuning) a ViT model in Python, converting it to the .mlpackage format, and integrating it into a native iOS environment.

graph TD
    A[Pre-trained ViT Model - PyTorch/HF] --> B[coremltools Conversion]
    B --> C{CoreML Model .mlpackage}
    C --> D[SwiftUI App Bundle]
    D --> E[Vision Framework Pipeline]
    E --> F[VNCoreMLRequest]
    F --> G[On-Device Neural Engine Inference]
    G --> H[UI Update: Probabilities & Labels]
    style C fill:#f9f,stroke:#333,stroke-width:2px
    style G fill:#00ff00,stroke:#333,stroke-width:2px
Enter fullscreen mode Exit fullscreen mode

Prerequisites

To follow along, you'll need:

  • Python 3.9+ with torch, timm, and coremltools.
  • Xcode 15+.
  • A physical iPhone (Neural Engine is required for the best ViT performance!).
  • Basic knowledge of SwiftUI and Machine Learning on iOS.

Step 1: Converting ViT to CoreML (Python)

Vision Transformers are computationally expensive because of the self-attention mechanism. However, Apple’s Neural Engine (ANE) handles them surprisingly well if converted correctly.

First, let's convert our PyTorch ViT model using coremltools.

import torch
import timm
import coremltools as ct

# 1. Load your custom-trained skin lesion model
model = timm.create_model('vit_tiny_patch16_224', pretrained=False, num_classes=7)
model.load_state_dict(torch.load('skin_lesion_vit.pth'))
model.eval()

# 2. Trace the model with a dummy input
example_input = torch.rand(1, 3, 224, 224)
traced_model = torch.jit.trace(model, example_input)

# 3. Convert to CoreML
mlmodel = ct.convert(
    traced_model,
    inputs=[ct.ImageType(name="image", shape=example_input.shape, scale=1/255.0, bias=[-0.485/0.229, -0.456/0.224, -0.406/0.225])],
    classifier_config=ct.ClassifierConfig(['Actinic', 'Basal Cell', 'Dermatofibroma', 'Melanoma', 'Nevus', 'Pigmented', 'Vascular']),
    minimum_deployment_target=ct.target.iOS17
)

mlmodel.save("SkinViT.mlpackage")
print("✅ CoreML model exported successfully!")
Enter fullscreen mode Exit fullscreen mode

Step 2: The "Official" Way to Optimize Production Apps

Before we dive into the Swift code, it's worth noting that deploying medical-grade AI requires more than just a conversion script. You need to handle data drift, model versioning, and rigorous validation.

For advanced architectural patterns and more production-ready examples of AI integration in healthcare apps, I highly recommend checking out the technical deep-dives at WellAlly Blog. They cover how to scale these edge solutions and integrate them into enterprise health ecosystems.


Step 3: Integrating into SwiftUI with Vision Framework

Once you drop your .mlpackage into Xcode, it generates a Swift class automatically. Now, we use the Vision Framework to handle image scaling and orientation—saving us from manual pixel buffer headaches.

The Inference Logic

import Vision
import CoreML

class SkinClassifier: ObservableObject {
    @Published var classificationLabel: String = "Scan a lesion"

    func performInference(uiImage: UIImage) {
        guard let ciImage = CIImage(image: uiImage) else { return }

        do {
            // Load the generated model class
            let config = MLModelConfiguration()
            let model = try VNCoreMLModel(for: SkinViT(configuration: config).model)

            let request = VNCoreMLRequest(model: model) { request, error in
                guard let results = request.results as? [VNClassificationObservation],
                      let topResult = results.first else {
                    self.classificationLabel = "Unable to classify"
                    return
                }

                DispatchQueue.main.async {
                    self.classificationLabel = "\(topResult.identifier): \(Int(topResult.confidence * 100))%"
                }
            }

            let handler = VNImageRequestHandler(ciImage: ciImage)
            try handler.perform([request])

        } catch {
            print("Failed to perform inference: \(error)")
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

The Minimalist UI

struct ContentView: View {
    @StateObject private var classifier = SkinClassifier()
    @State private var inputImage: UIImage?

    var body: some View {
        VStack(spacing: 20) {
            Text("DermAI Scan")
                .font(.largeTitle.bold())

            if let image = inputImage {
                Image(uiImage: image)
                    .resizable()
                    .scaledToFit()
                    .frame(height: 300)
                    .cornerRadius(12)
            }

            Text(classifier.classificationLabel)
                .font(.headline)
                .padding()
                .background(Color.secondary.opacity(0.1))
                .cornerRadius(10)

            Button("Capture Image") {
                // Trigger camera/photo library here
                // For demo, we just call the classifier
                if let img = inputImage { classifier.performInference(uiImage: img) }
            }
            .buttonStyle(.borderedProminent)
        }
        .padding()
    }
}
Enter fullscreen mode Exit fullscreen mode

Why Vision Transformers (ViT)?

Traditionally, CNNs (like MobileNet) were the kings of mobile vision. However, ViTs offer a global receptive field from the very first layer. This means for skin lesions, where the relationship between the central lesion and the surrounding skin texture is crucial, ViTs often provide higher sensitivity.

Performance Tip

When deploying ViTs on iOS:

  1. Keep it Tiny: Use vit_tiny or vit_small. The base models are too heavy for real-time mobile use.
  2. Use Quantization: Use coremltools.optimize.torch to convert your weights from Float32 to Float16 (or even 8-bit) to reduce the app size by 50-75%.

Conclusion

We've successfully taken a sophisticated Vision Transformer, compressed it for the iPhone, and wrapped it in a SwiftUI interface. This offline-first approach ensures that sensitive health data never leaves the user's device, providing both speed and security.

Ready to take your Edge AI skills further? Head over to wellally.tech/blog for more insights on building the future of digital health.

What are you building with CoreML? Let me know in the comments! 👇

Top comments (0)