Training a Personalised ML Model On-Device with CreateMLComponents

#ios #swift #machinelearning #coreml

Most on-device AI content focuses on inference — you ship a pre-trained model in your app bundle and run it locally. That's well-covered ground. What's less talked about is training a personalised model on the device, from the user's own data, without any server involvement.

I built exactly that recently — a health tracking app that trains a flare risk predictor from each user's biometric history. Here's how it works and what I learned.

The Problem With Generic Models

Predicting health outcomes from biometrics is noisy. A drop in HRV means something different for a 25-year-old athlete than for someone tracking hormonal health. A universal model is mediocre for everyone. A personalised one, trained on your data, is actually useful.

The constraint: this data is sensitive. Shipping it to a server — even your own — is a non-starter for privacy-first health apps. So the model has to live and train on-device.

CreateMLComponents + CoreML

Apple's CreateMLComponents framework (iOS 16+) lets you train models programmatically at runtime. It's different from the Create ML app or the older MLDataTable APIs — it's composable, async, and designed for this kind of on-device training use case.

The core training loop is straightforward:

let regressor = LinearRegressor<Double>()
let fitted = try await regressor.fitted(to: examples)
try fitted.export(to: tempURL)
let compiled = try await MLModel.compileModel(at: tempURL)

examples is a sequence of AnnotatedFeature<MLShapedArray<Double>, Double> — features in, score out. The model trains in a background task, exports as an .mlpackage, compiles to a .mlmodelc, and gets saved to the App Group container so the widget can read it too.

Total training time on an iPhone: a few seconds for 30-90 days of daily logs.

Feature Engineering Matters More Than Model Choice

With limited data (30-90 rows), the model architecture barely matters. Feature quality does. A few things that made a difference:

Cyclical encoding for time. Day of week and cycle day aren't linear — day 7 is close to day 1, not far from it. Encoding them as sin/cos pairs prevents the model from treating time as an arbitrary number.

vec.cycleDaySin = sin(2 * .pi * Double(cycleDay) / 28.0)
vec.cycleDayCos = cos(2 * .pi * Double(cycleDay) / 28.0)

Delta features over raw values. Absolute HRV of 45ms might be fine for one person and low for another. But a 15% drop from your own 7-day rolling mean is meaningful regardless of baseline. I compute deltas for all continuous biometrics (HRV, resting HR, basal body temperature).

Log-normalise high-variance features. Step count varies by an order of magnitude — 800 steps on a sick day, 12,000 on an active one. Log normalisation keeps it from dominating the linear model.

The Confidence Gradient

New users have no data, so you can't run the model immediately. I handle this with a TrainingState enum:

enum TrainingState {
    case insufficient   // fewer than 30 days of data
    case idle
    case training
    case trained(Date)
    case fallback       // using rule-based scorer
}

Under 30 days, a rule-based fallback runs instead — simple thresholds on HRV, deep sleep, and resting HR. It's less accurate but honest about its limitations. The confidence label shown to the user goes from "Building" to "Moderate" to "High" as data accumulates.

Confidence is capped at a formula: min(1.0, 0.5 + (logCount - 30) / 120.0). You hit 100% confidence at 150 days of data. Honest and explainable.

Prediction to Notification Pipeline

Once the model runs, high-risk predictions trigger a local notification. No server involved at any stage — data never leaves the device:

DailyLog history -> FeatureVector -> MLModel.prediction() -> PredictionResult
-> WidgetKit reload + flare warning notification (if high risk)

The compiled model is stored in the App Group container so both the main app and the widget read the same model file.

What I'd Do Differently

Quantile regression instead of point estimates. A flare risk score with a confidence interval is more useful than a precise-sounding number.
Federated fine-tuning (for a population-level baseline, if needed later). Right now the model is purely individual — no shared signal at all.
More aggressive retrain scheduling. Currently retrains on app open when new logs exist. Background task scheduling would make it more consistent.

If you're building health or fitness apps that need personalised predictions and can't touch a server, CreateMLComponents is worth a serious look. The API is clean, async throughout, and the trained models drop straight into the standard CoreML inference path.