DEV Community

SoftwareDevs mvpfactory.io
SoftwareDevs mvpfactory.io

Posted on • Originally published at mvpfactory.io

Apple Foundation Models SDK with Claude Code: Building Hybrid On-Device/Cloud AI Pipelines for iOS Apps in Swift

---
title: "Hybrid AI Pipelines: Apple On-Device + Claude Cloud"
published: true
description: "Build a tiered AI inference pipeline in Swift  route simple tasks to Apple's on-device models, escalate complex reasoning to Claude API, with an adapter pattern that keeps your feature layer clean."
tags: swift, ios, architecture, mobile
canonical_url: https://blog.mvpfactory.co/hybrid-ai-pipelines-apple-on-device-claude-cloud
---

## What We're Building

Let me show you a pattern I use in every project that ships AI features on iOS. We're going to architect a **tiered inference pipeline** that routes simple tasks — classification, short extraction — to Apple's Foundation Models framework on-device, and escalates complex reasoning to Claude's API. The glue is a protocol-based adapter so your feature layer never knows which provider answered.

By the end, you'll have: a unified `AIProvider` protocol, an intelligent router, token budget management, and Combine-based streaming that keeps your UI responsive regardless of provider.

## Prerequisites

- Xcode with Swift 6+
- A device with Apple Silicon (on-device inference requirement)
- An Anthropic API key for Claude access
- Familiarity with Swift concurrency (`async/await`, `AsyncThrowingStream`)

## Step 1: Define the Provider Protocol

The core abstraction is a protocol that both providers conform to. This is what makes swapping providers a config change instead of a rewrite.

Enter fullscreen mode Exit fullscreen mode


swift
protocol AIProvider {
func generate(prompt: String, maxTokens: Int) async throws -> String
func stream(prompt: String) -> AsyncThrowingStream
var estimatedCapabilityTier: CapabilityTier { get }
}

enum CapabilityTier: Int, Comparable {
case basic = 0 // Classification, short extraction
case standard = 1 // Summarization, simple generation
case advanced = 2 // Multi-step reasoning, long-form analysis
}


Your on-device provider wraps Apple's `LanguageModelSession`. Your cloud provider wraps the Anthropic SDK. Same interface, different engines.

## Step 2: Build the Router

Here is the minimal setup to get this working. The router checks task complexity and token estimates, then picks a provider:

Enter fullscreen mode Exit fullscreen mode


swift
struct InferenceRouter {
let onDevice: AIProvider
let cloud: AIProvider

func route(task: AITask) async throws -> String {
    if task.requiredTier <= onDevice.estimatedCapabilityTier
        && task.estimatedTokens < 512 {
        return try await onDevice.generate(
            prompt: task.prompt, maxTokens: task.estimatedTokens
        )
    }
    return try await cloud.generate(
        prompt: task.prompt, maxTokens: task.estimatedTokens
    )
}
Enter fullscreen mode Exit fullscreen mode

}


The split is intuitive: anything that fits a short context and needs a quick answer (sentiment, entity extraction, autocomplete) — on-device wins on latency (~50–200ms), cost (zero), and privacy (data never leaves device). The moment you need chain-of-thought reasoning, long document analysis, or nuanced generation — send it to Claude.

## Step 3: Add Combine-Based Streaming

Both providers can stream tokens. Wrapping them in a Combine pipeline keeps your UI responsive regardless of which provider is active:

Enter fullscreen mode Exit fullscreen mode


swift
func streamResponse(for task: AITask) -> AnyPublisher {
let stream = router.routeStreaming(task: task)

return stream
    .receive(on: DispatchQueue.main)
    .scan("") { accumulated, chunk in accumulated + chunk }
    .eraseToAnyPublisher()
Enter fullscreen mode Exit fullscreen mode

}


Your SwiftUI view subscribes to a single publisher. It doesn't care whether tokens come from Apple Silicon or Claude's API.

## Step 4: Guard Your Cloud Budget

In production, you need guardrails. This actor prevents runaway costs:

Enter fullscreen mode Exit fullscreen mode


swift
actor TokenBudgetManager {
private var dailyCloudTokensUsed: Int = 0
private let dailyLimit: Int = 100_000

func canUseCloud(estimatedTokens: Int) -> Bool {
    dailyCloudTokensUsed + estimatedTokens <= dailyLimit
}

func recordUsage(_ tokens: Int) {
    dailyCloudTokensUsed += tokens
}
Enter fullscreen mode Exit fullscreen mode

}


When the budget runs out, the router gracefully degrades to on-device only. Users still get responses — just simpler ones. Far better than a hard failure or a surprise bill.

## Step 5: Enforce Privacy Boundaries

The docs don't mention this, but this is the architectural decision that matters most. Define a clear data classification in your routing logic, **not** in your feature code:

- **Tier 1 (on-device only):** health data, financial records, personal messages — anything covered by privacy regulations.
- **Tier 2 (cloud-eligible):** generic content generation, public data analysis, non-personal queries.

The feature layer asks for "summarize this text." The router checks data classification *before* picking a provider. Privacy enforcement stays centralized and auditable.

## Gotchas

- **Don't route by gut feeling.** Classify each AI task by complexity requirements. Let the router decide based on token estimates, reasoning depth, and privacy constraints — not hardcoded provider choices scattered through your codebase.
- **On-device context windows are constrained by device memory.** Claude supports up to 200K tokens. If you're sending long documents to on-device and getting truncated results, that's why.
- **Apple's `@Generable` macro handles structured output on-device**, while Claude uses tool use and JSON mode. Your adapter needs to normalize both into the same response shape.
- **Design for offline-first.** Treat on-device as your baseline and cloud as your escalation path. When the network is unavailable or the token budget is spent, your app still works.

## Wrapping Up

Build the adapter layer now — even if you only use one provider today. The protocol-based abstraction is nearly free and saves you a rewrite later. The hybrid approach is the right default for most iOS apps shipping AI features because you're optimizing across latency, cost, privacy, and capability simultaneously instead of picking one axis and hoping the others work out.

Start with the `AIProvider` protocol, get on-device working first, then add Claude as your escalation path. You'll have an AI layer that's resilient, cost-aware, and privacy-respecting from day one.
Enter fullscreen mode Exit fullscreen mode

Top comments (0)