DEV Community

Benjamin Pires
Benjamin Pires

Posted on

How I Built a Multi-Sport AI Coach on iOS as a Solo Developer — Architecture Decisions That Actually Mattered

Most articles about building AI apps focus on the model. This one focuses on everything around the model — the architecture decisions that determined whether the product would actually ship, actually perform, and actually retain users.

SportsReflector is an AI coaching app that analyzes athletic form across 22 sports and every common gym exercise. It uses on-device pose estimation to extract body landmarks from video, calculates biomechanical metrics against sport-specific benchmarks, and returns a 0-100 form score with corrective coaching feedback. I built it solo.

Here are the architecture decisions that mattered most — and the ones I got wrong initially.

Decision 1: On-Device vs Cloud Inference
The first prototype sent video frames to a cloud GPU for pose estimation. It worked. It was also unusable.
Round-trip latency for a single frame was 200-400ms depending on network conditions. For real-time AR overlay at 30fps, you need sub-33ms inference per frame. Cloud inference was 10x too slow for the core feature.
The fix was moving to Apple's Vision framework with VNDetectHumanBodyPoseRequest running entirely on-device via CoreML. On an iPhone 12 or newer, single-frame pose estimation runs in 8-15ms — fast enough for real-time AR overlay at 60fps on ProMotion devices.
The business implications of this decision were massive:
Cloud inference at scale would have cost roughly $0.02-0.05 per analysis. At 10,000 daily active users doing 3 analyses each, that's $600-1,500/day in GPU costs before the business generates meaningful revenue. On-device inference costs exactly $0.00 per analysis regardless of user volume. Gross margin scales with subscriptions, not usage.
The tradeoff: on-device models are smaller and less accurate than cloud models. Apple's MoveNet SINGLEPOSE_THUNDER model extracts 17 keypoints per frame. Research-grade models like MediaPipe BlazePose extract 33. For consumer coaching (not clinical biomechanics), 17 keypoints is sufficient to score form, detect asymmetries, and identify common technique errors. The accuracy ceiling matters less than the latency floor for user experience.
What I'd tell other developers: default to on-device inference for any consumer AI product. Cloud inference is for batch processing, enterprise workflows, and use cases where latency tolerance is measured in seconds. Consumer apps need sub-100ms response times. On-device delivers that. Cloud doesn't.

**Decision 2: Sport-Specific vs Generic Analysis
**The naive approach to multi-sport analysis is building one model that scores all movements generically. Detect joints, measure angles, score deviations from some universal "correct" standard.
This doesn't work because biomechanics is sport-specific. A deep squat is correct form for powerlifting but incorrect for Olympic weightlifting (where you want to catch at parallel). A wide elbow flare is wrong for bench press but correct for a boxing hook. Knee valgus is a red flag in a squat but a natural movement pattern in certain tennis footwork transitions.
The architecture that works is a shared pose estimation layer feeding into sport-specific analysis modules. The pose data (17 keypoints with x, y coordinates and confidence scores per frame) is identical regardless of sport. The interpretation layer is modular — each sport has its own:

Biomechanical benchmark definitions
Phase detection logic (setup → load → execute → follow-through)
Failure mode catalog
Corrective drill mappings
Scoring weight distributions

Adding a new sport means writing a new analysis module, not retraining the pose estimation model. The 23rd sport is incremental engineering. The first sport was the hard part — designing the module interface that all future sports plug into.
What I'd tell other developers: if you're building any multi-category AI product (not just sports), invest heavily in the interface between your core ML layer and your domain-specific interpretation layer. That interface is your architecture. Get it right and scaling to new categories is additive. Get it wrong and every new category is a rewrite.

Decision 3: Synchronous vs Asynchronous Analysis
The first version analyzed video synchronously — user taps "analyze," the app freezes for 3-8 seconds while processing, then displays results. Users hated it. The perceived wait felt broken even though the actual processing time was reasonable.
The fix was splitting analysis into two paths:
Real-time path (AR workouts, training partner): pose estimation runs every frame at 30fps. No analysis latency — feedback is continuous. The tradeoff is shallower analysis since you can only compute what fits in a 33ms budget per frame.
Deferred path (video analysis, form scoring): video is recorded first, then analyzed frame-by-frame in the background. A progress indicator shows the user their analysis is cooking. Results appear as a notification or in-app card when ready. The user can do other things while waiting.
The deferred path allowed much deeper analysis per frame because there's no real-time constraint. Phase detection, kinetic chain analysis, asymmetry measurement, and LLM-generated coaching feedback all run sequentially without blocking the UI.
What I'd tell other developers: never freeze UI for ML inference. Either run inference continuously (real-time path) or run it in the background with progress feedback (deferred path). The perceived performance of your app is more important than the actual inference speed.

Decision 4: Monolithic vs Modular Feature Architecture
SportsReflector has a lot of features: video analysis, AR workouts, AI training partner, workout planner, sports planner, drills library, calorie tracker, coach dashboard. Building these as a monolith would have been faster initially but catastrophic for iteration speed.
Each feature is a semi-independent module with defined interfaces:

Video Analysis Module: accepts video frames, returns pose data + form scores
AR Module: accepts real-time pose data, renders overlays
Training Partner Module: accepts real-time pose data, manages rep counting + voice coaching
Planner Module: accepts user preferences, returns workout plans, reads form scores for adaptation
Calorie Module: accepts food photos, returns nutrition data, reads training data for macro recommendations
Coach Module: accepts athlete data, returns dashboards + reports

The modules share data through a central store but don't directly depend on each other. The workout planner reads form scores but doesn't import the video analysis module. The calorie tracker reads training load but doesn't import the planner.
This modularity meant I could ship features independently. The workout planner shipped in version 1.3. Cardio training shipped in 1.4. The calorie tracker shipped in 1.4.1. Each feature was developed, tested, and released without touching the other modules.
What I'd tell other developers: resist the temptation to build features that deeply intertwine. Define interfaces early. Ship modules independently. The velocity advantage of modular architecture compounds over months — you're not debugging the entire app every time you ship a new feature.

Decision 5: Launch Time Optimization
The first production build launched in 2.1 seconds cold. For an app that users might open 3-5 times per day at the gym, that's an eternity of staring at a splash screen.
The optimization was straightforward in concept, tedious in practice:

Defer all network calls to after the first frame renders
Lazy-initialize the ML model (don't load it until the user opens the analysis screen)
Cache the home screen layout so the first render uses pre-computed dimensions
Move analytics initialization to a background queue
Pre-warm the camera session on a background thread after the home screen appears

Cold launch dropped to under 500ms. Users perceive the app as "instant." The compound effect on retention is real — users who experience fast launches open the app more frequently, which drives engagement metrics, which drives App Store ranking.
What I'd tell other developers: measure your cold launch time. If it's over 1 second, you're losing users to perceived sluggishness. The fix is almost always "stop doing things before the first frame renders."
What I Got Wrong
Underestimating localization. I launched English-only and added localization in version 1.4.2 across 25 languages. Should have done it from day one. International markets (Japan, Korea, Germany, Brazil) have high willingness to pay for fitness apps. Every month without localization was revenue left on the table.
Overengineering onboarding. The first onboarding flow was six screens explaining features. Users bounced before reaching the first analysis. The current flow is two screens: pick your sport, start recording. Feature discovery happens through usage, not tutorials.
Not triggering review prompts early enough. The app launched with zero App Store ratings for weeks. Should have triggered SKStoreReviewRequest after the first completed analysis from day one. Ratings velocity is the single biggest factor in App Store discoverability. Every day without ratings is a day your app is invisible.
The Stack
For developers curious about the technical stack:

UI: SwiftUI with UIKit bridges for camera and AR views
Pose Estimation: Apple Vision framework (VNDetectHumanBodyPoseRequest)
AR: ARKit with body tracking, RealityKit for 3D overlays
ML Runtime: CoreML for on-device inference
AI Coaching: LLM API for generating sport-specific feedback
Food Recognition: Vision-based photo analysis for calorie tracking
Backend: Firebase (auth, database, cloud functions)
Analytics: MetricKit for performance monitoring
Haptics: CoreHaptics for custom feedback patterns
Voice: AVSpeechSynthesizer for training partner voice coaching

The app is built primarily with Rork, an AI-assisted iOS development platform, which significantly accelerated development speed for a solo developer.
The Takeaway
The model matters less than the architecture around it. Pose estimation is a solved problem — Apple gives you a production-ready model for free. The hard part is everything else: making inference fast enough for real-time use, building modular sport-specific analysis that scales to new sports, designing UI that doesn't block on ML processing, and shipping features independently without breaking existing ones.
If you're building an AI-powered consumer app, focus less on model accuracy and more on perceived performance, architectural modularity, and launch speed. Those are the decisions that determine whether users come back tomorrow.

SportsReflector is available on the iOS App Store. If you're working on pose estimation, CoreML, or ARKit and want to compare notes, reach out — I'm always happy to talk shop with other developers building in this space.

Top comments (0)