<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Daisuke Majima</title>
    <description>The latest articles on DEV Community by Daisuke Majima (@john-rocky).</description>
    <link>https://dev.to/john-rocky</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3963789%2F963e7445-7057-4827-8fed-3aa6d9d42dfc.png</url>
      <title>DEV Community: Daisuke Majima</title>
      <link>https://dev.to/john-rocky</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/john-rocky"/>
    <language>en</language>
    <item>
      <title>A Swift library to run Segment Anything natively on iOS (SamKit)</title>
      <dc:creator>Daisuke Majima</dc:creator>
      <pubDate>Tue, 02 Jun 2026 06:16:54 +0000</pubDate>
      <link>https://dev.to/john-rocky/a-swift-library-to-run-segment-anything-natively-on-ios-samkit-5eho</link>
      <guid>https://dev.to/john-rocky/a-swift-library-to-run-segment-anything-natively-on-ios-samkit-5eho</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8c1hl5d9jx3kjzie8c7t.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8c1hl5d9jx3kjzie8c7t.gif" alt="SAMKit Demo" width="390" height="802"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For a while I'd wanted to build a Swift Package that runs Meta's &lt;a href="https://github.com/facebookresearch/segment-anything" rel="noopener noreferrer"&gt;Segment Anything Model (SAM)&lt;/a&gt; on-device on iOS.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cut out the object you tap&lt;/li&gt;
&lt;li&gt;Cut out the object you box in&lt;/li&gt;
&lt;li&gt;Cut out the object you specify by text&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Any of these segments instantly, with all inference completing on-device. It even comes with ready-to-use UI components.&lt;/p&gt;

&lt;p&gt;So I built it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/john-rocky/SamKit" rel="noopener noreferrer"&gt;https://github.com/john-rocky/SamKit&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What it can do
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Point &amp;amp; Box&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Tap for a point, drag for a box, then segment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Text Prompt&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Type text like &lt;code&gt;"dog"&lt;/code&gt; or &lt;code&gt;"red cup"&lt;/code&gt; to detect and segment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Subject Lift&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Long-press to lift an object out, Apple Photos–style; copy/save/share&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Two backbones&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;MobileSAM (fast, 23MB) and SAM2 Tiny (accurate, 76MB)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Drop-in UI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Just embed the SwiftUI views as-is&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SAMKit/
├── SAMKit            # core inference engine (point/box)
├── SAMKitGrounding   # text detection (YOLO-World + CLIP)
└── SAMKitUI          # SwiftUI views (SamView / TextPromptView)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Split into three Swift Package products. Import only what you need.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Add the Swift Package
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="nv"&gt;dependencies&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;package&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"https://github.com/john-rocky/SamKit.git"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;from&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"1.0.0"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Download the models
&lt;/h3&gt;

&lt;p&gt;Get the &lt;code&gt;.mlpackage&lt;/code&gt; files from &lt;a href="https://github.com/john-rocky/SamKit/releases" rel="noopener noreferrer"&gt;Releases&lt;/a&gt; and add them to your Xcode project.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Size&lt;/th&gt;
&lt;th&gt;Use&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MobileSAM&lt;/td&gt;
&lt;td&gt;23 MB&lt;/td&gt;
&lt;td&gt;point/box segmentation (required)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SAM2 Tiny&lt;/td&gt;
&lt;td&gt;76 MB&lt;/td&gt;
&lt;td&gt;higher-accuracy segmentation (optional)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Grounding (YOLO-World + CLIP)&lt;/td&gt;
&lt;td&gt;148 MB&lt;/td&gt;
&lt;td&gt;text detection (optional)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Usage
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Point/box segmentation
&lt;/h3&gt;

&lt;p&gt;The most basic use. Set an image and specify a point.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kd"&gt;import&lt;/span&gt; &lt;span class="kt"&gt;SAMKit&lt;/span&gt;

&lt;span class="c1"&gt;// create a session (the model auto-loads from a bundled resource)&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="kt"&gt;SamSession&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nv"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;bundled&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mobileSam&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nv"&gt;config&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bestAvailable&lt;/span&gt;      &lt;span class="c1"&gt;// priority: Neural Engine &amp;gt; GPU &amp;gt; CPU&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;// encode the image (once; later predicts use the cache)&lt;/span&gt;
&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setImage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cgImage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;// segment by point&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nv"&gt;points&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;SamPoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;y&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;label&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;positive&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;// results&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;mask&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;masks&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;
&lt;span class="n"&gt;mask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cgImage&lt;/span&gt;   &lt;span class="c1"&gt;// segmentation mask image&lt;/span&gt;
&lt;span class="n"&gt;mask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;     &lt;span class="c1"&gt;// IoU confidence score&lt;/span&gt;
&lt;span class="n"&gt;mask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;alpha&lt;/span&gt;     &lt;span class="c1"&gt;// alpha-channel data&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can also specify negative points (regions to exclude) and a bounding box:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nv"&gt;points&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="kt"&gt;SamPoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;y&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;label&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;positive&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;   &lt;span class="c1"&gt;// point to include&lt;/span&gt;
        &lt;span class="kt"&gt;SamPoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;y&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;label&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;negative&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;     &lt;span class="c1"&gt;// point to exclude&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="nv"&gt;box&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;SamBox&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;x0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;y0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;x1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;y1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    &lt;span class="c1"&gt;// bounding box&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Segment by text prompt
&lt;/h3&gt;

&lt;p&gt;Combine SAM with text detection by YOLO-World + CLIP.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kd"&gt;import&lt;/span&gt; &lt;span class="kt"&gt;SAMKit&lt;/span&gt;
&lt;span class="kd"&gt;import&lt;/span&gt; &lt;span class="kt"&gt;SAMKitGrounding&lt;/span&gt;

&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="kt"&gt;TextSegmentationSession&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nv"&gt;groundingModel&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;bundled&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="nv"&gt;samModel&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;bundled&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mobileSam&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setImage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cgImage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;// search by text and segment&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;segment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"dog, cat"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;detections&lt;/span&gt;   &lt;span class="c1"&gt;// detections (bounding box + label)&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;masks&lt;/span&gt;        &lt;span class="c1"&gt;// segmentation mask for each detection&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;       &lt;span class="c1"&gt;// confidence scores&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Cutting out the object
&lt;/h3&gt;

&lt;p&gt;You can generate a transparent PNG from the segmentation result.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="c1"&gt;// cut out from a single mask&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;extracted&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;masks&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extractObject&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;from&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;cgImage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;// → a CGImage with a transparent background&lt;/span&gt;

&lt;span class="c1"&gt;// composite cut-out from multiple masks&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;combined&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;SamMask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extractObject&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;from&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;cgImage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;masks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;masks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Embedding the SwiftUI views
&lt;/h3&gt;

&lt;p&gt;You don't need to build the UI yourself. &lt;code&gt;SAMKitUI&lt;/code&gt; includes ready-to-use views.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kd"&gt;import&lt;/span&gt; &lt;span class="kt"&gt;SAMKitUI&lt;/span&gt;

&lt;span class="c1"&gt;// interactive segmentation by point/box&lt;/span&gt;
&lt;span class="kt"&gt;SamView&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;image&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;uiImage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;bundled&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mobileSam&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;// segmentation by text search&lt;/span&gt;
&lt;span class="kt"&gt;TextPromptView&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;image&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;uiImage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;textSession&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These views include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;subject highlight after segmentation (dim background + subject at full brightness)&lt;/li&gt;
&lt;li&gt;an animated glowing outline&lt;/li&gt;
&lt;li&gt;long-press to lift the object → drag → Copy/Save/Share menu&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How Subject Lift is implemented
&lt;/h2&gt;

&lt;p&gt;A technical walkthrough of recreating Apple Photos' "lift the subject" feature.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Binarizing the mask
&lt;/h3&gt;

&lt;p&gt;SAM's mask output is continuous sigmoid values, so convert it to a clean binary mask for display.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kd"&gt;func&lt;/span&gt; &lt;span class="nf"&gt;binarizeMask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="nv"&gt;maskImage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;CGImage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="kt"&gt;CGImage&lt;/span&gt;&lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// get pixel data via CGContext&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;CGContext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;width&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;height&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;height&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;draw&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;maskImage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;in&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;rect&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;pixels&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;!.&lt;/span&gt;&lt;span class="nf"&gt;bindMemory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;to&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;UInt8&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;capacity&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;height&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;UInt8&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;  &lt;span class="c1"&gt;// 50% — SAM's standard threshold&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;..&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;width&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;height&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;o&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;pixels&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="c1"&gt;// fully opaque white&lt;/span&gt;
            &lt;span class="n"&gt;pixels&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;255&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;pixels&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;255&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;pixels&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;255&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;pixels&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;255&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="c1"&gt;// fully transparent&lt;/span&gt;
            &lt;span class="n"&gt;pixels&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;pixels&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;pixels&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;pixels&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;makeImage&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At threshold 0 it picks up mask noise and cuts out most of the image. 128 (50%) is stable.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Generating the glowing outline
&lt;/h3&gt;

&lt;p&gt;Extract the mask's contour with CGContext's shadow feature. Far faster than per-pixel dilation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kd"&gt;func&lt;/span&gt; &lt;span class="nf"&gt;generateOutline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;from&lt;/span&gt; &lt;span class="nv"&gt;maskImage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;CGImage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="kt"&gt;CGImage&lt;/span&gt;&lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Step 1: turn the mask into a solid-white silhouette&lt;/span&gt;
    &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;draw&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;maskImage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;in&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;rect&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setBlendMode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sourceIn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setFillColor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;UIColor&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;white&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cgColor&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fill&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rect&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;// → white silhouette&lt;/span&gt;

    &lt;span class="c1"&gt;// Step 2: draw with a shadow, then erase the interior → only the contour remains&lt;/span&gt;
    &lt;span class="n"&gt;outCtx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setShadow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;offset&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zero&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;blur&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;glowRadius&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;color&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;UIColor&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;white&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cgColor&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;outCtx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;draw&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;whiteSilhouette&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;in&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;rect&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;// shadow = the contour's glow&lt;/span&gt;

    &lt;span class="n"&gt;outCtx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setBlendMode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;destinationOut&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;outCtx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;draw&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;whiteSilhouette&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;in&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;rect&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;// erase the interior → contour only&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key points:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;setShadow&lt;/code&gt; makes the white glow (only two draws)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;.destinationOut&lt;/code&gt; erases the interior, leaving only the outer glow&lt;/li&gt;
&lt;li&gt;Far faster than a dilation loop (O(thickness² × pixels))&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Shimmer animation
&lt;/h3&gt;

&lt;p&gt;Use &lt;code&gt;TimelineView&lt;/code&gt; and &lt;code&gt;AngularGradient&lt;/code&gt; to make light travel around the contour.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kt"&gt;TimelineView&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;animation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;minimumInterval&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;timeline&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;phase&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;timeline&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;timeIntervalSinceReferenceDate&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;truncatingRemainder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;dividingBy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;2.5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;2.5&lt;/span&gt;  &lt;span class="c1"&gt;// one lap in 2.5s&lt;/span&gt;

    &lt;span class="kt"&gt;ZStack&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// soft glow (blurred cyan)&lt;/span&gt;
        &lt;span class="n"&gt;outlineImage&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;colorMultiply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;Color&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;red&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;green&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.85&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;blue&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;blur&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;radius&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;opacity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;// sharp outline&lt;/span&gt;
        &lt;span class="n"&gt;outlineImage&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;colorMultiply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;white&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;// moving highlight&lt;/span&gt;
        &lt;span class="n"&gt;outlineImage&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;colorMultiply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;white&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="kt"&gt;AngularGradient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="nv"&gt;colors&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;white&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;white&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;opacity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;clear&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;clear&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                    &lt;span class="nv"&gt;center&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;center&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="nv"&gt;startAngle&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;degrees&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;phase&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;360&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                    &lt;span class="nv"&gt;endAngle&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;degrees&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;phase&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;360&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;360&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. Unified gesture handler
&lt;/h3&gt;

&lt;p&gt;Manage tap (add point), box drawing, and long-press lift &lt;strong&gt;all with a single &lt;code&gt;DragGesture(minimumDistance: 0)&lt;/code&gt;.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;SwiftUI's &lt;code&gt;onTapGesture&lt;/code&gt; + &lt;code&gt;onLongPressGesture&lt;/code&gt; block each other, so I receive all touches in one gesture and classify them by time and movement.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kt"&gt;DragGesture&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;minimumDistance&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;onChanged&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt;
        &lt;span class="c1"&gt;// schedule a timer on the first touch&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;gestureStartTime&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="kc"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;gestureStartTime&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="c1"&gt;// decide long-press after 0.3s&lt;/span&gt;
            &lt;span class="kt"&gt;DispatchQueue&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;asyncAfter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;deadline&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="k"&gt;guard&lt;/span&gt; &lt;span class="n"&gt;gestureStartTime&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="kc"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;isLifted&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hasVisibleMasks&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;moved&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;hypot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lastTranslation&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lastTranslation&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;height&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;guard&lt;/span&gt; &lt;span class="n"&gt;moved&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;  &lt;span class="c1"&gt;// if it moved, it's not a long-press&lt;/span&gt;
                &lt;span class="nf"&gt;handleLiftObject&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;// start the lift!&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;isLifted&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;liftDragOffset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;translation&lt;/span&gt;  &lt;span class="c1"&gt;// follow the drag&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;onEnded&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;isLifted&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="c1"&gt;// released → show menu&lt;/span&gt;
            &lt;span class="n"&gt;showLiftMenu&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;elapsed&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;moved&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="c1"&gt;// quick touch → add point&lt;/span&gt;
            &lt;span class="nf"&gt;addPoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;at&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;startLocation&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Classification logic:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Condition&lt;/th&gt;
&lt;th&gt;Verdict&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&amp;lt; 0.3s, &amp;lt; 15pt moved&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;tap&lt;/strong&gt; → add point&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;≥ 0.3s, &amp;lt; 15pt moved&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;long-press&lt;/strong&gt; → start lift&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;≥ 10pt moved (box mode)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;drag&lt;/strong&gt; → draw box&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;movement while lifted&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;lift-drag&lt;/strong&gt; → move object&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  5. Subject highlight
&lt;/h3&gt;

&lt;p&gt;Rather than overlaying a colored mask after segmentation, &lt;strong&gt;dim the background and show only the subject at its original brightness.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="c1"&gt;// darken the background&lt;/span&gt;
&lt;span class="kt"&gt;Color&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;black&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;opacity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.25&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;// show only the subject at the original image's brightness&lt;/span&gt;
&lt;span class="kt"&gt;Image&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;uiImage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;Image&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;uiImage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;UIImage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;cgImage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;binaryMask&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This makes the transition to long-press lift natural (the dimming just deepens 0.25 → 0.4).&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Image encoding&lt;/strong&gt;: once per image; later predicts reuse the cache&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inference&lt;/strong&gt;: accelerated on Neural Engine / GPU (FP16)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Outline generation&lt;/strong&gt;: only two CGContext-shadow draws; no pixel loop&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Networking&lt;/strong&gt;: none. Fully on-device&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;With SAMKit you can add segmentation to an iOS app in a few lines.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="c1"&gt;// an interactive segmentation UI in one line&lt;/span&gt;
&lt;span class="kt"&gt;SamView&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;image&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;uiImage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;bundled&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mobileSam&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Experiences like Subject Lift are built in too, so you can bring an Apple Photos–like UX into your own app immediately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/john-rocky/SamKit" rel="noopener noreferrer"&gt;https://github.com/john-rocky/SamKit&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Feedback and issues welcome!&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published in Japanese on &lt;a href="https://qiita.com/john-rocky/items/83a5aa73edfeb577f5ea" rel="noopener noreferrer"&gt;Qiita&lt;/a&gt;. &lt;a href="https://github.com/john-rocky" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; / &lt;a href="https://twitter.com/JackdeS11" rel="noopener noreferrer"&gt;X&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ios</category>
      <category>swift</category>
      <category>machinelearning</category>
      <category>coreml</category>
    </item>
    <item>
      <title>Type 'dog' to detect a dog: running YOLO-World on iPhone</title>
      <dc:creator>Daisuke Majima</dc:creator>
      <pubDate>Tue, 02 Jun 2026 06:16:52 +0000</pubDate>
      <link>https://dev.to/john-rocky/type-dog-to-detect-a-dog-running-yolo-world-on-iphone-2h2e</link>
      <guid>https://dev.to/john-rocky/type-dog-to-detect-a-dog-running-yolo-world-on-iphone-2h2e</guid>
      <description>&lt;h2&gt;
  
  
  What it does
&lt;/h2&gt;

&lt;p&gt;Type text like "person, red car, coffee cup" and it detects those objects in the camera view in real time. No class list needed. &lt;strong&gt;You can specify any words you like, as many as you like.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpgjpcz7ij7abal04obh8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpgjpcz7ij7abal04obh8.png" width="800" height="1739"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is YOLO-World's "Open-Vocabulary Detection." Presented at CVPR 2024, it's a fundamentally different approach from the conventional "fixed 80 classes" YOLO.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it works
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;text input ──→ CLIP Text Encoder ──→ text features [1,80,512]
                                            │
camera feed ──→ YOLO-World Detector ────────┤──→ boxes [1,4,8400]
                                            └──→ scores [1,80,8400]
                                                     │
                                                 NMS + Filter ──→ bounding boxes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A dual-wield of CLIP's language understanding and YOLO's detection speed. It converts text into vectors and detects via the matching score against features extracted from the image.&lt;/p&gt;

&lt;p&gt;Changing the query text only re-runs the CLIP encoder; camera-frame inference uses only the visual detector. No heavy recompute runs every time the text changes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Preparing the CoreML models
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Download (ready to use)
&lt;/h3&gt;

&lt;p&gt;Download 3 files from the release assets of the &lt;a href="https://github.com/john-rocky/CoreML-Models" rel="noopener noreferrer"&gt;CoreML-Models&lt;/a&gt; repository:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;File&lt;/th&gt;
&lt;th&gt;Size&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/john-rocky/CoreML-Models/releases/download/yolo-models-v1/yoloworld_detector.mlpackage.zip" rel="noopener noreferrer"&gt;yoloworld_detector.mlpackage&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;25 MB&lt;/td&gt;
&lt;td&gt;YOLO-World V2-S (image → boxes+scores)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/john-rocky/CoreML-Models/releases/download/yolo-models-v1/clip_text_encoder.mlpackage.zip" rel="noopener noreferrer"&gt;clip_text_encoder.mlpackage&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;121 MB&lt;/td&gt;
&lt;td&gt;CLIP ViT-B/32 (text → embedding)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/john-rocky/CoreML-Models/releases/download/yolo-models-v1/clip_vocab.json.zip" rel="noopener noreferrer"&gt;clip_vocab.json&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;1.6 MB&lt;/td&gt;
&lt;td&gt;BPE tokenizer vocabulary&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Convert it yourself
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;ultralytics open_clip_torch &lt;span class="nv"&gt;coremltools&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;8.1
python convert_models.py &lt;span class="nt"&gt;--size&lt;/span&gt; s  &lt;span class="c"&gt;# s/m/l/x&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The conversion script does:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Unwrap YOLO-World V2's Detect head&lt;/strong&gt; — output &lt;code&gt;boxes [1,4,8400]&lt;/code&gt; and &lt;code&gt;scores [1,NC,8400]&lt;/code&gt; directly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Convert CLIP's text encoder standalone&lt;/strong&gt; — patch MultiheadAttention to be CoreML-compatible&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Export the BPE vocab as JSON&lt;/strong&gt; — for the Swift-side tokenizer&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  iOS implementation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Architecture overview
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;TextGroundingDetector (ObservableObject)
├── visualModel: MLModel    — YOLO-World detector
├── textEncoder: MLModel    — CLIP text encoder
├── tokenizer: CLIPTokenizer — BPE tokenizer
└── cachedTxtFeats: MLMultiArray — text-feature cache
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Encoding text
&lt;/h3&gt;

&lt;p&gt;Run only when the user changes the query; the result is cached.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kd"&gt;func&lt;/span&gt; &lt;span class="nf"&gt;updateQueries&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="nv"&gt;queryString&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;queries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;queryString&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;separator&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;","&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;map&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;$0&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;trimmingCharacters&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;in&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;whitespaces&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// tokenize each query → CLIP encoder → 512-dim vector&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;txtFeats&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="kt"&gt;MLMultiArray&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="nv"&gt;dataType&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;prefix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;enumerated&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tokenize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;// ... textEncoder.prediction() via MLDictionaryFeatureProvider ...&lt;/span&gt;
        &lt;span class="c1"&gt;// L2-normalize the result and store into txtFeats[i]&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;cachedTxtFeats&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;txtFeats&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key points:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Up to 80 queries can be detected at once&lt;/li&gt;
&lt;li&gt;L2 normalization is important — CLIP outputs live in a normalized cosine-similarity space&lt;/li&gt;
&lt;li&gt;Fast normalization with Accelerate via &lt;code&gt;vDSP_svesq&lt;/code&gt; + &lt;code&gt;vDSP_vsmul&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Image preprocessing
&lt;/h3&gt;

&lt;p&gt;YOLO-World requires letterbox preprocessing (keep aspect ratio + padding):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kd"&gt;func&lt;/span&gt; &lt;span class="nf"&gt;preprocessImage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="nv"&gt;cgImage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;CGImage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;throws&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="kt"&gt;MLMultiArray&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;scale&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;Float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;640&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="kt"&gt;Float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;imgW&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;imgH&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;scaledW&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;Int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;Float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;imgW&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;scaledH&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;Int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;Float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;imgH&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;padX&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;640&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;scaledW&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;padY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;640&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;scaledH&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;

    &lt;span class="c1"&gt;// draw onto a 640x640 canvas padded with gray (0.5)&lt;/span&gt;
    &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setFillColor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;gray&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;alpha&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fill&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;CGRect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;y&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;width&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;640&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;height&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;640&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;draw&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cgImage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;in&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;CGRect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;padX&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;y&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;padY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;width&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;scaledW&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;height&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;scaledH&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="c1"&gt;// RGBA → CHW Float32 [0,1]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;..&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;640*&lt;/span&gt;&lt;span class="mi"&gt;640&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;dst&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;hw&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;Float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;255&lt;/span&gt;  &lt;span class="c1"&gt;// R&lt;/span&gt;
        &lt;span class="n"&gt;dst&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;hw&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;Float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;255&lt;/span&gt;  &lt;span class="c1"&gt;// G&lt;/span&gt;
        &lt;span class="n"&gt;dst&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;hw&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;Float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;255&lt;/span&gt;  &lt;span class="c1"&gt;// B&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can't use &lt;code&gt;.scaleFill&lt;/code&gt; — the coordinates shift by the letterbox padding, so you have to subtract the padding back out of the output coordinates.&lt;/p&gt;

&lt;h3&gt;
  
  
  Inference and post-processing
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;input&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="kt"&gt;MLDictionaryFeatureProvider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;dictionary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="s"&gt;"image"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tensor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s"&gt;"txt_feats"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;cachedTxtFeats&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;// cached text features&lt;/span&gt;
&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="n"&gt;visualModel&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;prediction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;from&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;boxes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;featureValue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;for&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"boxes"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;!.&lt;/span&gt;&lt;span class="n"&gt;multiArrayValue&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;   &lt;span class="c1"&gt;// [1,4,8400]&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;featureValue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;for&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"scores"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;!.&lt;/span&gt;&lt;span class="n"&gt;multiArrayValue&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt; &lt;span class="c1"&gt;// [1,NC,8400]&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;qi&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;..&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;queryCount&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;anchor&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;..&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;8400&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;qi&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;8400&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;anchor&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;guard&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;continue&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;cx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boxes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;8400&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;anchor&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;cy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boxes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;8400&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;anchor&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;bw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boxes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;8400&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;anchor&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;bh&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boxes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;8400&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;anchor&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

        &lt;span class="c1"&gt;// remove padding and convert to normalized coordinates&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;nx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cx&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;bw&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;padX&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;imgW&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;ny&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cy&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;bh&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;padY&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;imgH&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The output scores are sigmoid values already computed by the BNContrastiveHead, so you can use them directly as confidence.&lt;/p&gt;

&lt;h3&gt;
  
  
  NMS
&lt;/h3&gt;

&lt;p&gt;Apply NMS per query (per-class):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="n"&gt;allDets&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sort&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;$0&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;confidence&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nv"&gt;$1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;confidence&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;kept&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;Int&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;allDets&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;indices&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;suppress&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;ki&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;kept&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;allDets&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;classIndex&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;allDets&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;ki&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;classIndex&lt;/span&gt;
            &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nf"&gt;iou&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;allDets&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rect&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;allDets&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;ki&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rect&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;suppress&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;break&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;suppress&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;kept&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  BPE tokenizer (Swift)
&lt;/h3&gt;

&lt;p&gt;You need to implement CLIP's tokenizer in Swift. Load the BPE merge rules and vocabulary from &lt;code&gt;clip_vocab.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="kt"&gt;CLIPTokenizer&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;contextLength&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Int&lt;/span&gt;  &lt;span class="c1"&gt;// 77&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;encoder&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Int&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;bpeRanks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="kt"&gt;Int&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="kd"&gt;func&lt;/span&gt; &lt;span class="nf"&gt;tokenize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="nv"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;Int&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;encoder&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"&amp;lt;|startoftext|&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="c1"&gt;// lowercase text → split into characters → BPE merge → token IDs&lt;/span&gt;
        &lt;span class="c1"&gt;// ...&lt;/span&gt;
        &lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;encoder&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"&amp;lt;|endoftext|&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;// pad to contextLength (77)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="kt"&gt;Array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;repeating&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;count&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;contextLength&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Compared with ordinary YOLO
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;YOLO-World (Open-Vocabulary)&lt;/th&gt;
&lt;th&gt;YOLO26 (fixed classes)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Detection target&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;any text&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;fixed COCO 80 classes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model setup&lt;/td&gt;
&lt;td&gt;Detector + CLIP Encoder + Vocab&lt;/td&gt;
&lt;td&gt;one model only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total size&lt;/td&gt;
&lt;td&gt;~148 MB&lt;/td&gt;
&lt;td&gt;~18 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NMS&lt;/td&gt;
&lt;td&gt;implemented app-side&lt;/td&gt;
&lt;td&gt;none (End-to-End)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Use for&lt;/td&gt;
&lt;td&gt;flexible detection / search / grounding&lt;/td&gt;
&lt;td&gt;general object detection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Speed&lt;/td&gt;
&lt;td&gt;a bit slower (CLIP overhead)&lt;/td&gt;
&lt;td&gt;fastest&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Practical scenarios
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Search by "red sneakers"&lt;/strong&gt; — visual search in an e-commerce app&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Detect "cracks"&lt;/strong&gt; — infrastructure inspection&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Detect "dog, cat, hamster" simultaneously&lt;/strong&gt; — pet tracking&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Let users freely specify what to detect&lt;/strong&gt; — deploy without customization&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With fixed-class YOLO you had to collect a dataset and retrain to detect "cracks." With YOLO-World you just change the text.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sample app
&lt;/h2&gt;

&lt;p&gt;A complete sample app is in &lt;code&gt;sample_apps/YOLOWorldDemo/&lt;/code&gt; of the &lt;a href="https://github.com/john-rocky/CoreML-Models" rel="noopener noreferrer"&gt;CoreML-Models&lt;/a&gt; repository.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;3 modes: camera / photo / video&lt;/li&gt;
&lt;li&gt;freely change the query in a text field&lt;/li&gt;
&lt;li&gt;real-time filtering with a confidence slider&lt;/li&gt;
&lt;li&gt;download the models from release assets and drag into Xcode&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conversion tips
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Use &lt;strong&gt;coremltools 8.1&lt;/strong&gt; (9.0 has a bug)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You need to patch &lt;code&gt;torch.nn.MultiheadAttention.forward&lt;/code&gt;&lt;/strong&gt; — CoreML can't convert the default PyTorch MHA well; monkey-patch it to call &lt;code&gt;F.multi_head_attention_forward&lt;/code&gt; directly&lt;/li&gt;
&lt;li&gt;Use &lt;strong&gt;YOLO-World V2&lt;/strong&gt; (faster and more accurate than V1)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;compute_precision=ct.precision.FLOAT16&lt;/code&gt; halves the model size&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;YOLO-World delivers intuitive, powerful object detection where you "specify what you want to detect by text." Run it on the iPhone's Neural Engine and it works server-free, offline, with low latency.&lt;/p&gt;

&lt;p&gt;When to use which:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Speed-first, COCO 80 classes is enough → &lt;strong&gt;YOLO26&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Want to flexibly change targets → &lt;strong&gt;YOLO-World&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2401.17270" rel="noopener noreferrer"&gt;YOLO-World paper (CVPR 2024)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/AILab-CVC/YOLO-World" rel="noopener noreferrer"&gt;AILab-CVC/YOLO-World (GitHub)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.ultralytics.com/models/yolo-world/" rel="noopener noreferrer"&gt;Ultralytics YOLO-World Docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/john-rocky/CoreML-Models" rel="noopener noreferrer"&gt;CoreML-Models repository&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/openai/CLIP" rel="noopener noreferrer"&gt;OpenAI CLIP&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published in Japanese on &lt;a href="https://qiita.com/john-rocky/items/f573001a4fec4451ced0" rel="noopener noreferrer"&gt;Qiita&lt;/a&gt;. &lt;a href="https://github.com/john-rocky" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; / &lt;a href="https://twitter.com/JackdeS11" rel="noopener noreferrer"&gt;X&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ios</category>
      <category>machinelearning</category>
      <category>computervision</category>
      <category>coreml</category>
    </item>
    <item>
      <title>Real-time relighting of Gaussian Splatting reflections on iPhone (Metal)</title>
      <dc:creator>Daisuke Majima</dc:creator>
      <pubDate>Tue, 02 Jun 2026 06:11:50 +0000</pubDate>
      <link>https://dev.to/john-rocky/real-time-relighting-of-gaussian-splatting-reflections-on-iphone-metal-3b26</link>
      <guid>https://dev.to/john-rocky/real-time-relighting-of-gaussian-splatting-reflections-on-iphone-metal-3b26</guid>
      <description>&lt;p&gt;I built a Metal viewer on iPhone that &lt;strong&gt;re-lights an already-captured 3D scene with any lighting you like.&lt;/strong&gt; Swap or rotate the HDR environment map, and the object's reflections follow in real time, with that environment also drawn into the background.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fee7g9orkg1964chxkend.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fee7g9orkg1964chxkend.gif" width="360" height="783"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why relighting is valuable (the practical, commercial meaning)
&lt;/h2&gt;

&lt;p&gt;Ordinary Gaussian Splatting has the &lt;strong&gt;lighting from capture time baked in.&lt;/strong&gt; So if you place the captured object somewhere else, the highlights and shadows clash with the new surroundings and look fake. Relighting strips that light off and &lt;strong&gt;returns it to material&lt;/strong&gt;, so you can place the captured object &lt;strong&gt;under any lighting.&lt;/strong&gt; This matters in practice:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;E-commerce / product visuals&lt;/strong&gt;: capture a product once and show it under any light — showroom, outdoors, the customer's room (AR). Strong for "texture sells" goods like furniture, cars, jewelry, sneakers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Film / virtual production&lt;/strong&gt;: place a real capture into a new scene, consistent with that scene's lighting (HDRI / LED wall). No reshoot.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AR / spatial computing&lt;/strong&gt;: an object placed in a real room &lt;strong&gt;only blends in once it's lit by the room's light.&lt;/strong&gt; Relighting is a precondition for realistic AR placement.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Games / real-time 3D&lt;/strong&gt;: instead of a baked-in fixed look, you get a photoreal asset that reacts to dynamic in-game light (day/night, moving lights).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost&lt;/strong&gt;: an asset captured in minutes becomes "usable under real production lighting," like manual modeling + material authoring that takes days.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In short, relighting &lt;strong&gt;turns a captured 3D into an actually usable asset.&lt;/strong&gt; This article is a record of bringing it from desktop research (CUDA-assumed) to a &lt;strong&gt;real-time Metal implementation on iPhone.&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;The first half collects &lt;strong&gt;background knowledge&lt;/strong&gt;; the second half covers the &lt;strong&gt;implementation and four bugs I hit.&lt;/strong&gt; I define jargon as it appears, so you can follow even without a Gaussian Splatting / PBR background.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Repo: &lt;a href="https://github.com/john-rocky/MetalGaussianSplatRelighting" rel="noopener noreferrer"&gt;https://github.com/john-rocky/MetalGaussianSplatRelighting&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h1&gt;
  
  
  Background
&lt;/h1&gt;

&lt;h2&gt;
  
  
  1. What is 3D Gaussian Splatting
&lt;/h2&gt;

&lt;p&gt;A method that reconstructs a 3D scene from a set of photos and renders it in real time. The scene is represented as a huge set of translucent ellipsoids called &lt;strong&gt;splats.&lt;/strong&gt; Each splat has "position, shape &amp;amp; orientation (rotation), color, opacity." Rendering projects each splat to the screen as an ellipse and &lt;strong&gt;alpha-composites them front-to-back in depth order.&lt;/strong&gt; Color changes with viewing angle (view-dependent).&lt;/p&gt;

&lt;p&gt;Key point: ordinary Gaussian Splatting holds the &lt;strong&gt;appearance (color) itself&lt;/strong&gt;, with the &lt;strong&gt;capture-time lighting baked in.&lt;/strong&gt; So you can't change the lighting afterward.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Lighting and relighting
&lt;/h2&gt;

&lt;p&gt;Ordinary GS directly learns "the color of that spot photographed under that light." So lighting is fixed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Relightable GS&lt;/strong&gt; thinks differently. Per splat, it holds not "color" but a decomposed &lt;strong&gt;material&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Albedo&lt;/strong&gt;: the base color of the material itself, with lighting removed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Normal&lt;/strong&gt;: the direction the surface faces&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Roughness&lt;/strong&gt;: surface micro-roughness&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reflectance&lt;/strong&gt;: strength of specular reflection&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With the material, you can &lt;strong&gt;recompute the color on the fly under any environment light.&lt;/strong&gt; That's relighting. The &lt;a href="https://github.com/fudan-zvg/ref-gaussian" rel="noopener noreferrer"&gt;Ref-Gaussian&lt;/a&gt; I used learns this material decomposition.&lt;/p&gt;




&lt;p&gt;That's enough if you get "splats hold material, and we want to re-light them in a new environment." &lt;strong&gt;The rest, #3–#5, just answer one question:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Given that material and the environment, how do we compute the color of one pixel?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  3. Reflection is split into two and added
&lt;/h2&gt;

&lt;p&gt;Light hitting a surface returns in two ways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Diffuse&lt;/strong&gt;: returns light evenly in all directions → you see the &lt;strong&gt;albedo color itself.&lt;/strong&gt; No reflections.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Specular&lt;/strong&gt;: returns only in a specific direction (the reflection of the incidence) → the &lt;strong&gt;environment is reflected.&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So &lt;code&gt;color = diffuse + specular&lt;/code&gt;. The splat's &lt;strong&gt;reflectance&lt;/strong&gt; decides the blend (how strong the specular is). And &lt;strong&gt;roughness decides how blurry the specular is&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;low roughness → the environment reflects crisply, like a mirror&lt;/li&gt;
&lt;li&gt;high roughness → blurry; the bright parts of the environment just appear as a &lt;strong&gt;vague blob&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This matters later. In fact, a glossy car (roughness ~0.2) just shows the environment lighting as a &lt;strong&gt;blurry white blob&lt;/strong&gt; that moves, without resolving beams or window shapes. It's not "no reflection," it's a &lt;strong&gt;blurry reflection&lt;/strong&gt;, and that's physically correct. Drop the roughness way down and the same car becomes mirror-like, clearly reflecting the room.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. How to hold the environment (light source)
&lt;/h2&gt;

&lt;p&gt;Instead of placing point bulbs, use a &lt;strong&gt;360° image&lt;/strong&gt; as the light = a list of what color light comes from each direction.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;HDR&lt;/strong&gt;: an image that can hold brightness above 1 (needed because windows and lights are orders of magnitude brighter than paper)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;equirect / cubemap&lt;/strong&gt;: names for the &lt;strong&gt;storage format&lt;/strong&gt; of that 360° image. Both contents are "direction → light color."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Swapping the environment map = swapping the lighting = relighting.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Getting diffuse and specular from the environment (split-sum)
&lt;/h2&gt;

&lt;p&gt;We want to compute #3's "diffuse" and "specular" from #4's environment map. Done naively, you integrate the environment per pixel — heavy every frame. So UE4's &lt;strong&gt;split-sum&lt;/strong&gt; &lt;strong&gt;precomputes two images&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;For diffuse (irradiance)&lt;/strong&gt;: the environment averaged over all directions → one lookup in the &lt;strong&gt;normal direction&lt;/strong&gt; gives diffuse light.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;For specular (prefiltered)&lt;/strong&gt;: the environment &lt;strong&gt;blurred per roughness level&lt;/strong&gt; (stored progressively in mips) → one lookup in the &lt;strong&gt;reflection direction&lt;/strong&gt; gives specular light (blur level = roughness).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At runtime you just sample these two textures. Replacing a heavy integral with "look up a pre-blurred image" is the heart of split-sum. (Auxiliary: a small table that fine-tunes specular strength by angle — the BRDF LUT — is also precomputed.)&lt;/p&gt;

&lt;p&gt;I implemented this precompute kernel in Metal.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Deferred shading (a splat-specific issue)
&lt;/h2&gt;

&lt;p&gt;Splats are translucent and overlap, so neighboring splats' normals vary and get &lt;strong&gt;noisy.&lt;/strong&gt; If you shade each splat individually and then composite, that noise comes straight through.&lt;/p&gt;

&lt;p&gt;So &lt;strong&gt;change the order&lt;/strong&gt;: first accumulate each splat's material (color, normal, roughness, etc.) into a screen buffer (&lt;strong&gt;G-buffer&lt;/strong&gt;) and &lt;strong&gt;blend = average&lt;/strong&gt;, then shade &lt;strong&gt;once per pixel.&lt;/strong&gt; Computing after normals are averaged reduces noise. In Metal, use &lt;strong&gt;tile memory (imageblock)&lt;/strong&gt; to keep this buffer on the GPU and process it fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  7. Normals and coordinate systems
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Normal&lt;/strong&gt;: the unit vector of the surface direction. The reflection direction depends on it, so if it's off, all reflections are off. It's reconstructed from the splat's orientation (rotation quaternion).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Up-axis convention&lt;/strong&gt;: there's &lt;strong&gt;Y-up&lt;/strong&gt; (many viewers) and &lt;strong&gt;Z-up&lt;/strong&gt; (Blender-family). A mismatch between data and viewer tips the object over (→ bug 3).&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Implementation: the shading equations
&lt;/h1&gt;

&lt;p&gt;With the background in place, read the equations Ref-Gaussian's deferred surfel renderer (&lt;code&gt;render_surfel&lt;/code&gt;) computes per pixel:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="n"&gt;F0&lt;/span&gt;       &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.04&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;reflectance&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;albedo&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;reflectance&lt;/span&gt;
&lt;span class="n"&gt;specular&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;prefiltered&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reflect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;V&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;roughness&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;F0&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;fg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;fg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;final&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;reflectance&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;base_color&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;specular&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;reflect(V, N)&lt;/code&gt;: view direction V reflected about normal N. Look up prefiltered (#5) here = the environment reflected in the specular.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;fg&lt;/code&gt;: a lookup into the BRDF LUT (#5).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;final&lt;/code&gt;: &lt;strong&gt;uses base_color directly for diffuse, and only computes specular from the environment and adds it&lt;/strong&gt; (= #3's "diffuse + specular"). This matters in bug 2.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Whole pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Ref-Gaussian .ply --&amp;gt; load --&amp;gt; per-splat material (normal, roughness, reflectance, albedo)
                                       |
HDR env --&amp;gt; IBL precompute --&amp;gt; prefiltered (specular) + irradiance (diffuse) + BRDF LUT
                                       |
                         +-------------+--------------+
                         v                            v
                 G-buffer pass               postprocess pass
        (blend color/normal/material      (per-pixel split-sum IBL
          into tile memory = #6)            + skybox compositing)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;— sounds clean. Until you run it. Now the real story: four bugs.&lt;/p&gt;




&lt;h1&gt;
  
  
  Bug 1: the normal map is a rainbow sandstorm
&lt;/h1&gt;

&lt;p&gt;The shading was patchy and flickering. The "Normal" debug view (normals visualized as color) was &lt;strong&gt;rainbow noise&lt;/strong&gt;, not a smooth gradient.&lt;/p&gt;

&lt;p&gt;My first thought — "2D-surfel normals are just inherently noisy" — was &lt;strong&gt;wrong&lt;/strong&gt;, and I nearly wasted a stack of device builds on it.&lt;/p&gt;

&lt;p&gt;What saved me was discipline: &lt;strong&gt;first draw the normals offline and verify.&lt;/strong&gt; Compositing each splat's geometric normal (#7: reconstructed from the rotation quaternion and flipped to face the camera) with a small numpy script gave a &lt;strong&gt;smooth&lt;/strong&gt; result (median gradient 0.006). So the data was correct and the renderer was buggy.&lt;/p&gt;

&lt;p&gt;Culprit: at load time MetalSplatter reorders splats for cache efficiency (Morton order, &lt;code&gt;sortByLocality&lt;/code&gt;). It reorders the splat buffer &lt;strong&gt;and&lt;/strong&gt; the SH-coefficient buffer, but the &lt;strong&gt;material buffer (which holds the normals)&lt;/strong&gt; I'd added later was a separate buffer and &lt;strong&gt;wasn't reordered.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;So after sorting, &lt;code&gt;splats[i]&lt;/code&gt; corresponded to &lt;code&gt;materials[some other j]&lt;/code&gt;, and &lt;strong&gt;every splat held someone else's normal.&lt;/strong&gt; Color (the view-dependent color of #1) was in an already-reordered buffer, so it stayed consistent, and &lt;strong&gt;only the normals and specular looked broken&lt;/strong&gt; — which made it hard to isolate.&lt;/p&gt;

&lt;p&gt;The fix was one line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="n"&gt;materialsBuffer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reorderInPlace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;fromSourceIndices&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Lesson: when you bolt a "per-element parallel buffer" onto someone else's pipeline, fix &lt;strong&gt;every&lt;/strong&gt; place the source data gets reordered.&lt;/p&gt;

&lt;h1&gt;
  
  
  Bug 2: I was re-lighting the diffuse with environment light (don't)
&lt;/h1&gt;

&lt;p&gt;Even after fixing normals, the body was a "watery," patchy yellow.&lt;/p&gt;

&lt;p&gt;I'd written the diffuse term by the textbook as &lt;code&gt;albedo × irradiance&lt;/code&gt; (multiplying by #5's diffuse image). But Ref-Gaussian's equation uses &lt;strong&gt;base_color directly&lt;/strong&gt; for diffuse — &lt;strong&gt;only the specular is relit.&lt;/strong&gt; I was painting a pattern onto the cream-colored body with my own irradiance. Worse, I was tinting the specular F0 with the &lt;strong&gt;view-dependent color&lt;/strong&gt; (capture-time reflections baked in) instead of the learned albedo.&lt;/p&gt;

&lt;p&gt;Matching the reference equation exactly fixed it. Lesson: before improvising "correct" PBR, &lt;strong&gt;read the reference implementation's source and match it line by line.&lt;/strong&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Bug 3: the car is on its side (Z-up vs Y-up)
&lt;/h1&gt;

&lt;p&gt;When I trained and loaded a reflective car, it rendered &lt;strong&gt;90° on its side.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Instead of guessing, I measured the point cloud's bounding box: the shortest axis was Z (= height), the longest was Y (= length), and the dark tire splats were on the −Z side. The data was &lt;strong&gt;Z-up&lt;/strong&gt; (#7, Blender-family). The viewer assumed &lt;strong&gt;Y-up.&lt;/strong&gt; The 90° offset that didn't show on a round helmet was suddenly exposed by the car.&lt;/p&gt;

&lt;p&gt;Adding a Z-up→Y-up correction (−90° about X) to the camera fixed the car. Then a second head sprouted: now the &lt;strong&gt;background environment was 90° off.&lt;/strong&gt; equirect (#4) assumes Y-up, but the skybox rays and reflections are computed in the scene's Z-up frame.&lt;/p&gt;

&lt;p&gt;The fix: convert the &lt;strong&gt;sample direction&lt;/strong&gt; for the environment into the environment's Y-up frame, and rotate the slider about the scene's up-axis:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="n"&gt;environmentRotation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Rx&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;90&lt;/span&gt;&lt;span class="err"&gt;°&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;Rz&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;slider&lt;/span&gt; &lt;span class="n"&gt;angle&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The skybox and reflections sample with the same matrix, so they always match. I verified the mapping numerically before building, and validated the skybox itself by an offline render that reconstructs rays from the inverse view-projection. That also caught an old top-bottom flip (double-flipped) I'd previously baked into the HDR.&lt;/p&gt;

&lt;h1&gt;
  
  
  Bug 4: a "successful" build that runs old code
&lt;/h1&gt;

&lt;p&gt;After the orientation fix, a report came in: "the car is still on its side." The offline render had already proven the math correct, so the running binary must be stale — but why?&lt;/p&gt;

&lt;p&gt;I'd been type-checking with &lt;code&gt;xcodebuild -destination platform=macOS&lt;/code&gt;. That only compiles the &lt;code&gt;#if os(macOS)&lt;/code&gt; branch and Mac architectures. When I built for the iOS simulator, existing code revealed a &lt;strong&gt;compile error&lt;/strong&gt;: I was assigning &lt;code&gt;Float16(x).bitPattern&lt;/code&gt; into a &lt;code&gt;[UInt16]&lt;/code&gt; array.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;arm64 (device): native &lt;code&gt;Float16&lt;/code&gt; exists, &lt;code&gt;bitPattern&lt;/code&gt; is &lt;code&gt;UInt16&lt;/code&gt; → compiles&lt;/li&gt;
&lt;li&gt;the simulator's x86_64 slice: &lt;code&gt;Float16&lt;/code&gt; falls back, &lt;code&gt;bitPattern&lt;/code&gt; is &lt;code&gt;UInt32&lt;/code&gt; → type error&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The iOS build was failing, so the device kept running the &lt;strong&gt;previous binary.&lt;/strong&gt; Holding &lt;code&gt;[Float16]&lt;/code&gt; directly fixed it.&lt;/p&gt;

&lt;p&gt;Lesson: &lt;strong&gt;type-check for the platform you ship to.&lt;/strong&gt; A macOS-only &lt;code&gt;xcodebuild&lt;/code&gt; will happily lie about your iOS app. And "nothing changes on device" is almost always a sign the &lt;strong&gt;binary isn't new&lt;/strong&gt; — suspect that before re-debugging your logic.&lt;/p&gt;




&lt;h1&gt;
  
  
  The methodology that actually worked
&lt;/h1&gt;

&lt;p&gt;What all four bugs share: &lt;strong&gt;establish ground truth before judging your output.&lt;/strong&gt; Early on, my offline numpy preview was overly smooth and misled me. What worked:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Run the &lt;em&gt;reference&lt;/em&gt; renderer (Ref-Gaussian's &lt;code&gt;eval.py&lt;/code&gt; or training-time visualizations) on the same asset and compare. If the reference is clean and yours is dirty, it's a bug in your renderer.&lt;/li&gt;
&lt;li&gt;Reproduce the transforms (normals, skybox rays) exactly offline and &lt;strong&gt;look at them&lt;/strong&gt; before building for device.&lt;/li&gt;
&lt;li&gt;Verify on a clean synthetic asset (a hand-made chrome sphere) to separate "renderer bug" from "asset bug."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every time I skipped this and settled for "looks fine," I was wrong.&lt;/p&gt;

&lt;h1&gt;
  
  
  Aside: is the look "correct"?
&lt;/h1&gt;

&lt;p&gt;After finishing, I felt "the car's reflections are dull, it doesn't look like it's reflecting the environment." That's &lt;strong&gt;not a bug — it's the material's nature.&lt;/strong&gt; To be clear:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The trained car is roughness ~0.2 = &lt;strong&gt;semi-gloss.&lt;/strong&gt; Semi-gloss reflects the environment &lt;strong&gt;blurrily&lt;/strong&gt; (not a mirror). So bright lighting appearing as a blurry blob is correct. &lt;strong&gt;Any renderer, lit by the same environment, shows it equally dull.&lt;/strong&gt; Even the official renderer's output (ground truth) shows a clean reconstruction (sharp car, smooth normals); the material is fine.&lt;/li&gt;
&lt;li&gt;Turning off the app's "Use trained material" and lowering roughness lets you &lt;strong&gt;override&lt;/strong&gt; all splats to a uniform specular, clearly reflecting the room. That's flashy but &lt;strong&gt;not present on a real car.&lt;/strong&gt; ON = the learned real material, OFF = a manual override.&lt;/li&gt;
&lt;li&gt;Note that material decomposition has an inherent &lt;strong&gt;ambiguity&lt;/strong&gt;: a blue car's albedo can decompose as yellowish (splitting blue into "blue light + yellow material"). This is normal in inverse rendering; the rendered result itself is clean, so the harm is small, but it's not a "physically perfect decomposition."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So "dull look = the material behaving correctly," not an error in the iOS renderer.&lt;/p&gt;

&lt;h1&gt;
  
  
  Result
&lt;/h1&gt;

&lt;p&gt;A reflective Ref-Gaussian scene, relit in real time on iPhone: switch and rotate the HDR environment and the reflections and skybox follow. The base is MetalSplatter, the relighting model is Ref-Gaussian, and split-sum IBL is from UE4.&lt;/p&gt;

&lt;p&gt;Code, demo, details: &lt;a href="https://github.com/john-rocky/MetalGaussianSplatRelighting" rel="noopener noreferrer"&gt;https://github.com/john-rocky/MetalGaussianSplatRelighting&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Next: light the splats from the &lt;em&gt;actual&lt;/em&gt; environment via ARKit's environment probe — place a relightable object in your own room and reflect the room.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Implemented in Swift + Metal / iOS. Credits: &lt;a href="https://github.com/scier/MetalSplatter" rel="noopener noreferrer"&gt;MetalSplatter&lt;/a&gt; (Sean Cier, MIT), &lt;a href="https://github.com/fudan-zvg/ref-gaussian" rel="noopener noreferrer"&gt;Ref-Gaussian&lt;/a&gt;, HDRIs from &lt;a href="https://polyhaven.com" rel="noopener noreferrer"&gt;Poly Haven&lt;/a&gt; (CC0). Originally published in Japanese on &lt;a href="https://qiita.com/john-rocky/items/3c452476e66bdc29c290" rel="noopener noreferrer"&gt;Qiita&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ios</category>
      <category>metal</category>
      <category>graphics</category>
      <category>3d</category>
    </item>
    <item>
      <title>Adding UI to Google Colab: forms, sliders, buttons and more</title>
      <dc:creator>Daisuke Majima</dc:creator>
      <pubDate>Tue, 02 Jun 2026 06:11:49 +0000</pubDate>
      <link>https://dev.to/john-rocky/adding-ui-to-google-colab-forms-sliders-buttons-and-more-1ad2</link>
      <guid>https://dev.to/john-rocky/adding-ui-to-google-colab-forms-sliders-buttons-and-more-1ad2</guid>
      <description>&lt;h2&gt;
  
  
  Adding UI to Colab
&lt;/h2&gt;

&lt;p&gt;You can &lt;strong&gt;show UI&lt;/strong&gt; in a Google Colaboratory notebook. Input forms, buttons, and so on are handy when other people use your notebook. &lt;strong&gt;A form's value is reflected into the cell's variable.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://colab.research.google.com/drive/1mTkZQ0daeGyvVtHYoprznAP_tnYl2rJD?usp=sharing" rel="noopener noreferrer"&gt;&lt;strong&gt;Live Colab notebook sample&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3scwqbclptinvbcznnmk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3scwqbclptinvbcznnmk.png" width="600" height="870"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flkoluqgr3sex1bx89qy2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flkoluqgr3sex1bx89qy2.png" width="600" height="208"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Cell title
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#@title cell title
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6ncxa8i2r0bpoubxxnki.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6ncxa8i2r0bpoubxxnki.png" width="600" height="189"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Input form
&lt;/h3&gt;

&lt;p&gt;You can reflect the form's content into a variable in the cell.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;variable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;the form is reflected into the variable&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="c1"&gt;#@param {type:"string"}
# for a number: #@param {type:"number"}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyyfiz1n6pose0rwprxot.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyyfiz1n6pose0rwprxot.png" width="600" height="102"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Select box
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;dropdown&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="c1"&gt;#@param ["1st option", "2nd option", "3rd option"] {allow-input: true}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fov0m7nutubtai8yzz2yp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fov0m7nutubtai8yzz2yp.png" width="600" height="247"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Date input
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;date_input&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2018-03-22&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="c1"&gt;#@param {type:"date"}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foo2ta01vgj030xf8dhu1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foo2ta01vgj030xf8dhu1.png" width="600" height="126"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Slider
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;number_slider&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt; &lt;span class="c1"&gt;#@param {type:"slider", min:-1, max:1, step:0.1}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwmvousjymvr4pyu8io33.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwmvousjymvr4pyu8io33.png" width="600" height="100"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Checkbox
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;boolean_checkbox&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt; &lt;span class="c1"&gt;#@param {type:"boolean"}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgl26xke8gzj1vjrmfcec.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgl26xke8gzj1vjrmfcec.png" width="600" height="150"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Markdown
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#@markdown ---
#@markdown #Big
#@markdown ###Middle
#@markdown #####Little
#@markdown ---
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2hfjkbo24y6fyr5qt3kv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2hfjkbo24y6fyr5qt3kv.png" width="600" height="318"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  A button via the DOM
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;IPython.display&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;display&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Javascript&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.colab&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.colab.output&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;eval_js&lt;/span&gt;

&lt;span class="n"&gt;js&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Javascript&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'''&lt;/span&gt;&lt;span class="s"&gt;
            async function load_image() {
                const div = document.createElement(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;div&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;);
                var button = document.createElement(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;button&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;);
                var log = document.createElement(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;div&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;);

                button.textContent = &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;button&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;;
                button.onclick = function(){
                    log.innerHTML = &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Button Clicked.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;;
                }
                div.appendChild(button)
                div.appendChild(log)

                document.querySelector(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;#output-area&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;).appendChild(div);
                return
                } &lt;/span&gt;&lt;span class="sh"&gt;'''&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;js&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;eval_js&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;load_image()&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft29lqs44yc7sxihtwnw3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft29lqs44yc7sxihtwnw3.png" width="358" height="124"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published in Japanese on &lt;a href="https://qiita.com/john-rocky/items/e5802cdd15dc2e34cb84" rel="noopener noreferrer"&gt;Qiita&lt;/a&gt;. I build apps with Core ML and write about machine learning. &lt;a href="https://github.com/john-rocky" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; / &lt;a href="https://twitter.com/JackdeS11" rel="noopener noreferrer"&gt;X&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>colab</category>
      <category>machinelearning</category>
      <category>jupyter</category>
    </item>
    <item>
      <title>Make a 3D model on iPhone just by taking photos (RealityKit Photogrammetry)</title>
      <dc:creator>Daisuke Majima</dc:creator>
      <pubDate>Tue, 02 Jun 2026 06:06:47 +0000</pubDate>
      <link>https://dev.to/john-rocky/make-a-3d-model-on-iphone-just-by-taking-photos-realitykit-photogrammetry-aj9</link>
      <guid>https://dev.to/john-rocky/make-a-3d-model-on-iphone-just-by-taking-photos-realitykit-photogrammetry-aj9</guid>
      <description>&lt;h2&gt;
  
  
  Make a realistic 3D model just by taking photos
&lt;/h2&gt;

&lt;p&gt;Using Apple's tool, you can easily create 3D models.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhed94xik8okn5g5gg67l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhed94xik8okn5g5gg67l.png" width="200" height="433"&gt;&lt;/a&gt; &lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqqybgbfwm3sq5y3dioix.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqqybgbfwm3sq5y3dioix.png" width="800" height="647"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3xrcw58vbxqg7yq6flwo.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3xrcw58vbxqg7yq6flwo.gif" width="324" height="662"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;A turtle, and the captured turtle.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  I want realistic 3D models
&lt;/h2&gt;

&lt;p&gt;If you have plenty of usable 3D models, you can use them in AR and game apps. But (in my experience) freely usable 3D models for AR/games are surprisingly scarce — on download sites, the nice content is often paid.&lt;/p&gt;

&lt;h2&gt;
  
  
  Making 3D models seems hard
&lt;/h2&gt;

&lt;p&gt;And making them yourself — modeling and converting in 3D software — seems difficult.&lt;/p&gt;

&lt;h2&gt;
  
  
  With Apple's RealityKit, just taking photos makes a model
&lt;/h2&gt;

&lt;p&gt;In 2021 Apple released a tool that builds a 3D model just from photos. It's fairly easy and produces realistic objects. It recognizes the salient object and cleanly separates it from the background and floor.&lt;/p&gt;

&lt;h2&gt;
  
  
  Method
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Take photos
&lt;/h3&gt;

&lt;p&gt;With a handheld camera (iPhone is fine), photograph the thing you want in 3D from every direction.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhed94xik8okn5g5gg67l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhed94xik8okn5g5gg67l.png" width="200" height="433"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I simply put a stuffed animal on the carpet and shot 360° from the side and from above, covering it like a hemisphere. The conditions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;in as bright a place as possible&lt;/li&gt;
&lt;li&gt;with the whole subject as large in frame as possible&lt;/li&gt;
&lt;li&gt;shooting frequently so consecutive photos overlap by 70%+&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I took 200 photos and gathered them into a folder on my Mac.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Download Apple's tool
&lt;/h3&gt;

&lt;p&gt;You can get it from the developer site:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://developer.apple.com/documentation/realitykit/creating_a_photogrammetry_command-line_app/" rel="noopener noreferrer"&gt;https://developer.apple.com/documentation/realitykit/creating_a_photogrammetry_command-line_app/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Build it in Xcode and a &lt;code&gt;HelloPhotogrammetry&lt;/code&gt; file appears in the product folder; open it in Finder to find its location.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkzhn90v8rul2bnjmmudb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkzhn90v8rul2bnjmmudb.png" width="800" height="526"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Run
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&amp;lt;path to HelloPhotogrammetry&amp;gt; &amp;lt;input image folder&amp;gt; &amp;lt;output file path, with .usdz&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;* Absolute paths from root are required.&lt;/p&gt;

&lt;p&gt;This generates a USDZ model. With 200 photos it took about 20 minutes. Click the link below on an iPhone to try the turtle USDZ model in AR:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://firebasestorage.googleapis.com/v0/b/sincere-nirvana-292404.appspot.com/o/model.usdz?alt=media&amp;amp;token=96083418-ed43-435c-bce4-6fed085bbd7b" rel="noopener noreferrer"&gt;https://firebasestorage.googleapis.com/v0/b/sincere-nirvana-292404.appspot.com/o/model.usdz?alt=media&amp;amp;token=96083418-ed43-435c-bce4-6fed085bbd7b&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Toward richer 3D content
&lt;/h2&gt;

&lt;p&gt;Some people upload models made this way to 3D-model sharing sites. It'd be great if 3D models and content grew richer as lots of people scan lots of things.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published in Japanese on &lt;a href="https://qiita.com/john-rocky/items/5de42f382f0fce950093" rel="noopener noreferrer"&gt;Qiita&lt;/a&gt;. I build apps with Core ML and ARKit and write about ML/AR. &lt;a href="https://github.com/john-rocky" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; / &lt;a href="https://twitter.com/JackdeS11" rel="noopener noreferrer"&gt;X&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ios</category>
      <category>ar</category>
      <category>swift</category>
      <category>3d</category>
    </item>
    <item>
      <title>Japanese OCR on iPhone with the Vision framework</title>
      <dc:creator>Daisuke Majima</dc:creator>
      <pubDate>Tue, 02 Jun 2026 06:06:46 +0000</pubDate>
      <link>https://dev.to/john-rocky/japanese-ocr-on-iphone-with-the-vision-framework-446c</link>
      <guid>https://dev.to/john-rocky/japanese-ocr-on-iphone-with-the-vision-framework-446c</guid>
      <description>&lt;h2&gt;
  
  
  Easy on-device text recognition
&lt;/h2&gt;

&lt;p&gt;If you can recognize text on iPhone, you can build handy things like transcribing whiteboards or reading signs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb569w6gf9fgcgmur2ojv.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb569w6gf9fgcgmur2ojv.gif" width="450" height="784"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  A 2022 update made Japanese available
&lt;/h2&gt;

&lt;p&gt;Since iOS 16 (2022), Japanese text recognition is possible — using only the built-in framework. The accuracy is quite good; personally I think it's usable in production for many apps.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to use it
&lt;/h2&gt;

&lt;p&gt;Use Vision's &lt;code&gt;VNRecognizeTextRequest&lt;/code&gt;. &lt;strong&gt;Set &lt;code&gt;recognitionLanguages&lt;/code&gt; to &lt;code&gt;"ja"&lt;/code&gt;.&lt;/strong&gt; Requires macOS 13, Xcode 14, iOS 16 or later.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;request&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;VNRecognizeTextRequest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;recognitionLanguages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"ja"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="c1"&gt;// specify Japanese&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;handler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;VNImageRequestHandler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;cvPixelBuffer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pixelBuffer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;do&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="n"&gt;handler&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;perform&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch&lt;/span&gt; &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;guard&lt;/span&gt; &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;observations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="k"&gt;as?&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;VNTextObservation&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;observation&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;observations&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;box&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;observation&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;boundingBox&lt;/span&gt; &lt;span class="c1"&gt;// bounding box of the position&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;topCandidate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;observation&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;topCandidates&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;recognizedText&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;topCandidate&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="p"&gt;?&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="c1"&gt;// recognized text&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;recognizedText&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Feed it an image or a camera frame and it recognizes the text.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcws5sgkfqgtzhhy9zbu4.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcws5sgkfqgtzhhy9zbu4.gif" width="442" height="784"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published in Japanese on &lt;a href="https://qiita.com/john-rocky/items/c8abb7fa7aebdf19d9a3" rel="noopener noreferrer"&gt;Qiita&lt;/a&gt;. I build apps with Core ML and write about machine learning. &lt;a href="https://github.com/john-rocky" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; / &lt;a href="https://twitter.com/JackdeS11" rel="noopener noreferrer"&gt;X&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ios</category>
      <category>swift</category>
      <category>ocr</category>
      <category>vision</category>
    </item>
    <item>
      <title>How I became a freelance engineer earning 1M yen/month from zero experience</title>
      <dc:creator>Daisuke Majima</dc:creator>
      <pubDate>Tue, 02 Jun 2026 05:56:38 +0000</pubDate>
      <link>https://dev.to/john-rocky/how-i-became-a-freelance-engineer-earning-1m-yenmonth-from-zero-experience-58op</link>
      <guid>https://dev.to/john-rocky/how-i-became-a-freelance-engineer-earning-1m-yenmonth-from-zero-experience-58op</guid>
      <description>&lt;p&gt;I think there were three keys to going from zero experience to making a living as a freelancer:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Make time&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Grasp the big picture&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Put yourself out there&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Let me go through each.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Make time
&lt;/h2&gt;

&lt;p&gt;When deep snow piles up, you want to go out and play, right? Build a snowman, dig a kamakura, have a snowball fight. It's the same: when you have room in your schedule, the urge to experiment shows up. Fail once and you can go "okay, let me try it this way" — trial and error. Like a kid rolling around in the snow. You get to fail. Wasted time matters. Heck, curling up under a kotatsu is fine too, lol.&lt;/p&gt;

&lt;p&gt;I'd dabbled in programming in the gaps of my old job, but it never came together. When you fail once in the cracks of a day job, there's no "next try."&lt;/p&gt;

&lt;p&gt;So I became unemployed. Did I immediately throw myself into programming because I had time? Not really, lol. For a while I watched Netflix and went for walks. Gradually that got old, and the programming I picked up out of "isn't there something to do?" started to get fun. I think that's about the pace at which self-motivation is born.&lt;/p&gt;

&lt;p&gt;That said, making a big block of free time is hard for most people, so a rule like "make one hour of space a day" is fine too. Once you start, habitual momentum builds.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Grasp the big picture
&lt;/h2&gt;

&lt;p&gt;When things aren't going well, I think it's usually when you can't see the whole picture. A kid who can't do a pull-over on the bar will try the first kick, but can't picture how that kick connects to the rest of the motion or where it sits in the whole. So they just keep kicking higher and higher. People who can't escape the beginner course can tackle one element but don't know where it fits in the whole. Learning that you can't connect to anything means nothing.&lt;/p&gt;

&lt;p&gt;To get good at something, I think it's important to &lt;em&gt;feel&lt;/em&gt; the whole picture, even vaguely. With the big picture in place, you can connect the element in front of you organically.&lt;/p&gt;

&lt;p&gt;When you start learning to program and do some "copy the code" practice, you don't know how to actually use it. Classic beginner stuff. "1+1 prints like this" — &lt;strong&gt;"but how do I get from here to building an app??"&lt;/strong&gt; I was the same. I took online courses, bought O'Reilly books.&lt;/p&gt;

&lt;p&gt;How did I break out? I took Stanford's course — &lt;strong&gt;&lt;a href="https://cs193p.sites.stanford.edu/" rel="noopener noreferrer"&gt;CS193p, the iOS course&lt;/a&gt;&lt;/strong&gt;. After taking it, I could build apps. It covers the knowledge you need to build an app: "you'll really understand everything you need." Being a top university, it's top-tier clear. It's structured so you feel the landmarks needed to build an app and how to connect them, hands-on. I genuinely became able to build.&lt;/p&gt;

&lt;p&gt;Finding something that shows you the whole picture as one set really matters. Like a carpenter's apprenticeship — at first you just plane wood until the shavings come out clean, but the planing itself isn't the point; being next to the master and watching them do the job from 1 to 10 is. Imitating people is important for grasping the whole.&lt;/p&gt;

&lt;p&gt;Once you understand the whole of "building an app," then whenever you want to make something, you know which knowledge you're missing. After that you just pick up the needed pieces from the web and drop them into the big picture. The course's final project was "build your own app." I decided to add some cutting-edge tech to what I'd learned, hunted for interesting frameworks online, and tried building an app using machine learning. In the end, whether it's ML or iOS or anything, if you search the web there's documentation, prose, and code, and if you read it properly (as long as you have the big picture), you can understand it. Knowing the big picture and where the information lives massively expands what you can do.&lt;/p&gt;

&lt;p&gt;Back to CS193p: all the lectures are free on YouTube. English-wise, YouTube's English captions got me through. Watching the whole of "Breaking Bad" in English helped too, lol. Understanding English lets you read docs in the original, which is handy. Though Google Translate is excellent now, so English is much less of a barrier.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Put yourself out there
&lt;/h2&gt;

&lt;p&gt;Like the cat in the box, until you're observed, to others you may as well not exist. Ability alone doesn't bring work. You have to break out of the box, stick your head out, and go "meow." An efficient way to shout "I'm here!" is to &lt;strong&gt;publish information online.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Write technical articles, publish the information that is your ability. If those articles reach a client's eyes, work comes. You have to connect yourself with the people who'll pay. Doing that suddenly turned programming into money. I &lt;a href="https://qiita.com/john-rocky" rel="noopener noreferrer"&gt;publish technical articles on Qiita&lt;/a&gt;, write the same articles on the English service &lt;a href="https://medium.com/@rockyshikoku" rel="noopener noreferrer"&gt;Medium&lt;/a&gt;, and &lt;a href="https://github.com/john-rocky" rel="noopener noreferrer"&gt;put code on GitHub&lt;/a&gt;. Companies at home and abroad saw them and sent development requests, and as I worked through them I passed ¥1M/month in three months.&lt;/p&gt;

&lt;p&gt;Before that, there was about half a year where I earned ¥0. I &lt;a href="https://apps.apple.com/gb/developer/daisuke-majima/id1350309854?l=ja" rel="noopener noreferrer"&gt;uploaded personal apps to the App Store hoping to hit it with ad revenue, releasing about ten&lt;/a&gt; — an app that computes facial similarity, an AR virtual-background app, an app that strips ML edits off photos, an app that animifies photos. I'd undergone a chimeric, monster-cat evolution inside the box. Scary, right?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://apps.apple.com/gb/developer/daisuke-majima/id1350309854?l=ja" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw32nrsrbih12o0ivav9h.png" width="800" height="570"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;But my information never reached the people who'd actually pay. In my case, the payers weren't consumers. Marketing that properly reaches people who can pay matters.&lt;/p&gt;

&lt;p&gt;A big part of starting to broadcast was joining a community. There was a "broadcasting course" inside it, taught by someone who'd written tons of technical articles and books, and they clearly explained the whole arc of how that leads to work. Like the carpenter's apprenticeship — being shown the whole path to landing work was huge. They held one-on-one consultation events, gave me a chance to talk with a company's recruiter, pushed me to go freelance, and proofread my articles, which became a real asset. They even taught me how to negotiate with clients. I happened to have time, so I compiled a "best quotes" of community members and helped edit a YouTube channel — that probably earned me some goodwill too. Again, something I could only do because I had time.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Make plenty of time&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Grasp the big picture&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Put yourself out there&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These feel usable for things beyond "making money with programming," too.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published in Japanese on &lt;a href="https://qiita.com/john-rocky/items/4bdb278c1a7714191fd7" rel="noopener noreferrer"&gt;Qiita&lt;/a&gt;. I build apps with Core ML and ARKit and write about ML/AR. &lt;a href="https://github.com/john-rocky" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; / &lt;a href="https://twitter.com/JackdeS11" rel="noopener noreferrer"&gt;X&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>career</category>
      <category>beginners</category>
      <category>freelance</category>
      <category>ios</category>
    </item>
    <item>
      <title>On-device LLM on iPhone: which runtime is fastest? MLX vs llama.cpp vs LiteRT-LM vs CoreML</title>
      <dc:creator>Daisuke Majima</dc:creator>
      <pubDate>Tue, 02 Jun 2026 05:56:37 +0000</pubDate>
      <link>https://dev.to/john-rocky/on-device-llm-on-iphone-which-runtime-is-fastest-mlx-vs-llamacpp-vs-litert-lm-vs-coreml-1b42</link>
      <guid>https://dev.to/john-rocky/on-device-llm-on-iphone-which-runtime-is-fastest-mlx-vs-llamacpp-vs-litert-lm-vs-coreml-1b42</guid>
      <description>&lt;p&gt;&lt;strong&gt;I want to run an LLM on iPhone.&lt;/strong&gt;&lt;br&gt;
But &lt;strong&gt;there are several runtimes and it's not obvious which to pick.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;And I couldn't find many head-to-head benchmarks.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Runtime&lt;/th&gt;
&lt;th&gt;In a nutshell&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MLX&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Apple charging into the on-device-LLM scene and pushing hard.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;llama.cpp&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The mature, battle-tested community standard for local LLMs.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LiteRT-LM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Gemma-4 only, but Google's heavyweight, finally deployed.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CoreML-LLM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Lets you use the Apple Neural Engine, which the GPU/Metal-dominated LLM world tends to overlook. I built it — can it even compete...?&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Fine, let's just do it.&lt;/strong&gt; On an iPhone 17 Pro (A19 Pro), I ran the same model on four on-device inference runtimes and measured decode speed and memory.&lt;/p&gt;

&lt;p&gt;The conclusion:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"For local LLMs on iPhone, MLX by default."&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;"For Gemma 4 specifically, LiteRT-LM is unbeatable."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft3xpqq41chtjcguhjpc4.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft3xpqq41chtjcguhjpc4.jpeg" width="799" height="686"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion first
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Decode speed&lt;/strong&gt;:&lt;br&gt;
Qwen 3.5 2B is &lt;strong&gt;fastest on MLX&lt;/strong&gt; (61 tok/s).&lt;br&gt;
Gemma 4 E2B is a &lt;strong&gt;decisive win for LiteRT-LM&lt;/strong&gt; (55 tok/s).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Memory&lt;/strong&gt;:&lt;br&gt;
&lt;strong&gt;CoreML / ANE (Apple Neural Engine) wins by a landslide.&lt;/strong&gt; It runs Qwen 3.5 2B in just &lt;strong&gt;241 MB&lt;/strong&gt; (about 1/5 of MLX). Slowest on speed, though. Nice effort, CoreML.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Use-case recommendations at the end.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Test conditions
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Item&lt;/th&gt;
&lt;th&gt;Detail&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Device&lt;/td&gt;
&lt;td&gt;iPhone 17 Pro (A19 Pro / iOS 26.4.2)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Runtimes&lt;/td&gt;
&lt;td&gt;MLX Swift / llama.cpp / LiteRT-LM / CoreML(ANE)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Models&lt;/td&gt;
&lt;td&gt;Gemma 4 E2B, Qwen 3.5 2B (both ~4-bit)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Task&lt;/td&gt;
&lt;td&gt;short-chat (128-token generation)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Aggregation&lt;/td&gt;
&lt;td&gt;median of 3 cold runs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Metrics&lt;/td&gt;
&lt;td&gt;decode tok/s (higher = better), peak memory MB (lower = better)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Result 1: decode speed (tok/s, higher is better)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Runtime&lt;/th&gt;
&lt;th&gt;Gemma 4 E2B&lt;/th&gt;
&lt;th&gt;Qwen 3.5 2B&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;🔴 LiteRT-LM&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;55.4&lt;/strong&gt; 🏆&lt;/td&gt;
&lt;td&gt;— (Gemma only)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🟣 MLX-Swift&lt;/td&gt;
&lt;td&gt;47.5&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;61.2&lt;/strong&gt; 🏆&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🔵 llama.cpp&lt;/td&gt;
&lt;td&gt;37.8&lt;/td&gt;
&lt;td&gt;39.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🟠 CoreML/ANE&lt;/td&gt;
&lt;td&gt;33.4&lt;/td&gt;
&lt;td&gt;27.9&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;For Gemma 4 E2B, LiteRT-LM dominates.&lt;/strong&gt; It's Google's on-device runtime, running Gemma in its own &lt;code&gt;.litertlm&lt;/code&gt; (INT4 QAT) format on the GPU — &lt;strong&gt;first-party model × first-party runtime&lt;/strong&gt; optimization paying off. The Swift API was in development for ages; nice work, whoever shipped it.&lt;/p&gt;

&lt;p&gt;Meanwhile &lt;strong&gt;for Qwen 3.5 2B, MLX is fastest (61 tok/s).&lt;/strong&gt; Apple is clearly competing seriously on local LLMs. (LiteRT-LM's catalog is Gemma-only (&lt;code&gt;.litertlm&lt;/code&gt;), so it doesn't compete on Qwen.)&lt;/p&gt;

&lt;h2&gt;
  
  
  Result 2: peak memory (MB, lower is better)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Runtime&lt;/th&gt;
&lt;th&gt;Gemma 4 E2B&lt;/th&gt;
&lt;th&gt;Qwen 3.5 2B&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;🔴 LiteRT-LM&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;641&lt;/strong&gt; 🏆&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🟣 MLX-Swift&lt;/td&gt;
&lt;td&gt;2,900&lt;/td&gt;
&lt;td&gt;1,279&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🔵 llama.cpp&lt;/td&gt;
&lt;td&gt;3,156&lt;/td&gt;
&lt;td&gt;1,479&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🟠 CoreML/ANE&lt;/td&gt;
&lt;td&gt;1,187&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;241&lt;/strong&gt; 🏆&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;CoreML / ANE wins by a landslide&lt;/strong&gt; — Qwen 3.5 2B in just &lt;strong&gt;241 MB&lt;/strong&gt;. That comes from a &lt;strong&gt;chunked-MLKV&lt;/strong&gt; approach (CoreML-LLM's &lt;code&gt;Qwen35MLKVGenerator&lt;/code&gt;) that chunks the weights and KV cache onto the ANE — about 1/5 of MLX (1,279) and llama.cpp (1,479).&lt;/p&gt;

&lt;p&gt;If you want to run a 2B-class model on a memory-constrained iPhone, or avoid fighting your app and other features for memory, the ANE is a very strong option.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fairness notes
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CoreML/ANE&lt;/strong&gt;: the ANE is designed for memory/power over throughput. The first load triggers ANE compilation, so load time is longer. Decode is approximated by number of generated pieces (≈ tokens).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LiteRT-LM&lt;/strong&gt;: there's no max-output-tokens API, so it generates until EOS (≈ 458-token full response); the others cut off at 128. But decode is a &lt;em&gt;rate&lt;/em&gt;, so the comparison still holds. Numbers come from LiteRT-LM's own benchmark counter (&lt;code&gt;getBenchmarkInfo&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;All are ~4-bit, but the quantization schemes differ slightly per runtime (MLX 4bit / GGUF Q4_K_M / LiteRT INT4-QAT / CoreML INT4-palettized / INT8).&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Recommendations by use case
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Just fast, general-purpose, lots of models → MLX Swift.&lt;/strong&gt; Fastest on Qwen, easy from Swift, and &lt;code&gt;mlx-community&lt;/code&gt; has tons of models. The first choice for local LLMs on Apple devices.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemma as fast as possible → LiteRT-LM.&lt;/strong&gt; For the Gemma family, strongest on both speed and memory. Can't beat it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory first (any device / coexisting with other features) → CoreML / ANE.&lt;/strong&gt; 241 MB is exceptional. If you can tolerate the speed, it's the strongest for low memory and power.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Portability / run anywhere → llama.cpp.&lt;/strong&gt; GGUF assets and every platform. Not flashy, but solid.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Method and reproducibility
&lt;/h2&gt;

&lt;p&gt;Every run was executed headlessly from a Mac via &lt;code&gt;devicectl&lt;/code&gt; (no on-device tapping); models were side-loaded from the Mac. The raw result JSONL and charts are in the repo. One line = "1 runtime × 1 model × 1 device" — PRs welcome:&lt;/p&gt;

&lt;p&gt;👉 &lt;strong&gt;&lt;a href="https://github.com/john-rocky/apple-silicon-llm-bench" rel="noopener noreferrer"&gt;https://github.com/john-rocky/apple-silicon-llm-bench&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I'll write a separate article on the behind-the-scenes of full measurement automation and the build fights (git-LFS / SwiftPM unsafe-flags / &lt;code&gt;@preconcurrency&lt;/code&gt;, etc.).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;On-device LLM on iPhone: &lt;strong&gt;"MLX / LiteRT-LM for speed, CoreML/ANE for memory."&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Hope it helps your local-LLM development!&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published in Japanese on &lt;a href="https://qiita.com/john-rocky/items/800bb43b21f9f6da44c4" rel="noopener noreferrer"&gt;Qiita&lt;/a&gt;. I do mobile AI / CoreML / ARKit development and write about it. &lt;a href="https://github.com/john-rocky" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; / &lt;a href="https://twitter.com/JackdeS11" rel="noopener noreferrer"&gt;X&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ios</category>
      <category>machinelearning</category>
      <category>llm</category>
      <category>swift</category>
    </item>
    <item>
      <title>A relightable Gaussian Splatting x AR product viewer on iPhone</title>
      <dc:creator>Daisuke Majima</dc:creator>
      <pubDate>Tue, 02 Jun 2026 05:44:57 +0000</pubDate>
      <link>https://dev.to/john-rocky/a-relightable-gaussian-splatting-x-ar-product-viewer-on-iphone-47i</link>
      <guid>https://dev.to/john-rocky/a-relightable-gaussian-splatting-x-ar-product-viewer-on-iphone-47i</guid>
      <description>&lt;p&gt;In online shopping, you can only imagine how a product actually feels.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"I won't really know until it arrives."&lt;/li&gt;
&lt;li&gt;"Since I can't tell, I'll just go with whatever has good reviews."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A lot of purchases get put on hold like this.&lt;/p&gt;

&lt;p&gt;What if customers could feel a product's texture through the screen?&lt;/p&gt;

&lt;p&gt;This article shows a way to let customers &lt;em&gt;see&lt;/em&gt; a product's texture on iPhone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;We deliver it right in front of the customer, in AR.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The value of AR commerce: real texture
&lt;/h2&gt;

&lt;p&gt;The value of AR commerce is being able to bring a product into the customer's own space. This project goes one step further: &lt;strong&gt;it lights the product with the real light of the customer's room and reflects their environment in it.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwqegwhvjkik9zop0jyok.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwqegwhvjkik9zop0jyok.gif" width="280" height="608"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Texture comes alive in the room's light&lt;/strong&gt;: the environment's lighting is captured and the product is re-lit. Glossy surfaces reflect the surroundings; a dark room produces calm, subdued shading — &lt;strong&gt;the texture "reacts to the environment."&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Placed on the floor at true scale&lt;/strong&gt;: the floor is detected and the product is placed at real size. Walk around it and you instantly grasp its size and its look from every angle.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;A real "it's actually there" feeling&lt;/strong&gt;: a contact shadow at its base, plus occlusion (people or furniture in front hide the product), give a placement with no composited look.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0t0uzzvcajrurkteqqrg.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0t0uzzvcajrurkteqqrg.gif" width="300" height="652"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The goal: let customers experience, in their own environment, "the texture you couldn't tell from photos."&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Gaussian Splatting — how the model is made
&lt;/h2&gt;

&lt;p&gt;First, you need to capture the product as 3D. Here we use &lt;strong&gt;3D Gaussian Splatting (3DGS)&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It reconstructs a &lt;strong&gt;realistic 3D model&lt;/strong&gt; from photos/video taken from multiple viewpoints (no manual modeling).&lt;/li&gt;
&lt;li&gt;It's good at &lt;strong&gt;view-dependent&lt;/strong&gt; representation, where appearance changes with viewing angle — well-suited to reproducing gloss and highlights.&lt;/li&gt;
&lt;li&gt;Being point-based, it can be &lt;strong&gt;rendered in real time even on a phone.&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For this project I built the 3D with &lt;strong&gt;Ref-Gaussian&lt;/strong&gt;, which learns materials (normal, roughness, reflectance, albedo) separately. That makes it possible to "strip off the light from capture time and re-light with the light of wherever you place it (relighting)." &lt;strong&gt;This is the decisive difference from ordinary photogrammetry, which bakes lighting in.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation: real-time AR on iPhone with Metal
&lt;/h2&gt;

&lt;p&gt;Rendering is hand-written in &lt;strong&gt;Metal&lt;/strong&gt; (based on &lt;a href="https://github.com/scier/MetalSplatter" rel="noopener noreferrer"&gt;scier/MetalSplatter&lt;/a&gt;, with split-sum IBL relighting and deferred PBR added). AR is ARKit. Main features:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Relight &amp;amp; reflect with the room's light&lt;/strong&gt; (feeding &lt;code&gt;AREnvironmentProbeAnchor&lt;/code&gt;'s environment cubemap into IBL)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;True scale, floor placement, walk-around&lt;/strong&gt; (plane detection + raycast, ARCamera 6DOF)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Contact shadow / depth occlusion&lt;/strong&gt; (LiDAR &lt;code&gt;sceneDepth&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Color / finish try-on (configurator)&lt;/strong&gt;: switch the same model's color and finish (matte / gloss / mirror / metallic), correctly re-lit by the room's light. Change the color and the &lt;strong&gt;white logo or pattern stays white&lt;/strong&gt; (only the hue is replaced).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Turntable display&lt;/strong&gt;: hands-free, slow 360° auto-rotation.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;The technical key is that &lt;strong&gt;lighting is computed at render time.&lt;/strong&gt; That's why texture reacts correctly, in real time, to "the customer's room light" and "the color the customer chose."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Applicable to many products
&lt;/h2&gt;

&lt;p&gt;The more a product depends on texture and size, the better this lands.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cars&lt;/strong&gt;: body-paint reflections, showroom-style presentation, fit-check for a parking space.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Furniture / interior&lt;/strong&gt;: gloss and color under the room's light, sense of size when placed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Jewelry / watches&lt;/strong&gt;: metal and glass reflections (products where texture differences map directly to price).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Appliances / kitchenware&lt;/strong&gt;: the feel of stainless steel or glossy plastic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apparel accessories&lt;/strong&gt;: the sheen of leather and patent.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In all of them, "texture that photos can't convey" and "a size you can't judge until you place it" are solved in the user's own environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;For e-commerce's "size and texture don't come across" problem,&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;capture the real product with Gaussian Splatting&lt;/strong&gt;, and&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;display it in AR on iPhone at true scale, under the customer's room light&lt;/strong&gt;,&lt;/li&gt;
&lt;li&gt;to create a "check the texture before you buy" experience that applies to many products.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The code (relightable GS viewer / Metal + ARKit) is public:&lt;br&gt;
&lt;strong&gt;&lt;a href="https://github.com/john-rocky/MetalGaussianSplatRelighting" rel="noopener noreferrer"&gt;https://github.com/john-rocky/MetalGaussianSplatRelighting&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published in Japanese on &lt;a href="https://qiita.com/john-rocky/items/54ee967a757342a9d1ca" rel="noopener noreferrer"&gt;Qiita&lt;/a&gt;. I build apps with machine learning and AR, and write about both. &lt;a href="https://github.com/john-rocky" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; / &lt;a href="https://twitter.com/JackdeS11" rel="noopener noreferrer"&gt;X&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ios</category>
      <category>ar</category>
      <category>metal</category>
      <category>3d</category>
    </item>
    <item>
      <title>Fine-tuning a VLM to build an on-device fashion-scoring app</title>
      <dc:creator>Daisuke Majima</dc:creator>
      <pubDate>Tue, 02 Jun 2026 05:44:56 +0000</pubDate>
      <link>https://dev.to/john-rocky/fine-tuning-a-vlm-to-build-an-on-device-fashion-scoring-app-5kj</link>
      <guid>https://dev.to/john-rocky/fine-tuning-a-vlm-to-build-an-on-device-fashion-scoring-app-5kj</guid>
      <description>&lt;p&gt;Scoring outfits with AI. Offline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can it be done?&lt;/strong&gt;&lt;br&gt;
Style is qualitative. There isn't a single answer. AI can give a generic answer, but can it answer something like fashion, where the criteria vary by culture?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;There is a way.&lt;/strong&gt;&lt;br&gt;
This article is a record of building a fully offline fashion-scoring app on iPhone using a Visual LLM (VLM).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F59rs9wauoruc1sn7y32w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F59rs9wauoruc1sn7y32w.png" width="800" height="1739"&gt;&lt;/a&gt; &lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff2jmvyrciwfaqezvh8jf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff2jmvyrciwfaqezvh8jf.png" width="800" height="1739"&gt;&lt;/a&gt; &lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7u9664jw9mw97dso386u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7u9664jw9mw97dso386u.png" width="800" height="1739"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  The approach
&lt;/h2&gt;

&lt;p&gt;Use a &lt;em&gt;closed&lt;/em&gt; system of evaluation criteria.&lt;/p&gt;

&lt;p&gt;Every aesthetic or philosophical judgment has many schools of thought, and &lt;strong&gt;it's hard to produce an open answer that satisfies every possible criterion.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;But within a single school — whether in fashion, sports, or specialized work — &lt;strong&gt;there are cases where the correct answer is determined inside a closed system.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For example, here I referenced the idea of "the balance between dressy and casual," popularized for a general audience by the Japanese men's-fashion influencer "MB," and treated "if the dressy-to-casual balance is close to 7:3, it looks stylish" as the axis, scoring input images on it. (This is my own interpretation from reading MB's blog and so on.)&lt;/p&gt;

&lt;p&gt;Each item of an outfit — tops, bottoms, shoes — is &lt;strong&gt;scored against a somewhat systematized standard.&lt;/strong&gt; An AI (LLM) can do this. And it does it quite well. Even ~1,000 training examples is enough. You don't need to learn every possible item; it extrapolates to unseen ones.&lt;/p&gt;

&lt;p&gt;That's the real subject here. More than scoring fashion itself, the theme is &lt;strong&gt;how well-suited an LLM is to handling a "closed system."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Small models that fit on an iPhone are well-suited to this kind of domain-specific fine-tuning. With fewer parameters, training is cheap.&lt;/p&gt;

&lt;p&gt;This approach works not just for fashion but for anything where the answer is established within a closed system of a given school — makeup, sports form, fortune-telling, and so on.&lt;/p&gt;
&lt;h2&gt;
  
  
  How it's built
&lt;/h2&gt;

&lt;p&gt;Fine-tune by knowledge distillation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Teacher = large model (Qwen3-VL-235B-A22B)&lt;/li&gt;
&lt;li&gt;Student = small model (Qwen3-VL-2B)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Feed a theory document (~10KB: definitions of 5 axes + baseline tables + aggregation rules + output rules) to the large model as a prompt, and have it score the training images according to that document.&lt;/p&gt;

&lt;p&gt;Only the large model can do this; the small model can't hold the entire theory document.&lt;/p&gt;

&lt;p&gt;Using the set of (image input given to the large model, output the large model produced), fine-tune the small model.&lt;/p&gt;

&lt;p&gt;Now the small model can produce output grounded in the theoretical system. It doesn't &lt;em&gt;know&lt;/em&gt; the theory document, but it can perform the baked-in process.&lt;/p&gt;

&lt;p&gt;For this one closed-domain evaluation alone, the small model can imitate the behavior of a model 10×–100× its size.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it works&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input: image
Output: fixed-schema JSON label
Train Qwen3-VL 2B on (image, fixed question, JSON) triplets via LoRA fine-tuning (student)
Convert to CoreML -&amp;gt; iPhone -&amp;gt; fully offline scoring
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Because it's "closed," ~800 images are enough.&lt;/strong&gt; The mapping has low entropy, so if the teacher emits labels under consistent rules, even a small set lets the student reconstruct those rules.&lt;/p&gt;

&lt;h3&gt;
  
  
  The highest-leverage part is the "theory document"
&lt;/h3&gt;

&lt;p&gt;The most influential file in this pipeline is neither the training script nor the model definition — it's the &lt;strong&gt;theory document (the instructions to the teacher).&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Writing a genuine theory document is the one thing you can't skimp on.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Output schema (author's reconstruction)
&lt;/h3&gt;

&lt;p&gt;The JSON I had the student emit looks roughly like this (an implementation structure, not text from any original source):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"items"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"category"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"tops"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"white cotton dress shirt"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"scores"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"color"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"silhouette"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"material"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"design"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"item_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"item_dress_score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;4.2&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"category"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"bottoms"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"black skinny trousers"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"scores"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"color"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"silhouette"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"material"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"design"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"item_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"item_dress_score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;4.3&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"overall_dress_ratio"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.71&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"coordinate_silhouette"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"I"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"style_score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"rationale"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"target_ratio"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.70&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"verdict"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Near-ideal 7:3 for street wear. Dressy edges ahead slightly; clean enough."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"advice"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Implementation stack and numbers
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;th&gt;Choice&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Base model&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Qwen/Qwen3-VL-2B-Instruct&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;fp16/int8 stable on Apple Silicon; shipped&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;(compared) alt base&lt;/td&gt;
&lt;td&gt;&lt;code&gt;google/gemma-4-E2B-it&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;schema collapse at int4; passed over for FT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Teacher labeler&lt;/td&gt;
&lt;td&gt;Qwen3-VL-235B-A22B&lt;/td&gt;
&lt;td&gt;reads the theory and judges JSON&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Training&lt;/td&gt;
&lt;td&gt;LoRA rank16 / alpha32, &lt;code&gt;language_model.*&lt;/code&gt; only, vision frozen&lt;/td&gt;
&lt;td&gt;~25 min on Colab A100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Conversion&lt;/td&gt;
&lt;td&gt;coreml-llm Qwen3-VL stateful pipeline&lt;/td&gt;
&lt;td&gt;MLState + slice_update KV&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Device&lt;/td&gt;
&lt;td&gt;iPhone 17 Pro (A19 ANE)&lt;/td&gt;
&lt;td&gt;2.3GB int8 / ~24 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Training data was ~800–900 full-body outfit photos from Unsplash + Pexels (~750 used for training). One iteration (collect → label → train → convert → transfer) takes roughly &lt;strong&gt;2.5 hours.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing: a "dedicated scorer" that fits in your pocket
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Specialized knowledge that can be written as a closed system runs faster, cheaper, more consistently, and more privately when distilled wholesale into a 2B model on-device than when thrown at a giant API.&lt;/strong&gt; If a giant general model is "an advisor who knows a little about everything," what I built here is a way to put "a scorer who has drilled one certification standard into their body" in your pocket. Scoring, assessment, certification, fixed-schema extraction — the world has surprisingly many "closed systems," and any of them might be bakeable to device size with the same pattern.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;※ To repeat: this implementation is not supervised or endorsed by any specific individual or organization; it is an independent reconstruction of publicly known ideas for technical validation. The scores do not represent any definitive "correct answer."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Note
&lt;/h2&gt;

&lt;p&gt;The idea this article builds on — "shifting the dressy-to-casual balance toward 7:3" — references a &lt;strong&gt;publicly and widely known idea in Japanese men's fashion.&lt;/strong&gt; The scoring axes, JSON schema, prompt design, and aggregation rules here are &lt;strong&gt;my own reconstruction for technical validation&lt;/strong&gt;, not a quotation or reproduction of any original text, figures, or images. This implementation is &lt;strong&gt;not an official, supervised, partnered, or endorsed app of any individual or organization&lt;/strong&gt;, nor is it intended as an accurate explanation of the theory. It is purely a technical experiment in "how to internalize a subjective evaluation axis into an image-understanding model," and the scores do not constitute anyone's definitive judgment. The value of this article lies not in fashion theory itself but in the &lt;strong&gt;methodology of distilling a closed system into a small model.&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published in Japanese on &lt;a href="https://qiita.com/john-rocky/items/99638048a4864f0d798b" rel="noopener noreferrer"&gt;Qiita&lt;/a&gt;. I build apps with machine learning and AR, and write about both. &lt;a href="https://github.com/john-rocky" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; / &lt;a href="https://twitter.com/JackdeS11" rel="noopener noreferrer"&gt;X&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>ios</category>
      <category>llm</category>
    </item>
    <item>
      <title>The simplest possible SwiftUI MVVM</title>
      <dc:creator>Daisuke Majima</dc:creator>
      <pubDate>Tue, 02 Jun 2026 05:38:07 +0000</pubDate>
      <link>https://dev.to/john-rocky/the-simplest-possible-swiftui-mvvm-14aa</link>
      <guid>https://dev.to/john-rocky/the-simplest-possible-swiftui-mvvm-14aa</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;"I keep hearing about MVVM, but I only half get it..."&lt;br&gt;
"I really should start playing with SwiftUI, but the steps and where to put the code look complicated..."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If that's you, this article will give you the backbone of SwiftUI + MVVM design.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is MVVM?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;MVVM&lt;/strong&gt; is a code design pattern whose main goal is to &lt;strong&gt;separate the Model from the View&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;The Model is the actual substance of what the app does.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The View is how the app is presented to the user.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A &lt;strong&gt;ViewModel translates and relays the changes between them&lt;/strong&gt;. That way the Model can keep the app's substance clear and single-sourced, and the View can present the Model's state to the user without delay.&lt;/p&gt;

&lt;h2&gt;
  
  
  Is MVVM required?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;MVVM is not required to build an app with SwiftUI&lt;/strong&gt;, but using it lets you take the declarative approach — "hand a bundle of content changes to the View and let the View figure out how to render it" — which makes writing apps smooth. (By contrast, telling the View "do this, now do that" on every update is the imperative style.)&lt;/p&gt;

&lt;h2&gt;
  
  
  Let's understand it with the simplest possible example
&lt;/h2&gt;

&lt;p&gt;It's only human to feel &lt;strong&gt;"so what on earth do I actually write in a ViewModel?"&lt;/strong&gt; and &lt;strong&gt;"SwiftUI keeps throwing new characters at me like @ObservedObject and @Published — scary."&lt;/strong&gt; (I felt exactly that.) So I wrote a tiny case study that combines SwiftUI and MVVM.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;We'll build the MVVM pattern with the bare-minimum Model, View and ViewModel.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The case study is a switch that toggles between dog 🐶 and cat 🐱 when tapped.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;A simple switch that toggles on tap:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpuvy41763mjnhr72q2w1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpuvy41763mjnhr72q2w1.png" width="78" height="78"&gt;&lt;/a&gt; tap ⇄ &lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4vpqhcj5w97pq4pzapff.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4vpqhcj5w97pq4pzapff.png" width="78" height="78"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;(The background is green just for clarity.)&lt;/p&gt;

&lt;h2&gt;
  
  
  A simple MVVM
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Writing the Model
&lt;/h3&gt;

&lt;p&gt;The Model of this sample — the substance of this app — is switching between dog and cat.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model.swift&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kd"&gt;import&lt;/span&gt; &lt;span class="kt"&gt;Foundation&lt;/span&gt; &lt;span class="c1"&gt;// the Model does NOT import SwiftUI&lt;/span&gt;

&lt;span class="kd"&gt;struct&lt;/span&gt; &lt;span class="kt"&gt;Model&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

    &lt;span class="kd"&gt;enum&lt;/span&gt; &lt;span class="kt"&gt;Pet&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="c1"&gt;// the case is either dog or cat&lt;/span&gt;
        &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;🐶&lt;/span&gt;
        &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;🐱&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;pet&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Pet&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;🐶&lt;/span&gt; &lt;span class="c1"&gt;// default is dog&lt;/span&gt;

    &lt;span class="k"&gt;mutating&lt;/span&gt; &lt;span class="kd"&gt;func&lt;/span&gt; &lt;span class="nf"&gt;switchPet&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="c1"&gt;// toggle dog and cat&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;pet&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;🐶&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;pet&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;🐱&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;pet&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;🐶&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The Model does not import SwiftUI, because it is the app's substance — independent of the UI.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The substance of this app is switching between dog and cat, so the Model consists of a &lt;code&gt;pet&lt;/code&gt; variable (dog or cat) and a &lt;code&gt;switchPet&lt;/code&gt; function that toggles them. (A &lt;code&gt;struct&lt;/code&gt; uses a &lt;code&gt;mutating func&lt;/code&gt; to mutate itself.)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's the entire Model of our app.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Writing the View
&lt;/h3&gt;

&lt;p&gt;The View renders the Model's &lt;code&gt;pet&lt;/code&gt; as a Text view, and when the Text view is tapped, it switches the Model's &lt;code&gt;pet&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ContentView.swift&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kd"&gt;import&lt;/span&gt; &lt;span class="kt"&gt;SwiftUI&lt;/span&gt;

&lt;span class="kd"&gt;struct&lt;/span&gt; &lt;span class="kt"&gt;ContentView&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;View&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kd"&gt;some&lt;/span&gt; &lt;span class="kt"&gt;View&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kt"&gt;Text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"reflect the Model's pet here"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;padding&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;onTapGesture&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="c1"&gt;// switch the model's pet here&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Our View is responsible for displaying the Model's content to the user and for accepting the user's tap.&lt;/p&gt;

&lt;p&gt;If we ignore the MVVM pattern, &lt;strong&gt;we could hold the Model directly inside the View:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ContentView.swift (holding the Model directly)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kd"&gt;import&lt;/span&gt; &lt;span class="kt"&gt;SwiftUI&lt;/span&gt;

&lt;span class="kd"&gt;struct&lt;/span&gt; &lt;span class="kt"&gt;ContentView&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;View&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;@State&lt;/span&gt; &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;Model&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
     &lt;span class="c1"&gt;// @State lets you change view-state values and reflect them instantly&lt;/span&gt;

    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kd"&gt;some&lt;/span&gt; &lt;span class="kt"&gt;View&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kt"&gt;Text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pet&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rawValue&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;padding&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;onTapGesture&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;switchPet&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Going further, you could of course hold the &lt;code&gt;pet&lt;/code&gt; variable and the toggle function in the View itself:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ContentView.swift (holding pet and switchPet directly)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kd"&gt;import&lt;/span&gt; &lt;span class="kt"&gt;SwiftUI&lt;/span&gt;

&lt;span class="kd"&gt;struct&lt;/span&gt; &lt;span class="kt"&gt;ContentView&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;View&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

    &lt;span class="kd"&gt;enum&lt;/span&gt; &lt;span class="kt"&gt;Pet&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;🐶&lt;/span&gt;
        &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;🐱&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="kd"&gt;@State&lt;/span&gt; &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;pet&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Pet&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;🐶&lt;/span&gt;
     &lt;span class="c1"&gt;// @State lets you change view-state values and reflect them instantly&lt;/span&gt;

    &lt;span class="k"&gt;mutating&lt;/span&gt; &lt;span class="kd"&gt;func&lt;/span&gt; &lt;span class="nf"&gt;switchPet&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;pet&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;🐶&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;pet&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;🐱&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;pet&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;🐶&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kd"&gt;some&lt;/span&gt; &lt;span class="kt"&gt;View&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kt"&gt;Text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pet&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rawValue&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;padding&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;onTapGesture&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="nf"&gt;switchPet&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If this &lt;code&gt;pet&lt;/code&gt; variable only represents transient view state, maybe that's fine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But then, for example, when you have many Views it becomes hard to keep the Model's state single-sourced. MVVM is about not doing that.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Writing the ViewModel
&lt;/h3&gt;

&lt;p&gt;The ViewModel's job is to be the interpreter between View and Model: relaying the user's tap from the View to the Model, and relaying the Model's state back to the View.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ViewModel.swift&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kd"&gt;import&lt;/span&gt; &lt;span class="kt"&gt;Foundation&lt;/span&gt;
&lt;span class="kd"&gt;import&lt;/span&gt; &lt;span class="kt"&gt;SwiftUI&lt;/span&gt;

&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="kt"&gt;ViewModel&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;Model&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="c1"&gt;// holds the Model&lt;/span&gt;

    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;pet&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pet&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rawValue&lt;/span&gt; &lt;span class="c1"&gt;// return the Model's pet as the String the View needs&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="kd"&gt;func&lt;/span&gt; &lt;span class="nf"&gt;switchPet&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;switchPet&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="c1"&gt;// call the Model's switchPet&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Accessing the ViewModel from the View
&lt;/h3&gt;

&lt;p&gt;The View tells the ViewModel about the user's tap, the ViewModel calls the Model's toggle function, and the View reads the Model's &lt;code&gt;pet&lt;/code&gt; value back through the ViewModel.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ContentView.swift&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kd"&gt;import&lt;/span&gt; &lt;span class="kt"&gt;SwiftUI&lt;/span&gt;

&lt;span class="kd"&gt;struct&lt;/span&gt; &lt;span class="kt"&gt;ContentView&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;View&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;viewModel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;ViewModel&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kd"&gt;some&lt;/span&gt; &lt;span class="kt"&gt;View&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kt"&gt;ZStack&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="kt"&gt;Text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;viewModel&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pet&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;padding&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;onTapGesture&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="n"&gt;viewModel&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;switchPet&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run this and… &lt;strong&gt;the UI doesn't change.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To check, let's add a print to the Model's &lt;code&gt;switchPet&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model.swift&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;    &lt;span class="k"&gt;mutating&lt;/span&gt; &lt;span class="kd"&gt;func&lt;/span&gt; &lt;span class="nf"&gt;switchPet&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;pet&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;🐶&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;pet&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;🐱&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;pet&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;🐶&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pet&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;🐱
🐶
🐱
🐶
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Model's &lt;code&gt;pet&lt;/code&gt; is toggling, but the UI isn't updating. In terms of the information flow above:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The View tells the ViewModel about the user's tap (✅ done)&lt;/li&gt;
&lt;li&gt;The ViewModel calls the Model's toggle function (✅ done)&lt;/li&gt;
&lt;li&gt;
&lt;del&gt;The View reads the Model's &lt;code&gt;pet&lt;/code&gt; change back through the ViewModel&lt;/del&gt; (❌ this part isn't arriving)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In MVVM, the ViewModel &lt;strong&gt;publishes&lt;/strong&gt; Model changes to everyone, and the View &lt;strong&gt;subscribes&lt;/strong&gt; to whatever information it cares about — that's how the View receives Model updates.&lt;/p&gt;

&lt;p&gt;This is where SwiftUI's property wrappers come in.&lt;/p&gt;

&lt;h3&gt;
  
  
  The ViewModel publishes changes, and the View subscribes
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;ViewModel.swift&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kd"&gt;import&lt;/span&gt; &lt;span class="kt"&gt;Foundation&lt;/span&gt;
&lt;span class="kd"&gt;import&lt;/span&gt; &lt;span class="kt"&gt;SwiftUI&lt;/span&gt;

&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="kt"&gt;ViewModel&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;ObservableObject&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="c1"&gt;// conform to ObservableObject&lt;/span&gt;
    &lt;span class="kd"&gt;@Published&lt;/span&gt; &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;Model&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="c1"&gt;// mark it @Published&lt;/span&gt;

    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;pet&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pet&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rawValue&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="kd"&gt;func&lt;/span&gt; &lt;span class="nf"&gt;switchPet&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;switchPet&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By conforming to &lt;code&gt;ObservableObject&lt;/code&gt;, the ViewModel becomes observable and can broadcast information to the whole app (to anything willing to observe).&lt;/p&gt;

&lt;p&gt;By adding &lt;code&gt;@Published&lt;/code&gt;, the ViewModel (an &lt;code&gt;ObservableObject&lt;/code&gt;) publishes to everyone the moment this Model changes.&lt;/p&gt;

&lt;p&gt;Then the View subscribes to that published change.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ContentView.swift&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kd"&gt;import&lt;/span&gt; &lt;span class="kt"&gt;SwiftUI&lt;/span&gt;

&lt;span class="kd"&gt;struct&lt;/span&gt; &lt;span class="kt"&gt;ContentView&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;View&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;@ObservedObject&lt;/span&gt; &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;viewModel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;ViewModel&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="c1"&gt;// add @ObservedObject&lt;/span&gt;

    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kd"&gt;some&lt;/span&gt; &lt;span class="kt"&gt;View&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kt"&gt;ZStack&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="kt"&gt;Text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;viewModel&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pet&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;padding&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;onTapGesture&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="n"&gt;viewModel&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;switchPet&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By adding &lt;code&gt;@ObservedObject&lt;/code&gt; to the &lt;code&gt;viewModel&lt;/code&gt; property, whenever the &lt;code&gt;ObservableObject&lt;/code&gt; ViewModel publishes a change, the View can instantly update the relevant UI from its &lt;code&gt;body&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Now the "the View reads the Model's &lt;code&gt;pet&lt;/code&gt; value through the ViewModel" part works, and tapping updates the UI.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Dog/cat updating on tap:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpyet9olnwmovum5r42dq.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpyet9olnwmovum5r42dq.gif" width="152" height="118"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is the simplest possible MVVM.&lt;/strong&gt;&lt;br&gt;
There's a lot more to it, but I think the basic building blocks of the pattern are all here.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published in Japanese on &lt;a href="https://qiita.com/john-rocky/items/88b45e18bd48e3dbc87c" rel="noopener noreferrer"&gt;Qiita&lt;/a&gt;. I build apps with Core ML and write about machine learning. &lt;a href="https://github.com/john-rocky" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; / &lt;a href="https://twitter.com/JackdeS11" rel="noopener noreferrer"&gt;X&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ios</category>
      <category>swift</category>
      <category>swiftui</category>
      <category>mvvm</category>
    </item>
    <item>
      <title>Prompt tips for realistic human images with Stable Diffusion</title>
      <dc:creator>Daisuke Majima</dc:creator>
      <pubDate>Tue, 02 Jun 2026 05:38:06 +0000</pubDate>
      <link>https://dev.to/john-rocky/prompt-tips-for-realistic-human-images-with-stable-diffusion-h9</link>
      <guid>https://dev.to/john-rocky/prompt-tips-for-realistic-human-images-with-stable-diffusion-h9</guid>
      <description>&lt;h2&gt;
  
  
  Tips for generating good images
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fepcddgt4wdk0dtj9vh7e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fepcddgt4wdk0dtj9vh7e.png" width="512" height="768"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx5iqbk2n65uupjxkcxak.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx5iqbk2n65uupjxkcxak.png" width="512" height="512"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Stable Diffusion is the talk of the town, but generating the kind of high-quality, realistic portraits you often see on social media takes a bit of know-how. In this article I'll show, with examples, which words help you generate high-quality images.&lt;/p&gt;

&lt;p&gt;The Web UI is the convenient way to try Stable Diffusion. You can learn how to use it here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://qiita.com/john-rocky/items/1b6dd780d38c63bb64cd" rel="noopener noreferrer"&gt;https://qiita.com/john-rocky/items/1b6dd780d38c63bb64cd&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/AUTOMATIC1111/stable-diffusion-webui" rel="noopener noreferrer"&gt;https://github.com/AUTOMATIC1111/stable-diffusion-webui&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Just typing the text of the image you want isn't enough
&lt;/h2&gt;

&lt;p&gt;Say you want an image of a girl. If you just type &lt;code&gt;girl&lt;/code&gt;, you get something like this. It's not bad, but it looks a little CG-ish — we want something more photographic.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;girl
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb4siuinhxb70mun59v82.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb4siuinhxb70mun59v82.png" width="512" height="512"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Add words that evoke high quality
&lt;/h2&gt;

&lt;p&gt;Now, pile in words that evoke high quality — &lt;code&gt;best quality&lt;/code&gt;, &lt;code&gt;high resolution&lt;/code&gt;, and so on — almost to an absurd degree. It might surprise you, but &lt;strong&gt;stuffing in lots of comma-separated words like this is the first tip.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;girl, RAW photo, best quality, realistic, photo-realistic, best quality, masterpiece, an extremely delicate and beautiful, extremely detailed, 2k wallpaper, Amazing, finely detail, 8k wallpaper, huge filesize, ultra-detailed, highres, extremely detailed, realistic, 8K, Ultra-High Definition, highest quality, ultra high resolution, (realistic:1.4), High quality texture,
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe2ilnjpebo77g72p9pr5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe2ilnjpebo77g72p9pr5.png" width="512" height="512"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now you get a crisp, realistic image like the one above. But it still looks a bit like a painting.&lt;/p&gt;

&lt;h2&gt;
  
  
  Add negative words
&lt;/h2&gt;

&lt;p&gt;Add words you do &lt;strong&gt;not&lt;/strong&gt; want in the image. This time, &lt;strong&gt;we want to remove the painterly feel and get closer to a photo, so we put painting-evoking words into the negative prompt.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Negative prompt&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;EasyNegative, paintings, sketches, (worst quality:2), (low quality:2), (normal quality:2), lowres, normal quality, ((monochrome)),
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F296d6zudu34s4bn1suh1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F296d6zudu34s4bn1suh1.png" width="512" height="512"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This gets you much closer to a photo. The negative prompt is really important.&lt;/p&gt;

&lt;h2&gt;
  
  
  Add words for fine detail
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(extremely detailed eyes and face)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4i3j9il5oyb3mvftkww7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4i3j9il5oyb3mvftkww7.png" width="512" height="512"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Add low-quality terms to the negative prompt
&lt;/h2&gt;

&lt;p&gt;Drive the point home by adding words meaning the opposite of high quality to the negative prompt.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Negative prompt&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;worst quality, low quality, normal quality, jpegartifacts, signature, watermark, blurry, cropped, poorly draw, poorly draw, worst quality, low quality, lowres,
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhpe1a9u3prq4fe0y1573.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhpe1a9u3prq4fe0y1573.png" width="512" height="512"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Weight specific words
&lt;/h2&gt;

&lt;p&gt;You can give emphasis to a word by weighting it like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(detailed clothes:1.2)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;&lt;em&gt;Originally published in Japanese on &lt;a href="https://qiita.com/john-rocky/items/08bfffb1d0ca2a5f3637" rel="noopener noreferrer"&gt;Qiita&lt;/a&gt;. I build apps with machine learning and AR, and write about both. &lt;a href="https://github.com/john-rocky" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; / &lt;a href="https://twitter.com/JackdeS11" rel="noopener noreferrer"&gt;X&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>stablediffusion</category>
      <category>machinelearning</category>
      <category>art</category>
    </item>
  </channel>
</rss>
