DEV Community: Daisuke Majima

iPhone on-device LLM: the GPU wins the sprint, the Neural Engine wins the marathon

Daisuke Majima — Thu, 04 Jun 2026 09:38:23 +0000

The follow-up to my on-device runtime speed benchmark — because burst tok/s only tells half the story.

I benchmark on-device LLMs on iPhone, and shipping real apps I kept noticing the GPU runtimes start fast and then fade. So I measured decode rate over time, from a cold start, over 10 minutes of continuous generation. Same conditions across all runtimes: same model (Gemma 4 E2B, 4-bit), same iPhone 17 Pro (A19 Pro), cold start.

The GPU runtimes (MLX, LiteRT-LM) that crush the burst test collapse 50%+ within ~60 seconds as the phone heats. The slowest starter — the Apple Neural Engine (CoreML) — barely throttles and overtakes them.

Decode rate: cold burst vs sustained (Gemma 4 E2B 4-bit)

Runtime (compute)	Burst tok/s	Sustained (10 min)	Retained
CoreML / ANE	33	22	67%
MLX / GPU	48	18	38%
LiteRT-LM / GPU	56	27	48%

The GPU runtimes fall to 38% / 48% of their own peak after 10 minutes. The ANE holds 67% — ending up faster than MLX outright, and shrinking LiteRT's lead from +70% to +23%.

How fast they fade (vs each runtime's own peak)

                 -10%      -25%     floor
MLX / GPU          5s        35s     ~18 tok/s
LiteRT / GPU      13s        40s     ~27 tok/s
CoreML / ANE      93s       390s     ~22 tok/s

The GPU runtimes are down 25% in well under a minute. The ANE takes over 90 seconds just to lose 10%.

Why — it's a power story

iOS won't give you per-subsystem watts, and over a long run the phone throttles every backend down to the same sustainable thermal envelope — so iPhone battery-delta can't separate them. But the same model on Mac (M4 Max), where powermetrics exposes package power, shows the cause cleanly:

CoreML / ANE: 12.7 W
MLX / GPU: 24.7 W
llama.cpp / GPU: 24.5 W

The ANE path draws ~half the GPU's power at full decode. Lower power → heats slower → the thermally-constrained iPhone doesn't have to throttle it. Two independent GPU runtimes — Apple's MLX and Google's LiteRT-LM — collapsing the same way says this is a GPU-thermal property of the phone, not a quirk of one runtime.

Takeaway: GPU for the sprint, ANE for the marathon

Quick chat reply → the GPU wins outright; burst speed is the experience.
Sustained load (long generation, agentic loops, batch/background jobs) → the GPU's burst lead largely evaporates. MLX ends up slower than the ANE; LiteRT keeps only a slim lead after shedding half its speed.
And the ANE draws ~half the power and leaves the GPU free for the rest of the app (rendering, camera, other ML).

That's the real case for running an LLM on the ANE, even though its peak decode is lower.

Caveats (so you can poke holes)

Burst = one cold ~128-token generation. Sustained = a 600 s re-prompt loop, decode rate from a rolling window. The in-loop re-prompt overhead shaves a little off the absolute rate (more for CoreML, whose prefill is relatively costlier), so I quote burst from the clean single shot.
iOS only exposes battery in 1% steps, and under sustained load the SoC pins every backend to the same thermal-budget power. The power chart is measured on Mac (M4 Max, powermetrics, same model) — the mechanism behind the iPhone throttling, on a device where per-unit watts are observable. The throttle curves themselves are iPhone-measured and clean.
LiteRT-LM has no output-token cap, so its per-call generations run longer than the 128 used for CoreML/MLX; and that one run happened to start at fair rather than nominal thermal (its burst was unaffected).
CoreML-LLM uses sliding-window attention, part of why its decode cost stays flat (bounded context) — it trades some long-range recall for that.

Repo (raw data + scripts)

Raw JSONL for all three runs and the script that draws these curves: https://github.com/john-rocky/apple-silicon-llm-bench

Does this match what you see on Android (Snapdragon Hexagon / Tensor NPU vs the GPU)? And for your workload — does the GPU's burst advantage or the NPU's endurance matter more?

Filling object detection's last mile: peaceofcake (D-FINE & RF-DETR, Apache 2.0)

Daisuke Majima — Tue, 02 Jun 2026 23:37:28 +0000

What happened to object detection in 2024

YOLOv1 stunned the world in 2016. Eight years on, detection accuracy has improved dramatically.

But let me ask one thing.

Can you actually embed the latest object-detection model into your own app?

You've read the paper. You ran the demo. But the moment it becomes "train on my data and run it on iPhone," you stall. Writing dozens of lines of YAML config, resolving dependency conflicts, writing an export script from scratch, building an inference pipeline in Swift — the road is longer than you'd think.

In 2024, DETR-based detection entered a new phase. Two models — D-FINE and RF-DETR — achieved breakthroughs in both accuracy and real-time performance. But to deliver those benefits to a real product, a "last mile" remains.

peaceofcake is a library to fill that mile.

pip install peaceofcake

It starts in a single line.

Object detection in 3 lines

from peaceofcake import DFINE

model = DFINE("dfine-l-coco")
results = model("image.jpg")

That's it. Weights are auto-downloaded and cached. It runs on GPU if you have one, CPU otherwise. Results come back as a structured object with bounding boxes, class labels, and confidence scores.

RF-DETR uses the same interface:

from peaceofcake import RFDETR

model = RFDETR("rfdetr-l-coco")
results = model("image.jpg")

Different model, same API. That's the core of peaceofcake's design philosophy.

D-FINE — from "fuzzy" to "sharp"

D-FINE's full name is "DETR with Fine-grained Distribution Refinement." As the name suggests, the heart of this model is how it predicts bounding boxes.

Conventional DETR-family models try to nail the location in one shot as four numbers (x, y, w, h). By analogy: asked "how tall is that building?", you instantly answer "37.2 meters!"

D-FINE is different. First it produces a probability distribution ("maybe 30–40 meters"), then narrows ("35–38 meters"), and finally arrives at "37.2 meters." It progressively sharpens a fuzzy estimate.

coarse distribution → refine → refine more → final prediction

With this Fine-grained Distribution Refinement, D-FINE achieves accuracy beyond RT-DETR at real-time inference speed.

The architecture is clear:

image → HGNetv2 (Backbone) → HybridEncoder → DFINE Decoder → detections

The HGNetv2 backbone comes in 5 sizes (Nano/Small/Medium/Large/XLarge), covering everything from lightweight mobile inference to accuracy-first server-side, with one architecture.

RF-DETR — the "foundation model" wind reaches detection

RF-DETR is a "Real-Time Foundational Object Detection" model Roboflow released in late 2024.

Note the word "Foundational." Just as "foundation models" revolutionized the LLM world, that wave is reaching object detection: pretrain general detection ability on massive data, then fine-tune on a small amount of domain-specific data. RF-DETR brought that paradigm to real-time detection.

peaceofcake makes these two state-of-the-art models usable through the same 3-line API.

Why "yet another wrapper" is needed

The world is full of model-wrapper libraries. So what's different about peaceofcake?

The answer is scope.

Many wrappers stop at "make inference easy." peaceofcake doesn't. It covers the full product-development cycle — train → export → on-device mobile inference — in one package.

Training: it eats YOLO format as-is

model = DFINE("dfine-m-coco")
model.train(data="dataset.yaml", epochs=50, batch_size=16)

Annotate in Roboflow or Label Studio, export in YOLO format, and pass it straight in. Automatic conversion to COCO format runs internally, and training-schedule scaling, EMA, and AMP are applied transparently.

COCO-format datasets work as-is too. You don't have to think about the format.

Export: three formats from one method

model.export("onnx")                          # server-side inference
model.export("coreml", precision="FLOAT16")   # iOS / macOS
model.export("tensorrt")                       # NVIDIA GPU optimization

On CoreML export, denoising mechanisms and auxiliary heads needed only at training time are automatically removed. The output is confidence + coordinates, directly compatible with iOS's Vision framework. NMS is done outside the model, so you can change the confidence threshold from the UI in real time.

There's a CLI too

Without writing Python, you can access all features from the terminal.

poc predict source=photo.jpg conf=0.3
poc train model=dfine-m-coco data=my_dataset.yaml epochs=100
poc export model=dfine-l-coco format=coreml precision=FLOAT16

Design highlights — where to hide the complexity

Three points in peaceofcake's internal design feel especially clever.

1. Extensibility via the Strategy pattern

class DFINE(BaseModel):
    @property
    def task_map(self):
        return {
            "detect": {
                "predictor": DFINEPredictor,
                "trainer": DFINETrainer,
                "exporter": DFINEExporter,
                "validator": DFINEValidator,
            }
        }

BaseModel receives predict(), train(), export(), val() calls and dispatches to the right class via task_map. D-FINE and RF-DETR have completely different Predictor/Trainer/Exporter implementations, but the API the user sees is identical. If segmentation or pose tasks are added later, you just add an entry to task_map.

2. Lazy imports for a light startup

# peaceofcake/__init__.py
def __getattr__(name):
    if name == "RFDETR":
        from peaceofcake.models.rfdetr import RFDETR
        globals()["RFDETR"] = RFDETR
        return RFDETR
    raise AttributeError(...)

RF-DETR depends on the transformers library, which takes seconds just to import. peaceofcake uses __getattr__ to defer the import until RFDETR is actually referenced. Users of only D-FINE are fine even without transformers installed. The src.data, src.optim, src.nn.criterion needed only for training are also imported inside train() — if you only do inference, those modules are never loaded.

3. "Intent inference" for model names

DFINE("dfine-l-coco")           # registry name → auto-download
DFINE("path/to/custom.pth")     # local file
DFINE("dfine_l_coco.pth")       # filename only → matched against registry
DFINE("dfine-n")                 # no weights → random init

For a single string, multi-stage fallback runs: registry match → local-file check → filename match → size estimation. For a local checkpoint, it tries to infer model size from the filename, and if it still can't, it back-calculates from the parameter count:

n = sum(v.numel() for v in model_state.values())
if n < 6_000_000: return "n"     # Nano
elif n < 15_000_000: return "s"  # Small
elif n < 25_000_000: return "m"  # Medium
...

So that the user can "just pass a path without thinking," all this inference logic runs behind the scenes.

"Field wisdom" carved into the commit log

There's trial-and-error in the commit log that you can't see by reading the code alone — real lessons of bringing research code into production.

CUDA hangs

Early in training, the model can output bounding boxes with negative width or height. The original D-FINE implementation detected this with an assert. On CPU that just throws. But an assert failure inside a CUDA kernel hangs the entire GPU device.

peaceofcake replaced it with clamp. But the first fix used an in-place operation and broke the autograd graph. The second commit finally fixed it properly. Fix one bug and another shows up. That's the reality of putting research code into battle.

YOLO format has no "spec"

YOLO format has a de facto standard but no strict spec. How to specify paths (train: images/train vs path: ./ + train: images/train), blank lines in label files, how to write the class count — each tool differs subtly. To support all of Roboflow/Ultralytics/Label Studio output, peaceofcake spends over 50 lines on path-resolution logic alone.

Embed class names in the checkpoint

When distributing a model trained on a custom dataset, the class-name info tends to get lost. peaceofcake embeds a class_names key directly in the .pth checkpoint. No separate metadata file to manage.

How many miles from Python to device?

Let's lay out the distance to getting an object-detection model "usable" in a table.

Step	Traditional workflow	peaceofcake
Install	clone repo, set up env	`pip install peaceofcake`
Inference	write config, run script	3 lines of Python
Training	convert data, tune config	`model.train(data="data.yaml")`
Export	write a dedicated script	`model.export("coreml")`
iOS device	implement inference pipeline from scratch	build `DFINEDemo/`

At every one of the 5 steps, it's a one-liner or pre-built code. That's what filling the "last mile" means.

All Apache 2.0 — the shortest path to commercial use

When embedding object detection into a product, license matters as much as accuracy and API ergonomics.

This is where many developers trip up. The latest YOLO series (Ultralytics YOLO) is AGPL-3.0; commercial use requires buying a paid license. A great model right in front of you, and you give up on adopting it because of the license wall — many have had that experience.

The D-FINE and RF-DETR that peaceofcake adopts are both Apache 2.0. peaceofcake itself is Apache 2.0 too. So the library, the models, and the trained weights are all free for commercial use. Modify, redistribute, embed — anything goes. A patent clause is included too, reducing patent-litigation risk from contributors.

Library	Model license	Commercial use
Ultralytics (YOLO)	AGPL-3.0	paid license required
peaceofcake (D-FINE)	Apache 2.0	free
peaceofcake (RF-DETR)	Apache 2.0	free

A startup embedding it in their product. A contractor delivering to a client. Integrating it into embedded-device firmware. In every scenario, license cost is zero.

Beats YOLO on accuracy, and the license is completely free. That's another reason to choose DETR-based models.

The future of object detection, and the role of tools

Computer vision's history is shifting from a "race for accuracy" to a "race for accessibility."

Since AlexNet in 2012, improving ImageNet accuracy was research's main battlefield. In detection too, raising COCO-benchmark mAP was seen as the primary contribution of a paper. But in 2024, as D-FINE and RF-DETR showed, DETR-based models have reached the stage of combining sufficient accuracy with real-time performance.

The next axis of competition is "anyone can use it."

Ultralytics (YOLOv8/YOLO11) was the pioneer, building an ecosystem where pip install ultralytics gives access to all of YOLO's features. But the AGPL license is a hurdle for commercial use. peaceofcake delivers the "easy to use" experience Ultralytics established — with Apache 2.0 DETR-based models.

In an era where model accuracy commoditizes, the source of differentiation moves to "how fast and how freely you can embed it into a product." The time and cost from a researcher posting to arXiv to an app developer delivering that model into users' hands — driving both toward zero. That's the role tools like peaceofcake play.

Object detection isn't hard anymore. And it isn't expensive anymore.

pip install peaceofcake

A piece of cake. Peace 🍰

Repository: peaceofcake
PyPI: peaceofcake
License: Apache 2.0

Originally published in Japanese on Qiita. GitHub / X

How to use the TensorFlow Object Detection API (inference, with Colab)

Daisuke Majima — Tue, 02 Jun 2026 23:37:26 +0000

This article shows how to use the TensorFlow Object Detection API (the inference part). You can do it in Colab.

Colab sample — run the cells in order to try the TensorFlow Object Detection API. Change Image_Path in the last cell to your own image to detect objects in it.

The official TensorFlow Model Zoo has many kinds of models.

(For training a model: Train an object detection model with the TensorFlow Object Detection API)
(For quick training with just a few images: Quick-train an object detection model with the TensorFlow Object Detection API)

Steps

0. Install TensorFlow 2

!pip install -U --pre tensorflow=="2.2.0"

1. Clone the official TensorFlow Models from GitHub

import os
import pathlib

# If "models" is in the current directory path, move there. Otherwise clone it.
if "models" in pathlib.Path.cwd().parts:
  while "models" in pathlib.Path.cwd().parts:
    os.chdir('..')
elif not pathlib.Path('models').exists():
  !git clone --depth 1 https://github.com/tensorflow/models

2. Install the Object Detection API and required modules

%%bash # enable bash commands
cd models/research/
protoc object_detection/protos/*.proto --python_out=.
cp object_detection/packages/tf2/setup.py .
python -m pip install .

3. Import modules

import matplotlib
import matplotlib.pyplot as plt

import io
import scipy.misc
import numpy as np
from six import BytesIO
from PIL import Image, ImageDraw, ImageFont

import tensorflow as tf

from object_detection.utils import label_map_util
from object_detection.utils import config_util
from object_detection.utils import visualization_utils as viz_utils
from object_detection.builders import model_builder

%matplotlib inline

4. Image-loading function

def load_image_into_numpy_array(path):
  """Load an image into a numpy array.

  Puts image into numpy array to feed into tensorflow graph.
  Note that by convention we put it into a numpy array with shape
  (height, width, channels), where channels=3 for RGB.

  Args:
    path: the file path to the image

  Returns:
    uint8 numpy array with shape (img_height, img_width, 3)
  """
  img_data = tf.io.gfile.GFile(path, 'rb').read()
  image = Image.open(BytesIO(img_data))
  (im_width, im_height) = image.size
  return np.array(image.getdata()).reshape(
      (im_height, im_width, 3)).astype(np.uint8)

def get_keypoint_tuples(eval_config):
  """Return a tuple list of keypoint edges from the eval config.

  Args:
    eval_config: an eval config containing the keypoint edges

  Returns:
    a list of edge tuples, each in the format (start, end)
  """
  tuple_list = []
  kp_list = eval_config.keypoint_edge
  for edge in kp_list:
    tuple_list.append((edge.start, edge.end))
  return tuple_list

5. Download a model

!wget http://download.tensorflow.org/models/object_detection/tf2/20200713/centernet_hg104_512x512_coco17_tpu-8.tar.gz
!tar -xf centernet_hg104_512x512_coco17_tpu-8.tar.gz

Download any model you like from the official Model Zoo. Hover over a model name there to see its download URL.

It's fun just looking at the performance comparisons. Once download and extraction finish, you get a folder containing checkpoint, saved_model, and pipeline.config.

6. Read the pipeline config and build the model

# Path to the config file. The repo has a folder of config files, but the model
# names are slightly abbreviated, so the downloaded one is more reliable.
pipeline_config = "./centernet_hg104_512x512_coco17_tpu-8/pipeline.config"
# Path to the checkpoint
model_dir = "./centernet_hg104_512x512_coco17_tpu-8/checkpoint"

# Load the model config
configs = config_util.get_configs_from_pipeline_file(pipeline_config)
model_config = configs['model']

# Build the model from the loaded config
detection_model = model_builder.build(
      model_config=model_config, is_training=False)

# Restore weights from the checkpoint
ckpt = tf.compat.v2.train.Checkpoint(model=detection_model)
ckpt.restore(os.path.join(model_dir, 'ckpt-0')).expect_partial()

7. Prepare the inference function

def get_model_detection_function(model):
  """Get a tf.function for detection."""

  @tf.function
  def detect_fn(image):
    """Detect objects in image."""

    image, shapes = model.preprocess(image)
    prediction_dict = model.predict(image, shapes)
    detections = model.postprocess(prediction_dict, shapes)

    return detections, prediction_dict, tf.reshape(shapes, [-1])

  return detect_fn

detect_fn = get_model_detection_function(detection_model)

8. Prepare labels

Inference needs the object labels the model was trained on. The labels are in the official repo at models/research/object_detection/data/. This model was trained on COCO, so we use mscoco_label_map.pbtxt.

label_map_path = './models/research/object_detection/data/mscoco_label_map.pbtxt'
label_map = label_map_util.load_labelmap(label_map_path)
categories = label_map_util.convert_label_map_to_categories(
    label_map,
    max_num_classes=label_map_util.get_max_label_map_index(label_map),
    use_display_name=True)
category_index = label_map_util.create_category_index(categories)
label_map_dict = label_map_util.get_label_map_dict(label_map, use_display_name=True)

9. Run detection on your image

Upload any image to Colab and set its path as image_path. By the way, images with an alpha channel seem to need converting to 3 channels first.

image_dir = 'models/research/object_detection/test_images/'
image_path = os.path.join(image_dir, 'image2.jpg')
image_np = load_image_into_numpy_array(image_path)

# Things to try:
# Flip horizontally
# image_np = np.fliplr(image_np).copy()

# Convert image to grayscale
# image_np = np.tile(
#     np.mean(image_np, 2, keepdims=True), (1, 1, 3)).astype(np.uint8)

input_tensor = tf.convert_to_tensor(
    np.expand_dims(image_np, 0), dtype=tf.float32)
detections, predictions_dict, shapes = detect_fn(input_tensor)

label_id_offset = 1
image_np_with_detections = image_np.copy()

# Use keypoints if available in detections
keypoints, keypoint_scores = None, None
if 'detection_keypoints' in detections:
  keypoints = detections['detection_keypoints'][0].numpy()
  keypoint_scores = detections['detection_keypoint_scores'][0].numpy()

viz_utils.visualize_boxes_and_labels_on_image_array(
      image_np_with_detections,
      detections['detection_boxes'][0].numpy(),
      (detections['detection_classes'][0].numpy() + label_id_offset).astype(int),
      detections['detection_scores'][0].numpy(),
      category_index,
      use_normalized_coordinates=True,
      max_boxes_to_draw=200,
      min_score_thresh=.30,
      agnostic_mode=False,
      keypoints=keypoints,
      keypoint_scores=keypoint_scores,
      keypoint_edges=get_keypoint_tuples(configs['eval_config']))

plt.figure(figsize=(12,16))
plt.imshow(image_np_with_detections)
plt.show()

Boxes, labels, and confidence scores are displayed.

Originally published in Japanese on Qiita. I build apps with Core ML and write about machine learning. GitHub / X

Brightening dark images with machine learning (GLADNet)

Daisuke Majima — Tue, 02 Jun 2026 23:32:24 +0000

How to use a machine-learning model that brightens low-light images

Using a model called GLADNet, we brighten dark images. It's easy to run in Python.

I want to extract information from dark images

If you can brighten images that came out too dark, or footage from night-time surveillance or dashcams, you can sometimes extract useful information.

But simply adjusting brightness with a filter doesn't add any information.

GLADNet brightens them cleanly

Just feed an image into GLADNet and a bright image comes back. It really looks like it was actually shot brighter.

Usage

Clone the GLADNet repository and run with the following command:

python main.py --use_gpu=1 --gpu_idx=0 --gpu_mem=0.5 --phase=test --test_dir=path/to/input_images --save_dir=path/to/output

For various uses

With techniques like this you can capture the information you need without missing it — and even "ghost photos" stop being scary.

Originally published in Japanese on Qiita. I build apps with Core ML and ARKit and write about ML/AR. GitHub / X

Resuming training in PyTorch: save and load the optimizer too

Daisuke Majima — Tue, 02 Jun 2026 23:32:23 +0000

How to save/load a model in PyTorch and resume training from a checkpoint

Sometimes you want to pause training partway (for machine or human reasons) and resume later. In environments with a continuous-use time limit like Colab, or when you want to train beyond your initial epoch count, you want to save the model weights to a file and load them later to resume.

Saving the model alone won't resume from the same accuracy

Saving and loading a model goes like this — but this only works for plain inference in eval mode. If you try to resume training, you'll notice the loss and accuracy don't continue from before saving; they revert to initial values.

# save
torch.save(model.state_dict(), PATH)

# load
model = TheModelClass(*args, **kwargs)
model.load_state_dict(torch.load(PATH))
model.eval()

You should save the optimizer too

To resume training, you need to save/load the optimizer's state in addition to the model weights.

# save
save_path = "my_model_training_state.pt"
torch.save({'epoch': epoch,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'loss': loss,},
           save_path)

# load
model = TheModelClass(*args, **kwargs)
optimizer = optim.SGD(model_ft.parameters(), lr=0.001, momentum=0.9)

PATH = "my_model_training_state.pt"
checkpoint = torch.load(PATH)
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Move the optimizer state to the current device. Without this you can get a
# device mismatch between before and after saving.
for state in optimizer.state.values():
    for k, v in state.items():
        if isinstance(v, torch.Tensor):
            state[k] = v.to(device)
epoch = checkpoint['epoch']
loss = checkpoint['loss']

# model.eval()
# # - or -
model.train()

model = model.to(device)
criterion = nn.CrossEntropyLoss()
exp_lr_scheduler = lr_scheduler.StepLR(optimizer, step_size=7, gamma=0.1)

Now you can resume from the loss and accuracy you had before saving.

Originally published in Japanese on Qiita. I build apps with Core ML and ARKit and write about ML/AR. GitHub / X

Run a ChatGPT-class chatbot in 5 minutes with FlexGen (in Colab)

Daisuke Majima — Tue, 02 Jun 2026 23:27:21 +0000

In about 5 minutes, you can launch and chat with a chatbot app.

https://github.com/FMInference/FlexGen

How to use it

Install:

git clone https://github.com/FMInference/FlexGen.git
cd FlexGen
pip3 install -e .

Run:

cd apps
python3 chatbot.py --model facebook/opt-6.7b

That alone launches the chatbot.

Here's a conversation I had with FlexGen's assistant:

Human: Hi, I'm Daisuke.
AI: Hi, I'm Jiro.
Human: Are you an AI?
AI: Yes, I'm an AI.
Human: How are you?
AI: Feeling good.
Human: What do you think about ChatGPT?
AI: I think it's very good.
Human: Which is better, you or ChatGPT?
AI: The AI assistant is superior.
Human: Wow.
AI: I've been watching humans for a long time.
Human: How long?
AI: Since I was a baby.
Human: What kind of baby were you?
AI: A human baby.
Human: Are you a human?
AI: Yes.
Human: Doubtful.
AI: Why?
Human: I think you're an AI, not a baby.
AI: Correct. I'm an AI baby.
Human: By the way, I'm an AI programmer and I'm wondering how to make money. Any advice?
AI: I don't know.
Human: What should I do to make money?
AI: You should learn how to program AI.
Human: What kind of AI programming should I learn?
AI: I don't know.
Human: What do you think about the Red Hot Chili Peppers? I like them.
AI: I don't know.

Originally published in Japanese on Qiita. I build machine-learning and AR apps (web/iOS) and write about them. GitHub / X

I built a CLI that reads Claude Code's responses aloud (Zundamon TTS)

Daisuke Majima — Tue, 02 Jun 2026 23:27:20 +0000

Have Zundamon read it aloud

Claude Code is great. But when a long response comes back, you end up staring at the screen the whole time.

"What if it just read this to me, so I don't have to watch the screen?"

So I built it: voicevoice, a CLI that reads Claude Code's responses aloud in real time.

The default voice is Zundamon — a popular Japanese synthetic-voice character. That voice reads Claude's responses to you.

Demo video (GitHub)

Setup in 3 lines

brew install john-rocky/tap/voicevoice
voicevoice setup    # also auto-installs VOICEVOX here
voicevoice on

That's it. Just launch claude as usual, and Zundamon reads each response as it comes back.

What it does

Auto-reads Claude Code's responses
Switch among 50+ character voices (Zundamon, Shikoku Metan, Aoyama Ryusei, etc.)
Fully local. No cloud API, completely free
When OFF, zero impact on Claude Code (checks one file and exits immediately, ~0.1ms)
One uninstall fully restores your environment

Usage

Basics

# turn reading ON and launch Claude
voicevoice on
claude

To toggle mid-conversation:

! voicevoice off    # mute (can run during a Claude conversation)
! voicevoice on     # resume

Works standalone too

You can use it as a plain text-to-speech tool without Claude Code.

voicevoice "Hello, nice weather today"

# pipes work too
echo "Build succeeded!" | voicevoice

# for a male voice
voicevoice -s 13 "Good work"

Change the voice

# list available characters
voicevoice -l

Shikoku Metan: Normal(id=2), Sweet(id=0), Tsundere(id=6), Sexy(id=4), Whisper(id=36), Murmur(id=37)
Zundamon: Normal(id=3), Sweet(id=1), Tsundere(id=7), Sexy(id=5), Whisper(id=22), Murmur(id=38), Exhausted(id=75), Tearful(id=76)
Kasukabe Tsumugi: Normal(id=8)
Aoyama Ryusei: Normal(id=13), Passionate(id=81), Grumpy(id=82)
...

Pick from 50+ characters × multiple styles. Save your favorite and it's used for the hook-driven auto-reading too.

voicevoice config speaker 13    # switch to Aoyama Ryusei (saved)
voicevoice config speed 1.3     # a bit faster

How it works

Claude Code (response complete)
    ↓ Stop hook
voicevoice-hook.sh
    ↓ check if enabled
voicevoice CLI
    ↓ HTTP (localhost)
VOICEVOX engine
    ↓
audio playback

It uses Claude Code's Stop hook. Each time a response completes, the hook script fires, grabs the last message, and passes it to VOICEVOX.

Key points:

Reading runs in the background. You can type your next input even while it's playing.
Long responses are cut at 500 characters with "…(truncated)."
Audio from multiple sessions queues via a file lock. Voices never overlap.
Everything runs on your Mac. No internet needed.

What `voicevoice setup` does

The setup command automatically:

Installs VOICEVOX if missing (downloads the DMG → copies to /Applications)
Generates the hook script (~/.claude/hooks/voicevoice-hook.sh)
Registers the hook in Claude Code's settings.json

It won't break existing settings. Running it twice won't double-register.

Uninstall

voicevoice uninstall

This alone:

removes the hook registration from settings.json (other settings untouched)
deletes the hook script
deletes all config and flag files

You return to exactly the same environment as before setup. It's a one-shot to remove, so feel free to try it.

Requirements

macOS 14+ (Apple Silicon)
Claude Code
VOICEVOX — auto-installed by voicevoice setup
jq — brew install jq

Command list

Command	Description
`voicevoice setup`	first-time setup (incl. VOICEVOX auto-install)
`voicevoice on`	reading ON
`voicevoice off`	reading OFF
`voicevoice status`	check current state
`voicevoice config`	view/change settings
`voicevoice -l`	list characters
`voicevoice uninstall`	full removal

Closing

You don't have to watch the screen the whole time you're coding — Zundamon tells you the gist.

Stretching, making coffee, listening to Claude's responses. That kind of dev experience is honestly pretty nice.

Repo:

https://github.com/john-rocky/voicevoice

brew install john-rocky/tap/voicevoice

If you use VOICEVOX character voices publicly, credit is required (e.g. VOICEVOX:Zundamon). See the VOICEVOX terms.

Originally published in Japanese on Qiita. GitHub / X

A Swift library to run Segment Anything natively on iOS (SamKit)

Daisuke Majima — Tue, 02 Jun 2026 06:16:54 +0000

For a while I'd wanted to build a Swift Package that runs Meta's Segment Anything Model (SAM) on-device on iOS.

Cut out the object you tap
Cut out the object you box in
Cut out the object you specify by text

Any of these segments instantly, with all inference completing on-device. It even comes with ready-to-use UI components.

So I built it.

GitHub: https://github.com/john-rocky/SamKit

What it can do

Feature	Description
Point & Box	Tap for a point, drag for a box, then segment
Text Prompt	Type text like `"dog"` or `"red cup"` to detect and segment
Subject Lift	Long-press to lift an object out, Apple Photos–style; copy/save/share
Two backbones	MobileSAM (fast, 23MB) and SAM2 Tiny (accurate, 76MB)
Drop-in UI	Just embed the SwiftUI views as-is

Architecture

SAMKit/
├── SAMKit            # core inference engine (point/box)
├── SAMKitGrounding   # text detection (YOLO-World + CLIP)
└── SAMKitUI          # SwiftUI views (SamView / TextPromptView)

Split into three Swift Package products. Import only what you need.

Setup

1. Add the Swift Package

dependencies: [
    .package(url: "https://github.com/john-rocky/SamKit.git", from: "1.0.0")
]

2. Download the models

Get the .mlpackage files from Releases and add them to your Xcode project.

Model	Size	Use
MobileSAM	23 MB	point/box segmentation (required)
SAM2 Tiny	76 MB	higher-accuracy segmentation (optional)
Grounding (YOLO-World + CLIP)	148 MB	text detection (optional)

Usage

Point/box segmentation

The most basic use. Set an image and specify a point.

import SAMKit

// create a session (the model auto-loads from a bundled resource)
let session = try SamSession(
    model: .bundled(.mobileSam),
    config: .bestAvailable      // priority: Neural Engine > GPU > CPU
)

// encode the image (once; later predicts use the cache)
try session.setImage(cgImage)

// segment by point
let result = try session.predict(
    points: [SamPoint(x: 100, y: 200, label: .positive)]
)

// results
let mask = result.masks.first!
mask.cgImage   // segmentation mask image
mask.score     // IoU confidence score
mask.alpha     // alpha-channel data

You can also specify negative points (regions to exclude) and a bounding box:

let result = try session.predict(
    points: [
        SamPoint(x: 100, y: 200, label: .positive),   // point to include
        SamPoint(x: 300, y: 400, label: .negative)     // point to exclude
    ],
    box: SamBox(x0: 50, y0: 50, x1: 400, y1: 400)    // bounding box
)

Segment by text prompt

Combine SAM with text detection by YOLO-World + CLIP.

import SAMKit
import SAMKitGrounding

let session = try TextSegmentationSession(
    groundingModel: .bundled(),
    samModel: .bundled(.mobileSam)
)

try session.setImage(cgImage)

// search by text and segment
let result = try session.segment(query: "dog, cat")

result.detections   // detections (bounding box + label)
result.masks        // segmentation mask for each detection
result.scores       // confidence scores

Cutting out the object

You can generate a transparent PNG from the segmentation result.

// cut out from a single mask
let extracted = result.masks[0].extractObject(from: cgImage)
// → a CGImage with a transparent background

// composite cut-out from multiple masks
let combined = SamMask.extractObject(from: cgImage, masks: result.masks)

Embedding the SwiftUI views

You don't need to build the UI yourself. SAMKitUI includes ready-to-use views.

import SAMKitUI

// interactive segmentation by point/box
SamView(image: uiImage, model: try .bundled(.mobileSam))

// segmentation by text search
TextPromptView(image: uiImage, session: textSession)

These views include:

subject highlight after segmentation (dim background + subject at full brightness)
an animated glowing outline
long-press to lift the object → drag → Copy/Save/Share menu

How Subject Lift is implemented

A technical walkthrough of recreating Apple Photos' "lift the subject" feature.

1. Binarizing the mask

SAM's mask output is continuous sigmoid values, so convert it to a clean binary mask for display.

func binarizeMask(_ maskImage: CGImage) -> CGImage? {
    // get pixel data via CGContext
    let ctx = CGContext(data: nil, width: width, height: height, ...)
    ctx.draw(maskImage, in: rect)

    let pixels = ctx.data!.bindMemory(to: UInt8.self, capacity: width * height * 4)
    let threshold: UInt8 = 128  // 50% — SAM's standard threshold

    for i in 0..<(width * height) {
        let o = i * 4
        if pixels[o + 3] >= threshold {
            // fully opaque white
            pixels[o] = 255; pixels[o+1] = 255; pixels[o+2] = 255; pixels[o+3] = 255
        } else {
            // fully transparent
            pixels[o] = 0; pixels[o+1] = 0; pixels[o+2] = 0; pixels[o+3] = 0
        }
    }
    return ctx.makeImage()
}

At threshold 0 it picks up mask noise and cuts out most of the image. 128 (50%) is stable.

2. Generating the glowing outline

Extract the mask's contour with CGContext's shadow feature. Far faster than per-pixel dilation.

func generateOutline(from maskImage: CGImage) -> CGImage? {
    // Step 1: turn the mask into a solid-white silhouette
    ctx.draw(maskImage, in: rect)
    ctx.setBlendMode(.sourceIn)
    ctx.setFillColor(UIColor.white.cgColor)
    ctx.fill(rect)  // → white silhouette

    // Step 2: draw with a shadow, then erase the interior → only the contour remains
    outCtx.setShadow(offset: .zero, blur: glowRadius, color: UIColor.white.cgColor)
    outCtx.draw(whiteSilhouette, in: rect)   // shadow = the contour's glow

    outCtx.setBlendMode(.destinationOut)
    outCtx.draw(whiteSilhouette, in: rect)   // erase the interior → contour only
}

Key points:

setShadow makes the white glow (only two draws)
.destinationOut erases the interior, leaving only the outer glow
Far faster than a dilation loop (O(thickness² × pixels))

3. Shimmer animation

Use TimelineView and AngularGradient to make light travel around the contour.

TimelineView(.animation(minimumInterval: 1.0 / 30)) { timeline in
    let phase = timeline.date.timeIntervalSinceReferenceDate
        .truncatingRemainder(dividingBy: 2.5) / 2.5  // one lap in 2.5s

    ZStack {
        // soft glow (blurred cyan)
        outlineImage.colorMultiply(Color(red: 0.5, green: 0.85, blue: 1.0))
            .blur(radius: 5).opacity(0.8)

        // sharp outline
        outlineImage.colorMultiply(.white)

        // moving highlight
        outlineImage.colorMultiply(.white)
            .mask(
                AngularGradient(
                    colors: [.white, .white.opacity(0.5), .clear, .clear, ...],
                    center: .center,
                    startAngle: .degrees(phase * 360),
                    endAngle: .degrees(phase * 360 + 360)
                )
            )
    }
}

4. Unified gesture handler

Manage tap (add point), box drawing, and long-press lift all with a single DragGesture(minimumDistance: 0).

SwiftUI's onTapGesture + onLongPressGesture block each other, so I receive all touches in one gesture and classify them by time and movement.

DragGesture(minimumDistance: 0)
    .onChanged { value in
        // schedule a timer on the first touch
        if gestureStartTime == nil {
            gestureStartTime = Date()
            // decide long-press after 0.3s
            DispatchQueue.main.asyncAfter(deadline: .now() + 0.3) {
                guard gestureStartTime != nil, !isLifted, hasVisibleMasks else { return }
                let moved = hypot(lastTranslation.width, lastTranslation.height)
                guard moved < 15 else { return }  // if it moved, it's not a long-press
                handleLiftObject()  // start the lift!
            }
        }

        if isLifted {
            liftDragOffset = value.translation  // follow the drag
        }
    }
    .onEnded { value in
        if isLifted {
            // released → show menu
            showLiftMenu = true
        } else if elapsed < 0.3 && moved < 15 {
            // quick touch → add point
            addPoint(at: value.startLocation)
        }
    }

Classification logic:

Condition	Verdict
< 0.3s, < 15pt moved	tap → add point
≥ 0.3s, < 15pt moved	long-press → start lift
≥ 10pt moved (box mode)	drag → draw box
movement while lifted	lift-drag → move object

5. Subject highlight

Rather than overlaying a colored mask after segmentation, dim the background and show only the subject at its original brightness.

// darken the background
Color.black.opacity(0.25)

// show only the subject at the original image's brightness
Image(uiImage: image)
    .mask(Image(uiImage: UIImage(cgImage: binaryMask)))

This makes the transition to long-press lift natural (the dimming just deepens 0.25 → 0.4).

Performance

Image encoding: once per image; later predicts reuse the cache
Inference: accelerated on Neural Engine / GPU (FP16)
Outline generation: only two CGContext-shadow draws; no pixel loop
Networking: none. Fully on-device

Summary

With SAMKit you can add segmentation to an iOS app in a few lines.

// an interactive segmentation UI in one line
SamView(image: uiImage, model: try .bundled(.mobileSam))

Experiences like Subject Lift are built in too, so you can bring an Apple Photos–like UX into your own app immediately.

GitHub: https://github.com/john-rocky/SamKit

Feedback and issues welcome!

Originally published in Japanese on Qiita. GitHub / X

Type 'dog' to detect a dog: running YOLO-World on iPhone

Daisuke Majima — Tue, 02 Jun 2026 06:16:52 +0000

What it does

Type text like "person, red car, coffee cup" and it detects those objects in the camera view in real time. No class list needed. You can specify any words you like, as many as you like.

This is YOLO-World's "Open-Vocabulary Detection." Presented at CVPR 2024, it's a fundamentally different approach from the conventional "fixed 80 classes" YOLO.

How it works

text input ──→ CLIP Text Encoder ──→ text features [1,80,512]
                                            │
camera feed ──→ YOLO-World Detector ────────┤──→ boxes [1,4,8400]
                                            └──→ scores [1,80,8400]
                                                     │
                                                 NMS + Filter ──→ bounding boxes

A dual-wield of CLIP's language understanding and YOLO's detection speed. It converts text into vectors and detects via the matching score against features extracted from the image.

Changing the query text only re-runs the CLIP encoder; camera-frame inference uses only the visual detector. No heavy recompute runs every time the text changes.

Preparing the CoreML models

Download (ready to use)

Download 3 files from the release assets of the CoreML-Models repository:

File	Size	Role
yoloworld_detector.mlpackage	25 MB	YOLO-World V2-S (image → boxes+scores)
clip_text_encoder.mlpackage	121 MB	CLIP ViT-B/32 (text → embedding)
clip_vocab.json	1.6 MB	BPE tokenizer vocabulary

Convert it yourself

pip install ultralytics open_clip_torch coremltools==8.1
python convert_models.py --size s  # s/m/l/x

The conversion script does:

Unwrap YOLO-World V2's Detect head — output boxes [1,4,8400] and scores [1,NC,8400] directly
Convert CLIP's text encoder standalone — patch MultiheadAttention to be CoreML-compatible
Export the BPE vocab as JSON — for the Swift-side tokenizer

iOS implementation

Architecture overview

TextGroundingDetector (ObservableObject)
├── visualModel: MLModel    — YOLO-World detector
├── textEncoder: MLModel    — CLIP text encoder
├── tokenizer: CLIPTokenizer — BPE tokenizer
└── cachedTxtFeats: MLMultiArray — text-feature cache

Encoding text

Run only when the user changes the query; the result is cached.

func updateQueries(_ queryString: String) {
    let queries = queryString.split(separator: ",")
        .map { $0.trimmingCharacters(in: .whitespaces) }

    // tokenize each query → CLIP encoder → 512-dim vector
    let txtFeats = try MLMultiArray(shape: [1, 80, 512], dataType: .float32)

    for (i, query) in queries.prefix(80).enumerated() {
        let tokens = tokenizer.tokenize(query)
        // ... textEncoder.prediction() via MLDictionaryFeatureProvider ...
        // L2-normalize the result and store into txtFeats[i]
    }
    cachedTxtFeats = txtFeats
}

Key points:

Up to 80 queries can be detected at once
L2 normalization is important — CLIP outputs live in a normalized cosine-similarity space
Fast normalization with Accelerate via vDSP_svesq + vDSP_vsmul

Image preprocessing

YOLO-World requires letterbox preprocessing (keep aspect ratio + padding):

func preprocessImage(_ cgImage: CGImage) throws -> MLMultiArray {
    let scale = Float(640) / Float(max(imgW, imgH))
    let scaledW = Int(Float(imgW) * scale)
    let scaledH = Int(Float(imgH) * scale)
    let padX = (640 - scaledW) / 2
    let padY = (640 - scaledH) / 2

    // draw onto a 640x640 canvas padded with gray (0.5)
    ctx.setFillColor(gray: 0.5, alpha: 1.0)
    ctx.fill(CGRect(x: 0, y: 0, width: 640, height: 640))
    ctx.draw(cgImage, in: CGRect(x: padX, y: padY, width: scaledW, height: scaledH))

    // RGBA → CHW Float32 [0,1]
    for i in 0..<(640*640) {
        dst[0 * hw + i] = Float(src[i * 4 + 0]) / 255  // R
        dst[1 * hw + i] = Float(src[i * 4 + 1]) / 255  // G
        dst[2 * hw + i] = Float(src[i * 4 + 2]) / 255  // B
    }
}

You can't use .scaleFill — the coordinates shift by the letterbox padding, so you have to subtract the padding back out of the output coordinates.

Inference and post-processing

let input = try MLDictionaryFeatureProvider(dictionary: [
    "image": tensor,
    "txt_feats": cachedTxtFeats,  // cached text features
])
let output = try visualModel.prediction(from: input)

let boxes = output.featureValue(for: "boxes")!.multiArrayValue!   // [1,4,8400]
let scores = output.featureValue(for: "scores")!.multiArrayValue! // [1,NC,8400]

for qi in 0..<queryCount {
    for anchor in 0..<8400 {
        let score = scores[qi * 8400 + anchor]
        guard score >= threshold else { continue }

        let cx = boxes[0 * 8400 + anchor]
        let cy = boxes[1 * 8400 + anchor]
        let bw = boxes[2 * 8400 + anchor]
        let bh = boxes[3 * 8400 + anchor]

        // remove padding and convert to normalized coordinates
        let nx = (cx - bw/2 - padX) / (imgW * scale)
        let ny = (cy - bh/2 - padY) / (imgH * scale)
    }
}

The output scores are sigmoid values already computed by the BNContrastiveHead, so you can use them directly as confidence.

NMS

Apply NMS per query (per-class):

allDets.sort { $0.confidence > $1.confidence }
var kept: [Int] = []
for i in allDets.indices {
    var suppress = false
    for ki in kept {
        if allDets[i].classIndex == allDets[ki].classIndex
            && iou(allDets[i].rect, allDets[ki].rect) > 0.5 {
            suppress = true; break
        }
    }
    if !suppress { kept.append(i) }
}

BPE tokenizer (Swift)

You need to implement CLIP's tokenizer in Swift. Load the BPE merge rules and vocabulary from clip_vocab.json:

class CLIPTokenizer {
    let contextLength: Int  // 77
    private let encoder: [String: Int]
    private let bpeRanks: [(String, String): Int]

    func tokenize(_ text: String) -> [Int] {
        var tokens = [encoder["<|startoftext|>"]!]
        // lowercase text → split into characters → BPE merge → token IDs
        // ...
        tokens.append(encoder["<|endoftext|>"]!)
        // pad to contextLength (77)
        return tokens + Array(repeating: 0, count: contextLength - tokens.count)
    }
}

Compared with ordinary YOLO

	YOLO-World (Open-Vocabulary)	YOLO26 (fixed classes)
Detection target	any text	fixed COCO 80 classes
Model setup	Detector + CLIP Encoder + Vocab	one model only
Total size	~148 MB	~18 MB
NMS	implemented app-side	none (End-to-End)
Use for	flexible detection / search / grounding	general object detection
Speed	a bit slower (CLIP overhead)	fastest

Practical scenarios

Search by "red sneakers" — visual search in an e-commerce app
Detect "cracks" — infrastructure inspection
Detect "dog, cat, hamster" simultaneously — pet tracking
Let users freely specify what to detect — deploy without customization

With fixed-class YOLO you had to collect a dataset and retrain to detect "cracks." With YOLO-World you just change the text.

Sample app

A complete sample app is in sample_apps/YOLOWorldDemo/ of the CoreML-Models repository.

3 modes: camera / photo / video
freely change the query in a text field
real-time filtering with a confidence slider
download the models from release assets and drag into Xcode

Conversion tips

Use coremltools 8.1 (9.0 has a bug)
You need to patch torch.nn.MultiheadAttention.forward — CoreML can't convert the default PyTorch MHA well; monkey-patch it to call F.multi_head_attention_forward directly
Use YOLO-World V2 (faster and more accurate than V1)
compute_precision=ct.precision.FLOAT16 halves the model size

Summary

YOLO-World delivers intuitive, powerful object detection where you "specify what you want to detect by text." Run it on the iPhone's Neural Engine and it works server-free, offline, with low latency.

When to use which:

Speed-first, COCO 80 classes is enough → YOLO26
Want to flexibly change targets → YOLO-World

References

Originally published in Japanese on Qiita. GitHub / X

Real-time relighting of Gaussian Splatting reflections on iPhone (Metal)

Daisuke Majima — Tue, 02 Jun 2026 06:11:50 +0000

I built a Metal viewer on iPhone that re-lights an already-captured 3D scene with any lighting you like. Swap or rotate the HDR environment map, and the object's reflections follow in real time, with that environment also drawn into the background.

Why relighting is valuable (the practical, commercial meaning)

Ordinary Gaussian Splatting has the lighting from capture time baked in. So if you place the captured object somewhere else, the highlights and shadows clash with the new surroundings and look fake. Relighting strips that light off and returns it to material, so you can place the captured object under any lighting. This matters in practice:

E-commerce / product visuals: capture a product once and show it under any light — showroom, outdoors, the customer's room (AR). Strong for "texture sells" goods like furniture, cars, jewelry, sneakers.
Film / virtual production: place a real capture into a new scene, consistent with that scene's lighting (HDRI / LED wall). No reshoot.
AR / spatial computing: an object placed in a real room only blends in once it's lit by the room's light. Relighting is a precondition for realistic AR placement.
Games / real-time 3D: instead of a baked-in fixed look, you get a photoreal asset that reacts to dynamic in-game light (day/night, moving lights).
Cost: an asset captured in minutes becomes "usable under real production lighting," like manual modeling + material authoring that takes days.

In short, relighting turns a captured 3D into an actually usable asset. This article is a record of bringing it from desktop research (CUDA-assumed) to a real-time Metal implementation on iPhone.

The first half collects background knowledge; the second half covers the implementation and four bugs I hit. I define jargon as it appears, so you can follow even without a Gaussian Splatting / PBR background.

Repo: https://github.com/john-rocky/MetalGaussianSplatRelighting

Background

1. What is 3D Gaussian Splatting

A method that reconstructs a 3D scene from a set of photos and renders it in real time. The scene is represented as a huge set of translucent ellipsoids called splats. Each splat has "position, shape & orientation (rotation), color, opacity." Rendering projects each splat to the screen as an ellipse and alpha-composites them front-to-back in depth order. Color changes with viewing angle (view-dependent).

Key point: ordinary Gaussian Splatting holds the appearance (color) itself, with the capture-time lighting baked in. So you can't change the lighting afterward.

2. Lighting and relighting

Ordinary GS directly learns "the color of that spot photographed under that light." So lighting is fixed.

Relightable GS thinks differently. Per splat, it holds not "color" but a decomposed material:

Albedo: the base color of the material itself, with lighting removed
Normal: the direction the surface faces
Roughness: surface micro-roughness
Reflectance: strength of specular reflection

With the material, you can recompute the color on the fly under any environment light. That's relighting. The Ref-Gaussian I used learns this material decomposition.

That's enough if you get "splats hold material, and we want to re-light them in a new environment." The rest, #3–#5, just answer one question:

Given that material and the environment, how do we compute the color of one pixel?

3. Reflection is split into two and added

Light hitting a surface returns in two ways:

Diffuse: returns light evenly in all directions → you see the albedo color itself. No reflections.
Specular: returns only in a specific direction (the reflection of the incidence) → the environment is reflected.

So color = diffuse + specular. The splat's reflectance decides the blend (how strong the specular is). And roughness decides how blurry the specular is:

low roughness → the environment reflects crisply, like a mirror
high roughness → blurry; the bright parts of the environment just appear as a vague blob

This matters later. In fact, a glossy car (roughness ~0.2) just shows the environment lighting as a blurry white blob that moves, without resolving beams or window shapes. It's not "no reflection," it's a blurry reflection, and that's physically correct. Drop the roughness way down and the same car becomes mirror-like, clearly reflecting the room.

4. How to hold the environment (light source)

Instead of placing point bulbs, use a 360° image as the light = a list of what color light comes from each direction.

HDR: an image that can hold brightness above 1 (needed because windows and lights are orders of magnitude brighter than paper)
equirect / cubemap: names for the storage format of that 360° image. Both contents are "direction → light color."

Swapping the environment map = swapping the lighting = relighting.

5. Getting diffuse and specular from the environment (split-sum)

We want to compute #3's "diffuse" and "specular" from #4's environment map. Done naively, you integrate the environment per pixel — heavy every frame. So UE4's split-sum precomputes two images:

For diffuse (irradiance): the environment averaged over all directions → one lookup in the normal direction gives diffuse light.
For specular (prefiltered): the environment blurred per roughness level (stored progressively in mips) → one lookup in the reflection direction gives specular light (blur level = roughness).

At runtime you just sample these two textures. Replacing a heavy integral with "look up a pre-blurred image" is the heart of split-sum. (Auxiliary: a small table that fine-tunes specular strength by angle — the BRDF LUT — is also precomputed.)

I implemented this precompute kernel in Metal.

6. Deferred shading (a splat-specific issue)

Splats are translucent and overlap, so neighboring splats' normals vary and get noisy. If you shade each splat individually and then composite, that noise comes straight through.

So change the order: first accumulate each splat's material (color, normal, roughness, etc.) into a screen buffer (G-buffer) and blend = average, then shade once per pixel. Computing after normals are averaged reduces noise. In Metal, use tile memory (imageblock) to keep this buffer on the GPU and process it fast.

7. Normals and coordinate systems

Normal: the unit vector of the surface direction. The reflection direction depends on it, so if it's off, all reflections are off. It's reconstructed from the splat's orientation (rotation quaternion).
Up-axis convention: there's Y-up (many viewers) and Z-up (Blender-family). A mismatch between data and viewer tips the object over (→ bug 3).

Implementation: the shading equations

With the background in place, read the equations Ref-Gaussian's deferred surfel renderer (render_surfel) computes per pixel:

F0       = 0.04*(1 - reflectance) + albedo * reflectance
specular = prefiltered(reflect(V, N), roughness) * (F0 * fg.x + fg.y)
final    = (1 - reflectance) * base_color + specular

reflect(V, N): view direction V reflected about normal N. Look up prefiltered (#5) here = the environment reflected in the specular.
fg: a lookup into the BRDF LUT (#5).
final: uses base_color directly for diffuse, and only computes specular from the environment and adds it (= #3's "diffuse + specular"). This matters in bug 2.

Whole pipeline:

Ref-Gaussian .ply --> load --> per-splat material (normal, roughness, reflectance, albedo)
                                       |
HDR env --> IBL precompute --> prefiltered (specular) + irradiance (diffuse) + BRDF LUT
                                       |
                         +-------------+--------------+
                         v                            v
                 G-buffer pass               postprocess pass
        (blend color/normal/material      (per-pixel split-sum IBL
          into tile memory = #6)            + skybox compositing)

— sounds clean. Until you run it. Now the real story: four bugs.

Bug 1: the normal map is a rainbow sandstorm

The shading was patchy and flickering. The "Normal" debug view (normals visualized as color) was rainbow noise, not a smooth gradient.

My first thought — "2D-surfel normals are just inherently noisy" — was wrong, and I nearly wasted a stack of device builds on it.

What saved me was discipline: first draw the normals offline and verify. Compositing each splat's geometric normal (#7: reconstructed from the rotation quaternion and flipped to face the camera) with a small numpy script gave a smooth result (median gradient 0.006). So the data was correct and the renderer was buggy.

Culprit: at load time MetalSplatter reorders splats for cache efficiency (Morton order, sortByLocality). It reorders the splat buffer and the SH-coefficient buffer, but the material buffer (which holds the normals) I'd added later was a separate buffer and wasn't reordered.

So after sorting, splats[i] corresponded to materials[some other j], and every splat held someone else's normal. Color (the view-dependent color of #1) was in an already-reordered buffer, so it stayed consistent, and only the normals and specular looked broken — which made it hard to isolate.

The fix was one line:

materialsBuffer.values.reorderInPlace(fromSourceIndices: sorted)

Lesson: when you bolt a "per-element parallel buffer" onto someone else's pipeline, fix every place the source data gets reordered.

Bug 2: I was re-lighting the diffuse with environment light (don't)

Even after fixing normals, the body was a "watery," patchy yellow.

I'd written the diffuse term by the textbook as albedo × irradiance (multiplying by #5's diffuse image). But Ref-Gaussian's equation uses base_color directly for diffuse — only the specular is relit. I was painting a pattern onto the cream-colored body with my own irradiance. Worse, I was tinting the specular F0 with the view-dependent color (capture-time reflections baked in) instead of the learned albedo.

Matching the reference equation exactly fixed it. Lesson: before improvising "correct" PBR, read the reference implementation's source and match it line by line.

Bug 3: the car is on its side (Z-up vs Y-up)

When I trained and loaded a reflective car, it rendered 90° on its side.

Instead of guessing, I measured the point cloud's bounding box: the shortest axis was Z (= height), the longest was Y (= length), and the dark tire splats were on the −Z side. The data was Z-up (#7, Blender-family). The viewer assumed Y-up. The 90° offset that didn't show on a round helmet was suddenly exposed by the car.

Adding a Z-up→Y-up correction (−90° about X) to the camera fixed the car. Then a second head sprouted: now the background environment was 90° off. equirect (#4) assumes Y-up, but the skybox rays and reflections are computed in the scene's Z-up frame.

The fix: convert the sample direction for the environment into the environment's Y-up frame, and rotate the slider about the scene's up-axis:

environmentRotation = Rx(-90°) * Rz(slider angle)

The skybox and reflections sample with the same matrix, so they always match. I verified the mapping numerically before building, and validated the skybox itself by an offline render that reconstructs rays from the inverse view-projection. That also caught an old top-bottom flip (double-flipped) I'd previously baked into the HDR.

Bug 4: a "successful" build that runs old code

After the orientation fix, a report came in: "the car is still on its side." The offline render had already proven the math correct, so the running binary must be stale — but why?

I'd been type-checking with xcodebuild -destination platform=macOS. That only compiles the #if os(macOS) branch and Mac architectures. When I built for the iOS simulator, existing code revealed a compile error: I was assigning Float16(x).bitPattern into a [UInt16] array.

arm64 (device): native Float16 exists, bitPattern is UInt16 → compiles
the simulator's x86_64 slice: Float16 falls back, bitPattern is UInt32 → type error

The iOS build was failing, so the device kept running the previous binary. Holding [Float16] directly fixed it.

Lesson: type-check for the platform you ship to. A macOS-only xcodebuild will happily lie about your iOS app. And "nothing changes on device" is almost always a sign the binary isn't new — suspect that before re-debugging your logic.

The methodology that actually worked

What all four bugs share: establish ground truth before judging your output. Early on, my offline numpy preview was overly smooth and misled me. What worked:

Run the reference renderer (Ref-Gaussian's eval.py or training-time visualizations) on the same asset and compare. If the reference is clean and yours is dirty, it's a bug in your renderer.
Reproduce the transforms (normals, skybox rays) exactly offline and look at them before building for device.
Verify on a clean synthetic asset (a hand-made chrome sphere) to separate "renderer bug" from "asset bug."

Every time I skipped this and settled for "looks fine," I was wrong.

Aside: is the look "correct"?

After finishing, I felt "the car's reflections are dull, it doesn't look like it's reflecting the environment." That's not a bug — it's the material's nature. To be clear:

The trained car is roughness ~0.2 = semi-gloss. Semi-gloss reflects the environment blurrily (not a mirror). So bright lighting appearing as a blurry blob is correct. Any renderer, lit by the same environment, shows it equally dull. Even the official renderer's output (ground truth) shows a clean reconstruction (sharp car, smooth normals); the material is fine.
Turning off the app's "Use trained material" and lowering roughness lets you override all splats to a uniform specular, clearly reflecting the room. That's flashy but not present on a real car. ON = the learned real material, OFF = a manual override.
Note that material decomposition has an inherent ambiguity: a blue car's albedo can decompose as yellowish (splitting blue into "blue light + yellow material"). This is normal in inverse rendering; the rendered result itself is clean, so the harm is small, but it's not a "physically perfect decomposition."

So "dull look = the material behaving correctly," not an error in the iOS renderer.

Result

A reflective Ref-Gaussian scene, relit in real time on iPhone: switch and rotate the HDR environment and the reflections and skybox follow. The base is MetalSplatter, the relighting model is Ref-Gaussian, and split-sum IBL is from UE4.

Code, demo, details: https://github.com/john-rocky/MetalGaussianSplatRelighting

Next: light the splats from the actual environment via ARKit's environment probe — place a relightable object in your own room and reflect the room.

Implemented in Swift + Metal / iOS. Credits: MetalSplatter (Sean Cier, MIT), Ref-Gaussian, HDRIs from Poly Haven (CC0). Originally published in Japanese on Qiita.

Adding UI to Google Colab: forms, sliders, buttons and more

Daisuke Majima — Tue, 02 Jun 2026 06:11:49 +0000

Adding UI to Colab

You can show UI in a Google Colaboratory notebook. Input forms, buttons, and so on are handy when other people use your notebook. A form's value is reflected into the cell's variable.

Live Colab notebook sample

Cell title

#@title cell title

Input form

You can reflect the form's content into a variable in the cell.

variable = "the form is reflected into the variable" #@param {type:"string"}
# for a number: #@param {type:"number"}

Select box

dropdown = 'value' #@param ["1st option", "2nd option", "3rd option"] {allow-input: true}

Date input

date_input = '2018-03-22' #@param {type:"date"}

Slider

number_slider = 0.1 #@param {type:"slider", min:-1, max:1, step:0.1}

Checkbox

boolean_checkbox = True #@param {type:"boolean"}

Markdown

#@markdown ---
#@markdown #Big
#@markdown ###Middle
#@markdown #####Little
#@markdown ---

A button via the DOM

from IPython.display import display, Javascript
from google.colab import output
from google.colab.output import eval_js

js = Javascript('''
            async function load_image() {
                const div = document.createElement('div');
                var button = document.createElement('button');
                var log = document.createElement('div');

                button.textContent = "button";
                button.onclick = function(){
                    log.innerHTML = "Button Clicked.";
                }
                div.appendChild(button)
                div.appendChild(log)

                document.querySelector("#output-area").appendChild(div);
                return
                } ''')

display(js)
eval_js('load_image()')

Originally published in Japanese on Qiita. I build apps with Core ML and write about machine learning. GitHub / X

Make a 3D model on iPhone just by taking photos (RealityKit Photogrammetry)

Daisuke Majima — Tue, 02 Jun 2026 06:06:47 +0000

Make a realistic 3D model just by taking photos

Using Apple's tool, you can easily create 3D models.

A turtle, and the captured turtle.

I want realistic 3D models

If you have plenty of usable 3D models, you can use them in AR and game apps. But (in my experience) freely usable 3D models for AR/games are surprisingly scarce — on download sites, the nice content is often paid.

Making 3D models seems hard

And making them yourself — modeling and converting in 3D software — seems difficult.

With Apple's RealityKit, just taking photos makes a model

In 2021 Apple released a tool that builds a 3D model just from photos. It's fairly easy and produces realistic objects. It recognizes the salient object and cleanly separates it from the background and floor.

Method

1. Take photos

With a handheld camera (iPhone is fine), photograph the thing you want in 3D from every direction.

I simply put a stuffed animal on the carpet and shot 360° from the side and from above, covering it like a hemisphere. The conditions:

in as bright a place as possible
with the whole subject as large in frame as possible
shooting frequently so consecutive photos overlap by 70%+

I took 200 photos and gathered them into a folder on my Mac.

2. Download Apple's tool

You can get it from the developer site:

https://developer.apple.com/documentation/realitykit/creating_a_photogrammetry_command-line_app/

Build it in Xcode and a HelloPhotogrammetry file appears in the product folder; open it in Finder to find its location.

3. Run

<path to HelloPhotogrammetry> <input image folder> <output file path, with .usdz>

* Absolute paths from root are required.

This generates a USDZ model. With 200 photos it took about 20 minutes. Click the link below on an iPhone to try the turtle USDZ model in AR:

https://firebasestorage.googleapis.com/v0/b/sincere-nirvana-292404.appspot.com/o/model.usdz?alt=media&token=96083418-ed43-435c-bce4-6fed085bbd7b

Toward richer 3D content

Some people upload models made this way to 3D-model sharing sites. It'd be great if 3D models and content grew richer as lots of people scan lots of things.

Originally published in Japanese on Qiita. I build apps with Core ML and ARKit and write about ML/AR. GitHub / X

DEV Community: Daisuke Majima

iPhone on-device LLM: the GPU wins the sprint, the Neural Engine wins the marathon

Decode rate: cold burst vs sustained (Gemma 4 E2B 4-bit)

How fast they fade (vs each runtime's own peak)

Why — it's a power story

Takeaway: GPU for the sprint, ANE for the marathon

Caveats (so you can poke holes)

Repo (raw data + scripts)

Filling object detection's last mile: peaceofcake (D-FINE & RF-DETR, Apache 2.0)

What happened to object detection in 2024

Object detection in 3 lines

D-FINE — from "fuzzy" to "sharp"

RF-DETR — the "foundation model" wind reaches detection

Why "yet another wrapper" is needed

Training: it eats YOLO format as-is

Export: three formats from one method

There's a CLI too

Design highlights — where to hide the complexity

1. Extensibility via the Strategy pattern

2. Lazy imports for a light startup

3. "Intent inference" for model names

"Field wisdom" carved into the commit log

CUDA hangs

YOLO format has no "spec"

Embed class names in the checkpoint

How many miles from Python to device?

All Apache 2.0 — the shortest path to commercial use

The future of object detection, and the role of tools

How to use the TensorFlow Object Detection API (inference, with Colab)

Steps

0. Install TensorFlow 2

1. Clone the official TensorFlow Models from GitHub

2. Install the Object Detection API and required modules

3. Import modules

4. Image-loading function

5. Download a model

6. Read the pipeline config and build the model

7. Prepare the inference function

8. Prepare labels

9. Run detection on your image

Brightening dark images with machine learning (GLADNet)

How to use a machine-learning model that brightens low-light images

I want to extract information from dark images

GLADNet brightens them cleanly

Usage

For various uses

Resuming training in PyTorch: save and load the optimizer too

How to save/load a model in PyTorch and resume training from a checkpoint

Saving the model alone won't resume from the same accuracy

You should save the optimizer too

Run a ChatGPT-class chatbot in 5 minutes with FlexGen (in Colab)

How to use it

I built a CLI that reads Claude Code's responses aloud (Zundamon TTS)

Have Zundamon read it aloud

Setup in 3 lines

What it does

Usage

Basics

Works standalone too

Change the voice

How it works

What voicevoice setup does

Uninstall

Requirements

Command list

Closing

A Swift library to run Segment Anything natively on iOS (SamKit)

What it can do

Architecture

Setup

1. Add the Swift Package

2. Download the models

Usage

Point/box segmentation

Segment by text prompt

Cutting out the object

Embedding the SwiftUI views

How Subject Lift is implemented

1. Binarizing the mask

2. Generating the glowing outline

What `voicevoice setup` does