Daisuke Majima

Posted on Jun 2 • Originally published at qiita.com

Filling object detection's last mile: peaceofcake (D-FINE & RF-DETR, Apache 2.0)

#python #machinelearning #computervision #opensource

What happened to object detection in 2024

YOLOv1 stunned the world in 2016. Eight years on, detection accuracy has improved dramatically.

But let me ask one thing.

Can you actually embed the latest object-detection model into your own app?

You've read the paper. You ran the demo. But the moment it becomes "train on my data and run it on iPhone," you stall. Writing dozens of lines of YAML config, resolving dependency conflicts, writing an export script from scratch, building an inference pipeline in Swift — the road is longer than you'd think.

In 2024, DETR-based detection entered a new phase. Two models — D-FINE and RF-DETR — achieved breakthroughs in both accuracy and real-time performance. But to deliver those benefits to a real product, a "last mile" remains.

peaceofcake is a library to fill that mile.

pip install peaceofcake

It starts in a single line.

Object detection in 3 lines

from peaceofcake import DFINE

model = DFINE("dfine-l-coco")
results = model("image.jpg")

That's it. Weights are auto-downloaded and cached. It runs on GPU if you have one, CPU otherwise. Results come back as a structured object with bounding boxes, class labels, and confidence scores.

RF-DETR uses the same interface:

from peaceofcake import RFDETR

model = RFDETR("rfdetr-l-coco")
results = model("image.jpg")

Different model, same API. That's the core of peaceofcake's design philosophy.

D-FINE — from "fuzzy" to "sharp"

D-FINE's full name is "DETR with Fine-grained Distribution Refinement." As the name suggests, the heart of this model is how it predicts bounding boxes.

Conventional DETR-family models try to nail the location in one shot as four numbers (x, y, w, h). By analogy: asked "how tall is that building?", you instantly answer "37.2 meters!"

D-FINE is different. First it produces a probability distribution ("maybe 30–40 meters"), then narrows ("35–38 meters"), and finally arrives at "37.2 meters." It progressively sharpens a fuzzy estimate.

coarse distribution → refine → refine more → final prediction

With this Fine-grained Distribution Refinement, D-FINE achieves accuracy beyond RT-DETR at real-time inference speed.

The architecture is clear:

image → HGNetv2 (Backbone) → HybridEncoder → DFINE Decoder → detections

The HGNetv2 backbone comes in 5 sizes (Nano/Small/Medium/Large/XLarge), covering everything from lightweight mobile inference to accuracy-first server-side, with one architecture.

RF-DETR — the "foundation model" wind reaches detection

RF-DETR is a "Real-Time Foundational Object Detection" model Roboflow released in late 2024.

Note the word "Foundational." Just as "foundation models" revolutionized the LLM world, that wave is reaching object detection: pretrain general detection ability on massive data, then fine-tune on a small amount of domain-specific data. RF-DETR brought that paradigm to real-time detection.

peaceofcake makes these two state-of-the-art models usable through the same 3-line API.

Why "yet another wrapper" is needed

The world is full of model-wrapper libraries. So what's different about peaceofcake?

The answer is scope.

Many wrappers stop at "make inference easy." peaceofcake doesn't. It covers the full product-development cycle — train → export → on-device mobile inference — in one package.

Training: it eats YOLO format as-is

model = DFINE("dfine-m-coco")
model.train(data="dataset.yaml", epochs=50, batch_size=16)

Annotate in Roboflow or Label Studio, export in YOLO format, and pass it straight in. Automatic conversion to COCO format runs internally, and training-schedule scaling, EMA, and AMP are applied transparently.

COCO-format datasets work as-is too. You don't have to think about the format.

Export: three formats from one method

model.export("onnx")                          # server-side inference
model.export("coreml", precision="FLOAT16")   # iOS / macOS
model.export("tensorrt")                       # NVIDIA GPU optimization

On CoreML export, denoising mechanisms and auxiliary heads needed only at training time are automatically removed. The output is confidence + coordinates, directly compatible with iOS's Vision framework. NMS is done outside the model, so you can change the confidence threshold from the UI in real time.

There's a CLI too

Without writing Python, you can access all features from the terminal.

poc predict source=photo.jpg conf=0.3
poc train model=dfine-m-coco data=my_dataset.yaml epochs=100
poc export model=dfine-l-coco format=coreml precision=FLOAT16

Design highlights — where to hide the complexity

Three points in peaceofcake's internal design feel especially clever.

1. Extensibility via the Strategy pattern

class DFINE(BaseModel):
    @property
    def task_map(self):
        return {
            "detect": {
                "predictor": DFINEPredictor,
                "trainer": DFINETrainer,
                "exporter": DFINEExporter,
                "validator": DFINEValidator,
            }
        }

BaseModel receives predict(), train(), export(), val() calls and dispatches to the right class via task_map. D-FINE and RF-DETR have completely different Predictor/Trainer/Exporter implementations, but the API the user sees is identical. If segmentation or pose tasks are added later, you just add an entry to task_map.

2. Lazy imports for a light startup

# peaceofcake/__init__.py
def __getattr__(name):
    if name == "RFDETR":
        from peaceofcake.models.rfdetr import RFDETR
        globals()["RFDETR"] = RFDETR
        return RFDETR
    raise AttributeError(...)

RF-DETR depends on the transformers library, which takes seconds just to import. peaceofcake uses __getattr__ to defer the import until RFDETR is actually referenced. Users of only D-FINE are fine even without transformers installed. The src.data, src.optim, src.nn.criterion needed only for training are also imported inside train() — if you only do inference, those modules are never loaded.

3. "Intent inference" for model names

DFINE("dfine-l-coco")           # registry name → auto-download
DFINE("path/to/custom.pth")     # local file
DFINE("dfine_l_coco.pth")       # filename only → matched against registry
DFINE("dfine-n")                 # no weights → random init

For a single string, multi-stage fallback runs: registry match → local-file check → filename match → size estimation. For a local checkpoint, it tries to infer model size from the filename, and if it still can't, it back-calculates from the parameter count:

n = sum(v.numel() for v in model_state.values())
if n < 6_000_000: return "n"     # Nano
elif n < 15_000_000: return "s"  # Small
elif n < 25_000_000: return "m"  # Medium
...

So that the user can "just pass a path without thinking," all this inference logic runs behind the scenes.

"Field wisdom" carved into the commit log

There's trial-and-error in the commit log that you can't see by reading the code alone — real lessons of bringing research code into production.

CUDA hangs

Early in training, the model can output bounding boxes with negative width or height. The original D-FINE implementation detected this with an assert. On CPU that just throws. But an assert failure inside a CUDA kernel hangs the entire GPU device.

peaceofcake replaced it with clamp. But the first fix used an in-place operation and broke the autograd graph. The second commit finally fixed it properly. Fix one bug and another shows up. That's the reality of putting research code into battle.

YOLO format has no "spec"

YOLO format has a de facto standard but no strict spec. How to specify paths (train: images/train vs path: ./ + train: images/train), blank lines in label files, how to write the class count — each tool differs subtly. To support all of Roboflow/Ultralytics/Label Studio output, peaceofcake spends over 50 lines on path-resolution logic alone.

Embed class names in the checkpoint

When distributing a model trained on a custom dataset, the class-name info tends to get lost. peaceofcake embeds a class_names key directly in the .pth checkpoint. No separate metadata file to manage.

How many miles from Python to device?

Let's lay out the distance to getting an object-detection model "usable" in a table.

Step	Traditional workflow	peaceofcake
Install	clone repo, set up env	`pip install peaceofcake`
Inference	write config, run script	3 lines of Python
Training	convert data, tune config	`model.train(data="data.yaml")`
Export	write a dedicated script	`model.export("coreml")`
iOS device	implement inference pipeline from scratch	build `DFINEDemo/`

At every one of the 5 steps, it's a one-liner or pre-built code. That's what filling the "last mile" means.

All Apache 2.0 — the shortest path to commercial use

When embedding object detection into a product, license matters as much as accuracy and API ergonomics.

This is where many developers trip up. The latest YOLO series (Ultralytics YOLO) is AGPL-3.0; commercial use requires buying a paid license. A great model right in front of you, and you give up on adopting it because of the license wall — many have had that experience.

The D-FINE and RF-DETR that peaceofcake adopts are both Apache 2.0. peaceofcake itself is Apache 2.0 too. So the library, the models, and the trained weights are all free for commercial use. Modify, redistribute, embed — anything goes. A patent clause is included too, reducing patent-litigation risk from contributors.

Library	Model license	Commercial use
Ultralytics (YOLO)	AGPL-3.0	paid license required
peaceofcake (D-FINE)	Apache 2.0	free
peaceofcake (RF-DETR)	Apache 2.0	free

A startup embedding it in their product. A contractor delivering to a client. Integrating it into embedded-device firmware. In every scenario, license cost is zero.

Beats YOLO on accuracy, and the license is completely free. That's another reason to choose DETR-based models.

The future of object detection, and the role of tools

Computer vision's history is shifting from a "race for accuracy" to a "race for accessibility."

Since AlexNet in 2012, improving ImageNet accuracy was research's main battlefield. In detection too, raising COCO-benchmark mAP was seen as the primary contribution of a paper. But in 2024, as D-FINE and RF-DETR showed, DETR-based models have reached the stage of combining sufficient accuracy with real-time performance.

The next axis of competition is "anyone can use it."

Ultralytics (YOLOv8/YOLO11) was the pioneer, building an ecosystem where pip install ultralytics gives access to all of YOLO's features. But the AGPL license is a hurdle for commercial use. peaceofcake delivers the "easy to use" experience Ultralytics established — with Apache 2.0 DETR-based models.

In an era where model accuracy commoditizes, the source of differentiation moves to "how fast and how freely you can embed it into a product." The time and cost from a researcher posting to arXiv to an app developer delivering that model into users' hands — driving both toward zero. That's the role tools like peaceofcake play.

Object detection isn't hard anymore. And it isn't expensive anymore.

pip install peaceofcake

A piece of cake. Peace 🍰

Repository: peaceofcake
PyPI: peaceofcake
License: Apache 2.0

Originally published in Japanese on Qiita. GitHub / X

DEV Community