Serhii Kalyna

Posted on Jun 29

I added AI background removal to my image converter in a week, in Rust, no Python

#rust #machinelearning #webdev #buildinpublic

Native ONNX implementation for CPU-only VPS

Part of an ongoing build-in-public series on Convertify, a free image/file converter I build solo. This week: background removal. The honest version, with the walls I hit.

Most "remove image background" tutorials end with pip install rembg and a happy screenshot. Mine started with a constraint: my whole backend is Rust, and I did not want to bolt a Python process onto it just to run one model.

Here is how the week went. The good parts, and the three or four times I stared at a compiler error wondering if the constraint was worth it.

The starting point

Convertify is a free image converter. The backend is Rust + Axum + libvips, the model has to run CPU-only on a modest VPS, and there is no GPU anywhere in the budget. The obvious path for background removal is rembg, which is excellent, but it is Python and ships as a separate server process. Adding it would mean a second runtime, a second thing to deploy, a second thing to crash at 3am.

So the question for the week was simple: can I run the same models rembg uses, but natively in Rust?

Short answer: yes. rembg is, under the hood, just ONNX models plus some image pre and post processing. The models (u2net, isnet, silueta) are all .onnx files. If I can run ONNX in Rust and do the image work in libvips (which I already have), there is no Python in the picture at all.

The plan

The pipeline for background removal is not magic, it is five boring steps:

Decode the image, resize a copy to the model input size
Normalize the pixels into a tensor
Run inference, get a mask (one value per pixel: subject or background)
Normalize the mask, resize it back to the original size
Composite the mask onto the original as an alpha channel, export a transparent PNG

Steps 1, 4, 5 are libvips, which I already use everywhere. Step 3 is ONNX Runtime via the ort crate. Step 2 is a tight Rust loop. No Python anywhere.

[dependencies]
ort = { version = "=2.0.0-rc.12", features = ["download-binaries"] }

The download-binaries feature pulls a CPU build of ONNX Runtime at build time, so there is nothing to install on the box. That alone deleted half the "deploy a Python service" anxiety.

Wall #1: the model name lied about its size

I grabbed isnet-general-use.onnx from the rembg releases, expecting ~44 MB. What landed was 171 MB. My first thought was a broken download or an HTML error page renamed to .onnx. Quick check:

file models/isnet-general-use.onnx
head -c 200 models/isnet-general-use.onnx | xxd | head

The header showed a real pytorch 1.13.1 signature and tensor names like input_image and conv_in.weight. So it was a valid ONNX model, just heavier than the name suggested. Lesson: verify the file is actually what you think before you spend an hour debugging "why is RAM so high."

Wall #2: `ort` errors are not `Send + Sync`

First compile against anyhow and I get hit with this:

the trait `Sync` is not implemented for `NonNull<OrtSessionOptions>`
required for `anyhow::Error` to implement `From<ort::Error<SessionBuilder>>`

anyhow::Error wants Send + Sync. The ort error type holds raw pointers into the ONNX Runtime C++ session, which are not Sync. So ? straight into anyhow does not compile.

The fix is to stringify the error at the boundary. Display gives you a String, and String is Send + Sync:

let session = build(model_path, intra_threads)
    .map_err(|e| anyhow!("ort session init: {e}"))?;

The pointer never leaves, only the message does. Once I understood why, the pattern was mechanical: every ort ? that crosses into anyhow gets a .map_err(|e| anyhow!("...: {e}"))?.

Wall #3: `run` takes `&mut self`

This one actually changed my architecture. In this ort version, Session::run takes &mut self. I had the session behind an Arc in my Axum app state so it could be shared. You cannot get &mut through an Arc.

cannot borrow `self.session` as mutable, as it is behind a `&` reference

Options were a session pool, or a Mutex. Since my traffic is low and I gate inference to one at a time anyway, I wrapped the session in a Mutex:

pub struct BgRemover {
    session: Mutex<Session>,
}

remove(&self) stays &self, so Arc<BgRemover> still works in app state. The Mutex hands out the &mut for the single inference call. With a one-permit semaphore in front, the mutex never even contends. When traffic grows, the upgrade path is a pool of sessions, but that is a future-me problem.

Wall #4: `*mut VipsImage` is not `Send`

libvips image pointers are not Send, which means they cannot be held across an .await. If I ran inference directly in the async handler, the borrow checker would stop me, and even if it did not, a multi-second CPU inference on an async worker thread would freeze the whole runtime.

The answer is spawn_blocking. The entire libvips + inference chain runs on a dedicated blocking thread and returns finished PNG bytes (which are Send):

let png = tokio::task::spawn_blocking(move || {
    let _permit = permit;       // hold the semaphore for the whole job
    remover.remove(&bytes)
}).await??;

Every VipsImage is created and dropped inside that closure, never crossing an await point. The async runtime stays free to serve everything else while one image is being cut out.

The part that surprised me: privacy came for free

Because the handler returns the PNG straight in the HTTP response, the image is never written to disk. It comes in as multipart bytes, gets processed in memory, and the result streams back. Nothing is stored, nothing is queued, nothing to clean up.

I did not plan that as a feature, it fell out of the architecture. But "your photo is processed in memory and never saved" is a genuinely strong thing to be able to say, and it is true, not marketing.

Does it actually work?

Yes. First real test through Postman with a HEIC photo: 200 OK, transparent PNG out. The model is ISNet (the IS-Net dichotomous segmentation architecture), and on clean subjects, products, people, logos, the cutout is sharp.

What I would tell past-me

rembg is "just" ONNX + image ops. If you already have an image library, you can skip the Python entirely with ort.
The ort 2.0 API churns between rc versions. Pin the exact version and expect to fix one or two method names.
spawn_blocking is not optional for CPU-heavy, non-Send work. It is the whole reason the server stays responsive.
Constraints ("no Python") are annoying in the moment and clarifying in hindsight. The Rust-native version is one binary, one deploy, nothing to babysit.

If you want to see the result, background removal is live and free (no signup, no watermark) on Convertify. Upload a photo, get a transparent PNG. It runs the exact pipeline above.

Next week: turning one tool into a set of use-case pages (passport photos, product shots) without drowning in duplicate content. That one is more SEO than Rust, but the build-in-public log continues.

What would you have done differently on the &mut self session problem? A pool, a mutex, something smarter? Curious how others handle shared ONNX sessions under load.

Top comments (9)

UnitBuilds • Jun 29

You had me hooked at Rust. Note, this was a test to see whether it'll fail, not work. I already removed the background, I wanted to see whether it'll strip the sky, or garble the wave that went out of the circle. Good job!

Serhii Kalyna • Jun 29

Thank you for feedback, it very important for me!

mote • Jun 30

I want to push back gently on the Mutex choice, only because I went down the same path last year and hit a real cost.

I started with the exact same setup: Arc with Mutex and a 1-permit semaphore. It worked fine until I added a batch endpoint. The mutex turned what should have been parallel CPU work (libvips normalize step in particular) into strict serialization. Latency on batch-4 was 3.8x single request, not 4x. The semaphore was masking it.

What I ended up doing: ditched the Mutex, used a r2d2-style pool of N pre-warmed sessions (N = num_cpus::get / 2 since I also serve web traffic on the box). Each request checks out a session, runs inference, returns it. The pool itself sits behind a Mutex, but checkout is a quick atomic, not an inference call. The Mutex around the session was the wrong granularity.

For the &mut Session, you can also sidestep the Mutex entirely by going through unsafe (UnsafeCell) since ort guarantees internal thread safety, but the pool felt less cursed and the perf was within 2%.

Did you measure how much of your single-request latency was actually ort inference vs libvips pre/post? On my box u2net was ~80% inference, so even an oversized pool doesn't help if your batch size is small.

Serhii Kalyna • Jul 1

no, I didn't measure how much latency. Thank you for sharing your expirince

Valentyn Kit • Jul 2

You can keep ort::Error typed longer than you did here — inference is a single synchronous call, so stringify only where it actually crosses a thread or await boundary, not at the source. Convert too early and you lose the ability to match on it, e.g. treating "model failed to load" differently from "bad input tensor" once you add retries.

Serhii Kalyna • Jul 2

Thank you for your opinion, sounds reasonable.

mote • Jun 30

The decision to stay in Rust and avoid bolting on a Python process is the right call. I went through the same calculus when adding ML inference to an edge deployment. The moment you introduce a second runtime, your operational complexity doubles. You now have two things to monitor, two things that can fail, two startup sequences. On a modest VPS that tradeoff is brutal.

The ort crate is solid for this. I hit the same Send + Sync wall you described. The workaround I found was wrapping the session in an Arc and spawning inference on a dedicated thread, then passing results back through a channel. Not elegant, but it works and the latency is acceptable for single-image processing.

The 171MB model size is a real issue for deployment. Did you look into ONNX quantization at all? I managed to get a u2net variant down to about 43MB with dynamic quantization without noticeable quality loss for background removal. The ort crate supports quantized models without extra config.

For the persistence layer question, I have been working on moteDB which is an embedded database in Rust designed for exactly this kind of edge AI workload. The use case here would be caching inference results and managing model metadata without a separate database server. Would love to hear your thoughts on what storage patterns you are using for the processed images.

How are you handling the model loading at startup? Does it block the server or do you lazy-load on first request?

Serhii Kalyna • Jul 1

I use start model on the start. For now it enough

Hossein Yazdi • Jun 30

Nice write-up. I actually like that you shared the parts that didn't work instead of just the final result.

Keeping everything in Rust also makes deployment much cleaner. One runtime, one binary, and one less service to maintain is a pretty big win, especially for solo projects.

Also, for those who're interested in image editing tools, here are some interesting ones in this collection: 11 Best Image Editing Tools