Om Prakash

Posted on Mar 25

We Just Launched Virtual Try-On in Our API — Here's How It Actually Works (With Real Results)

#webdev #ai #python #machinelearning

Virtual try-on has been a "coming soon" feature for most AI APIs. The models that actually work well were either non-commercial licensed, needed 48GB+ VRAM, or required DensePose infrastructure that nobody explains how to set up.

We shipped it this week on PixelAPI. Here's an honest breakdown of what we built, what we learned, and what the results actually look like.

What We Built

Endpoint: POST /v1/virtual-tryon

Pricing: 50 credits ($0.05 per try-on)

Categories: upperbody, lowerbody, dress

The pipeline: you send a person image + a garment image → you get back the person wearing the garment. That's the promise. Here's the reality of what makes it actually work.

The Tech Stack

We evaluated several models before picking one:

Model	License	VRAM	Notes
CatVTON	CC BY-NC-SA ❌	8GB	Non-commercial only
OOTDiffusion	Apache 2.0 ✅	12GB	Decent quality
Leffa	MIT ✅	20-24GB	CVPR 2025 — best quality

We went with Leffa (CVPR 2025). MIT licensed, state-of-the-art quality, and the paper's attention flow mechanism genuinely preserves fine garment details like text, patterns, and hardware.

The full pipeline on our infrastructure:

Gateway: FastAPI on our load balancer
Inference: RTX 6000 Ada (48GB) on LLM3 — the only machine in our cluster with enough VRAM
Queue: Redis pixelapi:vton → dedicated worker
Preprocessing: SCHP (body parsing) + DensePose (body UV maps) + MediaPipe (pose landmarks)

The Results

Upperbody — Graphic Tee

Model wearing a Kickers shirt → try on a graphic tee:

The logo, typography, and sleeve details from the flat-lay garment are faithfully transferred. Face, pose, and lower body preserved exactly.

Upperbody — Branded Shirt

Same person, different garment:

Lowerbody — Dark Denim Jeans

Black trousers → dark denim. Waistband hardware (copper rivets), pocket structure, and fabric texture all transfer:

How to Use It

import requests
import base64

api_key = "your_api_key"
headers = {"Authorization": f"Bearer {api_key}"}

def image_to_base64(path):
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode()

# Submit job
response = requests.post(
    "https://api.pixelapi.dev/v1/virtual-tryon",
    headers=headers,
    json={
        "person_image": image_to_base64("person.jpg"),
        "garment_image": image_to_base64("shirt.jpg"),
        "category": "upperbody",   # upperbody | lowerbody | dress
        "n_steps": 30,             # 20-50, more = better quality
        "image_scale": 2.5         # guidance scale
    }
)

job_id = response.json()["job_id"]

# Poll for result
import time
while True:
    result = requests.get(
        f"https://api.pixelapi.dev/v1/virtual-tryon/jobs/{job_id}",
        headers=headers
    ).json()

    if result["status"] == "completed":
        img_b64 = result["result_image_b64"]
        with open("output.jpg", "wb") as f:
            f.write(base64.b64decode(img_b64))
        print("Done! Credits used:", result["credits_used"])
        break
    elif result["status"] == "failed":
        print("Error:", result["error_message"])
        break

    time.sleep(5)

What Makes Inputs Work Well

After extensive testing, input quality is the single biggest factor in result quality. Here's what we learned the hard way:

Person Image

✅ Full-body or torso-visible standing pose

✅ Front-facing

✅ Clean/plain background

✅ Good lighting, person clearly visible

❌ Headshots or portraits (no torso = mask goes on face)

❌ Complex backgrounds

❌ Multiple people

Garment Image

✅ Single garment only

✅ Flat-lay on white/plain background

✅ Ghost mannequin or product photography style

✅ Front view, full garment visible

❌ Garment worn on a person (model wearing it)

❌ Multiple garments in one shot

❌ Folded or partially visible garment

❌ Complex busy backgrounds

What We Had to Figure Out

DensePose is non-negotiable. The Leffa model uses body UV maps (DensePose IUV) to understand where each body part is in 3D space. Without it, the garment texture lands in completely wrong places. We tried substituting with color segmentation maps — garbage results. Real DensePose only.

SCHP for masking beats heuristics. We tried MediaPipe pose landmarks → polygon masks for the garment-agnostic region. Worked on neutral poses, completely broke when someone's arms were raised or in an unusual position. SCHP body-part segmentation is the right approach — it follows the actual clothing boundary regardless of pose.

Mushika GPU coordination. Our RTX 6000 Ada also runs another rendering pipeline (Mushika). We built a pause/resume mechanism so VTON inference gets full GPU priority, then Mushika restarts cleanly. Processing time: ~30 seconds at 30 inference steps.

Pricing

50 credits per try-on = $0.05

For comparison: most commercial VTON APIs charge $0.10–$0.20 per call, or require enterprise contracts. We're 2–4x cheaper with production-grade quality.

Get 10,000 free credits when you sign up: pixelapi.dev

What's Next

Webhook support for async callbacks (instead of polling)
Batch processing (multiple garments on same person)
Video try-on (Wan 2.1 T2V — already partially integrated)

The API is live now. If you're building a fashion app, e-commerce product visualization, or anything that needs virtual try-on, we'd love to have you try it.

Get started → pixelapi.dev

PixelAPI is an AI image and video API built for developers. We price 2x cheaper than the mainstream competitors because we think AI tools should be accessible.

DEV Community