Virtual try-on has been a "coming soon" feature for most AI APIs. The models that actually work well were either non-commercial licensed, needed 48GB+ VRAM, or required DensePose infrastructure that nobody explains how to set up.
We shipped it this week on PixelAPI. Here's an honest breakdown of what we built, what we learned, and what the results actually look like.
What We Built
Endpoint: POST /v1/virtual-tryon
Pricing: 50 credits ($0.05 per try-on)
Categories: upperbody, lowerbody, dress
The pipeline: you send a person image + a garment image → you get back the person wearing the garment. That's the promise. Here's the reality of what makes it actually work.
The Tech Stack
We evaluated several models before picking one:
| Model | License | VRAM | Notes |
|---|---|---|---|
| CatVTON | CC BY-NC-SA ❌ | 8GB | Non-commercial only |
| OOTDiffusion | Apache 2.0 ✅ | 12GB | Decent quality |
| Leffa | MIT ✅ | 20-24GB | CVPR 2025 — best quality |
We went with Leffa (CVPR 2025). MIT licensed, state-of-the-art quality, and the paper's attention flow mechanism genuinely preserves fine garment details like text, patterns, and hardware.
The full pipeline on our infrastructure:
- Gateway: FastAPI on our load balancer
- Inference: RTX 6000 Ada (48GB) on LLM3 — the only machine in our cluster with enough VRAM
-
Queue: Redis
pixelapi:vton→ dedicated worker - Preprocessing: SCHP (body parsing) + DensePose (body UV maps) + MediaPipe (pose landmarks)
The Results
Upperbody — Graphic Tee
Model wearing a Kickers shirt → try on a graphic tee:
The logo, typography, and sleeve details from the flat-lay garment are faithfully transferred. Face, pose, and lower body preserved exactly.
Upperbody — Branded Shirt
Same person, different garment:
Lowerbody — Dark Denim Jeans
Black trousers → dark denim. Waistband hardware (copper rivets), pocket structure, and fabric texture all transfer:
How to Use It
import requests
import base64
api_key = "your_api_key"
headers = {"Authorization": f"Bearer {api_key}"}
def image_to_base64(path):
with open(path, "rb") as f:
return base64.b64encode(f.read()).decode()
# Submit job
response = requests.post(
"https://api.pixelapi.dev/v1/virtual-tryon",
headers=headers,
json={
"person_image": image_to_base64("person.jpg"),
"garment_image": image_to_base64("shirt.jpg"),
"category": "upperbody", # upperbody | lowerbody | dress
"n_steps": 30, # 20-50, more = better quality
"image_scale": 2.5 # guidance scale
}
)
job_id = response.json()["job_id"]
# Poll for result
import time
while True:
result = requests.get(
f"https://api.pixelapi.dev/v1/virtual-tryon/jobs/{job_id}",
headers=headers
).json()
if result["status"] == "completed":
img_b64 = result["result_image_b64"]
with open("output.jpg", "wb") as f:
f.write(base64.b64decode(img_b64))
print("Done! Credits used:", result["credits_used"])
break
elif result["status"] == "failed":
print("Error:", result["error_message"])
break
time.sleep(5)
What Makes Inputs Work Well
After extensive testing, input quality is the single biggest factor in result quality. Here's what we learned the hard way:
Person Image
✅ Full-body or torso-visible standing pose
✅ Front-facing
✅ Clean/plain background
✅ Good lighting, person clearly visible
❌ Headshots or portraits (no torso = mask goes on face)
❌ Complex backgrounds
❌ Multiple people
Garment Image
✅ Single garment only
✅ Flat-lay on white/plain background
✅ Ghost mannequin or product photography style
✅ Front view, full garment visible
❌ Garment worn on a person (model wearing it)
❌ Multiple garments in one shot
❌ Folded or partially visible garment
❌ Complex busy backgrounds
What We Had to Figure Out
DensePose is non-negotiable. The Leffa model uses body UV maps (DensePose IUV) to understand where each body part is in 3D space. Without it, the garment texture lands in completely wrong places. We tried substituting with color segmentation maps — garbage results. Real DensePose only.
SCHP for masking beats heuristics. We tried MediaPipe pose landmarks → polygon masks for the garment-agnostic region. Worked on neutral poses, completely broke when someone's arms were raised or in an unusual position. SCHP body-part segmentation is the right approach — it follows the actual clothing boundary regardless of pose.
Mushika GPU coordination. Our RTX 6000 Ada also runs another rendering pipeline (Mushika). We built a pause/resume mechanism so VTON inference gets full GPU priority, then Mushika restarts cleanly. Processing time: ~30 seconds at 30 inference steps.
Pricing
50 credits per try-on = $0.05
For comparison: most commercial VTON APIs charge $0.10–$0.20 per call, or require enterprise contracts. We're 2–4x cheaper with production-grade quality.
Get 10,000 free credits when you sign up: pixelapi.dev
What's Next
- Webhook support for async callbacks (instead of polling)
- Batch processing (multiple garments on same person)
- Video try-on (Wan 2.1 T2V — already partially integrated)
The API is live now. If you're building a fashion app, e-commerce product visualization, or anything that needs virtual try-on, we'd love to have you try it.
PixelAPI is an AI image and video API built for developers. We price 2x cheaper than the mainstream competitors because we think AI tools should be accessible.



Top comments (0)