Triton vs TorchServe vs TFServing: 3 GPU Batch Tests

#triton #torchserve #tensorflowserving #gpuinference

TorchServe Failed at Batch Size 8

Tested three inference servers (Triton, TorchServe, TensorFlow Serving) with a ResNet-50 model on an A10G GPU. Sent 1000 requests with batch sizes 1, 8, and 32. TorchServe crashed at batch 32, Triton handled all three, and TF Serving added 40ms overhead even at batch 1.

The goal: find which server gives the best throughput-per-dollar for a side project serving image classification. Running costs matter when you're paying hourly for GPU instances.

Aesthetic arrangement of cherry blossoms in a teacup on a wooden table. Perfect for spring themes. — Photo by Pixabay on Pexels

Why Batch Inference Servers Exist

You could just wrap PyTorch in FastAPI and call it a day. For low-traffic services, that works. But once you hit 10+ requests per second, you need dynamic batching — the server waits a few milliseconds to collect multiple requests, runs them as a single batch through the GPU, then fans out the results.

The math: a single ResNet-50 forward pass on batch size 1 takes ~8ms. Batch size 16 takes ~22ms. That's not 16× the time — GPUs love parallelism. Throughput goes from 125 req/s to ~700 req/s.

Three main contenders:

Continue reading the full article on TildAlice