DEV Community

Alex Spinov
Alex Spinov

Posted on

BentoML Has a Free API: Deploy ML Models to Production in 5 Minutes

What is BentoML?

BentoML is an open-source framework for serving machine learning models. It turns any Python ML model into a production-ready API with batching, GPU support, and Docker packaging — without writing any infrastructure code.

Why BentoML?

  • Free and open-source — Apache 2.0 license
  • Any framework — PyTorch, TensorFlow, scikit-learn, HuggingFace, XGBoost
  • Adaptive batching — automatically batch requests for GPU efficiency
  • Docker-ready — one command to containerize
  • BentoCloud — managed deployment with free tier
  • OpenLLM — specialized serving for large language models

Quick Start

pip install bentoml
Enter fullscreen mode Exit fullscreen mode
# service.py
import bentoml
from transformers import pipeline

@bentoml.service(
    resources={"gpu": 1, "memory": "4Gi"},
    traffic={"timeout": 60}
)
class SentimentAnalysis:
    def __init__(self):
        self.classifier = pipeline(
            "sentiment-analysis",
            model="distilbert-base-uncased-finetuned-sst-2-english",
            device=0  # GPU
        )

    @bentoml.api
    def classify(self, text: str) -> dict:
        result = self.classifier(text)[0]
        return {"label": result["label"], "score": round(result["score"], 4)}

    @bentoml.api
    def batch_classify(self, texts: list[str]) -> list[dict]:
        results = self.classifier(texts)
        return [{"label": r["label"], "score": round(r["score"], 4)} for r in results]
Enter fullscreen mode Exit fullscreen mode
# Run locally
bentoml serve service:SentimentAnalysis

# Test
curl -X POST http://localhost:3000/classify \
  -H 'Content-Type: application/json' \
  -d '{"text": "This product is amazing!"}'
Enter fullscreen mode Exit fullscreen mode

Serve LLMs with OpenLLM

pip install openllm

# Serve any HuggingFace model
openllm start meta-llama/Llama-3-8b-chat-hf

# OpenAI-compatible endpoint at localhost:3000
curl http://localhost:3000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model": "meta-llama/Llama-3-8b-chat-hf", "messages": [{"role": "user", "content": "Hello!"}]}'
Enter fullscreen mode Exit fullscreen mode

Adaptive Batching

@bentoml.service(
    traffic={
        "timeout": 60,
        "max_batch_size": 32,
        "batch_wait_timeout": 0.5  # Wait up to 500ms to fill batch
    }
)
class ImageClassifier:
    @bentoml.api(batchable=True)
    def predict(self, images: list[np.ndarray]) -> list[str]:
        # BentoML automatically batches individual requests
        # 100 individual API calls become ~4 batched GPU operations
        return self.model.predict(np.stack(images))
Enter fullscreen mode Exit fullscreen mode

Containerize and Deploy

# Build a Bento (production package)
bentoml build

# Containerize
bentoml containerize sentiment_analysis:latest

# Run with Docker
docker run -p 3000:3000 sentiment_analysis:latest

# Or deploy to BentoCloud
bentoml deploy .
Enter fullscreen mode Exit fullscreen mode

BentoML vs Alternatives

Feature BentoML FastAPI TorchServe Triton
ML-specific Yes General PyTorch only Multi-framework
Adaptive batching Built-in Manual Built-in Built-in
Docker packaging One command Manual Manual Manual
GPU management Automatic Manual Automatic Automatic
OpenAI-compatible Via OpenLLM Manual No No
Learning curve Low Low High Very high

Real-World Impact

An e-commerce company served image classification models via Flask. At 1,000 requests/sec, each image processed individually — GPU utilization was 15%. After migrating to BentoML with adaptive batching: same hardware handles 10,000 requests/sec at 85% GPU utilization. They cancelled their GPU scale-out plan and saved $15K/month.


Deploying ML models to production? I help teams build efficient serving infrastructure. Contact spinov001@gmail.com or explore my data tools on Apify.

Top comments (0)