What is BentoML?
BentoML is an open-source framework for serving machine learning models. It turns any Python ML model into a production-ready API with batching, GPU support, and Docker packaging — without writing any infrastructure code.
Why BentoML?
- Free and open-source — Apache 2.0 license
- Any framework — PyTorch, TensorFlow, scikit-learn, HuggingFace, XGBoost
- Adaptive batching — automatically batch requests for GPU efficiency
- Docker-ready — one command to containerize
- BentoCloud — managed deployment with free tier
- OpenLLM — specialized serving for large language models
Quick Start
pip install bentoml
# service.py
import bentoml
from transformers import pipeline
@bentoml.service(
resources={"gpu": 1, "memory": "4Gi"},
traffic={"timeout": 60}
)
class SentimentAnalysis:
def __init__(self):
self.classifier = pipeline(
"sentiment-analysis",
model="distilbert-base-uncased-finetuned-sst-2-english",
device=0 # GPU
)
@bentoml.api
def classify(self, text: str) -> dict:
result = self.classifier(text)[0]
return {"label": result["label"], "score": round(result["score"], 4)}
@bentoml.api
def batch_classify(self, texts: list[str]) -> list[dict]:
results = self.classifier(texts)
return [{"label": r["label"], "score": round(r["score"], 4)} for r in results]
# Run locally
bentoml serve service:SentimentAnalysis
# Test
curl -X POST http://localhost:3000/classify \
-H 'Content-Type: application/json' \
-d '{"text": "This product is amazing!"}'
Serve LLMs with OpenLLM
pip install openllm
# Serve any HuggingFace model
openllm start meta-llama/Llama-3-8b-chat-hf
# OpenAI-compatible endpoint at localhost:3000
curl http://localhost:3000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model": "meta-llama/Llama-3-8b-chat-hf", "messages": [{"role": "user", "content": "Hello!"}]}'
Adaptive Batching
@bentoml.service(
traffic={
"timeout": 60,
"max_batch_size": 32,
"batch_wait_timeout": 0.5 # Wait up to 500ms to fill batch
}
)
class ImageClassifier:
@bentoml.api(batchable=True)
def predict(self, images: list[np.ndarray]) -> list[str]:
# BentoML automatically batches individual requests
# 100 individual API calls become ~4 batched GPU operations
return self.model.predict(np.stack(images))
Containerize and Deploy
# Build a Bento (production package)
bentoml build
# Containerize
bentoml containerize sentiment_analysis:latest
# Run with Docker
docker run -p 3000:3000 sentiment_analysis:latest
# Or deploy to BentoCloud
bentoml deploy .
BentoML vs Alternatives
| Feature | BentoML | FastAPI | TorchServe | Triton |
|---|---|---|---|---|
| ML-specific | Yes | General | PyTorch only | Multi-framework |
| Adaptive batching | Built-in | Manual | Built-in | Built-in |
| Docker packaging | One command | Manual | Manual | Manual |
| GPU management | Automatic | Manual | Automatic | Automatic |
| OpenAI-compatible | Via OpenLLM | Manual | No | No |
| Learning curve | Low | Low | High | Very high |
Real-World Impact
An e-commerce company served image classification models via Flask. At 1,000 requests/sec, each image processed individually — GPU utilization was 15%. After migrating to BentoML with adaptive batching: same hardware handles 10,000 requests/sec at 85% GPU utilization. They cancelled their GPU scale-out plan and saved $15K/month.
Deploying ML models to production? I help teams build efficient serving infrastructure. Contact spinov001@gmail.com or explore my data tools on Apify.
Top comments (0)