BentoML Has a Free API: Deploy ML Models to Production in 5 Minutes

#ai #machinelearning #python #devops

What is BentoML?

BentoML is an open-source framework for serving machine learning models. It turns any Python ML model into a production-ready API with batching, GPU support, and Docker packaging — without writing any infrastructure code.

Why BentoML?

Free and open-source — Apache 2.0 license
Any framework — PyTorch, TensorFlow, scikit-learn, HuggingFace, XGBoost
Adaptive batching — automatically batch requests for GPU efficiency
Docker-ready — one command to containerize
BentoCloud — managed deployment with free tier
OpenLLM — specialized serving for large language models

Quick Start

pip install bentoml

# service.py
import bentoml
from transformers import pipeline

@bentoml.service(
    resources={"gpu": 1, "memory": "4Gi"},
    traffic={"timeout": 60}
)
class SentimentAnalysis:
    def __init__(self):
        self.classifier = pipeline(
            "sentiment-analysis",
            model="distilbert-base-uncased-finetuned-sst-2-english",
            device=0  # GPU
        )

    @bentoml.api
    def classify(self, text: str) -> dict:
        result = self.classifier(text)[0]
        return {"label": result["label"], "score": round(result["score"], 4)}

    @bentoml.api
    def batch_classify(self, texts: list[str]) -> list[dict]:
        results = self.classifier(texts)
        return [{"label": r["label"], "score": round(r["score"], 4)} for r in results]

# Run locally
bentoml serve service:SentimentAnalysis

# Test
curl -X POST http://localhost:3000/classify \
  -H 'Content-Type: application/json' \
  -d '{"text": "This product is amazing!"}'

Serve LLMs with OpenLLM

pip install openllm

# Serve any HuggingFace model
openllm start meta-llama/Llama-3-8b-chat-hf

# OpenAI-compatible endpoint at localhost:3000
curl http://localhost:3000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model": "meta-llama/Llama-3-8b-chat-hf", "messages": [{"role": "user", "content": "Hello!"}]}'

Adaptive Batching

@bentoml.service(
    traffic={
        "timeout": 60,
        "max_batch_size": 32,
        "batch_wait_timeout": 0.5  # Wait up to 500ms to fill batch
    }
)
class ImageClassifier:
    @bentoml.api(batchable=True)
    def predict(self, images: list[np.ndarray]) -> list[str]:
        # BentoML automatically batches individual requests
        # 100 individual API calls become ~4 batched GPU operations
        return self.model.predict(np.stack(images))

Containerize and Deploy

# Build a Bento (production package)
bentoml build

# Containerize
bentoml containerize sentiment_analysis:latest

# Run with Docker
docker run -p 3000:3000 sentiment_analysis:latest

# Or deploy to BentoCloud
bentoml deploy .

BentoML vs Alternatives

Feature	BentoML	FastAPI	TorchServe	Triton
ML-specific	Yes	General	PyTorch only	Multi-framework
Adaptive batching	Built-in	Manual	Built-in	Built-in
Docker packaging	One command	Manual	Manual	Manual
GPU management	Automatic	Manual	Automatic	Automatic
OpenAI-compatible	Via OpenLLM	Manual	No	No
Learning curve	Low	Low	High	Very high

Real-World Impact

An e-commerce company served image classification models via Flask. At 1,000 requests/sec, each image processed individually — GPU utilization was 15%. After migrating to BentoML with adaptive batching: same hardware handles 10,000 requests/sec at 85% GPU utilization. They cancelled their GPU scale-out plan and saved $15K/month.

Deploying ML models to production? I help teams build efficient serving infrastructure. Contact spinov001@gmail.com or explore my data tools on Apify.