Joseph Tobi

Posted on May 7

How to Serve a PyTorch Model with FastAPI: A Complete Guide

#python #machinelearning #fastapi #pytorch

How to Serve a PyTorch Model with FastAPI: A Complete Guide
Most machine learning tutorials stop at model training. You get a trained model, a good validation score, and then — nothing. No one tells you how to actually use that model in a real application.
In this tutorial I'll show you exactly how to take a trained PyTorch model and serve it as a REST API using FastAPI. By the end you'll have a working inference endpoint that any frontend or application can call to get predictions.
I built this exact pipeline for my house price estimator project — a PyTorch MLP model served via FastAPI with a React frontend. Everything in this tutorial comes from real production experience.
Prerequisites
Python 3.8+
Basic PyTorch knowledge
Basic understanding of REST APIs
pip installed
What We're Building
Trained PyTorch Model (.pth file)
↓
FastAPI Server
↓
POST /predict endpoint
↓
Returns prediction as JSON
A client sends input features as JSON. FastAPI preprocesses them, runs inference through the model, and returns the prediction. Simple, clean, production-ready.
Step 1 — Train and Save Your Model
First let's define a simple MLP model and save it after training.

model.py

import torch
import torch.nn as nn

class MLP(nn.Module):
def init(self, input_dim, hidden_dim, output_dim):
super().init()
self.net = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(hidden_dim, hidden_dim // 2),
nn.ReLU(),
nn.Linear(hidden_dim // 2, output_dim)
)

def forward(self, x):
    return self.net(x)

After training, save the model weights and scaler:

train.py

import torch
import joblib
from model import MLP

--- your training loop here ---

Save model weights

torch.save(model.state_dict(), "model.pth")

Save scaler — critical for consistent preprocessing

joblib.dump(scaler, "scaler.pkl")

print("Model and scaler saved successfully")
Two things are saved — the model weights and the scaler. Both are required for consistent inference.
Step 2 — Understand Why You Save Both
This is the most important concept in production ML and the one most tutorials skip.
During training you fit a StandardScaler on your training data. This scaler learns the mean and standard deviation of each feature. During inference you must apply the exact same transformation using the exact same statistics.
If you refit the scaler on new data during inference, your features will be scaled differently from how the model was trained. The model receives input it has never seen before and predictions become unreliable.
Always save your fitted scaler. Always load it at inference time. Never refit it.
Step 3 — Install Dependencies
pip install fastapi uvicorn torch joblib numpy scikit-learn
Step 4 — Build the FastAPI Application

main.py

import torch
import joblib
import numpy as np
from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
from model import MLP

app = FastAPI(title="PyTorch Model API")

Allow frontend applications to call this API

app.add_middleware(
CORSMiddleware,
allow_origins=[""],
allow_methods=[""],
allow_headers=["*"],
)

--- Load model and scaler once on startup ---

Loading inside the predict function would reload

on every request — slow and inefficient

INPUT_DIM = 8
HIDDEN_DIM = 128
OUTPUT_DIM = 1

model = MLP(INPUT_DIM, HIDDEN_DIM, OUTPUT_DIM)
model.load_state_dict(
torch.load("model.pth", map_location="cpu")
)
model.eval() # Disables dropout during inference

scaler = joblib.load("scaler.pkl")

--- Input schema ---

class PredictionInput(BaseModel):
feature1: float
feature2: float
feature3: float
feature4: float
feature5: float
feature6: float
feature7: float
feature8: float

--- Prediction endpoint ---

@app.get("/")
def root():
return {"status": "Model API is running"}

@app.post("/predict")
def predict(data: PredictionInput):
try:
# Build feature array
features = np.array([[
data.feature1,
data.feature2,
data.feature3,
data.feature4,
data.feature5,
data.feature6,
data.feature7,
data.feature8,
]])

    # Preprocess using saved scaler
    features_scaled = scaler.transform(features)

    # Convert to tensor
    tensor = torch.tensor(
        features_scaled,
        dtype=torch.float32
    )

    # Run inference
    with torch.no_grad():
        # torch.no_grad() tells PyTorch not to
        # track gradients — faster and uses less memory
        prediction = model(tensor)

        # If you trained on log(target),
        # reverse the transformation
        result = torch.exp(prediction).item()

    return {
        "prediction": round(result, 2),
        "status": "success"
    }

except Exception as e:
    raise HTTPException(
        status_code=500,
        detail=str(e)
    )

Step 5 — Run the Server
uvicorn main:app --reload --host 0.0.0.0 --port 8000
Your API is now running at http://localhost:8000
Visit http://localhost:8000/docs to see the automatic interactive documentation FastAPI generates. You can test your endpoint directly from the browser.
Step 6 — Test Your Endpoint
Using curl:
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{
"feature1": 1500,
"feature2": 2005,
"feature3": 7,
"feature4": 3,
"feature5": 2,
"feature6": 2,
"feature7": 850,
"feature8": 1
}'
Expected response:
{
"prediction": 185432.50,
"status": "success"
}
Step 7 — Connect a React Frontend
In your React component, call the API like this:
const getPrediction = async (formData) => {
const response = await fetch("http://localhost:8000/predict", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify(formData)
});

const data = await response.json();
return data.prediction;
};
Step 8 — Deploy to Production
Frontend → Deploy to Vercel (free)
Backend → Deploy to Render.com (free tier available)
On Render, set your start command to:
uvicorn main:app --host 0.0.0.0 --port $PORT
Make sure your model.pth and scaler.pkl files are included in your GitHub repository so Render can access them during deployment.
Update your React frontend to use the Render URL instead of localhost:
const API_URL = "https://your-app.onrender.com/predict";
Key Concepts to Remember
model.eval() — Always call this after loading your model. It switches off dropout and batch normalization layers which behave differently during training versus inference.
torch.no_grad() — Always wrap inference in this context manager. It disables gradient tracking which saves memory and speeds up inference significantly.
Scaler consistency — Save your fitted scaler during training and load the same artifact during inference. Never refit on new data.
Load once on startup — Load your model and scaler at the top of main.py, not inside the predict function. Loading on every request is slow and wasteful.
Conclusion
Serving a PyTorch model with FastAPI follows a consistent pattern regardless of your model architecture or problem type. Train your model, save both the weights and preprocessing artifacts, load them once on server startup, and expose a clean prediction endpoint.
This pattern is what separates ML engineers who build demo notebooks from those who build production systems. The model is only half the job — getting it into a working API that real applications can consume is the other half.
The complete code for this tutorial is available on my GitHub at github.com/josephtobimayokun
Joseph Tobi Mayokun is a full-stack developer and ML engineer, founder of Microlink — an AI-focused tech startup building intelligent software.

DEV Community