DEV Community

Cover image for 🧠 GenAI as a Backend Engineer: Part 1 - Model Serving
Ayush
Ayush

Posted on • Edited on

🧠 GenAI as a Backend Engineer: Part 1 - Model Serving

This series is meant to serve as a guide towards better understanding and getting started with concepts surrounding AI. I am writing this as a Backend Engineer trying to level-up.

I am using google (yep, old school), and AI tools to form a roadmap, ask questions and understand the concepts. The goal is to document the steps along the way.


πŸ€” Questions that will get answered by the end, hopefully:

  • What are LLMs, Agents, DAGS, RAGs, Vector DBs etc
  • How do these things really work? Not the Math or the Neural nets but rather, all the flows using LLMs etc
  • More importantly β€” how can I, as a backend engineer, get started and contribute to building AI tools?

If you're also exploring, or already working in the field β€” do share your feedback, mistakes, or suggestions in the comments.


πŸ—ΊοΈ Current plan:

Model Serving --> Airflow --> Vector DBs --> RAG Style Q&A + Llama Index


πŸš€ Part 1: Model Serving with FastAPI & TorchVision

In this step, I learned how a model is served behind an API β€” that's it. The same models from the earlier "ML" days (e.g., classification models) β€” now exposed cleanly via API.

Serving is about making the model available for real-time or batch predictions, efficiently, securely, and at scale.


πŸ”Ή What is a Model?

A model is the core of an ML system β€” a program trained on data to recognize patterns and make predictions.


πŸ”Ή What is Model Serving?

Model Serving is the process of putting that trained model behind an API (e.g., FastAPI), so it can take input and return predictions (inference).

Instead of bundling the model inside every client app, you host it once, centrally.

πŸ” Real-World Examples:

  • πŸ–ΌοΈ Image β†’ API β†’ Model returns: "cat" or "dog"
  • πŸ’¬ Chatbot message β†’ API β†’ LLM replies
  • πŸ“„ Transaction β†’ Fraud model β†’ "fraud" or "legit"

πŸ“š Key Terms I Came Across

  • Inference β†’ Running the model on new (unseen) input data
  • Model Hosting β†’ Putting the model on a server (local or cloud) and exposing an API

πŸ“ Read: Model Serving 101 (Paywalled...)

πŸ”₯ Key Takeaways from the Article:

  • Model serving introduces a distinct set of challenges compared to a typical CRUD backend. It’s as if the heavy-lifting data pipelines we used to run in the background now need to respond to client requests in real-time, with strict performance and scalability requirements.

  • Key factors to balance:

    • πŸš€ Throughput – predictions/sec
    • ⏱️ Latency – response time
    • πŸ’° Cost – infra & compute
  • 3 Fundamental Deployment Types:

  1. Online Real-Time Inference
  2. Asynchronous Inference
  3. Offline Batch Transform

🧰 Tools Used

⚑ FastAPI

A high-performance Python web framework.

πŸ“– Official Tutorial: FastAPI Docs

πŸ–ΌοΈ TorchVision

A interesting PyTorch library that provides:

  • Pretrained computer vision models (like resnet18, mobilenet) used for image or object classification
  • Tools for image transformations and loading

πŸ’‘ Why it’s great: You don’t need to train from scratch. You can just load and serve a powerful image model in minutes.

πŸ“– Read: TorchVision Basics


βš™οΈ Step-by-Step: Serving a Model via API

πŸ”Ή Step 1: Setup Environment

mkdir model-serving && cd model-serving
python3 -m venv venv
venv\Scripts\activate  # On Mac/Linux: source venv/bin/activate
pip install fastapi uvicorn torch torchvision pillow requests
Enter fullscreen mode Exit fullscreen mode

πŸ”Ή Step 2: Create Model Loader

πŸ“„ model.py

import torch
from torchvision import models, transforms
from PIL import Image
import requests

model = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)
model.eval()

transform = transforms.Compose([
  transforms.Resize((224, 224)),
  transforms.ToTensor(),
  transforms.Normalize(
    mean=[0.485, 0.456, 0.406],
    std=[0.229, 0.224, 0.225],
  )
])

LABELS_URL = "https://raw.githubusercontent.com/pytorch/hub/master/imagenet_classes.txt"
response = requests.get(LABELS_URL)
labels = [line.strip() for line in response.text.splitlines()]

def predict(image_path):
    image = Image.open(image_path).convert("RGB")
    input_tensor = transform(image).unsqueeze(0)

    with torch.no_grad():
        output = model(input_tensor)
    pred_index = output.argmax().item()
    return labels[pred_index]

def predict_topk(img_path):
    image = Image.open(img_path).convert("RGB")
    input_tensor = transform(image).unsqueeze(0)

    with torch.no_grad():
        output = model(input_tensor)
    probs = torch.nn.functional.softmax(output[0], dim=0)
    top_p, top_i = torch.topk(probs, 5)
    top_labels = [(labels[idx], round(prob.item(), 4)) for idx, prob in zip(top_i, top_p)]
    return top_labels
Enter fullscreen mode Exit fullscreen mode

πŸ”Ž Interesting learning: models.ResNet18_Weights.DEFAULT loads a model pre-trained on 1000 categories. These labels come from a public file maintained by PyTorch (based on the ImageNet dataset). The model outputs a probability distribution over these categories, and the index with the highest score maps to the predicted label.

πŸ” I have included both predict and predict_topk methods to demonstrate how you can work with the model's output. While predict gives you just the top result, predict_topk provides the top 5 predictions along with confidence scores. This is useful when you want more insight into what the model "thinks" the image could be, especially in ambiguous cases.


πŸ”Ή Step 3: Create FastAPI App

πŸ“„ app.py

from fastapi import FastAPI, UploadFile, File
from model import predict
import shutil

app = FastAPI()

@app.post("/predict")
async def classify_image(file: UploadFile = File(...)):
    temp_path = f"/tmp/{file.filename}"
    with open(temp_path, "wb") as buffer:
        shutil.copyfileobj(file.file, buffer)

    result = predict(temp_path)  # or predict_topk(temp_path)
    return {"prediction": result}
Enter fullscreen mode Exit fullscreen mode

πŸ”Ή Step 4: Run the API

uvicorn app:app --reload
Enter fullscreen mode Exit fullscreen mode

Go to: http://127.0.0.1:8000/docs

πŸ“€ Upload any image β†’ get a prediction


βœ… That’s it! You’ve served your first model.Β Now you can integrate this into real-world applications or scale it using cloud services.


πŸ’» GitHub Repo

πŸ”— Other Parts:


πŸͺœ Coming Up Next

Next, I plan to explore Apache Airflow and how it's used for ML workflows and pipelines β€” one layer deeper each time πŸ’‘

Due to technical issues in Windows, this topic had to be pushed for later. Take a look at RAG and VectorDB instead.

Top comments (0)