Ayush

Posted on Aug 6 • Edited on Aug 10

🧠 GenAI as a Backend Engineer: Part 1 - Model Serving

#backend #ai #python

This series is meant to serve as a guide towards better understanding and getting started with concepts surrounding AI. I am writing this as a Backend Engineer trying to level-up.

I am using google (yep, old school), and AI tools to form a roadmap, ask questions and understand the concepts. The goal is to document the steps along the way.

🤔 Questions that will get answered by the end, hopefully:

What are LLMs, Agents, DAGS, RAGs, Vector DBs etc
How do these things really work? Not the Math or the Neural nets but rather, all the flows using LLMs etc
More importantly — how can I, as a backend engineer, get started and contribute to building AI tools?

If you're also exploring, or already working in the field — do share your feedback, mistakes, or suggestions in the comments.

🗺️ Current plan:

Model Serving --> Airflow --> Vector DBs --> RAG Style Q&A + Llama Index

🚀 Part 1: Model Serving with FastAPI & TorchVision

In this step, I learned how a model is served behind an API — that's it. The same models from the earlier "ML" days (e.g., classification models) — now exposed cleanly via API.

Serving is about making the model available for real-time or batch predictions, efficiently, securely, and at scale.

🔹 What is a Model?

A model is the core of an ML system — a program trained on data to recognize patterns and make predictions.

🔹 What is Model Serving?

Model Serving is the process of putting that trained model behind an API (e.g., FastAPI), so it can take input and return predictions (inference).

Instead of bundling the model inside every client app, you host it once, centrally.

🔍 Real-World Examples:

🖼️ Image → API → Model returns: "cat" or "dog"
💬 Chatbot message → API → LLM replies
📄 Transaction → Fraud model → "fraud" or "legit"

📚 Key Terms I Came Across

Inference → Running the model on new (unseen) input data
Model Hosting → Putting the model on a server (local or cloud) and exposing an API

📝 Read: Model Serving 101 (Paywalled...)

🔥 Key Takeaways from the Article:

Model serving introduces a distinct set of challenges compared to a typical CRUD backend. It’s as if the heavy-lifting data pipelines we used to run in the background now need to respond to client requests in real-time, with strict performance and scalability requirements.
Key factors to balance:
- 🚀 Throughput – predictions/sec
- ⏱️ Latency – response time
- 💰 Cost – infra & compute
3 Fundamental Deployment Types:

Online Real-Time Inference
Asynchronous Inference
Offline Batch Transform

🧰 Tools Used

⚡ FastAPI

A high-performance Python web framework.

📖 Official Tutorial: FastAPI Docs

🖼️ TorchVision

A interesting PyTorch library that provides:

Pretrained computer vision models (like resnet18, mobilenet) used for image or object classification
Tools for image transformations and loading

💡 Why it’s great: You don’t need to train from scratch. You can just load and serve a powerful image model in minutes.

📖 Read: TorchVision Basics

⚙️ Step-by-Step: Serving a Model via API

🔹 Step 1: Setup Environment

mkdir model-serving && cd model-serving
python3 -m venv venv
venv\Scripts\activate  # On Mac/Linux: source venv/bin/activate
pip install fastapi uvicorn torch torchvision pillow requests

🔹 Step 2: Create Model Loader

📄 model.py

import torch
from torchvision import models, transforms
from PIL import Image
import requests

model = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)
model.eval()

transform = transforms.Compose([
  transforms.Resize((224, 224)),
  transforms.ToTensor(),
  transforms.Normalize(
    mean=[0.485, 0.456, 0.406],
    std=[0.229, 0.224, 0.225],
  )
])

LABELS_URL = "https://raw.githubusercontent.com/pytorch/hub/master/imagenet_classes.txt"
response = requests.get(LABELS_URL)
labels = [line.strip() for line in response.text.splitlines()]

def predict(image_path):
    image = Image.open(image_path).convert("RGB")
    input_tensor = transform(image).unsqueeze(0)

    with torch.no_grad():
        output = model(input_tensor)
    pred_index = output.argmax().item()
    return labels[pred_index]

def predict_topk(img_path):
    image = Image.open(img_path).convert("RGB")
    input_tensor = transform(image).unsqueeze(0)

    with torch.no_grad():
        output = model(input_tensor)
    probs = torch.nn.functional.softmax(output[0], dim=0)
    top_p, top_i = torch.topk(probs, 5)
    top_labels = [(labels[idx], round(prob.item(), 4)) for idx, prob in zip(top_i, top_p)]
    return top_labels

🔎 Interesting learning: models.ResNet18_Weights.DEFAULT loads a model pre-trained on 1000 categories. These labels come from a public file maintained by PyTorch (based on the ImageNet dataset). The model outputs a probability distribution over these categories, and the index with the highest score maps to the predicted label.

🔍 I have included both predict and predict_topk methods to demonstrate how you can work with the model's output. While predict gives you just the top result, predict_topk provides the top 5 predictions along with confidence scores. This is useful when you want more insight into what the model "thinks" the image could be, especially in ambiguous cases.

🔹 Step 3: Create FastAPI App

📄 app.py

from fastapi import FastAPI, UploadFile, File
from model import predict
import shutil

app = FastAPI()

@app.post("/predict")
async def classify_image(file: UploadFile = File(...)):
    temp_path = f"/tmp/{file.filename}"
    with open(temp_path, "wb") as buffer:
        shutil.copyfileobj(file.file, buffer)

    result = predict(temp_path)  # or predict_topk(temp_path)
    return {"prediction": result}

🔹 Step 4: Run the API

uvicorn app:app --reload

Go to: http://127.0.0.1:8000/docs

📤 Upload any image → get a prediction

✅ That’s it! You’ve served your first model. Now you can integrate this into real-world applications or scale it using cloud services.

💻 GitHub Repo

🔗 Other Parts:

🪜 Coming Up Next

Next, I plan to explore ~~Apache Airflow~~ and how it's used for ML workflows and pipelines — one layer deeper each time 💡

Due to technical issues in Windows, this topic had to be pushed for later. Take a look at RAG and VectorDB instead.

DEV Community