SHIVAM UPADHYAY

Posted on Jun 23

Building a Self-Hosted MLOps Platform from Scratch with FastAPI, PostgreSQL, GCS, and Docker

#ai #devops #python #mlops

Introduction

Over the past few months, I set out to answer a simple question:

What does it take to build a production-style MLOps platform from scratch?

While tools like MLflow, SageMaker Model Registry, and Kubeflow provide powerful capabilities, I wanted to understand the underlying architecture and engineering decisions behind them. Instead of only using existing platforms, I decided to build my own.

The result is Kimchi, a self-hosted MLOps platform that supports:

Model Registration
Model Versioning
Artifact Management
Experiment Tracking
Audit Logging
Role-Based Access Control (RBAC)
Python SDK
CI/CD Automation
Google Cloud Storage Integration

This project was built using:

FastAPI
PostgreSQL
SQLAlchemy
Alembic
Docker
Google Cloud Storage (GCS)
GitHub Actions
JWT Authentication

In this article, I'll share how the project evolved from a simple model registry into a multi-phase MLOps platform.

The Problem

In many machine learning projects, models are often stored in random locations:

Local machines
Shared drives
Cloud storage buckets
Team chat channels

Over time, teams start asking difficult questions:

Which model version is currently in production?
Who promoted this version?
Which dataset was used to train it?
What accuracy did it achieve?
Where is the model artifact stored?

Without a centralized platform, answering these questions becomes difficult.

I wanted a system that could act as a single source of truth for the ML lifecycle.

Phase 1 — Building the Foundation

The first phase focused on creating a Model Registry.

Key capabilities:

Model registration
Model versioning
Artifact upload and download
JWT Authentication
Role-Based Access Control
Google Cloud Storage integration

Architecture

The platform separates metadata from artifacts:

PostgreSQL stores model metadata
GCS stores model artifacts
FastAPI exposes REST APIs

This architecture mirrors how many production MLOps systems are designed.

Phase 2 — Governance Through Audit Logging

Once the registry was functional, I realized something important:

The platform had no memory.

If someone promoted a model to production or deleted a version, there was no way to know:

Who made the change
What changed
When it happened

To solve this, I implemented an Audit Logging system.

Every write operation now creates an immutable audit record:

CREATE
UPDATE
DELETE
PROMOTE

This allows teams to trace every important lifecycle event.

Example:

Shivam
PROMOTE
Model Version 5
staging → production
2026-06-16 05:01

This feature brought governance and accountability to the platform.

Phase 3 — Experiment Tracking

At this stage, models could be registered, but training information was still buried inside artifact files.

For example:

metrics.json
params.json

could be uploaded, but the platform couldn't answer:

Which model has the highest accuracy?
Which version achieved the best F1 score?
What hyperparameters were used?

To solve this, I introduced Experiment Tracking.

A new Training Run layer stores:

Hyperparameters
Metrics
Dataset hashes
Framework information
Training duration

This enables powerful queries such as:

GET /experiments?min_accuracy=0.90

and

GET /models/{id}/versions/compare

The platform now supports experiment search, filtering, and version comparison.

Phase 4 — Improving Developer Experience

After building the core platform, I shifted focus to usability.

Python SDK

Instead of interacting with the API using raw curl commands, users can now work with a Python SDK.

Example:

from kimchi_sdk import ModelRegistry

registry = ModelRegistry(
    url="http://localhost:8000",
    username="user",
    password="password"
)

registry.create_model(
    name="fraud-detector"
)

GitHub Actions CI/CD

Every push automatically triggers:

Dependency installation
Automated test execution
Validation checks

This ensures platform stability and catches regressions early.

Admin APIs

Role management is now handled through APIs instead of direct database modifications.

Refresh Tokens

Refresh tokens provide a smoother authentication experience and prepare the platform for future UI integrations.

Lessons Learned

Building this platform taught me much more than simply writing APIs.

Some of the key lessons included:

Designing for Evolution

Each phase built on top of the previous one.

Instead of trying to build everything at once, I focused on creating a strong foundation and then extending it incrementally.

Governance Matters

Tracking who changed what becomes increasingly important as teams grow.

Audit logging turned out to be one of the most valuable additions.

Experiment Metadata is First-Class Data

Metrics and hyperparameters should not live only inside artifact files.

Making them queryable dramatically improves usability.

Developer Experience Matters

Features like SDKs and CI/CD pipelines are often overlooked but are critical for adoption.

What's Next?

The next phase focuses on Governance and Enterprise Readiness:

Signed GCS URLs
Model Cards
Dataset Lineage
Drift Monitoring
Kubernetes Deployments
Terraform Infrastructure

The goal is to continue evolving Kimchi into a production-grade MLOps platform.

Final Thoughts

This project started as an exercise to understand how model registries work internally.

Over time, it evolved into a hands-on exploration of:

MLOps
Backend Engineering
Cloud Infrastructure
Governance
Experiment Tracking
Developer Experience

Building systems from scratch is one of the best ways to understand the trade-offs behind production software.

If you're interested in MLOps, backend engineering, or cloud-native systems, I highly recommend building something similar yourself.

Project Repository

GitHub Repository:

[Add your GitHub repository URL here]

If you have suggestions, feedback, or ideas for future phases, feel free to connect with me.

DEV Community