DEV Community

SHIVAM UPADHYAY
SHIVAM UPADHYAY

Posted on

Building a Self-Hosted MLOps Platform from Scratch with FastAPI, PostgreSQL, GCS, and Docker

Introduction

Over the past few months, I set out to answer a simple question:

What does it take to build a production-style MLOps platform from scratch?

While tools like MLflow, SageMaker Model Registry, and Kubeflow provide powerful capabilities, I wanted to understand the underlying architecture and engineering decisions behind them. Instead of only using existing platforms, I decided to build my own.

The result is Kimchi, a self-hosted MLOps platform that supports:

  • Model Registration
  • Model Versioning
  • Artifact Management
  • Experiment Tracking
  • Audit Logging
  • Role-Based Access Control (RBAC)
  • Python SDK
  • CI/CD Automation
  • Google Cloud Storage Integration

This project was built using:

  • FastAPI
  • PostgreSQL
  • SQLAlchemy
  • Alembic
  • Docker
  • Google Cloud Storage (GCS)
  • GitHub Actions
  • JWT Authentication

In this article, I'll share how the project evolved from a simple model registry into a multi-phase MLOps platform.


The Problem

In many machine learning projects, models are often stored in random locations:

  • Local machines
  • Shared drives
  • Cloud storage buckets
  • Team chat channels

Over time, teams start asking difficult questions:

  • Which model version is currently in production?
  • Who promoted this version?
  • Which dataset was used to train it?
  • What accuracy did it achieve?
  • Where is the model artifact stored?

Without a centralized platform, answering these questions becomes difficult.

I wanted a system that could act as a single source of truth for the ML lifecycle.


Phase 1 — Building the Foundation

The first phase focused on creating a Model Registry.

Key capabilities:

  • Model registration
  • Model versioning
  • Artifact upload and download
  • JWT Authentication
  • Role-Based Access Control
  • Google Cloud Storage integration

Architecture

The platform separates metadata from artifacts:

  • PostgreSQL stores model metadata
  • GCS stores model artifacts
  • FastAPI exposes REST APIs

This architecture mirrors how many production MLOps systems are designed.


Phase 2 — Governance Through Audit Logging

Once the registry was functional, I realized something important:

The platform had no memory.

If someone promoted a model to production or deleted a version, there was no way to know:

  • Who made the change
  • What changed
  • When it happened

To solve this, I implemented an Audit Logging system.

Every write operation now creates an immutable audit record:

  • CREATE
  • UPDATE
  • DELETE
  • PROMOTE

This allows teams to trace every important lifecycle event.

Example:

Shivam
PROMOTE
Model Version 5
staging → production
2026-06-16 05:01
Enter fullscreen mode Exit fullscreen mode

This feature brought governance and accountability to the platform.


Phase 3 — Experiment Tracking

At this stage, models could be registered, but training information was still buried inside artifact files.

For example:

metrics.json
params.json

could be uploaded, but the platform couldn't answer:

  • Which model has the highest accuracy?
  • Which version achieved the best F1 score?
  • What hyperparameters were used?

To solve this, I introduced Experiment Tracking.

A new Training Run layer stores:

  • Hyperparameters
  • Metrics
  • Dataset hashes
  • Framework information
  • Training duration

This enables powerful queries such as:

GET /experiments?min_accuracy=0.90
Enter fullscreen mode Exit fullscreen mode

and

GET /models/{id}/versions/compare
Enter fullscreen mode Exit fullscreen mode

The platform now supports experiment search, filtering, and version comparison.


Phase 4 — Improving Developer Experience

After building the core platform, I shifted focus to usability.

Python SDK

Instead of interacting with the API using raw curl commands, users can now work with a Python SDK.

Example:

from kimchi_sdk import ModelRegistry

registry = ModelRegistry(
    url="http://localhost:8000",
    username="user",
    password="password"
)

registry.create_model(
    name="fraud-detector"
)
Enter fullscreen mode Exit fullscreen mode

GitHub Actions CI/CD

Every push automatically triggers:

  • Dependency installation
  • Automated test execution
  • Validation checks

This ensures platform stability and catches regressions early.

Admin APIs

Role management is now handled through APIs instead of direct database modifications.

Refresh Tokens

Refresh tokens provide a smoother authentication experience and prepare the platform for future UI integrations.


Lessons Learned

Building this platform taught me much more than simply writing APIs.

Some of the key lessons included:

Designing for Evolution

Each phase built on top of the previous one.

Instead of trying to build everything at once, I focused on creating a strong foundation and then extending it incrementally.

Governance Matters

Tracking who changed what becomes increasingly important as teams grow.

Audit logging turned out to be one of the most valuable additions.

Experiment Metadata is First-Class Data

Metrics and hyperparameters should not live only inside artifact files.

Making them queryable dramatically improves usability.

Developer Experience Matters

Features like SDKs and CI/CD pipelines are often overlooked but are critical for adoption.


What's Next?

The next phase focuses on Governance and Enterprise Readiness:

  • Signed GCS URLs
  • Model Cards
  • Dataset Lineage
  • Drift Monitoring
  • Kubernetes Deployments
  • Terraform Infrastructure

The goal is to continue evolving Kimchi into a production-grade MLOps platform.


Final Thoughts

This project started as an exercise to understand how model registries work internally.

Over time, it evolved into a hands-on exploration of:

  • MLOps
  • Backend Engineering
  • Cloud Infrastructure
  • Governance
  • Experiment Tracking
  • Developer Experience

Building systems from scratch is one of the best ways to understand the trade-offs behind production software.

If you're interested in MLOps, backend engineering, or cloud-native systems, I highly recommend building something similar yourself.

Project Repository

GitHub Repository:

[Add your GitHub repository URL here]

If you have suggestions, feedback, or ideas for future phases, feel free to connect with me.

Top comments (0)