Introduction
Over the past few months, I set out to answer a simple question:
What does it take to build a production-style MLOps platform from scratch?
While tools like MLflow, SageMaker Model Registry, and Kubeflow provide powerful capabilities, I wanted to understand the underlying architecture and engineering decisions behind them. Instead of only using existing platforms, I decided to build my own.
The result is Kimchi, a self-hosted MLOps platform that supports:
- Model Registration
- Model Versioning
- Artifact Management
- Experiment Tracking
- Audit Logging
- Role-Based Access Control (RBAC)
- Python SDK
- CI/CD Automation
- Google Cloud Storage Integration
This project was built using:
- FastAPI
- PostgreSQL
- SQLAlchemy
- Alembic
- Docker
- Google Cloud Storage (GCS)
- GitHub Actions
- JWT Authentication
In this article, I'll share how the project evolved from a simple model registry into a multi-phase MLOps platform.
The Problem
In many machine learning projects, models are often stored in random locations:
- Local machines
- Shared drives
- Cloud storage buckets
- Team chat channels
Over time, teams start asking difficult questions:
- Which model version is currently in production?
- Who promoted this version?
- Which dataset was used to train it?
- What accuracy did it achieve?
- Where is the model artifact stored?
Without a centralized platform, answering these questions becomes difficult.
I wanted a system that could act as a single source of truth for the ML lifecycle.
Phase 1 — Building the Foundation
The first phase focused on creating a Model Registry.
Key capabilities:
- Model registration
- Model versioning
- Artifact upload and download
- JWT Authentication
- Role-Based Access Control
- Google Cloud Storage integration
Architecture
The platform separates metadata from artifacts:
- PostgreSQL stores model metadata
- GCS stores model artifacts
- FastAPI exposes REST APIs
This architecture mirrors how many production MLOps systems are designed.
Phase 2 — Governance Through Audit Logging
Once the registry was functional, I realized something important:
The platform had no memory.
If someone promoted a model to production or deleted a version, there was no way to know:
- Who made the change
- What changed
- When it happened
To solve this, I implemented an Audit Logging system.
Every write operation now creates an immutable audit record:
- CREATE
- UPDATE
- DELETE
- PROMOTE
This allows teams to trace every important lifecycle event.
Example:
Shivam
PROMOTE
Model Version 5
staging → production
2026-06-16 05:01
This feature brought governance and accountability to the platform.
Phase 3 — Experiment Tracking
At this stage, models could be registered, but training information was still buried inside artifact files.
For example:
metrics.json
params.json
could be uploaded, but the platform couldn't answer:
- Which model has the highest accuracy?
- Which version achieved the best F1 score?
- What hyperparameters were used?
To solve this, I introduced Experiment Tracking.
A new Training Run layer stores:
- Hyperparameters
- Metrics
- Dataset hashes
- Framework information
- Training duration
This enables powerful queries such as:
GET /experiments?min_accuracy=0.90
and
GET /models/{id}/versions/compare
The platform now supports experiment search, filtering, and version comparison.
Phase 4 — Improving Developer Experience
After building the core platform, I shifted focus to usability.
Python SDK
Instead of interacting with the API using raw curl commands, users can now work with a Python SDK.
Example:
from kimchi_sdk import ModelRegistry
registry = ModelRegistry(
url="http://localhost:8000",
username="user",
password="password"
)
registry.create_model(
name="fraud-detector"
)
GitHub Actions CI/CD
Every push automatically triggers:
- Dependency installation
- Automated test execution
- Validation checks
This ensures platform stability and catches regressions early.
Admin APIs
Role management is now handled through APIs instead of direct database modifications.
Refresh Tokens
Refresh tokens provide a smoother authentication experience and prepare the platform for future UI integrations.
Lessons Learned
Building this platform taught me much more than simply writing APIs.
Some of the key lessons included:
Designing for Evolution
Each phase built on top of the previous one.
Instead of trying to build everything at once, I focused on creating a strong foundation and then extending it incrementally.
Governance Matters
Tracking who changed what becomes increasingly important as teams grow.
Audit logging turned out to be one of the most valuable additions.
Experiment Metadata is First-Class Data
Metrics and hyperparameters should not live only inside artifact files.
Making them queryable dramatically improves usability.
Developer Experience Matters
Features like SDKs and CI/CD pipelines are often overlooked but are critical for adoption.
What's Next?
The next phase focuses on Governance and Enterprise Readiness:
- Signed GCS URLs
- Model Cards
- Dataset Lineage
- Drift Monitoring
- Kubernetes Deployments
- Terraform Infrastructure
The goal is to continue evolving Kimchi into a production-grade MLOps platform.
Final Thoughts
This project started as an exercise to understand how model registries work internally.
Over time, it evolved into a hands-on exploration of:
- MLOps
- Backend Engineering
- Cloud Infrastructure
- Governance
- Experiment Tracking
- Developer Experience
Building systems from scratch is one of the best ways to understand the trade-offs behind production software.
If you're interested in MLOps, backend engineering, or cloud-native systems, I highly recommend building something similar yourself.
Project Repository
GitHub Repository:
[Add your GitHub repository URL here]
If you have suggestions, feedback, or ideas for future phases, feel free to connect with me.
Top comments (0)