"As Cloud-like as Possible" Data Science: Local MLOps with Docker Compose

#datascience #docker #mlops #mlflow

Emulates cloud-native MLOps locally

I have built a data science environment that allows me to construct data pipelines and manage machine learning experiments on a local PC, while providing a user experience that is as cloud-like as possible.

According to Gemini:

This environment functions as a sandbox for learning a "cloud-native development style" by replacing major components used in cloud environments—such as S3, orchestrators, and ML tracking services—with local Docker containers.

Alright. If you say so, this serves as educational content.

The source code (docker-compose.yaml, etc.) is available on GitHub.

System components

Data store: Versity S3 Gateway emulates S3 storage.
Source code repository: Just a Git daemon.
Pipeline: Prefect orchestrates pipelines.
Experiment management: MLflow tracks models, parameters, and results.
Notebook: Marimo for interactive development.

I initially used MinIO as the S3 alternative, but since they have stopped distributing binaries and Docker images, I switched to the Versity S3 gateway to access the file system via the S3 protocol.

The system components collaborate as shown in the following diagram:

(Marimo is omitted for clarity as it is loosely coupled with the
rest of the system.)

Usage

The steps for running experiments are as follows:

(If necessary) Upload the data to S3 (http://localhost:7070).
Implement the Prefect workflow (experiments/*.py).
git push the code to the Git repo (http:localhost:9010/repo).
Deploy the workflow to Prefect (prefect deploy).
Run the workflow via Prefect (prefect deployment run).
Check the experiment results in MLflow (http://localhost:5001).

Detailed prerequisite steps, such as setting environment variables, are written in the GitHub README.

Implementing Prefect workflows

Define your experiment workflows and tasks using Prefect decorators (@flow, @task). Insert MLflow functions such as
mlflow.log_params() to track everything in MLflow!

@flow
def example(n_tests: int = 1) -> None:
    """An example experiment as a Prefect flow with MLflow tracking."""
    mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)
    mlflow.set_experiment(f"example-experiment-{datetime.now()}")

    df = pd.read_csv("s3://data/iris.csv")
    x = df.drop(columns=["target", "species"])
    y = df["target"]
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)

    test_data = x_test.copy()
    test_data["target"] = y_test
    test_dataset = mlflow.data.from_pandas(test_data, targets="target")

    @task
    def run(n_estimators, criterion) -> None:
        """A run in the experiment."""
        with mlflow.start_run():
            params = {"n_estimators": n_estimators, "criterion": criterion}
            mlflow.log_params(params)
            model = RandomForestClassifier(**params)
            model.fit(x_train, y_train)
            model_info = mlflow.sklearn.log_model(model, input_example=x_train)
            mlflow.models.evaluate(
                model=model_info.model_uri,
                data=test_dataset,
                model_type="classifier",
            )

    for n_estimators in [2**i for i in range(5, 8)]:
        for criterion in ["gini", "entropy", "log_loss"]:
            for _ in range(n_tests):
                run(n_estimators, criterion)

Deploying and running Prefect workflows

Register the workflow with the Prefect server by declaring the deployment configuration in prefect.yaml and running prefect deploy.

deployments:
  - name: experiment
    tags: [test]
    parameters:
      n_tests: 5
    entrypoint: experiments/example.py:example
    work_pool:
      name: sandbox
      job_variables: {}

After deploying, run it with the following command:

$ prefect deployment run example/experiment

You can view the execution status in real-time by accessing the
Prefect Web UI (http://localhost:4200/runs).

Checking and analyzing results in MLflow

Access the MLflow Web UI (http://localhost:5001) to review the
tracked experiment results and perform comparative analysis based on parameters.

Conclusion

I have created a local environment that feels "as cloud-like as
possible." Adding something like a data lake (e.g., DuckLake) would likely make it feel even more like the cloud.

Translated from the original post at https://m15a.dev/ja/posts/acap-datasci-env/.