Sanskriti Harmukh for Vultr

Posted on Jul 2 with Aashish Chaurasiya • Originally published at docs.vultr.com

Deploying ClearML as an AWS SageMaker Alternative on Ubuntu

#mlops #docker #ai #devops

ClearML is an open-source MLOps platform that pairs experiment tracking, pipelines, hyperparameter optimisation, and model serving, a self-hosted alternative to AWS SageMaker. This guide deploys the ClearML server with Docker Compose, fronts the web, API, and file servers with Traefik on three subdomains, registers an agent, runs a sample experiment, builds a pipeline, runs an HPO sweep, and deploys a serving stack. By the end, you'll have ClearML covering the full ML lifecycle securely at your domain.

Prerequisite: Ubuntu host with Docker + Compose installed, DNS A records for app.clearml.example.com, api.clearml.example.com, files.clearml.example.com. NVIDIA Container Toolkit on the host if you plan to run GPU workloads.

Prepare the Host

1. Bump Elasticsearch's virtual memory limit:

$ echo "vm.max_map_count=524288" | sudo tee /etc/sysctl.d/99-clearml.conf
$ sudo sysctl --system
$ sudo systemctl restart docker

2. Create data directories with the expected ownership:

$ sudo mkdir -p /opt/clearml/{data/elastic_7,data/mongo_4/db,data/mongo_4/configdb,data/redis,data/fileserver,logs,config}
$ sudo chown -R 1000:1000 /opt/clearml

Deploy the ClearML Server

1. Create the project directory:

$ mkdir -p ~/clearml && cd ~/clearml

2. Download the official Compose file:

$ curl -fsSL https://raw.githubusercontent.com/clearml/clearml-server/master/docker/docker-compose.yml -o docker-compose.yml

3. Edit docker-compose.yml:

Comment the ports: blocks under apiserver, webserver, and fileserver (Traefik will publish them).
Replace the networks: section with named external networks:

networks:
  backend:
    name: clearml_backend
    driver: bridge
  frontend:
    name: clearml_frontend
    driver: bridge

4. Create the env file with public hostnames:

$ nano .env

CLEARML_WEB_HOST=https://app.clearml.example.com
CLEARML_API_HOST=https://api.clearml.example.com
CLEARML_FILES_HOST=https://files.clearml.example.com

5. Start the stack:

$ docker compose up -d
$ docker compose ps
$ docker compose logs --tail 50

Front the Stack with Traefik

1. Create the Traefik project directory:

$ mkdir -p ~/clearml/traefik && cd ~/clearml/traefik
$ mkdir -p letsencrypt && touch letsencrypt/acme.json && chmod 600 letsencrypt/acme.json

2. Create .env:

LETSENCRYPT_EMAIL=admin@example.com

3. Create the Traefik Compose file:

$ nano docker-compose.yml

services:
  traefik:
    image: traefik:v3.6
    container_name: traefik
    command:
      - "--log.level=INFO"
      - "--providers.file.filename=/etc/traefik/dynamic_conf.yml"
      - "--entryPoints.web.address=:80"
      - "--entryPoints.websecure.address=:443"
      - "--entryPoints.web.http.redirections.entrypoint.to=websecure"
      - "--certificatesResolvers.le.acme.httpChallenge.entryPoint=web"
      - "--certificatesResolvers.le.acme.email=${LETSENCRYPT_EMAIL}"
      - "--certificatesResolvers.le.acme.storage=/letsencrypt/acme.json"
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - "./letsencrypt:/letsencrypt"
      - "./dynamic_conf.yml:/etc/traefik/dynamic_conf.yml:ro"
    networks:
      - clearml-frontend
    restart: unless-stopped

networks:
  clearml-frontend:
    name: clearml_frontend
    external: true

4. Create the routing file:

$ nano dynamic_conf.yml

http:
  routers:
    clearml-web:
      rule: "Host(`app.clearml.example.com`)"
      entryPoints: [websecure]
      service: clearml-web
      tls: {certResolver: le}
    clearml-api:
      rule: "Host(`api.clearml.example.com`)"
      entryPoints: [websecure]
      service: clearml-api
      tls: {certResolver: le}
    clearml-files:
      rule: "Host(`files.clearml.example.com`)"
      entryPoints: [websecure]
      service: clearml-files
      tls: {certResolver: le}
  services:
    clearml-web:
      loadBalancer:
        servers: [{url: "http://clearml-webserver:80"}]
    clearml-api:
      loadBalancer:
        servers: [{url: "http://clearml-apiserver:8008"}]
    clearml-files:
      loadBalancer:
        servers: [{url: "http://clearml-fileserver:8081"}]

5. Start Traefik:

$ docker compose up -d
$ docker logs traefik 2>&1 | grep -i certificate

Create an Admin and API Credentials

Open https://app.clearml.example.com and register the administrator (name + company).
Settings → Workspace → Create new credentials.
Copy the generated block — you'll paste it into agent and SDK configuration:

api {
  web_server: https://app.clearml.example.com
  api_server: https://api.clearml.example.com
  files_server: https://files.clearml.example.com
  credentials {
    "access_key" = "YOUR_ACCESS_KEY"
    "secret_key" = "YOUR_SECRET_KEY"
  }
}

Register a ClearML Agent

$ mkdir -p ~/clearml-agent && cd ~/clearml-agent
$ sudo apt install python3.12-venv -y
$ python3 -m venv clearml_venv
$ source clearml_venv/bin/activate
$ pip install clearml-agent
$ clearml-agent init

Paste the credentials block when prompted. Then start the agent:

$ clearml-agent daemon --queue default --detached

For a GPU host:

$ clearml-agent daemon --gpus 0,1 --queue default --detached

Confirm in Web UI → Workers & Queues → Workers.

Run a Sample Experiment

1. Install the SDK in the same venv:

$ pip install clearml scikit-learn joblib pandas
$ clearml-init

2. Save the experiment:

$ nano 01_first_experiment.py

import joblib
from clearml import Task
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

task = Task.init(project_name='ClearML Tutorial', task_name='01_First_Experiment',
                 tags=['tutorial', 'random-forest'])

hp = {'n_estimators': 100, 'max_depth': 5, 'random_state': 42}
task.connect(hp)

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

clf = RandomForestClassifier(**hp).fit(X_train, y_train)
acc = accuracy_score(y_test, clf.predict(X_test))
task.get_logger().report_scalar('Performance', 'Accuracy', value=acc, iteration=1)

joblib.dump(clf, 'iris_rf_model.pkl')
task.upload_artifact('trained_model', 'iris_rf_model.pkl')
task.close()

3. Run it:

$ python3 01_first_experiment.py

The Web UI's ClearML Tutorial project now shows execution metadata, hyperparameters, scalars, and the uploaded artifact.

Build a Pipeline

$ nano 02_pipeline.py

from clearml import PipelineController

def step_one(pickle_data_url):
    import pickle, pandas as pd
    from clearml import StorageManager
    local = StorageManager.get_local_copy(remote_url=pickle_data_url)
    with open(local, 'rb') as f: iris = pickle.load(f)
    df = pd.DataFrame(iris['data'], columns=iris['feature_names']); df['target'] = iris['target']
    return df

def step_two(data_frame, test_size=0.2, random_state=42):
    from sklearn.model_selection import train_test_split
    y = data_frame['target']; X = data_frame.drop(columns=['target'])
    return train_test_split(X, y, test_size=test_size, random_state=random_state)

def step_three(data):
    from sklearn.linear_model import LogisticRegression
    X_train, X_test, y_train, y_test = data
    return LogisticRegression(solver='lbfgs', max_iter=1000).fit(X_train, y_train)

if __name__ == '__main__':
    pipe = PipelineController(project='ClearML Tutorial', name='02_Pipeline_Experiment',
                              version='1.0', add_pipeline_tags=True)
    pipe.add_parameter('url', 'https://github.com/allegroai/events/raw/master/odsc20-east/generic/iris_dataset.pkl')
    pipe.add_function_step('step_one',   step_one,   function_kwargs={'pickle_data_url': '${pipeline.url}'}, function_return=['data_frame'])
    pipe.add_function_step('step_two',   step_two,   function_kwargs={'data_frame': '${step_one.data_frame}'}, function_return=['processed_data'])
    pipe.add_function_step('step_three', step_three, function_kwargs={'data': '${step_two.processed_data}'}, function_return=['model'])
    pipe.start_locally(run_pipeline_steps_locally=True)

$ python3 02_pipeline.py

Run a Hyperparameter Sweep

$ nano 03_hpo.py

from clearml import Task
from clearml.automation import (
    HyperParameterOptimizer, UniformIntegerParameterRange,
    DiscreteParameterRange, RandomSearch,
)

base = Task.get_tasks(project_name='ClearML Tutorial', task_name='01_First_Experiment')[-1]
Task.init(project_name='ClearML Tutorial', task_name='03_Hyperparameter_Optimization',
          task_type=Task.TaskTypes.optimizer)

opt = HyperParameterOptimizer(
    base_task_id=base.id,
    hyper_parameters=[
        UniformIntegerParameterRange('General/n_estimators', min_value=10, max_value=200, step_size=20),
        DiscreteParameterRange('General/max_depth', values=[3, 5, 7, 10]),
    ],
    objective_metric_title='Performance', objective_metric_series='Accuracy',
    objective_metric_sign='max', optimizer_class=RandomSearch,
    max_number_of_concurrent_tasks=2, total_max_jobs=6,
)
opt.start(); opt.wait()
print(opt.get_top_experiments(1)[0].get_parameters_as_dict())

$ python3 03_hpo.py

Deploy a Model with ClearML Serving

$ cd ~/clearml
$ git clone https://github.com/clearml/clearml-serving.git
$ pip install clearml-serving
$ clearml-serving create --name "serving-example"

Set credentials in clearml-serving/docker/.env:

CLEARML_WEB_HOST="https://app.clearml.example.com"
CLEARML_API_HOST="https://api.clearml.example.com"
CLEARML_FILES_HOST="https://files.clearml.example.com"
CLEARML_API_ACCESS_KEY="YOUR_ACCESS_KEY"
CLEARML_API_SECRET_KEY="YOUR_SECRET_KEY"
CLEARML_SERVING_TASK_ID="SERVING_SERVICE_ID"

Start the serving stack:

$ cd ~/clearml/clearml-serving/docker
$ docker compose --env-file .env -f docker-compose-triton.yml up -d

Train + register a sample model, then add the endpoint:

$ pip install -r ~/clearml/clearml-serving/examples/pytorch/requirements.txt
$ python3 ~/clearml/clearml-serving/examples/pytorch/train_pytorch_mnist.py

$ clearml-serving --id SERVING_SERVICE_ID model add \
    --engine triton \
    --endpoint "test_model_pytorch" \
    --preprocess "clearml-serving/examples/pytorch/preprocess.py" \
    --model-id MODEL_ID \
    --input-size 1 28 28 --input-name "INPUT__0" --input-type float32 \
    --output-size 10 --output-name "OUTPUT__0" --output-type float32

$ docker compose --env-file .env -f docker-compose-triton.yml restart

Test inference:

$ curl -X POST "http://SERVER_IP:8080/serve/test_model_pytorch" \
    -H "Content-Type: application/json" \
    -d '{"url": "https://raw.githubusercontent.com/clearml/clearml-serving/main/examples/pytorch/5.jpg"}'

Next Steps

ClearML is running with tracking, agents, pipelines, HPO, and serving. From here you can:

Add more agents on GPU hosts and assign them to dedicated queues
Mirror tracking data into S3-compatible storage for long-term retention
Wire ClearML into CI to log every training run automatically

For the full guide with additional tips, visit the original article on Vultr Docs.

Top comments (1)

Aldo • Jul 7

We've definitely explored similar paths ourselves when trying to optimize MLOps costs and gain more control over the stack. The appeal of open-source platforms like ClearML, especially when you're looking to sidestep the full AWS SageMaker suite, is undeniable. There's a lot to be said for owning your data and compute orchestration without being tied to a specific vendor's opinionated framework, particularly when dealing with bespoke model deployments or very specific compliance requirements.

However, the operational overhead often becomes the real crucible. While getting ClearML up and running on Ubuntu instances can seem straightforward initially, the long-term maintenance is a different beast. We've found that scaling the underlying infrastructure, managing database backups, ensuring high availability for the ClearML services, and handling security patches across the stack quickly adds up in engineering hours. This is especially true when your ML workloads start hitting production scale, requiring robust monitoring and auto-scaling for your experiment tracking and pipeline execution engines.

The balance often shifts from the direct cost of SageMaker's managed services to the indirect cost of dedicated DevOps and MLOps engineers needed to keep a self-hosted solution robust and up-to-date. For teams with strong internal infrastructure capabilities and a clear need for deep customization, it's a viable route. Otherwise, the perceived savings can quickly evaporate into unforeseen operational complexities and slower iteration cycles for the data science team.