DEV Community

Cover image for Deploying ClearML as an AWS SageMaker Alternative on Ubuntu
Sanskriti Harmukh for Vultr

Posted on with Aashish Chaurasiya • Originally published at docs.vultr.com

Deploying ClearML as an AWS SageMaker Alternative on Ubuntu

ClearML is an open-source MLOps platform that pairs experiment tracking, pipelines, hyperparameter optimisation, and model serving, a self-hosted alternative to AWS SageMaker. This guide deploys the ClearML server with Docker Compose, fronts the web, API, and file servers with Traefik on three subdomains, registers an agent, runs a sample experiment, builds a pipeline, runs an HPO sweep, and deploys a serving stack. By the end, you'll have ClearML covering the full ML lifecycle securely at your domain.

Prerequisite: Ubuntu host with Docker + Compose installed, DNS A records for app.clearml.example.com, api.clearml.example.com, files.clearml.example.com. NVIDIA Container Toolkit on the host if you plan to run GPU workloads.


Prepare the Host

1. Bump Elasticsearch's virtual memory limit:

$ echo "vm.max_map_count=524288" | sudo tee /etc/sysctl.d/99-clearml.conf
$ sudo sysctl --system
$ sudo systemctl restart docker
Enter fullscreen mode Exit fullscreen mode

2. Create data directories with the expected ownership:

$ sudo mkdir -p /opt/clearml/{data/elastic_7,data/mongo_4/db,data/mongo_4/configdb,data/redis,data/fileserver,logs,config}
$ sudo chown -R 1000:1000 /opt/clearml
Enter fullscreen mode Exit fullscreen mode

Deploy the ClearML Server

1. Create the project directory:

$ mkdir -p ~/clearml && cd ~/clearml
Enter fullscreen mode Exit fullscreen mode

2. Download the official Compose file:

$ curl -fsSL https://raw.githubusercontent.com/clearml/clearml-server/master/docker/docker-compose.yml -o docker-compose.yml
Enter fullscreen mode Exit fullscreen mode

3. Edit docker-compose.yml:

  • Comment the ports: blocks under apiserver, webserver, and fileserver (Traefik will publish them).
  • Replace the networks: section with named external networks:
networks:
  backend:
    name: clearml_backend
    driver: bridge
  frontend:
    name: clearml_frontend
    driver: bridge
Enter fullscreen mode Exit fullscreen mode

4. Create the env file with public hostnames:

$ nano .env
Enter fullscreen mode Exit fullscreen mode
CLEARML_WEB_HOST=https://app.clearml.example.com
CLEARML_API_HOST=https://api.clearml.example.com
CLEARML_FILES_HOST=https://files.clearml.example.com
Enter fullscreen mode Exit fullscreen mode

5. Start the stack:

$ docker compose up -d
$ docker compose ps
$ docker compose logs --tail 50
Enter fullscreen mode Exit fullscreen mode

Front the Stack with Traefik

1. Create the Traefik project directory:

$ mkdir -p ~/clearml/traefik && cd ~/clearml/traefik
$ mkdir -p letsencrypt && touch letsencrypt/acme.json && chmod 600 letsencrypt/acme.json
Enter fullscreen mode Exit fullscreen mode

2. Create .env:

LETSENCRYPT_EMAIL=admin@example.com
Enter fullscreen mode Exit fullscreen mode

3. Create the Traefik Compose file:

$ nano docker-compose.yml
Enter fullscreen mode Exit fullscreen mode
services:
  traefik:
    image: traefik:v3.6
    container_name: traefik
    command:
      - "--log.level=INFO"
      - "--providers.file.filename=/etc/traefik/dynamic_conf.yml"
      - "--entryPoints.web.address=:80"
      - "--entryPoints.websecure.address=:443"
      - "--entryPoints.web.http.redirections.entrypoint.to=websecure"
      - "--certificatesResolvers.le.acme.httpChallenge.entryPoint=web"
      - "--certificatesResolvers.le.acme.email=${LETSENCRYPT_EMAIL}"
      - "--certificatesResolvers.le.acme.storage=/letsencrypt/acme.json"
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - "./letsencrypt:/letsencrypt"
      - "./dynamic_conf.yml:/etc/traefik/dynamic_conf.yml:ro"
    networks:
      - clearml-frontend
    restart: unless-stopped

networks:
  clearml-frontend:
    name: clearml_frontend
    external: true
Enter fullscreen mode Exit fullscreen mode

4. Create the routing file:

$ nano dynamic_conf.yml
Enter fullscreen mode Exit fullscreen mode
http:
  routers:
    clearml-web:
      rule: "Host(`app.clearml.example.com`)"
      entryPoints: [websecure]
      service: clearml-web
      tls: {certResolver: le}
    clearml-api:
      rule: "Host(`api.clearml.example.com`)"
      entryPoints: [websecure]
      service: clearml-api
      tls: {certResolver: le}
    clearml-files:
      rule: "Host(`files.clearml.example.com`)"
      entryPoints: [websecure]
      service: clearml-files
      tls: {certResolver: le}
  services:
    clearml-web:
      loadBalancer:
        servers: [{url: "http://clearml-webserver:80"}]
    clearml-api:
      loadBalancer:
        servers: [{url: "http://clearml-apiserver:8008"}]
    clearml-files:
      loadBalancer:
        servers: [{url: "http://clearml-fileserver:8081"}]
Enter fullscreen mode Exit fullscreen mode

5. Start Traefik:

$ docker compose up -d
$ docker logs traefik 2>&1 | grep -i certificate
Enter fullscreen mode Exit fullscreen mode

Create an Admin and API Credentials

  1. Open https://app.clearml.example.com and register the administrator (name + company).
  2. Settings → Workspace → Create new credentials.
  3. Copy the generated block — you'll paste it into agent and SDK configuration:
api {
  web_server: https://app.clearml.example.com
  api_server: https://api.clearml.example.com
  files_server: https://files.clearml.example.com
  credentials {
    "access_key" = "YOUR_ACCESS_KEY"
    "secret_key" = "YOUR_SECRET_KEY"
  }
}
Enter fullscreen mode Exit fullscreen mode

Register a ClearML Agent

$ mkdir -p ~/clearml-agent && cd ~/clearml-agent
$ sudo apt install python3.12-venv -y
$ python3 -m venv clearml_venv
$ source clearml_venv/bin/activate
$ pip install clearml-agent
$ clearml-agent init
Enter fullscreen mode Exit fullscreen mode

Paste the credentials block when prompted. Then start the agent:

$ clearml-agent daemon --queue default --detached
Enter fullscreen mode Exit fullscreen mode

For a GPU host:

$ clearml-agent daemon --gpus 0,1 --queue default --detached
Enter fullscreen mode Exit fullscreen mode

Confirm in Web UI → Workers & Queues → Workers.


Run a Sample Experiment

1. Install the SDK in the same venv:

$ pip install clearml scikit-learn joblib pandas
$ clearml-init
Enter fullscreen mode Exit fullscreen mode

2. Save the experiment:

$ nano 01_first_experiment.py
Enter fullscreen mode Exit fullscreen mode
import joblib
from clearml import Task
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

task = Task.init(project_name='ClearML Tutorial', task_name='01_First_Experiment',
                 tags=['tutorial', 'random-forest'])

hp = {'n_estimators': 100, 'max_depth': 5, 'random_state': 42}
task.connect(hp)

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

clf = RandomForestClassifier(**hp).fit(X_train, y_train)
acc = accuracy_score(y_test, clf.predict(X_test))
task.get_logger().report_scalar('Performance', 'Accuracy', value=acc, iteration=1)

joblib.dump(clf, 'iris_rf_model.pkl')
task.upload_artifact('trained_model', 'iris_rf_model.pkl')
task.close()
Enter fullscreen mode Exit fullscreen mode

3. Run it:

$ python3 01_first_experiment.py
Enter fullscreen mode Exit fullscreen mode

The Web UI's ClearML Tutorial project now shows execution metadata, hyperparameters, scalars, and the uploaded artifact.


Build a Pipeline

$ nano 02_pipeline.py
Enter fullscreen mode Exit fullscreen mode
from clearml import PipelineController

def step_one(pickle_data_url):
    import pickle, pandas as pd
    from clearml import StorageManager
    local = StorageManager.get_local_copy(remote_url=pickle_data_url)
    with open(local, 'rb') as f: iris = pickle.load(f)
    df = pd.DataFrame(iris['data'], columns=iris['feature_names']); df['target'] = iris['target']
    return df

def step_two(data_frame, test_size=0.2, random_state=42):
    from sklearn.model_selection import train_test_split
    y = data_frame['target']; X = data_frame.drop(columns=['target'])
    return train_test_split(X, y, test_size=test_size, random_state=random_state)

def step_three(data):
    from sklearn.linear_model import LogisticRegression
    X_train, X_test, y_train, y_test = data
    return LogisticRegression(solver='lbfgs', max_iter=1000).fit(X_train, y_train)

if __name__ == '__main__':
    pipe = PipelineController(project='ClearML Tutorial', name='02_Pipeline_Experiment',
                              version='1.0', add_pipeline_tags=True)
    pipe.add_parameter('url', 'https://github.com/allegroai/events/raw/master/odsc20-east/generic/iris_dataset.pkl')
    pipe.add_function_step('step_one',   step_one,   function_kwargs={'pickle_data_url': '${pipeline.url}'}, function_return=['data_frame'])
    pipe.add_function_step('step_two',   step_two,   function_kwargs={'data_frame': '${step_one.data_frame}'}, function_return=['processed_data'])
    pipe.add_function_step('step_three', step_three, function_kwargs={'data': '${step_two.processed_data}'}, function_return=['model'])
    pipe.start_locally(run_pipeline_steps_locally=True)
Enter fullscreen mode Exit fullscreen mode
$ python3 02_pipeline.py
Enter fullscreen mode Exit fullscreen mode

Run a Hyperparameter Sweep

$ nano 03_hpo.py
Enter fullscreen mode Exit fullscreen mode
from clearml import Task
from clearml.automation import (
    HyperParameterOptimizer, UniformIntegerParameterRange,
    DiscreteParameterRange, RandomSearch,
)

base = Task.get_tasks(project_name='ClearML Tutorial', task_name='01_First_Experiment')[-1]
Task.init(project_name='ClearML Tutorial', task_name='03_Hyperparameter_Optimization',
          task_type=Task.TaskTypes.optimizer)

opt = HyperParameterOptimizer(
    base_task_id=base.id,
    hyper_parameters=[
        UniformIntegerParameterRange('General/n_estimators', min_value=10, max_value=200, step_size=20),
        DiscreteParameterRange('General/max_depth', values=[3, 5, 7, 10]),
    ],
    objective_metric_title='Performance', objective_metric_series='Accuracy',
    objective_metric_sign='max', optimizer_class=RandomSearch,
    max_number_of_concurrent_tasks=2, total_max_jobs=6,
)
opt.start(); opt.wait()
print(opt.get_top_experiments(1)[0].get_parameters_as_dict())
Enter fullscreen mode Exit fullscreen mode
$ python3 03_hpo.py
Enter fullscreen mode Exit fullscreen mode

Deploy a Model with ClearML Serving

$ cd ~/clearml
$ git clone https://github.com/clearml/clearml-serving.git
$ pip install clearml-serving
$ clearml-serving create --name "serving-example"
Enter fullscreen mode Exit fullscreen mode

Set credentials in clearml-serving/docker/.env:

CLEARML_WEB_HOST="https://app.clearml.example.com"
CLEARML_API_HOST="https://api.clearml.example.com"
CLEARML_FILES_HOST="https://files.clearml.example.com"
CLEARML_API_ACCESS_KEY="YOUR_ACCESS_KEY"
CLEARML_API_SECRET_KEY="YOUR_SECRET_KEY"
CLEARML_SERVING_TASK_ID="SERVING_SERVICE_ID"
Enter fullscreen mode Exit fullscreen mode

Start the serving stack:

$ cd ~/clearml/clearml-serving/docker
$ docker compose --env-file .env -f docker-compose-triton.yml up -d
Enter fullscreen mode Exit fullscreen mode

Train + register a sample model, then add the endpoint:

$ pip install -r ~/clearml/clearml-serving/examples/pytorch/requirements.txt
$ python3 ~/clearml/clearml-serving/examples/pytorch/train_pytorch_mnist.py

$ clearml-serving --id SERVING_SERVICE_ID model add \
    --engine triton \
    --endpoint "test_model_pytorch" \
    --preprocess "clearml-serving/examples/pytorch/preprocess.py" \
    --model-id MODEL_ID \
    --input-size 1 28 28 --input-name "INPUT__0" --input-type float32 \
    --output-size 10 --output-name "OUTPUT__0" --output-type float32

$ docker compose --env-file .env -f docker-compose-triton.yml restart
Enter fullscreen mode Exit fullscreen mode

Test inference:

$ curl -X POST "http://SERVER_IP:8080/serve/test_model_pytorch" \
    -H "Content-Type: application/json" \
    -d '{"url": "https://raw.githubusercontent.com/clearml/clearml-serving/main/examples/pytorch/5.jpg"}'
Enter fullscreen mode Exit fullscreen mode

Next Steps

ClearML is running with tracking, agents, pipelines, HPO, and serving. From here you can:

  • Add more agents on GPU hosts and assign them to dedicated queues
  • Mirror tracking data into S3-compatible storage for long-term retention
  • Wire ClearML into CI to log every training run automatically

For the full guide with additional tips, visit the original article on Vultr Docs.

Top comments (0)