ClearML is an open-source MLOps platform that pairs experiment tracking, pipelines, hyperparameter optimisation, and model serving, a self-hosted alternative to AWS SageMaker. This guide deploys the ClearML server with Docker Compose, fronts the web, API, and file servers with Traefik on three subdomains, registers an agent, runs a sample experiment, builds a pipeline, runs an HPO sweep, and deploys a serving stack. By the end, you'll have ClearML covering the full ML lifecycle securely at your domain.
Prerequisite: Ubuntu host with Docker + Compose installed, DNS A records for
app.clearml.example.com,api.clearml.example.com,files.clearml.example.com. NVIDIA Container Toolkit on the host if you plan to run GPU workloads.
Prepare the Host
1. Bump Elasticsearch's virtual memory limit:
$ echo "vm.max_map_count=524288" | sudo tee /etc/sysctl.d/99-clearml.conf
$ sudo sysctl --system
$ sudo systemctl restart docker
2. Create data directories with the expected ownership:
$ sudo mkdir -p /opt/clearml/{data/elastic_7,data/mongo_4/db,data/mongo_4/configdb,data/redis,data/fileserver,logs,config}
$ sudo chown -R 1000:1000 /opt/clearml
Deploy the ClearML Server
1. Create the project directory:
$ mkdir -p ~/clearml && cd ~/clearml
2. Download the official Compose file:
$ curl -fsSL https://raw.githubusercontent.com/clearml/clearml-server/master/docker/docker-compose.yml -o docker-compose.yml
3. Edit docker-compose.yml:
- Comment the
ports:blocks underapiserver,webserver, andfileserver(Traefik will publish them). - Replace the
networks:section with named external networks:
networks:
backend:
name: clearml_backend
driver: bridge
frontend:
name: clearml_frontend
driver: bridge
4. Create the env file with public hostnames:
$ nano .env
CLEARML_WEB_HOST=https://app.clearml.example.com
CLEARML_API_HOST=https://api.clearml.example.com
CLEARML_FILES_HOST=https://files.clearml.example.com
5. Start the stack:
$ docker compose up -d
$ docker compose ps
$ docker compose logs --tail 50
Front the Stack with Traefik
1. Create the Traefik project directory:
$ mkdir -p ~/clearml/traefik && cd ~/clearml/traefik
$ mkdir -p letsencrypt && touch letsencrypt/acme.json && chmod 600 letsencrypt/acme.json
2. Create .env:
LETSENCRYPT_EMAIL=admin@example.com
3. Create the Traefik Compose file:
$ nano docker-compose.yml
services:
traefik:
image: traefik:v3.6
container_name: traefik
command:
- "--log.level=INFO"
- "--providers.file.filename=/etc/traefik/dynamic_conf.yml"
- "--entryPoints.web.address=:80"
- "--entryPoints.websecure.address=:443"
- "--entryPoints.web.http.redirections.entrypoint.to=websecure"
- "--certificatesResolvers.le.acme.httpChallenge.entryPoint=web"
- "--certificatesResolvers.le.acme.email=${LETSENCRYPT_EMAIL}"
- "--certificatesResolvers.le.acme.storage=/letsencrypt/acme.json"
ports:
- "80:80"
- "443:443"
volumes:
- "./letsencrypt:/letsencrypt"
- "./dynamic_conf.yml:/etc/traefik/dynamic_conf.yml:ro"
networks:
- clearml-frontend
restart: unless-stopped
networks:
clearml-frontend:
name: clearml_frontend
external: true
4. Create the routing file:
$ nano dynamic_conf.yml
http:
routers:
clearml-web:
rule: "Host(`app.clearml.example.com`)"
entryPoints: [websecure]
service: clearml-web
tls: {certResolver: le}
clearml-api:
rule: "Host(`api.clearml.example.com`)"
entryPoints: [websecure]
service: clearml-api
tls: {certResolver: le}
clearml-files:
rule: "Host(`files.clearml.example.com`)"
entryPoints: [websecure]
service: clearml-files
tls: {certResolver: le}
services:
clearml-web:
loadBalancer:
servers: [{url: "http://clearml-webserver:80"}]
clearml-api:
loadBalancer:
servers: [{url: "http://clearml-apiserver:8008"}]
clearml-files:
loadBalancer:
servers: [{url: "http://clearml-fileserver:8081"}]
5. Start Traefik:
$ docker compose up -d
$ docker logs traefik 2>&1 | grep -i certificate
Create an Admin and API Credentials
- Open
https://app.clearml.example.comand register the administrator (name + company). - Settings → Workspace → Create new credentials.
- Copy the generated block — you'll paste it into agent and SDK configuration:
api {
web_server: https://app.clearml.example.com
api_server: https://api.clearml.example.com
files_server: https://files.clearml.example.com
credentials {
"access_key" = "YOUR_ACCESS_KEY"
"secret_key" = "YOUR_SECRET_KEY"
}
}
Register a ClearML Agent
$ mkdir -p ~/clearml-agent && cd ~/clearml-agent
$ sudo apt install python3.12-venv -y
$ python3 -m venv clearml_venv
$ source clearml_venv/bin/activate
$ pip install clearml-agent
$ clearml-agent init
Paste the credentials block when prompted. Then start the agent:
$ clearml-agent daemon --queue default --detached
For a GPU host:
$ clearml-agent daemon --gpus 0,1 --queue default --detached
Confirm in Web UI → Workers & Queues → Workers.
Run a Sample Experiment
1. Install the SDK in the same venv:
$ pip install clearml scikit-learn joblib pandas
$ clearml-init
2. Save the experiment:
$ nano 01_first_experiment.py
import joblib
from clearml import Task
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
task = Task.init(project_name='ClearML Tutorial', task_name='01_First_Experiment',
tags=['tutorial', 'random-forest'])
hp = {'n_estimators': 100, 'max_depth': 5, 'random_state': 42}
task.connect(hp)
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)
clf = RandomForestClassifier(**hp).fit(X_train, y_train)
acc = accuracy_score(y_test, clf.predict(X_test))
task.get_logger().report_scalar('Performance', 'Accuracy', value=acc, iteration=1)
joblib.dump(clf, 'iris_rf_model.pkl')
task.upload_artifact('trained_model', 'iris_rf_model.pkl')
task.close()
3. Run it:
$ python3 01_first_experiment.py
The Web UI's ClearML Tutorial project now shows execution metadata, hyperparameters, scalars, and the uploaded artifact.
Build a Pipeline
$ nano 02_pipeline.py
from clearml import PipelineController
def step_one(pickle_data_url):
import pickle, pandas as pd
from clearml import StorageManager
local = StorageManager.get_local_copy(remote_url=pickle_data_url)
with open(local, 'rb') as f: iris = pickle.load(f)
df = pd.DataFrame(iris['data'], columns=iris['feature_names']); df['target'] = iris['target']
return df
def step_two(data_frame, test_size=0.2, random_state=42):
from sklearn.model_selection import train_test_split
y = data_frame['target']; X = data_frame.drop(columns=['target'])
return train_test_split(X, y, test_size=test_size, random_state=random_state)
def step_three(data):
from sklearn.linear_model import LogisticRegression
X_train, X_test, y_train, y_test = data
return LogisticRegression(solver='lbfgs', max_iter=1000).fit(X_train, y_train)
if __name__ == '__main__':
pipe = PipelineController(project='ClearML Tutorial', name='02_Pipeline_Experiment',
version='1.0', add_pipeline_tags=True)
pipe.add_parameter('url', 'https://github.com/allegroai/events/raw/master/odsc20-east/generic/iris_dataset.pkl')
pipe.add_function_step('step_one', step_one, function_kwargs={'pickle_data_url': '${pipeline.url}'}, function_return=['data_frame'])
pipe.add_function_step('step_two', step_two, function_kwargs={'data_frame': '${step_one.data_frame}'}, function_return=['processed_data'])
pipe.add_function_step('step_three', step_three, function_kwargs={'data': '${step_two.processed_data}'}, function_return=['model'])
pipe.start_locally(run_pipeline_steps_locally=True)
$ python3 02_pipeline.py
Run a Hyperparameter Sweep
$ nano 03_hpo.py
from clearml import Task
from clearml.automation import (
HyperParameterOptimizer, UniformIntegerParameterRange,
DiscreteParameterRange, RandomSearch,
)
base = Task.get_tasks(project_name='ClearML Tutorial', task_name='01_First_Experiment')[-1]
Task.init(project_name='ClearML Tutorial', task_name='03_Hyperparameter_Optimization',
task_type=Task.TaskTypes.optimizer)
opt = HyperParameterOptimizer(
base_task_id=base.id,
hyper_parameters=[
UniformIntegerParameterRange('General/n_estimators', min_value=10, max_value=200, step_size=20),
DiscreteParameterRange('General/max_depth', values=[3, 5, 7, 10]),
],
objective_metric_title='Performance', objective_metric_series='Accuracy',
objective_metric_sign='max', optimizer_class=RandomSearch,
max_number_of_concurrent_tasks=2, total_max_jobs=6,
)
opt.start(); opt.wait()
print(opt.get_top_experiments(1)[0].get_parameters_as_dict())
$ python3 03_hpo.py
Deploy a Model with ClearML Serving
$ cd ~/clearml
$ git clone https://github.com/clearml/clearml-serving.git
$ pip install clearml-serving
$ clearml-serving create --name "serving-example"
Set credentials in clearml-serving/docker/.env:
CLEARML_WEB_HOST="https://app.clearml.example.com"
CLEARML_API_HOST="https://api.clearml.example.com"
CLEARML_FILES_HOST="https://files.clearml.example.com"
CLEARML_API_ACCESS_KEY="YOUR_ACCESS_KEY"
CLEARML_API_SECRET_KEY="YOUR_SECRET_KEY"
CLEARML_SERVING_TASK_ID="SERVING_SERVICE_ID"
Start the serving stack:
$ cd ~/clearml/clearml-serving/docker
$ docker compose --env-file .env -f docker-compose-triton.yml up -d
Train + register a sample model, then add the endpoint:
$ pip install -r ~/clearml/clearml-serving/examples/pytorch/requirements.txt
$ python3 ~/clearml/clearml-serving/examples/pytorch/train_pytorch_mnist.py
$ clearml-serving --id SERVING_SERVICE_ID model add \
--engine triton \
--endpoint "test_model_pytorch" \
--preprocess "clearml-serving/examples/pytorch/preprocess.py" \
--model-id MODEL_ID \
--input-size 1 28 28 --input-name "INPUT__0" --input-type float32 \
--output-size 10 --output-name "OUTPUT__0" --output-type float32
$ docker compose --env-file .env -f docker-compose-triton.yml restart
Test inference:
$ curl -X POST "http://SERVER_IP:8080/serve/test_model_pytorch" \
-H "Content-Type: application/json" \
-d '{"url": "https://raw.githubusercontent.com/clearml/clearml-serving/main/examples/pytorch/5.jpg"}'
Next Steps
ClearML is running with tracking, agents, pipelines, HPO, and serving. From here you can:
- Add more agents on GPU hosts and assign them to dedicated queues
- Mirror tracking data into S3-compatible storage for long-term retention
- Wire ClearML into CI to log every training run automatically
For the full guide with additional tips, visit the original article on Vultr Docs.
Top comments (0)