DEV Community

Cover image for Adding Chaos to ML Compute Targets

Posted on


Adding Chaos to ML Compute Targets

Cloud provides us the compute resources to perform machine learning jobs such as training, hyper parameter-tuning, inference which requires high CPU and memory utilisation. How do we ensure that the compute targets can withstand CPU pressure to safely execute the Machine Learning model jobs? This blog walks you through applying Chaos Engineering approaches in Machine Learning Operations (MLOps) to establish a well-architected Azure solution.

This blog covers two main sections:

  1. Attaching compute targets to Azure ML Workspace
  2. Create and Run Chaos experiment on Azure VM

If you are knew to Chaos Engineering, do check out the basics concepts in previous blog here

Attaching remote VM for Machine Learning Jobs

Create a Data Science Linux Virtual Machine

Go to Azure Portal -> Virtual Machine and create one.
Find the virtual machine listing by typing in "data science virtual machine" and selecting "Data Science Virtual Machine- Ubuntu 18.04". You can find the info here

Create Azure Machine Learning workspace

In Azure Portal, search for Machine Learning resource and create one!


Attach Remote VM as our compute instance

To attach our Data Science Linux Virtual Machine as our compute instance to Machine Learning workspace, follow the steps below.

  • Navigate to Azure Machine Learning Workspace
  • Go to compute -> Attached Computes
  • Click new and add Virtual Machine
  • Enter the relevant details of our Virtual Machine and you will be able to see the compute instance listed in the portal.


Create and Run Chaos experiment on Azure VM

We will cause a high CPU event on a Linux virtual machine Compute Instance using a chaos experiment and Azure Chaos Studio. Running this experiment can help you defend against an application becoming resource-starved.

Once the compute instance is up and running, SSH into our VM

Install stress-ng

For our Chaos Experiment we will use stress-ng, an open-source application that can cause various stress events on virtual machine.

SSH into your VM (refer) you have created and install stress-ng by following command in your VM terminal.

sudo apt-get update && sudo apt-get -y install unzip && sudo apt-get -y install stress-ng

Create Managed Identity

Create Managed Identity resource and navigate to Identity Access Management (IAM)

Add Role Assignment of contributor to our VM.

Image description

Set Target in Chaos Studio

Go to Chaos Studio and select Targets. Select our VM.

Image description

Create Chaos Experiment

Click Chaos Experiment and create one.

Image description

Under Experiment Designer, we are going to design two different faults:

  • Branch 1_Step1: CPU Pressure
    Set the pressure to 95% for 10 mins

  • Branch 1_Step2: Physical Memory Pressure
    Set the pressure to 95% for 10 mins

Review and create the experiment


Give permissions to Chaos Experiment

Navigate to our VM and give Contributor access to Chaos Experiment.

Image description

Start the Chaos Experiment

Navigate back to Chaos Experiment and start the experiment. You can click on the details of the run to see injection faults

Image description

Monitor the Experiment

Once the experiment is started, you can notice the state of the experiment

Image description

CPU Pressure

SSH into our VM and use the command top to use the CPU Utilisation of the VM. As per our fault injection, the stress-ng exerts 95% CPU Pressure on the VM

Image description

Physical Memory Pressure

Using the same command, you can notice the free memory is less 5% as per our injection fault in VM

Image description

Once the experiment is over, the VM returns to its normal state.

Experiment Results


We noticed both the faults has been successfully injected by our Chaos Experiment


We inferred that the VM is capable of handling high pressure of CPU and memory which is suitable for Machine Learning Jobs such as ML Model training, hyper parameter tuning etc


You can set up alerts, Load balancing, backup VM as mitigation plans for the same.


We have successfully performed load testing on our Data Science Virtual Machine to ensure that the VM is resilient to handle high pressure of CPU an memory utilisation.

With Chaos Engineering, you can ensure that our MLOps is stable, robust and resilient to faults and failure.

Delete the resource group chaos once you are done to prevent additional charges.

If you had liked this article, show some love dropping heart and sharing across your social handles.

Let's add more chaos to our Machine Learning in upcoming blogs, stay tuned!

Top comments (0)

An Animated Guide to Node.js Event Loop

Node.js doesn’t stop from running other operations because of Libuv, a C++ library responsible for the event loop and asynchronously handling tasks such as network requests, DNS resolution, file system operations, data encryption, etc.

What happens under the hood when Node.js works on tasks such as database queries? We will explore it by following this piece of code step by step.