Vivek0712

Posted on Feb 15, 2022

Adding Chaos to ML Compute Targets

#azure #machinelearning #chaos #tutorial

Cloud provides us the compute resources to perform machine learning jobs such as training, hyper parameter-tuning, inference which requires high CPU and memory utilisation. How do we ensure that the compute targets can withstand CPU pressure to safely execute the Machine Learning model jobs? This blog walks you through applying Chaos Engineering approaches in Machine Learning Operations (MLOps) to establish a well-architected Azure solution.

This blog covers two main sections:

Attaching compute targets to Azure ML Workspace
Create and Run Chaos experiment on Azure VM

If you are knew to Chaos Engineering, do check out the basics concepts in previous blog here

Attaching remote VM for Machine Learning Jobs

Create a Data Science Linux Virtual Machine

Go to Azure Portal -> Virtual Machine and create one.
Find the virtual machine listing by typing in "data science virtual machine" and selecting "Data Science Virtual Machine- Ubuntu 18.04". You can find the info here

Create Azure Machine Learning workspace

In Azure Portal, search for Machine Learning resource and create one!

Attach Remote VM as our compute instance

To attach our Data Science Linux Virtual Machine as our compute instance to Machine Learning workspace, follow the steps below.

Navigate to Azure Machine Learning Workspace
Go to compute -> Attached Computes
Click new and add Virtual Machine
Enter the relevant details of our Virtual Machine and you will be able to see the compute instance listed in the portal.

Create and Run Chaos experiment on Azure VM

We will cause a high CPU event on a Linux virtual machine Compute Instance using a chaos experiment and Azure Chaos Studio. Running this experiment can help you defend against an application becoming resource-starved.

Once the compute instance is up and running, SSH into our VM

Install stress-ng

For our Chaos Experiment we will use stress-ng, an open-source application that can cause various stress events on virtual machine.

SSH into your VM (refer) you have created and install stress-ng by following command in your VM terminal.

sudo apt-get update && sudo apt-get -y install unzip && sudo apt-get -y install stress-ng

Create Managed Identity

Create Managed Identity resource and navigate to Identity Access Management (IAM)

Add Role Assignment of contributor to our VM.

Set Target in Chaos Studio

Go to Chaos Studio and select Targets. Select our VM.

Create Chaos Experiment

Click Chaos Experiment and create one.

Under Experiment Designer, we are going to design two different faults:

Branch 1_Step1: CPU Pressure
Set the pressure to 95% for 10 mins
Branch 1_Step2: Physical Memory Pressure
Set the pressure to 95% for 10 mins

Review and create the experiment

Give permissions to Chaos Experiment

Navigate to our VM and give Contributor access to Chaos Experiment.

Start the Chaos Experiment

Navigate back to Chaos Experiment and start the experiment. You can click on the details of the run to see injection faults

Monitor the Experiment

Once the experiment is started, you can notice the state of the experiment

CPU Pressure

SSH into our VM and use the command top to use the CPU Utilisation of the VM. As per our fault injection, the stress-ng exerts 95% CPU Pressure on the VM

Physical Memory Pressure

Using the same command, you can notice the free memory is less 5% as per our injection fault in VM

Once the experiment is over, the VM returns to its normal state.

Experiment Results

Observation:

We noticed both the faults has been successfully injected by our Chaos Experiment

Inference:

We inferred that the VM is capable of handling high pressure of CPU and memory which is suitable for Machine Learning Jobs such as ML Model training, hyper parameter tuning etc

Mitigation:

You can set up alerts, Load balancing, backup VM as mitigation plans for the same.

Conclusion

We have successfully performed load testing on our Data Science Virtual Machine to ensure that the VM is resilient to handle high pressure of CPU an memory utilisation.

With Chaos Engineering, you can ensure that our MLOps is stable, robust and resilient to faults and failure.

Delete the resource group chaos once you are done to prevent additional charges.

If you had liked this article, show some love dropping heart and sharing across your social handles.

Let's add more chaos to our Machine Learning in upcoming blogs, stay tuned!

DEV Community