DEV Community

Shubham Gupta
Shubham Gupta

Posted on

MLOps

Overview:

With reports from the past, nearly 70 to 80 % of the machine learning models ever made never actually made it to the production successfully. The models struggled on factors such as version control, reproducibility, scaling and deployment etc. Machine learning models require substantial compute, storage and expense as well. That is one factor that needs to be handled in parallel as well. Manual configuration and tuning of the model led to increased chances of inaccuracy, overfitting problem and poor results. Incorrect decision on data, selection of inappropriate algorithm and missing on cross validation score checks are also the possible reasons.
Machine learning operations bridge this gap by streamlining and automating the complete process i.e., development to deployment and maintenance of the machine learning models. This ensures an efficient Machine Learning lifecycle and foster collaboration among data scientist, developers and IT engineers.

What is MLOps?

MLOps isn’t a tool. It’s a practice like in any other software development lifecycle (SDLC). Or rather, Machine Learning operations is a set of practices designed specifically for building, running and deploying machine learning models in a streamlined automated fashion. It’s a concept specific to machine learning model development and does not extend to software application development.
With MLOps, one can extend the principles of agile process to machine learning projects. It is slowly becoming a standard engineering practice in IT sector leveraging the three core operations involved in the journey. They are ‘Machine Learning’, ‘Data Science’ and ‘Devops’.
It facilitates model creation by leveraging core principles such as automation using CI/CD, version control of dataset and code, model evaluation, monitoring and continuous feedback.

Relationship to Machine Learning

Before understanding how MLOps is related to Machine learning, we need to understand the typical processes a machine learning model undergoes before reaching the production servers.

In any machine learning model creation, a company starts with data sourcing and data aggregation. The nature of the data obtained from different sources may differ drastically and so organizing them into specific groups becomes an overhead.

Then the data undergoes standard machine learning processes such as data wrangling(also known as data munging or data cleaning). Here it is validated against any null or duplicate values or for any such value that doesn’t match the specific category it is belonging to. Standard operations like Normalization and regularization are done to scale variable to a value in a desirable range and to solve the problem of overfitting and underfitting respectively. Then the model is checked for any bias, if found its mitigated immediately.

Post processing, the model undergoes algorithm selection where it is experimented with different algorithms under trial and error. Then the model is trained locally. Once the model is trained, cross validation checks are performed to measure and compare accuracy, precision, f-scores. This is first performed in an isolated environment and then on a hosted environment. Confusion matrices are also created to identify data points where the model is inadequate or inaccurate with results.

With loopholes identified, the model is tuned with specific parameters to achieve the desired results in those areas as well. The model is then sent for testing via api endpoint and then proceeds to deployment.

Now that we know the process, how can we differentiate them ML from MLOps? Where Machine learning focuses on technically creating models at each phase, MLOps focuses on streamlining and automating the complete machine learning process. MLOps focuses on implementation and management of these operations during the machine learning lifecycle. With MLOps it is guaranteed that a model performs at its effective best efficiency and adapt overtime to the variations in data and model parameters.

Core Principles:

  1. Collaboration: promoted engagement and clearance between developers, data scientists and IT experts fostering communication and understanding the process effectively.

  2. Automation: Repetitive tasks such as preparing data, training the model and deployment can be automated, thus saving time for data scientists to focus on evaluation and innovation.

  3. Reproducibility: Results must be reproducible for each, and every test/experiment performed. This accounts for debugging, comparison and validation of experiment results.

  4. Versioning: Helps in keep track of the changes made to the ML models so that roll back to previous version is possible in case of any failure or fall in accuracy.

  5. Continuity: This refers to a set of streamlined activity that should go on continuously to preserve the context of MLOps:
     Continuous integration for validating data and testing models with the help of ML pipelines.
     Continuous training to train every incoming feature in real time and improve on bad outcomes.
     Continuous deployment to automatically rollout new changes to the deployed model.
     Continuous monitoring for keeping frequent checks on system resources as well as model.

  6. Governance and security: It refers to managing the complete aspect of the process with the help of necessary documentation, protecting sensitive data, collecting feedback and further retraining the models with tuned changes. It also talks about promoting collaboration among developers, data scientists and stakeholders for the sake of clarity of the process and transparency in the work.

Tools:

Implementation:

  1. Data Collection: Data is obtained from various sources such as API’s , databases, logs etc. and then stored in a cloud native storage such as Azure Data lake , Amazon s3 storage .

  2. Data Preprocessing and Feature Engineering: The raw data is then cleaned , structured and necessary transformations is performed . Feature extraction is done and is stored for reuse across models. Tools used are azure data factory, Databricks etc.

  3. Model Creation: Model is developed using libraries like TensorFlow or PyTorch and is experimented with different data and architectures. Necessary metrics are logged.Tools Involved are: Jupyter notebook for local development, MLflow for experiment tracking and git for version control

  4. CI Pipeline: Continuous Integration pipeline is triggered when the code is pushed to git repository. Data schemas are validated, code is packaged in the training pipeline using docker, yaml etc.

  5. Training and tuning: Automated training is done on the code from the CI pipeline using Kubeflow or MLflow, scheduled training jobs are executed on cloud compute like Azure ML, Sagemaker.

  6. Evaluation and Registry: Trained models are validated against the testing data. Once the accuracy check is passed, the model is logged, versioned and registered in model registry.

  7. CD Pipeline: Post approval, the model is immediately deployed via delivery pipelines in a continuous fashion. The deployment may be in batches(scheduled) or real time via FastAPI. Canary or blue-green deployment is preferred for such deployments.

  8. Monitoring and Logs: models’ behaviour is recorded in production. Factors such as latency, drift in input features and accuracy is monitored. Logs obtained are stored and analysed. Alerts are configured for any anomalies or breach in threshold.

  9. Feedback Loop: monitoring the model feedback and automating the pipeline trigger with tuned changes. The model is evaluated, compared with the previous one and re-deployed if has better used to achieve this.

Business Value and benefits:

Faster Marketing time with lower operational costs . Automated process of model creation and deployment is extremely helpful in delivering business value to organization. It delivers a strategic advantage. Efficiency in managing model on production and reproducing behaviour for troubleshooting purposes. Easily tracking and managing model versions. With CI/CD performance degradation is minimised thereby maintaining quality of the model.
Improved quality and accuracy of the model leading to better business decisions. It is a result of continuous evaluation, testing and monitoring. Ex: Helps in easily maintaining recommendation accuracy despite changing customer need. Easily Scalable from a single model to hundreds across regions as per the requirement. Teams can operate and innovate independently without facing any bottlenecks or failures. Since businesses grow and so the ML capabilities too.

Conclusion:

MlOps is a practice similar to devops but specifically designed for Machine learning projects. The question now that you should ask is “how can I implement this process in my projects”. Well we have a tone of tools and cloud services for that. One such is “AWS SageMaker” that is a fully managed service to build , train and deploy ML models. It has a fully managed infrastructure and is fit for any use case. Know more about this on The center for all your data, analytics, and AI – Amazon SageMaker – AWS. So start streamlining your ML workflow right now!

Top comments (0)