DEV Community: Ed Shee

Serving Python Machine Learning Models With Ease

Ed Shee — Tue, 12 Apr 2022 10:48:35 +0000

Ever trained a new model and just wanted to use it through an API straight away? Sometimes you don't want to bother writing Flask code or containerizing your model and running it in Docker. If that sounds like you, you definitely want to check out MLServer. It's a python based inference server that recently went GA and what's really neat about it is that it's a highly-performant server designed for production environments too. That means that, by serving models locally, you are running in the exact same environment as they will be in when they get to production.

This blog walks you through how to use MLServer by using a couple of image models as examples...

Dataset

The dataset we're going to work with is the Fashion MNIST dataset. It contains 70,000 images of clothing in greyscale 28x28 pixels across 10 different classes (top, dress, coat, trouser etc...).

If you want to reproduce the code from this blog, make sure you download the files and extract them in to a folder named data. They have been omitted from the github repo because they are quite large.

Training the Scikit-learn Model

First up, we're going to train a support vector machine (SVM) model using the scikit-learn framework. We'll then save the model to a file named Fashion-MNIST.joblib.

import pandas as pd
from sklearn import svm
import time
import joblib

#Load Training Data
train = pd.read_csv('../../data/fashion-mnist_train.csv', header=0)
y_train = train['label']
X_train = train.drop(['label'], axis=1)
classifier = svm.SVC(kernel="poly", degree=4, gamma=0.1)

#Train Model
start = time.time()
classifier.fit(X_train.values, y_train.values)
end = time.time()
exec_time = end-start
print(f'Execution time: {exec_time} seconds')

#Save Model
joblib.dump(classifier, "Fashion-MNIST.joblib")

Note: The SVM algorithm is not particularly well suited to large datasets because of it's quadratic nature. The model in this example will, depending on your hardware, take a couple of minutes to train.

Serving the Scikit-learn Model

Ok, so we've now got a saved model file Fashion-MNIST.joblib. Let's take a look at how we can serve that using MLServer...

First up, we need to install MLServer.

pip install mlserver

The additional runtimes are optional but make life really easy when serving models, we'll install the Scikit-Learn and XGBoost ones too

pip install mlserver-sklearn mlserver-xgboost

You can find details on all of the inference runtimes here

Once we've done that, all we need to do is add two configuration files:

settings.json - This contains the configuration for the server itself.
model-settings.json - As the name suggests, this file contains configuration for the model we want to run.

For our settings.json file it's enough to just define single parameter:

{
    "debug": "true"
}

The model-settings.json file requires a few more bits of info as it needs to know about the model we're trying to serve:

{
    "name": "fashion-sklearn",
    "implementation": "mlserver_sklearn.SKLearnModel",
    "parameters": {
        "uri": "./Fashion_MNIST.joblib",
        "version": "v1"
    }
}

The name parameter should be self-explanatory. It gives MLServer a unique identifier which is particularly useful when serving multiple models (we'll come to that in a bit). The implementation defines which pre-built server, if any, to use. It is heavily coupled to the machine learning framework used to train your model. In our case we trained the model using scikit-learn so we're going to use the scikit-learn implementation for MLServer. For model parameters we just need to provide the location of our model file as well as a version number.

That's it, two small config files and we're ready to serve our model using the command:

mlserver start .

Boom, we've now got our model running on a production-ready server locally. It's now ready to accept requests over HTTP and gRPC (default ports 8080 and 8081 respectively).

Testing the Model

Now that our model is up and running. Let's send some requests to see it in action.

To make predictions on our model, we need to send a POST request to the following URL:

http://localhost:8080/v2/models/<MODEL_NAME>/versions/<VERSION>/infer

That means to access our scikit-learn model that we trained earlier, we need to replace the MODEL_NAME with fashion-sklearn and VERSION with v1.

The code below shows how to import the test data, make a request to the model server and then compare the result with the actual label:

import pandas as pd
import requests

#Import test data, grab the first row and corresponding label
test = pd.read_csv('../../data/fashion-mnist_test.csv', header=0)
y_test = test['label'][0:1]
X_test = test.drop(['label'],axis=1)[0:1]

#Prediction request parameters
inference_request = {
    "inputs": [
        {
          "name": "predict",
          "shape": X_test.shape,
          "datatype": "FP64",
          "data": X_test.values.tolist()
        }
    ]
}
endpoint = "http://localhost:8080/v2/models/fashion-sklearn/versions/v1/infer"

#Make request and print response
response = requests.post(endpoint, json=inference_request)
print(response.text)
print(y_test.values)

When running the test.py code above we get the following response from MLServer:

{
  "model_name": "fashion-sklearn",
  "model_version": "v1",
  "id": "31c3fa70-2e56-49b1-bcec-294452dbe73c",
  "parameters": null,
  "outputs": [
    {
      "name": "predict",
      "shape": [
        1
      ],
      "datatype": "INT64",
      "parameters": null,
      "data": [
        0
      ]
    }
  ]
}

You'll notice that MLServer has generated a request id and automatically added metadata about the model and version that was used to serve our request. Capturing this kind of metadata is super important once our model gets to production; it allows us to log every request for audit and troubleshooting purposes.

You might also notice that MLServer has returned an array for outputs. In our request we only sent one row of data but MLServer also handles batch requests and returns them together. You can even use a technique called adaptive batching to optimise the way multiple requests are handled in production environments.

In our example above, the model's prediction can be found in outputs[0].data which shows that the model has labeled this sample with the category 0 (The value 0 corresponds to the category t-shirt/top). The true label for that sample was a 0 too so the model got this prediction correct!

Training the XGBoost Model

Now that we've seen how to create and serve a single model using MLServer, let's take a look at how we'd handle multiple models trained in different frameworks.

We'll be using the same Fashion MNIST dataset but, this time, we'll train an XGBoost model instead.

import pandas as pd
import xgboost as xgb
import time

#Load Training Data
train = pd.read_csv('../../data/fashion-mnist_train.csv', header=0)
y_train = train['label']
X_train = train.drop(['label'], axis=1)
dtrain = xgb.DMatrix(X_train.values, label=y_train.values)

#Train Model
params = {
    'max_depth': 5,
    'eta': 0.3,
    'verbosity': 1,
    'objective': 'multi:softmax',
    'num_class' : 10
}
num_round = 50

start = time.time()
bstmodel = xgb.train(params, dtrain, num_round, evals=[(dtrain, 'label')], verbose_eval=10)
end = time.time()
exec_time = end-start
print(f'Execution time: {exec_time} seconds')

#Save Model
bstmodel.save_model('Fashion_MNIST.json')

The code above, used to train the XGBoost model, is similar to the code we used earlier to train the scikit-learn model but this time our model has been saved in an XGBoost-compatible format as Fashion_MNIST.json.

Serving Multiple Models

One of the cool things about MLServer is that it supports multi-model serving. This means that you don't have to create or run a new server for each ML model you want to deploy. Using the models we built above, we'll use this feature to serve them both at once.

When MLServer starts up, it will search the directory (and any subdirectories) for model-settings.json files. If you've got multiple model-settings.json files then it'll automatically serve them all.

Note: you still only need a single settings.json (server config) file in the root directory

Here's a breakdown of my directory structure for reference:

.
├── data
│   ├── fashion-mnist_test.csv
│   └── fashion-mnist_train.csv
├── models
│   ├── sklearn
│   │   ├── Fashion_MNIST.joblib
│   │   ├── model-settings.json
│   │   ├── test.py
│   │   └── train.py
│   └── xgboost
│       ├── Fashion_MNIST.json
│       ├── model-settings.json
│       ├── test.py
│       └── train.py
├── README.md
├── settings.json
└── test_models.py

Notice that there are two model-settings.json files - one for the scikit-learn model and one for the XGBoost model.

We can now just run mlserver start . and it will start handling requests for both models.

[mlserver] INFO - Loaded model 'fashion-sklearn' succesfully.
[mlserver] INFO - Loaded model 'fashion-xgboost' succesfully.

Testing Accuracy of Multiple Models

With both models now up and running on MLServer, we can use the samples from our test set to validate how accurate each of our models is.

The following code sends a batch request (containing the full test set) to each of the models and then compares the predictions received to the true labels. Doing this across the whole test set gives us a reasonably good measure for each model's accuracy, which gets printed at the end.

import pandas as pd
import requests
import json

#Import the test data and split the data from the labels
test = pd.read_csv('./data/fashion-mnist_test.csv', header=0)
y_test = test['label']
X_test = test.drop(['label'],axis=1)

#Build the inference request
inference_request = {
    "inputs": [
        {
          "name": "predict",
          "shape": X_test.shape,
          "datatype": "FP64",
          "data": X_test.values.tolist()
        }
    ]
}

#Send the prediction request to the relevant model, compare responses to training labels and calculate accuracy
def infer(model_name, version):
    endpoint = f"http://localhost:8080/v2/models/{model_name}/versions/{version}/infer"
    response = requests.post(endpoint, json=inference_request)

    #calculate accuracy
    correct = 0
    for i, prediction in enumerate(json.loads(response.text)['outputs'][0]['data']):
        if y_test[i] == prediction:
            correct += 1
    accuracy = correct / len(y_test)
    print(f'Model Accuracy for {model_name}: {accuracy}')

infer("fashion-xgboost", "v1")
infer("fashion-sklearn", "v1")

The results show that the XGBoost model slightly outperforms the SVM scikit-learn one:

Model Accuracy for fashion-xgboost: 0.8953
Model Accuracy for fashion-sklearn: 0.864

Summary

Hopefully by now you've gained an understanding of how easy it is to serve models using MLServer. For further info it's worth reading the docs and taking a look at the examples for different frameworks.

For MLFlow users you can now serve models directly in MLFlow using MLServer and if you're a Kubernetes user you should definitely check out Seldon Core - an open source tool that deploys models to Kubernetes (it uses MLServer under the covers).

All of the code from this example can be found here.

6 Types of AI Bias Everyone Should Know

Ed Shee — Mon, 11 Oct 2021 08:30:40 +0000

In my previous blog we looked at the difference between Bias, Fairness and Explainability in AI. I included a high level view of what Bias is but this time we'll go in to more detail.

Bias appears in machine learning in lots of different forms. The important thing to consider is that training a machine learning model is a lot like bringing up a child.

When a child develops, they use senses like hearing, vision and touch to learn from the world around them. Their understanding of the world, their opinions, and the decisions they end up making are all heavily influenced by their upbringing. For example, a child that grows up and lives in a sexist community may never realise there is anything biased about the way they view different genders. Machine learning models are exactly the same. Instead of using senses as inputs, they use data - data that we give them! This is why it's so important to try and avoid bias in the data used for training machine learning models. Let's take a closer look at some of the most common forms of bias in machine learning:

Historical Bias

While gathering data for training a machine learning algorithm, grabbing historical data is almost always the easiest place to start. If we're not careful, however, it's very easy to include bias that was present in the historical data.

Take Amazon, for example; In 2014 they set out to build a system for automatically screening job applicants. The idea was to just feed the system hundreds of CVs and have the top candidates picked out automatically. The system was trained on 10 years worth of job applications and their outcomes. The problem? Most employees at Amazon were male (particularly in technical roles). The algorithm learned that, because there were more men than women at Amazon, men were more suitable candidates and actively discriminated against non-male applications. By 2015 the whole project had to be scrapped.

Sample Bias

Sample bias happens when your training data does not accurately reflect the makeup of the real world usage of your model. Usually one population is either heavily overrepresented or underrepresented.

I recently saw a talk from David Keene and he gave a really good example of sample bias.

When training a speech-to-text system, you need lots of audio clips together with their corresponding transcriptions. Where better to get lots of this data than audiobooks? What could be wrong with that approach?

Well, it turns out that the vast majority of audiobooks are narrated by well educated, middle aged, white men. Unsurprisingly, speech recognition software trained using this approach underperforms when the user is from a different socio-economic or ethnic background.

The chart above shows the word error rate [WER] for speech recognition systems from big tech companies. You can clearly see that all of the algorithms underperform for black voices vs white ones.

Label Bias

A lot of the data required to train ML algorithms needs to be labelled before it is useful. You actually do this yourself quite a lot when you log in to websites. Been asked to identify the squares that contain traffic lights? You're actually confirming a set of labels for that image to help train visual recognition models. The way in which we label data, however, varies a lot and inconsistencies in labelling can introduce bias into the system.

Imagine you train a system by labeling lions using the boxes on the images above. You then show your system this image:

Annoyingly, it is unable to identify the very obvious lion in the picture. By labeling faces only, you've inadvertently made the system bias toward front-facing lion pictures!

Aggregation Bias

Sometimes we aggregate data to simplify it, or present it in a particular fashion. This can lead to bias regardless of whether it happens before or after creating our model. Take a look at this chart, for example:

It shows how salary increases based on the number of years worked in a job. There's a pretty strong correlation here that the longer you work, the more you get paid. Let's now look at the data that was used to create this aggregate though:

We see that for athletes the complete opposite is true. They are able to earn high salaries early on in their careers while they are still at their physical peak but it then drops off as they stop competing. By aggregating them with other professions we're making our algorithm biased against them.

Confirmation Bias

Simply put, confirmation bias is our tendency to trust information that confirms our existing beliefs or discard information that doesn't. Theoretically, I could build the most accurate ML system ever, without bias in either the data or the modelling, but if you're going to change the result based on your own "gut feel", then it doesn't matter.

Confirmation bias is particularly prevalent in applications of machine learning where human review is required before any action is taken. The use of AI in healthcare has seen doctors be dismissive of algorithmic diagnosis because it doesn't match their own experience or understanding. Often when investigated, it turns out that the doctors haven't read the most recent research literature which points to slightly different symptoms, techniques or diagnosis outcomes. Ultimately, there are only so many research journals that one doctor can read (particularly while saving lives full-time) but an ML system can ingest them all.

Evaluation Bias

Let's imagine you're building a machine learning model to predict voting turnout across the country during a general election. You're hoping that, by taking a series of features like age, profession, income and political alignment, you can accurately predict whether someone will vote or not. You build your model, use your local election to test it out, and are really pleased by your results. It seems you can correctly predict whether someone will vote or not 95% of the time.

As the general election rolls around, you are suddenly very disappointed. The model you spent ages designing and testing was only correct 55% of the time - performing only marginally better than a random guess. The poor results are an example of evaluation bias. By only evaluating your model on people in your local area, you have inadvertently designed a system that only works well for them. Other areas of the country, with totally different voting patterns, haven't been properly accounted for, even if they were included in your initial training data.

Conclusion

You've now seen six different ways that bias can impact machine learning. Whilst it's not an exhaustive list, it should give you a good understanding of the most common ways in which ML systems end up becoming biased. If you're interested in reading further, I'd recommend this paper from Mehrabi et al.

Bias vs Fairness vs Explainability in AI

Ed Shee — Tue, 31 Aug 2021 09:12:18 +0000

Over the last few years, there has been a distinct focus on building machine learning systems that are, in some way, responsible and ethical. The terms “Bias”, “Fairness” and “Explainability” come up all over the place but their definitions are usually pretty fuzzy and they are widely misunderstood to mean the same thing. This blog aims to clear that up…

Bias

Before we look at how bias appears in machine learning, let’s start with the dictionary definition for the word:

“inclination or prejudice for or against one person or group, especially in a way considered to be unfair”

Look! The definition of bias includes the word “unfair”. It’s easy to see why the terms bias and fairness get confused for each other a lot.

Bias can impact machine learning systems at pretty much every stage. Here’s an example of how historical bias from the world around us can creep into your data:

Imagine you’re building a model to predict the next word in a sequence of text. To make sure you’ve got lots of training data, you give it every book written in the last 50 years. You then ask it to predict the next word in this sentence:

“The CEOs name is ____”.

You then notice, perhaps unsurprisingly, that your model is much more likely to predict male names for the CEO than female ones. What has happened is you’ve unintentionally taken the historical stereotypes that exist in our society and baked them into your model.

Bias doesn’t just occur in the data though, it can appear in the model too. If the data used to test a model doesn’t accurately represent the real world, you end up with what’s called evaluation bias.

A good example of this would be training a facial recognition system and then using photos from Instagram to test it. Your model might have really high accuracy on the test set but it is likely to underperform in the real world because the majority of Instagram users are between the ages of 18 and 35. Your model is now biased towards that age group and will perform worse on the faces of older or younger people.

There are actually loads of different types of bias in machine learning, I’ll cover all of those in a separate blog.

The word bias almost always comes with negative connotations but it’s important to note that this isn’t always the case in machine learning. Having prior knowledge of the problem you’re trying to solve can help you to select relevant features during modeling. This introduces human bias but can often speed up or improve the modeling process.

Explainability

Sometimes referred to as interpretability, explainability attempts to explain how a machine learning model makes predictions. It is about interrogating a model, gathering information on why a particular prediction (or series of predictions) was made, and then presenting this information back to humans in a comprehensible manner.

There are typically two situations you’ll be in when trying to explain how a model works:

Black Box — You have no access or information about the underlying model. The inputs and outputs of the model are all you can use to generate an explanation.
White Box — You have access to the underlying model so it’s easier to provide information about exactly why a certain prediction was made.

On the whole, “white box” models tend to be simpler in design, sometimes deliberately, so that explanations can be easily generated. The downside is that using a simpler, more interpretable model might fail to capture the complexity of the relationships in your data which means you could be faced with a tradeoff between interpretability and model performance.

When doing explainability, we’re typically interested in one of two things:

Model View — Overall, what features are more important than others to the model?
Instance View — For a particular prediction, what factors contributed?

The techniques used for explainability depend on whether your model is a black box or white box, whether you’re interested in the model view or instance view, and also depends on the type of data you’re exploring. The open source library Alibi does a great job of explaining these techniques in further detail.

Personally, I like to think of white-box models as “Interpretability” (because of the requirement for an interpretable model) and black-box models as “Explainability” (because we are attempting to explain the unknown). Sadly, however, there is no official definition and the words are often used interchangeably.

Fairness

Fairness is by far the most subjective of the three terms. As we did for bias, let’s glance at its everyday definition before looking at how it’s applied in machine learning:

“impartial and just treatment or behaviour without favouritism or discrimination.”

Applying this to the context of machine learning, the definition I like to use is:

“An algorithm is fair if it makes predictions that do not favour or discriminate against certain individuals or groups based on sensitive characteristics.”

Most definitions you’ll see (including mine above) tend to narrow the scope to machine learning that affects humans. Typically this is where AI can have disastrous consequences, and so fairness is super important. Something like a mortgage approval or a healthcare diagnosis is such a life-changing event that it’s critical we handle predictions in a fair and responsible way.

You’re probably asking yourself “What’s a “sensitive characteristic” though?” which is a very good question. The interpretation of the definition depends heavily on what you class as sensitive. Some obvious examples tend to be things like race, gender, sexual orientation, disability, etc…

One approach is to just remove all “sensitive” attributes when building a model. This seems like a sensible thing to do at first but there are actually multiple issues with this:

The sensitive features might actually be critical to the model. Imagine you’re trying to predict the height a child will be when they are fully grown. Removing sensitive attributes like age and sex will make your predictions useless.
Fairness is not necessarily about being agnostic. Sometimes it’s important to include sensitive features in order to favor those who might be discriminated against in other features. An example of this is university admissions, where raw grades alone may not be the best way to find the brightest pupils. Those who had access to fewer resources or a lower quality of education might have had better scores otherwise.
Sensitive features might be hidden in other attributes. It is often possible to determine the values for sensitive features using a combination of non-sensitive ones. For example, an applicant’s full name might allow a machine learning model to infer their race, nationality, or gender.

The reality is that AI fairness is an incredibly difficult field. It requires policymakers to define what “fair” looks like for each use case which can sometimes be very subjective. Often there is also a trade-off between group fairness and individual fairness. Using the university admissions example from earlier, making your algorithm fairer for an underprivileged group who didn’t have the same educational resources (group fairness) comes at the cost of those who had a good educational background and whose grades are now no longer quite good enough (individual fairness).

Summary

In summary, bias, explainability, and fairness are not the same thing. Whilst trying to explain all or part of a machine learning model, you might find that the model contains bias. The existence of that bias might even mean that your model is unfair. That doesn’t, however, mean that explainability, bias, and fairness are the same thing.

TL;DR

Bias is a preference or prejudice against a particular group, individual, or feature and comes in many forms.

Explainability is the ability to explain how or why a model makes a predictions.

Fairness is the subjective practice of using AI without favoritism or discrimination, particularly pertaining to humans.

Developer Relations Explained

Ed Shee — Fri, 06 Aug 2021 07:19:12 +0000

A couple of weeks ago I published a blog about MLOps and mentioned how I'd "given up trying to explain what I do to non-technical friends and family". I foolishly implied that the fault is theirs for not understanding the technology I work with or my niche job role. The reality is that it's totally my fault because I clearly haven't figured out a concise way to explain what I do. The more I think about it, the more I realise that being able to explain my job in simple terms actually helps me to define my role far more than it helps anyone else. Being totally honest, you probably don't even care what I do anyway. So here goes…

Developer Relations

Let's start with, arguably, the easier part - Developer Relations. Sometimes called Developer Advocacy, Developer Evangelism or DevRel (if you're one of the cool kids 😎) the job is a weird blend of responsibilities.

The Developer part of it is pretty easy to comprehend - basically everything DevRel does is related to software developers. This is actually a really important distinction because, as you'll come to see, a lot of the functions DevRel provides overlap with more traditional job roles.

The Relations term is where it becomes a lot more unclear. I usually try to explain this bit using these three commonly understood jobs:

Marketing - driving awareness and usage of your product or brand
Software Engineering - writing code and documentation for your product
Product Management - gathering user feedback and understanding market trends to improve your product

But those three jobs already exist so why do I need Developer Relations? Very good question! In fact, the answer is that very few companies do need developer relations (I told you it was niche!). The ones that do are almost always those who offer highly technical products. When this is the case, the barriers in between the traditional roles tend to break down:

Marketing can't reach the right audience when they are demanding highly technical content and actively avoid the usual sales or marketing channels.
Product Managers struggle to understand new industry trends without being experts in the domain
The Engineering team are busy building a product and may not have time or the skill set to do everything

Developer Relations is the department that glues everything together and solves these challenges. Experts in the product, the industry and with the relevant soft skills to do public speaking engagements and manage communities.

People often ask what a typical day looks like for me but the reality is that no two days are ever the same. I might write a code sample one day, speak at a conference the next, and then help a customer architect a solution the day after. The variation is what makes the job so much fun 😀.

Hopefully I've at least given you a rough idea of what i do, now for the hard part; what my company does.

Seldon

The simplest explanation is that we're an "AI startup". It definitely makes us sound cool, but it also doesn't disentangle us from the thousands of other AI startups out there. Most AI startups focus on a specific industry and then either:

Provide software that has machine learning built in to it
Do machine learning for clients in the industry as consultants

Seldon does neither of those. In fact, if you work at one of those AI startups there's a good chance you've heard of Seldon because what we do is provide tools for people who do AI.

Here's an analogy; imagine machine learning is pizza instead 🍕. While most companies in the pizza industry are busy making pizza, coming up with recipes and sourcing the best ingredients, we're the ones who supply the wood-fired ovens, the paddles and the pizza cutting wheels. It doesn't really matter if you're making italian-style, deep dish, sourdough or calzone, having an awesome oven lets you cook more pizza, faster and more reliably.

What does that mean in terms of machine learning? Well, the output of machine learning is what we call a model. It's the set of complex algorithms that allow you to predict or classify something. In order to actually use that model we do what's called "putting it into production" (production is just a term used in the software industry for stuff that's running for real rather than as an experiment or prototype).

Running models "in production" is actually really hard. Seldon provides tools that make it a lot easier, particularly at scale (think hundreds of thousands of bank transactions being monitored every second for fraud, for example).

So there you have it, hopefully, a relatively simple explanation of what I get up to and what my company does.

TL;DR - I do lots of nerdy stuff that looks a bit like marketing, product management and software development all at once for an AI startup in London. 💻

What is MLOps?

Ed Shee — Tue, 20 Jul 2021 15:50:02 +0000

I recently started a new job at a Machine Learning startup. I've given up trying to explain what I do to non-technical friends and family (my mum still just tells people I work with computers). For those of you who at least understand that "AI" is just an overused marketing term for Machine Learning, I can break it down for you using the latest buzzword in the field:

MLOps

The term "MLOps" (a compound of Machine Learning and Operations) refers to the practice of deploying, managing and monitoring machine learning models in production. It takes best practices from the field of DevOps and utilises them for the unique challenges that arise running machine learning systems in production.

Search interest for “MLOps” over the past 5 years. Source: Google Trends

The term is relatively new and has grown rapidly in usage over the last year and is a direct result of a maturing Machine Learning landscape. As businesses get good at collecting data, designing and training ML models, their focus shifts towards integrating those models into their software estates. This brings all sorts of new challenges around infrastructure, scalability, performance and monitoring that most data science teams are not traditionally equipped to deal with.

One approach is to segregate duties between Data Science and DevOps like so:

Data Science - design, build and evaluate the models
DevOps - deploy, monitor and manage the models

This seems like a good idea at first but we only need to start asking some questions to see where we might struggle:

When do we retrain a model and deploy a new version?
What are the expected input/output formats of the model? Do we need to validate them?
Can the model performance be optimsed by utilizing a GPU?
How do we allow models to be continually tested? Answering any of these questions requires knowledge of both the model itself and the complex environment it’s deployed in.

The reality is that the whole lifecycle of an ML system is tightly coupled and highly iterative in nature. Production ML is hard and requires expertise in Data Engineering, Data Science and DevOps. The umbrella term “MLOps” provides an easy way to refer to the techniques, tools and skilled engineers who inhabit the growing space between these disciplines.

Mandatory Venn diagram - Source: Wikipedia

Is MLOps just another buzzword? Absolutely! But for now it’s the best we’ve got and it serves an important purpose.