DEV Community

Hikaru
Hikaru

Posted on • Edited on

Let's build a simple MLOps workflow on AWS! #1 - ML model preperation

About this post

In this series, I'll explain how I implemented a simple MLOps workflow on AWS.

Intro

Currently, I'm working as a cloud engineer, but I was studying machine learning when I was in graduate school.
Back then, training of ML models was done by manually sending jobs to an on-premise server. So, whenever I changed experimental settings such as the training dataset or hyperparameters. etc, I had to manually run the job. I wasn't even familiar with concepts like version control, container, and CI/CD, so I had no clue about how I could improve the efficiency of this training process.
However, after starting my career in my current position, I noticed that there are so many useful technologies or concepts to accelerate the development process. Nowadays, I guess it's pretty common to run ML tasks in containerized applications on container orchestration tools like Kubernetes and automate testing and model deployment processes using CI/CD tools. So I felt like connecting the dots by doing some experiments that combine these two different areas of knowledge I've learned so far.

Training task

I trained a pretty basic CNN-based image classification model in PyTorch using the CIFAR-10 dataset. To implement the model, I followed the following officail tutorial from PyTorch.

Training a Classifier — PyTorch Tutorials 2.3.0+cu121 documentation
https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html

The CIFAR-10 dataset is a collection of 60,000 32x32 color images in 10 classes, with 6,000 images per class. It is one of the most widely used datasets for image classification tasks in the field of machine learning and computer vision. The images in this dataset look like below:

Image description

CIFAR-10 and CIFAR-100 datasets
https://www.cs.toronto.edu/~kriz/cifar.html

Regarding the image classification model, I chose the Convolutional neural network (CNN) as it's commonly used in image classification tasks. The model architecture is below:

import torch.nn as nn
import torch.nn.functional as F


# Define CNN
class CNN(nn.Module):
    def __init__(self):
        super(CNN, self).__init__()

        self.conv1 = nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(
            in_channels=32, out_channels=64, kernel_size=3, padding=1
        )
        self.conv3 = nn.Conv2d(
            in_channels=64, out_channels=128, kernel_size=3, padding=1
        )
        self.pool = nn.MaxPool2d(2, 2)
        self.dropout = nn.Dropout(0.5)
        self.fc1 = nn.Linear(128 * 4 * 4, 512)
        self.fc2 = nn.Linear(512, 256)
        self.fc3 = nn.Linear(256, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = self.pool(F.relu(self.conv3(x)))
        x = x.view(-1, 128 * 4 * 4)  # flatten tensor
        x = self.dropout(F.relu(self.fc1(x)))
        x = self.dropout(F.relu(self.fc2(x)))
        x = self.fc3(x)

        return x
Enter fullscreen mode Exit fullscreen mode

In terms of the difference from the normal CNN model architecture, I added the dropout layer after the fully-connected layer to prevent overfitting. Since the main goal of this series is automating the training process, I won't dive deep into the model itself or the accuracy of the classification task. However, at least we need to confirm that the training process of the model properly works.
Let's add the training setting. The basic training settings are as follows:

  • Loss function: Cross-entropy loss function
  • Optimizer: Adam
  • Learning rate: 0.001
  • Momentum: 0.9
  • The number of training epochs: 10

An overview of the training process is like this:

# Define loss function and optimization method
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(net.parameters(), lr=lr, momentum=momentum)

    num_epochs = num_epochs

    for epoch in range(num_epochs):  # Loop for the specified number of times
        print(f"Epoch {epoch+1}/{num_epochs}")

        for phase in ["train", "val"]:
            if phase == "train":
                net.train()
            else:
                net.eval()
            epoch_loss = 0.0
            epoch_corrects = 0

            for input_data, label in tqdm(dataloaders_dict[phase]):
                input_data, label = input_data.to(device), label.to(
                    device
                )  # Enable GPU
                optimizer.zero_grad()

                # Compute gradient in training phase
                with torch.set_grad_enabled(phase == "train"):
                    predicted_label = net(input_data)
                    loss = criterion(predicted_label, label)
                    _, pred_index = torch.max(predicted_label, axis=0)
                    _, label_index = torch.max(label, axis=0)

                    if phase == "train":
                        loss.backward()
                        optimizer.step()

                    # Update loss summary
                    epoch_loss += loss.item() * input_data.size(0)
                    # Update the number of correct prediction
                    epoch_corrects += torch.sum(pred_index == label_index)

            # show loss and accuracy for each epoch
            epoch_loss = epoch_loss / len(dataloaders_dict[phase].dataset)
            epoch_acc = epoch_corrects / len(dataloaders_dict[phase].dataset)

            print(f"{phase} Loss: {epoch_loss} Acc: {epoch_acc}")
Enter fullscreen mode Exit fullscreen mode

To see the whole source, please refere to the repository below:

https://github.com/hikarunakatani/cifar10-model

Basically, I put both training and validation processes in each epoch. In each epoch, the training loss and the model accuracy are computed and shown in the standard output.

Results

Let's run the code and see how it works. I added an option to run this task on GPU, but you can run this model on the CPU as it doesn't require that much computing resources.
In my environment, I used TensorBoard to see the logs of the training process graphically. The results of the training task are as follows:

Image description

The improvement of the model accuracy is quite small, but at least we can observe the constant improvement of the accuracy and the decrease of the loss.

Testing the model

Finally, let's prepare a random image and test it to see what kind of category it gets clarified. Testing code is as follows:

import os
import torch
import torchvision.transforms as transforms
from torch.autograd import Variable
from PIL import Image
import model
from model import MODEL_PATH


def main():
    transform = transforms.Compose(
        [transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]
    )
    classes = (
        "plane",
        "car",
        "bird",
        "cat",
        "deer",
        "dog",
        "frog",
        "horse",
        "ship",
        "truck",
    )

    net = model.CNN()
    net.cpu()
    if os.path.exists(MODEL_PATH):
        state_dict = torch.load(MODEL_PATH, map_location=torch.device("cpu"))
        net.load_state_dict(state_dict)
        print("Model loaded successfully.")
    else:
        print("No model checkpoint found at the specified path.")

    img = Image.open("bird.jpg")
    img = img.resize((32, 32))
    img = transform(img).float()
    img = Variable(img)
    img = img.unsqueeze(0)

    with torch.no_grad():
        outputs = net(img)

        # print(outputs)
        _, predicted = torch.max(outputs, 1)
        print(classes[predicted])


if __name__ == "__main__":
    main()

Enter fullscreen mode Exit fullscreen mode

I prepared an image of bird below to test the model.

Image description

The results of the test is as follows:

$ python test.py
Model loaded successfully.
deer
Enter fullscreen mode Exit fullscreen mode

Hmm, it was obvious from the results, but it misrecognized the bird as a deer. I ran the script several times, but about half of them are misrecognized as deer.

In the next article, I'll introduce how we can run this as a containerized application on AWS.

Let's build a simple MLOps workflow on AWS! #2 - Building infrastructure on AWS - DEV Community
https://dev.to/hikarunakatani/lets-build-a-simple-mlops-workflow-on-aws-2-building-infrastructure-on-aws-3h2j

Top comments (0)