Ellie Kloberdanz

Posted on Nov 22, 2022 • Edited on Nov 23, 2022

Secure Breast Cancer Identification with Enclaves

#ruby #jokes

Introduction

In this blog we develop a logistic regression model for breast cancer identification while ensuring that the sensitive medical data used for training the model remains private using Cape Privacy’s confidential computing platform.

The Issue of Privacy of Medical Data

Performing data analysis and modeling on medical data can provide extremely useful insights into both public and individual health. However, there are two primary challenges when it comes to running statistical analyses or developing predictive models with medical data. The first challenge is the size of medical data sets. Medical trials often include a number of participants that may be too small for creating complex machine learning models. The second challenge is the fact that medical data are governed by various privacy rules and laws such as HIPAA.

One approach to solving this issue is to use differential privacy techniques that obscure data points related to specific individuals to preserve privacy of their information. However, the downside is that this allows for studying only aggregated data at a high level. Moreover, the noise added to individual data points to ensure privacy may result in data that is manipulated too far from the original. Therefore, there is a trade-off between the noise added (also called the privacy budget ϵ) that provides stronger privacy protection and the utility of the data. The figure below demonstrates this trade-off between data privacy and utility with an example of an employee database query that returns the total number of employees in two different months. We can see that as the privacy budget increases, the total count of employees becomes inaccurate, which may help to hide some private information such as a termination of a specific individual at the expense of accurately reporting the total headcount.
Figure 1: Differential privacy techniques have a trade-off between privacy protection and data utility (Chang et al., 2021)

This is where Cape’s confidential computing platform based on AWS Nitro enclaves comes in.

How Does Cape Ensure that Confidential Data Remains Private?

Cape’s confidential computing platform allows its users to process data in a privacy preserving manner without needing to make a compromise between data privacy and utility. With Cape, you don’t have to use differential privacy methods, instead you can process your original data as is, because your data will be encrypted and processed in a secure enclave in the cloud.

Cape provides a CLI that enables its users to encrypt their input data, and deploy and run serverless functions with easy commands: cape encrypt, cape deploy, and cape run. Additionally, Cape also provides two SDKs: pycape and cape-js, which allow for using cape within Python and JavaScript programs respectively.

Training a Breast Cancer Identification Model with Cape?

In this blog we will use a publicly available breast cancer dataset, which contains tabular data describing several attributes that describe the breast tumor (e.g.: the size and shape of the tumor) along with a classification of the tumor as malignant or benign. For example, a tumor that is uniform and has a round shape typically indicates that it is noncancerous.

While this dataset is publicly available, most medical data is not, and we will use it as an example to demonstrate how Cape can be leveraged for private medical data processing.

Logistic Regression

Since the model that we wish to develop is a binary classification model that identifies breast tumors as malignant or benign and the number of data points is not very large, a logistic regression model is suitable.

Logistic regression is a classification model that uses input attributes to predict a categorical variable, eg. yes or no. In this demonstration we focus on a binary classification since there are only two possible outcomes.

Create a Function that Trains a Logistic Regression Model

Any function that is deployed with Cape needs to be named app.py, where app.py needs to contain a function called cape_handler() that takes the input that the function processes and returns the results. In this case the input is the breast cancer dataset that serves as training data and the output is the trained logistic regression model.

The code snippet below shows our app.py. First, we import some libraries as follows:

import pandas as pd
import numpy as np
import copy

Then we define a logistic regression class with methods that can perform training or compute model accuracy and loss:

class LogisticRegression():
    def __init__(self):
        self.losses = []
        self.train_accuracies = []

    def accuracy_score(self, y_true, y_pred):
        correct = np.sum(y_true == y_pred)
        accuracy = correct/y_true.shape[0]
        return accuracy

    def fit(self, x, y, epochs):
        x = self._transform_x(x)
        y = self._transform_y(y)

        self.weights = np.zeros(x.shape[1])
        self.bias = 0

        for i in range(epochs):
            x_dot_weights = np.matmul(self.weights, x.transpose()) + self.bias
            pred = self._sigmoid(x_dot_weights)
            loss = self.compute_loss(y, pred)
            error_w, error_b = self.compute_gradients(x, y, pred)
            self.update_model_parameters(error_w, error_b)

            pred_to_class = [1 if p > 0.5 else 0 for p in pred]
            self.train_accuracies.append(self.accuracy_score(y, pred_to_class))
            self.losses.append(loss)

    def compute_loss(self, y_true, y_pred):
        # binary cross entropy
        y_zero_loss = y_true * np.log(y_pred + 1e-9)
        y_one_loss = (1-y_true) * np.log(1 - y_pred + 1e-9)
        return -np.mean(y_zero_loss + y_one_loss)

    def compute_gradients(self, x, y_true, y_pred):
        # derivative of binary cross entropy
        difference =  y_pred - y_true
        gradient_b = np.mean(difference)
        gradients_w = np.matmul(x.transpose(), difference)
        gradients_w = np.array([np.mean(grad) for grad in gradients_w])

        return gradients_w, gradient_b

    def update_model_parameters(self, error_w, error_b):
        self.weights = self.weights - 0.1 * error_w
        self.bias = self.bias - 0.1 * error_b

    def predict(self, x):
        x_dot_weights = np.matmul(x, self.weights.transpose()) + self.bias
        probabilities = self._sigmoid(x_dot_weights)
        return [1 if p > 0.5 else 0 for p in probabilities]

    def _sigmoid(self, x):
        return np.array([self._sigmoid_function(value) for value in x])

    def _sigmoid_function(self, x):
        if x >= 0:
            z = np.exp(-x)
            return 1 / (1 + z)
        else:
            z = np.exp(x)
            return z / (1 + z)

    def _transform_x(self, x):
        x = copy.deepcopy(x)
        return x.values

    def _transform_y(self, y):
        y = copy.deepcopy(y)
        return y.values.reshape(y.shape[0], 1)

In addition to the logistic regression class, our app.py also contains the required cape_handler function, which takes the training data as input, splits it into a train and test set, instantiates the above defined logistic regression class, performs training, and outputs the trained model along with its accuracy.

def cape_handler(input_data):
    csv = input_data.decode("utf-8")
    csv = csv.replace('\\t', ',').replace('\\n', '\n')
    f = open('data.csv', 'w')
    f.write(csv)
    f.close()

    data = pd.read_csv('data.csv')
    data_size = data.shape[0]
    test_split = 0.33
    test_size = int(data_size * test_split)

    choices = np.arange(0, data_size)
    test = np.random.choice(choices, test_size, replace=False)
    train = np.delete(choices, test)

    test_set = data.iloc[test]
    train_set = data.iloc[train]

    column_names = list(data.columns.values)
    features = column_names[1:len(column_names)-1]

    y_train = train_set["target"]
    y_test = test_set["target"]
    X_train = train_set[features]
    X_test = test_set[features]

    lr = LogisticRegression()
    lr.fit(X_train, y_train, epochs=150)
    pred = lr.predict(X_test)

    accuracy = lr.accuracy_score(y_test, pred)

    # trained model
    model = {"accuracy": accuracy, "weights": lr.weights.tolist(), "bias": lr.bias.tolist()}

    return model

Deploy with Cape

To deploy our function with Cape, we first need to create a folder that contains all needed dependencies. For this logistic regression training app, that deployment folder needs to contain the app.py above. Additionally, because the app.py program imports some external libraries (in this case: numpy and pandas), the deployment folder needs to have those as well. We can save a list of those dependencies into a requirements.txt file and run docker to install those dependencies into our deployment folder called app as follows:

sudo docker run -v `pwd`:/build -w /build --rm -it python:3.9-slim-bullseye pip install -r requirements.txt --target ./app/

Now that we have everything ready, we can log into Cape:

cape login

Your CLI confirmation code is: GZPN-KHMT
Visit this URL to complete the login process: https://login.capeprivacy.com/activate?user_code=GZPN-KHMT
Congratulations, you're all set!

And after that we can deploy the app:

cape deploy ./app

Deploying function to Cape ...
Success! Deployed function to Cape
Function Checksum ➜ 348ea2008f014b4d62562b4256bf2ddbbebcbd8b958981de5c2e01a973f690f8
Function Id ➜ 5wggR4ZaEBdfHQSbV2RcN5

Invoke with Cape

Now that the app is deployed, we can pass it an input and invoke it with cape run:

cape run 5wggR4ZaEBdfHQSbV2RcN5 -f breast_cancer_data.csv

{'accuracy': 0.9197860962566845, 'weights': [10256.691270418847, 19071.613672774896, 63157.95554188486, 97842.31573298419, 106.154850842932, 43.29810217015701, -44.1862890971466, -22.519840356544492, 198.12010662303672, 78.6238754895288, 48.39822623036688, 1508.6634081937177, 342.695612801048, -22814.6600120419, 8.905474463874354, 16.958969184554977, 18.625567417774857, 7.857666827748692, 25.00139435235602, 4.305377619109947, 9667.094831413606, 24077.953801047104, 59698.82218324606, -91019.69570680606, 137.85512994764406, 64.23315269371734, -35.801829085602265, 1.0606119748691598, 287.2889897905756, 89.52499975392664], 'bias': 3.247905759162303}

The output above lists the parameters of the trained model, i.e.: its weights and bias, which define the model and can be used to perform inference. Additionally, we can also see that the trained model accuracy on testing data is 92%.

Conclusion

In this blog we discussed the challenges of developing predictive models on medical data and how Cape’s confidential computing platform can alleviate privacy issues associated with medical data processing. We defined a logistic regression model and trained to identify breast tumors as malignant or benign while keeping the medical data that was used for training confidential.

DEV Community