DEV Community: Ellie Kloberdanz

Demystifying Encryption: Symmetric Encryption, Public-key Encryption, and Hybrid Encryption

Ellie Kloberdanz — Wed, 11 Jan 2023 16:26:03 +0000

Introduction

In this blog I provide a technical overview of encryption, primarily focusing on two of the most widely used encryption algorithms today - AES and RSA.

What is Encryption

Encryption is the process of encoding information, which converts the original representation of the information known as plaintext into an alternative form known as ciphertext. The goal of encryption is to ensure confidentiality of information.

There are two essential components to encryption - the encryption algorithm called a cipher and a secret value called the key. The algorithm describes the step by step process of how data is encrypted, and the key is a randomly generated value. Because generating truly random values is actually very difficult, an encryption scheme usually uses a pseudo-random encryption key.

Encryption itself does not prevent interference, instead it denies the intelligible content to a would-be interceptor. The key used for decrypting encrypted information is shared only with authorized parties; therefore only authorized parties can decipher a ciphertext back to plaintext and access the original information. The security of encryption can be evaluated by assessing how likely it is that unauthorized parties would be able to guess/reconstruct the decryption key. Cracking the encryption by guessing the decryption key is practically impossible in modern encryption techniques, because they use algorithms for generating encryption/decryption keys that ensure that guessing the key is so computationally expensive to an extent that it is impossible.

Two of the most widely used encryption algorithms today are AES (Advanced Encryption Standard) and RSA (Rivest-Shamir-Adleman), which represent two different encryption schemes. AES uses a symmetric key scheme, where the encryption and decryption keys are the same. RSA uses an asymmetric scheme (also known as public-key scheme), where the encryption and decryption keys are different and the encryption key is public, while the decryption key is kept private.

The primary security challenge of AES encryption is the distribution of the encryption key that can both encrypt and decrypt information; and therefore, needs to remain secret. With RSA encryption having two separate keys, this challenge is alleviated. However, compared to AES, RSA encryption is computationally expensive. Therefore, it is common to combine RSA and AES encryption into a hybrid encryption scheme, e.g: TLS or SSH that are further described below.

Symmetric Encryption

Symmetric encryption relies on only one key that is used for both encryption and decryption. Some examples of symmetric encryption algorithms are: DES, GOST 28147-89, AES, or One-Time Pad. DES (Data Encryption Standard) was a predecessor of AES, which is the most commonly used symmetric encryption today.

AES

AES (Advanced Encryption System) is an encryption algorithm, or a cipher, that falls under the category of block ciphers, which process data in blocks.

Block ciphers encrypt and decrypt data in blocks as opposed to bit by bit and are characterized by two attributes: block size and key size. Their algorithms are based on a repetition of rounds that represent a sequence of transformations to the input data that is being encrypted. Each round of transformations is parameterized by a key, called a round key, which must be unique for each round to ensure security of the encryption. Performing the same data transformations, but with different round keys yields different results and the keys are derived using a key schedule prescribed by the specific algorithm. There are two main techniques that can be used in block ciphers: (1) the substitution-permutation (e.g.: AES), or (2) the Feisel scheme (e.g.: DES).

In addition to using a different key for each round, to ensure the security of encryption, the block cipher must add enough confusion and diffusion to the input that is being encrypted. Confusion means that the input undergoes complex transformations and diffusion means that the transformations depend on all bits of the input equally. These two concepts can be viewed as the depth and breadth of the transformations performed.

AES is a variant of the Rijndael block cipher with a fixed block size of 128 bits, and a key size of 128, 192, or 256 bits. The key size used for an AES cipher specifies the number of transformation rounds that convert the input (plaintext) into the final output (ciphertext). For a 128 bit key, the algorithm performs 10 rounds of transformations, for 192 bit key 12 rounds, and for 256 bit key it is 14 rounds.

The AES algorithm can be described as follows:

Divide plaintext into blocks
Each of the blocks contains a column of 16 bytes in a layout of four-by-four. Since one byte contains 8 bits, we get 128-bit block size (16x8=128)
Key expansion
Produce a key with Rijindael’s key schedule represented as a block
Add key
Add the blocks of text and the key
Repeat 9, 11, or 13 rounds:
Step 1: Byte substitution
Substitute every byte in the blocks produced with a code based on the Rijndael S-box
Step 2: Row shifting
Shift bytes in the 1st row to the left by 0 bytes, by 1 byte in the 2nd row, by 2 bytes in the 3rd row, by 3 bytes in the 4th row
Step 3: Mix columns
Multiply each column by a matrix
Step 4: Add key
Add the key to the cipher blocks
Final round
Step 1: Byte substitution
Step 2: Row shifting
Step 3: Add key

In practice AES encryption is implemented with special techniques called table based implementations and native instructions, which make it very fast.

AES encryption is very safe. Even cracking the smallest key of 128 bits is impossible, because it would require checking 2 to the 128 possibilities, which would take more than 100 trillion years on a supercomputer.

A major issue with AES is that, as a symmetric algorithm, it requires that both the encryptor and the decryptor use the same key. This gives rise to a crucial key management issue – how can that all-important secret key be distributed to perhaps hundreds of recipients around the world without running a huge risk of it being carelessly or deliberately compromised somewhere along the way? The answer is to combine the strengths of AES and RSA encryption, which is described in the section on hybrid encryption later in this article.

One-Time Pad Encryption

Another symmetric encryption scheme that is worth mentioning is One-time pad (OTP) encryption. It is a technique that guarantees perfect secrecy as long as the encryption key is at least as long as the plaintext it encrypts and the key is used only once. It is due to these two characteristics that OTP is perfect encryption, but ironically they also make OTP impractical. Generating a random key each time a message needs to be encrypted is computationally expensive, especially when the key must be the same length as the message. Imagine encrypting a 1 TB hard drive - that would require a 1TB key!

Asymmetric Encryption

Asymmetric encryption (also known as public-key encryption) uses two sets of keys - a public key that is used to encrypt and a private key that is used to decrypt information.

RSA

RSA, or Rivest-Shamir-Adleman, encryption named after its inventors is one of the most prominent asymmetric encryption schemes that consists of four steps: key generation, key distribution, encryption, and decryption.

An RSA user creates and publishes a public key based on a multiplication of two large prime numbers. The prime numbers are kept secret, only their product is made public. Messages can be encrypted by anyone, via the public key, but can only be decoded by someone who knows the two prime numbers. Because there is no known method of calculating the prime factors of such large numbers, only the creator of the public key can also generate the private key required for decryption.

The RSA algorithm can be summarized as follows:

Generate two very large prime numbers: p and q, which are kept secret
Compute their product, n = p * q, n is released as the public key
Compute λ(n), where λ is Carmichael's totient function
Choose an integer e such that 1 < e < λ(n) and find the greatest common denominator, or gcd(e, λ(n)) = 1; that is, e and λ(n) are coprime (meaning that they have no positive integer factors in common, aside from 1)
Find d, the modular multiplicative inverse of e times modulo λ(n), which will serve as the private key

The security of RSA relies on the practical difficulty of factoring the product of two large prime numbers. However, the downside of RSA is that it is a relatively slow algorithm. Because of this, it is not commonly used to directly encrypt user data. More often, RSA is used to transmit shared keys for symmetric-key cryptography, which are then used for bulk encryption–decryption.

Hybrid Encryption

Both AES and RSA encryption schemes have advantages and disadvantages. As we discussed above, AES encryption is fast, but the challenge of distributing the AES key without the encryption security becoming compromised is a challenge. This is less of a challenge with RSA, because it uses two sets of keys - one public and one private key. But RSA encryption is computationally expensive.

Hybrid encryption combines the best of both worlds of AES and RSA. It has the efficiency of symmetric encryption and the convenience of public-key (asymmetric) encryption. In cloud computing the commonly used term for hybrid encryption is envelope encryption.

Envelope encryption has two sets of keys: (1) Data Encryption Keys (DEK) and (2) Key Encryption Keys (KEK). The DEK are used to encrypt data using AES, while the KEK are used to encrypt the DEK using RSA. The procedure can be described as follows:

Generate a DEK locally
Encrypt your data using the DEK
Issue a request to encrypt the DEK with a KEK stored in a secure service called key management system (KMS)
Send DEK encrypted data and KEK encrypted DEK
The receiver decrypts the DEK with a private key stored in KMS, then uses the DEK to decrypt the data

Some of the commonly used hybrid encryption algorithms are TLS and SSH.

TLS (Transport Layer Security)

TLS is an encryption and authentication protocol designed to secure Internet communications that lies between the transport layer (e.g.: TCP) and application layer (e.g.: HTTP). It is the internet security protocol that protects the connection between servers and clients by establishing secure channels to allow for, e.g.: secure credit card transactions. If you go to a website and see that it begins with https, the s stands for “secure” and signifies that the communication between your device and the server hosting the website will be encrypted with TLS. When a server and client communicate using TLS, it ensures that no third party can eavesdrop or tamper with any message.

An integral part of TLS is the handshake protocol, during which the client and server that wish to communicate exchange messages to acknowledge each other, verify each other (authentication), establish the cryptographic algorithms they will use (cipher suite negotiation), and agree on session keys (session key exchange).

The verification step during which the server authenticates itself to the client is an important part of the TLS protocol. It relies on a public key certificate (also known as a digital or identity certificate), which contains information about the key, the owner identity, and the digital signature of the issuer that verified the certificate. Typically, the issuer that guarantees the certificate authenticity is a trusted 3rd party called the certificate authority (CA), e.g: Let’s Encrypt, Comodo, Digicert. During the verification step, the server sends its certificate to the client, who then verifies its validity with the CA to confirm that the server is who it says it is.

SSH (Secure Shell Protocol)

SSH is a network communication protocol that enables two computers to communicate and share data. SSH ensures that all user authentication, commands, output, and file transfers are encrypted to protect against attacks in the network. The most commonly used applications of SSH are remote login and command-line execution.

The SSH protocol can be summarized as follows:

The client contacts the server
The server sends its public SSH key
The client uses its private SSH key to authenticate the server
The client and server negotiate and agree upon the symmetric encryption algorithm to be used for their communication and generate the encryption key to be used
The client and server establish a secure connection

Secure Sentiment Analysis with Enclaves

Ellie Kloberdanz — Tue, 22 Nov 2022 20:24:13 +0000

Introduction

In this blog we build a secure sentiment analysis app using TensorFlow Lite and deploy it on the Cape Privacy secure cloud enclave system. The app is available to use here.

What is Sentiment Analysis and How Cape Makes it Secure?

Sentiment analysis is an application of natural language processing (NLP) that classifies the sentiment of text, typically as either positive or negative. Because vast amounts of data exist in textual form, sentiment analysis has a lot of practical applications including social media monitoring, customer feedback analysis, news analysis, market research etc. Processing this type of data in an automated manner therefore allows for extracting valuable information efficiently.

However, what if the textual input data that we need to analyze is sensitive or needs to stay confidential? This is where the Cape Privacy secure enclave system comes in. Cape Privacy provides a confidential computing platform based on AWS Nitro enclaves for security and privacy-minded developers. Cape allows for running serverless functions on encrypted data and ensures that sensitive data or intellectual property within apps is protected.

Cape provides a command line interface (CLI) and also a Python and JavaScript software development kits (SDKs) called pycape and cape-js that allow developers to deploy their apps and allow users to interact with them in a secure manner.

There are three essential components that enable this: cape encrypt, cape deploy, and cape run. The command cape encrypt encrypts inputs that can be sent into the Cape enclave for processing, cape deploy performs all needed actions for deploying a function into the enclave, and finally cape run invokes the deployed function with an input that was previously encrypted with cape encrypt. Learn more on the Cape docs.

How to Build a Sentiment Analysis App with Cape?

Training a Text Classification Model

The function that we wish to deploy and run in case of sentiment analysis is a text classification model and therefore, we need to first define its architecture and train it.

To make the model light weight we choose to use TensorflowLite and its Model Maker library. The training data that we use is the SST-2 (Stanford Sentiment Treebank), a commonly used dataset published by Socher et al. (2013) that consists of over 60,000 movie reviews that have been labeled as positive or negative. For the model architecture we use the average word embedding, which produces a model that is small and therefore, can perform fast inference. The following code snippet shows the model definition and training procedure and exports the trained model and its vocabulary as a TensorFlow Lite model.

Install dependencies:

!sudo apt -y install libportaudio2
!pip install -q tflite-model-maker-nightly

Import libraries:

import numpy as np
import os

from tflite_model_maker import model_spec
from tflite_model_maker import text_classifier
from tflite_model_maker.config import ExportFormat
from tflite_model_maker.text_classifier import AverageWordVecSpec
from tflite_model_maker.text_classifier import DataLoader

import tensorflow as tf
import pandas as pd
assert tf.__version__.startswith('2')
tf.get_logger().setLevel('ERROR')

Prepare training data:

 df.to_csv(new_file)

# Replace the label name for both the training and test dataset. Then write the
# updated CSV dataset to the current folder.
replace_label(os.path.join(os.path.join(data_dir, 'train.tsv')), 'train.csv')
replace_label(os.path.join(os.path.join(data_dir, 'dev.tsv')), 'dev.csv')

spec = model_spec.get('average_word_vec')

train_data = DataLoader.from_csv(
     filename='train.csv',
     text_column='sentence',
     label_column='label',
     model_spec=spec,
     is_training=True)
test_data = DataLoader.from_csv(
     filename='dev.csv',
     text_column='sentence',
     label_column='label',
     model_spec=spec,
     is_training=False)

Train model:

model = text_classifier.create(train_data, model_spec=spec, epochs=10)

Evaluate model:

loss, acc = model.evaluate(test_data)

Export model as Tensorflow Lite:

model.export(export_dir='model')
model.export(export_dir='model', export_format=[ExportFormat.LABEL, ExportFormat.VOCAB])

Create a Function

Any function that is deployed with Cape needs to be named app.py, where app.py needs to contain a function called cape_handler() that takes the input that the function processes and returns the results. In the case of the sentiment analysis app the input is the text that we wish to classify and the output is the sentiment that can be negative or positive.

The code snippet below shows our app.py. We can see that the cape_handler() function loads the TensorFlow Lite model that we previously trained and also its vocabulary. Additionally, the handler also vectorizes the text inputs using the vocabulary such that the inputs are encoded as numeric vectors before we run inference on them. The model then predicts the sentiment of this encoded text and outputs its predicted sentiment.

Import libraries:

import numpy as np
from tflite_runtime.interpreter import Interpreter
import contractions

Load vocabulary function:

def load_vocab(path):
    vocabulary = {}
    with open(path, "r") as f:
        for i, line in enumerate(f.readlines()):
            item = line.strip().split(" ")
            word = item[0]
            encoding = int(item[1])
            vocabulary[word] = encoding
    return vocabulary

Text vectorization function:

def vectorize_text(text, vocabulary, input_shape):
    encoded_text = []

    # Fix contractions
    expanded_words = []
    for word in text.split():
        expanded_words.append(contractions.fix(word))
    text = " ".join(expanded_words)

    text = text.split(" ")
    for word in text:
        word = word.lower()  # convert to lower case
        # account for words not in vocabulary
        if word in vocabulary.keys():
            word_encoding = vocabulary[word]
        else:
            word_encoding = vocabulary["<UNKNOWN>"]
        encoded_text.append(word_encoding)
    encoded_text = np.array(encoded_text, dtype=np.int32)
    encoded_text = np.pad(
        encoded_text, (0, input_shape[1] - len(encoded_text)), "constant"
    )
    encoded_text = np.reshape(encoded_text, (input_shape[0], input_shape[1]))
    return encoded_text

Cape handler:

def cape_handler(text):
    text = text.decode("utf-8")

    # Load vocabulary
    vocabulary = load_vocab("./vocab.txt")

    # Load the TFLite model and allocate tensors.
    interpreter = Interpreter(model_path="./model.tflite")
    interpreter.allocate_tensors()

    # Get input and output tensors.
    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()

    # Predict
    input_shape = input_details[0]["shape"]
    input_data = vectorize_text(
        text=text, vocabulary=vocabulary, input_shape=input_shape
    )
    interpreter.set_tensor(input_details[0]["index"], input_data)
    interpreter.invoke()

    output_data = interpreter.get_tensor(output_details[0]["index"])
    output_result = np.argmax(output_data)

    if output_result == 1:
        result = "positive"
    else:
        result = "negative"

    prob = output_data[0][output_result] * 100
    return (str(float(f'{prob:.2f}')) + "% " + result) or "You've stumped me! Please try a different phrase."

Deploy with Cape

To deploy our function with Cape, we first need to create a folder that contains all needed dependencies. For this sentiment analysis app, that deployment folder needs to contain the app.py above, the trained TFLite model and its vocabulary. Additionally, because the app.py program imports some external libraries, the deployment folder needs to have those as well. We can save a list of those dependencies into a requirements.txt file and run docker to install those dependencies into our deployment folder called app as follows:

sudo docker run -v `pwd`:/build -w /build --rm -it python:3.9-slim-bullseye pip install -r requirements.txt --target ./app/

Now that we have everything ready, we can log into Cape:

cape login

Your CLI confirmation code is: GZPN-KHMT
Visit this URL to complete the login process: https://login.capeprivacy.com/activate?user_code=GZPN-KHMT
Congratulations, you're all set!

And after that we can deploy the app:

cape deploy ./app

Deploying function to Cape ...
Success! Deployed function to Cape
Function ID ➜ CzFFUHDyjq6Uqm8MCVfdVc
Checksum ➜ eb989a5ef2fabf377a11ad5464b81b67757fada90a268c8c6d8f2d95013c4681

Invoke with Cape

Now that the app is deployed, we can pass it an input and invoke it with cape run:

cape run CzFFUHDyjq6Uqm8MCVfdVc "This was a great film"

78.08% positive

JavaScript Front-end with Cape SDK

In addition to the CLI, Cape also provides Python and JavaScript SDKs. Moreover, the CLI also allows developers to generate tokens for their functions as follows:

cape token <function ID> -- expires <number of seconds>

We can then use cape-js to invoke the function deployed in the enclave. First, we need to install cape-js with:

npm install @capeprivacy/cape-sdk

Or:

yarn add @capeprivacy/cape-sdk

Then we can import it to our JavaScript program:

import { Cape } from '@capeprivacy/cape-sdk';

Within our JavaScript program that we used to create the front-end, we can use the function token to connect to the enclave using cape-js as follows.

const client = new Cape ({ functionToken: <function token>});

The function id is then used to run the function that we previously deployed in the enclave with cape deploy.

await client.run({ id: '<FUNCTION_ID>', data: 'input' });

Using JavaScript and cape-js, we have created a front-end for the sentiment analysis application that allows users to go to a website, enter any text, click a button, and see the predicted sentiment. Go ahead, and try it yourself here.

Conclusion

In this blog we walked through one example use case for Cape’s confidential computing platform based on AWS Nitro enclaves. Specifically, we built a sentiment analysis application with TensorFlow lite that classifies the sentiment of any text as positive or negative. We have shown how this app can be seamlessly deployed with Cape’s CLI to ensure that the textual data processing is performed securely. In addition to the CLI, we have also showcased how cape-js, Cape’s JavaScript SDK that can be used within JavaScript programs, can connect to an enclave and run any deployed function. The front-end that we built gives Cape’s users a GUI interface for interacting with the sentiment analysis app in addition to the CLI.

Secure Breast Cancer Identification with Enclaves

Ellie Kloberdanz — Tue, 22 Nov 2022 20:07:02 +0000

Introduction

In this blog we develop a logistic regression model for breast cancer identification while ensuring that the sensitive medical data used for training the model remains private using Cape Privacy’s confidential computing platform.

The Issue of Privacy of Medical Data

Performing data analysis and modeling on medical data can provide extremely useful insights into both public and individual health. However, there are two primary challenges when it comes to running statistical analyses or developing predictive models with medical data. The first challenge is the size of medical data sets. Medical trials often include a number of participants that may be too small for creating complex machine learning models. The second challenge is the fact that medical data are governed by various privacy rules and laws such as HIPAA.

One approach to solving this issue is to use differential privacy techniques that obscure data points related to specific individuals to preserve privacy of their information. However, the downside is that this allows for studying only aggregated data at a high level. Moreover, the noise added to individual data points to ensure privacy may result in data that is manipulated too far from the original. Therefore, there is a trade-off between the noise added (also called the privacy budget ϵ) that provides stronger privacy protection and the utility of the data. The figure below demonstrates this trade-off between data privacy and utility with an example of an employee database query that returns the total number of employees in two different months. We can see that as the privacy budget increases, the total count of employees becomes inaccurate, which may help to hide some private information such as a termination of a specific individual at the expense of accurately reporting the total headcount.
Figure 1: Differential privacy techniques have a trade-off between privacy protection and data utility (Chang et al., 2021)

This is where Cape’s confidential computing platform based on AWS Nitro enclaves comes in.

How Does Cape Ensure that Confidential Data Remains Private?

Cape’s confidential computing platform allows its users to process data in a privacy preserving manner without needing to make a compromise between data privacy and utility. With Cape, you don’t have to use differential privacy methods, instead you can process your original data as is, because your data will be encrypted and processed in a secure enclave in the cloud.

Cape provides a CLI that enables its users to encrypt their input data, and deploy and run serverless functions with easy commands: cape encrypt, cape deploy, and cape run. Additionally, Cape also provides two SDKs: pycape and cape-js, which allow for using cape within Python and JavaScript programs respectively.

Training a Breast Cancer Identification Model with Cape?

In this blog we will use a publicly available breast cancer dataset, which contains tabular data describing several attributes that describe the breast tumor (e.g.: the size and shape of the tumor) along with a classification of the tumor as malignant or benign. For example, a tumor that is uniform and has a round shape typically indicates that it is noncancerous.

While this dataset is publicly available, most medical data is not, and we will use it as an example to demonstrate how Cape can be leveraged for private medical data processing.

Logistic Regression

Since the model that we wish to develop is a binary classification model that identifies breast tumors as malignant or benign and the number of data points is not very large, a logistic regression model is suitable.

Logistic regression is a classification model that uses input attributes to predict a categorical variable, eg. yes or no. In this demonstration we focus on a binary classification since there are only two possible outcomes.

Create a Function that Trains a Logistic Regression Model

Any function that is deployed with Cape needs to be named app.py, where app.py needs to contain a function called cape_handler() that takes the input that the function processes and returns the results. In this case the input is the breast cancer dataset that serves as training data and the output is the trained logistic regression model.

The code snippet below shows our app.py. First, we import some libraries as follows:

import pandas as pd
import numpy as np
import copy

Then we define a logistic regression class with methods that can perform training or compute model accuracy and loss:

class LogisticRegression():
    def __init__(self):
        self.losses = []
        self.train_accuracies = []

    def accuracy_score(self, y_true, y_pred):
        correct = np.sum(y_true == y_pred)
        accuracy = correct/y_true.shape[0]
        return accuracy

    def fit(self, x, y, epochs):
        x = self._transform_x(x)
        y = self._transform_y(y)

        self.weights = np.zeros(x.shape[1])
        self.bias = 0

        for i in range(epochs):
            x_dot_weights = np.matmul(self.weights, x.transpose()) + self.bias
            pred = self._sigmoid(x_dot_weights)
            loss = self.compute_loss(y, pred)
            error_w, error_b = self.compute_gradients(x, y, pred)
            self.update_model_parameters(error_w, error_b)

            pred_to_class = [1 if p > 0.5 else 0 for p in pred]
            self.train_accuracies.append(self.accuracy_score(y, pred_to_class))
            self.losses.append(loss)

    def compute_loss(self, y_true, y_pred):
        # binary cross entropy
        y_zero_loss = y_true * np.log(y_pred + 1e-9)
        y_one_loss = (1-y_true) * np.log(1 - y_pred + 1e-9)
        return -np.mean(y_zero_loss + y_one_loss)

    def compute_gradients(self, x, y_true, y_pred):
        # derivative of binary cross entropy
        difference =  y_pred - y_true
        gradient_b = np.mean(difference)
        gradients_w = np.matmul(x.transpose(), difference)
        gradients_w = np.array([np.mean(grad) for grad in gradients_w])

        return gradients_w, gradient_b

    def update_model_parameters(self, error_w, error_b):
        self.weights = self.weights - 0.1 * error_w
        self.bias = self.bias - 0.1 * error_b

    def predict(self, x):
        x_dot_weights = np.matmul(x, self.weights.transpose()) + self.bias
        probabilities = self._sigmoid(x_dot_weights)
        return [1 if p > 0.5 else 0 for p in probabilities]

    def _sigmoid(self, x):
        return np.array([self._sigmoid_function(value) for value in x])

    def _sigmoid_function(self, x):
        if x >= 0:
            z = np.exp(-x)
            return 1 / (1 + z)
        else:
            z = np.exp(x)
            return z / (1 + z)

    def _transform_x(self, x):
        x = copy.deepcopy(x)
        return x.values

    def _transform_y(self, y):
        y = copy.deepcopy(y)
        return y.values.reshape(y.shape[0], 1)

In addition to the logistic regression class, our app.py also contains the required cape_handler function, which takes the training data as input, splits it into a train and test set, instantiates the above defined logistic regression class, performs training, and outputs the trained model along with its accuracy.

def cape_handler(input_data):
    csv = input_data.decode("utf-8")
    csv = csv.replace('\\t', ',').replace('\\n', '\n')
    f = open('data.csv', 'w')
    f.write(csv)
    f.close()

    data = pd.read_csv('data.csv')
    data_size = data.shape[0]
    test_split = 0.33
    test_size = int(data_size * test_split)

    choices = np.arange(0, data_size)
    test = np.random.choice(choices, test_size, replace=False)
    train = np.delete(choices, test)

    test_set = data.iloc[test]
    train_set = data.iloc[train]

    column_names = list(data.columns.values)
    features = column_names[1:len(column_names)-1]

    y_train = train_set["target"]
    y_test = test_set["target"]
    X_train = train_set[features]
    X_test = test_set[features]

    lr = LogisticRegression()
    lr.fit(X_train, y_train, epochs=150)
    pred = lr.predict(X_test)

    accuracy = lr.accuracy_score(y_test, pred)

    # trained model
    model = {"accuracy": accuracy, "weights": lr.weights.tolist(), "bias": lr.bias.tolist()}

    return model

Deploy with Cape

To deploy our function with Cape, we first need to create a folder that contains all needed dependencies. For this logistic regression training app, that deployment folder needs to contain the app.py above. Additionally, because the app.py program imports some external libraries (in this case: numpy and pandas), the deployment folder needs to have those as well. We can save a list of those dependencies into a requirements.txt file and run docker to install those dependencies into our deployment folder called app as follows:

sudo docker run -v `pwd`:/build -w /build --rm -it python:3.9-slim-bullseye pip install -r requirements.txt --target ./app/

Now that we have everything ready, we can log into Cape:

cape login

Your CLI confirmation code is: GZPN-KHMT
Visit this URL to complete the login process: https://login.capeprivacy.com/activate?user_code=GZPN-KHMT
Congratulations, you're all set!

And after that we can deploy the app:

cape deploy ./app

Deploying function to Cape ...
Success! Deployed function to Cape
Function Checksum ➜ 348ea2008f014b4d62562b4256bf2ddbbebcbd8b958981de5c2e01a973f690f8
Function Id ➜ 5wggR4ZaEBdfHQSbV2RcN5

Invoke with Cape

Now that the app is deployed, we can pass it an input and invoke it with cape run:

cape run 5wggR4ZaEBdfHQSbV2RcN5 -f breast_cancer_data.csv

{'accuracy': 0.9197860962566845, 'weights': [10256.691270418847, 19071.613672774896, 63157.95554188486, 97842.31573298419, 106.154850842932, 43.29810217015701, -44.1862890971466, -22.519840356544492, 198.12010662303672, 78.6238754895288, 48.39822623036688, 1508.6634081937177, 342.695612801048, -22814.6600120419, 8.905474463874354, 16.958969184554977, 18.625567417774857, 7.857666827748692, 25.00139435235602, 4.305377619109947, 9667.094831413606, 24077.953801047104, 59698.82218324606, -91019.69570680606, 137.85512994764406, 64.23315269371734, -35.801829085602265, 1.0606119748691598, 287.2889897905756, 89.52499975392664], 'bias': 3.247905759162303}

The output above lists the parameters of the trained model, i.e.: its weights and bias, which define the model and can be used to perform inference. Additionally, we can also see that the trained model accuracy on testing data is 92%.

Conclusion

In this blog we discussed the challenges of developing predictive models on medical data and how Cape’s confidential computing platform can alleviate privacy issues associated with medical data processing. We defined a logistic regression model and trained to identify breast tumors as malignant or benign while keeping the medical data that was used for training confidential.

Using a Random Forest Model for Fraud Detection in Confidential Computing

Ellie Kloberdanz — Tue, 22 Nov 2022 19:30:46 +0000

Introduction

In this blog we develop a random forest model for detecting credit card fraud and deploy it on Cape Privacy's confidential computing platform to perform secure inference. Cape ensures that both the model and the examined credit card transactions remain private during inference.

Credit Card Fraud

Credit card fraud is a form of identity theft, which involves using another person’s credit card to make purchases or withdraw cash advances without the card owner's consent. Fraudsters may either obtain your physical credit card, or just steal your credit card information such as the account number, cardholder name, and CVV code and use it to take over your account.

In fact, according to the Federal Trade Commision credit card fraud has become the most frequent type of identity theft in 2022 [1]. The good news is that most major credit card providers such Visa, Mastercard or American Express offer $0 liability protection to their customers, which means that individuals whose credit card information has been stolen aren’t personally liable for fraudulent transactions. However, having your identity stolen and going through the process of mitigating the repercussions of it is still no fun. Therefore, timely credit card fraud detection is paramount for protecting credit cardholders against identity theft and for mitigating financial losses that the credit card industry suffers due to fraud.

Fraud Detection

To minimize their losses and ensure their customers satisfaction, credit card companies employ a variety methods to prevent and detect credit card fraud. Modern solutions leverage machine learning to timely detect suspicious transactions to stop fraud [2].

Privacy Intrusion in Fraud Detection

As a credit card holder, I want my account to be maximally protected against credit card fraud, but how do I feel about my credit card transactions data being collected and processed? What if that data is not securely handled and leaks? This is where Cape’s confidential computing platform can help.

Confidential Computing with Cape

Cape’s confidential computing platform based on secure AWS Nitro enclaves allows its users to process data in a privacy preserving manner in the cloud. A secure enclave is an environment that provides for isolation of code and data from OS using hardware-based CPU-level isolation. Secure enclaves offer a process called attestation to verify that the CPU and apps running are genuine and unaltered. Therefore, secure enclaves enable confidential data processing and ensure the privacy of both the code and data within the enclave.

In addition to the platform itself, Cape also provides a CLI that enables its users to easily encrypt their input data, and deploy and run serverless functions with easy commands: cape encrypt, cape deploy, and cape run. Additionally, Cape also provides two SDKs: pycape and cape-js, which allow for using cape within Python and JavaScript programs respectively.

Secure Credit Card Fraud Inference with Cape

In this blog we will train a credit card fraud detection model with Sklearn’s random forest classifier and deploy it on Cape’s secure cloud platform to ensure that the model and input data processed during inference remain confidential.

Train a Model

First, we train a simple random forest classifier and save the model as follows.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from joblib import dump, load

data = pd.read_csv("creditcard.csv")
X = data.drop(['Class'], axis=1)
Y = data["Class"]
X_data = X.values
Y_data = Y.values
X_train, X_test, Y_train, Y_test = train_test_split(X_data, Y_data, test_size = 0.2, random_state = 42)

model = RandomForestClassifier()
model.fit(X_train, Y_train)

y_pred = model.predict(X_test)
print(accuracy_score(Y_test, y_pred))

# save model
dump(model, 'model.joblib')

The above model has a testing accuracy of 99.9%.

Create an Inference Cape Function

The code snippet below shows our app.py. First, we import the libraries we need for our app.

from joblib import load
import pandas as pd
import sklearn

Then we define a cape handler function, which accepts a credit card transaction as input, invokes the previously trained model and outputs a prediction indicating if the transaction is legitimate or fraudulent.

Please note that any function that is deployed with Cape needs to be named app.py, where app.py needs to contain a function called cape_handler() that takes the input that the function processes and returns the results.

def cape_handler(input_data):
    csv = input_data.decode("utf-8")
    csv = csv.replace("\\t", ",").replace("\\n", "\n")
    f = open("data.csv", "w")
    f.write(csv)
    f.close()

    data = pd.read_csv("data.csv")
    clf = load('model.joblib')
    y_pred = clf.predict(data)
    if y_pred == 0:
        return "This credit card transaction is legitimate"
    else:
        return "This credit card transaction is fraudulent"

Deploy with Cape

To deploy our function with Cape, we first need to create a folder that contains all needed dependencies. In case of this app, the deployment folder needs to contain the app.py above and also the trained model, which we saved as model.joblib. Additionally, because the app.py program imports some external libraries (in this case: sklearn, pandas, and joblib), the deployment folder needs to have those as well. We can save a list of those dependencies into a requirements.txt file and run docker to install those dependencies into our deployment folder called app as follows:

sudo docker run -v `pwd`:/build -w /build --rm -it python:3.9-slim-bullseye pip install -r requirements.txt --target ./app/

Now that we have everything ready, we can log into Cape:

cape login

Your CLI confirmation code is: GZPN-KHMT
Visit this URL to complete the login process: https://login.capeprivacy.com/activate?user_code=GZPN-KHMT
Congratulations, you're all set!

And after that we can deploy the app:

cape deploy app

Deploying function to Cape ...
Success! Deployed function to Cape.
Function ID ➜  YdVYPwWkTw2TmP6u7JEF6i
Function Checksum ➜  26ebbba7e81391b9a40ea35f8b29eb969726417897dbfbe5d069973344a5e831

Run with Cape

Now that the app is deployed, we can pass it an input and invoke it with cape run:

cape run YdVYPwWkTw2TmP6u7JEF6i -f fraudulent_transaction.csv --insecure -u https://k8s-cape-enclaver-750003af11-e3080498c852b366.elb.us-east-1.amazonaws.com

This credit card transaction is fraudulent

cape run YdVYPwWkTw2TmP6u7JEF6i -f legitimate_transaction.csv --insecure -u https://k8s-cape-enclaver-750003af11-e3080498c852b366.elb.us-east-1.amazonaws.com

This credit card transaction is legitimate

Conclusion

In this blog we discussed the importance of timely credit card fraud detection, which has become the number one most common form of identity theft in 2022 [1]. Modern fraud detection tools leverage machine learning models, which requires a large scale collection of credit card transaction data. The challenge is to ensure that this sensitive data is handled in a secure manner to prevent data leaks that bad actors can take advantage of. To ensure that the credit card transactions that are being examined remain private during inference, we leveraged Cape’s confidential computing platform.

Specifically, we deployed a trained random forest classifier to Cape’s secure enclave from where we ran inference on some example transactions. Go to Cape’s 5 Minute Quickstart to try out Cape yourself!

[1] https://public.tableau.com/app/profile/federal.trade.commission/viz/TheBigViewAllSentinelReports/TopReports
[2] https://www.google.com/url?q=https://journalofbigdata.springeropen.com/articles/10.1186/s40537-022-00573-8&sa=D&source=docs&ust=1667852611321463&usg=AOvVaw3UX5MKe17ZhDxhabVjhQie