DEV Community: Temiloluwa Adeoti

Installing kubeflow on Amazon EKS (December 2023)

Temiloluwa Adeoti — Tue, 05 Dec 2023 09:18:23 +0000

Introduction

The primary objective of this post is to share my experience installing Kubeflow on Amazon EKS in December 2023. Before diving into the Kubeflow installation process, it is imperative to set up EKS through the AWS CLI or the AWS console.

There are two Kubeflow installation approaches I explored:

Using Juju by Canonical
Use the Official guide by Amazon at link

Setting Up Your EKS Cluster

According to the official documentation, Kubeflow mandates a minimum of 12GB of memory and 4 CPUs. Here are the configuration options for the nodes:

Kubernetes version: 1.25
Minimum number of nodes: 2
Instance type: t2.x-large (16GB RAM, 4 CPUs)
AMI: Amazon Linux 2
EBS: 100GB gp3

Kubeflow, upon deployment, will generate a minimum of 70 pods on these nodes. Considering that the t2.xlarge instance type can support up to 44 pods, a minimum of 2 nodes becomes required.

Important EKS Configuration Details

EKS Cluster Creation:
Creating an EKS Cluster using the AWS console involves a two-step process. First, you create the cluster, which takes approximately 20 minutes to complete. Subsequently, you add a node group to the cluster.
Roles Requirement:
To accomplish these tasks, you'll need Cluster Service Role for cluster creation and a Node IAM role for node group creation. Ensure these roles are configured correctly to facilitate the setup process

Persistent Volumes with the Amazon EBS CSI driver Addon

As soon as the cluster and nodes are ready, an Amazon EBS CSI driver addon has to be installed for the dynamic provisioning of persistent volumes. Kubeflow uses the addon to provision persistent volumes for pods like Mysql and Notebook servers. EKS comes with four Addons by default and I wonder why this important addon is not one of them.
.
To install the Amazon EBS CSI driver:

Create an IAM role according to the guide
Install the Addon on the Nodegroup using the AWS Console or AWSCLI

Interacting with EKS with Eksctl, Kubectl, and K9s

These command line tools are recommended for interacting with your EKS cluster: Eksctl, Kubectl, and K9s.
Eksctl is the official cli tool by AWS but I prefer using Kubectl and K9s.
K9s is amazing! It provides a GUI-like interface in the command line.
To permit Kubectl and K9s to detect the presence of your EKS cluster, the kubeconfig file must be updated with the cluster configuration.
Run the following AWS CLI command to achieve this:

# update kubeconfig file with cluster configuration
aws eks update-kubeconfig --region <region-code> --name <my-cluster>

Installation of Kubeflow with Juju by Canonical

An EKS Cluster must have been setup before you proceed

Dependencies for this method: Juju

Juju is an open-source orchestration engine for deploying infrastructure and configuring applications on on-premise and Cloud environments.
It is not only an IAC tool like Terraform, but also installs applications like Wordpress and Kubeflow, with a single command.
Juju charms are artifacts that encapsulate the deployment and configuration details of applications.
Visit the Charm Hub to see the comprehensive list of available charms.

To install the Kubeflow charm, we need a Controller.
A controller in Juju oversees the deployment of a charm and it could be installed on a separate EC2 instance or a pod in our target Kubernetes cluster.
These are the steps for deploying Kubeflow with Juju

Step 1: Install juju on your local machine

Install the Juju Cli on either a Mac or Linux machine.

  # mac
  brew install juju
  # linux
  sudo snap install juju --classic

Step 2: Add your AWS Programmatic credentials

Add your AWS credentials to Juju. Configure programmatic access to your AWS account using aws configure if you lack access.

juju add-credential aws

Step 3: Create your Juju controller

I chose to create a Kubernetes controller as a pod in the cluster.
If an EC2 controller is desired, a "M type" instance is created by default by Juju.

# ec2 controller
juju bootstrap aws aws-controller

# eks pod controller
juju bootstrap  kubeflow kubeflow-controller

Step 4: Create a Model on the Controller

A Juju model is a user-defined collection of applications and all components
that support its functionality like storage and networking.

The model is named kubeflow but any arbitrary name can be used.

juju add-k8s kubeflow

Step 5: Deploy the Charm

The kubeflow version chosen to be deployed is 1.7/stable.
Other versions can be found on the Charm hub

juju deploy kubeflow --channel 1.7/stable  --trust

Configuring Dex and OIDC to Login to Kubeflow Dashboard

The installation progress can be monitored in the terminal while the status of the created pods is best visualized using K9s.
Once all the pods are ready, Dex and OIDC must be configured with the commands below to access the Kubeflow Dashboard.

Juju creates two LoadBalancers upon Kubeflow deployment, one for ingress with Istio and another for the Kubernetes Controller.
Login to the dashboard with the Istio Ingress Gateway pod service URL or its Load Balancer URL retrieved from the Amazon web console.
The default user email address is user@example.com and the default password is 12341234.

  # set variables
  load_balancer_url=<istio-ingressgateway url>
  password=<new password>
  email=<new email>

  # configure dex
  juju config dex-auth public-url=<load_balancer_url>
  juju config oidc-gatekeeper public-url=<load_balancer_url> 
  juju config dex-auth static-password=password
  juju config dex-auth static-username=email

Installation of Kubeflow using the Amazon Official Guide

An EKS Cluster must have been setup before you proceed

Dependencies for this method: Make, Python3.8 and Kustomize

The prerequisites section of the AWS Kubeflow official documentation offers three options for creating an Ubuntu environment to deploy Kubeflow.
Any option works fine, but I decided instead to execute the installation from my local machine (Mac or Linux).
These are the important prerequisites:

Clone the official AWS labs Kubeflow manifest repository
Install a Python 3.8 environment using Miniconda. The choice of Python 3.8 is very crucial. Miniconda offers the ability to select a desired Python version for an environment.
Install Kustomize

Installation with Kubeflow Manifests

The cloned repository contains a Makefile at the root directory. Therefore, Make must be installed on your system to run its commands. I chose not to execute the make install-tools command as stated in the documentation as it installs dependencies like Python and Kubectl which I had previously installed.

The documentation recommends two Kubeflow deployment methods as soon as the prerequisites are fulfilled: with either Terraform or Manifests. Terraform would have been installed with the make install-tools command but I opted for using Manifests. The manifests method runs a Python script that executes shell commands using Kustomize to deploy Kubeflow. The script can be found at tests/e2e/utils/kubeflow_installation.py.

# conda environment yaml file
name: kubeflow-aws
channels:
  - conda-forge
  - defaults
dependencies:
  - boto3=1.33.6
  - numpy=1.23.5
  - pandas=1.5.3
  - sagemaker=2.198.0
  - pip:
    - black==23.11.0
    - kfp==2.4.0
    - kubernetes==26.1.0
    - mysql-connector-python==8.2.0
    - pytest==7.4.3
    - pyyaml==6.0.1
    - requests==2.31.0
    - retrying==1.3.4

In preparation for the deployment, the Python libraries in the yaml above must be installed into Python environment.
Finally, the deployment is implemented using make deploy-kubeflow

  # set environmental variables for vanilla deployment
  export DEPLOYMENT_OPTION=vanilla 
  export INSTALLATION_OPTION=kustomize 

  # install kubeflow using the python script at tests/e2e/utils/kubeflow_installation.py
  make deploy-kubeflow

The installation progress can be monitored in the terminal.
Once all the pods are ready, port forward the Istio ingress gateway pod to your local port 8085 to access the Kubeflow dashboard.
The default user email address is user@example.com and the default password is 12341234.

kubectl port-forward svc/istio-ingressgateway -n istio-system 8085:80

Comparison of both Kubeflow Deployment Methods.

From a cost perspective, the official AWS installation method is recommended. There is no need for controllers like with Juju which creates an extra Load balancer or EC2 instance.
On the other hand, once Juju is set up, the installation and maintenance of Kubeflow are easier to implement.

References

A Practical Introduction to Amazon SageMaker Python SDK

Temiloluwa Adeoti — Mon, 27 Feb 2023 22:03:39 +0000

Introduction

On the 12th of October, 2022, I presented a Knowledge Share to my colleagues at Machine Learning Reply GmBH titled, "Developing Solutions with Sagemaker". Knowledge Sharing is a tradition we observe weekly at Machine Learning Reply GmBH that helps us as consultants to develop a broad range of skill sets. There was little time to go delve into the Sagemaker Python SDK on the day. With this follow-up blog post, I would like to explore the Estimator API, Model API, Preprocessor API, and Predictor API using the AWS Sagemaker Python SDK.

AWS Sagemaker Python SDK

The Amazon SageMaker Python SDK is the recommended library for developing solutions is Sagemaker. The other ways of interacting with Sagemaker are the AWS CLI, Boto3, and the AWS web console.
In theory, the SDK should offer the best developer experience, but I discovered a learning curve exists to hit the ground running with it.

This post walks through a simple regression task that showcases the important APIs in the SDK.
I also highlight "gotchas" encountered while developing this solution. The entire codebase is found here.

Regression Task: Fuel Consumption Prediction

I selected a regression task I tackled as a budding Data scientist (notebook link): to predict fuel consumption of vehicles in MPG (problem definition). I broke down the problem into three stages:

A preprocessing stage for feature engineering
A model training and evaluation stage
A model inferencing stage

Each of these stages produces reusable model artifacts that are stored in S

Sagemaker Preprocessing and Training

Two things are king in Sagemaker: S3 and Docker containers. S3 is the primary location for storing training data and the destination for exporting training artifacts like models. The SDK provides Preprocessors and Estimators as the fundamental interfaces for data preprocessing and model training. These two APIs are simply wrappers for Sagemaker Docker containers. This is what happens under the hood when a preprocessing job is created with a Preprocessor or training job with an Estimator:

Data is transferred from S3 into the Sagemaker Docker container
The Job (training or preprocessing) is executed in the container that runs on the compute instance you have specified for the job
Output artifacts (models, preprocessed features) are exported to S3 when the job is concluded

This image depicts data transfer into and out of a preprocessing container from S3.

Sagemaker Containers

It is crucial to get familiar with the environmental variables and pre-configured path locations in Sagemaker containers. More information is found on the Sagemaker Containers' Github page. For example, Preprocessors, receive data from S3 into /opt/ml/preprocessing/input while Estimators store training data in /opt/ml/input/data/train. Some environmental variables include SM_MODEL_DIR for exporting models, SM_NUM_CPUS, and SM_HP_{hyperparameter_name}.

Project Folder Structure

The diagram below shows the project's folder structure. The main script is the python notebook auto_mpg_prediction.ipynb whose cells are executed in Sagemaker Studio. Training and preprocessing scripts are located in the scripts folder.

├── Blog.md
├── LICENSE
├── README.md
├── auto_mpg_prediction.ipynb
└── scripts
    ├── model
    │   ├── inference.py
    │   └── train.py
    └── preprocessor
        ├── custom_preprocessor.py
        ├── inference.py
        └── train.py

Preliminary Steps

Let's start with initializing a Sagemaker session followed by boilerplate steps of getting the region, execution role, and default bucket. I create prefixes to key s3 locations for data storage, and the export of preprocessed features and models.

import os
import json
import time
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
import boto3
import sagemaker
from sagemaker import get_execution_role
from io import StringIO

# initialize sagemaker session 
sess = sagemaker.Session()
region = sess.boto_session.region_name
bucket = sess.default_bucket() 
role = get_execution_role()

# boto3 client
sm_client = boto3.client('sagemaker')

prefix = "auto_mpg"

# raw data path
raw_train_prefix = f"{prefix}/data/bronze/train"
raw_val_prefix = f"{prefix}/data/bronze/val"
raw_test_prefix = f"{prefix}/data/bronze/test"

# preprocessed features path
pp_train_prefix = f"{prefix}/data/gold/train"
pp_val_prefix = f"{prefix}/data/gold/val"
pp_test_prefix = f"{prefix}/data/gold/test"

# preprocessor and ml models
pp_model_prefix = f"{prefix}/models/preprocessor"
ml_model_prefix = f"{prefix}/models/ml"


def get_s3_path(prefix, bucket=bucket):
    """ get full path in s3 """
    return f"s3://{bucket}/{prefix}"

Raw Data Transfer to S3

Next, we have to transfer our raw data to S3. In a production setting, an ETL job sets an S3 bucket as the final data destination. I have implemented a function that downloads the raw data, splits it into the train, validation, and test sets then uploads them all to their respective s3 paths in the default bucket based on pre-defined prefixes.


def upload_raw_data_to_s3(sess,
                          raw_train_prefix=raw_train_prefix,
                          raw_val_prefix=raw_val_prefix,
                          raw_test_prefix=raw_test_prefix, 
                          split=0.8):
    """
    Read MPG dataset, perform train test split, then upload to s3
    """
    # filenames
    train_fn = "train.csv"
    val_fn = "val.csv"
    test_fn = "test.csv"

    # download data
    data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data"
    res = requests.get(data_url)
    file = StringIO(res.text)

    # read data
    data = pd.read_csv(file, header = None, delimiter = '\s+', low_memory = False, na_values = "?")
    data_frame = data.drop(columns = 8)
    data_frame = data_frame.fillna(data_frame.mean())
    data_frame = data_frame.rename(index = int, columns = {0: "mpg", 1:"cylinders", 2: "displacement",3: "horsepower", 4: "weight", 5:"acceleration",6:"model year",7:"origin"})

    # train - test - split
    train_df = data_frame.sample(frac=split)
    test_df = data_frame.drop(train_df.index)

    # take the last 10 rows of test_df as the test data and the 
    val_df = test_df[:-10]
    test_df = test_df[-10:]

    assert set(list(train_df.index)).intersection(list(test_df.index)) == set([]), "overlap between train and test"

    # save data locally and upload data to s3
    train_df.to_csv(train_fn, index=False, sep=',', encoding='utf-8')
    train_path = sess.upload_data(path=train_fn, bucket=bucket, key_prefix=raw_train_prefix)

    val_df.to_csv(val_fn, index=False, sep=',', encoding='utf-8')
    val_path = sess.upload_data(path=val_fn, bucket=bucket, key_prefix=raw_val_prefix)

    test_df.to_csv(test_fn, index=False, sep=',', encoding='utf-8')
    test_path = sess.upload_data(path=test_fn, bucket=bucket, key_prefix=raw_test_prefix)

    # delete local versions of the data
    os.remove(train_fn)
    os.remove(val_fn)
    os.remove(test_fn)

    print("Path to raw train data:", train_path)
    print("Path to raw val data:", val_path)
    print("Path to raw test data:", test_path)

    return train_path, val_path, test_path

train_path, val_path, test_path = upload_raw_data_to_s3(sess)

Stage 1: Feature Engineering

The preprocessing steps are implemented using the Sklearn python library. These are the goals of this stage:

Preprocess the raw train and validation .csv data into features and export them to s3 in .npy format
Save the preprocessing model using joblib and export it to s3. This saved model will be deployed as the first step of our inference pipeline. During inferencing, its task will be to generate features (.npy) for raw test data.

The Sagemaker Python SDK offers Sklearn Preprocessors and PySpark Preprocessors. These are preprocessors that already come with Sklearn and Pyspark pre-installed. Unfortunately, I discovered it is not possible to use custom scripts or dependencies with both. Therefore, I had to use the Framework Preprocessor.

To instantiate the Framework Preprocessor with the Sklearn library, I supplied SKlearn estimator Class to the estimator_cls parameter. The .run method of the preprocessor comes with a code parameter for specifying the entry point script and source_dir parameter for indicating the directory that contains all custom scripts.

Pay close attention to how data is transferred into and exported out of the preprocessing container using ProcessingInput and ProcessingOutput APIs. You will see how container (/opt/ml/*) and S3 paths for data transfer are specified. Note that unlike Estimators that are executed using a .fit method, Preprocessors use a .run method.


from datetime import datetime
from sagemaker.sklearn.estimator import SKLearn
from sagemaker.processing import FrameworkProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput

current_time = datetime.now().strftime("%d-%b-%Y-%H:%M:%S").replace(":", "-")
TRAIN_FN = 'train.csv'
VAL_FN = 'val.csv'
TRAIN_FEATS_FN = 'train_feats.npy'
VAL_FEATS_FN = 'val_feats.npy'


sklearn_processor = FrameworkProcessor(
    base_job_name=f"auto-mpg-feature-eng-{current_time}",
    framework_version="1.0-1",
    role=role,
    instance_type="ml.m5.xlarge",
    instance_count=1,
    estimator_cls=SKLearn
)

sklearn_processor.run(
    code="train.py",
    source_dir="scripts/preprocessor/",
    inputs=[
        ProcessingInput(source=get_s3_path(raw_train_prefix), destination="/opt/ml/processing/input/train"),
        ProcessingInput(source=get_s3_path(raw_val_prefix), destination="/opt/ml/processing/input/test")
    ],
    outputs=[
        ProcessingOutput(output_name="train_features", source="/opt/ml/processing/train", destination=get_s3_path(pp_train_prefix)),
        ProcessingOutput(output_name="val_features", source="/opt/ml/processing/test", destination=get_s3_path(pp_val_prefix)),
        ProcessingOutput(output_name="preprocessor_model", source="/opt/ml/processing/output", destination=get_s3_path(pp_model_prefix)),
    ],
    arguments=["--train-filename", TRAIN_FN,
               "--test-filename", VAL_FN,
               "--train-feats-filename", TRAIN_FEATS_FN,
               "--test-feats-filename", VAL_FEATS_FN],
)

Custom Preprocessor

I wanted to implement some custom preprocessing logic so I created a custom preprocessor class and configured it to follow the popular Sklearn .fit and .transform interfaces by extending BaseEstimator and TransformerMixin. The preprocessor engineers the Model Year Feature into Age and makes features Origin and Cylinders categorical. It is vital that this custom transformer be stored in a separate file and imported by the main preprocessing script. The reason for this will be explained during the inferencing step.

%%writefile scripts/preprocessor/custom_preprocessor.py

from sklearn.base import BaseEstimator, TransformerMixin

original_features = ['cylinders',
                     'displacement',
                     'horsepower',
                     'weight',
                     'acceleration',
                     'model year',
                     'origin']


class CustomFeaturePreprocessor(BaseEstimator, TransformerMixin):
    """
    This is a custom transformer class that does the following

        1. converts model year to age
        2. converts data type of categorical columns
    """

    feat = original_features
    new_datatypes = {'cylinders': 'category', 'origin': 'category'}

    def fit(self, X, y=None):
        """ Fit function"""
        return self

    def transform(self, X, y=None):
        """ Transform Dataset """
        assert set(list(X.columns)) - set(list(self.feat))\
                    ==  set([]), "input does have the right features"

        # convert model year to age
        X["model year"] = 82 - X["model year"]

        # change data types of cylinders and origin 
        X = X.astype(self.new_datatypes)

        return X

    def fit_transform(self, X, y=None):
        """ Fit transform function """
        x = self.fit(X)
        x = self.transform(X)
        return x

Preprocessing Job

The preprocessing script at scripts/preprocessor/train.py is executed in the preprocessing container to perform the feature engineering. I create a Sklearn Pipeline model using my CustomFeaturePreprocessor class as its first step, followed by a OneHotEncoder for transforming categorical columns and finally StandardScaler for numerical columns. A Sklearn pipeline is an easy way to chain multiple Sklearn transformers together.

You should avoid data leakage during feature engineering as a good ML Engineer. Since the same transformer is applied to the train and validation, I excluded the first columns of the pandas dataframes which is the target. I also fitted the model on only the train set.

After the model is saved using joblib, it is imperative that it be compressed into a tar file. Sagemaker models are archived as tarfiles else, an error will be thrown when importing the model during inferencing.

%%writefile scripts/preprocessor/train.py

import os
import joblib
import argparse
import numpy as np
import pandas as pd
import tarfile
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_selector, ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from custom_preprocessor import original_features, CustomFeaturePreprocessor

def save_numpy(np_array, path):
    """ save np array """
    with open(path, 'wb') as f:
        np.save(f, np_array)

if __name__ == '__main__':
    CONTAINER_TRAIN_INPUT_PATH = "/opt/ml/processing/input/train"
    CONTAINER_VAL_INPUT_PATH = "/opt/ml/processing/input/test"
    CONTAINER_TRAIN_OUTPUT_PATH = "/opt/ml/processing/train"
    CONTAINER_VAL_OUTPUT_PATH = "/opt/ml/processing/test"
    CONTAINER_OUTPUT_PATH = "/opt/ml/processing/output"

    parser = argparse.ArgumentParser()
    parser.add_argument('--train-filename', type=str, default='train.csv')
    parser.add_argument('--val-filename', type=str, default='val.csv')
    parser.add_argument('--model-filename', type=str, default='model.tar.gz')
    parser.add_argument('--train-feats-filename', type=str, default='train_feats.npy')
    parser.add_argument('--val-feats-filename', type=str, default='val_feats.npy')

    args = parser.parse_args()

    # one hot categorical features
    # apply standard scaler to numerical features
    ct = ColumnTransformer([("categorical-feats", OneHotEncoder(), make_column_selector(dtype_include="category")),
                            ("numerical-feats", StandardScaler(), make_column_selector(dtype_exclude="category"))])

    # apply custom preprocessing
    pl = Pipeline([("custom-preprocessing", CustomFeaturePreprocessor()), ("column-preprocessing", ct)])

    train_data_path = os.path.join(CONTAINER_TRAIN_INPUT_PATH, args.train_filename)
    val_data_path = os.path.join(CONTAINER_VAL_INPUT_PATH, args.val_filename)

    # preprocess features, first column is target and the rest are features
    train_df = pd.read_csv(train_data_path)
    train_data = train_df.iloc[:, 1:]
    train_target = train_df["mpg"].values.reshape(-1, 1)
    train_features = pl.fit_transform(train_data)

    val_df = pd.read_csv(val_data_path)
    val_data = val_df.iloc[:, 1:]
    val_target = val_df["mpg"].values.reshape(-1, 1)
    val_features = pl.transform(val_data)

    # save features in output path with the container
    train_features = np.concatenate([train_target, train_features], axis=1)
    val_features = np.concatenate([val_target, val_features], axis=1)
    train_features_save_path = os.path.join(CONTAINER_TRAIN_OUTPUT_PATH, args.train_feats_filename)
    val_features_save_path = os.path.join(CONTAINER_VAL_OUTPUT_PATH, args.val_feats_filename)
    save_numpy(train_features, train_features_save_path)
    save_numpy(val_features, val_features_save_path)

    # save preprocessor model
    model_save_path = os.path.join(CONTAINER_OUTPUT_PATH, args.model_filename)

    # save model should be a tarfile so it can be loaded in the future
    joblib.dump(pl, "model.joblib")
    with tarfile.open(model_save_path, "w:gz") as tar_handle:
        tar_handle.add("model.joblib")

Stage 2: Model Training and Evaluation

The different python library smexperiments is used for Experiment tracking in Sagemaker. A Trial in Sagemaker is synonymous with an MLFlow run. A trial could consist of multiple ML workflow stages, depending on the complexity of the solution. For example, model training and evaluation or just a single hyperparameter optimization step could be considered a trial. What's important is that metrics are logged for each trial run to enable the comparison of different trials.

I created a trial for just the training step and attributed it to the created experiment using the experiment_name parameter in the Trial.create call.


from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker

current_time = datetime.now().strftime("%d-%b-%Y-%H:%M:%S").replace(":", "-")
experiment_name = "auto-mg-experiment"
# create a new experiment a load one if it exists
try:
    auto_experiment = Experiment.load(experiment_name=experiment_name)
    print(f'experiment {experiment_name} was loaded')
except Exception as ex:
    if "ResourceNotFound" in str(ex):
        auto_experiment = Experiment.create(experiment_name = experiment_name,
                                            description = "Regression on Auto MPG dataset",
                                            tags = [{'Key': 'Name', 'Value': f"auto-mg-experiment-{current_time}"},
                                                    {'Key': 'MLEngineer', 'Value': f"Temiloluwa Adeoti"},
                                                   ])
        print(f'experiment {experiment_name} was created')

from sagemaker.sklearn.estimator import SKLearn

current_time = datetime.now().strftime("%d-%b-%Y-%H:%M:%S").replace(":", "-")
n_estimators = 10
trail_name = f"auto-mg-{n_estimators}-estimators"
# create a trial for the training job
training_job_trial = Trial.create(trial_name = f"{trail_name}-{current_time}",
                              experiment_name = auto_experiment.experiment_name,
                              sagemaker_boto_client=sm_client,
                              tags = [{'Key': 'Name', 'Value': f"auto-mg-{current_time}"},
                                       {'Key': 'MLEngineer', 'Value': f"Temiloluwa Adeoti"}])

# configure the estimator
model = SKLearn(
    entry_point="train.py",
    source_dir="./scripts/model",
    framework_version="1.0-1", 
    instance_type="ml.m5.xlarge", 
    role=role,
    output_path = get_s3_path(ml_model_prefix), # model output path
    hyperparameters = {
        "n_estimators": n_estimators
    },
    metric_definitions=[
            {"Name": "train:mae", "Regex": "train_mae=(.*?);"},
            {"Name": "test:mae", "Regex": "test_mae=(.*?);"},
            {"Name": "train:mse", "Regex": "train_mse=(.*?);"},
            {"Name": "test:mse", "Regex": "test_mse=(.*?);"},
            {"Name": "train:rmse", "Regex": "train_rmse=(.*?);"},
            {"Name": "test:rmse", "Regex": "test_rmse=(.*?);"},
        ],
    enable_sagemaker_metrics=True
)

# fit the estimator
model.fit(job_name=f"auto-mpg-{current_time}",
          inputs = {"train": get_s3_path(pp_train_prefix), 
                    "test": get_s3_path(pp_val_prefix)
                   }, 
          experiment_config={
            "TrialName": training_job_trial.trial_name,
            "TrialComponentDisplayName": f"Training-auto-mg-run-{current_time}",
          },
          logs="All")

I reckon logging of metrics to be a bit involved in Sagemaker in comparison to other frameworks. This is how custom training metrics are captured:

Create a logger that streams to standard output (logging.StreamHandler(sys.stdout)). The streamed logs are automatically captured by AWS Cloudwatch.
Log metrics based on your predetermined format e.g metric-name=metric-value.
When creating the estimator that runs the training script, a regex pattern that matches your metric logging format must be given to the metric_definition parameter.

The training job is executed by running the scripts/model/train.py file within an Sklearn Estimator container on a compute instance (ml.m5.xlarge in this case). Pay attention to how inputs are supplied to estimators using the inputs parameter and how the training job is assigned to the created trial using the experiment_config parameter.

The script trains a RandomForestRegressor on the .npy preprocessed train features and the model is evaluated on validation features. I will explain in what follows why I did not save the model as a tarfile.

%%writefile scripts/model/train.py
import logging
import sys
import argparse
import os
import pandas as pd
import numpy as np
import joblib
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error 

# configure logger to standard output
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
stream_handler = logging.StreamHandler(sys.stdout)
stream_handler.setFormatter(logging.Formatter("%(asctime)s %(name)-12s %(levelname)-8s %(message)s"))
logger.addHandler(stream_handler)


def get_metrics(train_y_true, train_y_pred, val_y_true, val_y_pred):
    """
    Return train and val metrics
    """

    # mae
    t_mae = mean_absolute_error(train_y_true, train_y_pred)
    ts_mae = mean_absolute_error(val_y_true, val_y_pred)

    # mse
    t_mse = mean_squared_error(train_y_true, train_y_pred, squared=False)
    ts_mse = mean_squared_error(val_y_true, val_y_pred, squared=False)

    # rmse
    t_rmse = mean_squared_error(train_y_true, train_y_pred, squared=True)
    ts_rmse = mean_squared_error(val_y_true, val_y_pred, squared=True)

    return t_mae, ts_mae, t_mse, ts_mse, t_rmse, ts_rmse


if __name__=='__main__':
    parser = argparse.ArgumentParser()
    # Sagemaker specific arguments. Defaults are set in the environment variables.
    parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
    # location in container: '/opt/ml/model'
    parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
    # location in container: '/opt/ml/input/data/train'
    parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN'])
    # location in container: '/opt/ml/input/data/test'
    parser.add_argument('--test', type=str, default=os.environ['SM_CHANNEL_TEST'])
    # model filename
    parser.add_argument('--model-filename', type=str, default="model.joblib")

    # Hyperparameters are described here.
    parser.add_argument('--n_estimators', type=int)
    args = parser.parse_args()

    logger.debug(f"Number of Estimators: {args.n_estimators}")

    # Load numpy features saved in S3 
    # Targets are the first column, features are other columns
    train_data = np.load(os.path.join(args.train, "train_feats.npy"))
    train_feats = train_data[:, 1:]
    train_target = train_data[:, 0]

    val_data = np.load(os.path.join(args.test, "val_feats.npy"))
    val_feats = val_data[:, 1:]
    val_target = val_data[:, 0]

    # Train random forest model
    model = RandomForestRegressor(max_depth=args.n_estimators, random_state=0)
    model.fit(train_feats, train_target)
    logger.info("Model Trained ")

    train_pred = model.predict(train_feats)
    val_pred = model.predict(val_feats)

    # Evaluate Model
    train_mae, val_mae, train_mse, val_mse, train_rmse, val_rmse = \
        get_metrics(train_target, train_pred, val_target, val_pred)

    logger.info(f"train_mae={train_mae};  val_mae={val_mae};")
    logger.info(f"train_mse={train_mse};  val_mse={val_mse};")
    logger.info(f"train_rmse={train_rmse}; val_rmse={val_rmse};")

    # Save the Model
    joblib.dump(model, os.path.join(args.model_dir, args.model_filename))

Stage 3: Model Inferencing

Sagemaker provides the Model API for model deployement to an endpoint and the Predictor API for making predictions with the endpoint. Since we have two models, the preprocessor and regressor models, we require a Sagemaker pipeline model to chain both and make a deployment.

I selected the SKLearnModel (Model with Sklearn dependencies pre-installed) for the preprocessor. To prepare the Model, I supplied to it paths to the saved tar model in S3, the inference script which is its entry point, and the custom_preprocessor.py.

Repeating the CustomFeaturePreprocessor in both the preprocessor model training (scripts/preprocessor/train.py) and inference (scripts/preprocessor/inference.py) scripts did not work. I needed to import the class from a separate file (scripts/preprocessor/custom_preprocessor.py.py) for inferencing. This was the error I encountered:
Can't get attribute 'CustomFeaturePreprocessor' on . It is a common problem faced during model deployment. You can read more on this type of problem at this Stackoverflow post

I did not save the regressor as a tarfile because I chose not to import the regressor model from S3. Instead, I created a Sagemaker Model directly from the trained Estimator with the .create_model method (another way to create Models).

When the two models are provided in a list in the pipeline model definition as shown below, Sagemaker automatically serves the output of the preprocessor model as input to the regressor model. I deployed the model ml.c4.xlarge instance and used a CSVSerializer for input requests.

from sagemaker.sklearn.model import SKLearnModel
from sagemaker.pipeline import PipelineModel
from sagemaker.serializers import CSVSerializer
from datetime import datetime

current_time = datetime.now().strftime("%d-%b-%Y-%H:%M:%S").replace(":", "-")
model_name = f"inference-pipeline-{current_time}"
endpoint_name = f"inference-pipeline-{current_time}"
pp_model_path = get_s3_path(pp_model_prefix) + "/model.tar.gz"

print("preprocessor model path ", pp_model_path)

# preprocessor
sklearn_processor_model = SKLearnModel(
                             model_data=pp_model_path,
                             role=role,
                             entry_point="scripts/preprocessor/inference.py",
                             dependencies=["scripts/preprocessor/custom_preprocessor.py"],
                             framework_version="1.0-1",
                             sagemaker_session=sess)

# regression model
reg_model = model.create_model(entry_point="inference.py",
                               source_dir="./scripts/model")

# create a pipeline model with the two models    
inference_pipeline = PipelineModel(
    name=model_name, role=role, models=[sklearn_processor_model, reg_model],
    sagemaker_session=sess
)

# deploy model
predictor = inference_pipeline.deploy(initial_instance_count=1, 
                                      instance_type="ml.c4.xlarge", 
                                      endpoint_name=endpoint_name,
                                      serializer=CSVSerializer() # to ensure input is csv
                                     )

Sagemaker Inference Script Structure

In Sagemaker, the model server requires four functions to successfully deploy a model.

The model_fn loads the model file e.g .joblib.
The input_fn for parsing the input request to the model. Data deserialization and transformation can occur to prepare the request for the model.
The predict_fn makes the actual prediction with the model e.g calls model.predict.
The output_fn processes the response and delivers it in the desired format to the caller.

Our Scikit-learn model server already has default implementations of these functions which can be overridden in our inference script.

Preprocessor Inference Script

In my inference script for the preprocessor, I imported the custom dependencies to avoid the error earlier mentioned. The model_fn loads the .joblib model, while the input_fn ensures the request is in text or csv format and transforms the data into a Pandas dataframe. By default, the predict_fn should make a .predict call on the model, but since the preprocessor is a transform, the .transform method is used.

%%writefile scripts/preprocessor/inference.py
import os
import joblib
import argparse
import pandas as pd
from io import StringIO
# import the custom dependencies
from custom_preprocessor import original_features, CustomFeaturePreprocessor

def input_fn(input_data, content_type):
    """ Preprocess input data """
    if content_type == 'text/csv':
        df = pd.read_csv(StringIO(input_data), header=None)

        if len(df.columns) == len(original_features) + 1:
            df = df.iloc[:, 1:]

        if list(df.columns) != original_features:
            df.columns = original_features

        return df
    else:
        raise ValueError(f"Unsupported content type: {type(content_type)}")


def predict_fn(input_data, model):
    """ Call the transform method instead of the default predict method"""
    predictions = model.transform(input_data)
    return predictions


def model_fn(model_dir):
    """Load the model"""
    model_path = os.path.join(model_dir, "model.joblib")
    loaded_model = joblib.load(model_path)
    return loaded_model

Regressor Inference Script

The regressor uses a much simpler inference script. Here the output_fn is implemented to provide a JSON response to the input request.

%%writefile scripts/model/inference.py
import sys
import os
import logging
import json
import joblib
from sagemaker_containers.beta.framework import worker

# configure logger to standard output
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
stream_handler = logging.StreamHandler(sys.stdout)
stream_handler.setFormatter(logging.Formatter("%(asctime)s %(name)-12s %(levelname)-8s %(message)s"))
logger.addHandler(stream_handler)

def model_fn(model_dir):
    """Deserialize fitted model
    """
    model = joblib.load(os.path.join(model_dir, "model.joblib"))
    return model


def output_fn(prediction, accept):
    """
    Preprocess numpy array to return JSON output
    """
    pred = []
    for i, row in enumerate(prediction.tolist()):
        pred.append({"id": i, "prediction": row})

    return worker.Response(json.dumps(pred))

Making Predictions

All the major work is done and making predictions is a very simple process with the Predictor API.

In the code snippet below, I download the raw test set from S3 and store each line of the csv, except the header, as a string in the list called test_data. After instantiating the Predictor with the endpoint name and sagemaker session, I make predictions by calling the .predict method on the predictor instance.

from pprint import pprint

# download raw test data and read text files
sess.download_data(path=".", bucket=bucket, key_prefix=raw_test_prefix)
with open("test.csv", "r") as f:
    test_data = f.readlines()[1:]

# make predictions with the deployed endpoint
from sagemaker.predictor import Predictor
from sagemaker.deserializers import JSONLinesDeserializer

predictor = Predictor(
    endpoint_name=endpoint_name, sagemaker_session=sess, serializer=CSVSerializer(), deserializer=JSONLinesDeserializer()
)

num_of_samples = 1
response = predictor.predict(test_data[:num_of_samples])

Conclusion

In this post, we have explored the Estimator API, Model API, Preprocessor API, and Predictor API using the AWS Sagemaker Python SDK by training and deploying a regression model. These are the fundamental APIs that are required for developing solutions in Sagemaker.

Sagemaker comes with over 20 features that cover most of the ML life cycle from data annotation and preprocessing to model deployment. A full feature list is found here. Some of these features like we saw with Sagemaker Experiments have separate python libraries. This could make Sagemaker challenging to learn. From my experience, I have found it helpful to study Sagemaker examples when working with unfamiliar Sagemaker features.

Since most Sagemaker jobs run within a docker container, it's vital to have logging activated to debug errors. Logs streamed to standard output and default Sagemaker metrics are captured by Cloudwatch. I guarantee that this would save you a lot of pain when working with Sagemaker.

Email Notifications for S3 Events with CloudFormation

Temiloluwa Adeoti — Sat, 18 Jun 2022 22:19:05 +0000

Introduction

Amazon S3 buckets can be configured to emit event messages in response to actions. Object creation and removal are sample events for which S3 publishes messages. The entire list is found here. This post describes how to configure custom email notifications for S3 events using CloudFormation. The solution deployed by the CloudFormation template is summarised as: S3 events are consumed by a Lambda function then custom notification emails are sent by the function using Simple Email Service (SES). Although Simple Notification Service (SNS) comes first to mind for notifications, it lacks the facility to send out customised emails.

Challenge

This solution is simple to implement in the Amazon Console but becomes complicated when automated with CloudFormation. The three main resources to be created by CloudFormation are:

The s3 bucket that emits notification events
The lambda function that sends emails with SES
A custom resource that changes the S3 bucket's configurations, granting rights to the Lambda function for consumption of bucket notification events.

If the notification configurations are applied directly to the S3 bucket resource, a circular dependency occurs. This article clearly elucidates the problem: here.

Approach

The problem is tackled by first creating each of the above mentioned resources in isolation. Next, a custom resource applies the notification configurations to the bucket with a Lambda function.

Custom resources allow developers to implement features that are not natively supported by CloudFormation. Custom resources basically make API calls to perform CRUD operations of AWS resources. There are libraries written in python and javascript that make constructing these API calls easier.

Implementation

Let's walk through a CloudFormation template that creates resources for sending emails when files are uploaded to an S3 bucket.

The template begins by defining the following parameters:

BucketName for naming the S3 bucket
NotificationLambdaFnName for naming the Lambda function that sends emails
CustomLambdaFnName for naming the Lambda function that is executed by the custom resource
SenderEmail the email addrress that sends out the notification emails
RecipientEmail the email address that receives the notification emails
AWSREGION the aws region the email addresses are registered to in SES

AWSTemplateFormatVersion: 2010-09-09
Description: Send email notifications when files are uploaded to an s3 bucket

Parameters:

  BucketName:
    Type: String
    Default: bkt-notify-email-1001

  NotificationLambdaFnName:
    Type: String
    Default: fn-notify-email-1001

  CustomLambdaFnName:
    Type: String
    Default: custom-fn-notify-email-1001

  # modify value to a verified SES email
  SenderEmail:
    Type: String
    Default: fromsomeone@gmail.com

  # modify value to a verified SES email
  RecipientEmail:
    Type: String
    Default: tosomeone@gmail.com

  AWSREGION:
    Type: String
    Default: us-east-1

The rest of the template consists of resources. The first four resources are the S3 bucket (S3Bucket), the lambda function that sends out emails (NotificationFunction), the lambda function that applies notification configurations to the s3 bucket (ApplyS3Notification), and the custom resource (ApplyNotification).

The codes for the lambda functions are written in python.


Resources:

  S3Bucket:
    Type: 'AWS::S3::Bucket'
    Properties:
      AccessControl: Private
      BucketName: !Ref BucketName

  # lambda function that sends emails
  NotificationFunction: 
    Type: AWS::Lambda::Function
    Properties:
      Description: Function that sends email notifications for s3 bucket file uploads
      FunctionName: !Ref NotificationLambdaFnName
      Environment:
        Variables:
          sender: !Ref SenderEmail
          recipient: !Ref RecipientEmail
          awsregion: !Ref AWSREGION
      Handler: email_sender.handler
      Runtime: python3.8
      Role: !GetAtt 'NotificationFunctionRole.Arn'
      Timeout: 240
      Code: lambda-email/

  # lambda function for custom resource
  ApplyS3Notification:
    Type: AWS::Lambda::Function
    Properties:
      Description: Function that attaches creates S3 notification config
      FunctionName: !Ref CustomLambdaFnName
      Handler: lambda_fn.handler
      Runtime: python3.8
      Role: !GetAtt 'ApplyS3NotificationFuncRole.Arn'
      Timeout: 240
      Code: lambda-notify/


  # Custom resource to apply notification configuration to s3 bucket
  ApplyNotification:
    Type: Custom::ApplyNotification
    Properties:
      ServiceToken: !GetAtt ApplyS3Notification.Arn
      S3BucketName: !Ref BucketName
      FunctionARN: !GetAtt 'NotificationFunction.Arn'

    DependsOn:
      - S3Bucket

Yes you keen observer! I know the roles referenced by the !Get Attr functions are missing. They are found in the code snippet below which defines the following resources:

NotificationFunctionRole: A role assumed by the lambda function for sending emails using SES. The role has the AWS managed AWSLambdaBasicExecutionRole and an attached SES policy for sending emails.
ApplyS3NotificationFuncRole: This role is assumed by the custom resources and permits it to apply notification configuration changes to the S3 bucket. It has an AWS managed AWSLambdaBasicExecutionRole and an attached policy with the s3:PutBucketNotification action.
S3ToLambdaPermission is a permission resource that is applied to the lambda function that sends out emails. It permits it to consume events from the S3 bucket.

  # Role for lambda function that sends email. Requires SES policy
  NotificationFunctionRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
      Path: /
      Policies:
        - PolicyName: PolicySendEmailWithSES
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Sid: SendEmailswithSES
                Effect: Allow
                Action:
                  - ses:SendEmail
                  - ses:SendRawEmail
                Resource: '*'

  # Role that allows custom resource lambda function to apply notification configurations to s3 bucket
  ApplyS3NotificationFuncRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
      Path: /
      Policies:
        - PolicyName: S3BucketNotificationPolicy
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Sid: AllowBucketNotification
                Effect: Allow
                Action: s3:PutBucketNotification
                Resource:
                  - !Sub 'arn:aws:s3:::${S3Bucket}'
                  - !Sub 'arn:aws:s3:::${S3Bucket}/*'


  # permission applied to lambda function to allow it to read s3 notifications
  S3ToLambdaPermission:
    Type: AWS::Lambda::Permission
    Properties:
      Action: lambda:invokeFunction
      SourceAccount: !Ref AWS::AccountId
      FunctionName: !Ref NotificationLambdaFnName
      SourceArn: !GetAtt 'S3Bucket.Arn'
      Principal: s3.amazonaws.com

    DependsOn:
      - NotificationFunction

Custom Resource

Our custom resource invokes a lambda function that applies notification configurations to the s3 bucket. The lambda function's ARN serves as the service token for the custom resource. Two parameters are defined and passed on to the lambda function: S3BucketName and FunctionARN.

The lamdba function executes a python script that uses the boto3 library to apply the notification configurations to the s3 bucket. The API calls to CloudFormation are enabled by the custom resource python library called Custom Resource Helper.

Sending Emails with Lambda and SES

The second lambda function sends out emails using SES when files are uploaded to the S3 bucket. The sender and recepient emails must already be verified in SES. If your SES is no more in Sandbox mode, then only the sender email address needs to be verified.

The email addresses and their associated AWS region in which they were verified in SES are passed as environmental variables to the lambda function.

Template Deployment

An s3 bucket to which the CloudFormation template will be deployed is required.
The bucket will exist if you have created CloudFormation stacks in the past.

Next, python dependencies must be downloaded to the target folder of the lambda function. In our use case, the Custom Resource Helper library is downloaded to the customer resource lambda function's directory.

The aws cloudformation package command is then run to package the template and lambda functions and generate an output template file. The CloudFormation template's name, the CloudFormation s3 bucket name, and the output template file name are included in the command.

Finally, the aws cloudformation deploy command deploys the generated output template file.

The following code snippet displays the deployment steps.

# create s3 bucket to deploy cloud formation template
aws s3 mb "s3://$bucket" --region $region

# install python dependencies
pip install -r lambda-notify/requirements.txt --target lambda-notify

# package template
aws cloudformation package \
    --template-file $template \
    --s3-bucket $bucket\
    --output-template-file $output

# deploy template
aws cloudformation deploy \
    --capabilities CAPABILITY_IAM \
    --template-file $output \
    --stack-name $stackname

Testing

The status of the stack deployment can be monitored on CloudFormation in the AWS console or the command line. Test the deployment by uploading a file to the s3 bucket and confirming a notification email is received by the recipient.

Conclusion

This article explains how to configure a lambda Function to send out custom emails based on event changes on an s3 bucket. The process is automated using a CloudFormation template. To access the complete codebase for this solution, clone the following Github Repo: here.

References

A Quick Start to Databricks on AWS

Temiloluwa Adeoti — Sun, 24 Apr 2022 13:11:14 +0000

Big data technologies have evolved rapidly over the last two decades making it difficult to define clear-cut skillsets for data roles. For example, Data Scientists need to pre-process data before building models. But to what extent does this differ from Data Engineering. It is common to find role overlaps between Data Analysts and Data Scientists, Data Scientists and Machine Learning Engineers, and Machine Learning Engineers and DevOps Engineers.

This challenge extends to the development infrastructure. Organizations have isolated infrastructure stacks with specific tools that are dedicated to a single purpose like Web Development, Data Engineering, or Data Science. The siloed setup limits collaboration across teams, exposes data security challenges and limits effective data governance strategies.

What if a centralized data platform existed, where authorized persons, irrespective of their technical background and intended use case, could access data efficiently with consistency guarantees.

What is Databricks?

Databricks is a SAAS platform for developing cloud-agnostic AI and Data analytics solutions. Databricks creators are responsible for successful open-source projects like Apache Spark and MLflow. Over 5000 companies currently use Databricks and it integrates with over 450 partner technologies like Tableau, Qlik, SageMaker, Mathworks.

On Databricks, development teams can set up git repositories and run notebooks for Apache Spark applications in Python, Scala, R, and SQL. All-purpose clusters can be provisioned as the development environment or Job clusters, which are managed by the Databricks job scheduler, can be used for running automated jobs.

Why use Databricks?

Databricks currently has no streaming data ingestion offering like Kinesis. The competing services on AWS, are EMR and Glue for running Spark Jobs and Spark Machine Learning on Sagemaker. Given these services, is there any reason to consider using Databricks?

At the core of Databricks is the Datalakehouse platform that is founded on a Delta Lake. A Delta lake provides high-performance ACID properties for cloud object stores. Since S3 is a cost-effective solution for storing structured and unstructured data, a Delta lake can be built with it to provide the following benefits:

A unified environment to power diverse teams such as Data Analytics, Machine Learning, and Data Engineering.
Consistency is guaranteed when performing multi-object updates on a table based on multiple files in an object-store.
Perform rollbacks for unsuccessful transactions and query point-in-time snap-shots.

Databricks on AWS

Unlike the Azure Databricks service offering, a search for the Databricks service on the AWS portal yields no result. To quickly get started with Databricks on AWS, there are two options available:

Full Data Platform

Go to Databricks and click the Try Databricks button. Fill in the form and Select AWS as your desired platform afterward.
Select a Databricks subscription Plan: either Standard, Premium, or enterprise.
Set up a workspace using your AWS account. A workspace is simply a collaboration environment for your Databricks resources. You will be redirected to log in to your AWS account.
Authorize a CloudFormation Stack to create Databricks resources. By default, a cluster of three i3.xlarge EC2 instances is provisioned for the Spark cluster.
A URL will be sent to your email when your workspace is ready to start development.
Do not forget to terminate the default cluster when you sign out to prevent incurring unwanted costs.

Databricks Community Edition and S3

If you don't want to link your AWS account to Databricks or simply want to try it out, you can use the Databricks Community Edition. You can run Ipython notebooks for free on 15GB clusters, fully managed by Databricks.

You can mount your S3 buckets in your Databricks notebooks through Databricks File System (DBFS). A guide to implementing this can be found here.

Summary

Databricks provides a unified platform for Data teams to build AI and Analytics applications. It is founded on the DataLakehouse platform that is powered by DeltaLake, a technology that provides high performant ACID properties to object stores. You can set up Databricks on your AWS account or mount S3 in notebooks on the Databricks Community Edition.

References

Armbrust, M., Das, T., Sun, L., Yavuz, B., Zhu, S., Murthy, M., ... & Zaharia, M. (2020). Delta lake: high-performance ACID table storage over cloud object stores. Proceedings of the VLDB Endowment, 13(12), 3411-3424.
Lee, D., Das, T., & Jaiswal, V. (2022). Delta Lake The Definitive Guide [E-book]. O’Reilly Media.