🟥🕷🦈presents: EKS fun for Data Science (Part 1)

CrimsonSpiderShark — Sun, 12 Jan 2025 18:25:19 +0000

I added the 'fun' to the title because I'm going to get a little silly with it. Just a bit though, there will be plenty of serious stuff in it for you data science sickos.

So, cliffnotes version of the regular spiel on Kubernetes and EKS: Kubernetes is how you get a bunch of small services to work together without one crashing the other, EKS is how you use Kubernetes in AWS. AWS, by volume, is essentially the Internet. Today, we'll deploy a simple image size reduction service in EKS (this is part 1, after all).

Write the Flask code in Python

You don't need it to be Flask or Python, that's just the example I'm going to use. The example looks something like this:

app.py

from flask import Flask, request, send_file
from PIL import Image
import io
import os

app = Flask(__name__)

@app.route('/resize', methods=['POST'])
def resize_image():
    if 'image' not in request.files:
        return "No image file provided", 400

    image_file = request.files['image']

    try:
        img = Image.open(image_file)
        max_size = int(request.form.get('size', 512))

        img.thumbnail((max_size,max_size))
        img_io = io.BytesIO()
        img.save(img_io, 'JPEG')
        img_io.seek(0)

        return send_file(img_io, mimetype='image/jpeg')

    except Exception as e:
        return f"Error processing image: {str(e)}", 500

if __name__ == "__main__":
  app.run(debug=True, host="0.0.0.0", port=int(os.environ.get("PORT", 3000)))

requirements.txt

pip
autopep8
Flask==3.0.3
gunicorn==22.0.0
Werkzeug==3.0.6
Pillow

Dockerfile

FROM python:3.9-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install -r requirements.txt

COPY app.py .

EXPOSE 3000

CMD ["python", "app.py"]

Put those three in a single folder and make the magic happen.

This image sums it up pretty well. Just a list of commands to setup the things you'll need for the image for your Kubernetes cluster.

Push to your ECR repository

Oh, right. ECR is where you keep all your container images for Docker and Kubernetes services. Forgot to mention than one. But here's the link in AWS where you create one.

I also just noticed this blog is about using images (the non-typical kind) to reduce images (the typical kind).

Anyways, create the repository in your console, then go to the list of repos and click on 'View push commands'.

Then, in the directory where you have all those previouse files, run the commands you get from than console. And voila, you've got your repo up and running.

Create and test your Kubernetes Cluster

Use this deployment code after modifying it:

  GNU nano 5.8                                                                                     deployment.yaml                                                                                               
apiVersion: apps/v1
kind: Deployment
metadata:
  name: image-reducer
spec:
  replicas: 2
  selector:
    matchLabels:
      app:
  template:
    metadata:
      labels:
        app: image-reducer
    spec:
      containers:
      - name: image-reducer
        image: <your_image_name>
        ports:
        - containerPort: 3000
        resources:
          requests:
            cpu: "100m"
            memory: "256Mi"
          limits:
            cpu: "500m"
            memory: "512Mi"

And also one for load balancing:

apiVersion: v1
kind: Service
metadata:
  name: image-reducer-service
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: nlb
spec:
  selector:
    app: image-reducer
  ports:
  - protocol: TCP
    port: 80
    targetPort: 3000
  type: LoadBalancer

Then run the commands of deployment:

kubectl apply -f deployment.yaml
kubectl apply -f service.yaml

Get the IP address for your endpoint:

kubectl get service image-reducer-service

This will give you an IP which you can send a POST request to in the /resize endpoint with an image and get a resized image back.

curl -X POST -F <your_image_path> -F "size=200" http://<your-external-ip>/resize -o resized.jpg

And that's it, simple as pie. It gets you a nice little sample image resizer. Chomp Chomp.

🟥🕷🦈 brings to you: AWS Glue DataBrew

CrimsonSpiderShark — Thu, 02 Jan 2025 14:10:53 +0000

Welcome to the first blog post from this blog for 2025 (and also ever, but who's counting), today I'm going sink my teeth and my fangs into AWS Glue DataBrew. And as this is the first blog (of many), we're going to start it off in the right way: no explanation, just implementation. That's a lie, there will be an explanation, but after the implementation.

In the Glue DataBrew console, create a sample project

This sample project contains all the data we will explore. Let's pick up the dataset for Chess moves.

As always create a new AWS IAM role with the appropriate permissions for the task. I said this blog works backwards, right? In reality, this is best practices but you only do it after you have determined the scope with some nice almost-admin permissions. At least that's the shark's reality.

Start processing the data from the project console

This is a load of data, 17 columns, 2500 rows, lets reduce some of this down, give it some quality. We're looking for the most common opening move for which black wins when two players are within 22 points of each other in ratings (close games), according to this table:

To do this, we have to reduce pretty much everything that's not those things

Remove duplicates

Click the three dots on the column, and you'll find your answer

Apply the changes and continue.

Remove unnecssary columns

Look at the columns and remove everything that is not related to the ratings, the opening move, the winner and the ID, your final columns should look like this:

Filter black not winning

Filter out the non-black winning values using the filter icon:

Create difference column

Create a column to calculate the differences between ratings as follows:

Let's filter this column as we did before with two filters this time, one for -22 and another for 22 in ratings difference.

Not a lot of data, 176 rows, but it helps us with our point, we can even find the frequency of opening moves right there in the console:

The opening that wins the most is A00 (Benko's opening) which makes sense since it is unconventional and not very advantageous for white:

Introduction to GlueDataBrew

So, now we can introduce DataBrew based off of what we understood up there. It is, in its most simplest form, a way to clean (or brew) data and make it easier to consume. Which is abundantly clear from the tutorial, but its good to put it into words. Your final result is a series of data cleaning steps:

You can use these recipes to create data jobs that will follow this recipe for similar data.

There is however, things we could do better, which I would encourage you to look at:

Provision for potential missing values and filter them.
Widened the search to a larger ratings range.
Made the range so that if black was a more experienced player (far greater range), those would be reflected in the dataset.

But I digress, you can start with this improve on it, and remember, I didn't become a SpiderShark just for the fun of it, I did it so I would have 10 appendages to type with, just like a human! ChompChomp!

DEV Community: CrimsonSpiderShark