DEV Community

Cover image for πŸŸ₯πŸ•·πŸ¦ˆ brings to you: AWS Glue DataBrew
CrimsonSpiderShark
CrimsonSpiderShark

Posted on

πŸŸ₯πŸ•·πŸ¦ˆ brings to you: AWS Glue DataBrew

Welcome to the first blog post from this blog for 2025 (and also ever, but who's counting), today I'm going sink my teeth and my fangs into AWS Glue DataBrew. And as this is the first blog (of many), we're going to start it off in the right way: no explanation, just implementation. That's a lie, there will be an explanation, but after the implementation.

In the Glue DataBrew console, create a sample project

I love sampling almost as much as hip-hop producers

This sample project contains all the data we will explore. Let's pick up the dataset for Chess moves.

I love chess pieces! So easy to hunt since they can only move one move at a time

As always create a new AWS IAM role with the appropriate permissions for the task. I said this blog works backwards, right? In reality, this is best practices but you only do it after you have determined the scope with some nice almost-admin permissions. At least that's the shark's reality.

Start processing the data from the project console

Why do people want to con soles? The shoelaces are much more gullible.

This is a load of data, 17 columns, 2500 rows, lets reduce some of this down, give it some quality. We're looking for the most common opening move for which black wins when two players are within 22 points of each other in ratings (close games), according to this table:

Image description

To do this, we have to reduce pretty much everything that's not those things

Remove duplicates

I'm stopping with the descriptions now, its way too much work for a joke

Click the three dots on the column, and you'll find your answer

Image description

Apply the changes and continue.

Remove unnecssary columns

Look at the columns and remove everything that is not related to the ratings, the opening move, the winner and the ID, your final columns should look like this:

Image description

Filter black not winning

Filter out the non-black winning values using the filter icon:

Image description

Create difference column

Create a column to calculate the differences between ratings as follows:

Image description

Let's filter this column as we did before with two filters this time, one for -22 and another for 22 in ratings difference.

Image description
Not a lot of data, 176 rows, but it helps us with our point, we can even find the frequency of opening moves right there in the console:

Image description

The opening that wins the most is A00 (Benko's opening) which makes sense since it is unconventional and not very advantageous for white:

Image description

Introduction to GlueDataBrew

So, now we can introduce DataBrew based off of what we understood up there. It is, in its most simplest form, a way to clean (or brew) data and make it easier to consume. Which is abundantly clear from the tutorial, but its good to put it into words. Your final result is a series of data cleaning steps:

Image description

You can use these recipes to create data jobs that will follow this recipe for similar data.

There is however, things we could do better, which I would encourage you to look at:

  • Provision for potential missing values and filter them.
  • Widened the search to a larger ratings range.
  • Made the range so that if black was a more experienced player (far greater range), those would be reflected in the dataset.

But I digress, you can start with this improve on it, and remember, I didn't become a SpiderShark just for the fun of it, I did it so I would have 10 appendages to type with, just like a human! ChompChomp!

Top comments (0)