DEV Community: orlando ramirez

Building an ELT Pipeline with Python and SQL Server: A Netflix Dataset Walkthrough

orlando ramirez — Mon, 07 Apr 2025 16:03:56 +0000

Hi! In today’s article, we’ll walk through a small ELT project, revisiting each step of the process and diving deep into the details.

So, hop on and let’s get started!

Extract

As always, our first step is to extract or find the data we want to work with. For this project, we’ll be using the Netflix Movies and TV Shows dataset, which you can find on Kaggle.
Here’s the link: Netflix Movies and TV Shows

Load

Now, we move on to the load step, where we import the dataset into a database. For this example, I’m using Microsoft SQL Server running on Windows 11 via SQL Server Management Studio 20. All scripts and code are being executed from Ubuntu 24.04 using WSL.

Although everything is running on the same computer, they’re different environments. To connect them, I set up communication through IP which meant opening the necessary ports in SQL Server to allow access from the Ubuntu WSL partition.

Then, I created a Python script to load the CSV file (downloaded from Kaggle) into SQL Server.

import pandas as pd
import sqlalchemy as sal
from secret import connection


def main():
    df = pd.read_csv('./data/netflix_titles.csv')
    engine = sal.create_engine(connection.get_connection())
    conn = engine.connect()
    df.to_sql('netflix_raw', con=conn, index=False, if_exists='replace')


if __name__ == '__main__':
    main()

The connection logic is separated into another Python file where I manage the database credentials and connection string.
Once the script is executed, we can check the table created by Python. Here's the schema:

CREATE TABLE [dbo].[netflix_raw](
    [show_id] [varchar](max) NULL,
    [type] [varchar](max) NULL,
    [title] [varchar](max) NULL,
    [director] [varchar](max) NULL,
    [cast] [varchar](max) NULL,
    [country] [varchar](max) NULL,
    [date_added] [varchar](max) NULL,
    [release_year] [bigint] NULL,
    [rating] [varchar](max) NULL,
    [duration] [varchar](max) NULL,
    [listed_in] [varchar](max) NULL,
    [description] [varchar](max) NULL
)

As we can see, the initial schema is decent, but we’re allocating more space than necessary in each column by using max lengths. To optimize, we can use a Jupyter Notebook to inspect the actual maximum lengths in each column and adjust accordingly.

For example:

max (df.show_id.str.len())

After checking all columns, we get a more appropriate schema. For instance:

release_year changes from BIGINT to INT since we’re only dealing with year values.
title changes from VARCHAR to NVARCHAR this allows us to store foreign-language titles properly.

At this point, we can also check for duplicates in the show_id column, since we plan to use it as our PRIMARY KEY. We confirm there are no duplicates, so we can safely assign it as such.

Here's the new Schema:

[show_id] [varchar](10) NULL,
    [type] [varchar](10) NULL,
    [title] [nvarchar](200) NULL,
    [director] [varchar](250) NULL,
    [cast] [varchar](1000) NULL,
    [country] [varchar](150) NULL,
    [date_added] [varchar](20) NULL,
    [release_year] [int] NULL,
    [rating] [varchar](10) NULL,
    [duration] [varchar](10) NULL,
    [listed_in] [varchar](100) NULL,
    [description] [varchar](500) NULL

After that, we:

Drop the table.
Recreate it with the updated schema.
Update our Python script to change the if_exists parameter to 'append'.

def main():
    df = pd.read_csv('./data/netflix_titles.csv')
    engine = sal.create_engine(connection.get_connection())
    conn = engine.connect()
    df.to_sql('netflix_raw', con=conn, index=False, if_exists='append')


if __name__ == '__main__':
    main()

Run the script again.

Now we're ready for the next step in our journey.

Transform

Next comes the transformation step in our ELT pipeline.

We start by checking for duplicate records, using the title column. In doing so, we identify a few show_id 's that share the same title.

These are the show_id 's:

On closer inspection, we notice:

Some are TV shows and movies with the same name.
Others are remakes or reboots with different casts.
A couple are identical entries, with the only difference being the date_added.

This is the final show_id with issues:

s6706
s304
And
s160
s7346

For these cases, we keep the latest version assuming the older entry was removed and later re-added to the platform.

We exclude the older versions and re-query the dataset.

from dbo.netflix_raw
where show_id not in ('s160', 's304');

Next, we analyze columns like listed_in, director, country, and cast. These contain comma-separated lists, which are fine for production storage but not great for analysis.

To make them more useful for analytics, we split them into separate tables, creating one row per entry (e.g., one director per row, tied to the corresponding show_id).

This is done using SQL queries like:

select show_id,
trim(value) as genres
into netflix_genres
from dbo.netflix_raw
cross apply string_split(listed_in, ',')
where show_id not in ('s160', 's304');

Once done, we’ll have clean relational tables for directors, genres, countries, and casts.

Here's an image of how our Schema is with these changes:

Handling Missing Values

We then check for null values. For example, the country column has several missing entries. There are multiple ways to handle this, depending on:

The time you have.
The source and reliability of the data.
Whether you understand why the data is missing (this is crucial in production).

For this example, let’s try an approximation: if a director has another movie with a known country, we’ll use that to fill in the missing country values.

insert into netflix_countries
select show_id, m.country
from netflix_raw nr
inner join (
select director, country
from dbo.netflix_countries nc
        inner join netflix_directors nd on nc.show_id = nd.show_id
group by director, country) m  on nr.director = m.director
where nr.country is null;

We also found 3 nulls in the duration column. Interestingly, for show_id 's s5542, s5795, and s5814, the duration value is incorrectly placed in the rating column likely a data formatting issue.

We also notice that the date_added column is stored as a VARCHAR, but we need it as a proper DATE type.

To fix all of this, we:

Create a new intermediate table.
Eliminate columns we already separated into other tables.
Transform date_added to a date type.
Fix the duration issue.
Remove duplicated entries.

All of this is done with a SQL query like:


select show_id, 
            type, 
            title, 
            cast(date_added as date) as date_added,
            release_year,
            rating,
            case when duration is null 
            then rating
            else duration
            end as duration, 
            description
            into netflix_intermediate
from netflix_raw 
where show_id not in ('s160', 's304');

With this, we now have a clean, optimized intermediate table, ready for insights and exploratory analysis.

This is the final Schema for now:

We can go even further, like filling remaining nulls, building dashboards, and running deeper analytics. I’ll cover those in future articles.

Want to try this yourself? Download the dataset and follow along on your own!

Final Thoughts

This was a comprehensive walk-through of an ELT process, from raw data to a clean database schema ready for analytics.

We extracted the data, loaded it into a database, transformed it for analytics, and prepped it for insight generation.

In upcoming posts, I’ll focus on data visualization, insight generation, and maybe even using BI tools to create dashboards from this dataset.

Thanks for reading — stay tuned!

Have questions about this process or want to share your own approach? Drop a comment or reach out. I love talking about data!

In the next part, I’ll show you how to create stunning visualizations and dashboards using this cleaned dataset.

Subscribe or follow me so you don’t miss it!

Resources

OLAP vs OLTP: The war that is not meant to be

orlando ramirez — Tue, 25 Mar 2025 20:06:12 +0000

In previous articles, we covered what OLTP and OLAP are, their benefits, key features, and use cases.
This might make it seem like there's a 'fight' or 'battle' between these technologies, but that’s not the case.

OLAP and OLTP are different technologies for different use cases and when you need to do different things.
For example, if you need to process fast transactions or anticipate high-frequency insert, update, or delete operations, OLTP is the right choice. On the other hand, if you need to analyze historical data, group information by customers, or extract insights, OLAP is the better option.

So, when do you need which one? That’s a great question. If you're a small business, you might think you can save money by skipping an OLAP system and querying the database directly. In some cases, that might work initially without issues. But as soon you start to grow you will see that you will face problems around this. However, as you grow, you might face challenges. Imagine slow load times because your app is querying large amounts of data in real time—or worse, causing database crashes due to complex queries that overload the system.

That’s why it’s important to have a good data architecture and a plan, to follow as you and your company/app/service grow. OLAP and OLTP aren’t rivals—they were designed to fulfill different needs. Using them correctly will save you time, money, and headaches compared to relying on inefficient workarounds.

A solid data architecture and a clear understanding of the data engineering lifecycle are essential at any stage of your business, whether you’re just starting or scaling up. Having the right tools to do the right job will make easier your live, the live of your customers and everyone involved. Scaling your data infrastructure alongside your code will improve performance, stability, and long-term success.

How does your company handle transactional and analytical data? Let’s discuss your experiences in the comments!

Do you think businesses should invest in OLAP early on, or wait until they scale? Share your thoughts!

Have you ever faced performance issues because of querying transactional data for analytics? Let’s talk about best practices!

Want to learn more about data architecture and best practices? Follow me for future insights!

OLTP Explained: Speed, Integrity, and High Availability in Databases

orlando ramirez — Thu, 13 Feb 2025 19:23:08 +0000

OLTP

First thing first, what does OLTP means? OLTP stands for Online Transaction Processing and is a database type system very common in the tech industry. It’s mainly used because allows real-time and accurate data processing for a large number of users.

The main difference between this an OLAP Is that the latter is more focused on the analytical side of the data.

OLTP is built to handle lots of requests, information and users doing insertions, updates and deletions of data in the database. One of the things about OLTP is that it’s really important to maintain data integrity and a way to maintain it is through concurrency (No two transactions can happen to the same data at the same time), in that way you make sure that two people can query the same information and get the same result.

Additionally, OLTP uses Indexed datasets to speed up searches and queries. Works with backups so the data is available all the time.

It focuses on transaction speed, with everything designed to ensure the fastest response time between the user and the database

Some people say that OLAP is the evolution of OLTP and in some way it is, but it’s important to understand that they are both different and has different goals, one it’s focused on the transactions and speed and the other one is focused in analytics.

Transactions

Don’t let the word transaction trick you, OLTP is heavily used in banks, e-commerce, ATMs and airlines because of that you may think that it’s meaning is related to the exchange of economics but it’s more related to the computational transactions (it’s the atomic change of state in the database)

Another important aspect is that transactions either succed or fail as a whole; they can’t remain in an intermediate or pending state. This means that for the database, the transaction worked or not, inside the data, economical transaction or business rules, you can have pending states, but in the data flow can’t.

In other words OLTP

Process larges quantity of transactions.
Multiple users can access the same data with data integrity.
Rapid processes, usually measured in milliseconds.
Uses indexed data sets.
Has to be available all the time.

In another article I will discuss OLTP vs OLAP as technologies for data.

Have you worked with OLTP systems before? How do they impact your daily work? Share your thoughts in the comments!
Want to dive deeper into OLAP vs. OLTP? Stay tuned for my next article where I compare both in detail!
If you're interested in learning more about databases and data processing, follow me for more insights!
Need help optimizing your OLTP database? Let’s connect and exchange ideas!
Do you think OLTP is evolving with modern data needs? Let’s discuss it below!

OLAP: Unlocking the Power of Analytical Data Processing

orlando ramirez — Tue, 04 Feb 2025 13:48:35 +0000

Hearing the term OLAP out of the blue can be confusing, but once you understand its meaning and purpose, it becomes much easier to grasp. It is easier to get used to it brings a lot of options and brings new opportunities to get the most out of your data. That’s why I’m writing today about OLAP.

At first, the acronym OLAP means Online Analytical Processing, and here the keyword is Analytical that’s the main difference with OLTP, we will talk about OLTP later in another article. As the name says OLAP is focused on the Analytical part of your data, and that’s not all, the main idea of this is to optimize querying large quantities of Data. Reducing processing time and computational resource consumption. Resulting in saving time and money, and more data availability.

OLAP Cube

One of the main things about using OLAP technologies is using the OLAP cube, which consists on a way or technique to analyze data to look for insights. The cube is a multi-dimensional dataset where some of the data is summarized in a way that brings useful information for businesses.
A main thing in OLAP technologies is having a Fact table that has a star or snowflake schema that has connection with some other tables called dimensions.

OLAP cubes are optimized for complex analytical queries, allowing users to retrieve insights quickly compared to traditional relational databases. Since data is pre-aggregated, queries that would take minutes or hours in a standard database can be executed in seconds.
Analytical Operation

Another important thing in OLAP is the analytical operation you can do. These are:

Consolidation
Drill-Down
Slicing and Dicing

Consolidation

Also called roll-up, this operation aggregates data, making it easier to analyze trends and consume later.

Drill-Down

Lets the user see the details.

Slicing and Dicing

Lets you to specific parts of the data in the OLAP cube and see the data from different viewpoints.

A simple example of this technology that’s even in the Wikipedia Article is grouping all the sales of a store in a table that will be your fact table and having a reference in an id column to another table with the date and time of that sale that would be our dimension table.
By grouping the data in that way, we could see all the sales in general and analyze them, but we can drill down and search and group by periods, by days of the week, etc, which allows us to look for patterns, group them by a way that is useful for us or just present it in a better way in a data visualization tool.

The way that the data is stored in OLAP helps to do this kind of query and bring that sales information, a lot faster and less processing consuming than in an ordinary relational database. Another important thing to have in mind while you are working with OLAP it’s that all the relations that you need between your fact table and your dimensions tables.

Another good thing about OLAP technologies is that it has a better Integration with BI Tools

OLAP cubes are widely supported by Business Intelligence (BI) tools like Tableau, Power BI, and Pentaho. Their structured format enhances data visualization, making it easier to generate meaningful reports and dashboards.

That being said, I really like the OLAP technology and think that could be really really useful for businesses, and it’s a great evolution of the OLTP. Bringing you benefits for your data visualizations, and your final consumers of the data. OLAP is ideal for tracking historical data trends over time. Businesses can compare sales, performance, or customer behavior across different periods, facilitating better forecasting and strategic planning.

Have you worked with OLAP technologies before? Do you think investing in data preparation for OLAP is worthwhile?

Bibliography:
https://www.youtube.com/watch?v=iw-5kFzIdgY
https://en.wikipedia.org/wiki/Online_analytical_processing
https://es.wikipedia.org/wiki/Cubo_OLAP

Know your tools

orlando ramirez — Fri, 24 Jan 2025 15:59:47 +0000

Yesterday I was doing some challenges in codewars and I got a challenge where I had to “Create a function with two arguments that will return an array of the first n multiples of x.” At first I did what I usually do in these cases, break down the problem into simple steps to solve it easier.
The thing is even broken down I had to do a lot of steps including using a loop that I knew it will be using a lot of resources. If it is the only way and you can't avoid it then do it that way. But then I thought each database administrator or database system has its functions and different ways to get the job done.

That’s when I got the idea. Codewars uses PostgreSQL as their database system for these challenges, and then I started searching for functions that could help me with this in particular, finally after some research, I found the generate series function that lets you create a series from a number or timestamp, the function accepts an start number a finish number and a step, that’s when a really difficult problem became a really easy one.

That’s why it’s important to know your tool and, think about which is the best tool for the problem that you have.
Maybe if you are using the wrong tool for a problem you can make it even bigger or time / money consuming, or if you don't research on the tool that you are already using, maybe there’s a more easy way to solve the thing that you are facing.
Practicing, learning and continuously improving will make yours and your teammates life easier, so my advice to you is to constantly read and keet up with the new features and advances that are happening in your tools.

So, tell me do you do some katas in codewars to practice?

Did you face a problem like I did?

Tell me if you have any other stories like mine with another language.

Tips for naming variables in Python

orlando ramirez — Wed, 14 Jul 2021 23:26:35 +0000

Python

Python is a dynamic programming language, and if you have some time programming, or you are just starting, maybe you'v heard that naming variables is very important, but it’s not only giving long names to your variables, they have to be meaningful, they have to tell you what’s the program doing, or where are you going to use it.

Python conventions PEP8

Python uses language conventions for many things and one of them is to naming variables, these conventions are grouped in the PEP 8 Style Guide for Python Code as they say:

“This document gives coding conventions for the Python code comprising the standard library in the main Python distribution.”

Reserved words or keywords

There are words that you can’t use they are called reserved keywords these keywords can be listed inside python by importing the module keyword and printing the function keyword.kwlist.

These are some of the keywords in Python:

All of these words can’t be used as variable names, you can check out if a word is a keyword by using the function keyword.iskeyword() and passing the word inside the parentheses, for example:

like in keyword.iskeyword(‘else’) this will return a Boolean, in this case will be True.

Tips for naming variables in Python

Some tips to naming variables in Python are:

Using numbers and making them more descriptive as you can.
Don’t use special characters like @ or $.
Variable names can’t start with numbers.
To join words in variable names, you can use camel casing, but the convention is to use underscores like variable_name.

In python you don’t need to declare explicitly that something is a variable as in other languages, but when you write the name of a variable you can use some conventions to communicate what type of variable is.

The variable types are:

Public variables: these are the “normal” variables, you declare them by writing the variable name then an equal sign and what we want to store. e.g. greeting = ‘hello’
Private variables: these are variables that aren’t part of the public part of the program, is commonly used with APIs, in Python you can access a private variable any time, but what a private variable tells you is that if you change the content you can break the program or create malfunctions in it. e.g _age = 20.
Constants: By convention are all upper case, are values that are not going to change over time. e.g. PI = 3.14159.
“Super private” variables: totally private or please don’t touch me variables these starts with a double underscore. If you change anything inside these variables, you probably are going to break
all your program. e.g __really_important_number = 15.

This was a quick summary of naming variables and variables types, Python is a great programming language and can be used in many fields, tell me, have you ever saw any of these variables type? do you write Python code? Which is your favorite programming language?

A big thanks to ThisIsEngineering from Pexels for the cover photo