DEV Community

Hiswill Thompson
Hiswill Thompson

Posted on

1

Introduction to Data Engineering: setting up python for ETL

Introduction to Data Engineering: Setting up python for ETL

Hey there!

If you’ve missed my other articles on “introduction to Data Engineering “ and Understanding ETL pipeline” I’ll recommend you go check it out on my profile. As an aspiring Data engineer it will be handful in your journey of Data Engineering .

I’m this article, we will be talking about how to set up python for ETL. However, before then, let’s have an overview of what Data Engineering and the ETL process is all about .

Data engineering is the process of building, maintaining and optimizing data. It involves the process of “gathering” information from various sources might be an APIs, flat files websites etc processing and streamlining in into a useful and meaningful information then making it available to user ; might be a Data scientist or a Data Analyst .

The ETL process on the other hand is an integral apart of Data Engineering. It stands for Extract, Transform and Load. The Extraction process process basically is “fetching “ data for various sources or platforms .The TRANSFORM process is the most vigorous process in ETL .
Data gathered are streamlined into useful information and then stored in Data repositories for User accessibility; this is the LOAD stage.

Having have a clue of what Data Engineering and what ETL is let’s look at what Python and the significance it has in the ETL process before we actually delve into setting up python for ETL . That’s fair enough isn’t it?

Python is one of the most popular programming language. It is an open-source, high level and object-orientated programming language.
It is simple, easy to -learn and readable .It is easy to understand and have user interface reason most IT expert choose it over other programming languages.
Python has versatile and powerful features the helps Data Engineers in the ETL process. It possesses various libraries which fine- tune to Data Engineering needs.
Examples includes; Pandas, Numpy, Apache Airflow , Scikit- learn, Beautiful Soup etc.

This libraries has so much to do with the ETL process. It’s significant ranges from collection of Data from various sources, streamlining data, merging datasets, Data classification etc.
For instance Pandas helps in extracting, processing and even loading datasets, Psyspark helps in working with large datasets and SQL Alchemy with its flexibility helps in database interaction .

With this, let’s delve into how to set up python in your operating system in other to use it for your ETL operation.
Below are the python installation Guide:

1.open your favorite browser and search “Python download”
2.Python original website will display; python.org
3.Download the version of your choice; preferably the latest version.
4.Install the one for your operating system (OS options will be displayed)
5.Click Download.
6.Then install; you can customize installation
7.Tick the two boxes that will be displayed below it
8.Use admin privately when installing Py.exe and add python .exe to PATH.
9.Optional features will be displayed
10.Click on next
11.Advance setting will show , installation location will be show .
12.Click installation and wait for successful installation

  1. Click close or minimize an set up your python for you

Conclusively, python plays an important role in Data engineering and will still be of great effect in the field of Data Engineering task like ETL . It is pertinent to learn it an instill it into your Data Engineering journey.
Recommendation:
https://www.astera.com/type/blog/etl-using-python/

https://medium.com/@godswillthompson16/understanding-etl-pipelines-extract-transform-load-in-data-engineering-814472d71646?source=user_profile_page---------1-------------d1624a597f9d---------------

Image of Timescale

Timescale – the developer's data platform for modern apps, built on PostgreSQL

Timescale Cloud is PostgreSQL optimized for speed, scale, and performance. Over 3 million IoT, AI, crypto, and dev tool apps are powered by Timescale. Try it free today! No credit card required.

Try free

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

AWS GenAI Live!

GenAI LIVE! is a dynamic live-streamed show exploring how AWS and our partners are helping organizations unlock real value with generative AI.

Tune in to the full event

DEV is partnering to bring live events to the community. Join us or dismiss this billboard if you're not interested. ❤️