DEV Community

Haji Rufai
Haji Rufai

Posted on • Edited on

Introduction to Python for Data Engineering

Chart on Python basics for data engineering
Yes hello! With increasing interest in data engineering expertise among organizations, we have seen a rise in the demand for data engineers. We have seen Python as one of the main pillars in data engineering.

Well, what is Python? Why is it preferred for data engineering? And finally, most importantly, the scope - and how to get started.



What is python?

Python is a 4 GL (fourth generation) dynamically typed programming language. Thus, it is high-level and hence easier to learn and understand.

Python has seen an increase on its use due to its ease of use and flexibility. You'll get a nice introduction to python here.


Why python for data engineering.

  1. A data engineer's job entails interacting with various data formats. Python is the best choice in these situations. Its standard library facilitates simple management. One of the most popular data file types are csv files.

  2. A data engineer is often required to use APIs to retrieve data from databases. The data in such cases is usually stored in JSON (JavaScript Object Notation) format, and Python has a library named JSON-JSON to handle such type of data.

  3. The responsibility of a data engineer is not only to obtain data from different sources but also to process it. One of the most popular data process engines is Apache Spark which works with Python DataFrames and even offers an API, PySpark, to build scalable big data projects.

  4. Directed acyclic graphs (DAGs) are used with data engineering tools like Apache Airflow, Apache NiFi, etc. DAGs are nothing more than task specification codes written in Python. Data engineers will therefore be better able to utilise these technologies by learning Python.

  5. Finally, maybe not least, Python has got tonnes of libraries that a data engineer finds useful:

    Some of the python libraries for data engineering.

    1. Pandas

    Pandas is the Python library popular among data analysts and data scientists. It is equally useful for data engineers, who use it for reading, writing, querying, and manipulating data. Pandas dataframes are extremely compatible with two popular data types: csv and json.

    2. Psycopg2, pyodbc, sqlalchemy

    When someone hears the word "database," they often picture information kept in the form of tables with different rows and columns. A relational database (RDB) is the name given to this kind of database.
    There are many ways to communicate with these databases, and the majority of them rely on Structured Query Language (SQL). MyPostgreSQL is one such solution that is well-liked by data engineers, and Python has a number of libraries to connect to it, including pyodbc, Sqlalchemy, and psycopg2.

    3. Scientific Python (SciPy)

    As its name suggests, SciPy is a Python library that provides a number of functions for rapid mathematical operations. This library allows a data engineer to do mathematical computations on their data for more accurate analysis.

    4. BeautifulSoup

    This well-known library is used for web scraping and data mining. For the purpose of preparing their data, data engineers use this to extract information from websites and work with JSON/HTML data formats.

    5. Petl

    Petl is a Python package for extracting, modifying, and loading tabular data. Data engineers use this library for building ETL (Extract, Transform, and Load) pipelines.


Scope (And Lets get started)

Python is general-purpose programming language that is used in many field from web development, automation , networking, etc. you mention it.

A data engineer does not need to know every Python there is because each one is a large field on its own, which is another journey not on our roadmap, and thus we do not need to know in detail.

For example, python for web development (flask and Django), machine learning—well, a data engineer does not need to get deep into machine learning.

0. Getting started

Install Anaconda on your machine

Anaconda/download
No comprising here.

Anaconda is a free and open source distribution that consists of all the packages and web programs such as Jupyter that you'll need.

After installing anaconda you'll automatically have Jupyter notebook installed which is a great python IDE which saves your source files as .ipynb,

Jupyter runs on your browser:

Illustration



Jupyter notebook demo appearance
The above image shows on how Jupyter when opened will look on your machine. You can navigate to the folder where you want to create your .ipynb file



Jupyter notebook outlook
When you are on your desired directory, You then click on 'New' then select by clicking 'Python 3 (ipykernel)' to open your ipynb file.



Jupyter notebook rename illustration
It will open untitled ipynb file (notebook file) which will look like the above picture.
You can rename your notebook file by clicking on the 'untitled' as shown.



Working with Jupyter notebook
Start your stuff there! Oh yes, Press 'shift + Enter' to run your cell(the rectangular input field for your code).
All da best.

1. Python basics

Where to Learn

learnpython.org
It is a nice interactive website and beginner friendly for python language. There are several topics arranged in order, for each topic there is a coding exercise at the end to test you if you have mastered the topic.

The good part (not the lazy part) there is solution to all exercises!!

2. Data structures and algorithms

Learning data structure and algorithms is mandatory for a good data engineer and it will also sharpen you to a better programmer. This concept should be in your RAM!!

Where to Learn

Google/free/Udacity/data structures and algorithms
The comprehensive course will make you grasp in depth data structures and algorithms. The good part is that it is also taught in Python. And Yes it is free!

3. Python Statistics

A data engineer needs to have a base in mathematics of data and should have a ground on:

  1. Descriptive and inferential statistics.
  2. Probability distributions
  3. Hypothesis testing

Where to learn:

Resource: Brief/Comprehensive/pdf

4. Python Developer

You have been coding for a while. Now you need to learn how to write clean code.

That is where Python Enhancement Proposal 8 (PEP-8) comes in place. It is a document written in 2001 by Guido van Rossum (the developer), Barry Warsaw, and Nick Coghlan.

The primary focus of PEP 8 is to improve the readability and consistency of Python code.

Code is read more often than it is written.

Guido van Rossum , the creator of Python programming language.

Where to learn

Official/Documenation
realpython.com/pep8
Yes you need to learn how to write clean code. You will need to how to proper document your functions, methods, etc

Micro Illustration

Properly writing assignment code example



CONCLUSION

Without debate, we can conclude that python is the first choice programming language for a data engineer. Well, congrats till here!

REFERENCE:

https://explore-datascience.net/
https://www.projectpro.io/article/python-for-data-engineering/592
https://www.youtube.com/c/DataEngUncomplicated



Yes, well have you started your track yet?

Top comments (0)