Build in Public (Data Engineering): Day Zero

#buildinpublic #python #learning #development

It is great to come back to sharing my thoughts on the next line of action I want for my career. At this moment, I want to go all in with data engineering with focus on mastery.

Data is critical in this modern time as it has always been. Some social media platforms like LinkedIn and Instagram have made it explicity known that they will be using your data for training if you are comfortable with it.

In this data challenge series, I will be going from the basics to the complex to grasp the full concept of what it is to be a data engineer right from ingesting, storing, cleaning, and processing the data.

Who are Data Engineers?

From the brief intro above, data engineers are integral part of the company for maintaining systems that ingest data from both internal or external sources like databases and APIs, store this data for further processing, clean, and process them after going through a series of transformation steps.

How Spark works

To use Spark, you first need to have a Spark cluster. A cluster is a collection of computers running Spark software.

For a Spark application, the cluster consists of two components:

driver: this orchestrates the data processing
executors: this process the data itself

Installing Spark

brew install apache-spark

# after installation, run:
pyspark

# install jupyter notebook in the virtual env
pip install jupyter

# configure pyspark to use the jupyter notebook when we start it
# define these two environment variables
export PYSPARK_DRIVER_PYTHON='jupyter'
export PYSPARK_DRIVER_PYTHON_OPTS='lab'

# run or start the development server in the venv
pyspark

Practical demo

For you to become better, you must practice. This demo will use the data processing framework, Apache Spark to start the development server in Jupyter Notebook in a virtual environment.

This documentation is a good starting guide to writing efficient Spark applications and its functionality.

Now, let's create a SparkSession by reading an actual data from a CSV. To follow along, download a sample data from this website.

from pyspark.sql import SparkSession

spark = SparkSession \
        .builder \
        .appName('Read inside airbnb data') \
        .getOrCreate()

listings = spark.read.csv('data/listings.csv.gz',
                          header=True,
                          inferSchema=True,
                          sep=',',
                          quote='"',
                          escape='"',
                          multiLine=True,
                          mode='PERMISSIVE'
                         )

listings.printSchema()

for list in listings.schema:
    print(list)

description = listings.select(listings.description)

description.show(20, truncate=False)