DEV Community: Rahul Dubey

Metadata for win — Apache Parquet

Rahul Dubey — Sat, 25 May 2024 08:36:09 +0000

You read the title right! Apache Parquet provisions best of the data properties to optimize your data processing engine capabilities. Some of the popular Distributed Computing solutions like Apache Spark, Presto exploits this property of Apache Parquet to read/write data faster.

Also, enterprise solutions in market like Databricks provides ACID properties on top of the Apache Parquet format to build Delta Tables. Some new formats have also arrived like Apache Iceberg, Apache Hudi etcetera.

But how it works? What if you have to write your own custom processing solution when it’s desired to use a needle instead of sword like Apache Spark.

Often, setting up Apache Spark is another big elephant which most people shy away from if they just want to process a manageable size on in single system.

For such cases, Apache Arrow is the best solution. Although it’s a platform that is language agnostic, it can be used for a single machine processing. Other libraries like Polars can also be used for single machine processing.

But how these frameworks makes the best out of Apache Parquet format? It’s the columnar format and the inherent structure of the file. The file is structured in such a format that retrieving columns is much efficient than the row based retrieval. In fact, the columns which are only required for analytical processing queries are preferred to be retrieved instead of selecting all columns as a whole.

How is file structured?

Apache Parquet is optimized for analytical processing queries which is why it follows columnar format approach. Referring to the below official illustration, I’ll explain how it work:

What’s going on in the above illustration? Let me explain.

File is structured in parts, where it has 5 crucial information:

Header

Header provides information regarding the official file format named as Magic Number — “PAR1”. This information is mostly specific to designate that it’s a Parquet format.

Data

The actual data is stored in Data part of the file. It’s a combination of Row Groups + Column Chunks + Pages. Don’t worry we’ll come to it when we discuss about Footer.

Footer

Footer is main block of information we are interested in and this article is about that only. Footer holds some of the critical information which is exploited by processing frameworks to optimize read and write operation.

It contains metadata about the whole file. It’s written during the write time of the file inherently to keep track of information so that it can be used during read time. High-level metadata we are interested in are as follows:

FileMetadata — Metadata block containing information about the file like Schema, Version of the parquet etcetera.

RowGroupMetadata — This one holds metadata about the column chunk names, number of records per row group etcetera.

ColumnMetadata — Column metadata tells information about each column like Name, Values (Distinct if needed), Compression Type, Offset to data page, Column Statistics like Min, Max values, byte size etcetera.

PageMetadata — Pages are blocks containing the actual data where it’s broken down to multiple pages. Each page metadata contains the next offset value to access the page.

There are are metadata information like Indexed columns and other which can be set while writing the file.

If you want to know how it looks as compiled bytes to string, you can see the similar structure as below:

`
SOURCE - https://parquet.apache.org/docs/file-format/metadata/

4-byte magic number "PAR1"
<Column 1 Chunk 1 + Column Metadata>
<Column 2 Chunk 1 + Column Metadata>
...
<Column N Chunk 1 + Column Metadata>
<Column 1 Chunk 2 + Column Metadata>
<Column 2 Chunk 2 + Column Metadata>
...
<Column N Chunk 2 + Column Metadata>
...
<Column 1 Chunk M + Column Metadata>
<Column 2 Chunk M + Column Metadata>
...
<Column N Chunk M + Column Metadata>
File Metadata
4-byte length in bytes of file metadata (little endian)
4-byte magic number "PAR1"

Using metadata in practice

Discussed File Structure and Metadata can be used to write efficient systems for retrieving and processing large amount of files. In this section we are going to discuss how you can utilize the above information to benefit your program for read and processing logic.

Often the mistakes are made during the write time itself. Most people ignore the benefits of Parquet format and considers it like just any other file format like CSV, TSV etc. But if done right, some extra parameters can be used to apply these properties like Sorting based on low-cardinality columns, applying indexing based on a column, sorting the metric columns with high cardinality for co-locating values closer to a range.

Let’s not waste time and dive into a real example.

Data in practice

We are going to utilize a script to generate some random data with a structure. The script will produce 100k records per file with 100 files in total. We will add some extra parameters to force some conditions for efficient reads at first like sorting based on columns. This data will be stored in a GCS bucket. Since most production environments are based on cloud, the blob storage systems are used to handle concurrent read/writes by the processing solutions.

Data points are scattered randomly across the files, since this is a challenge for most processing engines to spread the reads across 80–90% of the files in question. It’s also a worst case to handle and test our processing system.

Refer the below code to generate some data.

Metadata Collection and Filtering Processes

We’ll divide our codebase into two parts, first is Metadata Collection process which will be responsible for reading metadata across all 100 files and writing it to a metadata folder at the same path where other files exists. Second part is the filtering process, which will take some random Query Parameters as arguments and based on this, collected metadata will be searched for filtering out only those part of the files that we require to read the records.

We are going to utilize Multiprocessing module in Python to maximize the reading and writing of metadata. Remind you, we are also going to utilize the Apache Arrow for just reading metadata and memory mapping the files while reading actual data.

In the cover image, each process is part of 2 classes. Collection and Write process are part of MetadataCollector class and Filter and Data Collection processes are part of MetadataProcessor class.

Both the classes provide executable methods for multiprocessing.

Experiment in Consideration

I took around 20 GB of user clicks data that I generated from a script, it has the following configuration:

No partitiong logic — random splits
No ordering of the columns — predicate pushdown process most probably read all files while filtering
Row Group size is kept 100K records per page
Number of Rows per file — 1M
Compression Type — Snappy

Compute Engine

The Google Cloud Compute Engine service is used to run the module. It has the following configuration:

4 vCPUs (2 Cores)
16 GB Memory

Performance

Whole process took around 1.1 minutes (29 seconds for Metadata Collection and 34 seconds for Metadata Filtering and Data Collection)

Conclusion

Although the processing with Python suffers a lot of draw back, but the module provides a basic understanding on how Apache Parquet can be used efficiently for better I/O operations. Future scope would be to use Bloom Filters, Sort-Merge Compaction, Z-Ordering (Key Co-location per file) and some other tricks to make it more mature.

Code Repository

Refer to the below GitHub link to run check and run your own process. The code is not mature and it lacks proper constructs, but it’s still a WIP so feel free to add any suggestion or PR maybe.

parquet-reader — https://github.com/RahulDubey391/parquet-reader

💻Ephemeral UI: React + Nginx + Docker + Cloud Run☁

Rahul Dubey — Tue, 11 Apr 2023 08:00:57 +0000

Hi🤗, In this tutorial, we are going to implement and deploy a basic React App on Google Cloud Run using Docker, Google Cloud Shell and of course a GCP account!

Before going forward, you might ask what's Cloud Run? What is Ephemeral UI here means?

Cloud Run - Developer's best friend 😎

Google Cloud Platform provide several services to deploy light weight application such as Cloud Functions, App Engine and finally Cloud Run. All these services are serverless and doesn't maintain application's state.

Behind the scene, these are Knative Application which are deployed during the invocation over Kubernetes Engine aka GKE.

Wait, What🧐? if its serverless how is it useful for Front-End Application

This is where the term Ephemeral UI comes into the picture. These application are only specific for a small session where a quick summary or information needs to be received and then the consumer will move away.

Such is the case when you as a consumer wants to do some analysis lets say collect Sales Prediction in future for a product, download the rep-sheet and just woosh away!

Apart from being cursory usage, it is also a good idea to develop and deploy quick UI prototypes on these light weight serverless services, since it won't cost you for sitting idle!!!

Getting Started

Before moving forward you need following things to be setup:

GCP Account
Google Cloud Shell
NodeJS
ReactJS

Setup Simple React App

In this section, we will create a simple React App that takes user's name as input and returns Hello with name.

Output:

To create this step, use the following code:

Install React

npm install react react-dom

Once installed, change to some directory and type:

npx create-react-app hello-app

After this, change to directory and run the app, it will show a React logo animation on the page:

>cd hello-app
>npm start

Once it is confirmed that React is installed correctly, use the below code in App.js file to create the desired app.

import React, { useState } from 'react';

function App() {
  const [inputValue, setInputValue] = useState('');

  const handleInputChange = (event) => {
    setInputValue(event.target.value);
  };

  const handleSubmit = (event) => {
    event.preventDefault();
    alert(`Hello ${inputValue}!`);
  };

  return (
    <div>
      <h1>Enter your name:</h1>
      <form onSubmit={handleSubmit}>
        <input type="text" value={inputValue} onChange={handleInputChange} />
        <button type="submit">Submit</button>
      </form>
    </div>
  );
}

export default App;

If you save the file, the changes will be automatically loaded and it will be visible on local host.

Setting-Up Nginx Configuration

While it is good to test the app locally first before deploying it, you have to follow some extra steps to make the application discoverable on the internet. Nginx provides the way for deploying your app in production setting.

But before this, you have to setup an Nginx configuration file while deploying in container. In your root directory, create a file name nginx.conf and paste following code:

server {
  listen 80;
  sendfile on;
  default_type application/octet-stream;

  gzip on;
  gzip_http_version 1.1;
  gzip_disable      "MSIE [1-6]\.";
  gzip_min_length   256;
  gzip_vary         on;
  gzip_proxied      expired no-cache no-store private auth;
  gzip_types        text/plain text/css application/json application/javascript application/x-javascript text/xml application/xml application/xml+rss text/javascript;
  gzip_comp_level   9;

  root /usr/share/nginx/html;

  location / {
    try_files $uri $uri/ /index.html =404;
  }
}

Finally, Dockerfile!

At last, before deployment, you have to do one more step to containerize the application in Docker Container using Dockerfile. The file contains commands similar to Linux commands and such is the case with Docker since it's a OS/Hardware virtualization layer.

Create a file named Dockerfile at the root of your app's directory and paste the following code:

FROM node:16.13.1 as build

WORKDIR /app

COPY package*.json ./

RUN npm install

COPY . .

RUN npm run build --prod

FROM nginx:latest AS ngi

COPY --from=build /app/build /usr/share/nginx/html

COPY /nginx.conf  /etc/nginx/conf.d/default.conf

EXPOSE 80

Here, we mention the steps to run npm build and nginx deployment.

Deploy with Google Cloud Shell 🚀

Before moving ahead, make sure you have Google Cloud Shell installed as well as you have service account JSON credentials. Using Google Cloud Shell allows to bypass the Cloud Build or npm build step in the local, so you can directly run build in the container itself.

Type the following command to discover and activate the appropriate service account:

gcloud auth list

This will show list of all the service accounts. If nothing is shown use this instead:

gcloud auth activate-service-account <SERVICE-ACCOUNT-NAME> --key-file=<PATH/TO/JSON>

Once, it is done, you will get activated notification. After this type change to the root of the App directory and type the following command:

gcloud run deploy hello-app --port 80 --allow-unauthenticated

Once it is done, you will start to see build wheel mentioning container initialization and service feedback.

Time to Go!

That's it folks! here we wrap up the tutorial. In this we learned end-to-end deployment of React application on GCP. In the next tutorial, we will show how to setup a CI/CD pipeline to automate the build and deployment of the application.

Take a moment to cherish that you have come this far!👾

Link to Github
https://github.com/RahulDubey391/React-On-Cloud-Run

Make it Go! - Introduction

Rahul Dubey — Sun, 12 Mar 2023 15:58:35 +0000

When it comes to choosing programming languages for your app which might be data-intensive and needs to have low computation latency, developers have to make a choice between better readability/fast development vs high-performance. Often the languages that have better readability and fast-prototyping lacks performance and it becomes main bottleneck when your app needs to serve millions of user requests.

This is the where Go comes in. Now I must tell you before moving ahead in the discussion that I have been always a Python guy due to its simplicity, fast development and better number of packages available from community and of course data!!! Python has a large number of packages available for doing ML/AI, Data Science, Data Analytics and Big Data tasks. But still it has a lot of drawbacks. Some of them according to me are:

First its an Interpreted dynamic language. Dynamic because the data mutability is allowed during the runtime which is by nature slow and can add significant latency issues.
Portability is another issue. The distribution of Python applications requires the consumer machines to have Python Interpreter installed locally.
No better way for Inter-Process Communication (IPC). The processes running in parallel needs to have a common queue or buffer in-place to allow processes share information among each other. There are middle-wares like Celery, Redis etc.
And Of course Concurrency! By default Python is not concurrent. This is due to the infamous GIL(Global Interpreter Lock) which restricts multiple threads running at the same time. There are some libraries which tries to outsmart the GIL like Asyncio, Futures etc.

And the drawback list can go on. My Job requires me to build and deploy a lot of custom applications to handle Big Data problems. Often I have to deal with Small Files Problem which is also a boon to many existing Big Data Frameworks like Apache Hadoop, Apache Spark. There are many new player in the field like Ray, Dask, Modin which allows you write thread safe concurrent and parallel application for Big Data problems.

In my belief, choosing an alternative to Python can be a great advantage when it comes to writing highly performant code. For me, Go is the first choice.

A little history about Go

Go was officially developed by Google to support its ongoing projects. Three of the Google Engineers named Robert Griesemer, Rob Pike, and Ken Thompson came up with the development plan for a programming language that can solve the issues with existing language like C++. They wanted simplicity of syntax but performance and features from languages such as C. They started developing Go September 21 2007 and after 2 years, the language was officially released as open-source project.

Today many of the projects are using Golang. Such as:

Docker

Go is popular among DevOps community, especially with Docker. It has a 90% codebase of Go. Docker is popular for application containarization and widely accepted by community, organizations etc.

Docker · GitHub

Docker helps developers bring their ideas to life by conquering the complexity of app development. - Docker

github.com

Kubernetes

Kubernetes is another project that is developed using Go. It is one of the popular platform for orchestrating Docker Containers. It benefits the application by providing automated container scaling and much more.

kubernetes / kubernetes

Production-Grade Container Scheduling and Management

Kubernetes (K8s)

Kubernetes, also known as K8s, is an open source system for managing containerized applications across multiple hosts. It provides basic mechanisms for deployment, maintenance and scaling of applications.

Kubernetes builds upon a decade and a half of experience at Google running production workloads at scale using a system called Borg combined with best-of-breed ideas and practices from the community.

Kubernetes is hosted by the Cloud Native Computing Foundation (CNCF) If your company wants to help shape the evolution of technologies that are container-packaged, dynamically scheduled, and microservices-oriented, consider joining the CNCF. For details about who's involved and how Kubernetes plays a role, read the CNCF announcement.

To start using K8s

See our documentation on kubernetes.io.

Try our interactive tutorial.

Take a free course on Scalable Microservices with Kubernetes.

To use Kubernetes code as a library in other applications, see the list…

View on GitHub

Benefits and Drawbacks

There are a lot of benefits to Go when I try to compare it Python. Some of these are:

Statically Complied Language. Compared to Python, Go enforces type definitions to be applied to variable which are immutable during the runtime.
Code Readability is almost similar to Python so if you are transitioning from Python to Go, then the learning curve will be less steep. This also enables faster application development time and developer can be productive.
IPCs are much easier than Python and kind of native to Go. Go has Channels which allows communication between the routines running in parallel.
Of course Concurrency is main benefit of Go. Go handles concurrency by Go Routines and Channels.

Although, transitioning to Go can be a bit difficult in the beginning due to following reasons:

No OOP concepts in Go. Go is a Procedural language and sits - somewhere between Fortran, C etc. Go has Generics, Interfaces and Modules.
Go doesn’t have Exceptions and deals with bugs by producing Errors.
Go is also has a Garbage Collector similar to Python which makes it less performant than Rust, C++.
Go allows Pointers which can be dauting if you have never used these in your life.

Besides all these drawbacks, Go can be your best friend if you want to write highly performant backend systems with performance near to C and readability similar to Python. Also Go is becoming a popular language and most of the organizations are shifting to Go for their data intensive backend systems, so this means there will be a lot of demand for Go Developers in the market with Fat Paycheck.

Keeping the monetary value aside, I think you should learn Go if you are already a Python developer. It can give you an edge in terms of dealing with building high-performant systems.

Conclusion

In this article, we discussed about how Go can be your best bet when you want to write highly performant code. The language is still fresh and growing day-by-day. Community has been contributing to the project and provided some of the best frameworks and libraries.

In the next article we will discuss how to create basic programs in Go. Till then Goodbye!!

Using DAG to deal with Zip file for Snowflake

Rahul Dubey — Sun, 19 Feb 2023 08:10:49 +0000

In this article, we are going to cover how to deal with Zip files when loading into Snowflake using Airflow in GCP Composer. Its a typical data pipeline, but still can be tricky to deal with if you are a beginner or never dealt with ETL altogether.

Before going further, we assume that you have the following services enabled:

Google Cloud Platform Account
Snowflake Account
GCP Composer Environment with basic integration setup with Snowflake
Code Editor
Python Installed

Case Study
Suppose, you are working for a renowned organization that has recently shifted their data platform from BigQuery to Snowflake. Now all your organization's data is housed in Snowflake and all BI/DataOps happens exclusively in Snowflake.

One fine morning, you are assigned a task to build a data pipeline to do Attribution Analytics. All the Attribution data is dropped in GCS bucket in the form of Zip files from organization's partner. I know what you are thinking, you can just create a 'COPY INTO ' statement with a File Format enabling 'COMPRESSION=ZIP'. But this is not true, you can't use 'ZIP' directly in File Format. Also you can't use 'DEFLATE' type.

What to do?
You can utilize the GCP's capabilities to orchestrate and automate the data loading. But first, you have to ensure that there is an integration Object created in Snowflake to GCP Bucket. After this you create an external stage to locate the path in GCS bucket for direct loading.

Once the above things are taken care of, you can now implement a DAG script for GCP Composer. GCP Composer is a managed Apache Airflow service, which enables quick deployment of Airflow on top of Kubernetes.

Airflow to the rescue
Apache Airflow is an elegant task scheduling service that allow Data Operation to be handled in effective and efficient manner. It provides an intuitive Web UI which allows users to manage task workflows. You can also create parallel workflows without the headache of creating own custom application dealing with Multiprocessing, Threads, Concurrency.

Enough with the introductions, let's start coding.

First you have to open you code editor and create a Python file. Name it as DAG_Sflk_loader.py.

After the above step, import all the necessary packages.

from datetime import datetime,timedelta,date
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.contrib.operators.snowflake_operator import SnowflakeOperator
import pandas as pd
from google.cloud import storage
import zipfile
import io

To declare a DAG script, you have to use DAG object from airflow package like this:

default_args = {
    'owner': 'ORGANIZATION',
    'start_date': datetime(2023, 2, 19),
    'email': ['username@email.com'],
    'email_on_failure': True,
    'email_on_retry': False
}

dag = DAG('SFLK_ZIP_LOAD', description = 'This DAG is loads ZIP files to snowflake', max_active_runs=1, catchup=False,default_args=default_args)

In the above code snippet, first we define the arguments to be passed to DAG object like 'owner','start_date','email','email_on_failure' etc. After this we create a Context Manager for data pipeline using DAG object.

Alright, now is the time to start defining custom tasks in Python and Snowflake. For this, we use Operators. Operators are individual task units that can be of any kind like Snowflake SQL, Python, Bash command, GSUtil etc. For our discussion, we will only use Python and Snowflake Operator.

We will distribute our data pipeline in the following way:

TRUNCATE_TABLE_TASK --> UNZIP_FILES_IN_GCS_TASK --> LOAD_FILES_TO_TABLE_TASK

TRUNCATE TASK
Before loading the data to Snowflake, we will first truncate the table. This is an incremental load, so we won't be using TRUNCATE TABLE <TABLE_NAME> directly. We will just delete data for CURRENT_DATE if present.

TRUNC_QUERY = '''DELETE FROM <DATABASE_NAME.SCHEMA_NAME.TABLE_NAME> WHERE <DATE_FIELD> = CURRENT_DATE'''

trunc_task = SnowflakeOperator(
               task_id='TRUNCATE_TASK',
               sql=[TRUNC_QUERY],
               snowflake_conn_id='<connection_id>',
               database='<DATABASE_NAME>',
               schema='<SCHEMA_NAME>',
               warehouse = '<DATAWAREHOUSE_NAME>',
               role = '<ROLE_NAME>',
               dag=dag)

UNZIPPING TASK

For unzipping files in GCS bucket, we will use three libraries

zipfile
io
google-cloud-storage

Here, we define this task as a Python callable. This callable is used by Python Operator.

def unzip_file_in_gcs(**context):
    #Define GCS Client parameters
    bucket_name = '<BUCKET_NAME>'
    file_name = '<FILE_NAME>.zip'

    # Connect to the GCS bucket
    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(file_name)

    # Download the zip file to memory
    zip_file_content = blob.download_as_string()

    # Unzip the file
    zip_file = zipfile.ZipFile(io.BytesIO(zip_file_content))
    zip_file.extractall(path='/home/airflow/gcs/data/temp/')

    # Upload each file in the zip to the GCS bucket
    with open('/home/airflow/gcs/data/temp/<FILE_NAME>.csv', 'rb') as f:
        file_content = f.read()
        new_blob = bucket.blob('<FILE_NAME>.csv')
        new_blob.upload_from_string(file_content)

unzip_task = PythonOperator(
        task_id="UNZIP_TASK",
        python_callable=unzip_file_in_gcs,
        provide_context=True,
        dag=dag
    )

LOADING TASK
Once unzipping is done, now you can use COPY INTO <TABLE_NAME> statement to load the data into Snowflake table.

Here is the task definition:

LOAD_QUERY = '''COPY INTO <DATABASE_NAME>.<SCHEMA_NAME>.<TABLE_NAME> FROM @<DATABASE_NAME>.<SCHEMA_NAME>.<STAGE_NAME>/<FILE_NAME>.csv
file_format = (format_name = <DATABASE_NAME>.<SCHEMA_NAME>.FF_CSV)'''

load_task = SnowflakeOperator(
        task_id='LOAD_TASK',
        sql=[LOAD_QUERY],
        snowflake_conn_id='<connection_id>',
        database='<DATABASE_NAME>',
        schema='<SCHEMA_NAME>',
        warehouse = '<DATAWAREHOUSE_NAME>',
        role = '<ROLE_NAME>',
        dag=dag)

Providing Task Flow
At last, you have to bring all the tasks together in one liner. Use this code snippet at the end:

with dag:
  trunc_task >> unzip_task >> load_task

Upload and Run the DAG script
Now upload the DAG script you just created into GCS bucket attached to Composer environment bucket. Apache Airflow WebUI will automatically reflect the new DAG after few minutes and will start running.

Conclusion
In this article, we learnt how to load zip files stored in GCS bucket using Apache Airflow into Snowflake table. We also went through the creation and deployment of DAG in GCS Composer. For future discussions, we will explore other integration methods in Apache Airflow.

Till then, Goodbye!!