Ranga Bashyam G

Posted on Jun 30

A Lightweight Big Data Stack for Python Engineers

Hi and greetings to the dev.to community!

This is my very first blog here, and I'm excited to share my thoughts and experiences with you all.

Over the years, I've primarily worked with Python-based technologies, so I’m quite comfortable with tools and libraries like Flask, Apache Airflow (DAGs), Pandas, PyArrow, and DuckDB. While I haven’t focused much on tools like PySpark or Hadoop, I’ve been deeply involved in handling large-scale data using Parquet files, performing data cleaning, designing robust pipelines, and deploying data workflows in a modular and scalable way.

Though my core expertise lies in Artificial Intelligence and Data Science, I’ve also taken on the role of a Data Engineer for several years, working across backend systems and real-time pipelines.

I'm happy to be part of this community, and I look forward to sharing more technical insights and learning from all of you.

Let’s dive into the world of Data Engineering!

What is Data Engineering?

Data Engineering is a critical discipline within the broader data ecosystem that focuses on building and maintaining the architecture, pipelines, and systems necessary for the collection, storage, and processing of large volumes of data. It is the foundation that supports data science, analytics, and machine learning operations. At its core, data engineering deals with designing robust and scalable systems that move data from various sources into forms that are usable by downstream applications.

A data engineer is responsible for ensuring that data is not only collected but also cleaned, structured, and made available in a timely manner. This involves a strong understanding of databases, distributed systems, scripting, and workflow orchestration tools. It is not a one-size-fits-all role; depending on the scale and nature of the organization, a data engineer might wear many hats—ranging from data ingestion and transformation to cloud infrastructure setup and pipeline optimization.

ETL and ELT Pipelines

A central task in data engineering is the implementation of ETL (Extract, Transform, Load) or ELT (Extract, Load, Transform) pipelines.

ETL is the traditional method where data is first extracted from source systems such as transactional databases, APIs, or flat files. It is then transformed—cleaned, aggregated, and reshaped—before being loaded into a destination system, often a data warehouse. This approach works well when transformation is done outside the warehouse, especially when transformation logic is complex or the warehouse is compute-constrained.

On the other hand, ELT has gained popularity with the rise of cloud-native data warehouses like Snowflake, BigQuery, and Redshift. In ELT, raw data is loaded directly into the warehouse, and all transformations are performed post-load. This method benefits from the massive parallelism and compute power of modern data warehouses and keeps raw data accessible for reprocessing.

Working with Data Lakes (IBM Cloud Object Storage)

A data lake is a centralized repository designed to store all structured, semi-structured, and unstructured data at scale. Unlike a data warehouse that requires predefined schemas, data lakes support schema-on-read, allowing more flexibility for exploration and modeling. Schema definitions are more important to maintain the folder structure, it is even more important in Parquet files to define schema in a proficient way.

IBM Cloud Object Storage (COS) is a popular choice for building data lakes, especially in hybrid cloud environments. It offers durability, scalability, and support for open data formats like Parquet and ORC. Engineers often use IBM COS as a staging ground for raw and processed data before it is ingested into analytics or machine learning workflows.

In practice, data engineers use services like IBM COS to store logs, streaming data, and backup files. The stored data is accessed using Python libraries such as boto3 or ibm_boto3.

Accesssing COS Bucket technically:

import ibm_boto3
from ibm_botocore.client import Config

cos = ibm_boto3.client("s3",
    ibm_api_key_id="API_KEY",
    ibm_service_instance_id="SERVICE_ID",
    config=Config(signature_version="oauth"),
    endpoint_url="https://s3.us-south.cloud-object-storage.appdomain.cloud"
)

# List files in a bucket
cos.list_objects_v2(Bucket="my-bucket")['Contents']

Pandas, SQL, Parquet, DuckDB, and PyArrow

Data engineering often involves working with various tools and formats for transformation and storage. Here’s how these technologies fit into a typical stack:

Pandas: A go-to Python library for data manipulation, ideal for small to medium datasets. While not optimal for big data, it is excellent for rapid prototyping and local transformations.
SQL: Structured Query Language remains a cornerstone of data transformations. Whether running in PostgreSQL, Snowflake, or embedded systems like DuckDB, SQL is used to clean, join, filter, and aggregate data.
Parquet: A columnar storage format that supports efficient querying and compression. It is widely used for storing processed datasets in data lakes due to its performance benefits in analytics.
DuckDB: An in-process SQL OLAP database that can query Parquet files directly without loading them into memory. It allows data engineers to write complex SQL queries on large datasets stored in files, making it excellent for fast, local experimentation.
PyArrow: A Python binding for Apache Arrow, enabling efficient serialization of data between systems. PyArrow is used under the hood by many libraries (including Pandas and DuckDB) to enable zero-copy reads and writes, boosting performance.

Together, these tools form a powerful suite for local and scalable data processing. A typical use case might involve reading a Parquet file from IBM COS using PyArrow, manipulating it with Pandas or DuckDB, and exporting it to a data warehouse via an ELT pipeline.

Why DuckDB is a Better Fit Than Modin or Vaex for Large-Scale Data Processing

Using DuckDB over Modin or Vaex is often a more robust and scalable approach when working with large datasets—particularly in Parquet format. DuckDB is highly efficient at processing queries directly on disk without loading the full dataset into memory. Attempting to perform complex operations like correlation or aggregations directly on massive in-memory dataframes is not only memory-intensive but can also be error-prone or slow. A better pattern is to convert the DataFrame to Parquet and use DuckDB to query and process the data efficiently. This offers both speed and scalability in a single-node environment.

Moreover, complex operations like joins, filters, aggregations, and correlations are optimized inside DuckDB’s vectorized execution engine. It handles query planning and execution more efficiently than the implicit operations in Modin or Vaex, which often delegate tasks to backends like Dask or rely on caching in RAM.

Some basic code level Implementations

Certainly! Here's a concise yet informative overview of the NYC Yellow Taxi Trip Data dataset you're using:

Dataset Overview: NYC Yellow Taxi Trip Records (2023)

The NYC Yellow Taxi Trip Dataset is a public dataset provided by the New York City Taxi & Limousine Commission (TLC). It contains detailed records of individual taxi trips taken in NYC, collected directly from the taxi meters and GPS systems.

For this example, we're using data from:

January 2023
Parquet File URL:
https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet

Schema (Sample Columns)

Column Name	Description
`vendorid`	ID of the taxi provider (1 or 2)
`tpep_pickup_datetime`	Timestamp when the trip started
`tpep_dropoff_datetime`	Timestamp when the trip ended
`passenger_count`	Number of passengers
`trip_distance`	Distance of the trip in miles
`ratecodeid`	Rate type (standard, JFK, Newark, etc.)
`store_and_fwd_flag`	If trip record was stored and forwarded due to loss of signal
`payment_type`	Type of payment (credit card, cash, etc.)
`fare_amount`	Base fare of the trip
`extra`	Additional charges (e.g., peak hour)
`mta_tax`	NY MTA tax
`tip_amount`	Tip paid by passenger
`tolls_amount`	Tolls charged during trip
`improvement_surcharge`	Fixed surcharge
`total_amount`	Total charged to the passenger

Size and Volume

~7 to 10 million rows per month
File size ranges from 300MB to 1GB+ in Parquet format
Data is stored in a columnar format, making it efficient for analytics

We’ll cover:

Trip duration calculation
Average fare per distance bucket
Peak hour identification

Basic Configs:

import duckdb

# Connect DuckDB
con = duckdb.connect()

parquet_url = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet"

1. Trip duration calculation:

# Calculate average trip duration and fare per hour of the day
query1 = f"""
    SELECT 
        EXTRACT(hour FROM tpep_pickup_datetime) AS pickup_hour,
        COUNT(*) AS total_trips,
        AVG(DATE_DIFF('minute', tpep_pickup_datetime, tpep_dropoff_datetime)) AS avg_trip_duration_min,
        AVG(total_amount) AS avg_total_fare
    FROM read_parquet('{parquet_url}')
    WHERE 
        tpep_dropoff_datetime > tpep_pickup_datetime 
        AND total_amount > 0
    GROUP BY pickup_hour
    ORDER BY pickup_hour
"""

result1 = con.execute(query1).fetchdf()
print("\nTrip Duration & Fare by Hour of Day:")
print(result1)

Output:

2.Average fare per distance bucket

# Bucket trip distances and calculate average fare per bucket
query2 = f"""
    SELECT 
        CASE 
            WHEN trip_distance BETWEEN 0 AND 1 THEN '0-1 mi'
            WHEN trip_distance BETWEEN 1 AND 3 THEN '1-3 mi'
            WHEN trip_distance BETWEEN 3 AND 5 THEN '3-5 mi'
            WHEN trip_distance BETWEEN 5 AND 10 THEN '5-10 mi'
            ELSE '>10 mi'
        END AS distance_bucket,
        COUNT(*) AS num_trips,
        AVG(total_amount) AS avg_fare
    FROM read_parquet('{parquet_url}')
    WHERE total_amount > 0 AND trip_distance > 0
    GROUP BY distance_bucket
    ORDER BY num_trips DESC
"""

result2 = con.execute(query2).fetchdf()
print("\nFare vs Distance Buckets:")
print(result2)

Output:

3. Vendor wise earnings

# Vendor-wise Earnings
query3 = f"""
    SELECT 
    vendorid,
    COUNT(*) AS num_trips,
    SUM(total_amount) AS total_revenue,
    AVG(total_amount) AS avg_fare
FROM read_parquet('{parquet_url}')
WHERE total_amount > 0
GROUP BY vendorid
ORDER BY total_revenue DESC
LIMIT 5
"""

result3 = con.execute(query3).fetchdf()
print("\nVendor-wise Earnings:")
print(result3)

Output:

Orchestrating Pipelines with Apache Airflow

Once data pipelines are defined, orchestrating them becomes a challenge—especially when multiple tasks need to be scheduled, retried on failure, and monitored. Apache Airflow solves this by allowing engineers to define workflows as Directed Acyclic Graphs (DAGs) using Python.

Airflow supports task dependencies, scheduling, retries, logging, and alerting out-of-the-box. Each task in Airflow is executed by an operator. For instance, a PythonOperator might run a transformation script, while a BashOperator could trigger a shell script to ingest data.

A typical Airflow pipeline might look like this:

Pull data from an API using a Python script.
Load the raw data into IBM Cloud Object Storage.
Run a DuckDB transformation on the stored Parquet files.
Export the clean data into a data warehouse.
Trigger a Slack or email notification on completion.

Airflow's extensibility, combined with its scheduling and monitoring features, makes it the default choice for modern data engineering teams.

Conclusion

Data Engineering is both a foundation and a force multiplier for modern analytics and AI systems. From building reliable ETL/ELT pipelines to managing petabytes of data in cloud storage, the role demands a mix of software engineering, data modeling, and system design skills.

As the data landscape evolves, so do the tools and techniques. Technologies like DuckDB and PyArrow are transforming how we process data locally, while orchestrators like Airflow and cloud platforms like IBM COS make it easier to scale and automate data workflows. A successful data engineer needs to stay deeply technical, understand the underlying principles, and always design systems with scalability, reliability, and maintainability in mind.