Building a simple data pipeline with GCS and Databricks Autoloader

#cloud #dataengineering #python #tutorial

I built a small end-to-end pipeline to simulate a common data engineering scenario: ingesting new files from cloud storage into a data platform automatically.

The pipeline:

extracts trending songs data from Kworb
writes the data as Parquet files
uploads them to Google Cloud Storage (GCS)
uses Databricks Autoloader to ingest new files incrementally
Architecture

The flow is straightforward:

Extract data from the source (Kworb) -> Store locally as Parquet-> Upload files to GCS -> Autoloader detects new files -> Data is written to a raw table

Data extraction

I used a Python script to collect trending songs and structure the data before saving it as Parquet.

uv run python src/extract_songs.py
Upload to GCS

To simulate a real ingestion layer, files are uploaded to a GCS bucket using Application Default Credentials.

gcloud auth login gcloud config set project <YOUR_PROJECT_ID> gcloud auth application-default login

Upload step:

uv run python src/load_files.py
Autoloader ingestion

On the Databricks side, Autoloader is used to ingest new files incrementally.

Source: GCS bucket
Format: Parquet
Target:
Catalog: songs_trending
Schema: raw

Basic example:

df = ( spark.readStream .format("cloudFiles") .option("cloudFiles.format", "parquet") .load("gs://<bucket>/path") )
This avoids reprocessing files and handles file discovery automatically.

Parquet simplifies ingestion and downstream usage
Autoloader reduces the need for manual orchestration
Authentication via ADC is straightforward but easy to misconfigure locally
File organization in the bucket impacts performance and scalability

This is a minimal setup, but it reflects a common pattern in data engineering:
data lands in object storage and is incrementally ingested into a platform.

Building small pipelines like this is a good way to practice real-world concepts beyond theory.