I built a small end-to-end pipeline to simulate a common data engineering scenario: ingesting new files from cloud storage into a data platform automatically.
The pipeline:
extracts trending songs data from Kworb
writes the data as Parquet files
uploads them to Google Cloud Storage (GCS)
uses Databricks Autoloader to ingest new files incrementally
Architecture
The flow is straightforward:
Extract data from the source (Kworb) -> Store locally as Parquet-> Upload files to GCS -> Autoloader detects new files -> Data is written to a raw table
Data extraction
I used a Python script to collect trending songs and structure the data before saving it as Parquet.
uv run python src/extract_songs.py
Upload to GCS
To simulate a real ingestion layer, files are uploaded to a GCS bucket using Application Default Credentials.
gcloud auth login
gcloud config set project <YOUR_PROJECT_ID>
gcloud auth application-default login
Upload step:
uv run python src/load_files.py
Autoloader ingestion
On the Databricks side, Autoloader is used to ingest new files incrementally.
Source: GCS bucket
Format: Parquet
Target:
Catalog: songs_trending
Schema: raw
Basic example:
df = (
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "parquet")
.load("gs://<bucket>/path")
)
This avoids reprocessing files and handles file discovery automatically.
Parquet simplifies ingestion and downstream usage
Autoloader reduces the need for manual orchestration
Authentication via ADC is straightforward but easy to misconfigure locally
File organization in the bucket impacts performance and scalability
This is a minimal setup, but it reflects a common pattern in data engineering:
data lands in object storage and is incrementally ingested into a platform.
Building small pipelines like this is a good way to practice real-world concepts beyond theory.
Top comments (0)