<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Yasmim</title>
    <description>The latest articles on DEV Community by Yasmim (@tinyhero13).</description>
    <link>https://dev.to/tinyhero13</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F550912%2Fe1040881-8649-4e88-9701-0a36053c184b.png</url>
      <title>DEV Community: Yasmim</title>
      <link>https://dev.to/tinyhero13</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tinyhero13"/>
    <language>en</language>
    <item>
      <title>Building a simple data pipeline with GCS and Databricks Autoloader</title>
      <dc:creator>Yasmim</dc:creator>
      <pubDate>Tue, 21 Apr 2026 14:15:57 +0000</pubDate>
      <link>https://dev.to/tinyhero13/building-a-simple-data-pipeline-with-gcs-and-databricks-autoloader-350d</link>
      <guid>https://dev.to/tinyhero13/building-a-simple-data-pipeline-with-gcs-and-databricks-autoloader-350d</guid>
      <description>&lt;p&gt;I built a small end-to-end pipeline to simulate a common data engineering scenario: ingesting new files from cloud storage into a data platform automatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pipeline:
&lt;/h2&gt;

&lt;p&gt;extracts trending songs data from Kworb&lt;br&gt;
writes the data as Parquet files&lt;br&gt;
uploads them to Google Cloud Storage (GCS)&lt;br&gt;
uses Databricks Autoloader to ingest new files incrementally&lt;br&gt;
Architecture&lt;/p&gt;

&lt;h2&gt;
  
  
  The flow is straightforward:
&lt;/h2&gt;

&lt;p&gt;Extract data from the source (Kworb) -&amp;gt; Store locally as Parquet-&amp;gt; Upload files to GCS -&amp;gt; Autoloader detects new files -&amp;gt; Data is written to a raw table&lt;/p&gt;

&lt;h2&gt;
  
  
  Data extraction
&lt;/h2&gt;

&lt;p&gt;I used a Python script to collect trending songs and structure the data before saving it as Parquet.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;uv run python src/extract_songs.py&lt;/code&gt;&lt;br&gt;
Upload to GCS&lt;/p&gt;

&lt;p&gt;To simulate a real ingestion layer, files are uploaded to a GCS bucket using Application Default Credentials.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;gcloud auth login&lt;br&gt;
gcloud config set project &amp;lt;YOUR_PROJECT_ID&amp;gt;&lt;br&gt;
gcloud auth application-default login&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Upload step:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;uv run python src/load_files.py&lt;br&gt;
&lt;/code&gt;&lt;br&gt;
Autoloader ingestion&lt;/p&gt;

&lt;p&gt;On the Databricks side, Autoloader is used to ingest new files incrementally.&lt;/p&gt;

&lt;p&gt;Source: GCS bucket&lt;br&gt;
Format: Parquet&lt;br&gt;
Target:&lt;br&gt;
Catalog: songs_trending&lt;br&gt;
Schema: raw&lt;/p&gt;

&lt;p&gt;Basic example:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;df = (&lt;br&gt;
  spark.readStream&lt;br&gt;
    .format("cloudFiles")&lt;br&gt;
    .option("cloudFiles.format", "parquet")&lt;br&gt;
    .load("gs://&amp;lt;bucket&amp;gt;/path")&lt;br&gt;
)&lt;br&gt;
&lt;/code&gt;&lt;br&gt;
This avoids reprocessing files and handles file discovery automatically.&lt;/p&gt;

&lt;p&gt;Parquet simplifies ingestion and downstream usage&lt;br&gt;
Autoloader reduces the need for manual orchestration&lt;br&gt;
Authentication via ADC is straightforward but easy to misconfigure locally&lt;br&gt;
File organization in the bucket impacts performance and scalability&lt;/p&gt;

&lt;p&gt;This is a minimal setup, but it reflects a common pattern in data engineering:&lt;br&gt;
data lands in object storage and is incrementally ingested into a platform.&lt;/p&gt;

&lt;p&gt;Building small pipelines like this is a good way to practice real-world concepts beyond theory.&lt;/p&gt;

</description>
      <category>cloud</category>
      <category>dataengineering</category>
      <category>python</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
