<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Luiz Oliveira</title>
    <description>The latest articles on DEV Community by Luiz Oliveira (@lholiv).</description>
    <link>https://dev.to/lholiv</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3899605%2Fb12cadf9-a120-4e01-b497-948f696dc1eb.jpg</url>
      <title>DEV Community: Luiz Oliveira</title>
      <link>https://dev.to/lholiv</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/lholiv"/>
    <language>en</language>
    <item>
      <title>The real problem with ingesting MongoDB into Delta Lake (and how I built a library to fix it)</title>
      <dc:creator>Luiz Oliveira</dc:creator>
      <pubDate>Tue, 05 May 2026 11:52:47 +0000</pubDate>
      <link>https://dev.to/lholiv/the-real-problem-with-ingesting-mongodb-into-delta-lake-and-how-i-built-a-library-to-fix-it-48o5</link>
      <guid>https://dev.to/lholiv/the-real-problem-with-ingesting-mongodb-into-delta-lake-and-how-i-built-a-library-to-fix-it-48o5</guid>
      <description>&lt;p&gt;If you've ever built ETL pipelines pulling data from MongoDB into Delta Lake using Spark, you've probably hit this wall. The pipeline works fine — until it doesn't. A single document with an unexpected shape is enough to break the entire write, leave the table in an inconsistent state, and send your on-call engineer digging through Spark logs at 11pm.&lt;/p&gt;

&lt;p&gt;I built and maintained more than 10 of these jobs in my last role. After solving the same problem manually across every single one, I decided to build the abstraction that should have existed from the start: &lt;strong&gt;nosql-delta-bridge&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;nosql-delta-bridge
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The problem isn't bad data — it's structural
&lt;/h2&gt;

&lt;p&gt;MongoDB's schema-free nature is a feature for application developers. For pipelines, it's a minefield. The problems came in three flavors:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Polymorphic fields
&lt;/h3&gt;

&lt;p&gt;Some collections had fields typed as &lt;code&gt;anyOf[object|bool|string]&lt;/code&gt; in the JSON Schema — completely valid from the application's perspective. A &lt;code&gt;status&lt;/code&gt; field might be a string in older documents and an integer in newer ones. A &lt;code&gt;value&lt;/code&gt; field might be a number, a boolean, or a nested object depending on which part of the application wrote it.&lt;/p&gt;

&lt;p&gt;Spark infers the schema from a sample at read time, commits to it, and the moment a document outside that sample has a different type, the entire write fails:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AnalysisException: Cannot cast StringType to IntegerType
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The only safe workaround was casting everything to &lt;code&gt;StringType&lt;/code&gt; defensively — which meant no type guarantees in the raw Delta table and re-casting in every downstream job.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Inconsistent nested structs
&lt;/h3&gt;

&lt;p&gt;Arrays of structs where fields appeared or disappeared depending on the document version. A subfield present in some documents, missing in others. Nested structs with subfields that changed shape across batches.&lt;/p&gt;

&lt;p&gt;Every job ended up with the same boilerplate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;rebuild_struct&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nf"&gt;struct&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
            &lt;span class="nf"&gt;coalesce&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Rebuild the struct by hand. Cast every field explicitly. Handle missing fields with &lt;code&gt;lit(None)&lt;/code&gt;. Drop fields that appeared in some batches but not others. Repeat across every collection.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Silent failures
&lt;/h3&gt;

&lt;p&gt;When the pipeline didn't crash outright, bad documents were silently coerced or dropped. There was no dead-letter queue, no audit trail, no contract that said "this field must be this type." Problems surfaced three jobs downstream — not at the ingestion boundary where they actually happened.&lt;/p&gt;




&lt;h2&gt;
  
  
  What existing tools don't solve
&lt;/h2&gt;

&lt;p&gt;A common suggestion in this space is to use a data observability tool like Elementary. Elementary is genuinely useful — but it operates at the table/model level. It tells you the table is unhealthy, not which document made it unhealthy.&lt;/p&gt;

&lt;p&gt;The investigation workflow without document-level isolation:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Elementary fires an alert — table freshness failed&lt;/li&gt;
&lt;li&gt;Engineer checks Spark logs — finds a cast error&lt;/li&gt;
&lt;li&gt;Engineer traces back to MongoDB — tries to identify the offending document in a batch of 100k records&lt;/li&gt;
&lt;li&gt;Even after finding it — casting it correctly in Spark is either impossible or takes significant work when the schema is inconsistent enough&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The inspection step is entirely manual, and finding the problematic document can take hours. And once you find it, you still have to figure out what to do with it while the rest of the batch sits unwritten.&lt;/p&gt;




&lt;h2&gt;
  
  
  How nosql-delta-bridge works
&lt;/h2&gt;

&lt;p&gt;The core idea is simple: &lt;strong&gt;every document either lands in the Delta table or goes to a dead-letter queue with an explicit rejection reason. Nothing is silently dropped. Nothing silently crashes the pipeline.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The workflow has two steps:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1 — Infer a schema contract from known-good historical data&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bridge infer historical.json &lt;span class="nt"&gt;--output&lt;/span&gt; payments.schema.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This generates a schema contract from a sample of documents you trust. The inference engine handles type conflicts using a configurable strategy — by default, the widest type wins and fields are nullable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2 — Ingest with validation&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bridge ingest incoming.json ./delta/payments &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--schema&lt;/span&gt; payments.schema.json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--dlq&lt;/span&gt; rejected.ndjson
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;incoming.json · 1,000 documents · schema: payments.schema.json
  written:   994  →  delta/payments
  rejected:    6  →  rejected.ndjson
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The 994 valid documents land in Delta Lake. The 6 that couldn't be reconciled go to the DLQ — with an explicit reason attached to each one:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"abc123"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"99.90"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"_dlq_reason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cast failed on 'amount': expected double, got string"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"_dlq_stage"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"coerce"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"_dlq_ts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2025-04-28T14:32:01Z"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No log archaeology. No manual document hunting. The bad document is already isolated, already labeled, at the exact moment ingestion ran.&lt;/p&gt;




&lt;h2&gt;
  
  
  What it handles
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Behavior&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Field type mismatch (castable)&lt;/td&gt;
&lt;td&gt;Cast applied, document written&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Field type mismatch (not castable)&lt;/td&gt;
&lt;td&gt;Document → DLQ with reason&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Missing required field&lt;/td&gt;
&lt;td&gt;Document → DLQ with reason&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;New field not in schema&lt;/td&gt;
&lt;td&gt;Configurable: reject or evolve schema&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Full type migration (all docs changed type)&lt;/td&gt;
&lt;td&gt;0 written, all → DLQ + warning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Nested struct with missing subfield&lt;/td&gt;
&lt;td&gt;Filled with null, document written&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Array of mixed types&lt;/td&gt;
&lt;td&gt;Configurable: cast to widest or reject&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Why pure Python and not Spark
&lt;/h2&gt;

&lt;p&gt;The MongoDB Connector for Apache Spark is the standard approach — but it requires a cluster. Most teams running smaller MongoDB collections don't need a full Spark environment just to move data into Delta Lake.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;nosql-delta-bridge&lt;/code&gt; uses &lt;code&gt;delta-rs&lt;/code&gt; under the hood — a pure Python implementation of the Delta Lake protocol. No cluster required. It runs locally, in a Docker container, or on a small VM. Anyone can clone the repo and run the examples in minutes.&lt;/p&gt;

&lt;p&gt;For large-scale production workloads that already run on Spark, the library-style design means you can wrap it or use its schema inference and coercion logic independently.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where it fits in your stack
&lt;/h2&gt;

&lt;p&gt;If you're using observability tools downstream, this fits cleanly upstream:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;MongoDB
  ↓
nosql-delta-bridge    ← structural validation, DLQ, schema contract
  ↓
Delta Lake
  ↓
dbt models
  ↓
Elementary / Monte Carlo    ← business-level anomaly detection
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Elementary tells you the table is sick. nosql-delta-bridge makes sure the table never gets sick from a bad document in the first place — and when it does, tells you exactly which document and why, before it ever touched the table.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;nosql-delta-bridge
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you work with MongoDB → Delta Lake pipelines and want to stress-test it against your own collections, I'd genuinely appreciate it. Especially interested in edge cases — deeply nested structs, arrays of structs with inconsistent shapes, or collections with heavy &lt;code&gt;anyOf&lt;/code&gt; variance.&lt;/p&gt;

&lt;p&gt;Open an issue on GitHub or leave a comment describing your scenario.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/lhrick/nosql-delta-bridge" rel="noopener noreferrer"&gt;https://github.com/lhrick/nosql-delta-bridge&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;PyPI:&lt;/strong&gt; &lt;a href="https://pypi.org/project/nosql-delta-bridge/" rel="noopener noreferrer"&gt;https://pypi.org/project/nosql-delta-bridge/&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built this because I got tired of writing the same defensive boilerplate across every MongoDB collection I touched. If you've felt the same pain, I'd love to hear how you've handled it.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>python</category>
      <category>opensource</category>
      <category>mongodb</category>
    </item>
  </channel>
</rss>
