If you've ever built ETL pipelines pulling data from MongoDB into Delta Lake using Spark, you've probably hit this wall. The pipeline works fine — until it doesn't. A single document with an unexpected shape is enough to break the entire write, leave the table in an inconsistent state, and send your on-call engineer digging through Spark logs at 11pm.
I built and maintained more than 10 of these jobs in my last role. After solving the same problem manually across every single one, I decided to build the abstraction that should have existed from the start: nosql-delta-bridge.
pip install nosql-delta-bridge
The problem isn't bad data — it's structural
MongoDB's schema-free nature is a feature for application developers. For pipelines, it's a minefield. The problems came in three flavors:
1. Polymorphic fields
Some collections had fields typed as anyOf[object|bool|string] in the JSON Schema — completely valid from the application's perspective. A status field might be a string in older documents and an integer in newer ones. A value field might be a number, a boolean, or a nested object depending on which part of the application wrote it.
Spark infers the schema from a sample at read time, commits to it, and the moment a document outside that sample has a different type, the entire write fails:
AnalysisException: Cannot cast StringType to IntegerType
The only safe workaround was casting everything to StringType defensively — which meant no type guarantees in the raw Delta table and re-casting in every downstream job.
2. Inconsistent nested structs
Arrays of structs where fields appeared or disappeared depending on the document version. A subfield present in some documents, missing in others. Nested structs with subfields that changed shape across batches.
Every job ended up with the same boilerplate:
def rebuild_struct(df, field, schema):
return df.withColumn(
field,
struct([
coalesce(col(f"{field}.{f}"), lit(None).cast(t)).alias(f)
for f, t in schema.items()
])
)
Rebuild the struct by hand. Cast every field explicitly. Handle missing fields with lit(None). Drop fields that appeared in some batches but not others. Repeat across every collection.
3. Silent failures
When the pipeline didn't crash outright, bad documents were silently coerced or dropped. There was no dead-letter queue, no audit trail, no contract that said "this field must be this type." Problems surfaced three jobs downstream — not at the ingestion boundary where they actually happened.
What existing tools don't solve
A common suggestion in this space is to use a data observability tool like Elementary. Elementary is genuinely useful — but it operates at the table/model level. It tells you the table is unhealthy, not which document made it unhealthy.
The investigation workflow without document-level isolation:
- Elementary fires an alert — table freshness failed
- Engineer checks Spark logs — finds a cast error
- Engineer traces back to MongoDB — tries to identify the offending document in a batch of 100k records
- Even after finding it — casting it correctly in Spark is either impossible or takes significant work when the schema is inconsistent enough
The inspection step is entirely manual, and finding the problematic document can take hours. And once you find it, you still have to figure out what to do with it while the rest of the batch sits unwritten.
How nosql-delta-bridge works
The core idea is simple: every document either lands in the Delta table or goes to a dead-letter queue with an explicit rejection reason. Nothing is silently dropped. Nothing silently crashes the pipeline.
The workflow has two steps:
Step 1 — Infer a schema contract from known-good historical data
bridge infer historical.json --output payments.schema.json
This generates a schema contract from a sample of documents you trust. The inference engine handles type conflicts using a configurable strategy — by default, the widest type wins and fields are nullable.
Step 2 — Ingest with validation
bridge ingest incoming.json ./delta/payments \
--schema payments.schema.json \
--dlq rejected.ndjson
incoming.json · 1,000 documents · schema: payments.schema.json
written: 994 → delta/payments
rejected: 6 → rejected.ndjson
The 994 valid documents land in Delta Lake. The 6 that couldn't be reconciled go to the DLQ — with an explicit reason attached to each one:
{
"_id": "abc123",
"amount": "99.90",
"_dlq_reason": "cast failed on 'amount': expected double, got string",
"_dlq_stage": "coerce",
"_dlq_ts": "2025-04-28T14:32:01Z"
}
No log archaeology. No manual document hunting. The bad document is already isolated, already labeled, at the exact moment ingestion ran.
What it handles
| Scenario | Behavior |
|---|---|
| Field type mismatch (castable) | Cast applied, document written |
| Field type mismatch (not castable) | Document → DLQ with reason |
| Missing required field | Document → DLQ with reason |
| New field not in schema | Configurable: reject or evolve schema |
| Full type migration (all docs changed type) | 0 written, all → DLQ + warning |
| Nested struct with missing subfield | Filled with null, document written |
| Array of mixed types | Configurable: cast to widest or reject |
Why pure Python and not Spark
The MongoDB Connector for Apache Spark is the standard approach — but it requires a cluster. Most teams running smaller MongoDB collections don't need a full Spark environment just to move data into Delta Lake.
nosql-delta-bridge uses delta-rs under the hood — a pure Python implementation of the Delta Lake protocol. No cluster required. It runs locally, in a Docker container, or on a small VM. Anyone can clone the repo and run the examples in minutes.
For large-scale production workloads that already run on Spark, the library-style design means you can wrap it or use its schema inference and coercion logic independently.
Where it fits in your stack
If you're using observability tools downstream, this fits cleanly upstream:
MongoDB
↓
nosql-delta-bridge ← structural validation, DLQ, schema contract
↓
Delta Lake
↓
dbt models
↓
Elementary / Monte Carlo ← business-level anomaly detection
Elementary tells you the table is sick. nosql-delta-bridge makes sure the table never gets sick from a bad document in the first place — and when it does, tells you exactly which document and why, before it ever touched the table.
Try it
pip install nosql-delta-bridge
If you work with MongoDB → Delta Lake pipelines and want to stress-test it against your own collections, I'd genuinely appreciate it. Especially interested in edge cases — deeply nested structs, arrays of structs with inconsistent shapes, or collections with heavy anyOf variance.
Open an issue on GitHub or leave a comment describing your scenario.
GitHub: https://github.com/lhrick/nosql-delta-bridge
PyPI: https://pypi.org/project/nosql-delta-bridge/
Built this because I got tired of writing the same defensive boilerplate across every MongoDB collection I touched. If you've felt the same pain, I'd love to hear how you've handled it.
Top comments (0)