Sara A.

Posted on Feb 7 • Edited on Feb 8

My Data Lake Runs on MongoDB and PostgreSQL and I’m Not Sorry

#mongodb #database #datascience #softwareengineering

A Brief History of How I Angered Absolutely No One (So Far)

Before we ever talked about technologies, we (a team working in a project) were dealing with a fairly uncomfortable data problem. We were collecting large volumes of data from multiple external sources, many of which we did not fully control or even fully understand. At the point of collection, the data arrived with little reliable context, inconsistent structure, and no clear guarantees about meaning.

More importantly, the data could not be meaningfully analysed at ingestion time. Individual values were not inherently valid or invalid, useful or useless. Their significance only emerged later, once they were combined with other datasets, configuration, and a set of calculations applied in a subsequent processing phase.

To make this harder, the same incoming data could eventually belong to different processing “plans”. Each plan defined its own expectations around data types, granularity, frequency, and validation rules. New plans could appear over time, existing ones could evolve, and at the moment the data was collected there was no reliable way to know which plan would ultimately apply.

There was also a strict traceability requirement: any calculation or recalculation had to be fully attributable to the original input values. This meant that once data was ingested, the values themselves could not be updated or rewritten. Corrections and reinterpretations had to be expressed as new processing steps, not mutations of the original records.

Faced with this, the textbook answer is a data lake. A “proper” data lake, at least in theory, is a place where raw data can be stored cheaply and indefinitely, in its original form, without forcing early decisions about structure or schema. The promise is simple: ingest first, understand later.

In practice, this usually means object storage at the core: large, inexpensive buckets where data is written once and rarely touched directly. Data is organised by conventions rather than rigid models, and meaning is derived at read time using analytical engines rather than enforced at write time. Around this storage layer sits an ecosystem of supporting technology - distributed query engines, batch processing frameworks, metadata catalogues, and orchestration tools - while the lake itself remains deliberately passive.

On paper, this fits our problem almost perfectly. We had uncertain data, evolving interpretation rules, and no safe way to apply strict schemas up front. Deferring meaning was a requirement.

And yet, this is also where the theory starts to fray. Our problem was not just storing raw data for later analytics; it was needing to interact with that raw data operationally, correct it, investigate it, and understand it while the system was running. The data lake model explains where to put uncertain data, but it says much less about how to live with it day to day.

So the question stopped being “where do we dump raw data cheaply?” and became “how do we live with raw data every day?”. We needed to inspect and query raw data operationally, without introducing extra ETL processes which would exist only to make the data usable, we needed to correlate records to plans once the missing context became available, and we needed to support corrections and investigations without rewriting history. The point was to keep immutability of original values for traceability, while still being able to attach new metadata and interpretations close to the original information - not as a separate universe that requires rebuilding pipelines every time we learn something new.

In a classic data lake stack, that usually means object storage plus a surrounding ecosystem: file formats like Parquet or Avro; table layers such as Iceberg, Delta, or Hudi; catalogues like Glue, Hive, or Unity; batch processing engines like Spark or Flink; and interactive query engines such as Trino or Presto, all held together by orchestration on top. That world is powerful, but it also tends to move complexity into “process”: jobs, compaction, reprocessing, schema evolution, and the operational overhead of making raw data convenient to interrogate. Cheap storage is great, but we also cared about cheap processes and cheap debugging - and getting all three (cheap storage, cheap compute, cheap operations) is basically an engineering utopia.

So we chose a simpler operational centre of gravity: store raw, schema-flexible data in MongoDB in a way that remains queryable day one, and treat plan correlation and corrections as additional metadata and derived layers rather than rewrites of the original values. It’s not the canonical data lake implementation, but it matched the shape of the problem we actually had: uncertainty first, meaning later, and a constant need to inspect and evolve the story without losing the original facts.

Two Databases Walk Into a Bar

Data lakes, despite their reputation for being “just storage”, almost never exist without some notion of a catalogue. Once raw data starts accumulating, you very quickly need a way to answer basic questions: what datasets exist, what they represent, where they live, and how they are meant to be used. Without that layer, a lake stops being flexible and starts being opaque.

In most standard data lake architectures, this catalogue emerges as a separate system. Technologies like Hive Metastore, AWS Glue, Unity Catalog, or similar services exist to map logical datasets to physical storage, track schemas, and help query engines make sense of otherwise passive files. The catalogue doesn’t replace the lake; it makes it navigable.

PostgreSQL is not what usually comes to mind when people talk about data lake catalogues. But at its core, a catalogue is simply structured metadata: names, identifiers, relationships, and lifecycle information that need to be queryable, consistent, and understandable by both humans and systems.

Seen through that lens, PostgreSQL works exceptionally well. It gives us strong consistency, a rich query model, transactional updates, and a familiar interface for expressing relationships and constraints. Instead of discovering metadata by scanning storage or inferring schemas after the fact, we explicitly record what exists and how it should be interpreted. The result is not a less capable catalogue, but a more intentional one that is built around accessibility and correctness rather than engine integration.

PostgreSQL became the place where we indexed meaning: which datasets exist, what context applies, and how raw collections should be interpreted at a given point in time.

And that, more or less, is how two databases walked into a bar and agreed to share custody of the data.

Every Data Lake Is a Zoo, Mine Just Has Signs

Yes, I could discover everything by parsing collection names.
I could also parse raw bytes in hex.
I choose civilisation.

A data lake without a catalogue is not empty, it’s just loud. All the information is technically there, but understanding it requires effort, context, and a tolerance for archaeology. Storage can be self-describing in theory, yet still deeply unfriendly in practice.

Our raw data is, by design, physically self-describing. Collection names encode the same information you would normally express through object-storage paths. For example, a MongoDB collection like:
raw_123_foo_2026-01
carries the same meaning as a more traditional data lake layout using partitioned object storage, such as: ~/qualifier=123/dataset=foo/date=2026-01/

In both cases, the dataset, scope, and time window are embedded directly into the storage structure. With enough convention and enough discipline, the data can always be rediscovered by inspecting storage alone.

The problem is not whether this works, but where that logic lives. Without a catalogue, every consumer needs to know how to parse collection names, reconstruct time ranges, and apply the same concatenation rules consistently. That logic inevitably leaks into multiple services, scripts, and mental models.

Instead, we centralise that knowledge in the catalogue. Rather than forcing every consumer to understand naming conventions or storage layouts, we record the relationships explicitly:

| plan | dataset_type | from       | to         | collection         |
| 123  | raw          | 2026-01-01 | 2026-01-31 | raw_123_foo_2026-01 |

Consumers no longer need to care whether the underlying data lives in MongoDB collections or in partitioned object storage (for example AWS S3 or Azure Data Lake Storage). They simply ask the catalogue what data exists for a given plan and time range, and receive the location that matches. Storage remains self-describing; the catalogue just makes that description immediately accessible.

This does not replace self-description with abstraction. If the catalogue disappears, the data is still there and still interpretable by inspecting collection names directly. The system degrades into inconvenience, not failure. What the catalogue removes is the need to repeatedly rediscover the same meaning through convention and duplication.

The same principle applies inside the data. Raw values are immutable, but their meaning over time is not. When a value needs to be corrected, we do not overwrite it. Instead, we attach context and let the data describe its own history:

{
  "timestamp": "2026-01-01T00:00:00Z",
  "name": "my_value",
  "value": "12",
  "metadata": {
    "status": "REPLACED",
    "collected_at": "2026-01-20T00:00:00Z",
    "...":  {... a set of audit information}
  }
},
{
  "timestamp": "2026-01-01T00:00:00Z",
  "name": "my_value",
  "value": "2",
  "metadata": {
    "status": "CURRENT",
    "collected_at": "2026-01-22T00:00:00Z",
    "...": {... a set of audit information}
  }
}

The original value still exists, unchanged. What changes is not the data itself, but the context around it: a status that marks when it was superseded, an audit trail that records when and why that happened, and a new entry that assumes the role of the current value. Nothing is erased, nothing is rewritten, and every recalculation can always be traced back to the original facts.

This is immutability with signage. Instead of forcing consumers to infer intent from timestamps or absence, the data makes its own history explicit. It tells you which value was current at any given time, which one replaced it, and under what circumstances - with meaning attached directly to the records themselves.

Every data lake eventually becomes a zoo: full of valuable, unfamiliar creatures. Some rely on visitors memorising the animals. Mine just puts the names on the enclosures.

My data lake is self-describing.
PostgreSQL just adds subtitles so humans can watch it without suffering.

This Was Designed Under Real-World Constraints

No data engineers were harmed in the making of this architecture.
Mostly because there were none present to stop me

This system was not designed by a specialised data platform team or under ideal conditions. It was built under a very concrete set of constraints that shaped most of the architectural decisions:

We are a small team, responsible for both building and maintaining the system, so any solution had to remain understandable and maintainable without requiring deep domain expertise in data platforms.
There is very limited dedicated data management knowledge within the team, which made highly specialised data platforms and analytics stacks, lakehouse ecosystems, and custom metadata or governance platforms - unrealistic from both a development and operational perspective.
The system had to be fully cloud-based, which meant that a baseline level of cost was unavoidable regardless of technology choices, making the focus more about cost predictability than absolute cost minimisation.
Debuggability was treated as a first-class architectural requirement. When a calculation produces an unexpected result, the system must allow us to inspect the exact raw inputs that contributed to it immediately, not after a batch job finishes or a pipeline is re-run.
The primary goal was therefore not to implement a “perfect” data lake architecture, but to build something that could be operated reliably, debugged easily, and evolved incrementally as the system and the team mature.

Many established data lake patterns make sense in organisations optimising for large-scale analytics and long-running batch workloads. Our constraints were different: operational clarity mattered more than theoretical optimality.

This Is Not Free, I Just Know What I am Paying For

The primary trade-off in this design is storage cost. Storing raw, immutable data in MongoDB is more expensive per gigabyte than using object storage, and that cost is not always trivial to estimate upfront. The volume, shape, and retention of data vary significantly depending on the plans applied to it, and those plans can evolve or appear over time. As a result, the exact storage footprint cannot be predicted with complete confidence at the outset.

However, this uncertainty is still easier to reason about than the alternative. Storage growth is largely linear and visible: data arrives, it is stored, and its cost accumulates in a predictable way over time. There are no hidden bursts of compute, no surprise cluster spin-ups, and no indirect costs tied to how often questions are asked. While the total cost may be higher, it is simpler to analyse, easier to attribute, and more transparent to operate than a model where storage is cheap but every interaction with the data incurs additional, variable processing cost.

The comparison below reflects a conscious trade-off: higher storage costs, offset by more predictable spending and simpler day-to-day operations.

Topic	MongoDB-based Lake	Object Storage + Lake Stack
Storage cost	Higher per GB	Low per GB
Compute model	Always-on database	On-demand distributed compute
Query latency	Low (interactive)	Medium to high (job/cluster spin-up)
Query pattern	Point lookups, filtered queries	Large scans, batch analytics
Update model	Metadata/status updates, new records	Rewrite / compaction jobs
Data mutability	Immutable raw values, contextual updates	Append-only, rewrite-based
Reprocessing	Selective, record-level logic	Batch pipeline re-execution
Metadata management	Explicit catalogue (Postgres)	External metastore (Glue/Hive/Unity)
Governance	Explicit, application-level	Platform-assisted
Operational complexity	Moderate	High (multiple systems)
Cost predictability	High	Variable (compute-driven)
Scaling pattern	Vertical + sharding	Horizontal compute clusters
Debugging	Direct data access	Indirect via jobs
Traceability	Record-level, immediate	Pipeline- and job-level
Time to investigate issues	Minutes to hours	Hours to days
Typical team size	Small to medium	Medium to large
Primary optimisation	Operational access	Analytical throughput

I Didn’t Build a Perfect Data Lake, I Built One I Can Explain

Future me will hate parts of this system.
Present me at least knows where the data is.

This architecture is not an attempt to redefine what a data lake should be, nor a manifesto against established patterns. In fact, we didn’t even come up with the idea of calling it a data lake. Using a data lake was a requirement. The name arrived fully formed, and our job was simply to make something exist behind it.

With that constraint in place, we did not spend much time debating what a data lake should look like in theory. The requirement was simply that one existed. Our job was to make something that fit the description without making the system unnecessarily painful to build, operate, or explain. That’s where the cheekiness started.

We kept the promises usually associated with a data lake: raw data retention, deferred interpretation, immutability, traceability - and quietly ignored the assumption that this automatically combined with large-scale distributed processing stacks and a supporting cast of coordination, orchestration, and metadata services. Not because those tools are bad, but because they optimise for a different class of problems.

There is a persistent idea that data lakes are cheap. In practice, only the storage is cheap. Everything required to make that storage usable, such as distributed query engines, batch processing frameworks, orchestration layers, catalogues, and the expertise needed to operate them, carries a real and often variable cost. Object storage is inexpensive precisely because it does nothing; the moment you want to understand your data, you start paying in compute, coordination, and operational overhead.

In our case, a fully cloud-based solution was also a requirement, so a baseline level of cost was unavoidable regardless of the architecture. The real choice was therefore not between cheap and expensive, but between different kinds of expense. We chose higher storage costs in exchange for simpler processes, faster feedback, and the ability to inspect and explain data directly, without spinning up clusters or waiting for pipelines to finish.

So yes, this is a data lake. It stores raw data, defers meaning, preserves immutability, and supports recalculation. It just does so while being a little irreverent about the tooling, and very serious about day-to-day operability.

If that makes purists uncomfortable, that’s fine.
They weren’t in the room.