Building a modern analytical platform usually rhymes with one thing: migrating to the cloud.
But what happens when operational realities, budget constraints, or strict compliance requirements demand a 100% on-premise solution?
Until recently, maintaining a local Data Lake meant choosing between two painful paths:
Wrestling with heavy, outdated Hadoop clusters.
Paying exorbitant licensing fees for proprietary software.
Today, the game has completely changed.
The open-source ecosystem has evolved to a point where you can build a robust, high-performance, and entirely decoupled Data Lakehouse running on bare metal servers.
No proprietary tech. No exorbitant fees. And absolutely no vendor lock-in.
In this article, I will break down the architecture, the technology stack, and the governance model we used to build a cutting-edge, in-house Data Lakehouse.
The Stack: Building an "Open-Source Databricks"
To ensure modularity and scalability, we applied the fundamental principle of modern data architecture: the separation of compute and storage.
Our stack was assembled using the absolute best the current open-source ecosystem has to offer:
Storage: MinIO
We replaced HDFS and AWS S3 with MinIO. It offers full compatibility with the S3 API but runs locally. Installed on NVMe drives with an XFS file system, it guarantees the incredibly high I/O throughput required for heavy analytical workloads.
Table Format: Apache Iceberg
Iceberg is the beating heart of our Lakehouse. It brings relational database features directly to the Data Lake, such as ACID transactions, painless schema evolution, and Time Travel (allowing us to query the exact state of the data at a specific point in the past).
Metadata Catalog: Project Nessie
Nessie acts as our Iceberg catalog. Its killer feature is providing a Git-like experience for data—allowing for branches, commits, and tags right at the data level.
SQL Engine: Trino
For the compute layer, we utilize Trino (formerly PrestoSQL). It is a lightning-fast, distributed query engine. Beyond querying our Lakehouse, Trino features query federation connectors. This allows us to join Lakehouse data directly with operational databases (SQL Server, MySQL) in a single query, avoiding unnecessary data duplication.
Ingestion & Transformation: dlt and dbt Core
To extract data from our sources, we use the Python library dlt (data load tool), which makes building resilient pipelines a breeze. For transformation and modeling, we rely on the industry-standard dbt Core, turning SQL code into a software engineering pipeline complete with version control, testing, and documentation.
Infrastructure: Bare Metal vs. Containers
To extract maximum performance exactly where it matters, we divided our infrastructure across two distinct servers:
The Core Server (Bare Metal):
This is where the heavy lifting happens. Services like MinIO, Nessie, and Trino run natively on Ubuntu Server 24.04, managed via systemd. We deliberately chose not to use Docker in this layer to avoid any virtualized network overhead or disk I/O bottlenecks. We ensure the SQL engine and storage communicate at the absolute maximum speed the hardware allows.The Support Server (Docker):
On a secondary server, we run our entire support ecosystem containerized via Docker. This is home to tools like:
Dagster: Our data pipeline orchestrator.
Grafana & Prometheus: For monitoring server health metrics.
CloudBeaver: Our web interface for SQL exploration.
The Medallion Architecture and Decentralized Governance
The biggest challenge of a Data Lakehouse isn't technological; it's organizational.
To prevent our Lakehouse from turning into a "data swamp"—and to stop the IT team from becoming a bottleneck for report requests—we adopted the Medallion Architecture with strictly divided responsibilities:
Bronze Layer (Raw): Exact replicas of the source tables, with no business rules applied. Managed by IT using incremental polling via dlt.
Silver Layer (Cleaned): Also managed by IT. Here, data is cast to correct types, deduplicated, and standardized using dbt.
Gold Layer (Business): This is where the magic of decentralized governance happens. The modeling (Star Schema, aggregations, KPIs) is the exclusive responsibility of the QA and BI teams.
How does this workflow operate in practice?
When the BI area needs a new metric, they don't open a ticket begging Data Engineering for help. They develop the SQL model themselves in dbt, test it locally against Trino, and open a Pull Request in Git.
IT simply reviews the SQL code quality, approves the PR, and triggers the deploy. From that moment on, Dagster takes over and automatically orchestrates the schedule for that new table.
The Future: Evolving from Batches to Real-Time (CDC)
Currently, our architecture operates on V1, based on scheduled incremental polling. This delivers a latency that is perfectly acceptable for our current business needs.
However, our architectural roadmap is already prepped for V2. We will replace scheduled extractions with CDC (Change Data Capture) using Debezium and Kafka to read transaction logs directly from our operational databases. Data will land in Iceberg in a matter of seconds.
The beauty of this modern stack is that by swapping the ingestion engine at the Bronze layer, all our dbt models in the Silver and Gold layers will continue to function seamlessly, without us having to rewrite a single line of transformation code.
What’s Next?
Building this data infrastructure solved our analytical scalability issues, but it also opened the doors to new initiatives—especially the intensive use of Artificial Intelligence to generate internal applications and mini-systems integrated with this data.
With systems being rapidly generated by AI, a new (and critical) challenge emerged: How do we guarantee that an AI-generated portal accesses the Lakehouse securely?
In future posts, I will dive deep into how we solved this exact problem, covering Security, IAM (Identity and Access Management), API Gateways, and Row-Level Security in Trino.
🔔 Follow me here to stay updated on future articles! Drop a comment below: what are your thoughts on this architecture, and how do you handle the On-Premise vs. Cloud dilemma in your own projects?
Top comments (0)