1. Introduction
In many companies, data comes from different systems like ERP, CRM, application databases, and web logs. This data is used for reports, dashboards, and business decisions. To use this data properly, we need a data architecture.
There are two main types of data architecture:
- Traditional Data Architecture (ETL + Data Warehouse)
- Modern Data Architecture (ELT + Data Lake + Lakehouse)
This document explains both approaches. It also explains why we use tools like Data Lake, Data Warehouse, Spark, Databricks, Delta Lake, Iceberg, Snowflake, BigQuery, Redshift, ADLS, GCS, S3, and Datadog.
2. High-Level Data Flow
Data Sources → Ingestion → Data Lake → Processing → Lakehouse Tables → Data Warehouse → BI & Reports → Monitoring
This means:
- Data comes from source systems.
- Data is ingested (copied) into the platform.
- Raw data is stored in a data lake.
- Data is cleaned and transformed using processing tools.
- Clean and reliable tables are created.
- Final data is loaded into a data warehouse for reports.
- The full system is monitored using monitoring tools.
3. Data Sources (Where data comes from)
Examples:
- ERP systems: Finance, HR, inventory data
- CRM systems: Customer and sales data
- OLTP databases: Application transaction data (orders, payments)
- Web logs: Website or app activity (clicks, errors, requests)
Why we use them:
These systems run the business. They create the data that we later analyze.
When we use them:
All the time. These are live systems used daily by the business.
4. Traditional Data Architecture (ETL)
4.1 What is Traditional Architecture?
In traditional architecture, data is transformed before it is loaded into the data warehouse.
Flow:
Sources → ETL Tool → Data Warehouse → BI/Reports
4.2 What is ETL?
ETL = Extract → Transform → Load
- Extract: Take data from source systems.
- Transform: Clean the data, fix formats, remove duplicates, join tables, and calculate metrics.
- Load: Put the clean data into the data warehouse.
4.3 Why Traditional Architecture was used
- Data warehouses were expensive.
- Storage and compute were limited.
- Only clean data was allowed in the warehouse.
4.4 Limitations
- Not easy to scale for big data.
- Raw data is lost after transformation.
- Not flexible for machine learning and advanced analytics.
5. Modern Data Architecture (ELT + Data Lake + Lakehouse)
5.1 What is Modern Architecture?
In modern architecture, raw data is first stored in a data lake. Transformations happen later using powerful compute engines.
Flow:
Sources → Ingestion → Data Lake → Transform (Spark/Databricks) → Lakehouse Tables → Data Warehouse → BI & ML → Monitoring
5.2 What is ELT?
ELT = Extract → Load → Transform
- Extract: Take data from sources.
- Load: Store raw data directly in the data lake.
- Transform: Clean and process data later using Spark or Databricks.
5.3 Why Modern Architecture is used
- Cloud storage is cheap and scalable.
- We can store raw data and use it later for new use cases.
- We can support both analytics and machine learning.
- Compute can scale up and down based on need.
6. Data Lake (S3, ADLS, GCS)
What is a Data Lake?
A data lake is a storage system that stores raw data in any format (CSV, JSON, images, logs).
Why we use it:
- Cheap storage
- Store raw data for future use
- Useful for big data and machine learning
Where we use it:
- AWS S3 (AWS cloud)
- Azure ADLS (Azure cloud)
- Google GCS (GCP cloud)
7. Data Warehouse (Snowflake, BigQuery, Redshift, Synapse)
What is a Data Warehouse?
A data warehouse stores clean, structured data for analytics and reporting.
Why we use it:
- Fast SQL queries
- Business reports and dashboards
- Used by analysts and managers
Examples:
- Snowflake
- Google BigQuery
- AWS Redshift
- Azure Synapse
8. Data Lakehouse (Delta Lake, Apache Iceberg)
What is a Lakehouse?
A lakehouse combines the low-cost storage of a data lake with the reliability of a data warehouse.
Why we use Delta Lake and Iceberg:
- ACID transactions (safe updates and deletes)
- Schema changes without breaking pipelines
- Time travel (see old versions of data)
Where we use it:
On top of the data lake, usually with Databricks and Spark.
9. Processing Layer (Spark and Databricks)
What is Spark?
Spark is a fast distributed engine to process large data.
What is Databricks?
Databricks is a platform that manages Spark and provides notebooks, clusters, and job scheduling.
Why we use them:
- To clean and transform large data
- To run batch and streaming jobs
- To build machine learning pipelines
10. File Formats (Avro, Parquet, ORC)
Avro:
Used for data movement and streaming. Good for schema evolution.
Parquet:
Column-based format. Very fast for analytics queries.
ORC:
Column-based format. Used in big data systems like Hive and Spark.
11. OLTP vs OLAP
OLTP:
Used by applications for daily transactions (orders, payments).
OLAP:
Used for analytics and reporting (data warehouse queries).
12. Monitoring with Datadog
What is Datadog?
Datadog is a monitoring and observability tool.
Why we use it:
- Monitor data pipelines
- Monitor Spark jobs
- Monitor servers and applications
- Get alerts when something fails
When we use it:
In production environments to keep the system healthy and reliable.
13. ETL vs ELT
| Feature | ETL (Traditional) | ELT (Modern) |
|---|---|---|
| Transform | Before load | After load |
| Storage | Data Warehouse | Data Lake |
| Scalability | Limited | High |
| Cost | Higher | Lower |
| Use Cases | Reports | Reports + ML |
14. Example End-to-End Use Case
Data from ERP and CRM systems and web logs is ingested into a data lake on AWS S3. Raw data is stored in Parquet format. Spark on Databricks processes and cleans the data. Clean tables are stored using Delta Lake. Final analytics data is loaded into Snowflake. Business users use dashboards to view reports. Datadog monitors the pipelines and sends alerts when jobs fail.
15. Key Takeaways
- Traditional architecture uses ETL + Data Warehouse.
- Modern architecture uses ELT + Data Lake + Lakehouse.
- Data Lake stores raw data.
- Data Warehouse stores clean data for reporting.
- Spark and Databricks handle large-scale processing.
- Delta Lake and Iceberg make data lakes reliable.
- Datadog monitors the entire system.


Top comments (0)