Salma Aga Shaik

Posted on Feb 22

Data Engineering Basics: From What is Data to Modern Lakehouse Architecture

#architecture #beginners #database #dataengineering

This post explains data fundamentals, databases, data warehousing, data lakes, and modern lakehouse architecture.

What is Data?

Data is raw facts or raw information collected from applications, users, and machines. On its own, data has little meaning. When we process, clean, and analyze data, it becomes useful information for business decisions.

Examples of data:

Customer name, email,Order amount, order time,Website clicks, error logs,Sensor readings

An e-commerce app stores every order as data. When analysts look at monthly sales trends and top-selling products, that processed data becomes insights.

Types of Data (Structured, Semi-Structured, Unstructured)

Aspect	Structured Data	Semi-Structured Data	Unstructured Data
What it means	Data stored in rows and columns with a fixed schema.	Data with some structure (keys/tags), but no fixed table schema.	Data with no predefined structure.
Where it is stored	Relational Databases, Data Warehouses	Data Lakes, modern warehouses	Object storage, file systems
How easy to query	Very easy with SQL	Needs parsing/flattening	Needs preprocessing/AI-ML
Examples	Customer table, Orders table	JSON from APIs, Web logs, Avro/Parquet	Images, videos, PDFs, emails

Databases and Data Storage

Databases are systems used to store and manage structured data for applications.
Data storage includes databases plus file systems and cloud object storage (for raw files).

Examples:

Databases: PostgreSQL, MySQL, Oracle
Object Storage: AWS S3, Azure ADLS, Google GCS

SQL & Relational Databases

What is SQL?

SQL (Structured Query Language) is used to read and write data in relational databases.

What is a Relational Database (RDBMS)?

An RDBMS stores data in tables with relationships (primary keys and foreign keys).

Examples: PostgreSQL, MySQL, SQL Server

DDL vs DML

Category	What it does	Examples
DDL (Data Definition Language)	Defines or changes table structure	CREATE, ALTER, DROP
DML (Data Manipulation Language)	Reads and modifies data inside tables	INSERT, UPDATE, DELETE, SELECT

Example (DDL):

CREATE TABLE customers (id INT, name VARCHAR(100));

Example (DML):

INSERT INTO customers VALUES (1, 'Salma');
SELECT * FROM customers;

OLTP vs OLAP (Databases vs Analytics)

Aspect	OLTP (Online Transaction Processing)	OLAP (Online Analytical Processing)
Main purpose	Run daily transactions for apps	Run analytics and reporting
Query pattern	Many small, fast writes/reads	Large scans and aggregations
Data	Current operational data	Historical, aggregated data
Typical systems	PostgreSQL, MySQL	Snowflake, BigQuery, Redshift
Example	Placing an order	Yearly sales analysis

ACID Transactions (Why Databases are Reliable)

Term	Meaning
Atomicity	A transaction is all-or-nothing.
Consistency	Data stays valid and correct.
Isolation	Parallel users do not interfere.
Durability	Once saved, data will not be lost.

Example:
If a payment fails halfway, Atomicity ensures the whole transaction is rolled back.

Data Warehouse vs Data Lake

Aspect	Data Warehouse	Data Lake
Data types	Structured only	Structured, semi-structured, unstructured
Schema	Schema-on-write (define before load)	Schema-on-read (define at query time)
Cost	Higher storage cost	Lower storage cost
Main use	BI reports, dashboards	Raw storage, ML/AI, exploration
Examples	Snowflake, Redshift, BigQuery	AWS S3, Azure ADLS, Google GCS

Data Formats: Avro vs Parquet vs ORC

Format	Storage Style	Best for	Example use
Avro	Row-based	Streaming, fast writes	Kafka pipelines
Parquet	Column-based	Analytics, fast reads	BI queries in Spark
ORC	Column-based	Analytics with compression	Hive/Spark

Row-Based vs Column-Based Storage

Aspect	Row-Based Storage	Column-Based Storage
How data is stored	Entire rows together	Same columns together
Best for	OLTP transactions	OLAP analytics
Typical systems	PostgreSQL, MySQL	BigQuery, Redshift, Parquet
Example	Fetch one customer record	Aggregate one column across millions of rows

RDBMS (Row-Based) vs Columnar Databases

Aspect	RDBMS (Row-Based)	Columnar Databases
Workload	Transactions	Analytics
Writes	Fast	Slower
Reads (aggregations)	Slower	Very fast
Example	PostgreSQL	BigQuery, Redshift

Data Warehousing Concepts: Facts and Dimensions

What are Fact Tables?

Fact tables store measurable numbers (metrics).

Examples: sales_amount, quantity, revenue

What are Dimension Tables?

Dimension tables store descriptive attributes to analyze facts.

Examples: customer, product, date, location

Types of Facts

Type	Meaning	Example
Transactional Fact	One row per transaction	Each order
Snapshot Fact	State at a point in time	Daily inventory
Accumulating Fact	Tracks process over time	Order lifecycle

Characteristics of Fact vs Dimension Tables

Aspect	Fact Table	Dimension Table
What it stores	Metrics (numbers)	Descriptions (attributes)
Size	Very large	Smaller
Keys	Foreign keys to dimensions	Primary keys
Example	Sales fact	Customer dimension

Data Lakehouse Architecture

Source → Ingestion → Data Lake Storage → Lakehouse Layer → BI / ML / Analytics

What the Lakehouse layer adds:

ACID transactions for reliability
Indexing for faster queries
Metadata for governance and discovery
Performance optimizations for analytics

Examples of Lakehouse Technologies:

Delta Lake (Databricks)
Apache Iceberg
Apache Hudi

What is Informatica?

Informatica is an enterprise ETL tool used to extract, transform, and load data from source systems into data warehouses or data lakes.

Example:
Move sales data from PostgreSQL → clean it → load into Snowflake.

Final End-to-End Summary

OLTP databases run daily business transactions.
OLAP systems (data warehouses) support analytics and reporting.
Data lakes store raw data of all types.
Lakehouse architecture combines low-cost storage with fast analytics.
Facts and dimensions organize data for reporting.
Avro/Parquet/ORC and row vs column storage decide performance.

DEV Community