DEV Community

Salma Aga Shaik
Salma Aga Shaik

Posted on

Data Engineering Basics: From What is Data to Modern Lakehouse Architecture

This post explains data fundamentals, databases, data warehousing, data lakes, and modern lakehouse architecture.


What is Data?

Data is raw facts or raw information collected from applications, users, and machines. On its own, data has little meaning. When we process, clean, and analyze data, it becomes useful information for business decisions.

Examples of data:

  • Customer name, email,Order amount, order time,Website clicks, error logs,Sensor readings

An e-commerce app stores every order as data. When analysts look at monthly sales trends and top-selling products, that processed data becomes insights.


Types of Data (Structured, Semi-Structured, Unstructured)

Aspect Structured Data Semi-Structured Data Unstructured Data
What it means Data stored in rows and columns with a fixed schema. Data with some structure (keys/tags), but no fixed table schema. Data with no predefined structure.
Where it is stored Relational Databases, Data Warehouses Data Lakes, modern warehouses Object storage, file systems
How easy to query Very easy with SQL Needs parsing/flattening Needs preprocessing/AI-ML
Examples Customer table, Orders table JSON from APIs, Web logs, Avro/Parquet Images, videos, PDFs, emails

Databases and Data Storage

Databases are systems used to store and manage structured data for applications.
Data storage includes databases plus file systems and cloud object storage (for raw files).

Examples:

  • Databases: PostgreSQL, MySQL, Oracle
  • Object Storage: AWS S3, Azure ADLS, Google GCS

SQL & Relational Databases

What is SQL?

SQL (Structured Query Language) is used to read and write data in relational databases.

What is a Relational Database (RDBMS)?

An RDBMS stores data in tables with relationships (primary keys and foreign keys).

Examples: PostgreSQL, MySQL, SQL Server


DDL vs DML

Category What it does Examples
DDL (Data Definition Language) Defines or changes table structure CREATE, ALTER, DROP
DML (Data Manipulation Language) Reads and modifies data inside tables INSERT, UPDATE, DELETE, SELECT

Example (DDL):

CREATE TABLE customers (id INT, name VARCHAR(100));
Enter fullscreen mode Exit fullscreen mode

Example (DML):

INSERT INTO customers VALUES (1, 'Salma');
SELECT * FROM customers;
Enter fullscreen mode Exit fullscreen mode

OLTP vs OLAP (Databases vs Analytics)

Aspect OLTP (Online Transaction Processing) OLAP (Online Analytical Processing)
Main purpose Run daily transactions for apps Run analytics and reporting
Query pattern Many small, fast writes/reads Large scans and aggregations
Data Current operational data Historical, aggregated data
Typical systems PostgreSQL, MySQL Snowflake, BigQuery, Redshift
Example Placing an order Yearly sales analysis

ACID Transactions (Why Databases are Reliable)

Term Meaning
Atomicity A transaction is all-or-nothing.
Consistency Data stays valid and correct.
Isolation Parallel users do not interfere.
Durability Once saved, data will not be lost.

Example:
If a payment fails halfway, Atomicity ensures the whole transaction is rolled back.


Data Warehouse vs Data Lake

Aspect Data Warehouse Data Lake
Data types Structured only Structured, semi-structured, unstructured
Schema Schema-on-write (define before load) Schema-on-read (define at query time)
Cost Higher storage cost Lower storage cost
Main use BI reports, dashboards Raw storage, ML/AI, exploration
Examples Snowflake, Redshift, BigQuery AWS S3, Azure ADLS, Google GCS

Data Formats: Avro vs Parquet vs ORC

Format Storage Style Best for Example use
Avro Row-based Streaming, fast writes Kafka pipelines
Parquet Column-based Analytics, fast reads BI queries in Spark
ORC Column-based Analytics with compression Hive/Spark

Row-Based vs Column-Based Storage

Aspect Row-Based Storage Column-Based Storage
How data is stored Entire rows together Same columns together
Best for OLTP transactions OLAP analytics
Typical systems PostgreSQL, MySQL BigQuery, Redshift, Parquet
Example Fetch one customer record Aggregate one column across millions of rows

RDBMS (Row-Based) vs Columnar Databases

Aspect RDBMS (Row-Based) Columnar Databases
Workload Transactions Analytics
Writes Fast Slower
Reads (aggregations) Slower Very fast
Example PostgreSQL BigQuery, Redshift

Data Warehousing Concepts: Facts and Dimensions

What are Fact Tables?

Fact tables store measurable numbers (metrics).

Examples: sales_amount, quantity, revenue

What are Dimension Tables?

Dimension tables store descriptive attributes to analyze facts.

Examples: customer, product, date, location


Types of Facts

Type Meaning Example
Transactional Fact One row per transaction Each order
Snapshot Fact State at a point in time Daily inventory
Accumulating Fact Tracks process over time Order lifecycle

Characteristics of Fact vs Dimension Tables

Aspect Fact Table Dimension Table
What it stores Metrics (numbers) Descriptions (attributes)
Size Very large Smaller
Keys Foreign keys to dimensions Primary keys
Example Sales fact Customer dimension

Data Lakehouse Architecture

Source → Ingestion → Data Lake Storage → Lakehouse Layer → BI / ML / Analytics

What the Lakehouse layer adds:

  • ACID transactions for reliability
  • Indexing for faster queries
  • Metadata for governance and discovery
  • Performance optimizations for analytics

Examples of Lakehouse Technologies:

  • Delta Lake (Databricks)
  • Apache Iceberg
  • Apache Hudi

What is Informatica?

Informatica is an enterprise ETL tool used to extract, transform, and load data from source systems into data warehouses or data lakes.

Example:
Move sales data from PostgreSQL → clean it → load into Snowflake.


Final End-to-End Summary

  • OLTP databases run daily business transactions.
  • OLAP systems (data warehouses) support analytics and reporting.
  • Data lakes store raw data of all types.
  • Lakehouse architecture combines low-cost storage with fast analytics.
  • Facts and dimensions organize data for reporting.
  • Avro/Parquet/ORC and row vs column storage decide performance.

Top comments (0)