This post explains data fundamentals, databases, data warehousing, data lakes, and modern lakehouse architecture.
What is Data?
Data is raw facts or raw information collected from applications, users, and machines. On its own, data has little meaning. When we process, clean, and analyze data, it becomes useful information for business decisions.
Examples of data:
- Customer name, email,Order amount, order time,Website clicks, error logs,Sensor readings
An e-commerce app stores every order as data. When analysts look at monthly sales trends and top-selling products, that processed data becomes insights.
Types of Data (Structured, Semi-Structured, Unstructured)
| Aspect | Structured Data | Semi-Structured Data | Unstructured Data |
|---|---|---|---|
| What it means | Data stored in rows and columns with a fixed schema. | Data with some structure (keys/tags), but no fixed table schema. | Data with no predefined structure. |
| Where it is stored | Relational Databases, Data Warehouses | Data Lakes, modern warehouses | Object storage, file systems |
| How easy to query | Very easy with SQL | Needs parsing/flattening | Needs preprocessing/AI-ML |
| Examples | Customer table, Orders table | JSON from APIs, Web logs, Avro/Parquet | Images, videos, PDFs, emails |
Databases and Data Storage
Databases are systems used to store and manage structured data for applications.
Data storage includes databases plus file systems and cloud object storage (for raw files).
Examples:
- Databases: PostgreSQL, MySQL, Oracle
- Object Storage: AWS S3, Azure ADLS, Google GCS
SQL & Relational Databases
What is SQL?
SQL (Structured Query Language) is used to read and write data in relational databases.
What is a Relational Database (RDBMS)?
An RDBMS stores data in tables with relationships (primary keys and foreign keys).
Examples: PostgreSQL, MySQL, SQL Server
DDL vs DML
| Category | What it does | Examples |
|---|---|---|
| DDL (Data Definition Language) | Defines or changes table structure | CREATE, ALTER, DROP |
| DML (Data Manipulation Language) | Reads and modifies data inside tables | INSERT, UPDATE, DELETE, SELECT |
Example (DDL):
CREATE TABLE customers (id INT, name VARCHAR(100));
Example (DML):
INSERT INTO customers VALUES (1, 'Salma');
SELECT * FROM customers;
OLTP vs OLAP (Databases vs Analytics)
| Aspect | OLTP (Online Transaction Processing) | OLAP (Online Analytical Processing) |
|---|---|---|
| Main purpose | Run daily transactions for apps | Run analytics and reporting |
| Query pattern | Many small, fast writes/reads | Large scans and aggregations |
| Data | Current operational data | Historical, aggregated data |
| Typical systems | PostgreSQL, MySQL | Snowflake, BigQuery, Redshift |
| Example | Placing an order | Yearly sales analysis |
ACID Transactions (Why Databases are Reliable)
| Term | Meaning |
|---|---|
| Atomicity | A transaction is all-or-nothing. |
| Consistency | Data stays valid and correct. |
| Isolation | Parallel users do not interfere. |
| Durability | Once saved, data will not be lost. |
Example:
If a payment fails halfway, Atomicity ensures the whole transaction is rolled back.
Data Warehouse vs Data Lake
| Aspect | Data Warehouse | Data Lake |
|---|---|---|
| Data types | Structured only | Structured, semi-structured, unstructured |
| Schema | Schema-on-write (define before load) | Schema-on-read (define at query time) |
| Cost | Higher storage cost | Lower storage cost |
| Main use | BI reports, dashboards | Raw storage, ML/AI, exploration |
| Examples | Snowflake, Redshift, BigQuery | AWS S3, Azure ADLS, Google GCS |
Data Formats: Avro vs Parquet vs ORC
| Format | Storage Style | Best for | Example use |
|---|---|---|---|
| Avro | Row-based | Streaming, fast writes | Kafka pipelines |
| Parquet | Column-based | Analytics, fast reads | BI queries in Spark |
| ORC | Column-based | Analytics with compression | Hive/Spark |
Row-Based vs Column-Based Storage
| Aspect | Row-Based Storage | Column-Based Storage |
|---|---|---|
| How data is stored | Entire rows together | Same columns together |
| Best for | OLTP transactions | OLAP analytics |
| Typical systems | PostgreSQL, MySQL | BigQuery, Redshift, Parquet |
| Example | Fetch one customer record | Aggregate one column across millions of rows |
RDBMS (Row-Based) vs Columnar Databases
| Aspect | RDBMS (Row-Based) | Columnar Databases |
|---|---|---|
| Workload | Transactions | Analytics |
| Writes | Fast | Slower |
| Reads (aggregations) | Slower | Very fast |
| Example | PostgreSQL | BigQuery, Redshift |
Data Warehousing Concepts: Facts and Dimensions
What are Fact Tables?
Fact tables store measurable numbers (metrics).
Examples: sales_amount, quantity, revenue
What are Dimension Tables?
Dimension tables store descriptive attributes to analyze facts.
Examples: customer, product, date, location
Types of Facts
| Type | Meaning | Example |
|---|---|---|
| Transactional Fact | One row per transaction | Each order |
| Snapshot Fact | State at a point in time | Daily inventory |
| Accumulating Fact | Tracks process over time | Order lifecycle |
Characteristics of Fact vs Dimension Tables
| Aspect | Fact Table | Dimension Table |
|---|---|---|
| What it stores | Metrics (numbers) | Descriptions (attributes) |
| Size | Very large | Smaller |
| Keys | Foreign keys to dimensions | Primary keys |
| Example | Sales fact | Customer dimension |
Data Lakehouse Architecture
Source → Ingestion → Data Lake Storage → Lakehouse Layer → BI / ML / Analytics
What the Lakehouse layer adds:
- ACID transactions for reliability
- Indexing for faster queries
- Metadata for governance and discovery
- Performance optimizations for analytics
Examples of Lakehouse Technologies:
- Delta Lake (Databricks)
- Apache Iceberg
- Apache Hudi
What is Informatica?
Informatica is an enterprise ETL tool used to extract, transform, and load data from source systems into data warehouses or data lakes.
Example:
Move sales data from PostgreSQL → clean it → load into Snowflake.
Final End-to-End Summary
- OLTP databases run daily business transactions.
- OLAP systems (data warehouses) support analytics and reporting.
- Data lakes store raw data of all types.
- Lakehouse architecture combines low-cost storage with fast analytics.
- Facts and dimensions organize data for reporting.
- Avro/Parquet/ORC and row vs column storage decide performance.
Top comments (0)