Sajjad Rahman

Posted on Oct 17

Data Engineering 101: Understanding Databases, Storage, and Security

#dataengineering #systemdesign #database #aws

Before diving into machine learning models, dashboards, or real-time pipelines, every Data Engineer must first master how data is stored, moved, and protected.

This foundation — databases, file systems, and distributed storage — forms the invisible infrastructure behind every analytics and AI system.

In this post, we’ll explore how modern data systems organize, process, and secure information, starting with the heart of it all — Databases — and then moving into File Systems and Data Formats.

🗄️💡 What Is a Database?

A database is an organized collection of data that allows efficient storage, retrieval, and management of information. It’s the backbone of every data-driven system, from your favorite mobile app to massive enterprise data warehouses.

At its core, a database enables both data organization and data protection — ensuring information remains consistent, accessible, and secure.

🎯 Why Do We Use Databases?

Databases are not just about storing data; they are about ensuring trust, structure, and safety. They make sure data stays accurate, available, and protected, even when systems crash or users make mistakes.

Purpose	Description	Example
Data Integrity	Ensures accuracy and consistency through constraints and validation rules.	Enforcing unique email addresses per user.
Data Security	Uses authentication, authorization, and encryption to control access.	Only admins can update salary information.
Backup & Recovery	Enables restoration after deletion or corruption.	MySQL binary logs for point-in-time recovery.
Data Relationships	Connects related data using primary and foreign keys.	Linking `users` and `orders` by `user_id`.
Data Sharing	Allows safe, concurrent access by multiple users or systems.	Microservices sharing a centralized DB.

🧩 Types of Databases

Before diving into how databases store data (Row-based vs Column-based), let’s zoom out and explore the different types of databases used across modern data systems.

Each type is built to solve a particular kind of problem — from handling financial transactions to analyzing time-series data, storing documents, or powering AI search.

Here’s a quick overview of the 15 common database types you’ll encounter as a Data Engineer:

Credit: blog.algomaster.io

#	Database Type	Description	Common Use Cases	Examples
1	Relational (RDBMS)	Stores structured data in tables with rows & columns, supports SQL and ACID transactions.	Banking, ERP, CRM	MySQL, PostgreSQL, Oracle
2	Key-Value Store	Stores data as key–value pairs for ultra-fast access and caching.	Session storage, caching, real-time apps	Redis, DynamoDB
3	Document Database	Stores semi-structured data as JSON/XML documents; flexible schema.	CMS, product catalogs, IoT data	MongoDB, Couchbase
4	Graph Database	Focuses on relationships between data entities (nodes & edges).	Social networks, recommendations	Neo4j, Amazon Neptune
5	Wide-Column Store	Stores data in columns under dynamic column families; designed for distributed scale.	Real-time analytics, logs	Cassandra, HBase, Bigtable
6	In-Memory Database	Keeps data in RAM for low-latency performance.	Gaming, trading systems, caching	Redis, Memcached
7	Time-Series Database	Optimized for time-stamped data and metric tracking.	Monitoring, IoT, stock analysis	InfluxDB, TimescaleDB, Prometheus
8	Object-Oriented Database	Stores data as objects (like OOP classes), preserving data + behavior.	Multimedia, simulation, CAD	ObjectDB, db4o
9	Text Search Database	Indexes and queries unstructured text efficiently.	Search engines, log analytics	Elasticsearch, Solr
10	Spatial Database	Handles geographical and geometric data.	Maps, logistics, location-based apps	PostGIS, Oracle Spatial
11	Blob Datastore	Manages large binary data like videos or images.	CDNs, archives, backups	Amazon S3, Azure Blob, HDFS
12	Ledger Database	Immutable, append-only transaction records (blockchain-like).	Supply chain, finance, healthcare	Amazon QLDB, Hyperledger
13	Hierarchical Database	Organizes data in parent–child (tree) structures.	File systems, organizational data	IBM IMS, Windows Registry
14	Vector Database	Stores and searches high-dimensional vectors for similarity search (AI).	Recommendations, semantic search, anomaly detection	Pinecone, Milvus, FAISS
15	Embedded Database	Lightweight DB embedded inside an application.	Mobile apps, desktop software	SQLite, RocksDB

💡 Choosing the Right Type

Factor	What It Means	Example Decision
Data Model	How your data is structured — relational, document, graph, or vector.	Social network → Graph DB
Workload Type	Transactional (OLTP) vs Analytical (OLAP).	E-commerce orders → RDBMS
Scale & Latency Needs	Distributed? Real-time? Global scale?	Real-time metrics → Time-series DB

🧮 Types of Databases by Storage Model

Databases generally fall into two categories — Row-based and Column-based.

Each is optimized for a specific purpose:

OLTP (Online Transaction Processing) → frequent small writes and updates
OLAP (Online Analytical Processing) → large-scale data analysis

🧱 1. Row-Based Databases (OLTP)

Used for: Real-time transactions and fast updates

Examples: MySQL, PostgreSQL, SQL Server, Oracle

CREATE TABLE users (
    user_id SERIAL PRIMARY KEY,
    name VARCHAR(50),
    age INT,
    email VARCHAR(100),
    city VARCHAR(50)
);

✅ Best For: Banking systems, e-commerce orders, CRM, and user data management

📊 2. Column-Based Databases (OLAP)

Used for: Analytical workloads and complex aggregations
Examples: Snowflake, BigQuery, Amazon Redshift, ClickHouse

✅ Best For:

Business Intelligence dashboards
Machine Learning feature extraction
Data Warehousing

🔁 Row vs. Column Summary

Feature	Row-Based (OLTP)	Column-Based (OLAP)
Storage	Stores complete rows	Stores each column separately
Best For	Transactions	Analytics
Performance	Fast writes/updates	Fast aggregations
Examples	MySQL, PostgreSQL	BigQuery, Redshift
Use Case	E-commerce, banking	BI dashboards, ML pipelines

🧰 Data Engineering Perspective

Area	Focus	Example
Data Ingestion	Load data into OLTP or OLAP depending on purpose	API → MySQL (OLTP), Kafka → Redshift (OLAP)
ETL Pipelines	Transform data from transactional to analytical systems	Airflow job: MySQL → Snowflake
Data Lakehouse	Combine both OLTP and OLAP benefits	Delta Lake, Apache Iceberg
Performance Tuning	Optimize storage and query efficiency	Partitioning, indexing, caching

🔐 Data Security in Databases

Security Area	Description	Example
Authentication	Verify user identity	IAM, Kerberos
Authorization	Role-based access control	`GRANT SELECT ON sales TO analyst_role;`
Encryption at Rest	Secure stored data	Transparent Data Encryption (TDE)
Encryption in Transit	Protect data during transfer	SSL/TLS connections
Auditing & Logging	Track access and changes	PostgreSQL audit logs
Backup Security	Secure backup files	Encrypted S3 storage with access policies

🗂️ File Systems & Data Formats in Data Engineering & Security

After data is moved or exported from databases, it is stored in files — usually in Amazon S3, Google Cloud Storage, or HDFS (Hadoop Distributed File System) — where data engineers can process it in bulk.

🧭 What Are File Systems in Data Engineering?

A file system manages how data files (like logs, CSVs, or Parquet files) are stored, organized, and accessed.
In modern pipelines, this could mean local storage for testing or distributed cloud storage for petabyte-scale data.

🧱 Common Distributed File Systems

File System	Type	Description	Use Case
HDFS	On-prem / Cluster	Stores large files across Hadoop nodes.	Spark/Hive batch jobs
Amazon S3	Cloud	Object storage with infinite scalability.	AWS Data Lake
GCS	Cloud	Google’s object storage, integrates with BigQuery.	Cloud analytics
ADLS	Cloud	Azure-native analytics storage.	ETL for Synapse
Local/NFS	On-prem	For small-scale or dev testing.	Local development

📦 Data Formats — How Data Is Stored, Read, and Secured

🧾 Overview Table

Format	Type	Pros	Cons	Common Use Case
CSV	Text (Row-based)	✅ Simple, human-readable ✅ Widely supported ✅ Easy to share	❌ No schema ❌ Inefficient for large data ❌ No compression	Data import/export, spreadsheets, ETL staging
JSON	Semi-structured text	✅ Flexible, supports nested data ✅ Ideal for APIs and logs ✅ Easy to parse	❌ Verbose ❌ Larger size ❌ Slower for analytics	Web APIs, event logs, NoSQL data exchange
Avro	Binary (Row-based)	✅ Compact and fast ✅ Supports schema evolution ✅ Ideal for streaming	❌ Not human-readable ❌ Needs schema registry	Kafka pipelines, inter-service communication
Parquet	Binary (Columnar)	✅ High compression ✅ Reads only needed columns ✅ Perfect for OLAP	❌ Inefficient for updates ❌ Not human-readable	Data warehouses, analytics, ML feature stores
ORC	Binary (Columnar)	✅ Strong compression ✅ Predicate pushdown ✅ Metadata stats	❌ Hadoop-dependent ❌ Less flexible in cloud	Hive, Presto, Hadoop-based analytics
XML	Structured text	✅ Self-descriptive ✅ Great for hierarchical data ✅ Legacy-compatible	❌ Verbose ❌ Parsing overhead ❌ Storage-heavy	Configuration files, web services
Delta Lake / Iceberg / Hudi	Table formats (built on Parquet/ORC)	✅ ACID transactions ✅ Time travel ✅ Schema evolution ✅ Combines OLTP + OLAP	❌ Setup complexity ❌ Metadata overhead	Data Lakehouse, unified analytics

🧩 Format Examples

1️⃣ CSV — Simple and Ubiquitous

name,age,city
John,30,New York
Alice,25,Los Angeles
Bob,35,Chicago

2️⃣ JSON — Flexible and Web-Friendly

{
  "name": "Alice",
  "age": 25,
  "skills": ["Python", "SQL"],
  "city": "Los Angeles"
}

3️⃣ Avro — Schema-Controlled Serialization

{ "name": "Alice", "age": 25, "city": "LA" }

4️⃣ Parquet — Analytics Powerhouse

name: [John, Alice, Bob]
age: [30, 25, 35]
city: [NY, LA, Chicago]

5️⃣ ORC — Hadoop’s Heavy Lifter

Columns: name, age, city
Data stored column-wise with compression and indexes

6️⃣ XML — Old but Gold

<user>
  <name>John</name>
  <age>30</age>
  <city>New York</city>
</user>

7️⃣ Delta Lake / Iceberg / Hudi — The Modern Lakehouse Formats

MERGE INTO customers AS c
USING updates AS u
ON c.id = u.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

🔐 Data Security Notes for All Formats

Aspect	Best Practice	Example Tools
Encryption at Rest	Encrypt files in storage	AWS KMS, Azure Key Vault
Encryption in Transit	Secure file transfers	HTTPS, TLS
Access Control	Restrict read/write by role	IAM roles, ACLs
Schema Validation	Enforce strict schemas	Avro schema registry
Compression & Privacy	Compress and mask sensitive data	Parquet + Snappy, Data masking in Spark

📚 Reference

Nnamdi Samuel, “Data File Formats Explained”, 2024 — used for core explanations of CSV, JSON, Parquet, Avro, etc.

Ashish Pratap Singh, “15 Types of Databases and When to Use Them”, AlgoMaster.io, March 2024 — source of the database type classifications & inspiration for that section.

Apache Hadoop / HDFS documentation — for HDFS design, block replication, encryption zones (for deeper technical reference).

Official AWS, GCP, Azure documentation — for S3, GCS, ADLS design, IAM, encryption, and audit logging best practices.

Databricks / Delta Lake, Apache Iceberg, Apache Hudi official docs — for transactional table formats, schema evolution, time travel. _

DEV Community