DEV Community

Cover image for Data Engineering 101: Understanding Databases, Storage, and Security
Sajjad Rahman
Sajjad Rahman

Posted on

Data Engineering 101: Understanding Databases, Storage, and Security

Before diving into machine learning models, dashboards, or real-time pipelines, every Data Engineer must first master how data is stored, moved, and protected.

This foundation — databases, file systems, and distributed storage — forms the invisible infrastructure behind every analytics and AI system.

In this post, we’ll explore how modern data systems organize, process, and secure information, starting with the heart of it all — Databases — and then moving into File Systems and Data Formats.


🗄️💡 What Is a Database?

A database is an organized collection of data that allows efficient storage, retrieval, and management of information. It’s the backbone of every data-driven system, from your favorite mobile app to massive enterprise data warehouses.

At its core, a database enables both data organization and data protection — ensuring information remains consistent, accessible, and secure.


🎯 Why Do We Use Databases?

Databases are not just about storing data; they are about ensuring trust, structure, and safety. They make sure data stays accurate, available, and protected, even when systems crash or users make mistakes.

Purpose Description Example
Data Integrity Ensures accuracy and consistency through constraints and validation rules. Enforcing unique email addresses per user.
Data Security Uses authentication, authorization, and encryption to control access. Only admins can update salary information.
Backup & Recovery Enables restoration after deletion or corruption. MySQL binary logs for point-in-time recovery.
Data Relationships Connects related data using primary and foreign keys. Linking users and orders by user_id.
Data Sharing Allows safe, concurrent access by multiple users or systems. Microservices sharing a centralized DB.

🧩 Types of Databases

Before diving into how databases store data (Row-based vs Column-based), let’s zoom out and explore the different types of databases used across modern data systems.

Each type is built to solve a particular kind of problem — from handling financial transactions to analyzing time-series data, storing documents, or powering AI search.

Here’s a quick overview of the 15 common database types you’ll encounter as a Data Engineer:

Type of Database

Credit: blog.algomaster.io

# Database Type Description Common Use Cases Examples
1 Relational (RDBMS) Stores structured data in tables with rows & columns, supports SQL and ACID transactions. Banking, ERP, CRM MySQL, PostgreSQL, Oracle
2 Key-Value Store Stores data as key–value pairs for ultra-fast access and caching. Session storage, caching, real-time apps Redis, DynamoDB
3 Document Database Stores semi-structured data as JSON/XML documents; flexible schema. CMS, product catalogs, IoT data MongoDB, Couchbase
4 Graph Database Focuses on relationships between data entities (nodes & edges). Social networks, recommendations Neo4j, Amazon Neptune
5 Wide-Column Store Stores data in columns under dynamic column families; designed for distributed scale. Real-time analytics, logs Cassandra, HBase, Bigtable
6 In-Memory Database Keeps data in RAM for low-latency performance. Gaming, trading systems, caching Redis, Memcached
7 Time-Series Database Optimized for time-stamped data and metric tracking. Monitoring, IoT, stock analysis InfluxDB, TimescaleDB, Prometheus
8 Object-Oriented Database Stores data as objects (like OOP classes), preserving data + behavior. Multimedia, simulation, CAD ObjectDB, db4o
9 Text Search Database Indexes and queries unstructured text efficiently. Search engines, log analytics Elasticsearch, Solr
10 Spatial Database Handles geographical and geometric data. Maps, logistics, location-based apps PostGIS, Oracle Spatial
11 Blob Datastore Manages large binary data like videos or images. CDNs, archives, backups Amazon S3, Azure Blob, HDFS
12 Ledger Database Immutable, append-only transaction records (blockchain-like). Supply chain, finance, healthcare Amazon QLDB, Hyperledger
13 Hierarchical Database Organizes data in parent–child (tree) structures. File systems, organizational data IBM IMS, Windows Registry
14 Vector Database Stores and searches high-dimensional vectors for similarity search (AI). Recommendations, semantic search, anomaly detection Pinecone, Milvus, FAISS
15 Embedded Database Lightweight DB embedded inside an application. Mobile apps, desktop software SQLite, RocksDB

💡 Choosing the Right Type

Factor What It Means Example Decision
Data Model How your data is structured — relational, document, graph, or vector. Social network → Graph DB
Workload Type Transactional (OLTP) vs Analytical (OLAP). E-commerce orders → RDBMS
Scale & Latency Needs Distributed? Real-time? Global scale? Real-time metrics → Time-series DB

🧮 Types of Databases by Storage Model

Databases generally fall into two categories — Row-based and Column-based.

Each is optimized for a specific purpose:

  • OLTP (Online Transaction Processing) → frequent small writes and updates
  • OLAP (Online Analytical Processing) → large-scale data analysis

🧱 1. Row-Based Databases (OLTP)

Used for: Real-time transactions and fast updates

Examples: MySQL, PostgreSQL, SQL Server, Oracle

CREATE TABLE users (
    user_id SERIAL PRIMARY KEY,
    name VARCHAR(50),
    age INT,
    email VARCHAR(100),
    city VARCHAR(50)
);
Enter fullscreen mode Exit fullscreen mode

Best For: Banking systems, e-commerce orders, CRM, and user data management


📊 2. Column-Based Databases (OLAP)

Used for: Analytical workloads and complex aggregations
Examples: Snowflake, BigQuery, Amazon Redshift, ClickHouse

Best For:

  • Business Intelligence dashboards
  • Machine Learning feature extraction
  • Data Warehousing

🔁 Row vs. Column Summary

Feature Row-Based (OLTP) Column-Based (OLAP)
Storage Stores complete rows Stores each column separately
Best For Transactions Analytics
Performance Fast writes/updates Fast aggregations
Examples MySQL, PostgreSQL BigQuery, Redshift
Use Case E-commerce, banking BI dashboards, ML pipelines

🧰 Data Engineering Perspective

Area Focus Example
Data Ingestion Load data into OLTP or OLAP depending on purpose API → MySQL (OLTP), Kafka → Redshift (OLAP)
ETL Pipelines Transform data from transactional to analytical systems Airflow job: MySQL → Snowflake
Data Lakehouse Combine both OLTP and OLAP benefits Delta Lake, Apache Iceberg
Performance Tuning Optimize storage and query efficiency Partitioning, indexing, caching

🔐 Data Security in Databases

Security Area Description Example
Authentication Verify user identity IAM, Kerberos
Authorization Role-based access control GRANT SELECT ON sales TO analyst_role;
Encryption at Rest Secure stored data Transparent Data Encryption (TDE)
Encryption in Transit Protect data during transfer SSL/TLS connections
Auditing & Logging Track access and changes PostgreSQL audit logs
Backup Security Secure backup files Encrypted S3 storage with access policies

🗂️ File Systems & Data Formats in Data Engineering & Security

After data is moved or exported from databases, it is stored in files — usually in Amazon S3, Google Cloud Storage, or HDFS (Hadoop Distributed File System) — where data engineers can process it in bulk.


🧭 What Are File Systems in Data Engineering?

A file system manages how data files (like logs, CSVs, or Parquet files) are stored, organized, and accessed.
In modern pipelines, this could mean local storage for testing or distributed cloud storage for petabyte-scale data.


🧱 Common Distributed File Systems

File System Type Description Use Case
HDFS On-prem / Cluster Stores large files across Hadoop nodes. Spark/Hive batch jobs
Amazon S3 Cloud Object storage with infinite scalability. AWS Data Lake
GCS Cloud Google’s object storage, integrates with BigQuery. Cloud analytics
ADLS Cloud Azure-native analytics storage. ETL for Synapse
Local/NFS On-prem For small-scale or dev testing. Local development

📦 Data Formats — How Data Is Stored, Read, and Secured

🧾 Overview Table

Format Type Pros Cons Common Use Case
CSV Text (Row-based) ✅ Simple, human-readable
✅ Widely supported
✅ Easy to share
❌ No schema
❌ Inefficient for large data
❌ No compression
Data import/export, spreadsheets, ETL staging
JSON Semi-structured text ✅ Flexible, supports nested data
✅ Ideal for APIs and logs
✅ Easy to parse
❌ Verbose
❌ Larger size
❌ Slower for analytics
Web APIs, event logs, NoSQL data exchange
Avro Binary (Row-based) ✅ Compact and fast
✅ Supports schema evolution
✅ Ideal for streaming
❌ Not human-readable
❌ Needs schema registry
Kafka pipelines, inter-service communication
Parquet Binary (Columnar) ✅ High compression
✅ Reads only needed columns
✅ Perfect for OLAP
❌ Inefficient for updates
❌ Not human-readable
Data warehouses, analytics, ML feature stores
ORC Binary (Columnar) ✅ Strong compression
✅ Predicate pushdown
✅ Metadata stats
❌ Hadoop-dependent
❌ Less flexible in cloud
Hive, Presto, Hadoop-based analytics
XML Structured text ✅ Self-descriptive
✅ Great for hierarchical data
✅ Legacy-compatible
❌ Verbose
❌ Parsing overhead
❌ Storage-heavy
Configuration files, web services
Delta Lake / Iceberg / Hudi Table formats (built on Parquet/ORC) ✅ ACID transactions
✅ Time travel
✅ Schema evolution
✅ Combines OLTP + OLAP
❌ Setup complexity
❌ Metadata overhead
Data Lakehouse, unified analytics

🧩 Format Examples

1️⃣ CSV — Simple and Ubiquitous

name,age,city
John,30,New York
Alice,25,Los Angeles
Bob,35,Chicago
Enter fullscreen mode Exit fullscreen mode

2️⃣ JSON — Flexible and Web-Friendly

{
  "name": "Alice",
  "age": 25,
  "skills": ["Python", "SQL"],
  "city": "Los Angeles"
}
Enter fullscreen mode Exit fullscreen mode

3️⃣ Avro — Schema-Controlled Serialization

{ "name": "Alice", "age": 25, "city": "LA" }
Enter fullscreen mode Exit fullscreen mode

4️⃣ Parquet — Analytics Powerhouse

name: [John, Alice, Bob]
age: [30, 25, 35]
city: [NY, LA, Chicago]
Enter fullscreen mode Exit fullscreen mode

5️⃣ ORC — Hadoop’s Heavy Lifter

Columns: name, age, city
Data stored column-wise with compression and indexes
Enter fullscreen mode Exit fullscreen mode

6️⃣ XML — Old but Gold

<user>
  <name>John</name>
  <age>30</age>
  <city>New York</city>
</user>
Enter fullscreen mode Exit fullscreen mode

7️⃣ Delta Lake / Iceberg / Hudi — The Modern Lakehouse Formats

MERGE INTO customers AS c
USING updates AS u
ON c.id = u.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
Enter fullscreen mode Exit fullscreen mode

🔐 Data Security Notes for All Formats

Aspect Best Practice Example Tools
Encryption at Rest Encrypt files in storage AWS KMS, Azure Key Vault
Encryption in Transit Secure file transfers HTTPS, TLS
Access Control Restrict read/write by role IAM roles, ACLs
Schema Validation Enforce strict schemas Avro schema registry
Compression & Privacy Compress and mask sensitive data Parquet + Snappy, Data masking in Spark

📚 Reference

_

  1. Nnamdi Samuel, “Data File Formats Explained”, 2024 — used for core explanations of CSV, JSON, Parquet, Avro, etc.
  2. Ashish Pratap Singh, “15 Types of Databases and When to Use Them”, AlgoMaster.io, March 2024 — source of the database type classifications & inspiration for that section.
  3. Apache Hadoop / HDFS documentation — for HDFS design, block replication, encryption zones (for deeper technical reference).
  4. Official AWS, GCP, Azure documentation — for S3, GCS, ADLS design, IAM, encryption, and audit logging best practices.
  5. Databricks / Delta Lake, Apache Iceberg, Apache Hudi official docs — for transactional table formats, schema evolution, time travel. _

Top comments (0)