DEV Community

Cover image for Data Engineering 101: Understanding Databases, Storage, and Security
Sajjad Rahman
Sajjad Rahman

Posted on

Data Engineering 101: Understanding Databases, Storage, and Security

Before diving into machine learning models, dashboards, or real-time pipelines, every Data Engineer must first master how data is stored, moved, and protected.

This foundation โ€” databases, file systems, and distributed storage โ€” forms the invisible infrastructure behind every analytics and AI system.

In this post, weโ€™ll explore how modern data systems organize, process, and secure information, starting with the heart of it all โ€” Databases โ€” and then moving into File Systems and Data Formats.


๐Ÿ—„๏ธ๐Ÿ’ก What Is a Database?

A database is an organized collection of data that allows efficient storage, retrieval, and management of information. Itโ€™s the backbone of every data-driven system, from your favorite mobile app to massive enterprise data warehouses.

At its core, a database enables both data organization and data protection โ€” ensuring information remains consistent, accessible, and secure.


๐ŸŽฏ Why Do We Use Databases?

Databases are not just about storing data; they are about ensuring trust, structure, and safety. They make sure data stays accurate, available, and protected, even when systems crash or users make mistakes.

Purpose Description Example
Data Integrity Ensures accuracy and consistency through constraints and validation rules. Enforcing unique email addresses per user.
Data Security Uses authentication, authorization, and encryption to control access. Only admins can update salary information.
Backup & Recovery Enables restoration after deletion or corruption. MySQL binary logs for point-in-time recovery.
Data Relationships Connects related data using primary and foreign keys. Linking users and orders by user_id.
Data Sharing Allows safe, concurrent access by multiple users or systems. Microservices sharing a centralized DB.

๐Ÿงฉ Types of Databases

Before diving into how databases store data (Row-based vs Column-based), letโ€™s zoom out and explore the different types of databases used across modern data systems.

Each type is built to solve a particular kind of problem โ€” from handling financial transactions to analyzing time-series data, storing documents, or powering AI search.

Hereโ€™s a quick overview of the 15 common database types youโ€™ll encounter as a Data Engineer:

Type of Database

Credit: blog.algomaster.io

# Database Type Description Common Use Cases Examples
1 Relational (RDBMS) Stores structured data in tables with rows & columns, supports SQL and ACID transactions. Banking, ERP, CRM MySQL, PostgreSQL, Oracle
2 Key-Value Store Stores data as keyโ€“value pairs for ultra-fast access and caching. Session storage, caching, real-time apps Redis, DynamoDB
3 Document Database Stores semi-structured data as JSON/XML documents; flexible schema. CMS, product catalogs, IoT data MongoDB, Couchbase
4 Graph Database Focuses on relationships between data entities (nodes & edges). Social networks, recommendations Neo4j, Amazon Neptune
5 Wide-Column Store Stores data in columns under dynamic column families; designed for distributed scale. Real-time analytics, logs Cassandra, HBase, Bigtable
6 In-Memory Database Keeps data in RAM for low-latency performance. Gaming, trading systems, caching Redis, Memcached
7 Time-Series Database Optimized for time-stamped data and metric tracking. Monitoring, IoT, stock analysis InfluxDB, TimescaleDB, Prometheus
8 Object-Oriented Database Stores data as objects (like OOP classes), preserving data + behavior. Multimedia, simulation, CAD ObjectDB, db4o
9 Text Search Database Indexes and queries unstructured text efficiently. Search engines, log analytics Elasticsearch, Solr
10 Spatial Database Handles geographical and geometric data. Maps, logistics, location-based apps PostGIS, Oracle Spatial
11 Blob Datastore Manages large binary data like videos or images. CDNs, archives, backups Amazon S3, Azure Blob, HDFS
12 Ledger Database Immutable, append-only transaction records (blockchain-like). Supply chain, finance, healthcare Amazon QLDB, Hyperledger
13 Hierarchical Database Organizes data in parentโ€“child (tree) structures. File systems, organizational data IBM IMS, Windows Registry
14 Vector Database Stores and searches high-dimensional vectors for similarity search (AI). Recommendations, semantic search, anomaly detection Pinecone, Milvus, FAISS
15 Embedded Database Lightweight DB embedded inside an application. Mobile apps, desktop software SQLite, RocksDB

๐Ÿ’ก Choosing the Right Type

Factor What It Means Example Decision
Data Model How your data is structured โ€” relational, document, graph, or vector. Social network โ†’ Graph DB
Workload Type Transactional (OLTP) vs Analytical (OLAP). E-commerce orders โ†’ RDBMS
Scale & Latency Needs Distributed? Real-time? Global scale? Real-time metrics โ†’ Time-series DB

๐Ÿงฎ Types of Databases by Storage Model

Databases generally fall into two categories โ€” Row-based and Column-based.

Each is optimized for a specific purpose:

  • OLTP (Online Transaction Processing) โ†’ frequent small writes and updates
  • OLAP (Online Analytical Processing) โ†’ large-scale data analysis

๐Ÿงฑ 1. Row-Based Databases (OLTP)

Used for: Real-time transactions and fast updates

Examples: MySQL, PostgreSQL, SQL Server, Oracle

CREATE TABLE users (
    user_id SERIAL PRIMARY KEY,
    name VARCHAR(50),
    age INT,
    email VARCHAR(100),
    city VARCHAR(50)
);
Enter fullscreen mode Exit fullscreen mode

โœ… Best For: Banking systems, e-commerce orders, CRM, and user data management


๐Ÿ“Š 2. Column-Based Databases (OLAP)

Used for: Analytical workloads and complex aggregations
Examples: Snowflake, BigQuery, Amazon Redshift, ClickHouse

โœ… Best For:

  • Business Intelligence dashboards
  • Machine Learning feature extraction
  • Data Warehousing

๐Ÿ” Row vs. Column Summary

Feature Row-Based (OLTP) Column-Based (OLAP)
Storage Stores complete rows Stores each column separately
Best For Transactions Analytics
Performance Fast writes/updates Fast aggregations
Examples MySQL, PostgreSQL BigQuery, Redshift
Use Case E-commerce, banking BI dashboards, ML pipelines

๐Ÿงฐ Data Engineering Perspective

Area Focus Example
Data Ingestion Load data into OLTP or OLAP depending on purpose API โ†’ MySQL (OLTP), Kafka โ†’ Redshift (OLAP)
ETL Pipelines Transform data from transactional to analytical systems Airflow job: MySQL โ†’ Snowflake
Data Lakehouse Combine both OLTP and OLAP benefits Delta Lake, Apache Iceberg
Performance Tuning Optimize storage and query efficiency Partitioning, indexing, caching

๐Ÿ” Data Security in Databases

Security Area Description Example
Authentication Verify user identity IAM, Kerberos
Authorization Role-based access control GRANT SELECT ON sales TO analyst_role;
Encryption at Rest Secure stored data Transparent Data Encryption (TDE)
Encryption in Transit Protect data during transfer SSL/TLS connections
Auditing & Logging Track access and changes PostgreSQL audit logs
Backup Security Secure backup files Encrypted S3 storage with access policies

๐Ÿ—‚๏ธ File Systems & Data Formats in Data Engineering & Security

After data is moved or exported from databases, it is stored in files โ€” usually in Amazon S3, Google Cloud Storage, or HDFS (Hadoop Distributed File System) โ€” where data engineers can process it in bulk.


๐Ÿงญ What Are File Systems in Data Engineering?

A file system manages how data files (like logs, CSVs, or Parquet files) are stored, organized, and accessed.
In modern pipelines, this could mean local storage for testing or distributed cloud storage for petabyte-scale data.


๐Ÿงฑ Common Distributed File Systems

File System Type Description Use Case
HDFS On-prem / Cluster Stores large files across Hadoop nodes. Spark/Hive batch jobs
Amazon S3 Cloud Object storage with infinite scalability. AWS Data Lake
GCS Cloud Googleโ€™s object storage, integrates with BigQuery. Cloud analytics
ADLS Cloud Azure-native analytics storage. ETL for Synapse
Local/NFS On-prem For small-scale or dev testing. Local development

๐Ÿ“ฆ Data Formats โ€” How Data Is Stored, Read, and Secured

๐Ÿงพ Overview Table

Format Type Pros Cons Common Use Case
CSV Text (Row-based) โœ… Simple, human-readable
โœ… Widely supported
โœ… Easy to share
โŒ No schema
โŒ Inefficient for large data
โŒ No compression
Data import/export, spreadsheets, ETL staging
JSON Semi-structured text โœ… Flexible, supports nested data
โœ… Ideal for APIs and logs
โœ… Easy to parse
โŒ Verbose
โŒ Larger size
โŒ Slower for analytics
Web APIs, event logs, NoSQL data exchange
Avro Binary (Row-based) โœ… Compact and fast
โœ… Supports schema evolution
โœ… Ideal for streaming
โŒ Not human-readable
โŒ Needs schema registry
Kafka pipelines, inter-service communication
Parquet Binary (Columnar) โœ… High compression
โœ… Reads only needed columns
โœ… Perfect for OLAP
โŒ Inefficient for updates
โŒ Not human-readable
Data warehouses, analytics, ML feature stores
ORC Binary (Columnar) โœ… Strong compression
โœ… Predicate pushdown
โœ… Metadata stats
โŒ Hadoop-dependent
โŒ Less flexible in cloud
Hive, Presto, Hadoop-based analytics
XML Structured text โœ… Self-descriptive
โœ… Great for hierarchical data
โœ… Legacy-compatible
โŒ Verbose
โŒ Parsing overhead
โŒ Storage-heavy
Configuration files, web services
Delta Lake / Iceberg / Hudi Table formats (built on Parquet/ORC) โœ… ACID transactions
โœ… Time travel
โœ… Schema evolution
โœ… Combines OLTP + OLAP
โŒ Setup complexity
โŒ Metadata overhead
Data Lakehouse, unified analytics

๐Ÿงฉ Format Examples

1๏ธโƒฃ CSV โ€” Simple and Ubiquitous

name,age,city
John,30,New York
Alice,25,Los Angeles
Bob,35,Chicago
Enter fullscreen mode Exit fullscreen mode

2๏ธโƒฃ JSON โ€” Flexible and Web-Friendly

{
  "name": "Alice",
  "age": 25,
  "skills": ["Python", "SQL"],
  "city": "Los Angeles"
}
Enter fullscreen mode Exit fullscreen mode

3๏ธโƒฃ Avro โ€” Schema-Controlled Serialization

{ "name": "Alice", "age": 25, "city": "LA" }
Enter fullscreen mode Exit fullscreen mode

4๏ธโƒฃ Parquet โ€” Analytics Powerhouse

name: [John, Alice, Bob]
age: [30, 25, 35]
city: [NY, LA, Chicago]
Enter fullscreen mode Exit fullscreen mode

5๏ธโƒฃ ORC โ€” Hadoopโ€™s Heavy Lifter

Columns: name, age, city
Data stored column-wise with compression and indexes
Enter fullscreen mode Exit fullscreen mode

6๏ธโƒฃ XML โ€” Old but Gold

<user>
  <name>John</name>
  <age>30</age>
  <city>New York</city>
</user>
Enter fullscreen mode Exit fullscreen mode

7๏ธโƒฃ Delta Lake / Iceberg / Hudi โ€” The Modern Lakehouse Formats

MERGE INTO customers AS c
USING updates AS u
ON c.id = u.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
Enter fullscreen mode Exit fullscreen mode

๐Ÿ” Data Security Notes for All Formats

Aspect Best Practice Example Tools
Encryption at Rest Encrypt files in storage AWS KMS, Azure Key Vault
Encryption in Transit Secure file transfers HTTPS, TLS
Access Control Restrict read/write by role IAM roles, ACLs
Schema Validation Enforce strict schemas Avro schema registry
Compression & Privacy Compress and mask sensitive data Parquet + Snappy, Data masking in Spark

๐Ÿ“š Reference

_

  1. Nnamdi Samuel, โ€œData File Formats Explainedโ€, 2024 โ€” used for core explanations of CSV, JSON, Parquet, Avro, etc.
  2. Ashish Pratap Singh, โ€œ15 Types of Databases and When to Use Themโ€, AlgoMaster.io, March 2024 โ€” source of the database type classifications & inspiration for that section.
  3. Apache Hadoop / HDFS documentation โ€” for HDFS design, block replication, encryption zones (for deeper technical reference).
  4. Official AWS, GCP, Azure documentation โ€” for S3, GCS, ADLS design, IAM, encryption, and audit logging best practices.
  5. Databricks / Delta Lake, Apache Iceberg, Apache Hudi official docs โ€” for transactional table formats, schema evolution, time travel. _

Top comments (0)