🗄️ Database Essentials
Databases sit at the core of every modern application. Whether you're building a social media platform, an online store, or a data pipeline, you need a reliable way to store, organize, and access information. This guide covers the core concepts every developer and data engineer should have solid before going deeper.
What is a Database?
A database is an organized collection of data that can be stored, managed, and retrieved efficiently. Instead of scattering data across files or spreadsheets, databases provide a structured system where data can be queried, updated, and maintained consistently.
Types of Databases
Databases come in many flavors, but the two most common categories are SQL (relational) and NoSQL (non-relational).
SQL Databases
- Structure: Tables with rows and columns
- Examples: PostgreSQL, MySQL, Oracle, SQL Server
-
Strengths:
- Strong consistency and reliability
- Support for complex queries and relationships (joins)
- Schema-based design enforces data integrity
NoSQL Databases
- Structure: Can be document-based, key-value pairs, wide-column stores, or graph databases
- Examples: MongoDB (document), Redis (key-value), Cassandra (wide-column), Neo4j (graph)
-
Strengths:
- Flexible schema; data doesn't need to fit into fixed tables
- Handles unstructured or semi-structured data well
- Scales horizontally, often preferred for big data and real-time apps
Schemas
A schema is the blueprint of a database. It defines how data is organized, what data types are allowed, and how different entities relate to each other.
In SQL databases, schemas are strict and defined before data is inserted. In NoSQL databases, schemas are flexible and each document or record can have different fields.
SQL Schema Example
Real pipelines often separate raw ingestion from cleaned data. A two-schema setup looks like this:
-- Stock price table (raw ingestion layer)
CREATE TABLE IF NOT EXISTS stock_prices (
id SERIAL PRIMARY KEY,
symbol VARCHAR(20) NOT NULL,
close_price NUMERIC(12, 4),
volume BIGINT,
timestamp TIMESTAMPTZ NOT NULL DEFAULT NOW(),
source VARCHAR(50),
CONSTRAINT uq_symbol_ts UNIQUE (symbol, timestamp),
CONSTRAINT chk_price CHECK (close_price > 0)
);
-- Listings table (with foreign key reference)
CREATE TABLE IF NOT EXISTS listing_images (
id BIGSERIAL PRIMARY KEY,
listing_id BIGINT NOT NULL REFERENCES listings(id) ON DELETE CASCADE,
url TEXT NOT NULL,
uploaded TIMESTAMPTZ DEFAULT NOW()
);
Notice the CHECK and UNIQUE constraints. These aren't optional extras in a production schema. They're what keeps bad data out without writing application logic to handle it.
NoSQL Schema Example
{
"id": "listing_001",
"location": "Westlands, Nairobi",
"price": 4500000,
"bedrooms": 3,
"images": [
{"url": "https://cdn.example.com/img1.jpg"},
{"url": "https://cdn.example.com/img2.jpg"}
]
}
The same listing in a relational database would require a listings table and a separate listing_images table joined by a foreign key. NoSQL embeds related data directly in one document.
When to Use SQL vs NoSQL
Use SQL when:
- Data is highly structured with clear relationships
- You need ACID transactions (banking, financial pipelines, e-commerce checkout)
- Queries involve complex joins and aggregations
Use NoSQL when:
- Data is semi-structured, rapidly changing, or unstructured
- Applications need high scalability and throughput at massive scale
- You're working with big data, caching, or real-time event streams
SQL favors consistency and structure. NoSQL favors flexibility and speed. Most serious data platforms end up using both.
DDL vs DML
Two key categories of SQL commands you'll work with constantly:
DDL (Data Definition Language) defines and manages database structures.
| Command | What it does |
|---|---|
CREATE |
Create tables, schemas, indexes |
ALTER |
Modify an existing table (add column, rename, change type) |
DROP |
Delete a table or schema permanently |
TRUNCATE |
Remove all rows from a table (faster than DELETE) |
DML (Data Manipulation Language) works with the actual data inside those structures.
| Command | What it does |
|---|---|
INSERT |
Add new rows |
UPDATE |
Modify existing rows |
DELETE |
Remove rows |
SELECT |
Query and return data |
DDL and DML in Practice
Here is a realistic sequence showing both in action:
-- DDL: define the structure
CREATE TABLE IF NOT EXISTS forex_rates (
id SERIAL PRIMARY KEY,
pair VARCHAR(10) NOT NULL,
rate NUMERIC(12, 6),
fetched_at TIMESTAMPTZ NOT NULL
);
-- DML: insert data
INSERT INTO forex_rates (pair, rate, fetched_at)
VALUES ('USD/KES', 129.5, NOW());
-- DML: query it back
SELECT pair, rate, fetched_at
FROM forex_rates
WHERE pair = 'USD/KES'
ORDER BY fetched_at DESC
LIMIT 10;
One DML pattern worth knowing early is UPSERT. In a data pipeline you often don't know whether a record already exists, so you insert it and handle conflicts in the same statement:
INSERT INTO forex_rates (pair, rate, fetched_at)
VALUES ('USD/KES', 130.1, NOW())
ON CONFLICT (pair)
DO UPDATE SET
rate = EXCLUDED.rate,
fetched_at = EXCLUDED.fetched_at;
This runs every pipeline run without failing or creating duplicates. It's one of the most useful patterns in production SQL.
Transactions and ACID
SQL databases guarantee ACID properties:
- Atomicity: All operations in a transaction succeed or none do
- Consistency: The database always moves from one valid state to another
- Isolation: Concurrent transactions don't interfere with each other
- Durability: Committed data survives crashes
A transaction in practice:
BEGIN;
UPDATE accounts SET balance = balance - 500 WHERE id = 1;
UPDATE accounts SET balance = balance + 500 WHERE id = 2;
COMMIT;
If the second UPDATE fails, the whole transaction rolls back. Neither account changes. This is why SQL databases are the default choice for financial systems, payment pipelines, and anything where partial writes are unacceptable.
Quick Summary
| SQL | NoSQL | |
|---|---|---|
| Structure | Fixed schema, tables | Flexible schema |
| Best for | Relationships, ACID transactions | Scale, unstructured data |
| Examples | PostgreSQL, MySQL | MongoDB, Redis, Cassandra |
| Query language | SQL | Varies (JSON queries, key lookups) |
These fundamentals apply whether you're building a web app, a data pipeline, or an analytics warehouse. The tools change but the tradeoffs don't.
Follow me on dev.to for more data engineering content, or browse the project code at github.com/declerke.
Top comments (0)