DEV Community: Emilio Ochieng

Introduction to SQL: Understanding Databases, Data Types, Constraints, and Core SQL Concepts

Emilio Ochieng — Mon, 20 Jul 2026 16:44:03 +0000

SQL (Structured Query Language) is one of the most fundamental skills for anyone interested in Data Analytics, Data Engineering, Software Development, or Database Administration. Whether you're building a web application or analyzing business data, SQL enables you to communicate with relational databases efficiently.

What is Data?

Data refers to raw, unorganized facts and figures. It can exist in many forms, including:

Numbers
Text
Images
Audio
Videos
Dates and times

On its own, data has little meaning until it is organized and processed.

What is a Database?

A database is an organized collection of data that is structured for easy access, management, and updating.

Instead of storing information in multiple spreadsheets, databases keep related information together, making it easier to search, retrieve, and maintain.

For example, a school database may contain:

Students
Teachers
Subjects
Exam Results
Understanding Database Architecture

A database system is organized into different layers.

1. Server

The server is the top-most level.

It is the actual software process (such as PostgreSQL or MySQL) running on a computer.

Think of the server as the entire building.

2. Database

Inside a server are one or more databases.

A database acts like a separate floor within the building, storing data for a specific application or organization.

3. Schema

A schema organizes database objects inside a database.

Think of a schema as rooms on a floor, helping separate tables based on their purpose.

For example:

Server
│
├── Greenwood Academy Database
│ ├── Students Schema
│ ├── Finance Schema
│ └── Library Schema

What is a DBMS?

A Database Management System (DBMS) is software that enables users to create, manage, update, and interact with databases.

Popular DBMSs include:

PostgreSQL
MySQL
Oracle Database
Microsoft SQL Server

A DBMS provides the tools needed to store, organize, and retrieve information efficiently.

What is SQL?

SQL (Structured Query Language) is the standard language used to communicate with relational databases.

Think of SQL as the language you use to "talk" to your database.

Using SQL, you can:

Create databases
Create tables
Insert data
Update records
Delete records
Retrieve information
Manage users and permissions
Types of SQL Commands

SQL is divided into several categories depending on the task being performed.

*1. DDL (Data Definition Language)
*
DDL focuses on creating and modifying database structures.

Common commands include:

CREATE
ALTER
DROP

Example:

CREATE TABLE students (
student_id INT PRIMARY KEY,
first_name VARCHAR(50)
);
2. DML (Data Manipulation Language)

DML works with the data stored inside existing tables.

Common commands include:

*INSERT
UPDATE
DELETE
*
Example:

INSERT INTO students
VALUES (1, 'Amina');
*3. DQL (Data Query Language)
*
DQL is used to retrieve information from a database.

The primary command is:

SELECT

Example:

SELECT *
FROM students;

This is the command you'll use most frequently when analyzing data.

*4. DCL (Data Control Language)
*
**DCL **controls access to the database.

Examples include granting or revoking user permissions.

Common commands:

GRANT
REVOKE *5. TCL (Transaction Control Language) * TCL manages database transactions.

Common commands include:

COMMIT
ROLLBACK
SAVEPOINT These commands ensure data consistency when multiple operations are performed.

Understanding Data Types

Every column in a database must specify the type of data it will store.

Choosing the correct data type improves performance, accuracy, and storage efficiency.

Numeric Data Types
INT

Stores whole numbers.

Example:

25
100
500
DECIMAL / NUMERIC

Stores numbers with fixed decimal places.

Ideal for:

Prices
Salaries
Financial records Example:

2500.75
99.99
SERIAL

Automatically generates sequential numbers.

Commonly used for:

Primary Keys
IDs
Text Data Types
CHAR(n)

Stores text with an exact number of characters.

Example:

CHAR(10)

Useful for fixed-length values like codes.

VARCHAR(n)

Stores text with a maximum length.

Example:

VARCHAR(50)

Suitable for names and addresses.

TEXT

Stores large amounts of text without a predefined limit.

Useful for:

Descriptions
Articles
Comments
Date and Time Data Types
DATE

Stores only the date.

Example:

2025-07-20
TIME

Stores only the time.

Example:

14:30:00
TIMESTAMP

Stores both the date and time.

Example:

2025-07-20 14:30:00
BOOLEAN

A Boolean column stores only two possible values:

TRUE
FALSE

Example:

is_active = TRUE
Understanding Constraints

Constraints are rules applied to tables or columns to ensure data integrity and prevent invalid information from entering the database.

NOT NULL

Ensures a column cannot be left empty.

Example:

Every student must have a first name.

DEFAULT

Provides a default value when none is supplied.

Example:

status DEFAULT 'Active'
UNIQUE

Ensures duplicate values are not allowed.

Example:

Two users cannot register with the same email address.

PRIMARY KEY

Uniquely identifies each row in a table.

A Primary Key:

Cannot be NULL
Must be unique

Example:

student_id
FOREIGN KEY

Links one table to another.

For example:

Students

student_id

Exam Results

student_id

The student_id in the Exam Results table must already exist in the Students table.

This relationship prevents invalid references.

CHECK

Applies custom validation rules.

Example:

CHECK (marks >= 0)

CHECK (price > 0)

The database will reject values that violate these conditions.

Why These Concepts Matter

Before writing complex SQL queries involving joins, aggregations, or window functions, it's essential to understand:

How databases are organized.
The purpose of schemas and tables.
Choosing the correct data types.
Using constraints to maintain clean and reliable data.
The different categories of SQL commands.

JOINs – Combining Data from Multiple Tables

In a relational database, information is often split across multiple tables to reduce redundancy and improve organization. JOINs allow you to combine related data from these tables into a single result.

For example, imagine a school database with three tables:

Students
Subjects
Exam Results

Instead of storing all information in one table, the database links them using keys.
A JOIN lets you answer questions like:

Which subjects is each student taking?
What marks did each student score?
Who teaches each subject?

Common types of JOINs include:

INNER JOIN – Returns only matching records from both tables.
LEFT JOIN – Returns all records from the left table and matching records from the right table.
RIGHT JOIN– Returns all records from the right table and matching records from the left table.
FULL OUTER JOIN – Returns all records from both tables, whether they match or not.

JOINs are among the most frequently used SQL operations because real-world databases almost always store information across multiple related tables.

2. Aggregate Functions – Summarizing Data

Aggregate functions calculate values across multiple rows and return a single result. Instead of viewing individual records, aggregates help summarize and analyze data.

Some common aggregate functions include:

COUNT() – Counts the number of records.
SUM() – Calculates the total.
*AVG() *– Finds the average.
*MIN() *– Returns the smallest value.
MAX() – Returns the largest value. For example, a school administrator might want to know:

How many students are enrolled?
What is the average exam score?
Which student scored the highest mark?
How many students are in each class?

Aggregate functions are widely used in business intelligence, reporting, and dashboard development.

3. Subqueries – Queries Within Queries

A subquery is a SQL query nested inside another SQL query. It allows you to use the result of one query as input for another.

Subqueries are useful when solving more complex problems, such as:

Finding students who scored above the class average.
Identifying products with sales higher than the average.
Listing employees earning more than their department's average salary.

Instead of performing multiple separate queries, SQL can handle everything in one statement.

Subqueries make SQL more flexible and allow you to solve sophisticated analytical problems with minimal code.

4. Common Table Expressions (CTEs) – Writing Cleaner SQL

As SQL queries become longer, they can become difficult to read and maintain.

Common Table Expressions (CTEs) help organize complex queries by breaking them into logical sections.

A CTE acts like a temporary named result set that exists only during the execution of a query.

Benefits of using CTEs include:

Improved readability.
Easier debugging.
Better organization of complex logic.
Simplified maintenance. Instead of writing deeply nested subqueries, you can separate each logical step into its own CTE, making your SQL easier for both you and your teammates to understand.

CTEs are especially useful in reporting, analytics, and data engineering workflows.

Window Functions – Performing Advanced Analytics

Window functions are among the most powerful features in SQL. Unlike aggregate functions, which reduce multiple rows into one result, window functions perform calculations across related rows while keeping every individual row in the output.

This makes them ideal for analytical tasks such as:

Ranking students by exam score.
Comparing a student's mark to the class average.
Calculating running totals.
Finding previous or next values.
Identifying top-performing products or employees. .

Common window functions include:

**-ROW_NUMBER()

RANK()
DENSE_RANK()
LAG()
LEAD()
FIRST_VALUE()
LAST_VALUE()**

Window functions are heavily used in business intelligence, financial reporting, customer analytics, and machine learning data preparation.

How These Concepts Work Together

Consider a school management system.

You might:

Use JOINs to combine student, subject, and exam data.
Apply aggregate functions to calculate average marks.
Use a subquery to identify students scoring above average.
Organize the logic using a CTE for better readability.
Apply window functions to rank students from highest to lowest. Each concept builds upon the previous one, enabling increasingly sophisticated analysis.

Why Every Data Professional Should Learn Advanced SQL

Modern organizations rely heavily on data-driven decision-making. Advanced SQL skills enable professionals to:

Build interactive dashboards.
Generate business reports.
Analyze customer behavior.
Monitor financial performance.
Prepare datasets for machine learning.
Design efficient data pipelines.
Support business intelligence initiatives.

Whether you're working in healthcare, finance, agriculture, education, or e-commerce, these SQL techniques are essential for extracting meaningful insights from data.

Conclusion

Mastering advanced SQL is a natural progression after learning the fundamentals. Concepts like databases, data types, constraints, SQL command categories ,JOINs, aggregate functions, subqueries, Common Table Expressions (CTEs), and window functions enable you to move beyond basic queries and solve complex, real-world data challenges.

These skills are not only valuable for writing efficient SQL—they are also core competencies for careers in Data Analytics, Data Engineering, Business Intelligence, Database Administration, and Software Development.

Every advanced SQL expert started with the basics. By practicing these concepts consistently and applying them to real projects, you'll build the confidence and expertise needed to work with large datasets, develop insightful reports, and create data-driven solutions that make a meaningful impact.

Connecting Power BI to SQL Databases: From Local Servers to Cloud Platforms

Emilio Ochieng — Fri, 10 Jul 2026 13:22:09 +0000

Introduction

Connecting Power BI directly to a PostgreSQL database offers several advantages. It eliminates repetitive manual imports, improves data consistency, supports larger datasets, and allows reports to be refreshed whenever the underlying data changes.

This the process of connecting Power BI to both a local PostgreSQL database and a cloud-hosted PostgreSQL database on Aiven, including how to configure SSL for secure cloud connections.

Whats PostgreSQL?

PostgreSQL is one of the world's most popular open-source relational database management systems. It is trusted by startups, enterprises, and cloud providers because it is:

Open-source and free
Reliable and highly scalable
Secure
Excellent for analytics and reporting
Supported directly by Power BI

Requirements

Before starting, make sure you have:

PostgreSQL installed
DBeaver installed
Power BI Desktop
Aiven PostgreSQL account (for cloud connection)

Part 1: Connecting Power BI to a Local PostgreSQL Database

Step 1: Create Your PostgreSQL Database

Install PostgreSQL and create your database.

Open DBeaver and connect using:

Host: localhost
Port: 5432
Database: postgres
Username: postgres
Password: Your PostgreSQL password

Step 2: Import Your Dataset

Once connected:

Expand Schemas.
Open the public schema.
Right-click and select Import Data.
Choose your CSV or Excel dataset.
Complete the import wizard.

Your tables should now appear inside the PostgreSQL database.

Step 3: Connect Power BI

Open Power BI Desktop.

Navigate to:

Home → Get Data → PostgreSQL Database

Enter:

Server

127.0.0.1:5432

Database

postgres

Since this database is running locally, leave Use Encrypted Connection unchecked.

Authenticate using your PostgreSQL credentials.

Power BI will display the available tables.

Select the tables you need and click Load.

Part 2: Connecting Power BI to a Cloud PostgreSQL Database (Aiven)

Cloud databases allow you to work from anywhere while providing security, scalability, backups, and high availability.

After creating a PostgreSQL service in Aiven, you'll receive:

Host
Port
Database
Username
Password
SSL Certificate

Step 1: Download the SSL Certificate

Cloud databases encrypt communication between your computer and the server.

Download the CA Certificate (ca.pem) from the Aiven dashboard.

Step 2: Import the Certificate into Windows

Search for:

Manage Computer Certificates

Navigate to:

Trusted Root Certification Authorities → Certificates

Right-click Certificates and select:

All Tasks → Import

Choose the downloaded ca.pem file and complete the Certificate Import Wizard.

Windows will confirm the import was successful.

Once selected, the following dialogue box will appear:

In this dialogue box, navigate to Trusted Root Certification Authorities and click on the drop-down arrow, then right click on Certificates, then go to All Tasks, then Import....

This will lead you to the following dialogue box:

Click Next, upload the certificate file which will have been downloaded as ca.pem from Aiven. You will be required to select the all files option when browsing your device for the certificate file. Continue until you get the confirmation that import is successful as shown below:

Step 3: Connect Power BI

Open Power BI Desktop.

Navigate to:

Home → Get Data → PostgreSQL Database

Enter:

Server

your-hostname:port

Database

defaultdb

check

Use Encrypted Connection

Authenticate using the username and password provided by Aiven.

Power BI will connect securely and display the available tables.

Select the required tables and click Load.

Local vs Cloud PostgreSQL

Local PostgreSQL	Cloud PostgreSQL (Aiven)
Runs on your computer	Hosted on cloud infrastructure
Uses `localhost`	Uses a public hostname
SSL not required	SSL certificate required
Ideal for development	Ideal for production and collaboration
Limited remote access	Accessible from anywhere

Common Connection Issues

If Power BI cannot connect to PostgreSQL, check the following:

Is PostgreSQL running?
Is the server name and port correct?
Are the username and password correct?
Has the SSL certificate been imported (for cloud databases)?
Is the PostgreSQL driver installed?
Is your firewall blocking the connection?

Most connection issues can be resolved by verifying these settings.

Why Connect Power BI Directly to PostgreSQL?

Connecting Power BI directly to PostgreSQL offers several benefits:

Faster reporting
Centralized data management
Improved security
Reduced manual work
Better scalability
Support for scheduled refreshes
Real-time insights when combined with DirectQuery

For organizations managing growing datasets, connecting Power BI directly to a database is a more efficient approach than repeatedly importing Excel files.

Final Thoughts

Learning how to connect Power BI to PostgreSQL is an important milestone for anyone pursuing a career in Data Analytics or Data Engineering.

Whether you're working with a local PostgreSQL server during development or a cloud-hosted PostgreSQL instance such as Aiven, the overall workflow remains the same: establish a connection, authenticate securely, load the required tables, and begin transforming raw data into actionable insights.

If you're learning Power BI, I encourage you to move beyond spreadsheets and start working directly with databases. It's a skill that mirrors real-world data environments and prepares you for more advanced analytics projects.

Thank you for reading! If you found this guide helpful, feel free to share your experience or ask questions in the comments.

Data Modeling, Joins, Relationships, and Different Schemas

Emilio Ochieng — Fri, 19 Jun 2026 08:33:38 +0000

Data Modeling, Joins, Relationships, and Different Schemas

Data Modeling

What is Data Modeling?

Data modeling is the process of designing and organizing data structures to define how data is stored, connected, and accessed within a database system.

A data model serves as a blueprint for creating databases by identifying:

Data entities
Attributes
Relationships
Constraints
Business rules

The primary goal of data modeling is to ensure data consistency, accuracy, efficiency, and scalability.

Example

Consider a university system:

Students

Student ID
Name
Email

Courses

Course ID
Course Name

Enrollments

Student ID
Course ID

This model defines how students interact with courses through enrollments.

Types of Data Models

1. Conceptual Data Model

The conceptual model provides a high-level view of business entities and relationships.

Example:

Student → Enrolls In → Course

Characteristics:

Business-focused
No technical details
Easy for stakeholders to understand

2. Logical Data Model

The logical model defines attributes, primary keys, and relationships.

Example:

Student

Student_ID (PK)
Name
Email

Course

Course_ID (PK)
Course_Name

Enrollment

Enrollment_ID (PK)
Student_ID (FK)
Course_ID (FK)

3. Physical Data Model

The physical model describes how data is implemented in a database system.

Example:

CREATE TABLE Student (
    Student_ID INT PRIMARY KEY,
    Name VARCHAR(100),
    Email VARCHAR(100)
);

Characteristics:

Database-specific
Includes indexes and storage details
Optimized for performance

Database Relationships

Relationships define how tables interact with each other.

One-to-One Relationship (1:1)

Each record in one table relates to one record in another table.

Example:

Person ↔ Passport

Person ID	Name
1	John

Passport ID	Person ID
P123	1

A person can have only one passport.

One-to-Many Relationship (1:M)

One record can relate to many records.

Example:

Customer → Orders

One customer can place many orders.

Customer ID	Name
101	Emilio

Order ID	Customer ID
1	101
2	101

Many-to-Many Relationship (M:M)

Many records relate to many records.

Example:

Students ↔ Courses

A student can take multiple courses.
A course can have multiple students.

This requires a bridge table.

Student

Course

Enrollment

Primary Keys and Foreign Keys

Primary Key (PK)

A unique identifier for records in a table.

Example:

Student_ID

Characteristics:

Unique
Cannot be null

Foreign Key (FK)

A field that references a primary key in another table.

Example:

Student_ID

in Enrollment table references:

Student(Student_ID)

Purpose:

Maintains data integrity
Creates relationships

Joins

Joins combine data from multiple tables based on related columns.

INNER JOIN

Returns matching records from both tables.

SELECT *
FROM Customers c
INNER JOIN Orders o
ON c.CustomerID = o.CustomerID;

Result:

Only customers who have placed orders appear.

LEFT JOIN

Returns all records from the left table and matching records from the right table.

SELECT *
FROM Customers c
LEFT JOIN Orders o
ON c.CustomerID = o.CustomerID;

Result:

All customers appear, even those without orders.

RIGHT JOIN

Returns all records from the right table and matching records from the left table.

SELECT *
FROM Customers c
RIGHT JOIN Orders o
ON c.CustomerID = o.CustomerID;

FULL OUTER JOIN

Returns all records from both tables.

SELECT *
FROM Customers c
FULL OUTER JOIN Orders o
ON c.CustomerID = o.CustomerID;

CROSS JOIN

Produces every possible combination.

SELECT *
FROM Products
CROSS JOIN Stores;

Useful for:

Simulations
Testing
Matrix generation

Schemas in Data Warehousing

A schema defines how tables are structured and connected within a database or data warehouse.

Star Schema

The most common schema in Business Intelligence and Power BI.

Structure:

          Customer
              |
Product -- Fact Sales -- Date
              |
           Store

Characteristics:

One central fact table
Multiple dimension tables
Simple structure
Fast query performance

Advantages:

Easy to understand
Optimized for reporting
Ideal for Power BI

Snowflake Schema

A normalized version of the Star Schema.

Structure:

Product
   |
Category
   |
Fact Sales

Characteristics:

Dimension tables are split further
Reduces redundancy
More complex joins

Advantages:

Better data integrity
Reduced storage

Disadvantages:

More joins
Slightly slower queries

Galaxy Schema (Fact Constellation)

Contains multiple fact tables sharing dimension tables.

Example:

Fact Sales
      |
Customer
      |
Fact Inventory

Used when:

Multiple business processes exist
Enterprise-level data warehouses

Advantages:

Supports complex analytics
Highly scalable

Relationships in Power BI

Power BI relies heavily on relationships between tables.

Common Relationship Types:

One-to-Many

Most common.

Example:

Customers → Orders

CustomerID

Many-to-One

Reverse of one-to-many.

Many-to-Many

Used when multiple records match across tables.

Requires careful management to avoid ambiguity.

Best Practices

Use Star Schema whenever possible.
Create meaningful primary keys.
Avoid unnecessary many-to-many relationships.
Use surrogate keys in data warehouses.
Keep fact tables narrow and dimension tables descriptive.
Optimize joins for performance.
Document all relationships clearly.

Conclusion

Data modeling, joins, relationships, and schemas are fundamental concepts in database design and data engineering. Data modeling provides structure, relationships define how data interacts, joins retrieve meaningful information, and schemas organize data efficiently for analytics and reporting.

Linux Fundamentals for Data Engineers.

Emilio Ochieng — Thu, 18 Jun 2026 13:02:38 +0000

The Essential Guide

In the world of data engineering, Python, SQL, and Spark often steal the spotlight. Yet underneath these tools lies the operating system that powers most data platforms: Linux. Whether you're managing Airflow on an EC2 instance, troubleshooting a Kafka cluster, or building ETL pipelines in a Docker container, Linux proficiency directly impacts your productivity and reliability as a data engineer.This guide covers the Linux fundamentals every data engineer should master.

1. Why Linux Matters in Data Engineering

Most cloud data platforms (AWS, GCP, Azure) run on Linux. Self-hosted tools like Apache Airflow, dbt, Spark, Kafka, Flink, and PostgreSQL are designed for Linux environments. Data engineers who understand Linux can:Debug infrastructure issues faster
Write more efficient automation scripts
Secure data pipelines properly
Optimize resource usage
Reduce dependency on DevOps teams

Mastering Linux turns you from a "SQL + Python" engineer into a true infrastructure-aware data professional.

2. Installation & User Management

Choosing the Right Distribution

For data engineering, Ubuntu LTS (22.04 or 24.04) is the most popular choice due to its stability and vast package ecosystem. CentOS/Rocky Linux/AlmaLinux are common in enterprise environments.

Creating a Dedicated UserNever run data pipelines as root.

Create a dedicated user:
bash
sudo adduser dataeng
sudo usermod -aG sudo dataeng # Optional: grant sudo access

SSH Key Authentication (Best Practice)bash

ssh-keygen -t ed25519 -C "dataeng@workstation"
ssh-copy-id dataeng@your-server-ip

Disable password authentication in /etc/ssh/sshd_config for better security.

3. File System & Permissions

Understanding the Linux Filesystem Hierarchy

/home – User files
/var/log – Application and system logs (critical for debugging)
/etc – Configuration files
/opt – Third-party software
/tmp – Temporary files (cleaned on reboot)

Permissions Deep Divebash

ls -la
chmod 755 script.sh # Owner: rwx, Group/Other: rx
chown dataeng: dataeng /opt/pipeline

Special Permissions for Data WorkUse umask to control default file permissions and setfacl for complex shared directories in team environments.
Practical Example:
bash

Create a shared data directory

sudo mkdir -p /data/lakehouse
sudo chown -R dataeng:dataeng /data
sudo chmod -R 775 /data

4. Process & Resource Management

Essential commands
ps aux | grep spark # Find processes
top / htop # Interactive monitoring
kill -9 # Force kill (use carefully)

Systemd – The Modern Init System
Most data tools run as systemd services:
sudo systemctl status postgresql
sudo systemctl restart airflow
sudo journalctl -u airflow -f # Live logs