dr_gos(☕ 💻)

Posted on Nov 27, 2023 • Originally published at drgos.com

Deep Diving the SQL Iceberg - Level 0: Sky Zone

#database #webdev #programming #postgres

"Deep Diving the SQL Iceberg" is a blog series designed to demystify the submerged complexities of SQL databases. Taking this educational journey I hope to become more proficient in the field of database technology, focusing on PostgreSQL as the exemplar. Each post will surface a few concepts from the vast array of topics encapsulated in the "SQL Iceberg" by Jordan Lewis meme—from foundational elements like "JOIN" and "indexes" to advanced intricacies such as "query plans and EXPLAIN" and "The Halloween Problem."

I've seen firsthand that the Achilles' heel of many web applications often lies in suboptimal database architecture and interaction, or in flawed API design(topic for another series). My objective is to deepen my expertise and evolve as a professional and as an architect. By making my insights and discoveries accessible to others, I aspire not only to enhance my own skills but also to potentially assist others on similar quests for improvement.

Thanks to Jordan Lewis & friends for creating the "SQL Iceberg"

Most images are generated by ChatGPT

Level 0: Sky Zone

Concepts:

ORMs
data types
CREATE TABLE
indexes
foreign keys
SELECT / INSERT / UPDATE / DELETE
JOIN
ORDER BY / GROUP BY
LIMIT and OFFSET
NULL

ORMs (Object-Relational Mapping)

Definition: ORMs are programming libraries that facilitate the conversion of data between incompatible type systems in object-oriented programming languages and relational databases. They enable developers to interact with a database using high-level programming constructs instead of direct SQL queries.

Example: In the context of PostgreSQL, an ORM like Hibernate (Java) or Sequelize (Node.js) can be used. For instance, using Sequelize, you can define a model for a table users like this:

const User = sequelize.define('user', {
  id: { type: Sequelize.INTEGER, primaryKey: true },
  name: Sequelize.STRING
});

Pros of Using ORMs

Abstraction from SQL: ORMs abstract away the complexity of raw SQL queries, allowing developers to interact with the database using the object-oriented paradigm of their programming language.
Increased Productivity: By automating the task of writing SQL queries and handling database connections, ORMs can significantly speed up the development process, especially for standard database operations.
Database Agnosticism: ORMs often allow applications to become largely independent of the underlying database system. This means you can switch databases with minimal changes in the application code.
Automated Schema Management: Many ORMs provide tools for automatically generating and updating database schemas based on the object model, which can simplify migrations and schema changes.
Support for Complex Relationships and Operations: ORMs generally provide a straightforward way to handle complex joins, nested selects, and transactions.

Cons of Using ORMs

Performance Overhead: ORMs add an extra layer of abstraction which can lead to performance issues, especially with large datasets or complex queries. The automatically generated SQL might not be as optimized as hand-written queries.
Complex Queries Can Be Challenging: For complex database operations, ORM queries can become cumbersome and less efficient than native SQL. Sometimes, the abstraction can limit the ability to perform certain highly specialized operations.
Steep Learning Curve: While ORMs simplify database interactions, they often have their own complexities. Learning to use an ORM effectively can require a significant investment of time and effort.
Less Control Over the Execution: ORMs tend to hide the details of data retrieval and manipulation, which can be a disadvantage when fine-tuning queries for performance optimizations.
Potential for Poor Database Design: Relying heavily on ORMs can sometimes lead to suboptimal database design, as the database schema is often a reflection of the object model, which may not always be the most efficient representation of the data.

In summary, while ORMs offer a high level of convenience and productivity gains, especially for standard CRUD (Create, Read, Update, Delete) operations, they also introduce complexities and performance considerations.

One of the main reasons for starting this series is my frustration with ORNs. They provide a great medium for inserting data. However, abstracting the queries makes the developers not think about the impact that each call has on the overall performance of the application. I've encountered far too often queries called inside a loop or the famous n+1 problem.

RECOMMENDATION: use ORMs for inserting data and basic queries but switch to native queries for more complex stuff and use the ORM mapping capabilities to convert the result to entities or projection objects. Be careful with eager and lazy fetching and what data you need in your business logic.

Data types

To be honest I don't think there is a need for me to write about SQL data types. Depending on what database you use you can easily find the information in their documentation. Here is a link for PostgreSQL.

A few recommendations:

Floating-Point Precision:
- Be cautious with floating-point types (REAL and DOUBLE PRECISION). They can introduce rounding errors and are not suitable for storing exact values like monetary amounts.
- Use DECIMAL or NUMERIC for precise calculations where rounding errors are unacceptable. (Currency, etc)
Character Types:
- VARCHAR(n) and CHAR(n) have a performance difference: CHAR(n) pads spaces to the defined length, which can increase storage usage unnecessarily for variable-length strings.
- TEXT and VARCHAR are very similar, but TEXT is preferable when you don't have a specific limit on string length. Internally, PostgreSQL treats VARCHAR without a length specifier as TEXT.
Date and Time Types:
- Be mindful of timezone handling with TIMESTAMP WITH TIME ZONE and TIMESTAMP WITHOUT TIME ZONE. The former converts the stored time to the current timezone of the server, while the latter does not.
- DATE, TIME, and TIMESTAMP types have different precision levels and functions, and choosing the correct type for your needs is crucial.
Array Types:
- PostgreSQL supports arrays for most data types. However, managing and querying arrays can be more complex than simple scalar types.
- Use arrays judiciously, as they can complicate query writing and impact performance.
JSON vs. JSONB:
- JSON and JSONB are both used to store JSON data, but they behave differently. JSONB stores data in a decomposed binary format, offering better performance for querying and indexing at the cost of slightly slower write times and more storage space.
- JSON preserves the exact input format, which might be necessary in some cases.
UUID Type:
- Using UUID as a primary key can be beneficial for distributed systems, but it's larger and slower to generate compared to traditional integer types. Here is a great video about it.

CREATE TABLE

The CREATE TABLE command in SQL is used to create a new table in a specific database. This command defines the table structure, specifying the columns and the types of data each column can hold. It's akin to laying the blueprint for a building — you're setting the foundation and structure on which your data will reside.

Syntax and Usage

The basic syntax of the CREATE TABLE command is straightforward, yet powerful:

CREATE TABLE employees (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    email VARCHAR(100) UNIQUE,
    join_date DATE NOT NULL
);

In this example, we create a table named employees with four columns: id, name, email, and join_date. The id column is a serial type, automatically incrementing and serving as the primary key.

Best Practices:

Naming Conventions: Use clear, descriptive names for tables and columns.
Normalizing Data: Aim for a normalized database design to reduce data redundancy and improve data integrity.
Defining Constraints: Use constraints wisely to enforce data integrity.
Planning for Scalability: Consider how your tables will scale and perform as data grows. Thinking about sharding keys might come in handy.

Indexes

Indexes in PostgreSQL are special database objects that are primarily used to improve the speed of data retrieval operations. Think of them like the index in a book - instead of reading the whole book to find a topic, you refer to the index to quickly locate the pages you need. In a database, indexes help the database server to find and retrieve specific rows much faster than it could do by scanning the entire table.

The Mechanism of Indexes on Insert Operations

While indexes are widely known for enhancing query performance, their impact on data insertion is a critical aspect often overlooked. Here's what happens during an insert operation in a PostgreSQL table with indexes:

Data Insertion and Index Update: When a new record is inserted into a table, PostgreSQL not only saves this record into the table but also updates all the indexes that the table is associated with. Each index must reflect the change to ensure it remains accurate for read operations.
Movement and Sorting: For certain types of indexes, particularly those like B-tree (the default in PostgreSQL), the new data must be correctly sorted into the existing structure. This might involve moving and rearranging data within the index to maintain its order and integrity.
Write Amplification: Each additional index results in more write operations. This phenomenon, known as write amplification, can significantly impact the performance of insert operations, especially in write-heavy databases.

Choosing the Right Indexes: A Balance Act

Given the overhead indexes can create during data insertion, choosing the right indexes becomes a crucial decision. Here's why careful index selection is important:

Performance Trade-off: Every index you add can improve query performance but might slow down data insertion. It's essential to strike a balance - create indexes that will benefit your most common and critical queries without overly hindering write operations.
Index Selectivity: Highly selective indexes, where the index entries have unique or near-unique values, are generally more effective. They allow PostgreSQL to quickly narrow down the search to a few rows, enhancing query performance significantly.
Index Types Matter: Different index types (like B-tree (default), Hash, GIN, and GiST) have different characteristics and use cases. Understanding the nature of your data and queries can guide you to choose the most appropriate type.
Maintenance and Monitoring: Regularly monitor the performance of your indexes. Unused or less frequently used indexes might be candidates for removal to improve write performance.
Partial Indexes: In cases where you only query a subset of your data, consider using partial indexes. These indexes are smaller and faster to update, as they only index a portion of the table.

Foreign Keys

Foreign keys in PostgreSQL are a type of constraint used to link two tables together. This link is based on the relationship between columns in these tables - typically, the primary key in one table becomes a foreign key in another. They play a crucial role in preventing invalid data entry and maintaining the consistency of data across tables.

When to Use Foreign Keys

Enforcing Referential Integrity: Use foreign keys when you need to ensure that a relationship exists between rows in different tables. For example, an order should always be linked to a valid customer.
Representing Relationships: They are essential for representing real-world relationships in your database design, such as one-to-many or many-to-one relationships.

Examples of Foreign Keys in Action

E-Commerce Database:

In an e-commerce database, you might have customers and orders tables.
The orders table can have a foreign key that references the customer_id in the customers table, ensuring that every order is associated with a valid customer.

CREATE TABLE customers (
    customer_id SERIAL PRIMARY KEY,
    name VARCHAR(100),
    email VARCHAR(100)
);

CREATE TABLE orders (
    order_id SERIAL PRIMARY KEY,
    order_date DATE NOT NULL,
    customer_id INTEGER REFERENCES customers(customer_id)
);

Best Practices for Using Foreign Keys

Careful Planning: Design your database schema with the relationships between tables in mind. Ensure that foreign keys are used to enforce meaningful and necessary relationships.
Indexing Foreign Key Columns: Indexes on foreign key columns can significantly improve the performance of joins and other queries that involve these columns.
Avoiding Cascade Deletes: While cascade deletes can be convenient, they can also lead to unintentional loss of data. Use them judiciously and ensure that your application logic accounts for their effects.
Consistent Data Types: Ensure that the data type of the foreign key column matches the data type of the referenced primary key column.
Regular Integrity Checks: Periodically check the integrity of your relationships, especially if you perform bulk data operations that might bypass foreign key constraints.
Documenting Relationships: Maintain clear documentation of your database schema, including the foreign key relationships. This aids in understanding the data model and troubleshooting issues.

Select / Insert / Update / Delete

SELECT, INSERT, UPDATE, and DELETE – are commands that form the core of data manipulation and querying.

The example I give in this section only scratches the surface, so I recommend that you read about the different operations from other sources. The most important part when it comes to these operations is time in the saddle. The more you need them the better you'll get at writing them.

1. SELECT - Fetching Data from the Database. (Postgres Docs)

Usage: The SELECT statement is used to query and retrieve data from a database table.

Example:

SELECT name, email FROM customers;

This query retrieves the name and email columns from the customers table.

Best Practices:

Avoid SELECT *: Instead of using SELECT *, specify the columns you need. This reduces the data load, especially when the table has many columns.
Use WHERE for Filtering: Leverage the WHERE clause to filter data. This minimizes the amount of data processed and transferred.

2. INSERT - Adding New Records

Usage: INSERT is used to add new rows to a table.

Example:

INSERT INTO orders (order_date, customer_id) VALUES ('2023-01-15', 1);

This adds a new order with the specified date and customer ID.

Best Practices:

Specify Column Names: Always specify column names for clarity and to avoid errors if the table structure changes.
Use Batch Inserts for Multiple Rows: For inserting multiple rows, use batch inserts to reduce the number of database round-trips.

3. UPDATE - Modifying Existing Data

Usage: UPDATE modifies existing records in a table.

Example:

UPDATE orders SET order_date = '2023-01-16' WHERE order_id = 1;

This changes the order date for the order with ID 1.

Best Practices:

Be Specific with WHERE: Always use a WHERE clause to target specific rows. Without it, you risk updating all rows in the table.
Limit the Scope: When updating large tables, limit the scope of your update to avoid performance issues.

4. DELETE - Removing Records

Usage: DELETE is used to remove rows from a table.

Example:

DELETE FROM orders WHERE order_id = 1;

This deletes the order with ID 1 from the orders table.

Best Practices:

Use WHERE with Caution: Similar to UPDATE, always use a WHERE clause to specify which rows to delete.
Consider Using Soft Deletes: Instead of physically deleting records, you can mark them as inactive. This allows for data recovery and audit trails.

Join

JOINs are used to retrieve data from multiple tables by linking them through a common field. The most common types of JOINs are INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN.

Here is a good article that explains it in detail. The example I give in this section only scratches the surface, so I recommend that you read about joins from other sources. The most important part when it comes to these operations is time in the saddle. The more you need them the better you'll get at writing them.

INNER JOIN: Selects records with matching values in both tables. Intersection between the two tables.
LEFT (OUTER) JOIN: Selects all records from the left table, and the matched records from the right table. For rows on the left that don't have a match on the right, null will be placed instead.
RIGHT (OUTER) JOIN: Selects all records from the right table, and the matched records from the left table. For rows on the right that don't have a match on the left, null will be placed instead.
FULL (OUTER) JOIN: Selects all records when there is a match in either the left or right table.

Examples of JOINs

Consider two tables in an e-commerce database: customers and orders. The customers table contains customer information, and the orders table contains order details.

INNER JOIN Example:

Fetch all orders along with customer names.


SELECT customers.name, orders.order_id
FROM customers
INNER JOIN orders ON customers.customer_id = orders.customer_id;

LEFT JOIN Example:

Retrieve all customers and their orders, including customers who have not placed any orders.


SELECT customers.name, orders.order_id
FROM customers
LEFT JOIN orders ON customers.customer_id = orders.customer_id;

RIGHT JOIN Example:

Fetch all orders, including those not linked to a customer in the customers table.


SELECT customers.name, orders.order_id
FROM customers
RIGHT JOIN orders ON customers.customer_id = orders.customer_id;

FULL JOIN Example:

Retrieve all records of customers and orders, with or without a match.


SELECT customers.name, orders.order_id
FROM customers
FULL OUTER JOIN orders ON customers.customer_id = orders.customer_id;

Best Practices for Using JOINs

Use Specific Columns in SELECT: Instead of using SELECT *, specify the columns you need to retrieve. This enhances performance, especially in large tables.
Optimize ON Conditions: Ensure that the ON condition in your JOIN is efficient and uses indexed columns if possible. This can significantly improve query performance.
Avoid Unnecessary JOINs: Only JOIN tables that are essential for your query. Extra JOINs can complicate the query and degrade performance.
Understand the Data Model: A clear understanding of the relationships between your tables is crucial. This helps in determining the right type of JOIN to use.
Be Aware of NULL Values: In LEFT or RIGHT JOINs, if there is no match, the result set will contain NULLs. It’s important to handle these cases in your queries.
Use Aliases for Clarity: When dealing with tables having similar column names, use aliases to make your query more readable and avoid confusion.

ORDER BY / GROUP BY

In SQL, two of the most powerful clauses for organizing and summarizing data are ORDER BY and GROUP BY. Both play a crucial role in data analysis and reporting.

ORDER BY: Sorting Your Data

Functionality: The ORDER BY clause is used to sort the result set of a query by one or more columns. It can sort data in ascending (ASC) or descending (DESC) order. By default, ORDER BY sorts in ascending order.

Example:

SELECT name, department, salary FROM employees
ORDER BY department ASC, salary DESC;

This query sorts employees first by department in ascending order and then by salary in descending order within each department.

Be careful of:

Performance Overhead: Sorting large datasets can be resource-intensive and slow down query execution.
Limited by Memory: Extremely large sorts may exceed memory limits and require disk-based sorting, which is slower.

Best Practices:

Index Usage: Use indexes on sorting columns to speed up ORDER BY operations.
Limit Sorted Data: When working with large datasets, try to limit the result set before applying ORDER BY.

GROUP BY: Aggregating Your Data

Functionality: The GROUP BY clause is used to group rows that have the same values in specified columns into summary rows, like "find the number of customers in each country."

Example:

SELECT department, AVG(salary) AS average_salary
FROM employees
WHERE age > 30
GROUP BY department
HAVING AVG(salary) > 50000;

This query calculates the average salary per department for employees over 30 years old, but only for departments where this average is over 50,000.

Be careful of:

Complexity in Queries: Can make queries more complex, especially with the addition of HAVING clauses.
Potential for Misinterpretation: Incorrect use can lead to misleading results, especially if the SELECT statement includes columns not in the GROUP BY clause.

Best Practices:

Clear Grouping Criteria: Ensure that your GROUP BY criteria align with your analytical goals.
Use Aggregative Functions Wisely: Functions like COUNT(), AVG(), SUM(), etc., should be used thoughtfully to avoid incorrect interpretations.
Consider Using HAVING: Use the HAVING clause for filtering groups, but be aware of its additional processing overhead.

While ORDER BY enhances the readability and presentation of query results, GROUP BY allows for powerful data aggregation and summary. Understanding their strengths and limitations, and following best practices, can help you harness their full potential, enabling you to conduct more effective and insightful data analysis.

Limit and offset

LIMIT: This clause is used to specify the maximum number of records to return in a query result. It's particularly useful for limiting the size of the result set to a manageable number.

OFFSET: OFFSET is used alongside LIMIT to skip a specified number of records before starting to return records from the query.

Use case

Paginating Query Results: Imagine an e-commerce platform with a products table containing thousands of entries. To display these products over multiple pages, you would use both LIMIT and OFFSET.
```
SELECT * FROM products
ORDER BY name
LIMIT 10 OFFSET 20;
```
This query retrieves 10 products, skipping the first 20 (essentially showing the third page of results if each page shows 10 products).
Sequential Data Processing: In scenarios where you need to process data in batches (like sending out batches of emails), LIMIT and OFFSET can be used to manage this.
```
SELECT email FROM customers
WHERE subscription_status = 'active'
ORDER BY signup_date
LIMIT 100 OFFSET 200;
```
Here, you would process the third batch of 100 active subscribers.

Pros and Cons

Pros:

Improved Performance: By limiting the amount of data processed and returned, you can significantly reduce server load and improve response times.
Enhanced User Experience: In applications with large datasets, like web interfaces, using LIMIT and OFFSET for pagination can drastically improve user experience.

Be careful of:

Offset Performance Issue: OFFSET can be inefficient on large tables because it still has to read through all the preceding rows, which becomes more costly as the offset increases.
Risk of Inconsistent Data: When data is added or removed between queries, OFFSET can lead to skipping or duplicating rows.

Best Practices

Use with ORDER BY: Always use LIMIT and OFFSET in conjunction with ORDER BY to ensure a consistent order in paginated results.
Be Cautious with Large OFFSETs: For large datasets, consider alternative pagination strategies like keyset pagination (using a WHERE clause with a column value) instead of OFFSET to improve performance.
Avoid Using OFFSET for Sequential Processing: For sequential processing of data, it's often more efficient to use a WHERE clause to specify a range or condition, rather than OFFSET.
Indexed Columns in ORDER BY: Ensure that the column used in the ORDER BY clause is indexed, particularly when using LIMIT and OFFSET, to optimize the query performance.

NULL

NULL: The Billion Dollar Mistake

In SQL, NULL represents a unique concept: the absence of a value. It's neither zero nor an empty string but a distinct marker indicating that a value is unknown or undefined. While NULL is a powerful aspect of SQL, its proper use requires understanding its behavior, advantages, and potential pitfalls.

Understanding NULL in SQL

NULL is used in SQL to denote that a value is missing or unknown. It's a state that signifies the lack of a value, and therefore, NULL is not equal to NULL (since both represent unknown values, they can't be considered equal).

Examples of NULL in SQL

Creating a Table with Nullable Columns:

CREATE TABLE employees (
    id INT PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    manager_id INT NULL
);

In this example, manager_id can be NULL, indicating that some employees might not have a manager.

Inserting NULL Values:

INSERT INTO employees (id, name, manager_id) VALUES (1, 'John Doe', NULL);

This adds an employee who doesn’t have a manager assigned.

Querying with NULL Values:
```
SELECT * FROM employees WHERE manager_id IS NULL;
```
This query selects all employees without a manager.

Pros and Cons of Using NULL

Pros:

Flexibility: NULL allows for flexibility in database design by providing a way to represent unknown or inapplicable information.
Data Integrity: It helps maintain data integrity by differentiating between "unknown" and "empty" or "zero".

Cons:

Complexity in Queries: NULL values can complicate query logic, especially in calculations and comparisons.
Ambiguity: The meaning of NULL can be ambiguous – is the data unknown, not applicable, or simply not entered yet?
Unexpected Results: Operations involving NULL (like NULL + 1) result in NULL, which can lead to unexpected outcomes in calculations.

Best Practices for Handling NULL

Explicit Handling in Queries: Always handle NULL explicitly in your SQL queries. Use IS NULL and IS NOT NULL for checking NULL values.
Consider Using COALESCE: COALESCE can be used to provide default values when dealing with NULL, ensuring more predictable results.
```
SELECT COALESCE(manager_id, 0) FROM employees;
```
Avoid NULL in Calculations: Be cautious with calculations involving NULL values. They can propagate through your expressions and result in NULL.
Use NOT NULL Constraints: When designing your database schema, use NOT NULL constraints where a value is always required, to ensure data completeness.
Clear Definition of NULL: Define what NULL represents in each column of your database to avoid ambiguity and misuse.
NULL in Indexes: Remember that NULL values are typically not included in indexes. If you need to frequently query for NULL values, consider indexing strategies that account for them.

This is a long article and I didn't even went to deep in most topics. Note to self for the future, split into multiple articles if the topics are complex.

Level 0: Sky Zone

ORMs (Object-Relational Mapping)

Data types

CREATE TABLE

Syntax and Usage

Best Practices:

Indexes

The Mechanism of Indexes on Insert Operations

Choosing the Right Indexes: A Balance Act

Foreign Keys

When to Use Foreign Keys

Examples of Foreign Keys in Action

Best Practices for Using Foreign Keys

Select / Insert / Update / Delete

1. SELECT - Fetching Data from the Database. (Postgres Docs)

2. INSERT - Adding New Records

3. UPDATE - Modifying Existing Data

4. DELETE - Removing Records

Join

Examples of JOINs

Best Practices for Using JOINs

ORDER BY / GROUP BY

ORDER BY: Sorting Your Data

GROUP BY: Aggregating Your Data

Limit and offset

Use case

Pros and Cons

Best Practices

NULL

Understanding NULL in SQL

Examples of NULL in SQL

Pros and Cons of Using NULL

Best Practices for Handling NULL

Read next

How to Build and Deploy Full-Stack JavaScript Apps with NextJS, Tailwind, PostgreSQL, and Sevalla⚡👨‍💻

Revolutionary AI System Creates Ultra-Precise 3D Maps for Self-Driving Cars

AI's Math Skills Jump 20% with New Hybrid Token Method

AI Cars Learn to Drive Safely Through Self-Competition, No Human Training Needed