"Deep Diving the SQL Iceberg" is a blog series designed to demystify the submerged complexities of SQL databases. Taking this educational journey I hope to become more proficient in the field of database technology, focusing on PostgreSQL as the exemplar. Each post will surface a few concepts from the vast array of topics encapsulated in the "SQL Iceberg" by Jordan Lewis meme—from foundational elements like "JOIN" and "indexes" to advanced intricacies such as "query plans and EXPLAIN" and "The Halloween Problem."
I've seen firsthand that the Achilles' heel of many web applications often lies in suboptimal database architecture and interaction, or in flawed API design(topic for another series). My objective is to deepen my expertise and evolve as a professional and as an architect. By making my insights and discoveries accessible to others, I aspire not only to enhance my own skills but also to potentially assist others on similar quests for improvement.
Thanks to Jordan Lewis & friends for creating the "SQL Iceberg"
Most images are generated by ChatGPT
Level 0: Sky Zone
Concepts:
ORMs
data types
CREATE TABLE
indexes
foreign keys
SELECT / INSERT / UPDATE / DELETE
JOIN
ORDER BY / GROUP BY
LIMIT and OFFSET
NULL
ORMs (Object-Relational Mapping)
Definition: ORMs are programming libraries that facilitate the conversion of data between incompatible type systems in object-oriented programming languages and relational databases. They enable developers to interact with a database using high-level programming constructs instead of direct SQL queries.
Example: In the context of PostgreSQL, an ORM like Hibernate (Java) or Sequelize (Node.js) can be used. For instance, using Sequelize, you can define a model for a table users
like this:
const User = sequelize.define('user', {
id: { type: Sequelize.INTEGER, primaryKey: true },
name: Sequelize.STRING
});
Pros of Using ORMs
Abstraction from SQL: ORMs abstract away the complexity of raw SQL queries, allowing developers to interact with the database using the object-oriented paradigm of their programming language.
Increased Productivity: By automating the task of writing SQL queries and handling database connections, ORMs can significantly speed up the development process, especially for standard database operations.
Database Agnosticism: ORMs often allow applications to become largely independent of the underlying database system. This means you can switch databases with minimal changes in the application code.
Automated Schema Management: Many ORMs provide tools for automatically generating and updating database schemas based on the object model, which can simplify migrations and schema changes.
Support for Complex Relationships and Operations: ORMs generally provide a straightforward way to handle complex joins, nested selects, and transactions.
Cons of Using ORMs
Performance Overhead: ORMs add an extra layer of abstraction which can lead to performance issues, especially with large datasets or complex queries. The automatically generated SQL might not be as optimized as hand-written queries.
Complex Queries Can Be Challenging: For complex database operations, ORM queries can become cumbersome and less efficient than native SQL. Sometimes, the abstraction can limit the ability to perform certain highly specialized operations.
Steep Learning Curve: While ORMs simplify database interactions, they often have their own complexities. Learning to use an ORM effectively can require a significant investment of time and effort.
Less Control Over the Execution: ORMs tend to hide the details of data retrieval and manipulation, which can be a disadvantage when fine-tuning queries for performance optimizations.
Potential for Poor Database Design: Relying heavily on ORMs can sometimes lead to suboptimal database design, as the database schema is often a reflection of the object model, which may not always be the most efficient representation of the data.
In summary, while ORMs offer a high level of convenience and productivity gains, especially for standard CRUD (Create, Read, Update, Delete) operations, they also introduce complexities and performance considerations.
One of the main reasons for starting this series is my frustration with ORNs. They provide a great medium for inserting data. However, abstracting the queries makes the developers not think about the impact that each call has on the overall performance of the application. I've encountered far too often queries called inside a loop or the famous n+1 problem.
RECOMMENDATION: use ORMs for inserting data and basic queries but switch to native queries for more complex stuff and use the ORM mapping capabilities to convert the result to entities or projection objects. Be careful with eager and lazy fetching and what data you need in your business logic.
Data types
To be honest I don't think there is a need for me to write about SQL data types. Depending on what database you use you can easily find the information in their documentation. Here is a link for PostgreSQL.
A few recommendations:
-
Floating-Point Precision:
- Be cautious with floating-point types (
REAL
andDOUBLE PRECISION
). They can introduce rounding errors and are not suitable for storing exact values like monetary amounts. - Use
DECIMAL
orNUMERIC
for precise calculations where rounding errors are unacceptable. (Currency, etc)
- Be cautious with floating-point types (
-
Character Types:
-
VARCHAR(n)
andCHAR(n)
have a performance difference:CHAR(n)
pads spaces to the defined length, which can increase storage usage unnecessarily for variable-length strings. -
TEXT
andVARCHAR
are very similar, butTEXT
is preferable when you don't have a specific limit on string length. Internally, PostgreSQL treatsVARCHAR
without a length specifier asTEXT
.
-
-
Date and Time Types:
- Be mindful of timezone handling with
TIMESTAMP WITH TIME ZONE
andTIMESTAMP WITHOUT TIME ZONE
. The former converts the stored time to the current timezone of the server, while the latter does not. -
DATE
,TIME
, andTIMESTAMP
types have different precision levels and functions, and choosing the correct type for your needs is crucial.
- Be mindful of timezone handling with
-
Array Types:
- PostgreSQL supports arrays for most data types. However, managing and querying arrays can be more complex than simple scalar types.
- Use arrays judiciously, as they can complicate query writing and impact performance.
-
JSON vs. JSONB:
-
JSON
andJSONB
are both used to store JSON data, but they behave differently.JSONB
stores data in a decomposed binary format, offering better performance for querying and indexing at the cost of slightly slower write times and more storage space. -
JSON
preserves the exact input format, which might be necessary in some cases.
-
-
UUID Type:
- Using
UUID
as a primary key can be beneficial for distributed systems, but it's larger and slower to generate compared to traditional integer types. Here is a great video about it.
- Using
CREATE TABLE
The CREATE TABLE
command in SQL is used to create a new table in a specific database. This command defines the table structure, specifying the columns and the types of data each column can hold. It's akin to laying the blueprint for a building — you're setting the foundation and structure on which your data will reside.
Syntax and Usage
The basic syntax of the CREATE TABLE
command is straightforward, yet powerful:
CREATE TABLE employees (
id SERIAL PRIMARY KEY,
name VARCHAR(100) NOT NULL,
email VARCHAR(100) UNIQUE,
join_date DATE NOT NULL
);
In this example, we create a table named employees
with four columns: id
, name
, email
, and join_date
. The id
column is a serial type, automatically incrementing and serving as the primary key.
Best Practices:
Naming Conventions: Use clear, descriptive names for tables and columns.
Normalizing Data: Aim for a normalized database design to reduce data redundancy and improve data integrity.
Defining Constraints: Use constraints wisely to enforce data integrity.
Planning for Scalability: Consider how your tables will scale and perform as data grows. Thinking about sharding keys might come in handy.
Indexes
Indexes in PostgreSQL are special database objects that are primarily used to improve the speed of data retrieval operations. Think of them like the index in a book - instead of reading the whole book to find a topic, you refer to the index to quickly locate the pages you need. In a database, indexes help the database server to find and retrieve specific rows much faster than it could do by scanning the entire table.
The Mechanism of Indexes on Insert Operations
While indexes are widely known for enhancing query performance, their impact on data insertion is a critical aspect often overlooked. Here's what happens during an insert operation in a PostgreSQL table with indexes:
Data Insertion and Index Update: When a new record is inserted into a table, PostgreSQL not only saves this record into the table but also updates all the indexes that the table is associated with. Each index must reflect the change to ensure it remains accurate for read operations.
Movement and Sorting: For certain types of indexes, particularly those like B-tree (the default in PostgreSQL), the new data must be correctly sorted into the existing structure. This might involve moving and rearranging data within the index to maintain its order and integrity.
Write Amplification: Each additional index results in more write operations. This phenomenon, known as write amplification, can significantly impact the performance of insert operations, especially in write-heavy databases.
Choosing the Right Indexes: A Balance Act
Given the overhead indexes can create during data insertion, choosing the right indexes becomes a crucial decision. Here's why careful index selection is important:
Performance Trade-off: Every index you add can improve query performance but might slow down data insertion. It's essential to strike a balance - create indexes that will benefit your most common and critical queries without overly hindering write operations.
Index Selectivity: Highly selective indexes, where the index entries have unique or near-unique values, are generally more effective. They allow PostgreSQL to quickly narrow down the search to a few rows, enhancing query performance significantly.
Index Types Matter: Different index types (like B-tree (default), Hash, GIN, and GiST) have different characteristics and use cases. Understanding the nature of your data and queries can guide you to choose the most appropriate type.
Maintenance and Monitoring: Regularly monitor the performance of your indexes. Unused or less frequently used indexes might be candidates for removal to improve write performance.
Partial Indexes: In cases where you only query a subset of your data, consider using partial indexes. These indexes are smaller and faster to update, as they only index a portion of the table.
Foreign Keys
Foreign keys in PostgreSQL are a type of constraint used to link two tables together. This link is based on the relationship between columns in these tables - typically, the primary key in one table becomes a foreign key in another. They play a crucial role in preventing invalid data entry and maintaining the consistency of data across tables.
When to Use Foreign Keys
Enforcing Referential Integrity: Use foreign keys when you need to ensure that a relationship exists between rows in different tables. For example, an
order
should always be linked to a validcustomer
.Representing Relationships: They are essential for representing real-world relationships in your database design, such as one-to-many or many-to-one relationships.
Examples of Foreign Keys in Action
-
E-Commerce Database:
- In an e-commerce database, you might have
customers
andorders
tables. - The
orders
table can have a foreign key that references thecustomer_id
in thecustomers
table, ensuring that every order is associated with a valid customer.
CREATE TABLE customers ( customer_id SERIAL PRIMARY KEY, name VARCHAR(100), email VARCHAR(100) ); CREATE TABLE orders ( order_id SERIAL PRIMARY KEY, order_date DATE NOT NULL, customer_id INTEGER REFERENCES customers(customer_id) );
- In an e-commerce database, you might have
Best Practices for Using Foreign Keys
Careful Planning: Design your database schema with the relationships between tables in mind. Ensure that foreign keys are used to enforce meaningful and necessary relationships.
Indexing Foreign Key Columns: Indexes on foreign key columns can significantly improve the performance of joins and other queries that involve these columns.
Avoiding Cascade Deletes: While cascade deletes can be convenient, they can also lead to unintentional loss of data. Use them judiciously and ensure that your application logic accounts for their effects.
Consistent Data Types: Ensure that the data type of the foreign key column matches the data type of the referenced primary key column.
Regular Integrity Checks: Periodically check the integrity of your relationships, especially if you perform bulk data operations that might bypass foreign key constraints.
Documenting Relationships: Maintain clear documentation of your database schema, including the foreign key relationships. This aids in understanding the data model and troubleshooting issues.
Select / Insert / Update / Delete
SELECT, INSERT, UPDATE, and DELETE – are commands that form the core of data manipulation and querying.
The example I give in this section only scratches the surface, so I recommend that you read about the different operations from other sources. The most important part when it comes to these operations is time in the saddle. The more you need them the better you'll get at writing them.
1. SELECT - Fetching Data from the Database. (Postgres Docs)
Usage: The SELECT statement is used to query and retrieve data from a database table.
Example:
SELECT name, email FROM customers;
This query retrieves the name and email columns from the customers table.
Best Practices:
Avoid
SELECT *
: Instead of usingSELECT *
, specify the columns you need. This reduces the data load, especially when the table has many columns.Use WHERE for Filtering: Leverage the WHERE clause to filter data. This minimizes the amount of data processed and transferred.
2. INSERT - Adding New Records
Usage: INSERT is used to add new rows to a table.
Example:
INSERT INTO orders (order_date, customer_id) VALUES ('2023-01-15', 1);
This adds a new order with the specified date and customer ID.
Best Practices:
Specify Column Names: Always specify column names for clarity and to avoid errors if the table structure changes.
Use Batch Inserts for Multiple Rows: For inserting multiple rows, use batch inserts to reduce the number of database round-trips.
3. UPDATE - Modifying Existing Data
Usage: UPDATE modifies existing records in a table.
Example:
UPDATE orders SET order_date = '2023-01-16' WHERE order_id = 1;
This changes the order date for the order with ID 1.
Best Practices:
Be Specific with WHERE: Always use a WHERE clause to target specific rows. Without it, you risk updating all rows in the table.
Limit the Scope: When updating large tables, limit the scope of your update to avoid performance issues.
4. DELETE - Removing Records
Usage: DELETE is used to remove rows from a table.
Example:
DELETE FROM orders WHERE order_id = 1;
This deletes the order with ID 1 from the orders table.
Best Practices:
Use WHERE with Caution: Similar to UPDATE, always use a WHERE clause to specify which rows to delete.
Consider Using Soft Deletes: Instead of physically deleting records, you can mark them as inactive. This allows for data recovery and audit trails.
Join
JOINs are used to retrieve data from multiple tables by linking them through a common field. The most common types of JOINs are INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN.
Here is a good article that explains it in detail. The example I give in this section only scratches the surface, so I recommend that you read about joins from other sources. The most important part when it comes to these operations is time in the saddle. The more you need them the better you'll get at writing them.
INNER JOIN: Selects records with matching values in both tables. Intersection between the two tables.
LEFT (OUTER) JOIN: Selects all records from the left table, and the matched records from the right table. For rows on the left that don't have a match on the right,
null
will be placed instead.RIGHT (OUTER) JOIN: Selects all records from the right table, and the matched records from the left table. For rows on the right that don't have a match on the left,
null
will be placed instead.FULL (OUTER) JOIN: Selects all records when there is a match in either the left or right table.
Examples of JOINs
Consider two tables in an e-commerce database: customers
and orders
. The customers
table contains customer information, and the orders
table contains order details.
- INNER JOIN Example:
- Fetch all orders along with customer names.
SELECT customers.name, orders.order_id
FROM customers
INNER JOIN orders ON customers.customer_id = orders.customer_id;
- LEFT JOIN Example:
- Retrieve all customers and their orders, including customers who have not placed any orders.
SELECT customers.name, orders.order_id
FROM customers
LEFT JOIN orders ON customers.customer_id = orders.customer_id;
- RIGHT JOIN Example:
- Fetch all orders, including those not linked to a customer in the
customers
table.
SELECT customers.name, orders.order_id
FROM customers
RIGHT JOIN orders ON customers.customer_id = orders.customer_id;
- FULL JOIN Example:
- Retrieve all records of customers and orders, with or without a match.
SELECT customers.name, orders.order_id
FROM customers
FULL OUTER JOIN orders ON customers.customer_id = orders.customer_id;
Best Practices for Using JOINs
Use Specific Columns in SELECT: Instead of using
SELECT *
, specify the columns you need to retrieve. This enhances performance, especially in large tables.Optimize ON Conditions: Ensure that the ON condition in your JOIN is efficient and uses indexed columns if possible. This can significantly improve query performance.
Avoid Unnecessary JOINs: Only JOIN tables that are essential for your query. Extra JOINs can complicate the query and degrade performance.
Understand the Data Model: A clear understanding of the relationships between your tables is crucial. This helps in determining the right type of JOIN to use.
Be Aware of NULL Values: In LEFT or RIGHT JOINs, if there is no match, the result set will contain NULLs. It’s important to handle these cases in your queries.
Use Aliases for Clarity: When dealing with tables having similar column names, use aliases to make your query more readable and avoid confusion.
ORDER BY / GROUP BY
In SQL, two of the most powerful clauses for organizing and summarizing data are ORDER BY
and GROUP BY
. Both play a crucial role in data analysis and reporting.
ORDER BY: Sorting Your Data
Functionality: The ORDER BY
clause is used to sort the result set of a query by one or more columns. It can sort data in ascending (ASC
) or descending (DESC
) order. By default, ORDER BY
sorts in ascending order.
Example:
SELECT name, department, salary FROM employees
ORDER BY department ASC, salary DESC;
This query sorts employees first by department in ascending order and then by salary in descending order within each department.
Be careful of:
Performance Overhead: Sorting large datasets can be resource-intensive and slow down query execution.
Limited by Memory: Extremely large sorts may exceed memory limits and require disk-based sorting, which is slower.
Best Practices:
Index Usage: Use indexes on sorting columns to speed up
ORDER BY
operations.Limit Sorted Data: When working with large datasets, try to limit the result set before applying
ORDER BY
.
GROUP BY: Aggregating Your Data
Functionality: The GROUP BY
clause is used to group rows that have the same values in specified columns into summary rows, like "find the number of customers in each country."
Example:
SELECT department, AVG(salary) AS average_salary
FROM employees
WHERE age > 30
GROUP BY department
HAVING AVG(salary) > 50000;
This query calculates the average salary per department for employees over 30 years old, but only for departments where this average is over 50,000.
Be careful of:
Complexity in Queries: Can make queries more complex, especially with the addition of
HAVING
clauses.Potential for Misinterpretation: Incorrect use can lead to misleading results, especially if the SELECT statement includes columns not in the
GROUP BY
clause.
Best Practices:
Clear Grouping Criteria: Ensure that your
GROUP BY
criteria align with your analytical goals.Use Aggregative Functions Wisely: Functions like
COUNT()
,AVG()
,SUM()
, etc., should be used thoughtfully to avoid incorrect interpretations.Consider Using
HAVING
: Use theHAVING
clause for filtering groups, but be aware of its additional processing overhead.
While ORDER BY
enhances the readability and presentation of query results, GROUP BY
allows for powerful data aggregation and summary. Understanding their strengths and limitations, and following best practices, can help you harness their full potential, enabling you to conduct more effective and insightful data analysis.
Limit and offset
LIMIT: This clause is used to specify the maximum number of records to return in a query result. It's particularly useful for limiting the size of the result set to a manageable number.
OFFSET: OFFSET is used alongside LIMIT to skip a specified number of records before starting to return records from the query.
Use case
-
Paginating Query Results: Imagine an e-commerce platform with a products table containing thousands of entries. To display these products over multiple pages, you would use both LIMIT and OFFSET.
SELECT * FROM products ORDER BY name LIMIT 10 OFFSET 20;
This query retrieves 10 products, skipping the first 20 (essentially showing the third page of results if each page shows 10 products).
-
Sequential Data Processing: In scenarios where you need to process data in batches (like sending out batches of emails), LIMIT and OFFSET can be used to manage this.
SELECT email FROM customers WHERE subscription_status = 'active' ORDER BY signup_date LIMIT 100 OFFSET 200;
Here, you would process the third batch of 100 active subscribers.
Pros and Cons
Pros:
Improved Performance: By limiting the amount of data processed and returned, you can significantly reduce server load and improve response times.
Enhanced User Experience: In applications with large datasets, like web interfaces, using LIMIT and OFFSET for pagination can drastically improve user experience.
Be careful of:
Offset Performance Issue: OFFSET can be inefficient on large tables because it still has to read through all the preceding rows, which becomes more costly as the offset increases.
Risk of Inconsistent Data: When data is added or removed between queries, OFFSET can lead to skipping or duplicating rows.
Best Practices
Use with ORDER BY: Always use LIMIT and OFFSET in conjunction with ORDER BY to ensure a consistent order in paginated results.
Be Cautious with Large OFFSETs: For large datasets, consider alternative pagination strategies like keyset pagination (using a WHERE clause with a column value) instead of OFFSET to improve performance.
Avoid Using OFFSET for Sequential Processing: For sequential processing of data, it's often more efficient to use a WHERE clause to specify a range or condition, rather than OFFSET.
Indexed Columns in ORDER BY: Ensure that the column used in the ORDER BY clause is indexed, particularly when using LIMIT and OFFSET, to optimize the query performance.
NULL
NULL: The Billion Dollar Mistake
In SQL, NULL
represents a unique concept: the absence of a value. It's neither zero nor an empty string but a distinct marker indicating that a value is unknown or undefined. While NULL
is a powerful aspect of SQL, its proper use requires understanding its behavior, advantages, and potential pitfalls.
Understanding NULL in SQL
NULL
is used in SQL to denote that a value is missing or unknown. It's a state that signifies the lack of a value, and therefore, NULL
is not equal to NULL
(since both represent unknown values, they can't be considered equal).
Examples of NULL in SQL
-
Creating a Table with Nullable Columns:
CREATE TABLE employees ( id INT PRIMARY KEY, name VARCHAR(100) NOT NULL, manager_id INT NULL );
In this example,
manager_id
can beNULL
, indicating that some employees might not have a manager. -
Inserting NULL Values:
INSERT INTO employees (id, name, manager_id) VALUES (1, 'John Doe', NULL);
This adds an employee who doesn’t have a manager assigned.
-
Querying with NULL Values:
SELECT * FROM employees WHERE manager_id IS NULL;
This query selects all employees without a manager.
Pros and Cons of Using NULL
Pros:
Flexibility:
NULL
allows for flexibility in database design by providing a way to represent unknown or inapplicable information.Data Integrity: It helps maintain data integrity by differentiating between "unknown" and "empty" or "zero".
Cons:
Complexity in Queries:
NULL
values can complicate query logic, especially in calculations and comparisons.Ambiguity: The meaning of
NULL
can be ambiguous – is the data unknown, not applicable, or simply not entered yet?Unexpected Results: Operations involving
NULL
(likeNULL + 1
) result inNULL
, which can lead to unexpected outcomes in calculations.
Best Practices for Handling NULL
Explicit Handling in Queries: Always handle
NULL
explicitly in your SQL queries. UseIS NULL
andIS NOT NULL
for checkingNULL
values.-
Consider Using COALESCE:
COALESCE
can be used to provide default values when dealing withNULL
, ensuring more predictable results.
SELECT COALESCE(manager_id, 0) FROM employees;
Avoid NULL in Calculations: Be cautious with calculations involving
NULL
values. They can propagate through your expressions and result inNULL
.Use NOT NULL Constraints: When designing your database schema, use
NOT NULL
constraints where a value is always required, to ensure data completeness.Clear Definition of NULL: Define what
NULL
represents in each column of your database to avoid ambiguity and misuse.NULL in Indexes: Remember that
NULL
values are typically not included in indexes. If you need to frequently query forNULL
values, consider indexing strategies that account for them.
This is a long article and I didn't even went to deep in most topics. Note to self for the future, split into multiple articles if the topics are complex.
Top comments (0)