Mohammad Atif

Posted on Feb 28

Clustered vs Non-Clustered Index in SQL(Complete Guide with Examples)

#database #sql #clusteredvsnonclusteredindex #postgres

1. Introduction: Why Do We Need Indexes?

Speed and efficiency are key when it comes to querying a database. An index is a data structure that acts as a guide for your database, giving you quick access to a particular data set to speed up your queries significantly.

To understand indexes, let’s consider a couple of analogies.

The Book: You don’t want to go through each page of a book to find a particular chapter or data; instead, you look in the back of the book to find that chapter quickly.
The Hotel: You don’t want to knock on every hotel room on every floor to find room 5001; instead, you look at a map in the lobby to quickly locate that room.

2. The Foundation: Data Pages and The Heap Structure

Before we move on to indexes, we should know how SQL stores data logically as well as physically.

Data Pages:

When you create a new table in your SQL database and start adding data to it, SQL doesn’t store data as a spreadsheet; instead, it stores data in data files on your hard disk, divided into fixed-size 8-Kilobyte blocks called “Pages.”

A Data Page Consists Of:

Page Header: This contains metadata related to that particular page.
Data Rows: This contains your actual data. The amount of data that you can store here is determined by your data size.
Offset Array: This is a small internal map that quickly tracks where your data starts on that particular page.

The Heap Structure (No Index)

If you don't create an index on your table, your data is stored in a Heap Structure.

Write Logic: When new data is added to your table, SQL just throws it onto the next available Data Page in the order it was written. They're not sorted.

Read Logic (Full Table Scan): Because your data is unsorted, when you want to read a particular record, say Customer ID 14, SQL doesn't know where it is. It has to perform a Full Table Scan, reading every single Data Page until it finds your record.

Trade-off: Fast write speed (no sorting necessary), terrible read speed.

3. The Clustered Index

A Clustered Index changes everything about how your data is stored on your hard disk.

Physical Sorting:

A Clustered Index on a particular column (e.g., User_ID) causes SQL to physically sort ALL existing data pages on that column from lowest to highest.

The B-Tree Logic:

SQL uses a default data structure called a B-Tree (Balanced Tree) to navigate this sorted data:

Root Node (Top): An index page that contains a pointer to intermediate pages.
Intermediate Nodes: Index pages that contain a pointer to a particular set of data (e.g., "IDs 1-10 go left, IDs 11-20 go right").
Leaf Nodes (Bottom): In a Clustered Index, this is your actual Data Page.

Characteristics: Limit: You can only have ONE Clustered Index per table because your data can only be sorted in ONE way.

Ideal Candidate:

Primary Keys are perfect for Primary Keys. They are unique, and they are never updated. (Updating a clustered index column requires the database to move the row to maintain the sort order, which is very resource intensive).

4. The Non-Clustered Index

A Non-Clustered Index makes reads faster but never changes the physical order of the underlying table.

The Pointer Structure

When you create a non-clustered index, SQL leaves the actual Data Pages exactly as they are, whether they are a Heap or sorted by a clustered index. It simply creates an entirely separate B-Tree structure.

The B-Tree Logic

Leaf Nodes: Unlike the clustered index, these leaf nodes are Index Pages, not Data Pages. These leaf nodes are simply a list of the actual index values, along with a Row Identifier (RID) [00:17:42]. This RID is an exact address, like File ID, Page Number, and then an actual Offset, which points to where the actual row lives in the actual table.

Navigation: To retrieve your actual row, SQL uses the non-clustered index B-Tree, navigates down to the leaf node, finds the RID, and then makes one "jump" to the actual Data Page to retrieve the actual row.

Characteristics

Limit: You may have multiple non-clustered indexes on one single table.
Ideal Candidate: Columns you often use in your WHERE clause, like "Last_Name," or often use in your JOIN clause.

5. Direct Comparison: Clustered vs. Non-Clustered

6. SQL Syntax and Implementation

Creating indexes in SQL Server uses straightforward syntax. By default, if Primary Keys are defined, SQL automatically creates a clustered index for them and uses B-Tree as default data structure.

Creating a Clustered Index:

CREATE CLUSTERED INDEX idx_customers_id ON sales.customers (customer_id);

Creating a Non-Clustered Index:

CREATE NONCLUSTERED INDEX idx_customers_lastname ON sales.customers (last_name);
(Note: If you do not specify "NONCLUSTERED", SQL Server defaults to Non-Clustered.)

I have another post in which I have talked about how to achieve best unique identifier key for your table's cluster index for the fastest Read/Write. Click Me to redirect to the relevant post.

DEV Community