Most modern database management systems follow a client–server architecture. The database acts as the server, responsible for storing, processing, and protecting data, while applications act as clients that send requests, usually SQL queries, over the network.
At a high level, a DBMS can be broken into four major components:
- Query Processor
- Query Optimizer
- Execution Engine
- Storage Engine
Query Processor
The query processor is responsible for understanding and validating incoming SQL queries before any data is touched.
Query Parser
When a client sends a SQL query, it is first passed from the transport layer to the query parser. The parser’s job is to ensure that the query is valid and meaningful.
This happens in several stages:
- Lexical analysis: The raw SQL string is broken into tokens such as keywords, identifiers, operators, and literals.
- Syntax checking: The parser verifies that the tokens follow the grammar rules of SQL.
- Semantic analysis: The database checks that referenced tables, columns, functions, and operations actually exist and make sense.
- Access control checks: The system ensures the user has the required permissions on the referenced objects.
If all checks pass, the parser produces an abstract syntax tree (AST). This is then transformed into a logical query plan, which represents the query in relational terms and is passed to the query optimizer.
Query Optimizer
The query optimizer’s goal is to find the most efficient way to execute the query.
Given a logical query plan, the optimizer generates multiple physical execution plans. These plans may differ in join order, join algorithms, index usage, or access paths. Each plan is assigned a cost, and the plan with the lowest estimated cost is chosen.
Cost is an estimate of how expensive a query plan is to run. Lower cost usually means faster execution.
How Cost Estimation Works
The optimizer relies heavily on database statistics, such as:
- Number of rows in a table
- Number of distinct values per column
- Data distribution and histograms
- Index availability
- Percentage of NULL values
Using these statistics, the optimizer estimates:
- I/O cost (disk reads and writes, typically the most expensive)
- CPU cost (hashing, comparisons, sorting)
- Memory usage
For very large tables, collecting exact statistics can be expensive. To manage this, many databases sample random data pages and extrapolate statistics from the sample. This balances accuracy with performance.
Execution Engine
The execution engine is responsible for actually running the chosen execution plan.
It walks the execution plan tree and executes each operator, such as:
- Table scans or index scans
- Filters and projections
- Joins
- Sorting and aggregation
- Set operations
During execution, the engine:
- Requests data pages from the storage engine
- Manages memory for operations like sorting and hashing
- Handles set operations and aggregations
- Formats the final result set and returns the result back to the client.
Storage Engine
The storage engine handles how data is physically stored, retrieved, and protected on disk. It has several components:
- Buffer Manager: Caches frequently accessed data pages in memory to reduce disk I/O.
- Access Methods: Define how data is organized and accessed on disk, supporting structures such as sequential files, B-Trees, and LSM Trees.
- Transaction Manager: Enforces ACID properties, ensuring that transactions are atomic, consistent, isolated, and durable.
- Lock Manager: Coordinates concurrent access to data. In many modern systems, this is implemented using MVCC (Multi-Version Concurrency Control) rather than traditional locking.
- Recovery Manager: Manages write-ahead logging (WAL) and ensures the database can recover to a consistent state after crashes or failures.
A page is the fundamental unit of storage in a database. Each page contains multiple rows, with the exact number depending on page size and row layout.
Conclusion
A DBMS is more than just a place to store data. From parsing SQL into a syntax tree, turning it into a logical plan, optimizing for best execution paths, managing memory, ensuring consistency and recovery from crashes, each layer has a critical role.
Undertanding this architecture will help you write better queries and give you insights into performance, debugging and designing systems at scale.
Top comments (0)