<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Yash Jadhav</title>
    <description>The latest articles on DEV Community by Yash Jadhav (@yash_jadhav_cf439de0c2991).</description>
    <link>https://dev.to/yash_jadhav_cf439de0c2991</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3703080%2Fc51c4b2b-1797-4a18-94b2-eb46dbb43d90.png</url>
      <title>DEV Community: Yash Jadhav</title>
      <link>https://dev.to/yash_jadhav_cf439de0c2991</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/yash_jadhav_cf439de0c2991"/>
    <language>en</language>
    <item>
      <title>How I Built a SQL Server From Scratch</title>
      <dc:creator>Yash Jadhav</dc:creator>
      <pubDate>Sat, 16 May 2026 19:20:06 +0000</pubDate>
      <link>https://dev.to/yash_jadhav_cf439de0c2991/how-i-built-a-sql-server-from-scratch-2fm</link>
      <guid>https://dev.to/yash_jadhav_cf439de0c2991/how-i-built-a-sql-server-from-scratch-2fm</guid>
      <description>&lt;p&gt;So I wanted to learn how to use SQL databases and what better way to than to build one and learn how things works, well couldn't cover as much as planned still touched core  components and basic queries. Here is how I started.&lt;/p&gt;

&lt;h6&gt;
  
  
  Resources that I referred to
&lt;/h6&gt;

&lt;ul&gt;
&lt;li&gt;Architecture of a Database System (majorly)&lt;/li&gt;
&lt;li&gt;database-management-systems-raghu-ramakrishnan (for query planner &amp;amp;
optimization)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Source
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/lightsigma96/db_scratch" rel="noopener noreferrer"&gt;https://github.com/lightsigma96/db_scratch&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/lightsigma96" rel="noopener noreferrer"&gt;
        lightsigma96
      &lt;/a&gt; / &lt;a href="https://github.com/lightsigma96/db_scratch" rel="noopener noreferrer"&gt;
        db_scratch
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;p&gt;This is a simple single-threaded database built for learning purposes.&lt;/p&gt;
&lt;p&gt;It supports basic operations such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Creating schemas&lt;/li&gt;
&lt;li&gt;Creating tables&lt;/li&gt;
&lt;li&gt;Inserting rows&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Supported data types:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;INT&lt;/li&gt;
&lt;li&gt;STRING&lt;/li&gt;
&lt;li&gt;FLOAT&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Build and Run (Local)&lt;/h2&gt;

&lt;/div&gt;
&lt;p&gt;Build the project using CMake:&lt;/p&gt;
&lt;div class="snippet-clipboard-content notranslate position-relative overflow-auto"&gt;&lt;pre class="notranslate"&gt;&lt;code&gt;cmake -S . -B build
cmake --build build
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Run the database:&lt;/p&gt;
&lt;div class="snippet-clipboard-content notranslate position-relative overflow-auto"&gt;&lt;pre class="notranslate"&gt;&lt;code&gt;./build/db_scratch
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Using Docker&lt;/h2&gt;

&lt;/div&gt;
&lt;p&gt;A Dockerfile is provided to build and run the project inside a container.&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h3 class="heading-element"&gt;Build the image&lt;/h3&gt;

&lt;/div&gt;
&lt;div class="snippet-clipboard-content notranslate position-relative overflow-auto"&gt;&lt;pre class="notranslate"&gt;&lt;code&gt;docker build -t db_scratch .
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;div class="markdown-heading"&gt;
&lt;h3 class="heading-element"&gt;Run the container (interactive)&lt;/h3&gt;

&lt;/div&gt;
&lt;div class="snippet-clipboard-content notranslate position-relative overflow-auto"&gt;&lt;pre class="notranslate"&gt;&lt;code&gt;docker run -it db_scratch
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Inside the container:&lt;/p&gt;
&lt;div class="snippet-clipboard-content notranslate position-relative overflow-auto"&gt;&lt;pre class="notranslate"&gt;&lt;code&gt;./build/db_scratch
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Notes&lt;/h2&gt;

&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;The Docker image includes build tools and is intended for both development and debugging.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;If you modify the source code, rebuild the image:&lt;/p&gt;
&lt;p&gt;docker build -t db_scratch .&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;CMake&lt;/h2&gt;

&lt;/div&gt;
&lt;p&gt;Uses a basic CMake setup for building the project.&lt;/p&gt;
&lt;/div&gt;



&lt;/div&gt;
&lt;br&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/lightsigma96/db_scratch" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;br&gt;
&lt;/div&gt;
&lt;br&gt;


&lt;h3&gt;
  
  
  Storage Manager
&lt;/h3&gt;

&lt;p&gt;This is the core of disk-based database, it's function is to interact with disk,&lt;br&gt;
handle various db files and manage pages.&lt;/p&gt;

&lt;h4&gt;
  
  
  Buffer Pool
&lt;/h4&gt;

&lt;p&gt;Buffer Pool caches pages from heap files in memory (which are responsible for storage of rows), pages are small fragments of heap file which are of fixed size (4kb in my case).&lt;/p&gt;

&lt;p&gt;Reading or Writing from heap file on disk can be slow, hence pages from heap&lt;br&gt;
file are loaded into memory (RAM) through buffer pool which is much faster.&lt;/p&gt;

&lt;p&gt;Now obviously we cannot load entire file into memory, as that that can lead to out of memory error.&lt;/p&gt;

&lt;p&gt;Hence we load file in pages into buffer pool which sits in memory.&lt;/p&gt;

&lt;p&gt;Buffer Pool has a page replacement policy which evicts certain pages when buffer&lt;br&gt;
pool is full.&lt;/p&gt;

&lt;p&gt;In my case I implemented a simple FIFO replacement policy which tracks pages in&lt;br&gt;
order of insertion into buffer pool, which checks if the page is &lt;em&gt;pinned (pages which are frequently used or are being used by some component are pinned, pin count shows how many components are using that page)&lt;/em&gt;, if yes then send them back in the queue, if no then flush them (write to disk) and evict them.&lt;/p&gt;

&lt;p&gt;Buffer pools also contains a Disk Operator responsible for writing to file and&lt;br&gt;
other things (implementations can vary).&lt;/p&gt;

&lt;h3&gt;
  
  
  Access Methods
&lt;/h3&gt;

&lt;p&gt;Access Methods provides with necessary data structure and algorithms to organize&lt;br&gt;
and access data.&lt;/p&gt;

&lt;p&gt;On heap file rows are stored one after another and &lt;em&gt;heap scan&lt;/em&gt; from access methods scan each one of them to find correct row based on given SARG.&lt;/p&gt;

&lt;p&gt;SARGs (Search Arguments) describes condition for finding a row, they contain&lt;br&gt;
column, operator and constant.&lt;/p&gt;

&lt;p&gt;Heap scan may require scanning all rows across multiple pages resulting in O(n)&lt;br&gt;
search complexity, hence a &lt;em&gt;index scan&lt;/em&gt; can be implemented using B+ Trees.&lt;/p&gt;

&lt;p&gt;B+ Tree consists of leaf nodes and internal nodes, Leaf nodes store key and RID (Row Identifier) which points to the exact row in heap file, &lt;/p&gt;

&lt;p&gt;B+ Tree reduces index traversal cost from scanning many pages to traversing a small number of pages proportional to tree height, after which corresponding heap pages can be accessed using RID.&lt;/p&gt;

&lt;p&gt;The downside of index scan is that index keys are built on specific columns.If query predicates do not use indexed columns then planner may fallback to heap scan.&lt;/p&gt;

&lt;h3&gt;
  
  
  Query Planner and Optimizer
&lt;/h3&gt;

&lt;p&gt;After Buffer Pool and Access Methods, rows can be stored on disk and retrieved,&lt;br&gt;
next comes the query planner &amp;amp; optimizer which makes plan for execution of query&lt;br&gt;
and optimizes that plan.&lt;/p&gt;

&lt;h4&gt;
  
  
  Lexer &amp;amp; Parser
&lt;/h4&gt;

&lt;p&gt;Queries are tokenized into keywords, identifiers, literals and special symbols&lt;br&gt;
which are then passed to parser.&lt;/p&gt;

&lt;p&gt;Parser is responsible for checking grammar and construct AST based on tokens&lt;br&gt;
passed.&lt;/p&gt;

&lt;p&gt;Parser's implementation can vary, in my case Lexer provides a tokenizer which is&lt;br&gt;
used to iterate token forward/backward, and peek tokens.&lt;/p&gt;

&lt;p&gt;Parser uses these token to check grammar of different queries.&lt;/p&gt;

&lt;p&gt;Once this has been done Parser produces an AST specific to each query.&lt;/p&gt;

&lt;h4&gt;
  
  
  Planner and Optimizer
&lt;/h4&gt;

&lt;p&gt;The AST produced by Parser is consumed by Planner which has a plan defined for&lt;br&gt;
each query.&lt;/p&gt;

&lt;p&gt;For example plan for select query involves three steps (from top to bottom):&lt;br&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Projection : Selects specific columns from tuples returned by child operators.&lt;br&gt;&lt;br&gt;
Filter : Based on SARG matching, this steps filters out rows returned by heap scan.&lt;br&gt;&lt;br&gt;
Scan : This calls heap scan and returns the output above in the chain.&lt;br&gt;&lt;br&gt;
(Rows flow upward from Scan → Filter → Projection)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Each operator pulls one tuple at a time from its child operator.&lt;/p&gt;

&lt;p&gt;Optimization depends on type of query, for example if a index for columns exists then index scan can be used instead of heap scan.&lt;/p&gt;

&lt;h3&gt;
  
  
  Catalog Manager
&lt;/h3&gt;

&lt;p&gt;Catalog Manager is logically separate from query execution components and is&lt;br&gt;
responsible for schema metadata management.&lt;/p&gt;

&lt;p&gt;This is used to enforce schema and handle  writing and reading to/from a schema file.&lt;/p&gt;

&lt;p&gt;To write schema to a file a serializer has to be made which writes schema in&lt;br&gt;
format:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Schema name size&lt;/li&gt;
&lt;li&gt;Schema name&lt;/li&gt;
&lt;li&gt;- heap page offset&lt;/li&gt;
&lt;li&gt;- heap page id&lt;/li&gt;
&lt;li&gt;- Table name size &lt;/li&gt;
&lt;li&gt;- Table name &lt;/li&gt;
&lt;li&gt;- - Column size&lt;/li&gt;
&lt;li&gt;- - Column name&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The only caveat here is, before every name I had to store it's size as I have used C++ string over here which stores pointer to actual string allocated in memory, so if I directly write it on file, I would be writing garbage memory address.&lt;/p&gt;

&lt;p&gt;Deserializer reads this back from file. &lt;/p&gt;

&lt;h3&gt;
  
  
  Main Server
&lt;/h3&gt;

&lt;p&gt;This initializes all the necessary objects like buffer manager, access methods,&lt;br&gt;
query planner, catalog manager.&lt;/p&gt;

&lt;p&gt;I have two pipelines, DB Pipeline, TUI Pipeline, initially I wanted to have a&lt;br&gt;
client-server architecture with a tcp server requesting DB but instead I thought&lt;br&gt;
of going with direct function calls.&lt;/p&gt;

&lt;h4&gt;
  
  
  DB Pipeline
&lt;/h4&gt;

&lt;p&gt;This function is responsible for orchestration of database pipeline (A TCP server can also be created which would be responsible for preprocessing cli input and calling db). &lt;/p&gt;

&lt;p&gt;First the input query is tokenized and checked for correctness by lexer and parser. &lt;/p&gt;

&lt;p&gt;If query is valid then AST is generated which is then goes through their respective processing (like for SELECT query it is passed through select plan and create table is passed through create table provided by catalog manager). &lt;/p&gt;

&lt;p&gt;A response is finally returned by this function, implementation for this pipeline can be different.&lt;/p&gt;

&lt;p&gt;With all the things above a database can support basic things like creating schema,table,columns executing basic queries like SELECT, INSERT, DELETE (same architecture can be extended to support JOINS and Stored Procedure).&lt;/p&gt;

&lt;p&gt;Some other core components included: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;WAL (Write Ahead Log): Ensures durability and crash recovery by logging changes before modified pages are written to disk. Follows ACID properties.&lt;/li&gt;
&lt;li&gt;Lock Manager: Used to gain locks for multi-threading.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>database</category>
    </item>
  </channel>
</rss>
