DEV Community

Haji Rufai
Haji Rufai

Posted on • Originally published at github.com

Building a SQL Query Engine from Scratch in Python: Lexer Parser Optimizer Executor

Have you ever wondered what happens when you type SELECT * FROM employees WHERE salary > 80000? How does a database turn that string of text into actual results?

I built QueryForge — a complete SQL query engine from scratch in Python — to find out. No SQLite, no DuckDB, no database dependencies at all. Just pure Python implementing every stage of the SQL pipeline.

GitHub: github.com/hajirufai/queryforge

The Architecture

Every SQL query passes through five stages:

SQL String → Lexer → Parser → Planner → Optimizer → Executor → Results
Enter fullscreen mode Exit fullscreen mode
┌─────────┐    ┌──────────┐    ┌──────────┐    ┌───────────┐    ┌──────────┐
│  Lexer  │───▶│  Parser  │───▶│ Planner  │───▶│ Optimizer │───▶│ Executor │
│ (tokens)│    │  (AST)   │    │  (plan)  │    │(optimized)│    │ (results)│
└─────────┘    └──────────┘    └──────────┘    └───────────┘    └──────────┘
Enter fullscreen mode Exit fullscreen mode

Let me walk through each stage with real code.


Stage 1: The Lexer — Turning Text into Tokens

The lexer is the simplest piece conceptually, but getting it right matters. It scans the SQL string character-by-character and produces a stream of tokens.

from dataclasses import dataclass
from enum import Enum, auto

class TokenType(Enum):
    SELECT = auto()
    FROM = auto()
    WHERE = auto()
    INTEGER = auto()
    FLOAT = auto()
    STRING = auto()
    IDENTIFIER = auto()
    STAR = auto()
    EQUALS = auto()
    # ... 60+ token types total
Enter fullscreen mode Exit fullscreen mode

The lexer handles some tricky cases:

  • Keywords are case-insensitiveSELECT, select, and SeLeCt are identical
  • String escaping'it''s' becomes it's
  • Comments — both -- line and /* block */ styles
  • Quoted identifiers"column with spaces" is a valid column name
class Lexer:
    def __init__(self, source: str):
        self.source = source
        self.pos = 0
        self.tokens: list[Token] = []

    def tokenize(self) -> list[Token]:
        while self.pos < len(self.source):
            ch = self.source[self.pos]
            if ch.isspace():
                self.pos += 1
            elif ch.isdigit():
                self._read_number()
            elif ch == "'":
                self._read_string()
            elif ch.isalpha() or ch == '_':
                self._read_identifier_or_keyword()
            # ... operators, punctuation, etc.
        self.tokens.append(Token(TokenType.EOF, None))
        return self.tokens
Enter fullscreen mode Exit fullscreen mode

For SELECT name, salary FROM employees WHERE salary > 80000, the lexer produces:

[SELECT, IDENTIFIER("name"), COMMA, IDENTIFIER("salary"), FROM,
 IDENTIFIER("employees"), WHERE, IDENTIFIER("salary"), GREATER,
 INTEGER(80000), EOF]
Enter fullscreen mode Exit fullscreen mode

Stage 2: The Parser — Building an Abstract Syntax Tree

This is where it gets interesting. I used a recursive-descent parser — each SQL clause is handled by a dedicated method that can call other parsing methods.

The AST node for a SELECT statement captures the full query structure:

@dataclass
class SelectStatement:
    columns: list[AliasedExpr]     # SELECT ...
    from_table: TableRef | None    # FROM ...
    joins: list[JoinClause]        # JOIN ...
    where: Expression | None       # WHERE ...
    group_by: list[Expression]     # GROUP BY ...
    having: Expression | None      # HAVING ...
    order_by: list[OrderByItem]    # ORDER BY ...
    limit: int | None              # LIMIT ...
    offset: int | None             # OFFSET ...
    distinct: bool = False         # DISTINCT
Enter fullscreen mode Exit fullscreen mode

Expression parsing uses precedence climbing. The key insight: OR has lower precedence than AND, which has lower precedence than comparisons:

def _parse_expression(self) -> Expression:
    return self._parse_or()

def _parse_or(self):
    left = self._parse_and()
    while self._match(TokenType.OR):
        right = self._parse_and()
        left = BinaryOp(left, "OR", right)
    return left

def _parse_and(self):
    left = self._parse_not()
    while self._match(TokenType.AND):
        right = self._parse_not()
        left = BinaryOp(left, "AND", right)
    return left
Enter fullscreen mode Exit fullscreen mode

The parser handles the full SQL grammar including subqueries, CASE expressions, and DML/DDL:

-- All of these parse correctly:
SELECT e.name, d.name AS dept
FROM employees e
JOIN departments d ON e.dept_id = d.id
WHERE e.salary > (SELECT AVG(salary) FROM employees)
  AND e.dept_id IN (SELECT id FROM departments WHERE budget > 300000)
ORDER BY e.salary DESC
LIMIT 10;
Enter fullscreen mode Exit fullscreen mode

Stage 3: The Planner — From AST to Logical Plan

The planner converts the AST into a tree of plan nodes. This is where relational algebra meets code:

@dataclass
class ScanNode:
    table_name: str
    alias: str | None = None

@dataclass
class FilterNode:
    child: PlanNode
    predicate: Expression

@dataclass
class JoinNode:
    left: PlanNode
    right: PlanNode
    join_type: str          # INNER, LEFT, RIGHT, CROSS
    condition: Expression | None

@dataclass
class AggregateNode:
    child: PlanNode
    group_by: list[Expression]
    aggregates: list[AliasedExpr]
Enter fullscreen mode Exit fullscreen mode

For a query like:

SELECT dept_id, COUNT(*) AS cnt
FROM employees
WHERE salary > 50000
GROUP BY dept_id
ORDER BY cnt DESC
LIMIT 5
Enter fullscreen mode Exit fullscreen mode

The planner builds:

Limit(5)
  └── Sort(cnt DESC)
       └── Aggregate(GROUP BY dept_id)
            └── Filter(salary > 50000)
                 └── Scan(employees)
Enter fullscreen mode Exit fullscreen mode

Stage 4: The Optimizer — Predicate Pushdown

Even a simple optimizer can make a huge difference. QueryForge implements predicate pushdown — the most impactful optimization in query engines.

The idea: push filter conditions as close to the data source as possible. If you're joining two tables but only need rows where salary > 80000, filter before the join instead of after.

class Optimizer:
    def optimize(self, plan: PlanNode) -> PlanNode:
        plan = self._push_predicates(plan)
        plan = self._eliminate_redundant_projects(plan)
        return plan

    def _push_predicates(self, node: PlanNode) -> PlanNode:
        if isinstance(node, FilterNode):
            if isinstance(node.child, JoinNode):
                join = node.child
                # If predicate only references left table, push it down
                if self._references_only(node.predicate, join.left):
                    join.left = FilterNode(join.left, node.predicate)
                    return join
        return node
Enter fullscreen mode Exit fullscreen mode

Before optimization:

Filter(e.salary > 80000)
  └── Join(e.dept_id = d.id)
       ├── Scan(employees AS e)
       └── Scan(departments AS d)
Enter fullscreen mode Exit fullscreen mode

After optimization:

Join(e.dept_id = d.id)
  ├── Filter(e.salary > 80000)
  │    └── Scan(employees AS e)
  └── Scan(departments AS d)
Enter fullscreen mode Exit fullscreen mode

Stage 5: The Executor — Running the Plan

The executor walks the plan tree recursively. Each node type has its own execution logic:

class Executor:
    def execute(self, node: PlanNode) -> tuple[list[str], list[Row]]:
        if isinstance(node, ScanNode):
            return self._exec_scan(node)
        elif isinstance(node, FilterNode):
            return self._exec_filter(node)
        elif isinstance(node, JoinNode):
            return self._exec_join(node)
        elif isinstance(node, AggregateNode):
            return self._exec_aggregate(node)
        # ... etc.
Enter fullscreen mode Exit fullscreen mode

JOIN execution uses nested-loop joins (simple but correct):

def _exec_join(self, node: JoinNode):
    left_cols, left_rows = self.execute(node.left)
    right_cols, right_rows = self.execute(node.right)

    result = []
    for l_row in left_rows:
        matched = False
        for r_row in right_rows:
            combined = {**l_row, **r_row}
            if node.condition is None or self._eval_expr(node.condition, combined):
                result.append(combined)
                matched = True

        # LEFT JOIN: include unmatched left rows with NULLs
        if not matched and node.join_type == "LEFT":
            null_right = {c: None for c in right_cols}
            result.append({**l_row, **null_right})

    return left_cols + right_cols, result
Enter fullscreen mode Exit fullscreen mode

The executor also includes 20+ scalar functions (UPPER, LOWER, COALESCE, ABS, ROUND, SUBSTR, etc.), LIKE pattern matching with wildcards, and full NULL handling.


What You Can Do with It

QueryForge supports real-world SQL:

from queryforge import Engine

engine = Engine()
engine.load_csv("employees", "employees.csv")
engine.load_csv("departments", "departments.csv")

# Complex analytical query
result = engine.execute("""
    SELECT d.name AS department,
           COUNT(*) AS headcount,
           AVG(e.salary) AS avg_salary,
           CASE WHEN AVG(e.salary) > 85000 THEN 'High'
                WHEN AVG(e.salary) > 70000 THEN 'Mid'
                ELSE 'Low' END AS pay_band
    FROM employees e
    JOIN departments d ON e.department_id = d.id
    WHERE e.is_active = TRUE
    GROUP BY d.name
    HAVING COUNT(*) >= 2
    ORDER BY avg_salary DESC
""")

print(result.to_table())
Enter fullscreen mode Exit fullscreen mode

It also includes:

  • Rich CLI with interactive REPL
  • FastAPI REST API for querying over HTTP
  • Docker support for deployment
  • 157 passing tests across all components

What I Learned

Building a SQL engine taught me more about databases than years of using them:

  1. Parsing SQL is harder than it looks — the grammar is context-sensitive (is IN a keyword or identifier?) and full of edge cases
  2. The optimizer matters enormously — even simple predicate pushdown can be the difference between scanning 1,000 rows vs. 1,000,000
  3. NULL handling is tricky — three-valued logic means NULL = NULL is NULL, not TRUE
  4. Aggregate vs. scalar context — knowing when you're inside a GROUP BY vs. a regular SELECT changes everything

Try It

git clone https://github.com/hajirufai/queryforge.git
cd queryforge
pip install -e ".[dev]"

# Interactive REPL
queryforge repl data/employees.csv -t employees

# One-shot query
queryforge query "SELECT * FROM employees ORDER BY salary DESC LIMIT 5" data/employees.csv -t employees
Enter fullscreen mode Exit fullscreen mode

The full source is ~4,400 lines of Python with 157 tests. Every stage is documented and tested independently.

GitHub: github.com/hajirufai/queryforge


This is part of my "Building from Scratch" series where I implement complex systems from first principles. Previously: VectorLite (vector search engine) and TaskFlow (DAG workflow orchestration).

Top comments (0)