Have you ever wondered what happens when you type SELECT * FROM employees WHERE salary > 80000? How does a database turn that string of text into actual results?
I built QueryForge — a complete SQL query engine from scratch in Python — to find out. No SQLite, no DuckDB, no database dependencies at all. Just pure Python implementing every stage of the SQL pipeline.
GitHub: github.com/hajirufai/queryforge ⭐
The Architecture
Every SQL query passes through five stages:
SQL String → Lexer → Parser → Planner → Optimizer → Executor → Results
┌─────────┐ ┌──────────┐ ┌──────────┐ ┌───────────┐ ┌──────────┐
│ Lexer │───▶│ Parser │───▶│ Planner │───▶│ Optimizer │───▶│ Executor │
│ (tokens)│ │ (AST) │ │ (plan) │ │(optimized)│ │ (results)│
└─────────┘ └──────────┘ └──────────┘ └───────────┘ └──────────┘
Let me walk through each stage with real code.
Stage 1: The Lexer — Turning Text into Tokens
The lexer is the simplest piece conceptually, but getting it right matters. It scans the SQL string character-by-character and produces a stream of tokens.
from dataclasses import dataclass
from enum import Enum, auto
class TokenType(Enum):
SELECT = auto()
FROM = auto()
WHERE = auto()
INTEGER = auto()
FLOAT = auto()
STRING = auto()
IDENTIFIER = auto()
STAR = auto()
EQUALS = auto()
# ... 60+ token types total
The lexer handles some tricky cases:
-
Keywords are case-insensitive —
SELECT,select, andSeLeCtare identical -
String escaping —
'it''s'becomesit's -
Comments — both
-- lineand/* block */styles -
Quoted identifiers —
"column with spaces"is a valid column name
class Lexer:
def __init__(self, source: str):
self.source = source
self.pos = 0
self.tokens: list[Token] = []
def tokenize(self) -> list[Token]:
while self.pos < len(self.source):
ch = self.source[self.pos]
if ch.isspace():
self.pos += 1
elif ch.isdigit():
self._read_number()
elif ch == "'":
self._read_string()
elif ch.isalpha() or ch == '_':
self._read_identifier_or_keyword()
# ... operators, punctuation, etc.
self.tokens.append(Token(TokenType.EOF, None))
return self.tokens
For SELECT name, salary FROM employees WHERE salary > 80000, the lexer produces:
[SELECT, IDENTIFIER("name"), COMMA, IDENTIFIER("salary"), FROM,
IDENTIFIER("employees"), WHERE, IDENTIFIER("salary"), GREATER,
INTEGER(80000), EOF]
Stage 2: The Parser — Building an Abstract Syntax Tree
This is where it gets interesting. I used a recursive-descent parser — each SQL clause is handled by a dedicated method that can call other parsing methods.
The AST node for a SELECT statement captures the full query structure:
@dataclass
class SelectStatement:
columns: list[AliasedExpr] # SELECT ...
from_table: TableRef | None # FROM ...
joins: list[JoinClause] # JOIN ...
where: Expression | None # WHERE ...
group_by: list[Expression] # GROUP BY ...
having: Expression | None # HAVING ...
order_by: list[OrderByItem] # ORDER BY ...
limit: int | None # LIMIT ...
offset: int | None # OFFSET ...
distinct: bool = False # DISTINCT
Expression parsing uses precedence climbing. The key insight: OR has lower precedence than AND, which has lower precedence than comparisons:
def _parse_expression(self) -> Expression:
return self._parse_or()
def _parse_or(self):
left = self._parse_and()
while self._match(TokenType.OR):
right = self._parse_and()
left = BinaryOp(left, "OR", right)
return left
def _parse_and(self):
left = self._parse_not()
while self._match(TokenType.AND):
right = self._parse_not()
left = BinaryOp(left, "AND", right)
return left
The parser handles the full SQL grammar including subqueries, CASE expressions, and DML/DDL:
-- All of these parse correctly:
SELECT e.name, d.name AS dept
FROM employees e
JOIN departments d ON e.dept_id = d.id
WHERE e.salary > (SELECT AVG(salary) FROM employees)
AND e.dept_id IN (SELECT id FROM departments WHERE budget > 300000)
ORDER BY e.salary DESC
LIMIT 10;
Stage 3: The Planner — From AST to Logical Plan
The planner converts the AST into a tree of plan nodes. This is where relational algebra meets code:
@dataclass
class ScanNode:
table_name: str
alias: str | None = None
@dataclass
class FilterNode:
child: PlanNode
predicate: Expression
@dataclass
class JoinNode:
left: PlanNode
right: PlanNode
join_type: str # INNER, LEFT, RIGHT, CROSS
condition: Expression | None
@dataclass
class AggregateNode:
child: PlanNode
group_by: list[Expression]
aggregates: list[AliasedExpr]
For a query like:
SELECT dept_id, COUNT(*) AS cnt
FROM employees
WHERE salary > 50000
GROUP BY dept_id
ORDER BY cnt DESC
LIMIT 5
The planner builds:
Limit(5)
└── Sort(cnt DESC)
└── Aggregate(GROUP BY dept_id)
└── Filter(salary > 50000)
└── Scan(employees)
Stage 4: The Optimizer — Predicate Pushdown
Even a simple optimizer can make a huge difference. QueryForge implements predicate pushdown — the most impactful optimization in query engines.
The idea: push filter conditions as close to the data source as possible. If you're joining two tables but only need rows where salary > 80000, filter before the join instead of after.
class Optimizer:
def optimize(self, plan: PlanNode) -> PlanNode:
plan = self._push_predicates(plan)
plan = self._eliminate_redundant_projects(plan)
return plan
def _push_predicates(self, node: PlanNode) -> PlanNode:
if isinstance(node, FilterNode):
if isinstance(node.child, JoinNode):
join = node.child
# If predicate only references left table, push it down
if self._references_only(node.predicate, join.left):
join.left = FilterNode(join.left, node.predicate)
return join
return node
Before optimization:
Filter(e.salary > 80000)
└── Join(e.dept_id = d.id)
├── Scan(employees AS e)
└── Scan(departments AS d)
After optimization:
Join(e.dept_id = d.id)
├── Filter(e.salary > 80000)
│ └── Scan(employees AS e)
└── Scan(departments AS d)
Stage 5: The Executor — Running the Plan
The executor walks the plan tree recursively. Each node type has its own execution logic:
class Executor:
def execute(self, node: PlanNode) -> tuple[list[str], list[Row]]:
if isinstance(node, ScanNode):
return self._exec_scan(node)
elif isinstance(node, FilterNode):
return self._exec_filter(node)
elif isinstance(node, JoinNode):
return self._exec_join(node)
elif isinstance(node, AggregateNode):
return self._exec_aggregate(node)
# ... etc.
JOIN execution uses nested-loop joins (simple but correct):
def _exec_join(self, node: JoinNode):
left_cols, left_rows = self.execute(node.left)
right_cols, right_rows = self.execute(node.right)
result = []
for l_row in left_rows:
matched = False
for r_row in right_rows:
combined = {**l_row, **r_row}
if node.condition is None or self._eval_expr(node.condition, combined):
result.append(combined)
matched = True
# LEFT JOIN: include unmatched left rows with NULLs
if not matched and node.join_type == "LEFT":
null_right = {c: None for c in right_cols}
result.append({**l_row, **null_right})
return left_cols + right_cols, result
The executor also includes 20+ scalar functions (UPPER, LOWER, COALESCE, ABS, ROUND, SUBSTR, etc.), LIKE pattern matching with wildcards, and full NULL handling.
What You Can Do with It
QueryForge supports real-world SQL:
from queryforge import Engine
engine = Engine()
engine.load_csv("employees", "employees.csv")
engine.load_csv("departments", "departments.csv")
# Complex analytical query
result = engine.execute("""
SELECT d.name AS department,
COUNT(*) AS headcount,
AVG(e.salary) AS avg_salary,
CASE WHEN AVG(e.salary) > 85000 THEN 'High'
WHEN AVG(e.salary) > 70000 THEN 'Mid'
ELSE 'Low' END AS pay_band
FROM employees e
JOIN departments d ON e.department_id = d.id
WHERE e.is_active = TRUE
GROUP BY d.name
HAVING COUNT(*) >= 2
ORDER BY avg_salary DESC
""")
print(result.to_table())
It also includes:
- Rich CLI with interactive REPL
- FastAPI REST API for querying over HTTP
- Docker support for deployment
- 157 passing tests across all components
What I Learned
Building a SQL engine taught me more about databases than years of using them:
-
Parsing SQL is harder than it looks — the grammar is context-sensitive (is
INa keyword or identifier?) and full of edge cases - The optimizer matters enormously — even simple predicate pushdown can be the difference between scanning 1,000 rows vs. 1,000,000
-
NULL handling is tricky — three-valued logic means
NULL = NULLisNULL, notTRUE - Aggregate vs. scalar context — knowing when you're inside a GROUP BY vs. a regular SELECT changes everything
Try It
git clone https://github.com/hajirufai/queryforge.git
cd queryforge
pip install -e ".[dev]"
# Interactive REPL
queryforge repl data/employees.csv -t employees
# One-shot query
queryforge query "SELECT * FROM employees ORDER BY salary DESC LIMIT 5" data/employees.csv -t employees
The full source is ~4,400 lines of Python with 157 tests. Every stage is documented and tested independently.
GitHub: github.com/hajirufai/queryforge
This is part of my "Building from Scratch" series where I implement complex systems from first principles. Previously: VectorLite (vector search engine) and TaskFlow (DAG workflow orchestration).
Top comments (0)