Build a Data Science Query Language in Python using Lark
What if you could write something like this:
DATA [1, 2, 3, 4, 5]
SUM
MEAN
STD
…and have it behave like a mini data science engine?
In this tutorial, we’ll build a **Domain-Specific Language (DSL)** for data analysis using:
- Python
- Lark (parser library)
- NumPy
---
# What Are We Building?
We are creating a **custom query language** that:
- Accepts a dataset
- Runs statistical commands
- Prints results
---
# Step 1: Install Dependencies
```
{% endraw %}
bash
pip install lark numpy
{% raw %}
Step 2: Define the Grammar
The grammar defines how our language looks.
python
from lark import Lark, Transformer
import numpy as np
grammar = """
start: data command+
data: "DATA" list
command: "SUM" -> sum
| "MEAN" -> mean
| "STD" -> std
| "MAX" -> max
| "MIN" -> min
list: "[" NUMBER ("," NUMBER)* "]"
%import common.NUMBER
%import common.WS
%ignore WS
"""
Explanation
start: data command+
- Program must start with
DATA - Followed by one or more commands
data: "DATA" list
- Defines dataset input
- Example:
plaintext
DATA [1, 2, 3]
Commands
plaintext
SUM → sum
MEAN → mean
STD → std
MAX → max
MIN → min
- These map text → function names
-
-> summeans callsum()in Transformer
List Rule
plaintext
list: "[" NUMBER ("," NUMBER)* "]"
-
Accepts:
[1][1, 2, 3]
(, NUMBER)*means repeat
Ignore Spaces
plaintext
%ignore WS
- Allows flexible formatting
⚙️ Step 3: Build the Interpreter
Now we convert parsed text into execution.
python
class DLangInterpreter(Transformer):
def data(self, items):
self.data = np.array([float(x) for x in items[0]])
return self.data
Explanation
-
items[0]→ list of numbers - Convert to NumPy array
- Store in
self.datafor reuse
Step 4: Add Operations
SUM
python
def sum(self, _):
print(np.sum(self.data))
MEAN
python
def mean(self, _):
print(np.mean(self.data))
STD
python
def std(self, _):
print(np.std(self.data))
MAX
python
def max(self, _):
print(np.max(self.data))
MIN
python
def min(self, _):
print(np.min(self.data))
Explanation
- Each function matches grammar rule
-
_= unused input - Uses NumPy for computation
- Prints result immediately
Step 5: Parse List
python
def list(self, items):
return items
Explanation
- Returns list of numbers
- Passed to
data()method
Step 6: Create the Parser
python
parser = Lark(grammar, parser="lalr", transformer=DLangInterpreter())
Explanation
-
lalr→ fast parsing algorithm -
transformer→ auto-executes logic
Step 7: Read Input File
python
with open("example.dl") as f:
code = f.read()
parser.parse(code)
Example example.dl
plaintext
DATA [10, 20, 30, 40]
SUM
MEAN
MAX
✅ Output
plaintext
100
25.0
40
How It Works (Flow)
plaintext
Text Input
↓
Parser (Lark)
↓
Grammar Rules Match
↓
Transformer Methods Trigger
↓
NumPy Executes
↓
Output Printed
✨ Why This Is Powerful
- You built a mini programming language
-
Clean separation of:
- Syntax (grammar)
- Execution (Transformer)
Easily extensible
Next Features You Can Add
1. Filtering
plaintext
FILTER > 10
2. Sorting
plaintext
SORT ASC
3. CSV Support
plaintext
DATA file.csv
4. Chaining
plaintext
DATA [1,2,3,4]
FILTER > 2
MEAN
Final Thought
This is how real systems like:
- SQL
- Pandas query engine
- Spark
…start at a basic level.
You just built the foundation of a data query engine
If You Liked This
Drop a like ❤️
Follow for more AI + Systems content
And try extending this DSL yourself!
Top comments (0)