DEV Community

Cover image for Build a Data Science Query Language in Python using Lark
Prasoon  Jadon
Prasoon Jadon

Posted on

Build a Data Science Query Language in Python using Lark

Build a Data Science Query Language in Python using Lark

What if you could write something like this:

DATA [1, 2, 3, 4, 5]
SUM
MEAN
STD


…and have it behave like a mini data science engine?

In this tutorial, we’ll build a **Domain-Specific Language (DSL)** for data analysis using:

- Python   
- Lark (parser library)   
- NumPy   

---

#  What Are We Building?

We are creating a **custom query language** that:

- Accepts a dataset
- Runs statistical commands
- Prints results

---

#  Step 1: Install Dependencies

```
{% endraw %}
bash
pip install lark numpy
{% raw %}

Enter fullscreen mode Exit fullscreen mode

Step 2: Define the Grammar

The grammar defines how our language looks.


python
from lark import Lark, Transformer
import numpy as np

grammar = """
start: data command+

data: "DATA" list

command: "SUM" -> sum
       | "MEAN" -> mean
       | "STD" -> std
       | "MAX" -> max
       | "MIN" -> min

list: "[" NUMBER ("," NUMBER)* "]"

%import common.NUMBER
%import common.WS
%ignore WS
"""


Enter fullscreen mode Exit fullscreen mode

Explanation

start: data command+

  • Program must start with DATA
  • Followed by one or more commands

data: "DATA" list

  • Defines dataset input
  • Example:

plaintext
  DATA [1, 2, 3]


Enter fullscreen mode Exit fullscreen mode

Commands


plaintext
SUM → sum
MEAN → mean
STD → std
MAX → max
MIN → min


Enter fullscreen mode Exit fullscreen mode
  • These map text → function names
  • -> sum means call sum() in Transformer

List Rule


plaintext
list: "[" NUMBER ("," NUMBER)* "]"


Enter fullscreen mode Exit fullscreen mode
  • Accepts:

    • [1]
    • [1, 2, 3]
  • (, NUMBER)* means repeat


Ignore Spaces


plaintext
%ignore WS


Enter fullscreen mode Exit fullscreen mode
  • Allows flexible formatting

⚙️ Step 3: Build the Interpreter

Now we convert parsed text into execution.


python
class DLangInterpreter(Transformer):

    def data(self, items):
        self.data = np.array([float(x) for x in items[0]])
        return self.data


Enter fullscreen mode Exit fullscreen mode

Explanation

  • items[0] → list of numbers
  • Convert to NumPy array
  • Store in self.data for reuse

Step 4: Add Operations

SUM


python
def sum(self, _):
    print(np.sum(self.data))


Enter fullscreen mode Exit fullscreen mode

MEAN


python
def mean(self, _):
    print(np.mean(self.data))


Enter fullscreen mode Exit fullscreen mode

STD


python
def std(self, _):
    print(np.std(self.data))


Enter fullscreen mode Exit fullscreen mode

MAX


python
def max(self, _):
    print(np.max(self.data))


Enter fullscreen mode Exit fullscreen mode

MIN


python
def min(self, _):
    print(np.min(self.data))


Enter fullscreen mode Exit fullscreen mode

Explanation

  • Each function matches grammar rule
  • _ = unused input
  • Uses NumPy for computation
  • Prints result immediately

Step 5: Parse List


python
def list(self, items):
    return items


Enter fullscreen mode Exit fullscreen mode

Explanation

  • Returns list of numbers
  • Passed to data() method

Step 6: Create the Parser


python
parser = Lark(grammar, parser="lalr", transformer=DLangInterpreter())


Enter fullscreen mode Exit fullscreen mode

Explanation

  • lalr → fast parsing algorithm
  • transformer → auto-executes logic

Step 7: Read Input File


python
with open("example.dl") as f:
    code = f.read()

parser.parse(code)


Enter fullscreen mode Exit fullscreen mode

Example example.dl


plaintext
DATA [10, 20, 30, 40]
SUM
MEAN
MAX


Enter fullscreen mode Exit fullscreen mode

✅ Output


plaintext
100
25.0
40


Enter fullscreen mode Exit fullscreen mode

How It Works (Flow)


plaintext
Text Input
   ↓
Parser (Lark)
   ↓
Grammar Rules Match
   ↓
Transformer Methods Trigger
   ↓
NumPy Executes
   ↓
Output Printed


Enter fullscreen mode Exit fullscreen mode

✨ Why This Is Powerful

  • You built a mini programming language
  • Clean separation of:

    • Syntax (grammar)
    • Execution (Transformer)
  • Easily extensible


Next Features You Can Add

1. Filtering


plaintext
FILTER > 10


Enter fullscreen mode Exit fullscreen mode

2. Sorting


plaintext
SORT ASC


Enter fullscreen mode Exit fullscreen mode

3. CSV Support


plaintext
DATA file.csv


Enter fullscreen mode Exit fullscreen mode

4. Chaining


plaintext
DATA [1,2,3,4]
FILTER > 2
MEAN


Enter fullscreen mode Exit fullscreen mode

Final Thought

This is how real systems like:

  • SQL
  • Pandas query engine
  • Spark

…start at a basic level.

You just built the foundation of a data query engine


If You Liked This

Drop a like ❤️
Follow for more AI + Systems content
And try extending this DSL yourself!

Top comments (0)