Aleksandar Milenkovic

Posted on Jul 26, 2023

Building a simple but advanced Jsonic language using Python

#python #json #algorithms #javascript

Blog on LinkedIn

Prerequisites

Source code for this blog
Python support

Layout of this guide

Motivation
Introduction
Character Stream
Prefix Checking (peek, current, advance)
Recursive Descent
Error handling
Skipping white space
Imports
From Back-end To Front-end
The Object Scope
Flags
Method Chaining
Interpreting empty object
Scoping
Expressions
Wrapping up Method Chaining
Conclusion
References

Source code for this blog:

Source Code

the little utility library we use:

utility library

Python Support

In this blog, we use Python v3.8.x but you can use v3.5 or greater. Note that the type annotation has been supported since 3.5.

This blog is a good quick start for any developer interested in building the first prototype of their own language. Advanced Python skills are, of course, optimal.

Motivation

While working for Voxgig, I have taken a lot of interest in how a prototype of cue language we have been using for our model-driven applications works behind the scenes. Just the basics of it. But even the basics seemed complicated in my eyes. So I wanted to make it seem simple and sharpen up my understanding of the language along the way. This has worked wonders for me in many ways. From why we can write

foo:bar:zoo: 1

to why JavaScript throws an error on:

console.log('hi!')
(10 + 10) * 1


/*
 * Uncaught TypeError: console.log(...) is not a function
     at <anonymous>:3:1
*/

Introduction

As developers, we’d always like to try our hand at building and customizing our own toolkit – whether that is out of curiosity or out of necessity.

This blog will give you some guidelines and basic knowledge on how to build your own basic language, in our case a Jsonic language, and customize it to your needs. A Jsonic Language is a JSON-like language with fun but useful syntax features such as expressions, scoping, calling functions, etc.

Upon reading this blog, not only will you understand the basics of how to build an advanced Jsonic language but you will use simple algorithms to do so, which will increase productivity and decrease confusion!

The prototype of our language is in Python but we try to be as abstract as possible so that you can pick a language of your choice to build your own version of this language. Of course, things like garbage collectors are built-in so we won’t cover that if you want to work with languages like C.

Character Stream

To cut right to the chase, we will leave out the implementation of the CharStream class and just demonstrate its functionality. You have the source code as your friend – in fact, the CharStream class is very straightforward and is the core of our project to begin with.

char_stream = CharStream("welcome!")
char_stream.peek() == 'w' # True
char_stream.advance() 
char_stream.peek() == 'e' # True

For prefix checking, we use peek() and current methods of the CharStream class and then consume characters with advance() that moves one character ahead and returns the current one.
Compare it to Python built-in iter:

stream = iter("welcome")

next(stream) == 'w'

next(stream) == 'e' # it is 'e' now - there isn't really a way to check the current char but not advance

Prefix checking

If the character stream peeks for a certain character, we could easily determine the type of that token right away, roll with it until the while loop is done, and just evaluate it.

We will be applying scanning, parsing and evaluation all in one go and on the fly – that's to say we are not going to generate Lexer and AST per ser. Instead, we jump straight to interpreting our object.

For a start:

def interpret_obj(char_stream: CharStream, advance: bool, ...):
  if advance:
    char_stream.advance()
  skip_space(char_stream)
  if char_stream.peek().isdigit():
    value = ''
    while char_stream.peek().isdigit():
      value += char_stream.advance()
    skip_space(char_stream)
    return int(value, 10) # base 10
...

      Snippet 1

Here, the bool advance parameter will be a help because each time we match a certain character, we might forget to consume it – so we pass True and, if necessary, False.

We come back to this logic to sort out identifiers and also introduce the flags.

def scan_id(char_stream: CharStream, advance: bool):
  if advance:
    char_stream.advance()
  value = ''
  if not char_stream.peek().isalpha():
    raise SyntaxError("Expected an identifier to start with an alpha")
  while char_stream.peek().isalpha() or char_stream.peek().isdigit():
    value += char_stream.advance()
  return value

def interpret_obj(char_stream: CharStream, advance: bool, scope: dict, flags: utils.Map):
  ...
  elif char_stream.peek().isalpha() and flags.temFlag:
    identifier = scan_id(char_stream, False)
    skip_space(char_stream) # clear up the following white space
    if flags.isKey:
      return identifier
    if identifier not in scope.keys():
      raise NameError(identifier + " not defined!")
    return scope[identifier]
  ...

      Snippet 2

What’s more, utils.Map is similar to typical dict except that its values are not accessed just using obj['key1'] but using either obj.key1 or obj['key1'] which is, more JavaScript-like.

Check out the Python docs on setattr, getattribute and getitem, dict. More on Flags – on Flags section.

RECURSIVE DESCENT

Our little language will only use one main algorithm and still be good enough to support advanced syntax features such as: scoping, method chaining, expressions, and calling functions.

In this series, we prioritize typical JSON data that is: null, Number, String, Object, and Array.

However, this guide only shows you how to support Numbers as it is easier to make complete sense of it and you can use those same methods to scan the other types. They will easily fit the rest of the code – you just need to write a few conditions to identify those types.

For example, Arrays start with the character [ and end with ] while strings start and end with ".

Don't sweat it! This is a blog series so there will be more to come.

Error handling

It is important to detect the syntax errors. The way to handle them is by throwing exceptions!

Throwing exceptions for this kind of algorithm is ideal as it will stop all the recursive calls up to the base call no matter the depth.

Skipping white-space

To skip the white space, we simply peek for ' ' or '\n' and advance to the next character.

def skip_space(char_stream: CharStream):
  while not char_stream.is_over() and (char_stream.peek() == ' ' or char_stream.peek() == '\n'):
    char_stream.advance() # consume the white space

      Snippet 3

Imports

In this blog, we use: sys, utils and parser.

import sys # built-in python sys module
import utils # this-project-specific set of utilities – see examples: https://github.com/BeAllAround/py_utils
from parser import CharStream # library that will help us with scanning/lexical analysis

      Snippet 4

We recommend that you write your own back-end libraries as much as you can. It makes your language open to bigger features like merging nested objects such as: { a: 1, { b: 2 } } and others.

From back-end to front-end

For us, the front-end would be scanning and parsing the text so that we could evaluate it with the back-end modules:

front-end – scanning and parsing
back-end – evaluation
front-end + back-end – interpretation

Python itself is the back-end engine of this language for its built-in error handling system and methods under the hood of dict, str, list such as add, eq, getitem, etc. but we are going to take a step back and focus on the modules that are not built-in.

When scanning and parsing, we have to think about the evaluation. That is when utilities come into play depending on what we want to do.

A great example would be the deep extend function for merging objects.

merge({foo: {a: 1}}, {foo: {b:2}}) => {foo: {a: 1, b: 2}}

With that in mind, we would need a function to merge two objects, as the ones in the example above.

utils.deep_update(obj1, obj2)

The Object Scope

The aim of our language would be interpreting the object. Now that we know how to apply the prefix checking, we are all set to kick off!

If we try to interpret: "{1: 1}"

def interpret_obj(char_stream: CharStream, advance: bool, scope: dict, flags: utils.Map):
  if advance:
    char_stream.advance();
  ...
  elif char_stream.peek() == '{':
    obj = {}
    while True:
      key = interpret_obj(char_stream, True, scope, utils.Map({'temFlag': True, 'isKey': True,}) ) # key == 1
      skip_space(char_stream)
      if char_stream.peek() == ':':
        value = interpret_obj(char_stream, True, obj, utils.Map({'temFlag': True, 'isKey': False,}) )
        utils.deep_update(obj, { key: value }) # obj now equals to "{1: 1}" – see docs: https://github.com/BeAllAround/py_utils
        if char_stream.peek() == '}':
          char_stream.advance() # eat '}'
          skip_space(char_stream)
          return obj
        elif char_stream.peek() == ',':
          continue
        else:
          raise SyntaxError("Unexpected char " + char_stream.current)
      else:
        raise SyntaxError("Unexpected char " + char_stream.current)
  else:
    raise SyntaxError("Unexpected char " + char_stream.current)

      Snippet 5

Flags

To enable and register features, we could simply use an instance of utils.Map that will contain the flags for us to use.

We could pass Booleans to our function but that would be very hard to maintain.

Compare the following two snippets:

def interpret_obj(char_stream: CharStream, ..., flags: utils.Map):
  ...
    value = interpret_obj(char_stream, ..., utils.Map({'temFlag': True, 
                                                   'isKey': False,}) )
  ...

      Snippet 6

def interpret_obj(char_stream: CharStream, ..., temFlag = False, isKey = False):
  ...
    value = interpret_obj(char_stream, ..., True, False)
  ...

      Snippet 7

As you can see, the former significantly improves readability and would simplify adding new flags as you pick up new features.

Still, this approach isn’t great and would require a bit of refactoring, which we can look at in the next volume of this series.

Method Chaining

For method chaining, we will need another infinite loop that breaks if none of our existing conditions match.

value_ptr – is just an array of one element that tries to mimic a pointer in Python or other scripting languages that don’t really have an equal.

It will help us achieve the following:

a = 1
def modify_a(a):
  a = 2
modify_a(a)
> a # still 1 but we need it to be 2

      Snippet 8

But at least, it mimics pointers in that regard:

a = [ 1 ]
def modify_a(value_ptr):
  value_ptr[0] = 2
modify(a)
> a[0] # our main value is now 2

      Snippet 9

def chain_obj(char_stream: CharStream, value_ptr: list, scope: dict):
  while True:
    if callable(value_ptr[0]) and char_stream.peek() == '(':
      args = [] # a place to store the evaluated arguments

      # evaluating "func()"
      cs = CharStream(char_stream.source, char_stream.c)
      char_stream.advance()
      skip_space(char_stream) # evaluate "func(  )"
      if char_stream.peek() == ')':
        char_stream.advance() # eat ')'
    skip_space(char_stream)
    value_ptr[0] = value_ptr[0]() # evaluate the function with no arguments passed
    continue
      else:
        char_stream.set_cs(cs) # reset the char stream to last matched '('
      while True:
        args.append(parse_obj(char_stream, True, scope, utils.Map({'temFlag': True, 'isKey': False,}) )
    if char_stream.peek() == ',':
      continue
    elif char_stream.peek() == ')':
      char_stream.advance() # eat ')'
      skip_space(char_stream)
      value_ptr[0] = value_ptr[0](*args) # evaluate the function with x number of arguments passed
      break
    else:
      raise SyntaxError("Unexpected char while evaluating a function: " + char_stream.current)

    elif type(value_ptr[0]) == dict and char_stream.peek() == '.':
      identifier = scan_id(char_stream, True)
      skip_space(char_stream)
      value_ptr[0] = value_ptr[0][identifier]
      continue

    else:
      break

      Snippet 10

This code implements obj.a.b, obj.a().b, func(1, 2), "func()() - calling functions" and vice versa depending on what you want to do.
Since our language does not support function definition, we need to pass in Python functions. After all, this will come in handy as you can add up any Python code and use it from the language code itself.
Formally, we provide some "context".

For example:

def json_export():
  scope =  {'func': lambda x,y: x+y,} # our language can call functions
  default_flags = utils.Map({})
  utils.export_json( interpret_obj(CharStream('{' + input() + '}'), False, scope, default_flags) )
  # Note that 'utils.export_json' is a custom alternative to 'json.dump' from Python Standard Library
  # more details on that here: https://docs.python.org/3/library/json.html

      Snippet 11

While on the point of printing out our final object, there is an interesting detail to point out. That is, Python actually uses sys.stdout.write(obj.repr() + '\n') to convert an object to a string and, then, show it to the console.

In this snippet, we used utils.export_json just that print uses single quotation marks to encode Python strings. "a" will be 'a' – which isn't standard JSON.

Generating out the JSON data for reuse:

To further customize our formatting, you could code your own object_to_string function:

def object_to_string(obj: dict):
  # handle obj
  return str(...)

# Afterwards, you can add it up like this:
print(object_to_string(interpret_obj(CharStream('{' + input() + '}'), False, scope, default_flags)))

      Snippet 12

To get a better sense of how you could format your object with somewhat a simple set of rules, check out the implementation of utils.export_json from the source code.

To enable the method chaining, we would turn a part of the code from Snippet 5 into:

...
  if char_stream.peek() == ':':
    value = interpret_obj(char_stream, True, scope, utils.Map({ 'isKey': False, 'temFlag': True, }) )
    value_ptr = [ value ]
    chain_main(char_stream, value_ptr, scope)
    value = value_ptr[0]
    utils.deep_update(obj, { key: value }) 
...

      Snippet 13

Or we could wrap it inside a single function for reusability.

def interpret_obj_and_chain(char_stream: CharStream, advance: bool, scope: dict, flags: utils.Map):
  value = interpret_obj(char_stream, advance, scope, flags)
  value_ptr = [ value ]
  chain_obj(char_stream, value_ptr, scope)
  return value_ptr[0] # return the final value

      Snippet 14

Now, a part of Snippet 5 translates into

...
if char_stream.peek() == ':':
  value = interpret_obj_and_chain(char_stream, True, scope, utils.Map({'temFlag': True, 'isKey': False,}) )
  utils.deep_update(obj, { key: value })
...

      Snippet 15

Interpreting empty object

There is a special case we have yet to set. That is, interpreting {}

Let's turn a piece of Snippet 5 into

def interpret_obj(char_stream: CharStream, advance: bool, scope: dict, flags: utils.Map):
  ...
  elif char_stream.peek() == '{':
    obj = {}
    # interpreting "{}"
    cs = CharStream(char_stream.source, char_stream.c)
    char_stream.advance()
    skip_space(char_stream) # with this, we can interpret "{  }"
    if char_stream.peek() == '}':
      char_stream.advance() # eat '}'
      skip_space(char_stream)
      return obj
    else:
      char_stream.set_cs(cs) # trace the char_stream back to last matched '{' and proceed
  ...

      Snippet 16

Scoping

Scoping, most of all, enables you to use the predefined context as detailed in Snippet 11.

To make sure we don’t mutate our context, we are going to make a copy of it into obj_scope and, from then on, use it as a dictionary to get items from.

Let’s work on Snippet 5 a bit further.

def interpret_obj(char_stream: CharStream, advance: bool, scope: dict, flags: utils.Map):
  ...
  elif char_stream.peek() == '{':
    obj = {}
    obj_scope = {}
    ...
    utils.deep_update(obj_scope, scope)
    ...
      if char_stream.peek() == ':':
        value = interpret_obj_and_chain(char_stream, True, obj_scope, utils.Map({ 'temFlag': True, 'isKey': False, }) )
        utils.deep_update(obj_scope, { key: value })
        utils.deep_update(obj, { key: value })
      ...
    ...
  ...

      Snippet 17

Expressions

For a start, think of it as a desk calculator! We first have to handle basic arithmetic operations: multiplication, division, addition, and subtraction.

Consider: (3 + 3) * ( 3 * 3)

To achieve this, we need to prioritize multiplication and division over addition and subtraction; expr revolving around term with term revolving around prim.

In computer science, the term is Operator-precedence parser or just precedence.

def expr(char_stream: CharStream, advance: bool, scope: dict, flags: utils.Map):
  left = term(char_stream, advance, scope, flags)
  while True:
    if char_stream.peek() == '+':
      left += term(char_stream, True, scope, flags)
    elif char_stream.peek() == '-':
      left -= term(char_stream, True, scope, flags)
    else:
      return left

def term(char_stream: CharStream, advance: bool, scope: dict, flags: utils.Map):
  left = prim(char_stream, advance, scope, flags)
  while True:
    if char_stream.peek() == '*':
      left *= prim(char_stream, True, scope, flags)
    elif char_stream.peek() == '/':
      left /= prim(char_stream, True, scope, flags)
    else:
      return left

def prim(char_stream: CharStream, advance: bool, scope: dict, flags: utils.Map):
  if advance:
    char_stream.advance()
  skip_space(char_stream)
  if char_stream.peek() == '(': # enable ((2 * 2) + 1)  
    value = expr(char_stream, True, scope, flags)
    if char_stream.peek() != ')':
      raise SyntaxError("Missing ')'")
    char_stream.advance() # eat ')'
    skip_space(char_stream)
    return value 
  return interpret_obj_and_chain(char_stream, False, scope, flags) # use the complete variant of interpret_obj with method chaining

      Snippet 18

Our function can be illustrated using either AST or Parse Tree.

Visually, we have:

To cap it off, let’s replace interpret_obj_and_chain with this new function!
Back to Snippet 5:

def interpret_obj(char_stream: CharStream, advance: bool, scope: dict, flags: utils.Map):
  ...
  elif char_stream.peek() == '{':
    ...
      if char_stream.peek() == ':':
        value = expr(char_stream, True, obj_scope, utils.Map({ 'temFlag': True, 'isKey': False, }) )
        ...
    ...
  ...

      Snippet 19

Additionally, don’t forget to call expr from chain_obj as well – particularly instead of interpret_obj when getting a function arguments.

For example, interpreting { f: func(1+1, 2*2) } would require:

args.append( expr(char_stream, True, scope, utils.Map({'temFlag': True, 'isKey': False,}) )

Wrapping Up Method Chaining

Now, if we write {c: {b: 1, d: {a: 21}}, d1: (c.d.a)*2+1 } – that works giving us the correct result. But what if we tried to write ((c.d).a) instead of just (c.d.a). Well, that easy to implement! If you retrace your steps back to prim – you will find the part that matches parentheses and hit the nail on the head! That’s the block of code we need to work on!

def prim(char_stream: CharStream, advance: bool, scope: dict, flags: utils.Map):
  ...
  if char_stream.peek() == '(': # enable ((2 * 2) + 1)  
    value = expr(char_stream, True, scope, flags)
    value_ptr = [ value ]
    if char_stream.peek() != ')':
      raise SyntaxError("Missing ')'")
    char_stream.advance() # eat ')'
    skip_space(char_stream)
    chain_obj(char_stream, value_ptr, scope) # just enable method-chaining
    value = value_ptr[0] # the final value
    return value
  ...

      Snippet 20

Next, go ahead and try (c.d).a !

Because of our recursion logic, we also must not forget to check for unmatched characters if there are extra characters, which should produce an error - for example: {}}, ()) and similar.

In that case, we have:

def interpret_object(stream: CharStream, scope: dict): 
    default_flags = utils.Map({})
    obj = interpret_obj(stream, False, scope, default_flags)
    if not stream.is_over():
        raise SyntaxError('unmatched ' + stream.current)
    return obj

def main():
    text = CharStream(r'''{}''')
    # text = Char_stream(r'''{}}''') # we now get "SyntaxError('unmatched }')"
    scope = { 'func': lambda x,y: x+y, 'func1': lambda: print('hi!'), }

    obj = interpret_object(text, scope)
    utils.export_json(obj)

      Snippet 21

Conclusion

Understanding how programming languages work opens up many opportunities in the computer science world by introducing you to natural language processing, pattern matching, evaluation, compilers, interpreters, etc.

It helps you picture how each expression you have written is processed before you even hit the enter key trying to run just a few lines of code – it can inspire you to put the programming language you are using under the microscope for the first time.

There is plenty of room to structure and improve our code. One of the ways is to lean towards OOP, which is exactly what the next volume will be about – one step at a time!

We wish you luck with the prototype of your language and look forward to sharing more advanced guidance with you soon.

Let us know about any typos and your takeaways in the comments.

References

Top comments (1)

Tori Hevnilva • Sep 8 '23

I am always here for a good parser article.

DEV Community