DEV Community

sma
sma

Posted on • Edited on

1

Lets build a simple interpreter from scratch in python, pt.10: Tokenizer

Coding a program using only recursive lists is very difficult and tedious. This kind of hard work must be done by computers. The first challenge we have to overcome is that we somehow transform human-readable program code into a structure that can be easily processed by the interpreter. Symbols, words, and values ​​must be separated into small meaningful structures, which are called tokens. In this post we will convert this:

a=1;
Enter fullscreen mode Exit fullscreen mode

to this:

[["name","a"], ["eq","="], ["number",1], ["semi",";"]]
Enter fullscreen mode Exit fullscreen mode

["name","a"] is a token like [type, value] . We currently don't need to create a Token class for simplicity's sake.

Warning: The code below does not check for edge cases. It was written to reach the result as soon as possible instead of wasting time to trying to be idiomatic.
Lets try:

# We need a generator function that yields items of strings or lists:
def gen(xs):
    for x in xs:
        yield x

def tokenize(xs):
    isalpha=lambda x: (x>="a" and x<="z") or (x>="A" and x<="Z") or x=="_"
    reserved="function,set,get,print,return,if,else,for,while,break,continue".split(",")
    symbols={"(":"lparen",")":"rparen","{":"lcurly","}":"rcurly",",":"comma",";":"semi",
             "=":"equal"}
    g=gen(xs)
    i=next(g)
    tokens=[]
    buf=""
    try:
        while True:
            # Skip whitespace;
            while i in " \r\n\t":
                i=next(g);

            # Token starts with alphas ;
            if isalpha(i):
                # and continues with alphanumeric
                while isalpha(i) or i in "0123456789":
                    buf+=i
                    i=next(g)
                if buf in reserved:
                    tokens.append( [buf,buf] )
                else:
                    tokens.append( ["name",buf] )
                buf=""

            # We found a string;
            if i in "\"'":
                delimiter=i
                i=next(g)
                while i!=delimiter:
                    buf+=i
                    i=next(g)
                # Skip delimiter;
                i=next(g) 
                tokens.append( ["string",buf] )
                buf=""

            # We found a symbol;
            if i in "(){};=,":
                tokens.append( [symbols[i],i] )
                i=next(g)

            # We found a number;
            if i in "+-0123456789":
                while i in "+-0123456789.e":
                    buf+=i
                    i=next(g)
                if "e" in buf or "." in buf:
                    tokens.append( ["number",float(buf)] )
                else:
                    tokens.append( ["number",int(buf)] )
                buf=""

    except StopIteration:
        pass
    return tokens

# test it;
print(tokenize("if (gt(a,b)) return a; else return b;"))
Enter fullscreen mode Exit fullscreen mode

output:

[['if', 'if'], ['lparen', '('], ['name', 'gt'], ['lparen', '('], ['name', 'a'], ['comma', ','], ['name', 'b'], ['rparen', ')'], ['rparen', ')'], ['return', 'return'], ['name', 'a'], ['semi', ';'], ['else', 'else'], ['return', 'return'], ['name', 'b'], ['semi', ';']]
Enter fullscreen mode Exit fullscreen mode

It works :P

In the next post we will write a parser class that converts 'a=1;' to '["Set","a",1]'

Image of Datadog

The Essential Toolkit for Front-end Developers

Take a user-centric approach to front-end monitoring that evolves alongside increasingly complex frameworks and single-page applications.

Get The Kit

Top comments (1)

Collapse
 
muhammadzaki693 profile image
muhammadzaki693

I like you man

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay