DEV Community

sma
sma

Posted on • Updated on

Lets build a simple interpreter from scratch in python, pt.10: Tokenizer

Coding a program using only recursive lists is very difficult and tedious. This kind of hard work must be done by computers. The first challenge we have to overcome is that we somehow transform human-readable program code into a structure that can be easily processed by the interpreter. Symbols, words, and values ​​must be separated into small meaningful structures, which are called tokens. In this post we will convert this:

a=1;
Enter fullscreen mode Exit fullscreen mode

to this:

[["name","a"], ["eq","="], ["number",1], ["semi",";"]]
Enter fullscreen mode Exit fullscreen mode

["name","a"] is a token like [type, value] . We currently don't need to create a Token class for simplicity's sake.

Warning: The code below does not check for edge cases. It was written to reach the result as soon as possible instead of wasting time to trying to be idiomatic.
Lets try:

# We need a generator function that yields items of strings or lists:
def gen(xs):
    for x in xs:
        yield x

def tokenize(xs):
    isalpha=lambda x: (x>="a" and x<="z") or (x>="A" and x<="Z") or x=="_"
    reserved="function,set,get,print,return,if,else,for,while,break,continue".split(",")
    symbols={"(":"lparen",")":"rparen","{":"lcurly","}":"rcurly",",":"comma",";":"semi",
             "=":"equal"}
    g=gen(xs)
    i=next(g)
    tokens=[]
    buf=""
    try:
        while True:
            # Skip whitespace;
            while i in " \r\n\t":
                i=next(g);

            # Token starts with alphas ;
            if isalpha(i):
                # and continues with alphanumeric
                while isalpha(i) or i in "0123456789":
                    buf+=i
                    i=next(g)
                if buf in reserved:
                    tokens.append( [buf,buf] )
                else:
                    tokens.append( ["name",buf] )
                buf=""

            # We found a string;
            if i in "\"'":
                delimiter=i
                i=next(g)
                while i!=delimiter:
                    buf+=i
                    i=next(g)
                # Skip delimiter;
                i=next(g) 
                tokens.append( ["string",buf] )
                buf=""

            # We found a symbol;
            if i in "(){};=,":
                tokens.append( [symbols[i],i] )
                i=next(g)

            # We found a number;
            if i in "+-0123456789":
                while i in "+-0123456789.e":
                    buf+=i
                    i=next(g)
                if "e" in buf or "." in buf:
                    tokens.append( ["number",float(buf)] )
                else:
                    tokens.append( ["number",int(buf)] )
                buf=""

    except StopIteration:
        pass
    return tokens

# test it;
print(tokenize("if (gt(a,b)) return a; else return b;"))
Enter fullscreen mode Exit fullscreen mode

output:

[['if', 'if'], ['lparen', '('], ['name', 'gt'], ['lparen', '('], ['name', 'a'], ['comma', ','], ['name', 'b'], ['rparen', ')'], ['rparen', ')'], ['return', 'return'], ['name', 'a'], ['semi', ';'], ['else', 'else'], ['return', 'return'], ['name', 'b'], ['semi', ';']]
Enter fullscreen mode Exit fullscreen mode

It works :P

In the next post we will write a parser class that converts 'a=1;' to '["Set","a",1]'

Links: Twitter,Patreon

Discussion (1)

Collapse
muhammadzaki693 profile image
muhammadzaki693

I like you man