Making a Language - Tokenizer

To begin

We are going to build an interpreter for our language, but before creating the interpreter we have to do the part of the Tokens that will be very important for our interpreter or compiler.

Every language has a defined syntax, so for your language you must create a syntax, our syntax will be a syntax I created called regius syntax which is a mixture of python and c ++:

Hello world: print "Hello world";
variables: int age = 10;
Show variables: print age;

Making the Types

Let's do the main function and create the token types for our language:

# language types
STRING = "STRING"
INT = "INT"
IDENTIFY = "IDENTIFY"
NULL = "NULL"

def main():
    Interpreter = interpreter()
    while True:
        line = input(">> ")             # get input
        Interpreter.main(line)          # call interpreter

main()

REMEMBERING THAT THE TOKEN AND INTERPRETER CLASSES MUST BE BETWEEN THE MAIN FUNCTION AND THE TYPES, for example:

# language types
STRING = "STRING"
INT = "INT"
IDENTIFY = "IDENTIFY"
NULL = "NULL"

class token:
   ...

class interpreter:
   ...

def main():
   Interpreter = interpreter()
   while True:
       line = input(">> ")             # get input
       Interpreter.main(line)          # call interpreter

main()

So now let's make a token class and define the token's type and value as an argument:

class token:
    def __init__(self, type, value):
        self.type = type
        self.value = value

And we create a class for our interpreter by defining the line as an argument in the interpreter's main function:

class interpreter:
   def __init__(self):
       pass
   def main(line):
       pass

Now let's do the interpretation process that is explained in the comments:

class interpreter:
   def __init__(self):
       self.vars = {} # We "hash table" to save vars

   def main(self, line):
       # Important variables
       self.i = 0  # Actual Line Index
       self.c = '' # Actual char of Line
       self.line = line # Saving string in a class var
       self.actual_token = token(NULL, '') # Save the actual token
       self.tokens = [] # Array of tokens
       self.in_string = False # bool controler to strings

       while True:
           self.c = self.line[self.i] # Get Char
           print(self.c)

           if self.c == ';': ; Line End
               break

           self.i += 1 # Go to next Index

This way, our interpreter can read a line and show each digit of it in the console until it reaches ; and stop the lexical analysis

Now let's generate lexical analysis tokens, as well as log the tokens into the console and disable character logging:

class interpreter:
   def __init__(self):
       self.vars = {} # We "hash table"

   def main(self, line):
       self.i = 0  # Actual Index
       self.c = '' # Actual Char
       self.line = line # Save we Line
       self.actual_token = token(NULL, '') # Actual TOKEN
       self.tokens = [] # Array of tokens
       self.in_string = False # bool controler to strings

       while True:
           self.c = self.line[self.i] # Get Char

           if self.c == ';': # End of Line
               break
           elif self.c.isdigit() and self.actual_token.type == INT or self.c.isdigit() and self.actual_token.type == NULL: # If actual char is a number and we are in a INT ou starting a token
               self.actual_token.type = INT # Token type is INT
               self.actual_token.value = self.actual_token.value + self.c # Add char to Token Value

               if self.line[self.i+1].isdigit(): # If Next char is number, continue
                   pass
               else: # Else, Save actual token and clear her
                   self.tokens.append(self.actual_token)
                   self.actual_token = token(NULL, '')
           elif self.c == '"': # If actual char is "
               if self.in_string: # IF in string, in_string = False, save actual token and clear her
                   self.in_string = False
                   self.tokens.append(self.actual_token)
                   self.actual_token = token(NULL, '')
               else: # Else, in_string = True, actual token is string
                   self.actual_token.type = STRING
                   self.in_string = True
           elif self.c.isalnum(): # If char is alphaNum
               if self.in_string: # If in string
                   self.actual_token.value = self.actual_token.value + self.c
                else: # Else is string, is a function
                    self.actual_token.type = IDENTIFY
                    self.actual_token.value = self.actual_token.value + self.c

                    if self.line[self.i + 1] in [" ", ';']: # If next char is space, end function
                        self.tokens.append(self.actual_token)
                        self.actual_token = token(NULL, '')

           elif self.c == ' ' and self.in_string: # If the char is space and we are in string
               self.actual_token.value = self.actual_token.value + self.c

           self.i += 1 # Next Index

       # Log Tokens
       for __token__ in self.tokens:
           print(__token__.type, __token__.value)

Now, if we run and type print "hello world" 123; the expected output is:

IDENTIFY print
STRING hello world
INT 123

We use IDENTIFY type to functions. Here we finish our Tokenizer and are ready to create our interpreter itself, but that's for the next post.

DEV Community

Making a Language - Tokenizer

To begin

Making the Types

Oldest comments (0)

Read next

Supabase Storage: now supports the S3 protocol

Is Your Boss Annoying?

Build a spreadsheet app with an AI-copilot (Next.js, gpt4, LangChain, & CopilotKit)

Earth Day Celebration Landing Page submission