DEV Community

Sean Charles
Sean Charles

Posted on

How's Anything Work Anyway -1- The Lexer

I recently had a brain fart that led to this article,
https://dev.to/emacstheviking/how-does-any-of-it-work-anyway-4oa0
which at the end of I made some thinly veiled threat about writing something about "lexical analysis" in a vain attempt to at least impart to those interested enough to digest my prose, an attempt to explain in reasonably straight forward terms how it is that any application that you interact with, using the keyboard to type in "something" and have the computer "do something" in return, how that all works.

From the top ...2 ..3 ..4

Well actually it's always from the bottom up when it comes to this sort of thing. Let's first just take a high level view of the process

  1. you type in a command and some data
  2. the computer says "thanks"
  3. the computer works out what you meant
  4. the computer executes your desired command

Note that by any application, I mean any application, like the web browser you are using right now, when you type something into the address bar...it starts here!

Step 3 is our area of interest. Let's for example say that you are sitting in a REPL of your favourite language and you type something like:

    load killerapp.src[RETURN]

OK, first we can see that the intent is that you want your system to somehow "load" the contents of a file called "killerapp.src" into the running environment and then return control to you, unless of course something in the file causes control to be passed to the killer application.

From first principles then we can see that what you did was give the computer a sequence of characters that contain your intent but sadly, the computer sees only a bunch of consecutive bytes that have no value over and above any other bunch of bytes.

The meaning will come from our instructions on how to process the sequence and in order to do that we must first determine the rules that allow us to separate out those bytes into tokens. Have you heard of Henry Moore? I am a big fan of his work, and one of my favourite quotes of his is this one,

To know one thing, you must know the opposite.

I mentioned "tokens" in the previous paragraph, and if I said that tokens are continuous groups of "non-whitespace" characters, that would be great...if we knew what whitespace was! Traditionally it is normally the set of characters comprised of these ASCII codes:

32 (SPACE)
10 (LF)
13 (RETURN)

Now that we have a definition of "whitespace" for our purposes, we can now see that the tokens our thus the strings:

- load
- killerapp.src

OK, so how would we go about turning our observations into actual code? Let's do it! It really isn't that hard and most languages provide some means of splitting a string around some other delimiter. Being an old timer and something of a fan of "C" to this day, we are going to go with that, but as I say, most languages provide a means to split a string of characters by one or more delimiters -- that's the key point here.

#include <stdio.h>
#include <string.h>

int main(int argc, char* argv[]) {
    char *token = strtok(argv[1], argv[2]);
    while(token) {
        printf("TOKEN: %s\n", token);
        token = strtok(NULL, argv[2]);
    }
}

This little program lets you tokenise the first string by the characters in the second string. Remember that you will need to surround both strings in double quotes from the command line to ensure the program sees the correct input. And when we do run it, making sure to provide enough arguments to avoid a segfault, we can see that this incantation:

cc minilex.c -o minilex
./minilex "load killerapp.src" " "
TOKEN: load
TOKEN: killerapp.sec

Well that worked but it's pretty simple and possibly not very useful for very long. For example, what if we wanted to implement another command that could print a "string" of characters for example, how would we be able to do that if we are using space as our token separator? It's not long before lexing becomes non-trivial!

It doesn't take too much imagination to see how already we have the basics for creating our own REPL / command interpreter. By switching on the first token and consuming subsequent tokens we can quickly build something useful. Or not. This tiny tiny example shows how you could begin to write a dead simple calculator that only does addition:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(int argc, char* argv[]) {
  char *token = strtok(argv[1], " ");
  if (0 == strcmp("+", token)) {
    int total = 0;
    int value = 0;
    token = strtok(NULL, " ");
    while (token) {
      value = atoi(token);
      total += value;
      token = strtok(NULL, " ");
    }
    printf(" = %d\n", total);
  }
}

Compiling the above little program, run it, we can see:

cc minilex.c -o minilex
./minilex "+"
 = 0
./minilex "+ 32 10"
 = 42
./minilex "+ 1 2 3 4 5 6"
 = 21

Of course, there is zero error checking so entering things that aren't numbers isn't recommended.

Summing up (pun intended)

Lexing then is the process of turning a sequence of bytes into a sequence of tokens and then processing them. This has been a very simple and deliberately simple attempt but in the real world, it's rare that a functional lexer can be this simple, although Forth just might be the exception to that!

In the next thrilling instalment we'll go into a more complex lexer, one that can recognise "data types" such as strings and brackets which means we will be classifying the tokens so we can better interpret them.

Next time, we will outline a more formal but simple lexer rule set and then we will discuss something called finite state machines which are an important thing to know about in the world of software.

Discussion (0)