I've spent the past few weeks going deep on Unix and Linux systems programming — reading TLPI, working through CS:APP, writing C daily. Today I started the project that ties it all together: building a shell from scratch in C.
The shell has a modular structure. Each component has one job. The first component I wrote today is the lexer.
**
What a lexer does**
The user types a string. Your program needs structured data. The lexer bridges that gap.
It takes a raw input string like:
'''bash
ls -la | grep txt > out.txt
'''
And converts it into an array of typed tokens:
[WORD:"ls"] [WORD:"-la"] [PIPE] [WORD:"grep"] [WORD:"txt"] [REDIR_OUT] [WORD:"out.txt"] [EOF]
Each token has a type and a value string for word tokens. The parser takes this token array and figures out what it means.
The implementation
The lexer is a state machine. It reads characters one at a time and decides what to do:
- whitespace — finish any word in progress, skip
- | — emit a PIPE token
- > — peek at the next character: >> emits REDIR_APPEND, > alone emits REDIR_OUT
- < — emit REDIR_IN
- anything else — accumulate into the current word
#include "shell.h"
int lexer(const char *input, Token *tokens, int max_tokens) {
int i = 0;
int t = 0;
int w = 0;
char word[MAX_INPUT];
while (input[i] != '\0' && t < max_tokens - 1) {
if (input[i] == ' ' || input[i] == '\t') {
if (w > 0) {
word[w] = '\0';
tokens[t].type = TOKEN_WORD;
strncpy(tokens[t].value, word, MAX_INPUT - 1);
t++;
w = 0;
}
i++;
continue;
}
if (input[i] == '|') {
if (w > 0) {
word[w] = '\0';
tokens[t].type = TOKEN_WORD;
strncpy(tokens[t].value, word, MAX_INPUT - 1);
t++;
w = 0;
}
tokens[t].type = TOKEN_PIPE;
tokens[t].value[0] = '\0';
t++;
i++;
continue;
}
if (input[i] == '>') {
if (w > 0) {
word[w] = '\0';
tokens[t].type = TOKEN_WORD;
strncpy(tokens[t].value, word, MAX_INPUT - 1);
t++;
w = 0;
}
if (input[i + 1] == '>') {
tokens[t].type = TOKEN_REDIR_APPEND;
i += 2;
} else {
tokens[t].type = TOKEN_REDIR_OUT;
i++;
}
tokens[t].value[0] = '\0';
t++;
continue;
}
if (input[i] == '<') {
if (w > 0) {
word[w] = '\0';
tokens[t].type = TOKEN_WORD;
strncpy(tokens[t].value, word, MAX_INPUT - 1);
t++;
w = 0;
}
tokens[t].type = TOKEN_REDIR_IN;
tokens[t].value[0] = '\0';
t++;
i++;
continue;
}
word[w++] = input[i++];
}
if (w > 0) {
word[w] = '\0';
tokens[t].type = TOKEN_WORD;
strncpy(tokens[t].value, word, MAX_INPUT - 1);
t++;
}
tokens[t].type = TOKEN_EOF;
tokens[t].value[0] = '\0';
t++;
return t;
}
Why separate lexer from the parser?
While designing, i learned about splitting responsibilities.
The lexer only cares about about the character or sequence. It knows nothing about a valid structure. That is the parser's problem.
The parser inquires about what the token or sequence mean and not about the raw characters.
If you combine them into one function you get something complex and hard to test. Separate them and each is small enough to hold in your head at once. I can test the lexer by printing its output before the parser ever runs.
Top comments (0)