DEV Community

Jay Damani
Jay Damani

Posted on

JShell - Making my own shell

A few months back, I decided to create JShell as a fun project to learn more about the good old C. Today, I want to reflect back and share what I learned along the way.

But what exactly is shell?

(Skip if you already know)

A shell simply reads user input (commands), executes programs if needed, prints the output and repeats this.

With above description, a shell looks like a simple program running commands like ls but it is not too far from a full fledged programming language. It can redirect output of a command to another command or file. It can also support loops, if conditions and variables. Heck, there exists a http server written in pure bash. (Curl is an external applicaton so does not count as pure bash)

Also note that shell and terminals are 2 separate applications that always interact with each other. Terminals (like kitty and Windows Terminal) are the front-end handling the GUI and shells (like Bash and PowerShell) are the back-end handling command parsing and execution.

Finally, to keep every developer's sanity, it is really important that different shells follow a consistent behaviour. The POSIX standard handles this responsibility of defining what the shell as a language should look like. Most mainstream shells follow this standard, with the exception of Windows Command Prompt and PowerShell but for the scope of this series, we are going to ignore them like they ignored the standard :D

Understanding Input

The first part of the shell is taking user input and converting it into something structured. JShell converts user input into a linked list of simple_command structured below.

typedef struct simple_command {
  sc_arg *words; // linked list of words (first member is command name)
  int wordc; // number of words
  redirection *redirects; // linked list of redirects (explained later)
  int redirc; // number of redirects
  struct simple_command *next; // next command to execute
} simple_command;

Enter fullscreen mode Exit fullscreen mode

This gives us a clear picture of our output, now let's look at our input:

# just a simple command with 2 arguments
rm -rf /boot 
# command with output redirection or some other operator
ls dir > list.txt 
# command with 1 argument, spaces inside quotes should not split arguments
cat "file name with spaces"
# backslash should escape quotes 
cat "file name with \" in middle"
Enter fullscreen mode Exit fullscreen mode

The first naive approach might be to split string by newline, space or something else. Due to convoluted escaping like \\" will evaluate to \" and as we add more operators and complex features, this approach will run into a lot of issues.

The more correct approach used by most compilers and interpreters is to read the input character by character to group them into tokens like a word, text inside double quotes or an operator. Then it will parse this list of tokens into a tree like data structure known as AST (Abstract Syntax Tree). For JShell, the simple_command stuct shown above acts as the AST.

Lexing

The first part of grouping characters into tokens is called lexing. This can be done on demand while parsing (like lazy loading) or it can be applied to the whole input at once.

JShell initially did lexing and parsing separately for simplicity but I later changed it to only lex as needed because that felt more efficient.

Here is a simplified version of the current lexer broken into components.

#include <stdbool.h>

// Token types
enum TokenType {
  T_WORD,      // Regular word/identifier
  T_GTR,       // > redirect
  T_DGTR,      // >> append redirect
  T_LINEBREAK, // Newline or semicolon
  T_END        // End of input
};

// Token structure that will be returned by the lexer
struct token {
  const char *str;
  int len;
  enum TokenType type;
};
Enter fullscreen mode Exit fullscreen mode

The token structure is the output returned by lexer with all the necessary details about a token.

// Stores state of the lexer for a given input string
struct lexer {
  const char *curr;  // Current position in input
  struct token tk;   // Current token
};
Enter fullscreen mode Exit fullscreen mode

lexer structure stores the state of the lexer. Mainly, the index of the current character to start lexing from and some memory space to store the token.

The main lexing function starts from current position in the lexing state and finds next token using these steps:

  1. Skip over any blank spaces
  2. Skip over comments (if current character is # keep moving till we encounter newline)
  3. Check for fixed character tokens like newline, end of input and operators
  4. If the current character does not match anything it is counted as "word" and the lexer keeps moving till it finds an end of the word (characters like space, new line, ;, &) Also, shells have a lesser known feature which is the fact that consecutive strings with no space between them are concanated into one string. For example, echo "a"b'c' prints abc because it evaluates "a", b and 'c' as a, b and c then concatenates them into 1 abc

Here is my implementation for above:

enum L_STATE nextToken(struct lexer *l) {
  struct token *t = &l->tk;
  t->flags = 0;

  while (isBlank(*l->curr))
    l->curr++;

  if (*l->curr == '#') {
    while (*l->curr != '\0' && *l->curr != '\n')
      l->curr++;
  }
  if (*l->curr == '\n' || *l->curr == ';') {
    t->str = l->curr;
    t->len = 1;
    t->type = T_LINEBREAK;
    l->curr++;
    return l->state = L_EOC;
  }
  if (*l->curr == '\0') {
    t->type = T_END;
    return l->state = L_EOF;
  }

  const char *start = l->curr;

  if (*l->curr >= '0' && *l->curr <= '9') {
    l->curr++;
  }
  if (strncmp(l->curr, ">>", 2) == 0) {
    t->type = T_DGTR;
    l->curr += 2;
    if (*l->curr == '&') {
      l->curr++;
    }
    t->str = start;
    t->len = l->curr - start;
    return l->state = L_CONTINUE;
  } else if (*l->curr == '>') {
    t->type = T_GTR;
    l->curr++;
    if (*l->curr == '&') {
      l->curr++;
    }
    t->str = start;
    t->len = l->curr - start;
    return l->state = L_CONTINUE;
  } else {
    // go back to consume as identifiers
    l->curr = start;
  }
  t->type = T_WORD;
  do {
    if (*l->curr == '"') {
      while (*l->curr != '"') {
        if (*++l->curr == '\\' && *++l->curr != '\0')
          l->curr++;
        if (*l->curr == '\0')
          return l->state = L_EOF_DQUOTE;
      }
    } else if (*l->curr == '\'') {
      do {
        if (*++l->curr == '\0')
          return l->state = L_EOF_QUOTE;
      } while (*l->curr != '\'');
    } else if (*l->curr == '\\') {
      if (*(l->curr + 1) != 0)
        l->curr++;
    }
    l->curr++;
  } while (isNotSpecialChar(*l->curr));
  t->str = start;
  t->len = l->curr - start;
  return l->state = L_CONTINUE;
}
Enter fullscreen mode Exit fullscreen mode

To be continued

There is lot more to talk about but since this single blog is getting too long. I am splitting it into multiple parts with next parts talking about parser, auto completetion and some more stuff.

Top comments (0)