LL(1) parsing in Scala

#scala #programming #performance #parsing

From Wikipedia, an LL parser (Left-to-right, leftmost derivation) is a top-down parser for a restricted context-free language. An LL parser is called an LL(k) parser if it uses k tokens of lookahead when parsing a sentence. An LL(1) parser therefore has 1 token of lookahead. The lookahead allows you to make decisions on future tokens without removing the token from the input. Many modern day parsers are top down parsers which take a source file, turn it into a set of tokens, and parses those tokens into an abstract syntax tree or some other data structure.

Scala has a Source object that allows you to create an Iterator[Char] from files or strings. We will use this as the basis of the input for our parser. Unfortunately, iterators do not allow you to peek or look ahead. Once you call next() you have consumed the next Char. What we need is an iterator that allows you to peek and see the next item. Here is a PeekableIterator class that wraps an iterator and provides the peek() and poke() methods.

/**
 * Create a peekable iterator to allow the caller to peek() the next item without moving the current position.
 *
 * @param it an Iterator
 * @tparam T the type of value in the iterator
 */
class PeekableIterator[T](private val it:Iterator[T]) extends Peekable[T]:
  private var nextToken:Try[T] = Try(it.next)
  def peek:Try[T] = nextToken
  def poll:Try[T] =
    val currentToken = nextToken
    nextToken = Try(it.next)
    currentToken

PeekableIterator() takes an iterator parameter and immediately calls next to load the next token. nextToken is a Try[T] so that it captures either the successful token or the failure. peek() returns the nextToken, put does not call next on the iterator. poll(), on the other hand, will return the nextToken and will call next() on the iterator and move to the next token.

Now that we can look ahead, let's define the goal of the parser. The parser should parse english text and return a list of sentences where each sentence is a list of words. The return type is Vector[Vector[String]] as Vector has better append performance than List. So how do we go from a Peekable[Char] to the Vector[Vector[String]]?

Most parsers start with a lexical analysis(tokenizing) step and then a parsing step. Our parsing goal is simple and does not require an LL(1) parser, but we use LL(1) parsing techniques to give readers an example they can use for more complicated parsing. Therefore we will process the source file using both the tokenizing step and a parsing step. The tokenizing step is just another parser but is usually referred to as a lexer. A first implementation could be step 1 as a lexer of Char to Vector[String] and step 2 is a Parser of Vector[String] to Vector[Vector[String]].

The shortcoming of this design is that the entire file is turned into tokens before any tokens are consumed by the parser. If you try to parse an extremely large file, you could run out of memory building the Vector[String] and never parse one token. It would be more efficient, in both time and space, to be able to transform the source file into tokens in a stream-like way where tokens are created lazily and consume the source input as needed.

3 ways to solve this problem were considered: LazyList, Akka Streams, and chained iterators. The addition of Akka Streams seemed like overkill for a simple parser, but it may be the subject of a future article. LazyList's immutability and memoization made it complicated to just create and consume one token at a time. Hence, I was left with chaining iterators together.

The goal of iterator chaining is to allow one iterator to use another iterator to lazily create its next item. To demonstrate this I will chain Iterator[Char] to Iterator[String]. The chaining is done as an apply on the PeekableIterator object:

/**
 * Companion object for creating PeekableIterators from an iterator or by chaining iteratars.
 */
object PeekableIterator:
  def apply[T](it:Iterator[T]):PeekableIterator[T] = new PeekableIterator[T](it)

  def apply[TInput, TOutput](it1:Iterator[TInput], getNext: PeekableIterator[TInput]=>TOutput ): PeekableIterator[TOutput] =
    val input = PeekableIterator[TInput](it1)
    PeekableIterator[TOutput](new Iterator[TOutput] {
      override def hasNext: Boolean = input.peek.isSuccess
      override def next(): TOutput = getNext(input)
    })

The second apply method takes an Iterator and a getNext function which will be used to dynamically create an iterator that will chain or wrap the first iterator. Its use is shown in the creation of the lexer for the parser.

    val lexer = PeekableIterator[Char,String](source, lazyLexer)

lazyLexer is the method that is used as the getNext() for the dynamically created iterator. It takes a Peekable[Char] as its parameter and returns a String

  // method used to do the actual tokenizing. It builds one token and returns
  private def lazyLexer(tokenizer: Peekable[Char]): String =
    // ignore the whitespace and look for an end of sentence
    while tokenizer.peek.isSuccess && (tokenizer.peek.get.isWhitespace || tokenizer.peek.get.isOtherPunctuation) do
      tokenizer.poll

    // Represent the End of Sentence as an empty String
    if tokenizer.peek.isSuccess && tokenizer.peek.get.isEndOfSentence then
      tokenizer.poll
      return ""

    // if we are at the end of input pass the failure on
    if tokenizer.peek.isFailure then
      throw tokenizer.peek.failed.get

    // build a word
    val sb = new StringBuilder()
    while tokenizer.peek.isSuccess && !(tokenizer.peek.get.isWhitespace || tokenizer.peek.get.isPunctuation) do
      sb.append(tokenizer.poll.get)

    // return the word
    sb.toString()
  end lazyLexer

Finally, the parser takes the words from the lexer and builds a Vector[Vector[String]] to represent a list of sentences. It is written to ignore empty sentences.

  def parse(source:Source):Vector[Vector[String]] =
    // create the lexer by chaining the source to the lazylexer
    var sentences:Vector[Vector[String]] = Vector.empty
    var sentence:Vector[String] = Vector.empty
    val lexer = PeekableIterator[Char,String](source, lazyLexer)

    while lexer.peek.isSuccess do
      if lexer.peek.get.isEmpty then
        // ignore empty sentences
        if sentence.nonEmpty then
          sentences = sentences :+ sentence
          sentence = Vector.empty
        end if
        lexer.poll
      else
        sentence = sentence :+ lexer.poll.get
      end if
    end while

    if sentence.nonEmpty then
      sentences = sentences :+ sentence
    end if

    sentences
  end parse

The code from this article can be found here: ll1-parser

🛠️ Bring your solution into Docusign. Reach over 1.6M customers.

Docusign is now extensible. Overcome challenges with disconnected products and inaccessible data by bringing your solutions into Docusign and publishing to 1.6M customers in the App Center.

Learn more

DEV Community

LL(1) parsing in Scala

🛠️ Bring your solution into Docusign. Reach over 1.6M customers.

Top comments (0)

Read next

Upload to S3

Build a clone of Perplexity with LangGraph, CopilotKit, Tavily & Next.js 🪄

AI System Combines Face Analysis and Body Signals to Better Detect Human Emotions

How to Update Java JDK [A Simple Guide]

AWS GenAI LIVE!