Gabriel Lebec

Posted on Nov 2, 2018 • Edited on Jan 28, 2021

Monadic Parser Combinators: an Interactive JS Tutorial (Pt. 1)

#parsers #monads #functional #combinators

Parsing analyzes serial data into a structural result, and is a key step in static code analysis and compilation. Parsers also demonstrate various functional concepts including purity, composition, and monads.

In this interactive tutorial, we will walk through implementing a simple parser combinator library. Combinators allow programmers to easily build up complex programs through the use of a small number of cooperative utility functions.

const color  = P.either(P.literal('red'), P.literal('blue'))
const animal = P.either(P.literal('zebra'), P.literal('giraffe'))
const spaces = P.many1(P.literal(' '))
const tallTale =
    color .chain(c =>
    spaces.chain(_ =>
    animal.chain(a => P.of(`There was a ${c} ${a}.`))))

The intent is partly to learn about parsing and partly to learn about functional programming. No prior experience in either is assumed.

Motivations

Serial data, e.g. from a file or network payload, often must be analyzed into a result before a program can work with it. For example:

HTML is a string format for page contents. Browsers parse HTML into the DOM, a tree of in-memory nodes that can be queried and manipulated to affect the rendered web page.
JSON is a string format for nested data, often used for configuration files or network payloads. Programs can use JSON.parse to convert a JSON string into a JavaScript Object, which can be easily read and updated during runtime.
ESLint is a static code analysis tool for detecting errors and style deviations. ESLint uses a JavaScript parser (Espree) to read the author's program (a text file) and identify potential problems.

In addition, a class of programs called compilers go one step further, folding a parse tree back down into a converted string format. Compilers thus act as translators between strings in one formal language to strings in another.

const codeES2018 = 'num => num + 1'
const codeES5 = compileToES5(codeES2018)
console.log(codeES5) // 'function (num) { return num + 1 }'

V8 is a JavaScript engine that uses a compiler to translate JavaScript into executable machine code.
Babel compiles JavaScript written in modern and/or domain-specific syntax (e.g. ES2018 + JSX) into older syntax (e.g. ES5).
Prettier compiles JavaScript with non-standard formatting into JavaScript with consistent formatting.

In a typical compiler, parsing is referred to as the front end and code generation is called the back end.

    infix       FE (parser)       +      BE (generator)      postfix
 "2 + 3 * 5"         ->          / \           ->          "2 3 5 * +"
                                2   *
                               / \
                              3   5

However, it is also possible to generate results during the parsing step. In that case, no explicit tree is built, though the parser implicitly follows a tree structure while recursing through the input string.

Parsers

Generally, a parser extracts structure from the beginning of some serial input. The input can be an ordinary string, though some parsers may expect a stream (sequentially consumable source, e.g. generator or observable) of tokens (objects recording both linguistic type and lexical content).

Let's start simple and consider a parser to be a function from string to result.

// parse :: String -> *
const parse = s => ?

This begs the question of what our parser produces. The answer depends on the problem. Many parsers produce tree nodes, but parsers can be written to build any result desired, including raw strings, numbers, functions, whatever.

Dealing With Failure

For now let's just capture the literal string value (or lexeme) that our parser matches. Here is an (incomplete) parser intended to match on and result in the lexeme "Triceratops":

// parseTri :: String -> String
const parseTri = s => "Triceratops"

Sadly, this parser is broken. What if our string contains the wrong dinosaur? Parsing "T. Rex" shouldn't result in "Triceratops". We need a way to signal failure.

How would you address this issue? Come up with your own approach; your parsing function should be able to take the following strings and either result in "Triceratops" or else indicate (somehow) that it failed to parse.

Signaling Failure: Maybe?

type Parser = String -> Maybe String

An experienced functional programmer would likely reach for a sum type such as the Maybe monad. However, to properly address that concept now would derail this article substantially, so we will circle back to Maybe later on.

Signaling Failure: Array?

type Parser = String -> [String]

Functional parsers commonly represent their results as lists. Failure can thus be represented by the empty list [], and multiple successes (from an ambiguous grammar) can be captured if necessary. Ambiguous grammars lie outside the scope of this article, so we don't strictly need lists.

Signaling Failure: Null

type Parser = String -> String | Null

Both Maybe and lists are good ways of dealing with failure, but for now we will stick with a method JS developers are familiar with: null. The null value, whose inventor Tony Hoare once called his "billion dollar mistake", has serious pitfalls. However, it is idiomatic, approachable, and will suffice for the moment.

Here is our new parser. Try it on some strings and see what happens:

Dealing with Leftovers

When we parse a result from a string, in most cases we will only consume part of the input. The remainder will have other values we can parse, which means we need to keep track of where to resume. In a mutable setting like JavaScript, one might be tempted to create a shared index value.

let index = 0
const parseTri = s => {
    const remainder = s.slice(index)
    if (remainder.startsWith('Triceratops')) {
        index += 'Triceratops'.length
        return 'Triceratops'
    }
    return null
}

This works fine for parsing a single string:

However, shared mutable state can quickly cause problems. What if we want to subsequently parse a different string?

There are solutions, of course. We could create a Parser class whose instances manage internal state, or a higher-order makeParser function which closes over state. Both of those solutions bury state without actually eliminating it; debugging such hidden state is sometimes even more challenging.

Pure Token Consumption

It turns out that for this problem, we don't actually need mutable state. In functional programming, we prefer pure solutions.

Pure functions are stateless deterministic mappings from input to output, with no side effects. In other words, pure functions have output, but neither depend on nor change the external universe. If you can define your function as a (potentially infinite) table of inputs and outputs, it's pure. Try to induce how f1, f2, and f3 below are defined:

f1 in	f1 out	f2 in	f2 out	f3 in	f3 out
`'hi'`	`2`	`[a, b]`	`b`	`0`	`-2`
`'what'`	`4`	`[0, 9, 1]`	`9`	`1`	`-1`
`'pow'`	`3`	`[0]`	`undefined`	`2`	`2`
`'forest'`	`6`	`[]`	`undefined`	`3`	`7`
`'a'`	`1`	`[1, 2, 3]`	`2`	`4`	`14`
`''`	`0`	`[z, y]`	`y`	`5`	`23`
…	…	…	…	…	…

If a parser needs to indicate where the next parse should commence, that implies that the leftover input should itself be a return value. So, our parsers will return two things: a parsed result, and remaining input.

type Parser = String -> String & String

How can you write a function that returns two things? There is more than one viable way; come up with your own approach below, then read on for our take.

Tuples and Friends

Because functional programming involves a lot of data flowing in and out of functions, functional languages often feature a lightweight data structure for packaging values together: the tuple.

-- Haskell
myTuple = ("hello", 99) -- a 2-tuple
str = fst myTuple -- "hello"
num = snd myTuple -- 99

JavaScript doesn't have tuples, but it does have Objects and Arrays (which are actually a type of Object). We can emulate tuples easily enough:

const myTuple = ['hello', 99]
const str = myTuple[0] // 'hello'
const num = myTuple[1] // false

const myTuple = { str: 'hello', num: 99 }
const { str } = myTuple // 'hello'
const { num } = myTuple // false

Arrays are more concise, but Objects are more expressive. You probably used one of these in your solution above; we'll use Objects.

Side Note: Isomorphisms

n-Tuples and n-element JS Arrays (when used solely to store data) are actually isomorphic. Sets A and B are isomorphic if you can define a pair of functions a2b and b2a, such that:

a2b maps all values in A to values in B
b2a maps all values in B to values in A
b2a(a2b(a)) = a
a2b(b2a(b)) = b

Isomorphisms are a rich topic with some delightful results, but the upshot is that it is nice to know when two data types are equivalent for a given use case.

Thought puzzle: why might two-valued objects not be isomorphic to 2-el arrays? What information would be lost in translation?

Threading Parsers

Our parser can now report which part of the input was not consumed, but how do we actually use that information? The leftover string from one parse becomes the input to the next parse. Complete the code below.

To see our solution, you can log the hidden variable solution.

Generalizing

We have a single parser which can parse "Triceratops" – not very interesting. Write a few parsers for other dinosaurs.

This is pretty repetitive. Let's generalize and make a parseLiteral function.

Solving the above demonstrates a common functional technique. Higher-order functions take and/or return functions themselves; parseLiteral does the latter. Instead of hard-coding individual parsers for every string, parseLiteral generalizes, mapping from strings to parsers:

string	`parseLiteral(string)`
`'hi'`	parser for `'hi'`
`'yo'`	parser for `'yo'`
`'wow'`	parser for `'wow'`
…	…

Different Result Types

Parsers match on strings, but they don't have to result in them. Write parsers for digit strings which result in numbers.

Once we start using our parsers for practical purposes, it will be convenient to transform raw lexemes into processed results specific to our use case.

Types, Signatures, & Aliases

We now have seen two different kinds of parsers: ones that result in strings, and ones that result in numbers.

parseRex  :: String -> { result: String, remainder: String } | Null
parseNum1 :: String -> { result: Number, remainder: String } | Null

We haven't drawn special attention to these comments yet, preferring to teach by example, but these are examples of type signatures. A type signature documents what set a value belongs to. For example, the Bool type is a set of two values: true and false. You can read the notation :: as "has type" or "is in the set":

true  :: Bool -- `true`  is in the set `Bool`
false :: Bool -- `false` is in the set `Bool`

Functions also belong to types. The ! (or "not") function happens to be in the set of functions which map values in Bool to values in Bool. We signify this as (!) :: Bool -> Bool. Here is the ! function defined:

Bool input	Bool output
`true`	`false`
`false`	`true`

Thought puzzle: there are precisely three other functions of type Bool -> Bool. Can you define all of them? Remember, each function will be a table of two rows.

Functions are mappings from one type to another type (e.g. Int -> Bool, String -> String, or Object -> String). Accordingly, functional languages often emphasize their type system. ML and related languages like OCaml, F#, ReasonML, Miranda, Haskell, PureScript, Idris, and Elm are examples of languages with sophisticated typing, including algebraic data types and complete type inference.

JavaScript has a (sometimes infamous) dynamic type system, and lacks an official type notation. The syntax here is borrowed from Haskell and is similar to the system used by Ramda, a JS utility library. It is explained well by both the Ramda wiki and Fantasy Land docs.

We bring this up now because it is beginning to get cumbersome to write out the entire type of our parsers. The parseRex and parseNum1 type signatures differ at only one location, result:

parseRex  :: String -> { result: String, remainder: String } | Null
parseNum1 :: String -> { result: Number, remainder: String } | Null

Accordingly, from now on we will use Parser a to mean "a parsing function whose result is of type a":

type Parser a = String -> { result: a, remainder: String } | Null

This type alias is just a shorthand we will use for documentation purposes. It will clean up our comments considerably and make it easier to talk about parsers. For example, now we can categorize parseRex and parseNum1 more concisely:

parseRex  :: Parser String
parseNum1 :: Parser Number

Try it yourself. Simplify our parseLiteral signature using the shorthand:

Combinators at Last

What if we want to match on more than one possibility? Implement the following parsers. Existing parsers are in scope, use them in your definition:

Again, this is getting repetitive. What does the functional programmer do when faced with repetition? Generalize! Write a function either which takes two parsers and returns a new parser that matches either of them.

Congratulations, you've written a combinator.

Combi-who?

Combinator, like many programming terms, is overloaded – it means different things in different contexts.

Combinatory logic is the study of functions which act on functions. For example, the I combinator (x => x) takes in a function and returns the same function. In this field, "combinator" formally means a function with no free (unbound) variables.
The combinator pattern is a strategy for building flexible libraries, in which a small number of cooperative utility functions build up rich results (which in turn can be new inputs to those same functions).
- For example, CSS "combinators" take in CSS selectors and yield new, more complex selectors. div and h1 are both primitive selectors; using the direct child (>) combinator, you can create the new div > h1 selector.

We will define a parser combinator as a function which takes in one or more parsers, and returns a new parser. Some examples:

either :: (Parser a, Parser b) -> Parser (a | b)
both   :: (Parser a, Parser b) -> Parser [a, b]
any    :: (...Parser *)        -> Parser *
many   ::  Parser a            -> Parser [a]

You've already defined either; take a crack at both and any next. For manual testing, parseLiteral and all of our dino parsers are in scope:

Also, either is in scope if you want it:

This is a powerful concept. Instead of hand-coding ad-hoc parser functions for every conceivable purpose, we can define a few abstract combinators. Feed simple parsers in, get more powerful parsers out; those new powerful parsers can become, in turn, inputs to the same combinators. We can therefore build up rich behavior from a few easy-to-use tools. This is the essence of composition – construction via combination.

Conclusion to Part 1

Below is the current state of our parser combinator library.

// type Parser a = String -> { result: a, remainder: String } | Null

// parseLiteral :: String -> Parser String
const parseLiteral = literal => s => s.startsWith(literal)
    ? { result: literal, remainder: s.slice(literal.length) }
    : null

// either :: (Parser a, Parser b) -> Parser (a | b)
const either = (p1, p2) => s => p1(s) || p2(s)

// both :: (Parser a, Parser b) -> Parser [a, b]
const both = (p1, p2) => s => {
    const r1 = p1(s)
    if (!r1) return null
    const r2 = p2(r1.remainder)
    if (!r2) return null
    return {
        result: [r1.result, r2.result],
        remainder: r2.remainder
    }
}

// any :: (...Parser *) -> Parser *
const any = (...parsers) => parsers.reduce(either)

In this first installment, we covered:

what parsing is
some use cases for parsers
one (non-functional) way to deal with failure
one (functional) way to eliminate state
function purity
using tuples (or arrays/objects) as return values
threading parser results sequentially
types, signatures, and aliases
basic parser combinators

In an upcoming sequel to this article, we will cover some more challenging issues:

dealing with longer sequences
dealing with choice
functors and monads
simulating infix function notation
formal grammars & parsing theory
recursive descent
laziness

Part 2 will be linked here once published.

About the Author

Gabriel Lebec is an instructor for Fullstack Academy of Code, an intensive web applications development program. He has a B.A. in Mathematics and Studio Art, and too many hobbies, including the art-historical study of antique Japanese swords. You can see more by Gabriel at Medium, YouTube, GitHub, and Twitter.

Top comments (5)

Arun Karikalan • Sep 26 '19

Seriously, I am very excited when I read this. please post part 2. I am waiting for that eagerly

Gabriel Lebec • Aug 23 '20 • Edited

NOTE: on 2020-08-13, github.com/forem/forem/issues/9773 noted that some Runkit embeds are failing to load correctly. This article is affected; apologies to any readers who may come here before that issue is resolved.

Gabriel Lebec • Aug 23 '20

NOTE: on 2020-08-13, github.com/forem/forem/issues/9773 noted that some Runkit embeds are failing to load correctly. This article is affected; apologies to any readers while that issue is being resolved.

Pelle Schöld • Oct 25 '19

Really liked this tutorial, keep up the good work. Excited for part 2

Joan Llenas Masó • Sep 16 '19

Very well written and approachable, thanks for putting all this together.
The interactive aspect of it was also very helpful.
Any plans for releasing the 2nd part?