Parsing analyzes serial data into a structural result, and is a key step in static code analysis and compilation. Parsers also demonstrate various functional concepts including purity, composition, and monads.
In this interactive tutorial, we will walk through implementing a simple parser combinator library. Combinators allow programmers to easily build up complex programs through the use of a small number of cooperative utility functions.
const color = P.either(P.literal('red'), P.literal('blue'))
const animal = P.either(P.literal('zebra'), P.literal('giraffe'))
const spaces = P.many1(P.literal(' '))
const tallTale =
color .chain(c =>
spaces.chain(_ =>
animal.chain(a => P.of(`There was a ${c} ${a}.`))))
The intent is partly to learn about parsing and partly to learn about functional programming. No prior experience in either is assumed.
Motivations
Serial data, e.g. from a file or network payload, often must be analyzed into a result before a program can work with it. For example:
- HTML is a string format for page contents. Browsers parse HTML into the DOM, a tree of in-memory nodes that can be queried and manipulated to affect the rendered web page.
-
JSON is a string format for nested data, often used for configuration files or network payloads. Programs can use
JSON.parse
to convert a JSON string into a JavaScript Object, which can be easily read and updated during runtime. - ESLint is a static code analysis tool for detecting errors and style deviations. ESLint uses a JavaScript parser (Espree) to read the author's program (a text file) and identify potential problems.
In addition, a class of programs called compilers go one step further, folding a parse tree back down into a converted string format. Compilers thus act as translators between strings in one formal language to strings in another.
const codeES2018 = 'num => num + 1'
const codeES5 = compileToES5(codeES2018)
console.log(codeES5) // 'function (num) { return num + 1 }'
- V8 is a JavaScript engine that uses a compiler to translate JavaScript into executable machine code.
- Babel compiles JavaScript written in modern and/or domain-specific syntax (e.g. ES2018 + JSX) into older syntax (e.g. ES5).
- Prettier compiles JavaScript with non-standard formatting into JavaScript with consistent formatting.
In a typical compiler, parsing is referred to as the front end and code generation is called the back end.
infix FE (parser) + BE (generator) postfix
"2 + 3 * 5" -> / \ -> "2 3 5 * +"
2 *
/ \
3 5
However, it is also possible to generate results during the parsing step. In that case, no explicit tree is built, though the parser implicitly follows a tree structure while recursing through the input string.
Parsers
Generally, a parser extracts structure from the beginning of some serial input. The input can be an ordinary string, though some parsers may expect a stream (sequentially consumable source, e.g. generator or observable) of tokens (objects recording both linguistic type and lexical content).
Let's start simple and consider a parser to be a function from string to result.
// parse :: String -> *
const parse = s => ?
This begs the question of what our parser produces. The answer depends on the problem. Many parsers produce tree nodes, but parsers can be written to build any result desired, including raw strings, numbers, functions, whatever.
Dealing With Failure
For now let's just capture the literal string value (or lexeme) that our parser matches. Here is an (incomplete) parser intended to match on and result in the lexeme "Triceratops":
// parseTri :: String -> String
const parseTri = s => "Triceratops"
Sadly, this parser is broken. What if our string contains the wrong dinosaur? Parsing "T. Rex" shouldn't result in "Triceratops". We need a way to signal failure.
How would you address this issue? Come up with your own approach; your parsing function should be able to take the following strings and either result in "Triceratops" or else indicate (somehow) that it failed to parse.
Signaling Failure: Maybe?
type Parser = String -> Maybe String
An experienced functional programmer would likely reach for a sum type such as the Maybe
monad. However, to properly address that concept now would derail this article substantially, so we will circle back to Maybe
later on.
Signaling Failure: Array?
type Parser = String -> [String]
Functional parsers commonly represent their results as lists. Failure can thus be represented by the empty list []
, and multiple successes (from an ambiguous grammar) can be captured if necessary. Ambiguous grammars lie outside the scope of this article, so we don't strictly need lists.
Signaling Failure: Null
type Parser = String -> String | Null
Both Maybe
and lists are good ways of dealing with failure, but for now we will stick with a method JS developers are familiar with: null
. The null
value, whose inventor Tony Hoare once called his "billion dollar mistake", has serious pitfalls. However, it is idiomatic, approachable, and will suffice for the moment.
Here is our new parser. Try it on some strings and see what happens:
Dealing with Leftovers
When we parse a result from a string, in most cases we will only consume part of the input. The remainder will have other values we can parse, which means we need to keep track of where to resume. In a mutable setting like JavaScript, one might be tempted to create a shared index
value.
let index = 0
const parseTri = s => {
const remainder = s.slice(index)
if (remainder.startsWith('Triceratops')) {
index += 'Triceratops'.length
return 'Triceratops'
}
return null
}
This works fine for parsing a single string:
However, shared mutable state can quickly cause problems. What if we want to subsequently parse a different string?
There are solutions, of course. We could create a Parser
class whose instances manage internal state, or a higher-order makeParser
function which closes over state. Both of those solutions bury state without actually eliminating it; debugging such hidden state is sometimes even more challenging.
Pure Token Consumption
It turns out that for this problem, we don't actually need mutable state. In functional programming, we prefer pure solutions.
Pure functions are stateless deterministic mappings from input to output, with no side effects. In other words, pure functions have output, but neither depend on nor change the external universe. If you can define your function as a (potentially infinite) table of inputs and outputs, it's pure. Try to induce how f1
, f2
, and f3
below are defined:
f1 in | f1 out | f2 in | f2 out | f3 in | f3 out |
---|---|---|---|---|---|
'hi' |
2 |
[a, b] |
b |
0 |
-2 |
'what' |
4 |
[0, 9, 1] |
9 |
1 |
-1 |
'pow' |
3 |
[0] |
undefined |
2 |
2 |
'forest' |
6 |
[] |
undefined |
3 |
7 |
'a' |
1 |
[1, 2, 3] |
2 |
4 |
14 |
'' |
0 |
[z, y] |
y |
5 |
23 |
… | … | … | … | … | … |
If a parser needs to indicate where the next parse should commence, that implies that the leftover input should itself be a return value. So, our parsers will return two things: a parsed result, and remaining input.
type Parser = String -> String & String
How can you write a function that returns two things? There is more than one viable way; come up with your own approach below, then read on for our take.
Tuples and Friends
Because functional programming involves a lot of data flowing in and out of functions, functional languages often feature a lightweight data structure for packaging values together: the tuple.
-- Haskell
myTuple = ("hello", 99) -- a 2-tuple
str = fst myTuple -- "hello"
num = snd myTuple -- 99
JavaScript doesn't have tuples, but it does have Objects and Arrays (which are actually a type of Object). We can emulate tuples easily enough:
const myTuple = ['hello', 99]
const str = myTuple[0] // 'hello'
const num = myTuple[1] // false
const myTuple = { str: 'hello', num: 99 }
const { str } = myTuple // 'hello'
const { num } = myTuple // false
Arrays are more concise, but Objects are more expressive. You probably used one of these in your solution above; we'll use Objects.
Side Note: Isomorphisms
n-Tuples and n-element JS Arrays (when used solely to store data) are actually isomorphic. Sets A
and B
are isomorphic if you can define a pair of functions a2b
and b2a
, such that:
-
a2b
maps all values inA
to values inB
-
b2a
maps all values inB
to values inA
b2a(a2b(a)) = a
a2b(b2a(b)) = b
Isomorphisms are a rich topic with some delightful results, but the upshot is that it is nice to know when two data types are equivalent for a given use case.
Thought puzzle: why might two-valued objects not be isomorphic to 2-el arrays? What information would be lost in translation?
Threading Parsers
Our parser can now report which part of the input was not consumed, but how do we actually use that information? The leftover string from one parse becomes the input to the next parse. Complete the code below.
To see our solution, you can log the hidden variable solution
.
Generalizing
We have a single parser which can parse "Triceratops" – not very interesting. Write a few parsers for other dinosaurs.
This is pretty repetitive. Let's generalize and make a parseLiteral
function.
Solving the above demonstrates a common functional technique. Higher-order functions take and/or return functions themselves; parseLiteral
does the latter. Instead of hard-coding individual parsers for every string, parseLiteral
generalizes, mapping from strings to parsers:
string | parseLiteral(string) |
---|---|
'hi' |
parser for 'hi'
|
'yo' |
parser for 'yo'
|
'wow' |
parser for 'wow'
|
… | … |
Different Result Types
Parsers match on strings, but they don't have to result in them. Write parsers for digit strings which result in numbers.
Once we start using our parsers for practical purposes, it will be convenient to transform raw lexemes into processed results specific to our use case.
Types, Signatures, & Aliases
We now have seen two different kinds of parsers: ones that result in strings, and ones that result in numbers.
parseRex :: String -> { result: String, remainder: String } | Null
parseNum1 :: String -> { result: Number, remainder: String } | Null
We haven't drawn special attention to these comments yet, preferring to teach by example, but these are examples of type signatures. A type signature documents what set a value belongs to. For example, the Bool
type is a set of two values: true
and false
. You can read the notation ::
as "has type" or "is in the set":
true :: Bool -- `true` is in the set `Bool`
false :: Bool -- `false` is in the set `Bool`
Functions also belong to types. The !
(or "not") function happens to be in the set of functions which map values in Bool
to values in Bool
. We signify this as (!) :: Bool -> Bool
. Here is the !
function defined:
Bool input | Bool output |
---|---|
true |
false |
false |
true |
Thought puzzle: there are precisely three other functions of type Bool -> Bool
. Can you define all of them? Remember, each function will be a table of two rows.
Functions are mappings from one type to another type (e.g. Int -> Bool
, String -> String
, or Object -> String
). Accordingly, functional languages often emphasize their type system. ML and related languages like OCaml, F#, ReasonML, Miranda, Haskell, PureScript, Idris, and Elm are examples of languages with sophisticated typing, including algebraic data types and complete type inference.
JavaScript has a (sometimes infamous) dynamic type system, and lacks an official type notation. The syntax here is borrowed from Haskell and is similar to the system used by Ramda, a JS utility library. It is explained well by both the Ramda wiki and Fantasy Land docs.
We bring this up now because it is beginning to get cumbersome to write out the entire type of our parsers. The parseRex
and parseNum1
type signatures differ at only one location, result
:
parseRex :: String -> { result: String, remainder: String } | Null
parseNum1 :: String -> { result: Number, remainder: String } | Null
Accordingly, from now on we will use Parser a
to mean "a parsing function whose result
is of type a
":
type Parser a = String -> { result: a, remainder: String } | Null
This type alias is just a shorthand we will use for documentation purposes. It will clean up our comments considerably and make it easier to talk about parsers. For example, now we can categorize parseRex
and parseNum1
more concisely:
parseRex :: Parser String
parseNum1 :: Parser Number
Try it yourself. Simplify our parseLiteral
signature using the shorthand:
Combinators at Last
What if we want to match on more than one possibility? Implement the following parsers. Existing parsers are in scope, use them in your definition:
Again, this is getting repetitive. What does the functional programmer do when faced with repetition? Generalize! Write a function either
which takes two parsers and returns a new parser that matches either of them.
Congratulations, you've written a combinator.
Combi-who?
Combinator, like many programming terms, is overloaded – it means different things in different contexts.
-
Combinatory logic is the study of functions which act on functions. For example, the
I
combinator (x => x
) takes in a function and returns the same function. In this field, "combinator" formally means a function with no free (unbound) variables. -
The combinator pattern is a strategy for building flexible libraries, in which a small number of cooperative utility functions build up rich results (which in turn can be new inputs to those same functions).
- For example, CSS "combinators" take in CSS selectors and yield new, more complex selectors.
div
andh1
are both primitive selectors; using the direct child (>
) combinator, you can create the newdiv > h1
selector.
- For example, CSS "combinators" take in CSS selectors and yield new, more complex selectors.
We will define a parser combinator as a function which takes in one or more parsers, and returns a new parser. Some examples:
either :: (Parser a, Parser b) -> Parser (a | b)
both :: (Parser a, Parser b) -> Parser [a, b]
any :: (...Parser *) -> Parser *
many :: Parser a -> Parser [a]
You've already defined either
; take a crack at both
and any
next. For manual testing, parseLiteral
and all of our dino parsers are in scope:
Also, either
is in scope if you want it:
This is a powerful concept. Instead of hand-coding ad-hoc parser functions for every conceivable purpose, we can define a few abstract combinators. Feed simple parsers in, get more powerful parsers out; those new powerful parsers can become, in turn, inputs to the same combinators. We can therefore build up rich behavior from a few easy-to-use tools. This is the essence of composition – construction via combination.
Conclusion to Part 1
Below is the current state of our parser combinator library.
// type Parser a = String -> { result: a, remainder: String } | Null
// parseLiteral :: String -> Parser String
const parseLiteral = literal => s => s.startsWith(literal)
? { result: literal, remainder: s.slice(literal.length) }
: null
// either :: (Parser a, Parser b) -> Parser (a | b)
const either = (p1, p2) => s => p1(s) || p2(s)
// both :: (Parser a, Parser b) -> Parser [a, b]
const both = (p1, p2) => s => {
const r1 = p1(s)
if (!r1) return null
const r2 = p2(r1.remainder)
if (!r2) return null
return {
result: [r1.result, r2.result],
remainder: r2.remainder
}
}
// any :: (...Parser *) -> Parser *
const any = (...parsers) => parsers.reduce(either)
In this first installment, we covered:
- what parsing is
- some use cases for parsers
- one (non-functional) way to deal with failure
- one (functional) way to eliminate state
- function purity
- using tuples (or arrays/objects) as return values
- threading parser results sequentially
- types, signatures, and aliases
- basic parser combinators
In an upcoming sequel to this article, we will cover some more challenging issues:
- dealing with longer sequences
- dealing with choice
- functors and monads
- simulating infix function notation
- formal grammars & parsing theory
- recursive descent
- laziness
Part 2 will be linked here once published.
About the Author
Gabriel Lebec is an instructor for Fullstack Academy of Code, an intensive web applications development program. He has a B.A. in Mathematics and Studio Art, and too many hobbies, including the art-historical study of antique Japanese swords. You can see more by Gabriel at Medium, YouTube, GitHub, and Twitter.
Top comments (5)
NOTE: on 2020-08-13, github.com/forem/forem/issues/9773 noted that some Runkit embeds are failing to load correctly. This article is affected; apologies to any readers while that issue is being resolved.
NOTE: on 2020-08-13, github.com/forem/forem/issues/9773 noted that some Runkit embeds are failing to load correctly. This article is affected; apologies to any readers who may come here before that issue is resolved.
Seriously, I am very excited when I read this. please post part 2. I am waiting for that eagerly
Really liked this tutorial, keep up the good work. Excited for part 2
Very well written and approachable, thanks for putting all this together.
The interactive aspect of it was also very helpful.
Any plans for releasing the 2nd part?