DEV Community

loading...
Kabisa Software Artisans

Learning functional programming #4: Parsing strings and numbers

Matthijs Groen
Originally published at kabisa.nl on ・10 min read

This post is part 4 of the series “Learning Functional Programming using JavaScript”. When the next part will be published, a link to part 5 will be added here in the top.

Earlier parts: part 1, part 2, part 3.

This is a continuation of the Functional Programming challenge I gave myself: Write a JSON parser in JavaScript, usingthe parser combinator principles from the blogpost of Scott Wlaschin.

So far we defined the challenge, added some rules to adhere, created a character parser and we are able to parse the first literals of JSON: null, true andfalse.

Quoted string parser

Its time for the next type to parse: Strings.

An EBNF notation of a JSON String looks like this:


1
2
3
4
5
6

string = '"',
  { ? Any unicode character except " or \ or control character ?
  | "\", (
    '"' | "\" | "/" | "b" | "f" | "n" | "r" | "t" |
    "u", 4 * ? hexadeximal digit ?
  ) }, '"';

Enter fullscreen mode Exit fullscreen mode

Since our example data does not have \uXXXX unicode in it, I will skip the implementation of it.

Let us start with parsing the quote:


1
2

const quoteParser = characterParser('"');
console.log(quoteParser('"test"')); // ['"', [ 't', 'e', 's', 't', '"'] ]

Enter fullscreen mode Exit fullscreen mode

The string could contain any character, so having to define them all using thecharacterParser would be madness. Time to change the strategy here, by changing the characterParser into a satisfy method, that parses / consumes the head of our stream when it satisfies a predicate.

What is a predicate?

A Predicate is a function that will simply return true or false based on the input. It is commonly used in [].filter(), [].some() and [].every().


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

const parseError = (message) => (error) => [FAILED, message, error];
const genericParseError = parseError("Error parsing:");

const satisfy = (predicate) => ([head, ...tail]) =>
  head
    ? predicate(head)
      ? [head, tail]
      : genericParseError(`Unexpected '${head}'`)
    : genericParseError("Unexpected EOF");

const addLabel = (label) => (parser) => (stream) =>
  onFailure(parser(stream))(([, , error]) =>
    parseError(`Error parsing '${label}':`)(error)
  );

const characterParser = (character) =>
  addLabel(character)(satisfy((c) => c === character));

Enter fullscreen mode Exit fullscreen mode

Since we now use a predicate that needs to return a true or false to determine if the head of the data must be consumed, it is no longer possible to say which character we were trying to parse. I updated the error message therefor to a real generic one: “Error parsing:”.

After adjusting the addLabel as well, the characterParser would still stay compatible. And now we can create a new parser. The stringCharParser that would ‘satisfy’ on each character except the double quote ".


1
2
3
4
5
6

const stringCharParser = satisfy((c) => c !== '"');

const quotedStringParser = andThenRight(quoteParser)(stringCharParser);

console.log(quotedStringParser('"test"')); // ['t', [ 'e', 's', 't', '"'] ]
console.log(quotedStringParser('""')); // [Symbol(Failed), 'Error parsing:', `Unexpected '"'`]

Enter fullscreen mode Exit fullscreen mode

We could repeat a parser until it fails, to parse all the characters in the string. If it fails, we will return an empty result, to allow empty strings, and create the repeat process using recursion:


1
2

const many = (parser) => (stream) =>
  orElse(andThen(parser)(many(parser)))(resultParser([]))(stream);

Enter fullscreen mode Exit fullscreen mode

The parser will try to apply itself, and on success call itself to try the next character. We are breaking one of the rules of the challenge now: Recursion to own function. For now, I just want to focus on completing the JSON parser, so I will park this challenge for later.

We are able to parse the whole (test) string now:


1
2
3

const quotedStringParser = andThenRight(quoteParser)(many(stringCharParser));

console.log(quotedStringParser('"test"')); // [[ 't', [ 'e', [Array] ] ], ['"'] ]

Enter fullscreen mode Exit fullscreen mode

now only parsing the right quote, by adding a andThenLeft parser:


1
2
3
4
5
6
7
8

const andThenLeft = (parserA) => (parserB) =>
  mapResult((result) => result[0])(andThen(parserA)(parserB));

const quotedStringParser = andThenLeft(
  andThenRight(quoteParser)(many(stringCharParser))
)(quoteParser);

console.log(quotedStringParser('"test"')); // [[ 't', [ 'e', [Array] ] ], [] ]

Enter fullscreen mode Exit fullscreen mode

The whole string is parsed, but the result is not a nice string as hoped. It has a nesting structure, caused by the andThen construct.


1

["t", ["e", ["s", "t"]]];

Enter fullscreen mode Exit fullscreen mode

Lets change that, using the earlier built mapResult:


1

const toList = mapResult((result) => [result[0]].concat(result[1]));

Enter fullscreen mode Exit fullscreen mode

And now wrap it around the andThen call in the many parser, to keep flattening the result:


1
2
3
4

const many = (parser) => (stream) =>
  orElse(toList(andThen(parser)(many(parser))))(resultParser([]))(stream);

console.log(quotedStringParser('"test"')); // [[ 't', 'e', 's', 't'], [] ]

Enter fullscreen mode Exit fullscreen mode

Much better! Now add another mapResult in the quotedStringParser to join all the characters:


1
2
3
4
5
6
7
8

const toString = mapResult((result) => result.join(""));

const quotedStringParser = andThenLeft(
  andThenRight(quoteParser)(toString(many(stringCharParser)))
)(quoteParser);

console.log(quotedStringParser('"test"')); // ['test', [] ]
console.log(quotedStringParser('""')); // ['', [] ]

Enter fullscreen mode Exit fullscreen mode

Simple strings can now be parsed, time for some refactoring:


1
2
3
4
5
6

const between = (parserLeft) => (parserMiddle) => (parserRight) =>
  andThenLeft(andThenRight(parserLeft)(parserMiddle))(parserRight);

const quotedStringParser = between(quoteParser)(
  toString(many(stringCharParser))
)(quoteParser);

Enter fullscreen mode Exit fullscreen mode

The currying of between seems logical, because it semantically looks like a string parser between quote parsers. But I found out it is more logical to have the argument that could change the most to be at the end. This way you can profit the most from “partially applied functions”.


1
2
3
4
5
6

const between = (parserLeft) => (parserRight) => (parserMiddle) =>
  andThenLeft(andThenRight(parserLeft)(parserMiddle))(parserRight);

const quotedStringParser = between(quoteParser)(quoteParser)(
  toString(many(stringCharParser))
);

Enter fullscreen mode Exit fullscreen mode

Lets say we would also like to create a quoted boolean parser, we could do it more easily when the middle section would be its last argument:


1
2
3
4

const quotedParser = between(quoteParser)(quoteParser);

const quotedStringParser = quotedParser(toString(many(stringCharParser)));
const quotedBooleanParser = quotedParser(boolParser);

Enter fullscreen mode Exit fullscreen mode

Lets continue with the escape characters of our string:


1
2

console.log(quotedStringParser('"test\\nmultiple \\"lines\\""'));
// ['test\\nmultiple \\', [ 'l', 'i', 'n', 'e', 's', '\\', '"', '"'] ]

Enter fullscreen mode Exit fullscreen mode

We could start by listing all special characters we have, and what result they should produce. Then create a choice parser, that is a orElse for a list of parsers. (Just like chain is for the andThen)


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

const quoteParser = characterParser('"');
const stringCharParser = satisfy((c) => c !== '"' && c !== "\\");
const specialCharacters = [
  ['\\"', '"'],
  ["\\\\", "\\"],
  ["\\/", "/"],
  ["\\b", "\b"],
  ["\\f", "\f"],
  ["\\n", "\n"],
  ["\\r", "\r"],
  ["\\t", "\t"],
];
const choice = (parsers) => parsers.reduce((a, b) => orElse(a)(b));
const escapedCharParser = choice(
  specialCharacters.map(([match, result]) => parseStringResult(match)(result))
);

const quotedStringParser = between(quoteParser)(quoteParser)(
  toString(many(orElse(stringCharParser)(escapedCharParser)))
);
console.log(quotedStringParser('"test\\nmultiple \\"lines\\""'));
// ['test\nmultiple "lines"', [] ]

Enter fullscreen mode Exit fullscreen mode

So the string was in there as well now. I really like how the chain andchoice are alike:


1
2

const chain = (parsers) => parsers.reduce((a, b) => andThen(a)(b));
const choice = (parsers) => parsers.reduce((a, b) => orElse(a)(b));

Enter fullscreen mode Exit fullscreen mode

What I did not like however, was that the function passed into the .reduce had to accept 2 parameters. Time to force us into another direction, by adding yet another selector to our no-restricted-syntax eslint rule, in ourpackage.json:


1
2
3
4
5
6
7
8
9
10
11
12
13

{
  "eslintConfig": {
    "rules": {
      "no-restricted-syntax": [
        "error",
        {
          "selector": "ArrowFunctionExpression[params.length > 1]",
          "message": "Only 1 argument allowed. Please use currying"
        }
      ]
    }
  }
}

Enter fullscreen mode Exit fullscreen mode

The following lines are now giving errors:


1
2

const chain = (parsers) => parsers.reduce((a, b) => andThen(a)(b));
const choice = (parsers) => parsers.reduce((a, b) => orElse(a)(b));

Enter fullscreen mode Exit fullscreen mode

So we build our own, recursive reducer:


1
2
3
4
5
6
7

const doReduce = (reducer) => (start) => ([head, ...tail]) =>
  head ? doReduce(reducer)(reducer(start)(head))(tail) : start;

const reduce = (reducer) => ([head, ...tail]) => doReduce(reducer)(head)(tail);

console.log(doReduce((a) => (b) => a + b)(1)([2, 3, 4])); // 10
console.log(reduce((a) => (b) => a + b)([1, 2, 3, 4])); // 10

Enter fullscreen mode Exit fullscreen mode

The doReduce function will accept a reducer, a start value and a list. The reducer should have the signature acc => elem => {}. The reduce function is a convenient method to use the first value of the list as initial accumulated value.

Having our new reducer function, we are now actually able to write:


1
2

const chain = (list) => reduce((a) => (b) => andThen(a)(b))(list);
const choice = (list) => reduce((a) => (b) => orElse(a)(b))(list);

Enter fullscreen mode Exit fullscreen mode

Which is actually the same as:


1
2

const chain = reduce(andThen);
const choice = reduce(orElse);

Enter fullscreen mode Exit fullscreen mode

🤯

The Number parser

An EBNF notation of a JSON Number looks like this:


1
2
3
4

number = ["-"],
  ( "0" | ? digit 1-9 ?, { ? digit ? } ),
  [".", ? digit ?, { ? digit ? }],
  [( "e" | "E" ), [ "+" | "-"], ? digit ?, { ? digit ? } ];

Enter fullscreen mode Exit fullscreen mode

Lets idenfity the elements:

  • An optional sign “-“
  • An integer part
  • An optional fractional part
  • An optional exponent part

A number has a lot of optional parts, and a lot of ‘choices’ within them.

Time to introduce new parser tools for them:

  • AnyOf. Each character in the string is becoming an allowed character. This will help us define ‘digits’
  • Opt. If parsing fails, succeed anyway. (Thus making the first parse optional)

1
2

const anyOf = (string) => choice([...string].map(characterParser));
const opt = (parser) => orElse(parser)(resultParser([]));

Enter fullscreen mode Exit fullscreen mode

We can now create the sign parser:


1

const optSign = opt(characterParser("-"));

Enter fullscreen mode Exit fullscreen mode

And the integer parser:


1
2
3
4
5
6

const zero = characterParser("0");
const digitOneNine = anyOf("123456789");
const digit = anyOf("0123456789");

const nonZeroInt = toString(andThen(digitOneNine)(toString(many(digit))));
const intPart = orElse(nonZeroInt)(zero);

Enter fullscreen mode Exit fullscreen mode

We use the toString to concat the results together. And we will use chain to add our individual pieces into a single parser:


1
2
3
4
5
6
7

const numberParser = chain([optSign, intPart]);

console.log(numberParser("-123.45e6"));
// [[ '-', '123'], ['.', '4', '5', 'e', '6'] ]

console.log(numberParser("123.45e-6"));
// [[ [], '123' ], ['.', '4', '5', 'e', '-', '6'] ]

Enter fullscreen mode Exit fullscreen mode

Now the fraction part:


1
2
3
4
5
6
7
8
9
10

const some = (parser) => toList(andThen(parser)(many(parser)));
const point = characterParser(".");
const fractionPart = toString(andThen(point)(toString(some(digit))));
const numberParser = chain([optSign, intPart, opt(fractionPart)]);

console.log(numberParser("-123.45e6"));
// [[ [ '-', '123'], '.45' ], ['e', '6'] ]

console.log(numberParser("123.45e-6"));
// [[ [ [], '123' ], '.45' ], ['e', '-', '6'] ]

Enter fullscreen mode Exit fullscreen mode

Now only the exponent part:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

const e = anyOf("eE");
const optPlusMinus = opt(anyOf("+-"));
const exponentPart = toString(chain([e, optPlusMinus, toString(some(digit))]));

const numberParser = chain([
  optSign,
  intPart,
  opt(fractionPart),
  opt(exponentPart),
]);

console.log(numberParser("-123.45e6"));
// [[ [ [Array], '.45' ], 'e,6' ], [] ]

console.log(numberParser("123.45e-6"));
// [[ [ [Array], '.45' ], 'e,-6' ], [] ]

Enter fullscreen mode Exit fullscreen mode

The exponent part is not quite what we expected. It has returned e,6 instead of e6. This has to do with the toString over the chain function. The rest of the result is also acting like a russion doll. We need to flatten te result during the chain process, to prevent the nested andThen constructs to produce this nested result.


1
2
3
4
5
6
7
8
9
10
11

const chain = reduce((parserA) => (parserB) =>
  mapResult(([resultA, resultB]) => [...resultA, resultB])(
    andThen(parserA)(parserB)
  )
);

console.log(numberParser("-123.45e6"));
// [[ '-', '123', '.45', 'e6'], [] ]

console.log(numberParser("123.45e-6"));
// [[ '123', '.45', 'e-6'], [] ]

Enter fullscreen mode Exit fullscreen mode

Much better! now we should convert it to an actual JavaScript number. To make it easy for use, we will use the Number() to typecast our parsed string to a number. The final number parser now looks like this:


1
2
3
4
5
6
7
8
9
10
11

const numberParser = addLabel("number")(
  mapResult((result) => Number(result))(
    toString(chain([optSign, intPart, opt(fractionPart), opt(exponentPart)]))
  )
);

console.log(numberParser("-123.45e6"));
// [-123450000, [] ]

console.log(numberParser("123.45e-6"));
// [0.00012345, [] ]

Enter fullscreen mode Exit fullscreen mode

This is one seemed quite easy to build, even if a number has a lot of seperate parts.

The implementation so far:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132

const FAILED = Symbol("Failed");
const PARSED = 0;
const REMAINING = 1;

const parseError = (message) => (error) => [FAILED, message, error];
const genericParseError = parseError("Error parsing:");

const satisfy = (predicate) => ([head, ...tail]) =>
  head
    ? predicate(head)
      ? [head, tail]
      : genericParseError(`Unexpected '${head}'`)
    : genericParseError("Unexpected EOF");

const onSuccess = (result) => (next) =>
  result[PARSED] !== FAILED ? next(result) : result;

const onFailure = (result) => (next) =>
  result[PARSED] === FAILED ? next(result) : result;

const addLabel = (label) => (parser) => (stream) =>
  onFailure(parser(stream))(([, , error]) =>
    parseError(`Error parsing '${label}':`)(error)
  );

const characterParser = (character) =>
  addLabel(character)(satisfy((c) => c === character));

const combineResult = (resultA) => (resultB) =>
  onSuccess(resultB)(() => [
    [resultA[PARSED], resultB[PARSED]],
    resultB[REMAINING],
  ]);

const andThen = (parserA) => (parserB) => (stream) =>
  onSuccess(parserA(stream))((result) =>
    combineResult(result)(parserB(result[REMAINING]))
  );

const orElse = (parserA) => (parserB) => (stream) =>
  onFailure(parserA(stream))(() => parserB(stream));

const resultParser = (result) => (stream) => [result, stream];

const mapResult = (transform) => (parser) => (stream) =>
  onSuccess(parser(stream))((result) => [
    transform(result[PARSED]),
    result[REMAINING],
  ]);

const andThenLeft = (parserA) => (parserB) =>
  mapResult((result) => result[0])(andThen(parserA)(parserB));

const andThenRight = (parserA) => (parserB) =>
  mapResult((result) => result[1])(andThen(parserA)(parserB));

const between = (parserLeft) => (parserRight) => (parserMiddle) =>
  andThenLeft(andThenRight(parserLeft)(parserMiddle))(parserRight);

const doReduce = (reducer) => (start) => ([head, ...tail]) =>
  head ? doReduce(reducer)(reducer(start)(head))(tail) : start;
const reduce = (reducer) => ([head, ...tail]) => doReduce(reducer)(head)(tail);

const chain = reduce((parserA) => (parserB) =>
  mapResult(([resultA, resultB]) => [...resultA, resultB])(
    andThen(parserA)(parserB)
  )
);
const choice = reduce(orElse);

const toList = mapResult((result) => [result[0]].concat(result[1]));
const toString = mapResult((result) => result.join(""));

const many = (parser) => (stream) =>
  orElse(toList(andThen(parser)(many(parser))))(resultParser([]))(stream);

const stringParser = (string) =>
  addLabel(string)(chain([...string].map((char) => characterParser(char))));

const parseStringResult = (string) => (result) =>
  andThenRight(stringParser(string))(resultParser(result));

const nullParser = parseStringResult("null")(null);
const boolParser = orElse(parseStringResult("true")(true))(
  parseStringResult("false")(false)
);

const quoteParser = characterParser('"');
const stringCharParser = satisfy((c) => c !== '"' && c !== "\\");
const specialCharacters = [
  ['\\"', '"'],
  ["\\\\", "\\"],
  ["\\/", "/"],
  ["\\b", "\b"],
  ["\\f", "\f"],
  ["\\n", "\n"],
  ["\\r", "\r"],
  ["\\t", "\t"],
];
const escapedCharParser = choice(
  specialCharacters.map(([match, result]) => parseStringResult(match)(result))
);

const quotedStringParser = addLabel("string")(
  between(quoteParser)(quoteParser)(
    toString(many(orElse(stringCharParser)(escapedCharParser)))
  )
);

const anyOf = (string) => choice([...string].map(characterParser));
const opt = (parser) => orElse(parser)(resultParser([]));
const some = (parser) => toList(andThen(parser)(many(parser)));
const optSign = opt(characterParser("-"));

const digitOneNine = anyOf("123456789");
const digit = anyOf("0123456789");

const point = characterParser(".");
const optPlusMinus = opt(anyOf("+-"));
const e = anyOf("eE");

const zero = characterParser("0");
const nonZeroInt = toString(andThen(digitOneNine)(toString(many(digit))));
const intPart = orElse(nonZeroInt)(zero);
const fractionPart = toString(andThen(point)(toString(some(digit))));
const exponentPart = toString(chain([e, optPlusMinus, toString(some(digit))]));

const numberParser = addLabel("number")(
  mapResult((result) => Number(result))(
    toString(chain([optSign, intPart, opt(fractionPart), opt(exponentPart)]))
  )
);

Enter fullscreen mode Exit fullscreen mode

Conclusion

The code so far:

  • We changed our basic characterParser to a more generic satisfy function, and build a new characterParser upon that.
  • We added more tools to the parsing family:
    • some (one or more)
    • many (zero or more)
    • opt (zero or one)
    • anyOf (choice of character parsers)
  • We can now parse all ‘simple’ types in JSON. null, true, false, stringand number.
  • We refactored reduce into a curried function
  • Our parsers now produce a result based on the dynamic input, and of the proper type.
  • Check the full code here

What I learned so far:

  • Changing earlier basic functions around is quite easy
  • We are mainly building ‘tools’ to make a parser, instead of building the actual parser. The JSON specific part is still really small

What did I like:

  • When building it you really feel a gain in speed. We are working with bigger and bigger parts. It feels like ages when we first started with parsing a single character
  • The structure of the JSON part implementation is quite on-par with the EBNF definition
  • Changing the reduce into a curried version was really nice. If actually benefitted readability when being able to write: reduce(orElse)
  • It starts to feel more and more normal to only have a single argument for a function

What did I not like:

  • We should also update the .map and .join to curried versions

Next time:

  • Enforcing the single argument rule
  • Parsing arrays of values and objects

Discussion (0)