DEV Community

Cover image for I made my own JSON Parser
Hetarth Shah
Hetarth Shah

Posted on

I made my own JSON Parser

✏️Preface

I was always interested in how programming languages worked. To be specific, how what we write in a human readable way is interpreted my the machine. I did have an bird's eye view of what would happen, but it was and still kind of is like black magic to me. So, now I have decided to learn these magic verses and this is my first step towards it.

ℹ️First Steps

To not overwhelm myself, I wanted to start with something easy and which is used almost everyday. This way, I will be able to understand, solve it and actually have motivation to move forward rather than just learning amidst the hype is going on in my mind and then when it shimmers down I would eventually drop it.

So, I decided to write a JSON parser.

The goal was simple:

  • Input a string

  • Output should be JSON or Error (depending on the input)

I wanted to start with golang but eventually chose to go with javascript, the reason being simply that I do not wanted to divide my focus on learning many things at once. Hence, javascript it is.

⚠️I had no idea

Well, at first I tried doing it myself without having any idea of how to do. Pretty quickly, I was deep into for loops and regex territory. My intention was never to be able to solve this on my own but to actually learn where and what things I get stuck on. This way, when I actually refer something I can make a bridge from my thoughts and what the actual logic is.

After struggling for a while I started to notice what problems needed to be solved.

  1. I needed some way to proper differentiate different kinds of words/symbols/values.

  2. After differentiating them, I needed to validate each of them and put them back together.

👾Wow, this is interesting as hell!

Ok, now I had a basic understanding of what I needed to do and my mind was warmed up enough. Alright, let's start digging up the internet. After a while, I stumbled upon an article from Oz named Write Your Own JSON Parser with Node and Typescript.

This article does a fantastic job to decode the logic in a very step by step manner. I encourage you to give this one a read.

So basically, we need a tokenizer and parser . Tokenizer will identify the different types of token. Hey, this solves one of my problems. Ok, so rather than using different terms like words, symbols, characters we call them tokens. Seeing an example json there are many different kinds of tokens which we need to identify.

{
  "id": "647ceaf3657eade56f8224eb",
  "index": 0,
  "something": [],
  "boolean": true,
  "nullValue": null
}
Enter fullscreen mode Exit fullscreen mode

There is { (Open Brace), } (Close Brace), [ (Open Bracket), ] (Close Bracket), strings like "id" , : colon, , commas and much more.

Alright, now if we were to tokenize this we should have information for basically two things,

  1. Type of Token

  2. Value of Token

We would store this array of tokens and then feed this to the parser which will then take it forward. So our end result should look something like this,

[
  { type: "BraceOpen", value: "{" },
  { type: "String", value: "id" },
  { type: "Colon", value: ":" },
  { type: "String", value: "647ceaf3657eade56f8224eb" },
  { type: "Comma", value: "," },
  { type: "String", value: "index" },
  { type: "Colon", value: ":" },
  { type: "Number", value: "0" },
  { type: "Comma", value: "," },
  { type: "String", value: "something" },
  { type: "Colon", value: ":" },
  { type: "BracketOpen", value: "[" },
  { type: "BracketClose", value: "]" },
  { type: "Comma", value: "," },
  { type: "String", value: "boolean" },
  { type: "Colon", value: ":" },
  { type: "True", value: "true" },
  { type: "Comma", value: "," },
  { type: "String", value: "nullValue" },
  { type: "Colon", value: ":" },
  { type: "Null", value: "null" },
  { type: "BraceClose", value: "}" },
];
Enter fullscreen mode Exit fullscreen mode

🏷️Tokenizing

We will make a function named tokenizer which will take a string as an input and give us an array of token objects. We will iterate the string character by character and classify it based on some logic.

Oz's article does a very nice job of explaining the logic for token classification. However, it still skips out on somethings. The main fun which I had in making this was tackling the issues which the article didn't handled.

But let's start with the basics here, first we need a variable to track our position in the string and an array to store different token objects.

/**
 * JSON Tokenizer
 *
 * @param { String } string
 * @returns { any[] }
 */
function tokenizer(string) {
    let current = 0;
    const tokens = [];

    while (current < string.length) {
        let char = string[current];
        current++;
    }

    return tokens;
}
Enter fullscreen mode Exit fullscreen mode

Now we will add some basic conditions to extract of tokens like { , }, etc.

/**
 * JSON Tokenizer
 *
 * @param { String } string
 * @returns { any[] }
 */
function tokenizer(string) {
    let current = 0;
    const tokens = [];

    while (current < string.length) {
        let char = string[current];

        if (char === "{") {
            tokens.push({ type: "BraceOpen", value: char });
            current++;
            continue;
        }

        if (char === "}") {
            tokens.push({ type: "BraceClose", value: char });
            current++;
            continue;
        }

        if (char === "[") {
            tokens.push({ type: "BracketOpen", value: char });
            current++;
            continue;
        }

        if (char === "]") {
            tokens.push({ type: "BracketClose", value: char });
            current++;
            continue;
        }

        if (char === ":") {
            tokens.push({ type: "Colon", value: char });
            current++;
            continue;
        }

        if (char === ",") {
            tokens.push({ type: "Comma", value: char });
            current++;
            continue;
        }

        // If it whitespace, ignore it.
        if (/\s/.test(char)) {
            current++;
            continue;
        }

        // For all other characters throw error
        throw new Error("Unexpected character: " + char);
    }

    return tokens;
}
Enter fullscreen mode Exit fullscreen mode

All right, let's now tackle the trickiest things. Strings, Numbers, Boolean and Null values.

Let's start with Strings. So, in json each string starts and ends with " . This is the key part in our logic. When we encounter a " , we then start iteraing and storing all the values after it until we encounter another " . Then we store that as a value and increment our current position in the string.

if (char === '"') {
    let value = "";
    char = string[++current];
    while (char !== '"') {
        value += char;
        char = string[++current];
    }

    current++;
    tokens.push({ type: "String", value: value });
    continue;
}
Enter fullscreen mode Exit fullscreen mode

For Boolean and Null values the logic is almost the same, just the condition for entering and breaking the loop is different.

// Test if it starts with a character
if (/[a-zA-z]/.test(char)) {
    let value = "";

    // Loop until it is not a character
    while (/[a-zA-Z]/.test(char)) {
        value += char;
        char = string[++current];
    }

    if (value === "true") {
        tokens.push({ type: "True", value });
    } else if (value === "false") {
        tokens.push({ type: "False", value });
    } else if (value === "null") {
        tokens.push({ type: "Null", value });
    } else {
        throw new Error("Unexpected value: " + value);
    }

    continue;
}
Enter fullscreen mode Exit fullscreen mode

Now, for the last type Number . This is was the most fun part for me because the article which I was referring only dealt with Int numbers, if I tried a Float , or -ve number it didn't work. For recognizing something as a number there are some rules in json .

Basically, 69, -69 , 6.9 , -6.9 , 6e9 , -6e9 , 6.e9 , -6.e9 are all valid numbers. We will first see if the current character is - or a number. If it is then we will again repeat the same thing as we did for strings, boolean and null. We will start capturing the value, unless it is not a digit, not an e , - , . , + . But I hear you say, would this be wrong, this could parse 1.1.1 as number. You are right! So we will then add a check if it truly is a Number or not.

if (char === "-" || /\d/.test(char)) {
    let value = "";
    while (
        char === "+" ||
        char === "-" ||
        char === "e" ||
        char === "." ||
        /\d/.test(char)
    ) {
        value += char;
        char = string[++current];
    }

    if (isNumber(value)) {
        tokens.push({ type: "Number", value });
    } else {
        throw new Error("Unexpected value: " + value);
    }

    continue;
}

/**
 * Checks for Number
 *
 * @param { String } value
 * @returns { Boolean }
 */
function isNumber(value) {
    return !isNaN(Number(value));
}
Enter fullscreen mode Exit fullscreen mode

The final version of tokenizer function looks like this.

/**
 * Checks for Number
 *
 * @param { String } value
 * @returns { Boolean }
 */
function isNumber(value) {
    return !isNaN(Number(value));
}

/**
 * JSON Tokenizer
 *
 * @param { String } string
 * @returns { any[] }
 */
function tokenizer(string) {
    let current = 0;
    const tokens = [];

    while (current < string.length) {
        let char = string[current];

        if (char === "{") {
            tokens.push({ type: "BraceOpen", value: char });
            current++;
            continue;
        }

        if (char === "}") {
            tokens.push({ type: "BraceClose", value: char });
            current++;
            continue;
        }

        if (char === "[") {
            tokens.push({ type: "BracketOpen", value: char });
            current++;
            continue;
        }

        if (char === "]") {
            tokens.push({ type: "BracketClose", value: char });
            current++;
            continue;
        }

        if (char === ":") {
            tokens.push({ type: "Colon", value: char });
            current++;
            continue;
        }

        if (char === ",") {
            tokens.push({ type: "Comma", value: char });
            current++;
            continue;
        }

        if (char === '"') {
            let value = "";
            char = string[++current];
            while (char !== '"') {
                value += char;
                char = string[++current];
            }

            current++;
            tokens.push({ type: "String", value: value });
            continue;
        }

        if (char === "-" || /\d/.test(char)) {
            let value = "";
            while (
                char === "+" ||
                char === "-" ||
                char === "e" ||
                char === "." ||
                /\d/.test(char)
            ) {
                value += char;
                char = string[++current];
            }

            if (isNumber(value)) {
                tokens.push({ type: "Number", value });
            } else {
                throw new Error("Unexpected value: " + value);
            }

            continue;
        }

        if (/[a-zA-z]/.test(char)) {
            let value = "";

            while (/[a-zA-Z]/.test(char)) {
                value += char;
                char = string[++current];
            }

            if (value === "true") {
                tokens.push({ type: "True", value });
            } else if (value === "false") {
                tokens.push({ type: "False", value });
            } else if (value === "null") {
                tokens.push({ type: "Null", value });
            } else {
                throw new Error("Unexpected value: " + value);
            }

            continue;
        }

        if (/\s/.test(char)) {
            current++;
            continue;
        }

        throw new Error("Unexpected character: " + char);
    }

    return tokens;
}
Enter fullscreen mode Exit fullscreen mode

🖨️Parsing

The parser is where we make sense out of our tokens. Now we have to build our Abstract Syntax Tree (AST). The AST represents the structure and meaning of the code in a hierarchical tree-like structure. It captures the relationships between different elements of the code, such as statements, expressions, and declarations.

Every language or format you can think of uses some form of AST based on grammar rules of the programming language or data format being parsed.

Alright, so we will code for both things, to output the AST representation from the tokens and also an actual json object.

The logic again is kinda similar, we will iterate through the tokens and map it based on their type one by one.

/**
 * Parser + AST for tokens
 *
 * @param { any[] } tokens
 * @returns { any }
 */
function parser(tokens) {
    if (!tokens.length) {
        throw new Error("Nothing to parse. Exiting!");
    }

    let current = 0;

    function advance() {
        return tokens[++current];
    }

    function parseValue() {
        const token = tokens[current];

        // Gives ASTNode
        switch (token.type) {
            case "String":
                return { type: "String", value: token.value };
            case "Number":
                return { type: "Number", value: Number(token.value) };
            case "True":
                return { type: "Boolean", value: true };
            case "False":
                return { type: "Boolean", value: false };
            case "Null":
                return { type: "Null" };
            case "BraceOpen":
                return parseObject(); // To be implemented
            case "BracketOpen":
                return parseArray(); // To be implemented
            default:
                throw new Error(`Unexpected token type: ${token.type}`);
        }
    }

    return parseValue();
}
Enter fullscreen mode Exit fullscreen mode

Alright, but whenever we see { or [ we need to handle the nested objects and arrays recursively which means calling parseValue() again from parseObject() and parseArray() .

function parseObject() {
    const node = { type: "Object", value: {} };

    let token = advance();

    // Iterate through the tokens until we reach a BraceClose '}'
    while (token.type !== "BraceClose") {
        if (token.type === "String") {
            const key = token.value;
            token = advance();

            if (token.type !== "Colon") {
                throw new Error("Expected : in key-value pair");
            }
            token = advance(); // Skip ':'

            const value = parseValue(); // Recursively parse the value
            node.value[key] = value;
        } else {
            throw new Error(`Unexpected key. Token type: ${token.type}`);
        }

        token = advance(); // Increment one step
        if (token.type === "Comma") token = advance(); // Skip ','
    }

    // Gives ASTNode
    return node;
}

function parseArray() {
    const node = { type: "Array", value: [] };
    let token = advance(); // Skip '['

    while (token.type !== "BracketClose") {
        const value = parseValue();
        node.value.push(value);

        token = advance();  // Increment one step
        if (token.type === "Comma") token = advance();  // Skip ','
    }

    // Gives ASTNode
    return node;
}
Enter fullscreen mode Exit fullscreen mode

Now our parser is ready, here is what the final implementation looks like.

/**
 * Checks for Number
 *
 * @param { String } value
 * @returns { Boolean }
 */
function isNumber(value) {
    return !isNaN(Number(value));
}

/**
 * JSON Tokenizer
 *
 * @param { String } string
 * @returns { any[] }
 */
function tokenizer(string) {
    let current = 0;
    const tokens = [];

    while (current < string.length) {
        let char = string[current];

        if (char === "{") {
            tokens.push({ type: "BraceOpen", value: char });
            current++;
            continue;
        }

        if (char === "}") {
            tokens.push({ type: "BraceClose", value: char });
            current++;
            continue;
        }

        if (char === "[") {
            tokens.push({ type: "BracketOpen", value: char });
            current++;
            continue;
        }

        if (char === "]") {
            tokens.push({ type: "BracketClose", value: char });
            current++;
            continue;
        }

        if (char === ":") {
            tokens.push({ type: "Colon", value: char });
            current++;
            continue;
        }

        if (char === ",") {
            tokens.push({ type: "Comma", value: char });
            current++;
            continue;
        }

        if (char === '"') {
            let value = "";
            char = string[++current];
            while (char !== '"') {
                value += char;
                char = string[++current];
            }

            current++;
            tokens.push({ type: "String", value: value });
            continue;
        }

        if (char === "-" || /\d/.test(char)) {
            let value = "";
            while (
                char === "+" ||
                char === "-" ||
                char === "e" ||
                char === "." ||
                /\d/.test(char)
            ) {
                value += char;
                char = string[++current];
            }

            if (isNumber(value)) {
                tokens.push({ type: "Number", value });
            } else {
                throw new Error("Unexpected value: " + value);
            }

            continue;
        }

        if (/[a-zA-z]/.test(char)) {
            let value = "";

            while (/[a-zA-Z]/.test(char)) {
                value += char;
                char = string[++current];
            }

            if (value === "true") {
                tokens.push({ type: "True", value });
            } else if (value === "false") {
                tokens.push({ type: "False", value });
            } else if (value === "null") {
                tokens.push({ type: "Null", value });
            } else {
                throw new Error("Unexpected value: " + value);
            }

            continue;
        }

        if (/\s/.test(char)) {
            current++;
            continue;
        }

        throw new Error("Unexpected character: " + char);
    }

    return tokens;
}

/**
 * Parser + AST for tokens
 *
 * @param { any[] } tokens
 * @returns { any }
 */
function parser(tokens) {
    if (!tokens.length) {
        throw new Error("Nothing to parse. Exiting!");
    }

    let current = 0;

    function advance() {
        return tokens[++current];
    }

    function parseValue() {
        const token = tokens[current];

        // Gives ASTNode
        // switch (token.type) {
        //     case "String":
        //         return { type: "String", value: token.value };
        //     case "Number":
        //         return { type: "Number", value: Number(token.value) };
        //     case "True":
        //         return { type: "Boolean", value: true };
        //     case "False":
        //         return { type: "Boolean", value: false };
        //     case "Null":
        //         return { type: "Null" };
        //     case "BraceOpen":
        //         return parseObject();
        //     case "BracketOpen":
        //         return parseArray();
        //     default:
        //         throw new Error(`Unexpected token type: ${token.type}`);
        // }

        // Gives value
        switch (token.type) {
            case "String":
                return token.value;
            case "Number":
                return Number(token.value);
            case "True":
                return true;
            case "False":
                return false;
            case "Null":
                return null;
            case "BraceOpen":
                return parseObject();
            case "BracketOpen":
                return parseArray();
            default:
                throw new Error(`Unexpected token type: ${token.type}`);
        }
    }

    function parseObject() {
        const node = { type: "Object", value: {} };

        let token = advance();

        while (token.type !== "BraceClose") {
            if (token.type === "String") {
                const key = token.value;
                token = advance();

                if (token.type !== "Colon") {
                    throw new Error("Expected : in key-value pair");
                }
                token = advance();

                const value = parseValue();
                node.value[key] = value;
            } else {
                throw new Error(`Expected String key in object. Token type: ${token.type}`);
            }

            token = advance();
            if (token.type === "Comma") token = advance();
        }

        // Gives ASTNode
        // return node;

        // Gives Value
        return node.value;
    }

    function parseArray() {
        const node = { type: "Array", value: [] };
        let token = advance();

        while (token.type !== "BracketClose") {
            const value = parseValue();
            node.value.push(value);

            token = advance();
            if (token.type === "Comma") token = advance();
        }

        // Gives ASTNode
        // return node;

        // Gives Value
        return node.value;
    }

    return parseValue();
}

let output = parser(
    tokenizer(
        JSON.stringify({
            products: [
                {
                    id: 1,
                    title: "Essence Mascara Lash Princess",
                    category: "beauty",
                    price: 9.99,
                    discountPercentage: 7.17,
                    rating: 4.94,
                    stock: 5,
                    tags: ["beauty", "mascara"],
                    brand: "Essence",
                    sku: "RCH45Q1A",
                    weight: 2,
                    dimensions: {
                        width: 23.17,
                        height: 14.43,
                        depth: 28.01,
                    },
                    warrantyInformation: "1 month warranty",
                    shippingInformation: "Ships in 1 month",
                    availabilityStatus: "Low Stock",
                    reviews: [
                        {
                            rating: 5,
                            comment: "Very satisfied!",
                            date: "2024-05-23T08:56:21.618Z",
                            reviewerName: "Scarlett Wright",
                            reviewerEmail: "scarlett.wright@x.dummyjson.com",
                        },
                    ],
                    returnPolicy: "30 days return policy",
                    minimumOrderQuantity: 24,
                    meta: {
                        createdAt: "2024-05-23T08:56:21.618Z",
                        updatedAt: "2024-05-23T08:56:21.618Z",
                        barcode: "9164035109868",
                        qrCode: "https://dummyjson.com/public/qr-code.png",
                    },
                    images: [
                        "https://cdn.dummyjson.com/products/images/beauty/Essence%20Mascara%20Lash%20Princess/1.png",
                    ],
                    thumbnail:
                        "https://cdn.dummyjson.com/products/images/beauty/Essence%20Mascara%20Lash%20Princess/thumbnail.png",
                },
            ],
            total: 194,
            skip: 0,
            limit: 30,
        })
    )
);

console.log(JSON.stringify(output, "", 2));
Enter fullscreen mode Exit fullscreen mode

🔭Conclusion

This wasn't even that time consuming and I really learned some useful concepts in during this. Now, since I have a understanding how this works; I can take up any new language which I want to learn and implement this as a proof of concept in it.


If you’d like to know more about my journey, please feel free to reach out or follow me!

Github: Hetarth02

LinkedIn: Hetarth Shah

Website: Portfolio

💖Thank you for joining me on this journey, and look forward to more interesting articles!

Credits:

Cover Image from, Image Source.

Article Images from, json.org.

References:

Top comments (1)

Collapse
 
marcisbee profile image
Marcis Bergmanis

Very nice! Just one thing caught my eye, the part that tokenizes string will break on something like this: {"foo":"ba\"r"}.
Other than that, nice work!