DEV Community

Cover image for Implementing a JSON Schema Validator from Scratch - Week 1
Ahmed Hany Gamal
Ahmed Hany Gamal

Posted on

Implementing a JSON Schema Validator from Scratch - Week 1

Going into this week, the plan was to go through the "Getting started" page in the JSON Schema website and read the draft 2020-12 specs. How naively optimistic of me.

I went through the "Getting started" page to refresh my mind on JSON Schema and most of its keywords, then I started reading the specs.

It turned out that there are two separate documents for the draft 2020-12 specs, one containing the keywords and details relevant to the architecture and design of JSON Schema as a whole (JSON Schema Core), and another containing the details for the validation keywords (JSON Schema Validation). There's also the Relative JSON Pointers document, but I'm not sure where this fits into the whole picture, I'm guessing it just explains how relative JSON pointers work in greater detail.

I ended up only reading the first 8 chapters of the Core specs.
In hindsight, this should have been expected, especially since this is my first time dealing with something like this.

Initial thoughts on the specs

It was surprising to see how a document could explain the proper way to implement an entire system without leaving any ambiguities, every minute detail is mentioned with the reason behind it and everything.

But all that being said, the nature of what is required of a specification makes it really hard to actually read or write it, by that I mean that explaining a system as sophisticated and elegant as JSON Schema with only a few words is a really difficult task, and even though all of the system's details are mentioned, some parts are just really hard to put into words, which made some parts relatively hard to understand, and other parts incredibly easy to misunderstand (I'll be talking about all that in a second).

The difficulties of the specs

Most of the topics weren't that difficult individually (though as we'll later see, some certainly were), but the whole process of reading the specs and understanding it certainly is.
The difficulty comes from the fact that this specification explains an entire system, an entire world if you will, a world that you've just found yourself in with no prior experience to what's in it and how things work there, so the difficulty arises from the sheer number of things that you need to learn and connect together in order to understand how this system/world works.

Topics I misunderstood

These are topics that I thought I understood, only to later realize that my understanding was completely incorrect

meta-schemas

The first time I saw the term "meta-schema" was in the JSON Schema Glossary. There it said that a meta-schema is simply a schema that is used to validate another schema, meaning that the other schema is treated as an instance in the validation process. When I read the first few chapters of the specs, I believe I saw that definition again, but then later the $vocabulary keyword was mentioned and it became clear that something was off.

What I came to later realize was that though the previous definition isn't wrong, it fails to mention some incredibly crucial details, which makes this definition incredibly misleading.

When I read this definition I thought that if I use a schema to validate an instance, and that instance happened to also be a schema, then the schema that I'm using for validation is a meta-schema.
It turns out that it's much more complicated than that.

When you initially use/load a schema, there are some steps that need to be taken before you actually validate any instances. First, you use the URI in $schema (for example https://json-schema.org/draft/2020-12/schema in the case of draft 2020-12) to get a schema from that URI, you then check the vocabularies from the $vocabulary of that schema and make sure that the implementation supports the required vocabularies, if it does then you run that schema on your original schema to make sure that the schema you're using complies with the draft/dialect that you're using, if that validation succeeds, only then do you actually start using your original schema to validate JSON instances.
In that whole process, the schema that we initially got from the $schema URI, that is the meta-schema.

So, the original definition isn't wrong, but it is incredibly misleading, and it doesn't help that the specs don't explicitly mention that fact, it just gives you bits and pieces in different parts of the document and lets you figure it out on your own.

Topics I don't understand yet

These are topics that I still haven't fully grasped, I have a general idea of what it's about, but I still don't understand it

lexical vs dynamic scopes/keywords

From what I understand, this is one of the most difficult topics in the specs, so I think it's normal for me to not understand right away.
I think I may already understand what the lexical scope is and how it works, I assume it's basically just the default/intuitive method of dealing with scope, while the dynamic scope is a bit different and more complex/unintuitive, as it's made for very specific cases with very specific needs.
I'll do my best to understand this next week.

Conclusion

There are more topics that I didn't mention in this post, like different keywords types (Annotations, Assertions, Applicators, etc.) and how they function, the differences between the different types of schemas (resource schema, sub-schema, embedded schema, etc.), and other things, but these were all relatively straightforward, plus this post would be incredibly long if I did, and it's already pretty long.

I usually don't enjoy reading, as a matter of fact, I usually hate reading, but this was a surprisingly enjoyable experience, not saying it was easy, far from it, but you just get a weird dopamine rush when you finally understand a topic that you've spent 5 consecutive hours trying to figure out.

I'll probably start actually writing code for this project either by the end of week 2, or the beginning of week 3.

I'll be posting weekly updates on my journey here.
The code (or lack thereof at this point in time) can be found on GitHub

Top comments (0)