Why JSON Format Is No Longer Suitable for Generative AI and LLM Models

#webdev #javascript #ai #performance

The king is dead, long live the king! Years later, JSON is beginning to lose its appeal due to generative AI. The problem is that it costs too much: because tokens have revolutionized the way we think about IT investments. Thus, one of the most beloved structures of recent times has become a problem.

JSON From Riches to Rags

Of course, JSON is not only used to provide and receive structured data to and from generative AI agents. Therefore, the characteristics and usefulness of the format remain unchanged in all other cases. The problem with LLM models is cost, not usefulness: JSON is too verbose.

In a world forced to redistribute investments from on-premise servers to the cloud to tokens, JSON is starting to become a problem. The paradox is that, while IT is being developed to be increasingly closer to natural language, data structures are returning to being closer to machine code.

One might wonder whether it would not be better to continue investing in programming languages managed by humans and designed to be closer to the functioning of machines than to the human brain. But the industry is moving in the opposite direction, and it is advisable to adapt to new ways of writing code.

The Rise of Toon

The JSON “case” has been a trending topic in recent months. It was on these very pages that I discovered Toon: a sort of CSV on steroids that replaces JSON. A format that delivers exactly what I anticipated, namely reduced costs at the expense of reduced readability of the structure.

Yes, because it essentially converts JSON objects into CSV-like structures with custom headers that describe their content. Those who are used to working with Excel or some other form of structured data should not have any major problems deciphering its content, but compared to JSON, its readability is poor.

Not all cases are suitable for the use of Toon. The official project repository on GitHub contains a series of benchmarks showing performance in the most common cases. Personally, I think I might opt to use this format in my implementations: but I haven’t tried the format below yet.

MessagePack: The Outsider

Here is another format that is emerging. If possible, compared to Toon, we are at an even higher level (or lower, depending on how you look at it). MessagePack has been around for over 13 years, but only recently has it begun to achieve a certain degree of success with the general public, so to speak.

Credit for LLM model prompts: MessagePack is “a binary serialization format” that supports input formatted in JSON or any other data structure such as arrays. It follows that, when discussing generative AI, it is proposed as a valid alternative to JSON in dialogue with agents.

I still have to try it, but it could become my favorite, being even more efficient than Toon: the problem is that it completely loses the readability of the data. It is no coincidence that it was conceived years before the release of the GPT models. MessagePack goes in the opposite direction to natural language, converting information back into machine code.

Natural Language vs. Machine Code

It is useful to ask ourselves a serious question. Considering that Markdown is the preferred format for writing prompts (and the only alternative seems to be XML), does it make sense to exchange data in machine code? I mean, it seems counterintuitive to me.

Where I work, they would like to delegate some of the tasks assigned to developers to people who have never written a single line of code. This is a ploy to reduce salaries and hire unskilled workers, which has never been successful. How could these be efficient using formats such as Toon or MessagePack?

At the very least, they should be provided with data encoding and decoding tools that work in the background, without requiring any interaction on their part. This is a hypothesis that I can assure you has never worked in my company: all attempts have been returned to the developers.

An Impossible Replacement

OK, let’s assume that a format similar to CSV is understandable to most employees. In Italy, this is not the case, but let’s pretend it is! What about MessagePack? I don’t think a format like this can be given to people who don’t have technical training. It is a failed substitution hypothesis.

Of course, tomorrow generative AI will cost less, and therefore a format such as JSON will once again become attractive. It could take decades, and I’m not convinced that LLM models will still work as they do today. But in the meantime? Let’s look at the job ads.

Leaving aside those that require ten years of experience (remember that GPT was released in 2018), we still find job ads that require bachelor’s or master’s degrees in STEM disciplines and knowledge of languages such as Python. I really don’t think our work will end in 2026, as many people say.

^{If you like, follow me on Bluesky and/or GitHub for more contents. I enjoy networking.}