Rowland

Posted on Feb 26

Data Serialization: A Concise Guide to JSON, YAML, TOML, and More

#programming #javascript #go #python

As a developer, you’ve most likely encountered serialization, perhaps without realizing it. It happens whenever you read config files, use APIs, or save application state.

So what is it exactly? Serialization is the process of converting data from its in-memory representation into a format that can be stored or transmitted. The reverse, reading that stored format back into usable data, is called deserialization.

A common example of this process is copying text to your clipboard from a rich text editor. The editor has all sorts of in-memory data about that text, font size, bold state, color, line height etc. When you paste it into a plain text app like Notepad, it gets stripped down to just the characters. That stripping process is a crude form of serialization, converting a rich in-memory object into a simpler transmittable format.

The format you choose matters. It affects how readable your code and config files are, how easy it is to debug issues, how well your tools and libraries support it, and sometimes even performance at scale. In this post, I review popular serialization formats such as JSON, YAML, and TOML, along with other noteworthy options, evaluating the advantages and disadvantages of each.

JSON

JSON (JavaScript Object Notation) is probably the most widely used serialization format in software today. It started as a subset of JavaScript but has since become the standard for data exchange across virtually every programming language and platform.

This is what it looks like:

{
 “name”: “John”,
 “age”: 30,
 “email”: “john@example.com”,
 “roles”: [“admin”, “editor”],
 “address”: {
 “city”: “Lagos”,
 “country”: “Nigeria”
 },
 “active”: true
}

JSON's popularity stems from its intentional simplicity, supporting six data types: strings, numbers, booleans, null, arrays, and objects.

JSON is the standard format utilized for REST APIs, browser storage such as localStorage, and configuration files within the JavaScript ecosystem, including package.json. When two services require data exchange over HTTP, JSON is typically the preferred option.

Pros

Every major programming language has a JSON parser, either built in or easily available
Easy to read and understand
Lightweight with minimal overhead
Excellent tooling support, including formatters, validators, and editor plugins

Cons

No support for comments, which is a real limitation when using it for configuration files
Strict syntax means a single trailing comma or missing quote can break parsing
Limited data types with no native support for dates or binary data
Deeply nested structures can get hard to read quickly

Worth knowing: Two unofficial extensions of JSON exist to address the lack of comments. JSONC (JSON with Comments) adds support for single-line // and multi-line /* */ comments. It's what VS Code uses for its settings.json and what TypeScript uses for tsconfig.json, so most developers encounter it without realizing it has a distinct name. JSON5 goes further, also allowing trailing commas and unquoted keys. Both require parsers that explicitly support them since standard JSON parsers will reject the extra syntax.

YAML

YAML (YAML Ain’t Markup Language) was built with human readability as the main priority. It uses indentation instead of brackets and braces, which makes it feel closer to plain text than most other formats.

What it looks like:

name: John
age: 30
email: john@example.com
roles:
  - admin
  - editor
address:
  city: Lagos
  country: Nigeria
active: true

# You can also include a comment
database:
  host: localhost
  port: 5432
  name: myapp_db

Almost the same data as the JSON example above, but noticeably less visual noise.

YAML is the standard in the DevOps world. Docker Compose files, Kubernetes manifests, GitHub Actions workflows, Ansible playbooks, and most CI/CD configuration files are written in YAML. It also shows up in frameworks like Ruby on Rails and static site generators like Hugo and Jekyll.

Pros

Very readable, especially for people who may not have extensive technical knowledge
Supports comments, which makes it much more practical for configuration files
Handles a wider range of data types natively, including dates and multi-line strings
Less syntactic clutter compared to JSON

Cons

Indentation-sensitive, so a misplaced space can cause silent bugs or confusing parse errors
The full spec has a surprising number of edge cases. One well-known example is the “Norway problem” where the string NO gets parsed as false in some parsers
Generally slower to parse than JSON
Some YAML parsers support object deserialization features that can introduce security vulnerabilities if you’re not careful

Worth knowing: YAML is actually a superset of JSON. Any valid JSON document is also valid YAML. This means YAML parsers can read JSON files, though you’d rarely want to mix the two intentionally.

TOML

TOML (Tom’s Obvious, Minimal Language) was created by Tom Preston-Werner as a configuration format that is easy to read and has predictable, unambiguous behavior. It maps cleanly to a hash table, which makes it straightforward to parse and work with programmatically.

How it looks:

# Application configuration (Also supports comments)

[app]
name = "MyApp"
version = "1.0.0"
debug = false

[database]
host = "localhost"
port = 5432
name = "myapp_db"

[server]
host = "0.0.0.0"
port = 8080
allowed_origins = ["https://example.com", "https://app.example.com"]

[[users]]
name = "Alice"
role = "admin"

[[users]]
name = "Bob"
role = "editor"

The [[double bracket]] syntax is how TOML handles arrays of tables which looks a little unusual at first but becomes intuitive quickly.

TOML has become the standard for Rust projects through Cargo.toml and Python packaging through pyproject.toml. It's also picking up adoption in Go and other ecosystems. It works best for application configuration where you want clarity and correctness without a lot of ambiguity.

Pros

Very few surprises. The spec is strict enough that parsing behavior is consistent across different implementations
Supports comments
Types are explicit. Dates, integers, floats, and strings are all clearly typed with no implicit coercion
The section-based structure feels natural for grouping related settings

Cons

Not as universally supported as JSON or YAML across all languages and frameworks
Gets awkward with deeply nested structures
Not really designed for general-purpose data exchange. It’s a configuration format and works best when treated as one

Other Formats Worth Knowing

XML

This was the dominant format used for REST APIs before JSON took over. It’s verbose and harder to read, but it’s still widely used in enterprise systems, SOAP APIs, and document formats like SVG and Microsoft Office files.

<user>
  <name>John</name>
  <age>30</age>
  <active>true</active>
</user>

CSV (Comma-Separated Values)

CSV’s simplicity and ubiquity have made it the most common format for tabular data, spreadsheets, and data pipelines, despite its limitations which includes: lack of type information, nesting, and a consistent standard for special characters. Its widespread adoption stems from universal accessibility and broad tool support.

id,name,email,role,active
1,Alice,alice@example.com,admin,true
2,John,john@example.com,editor,true
3,Carol,carol@example.com,viewer,false

Protocol Buffers (Protobuf)

Built by Google, is a binary serialization format focused on performance and efficiency. It’s not human-readable, but it produces significantly smaller payloads and parses faster than any text-based format. It’s commonly used in gRPC services and high-throughput microservice architectures. You define your data structure in a .proto schema file first, then Protobuf generates the serialization code for your language.

syntax = "proto3";

message User {
  string name = 1;
  int32 age = 2;
  string email = 3;
  repeated string roles = 4;
}

Running protoc --go_out=. user.proto against that file generates a user.pb.go file containing the User struct and all the binary encoding logic. You never write or touch that file. What you do write is your own application code that imports and uses the generated struct.

// This is inside the auto-generated user.pb.go - you don't write this
type User struct {
    Name   string   `protobuf:"bytes,1,opt,name=name"`
    Age    int32    `protobuf:"varint,2,opt,name=age"`
    Email  string   `protobuf:"bytes,3,opt,name=email"`
    Roles  []string `protobuf:"bytes,4,rep,name=roles"`
}

Those struct tags carry the field number information at runtime so proto.Marshal knows how to encode each field into binary. In your own code, you import the generated package and use it like this:

// This is your application code - you write this
user := &pb.User{
    Name:  "John",
    Age:   30,
    Email: "john@example.com",
    Roles: []string{"admin", "editor"},
}

// Serialize to binary
data, err := proto.Marshal(user)

// Deserialize back
user2 := &pb.User{}
err = proto.Unmarshal(data, user2)
fmt.Println(user2.Name) // John

It is important to understanding that the field numbers in the .proto file (the = 1, = 2 next to each field) are what Protobuf actually writes to the binary output, not the field names. Each field is encoded as a tag followed by its value. The tag is a single byte that packs two things together: the field number and a wire type.

The wire type tells the decoder how to read the bytes that follow. For example, wire type 0 means read a variable-length integer, wire type 2 means read the next byte as a length then read that many bytes (used for strings and arrays), and wire type 1 means always read 8 bytes (used for doubles). So from a single byte the decoder knows both which field it's reading and how many bytes to consume for the value.

This is why Protobuf payloads are so compact. Instead of writing "name": "John" as a string key-value pair like JSON, it writes a tag byte followed directly by the value with almost no structural overhead. For the name field, the actual bytes on the wire look like this:

0x0a 0x04 0x4a 0x6f 0x68 0x6e

0x0a is the tag byte encoding field number 1 and wire type 2 (length-delimited). 0x04 is the string length (4 bytes for "John"). The remaining four bytes are the UTF-8 characters J, o, h, n. That's the entire field, six bytes total. The equivalent in JSON, "name": "John", is 14 characters. It also means field numbers are permanent. If you change one after data has already been serialized, you'll break deserialization of existing data because the tags no longer match.

This is also why the .proto file needs to be shared between services. The binary output is not self-describing, so without the schema, the bytes are meaningless. JSON carries its field names in the payload itself, which is why any parser can read it without prior knowledge of the structure. In practice, teams using Protobuf keep their .proto files in a shared repository that all services reference so everyone stays in sync.

Pardon me for rambling so much about Protobuf. It is not the most common format you’ll encounter day to day, but I got carried away learning more about it during research and felt compelled to share.

MessagePack

It takes the same structure as JSON but serializes it as binary. The result is a much smaller payload with faster parsing. It shows up in real-time applications and anywhere that bandwidth or latency is a serious concern. The API feels familiar to anyone used to JSON.

import { encode, decode } from "@msgpack/msgpack";

const data = { name: "John", age: 30, active: true };

// Serialize
const encoded: Uint8Array = encode(data);

// Deserialize
const decoded = decode(encoded);
console.log(decoded); // { name: "John", age: 30, active: true }

For the name field with value "John", the MessagePack binary output looks like this:

0xa4 0x4a 0x6f 0x68 0x6e

Compare that to Protobuf’s encoding of the same value which was six bytes: 0x0a 0x04 0x4a 0x6f 0x68 0x6e. Protobuf needed two bytes of overhead (the tag byte and the length byte) before the value. MessagePack only needed one because it packs the type and length into a single byte for short strings.

The more important difference between the two is that MessagePack is self-describing. The encoded bytes carry enough information to deserialize the data without any external schema. Anyone with a MessagePack decoder can read the bytes and get back the original structure. Protobuf cannot do this. Without the .proto file, the binary output is meaningless. This makes MessagePack more flexible for ad hoc data exchange, while Protobuf is better suited for structured service-to-service communication where both sides share a schema and need strict type contracts.

Many Tools Accept Multiple Formats

Some developers may be surprised that popular tools like Prettier use multiple serialization formats, inferring the configuration format from the file extension. Valid configuration methods include:

.prettierrc             (JSON or YAML, no extension)
.prettierrc.json
.prettierrc.yaml
.prettierrc.yml
.prettierrc.toml
prettier.config.js

The same is true for ESLint (.eslintrc.json, .eslintrc.yaml, .eslintrc.js), Babel, Stylelint, and many other tools in the JavaScript ecosystem. The data inside is structurally the same regardless of which format you use. It's just a matter of preference.

This is worth keeping in mind when setting up a project. If your team is more comfortable with YAML than JSON, or you want to be able to leave comments in your config, you often have that option without having to change tools.

Converting Between Formats

Sometimes, you might take over a project with one format and need to switch it, or you may have to convert a configuration file to align with the requirements of a specific tool. Fortunately, there are tools available that can handle this for you, saving you from the hassle of manual rewriting.

For online converters, sites like transform.tools are very handy. You can paste JSON and get YAML, TOML, CSV, and several other formats back in seconds. It covers a wide range of conversions and is probably the quickest option for one-off tasks.

Transforming JSON to YAML in transform.tools

If you prefer working in the terminal or writing scripts, most programming languages offer libraries that simplify data format conversion. For instance, in Python, converting a YAML file to JSON takes just a few lines.

import json
import yaml

with open("config.yaml") as f:
    data = yaml.safe_load(f)

with open("config.json", "w") as f:
    json.dump(data, f, indent=2)

Using the yaml library, you can safely load the contents of a YAML file, and with the json library, you can write the data back out as a JSON file with proper formatting. This approach can also be reversed or applied to any two formats supported by parser libraries available in your chosen language.

Remember that conversions are not always lossless. YAML and TOML support comments, whereas JSON does not. Consequently, converting a well-commented YAML or TOML file to JSON will result in the loss of those comments. Additionally, TOML includes explicit date and time types that lack a direct equivalent in JSON, which means such values might be converted into plain strings depending on the tool used. Always carefully review the output before finalizing or committing to it.

How to Choose the Right Format

There’s no one-size-fits-all answer, but some practical defaults can guide your decision:

For APIs or data exchange over HTTP, JSON is the best starting point. It’s universally supported and works seamlessly with any programming language.
For tabular data or files that non-developers need to access and edit, CSV remains the most practical and accessible choice.
For CI/CD pipelines, container orchestration, or infrastructure tooling, YAML is the industry standard. Familiarize yourself with its quirks and always use a linter to avoid errors.
For application configuration, especially in languages like Rust or Python, consider TOML. It’s clean, explicit, and avoids many of YAML’s pitfalls.
For high-volume, low-latency communication between internal services, explore Protocol Buffers or MessagePack. While you lose human-readability, you gain significant performance advantages.

Conclusion

Serialization formats are something most developers pick up implicitly over time. You use JSON because the tutorial used JSON. You use YAML because Kubernetes uses YAML. That’s fine as a starting point, but understanding why each format exists and what it’s optimized for makes you more deliberate about these choices.

JSON is universal and simple, making it the default for data exchange. YAML prioritizes readability and is well-suited for configuration-heavy tools. TOML offers clarity and consistency for application config. Binary formats step in when performance becomes a real constraint.

None of them are perfect for every situation. Knowing the tradeoffs means you can pick the right tool rather than defaulting to whatever was used in the last project.

This is also by no means an exhaustive list. There are formats I didn’t cover here, like Avro, BSON, INI, HCL, and others that have their own niches and communities around them. If you work with one regularly or think it deserved a mention, feel free to drop it in the comments. I’d love to hear which formats others are using and what problems they’re solving with them.

DEV Community

Data Serialization: A Concise Guide to JSON, YAML, TOML, and More

JSON

YAML

TOML

Other Formats Worth Knowing

XML

CSV (Comma-Separated Values)

Protocol Buffers (Protobuf)

MessagePack

Many Tools Accept Multiple Formats

Converting Between Formats

How to Choose the Right Format

Conclusion

Top comments (0)