Arch-AI-tech

Posted on Jun 11

Ktav: I got fed up with every config format, so I built one with no quotes, no commas, no indentation hell

#rust #opensource #configuration #programming

This started as a rage-quit from config files.

I was hacking on a hobby project — a SOCKS5 proxy rotator — and every time I needed to tweak launch configs, I'd lose time fighting the format itself instead of the actual problem. After years of being mildly annoyed at every config format I tried, I finally snapped and built my own.

It's called Ktav (כְּתָב, Hebrew for "script/writing"). This article is the long version of what it is, why every existing format made me give up, and the design decisions that went into the parser, the FFI strategy across seven languages, and the editor tooling. It's open-source, dual-licensed MIT OR Apache-2.0, and the playground is in the browser if you'd rather just try it.

The graveyard of formats I tried

Before writing one more line of code, let me explain why none of the existing options worked.

.env

Fine for flat key=value, useless the moment you need a nested object or an array. I ended up writing things like:

UPSTREAM_0_HOST=a.example
UPSTREAM_0_PORT=1080
UPSTREAM_0_WEIGHT=0.7
UPSTREAM_1_HOST=b.example
UPSTREAM_1_PORT=1080
UPSTREAM_1_WEIGHT=0.3

It's not even configuration — it's a hand-rolled index encoding. Adding a third upstream means inventing a new convention every time.

INI / TOML

Sections help, but TOML's [[array.of.tables]] syntax for arrays of tables always made me pause and re-read the docs. Inline tables can't span multiple lines. Every time I thought "this should be simple" it wasn't.

[[upstreams]]
host = "a.example"
port = 1080
weight = 0.7

[[upstreams]]
host = "b.example"
port = 1080
weight = 0.3

TOML is great for Cargo.toml-style configs where you know the schema and the structure is mostly flat. For hand-edited configs with arbitrary nesting, it always felt like I was working around the syntax instead of with it.

JSON

JSON's data model is exactly what I want. Scalars, arrays, objects, null, booleans. Composable. Unambiguous. But typing it by hand?

{
    "port": 20082,
    "log_level": "info",
    "upstreams": [
        {
            "host": "a.example",
            "port": 1080,
            "weight": 0.7
        },
        {
            "host": "b.example",
            "port": 1080,
            "weight": 0.3
        }
    ]
}

Quotes around every key. Quotes around every string. Commas after every line. Trailing comma equals parse error. Forget a comma anywhere and the error message is on the wrong line. I was spending more time on punctuation than on config values.

JSON5

Better — relaxed quoting, trailing commas allowed, comments. But strings with special characters still need quoting, and in most tooling commas are still mandatory between items. It softens the pain without removing it.

YAML

I genuinely tried YAML. Multiple times. But I would constantly lose track of where I was in the indentation. A misaligned space silently changes the structure. I'd paste a block, the indent shifts, and suddenly my array is a string. I don't have the spatial reasoning for YAML, apparently — and apparently neither does half the industry, given how often I see YAML horror stories.

The Norway problem (country: NO parsed as false), the sexagesimal floats, the implicit type system that makes 123 an int but 123.0.0 a string — YAML's flexibility is its own footgun.

So what is Ktav?

Take JSON's data model. Strip the ceremony.

## A SOCKS5 rotator config.
port: 20082
log_level: info
debug: true

upstreams: [
    {
        host: a.example
        port: 1080
        weight: 0.7
    }
    {
        host: b.example
        port: 1080
        weight: 0.3
    }
]

## '::' forces a literal string — "true" stays a String, not a Bool.
feature_flag:: true
zip_code:: 00544

## Multiline strings — leading indent is auto-trimmed.
motd: (
    Welcome to the node.
    Please behave.
)

That's it. No quotes around keys. No quotes around strings. No commas between items. No indentation-sensitivity (the indent above is cosmetic; the parser ignores it). Bare numbers auto-type as Integer or Float; true/false/null are keywords; everything else is a String, verbatim.

The data model is exactly JSON's. Anything you can express in JSON, you can express in Ktav. The reverse is also true — you can roundtrip JSON ⇄ Ktav cleanly.

Design decisions worth explaining

Every absent feature is a decision. Here's why the format looks the way it does.

1. Typing by lexical form, not by quoting

A bare number that looks like an integer (20082) becomes Integer. A bare number that looks like a float (0.7) becomes Float. Everything else — including digit-ish content like a version 1.2.3-rc1, a regex \d+\.example, or info — is a String, verbatim.

Trade-off: simple to reason about (no need for type hints), but it introduces the need for an explicit "I want this as a string" marker for ambiguous cases — see (3).

2. `##` for comments instead of `#`

Single # is too common as content — hex colors, issue references, channel names, shebangs. Requiring ## for comments means color: #ff5577 parses without escaping.

Trade-off: two characters instead of one, but zero ambiguity between content and comment.

3. `::` as a "forced literal string" marker

When the lexical-form typing would mis-classify, :: says "the entire value, as-is, is a String":

feature_flag:: true       ## "true" — String, not Bool
zip_code:: 00544          ## "00544" — String, preserves leading zero
version:: 1.2.3           ## "1.2.3" — String, not Float

Trade-off: it's a second sigil to learn, but it's a clean escape hatch without re-introducing JSON-style quoting for every value.

4. Multi-line strings via `(` ... `)` with auto-dedent

YAML's | and > block scalars are powerful but I find them under-discoverable. Ktav uses parentheses, and the common leading indent of the block is auto-trimmed so you can indent the body of the string to match its surroundings:

motd: (
    Welcome to the node.
    Please behave.
)

The value is "Welcome to the node.\nPlease behave." — the four-space indent is recognized as cosmetic and stripped.

Trade-off: less expressive than YAML's full block-scalar grammar (no |-, |+, fold/strip variants), but covers the 90% case with one rule.

5. Dotted keys

a.b.c: value is sugar for {a: {b: {c: value}}}. Optional — you can always use explicit {} if you prefer:

node.host: a.example
node.port: 1080
node.auth.user: alice

It's the only place where the format adds a convenience that isn't strictly necessary. I went back and forth on whether to include it.

What's deliberately not in the format

No anchors / references (YAML's & and *)
No type tags
No expressions, interpolation, or includes
No schema language
No "JSON super-set" claim — Ktav is its own format with its own parser

Every absent feature is one I considered and rejected. Most of them push the parser past the "one evening to implement" complexity I wanted preserved, which matters for the next part.

It's Rust all the way down

The reference parser is in Rust. Hand-written recursive descent, no parser generator, zero-copy where possible. Speed is comparable to serde_json on typical config-sized inputs.

use serde::Deserialize;

#[derive(Deserialize)]
struct Config {
    port: u16,
    log_level: String,
    upstreams: Vec<Upstream>,
}

#[derive(Deserialize)]
struct Upstream {
    host: String,
    port: u16,
    weight: f64,
}

let config: Config = ktav::from_str(&text)?;

Serde support is native — no separate serializer crate, no glue code.

The FFI strategy across seven languages

This is the part I'm most curious to hear feedback on.

Bindings for JavaScript / TypeScript, Python, Go, PHP, Java, and C# all wrap the same Rust core via FFI. One parser implementation, one behavior, seven languages. Each binding ships prebuilt binaries for Linux/macOS/Windows so consumers don't have to compile anything.

Language	FFI mechanism	Distribution
JS / TS	N-API (native) + WebAssembly (fallback)	`npm install @ktav-lang/ktav`
Python	PyO3 + `abi3` wheels	`pip install ktav`
Go	`purego` (no cgo for consumers)	`go get github.com/ktav-lang/golang`
PHP	FFI (PHP 7.4+ `ext-ffi`)	`composer require ktav-lang/ktav`
Java	JNA (no JNI for consumers)	Maven Central: `io.github.ktav-lang:ktav`
C# / .NET	P/Invoke	`dotnet add package Ktav`

The hard parts were:

Designing a stable C ABI that doesn't leak Rust types. Strings cross the boundary as length-prefixed byte slices; everything else is opaque handles with explicit _free functions.
Memory ownership semantics — every binding had to learn the same rule: "the parser allocates, the language frees via the free function". Documented once, repeated everywhere.
Error propagation — Rust's Result becomes a tagged union at the FFI layer, then each language wraps it in its idiomatic equivalent (exceptions in Python/Java/C#, errors in Go, rejected promises in JS).
Conformance tests — ~180 tests that every binding runs against. If a binding diverges from the spec, CI catches it. This was the single most valuable investment in the project.

The WebAssembly build of the same Rust crate also powers the online playground — you can paste JSON, YAML, TOML, or INI and see the Ktav equivalent in your browser. Everything runs locally; nothing is sent to a server.

Editor tooling

Because a config format without editor support is just a frustration generator.

LSP server in Rust (separate ktav-lsp crate) — diagnostics, completions, hover info, go-to-definition for dotted keys.
VS Code plugin — bundles the LSP, syntax highlighting via TextMate grammar.
JetBrains plugin (IntelliJ, CLion, RustRover, WebStorm, PyCharm, GoLand, PhpStorm, Rider) — bundles the LSP, syntax highlighting, indentation support.
tree-sitter grammar — drop into Neovim, Helix, Zed, or anything else with tree-sitter support.

The LSP catches the things the parser would catch at runtime, but inline as you type: type mismatches at the lexical level, unmatched brackets, malformed multi-line strings.

Honest caveats

I should be upfront about what this is and isn't.

It's young. Spec is at 0.6.x, format is still evolving though I don't expect breaking changes from here.
No production users I know of besides me. I built it because I wanted it. The ecosystem grew because wrapping one Rust core in FFI turned out to be much more tractable than I expected.
It doesn't replace TOML for Cargo. It doesn't replace YAML for Kubernetes. Those formats have entire ecosystems built on them and Ktav has zero. I'm not trying to displace anything — I'm offering a different trade-off for people who, like me, want JSON's flexibility without JSON's ceremony.
Parsing speed is comparable to serde_json for typical configs (< 100 KB), but I'm not claiming it beats simd-json on 500 MB inputs. It's a config format, not a data interchange format.

What exists today

All open-source, dual-licensed MIT OR Apache-2.0.

Specification: ktav-lang/spec — formal grammar, semantics, ~180 conformance tests
Rust reference: ktav on crates.io, with serde support
JS / TS: @ktav-lang/ktav on npm
Python: ktav on PyPI
Go: github.com/ktav-lang/golang
PHP: ktav-lang/ktav on Packagist
Java: io.github.ktav-lang:ktav on Maven Central
C# / .NET: Ktav on NuGet
LSP server: ktav-lsp on crates.io
Editor plugins: ktav-lang/editor — VS Code + JetBrains
Tree-sitter: ktav-lang/tree-sitter-ktav
Online playground: ktav-lang.github.io

Everything lives under the ktav-lang organization on GitHub.

What I'd love feedback on

Is :: the right shape for "forced literal string"? Alternatives I considered: :' (looks like a quote), :" (same), :s (explicit type), := (looks like assignment in some languages). I picked :: because it feels like "the same marker, doubled" — but I'm not sure.
Should multi-line strings preserve trailing whitespace? Right now they're trimmed. I went back and forth.
Are dotted keys worth their complexity? They're nice when present but they're the only "two ways to do the same thing" feature in the format.
The FFI-everywhere strategy — am I underestimating maintenance cost? Right now ~180 conformance tests catch divergences in CI. Is there a scale at which this breaks down?

If you read this far, thank you. Issues and PRs are welcome anywhere in the org, and so is any feedback that helps me understand what trade-offs to make next.

(Drafted with assistance from an LLM editor)

Top comments (6)

Nicholas Franklin • Jun 19

What if you just use = for strings? You could even have a rule like "= for strings : for everything else" which, if you forbid strings to use :, could catch some errors where you meant something to be not a string but it accidentally became one due to a typo. Maybe like if you typoed 1.0 as 1..0, this would catch that and provide a more useful error.

Arch-AI-tech • Jul 3

Thanks for the suggestion, and for the interesting angle on catching typos through separator choice!

I don't think I can take the = for strings idea, though. Splitting the pair separator into two symbols (= for strings, : for everything else) would break every existing
document, and it moves away from the core idea of the format: one separator, with the type inferred from the shape of the value - not from which symbol you typed.

You're right about the typo-catching, to be fair: today v: 1..0 silently becomes the string "1..0" rather than an error, and under your scheme it would fail loudly. That's a
real trade-off I'm consciously accepting - Ktav treats "anything that isn't a number/keyword is a string" as a feature (it's what makes unquoted strings possible at all), and
pushes type validation to the consumer: an app expecting a float will reject "1..0" with a clear error at load time. A schema layer on top (JSON Schema-style) is the more
likely future answer for catching this earlier, without splitting the separator.

Appreciate you taking the time to think through the design - feedback like this is exactly what makes the project better.

Phil S • Jun 19

I love what you've done and the distance you've covered to do it. The "7 FFI" is great.

Re: #1: Consider ":=" for explicit strings. I do appreciate your defaults-to-string logic.

Re: #2: Perhaps a number before the open paren can say how many spaces to trim? Default to all, but allow "0" to mean "don't touch this" and "4" to mean trim the first four columns but no more.

## Takes 2 spaces off, leaving 2
motd: 2 (
    Welcome to the node.
    Please behave.
)

I'd also consider adding escaping, unicode, etc since these are commonly used ("\u1234", "\n", "\", etc).

More importantly, the big problem in complex configuration files is repeating content. Please add some mechanism to allow variables and variable expansion, for both fields and hierarchies. Something like:

MINE=mine
DESTDIR=/opt/local/${MINE}
SHAREDIR=/opt/share/local/${MINE}

## The top-level braces are not considered part of the value
PORT_INFO= {
    port: 1080
    weight: 0.7
}

## Here the inner level of braces are keep as part of the value
STREAM = {
    {
        foo: 4
        bar: 5
    }
}

upstreams: [
    {
        host: a.example
       ${PORT_INFO}
    }
    {
        host: b.example
        ${PORT_INFO}
    }
]

Without some sort of substitution scheme, a lot of information is repeated or one has to depend in some application specific inheritance or substitution scheme (e.g. .ssh/config). These schemes are often not flexible or exact enough for users' needs.

I'd also consider keeping some sort of syntax for giving directives to ktav itself, ala pragmas. While I like your simplicity and don't want to see you following the "keep adding everything" other config languages have taken [1], having a "version" pragma becomes a get-out-of-jail card for future-proofing your work.

## this is a comment

## explicit version number
#version 0.6

## Hate IEEE floats?  Use bigfloat
#use bigfloat

## Hate 64-bit limitations?  Use decimal
#use decimal

This keeps your "Single # is too common" idea, but repurpose single "#" (in column zero) as a feature/behavior-triggering mechanism.

Again, what you have is simple, readable, and human oriented, with precise yet flexible typing for numbers/floats/strings, which is a step forward that I appreciate.

Thanks,
Phil

[1]: Yes, I know I'm asking you to add variables, but no "&"s and "*"s please ;^)

Arch-AI-tech • Jul 3 • Edited

Thanks for such a thoughtful, detailed comment - really appreciate the time and the concrete examples you put into it!

On the \uXXXX unicode escape - this one's genuinely useful and doesn't conflict with anything in the current design, it's a pure addition to the escape table, not a behavior
change. I've taken it into the spec scope for the next revision: github.com/ktav-lang/spec/issues/1...

On := as an explicit-string marker and the numeric trim-count on the multi-line opener - I like the thinking, but I don't think I can take either. := would break every
existing document for a stylistic gain I'm not sure earns it, and the trim-count adds a grammar parameter for a case already covered by choosing stripped vs. verbatim form.

On variables and pragma directives - this is the big one, and it's genuinely interesting, but it goes against the core idea of the project right now: every line's meaning
should be self-evident from what's visible on it, nothing computed or resolved from somewhere else in the document. Variable expansion and directives are exactly the kind of
implicit, non-local behavior Ktav was built to get away from - the moment a value can depend on something defined elsewhere, you lose the "just read it top to bottom" property
that's the whole point.

That said, I don't want to just wave it off, because the underlying need is real - repeated content in large configs is a genuine pain, and inheritance schemes like
.ssh/config's often aren't flexible enough, as you said. I could see this living as a separate, more powerful language built on top of Ktav - something like "KtavPlus" - that
compiles down to plain Ktav (or the same Value model) and adds variables, substitution, maybe pragmas, as an explicit opt-in layer, while the base format stays a pure data
format with no surprises. That way people who want a config language get one, and people who just want a config format aren't forced to reason about resolution order.

If that idea interests you, I'd genuinely like to keep discussing it - please open an issue (or comment on an existing design discussion) at github.com/ktav-lang/spec
so it's tracked properly and other people can weigh in too.

Thanks again for engaging with this so seriously - it's exactly the kind of feedback that makes the project better.

Berkus Decker • Jul 5

So how's is it different from KDL?

Arch-AI-tech • Jul 6 • Edited

Good question - KDL is genuinely the closest thing to Ktav I've seen, but the core model is different.

Document shape: KDL's basic unit is a node - a name followed by positional arguments and/or properties, with optional child nodes in { }. It's closer to XML/S-expressions with
a friendlier face than to JSON. Ktav's basic unit is a key: value pair - the document is JSON's data model (scalars, arrays, objects, null, bool), just written without
punctuation. There's no "positional argument vs. property" distinction in Ktav, because there's no node concept - everything is a plain key-value tree or array.

Strings: KDL still requires quotes for anything with whitespace or that looks like a number/keyword ("hello world"), with a separate raw-string form (#"..."#) for no-escape
content. Ktav drops quotes entirely - key: hello world is valid as-is, and the rare "force this to be a string" case uses :: instead of quotes.

Comments: KDL has //, /* */, and the "slashdash" /- for commenting out a node. Ktav has one form, ##, line-only, no block comments.

Multi-line strings: KDL's triple-quoted strings auto-dedent to the closing """'s indent. Ktav's ( … ) dedents to the minimum indent across lines and (after a fix I'm making)
also strips trailing whitespace; (( … )) is fully verbatim if you need every byte exact.

Typed values: KDL lets you tag a value with a type annotation like (date)"2021-02-03" - genuinely more expressive than anything Ktav has; Ktav only infers the four/five
JSON-ish types from lexical form.

Basically: if you want a format that can express document-like/markup-like structures (nodes with mixed args+children, like a UI tree or CLI-arg style config), KDL's model
fits that better. If your data is fundamentally JSON-shaped - objects, arrays, scalars - and you just want the punctuation gone, that's exactly Ktav's one job.

That's really the whole philosophy behind Ktav: keep it as simple as possible, one shape, no ceremony. It's not a perfect or maximally-expressive format - KDL's type
annotations and node model can do things Ktav deliberately can't. But for the thing Ktav is actually for - plain JSON-shaped config, written by hand - I think it's the best
tool I know of for that job, mine included, because I use it daily and it's exactly what I wanted when nothing else fit.