Verse Markup Language (VML): When Poetry Meets Code
Have you ever tried to make a computer understand a poem? Not just find rhymes, but parse where the author is, where the title is, where the epigraph is, and where the actual verses are? If so, you know how quickly “plain text” becomes a headache.
I’m Boris Orekhov, and I created VML (Verse Markup Language) — a minimalist markup language that turns the chaos of poetic text into a machine-readable structure. And yes, it’s already working on real‑world corpora (e.g., the Bashkir Poetry Corpus with >10,000 poems).
A note on language: The full VML specification is written in Russian. However, given the current state of machine translation (DeepL, GPT-4, etc.), this is hardly a barrier. Anyone can obtain an accurate, readable English version in seconds. The tag names and syntax are language-agnostic anyway — <a>, <&>, <n> work the same in any idiom.
Why not TEI, JSON, or Markdown?
- TEI – a beast. It can handle anything, but marking up a single poem feels like writing a new gospel in XML. Impossible for handling for humanists.
- JSON – great for machines, but a humanist will close their laptop in horror.
- Markdown – fine for documentation, but not for strict metadata markup.
VML is a text format that stays human‑readable and is easy to parse. Its tags are short (<a>, <&>, <n>, <rm>), the hierarchy is strict, and learning it takes 5 minutes.
Minimal example
<a> William Blake
<&> Tyger Tyger, burning bright,
In the forests of the night,
What immortal hand or eye,
Could frame thy fearful symmetry?
-
<a>– author (stays active until the next<a>) -
<&>– the first line of the poem (incipit). Without it, the document is invalid.
From this you can already extract the author, the title (none here), the list of lines, and even count them.
Adding structure
<a> Emily Dickinson
<rm> Hope is the thing with feathers <-- epigraph (prose insert)
<&> “Hope” is the thing with feathers -
That perches in the soul -
And sings the tune without the words -
And never stops - at all -
<*> And sweetest - in the Gale - is heard -
And sore must be the storm -
That could abash the little Bird
That kept so many warm -
-
<rm>– prose insert (epigraph, dedication, stage direction). Each line gets its own<rm>. Poetry parser skips it. -
<*>– start of a stanza. Stanzas are separated by repeated<*>.
Poetry cycles
<nn> Sonnets to the Young Man
<n> Sonnet 1
<&> From fairest creatures we desire increase...
<n> Sonnet 2
<&> When forty winters shall beseige thy brow...
</nn>
<nn> ... </nn> – a cycle. Inside are several poems, each with its own title <n> and incipit <&>.
What if an epigraph appears before an incipit without a title?
Two poems in a row, and the second has an epigraph. How do you avoid attaching the epigraph to the first poem? VML provides <&&> – an explicit start of a new poem before an incipit:
<a> John Keats
<&> A thing of beauty is a joy for ever...
...
<&&>
<rm> Epigraph to “Ode to a Nightingale”
<rm> “Thou wast not born for death, immortal Bird!”
<&> My heart aches, and a drowsy numbness pains...
For the geeks: “ladder” layout
If you’re a verse scholar, you might need explicit “ladder” markup:
For example Mayakovsky:
<&> Я волком бы <l-2> выгрыз <l-3> бюрократизм.
Tools
The repository contains:
-
Validator (
vml_validator.py) – checks syntax, nesting, escaping. -
Counter (
vml_counter.py) – outputs statistics: authors, poems, lines, cycles.
python3 vml_validator.py poem.vml
python3 vml_counter.py poem.vml
Where is it already used?
Bashkir Poetry Corpus (http://web-corpora.net/bashcorpus/) – more than 10,000 poems marked up in VML. Based on this corpus, a monograph (Orekhov, 2019) and several articles on quantitative verse studies have been published.
Who is it for?
- Humanists who want to build their own corpus without learning XML/JSON.
- Developers who need to extract metadata from poems for NLP tasks (author, date, metre, stanza structure).
- Digital Humanities enthusiasts looking for a simple, open standard.
Links
- Full specification – README.md
- Repository with code and examples – github.com/nevmenandr/VML
- Language DOI – 10.5281/zenodo.20100191
- License – Apache 2.0 (free to use, modify, and distribute, including commercial use)
Biblio
Orekhov, B. V. (2019). Bashkirskiy stikh XX veka. Korpusnoye issledovaniye [Bashkir Verse of the 20th Century: A Corpus Study]. Saint Petersburg: Aleteya. 344 p. ISBN 978-5-907189-29-4. (In Russian)
P.S. What about escaping?
In real poems, sequences like <a> almost never appear as part of the text. But just in case, escaping exists: \<a> will be treated as plain <a> and won’t break the parser.
Discussion in the comments is welcome. Do you use any formats for poetry markup? Would you like to try VML? Feel free to open issues on GitHub – any feedback is valuable.
Boris Orekhov, 2026

Top comments (0)