jk-kaluga

Posted on May 29

Preserving semantic styles from DOCX/ODT in Go

#go #epub #publishing #opensource

Most manuscript conversion tools are very good at moving text from one format to another.

That sounds like enough until you work with a manuscript where styles are not just decoration.

In many Word or LibreOffice files, a paragraph style called Poem does not mean “make this text indented”. It means “this is a poem”. A character style called Foreign - Latin does not only mean “italic text”. It means “this span is foreign-language text, probably with language-specific handling later”.

That distinction is easy to lose.

Once it is gone, every later step has to guess. Is this italic text a thought, a book title, a foreign phrase, emphasis, or just an accident from the editor? Is this indented block a quote, a poem, a letter, or a layout workaround?

I wanted a converter that treats named styles as manuscript semantics first, and visual output second.

That is the idea behind Tessera.

The problem with “just convert it”

DOCX and ODT are not pleasant formats, but they do carry useful information. Authors and editors already use named paragraph and character styles in tools they understand.

The problem is that a lot of conversion workflows flatten those styles into presentation:

Poem becomes indented paragraphs.
Epigraph becomes a styled quote.
Foreign - Latin becomes italic text.
Direct Thought becomes italic text too.
After conversion, two very different meanings can look identical.
This may be fine for a one-off document.

It is less fine for book production, where the same source may need to produce EPUB, PDF, LaTeX, test fixtures, and review artifacts. At that point, preserving meaning becomes more useful than preserving whatever the original document happened to look like on one machine.

Why not just use Pandoc or Calibre?

Pandoc and Calibre are excellent tools. I use them, and I do not think Tessera should be described as a replacement for either of them.

They solve broader problems.

Tessera is narrower. It is built around one specific assumption: in a manuscript, named styles are the source of truth.

That means I do not want to infer that a paragraph is a poem because it is indented or italic. I want the manuscript to say it is a poem through a style name, then carry that role through the whole pipeline.

For generic conversion, broad format support is a strength. For this workflow, a stricter model is useful:

parse DOCX or ODT;
map named styles to known roles;
build a semantic intermediate representation;
render EPUB, LaTeX, and PDF from that representation;
make the output reproducible enough to test.
That is a smaller problem than “convert anything to anything”, but it gives more control over the book-specific cases I care about.

The pipeline

The pipeline is:

DOCX / ODT -> semantic IR -> LaTeX + EPUB

The input parser reads the document package, extracts text, metadata, styles, images, notes, and structure, then maps known style names
into semantic roles.

The intermediate representation is the important layer. It is not HTML, and it is not LaTeX. It is closer to “what the manuscript means”.

From there, Tessera can render different outputs:

EPUB XHTML for ebooks;
LaTeX for print-oriented workflows;
PDF through a TeX engine;
canonical IR JSON for tests and debugging.

That IR layer makes the rest of the system easier to reason about. If a parser bug changes document meaning, it shows up before rendering.
If an EPUB renderer and a LaTeX renderer disagree, they are disagreeing over the same structured input.

A small example

Imagine a manuscript with these styles:

Paragraph style: Poem
Paragraph style: Epigraph
Character style: Foreign - Latin
Character style: Direct Thought

A generic conversion pipeline might turn several of those into italics and indentation.

Tessera keeps them separate.

A phrase marked as Foreign - Latin can become language-aware text:

veritas

A span marked as Direct Thought can remain a thought role:

a private thought

A paragraph marked as Poem can become a verse block instead of just a visually indented paragraph:

\begin{verse}
First semantic line\
Second semantic line
\end{verse}

An Epigraph can remain an epigraph in both EPUB and LaTeX output, instead of becoming just another quote-like block.

The point is not that these exact tags or macros are magical. The point is that the decision is made once, from manuscript intent, and
then each output format gets a suitable representation.

The annoying parts

The hardest parts were not the obvious ones.

Parsing XML is expected. Working with ZIP-based document formats is expected. The more annoying work was making the pipeline boring and
testable.

DOCX and ODT are packages, not single files. The useful data is spread across XML files, relationships, metadata, media, and style
definitions. Small differences between Word and LibreOffice matter. Style names can have visible names, internal IDs, localized names, and
inherited behavior.

Then there is deterministic output.

For normal use, a generated EPUB just needs to open. For tests and CI, it is much better if the same input produces the same output. That
means paying attention to timestamps, ordering, metadata, ZIP entries, generated identifiers, and canonical JSON.

EPUB linting was another useful constraint. It is easy to generate XHTML that looks fine in one reader and is still structurally wrong. A
built-in lint pass catches project-specific mistakes early, and external tools like epubcheck can still be used in stricter workflows.

None of this is glamorous, but it is the difference between a demo converter and something I can trust in a repeatable publishing
pipeline.

Current status

Tessera is still early.

The core shape is in place:

DOCX and ODT input;
semantic IR;
EPUB output;
LaTeX output;
PDF builds through TeX;
CLI commands for build, inspect, lint, and doctor;
Docker and GitHub Action support;
tests around the parser and renderers.

The project is intentionally not a GUI, not a SaaS uploader, and not a general-purpose Markdown converter. It is a command-line tool for
manuscripts where named styles carry the structure of the book.

If that sounds close to a problem you have, the repository is here:

https://github.com/balyakin/tessera