From Leather-Bound Notebooks to .DOCX
Remember the days of engaging in a meeting with an expensive pen and a leather-bound notebook? Then came the pandemic. Virtual meetings became the default workplace interaction, and AI-driven natural language models made meeting transcription mainstream.
For platforms like Zoom and Microsoft Teams, the default format for those transcriptions is often Microsoft’s .DOCX. That’s convenient for an end user who just wants to open and read the transcript. But if another machine — like another AI engine — wants to read and process that transcript, things get much trickier.
⸻
.DOCX Backgrounder
In the mid-2000s, Microsoft shifted the default Word format from .doc (a proprietary binary format) to .docx, part of the Office Open XML (OOXML) specification. This move was about aligning with ISO/IEC standards for document formats and making Word files more interoperable with other systems.
Technically speaking, .docx files are ZIP archives containing XML documents (plus supporting assets like styles, metadata, and embedded media). Inside, you’ll find document.xml, which contains the actual text and formatting information, along with other XML files describing relationships, numbering, and styles.
So yes, "a .docx is literally just XML hiding inside a ZIP file." On Linux or macOS, one unzip command reveals its innards; on Windows, you can rename .docx to .zip and browse it like a folder.
This was a major step forward for document portability — but it’s not the same as true composability. That’s the bigger issue at hand.
⸻
Structured Inputs and Outputs
For developers working with AI APIs, the key superpower comes from turning messy, unstructured data into structured outputs.
Example: You can feed an LLM a transcript of weather reports from 25 cities and ask it to return recommended activities. The model can output a structured JSON list like:
[
{"city": "Seattle", "activity": "indoor sporting event"},
{"city": "Denver", "activity": "snowball fight"},
{"city": "Miami", "activity": "walk in the park"}
]
That structure makes it trivial to plug into downstream systems. But this only works if the inputs are consistent. If your transcript source (say from Zoom vs. Teams vs. Otter.ai) structures speaker labels, timestamps, or section headers differently, you’re back to brittle preprocessing.
Unlike JSON, .docx is meant for human readability, not machine composability. Styles, headings, and semantic tags vary wildly between vendors. That’s why every transcription pipeline ends up reinventing the wheel.
⸻
5 Considerations for Developers
When unpacking transcriptions and other content from .docx into machine-usable formats, keep in mind:
- Prefer specialized libraries for Microsoft formats; otherwise, programmatically convert to JSON as soon as possible.
- Modularize transformations so you can handle differences across platforms.
- Normalize speaker names and other categorical information early — account for different naming conventions and quirks.
- Chunk transcripts post-conversion to optimize AI token usage.
- Expect drift — platforms update their formatting schemas frequently.
⸻
The Case for Composability
In Unix and Linux, composability became a design imperative in the 1970s: each program did one thing well and could be piped into the next. That philosophy treated humans and machines as equal first-class users.
In the document world, progress has been slower. We’ve gone from proprietary formats to portable, ISO-driven formats. That was great for portability, but it’s still a far cry from composability.
Now, with AI agents consuming more of our knowledge work, composability is no longer optional. Transcriptions — and documents generally — need to be machine-friendly by default, not as an afterthought.
JSON (or other structured, machine-readable formats) is the natural target. But at a minimum, standardized schemas for human-readable documents would be a big step forward.
⸻
Summary
So where we started — with .docx as a zipped XML file — is a perfect metaphor for today’s challenge. On the surface, we have “open” formats, but under the hood, they’re not composable enough for the AI-first era.
If Unix taught us anything, it’s that composability unlocks ecosystems. Knowledge worker apps must embrace that same principle — because the users of the future are not just humans but AI agents that need to plug in seamlessly.
Derick Schaefer
Author of >CLI: A Practical Guide to Creating Modern Command-Line Interfaces
Top comments (0)