loading...
Cover image for Time to build a markdown parser and processor (MDL Log #1)

Time to build a markdown parser and processor (MDL Log #1)

mortoray profile image edA‑qa mort‑ora‑y Updated on ・3 min read

I need to write a markdown parser and processor. My writing projects have exceeded the abilities of the tools I currently have. There's also a dearth of quality writing tools -- something I discovered while working on my book. I've finally decided I have to fix the situation for myself, and hopefully, somebody else can use it as well.

I figured the best way to start this was with a post on dev.to. Perhaps I'm procrastinating slightly, but there's a reason for this post. I want to provide the opportunity to follow the development from the start. So far I have a repository, but it's empty.

I encourage you to ask questions and to question anything you see in the project.

Requirements

The correct place to start would be a user-story. However, it's a bit awkward to talk about me in a user story. Also, my needs are somewhat clear, so the requirements are quite strict. Nonetheless, I will get back to a proper user story as I move along a bit. There is some missing motivation for some of the features.

Here are some of the key things I need:

  • All the things I do in my technical blog articles. This includes the standard formatting, including code, images, and also latex equations. (Replacing latex with something better is a long-term feature).
  • Multiple targets. I post on my own site, here on dev.to, on Medium, on my cooking site, and some writing sites. These all have different formatting and encoding requirements.
  • eBook and print ready. I was aghast at the tools available in the writing sector. It should have been possible to use basic markdown and publish a lovely looking book.

I'll expand these with user stories as time permits. It'll help you understand why I need the features, and how they should work. I'm also working on a course at Skill Share for writing user stories; I'll get back to you on that.

Architecture

I've written more parsers and tree processors than I can count. I will not be evaluating any pre-existing solutions -- I figure I've done this for over 20 years now and am still unhappy with what's there. My most recent work on Leaf had an exemplary parser structure, and I think I will mimic that.

This document system will work much like a compiler, and has these phases:

  • Tree Parser: Parse the raw document into a tree of nodes. This takes care of low-level source details and partially processes syntax. By sticking with a generic tree, we keep a lot of language details out of the parser; thus it's simpler.
  • Parse Tree Converter: From here, the parse tree is scanned and converted into an abstract syntax tree (AST). This lifts the low-level constructs into high-level syntax.
  • AST Processing: This is where a lot of the tools will be built. The input tree is annotated with more information, things like user-templates are resolved, and pieces pulled together. While the export mode can influence this stage, it remains an abstract tree.
  • Lowering/Export: The AST is exported to the final document format. This includes processing of features like syntax highlighting, latex graphic creation, upload source code to gist.

I'll provide more information on each phase as I work on it. This structure provides several distinct layers where extensions can be added.

While I'll be coding in Python, the tree parser will ultimately end up in C++. It's the most costly part of the processing -- scanning through characters one at a time is hard on interpreted or dynamic languages. My initial needs, however, are for individual docs, so, for now, the speed is of little concern.

First goal

My first goal is to get a rudimentary tree parser. This is a testable component. It's also the first in the chain and the entry point to all other components.

I'll give an update when I've completed that. In the meantime, feel free to ask questions. You can look at the repository, but it'll be empty for a while, then chaotic. I believe refactoring is more important than design.

Posted on by:

mortoray profile

edA‑qa mort‑ora‑y

@mortoray

I'm a creative writer and adventurous programmer. I cook monsters.

Discussion

pic
Editor guide
 

I have a bunch of unfinished projects along the same lines. I kept getting frustrated with my tools and I wanted to build something better.

Every time I went down a rabbit hole and eventually woke up to realize I was just yak-shaving.

These days I've settled to just using MDX for my Markdown, even though it's not perfect. I decided to stick to it as a good-enough solution, and that was a good move: I feel like I can finally relax and focus on my content.

 

I have a fairly clear set of requirements, so I won't be chasing a ghost at least. I've been disappointed with the other tools. Primarily I need to be able to customize the syntax, adding extensions. I need advanced output capability -- I don't want things looking like basic markdown generated documents.

 

Cool, looking forward to following your progress.

 

Looks cool.
List somewhere what features and stuff you want eventually?

I'll get back with user stories. I'm going to do it the proper way, as an example.

 

Regarding converting markdown to an e-book, have you tried using pandoc? I found it very useful for converting to and from various publishing formats.

 

I've used pandoc a few times. I'm looking for something I can customize and extend. I didn't dig too deeply into what pandoc supports, but from my initial looking, it wasn't the type of tool to support my needs. I think I still use it to generate some sphinx docs from markdown for a Python project.

 

There’s some power in Pandoc as it lets you access it’s AST and modify it before feeding it to the output converter. It’s actually pretty nice, but the options in the AST are quite limited (you can’t add a class to a list, for example). The AST is also a pain in the neck to read and write, and a lot of sample code and libraries are outdated.

I wrote a filter that lets me write the ingredients of a recipe for my cooking cards as a list, with the quantities in italics, and output it as a table, for easier formatting:

* _1 cup_ milk
* _1 cup_ flour
* _2 tbsp_ baking powder

is much easier to type than

|        |               |
|--------|---------------|
| 1 cup  | milk          |
| 1 cup  | flour         |
| 2 tbsp | baking powder |

If your project allows the user to modify the AST like that, it could be very powerful and customisable.

Supporting a recipe integration is a high-level feature I need for my recipe site. It currently uses an external YAML file and I combine the bits together with some Python code.

This will be integrated by allowing custom sections in the markdown file. Those sections can have their own parser, or if simple enough, options on the default ones. They can produce custom entries in the AST.

Table support in Markdown now is atrocious. I'll provide a yaml-like syntax that generates tables.

The AST will allow custom translations/visitors as well. Each stage will be well defined with a clear format. My project will be all about customization and extension.

 

You can load it as a library if you dig into it, or it has been incorporated into Hakyll with a framework for producing multiple outputs and extracting document information from a bunch of documents. It's probably worth a look if you're already comfortable in Haskell.

 

Hi,

I was checking your code and I couldn't find a way to run it. I am a newbie in python.
It would be great if you could update the readme file so that I can set it up and understand the flow better.

 

I've updated the readme. I only have a test program at the moment. I'll make it a priority to produce some kind of simple CLI.

 
 

If you could make a tutorial of it. It will be a great learning experience for the beginner programers like me.
Any help will be appriciated. any past opensource project of parser can also help.
Thanks

 

I'll keep posting log updates, that say what I've done. There's a lot to cover, so if you have specific questions you'll be able to ask, and I can answer.

 

sure.
That will be helpful.
Thanks

 

I guest the choice of C++ is because you already know it, but if you are looking for a modern fast language. I would suggest to take a look at Rust.
I think it worth the effort.

 

I did a lot of Rust programming while doing AI competitions. Unfortunately, manipulation of trees, what I did then, and what I'm doing now, is a weak point for Rust. I had too many questions that the community was unable to answer.

Though, in this case, the component I'd externalize wouldn't be doing much tree manipulation, so perhaps Rust would be an option.

 

Very nice experience.

I'm still a rookie at Rust, and I was surprised you didn't pick it over C++. Now it all makes sense

 

You should definitely take a look at markdown-it. It lets you write plugins through which you can create syntax extensions, access the AST, and from there you can basically make it do anything.

 

It does not appear to support the multiple output case that I want. It's focused on rendering to HTML. I really want an accessible high-level tree where I can do abstract operations and lower to any output format.

Note that I currently use pyMarkdown which also offers extensions. And I've used other packages. I'm not keen on going down another path which isn't guaranteed to do what I want.

 

I got a big chunk of the low-level parsing done today. It's probably enough for me to move on to the tree converter.