It's been a while since my last update as I've not had a lot of time to work on the parser. So is the life of side-projects. :/
This time around I've made changes to support lists. In the spirit of the feature the changes are:
- List parsing syntax, those are lines that begin with
-
- Adding a
_NodeIterator
toparse_to_doc.py
to deal with nested iterating - Adding backtick support, because I needed to write
`_NodeIterator`
above - Independent anchor parsing
The parser is nearing a useful stage for me, and I'll attempt to use it for my regular documents now. The design is also becoming stable enough that I can start adding unit tests soon.
I'm keeping the technical descriptions short here, as I don't know for sure what is most relevant or of most interest. Ask if you'd like more detail.
Lists
Unlike in markdown, the MDL lists can be separated over multiple lines. This makes it easier to work with longer sentences and markup inside those bullet points.
- You can't see this.
- But there was extra space between these points.
Markdown supports indented lists, which might be easier to read in the source. This isn't a concern for me now, and I'll leave out support for them. I'll also ignore embedded lists until I have a decent use-case for it.
Splitting anchor parsing
I've made a change to how the parsing handles anchors. I'm deviating a bit from what HTML says an anchor is. In MDL an anchor is only a piece of text that could have something associated with it -- almost what <a>
used to mean in HTML prior to version 5, the version which made it exclusive for linking.
The parse change is that [anchor blocks]
are parsed independently of any trailing bits now, such as a full link [anchor](link)
. This new parsing is easier and more flexible. The conversion into a link is done by the tree conversion. It has a better ability to detect flow like this.
An [anchor bit]
without a trailing link raises an error now. Eventually, for a markdown compatible mode, that might need to be reverted to plain text.
This change, along with the backtick support, allowed me to switch to an unparsed segment for the URL.
Backtick support
Given how much I'm using a backtick `
in this document, it's surprising I didn't add support for it earlier.
Unlike markdown, escaping in MDL is done with the backslash \
character. It's one of the characters that is recognized even inside escaped sections, like the backticks.
Escaping backticks inside backticks is a bit of an issue. Consider the `_NodeIterator`
bit that needs to look like
`` `_NodeIterator` ``
in GitHub flavoured markdown. I had to count the ticks in the collapsed text, add the count of enclosing ticks, and optional spaces if starts/ends with ticks!
_NodeIterator
The doc_tree converter is no longer a recursive tree converter. Instead, each function takes the new _NodeIterator
and is capable of moving through the tree. This allows features to depend on items around them.
For lists, it was essential as list items are independent blocks in the main document. The parser sees no difference between a list item and a paragraph. The doc_tree converter must reassemble the pieces into a single list block. During the block parsing, subsequent list items are appended to the same list.
This also made it easier to do the anchor splitting. The anchor now looks towards the following node and expects it to be a link style node. Then it merges the two together to produce a single output element.
This is a development log entry for the MDL Document Processor Project.
Top comments (2)
DEV's markdown processing appears to have a defect. This part is supposed to be a single paragraph (it is on Gist/Github).
My guess is that the DEV parser is seeing a triple backtick as a block-level code block always, even if inline to the paragraph.
This is what the source of that bit looks like:
Very interesting. I need to get close look into this.