Convert HTML to Anything You Want!

#javascript #parser #html #webdev

If you feel it's too long to read, here's the repo: https://github.com/huozhi/html2any

Inspiration

There was a task, to create a FAQ site, which is fully functional to provide user with help information.

Designer: First we have a search bar 🔍, able to retrieve every index page. pages are in rich text.
Dev: Yep, sounds good, not difficult. (I guess markdown can handle all)
Designer: Rich text is required to support videos, gif, inline images, block images blablabla...we hope it could be aligned with our main site, all the theme, animations are same.
Dev: Emmm...this is a new fresh project, can we just use the default video controls?
Designer: The vidos/gif need to be same with main site, the basic controls is not enough for user.
Dev: And where are these contents from?
Designer: Maybe an editor in CMS to publish new pages?
Dev: Hurry?
Designer: Yep! hope be ready soon!

** WHAT THE HELLLLL... **

Looks impossible to finish this work in a such short time with markdown. However it's insane to hard code all static pages within react or other js code. The point is, the RichText component in existing project is not able to easily migrate now, and they still have other logics to handle text collapse / metrics collecting...which we don't really need.

For us, we just want a static page. That's it.

What I can only decouple from existing project are: Video, Image and Gif components. CMS will always provide me with a HTML string for the rich text content. I have to figure out a way to replace the native image/videoes tag with customized react components.

Editor & RichText

While you typing stuff in rich text editor, making them bold / italic, inserting some images, you've already finished once rich text editing. Since these content are not only pure text to display, they need more complicated composing with HTML and CSS, even JavaScript to perform an interaction.

There are kinds of editors:

stateful editor: such as draftjs, slate.They all convert HTML to a middle state, then serialize from state to final HTML
non-stateful editor: doesn't need state, maybe only rely on contenteditable, encapsulate on the top, such as Medium.js

Saving editing content has 2 usual thoughts:

Use stateful editor, sync state to database. Recover from state in database when you display. Feel natural.
Use any editor you like. Communicate by HTML between client and storage.

Saving state may bring potential traps. For example you want to migrate from google closure editor to draftjs. There isn't any state before, the new comer breaks rules. Make you hard to handle the previous way. Migration takes effort and risks.

If you saving HTML string with stateful editor, you have to write your own serializer + deserializer state converter. Draft requires lib like draft-convert, slate has built-in serializer and deserializer with convenient usage.

Went So Far. Anything Related to Our Stuff?

First tasting on slate editor I felt free because of its HTML convert

const rules = [
  {
    deserialize(el, next) {
      if (el.tagName.toLowerCase() == 'p') {
        return {
          kind: 'block',
          type: 'paragraph',
          nodes: next(el.childNodes)
        }
      }
    },
    // Add a serializing function property to our rule...
    serialize(object, children) {
      if (object.kind == 'block' && object.type == 'paragraph') {
        return <p>{children}</p>
      }
    }
  }
]

import { Html } from 'slate'

// Create a new serializer instance with our `rules` from above.
const html = new Html({ rules })

state = {
  state: html.deserialize(htmlString),
}

const htmlString = html.serialize(someState)

Isn't it interesting? Enjoy state and HTML switching after you just defined a de/serialization rule. COOL!

When you reach here, got it? What we need is a thing, without any editor functions, be capable to convert HTML into and from structural state. To help us display complicated state visualization.

LETS DOT IT

Still remember the principle of compiler? The process of consuming code string and output as machine code:

tokenizer: extract special tokens
parse: build tokes to AST
transform: transform AST to dest code

Now the same, our HTML and state are totally like this process. dest code is our final visual form. It could be a component or a HTML string, even a JSON object, whatever.

What we're going to do are following 3 steps:

Tokenize HTML into proper HTML tags
Build a tree, each node is a HTML tag containing its information and children
Traverse this tree with replacing the node into your own

Introduce you html2any

Checkout my final implementation: https://github.com/huozhi/html2any

Run on React Native

Check the presentation on React Native:

A paragraph containing bold fonts and images was converted to the react native form. Here's the screenshot on iOS:

Of course the component nested rules on React Native is much restrict, e.g. Text need sit under View with size specified. Text under Text doesn't inherit styles, which unlike CSS.

Run on Web with React

Click Here!

I made a simple transform rule for web:

br to hr tag
replace gif with a gif player including loading phase
Native video tag change to react video player

Want more? you can design more complicated rule function, then left it to html2any to handle.

Reference and Comparison

Actually we have lots of HTML parser in community. The most familiar ones are parse5 and HTMLparser2. Even cheerio is using HTMLparser2, why create the wheel again?

My reasons are:

html2any is really small enough. It's worth to try if you want to show any content generated from slate or dratjs.
many parsers are in sax form, parsing top to down. Creating a few API to handle the middle processing phase. For usage like ours, we don't need that much. And they do much compatible work for unreachable cases.
The most important reason —— more parsers are for web specially. Their outputs may be DOM tree, that's not our desired dest code. See the examples above right? Our doing is Universal HTML! Render Everywhere! Haha

My slide