DEV Community

Ajah Chukwuemeka
Ajah Chukwuemeka

Posted on

Beautiful-dom; a HTML parser built with TypeScript

Beautiful-dom

Beautiful-dom is a lightweight library that mirrors the capabilities of the HTML DOM API needed for parsing crawled HTML/XML pages. It models the methods and properties of HTML nodes that are relevant for extracting data from HTML nodes. It is written in TypeScript and can be used as a CommonJS library

What you get

  • The ability to parse HTML documents as if you were dealing with HTML documents in a live browser
  • Fast queries that return essential data from HTML nodes
  • In-place order of HTML nodes after searching and parsing.
  • Complex queries with CSS selectors.

How to use

npm install --save beautiful-dom
const BeautifulDom = require('beautiful-dom');
const document = `
<p class="paragraph highlighted-text" >
  My name is <b> Ajah, C.S. </b> and I am a <span class="work"> software developer </span>
</p>
<div class = "container" id="container" >
 <b> What is the name of this module </b>
 <p> What is the name of this libray </p>
 <a class="myWebsite" href="https://www.ajah.xyz" > My website </a>
</div>
<form>
  <label for="name"> What's your name? </label>
  <input type="text" id="name" name="name" />
</form>
`;
const dom = new BeautifulDom(document);

API

Methods on the document object.

  • document.getElementsByTagName()
  • document.getElementsByClassName()
  • document.getElementsByName()
  • document.getElementById()
  • document.querySelectorAll()
  • document.querySelector()

Methods on the HTML node object

  • node.getElementsByClassName()
  • node.getElementsByTagName()
  • node.querySelector()
  • node.querySelectorAll()
  • node.getAttribute()

Properties of the HTML node object

  • node.outerHTML
  • node.innerHTML
  • node.textContent
  • node.innerText

Their usage is as they are expected to be used in an actual HTML DOM with the desired method parameters.

Examples for document object


let paragraphNodes = dom.getElementsByTagName('p');
// returns a list of node objects with node name 'p'

let nodesWithSpecificClass = dom.getElementsByClassName('work');
// returns a list of node objects with class name 'work'

let nodeWithSpecificId = dom.getElementById('container');
// returns a node with id 'container'

let complexQueryNodes = dom.querySelectorAll('p.paragraph b');
// returns a list of nodes that satisfy the complex query of CSS selectors

let nodesWithSpecificName = dom.getElementsByName('name');
// returns a list of nodes with the specific 'name'

let linkNode = dom.querySelector('a#myWebsite');
// returns a node object with with the CSS selector

let linkHref = linkNode.getAttribute('href');
// returns the value of the attribute e.g 'https://www.ajah.xyz'

let linkInnerHTML = linkNode.innerHTML
// returns the innerHTML of a node object e.g ' My website '

let linkTextContent = linkNode.textContent 
// returns the textContent of a node object e.g ' My website '

let linkInnerText = linkNode.innerText
// returns the innerText of a node object e.g ' My website '

let linkOuterHTML = linkNode.outerHTML
// returns the outerHTML of a node object i.e. '<a class="myWebsite" href="https://www.ajah.xyz" > My website </a>'

Examples for a node object


let paragraphNodes = dom.getElementsByTagName('p');
// returns a list of node objects with node name 'p'

let nodesWithSpecificClass = paragraphNodes[0].getElementsByClassName('work');
// returns a list of node objects inside the first paragraph node with class name 'work' 


let complexQueryNodes = paragraphNodes[0].querySelectorAll('span.work');
// returns a list of nodes in the paragraph node that satisfy the complex query of CSS selectors


let linkNode = dom.querySelector('a#myWebsite');
// returns a node object with with the CSS selector

let linkHref = linkNode.getAttribute('href');
// returns the value of the attribute e.g 'https://www.ajah.xyz'

let linkInnerHTML = linkNode.innerHTML
// returns the innerHTML of a node object e.g ' My website '

let linkTextContent = linkNode.textContent 
// returns the textContent of a node object e.g ' My website '

let linkInnerText = linkNode.innerText
// returns the innerText of a node object e.g ' My website '

let linkOuterHTML = linkNode.outerHTML
// returns the outerHTML of a node object i.e. '<a class="myWebsite" href="https://www.ajah.xyz" > My website </a>'

Contributing

In case you have any ideas, features you would like to be included or any bug fixes, you can send a PR.

(Requires Node v6 or above)

  • Clone the repo
git clone https://github.com/ChukwuEmekaAjah/beautiful-dom.git

It was an exciting building this NodeJS module using TypeScript as I recently learned how to use TypeScript and what better way to practice and experiment with new knowledge?

I would appreciate comments and contributions to the project as well as the opening of issues as regards edge cases that I may not have fathomed as well as errors encountered while you use the module.

Top comments (6)

Collapse
 
adam_cyclones profile image
Adam Crockett ๐ŸŒ€

Why typescript and not a faster lower level language? (I love ts btw)

Collapse
 
ajahso4 profile image
Ajah Chukwuemeka

Remember TypeScript is mostly transpiled down to Javascript. Also, the project is for NodeJS which works with Javascript. Moreover, I was learning typescript of recent and decided to build this project with the newly acquired knowledge.

Collapse
 
adam_cyclones profile image
Adam Crockett ๐ŸŒ€

I'm talking about webassembly, you can write programs in rust, CPP, c, Assembly script (typescript like) and others to achieve near native speeds. For JavaScript. I was a regular typescript user but node supports wasm which meant I set out to learn CPP now rust, it's enormous fun.

Thread Thread
 
mindplay profile image
Rasmus Schultz

WASM doesn't really seem like a real candidate for something like this - not if you want an API you can consume from JS, anyhow.

Most likely the amount of work you'd be able to outsource to WASM, is more or less the same work you're already outsourcing to highly optimized C code with the standard (String, RegExp, etc.) JS APIs - so I don't think there's a whole lot to gain with WASM here?

Collapse
 
moshosanya profile image
Michael Oshosanya

Good one man. Recently started learning typescript too. It's a good language

Collapse
 
ajahso4 profile image
Ajah Chukwuemeka

Thanks brother. I appreciate your shout out. I wouldn't mind collaborating with you to speed it up.