Alvison Hunter Arnuero | Front-End Web Developer

Posted on Jul 13, 2022 • Updated on Mar 25, 2023

Removing HTML tags in JavaScript using RegEx

#javascript #regex #programming #webdev

Howdy folks! It came to pass that in my current project working with a Translation app, we are sending data from Zendesk platform to our app and then to Deepl Translator API in order to get the English translation from any type of language we receive from Zendesk as a payload to our app.

As a result, the string received from Zendesk contains html tags in ASCII/Symbol format(example < and >) and the one we receive back from the Deepl Translator API returns this one with the same html tags, but written sometimes in html format (example < and >).

In simple terms, this shouldn't be an issue since it is somehow someway the same thing, however, since we are using the strings to compare it with the previous one to avoid repetitions when populating the translated text container in the UI, this becomes a nightmare since the strings are indeed different when it comes to compare them.

We could take several ways in order to make the comparison possible, however, there is a simpler way of doing this, an after all, programming is pretty much about that, make things simple, hence I will share with you my approach to this matter.

I am basically obtaining only the text from the string received from Zendesk, since the only thing that matters to me is to have the plain text to add it to the translations container. In other scenarios, the tags should remain to keep the html formatting in the string, but in this particular case, the client only want to display a text version of the text being translated, hence we would proceed to do like so.

So, by using the replace method in this string, I am able to remove the HTML tags from it, let me give you an example of the helper function I built for this:

/**
* Returns a string containing plain text format
* @constructor
* @param {string} strToSanitize - String to be sanitized
*/
export const clearHTMLTags = (strToSanitize) => {
  return strToSanitize.replace(/(<([^>]+)>)/gi, '');
}

Some might've just said: What's the big deal with this function? well, let me tell ya, first that all, I am using the /(<([^>]+)>)/gi simply to obtain all of the open and closing tags that the given string might contain.

Next, we add 'gi' RegEx modifier to do a case sensitive search of all occurrences of the regex pattern in the string parameter.

Last, we use the replace method combined with this regEx adding an empty string to remove the tags on that string. The result of this will be a sanitized-html-tags-free plain text from this function.

Now, as usual, this approach Is Not Bullet-Proof, hence we might find that in some cases the regex will not understand some of the open-closing tags and this could certainly bring some results with extra characters or missing < or >, hence, I am sharing a more efficient way to do this to make sure you cover all of the details when it comes to more complicated scenarios with a major amount of HTML tagging.

Please find below the approach presented by Frank Wisniewski as a suggestion to best perform this task:

const betterClearHTMLTags = (strToSanitize) => {
  let myHTML = new DOMParser()
    .parseFromString(strToSanitize, 'text/html');
    return myHTML.body.textContent || '';
}

let myHTML = `<!--  don't > use Regex --><h1>Test</h1>`

console.log(
  betterClearHTMLTags(myHTML)
)
 // output: Test

Well, Folks, I certainly hope you find both of these functions useful when you come to a situation where you would like to clean up any HTML tag and leave the content of it just in plain text. Happy reading, buddies!

Thanks for reading this article, I hope you enjoyed it as much as I did writing it. Until next time, dear readers!

❤️ Your enjoyment of this article encourages me to write further.
💬 Kindly share your valuable opinion by leaving a comment.
🔖 Bookmark this article for future reference.
🔗 If this article has truly helped you, please share it.

Top comments (4)

Frank Wisniewski • Jul 13 '22 • Edited

Never use Regex to parse HTML

const clearHTMLTags = (strToSanitize) => {
  return strToSanitize.replace(/(<([^>]+)>)/gi, '');
}
let myHTML = `<!--  don't > use Regex --><h1>Test</h1>`

console.log(
  clearHTMLTags(myHTML)
)
// output: use Regex -->Test


// The right way

const betterClearHTMLTags = (strToSanitize) => {
  let myHTML = new DOMParser()
    .parseFromString(strToSanitize, 'text/html');
    return myHTML.body.textContent || '';
}
console.log(
  betterClearHTMLTags(myHTML)
)
 // output: Test

Alvison Hunter Arnuero | Front-End Web Developer • Jul 13 '22 • Edited

Awesome! This is an excellent approach, let me share it in the post if you don't mind and refer your profile as a reference. Thanks for this, pal!

Samuel Eiche • Jun 7 '23 • Edited

That wont work for sth like

betterClearHTMLTags(`\"><script>document.write('<img src=//X55.is onload=import(src)>');</script>`)

John Peters • Jul 13 '22

Not for me

DEV Community

Removing HTML tags in JavaScript using RegEx

Top comments (4)

Read next

Hiring React JS Dev

Create a Single Page Application in React.js Using EasyFrontend

20 Free Api For Your Next Project

Learning New Things