One liner - remove html tags from a string

#javascript #html #webdev #programming

var plainText = content.replace(/<[^>]*>/g, '');

Refer: https://stackoverflow.com/questions/51195143/is-there-a-way-to-remove-html-tags-from-a-string-in-javascript/51195294#51195294

Top comments (9)

Frank Wisniewski • Jun 10 '22 • Edited

why so complicated, use textContent

<!DOCTYPE html>
<html lang=de>
  <meta charset=UTF-8>
  <title>delete tags</title>
  <div id="myContainer">
    <h1>myHeader</h1>
    <p>Lorem ipsum dolor sit amet, consectetur adipisicing elit. Incidunt, vitae.</p>
    <p>Lorem ipsum dolor sit amet, consectetur adipisicing elit. Incidunt, vitae.</p>
  </div>
  <script>
  "use strict";
  let myContainerText = myContainer.textContent,
      plainText = myContainer.innerHTML.replace(/<[^>]*>/g, '');
  console.log(plainText===myContainerText) // true
  </script>

Surjeet Bhadauriya • Jun 10 '22

To be honest it is all about this line

.replace(/<[^>]*>/g, '');

Frank Wisniewski • Jun 10 '22 • Edited

It's really not a good idea to parse HTML with regex...

Look at the following sample:

<!DOCTYPE html>
<html lang=de>
  <meta charset=UTF-8>
  <title>delete tags</title>
    <div id="c">
      <p>This is a &lt;H1&gt; tag</p>
    </div>
  <script>
  "use strict";
  const extractPlainText = str => 
    new DOMParser()
      .parseFromString(str, "text/html")
      .documentElement.textContent
  console.log(
    extractPlainText(c.innerHTML)
  )
  // This is a <H1> tag
  console.log(
    c.innerHTML.replace(/<[^>]*>/g, '')
  )
  //     This is a &lt;H1&gt; tag
  </script>

HTML entities cause problems...

FJones • Jun 10 '22

Don't parse HTML with Regex. Golden Rule of Regex on the web. HTML is a context-free language, not a regular one. As such, a regular automaton isn't going to suffice for all intricacies.
Extended Regex mitigates some of those issues, but in JS in particular, .textContent is the superior choice.

Randall • Jun 10 '22

Doesn't work. For example:

'<a href="abc>xyz" />'.replace(/<[^>]*>/g, ''); -> xyz\" />

Don't use regex to parse HTML. It may work in most cases, but you'll be caught flat-footed by corner cases.

Frank Wisniewski • Jun 11 '22 • Edited

The topic is more complex than expected. There are leading spaces, comments, etc.
Text content alone is not enough.
The following example also shows the regex problem.

<!DOCTYPE html>
<html lang=de>
  <meta charset=UTF-8>
  <title>delete tags</title>
    <div id="c">
      <p>vote for</p>
      <!-- >Trump not for-->
      <p>Biden</p>
    </div>
  <script>
  "use strict";
  const extractPlainText = str => 
    new DOMParser()
      .parseFromString( str, "text/html ")
      .documentElement.textContent
      .split( '\n' )
      .map( el => el.trim() )
      .filter( x => x.length > 0 )
      .join( '\n' )

  console.log(
    extractPlainText(c.innerHTML)
  )
// vote for
// Biden

  console.log(
    c.innerHTML.replace(/<[^>]*>/g, '')
  )
//     vote for 
//     Trump not for -->
//     Biden
  </script>

NunoA21 • Jun 13 '22 • Edited

Nice, never thought about that one :)
The other day I was following a react tutorial and I've found that you could do something like this in react (in case you're interested):

<p>{dangerouslySetInnerHTML={{ __html: details.instructions }}}</p>

The "dangerouslySetInnerHTML" is from react, and you can use when you're receiving text with HTML quotes, like the following:

{
    "text": "<p> Hello <b>World</b>!</p>"
}

FJones • Jun 10 '22

(To be fair, web crawling and the like may in fact require parsing HTML like that, but at that point actually parsing it is the preferable solution.)