DEV Community

Cover image for RegExp based HTML modification in PHP
Shuaib Yusuf Shuaib
Shuaib Yusuf Shuaib

Posted on

1

RegExp based HTML modification in PHP

The kinda way I use to modify HTML with RegExp using PHP.

HTML Tag Regex

/\<(?<tag>[a-z][a-z0-9\-]*)(\s+([\s\S]*?))?\/?\>(([\s\S]*?)\<\/(?P=tag)\>)?/
Enter fullscreen mode Exit fullscreen mode

The above RegExp can be broken down as:

  1. /: The regex opening delimiter.
  2. \<: Matches the character < of an opening tag.
  3. (?<tag>[a-z][a-z0-9\-]*): Matches HTML valid tag name, which should start with a character between a and z, could contain another characters between a to z, and numbers between 0 and 9, and could also contain the character - in it.
  4. (\s+([\s\S]*?))?: Matches the entire attributes of the tag including spaces between them, but only if they were present.
  5. \/?: Matches the character / of self closing tags.
  6. \>: Matches the character >, which is supposed to be the closing character of the opening tag.
  7. (([\s\S]*?)\<\/(?P=tag)\>)?: Matches the content or HTML inside the tag and the closing tag, but only if the tag is not self closing tag.
  8. /: The regex closing delimiter.

HTML Tag Attributes Regex

/([\w\-]+)(\s*\=\s*(?|(?<quot>[\'"])([\s\S]*?)(?P=quot)|(?<quot>)([\w\-]+)))?/
Enter fullscreen mode Exit fullscreen mode

The above RegExp can be broken down as:

  1. /: The regex opening delimiter.
  2. ([\w\-]+): Matches the attributes key/name.
  3. (\s*\=\s*(?|(?<quot>[\'"])([\s\S]*?)(?P=quot)|(?<quot>)([\w\-]+)))?: Matches the value of the attribute, which could be anything wrapped in a single-quote (') or in a double-quote ("). Also, could be naked (not wrapped in a quote). If not wrapped, the value must only contain characters in the range a to z or the capitals A to Z, and numbers in the range 0 to 9, and _ (underscore), and - (hyphen). This could also match nothing for boolean attributes.
  4. /: The regex closing delimiter.

Example Usage

<?php

// HTML elements
$content = <<<EOL

<p>Text paragraph.</p>
<img src="http://example.com/image-200x320.png" width="200" height="320">

EOL;

// Tags matching RegExp
$tags_regexp = '/\<(?<tag>[a-z][a-z0-9\-]*)(\s+([\s\S]*?))?\/?\>(([\s\S]*?)\<\/(?P=tag)\>)?/';

// Attributes matching RegExp
$atts_regexp = '/([\w\-]+)(\s*\=\s*(?|(?<quot>[\'"])([\s\S]*?)(?P=quot)|(?<quot>)([\w\-]+)))?/';

// Match all the valid elements in the HTML
preg_match_all( $tags_regexp, $content, $matches, PREG_SET_ORDER );

// Loop through and make the necessary changes
foreach ( $matches as $match ) {

  // We are going to modify only image tags
  if ( 'img' !== $match[ 'tag' ] ) continue;

  // Match all the attributes
  preg_match_all( $atts_regexp, $match[2], $atts_match );

  // Combine the keys and the values
  $atts_match = array_combine( $atts_match[1], $atts_match[4] );

  // Build back a HTML valid attributes
  $atts = '';
  foreach ( $atts_match as $name => $value ) {
    $atts .= sprintf( ' %s="%s"', $name, $value );
  }

  // Replacement for the tag
  $amp = sprintf( '<amp-img%s></amp-img>', $atts );

  // Replace the complete tag match with the new replacement
  $content = str_replace( $match[0], $amp, $content );

}


// The AMPifyed HTML
/**
 * <p>Text paragraph.</p>
 * <amp-img src="http://example.com/image-200x320.png" width="200" height="320"></amp-img>
 */
echo $content;


Enter fullscreen mode Exit fullscreen mode

The above could also be improved to make a complete HTML-to-AMP converter for simple pages.

Image of Datadog

The Essential Toolkit for Front-end Developers

Take a user-centric approach to front-end monitoring that evolves alongside increasingly complex frameworks and single-page applications.

Get The Kit

Top comments (0)

Sentry image

See why 4M developers consider Sentry, “not bad.”

Fixing code doesn’t have to be the worst part of your day. Learn how Sentry can help.

Learn more