DEV Community

Cover image for RegExp based HTML modification in PHP
Shuaib Yusuf Shuaib
Shuaib Yusuf Shuaib

Posted on

RegExp based HTML modification in PHP

The kinda way I use to modify HTML with RegExp using PHP.

HTML Tag Regex

/\<(?<tag>[a-z][a-z0-9\-]*)(\s+([\s\S]*?))?\/?\>(([\s\S]*?)\<\/(?P=tag)\>)?/
Enter fullscreen mode Exit fullscreen mode

The above RegExp can be broken down as:

  1. /: The regex opening delimiter.
  2. \<: Matches the character < of an opening tag.
  3. (?<tag>[a-z][a-z0-9\-]*): Matches HTML valid tag name, which should start with a character between a and z, could contain another characters between a to z, and numbers between 0 and 9, and could also contain the character - in it.
  4. (\s+([\s\S]*?))?: Matches the entire attributes of the tag including spaces between them, but only if they were present.
  5. \/?: Matches the character / of self closing tags.
  6. \>: Matches the character >, which is supposed to be the closing character of the opening tag.
  7. (([\s\S]*?)\<\/(?P=tag)\>)?: Matches the content or HTML inside the tag and the closing tag, but only if the tag is not self closing tag.
  8. /: The regex closing delimiter.

HTML Tag Attributes Regex

/([\w\-]+)(\s*\=\s*(?|(?<quot>[\'"])([\s\S]*?)(?P=quot)|(?<quot>)([\w\-]+)))?/
Enter fullscreen mode Exit fullscreen mode

The above RegExp can be broken down as:

  1. /: The regex opening delimiter.
  2. ([\w\-]+): Matches the attributes key/name.
  3. (\s*\=\s*(?|(?<quot>[\'"])([\s\S]*?)(?P=quot)|(?<quot>)([\w\-]+)))?: Matches the value of the attribute, which could be anything wrapped in a single-quote (') or in a double-quote ("). Also, could be naked (not wrapped in a quote). If not wrapped, the value must only contain characters in the range a to z or the capitals A to Z, and numbers in the range 0 to 9, and _ (underscore), and - (hyphen). This could also match nothing for boolean attributes.
  4. /: The regex closing delimiter.

Example Usage

<?php

// HTML elements
$content = <<<EOL

<p>Text paragraph.</p>
<img src="http://example.com/image-200x320.png" width="200" height="320">

EOL;

// Tags matching RegExp
$tags_regexp = '/\<(?<tag>[a-z][a-z0-9\-]*)(\s+([\s\S]*?))?\/?\>(([\s\S]*?)\<\/(?P=tag)\>)?/';

// Attributes matching RegExp
$atts_regexp = '/([\w\-]+)(\s*\=\s*(?|(?<quot>[\'"])([\s\S]*?)(?P=quot)|(?<quot>)([\w\-]+)))?/';

// Match all the valid elements in the HTML
preg_match_all( $tags_regexp, $content, $matches, PREG_SET_ORDER );

// Loop through and make the necessary changes
foreach ( $matches as $match ) {

  // We are going to modify only image tags
  if ( 'img' !== $match[ 'tag' ] ) continue;

  // Match all the attributes
  preg_match_all( $atts_regexp, $match[2], $atts_match );

  // Combine the keys and the values
  $atts_match = array_combine( $atts_match[1], $atts_match[4] );

  // Build back a HTML valid attributes
  $atts = '';
  foreach ( $atts_match as $name => $value ) {
    $atts .= sprintf( ' %s="%s"', $name, $value );
  }

  // Replacement for the tag
  $amp = sprintf( '<amp-img%s></amp-img>', $atts );

  // Replace the complete tag match with the new replacement
  $content = str_replace( $match[0], $amp, $content );

}


// The AMPifyed HTML
/**
 * <p>Text paragraph.</p>
 * <amp-img src="http://example.com/image-200x320.png" width="200" height="320"></amp-img>
 */
echo $content;


Enter fullscreen mode Exit fullscreen mode

The above could also be improved to make a complete HTML-to-AMP converter for simple pages.

Top comments (0)