The kinda way I use to modify HTML with RegExp using PHP.
HTML Tag Regex
/\<(?<tag>[a-z][a-z0-9\-]*)(\s+([\s\S]*?))?\/?\>(([\s\S]*?)\<\/(?P=tag)\>)?/
The above RegExp can be broken down as:
-
/: The regex opening delimiter. -
\<: Matches the character<of an opening tag. -
(?<tag>[a-z][a-z0-9\-]*): Matches HTML valid tag name, which should start with a character betweenaandz, could contain another characters betweenatoz, and numbers between0and9, and could also contain the character-in it. -
(\s+([\s\S]*?))?: Matches the entire attributes of the tag including spaces between them, but only if they were present. -
\/?: Matches the character/of self closing tags. -
\>: Matches the character>, which is supposed to be the closing character of the opening tag. -
(([\s\S]*?)\<\/(?P=tag)\>)?: Matches the content or HTML inside the tag and the closing tag, but only if the tag is not self closing tag. -
/: The regex closing delimiter.
HTML Tag Attributes Regex
/([\w\-]+)(\s*\=\s*(?|(?<quot>[\'"])([\s\S]*?)(?P=quot)|(?<quot>)([\w\-]+)))?/
The above RegExp can be broken down as:
-
/: The regex opening delimiter. -
([\w\-]+): Matches the attributes key/name. -
(\s*\=\s*(?|(?<quot>[\'"])([\s\S]*?)(?P=quot)|(?<quot>)([\w\-]+)))?: Matches the value of the attribute, which could be anything wrapped in a single-quote (') or in a double-quote ("). Also, could be naked (not wrapped in a quote). If not wrapped, the value must only contain characters in the rangeatozor the capitalsAtoZ, and numbers in the range0to9, and_(underscore), and-(hyphen). This could also match nothing for boolean attributes. -
/: The regex closing delimiter.
Example Usage
<?php
// HTML elements
$content = <<<EOL
<p>Text paragraph.</p>
<img src="http://example.com/image-200x320.png" width="200" height="320">
EOL;
// Tags matching RegExp
$tags_regexp = '/\<(?<tag>[a-z][a-z0-9\-]*)(\s+([\s\S]*?))?\/?\>(([\s\S]*?)\<\/(?P=tag)\>)?/';
// Attributes matching RegExp
$atts_regexp = '/([\w\-]+)(\s*\=\s*(?|(?<quot>[\'"])([\s\S]*?)(?P=quot)|(?<quot>)([\w\-]+)))?/';
// Match all the valid elements in the HTML
preg_match_all( $tags_regexp, $content, $matches, PREG_SET_ORDER );
// Loop through and make the necessary changes
foreach ( $matches as $match ) {
// We are going to modify only image tags
if ( 'img' !== $match[ 'tag' ] ) continue;
// Match all the attributes
preg_match_all( $atts_regexp, $match[2], $atts_match );
// Combine the keys and the values
$atts_match = array_combine( $atts_match[1], $atts_match[4] );
// Build back a HTML valid attributes
$atts = '';
foreach ( $atts_match as $name => $value ) {
$atts .= sprintf( ' %s="%s"', $name, $value );
}
// Replacement for the tag
$amp = sprintf( '<amp-img%s></amp-img>', $atts );
// Replace the complete tag match with the new replacement
$content = str_replace( $match[0], $amp, $content );
}
// The AMPifyed HTML
/**
* <p>Text paragraph.</p>
* <amp-img src="http://example.com/image-200x320.png" width="200" height="320"></amp-img>
*/
echo $content;
The above could also be improved to make a complete HTML-to-AMP converter for simple pages.
Top comments (0)