Disclaimer: This posts assumes some knowledge about regular expressions.
Recently I was trying to capture an HTML attribute in sed
. For example, let’s say I want to extract the href
attribute in the following example:
<a href="https://brandonrozek.com" rel="me"></a>
Advice you commonly see on the Internet is to use a capture group for anything between the quotes of the href.
In regular expression land, we can represent anything as .*
and define a capture group of some regular expression X
as \(X\)
.
sed "s/.*href=\"\(.*\)\".*/\1/g"
What does this look like for our input?
echo \<a href=\"https://brandonrozek.com\" rel=\"me\"\>\</a\> |\
sed "s/.*href=\"\(.*\)\".*/\1/g"
https://brandonrozek.com" rel="me
It matches all the way until the second "
! What we want, is to not match any character within the quotations, but match any character that is not the quotation itself [^\"]*
sed "s/.*href=\"\([^\"]*\)\".*/\1/g"
This then works for our example:
echo \<a href=\"https://brandonrozek.com\" rel=\"me\"\>\</a\> |\
sed "s/.*href=\"\([^\"]*\)\".*/\1/g"
https://brandonrozek.com
Within a bash script, we can make this a little more readable by using multiple variables.
QUOTED_STR="\"\([^\"]*\)\""
BEFORE_TEXT=".*href=$QUOTED_STR.*"
AFTER_TEXT="\1"
REPLACE_EXPR="s/$BEFORE_TEXT/$AFTER_TEXT/g"
INPUT="\<a href=\"https://brandonrozek.com\" rel=\"me\"\>\</a\>"
echo "$INPUT" | sed "$REPLACE_EXPR"
Top comments (0)