As a developer you’re constantly working with large amounts of text like source code, logs, and data files. Often you need to extract, replace, or manipulate that text and regex can help you.
Today I present variants of a real case study where I personally used regex. These examples build upon my last post so check that out too.
reddit_id,colorblind_comment,score,title,url,created ... 8cwcbu,False,101457,"Cause of Death - Reality vs. Google vs. Media [OC]",https://i.imgur.com/GtIzEok.gif,1523970172 8bzdr8,False,99626,"Gaze and foot placement when walking over rough terrain (article link in comments) [OC]",https://v.redd.it/h0f0m4v5nor01,1523628194 fpga3f,False,99488,"[OC] To show just how insane this week's unemployment numbers are, I animated initial unemployment insurance claims from 1967 until now. These numbers are just astonishing.",https://i.redd.it/tch0t0is32p41.gif,1585245693 i2vx78,True,98638,"The environmental impact of Beyond Meat and a beef patty [OC]",https://i.redd.it/jskjkodg3se51.png,1596456703 fxoxti,False,98067,"Coronavirus Deaths vs Other Epidemics From Day of First Death (Since 2000) [OC]",https://v.redd.it/yemjrb1p9rr41,1586422082 ...
(Top 100 posts from r/dataisbeautiful on Reddit where 'colorblind' is mentioned in the comments)
There are a ton of things you could extract here but I’ll go with this:
Extract image urls from posts with colorblind comments
You could use excel or write a script here but this is why I would use regex:
- Quick and easily done from your IDE
- Not limited to CSVs - this could be source code which isn’t so easy via excel/scripts
- Took me under a minute to do (with prior regex knowledge)
Step #1: Select all lines where the colorblind column is
Truematch the word ‘True’
,match commas (so we don’t match ‘True’ in the title text)
.*match everything before and after
Step #2: Limit to lines that have an image url
http.*match start of url (not necessary but needed for the next step)
http.*\.(png|jpg)combine to match urls with an image extension
.*combine step 1 regex, the image url, and everything after. Now it matches lines with colorblind comments AND image urls
Step #3: Match just the url:
Lets break up our regex so far into parts:
part_1 | Before the url |
part_2 | The url |
part_3 | After the url |
We want to isolate
part_2 so must exclude
part_3 like this:
(?<= ... ) ...exclude stuff before a match
... (?= ... )exclude stuff after a match
(?<= part_1 )+
(?= part_3 )exclude
(?<=.*,True,.*)http.*\.(png|jpg)(?=.*)substitute in our actual regex
Step #4: Extract the text
A modern IDE should now let you select your matches. In VS Code you can select and copy your matches like this:
Alt + Enterto select matches
Ctrl + cto copy matches
Ctrl + vto paste matches (in a separate file)
Urls successfully extracted! You could take this further by running this over your entire repo using global search (
Alt + Shift + f). Show me an excel script that does that!
Sometimes we don’t want to extract text but rather update it. For this example we’ll update all image urls in the file like so:
- from this:
- to this:
Step #1: Match image urls (with ID match)
- This the url regex we made before but with an image ID match that we’ll use in the next step
http.*match start of url and everything after
/.*match forward slash and everything after (matches the image ID)
Step #2: Create regex groups
Everything wrapped with
() creates a regex group which we can reference in our replace command.
Given the url
https://i.redd.it/jskjkodg3se51.png we must create groups for the following parts to transform them as desired:
https://i.redd.it/the domain part
jskjkodg3se51the image ID part
pngthe image extension part
Let’s group those parts of our regex:
(http.*)groups the domain part
/(.*)groups the image ID part (note the slash is excluded)
\.(png|jpg)groups the image extension part (note this was already grouped and that
Step #3: Replace
Open the replace UI (
Ctrl + h in most IDEs)
We can now reference groups in replace commands as follows:
$0is the entire match
$3is the first, second, third group etc
Our search and replace commands will therefore look like this:
That regex replace will change urls like so:
Nice huh? You can also do this globally via
Alt + shift + h to search and replace over your whole repo. Very useful.
Powerful stuff eh? Once you learn how to build up a regex like this (hint: practice makes perfect) you can do this kind of stuff in seconds. Searching, extracting, and transforming text will become second nature.
Well that’s it for now! I’ve got 2 more sections of this case study ready to go but this article was getting long. So next week we’ll learn how to use multi-cursors to augment regex even further.
It’s seriously powerful stuff!