As a developer you’re constantly working with large amounts of text like source code, logs, and data files. Often you need to extract, replace, or manipulate that text and regex can help you.
Today I present variants of a real case study where I personally used regex. These examples build upon my last post so check that out too.
Case study example
This is a snippet of this csv file which you can also play with here.
reddit_id,colorblind_comment,score,title,url,created
...
8cwcbu,False,101457,"Cause of Death - Reality vs. Google vs. Media [OC]",https://i.imgur.com/GtIzEok.gif,1523970172
8bzdr8,False,99626,"Gaze and foot placement when walking over rough terrain (article link in comments) [OC]",https://v.redd.it/h0f0m4v5nor01,1523628194
fpga3f,False,99488,"[OC] To show just how insane this week's unemployment numbers are, I animated initial unemployment insurance claims from 1967 until now. These numbers are just astonishing.",https://i.redd.it/tch0t0is32p41.gif,1585245693
i2vx78,True,98638,"The environmental impact of Beyond Meat and a beef patty [OC]",https://i.redd.it/jskjkodg3se51.png,1596456703
fxoxti,False,98067,"Coronavirus Deaths vs Other Epidemics From Day of First Death (Since 2000) [OC]",https://v.redd.it/yemjrb1p9rr41,1586422082
...
(Top 100 posts from r/dataisbeautiful on Reddit where 'colorblind' is mentioned in the comments)
Text extraction: Extract urls from data
There are a ton of things you could extract here but I’ll go with this:
Extract image urls from posts with colorblind comments
You could use excel or write a script here but this is why I would use regex:
- Quick and easily done from your IDE
- Not limited to CSVs - this could be source code which isn’t so easy via excel/scripts
- Took me under a minute to do (with prior regex knowledge)
Step #1: Select all lines where the colorblind column is True
.*,True,.*
-
True
match the word ‘True’ -
,
match commas (so we don’t match ‘True’ in the title text) -
.*
match everything before and after
Step #2: Limit to lines that have an image url
.*,True,.*http.*\.(png|jpg).*
-
http.*
match start of url (not necessary but needed for the next step) -
\.(png|jpg)
match.png
or.jpg
extensions -
http.*\.(png|jpg)
combine to match urls with an image extension -
.*,True,.*
+http.*\.(png|jpg)
+.*
combine step 1 regex, the image url, and everything after. Now it matches lines with colorblind comments AND image urls
Step #3: Match just the url:
(?<=.*,True,.*,)http.*\.(png|jpg)(?=.*)
Lets break up our regex so far into parts:
| part_1
| Before the url | .*,True,.*
|
| part_2
| The url | http.*\.(png|jpg)
|
| part_3
| After the url | .*
|
We want to isolate part_2
so must exclude part_1
and part_3
like this:
-
(?<= ... ) ...
exclude stuff before a match -
... (?= ... )
exclude stuff after a match -
(?<= part_1 )
+part_2
+(?= part_3 )
excludepart_1
andpart_3
-
(?<=.*,True,.*)http.*\.(png|jpg)(?=.*)
substitute in our actual regex
Step #4: Extract the text
A modern IDE should now let you select your matches. In VS Code you can select and copy your matches like this:
-
Alt + Enter
to select matches -
Ctrl + c
to copy matches -
Ctrl + v
to paste matches (in a separate file)
Urls successfully extracted! You could take this further by running this over your entire repo using global search (Alt + Shift + f
). Show me an excel script that does that!
Text replace: Reformat url structure
Sometimes we don’t want to extract text but rather update it. For this example we’ll update all image urls in the file like so:
- from this:
https://i.redd.it/jskjkodg3se51.png
- to this:
https://i.redd.it/png/jskjkodg3se51
Step #1: Match image urls (with ID match)
http.*/.*\.(png|jpg)
- This the url regex we made before but with an image ID match that we’ll use in the next step
-
http.*
match start of url and everything after -
/.*
match forward slash and everything after (matches the image ID) -
\.(png|jpg)
match.png
or.jpg
extensions
Step #2: Create regex groups
(http.*)/(.*)\.(png|jpg)
Everything wrapped with ()
creates a regex group which we can reference in our replace command.
Given the url https://i.redd.it/jskjkodg3se51.png
we must create groups for the following parts to transform them as desired:
-
https://i.redd.it/
the domain part -
jskjkodg3se51
the image ID part -
png
the image extension part
Let’s group those parts of our regex:
-
(http.*)
groups the domain part -
/(.*)
groups the image ID part (note the slash is excluded) -
\.(png|jpg)
groups the image extension part (note this was already grouped and that.
is excluded)
Step #3: Replace
(http.*)/(.*)\.(png|jpg)
+ $1/$3/$2
Open the replace UI (Ctrl + h
in most IDEs)
We can now reference groups in replace commands as follows:
-
$0
is the entire match -
$1
,$2
,$3
is the first, second, third group etc
Our search and replace commands will therefore look like this:
- Search:
(http.*)/(.*)\.(png|jpg)
- Replace:
$1/$3/$2
That regex replace will change urls like so:
-
https://i.redd.it/jskjkodg3se51.png
->https://i.redd.it/png/jskjkodg3se51
-
https://i.redd.it/mkdikcce8gh31.jpg
->https://i.redd.it/jpg/mkdikcce8gh31
Nice huh? You can also do this globally via Alt + shift + h
to search and replace over your whole repo. Very useful.
Powerful stuff eh? Once you learn how to build up a regex like this (hint: practice makes perfect) you can do this kind of stuff in seconds. Searching, extracting, and transforming text will become second nature.
Well that’s it for now! I’ve got 2 more sections of this case study ready to go but this article was getting long. So next week we’ll learn how to use multi-cursors to augment regex even further.
It’s seriously powerful stuff!
Found this useful? Tweet about it or follow me on Twitter. Until next time 👋
Top comments (0)