I have been working on a particular task that required me to extract dates from a pool of text that was stored Big query. Trust me when I say that the data was quite messy and every date format you can imagine was in attendance.
Hence I got to work. Went through the big query docs and the function that was more suitable for my task was
REGEXP_EXTRACT. How it was supposed to work was that I would specify the date format I wanted to extract in the regex function and it would extract it for me.
Something like this:
In this case, the date was
1/18/2021 and the job of the regex function was to extract the date.
The issue I encountered was that the
regex_extract function doesn't work with more than one capturing group. What I mean by capturing group is mainly due to the fact that I could not predict the date format in the data and each part of the date needed a different capturing group. Something like
DD/MM/YY or MM/DD/YY
Unfortunately, the regex extract function only accepted one capturing group.
Regular expressions passed into extraction functions must not have more than 1 capturing group
Did so some google and Stackoverflow searches before was able to figure out a solution.
The solution was that since
REGEXP_EXTRACT only takes one capturing group you had to make the other groups non-capturing using
?:. This allows the function to capture more than one group.
The code below shows this in practice.
The code above would perfectly extract the date
1/18/2021 as needed.
In conclusion, if you ever encounter a task that requires you to extract more than one capturing group using regex on BigQuery remember that you have to make the other groups non-capturing using the
?: notation and you are good to go.