I have been working on a particular task that required me to extract dates from a pool of text that was stored Big query. Trust me when I say that the data was quite messy and every date format you can imagine was in attendance.
Hence I got to work. Went through the big query docs and the function that was more suitable for my task was REGEXP_EXTRACT
. How it was supposed to work was that I would specify the date format I wanted to extract in the regex function and it would extract it for me.
Something like this:
SELECT REGEXP_EXTRACT('210118164541-1/18/2021',r"(0[1-9]|1[0-2])(/)(0[1-9]|[12]\d|30|31)(/)(\d{4})")
In this case, the date was 1/18/2021
and the job of the regex function was to extract the date.
The issue I encountered was that the regex_extract
function doesn't work with more than one capturing group. What I mean by capturing group is mainly due to the fact that I could not predict the date format in the data and each part of the date needed a different capturing group. Something like
DD/MM/YY or MM/DD/YY
Unfortunately, the regex extract function only accepted one capturing group.
Citing that Regular expressions passed into extraction functions must not have more than 1 capturing group
Did so some google and Stackoverflow searches before was able to figure out a solution.
The solution was that since REGEXP_EXTRACT
only takes one capturing group you had to make the other groups non-capturing using ?:
. This allows the function to capture more than one group.
The code below shows this in practice.
SELECT REGEXP_EXTRACT('210118164541-1/18/2021',r"(?:0[1-9]|[12]\d|30|31)(?:.)(?:0[1-9]|1[0-2])(?:/)(?:\d{4}|\d{2}")
The code above would perfectly extract the date 1/18/2021
as needed.
In conclusion, if you ever encounter a task that requires you to extract more than one capturing group using regex on BigQuery remember that you have to make the other groups non-capturing using the ?:
notation and you are good to go.
Top comments (0)