DEV Community

Oluwabukunmi Ige
Oluwabukunmi Ige

Posted on

How To Capture More than Group When Using BigQuery Regex

I have been working on a particular task that required me to extract dates from a pool of text that was stored Big query. Trust me when I say that the data was quite messy and every date format you can imagine was in attendance.

Hence I got to work. Went through the big query docs and the function that was more suitable for my task was REGEXP_EXTRACT. How it was supposed to work was that I would specify the date format I wanted to extract in the regex function and it would extract it for me.
Something like this:

SELECT REGEXP_EXTRACT('210118164541-1/18/2021',r"(0[1-9]|1[0-2])(/)(0[1-9]|[12]\d|30|31)(/)(\d{4})")

In this case, the date was 1/18/2021 and the job of the regex function was to extract the date.

The issue I encountered was that the regex_extract function doesn't work with more than one capturing group. What I mean by capturing group is mainly due to the fact that I could not predict the date format in the data and each part of the date needed a different capturing group. Something like
DD/MM/YY or MM/DD/YY

Unfortunately, the regex extract function only accepted one capturing group.

Citing that Regular expressions passed into extraction functions must not have more than 1 capturing group

Did so some google and Stackoverflow searches before was able to figure out a solution.

The solution was that since REGEXP_EXTRACT only takes one capturing group you had to make the other groups non-capturing using ?:. This allows the function to capture more than one group.

The code below shows this in practice.

SELECT REGEXP_EXTRACT('210118164541-1/18/2021',r"(?:0[1-9]|[12]\d|30|31)(?:.)(?:0[1-9]|1[0-2])(?:/)(?:\d{4}|\d{2}")

The code above would perfectly extract the date 1/18/2021 as needed.

In conclusion, if you ever encounter a task that requires you to extract more than one capturing group using regex on BigQuery remember that you have to make the other groups non-capturing using the ?: notation and you are good to go.

Discussion (0)