Adam.S

Posted on Feb 4, 2021 • Edited on Jun 28, 2021 • Originally published at bas-man.dev

Processing Email - Being Selective

#google #python #api #email

Part 5 in a series of articles on implementing a notification system using Gmail and Line Bot

Today I will be looking at a regex expression, and making a generic function for searching an email for some specific contents.

In my previous post I referred you to "Working with double-byte regex expressions with Python3".

I will now just quickly go over a generic function I wrote to deal with my use case. I did this because initially I had to write three different functions that all did basically the same thing, the only difference being the number of named groups that each method returned.

In order to prevent oneself from making too many typing mistakes. We should store our regex strings as constants.

# Bus
BUS_DATA = (
    r"^「(?P<busname>[一-龯]\d{1,2})\s(?P<destination>[一-龯]+)行き・"
    "(?P<stop>[一-龯]+)」"
    )

Things to take note of:

[一-龯] equates to [a-z] where we accept any single character between a and z. This is not perfect, but covers all the characters that I need to be concerned with in Japanese.

Here we have [一-龯]\d{1,2}. All bus names in Tokyo are made up of one full kanji and a number. So this regular expression says, match one kanji followed be either 1, or 2 digits. Example: 渋11, 渋61

This is then surrounded with parentheses. Creating a capture group. And this group is also named using the ?P<name>; giving us one complete capture of (?P<busname>[一-龯]\d{1,2}).

Let's break down the other two capture groups.

(?P<destination>[一-龯]+)行き This captures one or more kanji and is the final destination of the bus. Where it will terminate service.
(?P<stop>[一-龯]+) This captures again, one or more kanji and is the name of the bus stop where it was boarded.

Let's take a look at the generic function.

def findMatches(string, regex) -> dict:
    match = re.search(regex, string, re.UNICODE | re.MULTILINE)
    if match:
        matches = dict()
        for key in match.groupdict():
            matches[key] = match.group(key)
        return matches
    # No Matches
    return None

What is this doing? We are taking in a string of text, which is from an email and is a single string of multi line text. We are also taking in the regex expression.

The function call looks like this:

matches = findMatches(body,BUS_DATA)

Where body might be:

'02月03日 18時15分\nNAME_OF_PASMO_USER\n「渋11\u3000田園調布駅行き・昭和女子大」でタッチしました。\n\n東急セキュリティ\n'

First we create a match object using re.search().

We tell match it's getting multiline text and that text is not ascii.

If we find a match using the regex, we are going to create a dictionary object.

We then go over the match object getting the key names (busname, destination, stop). We get the values associated with those keys using match.group(key) and assign these values to their keys in the new dictionary object. We finally return that object.

However there is a chance there will be no match. So we should return None in that case and we can test for that later in our code.

This is probably not the best solution. I could, of course, simply return the match object without creating a dictionary.