LGTM Devlog 11: Writing the Serverless Function for receiving GitHub webhooks with Pydantic validation

#devjournal #webdev #github #python

Now that we know what the github forks look like, it's time to write the function for receiving them. You can see the code matching this post at this commit: 40ac1367

Data validation with Pydantic

First, validating data and extracting it. I need to pull out four or five pieces of data from the json being sent. While this is a relatively simple case of reading the various nested keys from the json, there's problems with that: what if some of the keys don't exist? We need to throw an error, ideally one that explains what the problem is. Suddenly pulling out 4 values becomes having to write four if statements and for error messages.

While in this particular instance, we don't expect the github API to be omitting values, it's still to do things rigorously. One of the best options for this validation problem is to use a module like Pydantic. Not only does it do this validation for you, it generates meaningful errors, has static type annotations, and a whole bunch of very wholesome reasons why you'd use Pydantic over manually decoding JSONs.

In fact, it's so useful that Python API libraries like Starlette/FastAPI built-in support for Pydantic models for defining what API endpoints require as payload, or output. Because we're running on Google's serverless functions, we don't have those built-in support for Pydantic, but we can still make use of it, we just have to manually run the validation.

So first, we define the data model(s). I noticed that the hook json payload has two members: forkee and repository. They both describe a repo. And inside each repo, is an owner that contains the details of the person who owns the repo. So I decided to buidl my data models like this:

(This is the entire contents of the app/utils/models.py file)

""" Models for validation of Github hooks """
from pydantic import BaseModel, Field, Extra  # pylint: disable=no-name-in-module

# pylint: disable=too-few-public-methods,missing-class-docstring
class GitHubUser(BaseModel):
    login: str = Field(..., title="User's user login name")
    id: int = Field(..., title="User's ID")

    class Config:
        extra = Extra.ignore


class GitHubRepository(BaseModel):
    id: int = Field(..., title="Repo's ID")
    full_name: str = Field(..., title="Repo's full name (including owner)")
    owner: GitHubUser = Field(..., title="Owner of repo")
    url: str = Field(..., title="API RUL of the repo")

    class Config:
        extra = Extra.ignore


class GitHubHookFork(BaseModel):
    forkee: GitHubRepository = Field(..., title="The fork created")
    repository: GitHubRepository = Field(..., title="The repository being forked")

    class Config:
        extra = Extra.ignore

The extra config tells Pydantic to not error out when it receives more data than was given, since I don't want to fully define the github object, just the objects I want to access later.

Now in my main function, I can tell it to parse the raw data from the request and return this object:

    try:
        hook_fork = GitHubHookFork.parse_raw(request.data)
    except ValidationError as err:
        logger.err("Validation error", err=err)
        return abort(400, "Validation error")

This hook_fork object has the properties we defined, for example, we can check the username of the person who forked our repo using hook_fork.forkee.owner.login, and! in the case validation fails, we'll get an exception containing what part of the data didn't match.

Signature validation

We discovered in the last post that GitHub can secure webhooks using a signature that is calculated from a pre-shared secret value that you tell it to use, and the payload. I found an example of this implementation on Google, but updated it to use the new SHA256 which GitHub wants you to use. I'm going to store the pre-shared secret in an environmental variable called SECRET (may rename later), so the code looks like this (contents of app/utils/verify.py):

""" Verify the GitHub webhook secret """
import os
import hmac
import hashlib

from flask import Request

SECRET = bytes(os.environ["SECRET"], "utf-8")


def verify_signature(request: Request) -> bool:
    """ Validates the github webhook secret. Will return false if secret not provided """
    expected_signature = hmac.new(
        key=SECRET, msg=request.data, digestmod=hashlib.sha256
    ).hexdigest()
    incoming_signature = request.headers.get("X-Hub-Signature-256", "").removeprefix(
        "sha256="
    )
    return hmac.compare_digest(incoming_signature, expected_signature)

A couple of things to note here: the new Python 3.9 .removeprefix() method for strings, and using hmac.compare_digest() to compare the two digests. This method is equivalent to just doing == but it introduces some random timing to prevent time analysis, which makes it a bit more secure (though in our case that is unlikely to be a major consideration, it's still best-practice)

Main function

So finally, the main function that we will deploy looks like this (full contents of app/main.py):

""" Listens to webhooks from GitHub """

import structlog  # type: ignore
from flask import Request, abort
from pydantic import ValidationError

from utils.verify import verify_signature
from utils.models import GitHubHookFork

logger = structlog.get_logger()

OUR_REPO = "meseta/lgtm"


def github_webhook_listener(request: Request):
    """ A listener for github webhooks """

    # verify
    if not verify_signature(request):
        return abort(403, "Invalid signature")

    # decode
    try:
        hook_fork = GitHubHookFork.parse_raw(request.data)
    except ValidationError as err:
        logger.err("Validation error", err=err)
        return abort(400, "Validation error")

    # output
    if hook_fork.repository.full_name == OUR_REPO:
        logger.info("Got fork", data=hook_fork.dict())

    return "OK"

Note that at this point, I'm not actually doing anything with the data, just decoding it and logging it. Next sprint's task will be to actually take this data and use it to create accounts and whatever.

Tests

To ensure we have full test coverage, I wrote a couple of tests, as well as downloaded the payload and hash for a valid hook, and fabricated an invalid one. The tests/conftest.py file now looks like this, with fixtures to provide the two test payloads as well as the test client:

""" Setup for tests """

import os
import json
import pytest
from functions_framework import create_app  # type: ignore

TEST_FILES = os.path.join(
    os.path.dirname(os.path.realpath(__file__)),
    "test_files",
)


class Payload:  # pylint: disable=too-few-public-methods
    """ Container for holding header/payload pairs during testing"""

    def __init__(self, header_path, payload_path):
        with open(os.path.join(TEST_FILES, header_path)) as fp:
            self.headers = json.load(fp)
        with open(os.path.join(TEST_FILES, payload_path), "rb") as fp:
            self.payload = fp.read()


@pytest.fixture()
def good_fork():
    """ A payload containing (raw) data, this is recorded from GitHub """
    return Payload("good_fork_headers.json", "good_fork.bin")


@pytest.fixture()
def bad_fork():
    """A payload containing (raw) data, that has been edited to be missing stuff
    but paired wiht a valid signature"""
    return Payload("bad_fork_headers.json", "bad_fork.bin")


@pytest.fixture(scope="package")
def client():
    """ Test client """
    return create_app(
        "github_webhook_listener", os.environ["FUNCTION_SOURCE"]
    ).test_client()

And to test our model (contents of tests/test_model.py), we simply ask it to parse the good and the bad payloads to see if it'll work or raise the appropriate exceptions:

""" Tests for pydantic models """

import pytest
from pydantic import ValidationError
from app.utils.models import GitHubHookFork

# pylint: disable=redefined-outer-name
def test_model(good_fork):
    """ Test that our validator works """

    hook_fork = GitHubHookFork.parse_raw(good_fork.payload)
    assert hook_fork.forkee.owner.login


def test_model_invalid(bad_fork):
    """ Test that our validator fails correctly """

    with pytest.raises(ValidationError):
        GitHubHookFork.parse_raw(bad_fork.payload)

And to test our main function, there's four tests that test various combinations of valid/invalid/missing signatures, valid/invalid payloads (contents of tests/test_main.py)

""" Tests for main.py """

# pylint: disable=redefined-outer-name
def test_good_fork(client, good_fork):
    """ For a good fork that's working fine """

    res = client.post("/", headers=good_fork.headers, data=good_fork.payload)
    assert res.status_code == 200


def test_model_validation_fail(client, bad_fork):
    """ Test model validation failure with bad payload but correct signature for it"""

    res = client.post("/", headers=bad_fork.headers, data=bad_fork.payload)
    assert res.status_code == 400


def test_signature_fail(client, good_fork, bad_fork):
    """ Test signature validation failure with good payload but incorrect signature for it"""

    res = client.post("/", headers=bad_fork.headers, data=good_fork.payload)
    assert res.status_code == 403


def test_no_signature(client, good_fork):
    """ When signatures are not supplied """

    res = client.post(
        "/", headers={"Content-Type": "application/json"}, data=good_fork.payload
    )
    assert res.status_code == 403

Our final test results show all tests pass, with 100% code coverage, which isn't definitive that our code is bug free, but gives us a little confidence, and helps us make sure there's no regressions when we edit the code later

Running a real test

To run a real test, I fire up that ngrok endpoint I set up previously (by running ngrok http 5000), started up the test server using pipenv run serve github_webhook_listener, updating the GitHub Webhook URL to the new temporary ngrok path, and telling GitHub to re-send the previous webhook.

The server logs show that a webhook is received, and you can read the decoded variables that are now in the hook_fork object!

For those interested, all of this is running inside VSCode's Terminal pane, opened into WSL, and inside a tmux session. It's a nice way to run code while developing.

This represents all that is needed for a basic GitHub webhook receiving serverless function, that is built in a reasonably robust way. Next we need to upload it, but I will set up the CI to do this instead of me having to upload it every time.