Jakub Szafran

Posted on Nov 16, 2023 • Originally published at questofpython.dev

Ready for a little Python challenge?

#python #programming #learning

Below exercise comes from a Quest Of Python - a little side-project of mine where I share Python challenges/exercises with exemplary solution. If you've enjoyed it and would like to practice more, go check Quest Of Python website.

Introduction

You've come up with a great side-project idea - let's analyze information about IT job market in Poland!

While browsing JustJoin.it job board, you noticed that all job offers are served as JSON through an HTTP API. Since it's publicly available, you decided to create a small application to fetch & store this data.

You decide to go with AWS cloud for hosting your app.

Your workload includes a short Lambda function written in Python (which fetches the data from job offers API endpoint and persists JSON data into S3 bucket) which is executed on a daily schedule (through AWS EventBridge trigger). Each successful run of the function creates a new object in S3, following s3://some-s3-bucket-name/justjoinit-data/<year>/<month>/<day>.json naming convention.

You quickly test it and everything seems to be fine. Then you deploy resources to your AWS account and forget about whole thing for a long time.

Recently you decided to revive this project and try to extract something meaningful from this data. You quickly realize there are gaps in the data (some days are missing). Turns out that you were so confident about your code that you did not include any retry in case of HTTP request failure. Shame on you!

Your task

Clone the challenges blueprints repository and navigate to 0005_justjoinit_data_finding_the_gaps directory.
It contains a directory called justjoinit_data which is supposed to mimic the structure of original S3 bucket with raw data - each year of data is a separate directory containing a subdirectories with months (and each month directory contains multiple JSON files representing a single day of data).

Here's an output of tree command on this directory:

justjoinit_data
├── 2021
│   ├── 10
│   │   ├── 23.json
│   │   ├── 24.json
│   │   ├── 25.json
│   │   ├── 26.json
│   │   ├── 27.json
│   │   ├── 28.json
│   │   ├── 29.json
│   │   ├── 30.json
│   │   └── 31.json
│   ├── 11
│   │   ├── 01.json
│   │   ├── 02.json
│   │   ├── 03.json

...

Your task is to find out which dates (JSON files) are missing from justjoinit_directory directory (which would be days when our small AWS job failed due to some reason).

Put your logic into find_missing_dates function (inside missing_dates.py file). Missing dates should be returned as a string of dates joined by comma and a space character. If 2021-01-01, 2021-03-05 and 2022-05-10 were the missing dates, the result string would look like following:

"2021-01-01, 2021-03-05, 2022-05-10"

You can assume that directories will always be named after valid month (1 <= month <= 12) or day (1 <= day <= 31) and days within specific months are correct (for example there are no dates like February 31st).

You can use a test from test_missing_dates.py file to check if your solution is correct. Run below command (while being in 0005_justjoinit_data_finding_the_gaps directory) to run the test suite:

python -m unittest

python test_missing_dates.py

P.S. I plan to share this JustJoin.it job offers dataset publicly (probably on Kaggle). Once this is done, I'll update this page and provide the link to the dataset.

Exemplary solution

Note: you'll find the detailed explanation of the solution below the code snippet.

import pathlib
from datetime import date, timedelta


def find_missing_dates(input_directory: pathlib.Path):
    dates_from_disk = set()
    for file in input_directory.glob("**/*.json"):
        *_, year, month, day = file.parts

        dates_from_disk.add(
            date(
                year=int(year),
                month=int(month),
                day=int(day.replace(".json", "")),
            )
        )

    start_date = min(dates_from_disk)
    end_date = max(dates_from_disk)
    difference_in_days = (end_date - start_date).days
    expected_dates = {start_date + timedelta(days=i) for i in range(difference_in_days + 1)}

    missing_dates = expected_dates - dates_from_disk

    return ", ".join(x.strftime("%Y-%m-%d") for x in sorted(missing_dates))

Our solution for this challenge will leverage sets and operations they provide (sets difference). Steps we'll take:

create a set of all dates existing within justjoinit_data directory (set dates_from_disk)
calculate earliest and latest date from dates_from_disk set
create a set of expected dates expected_dates containing all dates from range between earliest and latest dates calculated in previous step
calculate a difference between expected_dates and dates_from_disk (dates existing in expected_dates but missing from dates_from_disk)
sort dates chronologically and transform them to conform to expected string format

We start with defining an empty set dates_from_disk.

Glob pattern **/*.json allows us to iterate over all files with .json extension (** means traversing justjoinit_data directory and all its subdirectories recursively).

To extract year, month and day info, we leverage path's .parts attribute - a tuple containing the individual components of path:

>>> pathlib.Path("/some/path/justjoinit_data/2022/10/01.json").parts
('/', 'some', 'path', 'justjoinit_data', '2022', '10', '01.json')

Tuple unpacking lets us conveniently capture year, month and day variables within single line. Every part that comes before year part is captured within _ variable (it's Pythonic way of saying that you don't care about something). We also combine it with asterisk *, which means that _ variable can hold multiple elements.

*_, year, month, day = file.parts
>>> year
'2022'
>>> month
'10'
>>> day
'01.json'
>>> _
['/', 'some', 'path', 'justjoinit_data']

After a small cleanup (removing .json suffix from day and converting day, month, year to integers), we're able to construct a valid date object and add it to dates_from_disk set:

        dates_from_disk.add(
            date(
                year=int(year),
                month=int(month),
                day=int(day.replace(".json", "")),
            )
        )

After for loop is done, dates_from_disk contains all the dates existing in the justjoinit_data directory.

We use built-in min and max functions to calculate earliest and latest date. We use these dates to calculate a helper variable called difference_in_days, which is later used for generating a range of expected dates between start_date and end_date:

expected_dates = {start_date + timedelta(days=i) for i in range(difference_in_days + 1)}

To find the missing dates within justjoinit_data directory, we simply calculate the difference between expected_dates and dates_from_disk:

missing_dates = expected_dates - dates_from_disk

Last thing we do is sorting the dates (sorted(missing_dates), transforming them to strings with .strftime("%Y-%m-%d") string method and joining with ", " string (so it matches the expected format from task's description).

Summary

I hope you enjoyed this little exercise. I encourage you to check Quest Of Python for more :-)!

DEV Community

Ready for a little Python challenge?

Introduction

Your task

Exemplary solution

Summary

Top comments (0)

Read next

Understanding KeyConditionExpression and FilterExpression in DynamoDB

10 Essential Questions to Ask When Starting with NumPy Data Manipulation

2331. Evaluate Boolean Binary Tree

TUBA: Finally, an Error Management Tool for Microservices