Below exercise comes from a Quest Of Python - a little side-project of mine where I share Python challenges/exercises with exemplary solution. If you've enjoyed it and would like to practice more, go check Quest Of Python website.
Introduction
You've come up with a great side-project idea - let's analyze information about IT job market in Poland!
While browsing JustJoin.it job board, you noticed that all job offers are served as JSON through an HTTP API. Since it's publicly available, you decided to create a small application to fetch & store this data.
You decide to go with AWS cloud for hosting your app.
Your workload includes a short Lambda function written in Python (which fetches the data from job offers API endpoint and persists JSON data into S3 bucket) which is executed on a daily schedule (through AWS EventBridge trigger). Each successful run of the function creates a new object in S3, following s3://some-s3-bucket-name/justjoinit-data/<year>/<month>/<day>.json
naming convention.
You quickly test it and everything seems to be fine. Then you deploy resources to your AWS account and forget about whole thing for a long time.
Recently you decided to revive this project and try to extract something meaningful from this data. You quickly realize there are gaps in the data (some days are missing). Turns out that you were so confident about your code that you did not include any retry in case of HTTP request failure. Shame on you!
Your task
Clone the challenges blueprints repository and navigate to 0005_justjoinit_data_finding_the_gaps
directory.
It contains a directory called justjoinit_data
which is supposed to mimic the structure of original S3 bucket with raw data - each year of data is a separate directory containing a subdirectories with months (and each month directory contains multiple JSON files representing a single day of data).
Here's an output of tree
command on this directory:
justjoinit_data
├── 2021
│ ├── 10
│ │ ├── 23.json
│ │ ├── 24.json
│ │ ├── 25.json
│ │ ├── 26.json
│ │ ├── 27.json
│ │ ├── 28.json
│ │ ├── 29.json
│ │ ├── 30.json
│ │ └── 31.json
│ ├── 11
│ │ ├── 01.json
│ │ ├── 02.json
│ │ ├── 03.json
...
Your task is to find out which dates (JSON files) are missing from justjoinit_directory
directory (which would be days when our small AWS job failed due to some reason).
Put your logic into find_missing_dates
function (inside missing_dates.py
file). Missing dates should be returned as a string of dates joined by comma and a space character. If 2021-01-01
, 2021-03-05
and 2022-05-10
were the missing dates, the result string would look like following:
"2021-01-01, 2021-03-05, 2022-05-10"
You can assume that directories will always be named after valid month (1 <= month <= 12
) or day (1 <= day <= 31
) and days within specific months are correct (for example there are no dates like February 31st).
You can use a test from test_missing_dates.py
file to check if your solution is correct. Run below command (while being in 0005_justjoinit_data_finding_the_gaps
directory) to run the test suite:
python -m unittest
or
python test_missing_dates.py
P.S. I plan to share this JustJoin.it job offers dataset publicly (probably on Kaggle). Once this is done, I'll update this page and provide the link to the dataset.
Exemplary solution
Note: you'll find the detailed explanation of the solution below the code snippet.
import pathlib
from datetime import date, timedelta
def find_missing_dates(input_directory: pathlib.Path):
dates_from_disk = set()
for file in input_directory.glob("**/*.json"):
*_, year, month, day = file.parts
dates_from_disk.add(
date(
year=int(year),
month=int(month),
day=int(day.replace(".json", "")),
)
)
start_date = min(dates_from_disk)
end_date = max(dates_from_disk)
difference_in_days = (end_date - start_date).days
expected_dates = {start_date + timedelta(days=i) for i in range(difference_in_days + 1)}
missing_dates = expected_dates - dates_from_disk
return ", ".join(x.strftime("%Y-%m-%d") for x in sorted(missing_dates))
Our solution for this challenge will leverage sets and operations they provide (sets difference). Steps we'll take:
- create a set of all dates existing within
justjoinit_data
directory (setdates_from_disk
) - calculate earliest and latest date from
dates_from_disk
set - create a set of expected dates
expected_dates
containing all dates from range between earliest and latest dates calculated in previous step - calculate a difference between
expected_dates
anddates_from_disk
(dates existing inexpected_dates
but missing fromdates_from_disk
) - sort dates chronologically and transform them to conform to expected string format
We start with defining an empty set dates_from_disk
.
Glob pattern **/*.json
allows us to iterate over all files with .json
extension (**
means traversing justjoinit_data
directory and all its subdirectories recursively).
To extract year, month and day info, we leverage path's .parts
attribute - a tuple containing the individual components of path:
>>> pathlib.Path("/some/path/justjoinit_data/2022/10/01.json").parts
('/', 'some', 'path', 'justjoinit_data', '2022', '10', '01.json')
Tuple unpacking lets us conveniently capture year, month and day variables within single line. Every part that comes before year part is captured within _
variable (it's Pythonic way of saying that you don't care about something). We also combine it with asterisk *
, which means that _
variable can hold multiple elements.
*_, year, month, day = file.parts
>>> year
'2022'
>>> month
'10'
>>> day
'01.json'
>>> _
['/', 'some', 'path', 'justjoinit_data']
After a small cleanup (removing .json
suffix from day
and converting day
, month
, year
to integers), we're able to construct a valid date
object and add it to dates_from_disk
set:
dates_from_disk.add(
date(
year=int(year),
month=int(month),
day=int(day.replace(".json", "")),
)
)
After for loop is done, dates_from_disk
contains all the dates existing in the justjoinit_data
directory.
We use built-in min
and max
functions to calculate earliest and latest date. We use these dates to calculate a helper variable called difference_in_days
, which is later used for generating a range of expected dates between start_date
and end_date
:
expected_dates = {start_date + timedelta(days=i) for i in range(difference_in_days + 1)}
To find the missing dates within justjoinit_data
directory, we simply calculate the difference between expected_dates
and dates_from_disk
:
missing_dates = expected_dates - dates_from_disk
Last thing we do is sorting the dates (sorted(missing_dates)
, transforming them to strings with .strftime("%Y-%m-%d")
string method and joining with ", "
string (so it matches the expected format from task's description).
Summary
I hope you enjoyed this little exercise. I encourage you to check Quest Of Python for more :-)!
Top comments (0)