From Python Script to Packaged Tool: My First Week with Poetry and GitHub Actions

#python #githubactions #testing #beginners

The Starting Point:

First, I introduced a basic script in my project. I used requests to fetch data from an open API, cleaned the data with a generator function, and then saved it in a database using sqlite3. I also used argparse to allow the URL and database name to be passed as arguments when running the script.

Here is the code for main.py:

import requests
import sqlite3
import argparse
import logging

logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")

def fetch_data(url):
    try:
        r = requests.get(url)
        r.raise_for_status()
        return r.json()
    except Exception as e:
        logging.error(f"network error {e}")
        return []

def clean_data(item):
    try:
        user_id = int(item.get("userId", 0))
        pid = int(item.get("id", 0))
        title = item.get("title", '').strip()
        body = item.get("body", '').strip()
        return (user_id, pid, title, body)
    except (TypeError, ValueError):
        return None

def data_generator(data):
    for item in data:
        row = clean_data(item)
        if row:
            yield row

def create_table(conn):
    with conn:
        conn.execute('''
                     CREATE TABLE IF NOT EXISTS posts(
                         userId INTEGER,
                         id INTEGER PRIMARY KEY,
                         title TEXT,
                         body TEXT
                     )
                     ''')

def store_data(conn, data_gen):
    with conn:
        conn.executemany('INSERT OR REPLACE INTO posts(userId, id, title, body) VALUES(?,?,?,?)', data_gen)

def main():
    parser = argparse.ArgumentParser(description="A CLI tool to fetch data from API and store in sqlite db")
    parser.add_argument("--url", default="https://jsonplaceholder.typicode.com/posts", help="API URL to fetch data from.")
    parser.add_argument("--db", default="posts.db", help="path to database file")

    args = parser.parse_args()

    data = fetch_data(args.url)
    if data:
        conn = sqlite3.connect(args.db)
        create_table(conn)
        clean_data_gen = data_generator(data)
        store_data(conn, clean_data_gen)
        logging.info(f"Data stored to {args.db}")
if __name__ == "__main__":
    main()

Introducing Testing:

I then created my first tests using pytest and checked the clean_data function with different types of input data to make sure it was robust. Luckily, my function passed all the tests.

Here is the code for test_main.py:

from src.etl_cli.main import clean_data

def test_cleandata_valid():
    item = {"userId": 1, "id": 10, "title": "Some title", "body": "some body"}
    expected_output = (1, 10, "Some title", "some body")
    assert clean_data(item) == expected_output

def test_cleandata_invalid():
    item = {"userId": "Some useID", "id": 10, "title": "Some title", "body": "some body"}
    assert clean_data(item) is None

def test_cleandata_missing():
    item = {"title": "Some title"}
    expected_output = (0, 0, "Some title", "")
    assert clean_data(item) == expected_output

Packaging with Poetry:

I used Poetry to create a package for my project so that it could be distributed and installed properly. Using Poetry to manage dependencies and packaging was a straightforward and clean process.

Automating with GitHub Actions:

Finally, I automated the testing process using GitHub Actions. I set up a workflow to run pytest every time I push code to GitHub. This ensures that any new changes are automatically tested and don't break the project. I learned a lot by debugging a case-sensitivity issue with the README.md file to get the build to pass.