The Starting Point:
First, I introduced a basic script in my project. I used requests to fetch data from an open API, cleaned the data with a generator function, and then saved it in a database using sqlite3. I also used argparse to allow the URL and database name to be passed as arguments when running the script.
Here is the code for main.py:
import requests
import sqlite3
import argparse
import logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
def fetch_data(url):
try:
r = requests.get(url)
r.raise_for_status()
return r.json()
except Exception as e:
logging.error(f"network error {e}")
return []
def clean_data(item):
try:
user_id = int(item.get("userId", 0))
pid = int(item.get("id", 0))
title = item.get("title", '').strip()
body = item.get("body", '').strip()
return (user_id, pid, title, body)
except (TypeError, ValueError):
return None
def data_generator(data):
for item in data:
row = clean_data(item)
if row:
yield row
def create_table(conn):
with conn:
conn.execute('''
CREATE TABLE IF NOT EXISTS posts(
userId INTEGER,
id INTEGER PRIMARY KEY,
title TEXT,
body TEXT
)
''')
def store_data(conn, data_gen):
with conn:
conn.executemany('INSERT OR REPLACE INTO posts(userId, id, title, body) VALUES(?,?,?,?)', data_gen)
def main():
parser = argparse.ArgumentParser(description="A CLI tool to fetch data from API and store in sqlite db")
parser.add_argument("--url", default="https://jsonplaceholder.typicode.com/posts", help="API URL to fetch data from.")
parser.add_argument("--db", default="posts.db", help="path to database file")
args = parser.parse_args()
data = fetch_data(args.url)
if data:
conn = sqlite3.connect(args.db)
create_table(conn)
clean_data_gen = data_generator(data)
store_data(conn, clean_data_gen)
logging.info(f"Data stored to {args.db}")
if __name__ == "__main__":
main()
Introducing Testing:
I then created my first tests using pytest and checked the clean_data function with different types of input data to make sure it was robust. Luckily, my function passed all the tests.
Here is the code for test_main.py:
from src.etl_cli.main import clean_data
def test_cleandata_valid():
item = {"userId": 1, "id": 10, "title": "Some title", "body": "some body"}
expected_output = (1, 10, "Some title", "some body")
assert clean_data(item) == expected_output
def test_cleandata_invalid():
item = {"userId": "Some useID", "id": 10, "title": "Some title", "body": "some body"}
assert clean_data(item) is None
def test_cleandata_missing():
item = {"title": "Some title"}
expected_output = (0, 0, "Some title", "")
assert clean_data(item) == expected_output
Packaging with Poetry:
I used Poetry to create a package for my project so that it could be distributed and installed properly. Using Poetry to manage dependencies and packaging was a straightforward and clean process.
Automating with GitHub Actions:
Finally, I automated the testing process using GitHub Actions. I set up a workflow to run pytest every time I push code to GitHub. This ensures that any new changes are automatically tested and don't break the project. I learned a lot by debugging a case-sensitivity issue with the README.md file to get the build to pass.
Just 1% better every day
Nilesh sharma
Github
Top comments (0)