Fetching and Loading Data from Github

#githubcopilot #ai #discuss

It's usually preferable to write a function that downloads and decompresses data from github (or online data) rather than doing it manually.

This is especially true if the data changes frequently: you can write a small script that uses the function to retrieve the most recent data (of you can set up a scheduled job to do that automatically at regular intervals). Automating the data retrieval process is also useful if you need to install the dataset on multiple machines.

Here is the fuction to fetch and load data:

# ----- Libraries ---------
from pathlib import Path

import pandas as pd
import tarfile
import urllib.request
# --------------------------

# Function to fetch and load data --->
def function_name():
    zipped_path = Path("datasets/files.tgz")

    if not zipped_path.is_file():
        Path("datasets").mkdir(parents=True, exist_ok=True)
        url = "https://github.com/****/Datasets/raw/main/files.tgz"
        urllib.request.urlretrieve(url, zipped_path)

        with tarfile.open(zipped_path) as file_name:
            file_name.extractall(path = "datasets")

    return pd.read_csv(Path("datasets/files/file.csv"))

data_file = function_name()

When function_name()is called, it will look for the dataset file in datasets/files.tgz. If it does not find it, it will create a directory datasets inside your working directory; then it will download the files.tgz from the site https://github.com/****/Datasets/raw/main/files.tgz. This files.tgz contains the file file.csv.

The function with then lod the CSV file into a Pandas DataFrame object containing all the data, and return it.

You can check your data by:

print(data_file[:10])
 # Or 
print(housing.head())

To display the top 10 rows ( or 5 top rows) of your data.

Deliver your unique apps, your own way.

Heroku tackles the toil — patching and upgrading, 24/7 ops and security, build systems, failovers, and more. Stay focused on building great data-driven applications.

Learn More

DEV Community

Fetching and Loading Data from Github

Deliver your unique apps, your own way.

Top comments (0)

Go beyond the firewall

Okay