Pull GitHub data into Databricks with dbutils

#databricks #snowflake #datascience #machinelearning

In this blog, we will demonstrate a method that can be used to pull GitHub data across several formats into Databricks. This is a frequent request from Databricks users because it allows for the utilization of large existing GitHub datasets for developing and training AI and ML models, enabling Unity Catalog to access github repositories like US Zip Code data, and working with unstructured data such as JSON logs. By linking GitHub and Databricks, you can improve your workflows and access critical data.

The first step is to select the data that you would like to bring into the Databricks environment to analyze. For this example, we will be looking at US Census baby name data. Before starting, you should create a catalog, schema, and volume to pull the data into – this process has been covered in prior blogs.

Define your variables

You must define your variables before you start the process of pulling in data, because you will reference a catalog, schema, and volume to overwrite the GitHub data into your the Databricks Unity Catalog. Additionally, you need the raw link to the GitHub data, which can be generated by selecting 'View Raw' on the GitHub page, and copying the contents of your address bar. The code for defining your variables should look like the following:

# Define the variables you are going to use to save the Github data to Unity Catalog.
# Before starting, you can create the catalog, etc. in the UI or with SQL code.
# For download_url, go to GitHub file you would like to download, select view raw, and copy the address from your browser's address bar.
catalog = "github_catalog"
schema = "github_schema"
volume = "github_volume"
download_url = "https://raw.githubusercontent.com/dxdc/babynames/refs/heads/main/all-names.csv"
file_name = "github_new_baby_names.csv"
table_name = "github_table"
path_volume = "/Volumes/" + catalog + "/" + schema + "/" + volume
path_table = catalog + "." + schema
print(path_table) # Show the complete path
print(path_volume) # Show the complete path

Import GitHub to Databricks utilizing dbutils

Databricks Utilities (dbutils) are utilities that provide commands that enable you to work with your Databricks environment from notebooks. The commands are wide ranging but we will focus on the module dbutils.fs which covers the utilities that are used for accessing the Databricks File System. To write the GitHub csv to Unity Catalog, utilize the following code:

# Import the CSV file from Github into your Unity Catalog Volume utilizing the Databricks dbutils command
dbutils.fs.cp(f"{download_url}", f"{path_volume}" + "/" + f"{file_name}")

The f" strings in the above provide a concise way to embed expresisons and variables directly into strings, replacing str.format(). You can read more about f-strings in Python here. The .fs.cp module (.fs) and command (.cp) serve to copy the file to the specified volume with the specified file name.

Load volume to dataframe and table in Unity Catalog

As a next step, you need to convert the volume data into a python dataframe so it can subsequently be converted back into a table in Unity Catalog. At this point, we could drop columns or change headers as needed, but the data we are utilizing for this example does not require any adjustments.

df = spark.read.csv(f"{path_volume}/{file_name}",
                    header=True,
                    inferSchema=True,
                    sep=",")

Note that we are using a CSV here, but several other file formats are supported by the spark.read command, including JSON, txt, Parquet, ORC< XML, Avro, and more. Spark.read can do some pretty cool stuff, like infer tables from semi-structured JSON data. We will cover these more advanced applications in future blogs.

Before saving the dataframe to Unity Catalog, you should review the headers and data, and check for anything else within the dataframe that needs to be cleansed.

display(df)

df.describe()

f_count = df.filter(df.sex == 'F').count()
print(f_count)
m_count = df.filter(df.sex == 'M').count()
print(m_count)
total_count = df.count()
print(total_count)
check_total_count = "ERROR: The total count does not match the sum of the female and male counts" if f_count + m_count - total_count == 0 else f_count + m_count - total_count
print(check_total_count)

Once you are happy with the dataframe and are ready to commit it to a table in Unity Catalog, you can save the dataframe as a table to Unity Catalog with with the Apache Spark function df.write.

df.write.mode("overwrite").saveAsTable(f"{path_table}" + "." + f"{table_name}")

Pulling in data from GitHub can be a great first step in training your AI and ML models and developing experiments and use cases for ML. In subsequent blogs, we will walk through how to utilize this Databricks data to create ML models.

DEV Community

Pull GitHub data into Databricks with dbutils

Define your variables

Import GitHub to Databricks utilizing dbutils

Load volume to dataframe and table in Unity Catalog

Top comments (0)