DEV Community: Jordan Smith

Accessing HuggingFace ML datasets in Databricks

Jordan Smith — Fri, 14 Feb 2025 01:02:25 +0000

As a supplement to our blog on pulling GitHub datasets into Databricks, many users may find that the dataset that they require for their project is located in HuggingFace. HuggingFace is a prominent platform in the AI and machine learning community, known for its extensive library of pre-trained models and datasets. It provides tools for natural language processing (NLP), computer vision, audio, and multimodal tasks, making it a versatile resource for developers and researchers.

The HuggingFace platform fosters collaboration by allowing users to share and discover models, datasets, and applications, thereby accelerating the development and deployment of AI solutions. HuggingFace's open-source stack supports various modalities, including text, image, video, and audio, and offers both free and enterprise solutions to cater to different needs.

HuggingFace has several premade integrations with Databricks that allow for ultra-straightforward ingestion of existing datasets and ML models into your Unity Catalog. TO bring pulling in data, we can utilize HuggingFace's dataset loading script. Run the following to import the required Hugging Face scripts:

from datasets import load_dataset
from pyspark.sql import functions as F

Once we are setup, we need to define a persistent cache directory. Caching is an essential technique for improving the performance of data warehouse systems by avoiding the need to recompute or fetch the same data multiple times. In Databricks SQL, caching can significantly speed up query execution and minimize warehouse usage, resulting in lower costs and more efficient resource utilization.

# Define a persistent cache directory
cache_dir = "dbfs/cache/"

Once you have defined a cache directory, insert the code to load a dataset that you have selected from HuggingFace. Here I'm pulling in a movies dataset with genre, language and popularity scores with ~723,000 entries. If cost of compute is a concern for this demo, you can use the split argument to pull in a percentage of the dataset that is less than 100%.

dataset = load_dataset("wykonos/movies", cache_dir=cache_dir, split="train[:25%]")

Once you have loaded the dataset, you can load it into a data frame and perform any desired Apache Spark manipulations or analysis of the data. Once you're good with the data; go ahead and save it down as a table in your Unity Catalog so we can run with further ML analysis.

df = spark.createDataFrame(dataset)
df.write.mode("overwrite").saveAsTable(f"{path_table}" + "." + f"{table_name}")

Pulling in data from Hugging Face is just the start. The real value to be unlocked from the Databricks platform comes from the machine learning experiments we'll run on this data. With HuggingFace's extensive library of pre-trained models and datasets, we can explore new possibilities in AI and machine learning. By integrating HuggingFace with Databricks, we can easily ingest datasets and ML models into our Unity Catalog, paving the way for innovative experiments and impactful results.

Pull GitHub data into Databricks with dbutils

Jordan Smith — Thu, 13 Feb 2025 23:00:29 +0000

In this blog, we will demonstrate a method that can be used to pull GitHub data across several formats into Databricks. This is a frequent request from Databricks users because it allows for the utilization of large existing GitHub datasets for developing and training AI and ML models, enabling Unity Catalog to access github repositories like US Zip Code data, and working with unstructured data such as JSON logs. By linking GitHub and Databricks, you can improve your workflows and access critical data.

The first step is to select the data that you would like to bring into the Databricks environment to analyze. For this example, we will be looking at US Census baby name data. Before starting, you should create a catalog, schema, and volume to pull the data into – this process has been covered in prior blogs.

Define your variables

You must define your variables before you start the process of pulling in data, because you will reference a catalog, schema, and volume to overwrite the GitHub data into your the Databricks Unity Catalog. Additionally, you need the raw link to the GitHub data, which can be generated by selecting 'View Raw' on the GitHub page, and copying the contents of your address bar. The code for defining your variables should look like the following:

# Define the variables you are going to use to save the Github data to Unity Catalog.
# Before starting, you can create the catalog, etc. in the UI or with SQL code.
# For download_url, go to GitHub file you would like to download, select view raw, and copy the address from your browser's address bar.
catalog = "github_catalog"
schema = "github_schema"
volume = "github_volume"
download_url = "https://raw.githubusercontent.com/dxdc/babynames/refs/heads/main/all-names.csv"
file_name = "github_new_baby_names.csv"
table_name = "github_table"
path_volume = "/Volumes/" + catalog + "/" + schema + "/" + volume
path_table = catalog + "." + schema
print(path_table) # Show the complete path
print(path_volume) # Show the complete path

Import GitHub to Databricks utilizing dbutils

Databricks Utilities (dbutils) are utilities that provide commands that enable you to work with your Databricks environment from notebooks. The commands are wide ranging but we will focus on the module dbutils.fs which covers the utilities that are used for accessing the Databricks File System. To write the GitHub csv to Unity Catalog, utilize the following code:

# Import the CSV file from Github into your Unity Catalog Volume utilizing the Databricks dbutils command
dbutils.fs.cp(f"{download_url}", f"{path_volume}" + "/" + f"{file_name}")

The f" strings in the above provide a concise way to embed expresisons and variables directly into strings, replacing str.format(). You can read more about f-strings in Python here. The .fs.cp module (.fs) and command (.cp) serve to copy the file to the specified volume with the specified file name.

Load volume to dataframe and table in Unity Catalog

As a next step, you need to convert the volume data into a python dataframe so it can subsequently be converted back into a table in Unity Catalog. At this point, we could drop columns or change headers as needed, but the data we are utilizing for this example does not require any adjustments.

df = spark.read.csv(f"{path_volume}/{file_name}",
                    header=True,
                    inferSchema=True,
                    sep=",")

Note that we are using a CSV here, but several other file formats are supported by the spark.read command, including JSON, txt, Parquet, ORC< XML, Avro, and more. Spark.read can do some pretty cool stuff, like infer tables from semi-structured JSON data. We will cover these more advanced applications in future blogs.

Before saving the dataframe to Unity Catalog, you should review the headers and data, and check for anything else within the dataframe that needs to be cleansed.

display(df)

df.describe()

f_count = df.filter(df.sex == 'F').count()
print(f_count)
m_count = df.filter(df.sex == 'M').count()
print(m_count)
total_count = df.count()
print(total_count)
check_total_count = "ERROR: The total count does not match the sum of the female and male counts" if f_count + m_count - total_count == 0 else f_count + m_count - total_count
print(check_total_count)

Once you are happy with the dataframe and are ready to commit it to a table in Unity Catalog, you can save the dataframe as a table to Unity Catalog with with the Apache Spark function df.write.

df.write.mode("overwrite").saveAsTable(f"{path_table}" + "." + f"{table_name}")

Pulling in data from GitHub can be a great first step in training your AI and ML models and developing experiments and use cases for ML. In subsequent blogs, we will walk through how to utilize this Databricks data to create ML models.

Creating your first catalog, schema and tables in Databricks

Jordan Smith — Thu, 13 Feb 2025 20:17:10 +0000

Working in Databricks, it is key to harness a foundational understanding of Catalogs, Schemas, and Tables before moving on to advanced AI and ML use cases. The traditional database workflow of setting up a data environment is rapidly scalable within the Databricks platform like never before, but nonetheless, and the platform makes database development more streamlined than ever.

Catalog overview and default catalogs

A Catalog is the primary unit of data organization in the Databricks Unity Catalog data governance model, and Catalogs are the first layer in Unity Catalog's three-level namespace (for example, catalog.schema.table). A catalog can only contain schemas, but schemas can subsequently contain several disparate types of data (we will only cover volumes and tables in this blog).

When you design your data governance model, you should give careful thought to the catalogs that you create. As the highest level in your organization’s data governance model, each catalog should represent a logical unit of data isolation and a logical category of data access, allowing an efficient hierarchy of grants to flow down to schemas and the data objects that they contain.

A default catalog is configured for each workspace that is enabled for Unity Catalog. The default catalog lets you perform data operations without specifying a catalog. If you omit the top-level catalog name when you perform data operations, the default catalog is assumed.

If your workspace was enabled for Unity Catalog automatically, the pre-provisioned workspace catalog is specified as the default catalog. A workspace admin can change the default catalog as needed.

Even though most of the work described in this blog can be completed via point-and-click within the Databricks UI, it is important to understand the SQL code behind the workflows, as SQL might be required for more advanced actions such as JOINS. To create a new Catalog, you can use the following SQL code in a Databricks Notebook:

%sql
-- Find the below Managed Location URL by going to Catalog >> Create New Catalog >> Storage Location
CREATE CATALOG IF NOT EXISTS first_catalog
MANAGED LOCATION 'abfss://unity-catalog-storage@dbstoragewe2nak3uyjbts.dfs.core.windows.net/3297083325245759'

There are several additional arguments that can be added when creating a catalog, which can be reviewed in the Databricks Documentation website. The only argument we will discuss here is MANAGED LOCATION, which is required if your Databricks account does not have a metastore-level storage location specified. For demo and trial users of Databricks just learning the platform, you might not have metastore-level storage set up. We can work around this by finding the URL of our account's Unity Catalog by navigating to Catalog on the lefthand sidebar, selecting Create New Catalog, and selecting the default storage location.

Schema Overview and Code

In Unity Catalog, a schema is a child of a catalog and can contain tables, views, volumes, models, and functions. A schema organizes data and AI assets into logical categories that are more granular than catalogs. Typically a schema represents a single use case, project, or team sandbox. Regardless of category type, schemas are a useful tool for managing data access control and improving data discoverability.

We can create a schema within the first Catalog that we set up earlier in this blog. Notice two of the three components of the the catalog.schema.table namespace are utilized in the below command.

%sql
CREATE SCHEMA IF NOT EXISTS first.catalog.first.schema

Volumes and Tables

While there are several objects that can sit below Schemas in Databricks, Volumes and Tables are the key objects for new users of the platform to understand.

While tables provide governance over tabular datasets, volumes add governance over non-tabular datasets. *You can use volumes to store and access files in any format, including structured, semi-structured, and unstructured data. * Another way to understand this, is that volumes are the precursor to tables, where we might import bronze-level data and preform transformation and ETL steps (former excel users, think power-query). One example of semi structured data that would need to be imported as a volume is JSON log data. Once imported as a volume, JSON data can be quickly converted to a Table with spark.read functions. To create a volume, use the following code:

%sql
CREATE VOLUME IF NOT EXISTS first_catalog.first_schema.first_volume

This has served as an introduction to setting up a preliminary data environment in Databricks. Check out the next blogs in this series for an overview of ingesting raw data from the internet (GitHub and HuggingFace) into the volume you created, and transforming the volume data into a a tabular table that we can preform AI and ML on.