Accessing HuggingFace ML datasets in Databricks

#databricks #snowflake #machinelearning #datascience

As a supplement to our blog on pulling GitHub datasets into Databricks, many users may find that the dataset that they require for their project is located in HuggingFace. HuggingFace is a prominent platform in the AI and machine learning community, known for its extensive library of pre-trained models and datasets. It provides tools for natural language processing (NLP), computer vision, audio, and multimodal tasks, making it a versatile resource for developers and researchers.

The HuggingFace platform fosters collaboration by allowing users to share and discover models, datasets, and applications, thereby accelerating the development and deployment of AI solutions. HuggingFace's open-source stack supports various modalities, including text, image, video, and audio, and offers both free and enterprise solutions to cater to different needs.

HuggingFace has several premade integrations with Databricks that allow for ultra-straightforward ingestion of existing datasets and ML models into your Unity Catalog. TO bring pulling in data, we can utilize HuggingFace's dataset loading script. Run the following to import the required Hugging Face scripts:

from datasets import load_dataset
from pyspark.sql import functions as F

Once we are setup, we need to define a persistent cache directory. Caching is an essential technique for improving the performance of data warehouse systems by avoiding the need to recompute or fetch the same data multiple times. In Databricks SQL, caching can significantly speed up query execution and minimize warehouse usage, resulting in lower costs and more efficient resource utilization.

# Define a persistent cache directory
cache_dir = "dbfs/cache/"

Once you have defined a cache directory, insert the code to load a dataset that you have selected from HuggingFace. Here I'm pulling in a movies dataset with genre, language and popularity scores with ~723,000 entries. If cost of compute is a concern for this demo, you can use the split argument to pull in a percentage of the dataset that is less than 100%.

dataset = load_dataset("wykonos/movies", cache_dir=cache_dir, split="train[:25%]")

Once you have loaded the dataset, you can load it into a data frame and perform any desired Apache Spark manipulations or analysis of the data. Once you're good with the data; go ahead and save it down as a table in your Unity Catalog so we can run with further ML analysis.

df = spark.createDataFrame(dataset)
df.write.mode("overwrite").saveAsTable(f"{path_table}" + "." + f"{table_name}")

Pulling in data from Hugging Face is just the start. The real value to be unlocked from the Databricks platform comes from the machine learning experiments we'll run on this data. With HuggingFace's extensive library of pre-trained models and datasets, we can explore new possibilities in AI and machine learning. By integrating HuggingFace with Databricks, we can easily ingest datasets and ML models into our Unity Catalog, paving the way for innovative experiments and impactful results.

DEV Community

Accessing HuggingFace ML datasets in Databricks

Top comments (0)