DEV Community

Jordan Smith
Jordan Smith

Posted on • Edited on

Accessing HuggingFace ML datasets in Databricks

As a supplement to our blog on pulling GitHub datasets into Databricks, many users may find that the dataset that they require for their project is located in HuggingFace. HuggingFace is a prominent platform in the AI and machine learning community, known for its extensive library of pre-trained models and datasets. It provides tools for natural language processing (NLP), computer vision, audio, and multimodal tasks, making it a versatile resource for developers and researchers.

The HuggingFace platform fosters collaboration by allowing users to share and discover models, datasets, and applications, thereby accelerating the development and deployment of AI solutions. HuggingFace's open-source stack supports various modalities, including text, image, video, and audio, and offers both free and enterprise solutions to cater to different needs.

HuggingFace has several premade integrations with Databricks that allow for ultra-straightforward ingestion of existing datasets and ML models into your Unity Catalog. TO bring pulling in data, we can utilize HuggingFace's dataset loading script. Run the following to import the required Hugging Face scripts:

from datasets import load_dataset
from pyspark.sql import functions as F
Enter fullscreen mode Exit fullscreen mode

Once we are setup, we need to define a persistent cache directory. Caching is an essential technique for improving the performance of data warehouse systems by avoiding the need to recompute or fetch the same data multiple times. In Databricks SQL, caching can significantly speed up query execution and minimize warehouse usage, resulting in lower costs and more efficient resource utilization.

# Define a persistent cache directory
cache_dir = "dbfs/cache/"
Enter fullscreen mode Exit fullscreen mode

Once you have defined a cache directory, insert the code to load a dataset that you have selected from HuggingFace. Here I'm pulling in a movies dataset with genre, language and popularity scores with ~723,000 entries. If cost of compute is a concern for this demo, you can use the split argument to pull in a percentage of the dataset that is less than 100%.

dataset = load_dataset("wykonos/movies", cache_dir=cache_dir, split="train[:25%]")
Enter fullscreen mode Exit fullscreen mode

Once you have loaded the dataset, you can load it into a data frame and perform any desired Apache Spark manipulations or analysis of the data. Once you're good with the data; go ahead and save it down as a table in your Unity Catalog so we can run with further ML analysis.

df = spark.createDataFrame(dataset)
df.write.mode("overwrite").saveAsTable(f"{path_table}" + "." + f"{table_name}")
Enter fullscreen mode Exit fullscreen mode

Pulling in data from Hugging Face is just the start. The real value to be unlocked from the Databricks platform comes from the machine learning experiments we'll run on this data. With HuggingFace's extensive library of pre-trained models and datasets, we can explore new possibilities in AI and machine learning. By integrating HuggingFace with Databricks, we can easily ingest datasets and ML models into our Unity Catalog, paving the way for innovative experiments and impactful results.

Image of Datadog

How to Diagram Your Cloud Architecture

Cloud architecture diagrams provide critical visibility into the resources in your environment and how they’re connected. In our latest eBook, AWS Solution Architects Jason Mimick and James Wenzel walk through best practices on how to build effective and professional diagrams.

Download the Free eBook

Top comments (0)

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay