<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jordan Smith</title>
    <description>The latest articles on DEV Community by Jordan Smith (@encorepartners).</description>
    <link>https://dev.to/encorepartners</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2859604%2Fc9a627e8-b655-4840-9b8f-a009dabee58c.png</url>
      <title>DEV Community: Jordan Smith</title>
      <link>https://dev.to/encorepartners</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/encorepartners"/>
    <language>en</language>
    <item>
      <title>Accessing HuggingFace ML datasets in Databricks</title>
      <dc:creator>Jordan Smith</dc:creator>
      <pubDate>Fri, 14 Feb 2025 01:02:25 +0000</pubDate>
      <link>https://dev.to/encorepartners/accessing-huggingface-datasets-in-databricks-4k10</link>
      <guid>https://dev.to/encorepartners/accessing-huggingface-datasets-in-databricks-4k10</guid>
      <description>&lt;p&gt;As a supplement to our blog on pulling GitHub datasets into Databricks, many users may find that the dataset that they require for their project is located in HuggingFace. HuggingFace is a prominent platform in the AI and machine learning community, known for its extensive library of pre-trained models and datasets. It provides tools for natural language processing (NLP), computer vision, audio, and multimodal tasks, making it a versatile resource for developers and researchers. &lt;/p&gt;

&lt;p&gt;The HuggingFace platform fosters collaboration by allowing users to share and discover models, datasets, and applications, thereby accelerating the development and deployment of AI solutions. HuggingFace's open-source stack supports various modalities, including text, image, video, and audio, and offers both free and enterprise solutions to cater to different needs. &lt;/p&gt;

&lt;p&gt;HuggingFace has several premade integrations with Databricks that allow for ultra-straightforward ingestion of existing datasets and ML models into your Unity Catalog. TO bring pulling in data, we can utilize HuggingFace's dataset loading script. Run the following to import the required Hugging Face scripts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datasets&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dataset&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;functions&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once we are setup, we need to define a persistent cache directory. Caching is an essential technique for improving the performance of data warehouse systems by avoiding the need to recompute or fetch the same data multiple times. In Databricks SQL, caching can significantly speed up query execution and minimize warehouse usage, resulting in lower costs and more efficient resource utilization.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Define a persistent cache directory
&lt;/span&gt;&lt;span class="n"&gt;cache_dir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dbfs/cache/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once you have defined a cache directory, insert the code to load a dataset that you have selected from HuggingFace. Here I'm pulling in a &lt;a href="https://huggingface.co/datasets/wykonos/movies" rel="noopener noreferrer"&gt;movies dataset&lt;/a&gt; with genre, language and popularity scores with ~723,000 entries. If cost of compute is a concern for this demo, you can use the split argument to pull in a percentage of the dataset that is less than 100%.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;dataset = load_dataset("wykonos/movies", cache_dir=cache_dir, split="train[:25%]")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once you have loaded the dataset, you can load it into a data frame and perform any desired Apache Spark manipulations or analysis of the data. Once you're good with the data; go ahead and save it down as a table in your Unity Catalog so we can run with further ML analysis.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createDataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;saveAsTable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;path_table&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;table_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pulling in data from Hugging Face is just the start. The real value to be unlocked from the Databricks platform comes from the machine learning experiments we'll run on this data. With HuggingFace's extensive library of pre-trained models and datasets, we can explore new possibilities in AI and machine learning. By integrating HuggingFace with Databricks, we can easily ingest datasets and ML models into our Unity Catalog, paving the way for innovative experiments and impactful results.&lt;/p&gt;

</description>
      <category>databricks</category>
      <category>snowflake</category>
      <category>datascience</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Pull GitHub data into Databricks with dbutils</title>
      <dc:creator>Jordan Smith</dc:creator>
      <pubDate>Thu, 13 Feb 2025 23:00:29 +0000</pubDate>
      <link>https://dev.to/encorepartners/pull-github-data-into-databricks-with-dbutils-3196</link>
      <guid>https://dev.to/encorepartners/pull-github-data-into-databricks-with-dbutils-3196</guid>
      <description>&lt;p&gt;In this blog, we will demonstrate a method that can be used to pull GitHub data across several formats into Databricks. This is a frequent request from Databricks users because it allows for the utilization of large existing GitHub datasets for developing and training AI and ML models, enabling Unity Catalog to access github repositories like US Zip Code data, and working with unstructured data such as JSON logs. By linking GitHub and Databricks, you can improve your workflows and access critical data. &lt;/p&gt;

&lt;p&gt;The first step is to select the data that you would like to bring into the Databricks environment to analyze. For this example, we will be looking at US Census &lt;a href="https://github.com/dxdc/babynames/blob/main/all-names.csv" rel="noopener noreferrer"&gt;baby name data&lt;/a&gt;. Before starting, you should create a catalog, schema, and volume to pull the data into – this process has been covered in prior blogs. &lt;/p&gt;

&lt;h2&gt;
  
  
  Define your variables
&lt;/h2&gt;

&lt;p&gt;You must define your variables before you start the process of pulling in data, because you will reference a catalog, schema, and volume to overwrite the GitHub data into your the Databricks Unity Catalog. Additionally, you need the raw link to the GitHub data, which can be generated by selecting 'View Raw' on the GitHub page, and copying the contents of your address bar. The code for defining your variables should look like the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Define the variables you are going to use to save the Github data to Unity Catalog.
# Before starting, you can create the catalog, etc. in the UI or with SQL code.
# For download_url, go to GitHub file you would like to download, select view raw, and copy the address from your browser's address bar.
&lt;/span&gt;&lt;span class="n"&gt;catalog&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;github_catalog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;schema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;github_schema&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;volume&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;github_volume&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;download_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://raw.githubusercontent.com/dxdc/babynames/refs/heads/main/all-names.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;file_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;github_new_baby_names.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;table_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;github_table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;path_volume&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/Volumes/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;catalog&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;schema&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;volume&lt;/span&gt;
&lt;span class="n"&gt;path_table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;catalog&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;schema&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path_table&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Show the complete path
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path_volume&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Show the complete path
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Import GitHub to Databricks utilizing dbutils
&lt;/h2&gt;

&lt;p&gt;Databricks Utilities (dbutils) are utilities that provide commands that enable you to work with your Databricks environment from notebooks. The commands are wide ranging but we will focus on the module &lt;strong&gt;dbutils.fs&lt;/strong&gt; which covers the utilities that are used for accessing the Databricks &lt;strong&gt;F&lt;/strong&gt;ile &lt;strong&gt;S&lt;/strong&gt;ystem. To write the GitHub csv to Unity Catalog, utilize the following code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Import the CSV file from Github into your Unity Catalog Volume utilizing the Databricks dbutils command
&lt;/span&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;download_url&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;path_volume&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;file_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The f" strings in the above provide a concise way to embed expresisons and variables directly into strings, replacing str.format(). You can read more about f-strings in Python &lt;a href="https://www.geeksforgeeks.org/formatted-string-literals-f-strings-python/" rel="noopener noreferrer"&gt;here&lt;/a&gt;. The .fs.cp module (.fs) and command (.cp) serve to copy the file to the specified volume with the specified file name.&lt;/p&gt;

&lt;h2&gt;
  
  
  Load volume to dataframe and table in Unity Catalog
&lt;/h2&gt;

&lt;p&gt;As a next step, you need to convert the volume data into a python dataframe so it can subsequently be converted back into a table in Unity Catalog. At this point, we could drop columns or change headers as needed, but the data we are utilizing for this example does not require any adjustments.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;path_volume&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;file_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;header&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;inferSchema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;sep&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that we are using a CSV here, but several other file formats are supported by the spark.read command, including JSON, txt, Parquet, ORC&amp;lt; XML, Avro, and more. Spark.read can do some pretty cool stuff, like infer tables from semi-structured JSON data. We will cover these more advanced applications in future blogs.&lt;/p&gt;

&lt;p&gt;Before saving the dataframe to Unity Catalog, you should review the headers and data, and check for anything else within the dataframe that needs to be cleansed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;describe&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;f_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sex&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;F&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f_count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;m_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sex&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;M&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m_count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;total_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;check_total_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ERROR: The total count does not match the sum of the female and male counts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;f_count&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;m_count&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;total_count&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;f_count&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;m_count&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;total_count&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;check_total_count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once you are happy with the dataframe and are ready to commit it to a table in Unity Catalog, you can save the dataframe as a table to Unity Catalog with with the Apache Spark function df.write.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;saveAsTable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;path_table&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;table_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pulling in data from GitHub can be a great first step in training your AI and ML models and developing experiments and use cases for ML. In subsequent blogs, we will walk through how to utilize this Databricks data to create ML models.&lt;/p&gt;

</description>
      <category>databricks</category>
      <category>snowflake</category>
      <category>datascience</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Creating your first catalog, schema and tables in Databricks</title>
      <dc:creator>Jordan Smith</dc:creator>
      <pubDate>Thu, 13 Feb 2025 20:17:10 +0000</pubDate>
      <link>https://dev.to/encorepartners/creating-your-first-catalog-schema-and-tables-in-databricks-20p3</link>
      <guid>https://dev.to/encorepartners/creating-your-first-catalog-schema-and-tables-in-databricks-20p3</guid>
      <description>&lt;p&gt;Working in Databricks, it is key to harness a foundational understanding of Catalogs, Schemas, and Tables before moving on to advanced AI and ML use cases. The traditional database workflow of setting up a data environment is rapidly scalable within the Databricks platform like never before, but nonetheless, and the platform makes database development more streamlined than ever.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1k0byvz2kj7hzfdl1ral.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1k0byvz2kj7hzfdl1ral.png" alt="Image description" width="800" height="288"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Catalog overview and default catalogs
&lt;/h2&gt;

&lt;p&gt;A Catalog is the primary unit of data organization in the Databricks Unity Catalog data governance model, and Catalogs are the first layer in Unity Catalog's three-level namespace (for example, catalog.schema.table). A catalog can only contain schemas, but schemas can subsequently contain several disparate types of data (we will only cover volumes and tables in this blog).&lt;/p&gt;

&lt;p&gt;When you design your data governance model, you should give careful thought to the catalogs that you create. As the highest level in your organization’s data governance model, each catalog should represent a logical unit of data isolation and a logical category of data access, allowing an efficient hierarchy of grants to flow down to schemas and the data objects that they contain.&lt;/p&gt;

&lt;p&gt;A default catalog is configured for each workspace that is enabled for Unity Catalog. The default catalog lets you perform data operations without specifying a catalog. If you omit the top-level catalog name when you perform data operations, the default catalog is assumed.&lt;/p&gt;

&lt;p&gt;If your workspace was enabled for Unity Catalog automatically, the pre-provisioned workspace catalog is specified as the default catalog. A workspace admin can change the default catalog as needed.&lt;/p&gt;

&lt;p&gt;Even though most of the work described in this blog can be completed via point-and-click within the Databricks UI, it is important to understand the SQL code behind the workflows, as SQL might be required for more advanced actions such as JOINS. To create a new Catalog, you can use the following SQL code in a Databricks Notebook:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="k"&gt;sql&lt;/span&gt;
&lt;span class="c1"&gt;-- Find the below Managed Location URL by going to Catalog &amp;gt;&amp;gt; Create New Catalog &amp;gt;&amp;gt; Storage Location&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;CATALOG&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;first_catalog&lt;/span&gt;
&lt;span class="n"&gt;MANAGED&lt;/span&gt; &lt;span class="k"&gt;LOCATION&lt;/span&gt; &lt;span class="s1"&gt;'abfss://unity-catalog-storage@dbstoragewe2nak3uyjbts.dfs.core.windows.net/3297083325245759'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There are several additional arguments that can be added when creating a catalog, which can be reviewed in the Databricks Documentation website. The only argument we will discuss here is MANAGED LOCATION, which is required if your Databricks account does not have a metastore-level storage location specified. For demo and trial users of Databricks just learning the platform, you might not have metastore-level storage set up. We can work around this by finding the URL of our account's Unity Catalog by navigating to Catalog on the lefthand sidebar, selecting Create New Catalog, and selecting the default storage location.&lt;/p&gt;

&lt;h2&gt;
  
  
  Schema Overview and Code
&lt;/h2&gt;

&lt;p&gt;In Unity Catalog, a schema is a child of a catalog and can contain tables, views, volumes, models, and functions. A schema organizes data and AI assets into logical categories that are more granular than catalogs. Typically a schema represents a single use case, project, or team sandbox. Regardless of category type, schemas are a useful tool for managing data access control and improving data discoverability.&lt;/p&gt;

&lt;p&gt;We can create a schema within the first Catalog that we set up earlier in this blog. Notice two of the three components of the the catalog.schema.table namespace are utilized in the below command.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="k"&gt;sql&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;SCHEMA&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="k"&gt;first&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;first&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;schema&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Volumes and Tables
&lt;/h2&gt;

&lt;p&gt;While there are several objects that can sit below Schemas in Databricks, Volumes and Tables are the key objects for new users of the platform to understand.&lt;/p&gt;

&lt;p&gt;While tables provide governance over tabular datasets, volumes add governance over non-tabular datasets. *&lt;em&gt;You can use volumes to store and access files in any format, including structured, semi-structured, and unstructured data. *&lt;/em&gt; Another way to understand this, is that volumes are the precursor to tables, where we might import bronze-level data and preform transformation and ETL steps (former excel users, think power-query). One example of semi structured data that would need to be imported as a volume is JSON log data. Once imported as a volume, JSON data can be quickly converted to a Table with spark.read functions. To create a volume, use the following code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="k"&gt;sql&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;VOLUME&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;first_catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_volume&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This has served as an introduction to setting up a preliminary data environment in Databricks. Check out the next blogs in this series for an overview of ingesting raw data from the internet (GitHub and HuggingFace) into the volume you created, and transforming the volume data into a a tabular table that we can preform AI and ML on.&lt;/p&gt;

</description>
      <category>databricks</category>
      <category>snowflake</category>
      <category>machinelearning</category>
      <category>datascience</category>
    </item>
  </channel>
</rss>
