Creating your first catalog, schema and tables in Databricks

#databricks #snowflake #datascience #machinelearning

Working in Databricks, it is key to harness a foundational understanding of Catalogs, Schemas, and Tables before moving on to advanced AI and ML use cases. The traditional database workflow of setting up a data environment is rapidly scalable within the Databricks platform like never before, but nonetheless, and the platform makes database development more streamlined than ever.

Catalog overview and default catalogs

A Catalog is the primary unit of data organization in the Databricks Unity Catalog data governance model, and Catalogs are the first layer in Unity Catalog's three-level namespace (for example, catalog.schema.table). A catalog can only contain schemas, but schemas can subsequently contain several disparate types of data (we will only cover volumes and tables in this blog).

When you design your data governance model, you should give careful thought to the catalogs that you create. As the highest level in your organization’s data governance model, each catalog should represent a logical unit of data isolation and a logical category of data access, allowing an efficient hierarchy of grants to flow down to schemas and the data objects that they contain.

A default catalog is configured for each workspace that is enabled for Unity Catalog. The default catalog lets you perform data operations without specifying a catalog. If you omit the top-level catalog name when you perform data operations, the default catalog is assumed.

If your workspace was enabled for Unity Catalog automatically, the pre-provisioned workspace catalog is specified as the default catalog. A workspace admin can change the default catalog as needed.

Even though most of the work described in this blog can be completed via point-and-click within the Databricks UI, it is important to understand the SQL code behind the workflows, as SQL might be required for more advanced actions such as JOINS. To create a new Catalog, you can use the following SQL code in a Databricks Notebook:

%sql
-- Find the below Managed Location URL by going to Catalog >> Create New Catalog >> Storage Location
CREATE CATALOG IF NOT EXISTS first_catalog
MANAGED LOCATION 'abfss://unity-catalog-storage@dbstoragewe2nak3uyjbts.dfs.core.windows.net/3297083325245759'

There are several additional arguments that can be added when creating a catalog, which can be reviewed in the Databricks Documentation website. The only argument we will discuss here is MANAGED LOCATION, which is required if your Databricks account does not have a metastore-level storage location specified. For demo and trial users of Databricks just learning the platform, you might not have metastore-level storage set up. We can work around this by finding the URL of our account's Unity Catalog by navigating to Catalog on the lefthand sidebar, selecting Create New Catalog, and selecting the default storage location.

Schema Overview and Code

In Unity Catalog, a schema is a child of a catalog and can contain tables, views, volumes, models, and functions. A schema organizes data and AI assets into logical categories that are more granular than catalogs. Typically a schema represents a single use case, project, or team sandbox. Regardless of category type, schemas are a useful tool for managing data access control and improving data discoverability.

We can create a schema within the first Catalog that we set up earlier in this blog. Notice two of the three components of the the catalog.schema.table namespace are utilized in the below command.

%sql
CREATE SCHEMA IF NOT EXISTS first.catalog.first.schema

Volumes and Tables

While there are several objects that can sit below Schemas in Databricks, Volumes and Tables are the key objects for new users of the platform to understand.

While tables provide governance over tabular datasets, volumes add governance over non-tabular datasets. *You can use volumes to store and access files in any format, including structured, semi-structured, and unstructured data. * Another way to understand this, is that volumes are the precursor to tables, where we might import bronze-level data and preform transformation and ETL steps (former excel users, think power-query). One example of semi structured data that would need to be imported as a volume is JSON log data. Once imported as a volume, JSON data can be quickly converted to a Table with spark.read functions. To create a volume, use the following code:

%sql
CREATE VOLUME IF NOT EXISTS first_catalog.first_schema.first_volume

This has served as an introduction to setting up a preliminary data environment in Databricks. Check out the next blogs in this series for an overview of ingesting raw data from the internet (GitHub and HuggingFace) into the volume you created, and transforming the volume data into a a tabular table that we can preform AI and ML on.

DEV Community

Creating your first catalog, schema and tables in Databricks

Catalog overview and default catalogs

Schema Overview and Code

Volumes and Tables

Top comments (0)