Sebastian Mincewicz for AWS Community Builders

Posted on Jan 3 • Originally published at Medium

Databricks on AWS

#databricks #data #aws #architecture

This is to share my experience around building an enterprise data platform powered by Databricks on AWS. It’s not about the data side of things but purely about the platform architecture and wider configuration aspects.

While Databricks provides their customers with many different capabilities and features to fulfill a spectrum of needs, and depending on individual requirements, there are still common elements and considerations Databricks users must make their decisions on.

Just look out for 💡 down below, as you may find the information they are highlighting relevant or even get a head start with using Databricks on AWS.

Account and workspaces

It’s not about AWS accounts and Amazon Workspaces. When working with Databricks on AWS, you should quickly learn to be precise when discussing architectures and configurations to avoid unnecessary confusion about what’s what. You’ll see why… mark my words!

This time it’s about Databricks accounts and workspaces. With the Enterprise Edition (E2) model, Databricks introduced a highly scalable multi-tenant environment, a successor to previous deployment options that have been deprecated. By the way, it’s running on Kubernetes.

Here’s what it looks like on a very high level:

Source: Databricks

💡 The account console is hosted in the US West (Oregon) AWS region, while you choose which AWS region (15 currently) you want to have your workspaces deployed into.

A Databricks account is used to manage:

Users and their access to objects and resources,
Workspaces and cloud resources,
Metastores — the top-level container for catalogs in Unity Catalog (💡 There can only be a single metastore per account per region),
Other account-level settings like SSO, SCIM, security controls, various optional features, etc.

A Databricks workspace, on the other hand, is a Databricks deployment that can be considered an environment for data engineers to access Databricks assets. To configure a workspace, the following cloud-relevant information must be provided:

Credentials ~ an IAM role
Storage ~ an S3 bucket
Storage CMK ~ a KMS key
Network ~ a VPC with subnets and security group(s)

Now, it’s up to you how you want to manage the workspaces, but we went with a workspace per environment, meaning every SDLC environment is represented by a distinct Databricks workspace and its corresponding AWS account, having all the necessary services and resources configured accordingly. All that configuration was covered with IaC.

Databricks REST APIs

Databricks exposes two APIs, the Account API and the Workspace API, and so depending on what resource you’re configuring you interact with one of these. You can see many of the methods are listed as Public Preview; however, when you go with IaC you must simply consider them Public/GA as you have no choice but to use and rely on them.

💡 Not everything makes perfect sense when it comes to what is managed through which API, but maybe that’s just me. An example can be the Artifacts Allowlist. Namely, it is configurable with the Workspace API while that configuration is considered global as it’s linked to the Unity Catalog, i.e., it applies to all workspaces in your account. Now, having a workspace per environment, you may have different artifacts per environment that need to be whitelisted. In that case, you must bring them together and contain them as a single list. Moreover, when interacting with the Workspace API, you must provide a workspace host URL. Say you don’t have any workspaces yet or, just like in our case, you have workspaces representing distinct environments, and you want to set up the allow list — which workspace URL would you use to configure that global setting?

Architecture

On a high level, the following diagram visualizes what the core components are and how they are spread across the AWS accounts, where one belongs to Databricks and another one belongs to the customer.

Source: Databricks

As you can imagine, to make both sides interact with each other securely, it all relies on trust between both AWS accounts which is fulfilled with the use of a cross-account IAM role that is granted permissions to launch clusters or manage data in S3 in the customer AWS account. It’s always the same 41435176782 AWS account and arn:aws:iam::414351767826:role/unity-catalog-prod-UCMasterROle-14S5ZJVKOTYTL IAM role that will keep popping up in Security Hub findings if not properly marked as trusted. When using the serverless compute, there’s another combination of an AWS account ID and IAM role.
Then, at the network layer, there are whitelisting mechanisms, covered in the following sections down below.

Authentication and access control

While Databricks comes with its user store, it’s a common pattern to leverage SSO and configure Databricks with an IdP that is already used in your organization. The unified login option that is now enabled by default allows you to manage one SSO configuration in your account that is used for the account and Databricks workspaces.

💡 The unified login does not yet support your workspaces having the public access completely disabled and the only way to do that is to contact Databricks to get the unified login disabled for your account and then configure SSO on a per-workspace basis.

SCIM provisioning feature allows for syncing groups and users from your IdP, like Microsoft Entra ID, and use them in Databricks to grant permissions while following the least privilege principles. Here, make sure you use groups and not individual user accounts to grant permissions.
Service principals on the other hand must be managed locally in Databricks while they still can be members of the SCIM-synced groups.

Service Principals can use personal access tokens (PAT) or OAuth with automation, while it’s recommended to go with the latter wherever you can. Make sure you verify the available authentication types based on the use case.

💡 For example, currently, Qlik does not support OAuth for Databricks hosted in AWS, while it does for Databricks hosted in Azure.

Networking

It’s important to understand that in the E2 Databricks deployment model, the control plane is in the hands of Databricks while you use their web applications to access the account console and your workspaces. That architecture requires all those endpoints to be accessible over the public Internet or rather publicly resolvable.

For the Databricks account endpoint, there’s only a single option you use for restricting access on the network level and that is the IP Access List.
For the Workspaces endpoints, apart from the IP access list, you can leverage AWS PrivateLink. Whether you decide to use it for the front-end access, it’s up to your requirements but you should definitely use it for the back-end access, i.e., secure cluster connectivity. We went for both!

Private-only connectivity

The following diagram nicely visualizes the concept.

Source: Databricks

You can see that Databricks exposes two APIs privately (SCC Relay API and Workspace API) with the use of the VPC Endpoint Services for you to connect with the VPC Interface Endpoints you configure in your VPCs and whitelist in the Databricks account console.

💡 The Workspace endpoint is not only used for the front-end access but also REST API and ODBC/JDBC connections; hence, you must realize that by closing the front-end door, you’re restricting all types of connectivity to only whitelisted networks.

💡 VPC Interface Endpoints don’t have to live in the VPC you have your workspace configured with. This means, as long as your VPC Interface Endpoint is whitelisted (by ID) it can belong to any AWS account and any VPC which enables their centralization. However, you can control which endpoints can be used to access a given workspace.

💡 No matter whether you disable public access to your workspace or not, the access verification process is always the same and starts with authentication. Yes, the IP access list and AWS PrivateLink go second, while the IP access list only applies when accessing workspaces from public IP addresses. All that is influenced by the Databricks control plane architecture.

Delta-sharing

Delta-sharing is an open-source approach to data sharing across data, analytics, and AI developed and implemented by Databricks. Among other things, it allows for sharing data between Databricks customers having their own accounts and workspaces.

💡 Delta-sharing is happening privately using the AWS backbone and Databricks AWS account, which means that going fully private, i.e., disabling the public access to your workspaces does not affect the ability to use that feature.

Analytics tools

Now, when every tool has its cloud-based option, so do various analytics tools like Power BI or Qlik Sense. I don’t know all of them, but at least these two provide a solution that allows for establishing private connectivity between them and Databricks. That is with the use of gateways — Power BI Gateway and Qlik Data Gateway respectively. Such a gateway must be deployed in a VPC in the customer AWS account in a way that it can connect to Databricks privately with the use of that front-end VPC Interface Endpoint, while the connection to a given analytics tool is established from the gateway using encryption and whitelisting mechanisms, and so no endpoint on the AWS side is exposed publicly.

AWS Graviton

Why pay more if you can pay less?

💡 AWS Graviton-powered EC2 instances for Databricks clusters are supported; however, limitations may make it useless in some use cases so make sure you know what they are before you calculate your ARR based on the assumption you can use them everywhere for anything.

Logging and monitoring with CloudWatch

To make sure all relevant logs go to CloudWatch where they can be retained and analyzed, especially when cluster nodes are usually transient, but also to make use of non-default EC2 performance metrics, you can leverage cluster-scoped init scripts to install and configure the CloudWatch agent.
Those init scripts must be whitelisted with the aforementioned Artifacts Allowlist before they can be used.

💡 Considering the compute plane can be configured with an instance profile one can leverage that fact to configure a custom logger and having properly set up IAM policies stream logs directly to custom log groups.

Databricks also produces audit logs that can be stored in S3. From there they can be pushed to CloudWatch (or elsewhere) for analysis and event management purposes.

Caveats 💡

Here are some things it’s good to know sooner rather than later to avoid headaches or other surprises when using Databricks (in general):

It’s constantly evolving, so pay attention to the release notes.
For the same reason, documentation is sometimes unclear as they try to make it cover both the new things and features that have now been deprecated while some customers are still using them.
Not every feature is available in all regions so make sure you know the coverage before picking one. For example, currently, serverless compute features are not yet available in the London region.
Make sure you know the difference between roles, entitlements permissions, and grants — it can become confusing sometimes.
Using SQL to grant a given principal permissions to Databrick objects requires running compute resources as every statement must be run somewhere, don’t expect to use direct queries to the control plane; it’s not Kubernetes.
Follow Databricks security best practices and analyze carefully any weaknesses in your architecture that allow for data exfiltration. For example, make sure any egress traffic is controlled and inspected by using AWS Network Firewall or any other next-generation firewall solution.
In case of issues or doubts, don’t hesitate to contact Databricks and get a hold of a Solution Architect who can help you understand things and even give a hint on how other customers go about some aspects.
When using Terraform, refer to the guides available in the registry, just don’t use them blindly and adapt to your design and standards.
It’s not always a good idea to manage Databricks classic compute clusters themselves with IaC unless their configuration can remain unchanged, instead, it’s worth considering keeping their definition in source control and getting them deployed with CI/CD running Databricks CLI while still managing all their dependencies in IaC.

Finally, I’m not planning to keep this post updated with features availability changes, so everything here can only be considered relevant at the time of its publication 😉

Lastly…

There’s more to comprehend for a platform solution architect than you can imagine. To make it right, I’d suggest having an AWS expert working alongside a data architect, as that’s the best way to get all the different requirements and best practices considered and implemented properly.

To help with that journey, you can have a look at the Databricks certification program and Partner Academy to help you learn Databricks and optionally even get certified or achieve an accreditation.

DEV Community

Databricks on AWS

Account and workspaces

Databricks REST APIs

Architecture

Authentication and access control

Networking

Private-only connectivity

Delta-sharing

Analytics tools

AWS Graviton

Logging and monitoring with CloudWatch

Caveats 💡

Lastly…

Top comments (0)

Read next

Amazon Q Developer Tips: No.15 CHat Orientated Programming (CHOP)

Understanding Server-Side Composition and Client-Side Composition in Microservices

Akka, RabbitMQ, Kafka, and Azure Service Bus in Microservices Architecture

AWS re:Invent 2024 Day 2