Jesse P. Johnson

Posted on Dec 31, 2024

Applying DevSecOps within Databricks

#devsecops #datascience #cicd

Databricks is a data processing platform that combines both the processing and storage of data to support many business use cases. Traditionally this been the role of data warehousing but can also include Business Intelligence (BI) and Artificial Intelligence (AI). To implement these use cases though requires access many data sources and software libraries. This combination of data and processing capabilities makes it a target for exploits if proper precautions are not taken. This article will attempt to delve into the evolving landscape of DevSecOps within Databricks.

Note: This document will discuss PySpark as it is the most commonly used library within Databricks.

Attack Vectors

Databricks primarily uses the powerful Extract, Transform, and Load (ETL/ELT) along with the medallion architecture to simultaneously process and store data. This makes it much more more robust than an average data layer but can also potentially expose the system to a wider variety of exploits than other designs. There are three attack vectors related to this design that relate to development I will discuss.

The first potential issue with this design is that it is intended to allow consumption of data from nearly any source. This is different than many data stores in that the source data is usually local. ETL/ELT designs primarily center around big data and handling large amounts of it. This includes data from third parties or even untrusted sources. The more sources the more likely the case for an exploit.

The second potential issue with is all the powerful capabilities it can provide through its orchestration of clusters. Each of these supported capabilities utilize additional libraries thereby increasing access to both a larger amount of data and processing capability. In contrast a typical web application design is split into multiple separate layers to provide separation of concerns. By providing silos between data access, business logic processing, and the presentation each has a much smaller number of libraries and what can be accessed by each library within a layer. This layered approach simplifies development and improves security when done correctly.

Finally, because of the inherent power and capabilities that Databricks provides, developers are afforded a very high level of trust and power. Insider threats are just as dangerous as those outside of an organization - if not more. If developers are given too much access it could be misused and/or abused. Depending on the size and scope of access potential damage could be quite large and extensive.

Continuous Integration / Continuous Deployment (CI/CD)

The first import development milestone I believe is establishing a CI/CD process. So, we'll start there.

Development in Databricks mostly revolves around the use of developer notebooks that can utilize many libraries. Notebooks provide a powerful Graphical User Interface (GUI) that are modular, executable, and can be combined into workflows. The development of these Notebooks however doesn't really fit neatly into DevOps ecosystem though. This section will cover the phases of implementing a CI/CD laid out by Databricks here and explain how to integrate DevSecOps practices to secure your data pipelines.

Store

Modern software development centers around the use of a Version Control System (VCS). This practice allows multiple collaborators to modify and develop features and propose changes to the main code branch. This allows efficient code management of changes over time and while maintaining the overall health of the codebase. This most often is Git but can include a few outliers such as Mercurial, or even Subversion.

Databricks provides two implementations for VCS: Git Repos and Git Folders. The Git Repos has limited integration with only a few repositories and been replaced by Git Folders. With Git Folders either GitHub or GitLab can then be setup to track changes to the workflow being developed.

Note: I won't cover this step due to there being too many considerations.

Code

Once the VCS is setup development can then be moved from Databricks GUI to an local IDE environment. This will provide for a much improved development experience along with additional integrations for SAST and SCA from other systems such as SonarQube, CheckMarx or Fortify.

Install spark locally (MacOS):

brew install --cask visual-studio-code
brew install openjdk@11 python@3.11 apache-spark
pip3 install jupyterlab pyspark

echo "export SPARK_HOME=/opt/local/apache-spark" >> ~/.bashrc
echo "export PYSPARK_PYTHON=/opt/local/bin/python" >> ~/.bashrc

code --install-extension ms-toolsai.jupyter

Note: The above example uses VSCode but any IDE that supports plugins for jupyter notebooks will most likely work.

Build

The main capability that makes the CI/CD process worth while with Databricks is the new inclusion of the Databricks Asset Bundles (DAB). A is a type of Infrastructure as Code (IaC) that can provision notbooks, libraries, workbooks, data pipelines and infrastructure.

It is recommended that a DAB first be built locally and then ported to a CI/CD. This will also help when troubleshooting needs to be performed on deployments.

brew tap databricks/tap
brew install databricks

databricks auth login \
  --host <account-console-url> \
  --account-id <account-id>

databricks bundle init

After the process has been validated to function with your cluster setup a CI/CD process can be setup.

Example DAB catalog to package up an example users pipeline:

---
targets:
  dev:
    mode: development
  prod:
    mode: production
    git:
      branch: main
resources:
  ...
  pipelines:
    users_pipeline:
      name: test-pipeline-{{ .unique_id }}
      libraries:
        - notebook:
            path: ./users.ipynb
      development: true
      catalog: main
      target: ${resources.schemas.users_schema.id}
  schemas:
    users_schema:
      name: test-schema-{{ .unique_id }}
      catalog_name: main
      comment: This schema was created by DABs.

Example GitLab CI build job:

---
build-dab:
  image: ghcr.io/databricks/cli:v0.218.0
  stage: build
  entrypoint: ['']
  vars: 
    DATABRICKS_HOST="$DATABRICKS_HOST_ENVAR"
    DATABRICKS_TOKEN="$DATABRICKS_TOKEN_ENVAR"
  script:
    - /app/databricks --workdir ./ bundle deploy

Note: Make sure to sign all packages including these. See your respective CI/CD environment for details.

Deploy

Here the Databricks teams would deploy DAB build in the previous section (or CI job) can then be deployed to a development environment for testing.

---
deploy-dab-dev:
  image: ghcr.io/databricks/cli:v0.218.0
  stage: build
  entrypoint: ['']
  vars: 
    DATABRICKS_HOST="$DATABRICKS_HOST_ENVAR"
    DATABRICKS_TOKEN="$DATABRICKS_TOKEN_ENVAR"
  script:
    - /app/databricks --workdir ./ bundle deploy -t dev

Test

Unit and Integration testing are pivotal development practices to software engineering. There are two ways to perform unit testing within databricks. The first revolved around using one notebook to test another. The second would be require packaging the source as a library and perform pytest normally.

Note: This is relatively easy to figure out so I will skip this for now and decide if I need to elaborate more at a later time.

Static Application Security Testing (SAST)

Implementing SAST should utilize some sort of source scanning that is able to pick up vulnerabilities. Additionally this should include secrets scanning. Don't skip on scanning Infrastructure as Code (IaC) also.

---
sast:
  image: "$SF_PYTHON_IMAGE"
  stage: test
  before_script:
    - pip install semgrep==1.101.0
  script:
    - |
      semgrep ci \
        --config=auto \
        --gitlab-sast \
        --no-suppress-errors

Software Composition Analysis (SCA)

Vulnerabilities within the supply chain have been some of the devastating in recent memory. Implementing SCA helps secure the environment from libraries with known vulnerabilities. This works in tandem with a cyber team that is perform risk analysis on packages that are even allowed in the environment.

Unfortunately, this stage really should use a manifest to determine if there are any vulnerable dependencies. Currently, there is no such file created for this. To complete this task it would be possible to search out the pre-proccessing commands to install packages from PyPI , Maven or CRAN. This is outside of scope unfortunately for this article. Just be aware that this is a requirement though.

Note: See Clean and Validate Data for why this should be considered a good thing.

Run

This step is similar to the deploy stage covered before. The main difference is that there is also a run associate with this. This stage represent the Continuous Deployment (CD) phase.

---
run-dab-prod:
  image: ghcr.io/databricks/cli:v0.218.0
  stage: build
  entrypoint: ['']
  vars: 
    DATABRICKS_HOST="$DATABRICKS_HOST_ENVAR"
    DATABRICKS_TOKEN="$DATABRICKS_TOKEN_ENVAR"
  script:
    - /app/databricks --workdir ./ bundle deploy -t prod
    - /app/databricks --workdir ./ bundle run -t prod

Monitor

The last DevSecOps capability I will discuss will be the Static Analysis Tool (SAT) provided by Databricks. This tool provides a very useful feature to track the command logs execute by notebooks. When this is implemented tracing what and how something happened becomes much easier.

See here for additional details.

Warning: The example utilizes Pyre instead of Mypy which conflicts with one of the tools I would like to suggest for a different purpose.

Iterate

I think almost anyone who is willing to get this far in this article is probably aware of the Software Development Lifecycle (SDLC). An SDLC is a well established software development process to develop high-quality software. Development doesn't need to move from notebooks locally or even a CI/CD to be setup to implement this process. But they do complement each other. If anything I would recommend that time should be spent to determine if your team is hierarchical in nature or if it is flat. If it is the former I would recommend just implementing only Scrum but if it is the latter I would recommend looking into Extreme Programming (XP).

Clean and Validate Data

One of the primary uses of Databricks is to clean and validate data utilizing the aforementioned medallion architecture. The system is however designed to support a polyglot of languages. To do so it provides various types of dataframes (typically parquet) that utilizes a common set of primitive and complex types. This type system is then organized through the use of a schema to help structure these dataframes. This schema can be either manually specified or automatically generated. This schema ensures the data matches the schema type input into a dataframe but provide no additional validation capabilities.

Example schema using PySpark:

from pyspark.sql.types import (
    IntegerType,
    StringType,
    StructField,
    StructType,
)

user_schema = StructType(
    [
        StructField('name', StringType(), False),
        StructField('age', IntegerType(), True),
    ]
)

dataframe = (
    spark
    .read
    .schema(user_schema)
    .option("header", "false")
    .option("mode", "DROPMALFORMED")
    .csv("users.csv")
)

Adding additional validation can be provided through multiple third party modules. This is also desirable to have to the possibility to share this schema to any middleware and/or presentation layer if possible. The most popular libraries for this task are typically pydantic and marshmallow but neither supports dataframes natively. There are two promising libraries extend pydantic to support this: great_expectations and pandera. I will review pandera here as I have not been able to get any version of great_expectations to pass a cyber review (possibly due to the mistune dependency utilizing regex to parse markdown the Cthulhu Way)

%pip install pandera==v0.22.1

from pandera import DataFrameModel, Field

class UserModel(DataFrameModel):
    name: str = Field(coerce=True)
    age: int = Field(ge=16, le=125, coerce=True)

UserModel.validate(dataframe)

Using Parameterized Queries

Data persistence is provided through Databricks via the use of SQL Warehousing (formerly SQL endpoints) capability. In earlier versions concatenation and interpolation were the only approaches available to passing parameters to SQL queries. This unfortunately is the primary cause of SQL injection attacks and is considered bad security practice regardless of the language it is implemented. This attack is possible due to the recursive nature of SQL statements and the mishandling of untrusted inputs.

There are two ways to mitigate this attack. The first would be to sanitize all inputs. This is still considered insufficient though and more of a naive approach. The issue still is there is no guarantee that some unsantized inputs could still potentially be processed incorrectly by a statement susceptible to SQL inject attack. The preferred approach is to use prepared statements which are now supported by databricks (or Named Parameter Markers if using pure SQL)

Example parameterized query using PySpark:

query = "SELECT * FROM example_table WHERE id = {id};"
spark.sql(query, id=1).show()

DEV Community