loading...

Tips for your Databricks development workflow

fruiza profile image Francisco Ruiz A ・3 min read

As a data engineer I'm always looking for ways to automate things and be more efficient on my day to day development workflow. Here are some tips that might help improve yours.

1. Sync your Notebooks between Databricks and Git

You can integrate to Github or Azure DevOps directly from your Databricks workspace but what if you work with multiple git remote origins? or maybe your data project also includes other stacks like Apache Airflow or scripting files and you are working on a feature that spans across all this? - Databricks has an excellent command line interface that exposes a great set of API endpoints that can help you manage this stage of your development workflow.

Exporting your Databricks workspace into your local branch:

$ databricks workspace export_dir /Users/example@databricks.com/notebooks .

Let's save a few key strokes with a bash script that leverages your local git configuration:

MSYS_NO_PATHCONV=1 databricks workspace export_dir -o /Users/`git config user.email | awk '{print tolower($0)}'`/notebooks .

Now you can easily run it:

$ ./download_notebooks.sh

Do the reverse to upload your Notebooks to your Databricks workspace:

MSYS_NO_PATHCONV=1 databricks workspace import_dir -o . /Users/`git config user.email | awk '{print tolower($0)}'`/notebooks

Then run:

$ ./upload_notebooks.sh

This gives you the flexibility to commit and push other file changes you may have in your feature branch up to your origin.

2. Parallel automated tests

You write automated tests for your workloads, right? :-)

Let's say you have three test Notebooks to test different parts of your workload, typically you would have one devops/cd (because you run them on an integration environment) pipeline task for each of the test Notebook. This will run in sequence and assuming that each test Notebook takes 5 minutes, then your test pipeline takes 15 minutes in total.

You can run these test Notebooks in parallel using the following:

class TestNotebook:
  def __init__(self, path, timeout = 0, parameters = None, retry = 0):
    self.path = path
    self.timeout = timeout
    self.parameters = parameters
    self.retry = retry

def submitNotebook(notebook):
  print("Running notebook %s" % notebook.path + "\r")
  try:
    if (notebook.parameters):
      return dbutils.notebook.run(notebook.path, notebook.timeout, notebook.parameters)
    else:
      return dbutils.notebook.run(notebook.path, notebook.timeout)
  except Exception:
    if notebook.retry < 1:
      raise
    print("Retrying notebook %s" % notebook.path)
    notebook.retry = notebook.retry - 1
    submitNotebook(notebook)

def parallelNotebooks(notebooks, numInParallel):
  '''
  If you create too many notebooks in parallel the driver may crash when you submit all of the jobs at once. 
  This code limits the number of parallel notebooks.
  '''
  with ThreadPoolExecutor(max_workers = numInParallel) as ec:
    return [ec.submit(submitNotebook, notebook) for notebook in notebooks]

Then, have your test pipeline to run the following:

test_notebooks = [
 TestNotebook("test_notebook_1", timeout = 0, parameters = None, retry = 0),
  TestNotebook("test_notebook_2", timeout = 0, parameters = None, retry = 0),
  TestNotebook("test_notebook_3", timeout = 0, parameters = None, retry = 0),
]

parallelNotebooks(test_notebooks, 3)

And now your tests are finishing in 5 minutes ... or so :-)

3. Cluster environment variables

Have you ever had file "conflicts" because other people in your team are also manipulating the same file? Let's say you and your team mates are working on the same data lake container and doing I/O on the same files. Nightmare!

You can have a "development" location without mangling the official location variables in your code base by adding an environment variable to your cluster. Databricks makes this easy, go to your configuration advanced options and set the following:

DEV_DATA = "/john"

Then on your workload configuration do:

try:
  LAKE_PATH = LAKE_PATH + os.environ["DEV_DATA"]
except Exception:
  pass # intentionally pass for when env var is not available

Now you have the flexibility to work on a dev location in your lake without impacting others.

I hope these tips help a bit or a lot. Do you have other tips to share? Let me know in the comments, I'm always keen to hear what other teams are doing.

Posted on Jun 4 by:

fruiza profile

Francisco Ruiz A

@fruiza

Applying Software Engineering Practices to Big Data

Discussion

markdown guide