DEV Community: Francisco Ruiz A

Build a Python Flask API for your Delta Lake

Francisco Ruiz A — Thu, 24 Dec 2020 16:55:03 +0000

After the recent announcement on the Databricks blog about querying your Delta Lake natively with Python (and other languages) without Apache Spark, I got curious about how a Flask API endpoint would look like, so here it is.

from deltalake import DeltaTable
from flask import request, jsonify

app = flask.Flask(__name__)
app.config["DEBUG"] = True

@app.route('/read-delta-table', methods=['GET'])
def home():
    dt = DeltaTable("/tmp/delta/students-delta-table/")

    pd = dt.to_pyarrow_dataset().to_table().to_pandas()

    json_str = pd.to_json(orient = "records")

    parsed = json.loads(json_str)  

    return jsonify(parsed)

app.run()

Gist

Running the API

Pre-requisites:

In order to compile the code you need to use the nightly version of rust

[to install]
$ rustup toolchain install nightly

[to use]
$ cd ~/projects/needs-nightly
$ rustup override set nightly
You need to use maturin package to build the .whl

$ pip install maturin
$ maturin build

This is still an experimental interface to Delta Lake for Rust with native bindings for Python so proceed with caution, you wouldn't want to expose an ocean of data through an endpoint so proceed with caution.

I'm excited about this project, being able to query delta tables from front-end apps (not via Apache Spark) was a missing piece in the puzzle of delta lakes.

Fantastic effort by the delta-rs contributors:

Delta-rs Git repo is here

Tips for your Databricks development workflow

Francisco Ruiz A — Thu, 04 Jun 2020 19:39:16 +0000

As a data engineer I'm always looking for ways to automate things and be more efficient on my day to day development workflow. Here are some tips that might help improve yours.

1. Sync your Notebooks between Databricks and Git

You can integrate to Github or Azure DevOps directly from your Databricks workspace but what if you work with multiple git remote origins? or maybe your data project also includes other stacks like Apache Airflow or scripting files and you are working on a feature that spans across all this? - Databricks has an excellent command line interface that exposes a great set of API endpoints that can help you manage this stage of your development workflow.

Exporting your Databricks workspace into your local branch:

$ databricks workspace export_dir /Users/example@databricks.com/notebooks .

Let's save a few key strokes with a bash script that leverages your local git configuration:

MSYS_NO_PATHCONV=1 databricks workspace export_dir -o /Users/`git config user.email | awk '{print tolower($0)}'`/notebooks .

Now you can easily run it:

$ ./download_notebooks.sh

Do the reverse to upload your Notebooks to your Databricks workspace:

MSYS_NO_PATHCONV=1 databricks workspace import_dir -o . /Users/`git config user.email | awk '{print tolower($0)}'`/notebooks

Then run:

$ ./upload_notebooks.sh

This gives you the flexibility to commit and push other file changes you may have in your feature branch up to your origin.

2. Parallel automated tests

You write automated tests for your workloads, right? :-)

Let's say you have three test Notebooks to test different parts of your workload, typically you would have one devops/cd (because you run them on an integration environment) pipeline task for each of the test Notebook. This will run in sequence and assuming that each test Notebook takes 5 minutes, then your test pipeline takes 15 minutes in total.

You can run these test Notebooks in parallel using the following:

class TestNotebook:
  def __init__(self, path, timeout = 0, parameters = None, retry = 0):
    self.path = path
    self.timeout = timeout
    self.parameters = parameters
    self.retry = retry

def submitNotebook(notebook):
  print("Running notebook %s" % notebook.path + "\r")
  try:
    if (notebook.parameters):
      return dbutils.notebook.run(notebook.path, notebook.timeout, notebook.parameters)
    else:
      return dbutils.notebook.run(notebook.path, notebook.timeout)
  except Exception:
    if notebook.retry < 1:
      raise
    print("Retrying notebook %s" % notebook.path)
    notebook.retry = notebook.retry - 1
    submitNotebook(notebook)

def parallelNotebooks(notebooks, numInParallel):
  '''
  If you create too many notebooks in parallel the driver may crash when you submit all of the jobs at once. 
  This code limits the number of parallel notebooks.
  '''
  with ThreadPoolExecutor(max_workers = numInParallel) as ec:
    return [ec.submit(submitNotebook, notebook) for notebook in notebooks]

Then, have your test pipeline to run the following:

test_notebooks = [
 TestNotebook("test_notebook_1", timeout = 0, parameters = None, retry = 0),
  TestNotebook("test_notebook_2", timeout = 0, parameters = None, retry = 0),
  TestNotebook("test_notebook_3", timeout = 0, parameters = None, retry = 0),
]

parallelNotebooks(test_notebooks, 3)

And now your tests are finishing in 5 minutes ... or so :-)

3. Cluster environment variables

Have you ever had file "conflicts" because other people in your team are also manipulating the same file? Let's say you and your team mates are working on the same data lake container and doing I/O on the same files. Nightmare!

You can have a "development" location without mangling the official location variables in your code base by adding an environment variable to your cluster. Databricks makes this easy, go to your configuration advanced options and set the following:

DEV_DATA = "/john"

Then on your workload configuration do:

try:
  LAKE_PATH = LAKE_PATH + os.environ["DEV_DATA"]
except Exception:
  pass # intentionally pass for when env var is not available

Now you have the flexibility to work on a dev location in your lake without impacting others.

I hope these tips help a bit or a lot. Do you have other tips to share? Let me know in the comments, I'm always keen to hear what other teams are doing.

My Databricks article compilation of 2019

Francisco Ruiz A — Tue, 31 Dec 2019 11:53:41 +0000

As we come to the end of the year I thought I should share some of the articles that I bookmarked this year during my journey in learning and working with Databricks, a Unified Data Analytics Platform.

Using a remote Databricks cluster from a local Jupyter notebook. This article shows how Data Scientists can work in their familiar local environments with JupyterLab and access remote data and remote clusters in a consistent way.

Data pipeline with Structured Streaming. This article illustrates how to build data pipelines for high volume streaming use cases like mobile game analytics using Databricks Delta.

Building a Machine Learning Data Pipeline with Delta Lake. This article demonstrates how Delta Lake is the ideal platform for the machine learning life cycle because it offers tools and features that unify data science, data engineering, and production workflows.

Schema enforcement is the yin to schema evolution’s yang. This article shows how Delta Lake uses schema validation on write to keep compatibility with the target table.

Migrating Transactional Data to a Delta Lake. This article explains how to tackle some of the challenges with moving data from databases to data lakes. In this example they use the AWS Database Migration Service.

Migrating from Hadoop to modern cloud platforms. This article talks about the challenges with Hadoop architectures and how to move towards modern cloud data platforms.

I hope these articles are also useful to you. Keep an eye on the Databricks blog for 2020 and if you need help solving a big data problem please reach out, I might be able to help. Happy New Year!

PowerApps BuildTools for Azure DevOps

Francisco Ruiz A — Thu, 03 Oct 2019 12:10:17 +0000

For quite a long time I have been trying to automate the deployment of PowerApps solution/app. PowerApps build tools is still in Preview but this collection of Azure DevOps build tasks will help you eliminate all the manual intervention that is currently needed to manage the lifecycle of your application.

Normally, you would manually export your zip file from your development environment and import it into your production environment. You can now automate these tasks using Azure DevOps. To make use of these buildtools tasks you must install the extension from the Visual Studio market place into your Azure DevOps instance

This is an example of a very simple Build pipeline that publishes your zip file as artefact for deployment

Then, you can use these two PowerApps DevOps tasks in a Release pipeline to deploy to an environment

Cool huh?!

But that's not it. You can also automate the export of the zip file with this task

And you can take it even further, we know zip files are hard to track changes, so you can source control the contents of the zip files by using the unpack task

This set of DevOps tasks will definitely help you manage the lifecycle of your PowerApps going forward, there are also tasks to automate common environment management functions that you can use.

If like me, you also like to get ahead of the game, go start automating your apps now :-)

Connect to Azure Data Warehouse from Azure Databricks

Francisco Ruiz A — Fri, 28 Jun 2019 14:43:51 +0000

Create a secret scope

A secret scope will allow you to use Azure KeyVault to download all secret information to connect to Azure Data Warehouse, e.g.: username/password, etc. In our example, this will be the whole connection string.

Navigate to https://{region}.azuredatabricks.net/?o={object_id}#secrets/createScope

Grab the DNS and Resource ID from your Azure KeyVault Properties

Get required secrets from azure keyvault and set Spark configuration

blob_storage_url = dbutils.secrets.get(scope = "databricks_scope", key = "blobstorageurl")

blob_storage_key = dbutils.secrets.get(scope = "databricks_scope", key = "blobstoragekey")

adw_connection = dbutils.secrets.get(scope = "databricks_scope", key = "dwconnection")

blob_storage_temp_dir = dbutils.secrets.get(scope = "databricks_scope", key = "blobstoragetempdir")

spark.conf.set(blob_storage_url, blob_storage_key)

Spark driver to SQL DW

The Spark driver connects to SQL DW via JDBC using a username and password. We recommended that you use the connection string provided by Azure portal, which enables Secure Sockets Layer (SSL) encryption for all data sent between the Spark driver and the SQL DW instance through the JDBC connection. To verify that the SSL encryption is enabled, you can search for encrypt=true in the connection string. To allow the Spark driver to reach SQL DW, we recommend that you set Allow access to Azure services to ON on the firewall pane of the SQL DW server through Azure portal. This setting allows communications from all Azure IP addresses and all Azure subnets, which allows Spark drivers to reach the SQL DW instance.

For more info see here

Load data from a SQL DW query into a Spark DataFrame.

df = (spark
      .read
      .format("com.databricks.spark.sqldw")
      .option("url", adw_connection)
      .option("tempDir", blob_storage_temp_dir)
      .option("forwardSparkAzureStorageCredentials", "true")
      .option("query", "select count(*) as counter from my_table")).load()

df.show()

That's it! Quick and easy.