<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Francisco Ruiz A</title>
    <description>The latest articles on DEV Community by Francisco Ruiz A (@fruiza).</description>
    <link>https://dev.to/fruiza</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F168961%2Fe6330d6f-8f4a-45d6-b577-64747c3e0c9f.jpg</url>
      <title>DEV Community: Francisco Ruiz A</title>
      <link>https://dev.to/fruiza</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/fruiza"/>
    <language>en</language>
    <item>
      <title>Build a Python Flask API for your Delta Lake</title>
      <dc:creator>Francisco Ruiz A</dc:creator>
      <pubDate>Thu, 24 Dec 2020 16:55:03 +0000</pubDate>
      <link>https://dev.to/fruiza/build-a-python-flask-api-for-your-delta-lake-3bgf</link>
      <guid>https://dev.to/fruiza/build-a-python-flask-api-for-your-delta-lake-3bgf</guid>
      <description>&lt;p&gt;After the recent announcement on the Databricks blog about &lt;a href="https://databricks.com/blog/2020/12/22/natively-query-your-delta-lake-with-scala-java-and-python.html" rel="noopener noreferrer"&gt;querying your Delta Lake natively with Python&lt;/a&gt; (and other languages) without Apache Spark, I got curious about how a Flask API endpoint would look like, so here it is.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from deltalake import DeltaTable
from flask import request, jsonify

app = flask.Flask(__name__)
app.config["DEBUG"] = True

@app.route('/read-delta-table', methods=['GET'])
def home():
    dt = DeltaTable("/tmp/delta/students-delta-table/")

    pd = dt.to_pyarrow_dataset().to_table().to_pandas()

    json_str = pd.to_json(orient = "records")

    parsed = json.loads(json_str)  

    return jsonify(parsed)

app.run()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://gist.github.com/francisco-ltech/05e944e52f7ca5c51bce927773a91dd0#file-deltalake-api-endpoint-py" rel="noopener noreferrer"&gt;Gist&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Running the API&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fraw797x8lskcvnw6zsuz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fraw797x8lskcvnw6zsuz.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pre-requisites:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;In order to compile the code you need to use the nightly version of rust&lt;/p&gt;

&lt;p&gt;[to install]&lt;br&gt;
&lt;code&gt;$ rustup toolchain install nightly&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;[to use]&lt;br&gt;
&lt;code&gt;$ cd ~/projects/needs-nightly&lt;/code&gt;&lt;br&gt;
&lt;code&gt;$ rustup override set nightly&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;You need to use maturin package to build the .whl&lt;/p&gt;

&lt;p&gt;$ pip install maturin&lt;br&gt;
$ maturin build&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is still an experimental interface to Delta Lake for Rust with native bindings for Python so proceed with caution, you wouldn't want to expose an ocean of data through an endpoint so proceed with caution.&lt;/p&gt;

&lt;p&gt;I'm excited about this project, being able to query delta tables from front-end apps (not via Apache Spark) was a missing piece in the puzzle of delta lakes.&lt;/p&gt;

&lt;p&gt;Fantastic effort by the delta-rs contributors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/houqp" rel="noopener noreferrer"&gt;QP Hou&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/rtyler" rel="noopener noreferrer"&gt;R. Tyler Croy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/sd2k" rel="noopener noreferrer"&gt;Ben Sully&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Delta-rs Git repo is &lt;a href="https://github.com/delta-io/delta-rs" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;/p&gt;

</description>
      <category>deltalake</category>
      <category>deltars</category>
      <category>rust</category>
      <category>python</category>
    </item>
    <item>
      <title>Tips for your Databricks development workflow</title>
      <dc:creator>Francisco Ruiz A</dc:creator>
      <pubDate>Thu, 04 Jun 2020 19:39:16 +0000</pubDate>
      <link>https://dev.to/fruiza/tips-for-your-databricks-development-workflow-pfc</link>
      <guid>https://dev.to/fruiza/tips-for-your-databricks-development-workflow-pfc</guid>
      <description>&lt;p&gt;As a data engineer I'm always looking for ways to automate things and be more efficient on my day to day development workflow. Here are some tips that might help improve yours.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Sync your Notebooks between Databricks and Git
&lt;/h2&gt;

&lt;p&gt;You can integrate to Github or Azure DevOps directly from your Databricks workspace but what if you work with multiple git remote origins? or maybe your data project also includes other stacks like Apache Airflow or scripting files and you are working on a feature that spans across all this? - Databricks has an excellent command line interface that exposes a great set of API endpoints that can help you manage this stage of your development workflow.&lt;/p&gt;

&lt;p&gt;Exporting your Databricks workspace into your local branch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ databricks workspace export_dir /Users/example@databricks.com/notebooks .
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Let's save a few key strokes with a bash script that leverages your local git configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;MSYS_NO_PATHCONV&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 databricks workspace export_dir &lt;span class="nt"&gt;-o&lt;/span&gt; /Users/&lt;span class="sb"&gt;`&lt;/span&gt;git config user.email | &lt;span class="nb"&gt;awk&lt;/span&gt; &lt;span class="s1"&gt;'{print tolower($0)}'&lt;/span&gt;&lt;span class="sb"&gt;`&lt;/span&gt;/notebooks &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Now you can easily run it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ./download_notebooks.sh
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Do the reverse to upload your Notebooks to your Databricks workspace:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;MSYS_NO_PATHCONV&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 databricks workspace import_dir &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt; /Users/&lt;span class="sb"&gt;`&lt;/span&gt;git config user.email | &lt;span class="nb"&gt;awk&lt;/span&gt; &lt;span class="s1"&gt;'{print tolower($0)}'&lt;/span&gt;&lt;span class="sb"&gt;`&lt;/span&gt;/notebooks
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Then run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ./upload_notebooks.sh
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;This gives you the flexibility to commit and push other file changes you may have in your feature branch up to your origin.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Parallel automated tests
&lt;/h2&gt;

&lt;p&gt;You write automated tests for your workloads, right? :-)&lt;/p&gt;

&lt;p&gt;Let's say you have three test Notebooks to test different parts of your workload, typically you would have one devops/cd (because you run them on an integration environment) pipeline task for each of the test Notebook. This will run in sequence and assuming that each test Notebook takes 5 minutes, then your test pipeline takes 15 minutes in total.  &lt;/p&gt;

&lt;p&gt;You can run these test Notebooks in parallel using the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;class TestNotebook:
  def __init__(self, path, timeout = 0, parameters = None, retry = 0):
    self.path = path
    self.timeout = timeout
    self.parameters = parameters
    self.retry = retry

def submitNotebook(notebook):
  print("Running notebook %s" % notebook.path + "\r")
  try:
    if (notebook.parameters):
      return dbutils.notebook.run(notebook.path, notebook.timeout, notebook.parameters)
    else:
      return dbutils.notebook.run(notebook.path, notebook.timeout)
  except Exception:
    if notebook.retry &amp;lt; 1:
      raise
    print("Retrying notebook %s" % notebook.path)
    notebook.retry = notebook.retry - 1
    submitNotebook(notebook)

def parallelNotebooks(notebooks, numInParallel):
  '''
  If you create too many notebooks in parallel the driver may crash when you submit all of the jobs at once. 
  This code limits the number of parallel notebooks.
  '''
  with ThreadPoolExecutor(max_workers = numInParallel) as ec:
    return [ec.submit(submitNotebook, notebook) for notebook in notebooks]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Then, have your test pipeline to run the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;test_notebooks = [
 TestNotebook("test_notebook_1", timeout = 0, parameters = None, retry = 0),
  TestNotebook("test_notebook_2", timeout = 0, parameters = None, retry = 0),
  TestNotebook("test_notebook_3", timeout = 0, parameters = None, retry = 0),
]

parallelNotebooks(test_notebooks, 3)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;And now your tests are finishing in 5 minutes ... or so :-)&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Cluster environment variables
&lt;/h2&gt;

&lt;p&gt;Have you ever had file "conflicts" because other people in your team are also manipulating the same file? Let's say you and your team mates are working on the same data lake container and doing I/O on the same files. Nightmare!&lt;/p&gt;

&lt;p&gt;You can have a "development" location without mangling the official location variables in your code base by adding an environment variable to your cluster. Databricks makes this easy, go to your configuration advanced options and set the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;DEV_DATA = "/john"
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Then on your workload configuration do:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;try:
  LAKE_PATH = LAKE_PATH + os.environ["DEV_DATA"]
except Exception:
  pass # intentionally pass for when env var is not available
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Now you have the flexibility to work on a dev location in your lake without impacting others.&lt;/p&gt;

&lt;p&gt;I hope these tips help a bit or a lot. Do you have other tips to share? Let me know in the comments, I'm always keen to hear what other teams are doing.&lt;/p&gt;

</description>
      <category>databricks</category>
      <category>git</category>
      <category>bash</category>
      <category>datalake</category>
    </item>
    <item>
      <title>My Databricks article compilation of 2019</title>
      <dc:creator>Francisco Ruiz A</dc:creator>
      <pubDate>Tue, 31 Dec 2019 11:53:41 +0000</pubDate>
      <link>https://dev.to/fruiza/my-databricks-article-compilation-of-2019-2kg7</link>
      <guid>https://dev.to/fruiza/my-databricks-article-compilation-of-2019-2kg7</guid>
      <description>&lt;p&gt;As we come to the end of the year I thought I should share some of the articles that I bookmarked this year during my journey in learning and working with Databricks, a Unified Data Analytics Platform.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://databricks.com/blog/2019/12/03/jupyterlab-databricks-integration-bridge-local-and-remote-workflows.html"&gt;Using a remote Databricks cluster from a local Jupyter notebook&lt;/a&gt;. This article shows how Data Scientists can work in their familiar local environments with JupyterLab and access remote data and remote clusters in a consistent way.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://databricks.com/blog/2018/07/02/build-a-mobile-gaming-events-data-pipeline-with-databricks-delta.html"&gt;Data pipeline with Structured Streaming&lt;/a&gt;. This article illustrates how to build data pipelines for high volume streaming use cases like mobile game analytics using Databricks Delta.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://databricks.com/blog/2019/08/14/productionizing-machine-learning-with-delta-lake.html"&gt;Building a Machine Learning Data Pipeline with Delta Lake&lt;/a&gt;. This article demonstrates how Delta Lake is the ideal platform for the machine learning life cycle because it offers tools and features that unify data science, data engineering, and production workflows.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://databricks.com/blog/2019/09/24/diving-into-delta-lake-schema-enforcement-evolution.html"&gt;Schema enforcement is the yin to schema evolution’s yang&lt;/a&gt;. This article shows how Delta Lake uses schema validation on write to keep compatibility with the target table.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://databricks.com/blog/2019/07/15/migrating-transactional-data-to-a-delta-lake-using-aws-dms.html"&gt;Migrating Transactional Data to a Delta Lake&lt;/a&gt;. This article explains how to tackle some of the challenges with moving data from databases to data lakes. In this example they use the AWS Database Migration Service.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://databricks.com/blog/2019/11/27/migration-from-hadoop-to-modern-cloud-platforms-the-case-for-hadoop-alternatives.html"&gt;Migrating from Hadoop to modern cloud platforms&lt;/a&gt;. This article talks about the challenges with Hadoop architectures and how to move towards modern cloud data platforms.&lt;/p&gt;

&lt;p&gt;I hope these articles are also useful to you. Keep an eye on the Databricks blog for 2020 and if you need help solving a big data problem please reach out, I might be able to help. Happy New Year! &lt;/p&gt;

</description>
      <category>databricks</category>
      <category>delta</category>
      <category>spark</category>
      <category>bigdata</category>
    </item>
    <item>
      <title>PowerApps BuildTools for Azure DevOps</title>
      <dc:creator>Francisco Ruiz A</dc:creator>
      <pubDate>Thu, 03 Oct 2019 12:10:17 +0000</pubDate>
      <link>https://dev.to/fruiza/powerapps-buildtools-for-azure-devops-184o</link>
      <guid>https://dev.to/fruiza/powerapps-buildtools-for-azure-devops-184o</guid>
      <description>&lt;p&gt;For quite a long time I have been trying to automate the deployment of PowerApps solution/app. &lt;a href="https://docs.microsoft.com/en-us/powerapps/developer/common-data-service/build-tools-overview" rel="noopener noreferrer"&gt;PowerApps build tools&lt;/a&gt; is still in &lt;strong&gt;&lt;strong&gt;Preview&lt;/strong&gt;&lt;/strong&gt; but this collection of Azure DevOps build tasks will help you eliminate all the manual intervention that is currently needed to manage the lifecycle of your application.&lt;/p&gt;

&lt;p&gt;Normally, you would manually export your zip file from your development environment and import it into your production environment. You can now automate these tasks using Azure DevOps. To make use of these buildtools tasks you must install the &lt;a href="https://marketplace.visualstudio.com/items?itemName=microsoft-IsvExpTools.PowerApps-BuildTools" rel="noopener noreferrer"&gt;extension&lt;/a&gt; from the Visual Studio market place into your Azure DevOps instance&lt;/p&gt;

&lt;p&gt;This is an example of a very simple Build pipeline that publishes your zip file as artefact for deployment&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fz41juybamk0qiu368mwf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fz41juybamk0qiu368mwf.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Then, you can use these two PowerApps DevOps tasks in a Release pipeline to deploy to an environment&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fgsq1nwq84rjtdeqbybz7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fgsq1nwq84rjtdeqbybz7.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Cool huh?!&lt;/p&gt;

&lt;p&gt;But that's not it. You can also automate the export of the zip file with this task&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fdxza99xcjzwlmkohgbtn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fdxza99xcjzwlmkohgbtn.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And you can take it even further, we know zip files are hard to track changes, so you can source control the contents of the zip files by using the unpack task&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fdyzd5ifus9jb24ibmmjm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fdyzd5ifus9jb24ibmmjm.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This set of DevOps tasks will definitely help you manage the lifecycle of your PowerApps going forward, there are also tasks to automate common environment management functions that you can use.&lt;/p&gt;

&lt;p&gt;If like me, you also like to get ahead of the game, go start automating your apps now :-)&lt;/p&gt;

</description>
      <category>powerapps</category>
      <category>devops</category>
      <category>deployment</category>
      <category>azure</category>
    </item>
    <item>
      <title>Connect to Azure Data Warehouse from Azure Databricks</title>
      <dc:creator>Francisco Ruiz A</dc:creator>
      <pubDate>Fri, 28 Jun 2019 14:43:51 +0000</pubDate>
      <link>https://dev.to/fruiza/connect-to-azure-data-warehouse-from-azure-databricks-jl2</link>
      <guid>https://dev.to/fruiza/connect-to-azure-data-warehouse-from-azure-databricks-jl2</guid>
      <description>&lt;h2&gt;
  
  
  Create a secret scope
&lt;/h2&gt;

&lt;p&gt;A secret scope will allow you to use Azure KeyVault to download all secret information to connect to Azure Data Warehouse, e.g.: username/password, etc. In our example, this will be the whole connection string.&lt;/p&gt;

&lt;p&gt;Navigate to https://{region}.azuredatabricks.net/?o={object_id}#secrets/createScope&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fuiq4eiihcf9ekbv3hha9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fuiq4eiihcf9ekbv3hha9.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Grab the DNS and Resource ID from your Azure KeyVault Properties&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Ferftnman1x64roqfxjhm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Ferftnman1x64roqfxjhm.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Get required secrets from azure keyvault and set Spark configuration
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;blob_storage_url = dbutils.secrets.get(scope = "databricks_scope", key = "blobstorageurl")

blob_storage_key = dbutils.secrets.get(scope = "databricks_scope", key = "blobstoragekey")

adw_connection = dbutils.secrets.get(scope = "databricks_scope", key = "dwconnection")

blob_storage_temp_dir = dbutils.secrets.get(scope = "databricks_scope", key = "blobstoragetempdir")

spark.conf.set(blob_storage_url, blob_storage_key)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Spark driver to SQL DW
&lt;/h2&gt;

&lt;p&gt;The Spark driver connects to SQL DW via JDBC using a username and password. We recommended that you use the connection string provided by Azure portal, which enables Secure Sockets Layer (SSL) encryption for all data sent between the Spark driver and the SQL DW instance through the JDBC connection. To verify that the SSL encryption is enabled, you can search for encrypt=true in the connection string. To allow the Spark driver to reach SQL DW, we recommend that you set Allow access to Azure services to ON on the firewall pane of the SQL DW server through Azure portal. This setting allows communications from all Azure IP addresses and all Azure subnets, which allows Spark drivers to reach the SQL DW instance.&lt;/p&gt;

&lt;p&gt;For more info see &lt;a href="https://docs.databricks.com/spark/latest/data-sources/azure/sql-data-warehouse.html" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Load data from a SQL DW query into a Spark DataFrame.
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df = (spark
      .read
      .format("com.databricks.spark.sqldw")
      .option("url", adw_connection)
      .option("tempDir", blob_storage_temp_dir)
      .option("forwardSparkAzureStorageCredentials", "true")
      .option("query", "select count(*) as counter from my_table")).load()

df.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it! Quick and easy.&lt;/p&gt;

</description>
      <category>azure</category>
      <category>databricks</category>
      <category>keyvault</category>
      <category>adw</category>
    </item>
  </channel>
</rss>
