Querying SQL from Databricks without PyODBC

#spark #databricks #python #database

Okay, so this is probably a bit of a niche post but still.

Something I see a lot of is people asking questions on how to do things like run arbitrary SQL statements on SQL databases from Databricks, or how to execute stored procedures. Often this leads to the same follow-ups on how to install PyODBC along with Unix Drivers, adding Microsoft's packages repo, accepting EULA's etc...

It works, don't get me wrong, but it's a faff especially if you don't have permission to execute those things. Plus Microsoft's packages repo has had... issues in the past, and suddenly production jobs fail because they can't connect. I've posted about this before, so there are ways around it, but still, it's a faff.

So, what if you want to connect to SQL and installing PyODBC isn't an option?

Using what's already available

Yep, you can do this using only what is already available, and pretty easily, all thanks to Py4J. Py4J is important in Spark because Spark runs on the JVM, so how do all of those PySpark calls get executed? Well, it calls the Java/Scala methods under the hood through Py4J. And this works not just for Spark.

To make things like the Azure SQL Databricks connector work it ships with the Microsoft SQL JDBC library (along with others such as PostgreSQL), and we can access these in Python.

To keep things as secure as possible I'm going to show how to do this using a service principal for access, but this works just as well with SQL-based authentication as well.

How it works

So first I'm going to assume some things.

You have a SQL database you can connect to
A service principal (or SQL account) exists which has permissions
You have access to Databricks
The credentials are in a secret scope (if not then why not!)

The first thing we need is a reference to SQLServerDataSource.

SQLServerDataSource = spark._sc._gateway.jvm.com.microsoft.sqlserver.jdbc.SQLServerDataSource

And that's the magic line which lets us access something in the JVM from Python. So we now have a Python variable which references this object. Now we can use it.

client_id = dbutils.secrets.get(secret_scope, '<secret name>')
client_secret = dbutils.secrets.get(secret_scope, '<secret name>')

datasource = SQLServerDataSource()
datasource.setServerName(f'{sql_server}.database.windows.net')
datasource.setDatabaseName(database_name)
datasource.setAuthentication('ActiveDirectoryServicePrincipal')
datasource.setAADSecurePrincipalId(client_id)
datasource.setAADSecurePrincipalSecret(client_secret)

Here we are getting the service principal's application id and client secret from our secret scope. Then we're creating an instance of SQLServerDataSource and configuring it to connect to our database using AAD based authentication (for other options see the connection string settings documentation).

Now, we're read to execute something. So lets do a simple query to get the list of users.

connection = datasource.getConnection()
statement = connection.createStatement()

try:
  results = statement.executeQuery('SELECT name FROM sysusers')
  while results.next():
    print(results.getString('name'))
except:
  print('oops')

So, we get a connection from the data source, get a statement object, execute a query, and then iterate on the results.

And that's pretty much it!

There are other methods which allow you to prepare statements which take parameters in a secure way, and you can use the execute method instead if you're not expecting results (such as when calling a stored procedure). Or maybe you want to make sure a database schema exists before writing your DataFrame to SQL.

statement.execute("IF NOT EXISTS (SELECT 1 FROM sys.schemas WHERE [name] = 'MyCoolSchema') BEGIN EXEC('CREATE SCHEMA [MyCoolSchema]') END")

There are samples of how to do this in the data source sample documentation.

But we haven't had to install any new drivers or packages to make this happen.

Any issues?

Yes. Don't go crazy with this. Spark is a Big Data platform used for processing huge amounts of data, it's not intended for making lots of small queries, and this can reduce connection availability for the Spark JDBC operations. But as with anything, if you use the tools the way they are intended to be used then you shouldn't have any issues.

DEV Community

Querying SQL from Databricks without PyODBC

Using what's already available

How it works

Any issues?

Top comments (0)