DEV Community

Darren Fuller
Darren Fuller

Posted on

2

Querying SQL from Databricks without PyODBC

Okay, so this is probably a bit of a niche post but still.

Something I see a lot of is people asking questions on how to do things like run arbitrary SQL statements on SQL databases from Databricks, or how to execute stored procedures. Often this leads to the same follow-ups on how to install PyODBC along with Unix Drivers, adding Microsoft's packages repo, accepting EULA's etc...

It works, don't get me wrong, but it's a faff especially if you don't have permission to execute those things. Plus Microsoft's packages repo has had... issues in the past, and suddenly production jobs fail because they can't connect. I've posted about this before, so there are ways around it, but still, it's a faff.

So, what if you want to connect to SQL and installing PyODBC isn't an option?

Using what's already available

Yep, you can do this using only what is already available, and pretty easily, all thanks to Py4J. Py4J is important in Spark because Spark runs on the JVM, so how do all of those PySpark calls get executed? Well, it calls the Java/Scala methods under the hood through Py4J. And this works not just for Spark.

To make things like the Azure SQL Databricks connector work it ships with the Microsoft SQL JDBC library (along with others such as PostgreSQL), and we can access these in Python.

To keep things as secure as possible I'm going to show how to do this using a service principal for access, but this works just as well with SQL-based authentication as well.

How it works

So first I'm going to assume some things.

  1. You have a SQL database you can connect to
  2. A service principal (or SQL account) exists which has permissions
  3. You have access to Databricks
  4. The credentials are in a secret scope (if not then why not!)

The first thing we need is a reference to SQLServerDataSource.

SQLServerDataSource = spark._sc._gateway.jvm.com.microsoft.sqlserver.jdbc.SQLServerDataSource
Enter fullscreen mode Exit fullscreen mode

And that's the magic line which lets us access something in the JVM from Python. So we now have a Python variable which references this object. Now we can use it.

client_id = dbutils.secrets.get(secret_scope, '<secret name>')
client_secret = dbutils.secrets.get(secret_scope, '<secret name>')

datasource = SQLServerDataSource()
datasource.setServerName(f'{sql_server}.database.windows.net')
datasource.setDatabaseName(database_name)
datasource.setAuthentication('ActiveDirectoryServicePrincipal')
datasource.setAADSecurePrincipalId(client_id)
datasource.setAADSecurePrincipalSecret(client_secret)
Enter fullscreen mode Exit fullscreen mode

Here we are getting the service principal's application id and client secret from our secret scope. Then we're creating an instance of SQLServerDataSource and configuring it to connect to our database using AAD based authentication (for other options see the connection string settings documentation).

Now, we're read to execute something. So lets do a simple query to get the list of users.

connection = datasource.getConnection()
statement = connection.createStatement()

try:
  results = statement.executeQuery('SELECT name FROM sysusers')
  while results.next():
    print(results.getString('name'))
except:
  print('oops')
Enter fullscreen mode Exit fullscreen mode

So, we get a connection from the data source, get a statement object, execute a query, and then iterate on the results.

And that's pretty much it!

There are other methods which allow you to prepare statements which take parameters in a secure way, and you can use the execute method instead if you're not expecting results (such as when calling a stored procedure). Or maybe you want to make sure a database schema exists before writing your DataFrame to SQL.

statement.execute("IF NOT EXISTS (SELECT 1 FROM sys.schemas WHERE [name] = 'MyCoolSchema') BEGIN EXEC('CREATE SCHEMA [MyCoolSchema]') END")
Enter fullscreen mode Exit fullscreen mode

There are samples of how to do this in the data source sample documentation.

But we haven't had to install any new drivers or packages to make this happen.

Any issues?

Yes. Don't go crazy with this. Spark is a Big Data platform used for processing huge amounts of data, it's not intended for making lots of small queries, and this can reduce connection availability for the Spark JDBC operations. But as with anything, if you use the tools the way they are intended to be used then you shouldn't have any issues.

Hot sauce if you're wrong - web dev trivia for staff engineers

Hot sauce if you're wrong ยท web dev trivia for staff engineers (Chris vs Jeremy, Leet Heat S1.E4)

  • Shipping Fast: Test your knowledge of deployment strategies and techniques
  • Authentication: Prove you know your OAuth from your JWT
  • CSS: Demonstrate your styling expertise under pressure
  • Acronyms: Decode the alphabet soup of web development
  • Accessibility: Show your commitment to building for everyone

Contestants must answer rapid-fire questions across the full stack of modern web development. Get it right, earn points. Get it wrong? The spice level goes up!

Watch Video ๐ŸŒถ๏ธ๐Ÿ”ฅ

Top comments (0)

AWS Q Developer image

Your AI Code Assistant

Automate your code reviews. Catch bugs before your coworkers. Fix security issues in your code. Built to handle large projects, Amazon Q Developer works alongside you from idea to production code.

Get started free in your IDE

๐Ÿ‘‹ Kindness is contagious

If you found this post helpful, please leave a โค๏ธ or a friendly comment below!

Okay