The elephant in the room
Unlike newer projects in the database space, Microsoft's SQL Server database management system hardly requires a lengthy introduction. Since the release of SQL Server 1.0 in 1989 [1], it has continued to be one of the most widely used database management systems and came in 3rd in DB-Engine's August 2022 ranking [2]. While being established has its benefits, it also poses certain challenges when it comes to effectively integrating SQL Server with modern data analytics workflows.
This article aims to shed some light on three useful ways of connecting SQL Server with notebook-based analytics workflows in Python. While you could also use the pyodbc
package for any of these, the examples below use Deepnote data notebooks since they allow for SQL editing with auto-complete through SQL blocks & in-notebook schemas.
1. Execute Python scripts on SQL Server as stored procedures
The first and potentially most obvious way to bring SQL Server workflows closer to your analytics team's setup is to run some Python code on SQL Server. If you're using SQL Server 2017 and later [3], you're in luck! SQL Server lets you run Python scripts as stored procedures, and all that you need to do to unlock this capability is to specify the @language
input. Let's take a look at an example:
EXECUTE Sp_execute_external_script
@language = N'Python',
@script = N' a = 1
b = 2
c = a/b
d = a*b
print(c, d)
While this example is nice to help grasp the general idea, it's not really a candidate for any mission-critical workflows. However, SQL Server can actually do way more than the above. In the example below, I'm running a query that uses Python variables as inputs (passed via JinjaSQL) for a Python script and returns the query result as Pandas DataFrames. How cool is that?
2️. Mix-and-match database tables and Pandas DataFrames
There might come a time when you have to work with data that is scattered across your SQL Server database, Pandas DataFrames, or local files like CSVs. Instead of having to bulk insert all your CSVs into your database, what if there was a way to reference both Pandas DataFrames (and, therefore, CSVs) as if they were tables in your database?
Imagine you're a data analyst at an e-commerce company and have a baseline product table in your SQL Server database that looks something like this:
SELECT NAME,
color,
weight,
listprice
FROM [SalesLT].[product]
Since your company is planning to run a marketing campaign with discounts, the head of marketing sent you a CSV called discounts.csv
with the discounts for each product and asked you to run some analyses for products with a planned discount of more than 15%. The CSV looks something like this:
After reading the CSV into a Pandas DataFrame, you'd now like to use it in your SQL Server query. Using JinjaSQL, you can do the following to filter for products to include in your analysis based on your manager's request:
SELECT NAME,
color,
weight,
listprice
FROM [SalesLT].[Product]
WHERE productid IN {{
pd.read_csv('discounts.csv') \
.query('Discount > 0.15')['ProductID']
| inclause
}}
3️. Leveraging T-SQL-specific functions in your analytics workflows
Thanks to its 33-year-long history, SQL Server comes with several useful functions out-of-the-box that are not supported in the same way in e.g. PostgreSQL or MySQL [4]. Some of them, like the PIVOT()
and UNPIVOT()
functions allow you to perform table manipulations in a manner that is very readable for analysts familiar with SQL and won't cause the machine you're running on to overheat. Let's run through an example.
Assume that we're back in our fictional e-commerce company. This time, however, we need to transform our database table by pivoting it and computing the mean product list price of each color type for the product categories 5, 6, 7, and 12. The image below shows our database table on the left and the desired target state on the right.
To pivot the sample Product table using T-SQL as illustrated in the image above, you could run the following query:
SELECT Color, [5] AS Cat5, [6] AS Cat6, [7] AS Cat7, [12] AS Cat12
FROM
(SELECT ListPrice, Color, ProductCategoryID
FROM [SalesLT].[Product]) p
PIVOT
(
AVG (ListPrice)
FOR ProductCategoryID IN
( [5], [6], [7], [12] )
) AS pvt
WHERE Color IS NOT NULL;
The neat thing about this query is that it's very readable for anyone familiar with SQL and, thus, more accessible across analytics teams in your organization. To really bring this point home, take a look at the Pandas code to produce a similar result:
pd.pivot_table(product_df,
values='ListPrice', index='Color',
columns=['ProductCategoryID'], aggfunc=np.mean, fill_value=np.nan) \
.reset_index()[["Color", 5, 6, 7, 12]]
While performing more or less the same actions, this code is harder to understand for someone not familiar with Python, Pandas, and NumPy.
Summary
Age is, as the saying goes, just a number. With SQL and Python workflows becoming increasingly intertwined every day, knowing how to effectively integrate the two becomes more and more important. However, as demonstrated in the examples above, it is important to remember that this process is very much bidirectional - Python-heavy workflows can benefit from some nifty SQL tricks as much as SQL workflows can benefit from Python.
Sources:
[1] "Microsoft SQL Server." Wikipedia, 28 July 2022. Wikipedia, https://en.wikipedia.org/w/index.php?title=Microsoft_SQL_Server&oldid=1100983318.
[2] "DB-Engines Ranking." DB-Engines, https://db-engines.com/en/ranking. Accessed 3 Aug. 2022.
[3] WilliamDAssafMSFT. Quickstart: Run Python Scripts - SQL Machine Learning. https://docs.microsoft.com/en-us/sql/machine-learning/tutorials/quickstart-python-create-script. Accessed 4 Aug. 2022.
[4] SQL Feature Comparison. https://www.sql-workbench.eu/dbms_comparison.html. Accessed 3 Aug. 2022.
Top comments (0)