Introduction
You wake up. You look at your to-do list, and there's a task from a team member. They want you to do an analysis of some sort and produce a report. The request itself is relatively complex, meaning that you might need to break it down using a problem statement, pull data from a data source, explore it, gather your insights, and then tell a story of your findings to the team member.
Here's how solving this problem might go:
- Build problem statement in a task tracking platform (e.g Trello, Jira)
- Pull data from a data source (e.g SQL database, using an interface like SSMS, Power Query)
- Perform exploratory data analysis and complex transformations in an analytical tool (Python, R, Excel)
- Identify insights and then present them in a digestible manner to muggles (a powerpoint, or if you are spicy, a visualization tool like Tableau or Power BI)
Sounds like a standard day in the life of a data professional, right? However, I don't know about you, but moving data across that many tools can be quite disorienting. It makes for a very inefficient workflow. But what if I told you there was a better way? That you could prepare your problem statement, write your SQL query, perform your analysis AND present your findings all in one place? If you work with data and Python, you're probably already familiar with the Jupyter Notebook as an excellent tool for adhoc analytics. The Notebook's interactive environment allows for sequential analysis and storytelling. But the Notebook's beauty doesn't just end at its marriage with IPython. The Jupyter Notebook is essentially built using HTML, CSS and JS; which means that we can use those technologies to expand the cpabilities of the Notebook. In this article, I'll show you how we can take advantage of the Jupyter Notebook's web interface and Python's rich data libraries to build an end-to-end reporting solution.
To follow along, you are going to need the following software and/or packages to be installed on your system:
SQL Server: This is the SQL dialect we are going to be interacting with for our data extraction
Aventure Works Database: This is the collection of databases we are going to be querying our data from. The AW db usually comes with the SQL Server installation by default. But if it doesn't, click on the link and follow the instructions on how to load it into your SQL Server installation.
Python and Jupyter: The easiest way to install both Python and Jupyter at the same time is by using the anaconda distribution. If you already have Python but want to only install Jupyter, use this link.
Python libraries:
- pyodbc
- pandas
- sqlalchemy
- plotly
- mlxtend
- networkx
Now that we've gotten that out of the way, let's dive right into our Jupyter Notebook. You can find it embedded below:
NB: The above embed is from a github repo, which you can clone here. However, because it's hosted on github, all the JS interactivity is disabled, so the "Toggle Button" will not be rendered. If you clone the repo to your machine and run the notebook locally, however, you will see that the button will look something like this:
And by click the "Toggle Button", the input cells are hidden and you are left with something like this:
And that's it. In summary, here's what we were able to do all within one environment (the Notebook):
- define a Problem Statement and expand on the objectives
- connect to and query data directly from a SQL database
- perform our analysis
- plot visualizations
- format our sections, layouts and results using markdown to tell a story and make our report readable by a non-technical audience
- provide functionality that hides our technical inputs, leaving only our story and insights
A complete end-to-end analytics workflow, all within one environment. The next step would be building a framework, technology and/or pipeline that allows us to seemlessly share our reports with stakeholders. If this article gets enough attention, I will write on that next. 😉
Top comments (0)