DEV Community

Cover image for See you soon? (Predictive Modeling using Machine Learning and Data Analysis)
Kay Wilson
Kay Wilson

Posted on

See you soon? (Predictive Modeling using Machine Learning and Data Analysis)


Machine Learning and AI has always perplexed me. I always figured it was based on past data and prediction somehow but never understood exactly how. This project offered the perfect opportunity to learn something new and gain hands-on practice using Power BI that helped me pass Microsoft's Power BI Data Analyst certification exam.

In this project, I cleansed a dataset and created visuals and machine learning models to predict the readmission of diabetic patients. Microsoft offers many data analysis tools. Of these tools, I used Databricks with Apache Spark and Power BI. The data set was provided by VCU's Center for Clinical and Translational Research.

Databricks vs Power BI

Databricks and Power BI differ with their respective user-friendliness. Databricks uses programming language for both data analysis and machine learning. I opted to use Python and PySpark (Apache Spark for Python). Power BI, however, is little-to-no code and allows you perform data analysis via a very user-friendly GUI that looks similar to Excel. Essentially, Databricks is for developers and Power BI is for everyone else.

Both Databricks and Power BI allow SQL queries. Within Databricks, SQL queries are what allow you to create visuals. Within Power BI, DAX and SQL can both be used for queries but are not required to create visuals. You can simply specify the data you want to use in the field pane. You can even create or import custom visuals using Python and R languages and animated visuals.

Preparing Data

Importing Data

Databricks: I created a blob storage account in Azure and a container and then uploaded the dataset file in CSV format as blob. I then mounted the storage account to my Databricks notebook and imported my CSV blob and created a dataframe . One mistake I made was that in order to use my blob storage account in Databricks, I needed to enable Data Lake Gen 2 on my storage account. This gave me headache in the beginning because kept receiving an error message when attempting to mount as I could not figure out why it would not upload my blob.

Mounting my blob storage account:

  source = "wasbs:/",
  mount_point = "/mnt/mymountpoint",
  extra_configs = {"": "myaccountkey123"})
Enter fullscreen mode Exit fullscreen mode

Creating a dataframe from my blob:

df = (
  .option("header", "true")
  .option("inferSchema", "true")
Enter fullscreen mode Exit fullscreen mode

Power BI: The CSV or flat file was imported locally to Power BI Desktop.

Cleansing Data

Only necessary data were needed from the raw data set. I was able to clean the data using various factors.

  • Columns deemed as potential indicators of diabetes readmissions were used for analysis. These were race, whether they were on any diabetes medications, previous number of readmissions, and the number of diagnoses, time spent in the hospital.
  • I dropped rows with certain discharge disposition IDs to exclude the patients that are no longer with us.
  • Patient number was included as a partition key.
  • Created an additional column called "Readmission Score" with the data type of integer to be used in certain calculations
  • Certain columns that could be used as potential indicators were deleted if there were too many rows with missing values (i.e., medical specialty).

Databricks: I created a new or cleansed dataframe with only the needed columns from my raw dataframe using code.

cleansed_df ="patient_nbr", "race", "admission_type_id", "discharge_disposition_id", "number_diagnoses", "readmitted", "num_medications", "diabetesMed", "gender", "age")
Enter fullscreen mode Exit fullscreen mode

Power BI: During the initial import of my data, the Transform Data option was used to only include the necessary columns and rows.

Modeling Data

After cleansing, tables were created to determine the correlation, if any, between the data.

Factors analyzed:

  • number of diagnoses
  • number of medications
  • race
  • sex
  • admission type
  • age

Choosing my aggregate function

There were a couple functions that I used in my queries such as count, avg, kurtosis, skewness

Average: used in finding the "readmission score" of a group
Count: used in showing the distribution of data
Kurtosis: readmission data showed a platykurtic distribution with a kurtosis of about -1.7. This means data was a little too flat.
Skewness: The readmissions data had a skewness score of 0.38. This shows that the distribution of my data is fairly symmetrical.


Graphs and charts were created to visualize some of the relationships between data. It is much easier to read a chart then just look at a table. The type of visual used was dependent on the parameter used.

  • Bar graphs for average and count functions
  • Pie charts for percentages
  • Cards for single values

Power BI has a neat key influencer visual that analyzes your data for you to determine factors influencing a certain metric. In my case, readmission scores.

Analysis Results

Once I created my measures and visuals, I looked at my data to see what the results were. Here is a break down of what I saw. Using Power BI, I was able to create a nice report using my visuals which you can view here.

Race and Gender

This was an obvious choice of data to look at for trends. I used count and average to examine the readmission history based on the readmission score. I was able to use count to see the number of patients of a specific demographic that had a specific score and find the average readmission score based on a specific demographic.
This is what I saw:

  • Caucasians and Blacks had a higher readmission score than other races.
  • Asians had the lowest readmission score of any race
  • Women had a higher readmission score than Men
  • Asian women had a even higher readmission score than Asian men when comparing women and men of other races

Age really did not show a correlation which was surprising to me. I expected to see that older patients would have a higher readmission rate. The average patient age was 71 no matter the readmission score. The age data was skewed making useless.

Number of Diagnoses

I was very confident that the number of diagnoses a patient had was going to greatly impact their readmissions. I expected to see that the higher the number of diagnoses, the higher the admission score. The data showed that there was little to no correlation.

Number of medications

I expected to see that the higher the number of medications, the lower the admission score as they would be managing their health at home. The data showed there was little to no correlation. A patient with no readmissions had only 1 more medication than those who have been.

Admission Type

I was most surprised to see that Admission Type showed the largest influence on readmissions. I was able to see that patients with a application type ID of 6 or (what type is it) had the highest readmission score, and the most common type of admission type was type 1.

Machine Learning (ML)

Power BI

Power BI has a built in no code ML feature called AI Insights. To use this feature, I created a dataflow in my Power BI workspace and added a machine learning model with the selected data that I wanted to include. My ML model was then applied to my table.


I used AutoML in Python syntax to create a machine learning model.

from databricks import automl

summary = automl.classify(train_df, target_col="readmitted", timeout_minutes=15)

model_uri = summary.best_trial.model_path

import mlflow

# Prepare test dataset
test_pdf = test_df.toPandas()
y_test = test_pdf["Y"]
X_test = test_pdf.drop(["Y"], axis=1)
# Run inference using the best model
model = mlflow.pyfunc.load_model(model_uri)
predictions = model.predict(X_test)
test_pdf["readmitted"] = predictions

import sklearn.metrics

model = mlflow.sklearn.load_model(model_uri)
sklearn.metrics.plot_confusion_matrix(model, X_test, y_test)

Enter fullscreen mode Exit fullscreen mode

Top comments (0)