DEV Community: Cristopher Delgado

Pneumonia Detection

Cristopher Delgado — Thu, 11 Jan 2024 01:14:14 +0000

Overview

Throughout my journey in embracing data science, I finally decided to dive into the biomedical aspect of data science. Machine Learning is taking over the analytics in medical devices as machine learning is beginning to provide a vast amount of opportunities in medical devices.

In this hypothetical scenario, I developed a Deep-learning model that can classify Pneumonia Images with 88% accuracy. Data science is a comprehensive application and in this story, I got the chance to showcase my hard work in developing a deep-learning model

Data Understanding

The X-ray images used in this git repo are of pediatric patients. The classification is a binary case on whether a patient has Pneumonia. The dataset comes from Kermany et al. on Mendley. The dataset on Kaggle is from the original source using Version 2 that was published on January 01, 2018. The validation folder consisted on 16 images. This quantity was insufficient in my opinion to truly grasp if the deep learning model would be overfitting. Instead a custom validation folder was made by incorporating the 16 images in the val folder back into the train folder and randomly selecting 20% of the train set of each category.

Best Model

Base Convolutional Model

The model architecture was simple with one Convolutional Layer of 16 filters, one hidden Dense Layer of 256 neurons, and one output layer consisting on 1 neuron for binary classification. This model did well in training with promising generalization as seen in the Epoch vs Loss graph but, with unknown data, the model does very well in classifying Pneumonia but not Normal Images. We can see that in the metrics as in training both Precision and Recall were high in the validation and training sets. In the test set only the Recall was high at 99% and low in Precision at 71%. This is due to the small amount of Normal images in the training set in comparison to Pneumonia.

Set	Loss	Precision	Recall	Accuracy
Train	0.020	100.00%	100.00%	100.00%
Test	1.336	71.85%	99.84%	75.32%
Validation	0.097	98.56%	97.68%	97.22%

Augmentation Model

To combat the low availability of Normal images data augmentation was utilized to create synthetic examples of Normal images for the model to train on. Taking the pre-trained model I introduced data augmentation generators that act as brand new images to learn from. Data augmentation resulted in better performance of unknown data. The training metrics on the augmented data show how it began to do better on Normal Images as seen with the high Precision scores in the train and validations sets. The test set used a data generator that supplied unaltered images. This was done to truly grasp its performance on the actual test set.

Set	Loss	Precision	Recall	Accuracy
Train	0.228	98.64%	88.79%	90.78%
Test	0.360	88.43%	94.10%	88.62%
Validation	0.217	98.85%	89.04%	91.10%

Conclusion

The best model was the augmented model as this model was the best-performing model across the board. It is slightly more complex than its base version but as a result, it learned from augmented data leading to better performance in Normal images. Please check out my entire project in the GitHub repository where I try many more iterations such as Transfer Learning!

Heart Disease Prediction

Cristopher Delgado — Mon, 25 Sep 2023 21:23:02 +0000

Media Profiles:
LinkedIn
Github

Overview:

Continuing my data science journey eventually meant I would finally reach a crossroads with Machine Learning. I am proud to say I have completed a third project that showcases my understanding of machine learning models. In this project that I want to share with everyone, I utilized six different classification models.

LogisticClassifier()
RandomForestClassfier()
KNeighborsClassifier()
AdaBosstClassifier()
GradientBoostClassifier()
XGBClassifer()

With these six classification models, I created baseline models and their optimized versions. In the end, I chose a single model that aligns the best with the project's main objective.

Business Problem:

The main goal of this project was to address the idea of diagnostics in Cardiovascular Diseases (CVD). The stakeholder is a Diagnostic based Medical Device company that wants to integrate machine learning to provide continous monitoring and early detection of Cardio Vascular Disease-related symptoms into developing medical devices or already existent devices.

Stakeholder: Diagnostic Medical Device Company

The stakeholder wants to determine what related features to CVDs are the most important to monitor. Knowing that information they would like to develop medical devices for either at-home use for patients or clinical use devices and potentially use Machine Learning algorithms incorporated into their diagnostic devices and create software as a medical device.

Methodology

Perform data cleaning which consists of casting columns to correct data types.
Dealing with missing values accordingly.
Perform data exploration and view correlations.
Normalize continuous data in order to have all the data on the same scale.
Create Testing and Training sets to train classification models and validate their performances.
Use the recall scoring metric to optimize models
Observe important features from the top-performing model.

I used the best recall score because I wanted the model's ability to correctly identify positive instances and minimize false negatives. The worst-case scenario for this model would be to classify an individual with heart disease when in reality they did have heart disease. In summary, I wanted the model to capture all positive cases and minimize false negatives.

Results:

Baseline Receiving Operating Characteristic Curve:

Optimized Receiving Operating Characteristic Curve:

The best-performing model was the Random Forests model which concluded the following top feature importance and confusion matrix

In this project, I concluded the following evaluations for the obtained results from my analysis:

I recommend incorporating the Random Forest model into a diagnostic medical device as software for the purposes of monitoring Cardiovascular symptoms for Cardio Vascular Diseases.
I recommend developing medical devices geared towards monitoring the slope of the S-T segment of the Electrocardiogram, Exercise-induced angina, and measuring the S-T segment depression.

Overall Experience

This was no easy task and there were times when I definitely got 'writer's block' due to the complexity of the task and the amount of models I needed to understand. There was much research that needed to be done using these classifiers. Now that I understand them I can proudly say that I can reproduce a project like this one using different features and targets.

My next step would be to look into image processing and deep learning so I can analyze medical imaging and create diagnostics using that knowledge. This project was just the stepping stone to reach my next goal.

To look more into this project, please look at my git repository which includes more analysis of the data and the data sources.

Github Repo

Home Value Regression Analysis

Cristopher Delgado — Wed, 02 Aug 2023 22:01:23 +0000

GitHub

As I continue my journey into data science and continue learning new data analysis methods, I am proud to have completed my second project showcasing my skills. This project is all about predicting home values located in King County Washington within the years 2021-2022.

Methodology:

Perform data cleaning which consisted of changing data types to appropriate/expected types.
Normalize data and linearize continuous data accordingly.
Perform exploratory data analysis to understand the correlations of the features with the price of a home.
Take on an iterative approach to creating prediction models using Linear Regression.

Data Normalization:

I prepared the continous data by standardizing the distribution in order to compare it to a z-distribution. In doing so I was able to remove any outliers 3 standards deviations away from the sample mean and then convert the values back into their original values.

General Trends:

After removing outliers, correcting datatypes, and manipulating data representations I was then able to begin viewing general trends in the data with my target variable being 'price'. Price in this context refers to the home sell price.
Viewing general trends then allowed me to see the features that most impact a home value.

Best Model:

This model has the best interpretability and lowest err out of all the models I generated for this project. This model was achieved using polynomial regression.

Conclusions:

Location can make up for most of the price in a home's value.
Home aspects such as bedrooms, bathrooms, and square footage matter.
A home's construction and state matter.

Challenges:

A part of my data preparation was zip code extraction so I can consider visualizing how location impacts price more in depth. Unfortunately, I was not able to develop a clear graph that showcased its impact due to so many unique zip codes. I still included zip code impact in my models.

Learning Outcomes:

I managed to create predictive models and learned all about statics used in data science.
I improved my data visualization abilities.
Learned to draw insights from regression modeling.

Please feel free to review the entire project in my GitHub repository.

Movie Analysis for Movie Production

Cristopher Delgado — Thu, 15 Jun 2023 23:04:02 +0000

Just recently I finished up my first data science driven project at Flatiron School. It was a big learning experience along with many ups and downs in tackling this project. I am so glad to say that I completed this project and created actionable insights. In this blog post I want to take the time to explain my findings.

Business Problem

The problem at hand involves Microsoft deciding to create a new movie studio. Microsoft does not know anything about creating movies. Overall, Microsoft would like to findings on what films are currently doing the best at the box office. Then translate those findings into actionable insights that the head of Microsoft's new movie studio can use to help decide what type of films to create.

Business Understanding
Microsoft wants to enter the movie production business however as a rookie in this business they need insight on audience opinions about characteristics about movies. Audience opinions impact ROI. In addition to learning about what an audience enjoys, Microsoft must also consider when to release their movies to get the most ROI.

Data

The data is based on datasets from Box Office Mojo, IMDB, Rotten Tomatoes, TheMovieDB, and The Numbers. The Box Office Mojo dataset contains information about movie domestic gross and worldwide gross. The IMDB database contains information about movie production team, cast, basic information, and ratings. The Rotten Tomatoes dataset contains information about critic reviews along with there rating and if the movie was fresh or rotten. The TheMovieDB dataset contains information about a movies genre, vote average, vote count, and popularity score. The Numbers dataset contains information about a movies production budget, domestic gross, worldwide gross, and release date.

Results

TMDB has a unique attribute for movies known as the popularity score.

The popularity score is the lifelong culmination of:

Number of votes for the day
Number of views for the day
Number of users who marked it as a 'favorite' for the day
Release date
Number of total votes
Accounts for the previous days score

It was essential to consider this score from the TMDB dataset to truly factor in many factors calculated by the courtesy of TMDB. This score is a good representation of popularity over the movies lifetime since the release date. Taking this score into consideration we can determine which genres are the most popular by creating a visualization as shown below.

Sales are important in movie production as this will determine the Return of Investment (ROI). Considering a movies characteristics such as intended audience and its release date can impact its Box Office Sale. As we can see PG-13 movies and PG make the most Box Office Sales which can make the most ROI. Releasing a movie at a certain time of the year can impact the Box Office Sales as we can see below.

Conclusion

This analysis has lead to three recommendations for Microsoft's new movie production studio:

Make movies with the following genres: Western, Romance, History, Animation, Family, Mystery, Thriller, Science Fiction, War, Crime, Fantasy, Action, Adventure. These genres had a popularity score of 40 percent or higher which according to TMDB is considered good through excellent.
Make movies with the intended Audience of PG-13 and PG. These intended audience ratings proved to have the higher box office sales in comparison to rated R and G movies.
Release movies in the summer time as well as the end of the year. The time of the year matters when releasing movies. We can see this occur in the months of May, June, July, November, December and February.

Self Assessment

This project has taught me the fundamentals of data analysis and data science. I learned so much about data cleaning and preparation. I am glad to say that I have learned about Pandas, SQLite3, Matplot.lib, Seaborn, JSON, and API which I utilized in this project.

As great as this project is, there is always room for improvement. I wish I had time to merge the 'Box Office Mojo' data with the data from 'The Number'. These datasets contained similar information in regards to ROI. My code overall as of right now is very linear and not a plug and play file. The same charts can not developed if different datasets were used. If possible I would have liked to create multiple function that can clean the data for you so others can use the same functions as well. Not only would this look more professional but also make the code more readable.

Feel free to check out my repository that contains all the exploratory analysis for this project

Next Steps

Further analysis can provide additional insights for Microsoft and there new movie production studio. Other exploratory routes to take can be the following directions:

Explore the ROI based on director for the movie. Directors are important to any movie production as they will be the ones to organize and envision movie scenes and guide its production. The main goal is to pick the best director.
Create a model where you can predict the best years to release movies. Eventually any movie studio will have to create an organized timeline for movie releases especially if sequels are made. Predicting a year to release movies can be crucial to obtain the most ROI.
Merge Box Office Mojo data with The Number since they contain similar information and look at the domestic gross made on average throughout the year.