Create account

DEV Community

Maksim Ekin Eren

Posted on May 23, 2020

COVID-19 Literature Clustering

#octograd2020

My Final Project

A novel approach to document exploration using Machine Learning. Let's cluster similar research articles together to make it easier for health professionals to find relevant research articles, and respond to rapidly spreading COVID-19 promptly.

Demo Link

https://maksimekin.github.io/COVID19-Literature-Clustering/plots/t-sne_covid-19_interactive.html

Link to Code

MaksimEkin / COVID19-Literature-Clustering

A novel approach to document exploration using Machine Learning. Let's cluster similar research articles together to make it easier for health professionals to find relevant research articles, and responde to rapidly spreading COVID-19 promptly.

COVID-19 Literature Clustering

Goal

Given the large number of literature and the rapid spread of COVID-19, it is difficult for health professionals to keep up with new information on the virus. Can clustering similar research articles together simplify the search for related publications? How can the content of the clusters be qualified?

By using clustering for labelling in combination with dimensionality reduction for visualization, the collection of literature can be represented by a scatter plot. On this plot, publications of highly similar topic will share a label and will be plotted near each other. In order, to find meaning in the clusters, topic modelling will be performed to find the keywords of each cluster.

By using Bokeh, the plot will be interactive. User’s will have the option of seeing the plot as a whole or filtering the data by cluster. If a narrower scope is required, the plot will also…

View on GitHub

How I built it

Parse the text from the body of each document using Natural Language Processing (NLP).
Turn each document instance di into a feature vector Xi using Term Frequency–inverse Document Frequency (TF-IDF).
Apply Dimensionality Reduction to each feature vector Xi using t-Distributed Stochastic Neighbor Embedding (t-SNE) to cluster similar research articles in the two-dimensional plane X embedding Y1.
Use Principal Component Analysis (PCA) to project down the dimensions of X to a number of dimensions that will keep .95 variance while removing noise and outliers in embedding Y2.
Apply k-means clustering on Y2, where k is 20, to label each cluster on Y1.
Apply Topic Modeling on X using Latent Dirichlet Allocation (LDA) to discover keywords from each cluster.
Investigate the clusters visually on the plot, zooming down to specific articles as needed, and via classification using Stochastic Gradient Descent (SGD).

Additional Thoughts / Feelings / Stories

You can see the analysis Jupyter Notebook here: https://maksimekin.github.io/COVID19-Literature-Clustering/COVID19_literature_clustering.html

The news video that featured the project can be found here: https://www.youtube.com/watch?v=vyOrM8zC_Iw

Do your career a big favor. Join DEV. (The website you're on right now)

It takes one minute, it's free, and is worth it for your career.

Get started

Community matters

Top comments (0)

See why 4M developers consider Sentry, “not bad.”

Fixing code doesn’t have to be the worst part of your day. Learn how Sentry can help.

Learn more

DEV Community

COVID-19 Literature Clustering

My Final Project

Demo Link

Link to Code

MaksimEkin / COVID19-Literature-Clustering

A novel approach to document exploration using Machine Learning. Let's cluster similar research articles together to make it easier for health professionals to find relevant research articles, and responde to rapidly spreading COVID-19 promptly.

COVID-19 Literature Clustering

Goal

How I built it

Additional Thoughts / Feelings / Stories

Top comments (0)

See why 4M developers consider Sentry, “not bad.”

Read next

Supporting Open Source: The Importance of Developer Patronage

Rippling vs Justworks Comparison

🚀 Node.js Performance: Essential Tips and Tricks for Developers

Perplexity AI’s Deep Research Tool: How to Access and Use It for Free

AWS GenAI LIVE!