DEV Community

Cover image for COVID-19 Literature Clustering
Maksim Ekin Eren
Maksim Ekin Eren

Posted on

COVID-19 Literature Clustering

My Final Project

A novel approach to document exploration using Machine Learning. Let's cluster similar research articles together to make it easier for health professionals to find relevant research articles, and respond to rapidly spreading COVID-19 promptly.

Demo Link

https://maksimekin.github.io/COVID19-Literature-Clustering/plots/t-sne_covid-19_interactive.html

Link to Code

GitHub logo MaksimEkin / COVID19-Literature-Clustering

A novel approach to document exploration using Machine Learning. Let's cluster similar research articles together to make it easier for health professionals to find relevant research articles, and responde to rapidly spreading COVID-19 promptly.

COVID-19 Literature Clustering

Goal

Given the large number of literature and the rapid spread of COVID-19, it is difficult for health professionals to keep up with new information on the virus. Can clustering similar research articles together simplify the search for related publications? How can the content of the clusters be qualified?

By using clustering for labelling in combination with dimensionality reduction for visualization, the collection of literature can be represented by a scatter plot. On this plot, publications of highly similar topic will share a label and will be plotted near each other. In order, to find meaning in the clusters, topic modelling will be performed to find the keywords of each cluster.

By using Bokeh, the plot will be interactive. User’s will have the option of seeing the plot as a whole or filtering the data by cluster. If a narrower scope is required, the plot will also…

How I built it

  • Parse the text from the body of each document using Natural Language Processing (NLP).
  • Turn each document instance di into a feature vector Xi using Term Frequency–inverse Document Frequency (TF-IDF).
  • Apply Dimensionality Reduction to each feature vector Xi using t-Distributed Stochastic Neighbor Embedding (t-SNE) to cluster similar research articles in the two-dimensional plane X embedding Y1.
  • Use Principal Component Analysis (PCA) to project down the dimensions of X to a number of dimensions that will keep .95 variance while removing noise and outliers in embedding Y2.
  • Apply k-means clustering on Y2, where k is 20, to label each cluster on Y1.
  • Apply Topic Modeling on X using Latent Dirichlet Allocation (LDA) to discover keywords from each cluster.
  • Investigate the clusters visually on the plot, zooming down to specific articles as needed, and via classification using Stochastic Gradient Descent (SGD).

Additional Thoughts / Feelings / Stories

You can see the analysis Jupyter Notebook here: https://maksimekin.github.io/COVID19-Literature-Clustering/COVID19_literature_clustering.html

The news video that featured the project can be found here: https://www.youtube.com/watch?v=vyOrM8zC_Iw

Top comments (0)