DEV Community: Guilherme Bauer-Negrini

Best biomedical and health data science books and resources

Guilherme Bauer-Negrini — Thu, 22 Apr 2021 00:00:00 +0000

What is biomedical data science?

Biomedical data science spans a range of biological and medical research challenges that are data-intensive and focused on the creation of novel methodologies to advance biomedical science discovery. - Annual Review of Biomedical Data Science

Here is a listing of some resources that I have found while researching and studying the field of biomedical data science and analytics. Unfortunately many books and courses listed here are paid, but I have tried my best to list some free and open-sourced resources too. Let’s go to them!

What is biomedical data science?
Notice of non-affiliation and disclaimer
Statistics and math
- Modern Statistics for Modern Biology
- Statistics for Biomedical Engineers and Scientists
- Applied Mathematics for the Analysis of Biomedical Data: Models, Methods, and MATLAB
- Data-Handling in Biomedical Science
Data engineering
- Data Warehousing for Biomedical Informatics
- Big Biomedical Data Engineering
Data manipulation, data analysis, and machine learning
- Data Science and Predictive Analytics: Biomedical and Health Applications using R
- Computational Learning Approaches to Data Analytics in Biomedical Applications
- Statistical Learning for Biomedical Data
- Case Studies in Neural Data Analysis
- Neural Data Science: A Primer with MATLAB and Python
- Computational Genomics with R
- Bioinformatics: The Machine Learning Approach
- Biomedical Image Analysis in Python
Datasets
- Synthea: Synthetic Patient Generation
- PhysioNet: The Research Resource for Complex Physiologic Signals
- Computational Biology Datasets Suitable For Machine Learning
- Kaggle: Healthcare tag
- NIH: Data Sharing Resources
Conclusions

Notice of non-affiliation and disclaimer

I am not the author nor I am associated with any author, publishing company, or digital platform of the resources mentioned here. I also was not paid, endorsed, or compensated in any way for this post. Any reference in this post is for the information and convenience of the public and does not constitute an endorsement, recommendation, or favoring.

Statistics and math

A good understanding of statistics and mathematics is fundamental to any data science or machine learning analysis. The most basic and key concepts include probability distributions, statistical significance, hypothesis testing, and regression. Here are some resources dedicated to teaching you all of that (and more) with examples from biomedical sciences.

Modern Statistics for Modern Biology

Susan Holmes, Wolfgang Huber

Book 📘 | Code: R | Free: ✅ | Link ↗️

The aim of this book is to enable scientists working in biological research to quickly learn many of the important ideas and methods that they need to make the best of their experiments and of other available data. The book takes a hands-on approach.

This book is not heavy on mathematics, it goes straight to the core concepts and has a lot of R code examples and exercises! It ranges from the basics of data distributions and hypothesis testing to more advanced topics like multivariate analysis and supervised learning.

Statistics for Biomedical Engineers and Scientists

Andrew King, Robert Eckersley

Book 📘 | Code: MATLAB | Free: ❌ | Link ↗️

Readers will learn how to understand the fundamental concepts of descriptive and inferential statistics, analyze data and choose an appropriate hypothesis test to answer a given question, compute numerical statistical measures and perform hypothesis tests “by hand”, and visualize data and perform statistical analysis using MATLAB.

This is just what you would expect from a regular undergraduate level book about probability and statistics. Not heavy on math and it has a lot of exercises.

Applied Mathematics for the Analysis of Biomedical Data: Models, Methods, and MATLAB

Peter J. Costa

Book 📘 | Code: MATLAB | Free: ❌ | Link ↗️

Features a practical approach to the analysis of biomedical data via mathematical methods and provides a MATLAB® toolbox for the collection, visualization, and evaluation of experimental and real-life data

This one is heavier on maths and assumes you are familiar with elementary differential equations, linear algebra, and statistics.

Data-Handling in Biomedical Science

Peter White

Book 📘 | Code: ❌ | Free: ❌ | Link ↗️

Packed with worked examples and problems, this book will help the reader improve their confidence and skill in data-handling.

This one is a little different from the previous ones, but it is worth listing. The book has no code examples and it is not about computational methods of data handling and analysis. It teaches basic math and statistics needed for biochemistry and microbiology experiments.

Data engineering

As important as analyzing data, we also need to know how to design and maintain data pipelines. Biomedical data can be messy, heterogenous, and big, but fortunately, these authors are here to help us!

Data Warehousing for Biomedical Informatics

Richard E. Biehl

Book 📘 | Code: SQL | Free: ❌ | Link ↗️

A step-by-step how-to guide for designing and building an enterprise-wide data warehouse across a biomedical or healthcare institution, using a four-iteration lifecycle and standardized design pattern.

This book is a gem. Classical content about data warehousing and ETL pipelines, but really focused on biomedical and healthcare data. Lots of SQL code snippets!

Big Biomedical Data Engineering

Ripon Patgiri, Sabuzima Nayak

Book chapter 📄 | Code: ❌ | Free: ✅ | Link ↗️

This chapter exploits the role of Big Data in biomedical data engineering and its storage dilemma.

A short book chapter that discusses some scenarios of biomedical big data applications and possible future.

Data manipulation, data analysis, and machine learning

This is where most people have fun. Let’s see how to handle, clean, analyze and extract insights from biomedical data.

Data Science and Predictive Analytics: Biomedical and Health Applications using R

Ivo D. Dinov

Book and MOOC 📘 💻 | Code: R | Free: ✅ ❌ | Link ↗️ | Free online material ↗️

Complete and self-contained treatment of the theory, experimental modeling, system development, and validation of predictive health analytics.

A comprehensive data science book: introduction to R, data manipulation, data visualization, classification, regression, NLP, and even a little Deep Learning! All of this with well-documented R code. The book is not free, but you can find the videos, class notes, and R code on the author’s page linked above.

Computational Learning Approaches to Data Analytics in Biomedical Applications

Khalid Al-Jabery Tayo Obafemi-Ajayi Gayla Olbricht Donald Wunsch

Book 📘 | Code: Python, MATLAB | Free: ❌ | Link ↗️

It presents insights on biomedical data processing, innovative clustering algorithms and techniques, and connections between statistical analysis and clustering.

An interesting and more theoretical approach to data preprocessing and clustering algorithms. Examples are given in pseudocode and some math knowledge is required. The last chapter has a hands-on approach using MATLAB and Python codes.

Statistical Learning for Biomedical Data

James D. Malley, Karen G. Malley, Sinisa Pajevic

Book 📘 | Code: MATLAB | Free: ❌ | Link ↗️

This book is for anyone who has biomedical data and needs to identify variables that predict an outcome, for two-group outcomes such as tumor/not-tumor, survival/death, or response from treatment.

Not heavy on math and does not have many code examples. Great theoretical explanations covering regression, single decision trees, and Random Forests.

Case Studies in Neural Data Analysis

Mark Kramer, Uri Eden

Book 📘 | Code: Python | Free: ✅ | Link ↗️

The intended audience is the practicing neuroscientist - e.g., the students, researchers, and clinicians collecting neuronal data in the hospital or lab. The material can get pretty math-heavy, but we’ve tried to outline the main concepts as directly as possible, with hands-on implementations of all concepts.

Great hands-on material for neuroscientists interested in analyzing spike trains and electric fields. All notebooks are in Python and have a little explanation about the concepts and goal of the analysis.

Neural Data Science: A Primer with MATLAB and Python

Erik Lee Nylen, Pascal Wallisch

Book 📘 | Code: Python, MATLAB | Free: ❌ | Link ↗️

A beginner’s introduction to the principles of computation and data analysis in neuroscience, using both Python and MATLAB, giving readers the ability to transcend platform tribalism and enable coding versatility.

This book is beautifully organized and filled with images. The coolest thing about it is the MATLAB and Python code written side-by-side. The content ranges from the basics of programming to advanced techniques such as analog signal processing, biophysical modeling, clustering, and classification.

Computational Genomics with R

Altuna Akalin

Book 📘 | Code: R | Free: ✅ | Link ↗️

The aim of this book is to provide the fundamentals for data analysis for genomics. We want this book to be a starting point for computational genomics students and a guide for further data analysis in more specific topics in genomics.

This book has a great introduction to genomics that will help a lot if you are not coming from a biological related field. It covers many topics such as introduction to R, statistics, exploratory data analysis, supervised learning, RNA-Seq, and more!

Bioinformatics: The Machine Learning Approach

Pierre Baldi, Søren Brunak

Book 📘 | Code: ❌ | Free: ❌ | Link ↗️

The book is aimed both at biologists and biochemists who need to understand new data-driven algorithms and at those with a primary background in physics, mathematics, statistics, or computer science who need to know more about applications in molecular biology.

This one is a little heavy on math, you will probably need some calculus, algebra, and probability theory. The book is really about the theoretical aspects of machine learning applied to bionformatics, including definitions of main concepts and proofs of main theorems.

Biomedical Image Analysis in Python

DataCamp

Videos and interactive code 💻 | Code: Python | Free: ❌ | Link ↗️

In this introductory course, you’ll learn the fundamentals of image analysis using NumPy, SciPy, and Matplotlib. You’ll navigate through a whole-body CT scan, segment a cardiac MRI time series, and determine whether Alzheimer’s disease changes brain structure.

Great content and it follows the DataCamp course structure: short videos and hands-on coding exercises directly in the browser!

Datasets

Here are some places where you can find datasets to explore and exercise your skills:

Synthea: Synthetic Patient Generation

MITRE Corporation

Link ↗️

SyntheaTM is an open-source, synthetic patient generator that models the medical history of synthetic patients. The resulting data is free from cost, privacy, and security restrictions, enabling research with Health IT data that is otherwise legally or practically unavailable.

PhysioNet: The Research Resource for Complex Physiologic Signals

MIT Laboratory for Computational Physiology

Link ↗️

PhysioNet is a repository of freely-available medical research data, managed by the MIT Laboratory for Computational Physiology.

Computational Biology Datasets Suitable For Machine Learning

Ben Lengerich

Link ↗️

This is a curated list of computational biology datasets that have been pre-processed for machine learning.

Kaggle: Healthcare tag

Link ↗️

Kaggle is the world’s largest data science community with powerful tools and public datasets.

NIH: Data Sharing Resources

Trans-NIH BioMedical Informatics Coordinating Committee

Link ↗️

To help researchers locate an appropriate resource for sharing their data, as well as to promote awareness of resources where datasets can be located for reuse, BMIC maintains lists of several types of data sharing resources.

Conclusions

That’s it! This comprehensive list covers many areas of biomedical data science and analytics, but there are many more great resources out there! Do you think I might have left out something important? Share with us in the comments!

Biomedical text natural language processing (BioNLP) using scispaCy: NER and rule-based matching

Guilherme Bauer-Negrini — Thu, 15 Apr 2021 15:40:55 +0000

Biomedical text mining and natural language processing (BioNLP) is an interesting research domain that deals with processing data from journals, medical records, and other biomedical documents. Considering the availability of biomedical literature, there has been an increasing interest in extracting information, relationships, and insights from text data. However, the unstructured organization and the domain complexity of biomedical documents make these tasks hard. Fortunately, some cool NLP Python packages can help us with that!

scispaCy is a Python package containing spaCy models for processing biomedical, scientific or clinical text. spaCy's most mindblowing features are neural network models for tagging, parsing, named entity recognition (NER), text classification, and more. Add scispaCy models on top of it and we can do all that in the biomedical domain!

Here we are going to see how to use scispaCy NER models to identify drug and disease names mentioned in a medical transcription dataset. Moreover, we are going to combine NER and rule-based matching to extract the drug names and dosages reported in each transcription.

Requirements
Dataset
Named entity recognition
Rule-based matching
Conclusions
References

Requirements

Python 3
pandas
spacy>=3.0
scispacy

You can simply pip install all of them.

We also need to download and install the NER model from scispaCy. To install the en_ner_bc5cdr_md model use the following command:

pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_ner_bc5cdr_md-0.4.0.tar.gz

For updated versions or other models, please check scispaCy page.

Dataset

Unstructured medical data, like medical transcriptions, are pretty hard to find. Here we are using a medical transcription dataset scraped from the MTSamples website by Tara Boyle and made available at Kaggle.

import pandas as pd

med_transcript = pd.read_csv("mtsamples.csv", index_col=0)
med_transcript.info()
med_transcript.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4999 entries, 0 to 4998
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   description        4999 non-null   object
 1   medical_specialty  4999 non-null   object
 2   sample_name        4999 non-null   object
 3   transcription      4966 non-null   object
 4   keywords           3931 non-null   object
dtypes: object(5)
memory usage: 234.3+ KB

	description	medical_specialty	sample_name	transcription	keywords
0	A 23-year-old white female presents with comp...	Allergy / Immunology	Allergic Rhinitis	SUBJECTIVE:, This 23-year-old white female pr...	allergy / immunology, allergic rhinitis, aller...
1	Consult for laparoscopic gastric bypass.	Bariatrics	Laparoscopic Gastric Bypass Consult - 2	PAST MEDICAL HISTORY:, He has difficulty climb...	bariatrics, laparoscopic gastric bypass, weigh...
2	Consult for laparoscopic gastric bypass.	Bariatrics	Laparoscopic Gastric Bypass Consult - 1	HISTORY OF PRESENT ILLNESS: , I have seen ABC ...	bariatrics, laparoscopic gastric bypass, heart...
3	2-D M-Mode. Doppler.	Cardiovascular / Pulmonary	2-D Echocardiogram - 1	2-D M-MODE: , ,1. Left atrial enlargement wit...	cardiovascular / pulmonary, 2-d m-mode, dopple...
4	2-D Echocardiogram	Cardiovascular / Pulmonary	2-D Echocardiogram - 2	1. The left ventricular cavity size and wall ...	cardiovascular / pulmonary, 2-d, doppler, echo...

The dataset has almost 5000 records, but let's work with a small random subsample so it doesn't take too long to process. We also have to drop any rows whose transcriptions are missing.

med_transcript.dropna(subset=['transcription'], inplace=True)
med_transcript_small = med_transcript.sample(n=100, replace=False, random_state=42)
med_transcript_small.info()
med_transcript_small.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 3162 to 3581
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   description        100 non-null    object
 1   medical_specialty  100 non-null    object
 2   sample_name        100 non-null    object
 3   transcription      100 non-null    object
 4   keywords           78 non-null     object
dtypes: object(5)
memory usage: 4.7+ KB

	description	medical_specialty	sample_name	transcription	keywords
3162	Markedly elevated PT INR despite stopping Cou...	Hematology - Oncology	Hematology Consult - 1	HISTORY OF PRESENT ILLNESS:, The patient is w...	NaN
1981	Intercostal block from fourth to tenth interc...	Pain Management	Intercostal block - 1	PREPROCEDURE DIAGNOSIS:, Chest pain secondary...	pain management, xylocaine, marcaine, intercos...
1361	The patient is a 65-year-old female who under...	SOAP / Chart / Progress Notes	Lobectomy - Followup	HISTORY OF PRESENT ILLNESS: , The patient is a...	soap / chart / progress notes, non-small cell ...
3008	Construction of right upper arm hemodialysis ...	Nephrology	Hemodialysis Fistula Construction	PREOPERATIVE DIAGNOSIS: , End-stage renal dise...	nephrology, end-stage renal disease, av dialys...
4943	Bronchoscopy with brush biopsies. Persistent...	Cardiovascular / Pulmonary	Bronchoscopy - 8	PREOPERATIVE DIAGNOSIS: , Persistent pneumonia...	cardiovascular / pulmonary, persistent pneumon...

Let's take one transcription to see how we can work with NER:

sample_transcription = med_transcript_small['transcription'].iloc[0]
print(sample_transcription[:1000]) # prints just the first 1000 characters

HISTORY OF PRESENT ILLNESS:,  The patient is well known to me for a history of iron-deficiency anemia due to chronic blood loss from colitis.  We corrected her hematocrit last year with intravenous (IV) iron.  Ultimately, she had a total proctocolectomy done on 03/14/2007 to treat her colitis.  Her course has been very complicated since then with needing multiple surgeries for removal of hematoma.  This is partly because she was on anticoagulation for a right arm deep venous thrombosis (DVT) she had early this year, complicated by septic phlebitis.,Chart was reviewed, and I will not reiterate her complex history.,I am asked to see the patient again because of concerns for coagulopathy.,She had surgery again last month to evacuate a pelvic hematoma, and was found to have vancomycin resistant enterococcus, for which she is on multiple antibiotics and followed by infectious disease now.,She is on total parenteral nutrition (TPN) as well.,LABORATORY DATA:,  Labs today showed a white blood

So, we can see a lot of entities in this transcription. There are drug, disease, and exam names for example.
The text was scraped from a web page and we can identify the different sections from the medical record like "HISTORY OF PRESENT ILLNESS" and "LABORATORY DATA", but this varies from record to record.

Named entity recognition

Named entity recognition (NER) is a subtask of natural language processing used to identify and classify named entities mentioned in unstructured text into pre-defined categories. scispaCy has different models to identify different entity types and you can check them here.

We are going to use the NER model trained on the BC5CDR corpus (en_ner_bc5cdr_md). This corpus consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases, and 3116 chemical-disease interactions. Don't forget to download and install the model.

import scispacy
import spacy
nlp = spacy.load("en_ner_bc5cdr_md")

spacy.load will return a Language object containing all components and data needed to process text. This object is usually called nlp in the documentation and tutorials. Calling the nlp object on a string of text will return a processed Doc object with the text split into words and annotated.

Let's get all identified entities and print their text, start position, end position, and type:

doc = nlp(sample_transcription)
print("TEXT", "START", "END", "ENTITY TYPE")
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

TEXT START END ENTITY TYPE
iron-deficiency anemia 79 101 DISEASE
chronic blood loss 109 127 DISEASE
colitis 133 140 DISEASE
iron 203 207 CHEMICAL
...
vancomycin 781 791 CHEMICAL
infectious disease 873 891 DISEASE
improved.,PT 1348 1360 CHEMICAL
vitamin K 1503 1512 CHEMICAL
uric acid 1830 1839 CHEMICAL
bilirubin 1853 1862 CHEMICAL
Creatinine 1911 1921 CHEMICAL
...
Compazine 2474 2483 CHEMICAL
Zofran 2487 2493 CHEMICAL
epistaxis 2629 2638 DISEASE
bleeding 3057 3065 DISEASE
edema.,CARDIAC 3109 3123 CHEMICAL
adenopathy 3156 3166 DISEASE
...

We can see the model correctly identified and label diseases such as iron-deficiency anemia, chronic blood loss, and many more. Lots of drugs were also identified, like vancomycin, Compazine, Zofran. The model can also identify common laboratory tested molecules such as creatinine, iron, bilirubin, uric acid.

Not everything is perfect though. See how some tokens are weirdly classified as chemicals, possibly due to punctuation marks:

improved.,PT 1348 1360 CHEMICAL
edema.,CARDIAC 3109 3123 CHEMICAL

Punctuation marks are usually removed in NLP preprocessing steps, but we can't remove all of them here, otherwise, we may miss chemical names and would screw up quantities like drug dosage. However, we can solve this problem by removing the ".," marks that appear to separate some sections of the transcription. It is important to know your data and your data's domain to have a better comprehension of your results.

import re

med_transcript_small['transcription'] = med_transcript_small['transcription'].apply(lambda x: re.sub('(\.,)', ". ", x))

We can also check the entities using the displacy visualizer:

from spacy import displacy
displacy.render(doc[:100], style='ent', jupyter=True) # here I am printing just the first 100 tokens

Rule-based matching

Rule-based matching is similar to regular expressions, but spaCy’s rule-based matcher engines and components give you access to the tokens within the document and their relationships. We can combine this with the NER models to identify some pattern that includes our entities.

Let's extract from the text the drug names and their reported dosages. This could be of real use to identify possible medication errors by checking if the dosages are in accordance with standards and guidelines.

from spacy.matcher import Matcher

pattern = [{'ENT_TYPE':'CHEMICAL'}, {'LIKE_NUM': True}, {'IS_ASCII': True}]
matcher = Matcher(nlp.vocab)
matcher.add("DRUG_DOSE", [pattern])

The code above creates a pattern to identify a sequence of three tokens:

A token whose entity type is CHEMICAL (drug name)
A token that resembles a number (dosage)
A token that consists of ASCII characters (units, like mg or mL)

Then we initialize the Matcher with a vocabulary. The matcher must always share the same vocab with the documents it will operate on, so we use the nlp object vocab. We then add this pattern to the matcher and give it an ID.

Now we can loop through all transcriptions and extract the text matching this pattern:

for transcription in med_transcript_small['transcription']:
    doc = nlp(transcription)
    matches = matcher(doc)
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]  # get string representation
        span = doc[start:end]  # the matched span
        print(string_id, start, end, span.text)

DRUG_DOSE 137 140 Xylocaine 20 mL
DRUG_DOSE 141 144 Marcaine 0.25%
DRUG_DOSE 208 211 Aspirin 81 mg
DRUG_DOSE 216 219 Spiriva 10 mcg
DRUG_DOSE 399 402 nifedipine 10 mg
DRUG_DOSE 226 229 aspirin one tablet
DRUG_DOSE 245 248 Warfarin 2.5 mg
DRUG_DOSE 67 70 Topamax 100 mg
...
DRUG_DOSE 193 196 Metamucil one pack
DRUG_DOSE 207 210 Nexium 40 mg
DRUG_DOSE 1133 1136 Naprosyn one p.o
DRUG_DOSE 290 293 Lidocaine 1%
DRUG_DOSE 37 40 Altrua 60,
...
DRUG_DOSE 74 77 Lidocaine 1.5%
DRUG_DOSE 209 212 Dilantin 300 mg
DRUG_DOSE 217 220 Haloperidol 1 mg
DRUG_DOSE 225 228 Dexamethasone 4 mg
DRUG_DOSE 234 237 Docusate 100 mg
DRUG_DOSE 250 253 Ibuprofen 600 mg
DRUG_DOSE 258 261 Zantac 150 mg
...
DRUG_DOSE 204 207 epinephrine 7 ml
DRUG_DOSE 214 217 Percocet 5,
DRUG_DOSE 55 58 . 4.
DRUG_DOSE 146 149 . 4.
DRUG_DOSE 2409 2412 Naprosyn 375 mg
DRUG_DOSE 141 144 Wellbutrin 300 mg
DRUG_DOSE 146 149 Xanax 0.25 mg
DRUG_DOSE 158 161 omeprazole 20 mg
...

Nice, we did it!

We successfully extracted drugs and dosages, including different kinds of units like mg, mL, %, packs.

Conclusions

Here we learned how to use some features of scispaCy and spaCy like NER and rule-base matching. We used one NER model, but there lots of others and you should totally check them out. For instance, the en_ner_bionlp13cg_md model can identify anatomical parts, tissues, cell types, and more. Imagine what else you could do with that!

We also didn't focus too much on preprocessing steps, but they are fundamental to get better results. Don't forget to explore your data and adapt the preprocessing steps to the NLP tasks you want to do.

References

Neumann, M., King, D., Beltagy, I., & Ammar, W. (2019). Scispacy: Fast and robust models for biomedical natural language processing. arXiv preprint arXiv:1902.07669.

Honnibal, M., Montani, I., Van Landeghem, S., Boyd, A. (2020). spaCy: Industrial-strength Natural Language Processing in Python.

Tolkien character or prescription drug name? Classification using character-level Long Short-Term Memory (LSTM) neural networks

Guilherme Bauer-Negrini — Thu, 08 Apr 2021 18:43:25 +0000

After trying to read J.R.R. Tolkien's The Silmarillion again for the millionth time, I remembered a funny tweet that has been around for a while:

// Detect dark theme var iframe = document.getElementById('tweet-977387234226855936-610'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=977387234226855936&theme=dark" }

Even though I was a casual fan of The Lord of the Rings and having already taken two pharmacology courses in college, I had no idea who or what a Narmacil was. Should we fear him/her by its sword skills or by its dangerous side effects?

This little trivia prompted me to ask if an artificial neural network (ANN) could succeed where I and many more have failed. Here, I show you how to build a special type of ANN called Long Short-Term Memory (LSTM) to classify Tolkien characters and prescription drug names using Keras.

Dataset

The first step was to build from scratch a combined dataset with names of Tolkien characters and prescription drugs (a bunch of them, not just antidepressants).

Tolkien characters

Lucky for us, the Behind the Name website has a database of the first names of Tolkien characters that we can directly read from the page's HTML using pandas.

import pandas as pd

raw_tolkien_chars = pd.read_html('https://www.behindthename.com/namesakes/list/tolkien/name')
raw_tolkien_chars[2].head()

	Name	Gender	Details	Total
0	Adalbert	m	1 character	1
1	Adaldrida	f	1 character	1
2	Adalgar	m	1 character	1
3	Adalgrim	m	1 character	1
4	Adamanta	f	1 character	1

tolkien_names = raw_tolkien_chars[2]['Name']
tolkien_names.iloc[350:355]

350           Gethron
351    Ghân-buri-Ghân
352            Gildis
353            Gildor
354         Gil-galad
Name: Name, dtype: object

We can see that some names are hyphenated and have accented letters. To simplify the analysis I transformed unicode characters to ASCII, removed punctuation marks, transformed them to lowercase, and removed any possible duplicates.

import unidecode

processed_tolkien_names = tolkien_names.apply(unidecode.unidecode).str.lower().str.replace('-', ' ')
processed_tolkien_names = [name[0] for name in processed_tolkien_names.str.split()]
processed_tolkien_names = pd.DataFrame(processed_tolkien_names, columns=['name']).sort_values('name').drop_duplicates()
processed_tolkien_names['tolkien'] = 1

processed_tolkien_names['name'].iloc[350:355]

473    gethron
439       ghan
109        gil
341     gildis
324     gildor
Name: name, dtype: object

processed_tolkien_names.shape

(746,2)

Done! Now we have 746 different character names.

Prescription drugs

To get a comprehensive list of drug names, I downloaded the medication guide of the U.S. Food & Drug Administration (FDA).

raw_medication_guide = pd.read_csv('data/raw/medication_guides.csv')
raw_medication_guide.head()

	Drug Name	Active Ingredient	Form;Route	Appl. No.	Company	Date	Link
0	Abilify	Aripiprazole	TABLET, ORALLY DISINTEGRATING;ORAL	21729	OTSUKA	02/05/2020	https://www.accessdata.fda.gov/drugsatfda_docs...
1	Abilify	Aripiprazole	TABLET;ORAL	21436	OTSUKA	02/05/2020	https://www.accessdata.fda.gov/drugsatfda_docs...
2	Abilify	Aripiprazole	SOLUTION;ORAL	21713	OTSUKA	02/05/2020	https://www.accessdata.fda.gov/drugsatfda_docs...
3	Abilify	Aripiprazole	SOLUTION;ORAL	21713	OTSUKA	02/05/2020	https://www.accessdata.fda.gov/drugsatfda_docs...
4	Abilify	Aripiprazole	INJECTABLE;INTRAMUSCULAR	21866	OTSUKA	02/05/2020	https://www.accessdata.fda.gov/drugsatfda_docs...

drug_names = raw_medication_guide['Drug Name']
drug_names.iloc[160:165]

160                                             Chantix
161         Children's Cetirizine Hydrochloride Allergy
162    Chlordiazepoxide and Amitriptyline Hydrochloride
163                                              Cimzia
164                                              Cimzia
Name: Drug Name, dtype: object

A similar preprocessing step was repeated for this dataset too:

processed_drug_names = drug_names.str.lower().str.replace('.', '').str.replace( '-', ' ').str.replace('/', ' ').str.replace("'", ' ').str.replace(",", ' ')
processed_drug_names = [name[0] for name in processed_drug_names.str.split()]
processed_drug_names = pd.DataFrame(processed_drug_names, columns=['name']).sort_values('name').drop_duplicates()
processed_drug_names['tolkien'] = 0

processed_drug_names['name'].iloc[84:89]

373             chantix
448            children
395    chlordiazepoxide
185              cimzia
292               cipro
Name: name, dtype: object

processed_drug_names.shape

(611,2)

Done, 611 different drug names!

We can finally combine the two datasets and move on.

dataset = pd.concat([processed_tolkien_names, processed_drug_names], ignore_index=True)

Data transformation

So now we have a bunch of names, but machine learning models don't work with raw characters. We need to convert them into a numerical format that can be processed by our soon-to-be-built model.

Using the Tokenizer class from Keras, we set char_level=True to process each word at character-level. The fit_on_texts() method will update the tokenizer internal vocabulary based on our dataset names and then texts_to_sequences() will transform each name into a sequence of integers.

from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(char_level=True)
tokenizer.fit_on_texts(dataset['name'])
char_index = tokenizer.texts_to_sequences(dataset['name'])

Look how our beloved Bilbo is now:

print(dataset['name'][134])
print(char_index[134])

bilbo
[16, 3, 6, 16, 5]

Yet, this representation is not ideal. Having integers to represent letters could lead the ANN to assume that the characters have an ordinal scale. To solve this problem we have to:

Set all names to have the length of the longest name (17 characters here). We use pad_sequences to add 0's to the end of names shorter than 17 letters.
Convert each integer representation to its one-hot encoded vector representation. The vector consists of 0s in all cells except for a single 1 in a cell to identify the letter.

from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
import numpy as np

char_index = pad_sequences(char_index, maxlen=dataset['name'].apply(len).max(), padding="post")
x = to_categorical(char_index)  # onehot encoding
y = np.array(dataset['tolkien'])

x.shape

(1357, 17, 27)

We have 1357 names. Each name has 17 letters and each letter is a one-hot encoded vector of size 27 (26 letters of the Latin alphabet + padding character).

Data split

I split the data into train, validation, and test sets with a 60/20/20 ratio using a custom function since sklearn train_test_split only outputs two sets.

from sklearn.model_selection import train_test_split

def data_split(data, labels, train_ratio=0.5, rand_seed=42):

    x_train, x_temp, y_train, y_temp = train_test_split(data,
                                                        labels,
                                                        train_size=train_ratio,
                                                        random_state=rand_seed)

    x_val, x_test, y_val, y_test = train_test_split(x_temp,
                                                    y_temp,
                                                    train_size=0.5,
                                                    random_state=rand_seed)

    return x_train, x_val, x_test, y_train, y_val, y_test

x_train, x_val, x_test, y_train, y_val, y_test = data_split(x, y, train_ratio=0.6)

Let's take a look at the splits:

from collections import Counter
import matplotlib.pyplot as plt

dataset_count = pd.DataFrame([Counter(y_train), Counter(y_val), Counter(y_test)],
                                index=["train", "val", "test"])
dataset_count.plot(kind='bar')
plt.xticks(rotation=0)
plt.show()

print(f"Total number of samples: \n{dataset_count.sum(axis=0).sum()}")
print(f"Class/Samples: \n{dataset_count.sum(axis=0)}")
print(f"Split/Class/Samples: \n{dataset_count}")

Total number of samples: 
1357
Class/Samples: 
1    746
0    611
dtype: int64
Split/Class/Samples: 
        1    0
train  451  363
val    149  122
test   146  126

There are more Tolkien characters than drug names, but it seems like a decent balance.

LSTM model

Long Short-Term Memory is a type of Recurrent Neural Network proposed by Hochreiter S. & Schmidhuber J. (1997) to store information over extended time intervals. Names are just sequences of characters in which the order is important, so LSTM networks are a great choice for our name prediction task. You can read more about LSTMs in this awesome illustrated guide written by Michael Phi.

Training

Keras was used to build this simple LSTM model after some tests and hyperparameter tuning. It is just a hidden layer with 8 LSTM blocks, one dropout layer to prevent overfitting, and one output neuron with a sigmoid activation function to make a binary classification.

from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout, LSTM
from tensorflow.random import set_seed
set_seed(23)

model = Sequential()
model.add(LSTM(8, return_sequences=False,
               input_shape=(x.shape[1], x.shape[2])))
model.add(Dropout(0.3))
model.add(Dense(units=1))
model.add(Activation('sigmoid'))
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
lstm (LSTM)                  (None, 8)                 1152      
_________________________________________________________________
dropout (Dropout)            (None, 8)                 0         
_________________________________________________________________
dense (Dense)                (None, 1)                 9         
_________________________________________________________________
activation (Activation)      (None, 1)                 0         

=================================================================
Total params: 1,161
Trainable params: 1,161
Non-trainable params: 0
_________________________________________________________________

Adam is a good default optimizer and produces great results in deep learning applications. Binary cross-entropy is the default loss function to binary classification problems and it is compatible with our single neuron output architecture.

from tensorflow.keras.optimizers import Adam

model.compile(loss="binary_crossentropy",
              optimizer=Adam(learning_rate=1e-3), metrics=['accuracy'])

Two callbacks were implemented. EarlyStopping to stop the training process after 20 epochs without reducing the validation loss and ModelCheckpoint to always save the model when the validation loss drops.

from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

es = EarlyStopping(monitor='val_loss', verbose=1, patience=20)
mc = ModelCheckpoint("best_model.h5", monitor='val_loss',
                     verbose=1, save_best_only=True)

history = model.fit(x_train, y_train, batch_size=32, epochs=100,
                    validation_data=(x_val, y_val), callbacks=[es, mc])

Epoch 00071: val_loss did not improve from 0.34949
Epoch 72/100
26/26 [==============================] - 1s 24ms/step - loss: 0.3085 - accuracy: 0.8836 - val_loss: 0.3861 - val_accuracy: 0.8487

Epoch 00072: val_loss did not improve from 0.34949
Epoch 00072: early stopping

val_loss_per_epoch = history.history['val_loss']
best_epoch = val_loss_per_epoch.index(min(val_loss_per_epoch)) + 1
print(f"Best epoch: {best_epoch}")

Best epoch: 52

Let's plot the accuracy and loss values per epoch to see the progression of these metrics.

def plot_metrics(history):

    plt.figure(figsize=(12,6))

    plt.subplot(1,2,1)
    plt.plot(history.history['accuracy'], label='Training')
    plt.plot(history.history['val_accuracy'], label='Validation')
    plt.xlabel('Epoch')
    plt.ylabel('Accuracy')
    plt.legend(loc='lower right')
    plt.grid('on')

    plt.subplot(1,2,2)
    plt.plot(history.history['loss'], label='Training')
    plt.plot(history.history['val_loss'], label='Validation')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.legend(loc='upper right')
    plt.grid('on')

plot_metrics(history)

We can see that the accuracy quickly reaches a good plateau around 80%. Visually the model appears to start overfitting after epoch 50. It shouldn't be a problem to use the version saved by ModelCheckpoint at epoch 52.

Performance evaluation

Finally, let's see how our model does with the test dataset.

from tensorflow.keras.models import load_model

model = load_model("best_model.h5")
metrics = model.evaluate(x=x_test, y=y_test)

9/9 [==============================] - 1s 7ms/step - loss: 0.4595 - accuracy: 0.8125

print("Accuracy: {0:.2f} %".format(metrics[1]*100))

Accuracy: 81.25 %

81.25 %

Not bad!

We can explore the results a little more with the confusion matrix and classification report:

from sklearn.metrics import confusion_matrix
from seaborn import heatmap

def plot_confusion_matrix(y_true, y_pred, labels):

    cm = confusion_matrix(y_true, y_pred)
    heatmap(cm, annot=True, fmt="d", cmap="rocket_r", xticklabels=labels, yticklabels=labels)
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.show()

predictions = model.predict(x_test)
threshold = 0.5
y_pred = predictions > threshold
plot_confusion_matrix(y_test, y_pred, labels=['Drug','Tolkien'])

from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred, target_names=['Drug', 'Tolkien']))

              precision    recall  f1-score   support

        Drug       0.80      0.80      0.80       126
     Tolkien       0.83      0.82      0.82       146

    accuracy                           0.81       272
   macro avg       0.81      0.81      0.81       272
weighted avg       0.81      0.81      0.81       272

Besides the good accuracy, the model has almost the same number of false positives and false negatives. We can see this reflecting in a balanced precision and recall.

So, just out of curiosity, which Tolkien characters could pass as prescription drugs?

def onehot_to_text(onehot_word):
    """Reverse one-hot encoded words to strings"""

    char_index = [[np.argmax(char) for char in onehot_word]]
    word = tokenizer.sequences_to_texts(char_index)
    return ''.join(word[0].split())

test_result = pd.DataFrame()
test_result['true'] = y_test
test_result['prediction'] = y_pred.astype(int)
test_result['name'] = [onehot_to_text(name) for name in x_test]
test_result.head()

	true	prediction	name
0	0	0	supprelin
1	1	1	bingo
2	0	0	ponstel
3	0	1	elidel
4	0	0	aubagio

test_result['name'].loc[(test_result['true']==1) & (test_result['prediction']==0)]

13              ivy

17         camellia

44      celebrindor

47         meriadoc

63        vanimelde

64        finduilas

75        eglantine

84             ruby

87            poppy

89             otto

100           tanta

102          myrtle

108          prisca

132          cottar

151          stybba

171            este

175           daisy

189          tulkas

195        arciryas

205        odovacar

206          tarcil

207    hyarmendacil

229            jago

230            tata

240           ponto

271       landroval

Name: name, dtype: object

Conclusion

So, here we covered how to work with character embeddings and build a simple LSTM model capable of telling apart Tolkien character names from prescription drug names. Full code, including requirements, dataset, a Jupyter Notebook code version, and a script version, can be found at my GitHub repo.

You can also play around with this popular interactive quiz found on the web: Antidepressant or Tolkien?. I only got 70.8% right! Can you guess better than the LSTM network?

References

Hu, Y., Hu, C., Tran, T., Kasturi, T., Joseph, E., & Gillingham, M. (2021). What's in a Name?--Gender Classification of Names with Character Based Machine Learning Models. arXiv preprint arXiv:2102.03692.

Bhagvati, C. (2018). Word representations for gender classification using deep learning. Procedia computer science, 132, 614-622.

Liang, X. (2018). How to Preprocess Character Level Text with Keras.

DEV Community: Guilherme Bauer-Negrini

Best biomedical and health data science books and resources

What is biomedical data science?

Table of Contents

Notice of non-affiliation and disclaimer

Statistics and math

Modern Statistics for Modern Biology

Susan Holmes, Wolfgang Huber

Statistics for Biomedical Engineers and Scientists

Andrew King, Robert Eckersley

Applied Mathematics for the Analysis of Biomedical Data: Models, Methods, and MATLAB

Peter J. Costa

Data-Handling in Biomedical Science

Peter White

Data engineering

Data Warehousing for Biomedical Informatics

Richard E. Biehl

Big Biomedical Data Engineering

Ripon Patgiri, Sabuzima Nayak

Data manipulation, data analysis, and machine learning

Data Science and Predictive Analytics: Biomedical and Health Applications using R

Ivo D. Dinov

Computational Learning Approaches to Data Analytics in Biomedical Applications

Khalid Al-Jabery Tayo Obafemi-Ajayi Gayla Olbricht Donald Wunsch

Statistical Learning for Biomedical Data

James D. Malley, Karen G. Malley, Sinisa Pajevic

Case Studies in Neural Data Analysis

Mark Kramer, Uri Eden

Neural Data Science: A Primer with MATLAB and Python

Erik Lee Nylen, Pascal Wallisch

Computational Genomics with R

Altuna Akalin

Bioinformatics: The Machine Learning Approach

Pierre Baldi, Søren Brunak

Biomedical Image Analysis in Python

DataCamp

Datasets

Synthea: Synthetic Patient Generation

MITRE Corporation

PhysioNet: The Research Resource for Complex Physiologic Signals

MIT Laboratory for Computational Physiology

Computational Biology Datasets Suitable For Machine Learning

Ben Lengerich

Kaggle: Healthcare tag

NIH: Data Sharing Resources

Trans-NIH BioMedical Informatics Coordinating Committee

Conclusions

Biomedical text natural language processing (BioNLP) using scispaCy: NER and rule-based matching

Table of contents

Requirements

Dataset

Named entity recognition

Rule-based matching

Conclusions

References

Tolkien character or prescription drug name? Classification using character-level Long Short-Term Memory (LSTM) neural networks

Dataset

Tolkien characters

Prescription drugs

Data transformation

Data split

LSTM model

Training

Performance evaluation

Conclusion

References