DEV Community

Cover image for Biomedical text natural language processing (BioNLP) using scispaCy: NER and rule-based matching

Biomedical text natural language processing (BioNLP) using scispaCy: NER and rule-based matching

gbnegrini profile image Guilherme Bauer-Negrini Originally published at ・9 min read

Biomedical text mining and natural language processing (BioNLP) is an interesting research domain that deals with processing data from journals, medical records, and other biomedical documents. Considering the availability of biomedical literature, there has been an increasing interest in extracting information, relationships, and insights from text data. However, the unstructured organization and the domain complexity of biomedical documents make these tasks hard. Fortunately, some cool NLP Python packages can help us with that!

scispaCy is a Python package containing spaCy models for processing biomedical, scientific or clinical text. spaCy's most mindblowing features are neural network models for tagging, parsing, named entity recognition (NER), text classification, and more. Add scispaCy models on top of it and we can do all that in the biomedical domain!

Here we are going to see how to use scispaCy NER models to identify drug and disease names mentioned in a medical transcription dataset. Moreover, we are going to combine NER and rule-based matching to extract the drug names and dosages reported in each transcription.

Table of contents


  • Python 3
  • pandas
  • spacy>=3.0
  • scispacy

You can simply pip install all of them.

We also need to download and install the NER model from scispaCy. To install the en_ner_bc5cdr_md model use the following command:

pip install
Enter fullscreen mode Exit fullscreen mode

For updated versions or other models, please check scispaCy page.


Unstructured medical data, like medical transcriptions, are pretty hard to find. Here we are using a medical transcription dataset scraped from the MTSamples website by Tara Boyle and made available at Kaggle.

import pandas as pd

med_transcript = pd.read_csv("mtsamples.csv", index_col=0)
Enter fullscreen mode Exit fullscreen mode
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4999 entries, 0 to 4998
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   description        4999 non-null   object
 1   medical_specialty  4999 non-null   object
 2   sample_name        4999 non-null   object
 3   transcription      4966 non-null   object
 4   keywords           3931 non-null   object
dtypes: object(5)
memory usage: 234.3+ KB
Enter fullscreen mode Exit fullscreen mode
description medical_specialty sample_name transcription keywords
0 A 23-year-old white female presents with comp... Allergy / Immunology Allergic Rhinitis SUBJECTIVE:, This 23-year-old white female pr... allergy / immunology, allergic rhinitis, aller...
1 Consult for laparoscopic gastric bypass. Bariatrics Laparoscopic Gastric Bypass Consult - 2 PAST MEDICAL HISTORY:, He has difficulty climb... bariatrics, laparoscopic gastric bypass, weigh...
2 Consult for laparoscopic gastric bypass. Bariatrics Laparoscopic Gastric Bypass Consult - 1 HISTORY OF PRESENT ILLNESS: , I have seen ABC ... bariatrics, laparoscopic gastric bypass, heart...
3 2-D M-Mode. Doppler. Cardiovascular / Pulmonary 2-D Echocardiogram - 1 2-D M-MODE: , ,1. Left atrial enlargement wit... cardiovascular / pulmonary, 2-d m-mode, dopple...
4 2-D Echocardiogram Cardiovascular / Pulmonary 2-D Echocardiogram - 2 1. The left ventricular cavity size and wall ... cardiovascular / pulmonary, 2-d, doppler, echo...

The dataset has almost 5000 records, but let's work with a small random subsample so it doesn't take too long to process. We also have to drop any rows whose transcriptions are missing.

med_transcript.dropna(subset=['transcription'], inplace=True)
med_transcript_small = med_transcript.sample(n=100, replace=False, random_state=42)
Enter fullscreen mode Exit fullscreen mode
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 3162 to 3581
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   description        100 non-null    object
 1   medical_specialty  100 non-null    object
 2   sample_name        100 non-null    object
 3   transcription      100 non-null    object
 4   keywords           78 non-null     object
dtypes: object(5)
memory usage: 4.7+ KB
Enter fullscreen mode Exit fullscreen mode
description medical_specialty sample_name transcription keywords
3162 Markedly elevated PT INR despite stopping Cou... Hematology - Oncology Hematology Consult - 1 HISTORY OF PRESENT ILLNESS:, The patient is w... NaN
1981 Intercostal block from fourth to tenth interc... Pain Management Intercostal block - 1 PREPROCEDURE DIAGNOSIS:, Chest pain secondary... pain management, xylocaine, marcaine, intercos...
1361 The patient is a 65-year-old female who under... SOAP / Chart / Progress Notes Lobectomy - Followup HISTORY OF PRESENT ILLNESS: , The patient is a... soap / chart / progress notes, non-small cell ...
3008 Construction of right upper arm hemodialysis ... Nephrology Hemodialysis Fistula Construction PREOPERATIVE DIAGNOSIS: , End-stage renal dise... nephrology, end-stage renal disease, av dialys...
4943 Bronchoscopy with brush biopsies. Persistent... Cardiovascular / Pulmonary Bronchoscopy - 8 PREOPERATIVE DIAGNOSIS: , Persistent pneumonia... cardiovascular / pulmonary, persistent pneumon...

Let's take one transcription to see how we can work with NER:

sample_transcription = med_transcript_small['transcription'].iloc[0]
print(sample_transcription[:1000]) # prints just the first 1000 characters
Enter fullscreen mode Exit fullscreen mode
HISTORY OF PRESENT ILLNESS:,  The patient is well known to me for a history of iron-deficiency anemia due to chronic blood loss from colitis.  We corrected her hematocrit last year with intravenous (IV) iron.  Ultimately, she had a total proctocolectomy done on 03/14/2007 to treat her colitis.  Her course has been very complicated since then with needing multiple surgeries for removal of hematoma.  This is partly because she was on anticoagulation for a right arm deep venous thrombosis (DVT) she had early this year, complicated by septic phlebitis.,Chart was reviewed, and I will not reiterate her complex history.,I am asked to see the patient again because of concerns for coagulopathy.,She had surgery again last month to evacuate a pelvic hematoma, and was found to have vancomycin resistant enterococcus, for which she is on multiple antibiotics and followed by infectious disease now.,She is on total parenteral nutrition (TPN) as well.,LABORATORY DATA:,  Labs today showed a white blood 
Enter fullscreen mode Exit fullscreen mode

So, we can see a lot of entities in this transcription. There are drug, disease, and exam names for example.
The text was scraped from a web page and we can identify the different sections from the medical record like "HISTORY OF PRESENT ILLNESS" and "LABORATORY DATA", but this varies from record to record.

Named entity recognition

Named entity recognition (NER) is a subtask of natural language processing used to identify and classify named entities mentioned in unstructured text into pre-defined categories. scispaCy has different models to identify different entity types and you can check them here.

We are going to use the NER model trained on the BC5CDR corpus (en_ner_bc5cdr_md). This corpus consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases, and 3116 chemical-disease interactions. Don't forget to download and install the model.

import scispacy
import spacy
nlp = spacy.load("en_ner_bc5cdr_md")
Enter fullscreen mode Exit fullscreen mode

spacy.load will return a Language object containing all components and data needed to process text. This object is usually called nlp in the documentation and tutorials. Calling the nlp object on a string of text will return a processed Doc object with the text split into words and annotated.

Let's get all identified entities and print their text, start position, end position, and type:

doc = nlp(sample_transcription)
print("TEXT", "START", "END", "ENTITY TYPE")
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)
Enter fullscreen mode Exit fullscreen mode
iron-deficiency anemia 79 101 DISEASE
chronic blood loss 109 127 DISEASE
colitis 133 140 DISEASE
iron 203 207 CHEMICAL
vancomycin 781 791 CHEMICAL
infectious disease 873 891 DISEASE
improved.,PT 1348 1360 CHEMICAL
vitamin K 1503 1512 CHEMICAL
uric acid 1830 1839 CHEMICAL
bilirubin 1853 1862 CHEMICAL
Creatinine 1911 1921 CHEMICAL
Compazine 2474 2483 CHEMICAL
Zofran 2487 2493 CHEMICAL
epistaxis 2629 2638 DISEASE
bleeding 3057 3065 DISEASE
edema.,CARDIAC 3109 3123 CHEMICAL
adenopathy 3156 3166 DISEASE
Enter fullscreen mode Exit fullscreen mode

We can see the model correctly identified and label diseases such as iron-deficiency anemia, chronic blood loss, and many more. Lots of drugs were also identified, like vancomycin, Compazine, Zofran. The model can also identify common laboratory tested molecules such as creatinine, iron, bilirubin, uric acid.

Not everything is perfect though. See how some tokens are weirdly classified as chemicals, possibly due to punctuation marks:

  • improved.,PT 1348 1360 CHEMICAL
  • edema.,CARDIAC 3109 3123 CHEMICAL

Punctuation marks are usually removed in NLP preprocessing steps, but we can't remove all of them here, otherwise, we may miss chemical names and would screw up quantities like drug dosage. However, we can solve this problem by removing the ".," marks that appear to separate some sections of the transcription. It is important to know your data and your data's domain to have a better comprehension of your results.

import re

med_transcript_small['transcription'] = med_transcript_small['transcription'].apply(lambda x: re.sub('(\.,)', ". ", x))
Enter fullscreen mode Exit fullscreen mode

We can also check the entities using the displacy visualizer:

from spacy import displacy
displacy.render(doc[:100], style='ent', jupyter=True) # here I am printing just the first 100 tokens
Enter fullscreen mode Exit fullscreen mode


Rule-based matching

Rule-based matching is similar to regular expressions, but spaCy’s rule-based matcher engines and components give you access to the tokens within the document and their relationships. We can combine this with the NER models to identify some pattern that includes our entities.

Let's extract from the text the drug names and their reported dosages. This could be of real use to identify possible medication errors by checking if the dosages are in accordance with standards and guidelines.

from spacy.matcher import Matcher

pattern = [{'ENT_TYPE':'CHEMICAL'}, {'LIKE_NUM': True}, {'IS_ASCII': True}]
matcher = Matcher(nlp.vocab)
matcher.add("DRUG_DOSE", [pattern])
Enter fullscreen mode Exit fullscreen mode

The code above creates a pattern to identify a sequence of three tokens:

  • A token whose entity type is CHEMICAL (drug name)
  • A token that resembles a number (dosage)
  • A token that consists of ASCII characters (units, like mg or mL)

Then we initialize the Matcher with a vocabulary. The matcher must always share the same vocab with the documents it will operate on, so we use the nlp object vocab. We then add this pattern to the matcher and give it an ID.

Now we can loop through all transcriptions and extract the text matching this pattern:

for transcription in med_transcript_small['transcription']:
    doc = nlp(transcription)
    matches = matcher(doc)
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]  # get string representation
        span = doc[start:end]  # the matched span
        print(string_id, start, end, span.text)
Enter fullscreen mode Exit fullscreen mode
DRUG_DOSE 137 140 Xylocaine 20 mL
DRUG_DOSE 141 144 Marcaine 0.25%
DRUG_DOSE 208 211 Aspirin 81 mg
DRUG_DOSE 216 219 Spiriva 10 mcg
DRUG_DOSE 399 402 nifedipine 10 mg
DRUG_DOSE 226 229 aspirin one tablet
DRUG_DOSE 245 248 Warfarin 2.5 mg
DRUG_DOSE 67 70 Topamax 100 mg
DRUG_DOSE 193 196 Metamucil one pack
DRUG_DOSE 207 210 Nexium 40 mg
DRUG_DOSE 1133 1136 Naprosyn one p.o
DRUG_DOSE 290 293 Lidocaine 1%
DRUG_DOSE 37 40 Altrua 60,
DRUG_DOSE 74 77 Lidocaine 1.5%
DRUG_DOSE 209 212 Dilantin 300 mg
DRUG_DOSE 217 220 Haloperidol 1 mg
DRUG_DOSE 225 228 Dexamethasone 4 mg
DRUG_DOSE 234 237 Docusate 100 mg
DRUG_DOSE 250 253 Ibuprofen 600 mg
DRUG_DOSE 258 261 Zantac 150 mg
DRUG_DOSE 204 207 epinephrine 7 ml
DRUG_DOSE 214 217 Percocet 5,
DRUG_DOSE 55 58 . 4.
DRUG_DOSE 146 149 . 4.
DRUG_DOSE 2409 2412 Naprosyn 375 mg
DRUG_DOSE 141 144 Wellbutrin 300 mg
DRUG_DOSE 146 149 Xanax 0.25 mg
DRUG_DOSE 158 161 omeprazole 20 mg
Enter fullscreen mode Exit fullscreen mode

Nice, we did it!

We successfully extracted drugs and dosages, including different kinds of units like mg, mL, %, packs.


Here we learned how to use some features of scispaCy and spaCy like NER and rule-base matching. We used one NER model, but there lots of others and you should totally check them out. For instance, the en_ner_bionlp13cg_md model can identify anatomical parts, tissues, cell types, and more. Imagine what else you could do with that!

We also didn't focus too much on preprocessing steps, but they are fundamental to get better results. Don't forget to explore your data and adapt the preprocessing steps to the NLP tasks you want to do.


Neumann, M., King, D., Beltagy, I., & Ammar, W. (2019). Scispacy: Fast and robust models for biomedical natural language processing. arXiv preprint arXiv:1902.07669.

Honnibal, M., Montani, I., Van Landeghem, S., Boyd, A. (2020). spaCy: Industrial-strength Natural Language Processing in Python.

Discussion (0)

Editor guide