<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Guilherme Bauer-Negrini</title>
    <description>The latest articles on DEV Community by Guilherme Bauer-Negrini (@gbnegrini).</description>
    <link>https://dev.to/gbnegrini</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F475676%2F88bb5578-a1ef-4569-888a-f88e965887d0.jpeg</url>
      <title>DEV Community: Guilherme Bauer-Negrini</title>
      <link>https://dev.to/gbnegrini</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/gbnegrini"/>
    <language>en</language>
    <item>
      <title>Best biomedical and health data science books and resources</title>
      <dc:creator>Guilherme Bauer-Negrini</dc:creator>
      <pubDate>Thu, 22 Apr 2021 00:00:00 +0000</pubDate>
      <link>https://dev.to/gbnegrini/best-biomedical-and-health-data-science-books-and-resources-4n1k</link>
      <guid>https://dev.to/gbnegrini/best-biomedical-and-health-data-science-books-and-resources-4n1k</guid>
      <description>&lt;h2&gt;
  
  
  What is biomedical data science?
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;Biomedical data science spans a range of biological and medical research challenges that are data-intensive and focused on the creation of novel methodologies to advance biomedical science discovery. - &lt;em&gt;&lt;a href="https://www.annualreviews.org/doi/full/10.1146/annurev-bd-01-041718-100001" rel="noopener noreferrer"&gt;Annual Review of Biomedical Data Science&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Here is a listing of some resources that I have found while researching and studying the field of biomedical data science and analytics. Unfortunately many books and courses listed here are paid, but I have tried my best to list some free and open-sourced resources too. Let’s go to them!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgbnegrini.com%2Fpost%2Fbest-biomedical-health-data-science-resources%2Fgiphy.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgbnegrini.com%2Fpost%2Fbest-biomedical-health-data-science-resources%2Fgiphy.gif" alt="giphy"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Table of Contents
&lt;/h1&gt;

&lt;ul&gt;
&lt;li&gt;What is biomedical data science?&lt;/li&gt;
&lt;li&gt;Notice of non-affiliation and disclaimer&lt;/li&gt;
&lt;li&gt;
Statistics and math
&lt;ul&gt;
&lt;li&gt;Modern Statistics for Modern Biology&lt;/li&gt;
&lt;li&gt;Statistics for Biomedical Engineers and Scientists&lt;/li&gt;
&lt;li&gt;Applied Mathematics for the Analysis of Biomedical Data: Models, Methods, and MATLAB&lt;/li&gt;
&lt;li&gt;Data-Handling in Biomedical Science&lt;/li&gt;
&lt;/ul&gt;




&lt;/li&gt;


&lt;li&gt;


Data engineering

&lt;ul&gt;
&lt;li&gt;Data Warehousing for Biomedical Informatics&lt;/li&gt;
&lt;li&gt;Big Biomedical Data Engineering&lt;/li&gt;
&lt;/ul&gt;




&lt;/li&gt;


&lt;li&gt;


Data manipulation, data analysis, and machine learning

&lt;ul&gt;
&lt;li&gt;Data Science and Predictive Analytics: Biomedical and Health Applications using R&lt;/li&gt;
&lt;li&gt;Computational Learning Approaches to Data Analytics in Biomedical Applications&lt;/li&gt;
&lt;li&gt;Statistical Learning for Biomedical Data&lt;/li&gt;
&lt;li&gt;Case Studies in Neural Data Analysis&lt;/li&gt;
&lt;li&gt;Neural Data Science: A Primer with MATLAB and Python&lt;/li&gt;
&lt;li&gt;Computational Genomics with R&lt;/li&gt;
&lt;li&gt;Bioinformatics: The Machine Learning Approach&lt;/li&gt;
&lt;li&gt;Biomedical Image Analysis in Python&lt;/li&gt;
&lt;/ul&gt;




&lt;/li&gt;


&lt;li&gt;


Datasets

&lt;ul&gt;
&lt;li&gt;Synthea: Synthetic Patient Generation&lt;/li&gt;
&lt;li&gt;PhysioNet: The Research Resource for Complex Physiologic Signals&lt;/li&gt;
&lt;li&gt;Computational Biology Datasets Suitable For Machine Learning&lt;/li&gt;
&lt;li&gt;Kaggle: Healthcare tag&lt;/li&gt;
&lt;li&gt;NIH: Data Sharing Resources&lt;/li&gt;
&lt;/ul&gt;




&lt;/li&gt;


&lt;li&gt;Conclusions&lt;/li&gt;


&lt;/ul&gt;
&lt;br&gt;

&lt;h2&gt;
  
  
  Notice of non-affiliation and disclaimer
&lt;/h2&gt;

&lt;p&gt;I am not the author nor I am associated with any author, publishing company, or digital platform of the resources mentioned here. I also was not paid, endorsed, or compensated in any way for this post. Any reference in this post is for the information and convenience of the public and does not constitute an endorsement, recommendation, or favoring.&lt;/p&gt;

&lt;h2&gt;
  
  
  Statistics and math
&lt;/h2&gt;

&lt;p&gt;A good understanding of statistics and mathematics is fundamental to any data science or machine learning analysis. The most basic and key concepts include probability distributions, statistical significance, hypothesis testing, and regression. Here are some resources dedicated to teaching you all of that (and more) with examples from biomedical sciences.&lt;/p&gt;

&lt;h3&gt;
  
  
  Modern Statistics for Modern Biology
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Susan Holmes, Wolfgang Huber
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgbnegrini.com%2Fpost%2Fbest-biomedical-health-data-science-resources%2F10modernstats.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgbnegrini.com%2Fpost%2Fbest-biomedical-health-data-science-resources%2F10modernstats.jpg" alt="modernstats"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Book 📘 | Code: R | Free: ✅ | &lt;a href="https://www.huber.embl.de/msmb/" rel="noopener noreferrer"&gt;Link&lt;/a&gt; ↗️&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The aim of this book is to enable scientists working in biological research to quickly learn many of the important ideas and methods that they need to make the best of their experiments and of other available data. The book takes a hands-on approach.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This book is not heavy on mathematics, it goes straight to the core concepts and has a lot of R code examples and exercises! It ranges from the basics of data distributions and hypothesis testing to more advanced topics like multivariate analysis and supervised learning.&lt;/p&gt;

&lt;h3&gt;
  
  
  Statistics for Biomedical Engineers and Scientists
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Andrew King, Robert Eckersley
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgbnegrini.com%2Fpost%2Fbest-biomedical-health-data-science-resources%2F11statisticsforbiomed.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgbnegrini.com%2Fpost%2Fbest-biomedical-health-data-science-resources%2F11statisticsforbiomed.jpg" alt="statsbiomed"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Book 📘 | Code: MATLAB | Free: ❌ | &lt;a href="https://www.sciencedirect.com/book/9780081029398/statistics-for-biomedical-engineers-and-scientists" rel="noopener noreferrer"&gt;Link&lt;/a&gt; ↗️&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Readers will learn how to understand the fundamental concepts of descriptive and inferential statistics, analyze data and choose an appropriate hypothesis test to answer a given question, compute numerical statistical measures and perform hypothesis tests “by hand”, and visualize data and perform statistical analysis using MATLAB.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is just what you would expect from a regular undergraduate level book about probability and statistics. Not heavy on math and it has a lot of exercises.&lt;/p&gt;

&lt;h3&gt;
  
  
  Applied Mathematics for the Analysis of Biomedical Data: Models, Methods, and MATLAB
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Peter J. Costa
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgbnegrini.com%2Fpost%2Fbest-biomedical-health-data-science-resources%2F14appliedmath.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgbnegrini.com%2Fpost%2Fbest-biomedical-health-data-science-resources%2F14appliedmath.jpg" alt="appliedmath"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Book 📘 | Code: MATLAB | Free: ❌ | &lt;a href="https://www.wiley.com/en-cg/Applied+Mathematics+for+the+Analysis+of+Biomedical+Data:+Models,+Methods,+and+MATLAB-p-9781119269496" rel="noopener noreferrer"&gt;Link&lt;/a&gt; ↗️&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Features a practical approach to the analysis of biomedical data via mathematical methods and provides a MATLAB® toolbox for the collection, visualization, and evaluation of experimental and real-life data&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This one is heavier on maths and assumes you are familiar with elementary differential equations, linear algebra, and statistics.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data-Handling in Biomedical Science
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Peter White
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgbnegrini.com%2Fpost%2Fbest-biomedical-health-data-science-resources%2F13datahandling.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgbnegrini.com%2Fpost%2Fbest-biomedical-health-data-science-resources%2F13datahandling.jpg" alt="datahandling"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Book 📘 | Code: ❌ | Free: ❌ | &lt;a href="https://www.cambridge.org/core/books/datahandling-in-biomedical-science/838B39D42427AFB1CEB2D707A1BB0114#fndtn-information" rel="noopener noreferrer"&gt;Link&lt;/a&gt; ↗️&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Packed with worked examples and problems, this book will help the reader improve their confidence and skill in data-handling.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This one is a little different from the previous ones, but it is worth listing. The book has no code examples and it is not about computational methods of data handling and analysis. It teaches basic math and statistics needed for biochemistry and microbiology experiments.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data engineering
&lt;/h2&gt;

&lt;p&gt;As important as analyzing data, we also need to know how to design and maintain data pipelines. Biomedical data can be messy, heterogenous, and big, but fortunately, these authors are here to help us!&lt;/p&gt;

&lt;h3&gt;
  
  
  Data Warehousing for Biomedical Informatics
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Richard E. Biehl
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgbnegrini.com%2Fpost%2Fbest-biomedical-health-data-science-resources%2F20dataware.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgbnegrini.com%2Fpost%2Fbest-biomedical-health-data-science-resources%2F20dataware.jpg" alt="datawarehouse"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Book 📘 | Code: SQL | Free: ❌ | &lt;a href="https://www.routledge.com/Data-Warehousing-for-Biomedical-Informatics/Biehl/p/book/9781482215212" rel="noopener noreferrer"&gt;Link&lt;/a&gt; ↗️&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A step-by-step how-to guide for designing and building an enterprise-wide data warehouse across a biomedical or healthcare institution, using a four-iteration lifecycle and standardized design pattern.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This book is a gem. Classical content about data warehousing and ETL pipelines, but really focused on biomedical and healthcare data. Lots of SQL code snippets!&lt;/p&gt;

&lt;h3&gt;
  
  
  Big Biomedical Data Engineering
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Ripon Patgiri, Sabuzima Nayak
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Book chapter 📄 | Code: ❌ | Free: ✅ | &lt;a href="https://link.springer.com/chapter/10.1007/978-3-030-43981-1_3" rel="noopener noreferrer"&gt;Link&lt;/a&gt; ↗️&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This chapter exploits the role of Big Data in biomedical data engineering and its storage dilemma.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A short book chapter that discusses some scenarios of biomedical big data applications and possible future.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data manipulation, data analysis, and machine learning
&lt;/h2&gt;

&lt;p&gt;This is where most people have fun. Let’s see how to handle, clean, analyze and extract insights from biomedical data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data Science and Predictive Analytics: Biomedical and Health Applications using R
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Ivo D. Dinov
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgbnegrini.com%2Fpost%2Fbest-biomedical-health-data-science-resources%2F30datapred.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgbnegrini.com%2Fpost%2Fbest-biomedical-health-data-science-resources%2F30datapred.png" alt="datapred"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Book and MOOC 📘 💻 | Code: R | Free: ✅ ❌ | &lt;a href="https://www.springer.com/us/book/9783319723464" rel="noopener noreferrer"&gt;Link&lt;/a&gt; ↗️ | &lt;a href="https://www.socr.umich.edu/people/dinov/courses/DSPA_Topics.html" rel="noopener noreferrer"&gt;Free online material&lt;/a&gt; ↗️&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Complete and self-contained treatment of the theory, experimental modeling, system development, and validation of predictive health analytics.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A comprehensive data science book: introduction to R, data manipulation, data visualization, classification, regression, NLP, and even a little Deep Learning! All of this with well-documented R code. The book is not free, but you can find the videos, class notes, and R code on the author’s page linked above.&lt;/p&gt;

&lt;h3&gt;
  
  
  Computational Learning Approaches to Data Analytics in Biomedical Applications
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Khalid Al-Jabery Tayo Obafemi-Ajayi Gayla Olbricht Donald Wunsch
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgbnegrini.com%2Fpost%2Fbest-biomedical-health-data-science-resources%2F31companalytics.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgbnegrini.com%2Fpost%2Fbest-biomedical-health-data-science-resources%2F31companalytics.jpg" alt="companalytics"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Book 📘 | Code: Python, MATLAB | Free: ❌ | &lt;a href="https://www.elsevier.com/books/computational-learning-approaches-to-data-analytics-in-biomedical-applications/al-jabery/978-0-12-814482-4" rel="noopener noreferrer"&gt;Link&lt;/a&gt; ↗️&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;It presents insights on biomedical data processing, innovative clustering algorithms and techniques, and connections between statistical analysis and clustering.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;An interesting and more theoretical approach to data preprocessing and clustering algorithms. Examples are given in pseudocode and some math knowledge is required. The last chapter has a hands-on approach using MATLAB and Python codes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Statistical Learning for Biomedical Data
&lt;/h3&gt;

&lt;h4&gt;
  
  
  James D. Malley, Karen G. Malley, Sinisa Pajevic
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgbnegrini.com%2Fpost%2Fbest-biomedical-health-data-science-resources%2F32statlearn.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgbnegrini.com%2Fpost%2Fbest-biomedical-health-data-science-resources%2F32statlearn.jpg" alt="statlearn"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Book 📘 | Code: MATLAB | Free: ❌ | &lt;a href="https://doi.org/10.1017/CBO9780511975820" rel="noopener noreferrer"&gt;Link&lt;/a&gt; ↗️&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This book is for anyone who has biomedical data and needs to identify variables that predict an outcome, for two-group outcomes such as tumor/not-tumor, survival/death, or response from treatment.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Not heavy on math and does not have many code examples. Great theoretical explanations covering regression, single decision trees, and Random Forests.&lt;/p&gt;

&lt;h3&gt;
  
  
  Case Studies in Neural Data Analysis
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Mark Kramer, Uri Eden
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgbnegrini.com%2Fpost%2Fbest-biomedical-health-data-science-resources%2F34casestud.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgbnegrini.com%2Fpost%2Fbest-biomedical-health-data-science-resources%2F34casestud.png" alt="casestud"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Book 📘 | Code: Python | Free: ✅ | &lt;a href="https://mark-kramer.github.io/Case-Studies-Python/intro.html" rel="noopener noreferrer"&gt;Link&lt;/a&gt; ↗️&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The intended audience is the practicing neuroscientist - e.g., the students, researchers, and clinicians collecting neuronal data in the hospital or lab. The material can get pretty math-heavy, but we’ve tried to outline the main concepts as directly as possible, with hands-on implementations of all concepts.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Great hands-on material for neuroscientists interested in analyzing spike trains and electric fields. All notebooks are in Python and have a little explanation about the concepts and goal of the analysis.&lt;/p&gt;

&lt;h3&gt;
  
  
  Neural Data Science: A Primer with MATLAB and Python
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Erik Lee Nylen, Pascal Wallisch
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgbnegrini.com%2Fpost%2Fbest-biomedical-health-data-science-resources%2F35neuraldata.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgbnegrini.com%2Fpost%2Fbest-biomedical-health-data-science-resources%2F35neuraldata.png" alt="neuraldata"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Book 📘 | Code: Python, MATLAB | Free: ❌ | &lt;a href="https://www.sciencedirect.com/book/9780128040430/neural-data-science" rel="noopener noreferrer"&gt;Link&lt;/a&gt; ↗️&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A beginner’s introduction to the principles of computation and data analysis in neuroscience, using both Python and MATLAB, giving readers the ability to transcend platform tribalism and enable coding versatility.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This book is beautifully organized and filled with images. The coolest thing about it is the MATLAB and Python code written side-by-side. The content ranges from the basics of programming to advanced techniques such as analog signal processing, biophysical modeling, clustering, and classification.&lt;/p&gt;

&lt;h3&gt;
  
  
  Computational Genomics with R
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Altuna Akalin
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgbnegrini.com%2Fpost%2Fbest-biomedical-health-data-science-resources%2F36compgenom.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgbnegrini.com%2Fpost%2Fbest-biomedical-health-data-science-resources%2F36compgenom.jpg" alt="compgenom"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Book 📘 | Code: R | Free: ✅ | &lt;a href="http://compgenomr.github.io/book/" rel="noopener noreferrer"&gt;Link&lt;/a&gt; ↗️&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The aim of this book is to provide the fundamentals for data analysis for genomics. We want this book to be a starting point for computational genomics students and a guide for further data analysis in more specific topics in genomics.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This book has a great introduction to genomics that will help a lot if you are not coming from a biological related field. It covers many topics such as introduction to R, statistics, exploratory data analysis, supervised learning, RNA-Seq, and more!&lt;/p&gt;

&lt;h3&gt;
  
  
  Bioinformatics: The Machine Learning Approach
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Pierre Baldi, Søren Brunak
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgbnegrini.com%2Fpost%2Fbest-biomedical-health-data-science-resources%2F37bioinfoml.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgbnegrini.com%2Fpost%2Fbest-biomedical-health-data-science-resources%2F37bioinfoml.jpg" alt="bioingoml"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Book 📘 | Code: ❌ | Free: ❌ | &lt;a href="https://mitpress.mit.edu/books/bioinformatics-second-edition" rel="noopener noreferrer"&gt;Link&lt;/a&gt; ↗️&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The book is aimed both at biologists and biochemists who need to understand new data-driven algorithms and at those with a primary background in physics, mathematics, statistics, or computer science who need to know more about applications in molecular biology.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This one is a little heavy on math, you will probably need some calculus, algebra, and probability theory. The book is really about the theoretical aspects of machine learning applied to bionformatics, including definitions of main concepts and proofs of main theorems.&lt;/p&gt;

&lt;h3&gt;
  
  
  Biomedical Image Analysis in Python
&lt;/h3&gt;

&lt;h4&gt;
  
  
  DataCamp
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgbnegrini.com%2Fpost%2Fbest-biomedical-health-data-science-resources%2F38datacamp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgbnegrini.com%2Fpost%2Fbest-biomedical-health-data-science-resources%2F38datacamp.png" alt="datacamp"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Videos and interactive code 💻 | Code: Python | Free: ❌ | &lt;a href="https://www.datacamp.com/courses/biomedical-image-analysis-in-python" rel="noopener noreferrer"&gt;Link&lt;/a&gt; ↗️&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;In this introductory course, you’ll learn the fundamentals of image analysis using NumPy, SciPy, and Matplotlib. You’ll navigate through a whole-body CT scan, segment a cardiac MRI time series, and determine whether Alzheimer’s disease changes brain structure.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Great content and it follows the DataCamp course structure: short videos and hands-on coding exercises directly in the browser!&lt;/p&gt;

&lt;h2&gt;
  
  
  Datasets
&lt;/h2&gt;

&lt;p&gt;Here are some places where you can find datasets to explore and exercise your skills:&lt;/p&gt;

&lt;h3&gt;
  
  
  Synthea: Synthetic Patient Generation
&lt;/h3&gt;

&lt;h4&gt;
  
  
  MITRE Corporation
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgbnegrini.com%2Fpost%2Fbest-biomedical-health-data-science-resources%2F90synthea.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgbnegrini.com%2Fpost%2Fbest-biomedical-health-data-science-resources%2F90synthea.png" alt="synthea"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://synthetichealth.github.io/synthea/" rel="noopener noreferrer"&gt;Link&lt;/a&gt; ↗️&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;SyntheaTM is an open-source, synthetic patient generator that models the medical history of synthetic patients. The resulting data is free from cost, privacy, and security restrictions, enabling research with Health IT data that is otherwise legally or practically unavailable.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  PhysioNet: The Research Resource for Complex Physiologic Signals
&lt;/h3&gt;

&lt;h4&gt;
  
  
  MIT Laboratory for Computational Physiology
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgbnegrini.com%2Fpost%2Fbest-biomedical-health-data-science-resources%2F91physionet.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgbnegrini.com%2Fpost%2Fbest-biomedical-health-data-science-resources%2F91physionet.png" alt="physionet"&gt;&lt;/a&gt;&lt;strong&gt;&lt;a href="https://physionet.org/about/database/" rel="noopener noreferrer"&gt;Link&lt;/a&gt; ↗️&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;PhysioNet is a repository of freely-available medical research data, managed by the MIT Laboratory for Computational Physiology.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Computational Biology Datasets Suitable For Machine Learning
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Ben Lengerich
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/blengerich/CompBioDatasetsForMachineLearning" rel="noopener noreferrer"&gt;Link&lt;/a&gt; ↗️&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This is a curated list of computational biology datasets that have been pre-processed for machine learning.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Kaggle: Healthcare tag
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgbnegrini.com%2Fpost%2Fbest-biomedical-health-data-science-resources%2F92kaggle.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgbnegrini.com%2Fpost%2Fbest-biomedical-health-data-science-resources%2F92kaggle.png" alt="kaggle"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.kaggle.com/search?q=tag%3A%22healthcare%22+in%3Adatasets" rel="noopener noreferrer"&gt;Link&lt;/a&gt; ↗️&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Kaggle is the world’s largest data science community with powerful tools and public datasets.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  NIH: Data Sharing Resources
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Trans-NIH BioMedical Informatics Coordinating Committee
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.nlm.nih.gov/NIHbmic/nih_data_sharing_repositories.html" rel="noopener noreferrer"&gt;Link&lt;/a&gt; ↗️&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;To help researchers locate an appropriate resource for sharing  their data, as well as to promote awareness of resources where datasets can be located for reuse, BMIC maintains lists of several types of data sharing resources.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Conclusions
&lt;/h2&gt;

&lt;p&gt;That’s it! This comprehensive list covers many areas of biomedical data science and analytics, but there are many more great resources out there! Do you think I might have left out something important? Share with us in the comments!&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Biomedical text natural language processing (BioNLP) using scispaCy: NER and rule-based matching</title>
      <dc:creator>Guilherme Bauer-Negrini</dc:creator>
      <pubDate>Thu, 15 Apr 2021 15:40:55 +0000</pubDate>
      <link>https://dev.to/gbnegrini/biomedical-text-natural-language-processing-bionlp-using-scispacy-ner-and-rule-based-matching-504p</link>
      <guid>https://dev.to/gbnegrini/biomedical-text-natural-language-processing-bionlp-using-scispacy-ner-and-rule-based-matching-504p</guid>
      <description>&lt;p&gt;Biomedical text mining and natural language processing (BioNLP) is an interesting research domain that deals with processing data from journals, medical records, and other biomedical documents. Considering the availability of biomedical literature, there has been an increasing interest in extracting information, relationships, and insights from text data. However, the unstructured organization and the domain complexity of biomedical documents make these tasks hard. Fortunately, some cool NLP Python packages can help us with that! &lt;/p&gt;

&lt;p&gt;&lt;a href="https://allenai.github.io/scispacy/" rel="noopener noreferrer"&gt;scispaCy&lt;/a&gt; is a Python package containing &lt;a href="https://spacy.io/" rel="noopener noreferrer"&gt;spaCy&lt;/a&gt; models for processing biomedical, scientific or clinical text. spaCy's most mindblowing features are neural network models for tagging, parsing, named entity recognition (NER), text classification, and more. Add scispaCy models on top of it and we can do all that in the biomedical domain!&lt;/p&gt;

&lt;p&gt;Here we are going to see how to use scispaCy &lt;strong&gt;NER models to identify drug and disease names mentioned in a medical transcription dataset&lt;/strong&gt;. Moreover, we are going to &lt;strong&gt;combine NER and rule-based matching to extract the drug names and dosages&lt;/strong&gt; reported in each transcription. &lt;/p&gt;

&lt;h2&gt;
  
  
  Table of contents
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Requirements&lt;/li&gt;
&lt;li&gt;Dataset&lt;/li&gt;
&lt;li&gt;Named entity recognition&lt;/li&gt;
&lt;li&gt;Rule-based matching&lt;/li&gt;
&lt;li&gt;Conclusions&lt;/li&gt;
&lt;li&gt;References&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Requirements
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Python 3&lt;/li&gt;
&lt;li&gt;pandas&lt;/li&gt;
&lt;li&gt;spacy&amp;gt;=3.0&lt;/li&gt;
&lt;li&gt;scispacy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can simply &lt;code&gt;pip install&lt;/code&gt; all of them.&lt;/p&gt;

&lt;p&gt;We also need to download and install the NER model from scispaCy. To install the &lt;code&gt;en_ner_bc5cdr_md&lt;/code&gt; model use the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_ner_bc5cdr_md-0.4.0.tar.gz
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For updated versions or other models, please check &lt;a href="https://allenai.github.io/scispacy/" rel="noopener noreferrer"&gt;scispaCy page&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Dataset
&lt;/h2&gt;

&lt;p&gt;Unstructured medical data, like medical transcriptions, are pretty hard to find. Here we are using a &lt;a href="https://www.kaggle.com/tboyle10/medicaltranscriptions" rel="noopener noreferrer"&gt;medical transcription dataset&lt;/a&gt; scraped from the &lt;a href="https://mtsamples.com/" rel="noopener noreferrer"&gt;MTSamples website&lt;/a&gt; by Tara Boyle and made available at Kaggle.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

&lt;span class="n"&gt;med_transcript&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mtsamples.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index_col&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;med_transcript&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;med_transcript&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;class 'pandas.core.frame.DataFrame'&amp;gt;
Int64Index: 4999 entries, 0 to 4998
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   description        4999 non-null   object
 1   medical_specialty  4999 non-null   object
 2   sample_name        4999 non-null   object
 3   transcription      4966 non-null   object
 4   keywords           3931 non-null   object
dtypes: object(5)
memory usage: 234.3+ KB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;description&lt;/th&gt;
      &lt;th&gt;medical_specialty&lt;/th&gt;
      &lt;th&gt;sample_name&lt;/th&gt;
      &lt;th&gt;transcription&lt;/th&gt;
      &lt;th&gt;keywords&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;0&lt;/th&gt;
      &lt;td&gt;A 23-year-old white female presents with comp...&lt;/td&gt;
      &lt;td&gt;Allergy / Immunology&lt;/td&gt;
      &lt;td&gt;Allergic Rhinitis&lt;/td&gt;
      &lt;td&gt;SUBJECTIVE:,  This 23-year-old white female pr...&lt;/td&gt;
      &lt;td&gt;allergy / immunology, allergic rhinitis, aller...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1&lt;/th&gt;
      &lt;td&gt;Consult for laparoscopic gastric bypass.&lt;/td&gt;
      &lt;td&gt;Bariatrics&lt;/td&gt;
      &lt;td&gt;Laparoscopic Gastric Bypass Consult - 2&lt;/td&gt;
      &lt;td&gt;PAST MEDICAL HISTORY:, He has difficulty climb...&lt;/td&gt;
      &lt;td&gt;bariatrics, laparoscopic gastric bypass, weigh...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;2&lt;/th&gt;
      &lt;td&gt;Consult for laparoscopic gastric bypass.&lt;/td&gt;
      &lt;td&gt;Bariatrics&lt;/td&gt;
      &lt;td&gt;Laparoscopic Gastric Bypass Consult - 1&lt;/td&gt;
      &lt;td&gt;HISTORY OF PRESENT ILLNESS: , I have seen ABC ...&lt;/td&gt;
      &lt;td&gt;bariatrics, laparoscopic gastric bypass, heart...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;3&lt;/th&gt;
      &lt;td&gt;2-D M-Mode. Doppler.&lt;/td&gt;
      &lt;td&gt;Cardiovascular / Pulmonary&lt;/td&gt;
      &lt;td&gt;2-D Echocardiogram - 1&lt;/td&gt;
      &lt;td&gt;2-D M-MODE: , ,1.  Left atrial enlargement wit...&lt;/td&gt;
      &lt;td&gt;cardiovascular / pulmonary, 2-d m-mode, dopple...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;4&lt;/th&gt;
      &lt;td&gt;2-D Echocardiogram&lt;/td&gt;
      &lt;td&gt;Cardiovascular / Pulmonary&lt;/td&gt;
      &lt;td&gt;2-D Echocardiogram - 2&lt;/td&gt;
      &lt;td&gt;1.  The left ventricular cavity size and wall ...&lt;/td&gt;
      &lt;td&gt;cardiovascular / pulmonary, 2-d, doppler, echo...&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The dataset has almost 5000 records, but let's work with a small random subsample so it doesn't take too long to process. We also have to drop any rows whose transcriptions are missing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;med_transcript&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dropna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;transcription&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;inplace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;med_transcript_small&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;med_transcript&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;replace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;med_transcript_small&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;med_transcript_small&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;class 'pandas.core.frame.DataFrame'&amp;gt;
Int64Index: 100 entries, 3162 to 3581
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   description        100 non-null    object
 1   medical_specialty  100 non-null    object
 2   sample_name        100 non-null    object
 3   transcription      100 non-null    object
 4   keywords           78 non-null     object
dtypes: object(5)
memory usage: 4.7+ KB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;description&lt;/th&gt;
      &lt;th&gt;medical_specialty&lt;/th&gt;
      &lt;th&gt;sample_name&lt;/th&gt;
      &lt;th&gt;transcription&lt;/th&gt;
      &lt;th&gt;keywords&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;3162&lt;/th&gt;
      &lt;td&gt;Markedly elevated PT INR despite stopping Cou...&lt;/td&gt;
      &lt;td&gt;Hematology - Oncology&lt;/td&gt;
      &lt;td&gt;Hematology Consult - 1&lt;/td&gt;
      &lt;td&gt;HISTORY OF PRESENT ILLNESS:,  The patient is w...&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1981&lt;/th&gt;
      &lt;td&gt;Intercostal block from fourth to tenth interc...&lt;/td&gt;
      &lt;td&gt;Pain Management&lt;/td&gt;
      &lt;td&gt;Intercostal block - 1&lt;/td&gt;
      &lt;td&gt;PREPROCEDURE DIAGNOSIS:,  Chest pain secondary...&lt;/td&gt;
      &lt;td&gt;pain management, xylocaine, marcaine, intercos...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1361&lt;/th&gt;
      &lt;td&gt;The patient is a 65-year-old female who under...&lt;/td&gt;
      &lt;td&gt;SOAP / Chart / Progress Notes&lt;/td&gt;
      &lt;td&gt;Lobectomy - Followup&lt;/td&gt;
      &lt;td&gt;HISTORY OF PRESENT ILLNESS: , The patient is a...&lt;/td&gt;
      &lt;td&gt;soap / chart / progress notes, non-small cell ...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;3008&lt;/th&gt;
      &lt;td&gt;Construction of right upper arm hemodialysis ...&lt;/td&gt;
      &lt;td&gt;Nephrology&lt;/td&gt;
      &lt;td&gt;Hemodialysis Fistula Construction&lt;/td&gt;
      &lt;td&gt;PREOPERATIVE DIAGNOSIS: , End-stage renal dise...&lt;/td&gt;
      &lt;td&gt;nephrology, end-stage renal disease, av dialys...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;4943&lt;/th&gt;
      &lt;td&gt;Bronchoscopy with brush biopsies.  Persistent...&lt;/td&gt;
      &lt;td&gt;Cardiovascular / Pulmonary&lt;/td&gt;
      &lt;td&gt;Bronchoscopy - 8&lt;/td&gt;
      &lt;td&gt;PREOPERATIVE DIAGNOSIS: , Persistent pneumonia...&lt;/td&gt;
      &lt;td&gt;cardiovascular / pulmonary, persistent pneumon...&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Let's take one transcription to see how we can work with NER:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;sample_transcription&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;med_transcript_small&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;transcription&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;iloc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sample_transcription&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="c1"&gt;# prints just the first 1000 characters
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;HISTORY OF PRESENT ILLNESS:,  The patient is well known to me for a history of iron-deficiency anemia due to chronic blood loss from colitis.  We corrected her hematocrit last year with intravenous (IV) iron.  Ultimately, she had a total proctocolectomy done on 03/14/2007 to treat her colitis.  Her course has been very complicated since then with needing multiple surgeries for removal of hematoma.  This is partly because she was on anticoagulation for a right arm deep venous thrombosis (DVT) she had early this year, complicated by septic phlebitis.,Chart was reviewed, and I will not reiterate her complex history.,I am asked to see the patient again because of concerns for coagulopathy.,She had surgery again last month to evacuate a pelvic hematoma, and was found to have vancomycin resistant enterococcus, for which she is on multiple antibiotics and followed by infectious disease now.,She is on total parenteral nutrition (TPN) as well.,LABORATORY DATA:,  Labs today showed a white blood 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;So, we can see a lot of entities in this transcription. There are drug, disease, and exam names for example.&lt;br&gt;
The text was scraped from a web page and we can identify the different sections from the medical record like "HISTORY OF PRESENT ILLNESS" and "LABORATORY DATA", but this varies from record to record.&lt;/p&gt;
&lt;h2&gt;
  
  
  Named entity recognition
&lt;/h2&gt;

&lt;p&gt;Named entity recognition (NER) is a subtask of natural language processing used to identify and classify named entities mentioned in unstructured text into pre-defined categories. &lt;code&gt;scispaCy&lt;/code&gt; has different models to identify different entity types and you can check them &lt;a href="https://allenai.github.io/scispacy/" rel="noopener noreferrer"&gt;here&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;We are going to use the NER model trained on the BC5CDR corpus (&lt;code&gt;en_ner_bc5cdr_md&lt;/code&gt;). This corpus consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases, and 3116 chemical-disease interactions. Don't forget to download and install the model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;scispacy&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;spacy&lt;/span&gt;
&lt;span class="n"&gt;nlp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spacy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;en_ner_bc5cdr_md&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;spacy.load&lt;/code&gt; will return a &lt;code&gt;Language&lt;/code&gt; object containing all components and data needed to process text. This object is usually called &lt;code&gt;nlp&lt;/code&gt; in the documentation and tutorials. Calling the &lt;code&gt;nlp&lt;/code&gt; object on a string of text will return a processed &lt;code&gt;Doc&lt;/code&gt; object with the text split into words and annotated.&lt;/p&gt;

&lt;p&gt;Let's get all identified entities and print their text, start position, end position, and type:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;nlp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sample_transcription&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TEXT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;START&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;END&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ENTITY TYPE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;ent&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ents&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;start_char&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;end_char&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;label_&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;TEXT START END ENTITY TYPE
iron-deficiency anemia 79 101 DISEASE
chronic blood loss 109 127 DISEASE
colitis 133 140 DISEASE
iron 203 207 CHEMICAL
...
vancomycin 781 791 CHEMICAL
infectious disease 873 891 DISEASE
improved.,PT 1348 1360 CHEMICAL
vitamin K 1503 1512 CHEMICAL
uric acid 1830 1839 CHEMICAL
bilirubin 1853 1862 CHEMICAL
Creatinine 1911 1921 CHEMICAL
...
Compazine 2474 2483 CHEMICAL
Zofran 2487 2493 CHEMICAL
epistaxis 2629 2638 DISEASE
bleeding 3057 3065 DISEASE
edema.,CARDIAC 3109 3123 CHEMICAL
adenopathy 3156 3166 DISEASE
...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;We can see the model correctly identified and label diseases such as &lt;em&gt;iron-deficiency anemia&lt;/em&gt;, &lt;em&gt;chronic blood loss&lt;/em&gt;, and many more. Lots of drugs were also identified, like &lt;em&gt;vancomycin&lt;/em&gt;, &lt;em&gt;Compazine&lt;/em&gt;, &lt;em&gt;Zofran&lt;/em&gt;. The model can also identify common laboratory tested molecules such as &lt;em&gt;creatinine&lt;/em&gt;, &lt;em&gt;iron&lt;/em&gt;, &lt;em&gt;bilirubin&lt;/em&gt;, &lt;em&gt;uric acid&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Not everything is perfect though. See how some tokens are weirdly classified as chemicals, possibly due to punctuation marks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;improved.,PT 1348 1360 CHEMICAL&lt;/li&gt;
&lt;li&gt;edema.,CARDIAC 3109 3123 CHEMICAL&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Punctuation marks are usually removed in NLP preprocessing steps, but we can't remove all of them here, otherwise, we may miss chemical names and would screw up quantities like drug dosage. However, we can solve this problem by removing the "&lt;strong&gt;.,&lt;/strong&gt;" marks that appear to separate some sections of the transcription. It is important to know your data and your data's domain to have a better comprehension of your results.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;

&lt;span class="n"&gt;med_transcript_small&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;transcription&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;med_transcript_small&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;transcription&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;(\.,)&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can also check the entities using the &lt;code&gt;displacy&lt;/code&gt; visualizer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;spacy&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;displacy&lt;/span&gt;
&lt;span class="n"&gt;displacy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;render&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;style&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ent&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;jupyter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# here I am printing just the first 100 tokens
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffyonopth8vf0a2z88t4y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffyonopth8vf0a2z88t4y.png" alt="displacy"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Rule-based matching
&lt;/h2&gt;

&lt;p&gt;Rule-based matching is similar to regular expressions, but spaCy’s rule-based matcher engines and components give you access to the tokens within the document and their relationships. We can combine this with the NER models to identify some pattern that includes our entities.&lt;/p&gt;

&lt;p&gt;Let's extract from the text the drug names and their reported dosages. This could be of real use to identify possible medication errors by checking if the dosages are in accordance with standards and guidelines.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;spacy.matcher&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Matcher&lt;/span&gt;

&lt;span class="n"&gt;pattern&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ENT_TYPE&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CHEMICAL&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;LIKE_NUM&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;IS_ASCII&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="n"&gt;matcher&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Matcher&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nlp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vocab&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;matcher&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DRUG_DOSE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The code above creates a pattern to identify a sequence of three tokens:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A token whose entity type is CHEMICAL (drug name)&lt;/li&gt;
&lt;li&gt;A token that resembles a number (dosage)&lt;/li&gt;
&lt;li&gt;A token that consists of ASCII characters (units, like mg or mL)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then we initialize the Matcher with a vocabulary. The matcher must always share the same vocab with the documents it will operate on, so we use the &lt;code&gt;nlp&lt;/code&gt; object vocab. We then add this pattern to the matcher and give it an ID.&lt;/p&gt;

&lt;p&gt;Now we can loop through all transcriptions and extract the text matching this pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;transcription&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;med_transcript_small&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;transcription&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;nlp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transcription&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;matches&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;matcher&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;match_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;matches&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;string_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nlp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vocab&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;strings&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;match_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# get string representation
&lt;/span&gt;        &lt;span class="n"&gt;span&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# the matched span
&lt;/span&gt;        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;string_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;DRUG_DOSE 137 140 Xylocaine 20 mL
DRUG_DOSE 141 144 Marcaine 0.25%
DRUG_DOSE 208 211 Aspirin 81 mg
DRUG_DOSE 216 219 Spiriva 10 mcg
DRUG_DOSE 399 402 nifedipine 10 mg
DRUG_DOSE 226 229 aspirin one tablet
DRUG_DOSE 245 248 Warfarin 2.5 mg
DRUG_DOSE 67 70 Topamax 100 mg
...
DRUG_DOSE 193 196 Metamucil one pack
DRUG_DOSE 207 210 Nexium 40 mg
DRUG_DOSE 1133 1136 Naprosyn one p.o
DRUG_DOSE 290 293 Lidocaine 1%
DRUG_DOSE 37 40 Altrua 60,
...
DRUG_DOSE 74 77 Lidocaine 1.5%
DRUG_DOSE 209 212 Dilantin 300 mg
DRUG_DOSE 217 220 Haloperidol 1 mg
DRUG_DOSE 225 228 Dexamethasone 4 mg
DRUG_DOSE 234 237 Docusate 100 mg
DRUG_DOSE 250 253 Ibuprofen 600 mg
DRUG_DOSE 258 261 Zantac 150 mg
...
DRUG_DOSE 204 207 epinephrine 7 ml
DRUG_DOSE 214 217 Percocet 5,
DRUG_DOSE 55 58 . 4.
DRUG_DOSE 146 149 . 4.
DRUG_DOSE 2409 2412 Naprosyn 375 mg
DRUG_DOSE 141 144 Wellbutrin 300 mg
DRUG_DOSE 146 149 Xanax 0.25 mg
DRUG_DOSE 158 161 omeprazole 20 mg
...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Nice, we did it!&lt;/p&gt;

&lt;p&gt;We successfully extracted drugs and dosages, including different kinds of units like &lt;em&gt;mg, mL, %, packs&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusions
&lt;/h2&gt;

&lt;p&gt;Here we learned how to use some features of scispaCy and spaCy like NER and rule-base matching. We used one NER model, but there lots of others and you should totally check them out. For instance, the &lt;code&gt;en_ner_bionlp13cg_md&lt;/code&gt; model can identify anatomical parts, tissues, cell types, and more. Imagine what else you could do with that!&lt;/p&gt;

&lt;p&gt;We also didn't focus too much on preprocessing steps, but they are fundamental to get better results. Don't forget to explore your data and adapt the preprocessing steps to the NLP tasks you want to do.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;Neumann, M., King, D., Beltagy, I., &amp;amp; Ammar, W. (2019). &lt;a href="https://arxiv.org/abs/1902.07669" rel="noopener noreferrer"&gt;Scispacy: Fast and robust models for biomedical natural language processing&lt;/a&gt;. arXiv preprint arXiv:1902.07669.&lt;/p&gt;

&lt;p&gt;Honnibal, M., Montani, I., Van Landeghem, S., Boyd, A. (2020). &lt;a href="https://spacy.io/" rel="noopener noreferrer"&gt;spaCy: Industrial-strength Natural Language Processing in Python&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>python</category>
      <category>nlp</category>
      <category>biomedical</category>
      <category>datamining</category>
    </item>
    <item>
      <title>Tolkien character or prescription drug name? Classification using character-level Long Short-Term Memory (LSTM) neural networks</title>
      <dc:creator>Guilherme Bauer-Negrini</dc:creator>
      <pubDate>Thu, 08 Apr 2021 18:43:25 +0000</pubDate>
      <link>https://dev.to/gbnegrini/tolkien-character-or-prescription-drug-name-classification-using-character-level-long-short-term-memory-lstm-neural-networks-1o4m</link>
      <guid>https://dev.to/gbnegrini/tolkien-character-or-prescription-drug-name-classification-using-character-level-long-short-term-memory-lstm-neural-networks-1o4m</guid>
      <description>&lt;p&gt;After trying to read J.R.R. Tolkien's &lt;em&gt;The Silmarillion&lt;/em&gt; again for the millionth time, I remembered a funny tweet that has been around for a while:&lt;/p&gt;

&lt;p&gt;&lt;iframe class="tweet-embed" id="tweet-977387234226855936-610" src="https://platform.twitter.com/embed/Tweet.html?id=977387234226855936"&gt;
&lt;/iframe&gt;

  // Detect dark theme
  var iframe = document.getElementById('tweet-977387234226855936-610');
  if (document.body.className.includes('dark-theme')) {
    iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=977387234226855936&amp;amp;theme=dark"
  }



&lt;/p&gt;



&lt;p&gt;Even though I was a casual fan of &lt;em&gt;The Lord of the Rings&lt;/em&gt; and having already taken two pharmacology courses in college, I had no idea who or what a &lt;em&gt;Narmacil&lt;/em&gt; was. Should we fear him/her by its sword skills or by its dangerous side effects?&lt;/p&gt;

&lt;p&gt;This little trivia prompted me to ask if an artificial neural network (ANN) could succeed where I and many more have failed. Here, I show you how to build a special type of ANN called Long Short-Term Memory (LSTM) to classify Tolkien characters and prescription drug names using &lt;code&gt;Keras&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnw5k8rxopc0zwxhyqiq6.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnw5k8rxopc0zwxhyqiq6.gif" alt="begins"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Dataset
&lt;/h2&gt;

&lt;p&gt;The first step was to build from scratch a combined dataset with names of Tolkien characters and prescription drugs (a bunch of them, not just antidepressants).&lt;/p&gt;

&lt;h3&gt;
  
  
  Tolkien characters
&lt;/h3&gt;

&lt;p&gt;Lucky for us, the &lt;a href="https://www.behindthename.com" rel="noopener noreferrer"&gt;&lt;em&gt;Behind the Name&lt;/em&gt;&lt;/a&gt; website has a database of the first names of Tolkien characters that we can directly read from the page's HTML using &lt;code&gt;pandas&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

&lt;span class="n"&gt;raw_tolkien_chars&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_html&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://www.behindthename.com/namesakes/list/tolkien/name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;raw_tolkien_chars&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;Name&lt;/th&gt;
      &lt;th&gt;Gender&lt;/th&gt;
      &lt;th&gt;Details&lt;/th&gt;
      &lt;th&gt;Total&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;0&lt;/th&gt;
      &lt;td&gt;Adalbert&lt;/td&gt;
      &lt;td&gt;m&lt;/td&gt;
      &lt;td&gt;1 character&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1&lt;/th&gt;
      &lt;td&gt;Adaldrida&lt;/td&gt;
      &lt;td&gt;f&lt;/td&gt;
      &lt;td&gt;1 character&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;2&lt;/th&gt;
      &lt;td&gt;Adalgar&lt;/td&gt;
      &lt;td&gt;m&lt;/td&gt;
      &lt;td&gt;1 character&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;3&lt;/th&gt;
      &lt;td&gt;Adalgrim&lt;/td&gt;
      &lt;td&gt;m&lt;/td&gt;
      &lt;td&gt;1 character&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;4&lt;/th&gt;
      &lt;td&gt;Adamanta&lt;/td&gt;
      &lt;td&gt;f&lt;/td&gt;
      &lt;td&gt;1 character&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;tolkien_names&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;raw_tolkien_chars&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;tolkien_names&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iloc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;350&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;355&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;350           Gethron
351    Ghân-buri-Ghân
352            Gildis
353            Gildor
354         Gil-galad
Name: Name, dtype: object
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;We can see that some names are hyphenated and have accented letters. To simplify the analysis I transformed unicode characters to ASCII, removed punctuation marks, transformed them to lowercase, and removed any possible duplicates.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;unidecode&lt;/span&gt;

&lt;span class="n"&gt;processed_tolkien_names&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tolkien_names&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unidecode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;unidecode&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;processed_tolkien_names&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;processed_tolkien_names&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;()]&lt;/span&gt;
&lt;span class="n"&gt;processed_tolkien_names&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;processed_tolkien_names&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;sort_values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;drop_duplicates&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;processed_tolkien_names&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tolkien&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;processed_tolkien_names&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;iloc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;350&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;355&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;473    gethron
439       ghan
109        gil
341     gildis
324     gildor
Name: name, dtype: object
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;processed_tolkien_names&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(746,2)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Done! Now we have 746 different character names.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prescription drugs
&lt;/h3&gt;

&lt;p&gt;To get a comprehensive list of drug names, I downloaded the &lt;a href="https://www.accessdata.fda.gov/scripts/cder/daf/index.cfm?event=medguide.page" rel="noopener noreferrer"&gt;medication guide&lt;/a&gt; of the U.S. Food &amp;amp; Drug Administration (FDA).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;raw_medication_guide&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;data/raw/medication_guides.csv&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;raw_medication_guide&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;Drug Name&lt;/th&gt;
      &lt;th&gt;Active Ingredient&lt;/th&gt;
      &lt;th&gt;Form;Route&lt;/th&gt;
      &lt;th&gt;Appl. No.&lt;/th&gt;
      &lt;th&gt;Company&lt;/th&gt;
      &lt;th&gt;Date&lt;/th&gt;
      &lt;th&gt;Link&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;0&lt;/th&gt;
      &lt;td&gt;Abilify&lt;/td&gt;
      &lt;td&gt;Aripiprazole&lt;/td&gt;
      &lt;td&gt;TABLET, ORALLY DISINTEGRATING;ORAL&lt;/td&gt;
      &lt;td&gt;21729&lt;/td&gt;
      &lt;td&gt;OTSUKA&lt;/td&gt;
      &lt;td&gt;02/05/2020&lt;/td&gt;
      &lt;td&gt;https://www.accessdata.fda.gov/drugsatfda_docs...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1&lt;/th&gt;
      &lt;td&gt;Abilify&lt;/td&gt;
      &lt;td&gt;Aripiprazole&lt;/td&gt;
      &lt;td&gt;TABLET;ORAL&lt;/td&gt;
      &lt;td&gt;21436&lt;/td&gt;
      &lt;td&gt;OTSUKA&lt;/td&gt;
      &lt;td&gt;02/05/2020&lt;/td&gt;
      &lt;td&gt;https://www.accessdata.fda.gov/drugsatfda_docs...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;2&lt;/th&gt;
      &lt;td&gt;Abilify&lt;/td&gt;
      &lt;td&gt;Aripiprazole&lt;/td&gt;
      &lt;td&gt;SOLUTION;ORAL&lt;/td&gt;
      &lt;td&gt;21713&lt;/td&gt;
      &lt;td&gt;OTSUKA&lt;/td&gt;
      &lt;td&gt;02/05/2020&lt;/td&gt;
      &lt;td&gt;https://www.accessdata.fda.gov/drugsatfda_docs...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;3&lt;/th&gt;
      &lt;td&gt;Abilify&lt;/td&gt;
      &lt;td&gt;Aripiprazole&lt;/td&gt;
      &lt;td&gt;SOLUTION;ORAL&lt;/td&gt;
      &lt;td&gt;21713&lt;/td&gt;
      &lt;td&gt;OTSUKA&lt;/td&gt;
      &lt;td&gt;02/05/2020&lt;/td&gt;
      &lt;td&gt;https://www.accessdata.fda.gov/drugsatfda_docs...&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;4&lt;/th&gt;
      &lt;td&gt;Abilify&lt;/td&gt;
      &lt;td&gt;Aripiprazole&lt;/td&gt;
      &lt;td&gt;INJECTABLE;INTRAMUSCULAR&lt;/td&gt;
      &lt;td&gt;21866&lt;/td&gt;
      &lt;td&gt;OTSUKA&lt;/td&gt;
      &lt;td&gt;02/05/2020&lt;/td&gt;
      &lt;td&gt;https://www.accessdata.fda.gov/drugsatfda_docs...&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;drug_names&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;raw_medication_guide&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Drug Name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;drug_names&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iloc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;160&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;165&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;160                                             Chantix
161         Children's Cetirizine Hydrochloride Allergy
162    Chlordiazepoxide and Amitriptyline Hydrochloride
163                                              Cimzia
164                                              Cimzia
Name: Drug Name, dtype: object
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;A similar preprocessing step was repeated for this dataset too:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;processed_drug_names&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;drug_names&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"'"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;processed_drug_names&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;processed_drug_names&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;()]&lt;/span&gt;
&lt;span class="n"&gt;processed_drug_names&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;processed_drug_names&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;sort_values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;drop_duplicates&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;processed_drug_names&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tolkien&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;processed_drug_names&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;iloc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;84&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;89&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;373             chantix
448            children
395    chlordiazepoxide
185              cimzia
292               cipro
Name: name, dtype: object
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;processed_drug_names&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(611,2)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Done, 611 different drug names!&lt;/p&gt;

&lt;p&gt;We can finally combine the two datasets and move on.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;concat&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;processed_tolkien_names&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;processed_drug_names&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;ignore_index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Data transformation
&lt;/h2&gt;

&lt;p&gt;So now we have a bunch of names, but machine learning models don't work with raw characters. We need to convert them into a numerical format that can be processed by our soon-to-be-built model.&lt;/p&gt;

&lt;p&gt;Using the &lt;code&gt;Tokenizer&lt;/code&gt; class from Keras, we set &lt;code&gt;char_level=True&lt;/code&gt; to process each word at character-level. The &lt;code&gt;fit_on_texts()&lt;/code&gt; method will update the tokenizer internal vocabulary based on our dataset names and then &lt;code&gt;texts_to_sequences()&lt;/code&gt; will transform each name into a sequence of integers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;keras.preprocessing.text&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Tokenizer&lt;/span&gt;
&lt;span class="n"&gt;tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Tokenizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;char_level&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit_on_texts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;char_index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;texts_to_sequences&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Look how our beloved Bilbo is now:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;134&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;char_index&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;134&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;bilbo
[16, 3, 6, 16, 5]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Yet, this representation is not ideal. Having integers to represent letters could lead the ANN to assume that the characters have an ordinal scale. To solve this problem we have to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Set all names to have the length of the longest name (17 characters here). We use &lt;code&gt;pad_sequences&lt;/code&gt; to add 0's to the end of names shorter than 17 letters.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Convert each integer representation to its one-hot encoded vector representation. The vector consists of 0s in all cells except for a single 1 in a cell to identify the letter.&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;keras.preprocessing.sequence&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pad_sequences&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;keras.utils&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;to_categorical&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="n"&gt;char_index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;pad_sequences&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;char_index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;maxlen&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;padding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;post&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;to_categorical&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;char_index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# onehot encoding
&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tolkien&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(1357, 17, 27)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;We have 1357 names. Each name has 17 letters and each letter is a one-hot encoded vector of size 27 (26 letters of the Latin alphabet + padding character).&lt;/p&gt;

&lt;h2&gt;
  
  
  Data split
&lt;/h2&gt;

&lt;p&gt;I split the data into train, validation, and test sets with a 60/20/20 ratio using a custom function since &lt;code&gt;sklearn train_test_split&lt;/code&gt; only outputs two sets.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.model_selection&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;train_test_split&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;data_split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;train_ratio&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rand_seed&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;

    &lt;span class="n"&gt;x_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x_temp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_temp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;train_test_split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                        &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                        &lt;span class="n"&gt;train_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;train_ratio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                        &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;rand_seed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;x_val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;train_test_split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x_temp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                    &lt;span class="n"&gt;y_temp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                    &lt;span class="n"&gt;train_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                    &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;rand_seed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x_val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt;

&lt;span class="n"&gt;x_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x_val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;data_split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;train_ratio&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.6&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's take a look at the splits:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;collections&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Counter&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;

&lt;span class="n"&gt;dataset_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="nc"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nc"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_val&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nc"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt;
                                &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;train&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;val&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;test&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;dataset_count&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;bar&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xticks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rotation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Total number of samples: &lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;dataset_count&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Class/Samples: &lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;dataset_count&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Split/Class/Samples: &lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;dataset_count&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6jedlm5u88inh2nlecf1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6jedlm5u88inh2nlecf1.png" alt="datasplit"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Total number of samples: 
1357
Class/Samples: 
1    746
0    611
dtype: int64
Split/Class/Samples: 
        1    0
train  451  363
val    149  122
test   146  126
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;There are more Tolkien characters than drug names, but it seems like a decent balance.&lt;/p&gt;

&lt;h2&gt;
  
  
  LSTM model
&lt;/h2&gt;

&lt;p&gt;Long Short-Term Memory is a type of Recurrent Neural Network proposed by Hochreiter S. &amp;amp; Schmidhuber J. (1997) to store information over extended time intervals. Names are just sequences of characters in which the order is important, so LSTM networks are a great choice for our name prediction task. You can read more about LSTMs in this &lt;a href="https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21" rel="noopener noreferrer"&gt;awesome illustrated guide&lt;/a&gt; written by Michael Phi.&lt;/p&gt;

&lt;h3&gt;
  
  
  Training
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;Keras&lt;/code&gt; was used to build this simple LSTM model after some tests and hyperparameter tuning. It is just a hidden layer with 8 LSTM blocks, one dropout layer to prevent overfitting, and one output neuron with a sigmoid activation function to make a binary classification.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;tensorflow.keras&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Sequential&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;tensorflow.keras.layers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Dense&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Activation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Dropout&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;LSTM&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;tensorflow.random&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;set_seed&lt;/span&gt;
&lt;span class="nf"&gt;set_seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;23&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Sequential&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;LSTM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;return_sequences&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
               &lt;span class="n"&gt;input_shape&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;])))&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Dropout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Dense&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;units&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Activation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sigmoid&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
lstm (LSTM)                  (None, 8)                 1152      
_________________________________________________________________
dropout (Dropout)            (None, 8)                 0         
_________________________________________________________________
dense (Dense)                (None, 1)                 9         
_________________________________________________________________
activation (Activation)      (None, 1)                 0         

=================================================================
Total params: 1,161
Trainable params: 1,161
Non-trainable params: 0
_________________________________________________________________
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Adam is a good default optimizer and produces great results in deep learning applications. Binary cross-entropy is the default loss function to binary classification problems and it is compatible with our single neuron output architecture.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;tensorflow.keras.optimizers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Adam&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loss&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;binary_crossentropy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
              &lt;span class="n"&gt;optimizer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;Adam&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;learning_rate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1e-3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;accuracy&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two callbacks were implemented. &lt;code&gt;EarlyStopping&lt;/code&gt; to stop the training process after 20 epochs without reducing the validation loss and &lt;code&gt;ModelCheckpoint&lt;/code&gt; to always save the model when the validation loss drops.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;tensorflow.keras.callbacks&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;EarlyStopping&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ModelCheckpoint&lt;/span&gt;

&lt;span class="n"&gt;es&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;EarlyStopping&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;monitor&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;val_loss&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;patience&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;mc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ModelCheckpoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;best_model.h5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;monitor&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;val_loss&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                     &lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;save_best_only&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;history&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;epochs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;validation_data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x_val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_val&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;callbacks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;es&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mc&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Epoch 00071: val_loss did not improve from 0.34949
Epoch 72/100
26/26 [==============================] - 1s 24ms/step - loss: 0.3085 - accuracy: 0.8836 - val_loss: 0.3861 - val_accuracy: 0.8487

Epoch 00072: val_loss did not improve from 0.34949
Epoch 00072: early stopping
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;val_loss_per_epoch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;val_loss&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;best_epoch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;val_loss_per_epoch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;val_loss_per_epoch&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Best epoch: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;best_epoch&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Best epoch: 52
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Let's plot the accuracy and loss values per epoch to see the progression of these metrics.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;plot_metrics&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;

    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;subplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;accuracy&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Training&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;val_accuracy&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Validation&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Epoch&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Accuracy&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;legend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;lower right&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;on&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;subplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;loss&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Training&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;val_loss&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Validation&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Epoch&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Loss&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;legend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;upper right&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;on&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;plot_metrics&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffq3sv55u77edzuv8jlea.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffq3sv55u77edzuv8jlea.png" alt="accloss"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can see that the accuracy quickly reaches a good plateau around 80%. Visually the model appears to start overfitting after epoch 50. It shouldn't be a problem to use the version saved by &lt;code&gt;ModelCheckpoint&lt;/code&gt; at epoch 52. &lt;/p&gt;

&lt;h3&gt;
  
  
  Performance evaluation
&lt;/h3&gt;

&lt;p&gt;Finally, let's see how our model does with the test dataset.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;tensorflow.keras.models&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_model&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;best_model.h5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;metrics&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;x_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;9/9 [==============================] - 1s 7ms/step - loss: 0.4595 - accuracy: 0.8125
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Accuracy: {0:.2f} %&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Accuracy: 81.25 %
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;81.25 %&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not bad!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsa7rko8hsciw6ibba7t1.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsa7rko8hsciw6ibba7t1.gif" alt="hobbits"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can explore the results a little more with the confusion matrix and classification report:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;confusion_matrix&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;seaborn&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;heatmap&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;plot_confusion_matrix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;

    &lt;span class="n"&gt;cm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;confusion_matrix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;heatmap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;annot&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;d&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cmap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rocket_r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;xticklabels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;yticklabels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;True label&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Predicted label&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;predictions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;
&lt;span class="n"&gt;y_pred&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;predictions&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;
&lt;span class="nf"&gt;plot_confusion_matrix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Drug&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Tolkien&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F96yhtu8yxgmobktj4qrl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F96yhtu8yxgmobktj4qrl.png" alt="confusionmatrix"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;classification_report&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;classification_report&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target_names&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Drug&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Tolkien&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;              precision    recall  f1-score   support

        Drug       0.80      0.80      0.80       126
     Tolkien       0.83      0.82      0.82       146

    accuracy                           0.81       272
   macro avg       0.81      0.81      0.81       272
weighted avg       0.81      0.81      0.81       272
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Besides the good accuracy, the model has almost the same number of false positives and false negatives. We can see this reflecting in a balanced precision and recall.&lt;/p&gt;

&lt;p&gt;So, just out of curiosity, &lt;strong&gt;which Tolkien characters could pass as prescription drugs?&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;onehot_to_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;onehot_word&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Reverse one-hot encoded words to strings&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="n"&gt;char_index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;argmax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;char&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;char&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;onehot_word&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
    &lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sequences_to_texts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;char_index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;test_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;test_result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt;
&lt;span class="n"&gt;test_result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;prediction&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;test_result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;onehot_to_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;x_test&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;test_result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;true&lt;/th&gt;
      &lt;th&gt;prediction&lt;/th&gt;
      &lt;th&gt;name&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;0&lt;/th&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;supprelin&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1&lt;/th&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;bingo&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;2&lt;/th&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;ponstel&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;3&lt;/th&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;elidel&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;4&lt;/th&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;aubagio&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;test_result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;test_result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;prediction&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;13              ivy&lt;br&gt;
17         camellia&lt;br&gt;
44      celebrindor&lt;br&gt;
47         meriadoc&lt;br&gt;
63        vanimelde&lt;br&gt;
64        finduilas&lt;br&gt;
75        eglantine&lt;br&gt;
84             ruby&lt;br&gt;
87            poppy&lt;br&gt;
89             otto&lt;br&gt;
100           tanta&lt;br&gt;
102          myrtle&lt;br&gt;
108          prisca&lt;br&gt;
132          cottar&lt;br&gt;
151          stybba&lt;br&gt;
171            este&lt;br&gt;
175           daisy&lt;br&gt;
189          tulkas&lt;br&gt;
195        arciryas&lt;br&gt;
205        odovacar&lt;br&gt;
206          tarcil&lt;br&gt;
207    hyarmendacil&lt;br&gt;
229            jago&lt;br&gt;
230            tata&lt;br&gt;
240           ponto&lt;br&gt;
271       landroval&lt;br&gt;
Name: name, dtype: object&lt;br&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
&lt;br&gt;
  &lt;br&gt;
  &lt;br&gt;
  Conclusion&lt;br&gt;
&lt;/h2&gt;

&lt;p&gt;So, here we covered how to work with character embeddings and build a simple LSTM model capable of telling apart Tolkien character names from prescription drug names. Full code, including requirements, dataset, a Jupyter Notebook code version, and a script version, can be found at my &lt;a href="https://github.com/gbnegrini/tolkien-char-prescription-drug" rel="noopener noreferrer"&gt;GitHub repo&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;You can also play around with this popular interactive quiz found on the web: &lt;a href="https://antidepressantsortolkien.now.sh/" rel="noopener noreferrer"&gt;&lt;em&gt;Antidepressant or Tolkien?&lt;/em&gt;&lt;/a&gt;. &lt;strong&gt;I only got 70.8% right!&lt;/strong&gt; Can you guess better than the LSTM network?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkypjumkf3f9i6aoyfpt9.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkypjumkf3f9i6aoyfpt9.gif" alt="over"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;Hu, Y., Hu, C., Tran, T., Kasturi, T., Joseph, E., &amp;amp; Gillingham, M. (2021). &lt;a href="https://arxiv.org/pdf/2102.03692.pdf" rel="noopener noreferrer"&gt;What's in a Name?--Gender Classification of Names with Character Based Machine Learning Models.&lt;/a&gt; arXiv preprint arXiv:2102.03692.&lt;/p&gt;

&lt;p&gt;Bhagvati, C. (2018). &lt;a href="https://www.sciencedirect.com/science/article/pii/S1877050918307476" rel="noopener noreferrer"&gt;Word representations for gender classification using deep learning.&lt;/a&gt; Procedia computer science, 132, 614-622.&lt;/p&gt;

&lt;p&gt;Liang, X. (2018). &lt;a href="https://towardsdatascience.com/how-to-preprocess-character-level-text-with-keras-349065121089" rel="noopener noreferrer"&gt;How to Preprocess Character Level Text with Keras.&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>python</category>
      <category>keras</category>
      <category>lstm</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
