From Concept to Code: Building an AI-Based Adverse Drug Reaction Detection System

#ai #python #mongodb #career

Our approach to tackling a real-world healthcare challenge using practical machine learning and fundamental NLP techniques.

By Rishika Chanda, Thiruveedhula Niharika, Shiva Keerthana, Karanam Sreeja
This project was developed under the guidance and mentorship of Professor Chanda Rajkumar

The Idea Behind the Project:

What if a single sentence in a medical report subtly indicates a harmful drug reaction—but goes unnoticed?
This was the question that stayed in our minds.

Healthcare data is often unstructured. Clinical notes, prescriptions, and reports frequently contain critical insights embedded within plain text. Identifying adverse drug reactions (ADRs) manually in such data is not only time-intensive but also prone to oversight.

Instead of pursuing a highly complex solution, We tried focusing on a simpler question:

Can a system be designed to automatically analyze medical text and flag potential ADRs using straightforward NLP and machine learning techniques?

Why do we think this Problem Matters?

In real-world healthcare settings, a significant portion of clinical data exists as unstructured text, where valuable insights are often embedded within narrative notes rather than neatly organized fields. Within this complexity, adverse drug reactions (ADRs) can be subtle, expressed in indirect or ambiguous language, making them easy to overlook. Relying on manual detection in such scenarios not only introduces the risk of missing critical signals but also becomes increasingly impractical as the volume of healthcare data continues to grow.
Against this backdrop, this project explores how far a lightweight NLP pipeline can go in addressing the challenge—focusing on building a solution that is both practical and scalable without relying on overly complex approaches.

How we setup our Project:

The project focuses on creating a complete, working system that:
1.Accepts medical text as input
2.Processes and cleans it
3.Extracts meaningful features
4.Predicts whether an ADR is present

Technology Stack:

To keep the system practical, scalable, and easy to extend, I built it using a lightweight yet complete stack—from NLP processing to backend APIs and database storage.

Python — Core Engine
The entire pipeline is powered by Python for its strong ML + NLP ecosystem.
Data Handling
Pandas → dataset processing
NumPy → numerical operations

text = clean_text(input_text)
vector = vectorizer.transform([text])
prediction = model.predict(vector)

NLP with NLTK
Used for preparing raw medical text:

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

Machine Learning — Scikit-learn
Scikit-learn was used to build the core classification pipeline, enabling efficient training and evaluation of models such as Logistic Regression and Naïve Bayes. It also supports feature extraction through techniques like TF-IDF, making it a reliable choice for text-based classification tasks.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

vectorizer = TfidfVectorizer(max_features=5000)
model = LogisticRegression()

Backend — Flask API
A lightweight backend built with Flask makes the model accessible via APIs.

from flask import Flask, request, jsonify

app = Flask(__name__)

@app.route('/predict', methods=['POST'])
def predict():
    text = request.json['text']
    result = predict_adr(text)
    return jsonify({"prediction": result})

MongoDB Integration for Database:

Medical text data (like patient reviews and clinical notes) is highly unstructured.
A NoSQL database like MongoDB is ideal because: Flexible schema (no rigid tables), Efficient handling of document-based data, Easy cloud deployment with MongoDB Atlas.
The system follows a straightforward flow from input to insight. The backend, built with Flask, receives user input in the form of medical text, which is then processed by the trained model to generate predictions. These results are converted into structured insights and stored in MongoDB for easy access and further use.
Data Stored in MongoDB
Each user submission is saved as a document containing:
Drug name
Extracted adverse reaction
Temporal indicators (if any)
Severity level
Prediction label (ADR / Non-ADR)
Timestamp

from nlp_pipeline import compare_side_effects, known_side_effects
from pymongo import MongoClient
import re

client = MongoClient("mongodb+srv://2410030044-ADR:niha1330@cluster0.xe8kag2.mongodb.net/?retryWrites=true&w=majority")
db = client["adr_database"]
reviews_collection = db["reviews"]

System Architecture:

The system follows a modular pipeline architecture, where raw medical text is progressively transformed into meaningful predictions.
1.Input
The system accepts medical text such as patient reviews or clinical descriptions.
2.Preprocessing
Text is cleaned by converting to lowercase, removing punctuation, and eliminating stopwords.
3.Feature Extraction
The cleaned text is transformed into numerical features using TF-IDF vectorization.
4.Prediction
A trained Logistic Regression model classifies whether an ADR is present.
5.Storage
The input and prediction are stored in MongoDB for future analysis.

Each component is independent, making the system easy to scale, debug, and improve.

Pipeline Flow
Input → Medical text
Preprocessing → Clean and normalize
Feature Extraction → Convert text into vectors
Model → Classification
Output → ADR detected or not

Data Storage Logic:

reviews_collection.insert_one({ 
"text": text,
"prediction": prediction,
"drug": drug if drug else "Unknown",
"reaction": reaction if reaction else "None",
"time": time if time else "Not detected",
"severity": severity if severity else "Not categorized",
"expected_effects": expected_effects,
"unexpected effects": unexpected effects
})

System Demonstration:

The developed system allows users to input a sentence describing a medical experience. The system processes the input, generates embeddings, and predicts whether it contains an adverse drug reaction.
The output is displayed as either “ADR Detected” or “No ADR Detected,” demonstrating how the model can be used as a decision-support tool.

Results and Insights:

Some example predictions we detected on our project:

Future Improvements:

Future work can focus on fine-tuning domain-specific models such as BioBERT to improve performance on medical data. Additional improvements include using ensemble methods, enhancing model interpretability, and deploying the system as a real-time application.

Conclusion
This project demonstrates how AI and NLP can be used to build a scalable system for detecting adverse drug reactions from unstructured text data. By integrating machine learning models with MongoDB, the system evolves from a standalone model into a practical application capable of real-world deployment.
Such systems have the potential to improve patient safety and support healthcare professionals in identifying risks more efficiently.

Project Links

GitHub Repository: