DEV Community: TAQİ EDDİNE EL MAMOUNİ

🧠 NLP: From Tokenization to Vectorization (with Practical Insights)

TAQİ EDDİNE EL MAMOUNİ — Thu, 19 Jun 2025 20:01:47 +0000

Natural Language Processing (NLP) bridges the gap between human language and machine intelligence. In this blog, we’ll explore foundational steps like tokenization, stemming, lemmatization, vectorization, and modern tools like Transformers. Whether you're just starting or want a refresher, this is your guide to transforming raw text into machine-readable format.

🔤 Tokenization Tokenization breaks text into smaller units — called tokens — that could be words, subwords, or sentences.

🔹 Word Tokenization

Input: "Natural Language Processing"
Tokens: ["Natural", "Language", "Processing"]

🔹 Sentence Tokenization

Input: "NLP is fascinating. It has endless applications!"
Tokens: ["NLP is fascinating.", "It has endless applications!"]

✂️ Stemming Stemming reduces words to their root by stripping prefixes/suffixes — but it may not always produce real words.

Words: "running", "runs", "runner"
Stems: "run", "run", "runner"

🟢 Use Case: Fast text indexing and search systems.

🧬 Lemmatization Lemmatization brings words to their proper dictionary root (lemma) using morphological analysis.

Words: "running", "ran", "runs"
Lemmas: "run", "run", "run"

🟢 Use Case: Sentiment analysis, text classification.

🛑 Stop Word Removal Stop words are common words like “the”, “is”, “and” that are usually removed before analysis.

Input: "AI is transforming the world."
Output: "AI transforming world"

🏷️ Part-of-Speech (POS) Tagging This tags each word with its grammatical role: noun, verb, adjective, etc.

Input: "AI transforms industries."
Output: [('AI', 'NNP'), ('transforms', 'VBZ'), ('industries', 'NNS')]

🔢 Text Normalization (Often Skipped but Important!) Before further processing, normalize the text:

Lowercasing

Removing punctuation/numbers

Removing extra spaces

import re
text = "AI is Changing the WORLD! 2025."
clean = re.sub(r"[^a-zA-Z\s]", "", text.lower())
# Result: "ai is changing the world"

🔠 TF-IDF (Vectorization) TF-IDF weighs words by importance across documents.

from sklearn.feature_extraction.text import TfidfVectorizer

docs = ["AI is the future", "AI transforms industries"]
tfidf = TfidfVectorizer()
matrix = tfidf.fit_transform(docs)

print(tfidf.get_feature_names_out())

🌐 Word Embeddings (Word2Vec, GloVe, FastText) These convert words to dense vectors with semantic meaning.

Model Description
Word2Vec Learns words from context
GloVe Combines local + global context
FastText Captures subword information (e.g., prefixes)

🤖 Transformers (BERT, RoBERTa, GPT) Modern NLP uses transformer-based models that understand context much better than traditional methods.

from transformers import pipeline
clf = pipeline("sentiment-analysis")
print(clf("I love NLP and transformers!"))

🟢 Use Cases: Sentiment analysis, question answering, summarization, translation, etc.

🔧 10. Build a Simple NLP Pipeline (Practical Example)

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

model = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('clf', LogisticRegression())
])

X = ["I love this product", "This is terrible"]
y = [1, 0]

model.fit(X, y)
print(model.predict(["Awesome experience"]))  # Output: [1]

🚀 Where to Go From Here?
Topic Description
🔍 NER: Recognize names, organizations, locations
🧩 Dependency Parsing: Understand how words relate
🏷️ Text Classification: Categorize emails, reviews, etc.
📚 Topic Modeling: Discover themes in documents
🤖 Transformers: BERT, GPT for deep understanding
📝 Summarization: Shorten long documents
💬 Chatbots: Build intelligent assistants
🛠 NLP Project: Use spaCy, NLTK, or HuggingFace to combine all steps

Medical Chat bot

TAQİ EDDİNE EL MAMOUNİ — Thu, 19 Jun 2025 10:11:06 +0000

🚀 Launching Our AI-Powered Turkish Health Support Chatbot! 🇹🇷💬
In regions where healthcare access is limited, we built a 24/7 Turkish-language chatbot that provides users with fast, reliable answers to basic health-related questions using cutting-edge LLM and NLP technologies.

🧠 🔹 Project Overview:
Users can ask natural questions like: “I have a headache” or “I feel nauseous”, and the bot replies with possible causes and suggestions.
Designed for native Turkish speakers and optimized to improve health literacy and reduce unnecessary hospital visits.
Accessible, real-time, and developed with a focus on public benefit.

📊 🔹 Dataset Information:
Format: CSV file with 15,000 question-answer pairs.
Source: Translated from SQuAD (Stanford Question Answering Dataset).
Sample:
Q: “How can I relieve a headache?”
A: “Rest, drink plenty of water, and take painkillers if needed.”

🧹 🔹 Data Preprocessing Steps:
Text cleaning: Removed HTML tags, links, and special characters.
Normalization: Lowercasing, punctuation handling, whitespace trimming.
Tokenization: Used meta-llama/Llama-3.2-1B-Instruct tokenizer for LLM compatibility.
Libraries: transformers, datasets, os, torch, Flask.

🔧 🔹 Model Development:
Model: meta-llama/Llama-3.2-1B-Instruct – a 1B parameter Turkish-tuned LLaMA 3 model.
Architecture: Causal decoder, fine-tuned on domain-specific healthcare QA data.
Training Configuration:
Epochs: 3
Batch Size: 8
Learning Rate: 2e-5
Optimizer: AdamW
Results:
Initial: Loss = 2.78, Accuracy = 11%
Final: Loss = 0.12, Accuracy = 73%
Training loss decreased steadily, indicating strong learning performance.
🌐 🔹 Web Interface:

Built with Flask for seamless user interaction.
Users submit questions through a simple HTML interface.
Backend:
Checks if the question was asked before.
If new, the model generates and stores the answer.
Responses are returned in JSON.
“Clear Chat” button allows resetting the session.

💡 🔹 Project Impact:
✅ Promotes Turkish-language NLP applications
✅ Real-world health chatbot use-case using LLaMA 3
✅ End-to-end AI integration (data, training, deployment)
✅ Fully functional Flask web app with real-time responses

👨‍💻 Developer: Taqi Eddine El Mamouni
👥 Teammate: ILYASS ELMAMOUNI
🎓 Advisor: Dr. Kadir TOHMA
📅 Project Date: May 29, 2025

hashtag#AI hashtag#HealthcareAI hashtag#LLM hashtag#LLaMA3 hashtag#NLP hashtag#TurkishLanguage hashtag#DeepLearning hashtag#Chatbot hashtag#MachineLearning hashtag#Flask hashtag#OpenSource hashtag#TaqiEddineElMamouni hashtag#HealthTech hashtag#DataScience

🤖 What is Artificial Intelligence (AI)?

TAQİ EDDİNE EL MAMOUNİ — Mon, 16 Jun 2025 13:20:40 +0000

Artificial Intelligence (AI) is a branch of computer science focused on building machines and systems that can perform tasks that typically require human intelligence. These tasks include problem-solving, learning, understanding language, recognizing patterns, and even making decisions.

🧠 Types of AI
Narrow AI:
Designed for a specific task (e.g., voice assistants like Siri, or recommendation engines like Netflix).

General AI:
A theoretical form of AI that can perform any intellectual task that a human can do.

Superintelligent AI:
A hypothetical future AI that surpasses human intelligence across all domains.

🔍 How Does AI Work?
AI systems work by processing large amounts of data using algorithms that find patterns and make predictions or decisions. Some key concepts include:

Machine Learning (ML) – AI that learns from data.

Natural Language Processing (NLP) – Understanding and generating human language.

Computer Vision – Interpreting and understanding visual information from the world.

🛠️ Real-World Applications
Healthcare: Diagnosing diseases, analyzing medical images.

Finance: Fraud detection, algorithmic trading.

Transportation: Self-driving cars.

Customer Service: Chatbots and virtual assistants.

Creativity: Generating art, music, and even writing.

💡 Why It Matters
AI is transforming industries and changing how we live and work. From simplifying daily tasks to solving complex global challenges, AI has become one of the most important technologies of the 21st century.