DEV Community

Cover image for ๐ŸŒฑ NLP for Beginners: Understanding the Basics of Natural Language Processing
Ananya S
Ananya S

Posted on

๐ŸŒฑ NLP for Beginners: Understanding the Basics of Natural Language Processing

Natural Language Processing (NLP) is one of the most exciting areas of Artificial Intelligence today. From chatbots and search engines to spam detection and sentiment analysis, NLP helps machines understand human language.

If youโ€™re just starting out and feel confused by terms like tokenization or lemmatization, this post will give you a clear and gentle introduction.


๐Ÿ“Œ What is Natural Language Processing (NLP)?

Natural Language Processing (NLP) is a subfield of Artificial Intelligence that enables computers to understand, analyze, and generate human language.

In simple terms:

NLP allows machines to work with text and speech in a meaningful way.

Real-world applications of NLP

  • Chatbots and virtual assistants
  • Google Search and autocomplete
  • Spam email detection
  • Sentiment analysis of reviews
  • Language translation

๐Ÿ—บ๏ธ A Beginner-Friendly Roadmap to Learn NLP

Before diving into complex models, itโ€™s important to understand how text is processed.

A simple conceptual roadmap

  • Text Preprocessing

    • Tokenization
    • Stop words removal
    • Stemming
    • Lemmatization
  • Text Representation

    • Bag of Words
    • TF-IDF
    • Word Embeddings
  • Classical NLP Tasks

    • Text classification
    • Sentiment analysis
    • Named Entity Recognition
  • Advanced NLP (Later Stage)

    • Transformers
    • BERT
    • GPT
    • Large Language Models

๐Ÿงน Why Text Preprocessing is Important

Machines donโ€™t understand language like humans do.

Example sentence: "I am learning Natural Language Processing!"

To a machine, this is just a sequence of characters.

Text preprocessing helps convert raw text into a format that machine learning models can understand.


โœ‚๏ธ Tokenization

Tokenization is the process of breaking text into smaller units called tokens.

Example

Sentence:

"I love learning NLP"

Enter fullscreen mode Exit fullscreen mode

After tokenization:

["I", "love", "learning", "NLP"]

Enter fullscreen mode Exit fullscreen mode

Types of tokenization

  • Word tokenization
  • Sentence tokenization
  • Subword tokenization (used in transformers)

๐Ÿ›‘ Stop Words

Stop words are commonly used words that usually donโ€™t add much meaning to the text.

Examples:

is, am, are, the, a, an, in, on, and
Enter fullscreen mode Exit fullscreen mode

Why remove stop words?

  • They add noise
  • They increase dimensionality
  • They often donโ€™t help in tasks like classification

๐ŸŒฟ Stemming

Stemming reduces words to their root form by removing suffixes.

  • Fast
  • Not always linguistically correct

Common stemming algorithms:

  1. PorterStemmer() : just removes suffix or prefix without context understanding.
  2. SnowballStemmer() : better than PorterStemmer and supports many languages.
  3. RegexStemmer() : removes prefix or suffix based on given expression to be removed.
words=['eating','eaten','eat','write','writes','history','mysterious','mystery','finally','finalised','historical']
from nltk.stem import PorterStemmer
stemming=PorterStemmer()
for word in words:
    print(word+"------>"+ stemming.stem(word))

Enter fullscreen mode Exit fullscreen mode

OUTPUT:
eating------>eat
eaten------>eaten
eat------>eat
write------>write
writes------>write
history------>histori
mysterious------>mysteri
mystery------>mysteri
finally------>final
finalised------>finalis
historical------>histor

Stemming just removes prefixes or suffixes and doesn't give meaning words.


๐Ÿƒ Lemmatization

Lemmatization converts words into their dictionary base form, called a lemma.
NLTK provides WordNetLemmatizer class which is a thin wrapper around the wordnet corpus

from nltk.stem import WordNetLemmatizer
## WordNet is a dictionary dataset which has words with their base form.We need to download this dataset to use WordNetLemmatizer
import nltk
nltk.download('wordnet')
lemmatizer=WordNetLemmatizer()
lemmatizer.lemmatize('going') #output: going
lemmatizer.lemmatize('going', pos='v') #ouput : go
#This lemmatize command we can add pos_tags that identify the word as verb, noun, adjective, etc. to help decide how to go to root word.
## Parts of Speech: Noun -n, Verb-v, adverb-r, adjective-a. Default pos tag is 'n'

Enter fullscreen mode Exit fullscreen mode
  • Considers grammar and context
  • Produces meaningful words
  • More accurate but slower than stemming

โš–๏ธ Stemming vs Lemmatization

Feature Stemming Lemmatization
Speed Fast Slower
Accuracy Lower Higher
Output May not be a real word Always a valid word
Grammar-aware โŒ โœ…

๐Ÿง  Final Thoughts

NLP is not magic โ€” its structured text processing combined with machine learning.
Which is your favorite concept in NLP?
Drop a comment down below!

Top comments (0)