TF-IDF Vectorization

TF-IDF vectorization is a technique used to process natural language for machine learning. The purpose for TF-IDF comes from the idea that rare words give us more information about the contents of a document, than words used all of the time in a document. This idea to put a focus on these rare words to find value in documents came from tons of work simpler types of word counters, having them ignore common words to help focus on what makes the documents different from one another.

The TF-IDF itself is a combination of two separate metrics, term frequency and inverse document frequency. The term frequency aspect is pretty easy to understand due to it just being the total times a word appears in a document compared to the total amount of words in that document.

The inverse document frequency is a little more complicated, it's the log of the total number of documents in your dataset divided by the number of documents with that word in it. The vectorizer then multiplies this together to get a value indicating how important a word is and repeated this process for each and every word in the documents.

Short Intro to Using the 'TfidfVectorizer' in 'sklearn'

For this example I will have my data imported in a pandas DataFrame with the column 'documents' representing the documents I am vectorizing.

Import Libraries

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords

Import Data

data = pd.read_csv('path_to_csv_file.csv')

Initialize TfidfVectoizer

vectorizer = TfidfVectorizer(strip_accents='unicode',stop_words=stopwords_list)

Compute Values

tf_idf = vectorizer.fit_transform(data['documents'])

Create New DataFrame with the Column Names Being the Vectorized Word:

nlp_name = pd.DataFrame(tf_idf.toarray(), columns=vectorizer.get_feature_names())

DEV Community