NLP Terminology in 5 Minutes

#nlp #datascience #bigdata

Term	Meaning
Weights and Vectors
TF-IDF	Each word weight is higher the more a word appears in a doc and not in corpus (other docs), as the word is special to our doc, it's weight in our doc is higher! TF-IDF is abbreviation for Term Frequency Inverse Document Frequency
length(TF-IDF, doc)	num of distinct words in doc, for each word number in vector.
Word Vectors	Calculate word vector: for each word w1 => for each 5 window words, make vectors increasingly closer, v[w1] closer v[w2] king - queen ~ man - woman // wow it will find that for you! You can even download ready made word vectors
Google Word Vectors	You can download ready made google trained vector words
Text Structure
Part-Of-Speech Tagging	word roles: is it verb, noun, …? it's not always obvious
Head of sentence	head(sentence) most important word, it's not nessesaraly the first word, it's the root of the sentence the most important word she hit the wall => hit . You build a graph for a sentence and it becomes the root.
Named entities	People, Companies, Locations, …, quick way to know what text is about.
Sentiment Analysis
Sentiment Dictionary	love +2.9, hated: -3.2, "I loved you but now I hate you" => 2.9 - 3.2
Sentiment Entities	Is it about the movie or about the cinema place?
Sentiment Features	Camera/Resolution , Camera/Convinience
Text Classification	Decisions, Decisions: What's the Topic, is he happy, native english speaker? Mostly supervised training: We have labels, then map new text to labels
Supervised Learning	We have 3 sets, Train Set, Dev Set, Test Set.
Train Set
Dev(=Validation) Set	Tuning Parameters (and also to prevent overfitting), tune model
Test Set	Check your model
Text Features	Convert documents to be classified into features, bags of words word vectors, can use TF-IDF
LDA	Latent Dirichlecht Allocation: LDA(Documents) => Topics Technology Topic: Scala, Programming, Machine Learning Sport Topic: Football, Basketball, Skateboards (3 most important words) Pick number # of topics ahead of time like 5 topics Doc = Distribution(topics) probability for each topic Topic = Distribution(words) technology topic higher probably over cpu word Unsupervised, what topics patterns are there. Good for getting the sense what the doc is about.
Machine Reading
Entity Extraction	EntityRecognition(text) => (EntityName -> EntityType) ("paul newman is a great actor") => [(PaulNewman -> Person)]
Entity Linking	EntityLinking(Entity) => FixedMeaning EntityLinking("PaulNewman") => "http://wikipedia../paul_newman_the_actor" (and not the other paul newman based on text)
dbpedia	DB for wikipedia, machines can read it its a db. Query DBPedia with SparQL
FRED (lib) / Pikes	FRED(natural-language) => formal-structure
Resources	https://www.youtube.com/watch?v=FcOH_2UxwRg https://tinyurl.com/word-vectors

🚀 pgai Vectorizer: SQLAlchemy and LiteLLM Make Vector Search Simple

We built pgai Vectorizer to simplify embedding management for AI applications—without needing a separate database or complex infrastructure. Since launch, developers have created over 3,000 vectorizers on Timescale Cloud, with many more self-hosted.

DEV Community

NLP Terminology in 5 Minutes

🚀 pgai Vectorizer: SQLAlchemy and LiteLLM Make Vector Search Simple

Top comments (0)

🛠️ Bring your solution into Docusign. Reach over 1.6M customers.

Read next

Build an AI-Powered Smart Appointment Booking App Using WinUI Scheduler

ReductStore vs. MongoDB: Which One is Right for Your Data?

TinaCMS: A Headless CMS with Git Version Control

"Mastering Fault Localization: The Future of Bug Detection Techniques"