DEV Community

Tomer Ben David
Tomer Ben David

Posted on

7 2

NLP Terminology in 5 Minutes

Term Meaning
Weights and Vectors
TF-IDF Each word weight is higher the more a word appears in a doc and not in corpus (other docs), as the word is special to our doc, it's weight in our doc is higher!
TF-IDF is abbreviation for Term Frequency Inverse Document Frequency
length(TF-IDF, doc) num of distinct words in doc, for each word number in vector.
Word Vectors Calculate word vector:
for each word w1 => for each 5 window words, make vectors increasingly
closer, v[w1] closer v[w2]
king - queen ~ man - woman // wow it will find that for you!
You can even download ready made word vectors
Google Word Vectors You can download ready made google trained vector words
Text Structure
Part-Of-Speech Tagging word roles: is it verb, noun, …? it's not always obvious
Head of sentence head(sentence) most important word, it's not nessesaraly the first
word, it's the root of the sentence the most important word
she hit the wall => hit .
You build a graph for a sentence and it becomes the root.
Named entities People, Companies, Locations, …, quick way to know what text is about.
Sentiment Analysis
Sentiment Dictionary love +2.9, hated: -3.2, "I loved you but now I hate you" => 2.9 - 3.2
Sentiment Entities Is it about the movie or about the cinema place?
Sentiment Features Camera/Resolution , Camera/Convinience
Text Classification Decisions, Decisions: What's the Topic, is he happy, native english speaker?
Mostly supervised training: We have labels, then map new text to labels
Supervised Learning We have 3 sets, Train Set, Dev Set, Test Set.
Train Set
Dev(=Validation) Set Tuning Parameters (and also to prevent overfitting), tune model
Test Set Check your model
Text Features Convert documents to be classified into features,
bags of words word vectors, can use TF-IDF
LDA Latent Dirichlecht Allocation: LDA(Documents) => Topics
Technology Topic: Scala, Programming, Machine Learning
Sport Topic: Football, Basketball, Skateboards (3 most important words)
Pick number # of topics ahead of time like 5 topics
Doc = Distribution(topics) probability for each topic
Topic = Distribution(words) technology topic higher probably over cpu word
Unsupervised, what topics patterns are there. Good for getting the sense what the doc is about.
Machine Reading
Entity Extraction EntityRecognition(text) => (EntityName -> EntityType)
("paul newman is a great actor") => [(PaulNewman -> Person)]
Entity Linking EntityLinking(Entity) => FixedMeaning
EntityLinking("PaulNewman") => "http://wikipedia../paul_newman_the_actor"
(and not the other paul newman based on text)
dbpedia DB for wikipedia, machines can read it its a db. Query DBPedia with SparQL
FRED (lib) / Pikes FRED(natural-language) => formal-structure
Resources https://www.youtube.com/watch?v=FcOH_2UxwRg
https://tinyurl.com/word-vectors

Image of Timescale

🚀 pgai Vectorizer: SQLAlchemy and LiteLLM Make Vector Search Simple

We built pgai Vectorizer to simplify embedding management for AI applications—without needing a separate database or complex infrastructure. Since launch, developers have created over 3,000 vectorizers on Timescale Cloud, with many more self-hosted.

Read more

Top comments (0)

Image of Docusign

🛠️ Bring your solution into Docusign. Reach over 1.6M customers.

Docusign is now extensible. Overcome challenges with disconnected products and inaccessible data by bringing your solutions into Docusign and publishing to 1.6M customers in the App Center.

Learn more