|Weights and Vectors|
|TF-IDF||Each word weight is higher the more a word appears in a doc and not in corpus (other docs), as the word is special to our doc, it's weight in our doc is higher!
TF-IDF is abbreviation for Term Frequency Inverse Document Frequency
|length(TF-IDF, doc)||num of distinct words in doc, for each word number in vector.|
|Word Vectors||Calculate word vector:
for each word w1 => for each 5 window words, make vectors increasingly
closer, v[w1] closer v[w2]
king - queen ~ man - woman // wow it will find that for you!
You can even download ready made word vectors
|Google Word Vectors||You can download ready made google trained vector words|
|Part-Of-Speech Tagging||word roles: is it verb, noun, …? it's not always obvious|
|Head of sentence||head(sentence) most important word, it's not nessesaraly the first
word, it's the root of the sentence the most important word
she hit the wall => hit .
You build a graph for a sentence and it becomes the root.
|Named entities||People, Companies, Locations, …, quick way to know what text is about.|
|Sentiment Dictionary||love +2.9, hated: -3.2, "I loved you but now I hate you" => 2.9 - 3.2|
|Sentiment Entities||Is it about the movie or about the cinema place?|
|Sentiment Features||Camera/Resolution , Camera/Convinience|
|Text Classification||Decisions, Decisions: What's the Topic, is he happy, native english speaker?
Mostly supervised training: We have labels, then map new text to labels
|Supervised Learning||We have 3 sets, Train Set, Dev Set, Test Set.|
|Dev(=Validation) Set||Tuning Parameters (and also to prevent overfitting), tune model|
|Test Set||Check your model|
|Text Features||Convert documents to be classified into features,
bags of words word vectors, can use TF-IDF
|LDA||Latent Dirichlecht Allocation: LDA(Documents) => Topics
Technology Topic: Scala, Programming, Machine Learning
Sport Topic: Football, Basketball, Skateboards (3 most important words)
Pick number # of topics ahead of time like 5 topics
Doc = Distribution(topics) probability for each topic
Topic = Distribution(words) technology topic higher probably over cpu word
Unsupervised, what topics patterns are there. Good for getting the sense what the doc is about.
|Entity Extraction||EntityRecognition(text) => (EntityName -> EntityType)
("paul newman is a great actor") => [(PaulNewman -> Person)]
|Entity Linking||EntityLinking(Entity) => FixedMeaning
EntityLinking("PaulNewman") => "http://wikipedia../paul_newman_the_actor"
(and not the other paul newman based on text)
|dbpedia||DB for wikipedia, machines can read it its a db. Query DBPedia with SparQL|
|FRED (lib) / Pikes||FRED(natural-language) => formal-structure|