Models - Oversimplified

#ai

Generally writing this for my own benefit as I'm diving into Natural Language Processing (NLP) for a current project. In the era of AI, folks throw around the term "model" and my mind (even as a certified math person™) replaces that with <vague mathy, computer-sciencey magic thingamajig>.

But I wanted to understand it a little more and did a little digging. My current understanding can be narrowed down to:

a set of test data
a set of features (things about the data - like "is capitalized")
a set of weights (numbers between 0 to 1) for each feature
a loop where the program makes a guess, changes the weights, and tries again - millions of times until it gets it right enough that it's worthwhile to keep around

Concretely, if you were implementing NLP you might have categories that define a word as being a person, an organization, or a location.

So you'd get some basic features like the below:

word_features = {
    "is_capitalized": true,
    "previous_word": "new",
    "next_word": "announced",
    "is_followed_by_Inc": true,
}

And you might start off with random weights and then, through the loops (this is the "training" part of creating a model), it'd eventually get to something like this:

weights = {
    "is_capitalized": {
        "ORG": 0.8,    // High, most organizations are capitalized
        "PERSON": 0.7,  // ...same for person names
        "LOC": 0.6     // Somewhat high for locations - as some are capitalized and some aren't like "school" vs "Fred Meyer"
    },
    "previous_word": {
        "new": { 
            "ORG": 0.5, // etc... for the rest of the categories and features
       },
    },
}

Then, of course, there's some probability mathy mathness in there that looks at all the weights across all the features and decides which category is most probable.

Makes me think of those personality tests I took in middle school: "if you answered mostly C, you're a sporty tortoise"! Though I suspect it's more complicated than that.

Top comments (1)

Mindy Zwanziger • Feb 10

Happy to hear corrections, as long as they're kind and not jargon-y.