⚠️ This chapter is a LOT more theoretical than the others but set up the theoretical foundations for machine learning - skip or persevere, your choice but be warned
We've learned about various machine learning algorithms - but what do we know about evaluating algorithms and the theory of learning itself? Here, we want to:
- define learning problems
- show whether specific algorithms are effective/ineffective with regard to the problem
- show some problems are fundamentally hard (aka unsolvable)
What's the Big O of machine learning?
We know about Big O of regular algorithms from Complexity Theory, to estimate how resource consumption scales with the problem size (e.g. for space and time). With ML, those constraints are also true, but the new limited resource is data.
Much like regular algorithms have a Big(O), ML algorithms also have Big(O)s for time and space, but also data consumption.
When we evaluate algorithms, we might want to know things like:
- probability of successful training (
1 - delta)
- number of examples to train on (
- complexity of hypothesis class (
complexity of H)
- Accuracy we are approximating target concept (
- presenting training examples (big batch, or incremental/iterative)
- selecting training examples (explained below)
Having control over the questions being asked (and therefore the examples to train on) can vary data consumption wildly. For example:
- if the person who knows the answer controls the questions (and therefore examples), they can ask very specific questions to get you to the right concept based on limited examples
- if the learner controls the questions, then they can design questions (and therefore examples) to eliminate as much of the hypothesis space as possible each time (like in a decision tree
- However if the data is controlled by nature (aka randomness, or sampling) then a lot more examples (how many?) will be needed to infer the right concept
- You can also consider an adversarial teacher that deliberately shows you real but unrepresentative examples to mislead you. FAR more data would be required in this case.
In practice, most training examples are randomly distributed. But it is worth considering how efficiently we can learn from other distributions.
An additional consideration is balance. We learn most from positive examples ("Absence of evidence is not evidence of absence"), so if those are few and far between then we will need a LOT of examples to learn the right concept (aka falsify wrong hypotheses).
Instead of asking how accurate our learning is to the "real" underlying concept, we can flip the script and try to set an upper bound on the mistakes we make while learning. This is a fascinating idea for linear binary classification models with no noise.
For example, let's say we have a possibility space of K bits (
0). So 10 bits have 1024 (
2 ^10) possibilities. We start by extreme overfitting to the first example we see (so that the first example we saw was the ONLY possible result from our current hypothesis). Then we continually relax the conditions (widening the hypothesis) as we come across more examples. Because each bit only has three possible states (positive, negative, and absent), new examples which conflict with our hypothesis must not matter to the answer, and so we can switch them to
In this way we guarantee that our learning will never make more than
K+1 mistakes, for a
2^K possibility space. A cute trick, but it doesn't quite put upper bounds on how many examples we still need, which is the original question.
We now have fleshed out some competing measures of "big O for Machine Learning":
- Computational Complexity: How much computational effort is needed for a learner to converge? (closest to the classical "big O")
- Sample Complexity: How many training examples in a batch are needed for a learner to create a successful hypothesis?
- Mistake bounds: How many mistakes can a learner make over an infinite run?
For practical purposes we will ultimately just focus on Sample Complexity, but it is worth considering these other forms.
- The hypothesis space
His the set of all possible hypotheses we are willing to consider
- The True hypothesis
cis the "real" one (aka the "target concept")
- The candidate hypothesis
his what we're currently considering
- The training set
Sis example data (for example uniformly randomly drawn from the population)
- A consistent learner is one that produces a hypothesis that is consistent with the data its seen so far
- The version space is the set of all hypotheses that are consistent with the data.
Basically, its "what we haven't ruled out yet". Silly to define, but worth having a term for.
- The Training error is the fraction of training examples misclassified by
h. The true hypothesis would have a training error of 0.
- The True error is the fraction of examples that would be misclassified on sample data.
C: Concept Class
H: Hypothesis space
n: size of hypothesis space
D: distribution over inputs
episilon: error goal (less than or equal to 0.5)
delta: certainty goal (with probability
1-delta, the algorithm will produce a true error less than or equal to
Because we are always learning from samples, we can never reduce our uncertainty to zero, nor our error, and thus our goal is better phrased as looking for a Probably Approximately Correct hypothesis.
Cis said to be
Hif and only if
Lwill, with probability
1-delta, output a
error(h) < epsilonin time and samples polynomial in
1/epsilon, 1/delta, and n.
A version space is epsilon exhausted iff all remaining hypothesis have low error. Then you can choose any of the hypothesis and be acceptably fine according to your chosen error rate.
In other words, it is Probably Approximately Correct.
This is a theorem for bounding the true error as a function of the number of examples that are drawn. We'll leave the derivation to a video if you're interested:
The TL;DR conclusion is that the upper bound for a version space not being epsilon exhausted after
m samples is:
m >= 1/epsilon * (ln n + ln (1/delta))
So we can work out, for example, that for an input space of 10 bits (1024 possibilities), target epsilon of 0.1, delta of 0.2, we will need at least
m = 1/10 * (ln 10 + ln (1/0.2)) ~= 40 training examples, or less than 4% of the possible space.
The theorem also shows what levers to pull and the consequences of those levers. For example, to reduce epsilon down to 1%, we'd then need 400 examples, or 40% of the total possible space.
We are now able to place sample complexity bounds for PAC learning with this equation.
Further notes on this topic:
Hopefully that was a good introduction to Computational Learning Theory. I am planning more primers and would love your feedback and questions on:
- Supervised Learning
- Unsupervised Learning
- Reinforcement Learning
- Markov Decision Processes - week of Mar 25
- "True" RL - week of Apr 1
- Game Theory - week of Apr 15