Yesterday, I read about the model of accuracy, the difference between variance (how much the estimated f will change given a different data set) and bias (inaccuracy that stems from assuming a certain shape of the true f), how they contribute to the test MSE, whose lower bound is the irreducible error term.
The discussion so far has been on the regression setting, but this actually applies to the classification too. The difference is that instead of using MSE, which doesn’t make sense since the responses are not numeric, we’re using something called an error rate:
That is, out of n observations, how many are incorrectly classified? Of course, just as with the regression case, there’s a difference between training error rate and test error rate. We want to have a classifier that works well against the observations we don’t use for training.
Unlike the regression case though, there is one classifier that apparently will minimize the test error rate on average: the Bayes Classifier. Just like its name suggests, it assign a test observation with predictor vector x0 to j, where the conditional probability:
is the largest.
This classifier produces the lowest possible error rate, which is called the Bayes error rate.
K-Nearest Neighbors (KNN)
The Bayes classifier is good in theory, but of course in the real world we don’t know the conditional of Y being j given that X is x0. So instead of being used in real life, the Bayes classifier is the theoretical gold standard that real world methods try to get to.
The obvious approach is then to estimate the conditional probability and go from there. The KNN method does just that, it estimates the conditional probability using the following:
In other words, “the conditional probability of Y being j, is given by how many out of the nearest K neighbours are in category j”. So if K = 3, and all 3 nearest neighbours are of category j, then we can conclude that there’s 100% probability that the response of x0 also falls under category j.
KNN is simple but surprisingly can produce classifiers that are pretty close to the theoretical Bayes classifier. The choice of K is analogous to the smoothness parameter for the spline approach. When K = 1, the classifier is very flexible, but it overfits. When K is too large, it becomes too coarse — the analog of assuming that the f is a linear function.
Like its regression counterpart, the choice lies somewhere in the middle as well.