(This post is part of my 20 Minutes of Reading Every Day project. My current book of choice is An Introduction to Statistical Learning, and today I’m continuing with Chapter 2: Statistical Learning.)
Regression vs Classification
Variables (both predictors and response) can be categorized as either quantitative or qualitative.
Quantitative means they take on numerical values, for example, someone’s height, age, income, etc. Quantitative means that they belong to different classes or categories. For example gender, brands, yes/no question, cancer categories, and so on.
Problems with a quantitative response are referred to as regression problems, and those with a qualitative response are classification problems. Note that it’s the response that determines whether it’s a regression or a classification problem, not the predictors! Qualitative predictors can be coded into quantitative ones before analysis.
As usual, the distinction between the two are not that crisp. Some methods such as K-nearest neighbours can be used for either qualitative or quantitative.
Now, for something more interesting.
Assessing Model Accuracy
How do we know how good our models are? In my first post about this book, I mentioned about how there’s no single best method in statistical learning. On a particular data set, one method may work better than the others, and it might be different for another data set.
But then… how do we know that? That is, how do we know that method A is better than method B for this particular data set, really? This section is about answering that question, which is more complicated than it looks. First of all, even after we decide on a way to measure how “good” a method is, the data set against which we’re measuring it also matters.
If we measure a method against the training data that’s used to fit the response, then we know what happens: the more flexible your method is, the lower your error is against the same data set. Of course this might just mean that the method is overly flexible, so it matches the training data set perfectly, but works horribly for any other data set in the real world. In other words, useless.
So, back to the same question: how do we measure how good a method is? Let’s pick a method to measure the error first — Mean Squared Error:
(By the way, I really hate the fact that it’s so freaking hard to insert an equation into a WordPress post. I know, I know you can use LaTeX to do it. It’s just that it’s way more troublesome that it should be, you know? Microsoft Equation Editor is really the way to go here.)
A method is good if it produces a low test MSE. That is, the MSE value that we see when the method is applied against previously unseen data. Note: NOT the training data!!!
When the test data is not available, then of course this is a problem. In general a method will attempt to minimize MSE for the training data, but there is no way to tell whether it will work well for the test data too.
One method that’s gonna be discussed in one of the incoming chapters is cross-validation.
The Bias-Variance Trade-Off
This is still about the test MSE. Without proof, the book says that the test MSE can be decomposed into 3 components: the variance of the estimated f, the squared bias of the estimated f, and the variance of the irreducible error terms e.
The left-hand side of the equation is the expected test MSE — which is the one that we want to minimize in addition to the training MSE.
A few observations:
- As already discussed, the last term is irreducible. So we can only focus on the first two, namely the variance, and the bias.
- Both terms are the result of squaring, so both of them are always positive. This is consistent with what is said earlier, about the irreducible error putting an upper boundary on the prediction accuracy. You really can’t get better than that.
Variance is how much the estimated f would change if the estimation is done using a different data set. So those very flexible methods that overfit? They have high variance. Bias is the error that we get from assuming that f is simpler than it actually is, for example, assuming that it’s linear. So if we have a very non-linear true f, then using linear regression on it will introduce high bias.
In general, as we use more flexible methods, initially bias will go down faster than variance’s going up, which means that the expected test MSE will decline. But at some point, we start to overfit, and this is where the bias doesn’t go down much more, and the variance keeps increasing.
This is the trade off we’re referring to. Because for the lowest test MSE, we want to have lowest bias AND lowest variance. But there lies the main challenge – how do we find a method that yields the lowest variance and bias?
Geez. It takes way longer to write this post than to read the book!