Trade-Off between Prediction Accuracy and Model Interpretability
Yesterday we’re covering the parametric and non-parametric approaches of estimating f. Today, I’m reading about the trade off between prediction accuracy and model interpretability. So far, I’ve read about how some methods are more flexible, and some are more restrictive. For example, the parametric method of assuming that f is linear, is a restrictive method, because it can only approach f so much.
The non-parametric approach such as the thin-plate spline is obviously a lot more flexible, since it can take into account the curves that the training data present.
So why don’t we just use the spline all the time? After all, you can always adjust the smoothness, right?
Because it can be hard to interpret. This is especially if you’re concerned about inference. With a linear function, it’s very easy to tell how much an independent variable contributes to a dependent variable. With the splines, the estimated f sometimes turns out to be so complicated that it’s difficult to understand this relationship. In general, there’s an inverse relationship between flexibility and interpretability.
So now, let’s turn the question around: why don’t we just use inflexible model all the time?
Because remember that other than inference, we also use statistical learning for prediction. For the latter, it might not matter as much that we can’t interpret the estimated f that well, as long as the prediction’s good!
Note that, just as per the point about spline smoothness, even for prediction purposes, it might not always be better to use the most flexible method, because it tends to overfit.
Supervised vs. Unsupervised Learning
Most of the book is about supervised learning. That is, you have a number of observations n, and each observation has the measurements of p predictors, and then we estimate the f that gives back an estimated Y, which is hopefully close enough to the actual y.
Unsupervised learning is different, in that there’s no Y. That right there already kicks linear regression out of the window: there’s no Y against which to regress.
So what can we do when there’s no Y? The only thing we can do is to learn something about the relationships between the observations. One stat learning tool that we do here is called cluster analysis. It’s a bit like how Google Inbox group your emails together, for example. Or the Google News.
Looking at the difference between the two, I thought that the line is always very clear, i.e.: either a problem is that of supervised learning, or not. But it turned out that there are cases in which this line is not as sharp as we think. For example, cases where you have Y for some observations, but not all of them, and it’s really expensive to measure Y. For these cases, the so-called “semi-supervised learning” method is appropriate: a method that takes advantage of observations that do have the corresponding Y, and those that don’t.