Reading An Introduction To Statistical Learning: Day 9

(This post is part of my 20 Minutes of Reading Every Day project. My current book of choice is An Introduction to Statistical Learning, and today I’m going a bit outside of the book.)

So far in the book we’ve seen two measures of closeness: MSE (Mean Squared Error) and RSS (Residual Sum of Squares). How do they differ, exactly? Why do we need to use two different measures?

To recap, here’s what MSE looks like:


So given n observations, it sums up the squared difference between y and the estimate of y for each observation, and average it over n. That is, it’s the expected value of the squared distance.

Whereas RSS is just the sum, without the division by n. It might sound trivial, but actually MSE and RSS are different — if you remember from Chapter 2, MSE is used as a theoretical measure for various models — we have not picked a model yet, the MSE is used to pick one.

RSS, on the other hand, in the context of Chapter 3, is used to pick coefficients of a model that’s already chosen (in this case, linear model).

I’m not clear yet on a deeper significance than this. That is, I get that they’re used for different purposes, but it’s still not clear to me why. The mathematical relationship between the two seems trivial. Minimizing one will minimize the other.

I feel that I might get a better understanding of this question by reading this PDF here. Something to do over the weekend…


Reading An Introduction To Statistical Learning: Day 8

(This post is part of my 20 Minutes of Reading Every Day project. My current book of choice is An Introduction to Statistical Learning, and today I’m continuing with Chapter 3: Linear Regression.)

Linear Regression is one of the simplest approach for supervised learning. It’s one of the major feature of CFA Level 2 material too, so it’s something with which I’m rather familiar. From my reading so far, I think this book does a much better job of explaining linear regression, especially the intuition, than the CFA text, though. I wish I’d come across this book first!

The book also makes a point that despite its simplicity, having a good understanding of linear regression is very important. Many of the fancier approaches that we’ll see in the later chapters are generalizations or extensions of the ideas of linear regression.

In particular, in the past chapter the book points out that with the assumption of linearity, comes better interpretability. For example it’s relatively straightforward to answer questions such as the following:

  1. Is there a relationship between the predictors and the response?
  2. How strong is the relationship, if it does exist?
  3. Which predictor contributes the most to the response?
  4. How accurate is our estimation of the effect of each predictor?
  5. How accurately can we predict future responses?
  6. Is the relationship linear?
  7. Is there synergy among the advertising media?

Simple Linear Regression

This is a very simple case of one predictor and one response, i.e.:

Y ≈ ß0 + ß1X

(Ha! I just found out that in Mac OS you can type ≈ using Option + x, and ß using Option + s.)

The ß0 is the intercept, and ß1 is the slope of the line. By carrying out the linear regression, then we’re estimating both, i.e.: we come up with estimated values of ß0 and ß1. Our goal is of course to come up with estimates that produce a line that matches the data points as closely as possible.

We’ve seen two measures of closeness so far — MSE (Mean Squared Error) for regression, and the Error Rate for classification. For this simple linear regression, we’re using a measure called Residual Sum of Squares (RSS).

Reading An Introduction To Statistical Learning: Day 7

(This post is part of my 20 Minutes of Reading Every Day project. My current book of choice is An Introduction to Statistical Learning, and today I’m finishing with Chapter 2: Statistical Learning.)

Today I’m using the 20 minutes to go over the final section of Chapter 2, namely, Introduction to R. Not much to say here except that I’m playing around with the R Studio, which can be downloaded here, instead of the console that comes with the R standard distro.

Tomorrow I’m going to get into Chapter 3, since the Lab can take quite a bit of time to go through. I’d like to finish the book to get a good overview, then go back to the lab at a more leisurely pace.

Reading An Introduction To Statistical Learning: Day 6

(This post is part of my 20 Minutes of Reading Every Day project. My current book of choice is An Introduction to Statistical Learning, and today I’m continuing with Chapter 2: Statistical Learning.)

Yesterday, I read about the model of accuracy, the difference between variance (how much the estimated f will change given a different data set) and bias (inaccuracy that stems from assuming a certain shape of the true f), how they contribute to the test MSE, whose lower bound is the irreducible error term.

The discussion so far has been on the regression setting, but this actually applies to the classification too. The difference is that instead of using MSE, which doesn’t make sense since the responses are not numeric, we’re using something called an error rate:


That is, out of n observations, how many are incorrectly classified? Of course, just as with the regression case, there’s a difference between training error rate and test error rate. We want to have a classifier that works well against the observations we don’t use for training.

Unlike the regression case though, there is one classifier that apparently will minimize the test error rate on average: the Bayes Classifier. Just like its name suggests, it assign a test observation with predictor vector x0 to j, where the conditional probability:


is the largest.

This classifier produces the lowest possible error rate, which is called the Bayes error rate.

K-Nearest Neighbors (KNN)

The Bayes classifier is good in theory, but of course in the real world we don’t know the conditional of Y being j given that X is x0. So instead of being used in real life, the Bayes classifier is the theoretical gold standard that real world methods try to get to.

The obvious approach is then to estimate the conditional probability and go from there. The KNN method does just that, it estimates the conditional probability using the following:


In other words, “the conditional probability of Y being j, is given by how many out of the nearest K neighbours are in category j”. So if K = 3, and all 3 nearest neighbours are of category j, then we can conclude that there’s 100% probability that the response of x0 also falls under category j.

KNN is simple but surprisingly can produce classifiers that are pretty close to the theoretical Bayes classifier. The choice of K is analogous to the smoothness parameter for the spline approach. When K = 1, the classifier is very flexible, but it overfits. When K is too large, it becomes too coarse — the analog of assuming that the f is a linear function.

Like its regression counterpart, the choice lies somewhere in the middle as well.

Reading An Introduction To Statistical Learning: Day 5

(This post is part of my 20 Minutes of Reading Every Day project. My current book of choice is An Introduction to Statistical Learning, and today I’m continuing with Chapter 2: Statistical Learning.)

Regression vs Classification

Variables (both predictors and response) can be categorized as either quantitative or qualitative.

Quantitative means they take on numerical values, for example, someone’s height, age, income, etc. Quantitative means that they belong to different classes or categories. For example gender, brands, yes/no question, cancer categories, and so on.

Problems with a quantitative response are referred to as regression problems, and those with a qualitative response are classification problems. Note that it’s the response that determines whether it’s a regression or a classification problem, not the predictors! Qualitative predictors can be coded into quantitative ones before analysis.

As usual, the distinction between the two are not that crisp. Some methods such as K-nearest neighbours can be used for either qualitative or quantitative.

Now, for something more interesting.

Assessing Model Accuracy

How do we know how good our models are? In my first post about this book, I mentioned about how there’s no single best method in statistical learning. On a particular data set, one method may work better than the others, and it might be different for another data set.

But then… how do we know that? That is, how do we know that method A is better than method B for this particular data set, really? This section is about answering that question, which is more complicated than it looks. First of all, even after we decide on a way to measure how “good” a method is, the data set against which we’re measuring it also matters.

If we measure a method against the training data that’s used to fit the response, then we know what happens: the more flexible your method is, the lower your error is against the same data set. Of course this might just mean that the method is overly flexible, so it matches the training data set perfectly, but works horribly for any other data set in the real world. In other words, useless.

So, back to the same question: how do we measure how good a method is? Let’s pick a method to measure the error first — Mean Squared Error:


(By the way, I really hate the fact that it’s so freaking hard to insert an equation into a WordPress post. I know, I know you can use LaTeX to do it. It’s just that it’s way more troublesome that it should be, you know? Microsoft Equation Editor is really the way to go here.)

A method is good if it produces a low test MSE. That is, the MSE value that we see when the method is applied against previously unseen data. Note: NOT the training data!!!

When the test data is not available, then of course this is a problem. In general a method will attempt to minimize MSE for the training data, but there is no way to tell whether it will work well for the test data too.

One method that’s gonna be discussed in one of the incoming chapters is cross-validation.

The Bias-Variance Trade-Off

This is still about the test MSE. Without proof, the book says that the test MSE can be decomposed into 3 components: the variance of the estimated f, the squared bias of the estimated f, and the variance of the irreducible error terms e.


The left-hand side of the equation is the expected test MSE — which is the one that we want to minimize in addition to the training MSE.

A few observations:

  1. As already discussed, the last term is irreducible. So we can only focus on the first two, namely the variance, and the bias.
  2. Both terms are the result of squaring, so both of them are always positive. This is consistent with what is said earlier, about the irreducible error putting an upper boundary on the prediction accuracy. You really can’t get better than that.

Variance is how much the estimated f would change if the estimation is done using a different data set. So those very flexible methods that overfit? They have high variance. Bias is the error that we get from assuming that f is simpler than it actually is, for example, assuming that it’s linear. So if we have a very non-linear true f, then using linear regression on it will introduce high bias.

In general, as we use more flexible methods, initially bias will go down faster than variance’s going up, which means that the expected test MSE will decline. But at some point, we start to overfit, and this is where the bias doesn’t go down much more, and the variance keeps increasing.

This is the trade off we’re referring to. Because for the lowest test MSE, we want to have lowest bias AND lowest variance. But there lies the main challenge – how do we find a method that yields the lowest variance and bias?

Geez. It takes way longer to write this post than to read the book!

Reading An Introduction To Statistical Learning: Day 4

(This post is part of my 20 Minutes of Reading Every Day project. My current book of choice is An Introduction to Statistical Learning, and today I’m continuing with Chapter 2: Statistical Learning.)

Trade-Off between Prediction Accuracy and Model Interpretability

Yesterday we’re covering the parametric and non-parametric approaches of estimating f. Today, I’m reading about the trade off between prediction accuracy and model interpretability. So far, I’ve read about how some methods are more flexible, and some are more restrictive. For example, the parametric method of assuming that f is linear, is a restrictive method, because it can only approach f so much.

The non-parametric approach such as the thin-plate spline is obviously a lot more flexible, since it can take into account the curves that the training data present.

So why don’t we just use the spline all the time? After all, you can always adjust the smoothness, right?

Because it can be hard to interpret. This is especially if you’re concerned about inference. With a linear function, it’s very easy to tell how much an independent variable contributes to a dependent variable. With the splines, the estimated f sometimes turns out to be so complicated that it’s difficult to understand this relationship. In general, there’s an inverse relationship between flexibility and interpretability.

So now, let’s turn the question around: why don’t we just use inflexible model all the time?

Because remember that other than inference, we also use statistical learning for prediction. For the latter, it might not matter as much that we can’t interpret the estimated f that well, as long as the prediction’s good!

Note that, just as per the point about spline smoothness, even for prediction purposes, it might not always be better to use the most flexible method, because it tends to overfit.

Supervised vs. Unsupervised Learning

Most of the book is about supervised learning. That is, you have a number of observations n, and each observation has the measurements of p predictors, and then we estimate the f that gives back an estimated Y, which is hopefully close enough to the actual y.

Unsupervised learning is different, in that there’s no Y. That right there already kicks linear regression out of the window: there’s no Y against which to regress.

So what can we do when there’s no Y? The only thing we can do is to learn something about the relationships between the observations. One stat learning tool that we do here is called cluster analysis. It’s a bit like how Google Inbox group your emails together, for example. Or the Google News.

Looking at the difference between the two, I thought that the line is always very clear, i.e.: either a problem is that of supervised learning, or not. But it turned out that there are cases in which this line is not as sharp as we think. For example, cases where you have Y for some observations, but not all of them, and it’s really expensive to measure Y. For these cases, the so-called “semi-supervised learning” method is appropriate: a method that takes advantage of observations that do have the corresponding Y, and those that don’t.


Reading An Introduction To Statistical Learning: Day 3

(This post is part of my 20 Minutes of Reading Every Day project. My current book of choice is An Introduction to Statistical Learning, and today I’m continuing with Chapter 2: Statistical Learning.)

On Day 2, I learned about what Statistical Learning is about. That is, estimating f, a function that maps independent variables X to dependent variable Y, for the purpose of either inference, or prediction.

Today we look into more details of the estimation approaches. Broadly speaking, they can be categorized into two, parametric, and non-parametric.


The parametric approach means that before we estimate f, we make an assumption first of its shape. A common assumption is that f is a linear function of X. That is,

f(X) = B0 + B1X1 + B2X2 +…+BpXp

Which simplifies our task a lot, since estimating a potentially arbitrary function f has been reduced to estimating the coefficients of the linear equation.

There are many ways to do this, the most popular being the Ordinary Least Squares, which will be discussed in the next chapter.

Does this approach have disadvantages? You bet it does. Assuming that f is linear might give us a very poor estimate if the true function deviates too much from it.


The approach is exactly what the name suggests. Instead of making assumptions about the shape of f, this approach makes no assumption. This has the advantage of not mismatching the shape of the true function f, however it also has the disadvantage of requiring a far larger number of n (i.e.: observations) compared to the parametric approach.

One of the approaches is something called the thin-plate spline, which will be discussed in one of the later chapters. In this approach, we calibrate something that’s called smoothness. A lower level of smoothness (i.e.: rougher) spline can fit training data perfectly at the risk of overfitting. A higher smoothness may not fit the training data as well, but it has a lower variance (we’ll get to this variance bit soon).