Unit 03: A statistical perspective on learning


Required Readings:

What is statistical learning?

Supposed that there is some relationship as follows

where $X$ is a set of input variables, features, or independent variables, $X_1, X_2, \dots, X_p $ and $Y$ is the output variable, dependent variable, and $\epsilon$ is noise in the relationship. This is called supervised learning because we are provided with an output variable that we have to model based on the input variables.

We never truly know $f$. Statistical learning refers to a set of approaches for estimating $f$ that considers the uncertainty in such estimation.

In Statistical learning, $\epsilon$ is a random variable with mean zero, which induces randomness in $Y$. For example, if $\epsilon$ where a normal distribution with mean zero and standard deviation $\sigma_\epsilon$, then

Why estimate $f$?

There are two main reasons: prediction and inference.


Let’s assume we have an estimation of $f$ as $\hat{f}$ and therefore we can estimate the value of $Y$ as

Example $Y$ is the progression of diabetes on a patient and $X$ is a set of measurements that we can draw from them. For example, $X = (X_1, X_2)$, where $X_1$ is the age and $X_2$ is sex of the patient. In general, we do not consider $X$ as being estimates. In the prediction example, we might be interested in predicting what would be the progression of diabetes of a patient that we have never seen before by using his or her age and sex.

The accuracy of such prediction depends on reducible error and irreducible error. Reducible errors comes from our inperfect estimation of $f$ and irreducible error is the error in the relationship $\epsilon$. There are many reasons why $\epsilon$ exists (e.g., unmeasurable quantities that would improve our prediction, true variability in the real world).

Question If we measure the error in our estimate as the mean squared error $E(Y - \hat{Y})^2$, then we can derive that

In this course, we will estimate $f$ so as to minimize the reducible error.


These are cases in which we want to understand the relationship between $X$ and $Y$ and therefore we are interested in looking at $f$ instead of treatining it as a black box.

Example questions are:

Example For the diabetes prediction, suppose that we propose $f$ to be a linear relationship In inference, after estimating $f$ - estimating the parameters $\beta_0, \beta_1, \beta_2$, we might be interested in checking whether $\beta_2$ (effect of sex) is significantly different from 0 (no effect). Or the sign of $\beta_1$ (effect of age).

How do we estimate $f$?

We estimate the model using training data. Assume we have $n$ training datapoints, $x_{ij}$ is the value of variable $j$ for datapoint $i$, and $y_i$ is the independent variable for that datapoint. Assume we have $p$ different variables and $x_i = [x_{i1} x_{i2} x_{i3} \dots x_{ip}]^T$. Our training dataset consists of ${ (x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)}$.

Parametric methods

  1. Define the form of $f$ (e.g., linear model)
  2. Use procedure to fit or train the model. In the case of linear model, we need to estimate the parameters $\beta$ so as to minimized the squared error. The squared error is an error function and there are many alternatives

Non-parametric methods

These methods do not make assumptions about the form of $f$. Informally, they try to get as close to the training data as possible but not too close. In general, they need more data to work and they are harder to interpret. However, since they do not make assumptions about $f$ then they can fit a much wider range of problems.

An example of non-parametric methods is the nearest neighbors. It is a very simple concept:

  1. Task: Predict $Y$ for a new $X^*$
  2. The nearest neighbor method looks at the $k$ closes datapoints to $X^*$ in the training data $X$.
  3. The prediction $\hat{Y}$ will be the average $Y$ of those closest points in $X$.

Accuracy vs interpretrability tradeoff

Typically, all the models examined in this course have different levels of predictive accuracy and model interpretability. In general, complex models are more accurace but less interpretable, and simple models are less accurate but more interpretable. One of the primary goals of this course is to help you select the appropriate model from this tradeoff landscape.

Example Deep Learning can fit very general relationships between $X$ (e.g., pixes in an image) and $Y$ (e.g., object category). However, interpreting $f$ in Deep Learning is the focus of active research. A linear model $f$ is not as accurate for such task but it is very interpretable - and many statistical methods have been developed to help in that interpretability.

Question Assuming the following data for the progress of diabetes dataset

age sex progress of diabetes
30 0 10
40 0 20
50 0 30
30 1 5
40 1 9
50 1 14

For age=50 and sex=0, find the squared error using

  1. A linear model with $\beta_0 = -9, \beta_1 = 0.725, \beta_2 = -10$
  2. A nearest neighbor model with $k=1$
  3. Which is more accurate?
  4. Which is more interpretable?
  5. What about $k=2$ for NN?

Supervised vs unsupervised learning

Statistical learning fall mostly into supervised and unsupervised methods. So far we have discussed supervised learning because for each observation $X$ there is an associated value $Y$. In unsupervised learning, we have the observations $X$ but no association.

In unsupervised learning, we want to understand the relationship between the variables $X$. For example, for the diabetes data shown before, we could learn that there are two clusters of people, those that are 50 years old and those who aren’t. We are not looking at $Y$ here, but rather the data in $X$. A challenge with unsupervised learning is that we do not have a clear way of evaluating our method. How do we measure accuracy here?

Some problems do not fall exactly into these two kinds of problems. For example, reinforcement learning is the problem of learning which actions to take in a sequential decision-making setting when we only observe rewards in certain states. For example, once you pass a course with an A, what sequences of actions you take in your next class to get an A? Reinforcement learning is part of a large class of problems known as semi-supervised learning.

Regression vs classification

When the variable that we are trying to predict is quantiative (progress, age, height, etc.), we talk about regression. When that variable categorical (gender, product, animal type, etc.), we talk about classification.

When there are only two categories (male or female), a common classication method is logistic regression, where the statistical model is different from what we explore before. Lets suppose we are trying to predict two classes $C=1$ or $C=0$. In logistic regression, the statistical model takes the form


While this form may look arbitrary for now, it is actually very convenient for estimating the parameters $\beta$.


Since this course is primarily for big data environments, we will usually be unable to learn the parameters of our models while looking at all of our data. One approach that helps in solving this shortcoming is stochastic gradient descent. This algorithm is a very simple method for finding models that minimize complex non-linear cost functions.

To understand stochastic gradient descent, we first need to take a look at gradient descent. Gradient descent is a method to minimize a function by moving the parameters of such function in the direction of the gradient.

First, we need to define the cost function. Lets assume a simple linear model with mean squared error as the loss function:

The goal now is to move a current estimate of $\beta$ in a direction such that the loss function is minimize:

Image from http://sebastianraschka.com/faq/docs/closed-form-vs-gd.html

For a particular $j$

The algorithm simply moves the current estimate $\beta^t$ using the following equation

where $\eta$ controls the speed at which the variable changes from iteration from iteration. This is known as the learning rate.

Question Can you derive gradient descent for logistic regression? Use the fact that $\frac{d \frac{1}{1+\exp z}}{d z} = \frac{1}{1+\exp z} (1-\frac{1}{1+\exp z})$

Now, if you notice in the update of gradient descent, we need all the training data $i=1,\dots,n$. However, if we make the learning rate small enough, we can achieve the same result by looking at batches of data and therefore we do not need to have all the data available at each iteration. This is an important advantage that it is used by big data machine learning and deep learning.