Unit 03 Lab 1: Estimating parametric model

Part 1: Overview

About Title

In this laboratory, we will learn how to define, fit, and use a model in Python.

Learning Outcomes

Upon completing this lab you will be able to:

Requirements

To complete this lab you will need:

Part 2: Walk-Though

The dataset

sklearn provides many datasets with the module datasets. We will be using that to load a sample dataset on diabetes. This dataset contains 442 observations with 10 features (the description of this dataset can be found here). According to the original source, the following is the description of the dataset:

Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline

# Make plots nice
%matplotlib inline
import seaborn as sns
sns.set_style("white")
import matplotlib as mpl
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
# we will use pandas to display the data nicely
import pandas as pd
datasource = pd.read_csv('http://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt', delimiter='\t')

X = pd.DataFrame(datasource.iloc[:, :-1].as_matrix(),
                 columns = ['age', 'sex', 'bmi', 'map', 'tc', 
                            'ldl', 'hdl', 'tch', 'ltg', 'glu'])
y = pd.DataFrame(datasource.iloc[:, [-1]].as_matrix(), 
                 columns=['disease_progression'])
Xy = pd.concat((X, y), axis=1)
print ("Number of observations: {}\nNumber of features {}".\
    format(X.shape[0], X.shape[1]))
Number of observations: 442
Number of features 10

Let’s take a look at a sample of the observations

X.head()
age sex bmi map tc ldl hdl tch ltg glu
0 59.0 2.0 32.1 101.0 157.0 93.2 38.0 4.0 4.8598 87.0
1 48.0 1.0 21.6 87.0 183.0 103.2 70.0 3.0 3.8918 69.0
2 72.0 2.0 30.5 93.0 156.0 93.6 41.0 4.0 4.6728 85.0
3 24.0 1.0 25.3 84.0 198.0 131.4 40.0 5.0 4.8903 89.0
4 50.0 1.0 23.0 101.0 192.0 125.4 52.0 4.0 4.2905 80.0
y.head()
disease_progression
0 151
1 75
2 141
3 206
4 135

Let’s estimate the age of the sample

Xy[['age']].plot(kind='hist');

png

MLE estimate of the probability that generate age:

Xy[['age']].mean()
age    48.5181
dtype: float64

So, let’s plot the relationship between bmi and the outcome

Xy.plot(x='bmi', y='disease_progression', kind='scatter');

png

Let’s define a parametric model of the relationship between bmi and disease progression as follows:

where $\beta_0$ and $\beta_1$ are two parameters.

Given this parameters, we can define an loss function that represents how bad we are fitting \eqref{eq:disease-equation} to our dataset. The simplest such loss function is the mean squared error:

where in our case $y_i$ is disease progression and $x_i$ is only age.

Let’s define this function

def f1(betas, x):
    beta0, beta1 = betas
    return beta0 + beta1*x

def loss1(betas, x, y):
    return ((f1(betas, x) - y)**2.).mean()    

Let’s check that our functions are working properly

f1([0.1, 0.2], Xy.age).head()
0    11.9
1     9.7
2    14.5
3     4.9
4    10.1
Name: age, dtype: float64
loss1([0.1, 0.2], Xy.bmi, Xy.disease_progression)
27388.951443438906

At this point, we can use an optimization procedure to find the optimal $\beta_0$ and $\beta_1$.

import numpy as np
from scipy.optimize import minimize
# define initial solution
betas0 = np.array([0., 0.])
result = minimize(lambda betas: loss1(betas, Xy.bmi, Xy.disease_progression),
        betas0)
print(result.x)
[-117.77561483   10.23321071]

According to our optimization, the best possible solution is

with mean squared error

loss1(result.x, Xy.bmi, Xy.disease_progression)
3890.4565855989276

Now let’s plot this solution, along with the data

Xy.plot(x='bmi', y='disease_progression', kind='scatter');
spaced_age = np.linspace(Xy.bmi.min(), Xy.bmi.max())
p0 = plt.plot(spaced_age, f1(result.x, spaced_age), 'r--', 
              label='best fit');
plt.legend();

png

Some questions:

Now, let’s explore creating a polynomial expansion of age and seeing how much we can reduce the mean square error. To do this, we need to create a more general form of the function and the error computation using matrix algebra.

Let’s define the vector $\beta$ to be simply the set of parameters, then we can define the linear function $f_\beta$ in terms of matrix multiplication:

where typically x has as first element unity as the bias. We can define the loss function as

where $X$ is the training set (each row is one observation) and $y$ is a vector with the quantity we want to predict.

Warning Make sure you understand \eqref{eq:linear} and \eqref{eq:loss}

Scikit-learn

Now, we will use scikit-learn to perform the same task. Scikit-learn is a general machine learning Python package the allows to do supervised and unsupervised learning. It also has functionality for cleaning up data, for model validation, and for parallel computing.

Many of the models in scikit-learn work as follows:

  1. Define model
  2. Fit data
  3. Predict

Define model

from sklearn.linear_model import LinearRegression
linear_regression_model = LinearRegression()

Fit data

linear_regression_model.fit(X[['bmi']], y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

We can examine the parameters of the model. (notice that scikit-learn automatically adds an intercept)

linear_regression_model.intercept_
array([-117.77336657])
linear_regression_model.coef_
array([[ 10.23312787]])

Compared to what we previously found, it is the same!

result.x
array([-117.77561483,   10.23321071])

Predict data

(for now, we are predicting the training data which is almost always useless)

yhat = linear_regression_model.predict(X[['bmi']])

Now we can compute the model accuracy

((yhat - y)**2).mean()
disease_progression    3890.456585
dtype: float64

Which again is the same that we found before

loss1(result.x, Xy.bmi, Xy.disease_progression)
3890.4565855989276

Nearest neighbors with scikit-learn

from sklearn.neighbors import KNeighborsRegressor
nn_model = KNeighborsRegressor(n_neighbors=5)
nn_model.fit(X, y)
KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=5, p=2,
          weights='uniform')
yhat2 = nn_model.predict(X)
((yhat2 - y)**2).mean()
disease_progression    3019.888507
dtype: float64

Section 2 of Walk-Though

Test Yourself

Using linear regression, use all variables in the data and see how much you can reduce the mean squared error
linear_regression_model2 = LinearRegression()
linear_regression_model2.fit(X, y)
MSE = ?? # <- compute mean squared error and compare to simpler model above
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
Compare the bmi coefficient of the simpler model with the one from the more complex model
simple_lr = LinearRegression().fit(X[['bmi']], y)
complex_lr = LinearRegression().fit(X, y)
X.head()
age sex bmi map tc ldl hdl tch ltg glu
0 59.0 2.0 32.1 101.0 157.0 93.2 38.0 4.0 4.8598 87.0
1 48.0 1.0 21.6 87.0 183.0 103.2 70.0 3.0 3.8918 69.0
2 72.0 2.0 30.5 93.0 156.0 93.6 41.0 4.0 4.6728 85.0
3 24.0 1.0 25.3 84.0 198.0 131.4 40.0 5.0 4.8903 89.0
4 50.0 1.0 23.0 101.0 192.0 125.4 52.0 4.0 4.2905 80.0
bmi_simple = simple_lr.coef_
bmi_complex = complex_lr.coef_[0, 2] # <- bmi is the third column in X

What can you say about that difference? How can you interpret it?

Using nearest neighbors, plot the performance on training data as you increase the number of k
performance = []
k_set = [2, 3, 4, 5, 6, 7, 8, 9, 10]
for k in k_set:
    nn_model = KNeighborsRegressor(n_neighbors=k).fit(X[['bmi']], y)
    yhat = nn_model.predict(X[['bmi']])
    MSE = ((y-yhat)**2).mean()
    performance.append(MSE)
plt.plot(k_set, performance);
plt.xlabel('k');
plt.ylabel('MSE');

png

Now, try the same but with all variables

performance = []
k_set = [2, 3, 4, 5, 6, 7, 8, 9, 10]
for k in k_set:
    nn_model = KNeighborsRegressor(n_neighbors=k).fit(X, y)
    yhat = nn_model.predict(X)
    MSE = ((y-yhat)**2).mean()
    performance.append(MSE)
# plot here

Part 3 On Your Own

  1. What is the MSE in the training data of the diabetes dataset when variables BMI and AGE are used in linear regression?
  2. Run several linear regressions using increasing more variables from X (e.g., starting one first column, then with first, and second, then with first, second, and third, and so on). Plot the MSE as a function of the number of variables on the training data.
  3. Show with code or a mathematical proof that nearest neighbor with k=1 always has MSE = 0 in the training data
  4. Can you think of a reason why assessing the accuracy of a model in the training data is a bad idea?