# Unit 03 Lab 1: Estimating parametric model

# Part 1: Overview

## About Title

In this laboratory, we will learn how to define, fit, and use a model in Python.

## Learning Outcomes

Upon completing this lab you will be able to:

- Define a parametric statistical model
- Define a loss function
- Find best parameters for loss function
- Interpret the results
- Define a non-parametric model (nearest neighbors)
- To try interpret the results

## Requirements

To complete this lab you will need:

- Jupyther notebook
- Scikit-learn

# Part 2: Walk-Though

## The dataset

`sklearn`

provides many datasets with the module `datasets`

. We will be using that to load a sample dataset on diabetes. This dataset contains 442 observations with 10 features (the description of this dataset can be found here). According to the original source, the following is the description of the dataset:

Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline

```
# Make plots nice
%matplotlib inline
import seaborn as sns
sns.set_style("white")
import matplotlib as mpl
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
```

```
# we will use pandas to display the data nicely
import pandas as pd
datasource = pd.read_csv('http://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt', delimiter='\t')
X = pd.DataFrame(datasource.iloc[:, :-1].as_matrix(),
columns = ['age', 'sex', 'bmi', 'map', 'tc',
'ldl', 'hdl', 'tch', 'ltg', 'glu'])
y = pd.DataFrame(datasource.iloc[:, [-1]].as_matrix(),
columns=['disease_progression'])
Xy = pd.concat((X, y), axis=1)
print ("Number of observations: {}\nNumber of features {}".\
format(X.shape[0], X.shape[1]))
```

```
Number of observations: 442
Number of features 10
```

Let’s take a look at a sample of the observations

```
X.head()
```

age | sex | bmi | map | tc | ldl | hdl | tch | ltg | glu | |
---|---|---|---|---|---|---|---|---|---|---|

0 | 59.0 | 2.0 | 32.1 | 101.0 | 157.0 | 93.2 | 38.0 | 4.0 | 4.8598 | 87.0 |

1 | 48.0 | 1.0 | 21.6 | 87.0 | 183.0 | 103.2 | 70.0 | 3.0 | 3.8918 | 69.0 |

2 | 72.0 | 2.0 | 30.5 | 93.0 | 156.0 | 93.6 | 41.0 | 4.0 | 4.6728 | 85.0 |

3 | 24.0 | 1.0 | 25.3 | 84.0 | 198.0 | 131.4 | 40.0 | 5.0 | 4.8903 | 89.0 |

4 | 50.0 | 1.0 | 23.0 | 101.0 | 192.0 | 125.4 | 52.0 | 4.0 | 4.2905 | 80.0 |

```
y.head()
```

disease_progression | |
---|---|

0 | 151 |

1 | 75 |

2 | 141 |

3 | 206 |

4 | 135 |

Let’s estimate the age of the sample

```
Xy[['age']].plot(kind='hist');
```

MLE estimate of the probability that generate age:

```
Xy[['age']].mean()
```

```
age 48.5181
dtype: float64
```

So, let’s plot the relationship between bmi and the outcome

```
Xy.plot(x='bmi', y='disease_progression', kind='scatter');
```

Let’s define a parametric model of the relationship between `bmi`

and `disease progression`

as follows:

where $\beta_0$ and $\beta_1$ are two parameters.

Given this parameters, we can define an loss function that represents how bad we are fitting \eqref{eq:disease-equation} to our dataset. The simplest such loss function is the **mean squared error**:

where in our case $y_i$ is `disease progression`

and $x_i$ is only `age`

.

Let’s define this function

```
def f1(betas, x):
beta0, beta1 = betas
return beta0 + beta1*x
def loss1(betas, x, y):
return ((f1(betas, x) - y)**2.).mean()
```

Let’s check that our functions are working properly

```
f1([0.1, 0.2], Xy.age).head()
```

```
0 11.9
1 9.7
2 14.5
3 4.9
4 10.1
Name: age, dtype: float64
```

```
loss1([0.1, 0.2], Xy.bmi, Xy.disease_progression)
```

```
27388.951443438906
```

At this point, we can use an optimization procedure to find the optimal $\beta_0$ and $\beta_1$.

```
import numpy as np
from scipy.optimize import minimize
# define initial solution
betas0 = np.array([0., 0.])
result = minimize(lambda betas: loss1(betas, Xy.bmi, Xy.disease_progression),
betas0)
print(result.x)
```

```
[-117.77561483 10.23321071]
```

According to our optimization, the best possible solution is

with **mean squared error**

```
loss1(result.x, Xy.bmi, Xy.disease_progression)
```

```
3890.4565855989276
```

Now let’s plot this solution, along with the data

```
Xy.plot(x='bmi', y='disease_progression', kind='scatter');
spaced_age = np.linspace(Xy.bmi.min(), Xy.bmi.max())
p0 = plt.plot(spaced_age, f1(result.x, spaced_age), 'r--',
label='best fit');
plt.legend();
```

Some questions:

- Are we sure that the $\beta$s we found are the best for mean square error?

Now, let’s explore creating a polynomial expansion of age and seeing how much we can reduce the mean square error. To do this, we need to create a more general form of the function and the error computation using matrix algebra.

Let’s define the vector $\beta$ to be simply the set of parameters, then we can define the linear function $f_\beta$ in terms of matrix multiplication:

where typically x has as first element unity as the bias. We can define the loss function as

where $X$ is the training set (each row is one observation) and $y$ is a vector with the quantity we want to predict.

Warning
*Make sure you understand \eqref{eq:linear} and \eqref{eq:loss}*

`Scikit-learn`

Now, we will use `scikit-learn`

to perform the same task. `Scikit-learn`

is a general machine learning Python package the allows to do supervised and unsupervised learning. It also has functionality for cleaning up data, for model validation, and for parallel computing.

Many of the models in `scikit-learn`

work as follows:

- Define model
- Fit data
- Predict

### Define model

```
from sklearn.linear_model import LinearRegression
```

```
linear_regression_model = LinearRegression()
```

### Fit data

```
linear_regression_model.fit(X[['bmi']], y)
```

```
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
```

We can examine the parameters of the model. (notice that `scikit-learn`

automatically adds an intercept)

```
linear_regression_model.intercept_
```

```
array([-117.77336657])
```

```
linear_regression_model.coef_
```

```
array([[ 10.23312787]])
```

Compared to what we previously found, it is the same!

```
result.x
```

```
array([-117.77561483, 10.23321071])
```

### Predict data

(for now, we are predicting the training data which is **almost always useless**)

```
yhat = linear_regression_model.predict(X[['bmi']])
```

Now we can compute the model accuracy

```
((yhat - y)**2).mean()
```

```
disease_progression 3890.456585
dtype: float64
```

Which again is the same that we found before

```
loss1(result.x, Xy.bmi, Xy.disease_progression)
```

```
3890.4565855989276
```

### Nearest neighbors with scikit-learn

```
from sklearn.neighbors import KNeighborsRegressor
```

```
nn_model = KNeighborsRegressor(n_neighbors=5)
```

```
nn_model.fit(X, y)
```

```
KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=5, p=2,
weights='uniform')
```

```
yhat2 = nn_model.predict(X)
```

```
((yhat2 - y)**2).mean()
```

```
disease_progression 3019.888507
dtype: float64
```

## Section 2 of Walk-Though

## Test Yourself

##### Using linear regression, use all variables in the data and see how much you can reduce the mean squared error

```
linear_regression_model2 = LinearRegression()
linear_regression_model2.fit(X, y)
MSE = ?? # <- compute mean squared error and compare to simpler model above
```

```
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
```

##### Compare the bmi coefficient of the simpler model with the one from the more complex model

```
simple_lr = LinearRegression().fit(X[['bmi']], y)
complex_lr = LinearRegression().fit(X, y)
```

```
X.head()
```

age | sex | bmi | map | tc | ldl | hdl | tch | ltg | glu | |
---|---|---|---|---|---|---|---|---|---|---|

0 | 59.0 | 2.0 | 32.1 | 101.0 | 157.0 | 93.2 | 38.0 | 4.0 | 4.8598 | 87.0 |

1 | 48.0 | 1.0 | 21.6 | 87.0 | 183.0 | 103.2 | 70.0 | 3.0 | 3.8918 | 69.0 |

2 | 72.0 | 2.0 | 30.5 | 93.0 | 156.0 | 93.6 | 41.0 | 4.0 | 4.6728 | 85.0 |

3 | 24.0 | 1.0 | 25.3 | 84.0 | 198.0 | 131.4 | 40.0 | 5.0 | 4.8903 | 89.0 |

4 | 50.0 | 1.0 | 23.0 | 101.0 | 192.0 | 125.4 | 52.0 | 4.0 | 4.2905 | 80.0 |

```
bmi_simple = simple_lr.coef_
bmi_complex = complex_lr.coef_[0, 2] # <- bmi is the third column in X
```

What can you say about that difference? How can you interpret it?

##### Using nearest neighbors, plot the performance on training data as you increase the number of k

```
performance = []
k_set = [2, 3, 4, 5, 6, 7, 8, 9, 10]
for k in k_set:
nn_model = KNeighborsRegressor(n_neighbors=k).fit(X[['bmi']], y)
yhat = nn_model.predict(X[['bmi']])
MSE = ((y-yhat)**2).mean()
performance.append(MSE)
```

```
plt.plot(k_set, performance);
plt.xlabel('k');
plt.ylabel('MSE');
```

Now, try the same but with all variables

```
performance = []
k_set = [2, 3, 4, 5, 6, 7, 8, 9, 10]
for k in k_set:
nn_model = KNeighborsRegressor(n_neighbors=k).fit(X, y)
yhat = nn_model.predict(X)
MSE = ((y-yhat)**2).mean()
performance.append(MSE)
```

```
# plot here
```

# Part 3 On Your Own

- What is the MSE in the training data of the diabetes dataset when variables BMI and AGE are used in linear regression?
- Run several linear regressions using increasing more variables from X (e.g., starting one first column, then with first, and second, then with first, second, and third, and so on). Plot the MSE as a function of the number of variables on the training data.
- Show with code or a mathematical proof that nearest neighbor with k=1 always has MSE = 0 in the training data
- Can you think of a reason why assessing the accuracy of a model in the training data is a bad idea?