Unit 04: Lab: Assessing accuracy

Topics

Introduction

In this lab, we will use a synthetic dataset

# Make plots nice
%matplotlib inline
import seaborn as sns
sns.set_style("white")
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import warnings
warnings.filterwarnings("ignore")

# we will use pandas to display the data nicely
import pandas as pd
datasource = pd.read_csv('http://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt', delimiter='\t')

X = pd.DataFrame(datasource.iloc[:, :-1].as_matrix(),
                 columns = ['age', 'sex', 'bmi', 'map', 'tc', 
                            'ldl', 'hdl', 'tch', 'ltg', 'glu'])
y = pd.DataFrame(datasource.iloc[:, [-1]].as_matrix(), 
                 columns=['disease_progression'])
Xy = pd.concat((X, y), axis=1)
print ("Number of observations: {}\nNumber of features {}".\
    format(X.shape[0], X.shape[1]))
Number of observations: 442
Number of features 10

Data preparation

First, we will use linear regression with MSE as our generalization measure

from sklearn.linear_model import LinearRegression

Let’s split the training data into three components. Sklearn doesn’t have a function to split into three components at once so we will need to apply it twice

from sklearn.cross_validation import train_test_split
X_train_validation, X_test, y_train_validation, y_test = train_test_split(X, y, test_size = 0.1, random_state = 0)

Now let’s split the training and validation into two other datasets

X_train, X_validation, y_train, y_validation = train_test_split(X_train_validation, 
                                                                y_train_validation, 
                                                                test_size = 0.2, 
                                                                random_state = 0)

1) Model fitting

First, let’s compare two models: one with BMI and another one with BMI and age

model1 = LinearRegression().fit(X_train[['bmi']], y_train)
model2 = LinearRegression().fit(X_train[['bmi', 'age']], y_train)
# Let's compute the training MSE
model1_train_err = ((model1.predict(X_train[['bmi']]) - y_train)**2).mean()[0]
model2_train_err = ((model2.predict(X_train[['bmi', 'age']]) - y_train)**2).mean()[0]

Obviously, with more variables we are able to fit the training data better

[model1_train_err, model2_train_err]
[3723.4087871665097, 3680.6865634023352]

2) Model selection

What about validation MSE?

model1_validation_err = ((model1.predict(X_validation[['bmi']]) - y_validation)**2).mean()[0]
model2_validation_err = ((model2.predict(X_validation[['bmi', 'age']]) - y_validation)**2).mean()[0]

Interestingly, we found that adding age makes the validation error worse

[model1_validation_err, model2_validation_err]
[4582.2887294802758, 4618.3730480775757]

Therefore, the model selection step chooses model 1 (the simpler model!)

3) Model assessment

Finally, the performance of our methodology is assessed at the end

model1_test_err = ((model1.predict(X_test[['bmi']]) - y_test)**2).mean()[0]
model1_test_err
3925.4228455751909

This means that we estimate the expected test error to be MSE = 3925

More sophisticated variable selection

# Compute null MSE
current_MSE = ((y_train - y_train.mean())**2).mean()[0]
current_model_vars = set()
p = X_train.shape[1]

for k in range(p):
    possible_next_vars = list(set(range(p)) - current_model_vars)
    next_model_vars = current_model_vars
    for next_var in possible_next_vars:
        list(current_model_vars) + [next_var]
        model = LinearRegression().fit(X_train.iloc[:, list(current_model_vars) + [next_var]], y_train)
        model_validation_err = \
        ((model.predict(X_validation.iloc[:, list(current_model_vars) + [next_var]]) - y_validation)**2).mean()[0]
        if model_validation_err < current_MSE:
            next_model_vars = current_model_vars.union(set([next_var]))
            current_MSE = model_validation_err
    if next_model_vars == current_model_vars:
        break
    else:        
        print("Selecting ", str(X_train.columns[list(next_model_vars - current_model_vars)]))
        current_model_vars = next_model_vars

Selecting  Index(['ltg'], dtype='object')
Selecting  Index(['bmi'], dtype='object')
Selecting  Index(['map'], dtype='object')
Selecting  Index(['tc'], dtype='object')
Selecting  Index(['glu'], dtype='object')
Selecting  Index(['age'], dtype='object')

Final model

X_train.columns[list(current_model_vars)]
Index(['age', 'bmi', 'map', 'tc', 'ltg', 'glu'], dtype='object')

Evaluate performance of model

model = LinearRegression().fit(X_train.iloc[:, list(current_model_vars)], y_train)
((model.predict(X_test.iloc[:, list(current_model_vars)]) - y_test)**2).mean()[0]
3324.2972729191706

Compared to the previous one, it is quite small!

model1_test_err
3925.4228455751909

Bias and variace decomposition

First, we will show that with more data, the training performance should decrease, but the generalization does not necessarily increases. We will use sklearn built-in helpers to split the data into training and testing. We will leave 10% of the data for testing

from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
# We will create spurious features
Xpoly = StandardScaler().fit_transform(PolynomialFeatures(degree=2).fit_transform(X))

Now, let’s increase the complex of the training model by adding one by one the variables in X

from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
all_mse_train = []
all_mse_test = []
for _ in range(100):
    X_train, X_test, y_train, y_test = train_test_split(Xpoly, y.as_matrix(), test_size = 0.1)
    mse_train = []
    mse_test = []
    for k in range(X_train.shape[1]):
        idx_vars = np.random.choice(range(X_train.shape[1]), size=k+1)
        lm = LinearRegression().fit(X_train[:, idx_vars], y_train)
        mse_train.append(((y_train - lm.predict(X_train[:, idx_vars]))**2).mean())
        mse_test.append(((y_test - lm.predict(X_test[:, idx_vars]))**2).mean())
    all_mse_train.append(mse_train)
    all_mse_test.append(mse_test)
all_mse_train = np.array(all_mse_train)
all_mse_test = np.array(all_mse_test)
plt.plot(range(X_train.shape[1]), all_mse_train.mean(0), label='train');
plt.plot(range(X_train.shape[1]), all_mse_test.mean(0), label='test');
plt.legend();
plt.xlabel('variables');
plt.ylabel('MSE');

png

We can see something similar by choosing a model and increasing the training size

from sklearn.learning_curve import learning_curve
def MSE_score(estimator, X, y):
    return ((estimator.predict(X) - y)**2).mean()[0]
train_sizes, train_scores, valid_scores = learning_curve(LinearRegression(), X[['bmi']], y, scoring=MSE_score, cv  = 10)
train_sizes2, train_scores2, valid_scores2 = learning_curve(LinearRegression(), X, y, scoring=MSE_score, cv  = 10)
plt.plot(train_sizes, train_scores.mean(1), 'b', label='train (only bmi)');
plt.plot(train_sizes, valid_scores.mean(1), 'g', label='test (only bmi)');
plt.plot(train_sizes2, train_scores2.mean(1), 'b--', label='train (all)');
plt.plot(train_sizes2, valid_scores2.mean(1), 'g--', label='test (all)');
plt.legend();
plt.xlabel('training size');
plt.ylabel('MSE');

png

Classification

We will now change the problem to a prediction setting. Lets suppose that a disease progression greater than 260 is really bad. Then, let’s great a classifier that predicts whether a person will have a really bad disease progression

y_class = y >= 260

Unfortunatly, from the data, the classes are really unbalanced, with only 11% of the people having a bad disease progression

print("People with bad disease progression ", y_class.mean()[0])
People with bad disease progression  0.119909502262

Therefore, a dummy predict is expected to have 90%+ accuracy in the prediction!

Let’s use a logistic regression model to predict bad diabetes levels

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
model = LogisticRegression(C=1000.) # don't worry about the C=1000 thing
# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y_class, test_size = 0.1, random_state=0)
model.fit(X_train, y_train);
/Users/danielacuna/anaconda3/lib/python3.5/site-packages/sklearn/utils/validation.py:515: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
probability_sick = model.predict_proba(X_test)[:, 1]

let’s play with a threshold of 50%

prediction_thresh_50 = probability_sick > 0.5
confusion_matrix(y_test, prediction_thresh_50)[::-1, ::-1]
array([[ 3,  2],
       [ 2, 38]])

what about a threshold of 1%, can we capture all bad disease progressions?

prediction_thresh_1 = probability_sick > 0.01
# yes, we can!
confusion_matrix(y_test, prediction_thresh_1)[::-1, ::-1]
array([[ 5,  0],
       [30, 10]])
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, probability_sick)
plt.plot(fpr, tpr);
plt.xlabel('False positive rate (1 - specificity)');
plt.ylabel('True positive rate (sensitivity)');

png

# area under the curve
from sklearn.metrics import auc
print("Area under the curve: ", auc(fpr, tpr))
Area under the curve:  0.9

Bayesian classifier

We will estimate the specificity and sensitivity of the threshold-50 classifier

confusion_matrix(y_test, prediction_thresh_50)[::-1, ::-1]
array([[ 3,  2],
       [ 2, 38]])
specificity = (prediction_thresh_50[(y_test.as_matrix() == False).flatten()] == False).sum()/(y_test == False).sum()[0]
sensitivity = prediction_thresh_50[y_test.as_matrix().flatten()].sum()/y_test.sum()[0]
print("Specificity: ", specificity, "Sensitivity: ", sensitivity)
Specificity:  0.95 Sensitivity:  0.6

Very high specificity! If we predict that someone will have bad disease progression, they probabily will right?

If we assume that the prevalance rate is 10%, then, not really:

less than 20% of having the disease! quite different from “specificity 95%”!