Default of Credit Card Clients


https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients

Abstract

This research aimed at the case of customers’ default payments in Taiwan and compares the predictive accuracy of probability of default among six data mining methods.

Attributes:

24

Date Donated

2016-01-26

Associated Tasks:

Classification

Missing Values?

N/A

Number of Web Hits:

29657

Data Set Information:

This research aimed at the case of customers default payments in Taiwan and compares the predictive accuracy of probability of default among six data mining methods. From the perspective of risk management, the result of predictive accuracy of the estimated probability of default will be more valuable than the binary result of classification - credible or not credible clients. Because the real probability of default is unknown, this study presented the novel Sorting Smoothing Method to estimate the real probability of default. With the real probability of default as the response variable (Y), and the predictive probability of default as the independent variable (X), the simple linear regression result (Y = A + BX) shows that the forecasting model produced by artificial neural network has the highest coefficient of determination; its regression intercept (A) is close to zero, and regression coefficient (B) to one. Therefore, among the six data mining techniques, artificial neural network is the only one that can accurately estimate the real probability of default.

Attribute Information:

This research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable. This study reviewed the literature and used the following 23 variables as explanatory variables:

X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit. 
X2: Gender (1 = male; 2 = female). 
X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others). 
X4: Marital status (1 = married; 2 = single; 3 = others). 
X5: Age (year). 
X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above. 
X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005. 
X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005.
In [1]:
import os
from sklearn.tree import DecisionTreeClassifier, export_graphviz
import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn import cross_validation, metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from time import time
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score , classification_report
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import precision_score, recall_score, accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
In [2]:
# read .csv from provided dataset
xls_filename="default of credit card clients.xls"

# df=pd.read_csv(csv_filename,index_col=0)
df=pd.read_excel(xls_filename, skiprows=1)
In [3]:
df.head()
Out[3]:
ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 ... BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 default payment next month
0 1 20000 2 2 1 24 2 2 -1 -1 ... 0 0 0 0 689 0 0 0 0 1
1 2 120000 2 2 2 26 -1 2 0 0 ... 3272 3455 3261 0 1000 1000 1000 0 2000 1
2 3 90000 2 2 2 34 0 0 0 0 ... 14331 14948 15549 1518 1500 1000 1000 1000 5000 0
3 4 50000 2 2 1 37 0 0 0 0 ... 28314 28959 29547 2000 2019 1200 1100 1069 1000 0
4 5 50000 1 2 1 57 -1 0 -1 0 ... 20940 19146 19131 2000 36681 10000 9000 689 679 0

5 rows × 25 columns

In [4]:
df.columns
Out[4]:
Index([u'ID', u'LIMIT_BAL', u'SEX', u'EDUCATION', u'MARRIAGE', u'AGE',
       u'PAY_0', u'PAY_2', u'PAY_3', u'PAY_4', u'PAY_5', u'PAY_6',
       u'BILL_AMT1', u'BILL_AMT2', u'BILL_AMT3', u'BILL_AMT4', u'BILL_AMT5',
       u'BILL_AMT6', u'PAY_AMT1', u'PAY_AMT2', u'PAY_AMT3', u'PAY_AMT4',
       u'PAY_AMT5', u'PAY_AMT6', u'default payment next month'],
      dtype='object')
In [5]:
features=list(df.columns[1:-1])
X=df[features]
y = df['default payment next month']

# split dataset to 60% training and 40% testing
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.4, random_state=0)
In [6]:
print X_train.shape, y_train.shape
(18000, 23) (18000L,)

Feature importances with forests of trees

This examples shows the use of forests of trees to evaluate the importance of features on an artificial classification task. The red bars are the feature importances of the forest, along with their inter-trees variability.

In [7]:
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

from sklearn.ensemble import ExtraTreesClassifier

# Build a classification task using 3 informative features
# Build a forest and compute the feature importances
forest = ExtraTreesClassifier(n_estimators=250,
                              random_state=0)

forest.fit(X, y)
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(X.shape[1]):
    print("%d. feature %d - %s (%f) " % (f + 1, indices[f], features[indices[f]], importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure(num=None, figsize=(14, 10), dpi=80, facecolor='w', edgecolor='k')
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), indices)
plt.xlim([-1, X.shape[1]])
plt.show()
Feature ranking:
1. feature 5 - PAY_0 (0.096054) 
2. feature 4 - AGE (0.066378) 
3. feature 0 - LIMIT_BAL (0.065619) 
4. feature 6 - PAY_2 (0.050899) 
5. feature 11 - BILL_AMT1 (0.050399) 
6. feature 12 - BILL_AMT2 (0.046844) 
7. feature 22 - PAY_AMT6 (0.046461) 
8. feature 13 - BILL_AMT3 (0.045059) 
9. feature 14 - BILL_AMT4 (0.044771) 
10. feature 16 - BILL_AMT6 (0.044064) 
11. feature 17 - PAY_AMT1 (0.043888) 
12. feature 15 - BILL_AMT5 (0.043550) 
13. feature 21 - PAY_AMT5 (0.042619) 
14. feature 18 - PAY_AMT2 (0.042169) 
15. feature 19 - PAY_AMT3 (0.042101) 
16. feature 20 - PAY_AMT4 (0.040959) 
17. feature 7 - PAY_3 (0.033914) 
18. feature 2 - EDUCATION (0.032599) 
19. feature 8 - PAY_4 (0.030316) 
20. feature 10 - PAY_6 (0.029489) 
21. feature 9 - PAY_5 (0.028565) 
22. feature 3 - MARRIAGE (0.021776) 
23. feature 1 - SEX (0.011505) 
In [8]:
importances[indices[:5]]
Out[8]:
array([ 0.09605437,  0.06637819,  0.06561883,  0.05089928,  0.0503994 ])
In [9]:
for f in range(5):
    print("%d. feature %d - %s (%f)" % (f + 1, indices[f], features[indices[f]] ,importances[indices[f]]))
1. feature 5 - PAY_0 (0.096054)
2. feature 4 - AGE (0.066378)
3. feature 0 - LIMIT_BAL (0.065619)
4. feature 6 - PAY_2 (0.050899)
5. feature 11 - BILL_AMT1 (0.050399)
In [10]:
best_features = []
for i in indices[:5]:
    best_features.append(features[i])
In [11]:
# Plot the top 5 feature importances of the forest
plt.figure(num=None, figsize=(8, 6), dpi=80, facecolor='w', edgecolor='k')
plt.title("Feature importances")
plt.bar(range(5), importances[indices][:5], 
       color="r",  yerr=std[indices][:5], align="center")
plt.xticks(range(5), best_features)
plt.xlim([-1, 5])
plt.show()

Decision Tree accuracy and time elapsed caculation

In [12]:
t0=time()
print "DecisionTree"

#dt = DecisionTreeClassifier(min_samples_split=1,random_state=99)
dt = DecisionTreeClassifier(min_samples_split=20,max_depth=5,random_state=99)

clf_dt=dt.fit(X_train,y_train)

print "Acurracy: ", clf_dt.score(X_test,y_test)
t1=time()
print "time elapsed: ", t1-t0
DecisionTree
Acurracy:  0.8205
time elapsed:  0.273999929428

cross validation for DT

In [13]:
tt0=time()
print "cross result========"
scores = cross_validation.cross_val_score(dt, X,y, cv=5)
print scores
print scores.mean()
tt1=time()
print "time elapsed: ", tt1-tt0
print "\n"
cross result========
[ 0.80736544  0.81233333  0.82033333  0.82616667  0.82763794]
0.818767342417
time elapsed:  2.09200000763


Tuning our hyperparameters using GridSearch

In [14]:
from sklearn.metrics import classification_report

pipeline = Pipeline([
    ('clf', DecisionTreeClassifier(criterion='entropy'))
])

parameters = {
    'clf__max_depth': (5, 25 , 50),
    'clf__min_samples_split': (1, 5, 10),
    'clf__min_samples_leaf': (1, 2, 3)
}

grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='f1')
grid_search.fit(X_train, y_train)

print 'Best score: %0.3f' % grid_search.best_score_
print 'Best parameters set:'

best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print '\t%s: %r' % (param_name, best_parameters[param_name])

predictions = grid_search.predict(X_test)

print classification_report(y_test, predictions)
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done  81 out of  81 | elapsed:  1.2min finished
Fitting 3 folds for each of 27 candidates, totalling 81 fits
Best score: 0.468
Best parameters set:
	clf__max_depth: 5
	clf__min_samples_leaf: 1
	clf__min_samples_split: 1
             precision    recall  f1-score   support

          0       0.84      0.95      0.89      9408
          1       0.67      0.33      0.44      2592

avg / total       0.80      0.82      0.80     12000

Random Forest accuracy and time elapsed caculation

In [15]:
t2=time()
print "RandomForest"
rf = RandomForestClassifier(n_estimators=100,n_jobs=-1)
clf_rf = rf.fit(X_train,y_train)
print "Acurracy: ", clf_rf.score(X_test,y_test)
t3=time()
print "time elapsed: ", t3-t2
RandomForest
Acurracy:  0.820083333333
time elapsed:  3.0

cross validation for RF

In [16]:
tt2=time()
print "cross result========"
scores = cross_validation.cross_val_score(rf, X,y, cv=5)
print scores
print scores.mean()
tt3=time()
print "time elapsed: ", tt3-tt2
print "\n"
cross result========
[ 0.80619897  0.80716667  0.81816667  0.82633333  0.8198033 ]
0.815533786811
time elapsed:  19.9739999771


Receiver Operating Characteristic (ROC) curve

In [17]:
roc_auc_score(y_test,rf.predict(X_test))
Out[17]:
0.6550807823129251
In [18]:
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc

predictions = rf.predict_proba(X_test)

false_positive_rate, recall, thresholds = roc_curve(y_test, predictions[:, 1])
roc_auc = auc(false_positive_rate, recall)
plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, recall, 'b', label='AUC = %0.2f' % roc_auc)
plt.legend(loc='lower right')

plt.plot([0, 1], [0, 1], 'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.ylabel('Recall')
plt.xlabel('Fall-out')
plt.show()

Tuning Models using GridSearch

In [19]:
pipeline2 = Pipeline([
('clf', RandomForestClassifier(criterion='entropy'))
])

parameters = {
    'clf__n_estimators': (25, 50, 100),
    'clf__max_depth': (5, 25 , 50),
    'clf__min_samples_split': (1, 5, 10),
    'clf__min_samples_leaf': (1, 2, 3)
}

grid_search = GridSearchCV(pipeline2, parameters, n_jobs=-1, verbose=1, scoring='accuracy', cv=3)

grid_search.fit(X_train, y_train)

print 'Best score: %0.3f' % grid_search.best_score_

print 'Best parameters set:'
best_parameters = grid_search.best_estimator_.get_params()

for param_name in sorted(parameters.keys()):
    print '\t%s: %r' % (param_name, best_parameters[param_name])

predictions = grid_search.predict(X_test)
print 'Accuracy:', accuracy_score(y_test, predictions)
print classification_report(y_test, predictions)
    
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   34.2s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:  3.6min
[Parallel(n_jobs=-1)]: Done 243 out of 243 | elapsed:  5.1min finished
Fitting 3 folds for each of 81 candidates, totalling 243 fits
Best score: 0.815
Best parameters set:
	clf__max_depth: 25
	clf__min_samples_leaf: 2
	clf__min_samples_split: 5
	clf__n_estimators: 100
Accuracy: 0.822833333333
             precision    recall  f1-score   support

          0       0.84      0.95      0.89      9408
          1       0.66      0.36      0.47      2592

avg / total       0.81      0.82      0.80     12000

OOB Errors for Random Forests

In [62]:
import matplotlib.pyplot as plt

from collections import OrderedDict
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

RANDOM_STATE = 123

# NOTE: Setting the `warm_start` construction parameter to `True` disables
# support for paralellised ensembles but is necessary for tracking the OOB
# error trajectory during training.

ensemble_clfs = [
    ("RandomForestClassifier, max_features='sqrt'",
        RandomForestClassifier(warm_start=True, oob_score=True,
                               max_features="sqrt",
                               random_state=RANDOM_STATE)),
    ("RandomForestClassifier, max_features='log2'",
        RandomForestClassifier(warm_start=True, max_features='log2',
                               oob_score=True,
                               random_state=RANDOM_STATE)),
    ("RandomForestClassifier, max_features=None",
        RandomForestClassifier(warm_start=True, max_features=None,
                               oob_score=True,
                               random_state=RANDOM_STATE))
]

# Map a classifier name to a list of (<n_estimators>, <error rate>) pairs.
error_rate = OrderedDict((label, []) for label, _ in ensemble_clfs)

# Range of `n_estimators` values to explore.
min_estimators = 15
max_estimators = 175

for label, clf in ensemble_clfs:
    for i in range(min_estimators, max_estimators + 1):
        clf.set_params(n_estimators=i)
        clf.fit(X, y)

        # Record the OOB error for each `n_estimators=i` setting.
        oob_error = 1 - clf.oob_score_
        error_rate[label].append((i, oob_error))

# Generate the "OOB error rate" vs. "n_estimators" plot.
for label, clf_err in error_rate.items():
    xs, ys = zip(*clf_err)
    plt.plot(xs, ys, label=label)

plt.xlim(min_estimators, max_estimators)
plt.xlabel("n_estimators")
plt.ylabel("OOB error rate")
plt.legend(loc="upper right")
plt.show()
C:\Miniconda2\lib\site-packages\sklearn\ensemble\forest.py:403: UserWarning: Some inputs do not have OOB scores. This probably means too few trees were used to compute any reliable oob estimates.
  warn("Some inputs do not have OOB scores. "

Naive Bayes accuracy and time elapsed caculation

In [22]:
t4=time()
print "NaiveBayes"
nb = BernoulliNB()
clf_nb=nb.fit(X_train,y_train)
print "Acurracy: ", clf_nb.score(X_test,y_test)
t5=time()
print "time elapsed: ", t5-t4
NaiveBayes
Acurracy:  0.78775
time elapsed:  0.0349998474121

cross-validation for NB

In [23]:
tt4=time()
print "cross result========"
scores = cross_validation.cross_val_score(nb, X,y, cv=5)
print scores
print scores.mean()
tt5=time()
print "time elapsed: ", tt5-tt4
print "\n"
cross result========
[ 0.76703883  0.75633333  0.76783333  0.7835      0.79346558]
0.773634214225
time elapsed:  0.482000112534


KNN accuracy and time elapsed caculation

In [24]:
t6=time()
print "KNN"
# knn = KNeighborsClassifier(n_neighbors=3)
knn = KNeighborsClassifier()
clf_knn=knn.fit(X_train, y_train)
print "Acurracy: ", clf_knn.score(X_test,y_test) 
t7=time()
print "time elapsed: ", t7-t6
KNN
Acurracy:  0.75575
time elapsed:  2.82899999619

cross validation for KNN

In [25]:
tt6=time()
print "cross result========"
scores = cross_validation.cross_val_score(knn, X,y, cv=5)
print scores
print scores.mean()
tt7=time()
print "time elapsed: ", tt7-tt6
print "\n"
cross result========
[ 0.74987502  0.75016667  0.752       0.76166667  0.76312719]
0.755367108406
time elapsed:  8.23200011253


In [26]:
from sklearn.cross_validation import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn import grid_search

knn = KNeighborsClassifier()

parameters = {'n_neighbors': (10, 15, 25)}

grid = grid_search.GridSearchCV(knn, parameters, n_jobs=-1, verbose=1, scoring='accuracy')


grid.fit(X_train, y_train)

print 'Best score: %0.3f' % grid.best_score_

print 'Best parameters set:'
best_parameters = grid.best_estimator_.get_params()

for param_name in sorted(parameters.keys()):
    print '\t%s: %r' % (param_name, best_parameters[param_name])
    
predictions = grid.predict(X_test)
print classification_report(y_test, predictions)
Fitting 3 folds for each of 3 candidates, totalling 9 fits
Best score: 0.770
Best parameters set:
	n_neighbors: 25
             precision    recall  f1-score   support

          0       0.80      0.97      0.87      9408
          1       0.46      0.10      0.16      2592

avg / total       0.72      0.78      0.72     12000

[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:   21.8s finished

Ensemble Learning

In [27]:
from sklearn.base import BaseEstimator
from sklearn.base import ClassifierMixin
from sklearn.preprocessing import LabelEncoder
from sklearn.externals import six
from sklearn.base import clone
from sklearn.pipeline import _name_estimators
import numpy as np
import operator


class MajorityVoteClassifier(BaseEstimator, 
                             ClassifierMixin):
    """ A majority vote ensemble classifier

    Parameters
    ----------
    classifiers : array-like, shape = [n_classifiers]
      Different classifiers for the ensemble

    vote : str, {'classlabel', 'probability'} (default='label')
      If 'classlabel' the prediction is based on the argmax of
        class labels. Else if 'probability', the argmax of
        the sum of probabilities is used to predict the class label
        (recommended for calibrated classifiers).

    weights : array-like, shape = [n_classifiers], optional (default=None)
      If a list of `int` or `float` values are provided, the classifiers
      are weighted by importance; Uses uniform weights if `weights=None`.

    """
    def __init__(self, classifiers, vote='classlabel', weights=None):

        self.classifiers = classifiers
        self.named_classifiers = {key: value for key, value
                                  in _name_estimators(classifiers)}
        self.vote = vote
        self.weights = weights

    def fit(self, X, y):
        """ Fit classifiers.

        Parameters
        ----------
        X : {array-like, sparse matrix}, shape = [n_samples, n_features]
            Matrix of training samples.

        y : array-like, shape = [n_samples]
            Vector of target class labels.

        Returns
        -------
        self : object

        """
        if self.vote not in ('probability', 'classlabel'):
            raise ValueError("vote must be 'probability' or 'classlabel'"
                             "; got (vote=%r)"
                             % self.vote)

        if self.weights and len(self.weights) != len(self.classifiers):
            raise ValueError('Number of classifiers and weights must be equal'
                             '; got %d weights, %d classifiers'
                             % (len(self.weights), len(self.classifiers)))

        # Use LabelEncoder to ensure class labels start with 0, which
        # is important for np.argmax call in self.predict
        self.lablenc_ = LabelEncoder()
        self.lablenc_.fit(y)
        self.classes_ = self.lablenc_.classes_
        self.classifiers_ = []
        for clf in self.classifiers:
            fitted_clf = clone(clf).fit(X, self.lablenc_.transform(y))
            self.classifiers_.append(fitted_clf)
        return self

    def predict(self, X):
        """ Predict class labels for X.

        Parameters
        ----------
        X : {array-like, sparse matrix}, shape = [n_samples, n_features]
            Matrix of training samples.

        Returns
        ----------
        maj_vote : array-like, shape = [n_samples]
            Predicted class labels.
            
        """
        if self.vote == 'probability':
            maj_vote = np.argmax(self.predict_proba(X), axis=1)
        else:  # 'classlabel' vote

            #  Collect results from clf.predict calls
            predictions = np.asarray([clf.predict(X)
                                      for clf in self.classifiers_]).T

            maj_vote = np.apply_along_axis(
                                      lambda x:
                                      np.argmax(np.bincount(x,
                                                weights=self.weights)),
                                      axis=1,
                                      arr=predictions)
        maj_vote = self.lablenc_.inverse_transform(maj_vote)
        return maj_vote

    def predict_proba(self, X):
        """ Predict class probabilities for X.

        Parameters
        ----------
        X : {array-like, sparse matrix}, shape = [n_samples, n_features]
            Training vectors, where n_samples is the number of samples and
            n_features is the number of features.

        Returns
        ----------
        avg_proba : array-like, shape = [n_samples, n_classes]
            Weighted average probability for each class per sample.

        """
        probas = np.asarray([clf.predict_proba(X)
                             for clf in self.classifiers_])
        avg_proba = np.average(probas, axis=0, weights=self.weights)
        return avg_proba

    def get_params(self, deep=True):
        """ Get classifier parameter names for GridSearch"""
        if not deep:
            return super(MajorityVoteClassifier, self).get_params(deep=False)
        else:
            out = self.named_classifiers.copy()
            for name, step in six.iteritems(self.named_classifiers):
                for key, value in six.iteritems(step.get_params(deep=True)):
                    out['%s__%s' % (name, key)] = value
            return out

Combining different algorithms for classification with majority vote

In [28]:
from sklearn.cross_validation import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.pipeline import Pipeline
import numpy as np

clf1 = LogisticRegression(penalty='l2', 
                          C=0.001, 
                          random_state=0)

clf2 = DecisionTreeClassifier(max_depth=5,
                              min_samples_leaf=5,
                              min_samples_split=1,
                              criterion='entropy', 
                              random_state=0)

clf3 = KNeighborsClassifier(n_neighbors=1, 
                            p=2, 
                            metric='minkowski')

pipe1 = Pipeline([['sc', StandardScaler()],
                  ['clf', clf1]])
pipe3 = Pipeline([['sc', StandardScaler()],
                  ['clf', clf3]])

clf_labels = ['Logistic Regression', 'Decision Tree', 'KNN']

print('10-fold cross validation:\n')
for clf, label in zip([pipe1, clf2, pipe3], clf_labels):
    scores = cross_val_score(estimator=clf, 
                             X=X_train, 
                             y=y_train, 
                             cv=10, 
                             scoring='roc_auc')
    print("ROC AUC: %0.2f (+/- %0.2f) [%s]" 
               % (scores.mean(), scores.std(), label))
10-fold cross validation:

ROC AUC: 0.72 (+/- 0.01) [Logistic Regression]
ROC AUC: 0.75 (+/- 0.01) [Decision Tree]
ROC AUC: 0.60 (+/- 0.01) [KNN]

You may be wondering why we trained the logistic regression and k-nearest neighbors classifier as part of a pipeline. The reason behind it is that, both the logistic regression and k-nearest neighbors algorithms (using the Euclidean distance metric) are not scale-invariant in contrast with decision trees.

Now let's move on to the more exciting part and combine the individual classifiers for majority rule voting in our MajorityVoteClassifier:

In [29]:
# Majority Rule (hard) Voting

mv_clf = MajorityVoteClassifier(
                classifiers=[pipe1, clf2,pipe3])

clf_labels = ['Logistic Regression', 'Decision Tree', 'K Nearest Neighbours', 'Majority Voting'] 
all_clf = [pipe1, clf2, pipe3, mv_clf]

for clf, label in zip(all_clf, clf_labels):
    scores = cross_val_score(estimator=clf, 
                             X=X_train, 
                             y=y_train, 
                             cv=10, 
                             scoring='roc_auc')
    print("ROC AUC: %0.2f (+/- %0.2f) [%s]" 
               % (scores.mean(), scores.std(), label))
ROC AUC: 0.72 (+/- 0.01) [Logistic Regression]
ROC AUC: 0.75 (+/- 0.01) [Decision Tree]
ROC AUC: 0.60 (+/- 0.01) [K Nearest Neighbours]
ROC AUC: 0.73 (+/- 0.01) [Majority Voting]

Evaluating and tuning the ensemble classifier

In [30]:
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve
from sklearn.metrics import auc

colors = ['black', 'orange', 'blue', 'green']
linestyles = [':', '--', '-.', '-']
for clf, label, clr, ls in zip(all_clf, clf_labels, colors, linestyles):

    # assuming the label of the positive class is 1
    y_pred = clf.fit(X_train, 
                     y_train).predict_proba(X_test)[:, 1]
    fpr, tpr, thresholds = roc_curve(y_true=y_test, 
                                     y_score=y_pred)
    roc_auc = auc(x=fpr, y=tpr)
    plt.plot(fpr, tpr, 
             color=clr, 
             linestyle=ls, 
             label='%s (auc = %0.2f)' % (label, roc_auc))

plt.legend(loc='lower right')
plt.plot([0, 1], [0, 1], 
         linestyle='--', 
         color='gray', 
         linewidth=2)

plt.xlim([-0.1, 1.1])
plt.ylim([-0.1, 1.1])
plt.grid()
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')

plt.tight_layout()
# plt.savefig('./figures/roc.png', dpi=300)
plt.show()

Plot class probabilities calculated by the VotingClassifier

In [33]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier

clf1 = LogisticRegression(random_state=123)
clf2 = RandomForestClassifier(random_state=123)
clf3 = GaussianNB()

eclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)],
                        voting='soft',
                        weights=[1, 1, 5])

# predict class probabilities for all classifiers
probas = [c.fit(X, y).predict_proba(X) for c in (clf1, clf2, clf3, eclf)]

# get class probabilities for the first sample in the dataset
class1_1 = [pr[0, 0] for pr in probas]
class2_1 = [pr[0, 1] for pr in probas]


# plotting

N = 4  # number of groups
ind = np.arange(N)  # group positions
width = 0.35  # bar width

fig, ax = plt.subplots()

# bars for classifier 1-3
p1 = ax.bar(ind, np.hstack(([class1_1[:-1], [0]])), width, color='green')
p2 = ax.bar(ind + width, np.hstack(([class2_1[:-1], [0]])), width, color='lightgreen')

# bars for VotingClassifier
p3 = ax.bar(ind, [0, 0, 0, class1_1[-1]], width, color='blue')
p4 = ax.bar(ind + width, [0, 0, 0, class2_1[-1]], width, color='steelblue')

# plot annotations
plt.axvline(2.8, color='k', linestyle='dashed')
ax.set_xticks(ind + width)
ax.set_xticklabels(['LogisticRegression\nweight 1',
                    'GaussianNB\nweight 1',
                    'RandomForestClassifier\nweight 5',
                    'VotingClassifier\n(average probabilities)'],
                   rotation=40,
                   ha='right')
plt.ylim([0, 1])
plt.title('Class probabilities for sample 1 by different classifiers')
plt.legend([p1[0], p2[0]], ['class 1', 'class 2'], loc='upper left')
plt.show()

Bagging -- Building an ensemble of classifiers from bootstrap samples

Bagging is an ensemble learning technique that is closely related to the MajorityVoteClassifier,however, instead of using the same training set to fit the individual classifiers in the ensemble, we draw bootstrap samples (random samples with replacement) from the initial training set, which is why bagging is also known as bootstrap aggregating.

In [34]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(criterion='entropy', 
                              max_depth=None)

bag = BaggingClassifier(base_estimator=tree,
                        n_estimators=500, 
                        max_samples=1.0, 
                        max_features=1.0, 
                        bootstrap=True, 
                        bootstrap_features=False, 
                        n_jobs=-1, 
                        random_state=1)
In [35]:
from sklearn.metrics import accuracy_score

tree = tree.fit(X_train, y_train)
y_train_pred = tree.predict(X_train)
y_test_pred = tree.predict(X_test)

tree_train = accuracy_score(y_train, y_train_pred)
tree_test = accuracy_score(y_test, y_test_pred)
print('Decision tree train/test accuracies %.3f/%.3f'
      % (tree_train, tree_test))

bag = bag.fit(X_train, y_train)
y_train_pred = bag.predict(X_train)
y_test_pred = bag.predict(X_test)

bag_train = accuracy_score(y_train, y_train_pred) 
bag_test = accuracy_score(y_test, y_test_pred) 
print('Bagging train/test accuracies %.3f/%.3f'
      % (bag_train, bag_test))
Decision tree train/test accuracies 1.000/0.729
Bagging train/test accuracies 1.000/0.821
C:\Miniconda2\lib\site-packages\sklearn\externals\joblib\hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array by
descriptor assignment is deprecated. To maintain
the Fortran contiguity of a multidimensional Fortran
array, use 'a.T.view(...).T' instead
  obj_bytes_view = obj.view(self.np.uint8)
C:\Miniconda2\lib\site-packages\sklearn\externals\joblib\hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array by
descriptor assignment is deprecated. To maintain
the Fortran contiguity of a multidimensional Fortran
array, use 'a.T.view(...).T' instead
  obj_bytes_view = obj.view(self.np.uint8)
C:\Miniconda2\lib\site-packages\sklearn\externals\joblib\hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array by
descriptor assignment is deprecated. To maintain
the Fortran contiguity of a multidimensional Fortran
array, use 'a.T.view(...).T' instead
  obj_bytes_view = obj.view(self.np.uint8)
C:\Miniconda2\lib\site-packages\sklearn\externals\joblib\hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array by
descriptor assignment is deprecated. To maintain
the Fortran contiguity of a multidimensional Fortran
array, use 'a.T.view(...).T' instead
  obj_bytes_view = obj.view(self.np.uint8)
C:\Miniconda2\lib\site-packages\sklearn\externals\joblib\hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array by
descriptor assignment is deprecated. To maintain
the Fortran contiguity of a multidimensional Fortran
array, use 'a.T.view(...).T' instead
  obj_bytes_view = obj.view(self.np.uint8)
C:\Miniconda2\lib\site-packages\sklearn\externals\joblib\hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array by
descriptor assignment is deprecated. To maintain
the Fortran contiguity of a multidimensional Fortran
array, use 'a.T.view(...).T' instead
  obj_bytes_view = obj.view(self.np.uint8)
C:\Miniconda2\lib\site-packages\sklearn\externals\joblib\hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array by
descriptor assignment is deprecated. To maintain
the Fortran contiguity of a multidimensional Fortran
array, use 'a.T.view(...).T' instead
  obj_bytes_view = obj.view(self.np.uint8)
C:\Miniconda2\lib\site-packages\sklearn\externals\joblib\hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array by
descriptor assignment is deprecated. To maintain
the Fortran contiguity of a multidimensional Fortran
array, use 'a.T.view(...).T' instead
  obj_bytes_view = obj.view(self.np.uint8)
C:\Miniconda2\lib\site-packages\sklearn\externals\joblib\hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array by
descriptor assignment is deprecated. To maintain
the Fortran contiguity of a multidimensional Fortran
array, use 'a.T.view(...).T' instead
  obj_bytes_view = obj.view(self.np.uint8)
C:\Miniconda2\lib\site-packages\sklearn\externals\joblib\hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array by
descriptor assignment is deprecated. To maintain
the Fortran contiguity of a multidimensional Fortran
array, use 'a.T.view(...).T' instead
  obj_bytes_view = obj.view(self.np.uint8)
C:\Miniconda2\lib\site-packages\sklearn\externals\joblib\hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array by
descriptor assignment is deprecated. To maintain
the Fortran contiguity of a multidimensional Fortran
array, use 'a.T.view(...).T' instead
  obj_bytes_view = obj.view(self.np.uint8)
C:\Miniconda2\lib\site-packages\sklearn\externals\joblib\hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array by
descriptor assignment is deprecated. To maintain
the Fortran contiguity of a multidimensional Fortran
array, use 'a.T.view(...).T' instead
  obj_bytes_view = obj.view(self.np.uint8)
C:\Miniconda2\lib\site-packages\sklearn\externals\joblib\hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array by
descriptor assignment is deprecated. To maintain
the Fortran contiguity of a multidimensional Fortran
array, use 'a.T.view(...).T' instead
  obj_bytes_view = obj.view(self.np.uint8)
C:\Miniconda2\lib\site-packages\sklearn\externals\joblib\hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array by
descriptor assignment is deprecated. To maintain
the Fortran contiguity of a multidimensional Fortran
array, use 'a.T.view(...).T' instead
  obj_bytes_view = obj.view(self.np.uint8)
C:\Miniconda2\lib\site-packages\sklearn\externals\joblib\hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array by
descriptor assignment is deprecated. To maintain
the Fortran contiguity of a multidimensional Fortran
array, use 'a.T.view(...).T' instead
  obj_bytes_view = obj.view(self.np.uint8)
C:\Miniconda2\lib\site-packages\sklearn\externals\joblib\hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array by
descriptor assignment is deprecated. To maintain
the Fortran contiguity of a multidimensional Fortran
array, use 'a.T.view(...).T' instead
  obj_bytes_view = obj.view(self.np.uint8)
C:\Miniconda2\lib\site-packages\sklearn\externals\joblib\hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array by
descriptor assignment is deprecated. To maintain
the Fortran contiguity of a multidimensional Fortran
array, use 'a.T.view(...).T' instead
  obj_bytes_view = obj.view(self.np.uint8)
C:\Miniconda2\lib\site-packages\sklearn\externals\joblib\hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array by
descriptor assignment is deprecated. To maintain
the Fortran contiguity of a multidimensional Fortran
array, use 'a.T.view(...).T' instead
  obj_bytes_view = obj.view(self.np.uint8)
C:\Miniconda2\lib\site-packages\sklearn\externals\joblib\hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array by
descriptor assignment is deprecated. To maintain
the Fortran contiguity of a multidimensional Fortran
array, use 'a.T.view(...).T' instead
  obj_bytes_view = obj.view(self.np.uint8)
C:\Miniconda2\lib\site-packages\sklearn\externals\joblib\hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array by
descriptor assignment is deprecated. To maintain
the Fortran contiguity of a multidimensional Fortran
array, use 'a.T.view(...).T' instead
  obj_bytes_view = obj.view(self.np.uint8)
C:\Miniconda2\lib\site-packages\sklearn\externals\joblib\hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array by
descriptor assignment is deprecated. To maintain
the Fortran contiguity of a multidimensional Fortran
array, use 'a.T.view(...).T' instead
  obj_bytes_view = obj.view(self.np.uint8)
C:\Miniconda2\lib\site-packages\sklearn\externals\joblib\hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array by
descriptor assignment is deprecated. To maintain
the Fortran contiguity of a multidimensional Fortran
array, use 'a.T.view(...).T' instead
  obj_bytes_view = obj.view(self.np.uint8)
C:\Miniconda2\lib\site-packages\sklearn\externals\joblib\hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array by
descriptor assignment is deprecated. To maintain
the Fortran contiguity of a multidimensional Fortran
array, use 'a.T.view(...).T' instead
  obj_bytes_view = obj.view(self.np.uint8)
C:\Miniconda2\lib\site-packages\sklearn\externals\joblib\hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous array by
descriptor assignment is deprecated. To maintain
the Fortran contiguity of a multidimensional Fortran
array, use 'a.T.view(...).T' instead
  obj_bytes_view = obj.view(self.np.uint8)

Leveraging weak learners via adaptive boosting

In [10]:
from sklearn.ensemble import AdaBoostClassifier

tree = DecisionTreeClassifier(criterion='entropy', 
                              max_depth=1)

ada = AdaBoostClassifier(base_estimator=tree,
                         n_estimators=500, 
                         learning_rate=0.1,
                         random_state=0)
In [11]:
tree = tree.fit(X_train, y_train)
y_train_pred = tree.predict(X_train)
y_test_pred = tree.predict(X_test)

tree_train = accuracy_score(y_train, y_train_pred)
tree_test = accuracy_score(y_test, y_test_pred)
print('Decision tree train/test accuracies %.3f/%.3f'
      % (tree_train, tree_test))

ada = ada.fit(X_train, y_train)
y_train_pred = ada.predict(X_train)
y_test_pred = ada.predict(X_test)

ada_train = accuracy_score(y_train, y_train_pred) 
ada_test = accuracy_score(y_test, y_test_pred) 
print('AdaBoost train/test accuracies %.3f/%.3f'
      % (ada_train, ada_test))
Decision tree train/test accuracies 0.596/0.597
AdaBoost train/test accuracies 0.675/0.662

Two-class AdaBoost

In [63]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_gaussian_quantiles


# Create and fit an AdaBoosted decision tree
bdt = AdaBoostClassifier(DecisionTreeClassifier(max_depth=5),
                         algorithm="SAMME",
                         n_estimators=200)

bdt.fit(X, y)

plot_colors = "br"
plot_step = 0.02
class_names = "AB"

plt.figure(figsize=(10, 5))

# Plot the decision boundaries
plt.subplot(121)
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
                     np.arange(y_min, y_max, plot_step))

Z = bdt.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
cs = plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)
plt.axis("tight")

# Plot the training points
for i, n, c in zip(range(2), class_names, plot_colors):
    idx = np.where(y == i)
    plt.scatter(X[idx, 0], X[idx, 1],
                c=c, cmap=plt.cm.Paired,
                label="Class %s" % n)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.legend(loc='upper right')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Decision Boundary')

# Plot the two-class decision scores
twoclass_output = bdt.decision_function(X)
plot_range = (twoclass_output.min(), twoclass_output.max())
plt.subplot(122)
for i, n, c in zip(range(2), class_names, plot_colors):
    plt.hist(twoclass_output[y == i],
             bins=10,
             range=plot_range,
             facecolor=c,
             label='Class %s' % n,
             alpha=.5)
x1, x2, y1, y2 = plt.axis()
plt.axis((x1, x2, y1, y2 * 1.2))
plt.legend(loc='upper right')
plt.ylabel('Samples')
plt.xlabel('Score')
plt.title('Decision Scores')

plt.tight_layout()
plt.subplots_adjust(wspace=0.35)
plt.show()
C:\Miniconda2\lib\site-packages\ipykernel\__main__.py:53: FutureWarning: in the future, boolean array-likes will be handled as a boolean array index

Discrete versus Real AdaBoost

In [64]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import zero_one_loss
from sklearn.ensemble import AdaBoostClassifier


n_estimators = 400
# A learning rate of 1. may not be optimal for both SAMME and SAMME.R
learning_rate = 1.


dt_stump = DecisionTreeClassifier(max_depth=1, min_samples_leaf=1)
dt_stump.fit(X_train, y_train)
dt_stump_err = 1.0 - dt_stump.score(X_test, y_test)

dt = DecisionTreeClassifier(max_depth=9, min_samples_leaf=1)
dt.fit(X_train, y_train)
dt_err = 1.0 - dt.score(X_test, y_test)

ada_discrete = AdaBoostClassifier(
    base_estimator=dt_stump,
    learning_rate=learning_rate,
    n_estimators=n_estimators,
    algorithm="SAMME")
ada_discrete.fit(X_train, y_train)

ada_real = AdaBoostClassifier(
    base_estimator=dt_stump,
    learning_rate=learning_rate,
    n_estimators=n_estimators,
    algorithm="SAMME.R")
ada_real.fit(X_train, y_train)

fig = plt.figure()
ax = fig.add_subplot(111)

ax.plot([1, n_estimators], [dt_stump_err] * 2, 'k-',
        label='Decision Stump Error')
ax.plot([1, n_estimators], [dt_err] * 2, 'k--',
        label='Decision Tree Error')

ada_discrete_err = np.zeros((n_estimators,))
for i, y_pred in enumerate(ada_discrete.staged_predict(X_test)):
    ada_discrete_err[i] = zero_one_loss(y_pred, y_test)

ada_discrete_err_train = np.zeros((n_estimators,))
for i, y_pred in enumerate(ada_discrete.staged_predict(X_train)):
    ada_discrete_err_train[i] = zero_one_loss(y_pred, y_train)

ada_real_err = np.zeros((n_estimators,))
for i, y_pred in enumerate(ada_real.staged_predict(X_test)):
    ada_real_err[i] = zero_one_loss(y_pred, y_test)

ada_real_err_train = np.zeros((n_estimators,))
for i, y_pred in enumerate(ada_real.staged_predict(X_train)):
    ada_real_err_train[i] = zero_one_loss(y_pred, y_train)

ax.plot(np.arange(n_estimators) + 1, ada_discrete_err,
        label='Discrete AdaBoost Test Error',
        color='red')
ax.plot(np.arange(n_estimators) + 1, ada_discrete_err_train,
        label='Discrete AdaBoost Train Error',
        color='blue')
ax.plot(np.arange(n_estimators) + 1, ada_real_err,
        label='Real AdaBoost Test Error',
        color='orange')
ax.plot(np.arange(n_estimators) + 1, ada_real_err_train,
        label='Real AdaBoost Train Error',
        color='green')

ax.set_ylim((0.0, 0.5))
ax.set_xlabel('n_estimators')
ax.set_ylabel('error rate')

leg = ax.legend(loc='upper right', fancybox=True)
leg.get_frame().set_alpha(0.7)

plt.show()

Feature transformations with ensembles of trees

In [65]:
import numpy as np
np.random.seed(10)

import matplotlib.pyplot as plt

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import (RandomTreesEmbedding, RandomForestClassifier,
                              GradientBoostingClassifier)
from sklearn.preprocessing import OneHotEncoder
from sklearn.cross_validation import train_test_split
from sklearn.metrics import roc_curve
from sklearn.pipeline import make_pipeline

n_estimator = 10
# It is important to train the ensemble of trees on a different subset
# of the training data than the linear regression model to avoid
# overfitting, in particular if the total number of leaves is
# similar to the number of training samples
X_train, X_train_lr, y_train, y_train_lr = train_test_split(X_train,
                                                            y_train,
                                                            test_size=0.5)

# Unsupervised transformation based on totally random trees
rt = RandomTreesEmbedding(max_depth=3, n_estimators=n_estimator,
	random_state=0)

rt_lm = LogisticRegression()
pipeline = make_pipeline(rt, rt_lm)
pipeline.fit(X_train, y_train)
y_pred_rt = pipeline.predict_proba(X_test)[:, 1]
fpr_rt_lm, tpr_rt_lm, _ = roc_curve(y_test, y_pred_rt)

# Supervised transformation based on random forests
rf = RandomForestClassifier(max_depth=3, n_estimators=n_estimator)
rf_enc = OneHotEncoder()
rf_lm = LogisticRegression()
rf.fit(X_train, y_train)
rf_enc.fit(rf.apply(X_train))
rf_lm.fit(rf_enc.transform(rf.apply(X_train_lr)), y_train_lr)

y_pred_rf_lm = rf_lm.predict_proba(rf_enc.transform(rf.apply(X_test)))[:, 1]
fpr_rf_lm, tpr_rf_lm, _ = roc_curve(y_test, y_pred_rf_lm)

grd = GradientBoostingClassifier(n_estimators=n_estimator)
grd_enc = OneHotEncoder()
grd_lm = LogisticRegression()
grd.fit(X_train, y_train)
grd_enc.fit(grd.apply(X_train)[:, :, 0])
grd_lm.fit(grd_enc.transform(grd.apply(X_train_lr)[:, :, 0]), y_train_lr)

y_pred_grd_lm = grd_lm.predict_proba(
    grd_enc.transform(grd.apply(X_test)[:, :, 0]))[:, 1]
fpr_grd_lm, tpr_grd_lm, _ = roc_curve(y_test, y_pred_grd_lm)


# The gradient boosted model by itself
y_pred_grd = grd.predict_proba(X_test)[:, 1]
fpr_grd, tpr_grd, _ = roc_curve(y_test, y_pred_grd)


# The random forest model by itself
y_pred_rf = rf.predict_proba(X_test)[:, 1]
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_pred_rf)

plt.figure(1)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr_rt_lm, tpr_rt_lm, label='RT + LR')
plt.plot(fpr_rf, tpr_rf, label='RF')
plt.plot(fpr_rf_lm, tpr_rf_lm, label='RF + LR')
plt.plot(fpr_grd, tpr_grd, label='GBT')
plt.plot(fpr_grd_lm, tpr_grd_lm, label='GBT + LR')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
plt.legend(loc='best')
plt.show()

plt.figure(2)
plt.xlim(0, 0.2)
plt.ylim(0.8, 1)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr_rt_lm, tpr_rt_lm, label='RT + LR')
plt.plot(fpr_rf, tpr_rf, label='RF')
plt.plot(fpr_rf_lm, tpr_rf_lm, label='RF + LR')
plt.plot(fpr_grd, tpr_grd, label='GBT')
plt.plot(fpr_grd_lm, tpr_grd_lm, label='GBT + LR')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve (zoomed in at top left)')
plt.legend(loc='best')
plt.show()

PCA Decomposition

In [19]:
target_names = ['Shares > 1400' , 'Shares < 1400']
In [41]:
X.values
Out[41]:
array([[  1.20000000e+01,   2.19000000e+02,   6.63594467e-01, ...,
         -1.87500000e-01,   0.00000000e+00,   1.87500000e-01],
       [  9.00000000e+00,   2.55000000e+02,   6.04743081e-01, ...,
          0.00000000e+00,   5.00000000e-01,   0.00000000e+00],
       [  9.00000000e+00,   2.11000000e+02,   5.75129531e-01, ...,
          0.00000000e+00,   5.00000000e-01,   0.00000000e+00],
       ..., 
       [  1.00000000e+01,   4.42000000e+02,   5.16355139e-01, ...,
          1.36363636e-01,   4.54545450e-02,   1.36363636e-01],
       [  6.00000000e+00,   6.82000000e+02,   5.39493293e-01, ...,
          0.00000000e+00,   5.00000000e-01,   0.00000000e+00],
       [  1.00000000e+01,   1.57000000e+02,   7.01986750e-01, ...,
          2.50000000e-01,   1.66666667e-01,   2.50000000e-01]])
In [42]:
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

pca = PCA(n_components=2)
reduced_X = pca.fit_transform(X)
In [48]:
for a in [red_x, red_y,blue_x,blue_y]:
    print len(a)
18490
18490
21154
21154
In [49]:
red_x, red_y = [], []
blue_x, blue_y = [], []

for i in range(len(reduced_X)):
    if y[i] == 0:
        red_x.append(reduced_X[i][0])
        red_y.append(reduced_X[i][1])
    elif y[i] == 1:
        blue_x.append(reduced_X[i][0])
        blue_y.append(reduced_X[i][1])
        
plt.scatter(red_x, red_y, c='r', marker='x')
plt.scatter(blue_x, blue_y, c='b', marker='.')
plt.show()
In [33]:
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
"""
pca = PCA(n_components=2)
X_r = pca.fit(X.values).transform(X.values)

lda = LinearDiscriminantAnalysis(n_components=2)
X_r2 = lda.fit(X.values, y.values).transform(X.values)
"""
pca = PCA(n_components=2)
X_r = pca.fit(X).transform(X)

lda = LinearDiscriminantAnalysis(n_components=2)
X_r2 = lda.fit(X, y).transform(X)

# Percentage of variance explained for each components
print('explained variance ratio (first two components): %s'
      % str(pca.explained_variance_ratio_))


plt.figure()

colors = ['blue','red']
for i in xrange(len(colors)):
    px = X_r[:, 0][y == i]
    py = X_r[:, 1][y == i]
    plt.scatter(px, py, c=colors[i])
plt.legend(target_names)
plt.title('PCA')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')

plt.figure()

colors = ['blue','red']
for i in xrange(len(colors)):
    px = X_r2[:, 0][y == i]
    py = X_pca[:, 1][y == i]
    plt.scatter(px, py, c=colors[i])
plt.legend(target_names)
plt.title('LDA')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')

plt.show()

"""

for c, i, target_name in zip("rb", [0, 1], target_names):
    plt.scatter(X_r[y == i, 0], X_r[y == i, 1], c=c, label=target_name)
plt.legend()
plt.title('PCA')

plt.figure()
for c, i, target_name in zip("rb", [0, 1], target_names):
    plt.scatter(X_r2[y == i, 0], X_r2[y == i, 1], c=c, label=target_name)
plt.legend()
plt.title('LDA')

plt.show()
"""
C:\Miniconda2\lib\site-packages\ipykernel\__main__.py:28: FutureWarning: in the future, boolean array-likes will be handled as a boolean array index
C:\Miniconda2\lib\site-packages\ipykernel\__main__.py:29: FutureWarning: in the future, boolean array-likes will be handled as a boolean array index
C:\Miniconda2\lib\site-packages\ipykernel\__main__.py:40: FutureWarning: in the future, boolean array-likes will be handled as a boolean array index
explained variance ratio (first two components): [ 0.7628241   0.16383047]
C:\Miniconda2\lib\site-packages\ipykernel\__main__.py:41: FutureWarning: in the future, boolean array-likes will be handled as a boolean array index
Out[33]:
'\n\nfor c, i, target_name in zip("rb", [0, 1], target_names):\n    plt.scatter(X_r[y == i, 0], X_r[y == i, 1], c=c, label=target_name)\nplt.legend()\nplt.title(\'PCA\')\n\nplt.figure()\nfor c, i, target_name in zip("rb", [0, 1], target_names):\n    plt.scatter(X_r2[y == i, 0], X_r2[y == i, 1], c=c, label=target_name)\nplt.legend()\nplt.title(\'LDA\')\n\nplt.show()\n'
In [25]:
plt.figure()
def plot_pca_scatter():
    colors = ['blue','red']
    for i in xrange(len(colors)):
        px = X_pca[:, 0][y == i]
        py = X_pca[:, 1][y == i]
        plt.scatter(px, py, c=colors[i])
    plt.legend(target_names)
    plt.xlabel('First Principal Component')
    plt.ylabel('Second Principal Component')
In [26]:
from sklearn.decomposition import PCA


estimator = PCA(n_components=2)
X_pca = estimator.fit_transform(X.values)
plot_pca_scatter() # Note that we only plot the first and second principal component
C:\Miniconda2\lib\site-packages\ipykernel\__main__.py:5: FutureWarning: in the future, boolean array-likes will be handled as a boolean array index
C:\Miniconda2\lib\site-packages\ipykernel\__main__.py:6: FutureWarning: in the future, boolean array-likes will be handled as a boolean array index
In [ ]:
plt.figure(figsize=(2. * n_col, 2.26 * n_row))
for i, comp in enumerate(images):
    plt.subplot(n_row, n_col, i + 1)
    plt.imshow(comp.reshape((8, 8)), interpolation='nearest')
    plt.text(0, -1, str(i + 1) + '-component')
    plt.xticks(())
    plt.yticks(())

Optimization

Concatenating multiple feature extraction methods

In [11]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.grid_search import GridSearchCV
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest

# This dataset is way to high-dimensional. Better do PCA:
pca = PCA(n_components=2)

# Maybe some original features where good, too?
selection = SelectKBest(k=1)

# Build estimator from PCA and Univariate selection:

combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])

# Use combined features to transform dataset:
X_features = combined_features.fit(X, y).transform(X)

dt = DecisionTreeClassifier(min_samples_split=1,max_depth=5,min_samples_leaf=5,random_state=99)

# Do grid search over k, n_components and max_depth:

pipeline = Pipeline([("features", combined_features), ("dt", dt)])

param_grid = dict(features__pca__n_components=[1, 2, 3],
                  features__univ_select__k=[1, 2],
                  dt__max_depth=[3, 5, 7])

grid_search = GridSearchCV(pipeline, param_grid=param_grid, verbose=10)
grid_search.fit(X, y)
print(grid_search.best_estimator_)
print(grid_search.best_score_)
Fitting 3 folds for each of 18 candidates, totalling 54 fits
[CV] features__pca__n_components=1, dt__max_depth=3, features__univ_select__k=1 
[CV]  features__pca__n_components=1, dt__max_depth=3, features__univ_select__k=1, score=0.555539 -   0.3s
[CV] features__pca__n_components=1, dt__max_depth=3, features__univ_select__k=1 
[CV]  features__pca__n_components=1, dt__max_depth=3, features__univ_select__k=1, score=0.595807 -   0.3s
[CV] features__pca__n_components=1, dt__max_depth=3, features__univ_select__k=1 
[CV]  features__pca__n_components=1, dt__max_depth=3, features__univ_select__k=1, score=0.572953 -   0.3s
[CV] features__pca__n_components=1, dt__max_depth=3, features__univ_select__k=2 
[CV]  features__pca__n_components=1, dt__max_depth=3, features__univ_select__k=2, score=0.565829 -   0.3s
[Parallel(n_jobs=1)]: Done   1 tasks       | elapsed:    0.4s
[Parallel(n_jobs=1)]: Done   4 tasks       | elapsed:    1.6s
[CV] features__pca__n_components=1, dt__max_depth=3, features__univ_select__k=2 
[CV]  features__pca__n_components=1, dt__max_depth=3, features__univ_select__k=2, score=0.594899 -   0.4s
[CV] features__pca__n_components=1, dt__max_depth=3, features__univ_select__k=2 
[CV]  features__pca__n_components=1, dt__max_depth=3, features__univ_select__k=2, score=0.595807 -   0.3s
[CV] features__pca__n_components=2, dt__max_depth=3, features__univ_select__k=1 
[CV]  features__pca__n_components=2, dt__max_depth=3, features__univ_select__k=1, score=0.554782 -   0.3s
[CV] features__pca__n_components=2, dt__max_depth=3, features__univ_select__k=1 
[CV]  features__pca__n_components=2, dt__max_depth=3, features__univ_select__k=1, score=0.595807 -   0.4s
[CV] features__pca__n_components=2, dt__max_depth=3, features__univ_select__k=1 
[CV]  features__pca__n_components=2, dt__max_depth=3, features__univ_select__k=1, score=0.573331 -   0.4s
[CV] features__pca__n_components=2, dt__max_depth=3, features__univ_select__k=2 
[CV]  features__pca__n_components=2, dt__max_depth=3, features__univ_select__k=2, score=0.565829 -   0.4s
[CV] features__pca__n_components=2, dt__max_depth=3, features__univ_select__k=2 
[CV]  features__pca__n_components=2, dt__max_depth=3, features__univ_select__k=2, score=0.594899 -   0.4s
[CV] features__pca__n_components=2, dt__max_depth=3, features__univ_select__k=2 
[CV]  features__pca__n_components=2, dt__max_depth=3, features__univ_select__k=2, score=0.595807 -   0.4s
[Parallel(n_jobs=1)]: Done   7 tasks       | elapsed:    3.1s
[Parallel(n_jobs=1)]: Done  12 tasks       | elapsed:    5.5s
[CV] features__pca__n_components=3, dt__max_depth=3, features__univ_select__k=1 
[CV]  features__pca__n_components=3, dt__max_depth=3, features__univ_select__k=1, score=0.563711 -   0.4s
[CV] features__pca__n_components=3, dt__max_depth=3, features__univ_select__k=1 
[CV]  features__pca__n_components=3, dt__max_depth=3, features__univ_select__k=1, score=0.595807 -   0.4s
[CV] features__pca__n_components=3, dt__max_depth=3, features__univ_select__k=1 
[CV]  features__pca__n_components=3, dt__max_depth=3, features__univ_select__k=1, score=0.571364 -   0.3s
[CV] features__pca__n_components=3, dt__max_depth=3, features__univ_select__k=2 
[CV]  features__pca__n_components=3, dt__max_depth=3, features__univ_select__k=2, score=0.600257 -   0.4s
[CV] features__pca__n_components=3, dt__max_depth=3, features__univ_select__k=2 
[CV]  features__pca__n_components=3, dt__max_depth=3, features__univ_select__k=2, score=0.594899 -   0.4s
[CV] features__pca__n_components=3, dt__max_depth=3, features__univ_select__k=2 
[CV]  features__pca__n_components=3, dt__max_depth=3, features__univ_select__k=2, score=0.595807 -   0.4s
[CV] features__pca__n_components=1, dt__max_depth=5, features__univ_select__k=1 
[CV]  features__pca__n_components=1, dt__max_depth=5, features__univ_select__k=1, score=0.541616 -   0.4s
[CV] features__pca__n_components=1, dt__max_depth=5, features__univ_select__k=1 
[CV]  features__pca__n_components=1, dt__max_depth=5, features__univ_select__k=1, score=0.596186 -   0.4s
[CV] features__pca__n_components=1, dt__max_depth=5, features__univ_select__k=1 
[CV]  features__pca__n_components=1, dt__max_depth=5, features__univ_select__k=1, score=0.571666 -   0.3s
[CV] features__pca__n_components=1, dt__max_depth=5, features__univ_select__k=2 
[CV]  features__pca__n_components=1, dt__max_depth=5, features__univ_select__k=2, score=0.597079 -   0.4s
[CV] features__pca__n_components=1, dt__max_depth=5, features__univ_select__k=2 
[CV]  features__pca__n_components=1, dt__max_depth=5, features__univ_select__k=2, score=0.595656 -   0.4s
[CV] features__pca__n_components=1, dt__max_depth=5, features__univ_select__k=2 
[CV]  features__pca__n_components=1, dt__max_depth=5, features__univ_select__k=2, score=0.589299 -   0.4s
[Parallel(n_jobs=1)]: Done  17 tasks       | elapsed:    8.0s
[Parallel(n_jobs=1)]: Done  24 tasks       | elapsed:   11.4s
[CV] features__pca__n_components=2, dt__max_depth=5, features__univ_select__k=1 
[CV]  features__pca__n_components=2, dt__max_depth=5, features__univ_select__k=1, score=0.532309 -   0.4s
[CV] features__pca__n_components=2, dt__max_depth=5, features__univ_select__k=1 
[CV]  features__pca__n_components=2, dt__max_depth=5, features__univ_select__k=1, score=0.596035 -   0.4s
[CV] features__pca__n_components=2, dt__max_depth=5, features__univ_select__k=1 
[CV]  features__pca__n_components=2, dt__max_depth=5, features__univ_select__k=1, score=0.572120 -   0.4s
[CV] features__pca__n_components=2, dt__max_depth=5, features__univ_select__k=2 
[CV]  features__pca__n_components=2, dt__max_depth=5, features__univ_select__k=2, score=0.594658 -   0.4s
[CV] features__pca__n_components=2, dt__max_depth=5, features__univ_select__k=2 
[CV]  features__pca__n_components=2, dt__max_depth=5, features__univ_select__k=2, score=0.594899 -   0.4s
[CV] features__pca__n_components=2, dt__max_depth=5, features__univ_select__k=2 
[CV]  features__pca__n_components=2, dt__max_depth=5, features__univ_select__k=2, score=0.588997 -   0.4s
[CV] features__pca__n_components=3, dt__max_depth=5, features__univ_select__k=1 
[CV]  features__pca__n_components=3, dt__max_depth=5, features__univ_select__k=1, score=0.569234 -   0.5s
[CV] features__pca__n_components=3, dt__max_depth=5, features__univ_select__k=1 
[CV]  features__pca__n_components=3, dt__max_depth=5, features__univ_select__k=1, score=0.595959 -   0.4s
[CV] features__pca__n_components=3, dt__max_depth=5, features__univ_select__k=1 
[CV]  features__pca__n_components=3, dt__max_depth=5, features__univ_select__k=1, score=0.567126 -   0.4s
[CV] features__pca__n_components=3, dt__max_depth=5, features__univ_select__k=2 
[CV]  features__pca__n_components=3, dt__max_depth=5, features__univ_select__k=2, score=0.590118 -   0.5s
[CV] features__pca__n_components=3, dt__max_depth=5, features__univ_select__k=2 
[CV]  features__pca__n_components=3, dt__max_depth=5, features__univ_select__k=2, score=0.593613 -   0.5s
[CV] features__pca__n_components=3, dt__max_depth=5, features__univ_select__k=2 
[CV]  features__pca__n_components=3, dt__max_depth=5, features__univ_select__k=2, score=0.608370 -   0.5s
[CV] features__pca__n_components=1, dt__max_depth=7, features__univ_select__k=1 
[CV]  features__pca__n_components=1, dt__max_depth=7, features__univ_select__k=1, score=0.552815 -   0.4s
[CV] features__pca__n_components=1, dt__max_depth=7, features__univ_select__k=1 
[CV]  features__pca__n_components=1, dt__max_depth=7, features__univ_select__k=1, score=0.598759 -   0.4s
[CV] features__pca__n_components=1, dt__max_depth=7, features__univ_select__k=1 
[CV]  features__pca__n_components=1, dt__max_depth=7, features__univ_select__k=1, score=0.570531 -   0.3s
[CV] features__pca__n_components=1, dt__max_depth=7, features__univ_select__k=2 
[CV]  features__pca__n_components=1, dt__max_depth=7, features__univ_select__k=2, score=0.597306 -   0.4s
[Parallel(n_jobs=1)]: Done  31 tasks       | elapsed:   15.2s
[Parallel(n_jobs=1)]: Done  40 tasks       | elapsed:   20.2s
[CV] features__pca__n_components=1, dt__max_depth=7, features__univ_select__k=2 
[CV]  features__pca__n_components=1, dt__max_depth=7, features__univ_select__k=2, score=0.591267 -   0.5s
[CV] features__pca__n_components=1, dt__max_depth=7, features__univ_select__k=2 
[CV]  features__pca__n_components=1, dt__max_depth=7, features__univ_select__k=2, score=0.608597 -   0.4s
[CV] features__pca__n_components=2, dt__max_depth=7, features__univ_select__k=1 
[CV]  features__pca__n_components=2, dt__max_depth=7, features__univ_select__k=1, score=0.548956 -   0.5s
[CV] features__pca__n_components=2, dt__max_depth=7, features__univ_select__k=1 
[CV]  features__pca__n_components=2, dt__max_depth=7, features__univ_select__k=1, score=0.598002 -   0.4s
[CV] features__pca__n_components=2, dt__max_depth=7, features__univ_select__k=1 
[CV]  features__pca__n_components=2, dt__max_depth=7, features__univ_select__k=1, score=0.571666 -   0.5s
[CV] features__pca__n_components=2, dt__max_depth=7, features__univ_select__k=2 
[CV]  features__pca__n_components=2, dt__max_depth=7, features__univ_select__k=2, score=0.598214 -   0.5s
[CV] features__pca__n_components=2, dt__max_depth=7, features__univ_select__k=2 
[CV]  features__pca__n_components=2, dt__max_depth=7, features__univ_select__k=2, score=0.591494 -   0.6s
[CV] features__pca__n_components=2, dt__max_depth=7, features__univ_select__k=2 
[CV]  features__pca__n_components=2, dt__max_depth=7, features__univ_select__k=2, score=0.595278 -   0.5s
[CV] features__pca__n_components=3, dt__max_depth=7, features__univ_select__k=1 
[CV]  features__pca__n_components=3, dt__max_depth=7, features__univ_select__k=1, score=0.559095 -   0.5s
[CV] features__pca__n_components=3, dt__max_depth=7, features__univ_select__k=1 
[CV]  features__pca__n_components=3, dt__max_depth=7, features__univ_select__k=1, score=0.594899 -   0.5s
[CV] features__pca__n_components=3, dt__max_depth=7, features__univ_select__k=1 
[CV]  features__pca__n_components=3, dt__max_depth=7, features__univ_select__k=1, score=0.582715 -   0.5s
[CV] features__pca__n_components=3, dt__max_depth=7, features__univ_select__k=2 
[CV]  features__pca__n_components=3, dt__max_depth=7, features__univ_select__k=2, score=0.596323 -   0.5s
[CV] features__pca__n_components=3, dt__max_depth=7, features__univ_select__k=2 
[CV]  features__pca__n_components=3, dt__max_depth=7, features__univ_select__k=2, score=0.599743 -   0.6s
[CV] features__pca__n_components=3, dt__max_depth=7, features__univ_select__k=2 
[CV]  features__pca__n_components=3, dt__max_depth=7, features__univ_select__k=2, score=0.619949 -   0.5s
[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:   25.7s
[Parallel(n_jobs=1)]: Done  54 out of  54 | elapsed:   28.9s finished
Pipeline(steps=[('features', FeatureUnion(n_jobs=1,
       transformer_list=[('pca', PCA(copy=True, n_components=3, whiten=False)), ('univ_select', SelectKBest(k=2, score_func=<function f_classif at 0x0000000006D019E8>))],
       transformer_weights=None)), ('dt', DecisionTreeClassifier(class_weight=None, cr...plit=1, min_weight_fraction_leaf=0.0,
            presort=False, random_state=99, splitter='best'))])
0.605337503784

Comparison of Calibration of Classifiers

In [27]:
import numpy as np
np.random.seed(0)
%matplotlib inline
import matplotlib.pyplot as plt

from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.calibration import calibration_curve

# Create classifiers
lr = LogisticRegression()
gnb = GaussianNB()
knn = KNeighborsClassifier(n_neighbors=25)
rfc = RandomForestClassifier(n_estimators=100)


###############################################################################
# Plot calibration plots

plt.figure(figsize=(10, 10))
ax1 = plt.subplot2grid((3, 1), (0, 0), rowspan=2)
ax2 = plt.subplot2grid((3, 1), (2, 0))

ax1.plot([0, 1], [0, 1], "k:", label="Perfectly calibrated")
for clf, name in [(lr, 'Logistic'),
                  (gnb, 'Naive Bayes'),
                  (knn, 'K Neighbors Classifier'),
                  (rfc, 'Random Forest')]:
    clf.fit(X_train, y_train)
    if hasattr(clf, "predict_proba"):
        prob_pos = clf.predict_proba(X_test)[:, 1]
    else:  # use decision function
        prob_pos = clf.decision_function(X_test)
        prob_pos = \
            (prob_pos - prob_pos.min()) / (prob_pos.max() - prob_pos.min())
    fraction_of_positives, mean_predicted_value = \
        calibration_curve(y_test, prob_pos, n_bins=10)

    ax1.plot(mean_predicted_value, fraction_of_positives, "s-",
             label="%s" % (name, ))

    ax2.hist(prob_pos, range=(0, 1), bins=10, label=name,
             histtype="step", lw=2)

ax1.set_ylabel("Fraction of positives")
ax1.set_ylim([-0.05, 1.05])
ax1.legend(loc="lower right")
ax1.set_title('Calibration plots  (reliability curve)')

ax2.set_xlabel("Mean predicted value")
ax2.set_ylabel("Count")
ax2.legend(loc="upper center", ncol=2)

plt.tight_layout()
plt.show()

Probability Calibration curves

In [28]:
import matplotlib.pyplot as plt

from sklearn import datasets
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (brier_score_loss, precision_score, recall_score,
                             f1_score)
from sklearn.calibration import CalibratedClassifierCV, calibration_curve
from sklearn.cross_validation import train_test_split



def plot_calibration_curve(est, name, fig_index):
    """Plot calibration curve for est w/o and with calibration. """
    # Calibrated with isotonic calibration
    isotonic = CalibratedClassifierCV(est, cv=2, method='isotonic')

    # Calibrated with sigmoid calibration
    sigmoid = CalibratedClassifierCV(est, cv=2, method='sigmoid')

    # Logistic regression with no calibration as baseline
    lr = LogisticRegression(C=1., solver='lbfgs')

    fig = plt.figure(fig_index, figsize=(10, 10))
    ax1 = plt.subplot2grid((3, 1), (0, 0), rowspan=2)
    ax2 = plt.subplot2grid((3, 1), (2, 0))

    ax1.plot([0, 1], [0, 1], "k:", label="Perfectly calibrated")
    for clf, name in [(lr, 'Logistic'),
                      (est, name),
                      (isotonic, name + ' + Isotonic'),
                      (sigmoid, name + ' + Sigmoid')]:
        clf.fit(X_train, y_train)
        y_pred = clf.predict(X_test)
        if hasattr(clf, "predict_proba"):
            prob_pos = clf.predict_proba(X_test)[:, 1]
        else:  # use decision function
            prob_pos = clf.decision_function(X_test)
            prob_pos = \
                (prob_pos - prob_pos.min()) / (prob_pos.max() - prob_pos.min())

        clf_score = brier_score_loss(y_test, prob_pos, pos_label=y.max())
        print("%s:" % name)
        print("\tBrier: %1.3f" % (clf_score))
        print("\tPrecision: %1.3f" % precision_score(y_test, y_pred))
        print("\tRecall: %1.3f" % recall_score(y_test, y_pred))
        print("\tF1: %1.3f\n" % f1_score(y_test, y_pred))

        fraction_of_positives, mean_predicted_value = \
            calibration_curve(y_test, prob_pos, n_bins=10)

        ax1.plot(mean_predicted_value, fraction_of_positives, "s-",
                 label="%s (%1.3f)" % (name, clf_score))

        ax2.hist(prob_pos, range=(0, 1), bins=10, label=name,
                 histtype="step", lw=2)

    ax1.set_ylabel("Fraction of positives")
    ax1.set_ylim([-0.05, 1.05])
    ax1.legend(loc="lower right")
    ax1.set_title('Calibration plots  (reliability curve)')

    ax2.set_xlabel("Mean predicted value")
    ax2.set_ylabel("Count")
    ax2.legend(loc="upper center", ncol=2)

    plt.tight_layout()

# Plot calibration cuve for Gaussian Naive Bayes
plot_calibration_curve(GaussianNB(), "Naive Bayes", 1)

# Plot calibration cuve for Linear SVC
plot_calibration_curve(DecisionTreeClassifier(), "Decision Tree", 2)

plt.show()
Logistic:
	Brier: 0.238
	Precision: 0.598
	Recall: 0.690
	F1: 0.641

Naive Bayes:
	Brier: 0.324
	Precision: 0.676
	Recall: 0.223
	F1: 0.335

Naive Bayes + Isotonic:
	Brier: 0.236
	Precision: 0.610
	Recall: 0.714
	F1: 0.658

Naive Bayes + Sigmoid:
	Brier: 0.240
	Precision: 0.629
	Recall: 0.603
	F1: 0.615

Logistic:
	Brier: 0.238
	Precision: 0.598
	Recall: 0.690
	F1: 0.641

Decision Tree:
	Brier: 0.422
	Precision: 0.605
	Recall: 0.593
	F1: 0.599

Decision Tree + Isotonic:
	Brier: 0.241
	Precision: 0.581
	Recall: 0.819
	F1: 0.679

Decision Tree + Sigmoid:
	Brier: 0.241
	Precision: 0.579
	Recall: 0.811
	F1: 0.675

Plot classification probability

In [55]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LogisticRegression

#PCA
pca = PCA(n_components=2)
X_r = pca.fit(X).transform(X)

n_features = X_r.shape[1]

C = 1.0

# Create different classifiers. The logistic regression cannot do
# multiclass out of the box.
classifiers = {'L1 logistic': LogisticRegression(C=C, penalty='l1'),
               'L2 logistic': LogisticRegression(C=C, penalty='l2'),
               'Decision Tree': DecisionTreeClassifier(max_depth=5,min_samples_leaf=5,min_samples_split=1,random_state=99),
               'K Nearest Neighbors': KNeighborsClassifier(n_neighbors=3),
               'Random Forest' : RandomForestClassifier(max_depth=25,min_samples_leaf=2,
                                                        min_samples_split=10,n_estimators=100,n_jobs=-1)
              }

n_classifiers = len(classifiers)

plt.figure(figsize=(2 * 2, n_classifiers * 2))
plt.subplots_adjust(bottom=.2, top=.95)

xx = np.linspace(3, 9, 100)
yy = np.linspace(1, 5, 100).T
xx, yy = np.meshgrid(xx, yy)
Xfull = np.c_[xx.ravel(), yy.ravel()]


for index, (name, classifier) in enumerate(classifiers.items()):
    classifier.fit(X_r, y)

    y_pred = classifier.predict(X_r)
    classif_rate = np.mean(y_pred.ravel() == y.ravel()) * 100
    print("classif_rate for %s : %f " % (name, classif_rate))

    # View probabilities=
    probas = classifier.predict_proba(Xfull)
    n_classes = np.unique(y_pred).size
    for k in range(n_classes):
        plt.subplot(n_classifiers, n_classes, index * n_classes + k + 1)
        plt.title("Class %d" % k)
        if k == 0:
            plt.ylabel(name)
        imshow_handle = plt.imshow(probas[:, k].reshape((100, 100)),
                                   extent=(3, 9, 1, 5), origin='lower')
        plt.xticks(())
        plt.yticks(())
        idx = (y_pred == k)
        if idx.any():
            plt.scatter(X_r[idx, 0], X_r[idx, 1], marker='o', c='k')

ax = plt.axes([0.15, 0.04, 0.7, 0.05])
plt.title("Probability")
plt.colorbar(imshow_handle, cax=ax, orientation='horizontal')

plt.show()
classif_rate for K Nearest Neighbors : 76.077086 
classif_rate for L1 logistic : 53.309454 
classif_rate for L2 logistic : 52.103723 
classif_rate for Random Forest : 82.065382 
classif_rate for Decision Tree : 55.047422 

Recursive feature elimination with cross-validation

In [66]:
import matplotlib.pyplot as plt
from sklearn.cross_validation import StratifiedKFold
from sklearn.feature_selection import RFECV
from sklearn.datasets import make_classification

# Create the RFE object and compute a cross-validated score.

rf = RandomForestClassifier(max_depth=25,min_samples_leaf=2,min_samples_split=10,n_estimators=100,n_jobs=-1)

# The "accuracy" scoring is proportional to the number of correct classifications
rfecv = RFECV(estimator=rf, step=1, cv=StratifiedKFold(y, 2),
              scoring='accuracy')
rfecv.fit(X, y)

print("Optimal number of features : %d" % rfecv.n_features_)

# Plot number of features VS. cross-validation scores
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (nb of correct classifications)")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()
Optimal number of features : 2