Adult Income Classification


http://archive.ics.uci.edu/ml/datasets/Adult

Listing of attributes:

Class: >50K, <=50K.

age: continuous.

workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.

fnlwgt: continuous.

education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.

education-num: continuous.

marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.

occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.

relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.

race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.

sex: Female, Male.

capital-gain: continuous.

capital-loss: continuous.

hours-per-week: continuous.

native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

In [1]:
import os
from sklearn.tree import DecisionTreeClassifier, export_graphviz
import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn import cross_validation, metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from time import time
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score , classification_report
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import precision_score, recall_score, accuracy_score, classification_report
In [2]:
# read .csv from provided dataset
csv_filename="datasets/adult.data"

df = pd.read_csv(csv_filename, header=None,
                names=["Age", "Work-Class", "fnlwgt",
                "Education", "Education-Num",
                "Marital-Status", "Occupation",
                "Relationship", "Race", "Sex",
                "Capital-gain", "Capital-loss",
                "Hours-per-week", "Native-Country",
                "Earnings-Raw"])
In [3]:
df.replace([' <=50K',' >50K'],[0,1],inplace=True)
In [4]:
df.tail()
Out[4]:
Age Work-Class fnlwgt Education Education-Num Marital-Status Occupation Relationship Race Sex Capital-gain Capital-loss Hours-per-week Native-Country Earnings-Raw
32556 27 Private 257302 Assoc-acdm 12 Married-civ-spouse Tech-support Wife White Female 0 0 38 United-States 0
32557 40 Private 154374 HS-grad 9 Married-civ-spouse Machine-op-inspct Husband White Male 0 0 40 United-States 1
32558 58 Private 151910 HS-grad 9 Widowed Adm-clerical Unmarried White Female 0 0 40 United-States 0
32559 22 Private 201490 HS-grad 9 Never-married Adm-clerical Own-child White Male 0 0 20 United-States 0
32560 52 Self-emp-inc 287927 HS-grad 9 Married-civ-spouse Exec-managerial Wife White Female 15024 0 40 United-States 1

The adult file itself contains two blank lines at the end of the file. By default, pandas will interpret the penultimate new line to be an empty (but valid) row. To remove this, we remove any line with invalid numbers (the use of inplace just makes sure the same Dataframe is affected, rather than creating a new one):

In [5]:
df.dropna(how='all', inplace=True)
In [6]:
feature_names = df.columns

The results show each of the feature names that are stored inside an Index object from pandas.

The Adult dataset contains several categorical features, with Work-Class being one example. While we could argue that some values are of higher rank than others (for instance, a person with a job is likely to have a better income than a person without), it doesn't make sense for all values. For example, a person working for the state government is not more or less likely to have a higher income than someone working in the private sector. We can view the unique values for this feature in the dataset using the unique() function:

In [7]:
df["Work-Class"].unique()
Out[7]:
array([' State-gov', ' Self-emp-not-inc', ' Private', ' Federal-gov',
       ' Local-gov', ' ?', ' Self-emp-inc', ' Without-pay', ' Never-worked'], dtype=object)

There are some missing values in the preceding dataset, but they won't affect our computations in this example.

Selecting the best individual features

If we have a number of features, the problem of finding the best subset is a difficult task.

The scikit-learn package has a number of transformers for performing univariate feature selection. They include SelectKBest, which returns the k best performing features, and SelectPercentile, which returns the top r% of features. In both cases, there are a number of methods of computing the quality of a feature.

There are many different methods to compute how effectively a single feature correlates with a class value. A commonly used method is the chi-squared (χ2) test. Other methods include mutual information and entropy.

We can observe single-feature tests in action using our Adult dataset. First, we extract a dataset and class values from our pandas DataFrame. We get a selection of the features:

In [8]:
X = df[["Age", "Education-Num", "Capital-gain", "Capital-loss", "Hours-per-week"]].values

We will also create a target class array by testing whether the Earnings-Raw value is above $50,000 or not. If it is, the class will be True. Otherwise, it will be False. Let's look at the code:

In [9]:
y = (df["Earnings-Raw"]).values
In [10]:
y
Out[10]:
array([0, 0, 0, ..., 0, 0, 1], dtype=int64)

Next, we create our transformer using the chi2 function and a SelectKBest transformer:

In [11]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
transformer = SelectKBest(score_func=chi2, k=3)

Running fit_transform will call fit and then transform with the same dataset. The result will create a new dataset, choosing only the best three features. Let's look at the code:

In [12]:
Xt_chi2 = transformer.fit_transform(X, y)

The resulting matrix now only contains three features. We can also get the scores for each column, allowing us to find out which features were used. Let's look at the code:

In [13]:
print(transformer.scores_)
[  8.60061182e+03   2.40142178e+03   8.21924671e+07   1.37214589e+06
   6.47640900e+03]

The highest values are for the first, third, and fourth columns Correlates to the Age, Capital-Gain, and Capital-Loss features. Based on a univariate feature selection, these are the best features to choose.

We could also implement other correlations, such as the Pearson's correlation coefficient. This is implemented in SciPy, a library used for scientific computing (scikit-learn uses it as a base).

In [14]:
from scipy.stats import pearsonr

The preceding function almost fits the interface needed to be used in scikit-learn's univariate transformers. The function needs to accept two arrays (x and y in our example) as parameters and returns two arrays, the scores for each feature and the corresponding p-values. The chi2 function we used earlier only uses the required interface, which allowed us to just pass it directly to SelectKBest.

The pearsonr function in SciPy accepts two arrays; however, the X array it accepts is only one dimension. We will write a wrapper function that allows us to use this for multivariate arrays like the one we have. Let's look at the code:

In [15]:
def multivariate_pearsonr(X, y):
    scores, pvalues = [], []
    for column in range(X.shape[1]):
        cur_score, cur_p = pearsonr(X[:,column], y)
        scores.append(abs(cur_score))
        pvalues.append(cur_p)
    return (np.array(scores), np.array(pvalues))

The Pearson value could be between -1 and 1. A value of 1 implies a perfect correlation between two variables, while a value of -1 implies a perfect negative correlation, that is, high values in one variable give low values in the other and vice versa. Such features are really useful to have, but would be discarded. For this reason, we have stored the absolute value in the scores array, rather than the original signed value.

In [16]:
transformer = SelectKBest(score_func=multivariate_pearsonr, k=3)
Xt_pearson = transformer.fit_transform(X, y)
print(transformer.scores_)
[ 0.2340371   0.33515395  0.22332882  0.15052631  0.22968907]

This returns a different set of features! The features chosen this way are the first, second, and fifth columns: the Age, Education, and Hours-per-week worked. This shows that there is not a definitive answer to what the best features are— it depends on the metric.

We can see which feature set is better by running them through a classifier. Keep in mind that the results only indicate which subset is better for a particular classifier and/or feature combination—there is rarely a case in data mining where one method is strictly better than another in all cases! Let's look at the code:

In [17]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import cross_val_score
clf = DecisionTreeClassifier(random_state=14)
scores_chi2 = cross_val_score(clf, Xt_chi2, y, scoring='accuracy')
scores_pearson = cross_val_score(clf, Xt_pearson, y, scoring='accuracy')
In [18]:
print("Chi2 performance: {0:.3f}".format(scores_chi2.mean()))
print("Pearson performance: {0:.3f}".format(scores_pearson.mean()))
Chi2 performance: 0.829
Pearson performance: 0.771

It is worth remembering the goal of this data mining activity: predicting wealth. Using a combination of good features and feature selection, we can achieve 83 percent accuracy using just three features of a person!

Principal Components Analysis

In [19]:
from sklearn.decomposition import PCA
pca = PCA(n_components=5)
Xd = pca.fit_transform(X)
In [20]:
np.set_printoptions(precision=3, suppress=True)
pca.explained_variance_ratio_
Out[20]:
array([ 0.997,  0.003,  0.   ,  0.   ,  0.   ])

The result, array([ 0.997, 0.003, 0. , 0. , 0. ]), shows us that the first feature accounts for 99.7 percent of the variance in the dataset, the second accounts for 0.3 percent, and so on. By the fourth feature, less than one-tenth of a percent of the variance is contained in the feature. The other features explain even less.

In [21]:
clf = DecisionTreeClassifier(random_state=14)
original_scores = cross_val_score(clf, X, y, scoring='accuracy')
In [23]:
print("The average score from the original dataset is {:.4f}".format(np.mean(original_scores)))
The average score from the original dataset is 0.8111
In [24]:
clf = DecisionTreeClassifier(random_state=14)
scores_reduced = cross_val_score(clf, Xd, y, scoring='accuracy')
In [25]:
print("The average score from the reduced dataset is {:.4f}".format(np.mean(scores_reduced)))
The average score from the reduced dataset is 0.8136

In [26]:
df.replace(' ?',np.nan,inplace=True)
In [27]:
df.loc[26,:]
Out[27]:
Age                           19
Work-Class               Private
fnlwgt                    168294
Education                HS-grad
Education-Num                  9
Marital-Status     Never-married
Occupation          Craft-repair
Relationship           Own-child
Race                       White
Sex                         Male
Capital-gain                   0
Capital-loss                   0
Hours-per-week                40
Native-Country     United-States
Earnings-Raw                   0
Name: 26, dtype: object
In [28]:
df.shape
Out[28]:
(32561, 15)
In [29]:
#We can also specify to only drop rows that are complete missing all data
df.dropna(inplace=True)
In [30]:
df.shape
Out[30]:
(30162, 15)

In [31]:
df.head()
Out[31]:
Age Work-Class fnlwgt Education Education-Num Marital-Status Occupation Relationship Race Sex Capital-gain Capital-loss Hours-per-week Native-Country Earnings-Raw
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States 0
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States 0
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States 0
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States 0
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba 0
In [32]:
for col in df.columns:
    if df[col].dtype != 'int64':
        print "For: {}, Total unique values are {} - \n {}".format(col ,
                                                        len(pd.Series(df[col].values.ravel()).unique()),
                                                        pd.Series(df[col].values.ravel()).unique())
        print '\n'
For: Work-Class, Total unique values are 7 - 
 [' State-gov' ' Self-emp-not-inc' ' Private' ' Federal-gov' ' Local-gov'
 ' Self-emp-inc' ' Without-pay']


For: Education, Total unique values are 16 - 
 [' Bachelors' ' HS-grad' ' 11th' ' Masters' ' 9th' ' Some-college'
 ' Assoc-acdm' ' 7th-8th' ' Doctorate' ' Assoc-voc' ' Prof-school'
 ' 5th-6th' ' 10th' ' Preschool' ' 12th' ' 1st-4th']


For: Marital-Status, Total unique values are 7 - 
 [' Never-married' ' Married-civ-spouse' ' Divorced'
 ' Married-spouse-absent' ' Separated' ' Married-AF-spouse' ' Widowed']


For: Occupation, Total unique values are 14 - 
 [' Adm-clerical' ' Exec-managerial' ' Handlers-cleaners' ' Prof-specialty'
 ' Other-service' ' Sales' ' Transport-moving' ' Farming-fishing'
 ' Machine-op-inspct' ' Tech-support' ' Craft-repair' ' Protective-serv'
 ' Armed-Forces' ' Priv-house-serv']


For: Relationship, Total unique values are 6 - 
 [' Not-in-family' ' Husband' ' Wife' ' Own-child' ' Unmarried'
 ' Other-relative']


For: Race, Total unique values are 5 - 
 [' White' ' Black' ' Asian-Pac-Islander' ' Amer-Indian-Eskimo' ' Other']


For: Sex, Total unique values are 2 - 
 [' Male' ' Female']


For: Native-Country, Total unique values are 41 - 
 [' United-States' ' Cuba' ' Jamaica' ' India' ' Mexico' ' Puerto-Rico'
 ' Honduras' ' England' ' Canada' ' Germany' ' Iran' ' Philippines'
 ' Poland' ' Columbia' ' Cambodia' ' Thailand' ' Ecuador' ' Laos' ' Taiwan'
 ' Haiti' ' Portugal' ' Dominican-Republic' ' El-Salvador' ' France'
 ' Guatemala' ' Italy' ' China' ' South' ' Japan' ' Yugoslavia' ' Peru'
 ' Outlying-US(Guam-USVI-etc)' ' Scotland' ' Trinadad&Tobago' ' Greece'
 ' Nicaragua' ' Vietnam' ' Hong' ' Ireland' ' Hungary'
 ' Holand-Netherlands']


Now all of these features are categorical classes, but most scikit-learn classifiers (in particular the Decision Trees we plan to use), expect real-valued attributes. We can easily convert sex to a binary value (0=female,1=male). We will use the LabelEncoder class from scikit-learn:

In [33]:
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
label_encoder = enc.fit(df['Sex'])
print "Categorical classes:", label_encoder.classes_
integer_classes = label_encoder.transform(label_encoder.classes_)
print "Integer classes:", integer_classes
t = label_encoder.transform(df['Sex'])
df['Sex'] = t
print 'Feature names:',feature_names
print 'Features:',df['Sex'][10:16]
Categorical classes: [' Female' ' Male']
Integer classes: [0 1]
Feature names: Index([u'Age', u'Work-Class', u'fnlwgt', u'Education', u'Education-Num',
       u'Marital-Status', u'Occupation', u'Relationship', u'Race', u'Sex',
       u'Capital-gain', u'Capital-loss', u'Hours-per-week', u'Native-Country',
       u'Earnings-Raw'],
      dtype='object')
Features: 10    1
11    1
12    0
13    1
15    1
16    1
Name: Sex, dtype: int64

Now, we have to convert other categorical features. Since we have three different classes, we cannot convert to binary values (and using 0/1/2 values would imply an order, something we do not want). We use pandas get_dummies method for it or we can also use the OneHotEncoder to get three different attributes:

In [34]:
categorical_features = []
for col in df.columns:
    if df[col].dtype != 'int64':
        categorical_features.append(col)
categorical_features
Out[34]:
['Work-Class',
 'Education',
 'Marital-Status',
 'Occupation',
 'Relationship',
 'Race',
 'Native-Country']
In [35]:
df.shape[0]
Out[35]:
30162
In [36]:
onehot_df = pd.get_dummies(df)
In [37]:
onehot_df.head()
Out[37]:
Age fnlwgt Education-Num Sex Capital-gain Capital-loss Hours-per-week Earnings-Raw Work-Class_ Federal-gov Work-Class_ Local-gov ... Native-Country_ Portugal Native-Country_ Puerto-Rico Native-Country_ Scotland Native-Country_ South Native-Country_ Taiwan Native-Country_ Thailand Native-Country_ Trinadad&Tobago Native-Country_ United-States Native-Country_ Vietnam Native-Country_ Yugoslavia
0 39 77516 13 1 2174 0 40 0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
1 50 83311 13 1 0 0 13 0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
2 38 215646 9 1 0 0 40 0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
3 53 234721 7 1 0 0 40 0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
4 28 338409 13 0 0 0 40 0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 104 columns

In [38]:
onehot_df.columns
Out[38]:
Index([u'Age', u'fnlwgt', u'Education-Num', u'Sex', u'Capital-gain',
       u'Capital-loss', u'Hours-per-week', u'Earnings-Raw',
       u'Work-Class_ Federal-gov', u'Work-Class_ Local-gov',
       ...
       u'Native-Country_ Portugal', u'Native-Country_ Puerto-Rico',
       u'Native-Country_ Scotland', u'Native-Country_ South',
       u'Native-Country_ Taiwan', u'Native-Country_ Thailand',
       u'Native-Country_ Trinadad&Tobago', u'Native-Country_ United-States',
       u'Native-Country_ Vietnam', u'Native-Country_ Yugoslavia'],
      dtype='object', length=104)
In [39]:
"""
from sklearn.preprocessing import OneHotEncoder

enc = LabelEncoder()

for f in categorical_features:

    label_encoder = enc.fit(df[f])
    print "Categorical classes:", label_encoder.classes_
    
    integer_classes = label_encoder.transform(label_encoder.classes_).reshape(
        len(pd.Series(df[f].values.ravel()).unique()), 1)
    print "Integer classes:", integer_classes
    
    enc = OneHotEncoder()
    one_hot_encoder = enc.fit(integer_classes)
    
    # First, convert clases to 0-(N-1) integers using label_encoder
    num_of_rows = df.shape[0]
    t = label_encoder.transform(df[f]).reshape(num_of_rows, 1)
    
    # Second, create a sparse matrix with three columns, each one indicating if the instance belongs to the class
    new_features = one_hot_encoder.transform(t)
    
    # Add the new features to df
    df = np.concatenate([df, new_features.toarray()], axis = 1)
    
    #Eliminate converted columns
    df.drop(f, axis=1, inplace=True)
    
    # Update feature names
    feature_names = ['age', 'sex', 'first_class', 'second_class', 'third_class']
    # Convert to numerical values
    df = df.astype(float)
    titanic_y = titanic_y.astype(float)
"""
Out[39]:
'\nfrom sklearn.preprocessing import OneHotEncoder\n\nenc = LabelEncoder()\n\nfor f in categorical_features:\n\n    label_encoder = enc.fit(df[f])\n    print "Categorical classes:", label_encoder.classes_\n    \n    integer_classes = label_encoder.transform(label_encoder.classes_).reshape(\n        len(pd.Series(df[f].values.ravel()).unique()), 1)\n    print "Integer classes:", integer_classes\n    \n    enc = OneHotEncoder()\n    one_hot_encoder = enc.fit(integer_classes)\n    \n    # First, convert clases to 0-(N-1) integers using label_encoder\n    num_of_rows = df.shape[0]\n    t = label_encoder.transform(df[f]).reshape(num_of_rows, 1)\n    \n    # Second, create a sparse matrix with three columns, each one indicating if the instance belongs to the class\n    new_features = one_hot_encoder.transform(t)\n    \n    # Add the new features to df\n    df = np.concatenate([df, new_features.toarray()], axis = 1)\n    \n    #Eliminate converted columns\n    df.drop(f, axis=1, inplace=True)\n    \n    # Update feature names\n    feature_names = [\'age\', \'sex\', \'first_class\', \'second_class\', \'third_class\']\n    # Convert to numerical values\n    df = df.astype(float)\n    titanic_y = titanic_y.astype(float)\n'
In [40]:
features_list = list(onehot_df.columns)
In [41]:
features_list.remove('Earnings-Raw')
In [42]:
X = onehot_df[features_list]
y= onehot_df['Earnings-Raw']

Feature importances with forests of trees

This examples shows the use of forests of trees to evaluate the importance of features on an artificial classification task. The red bars are the feature importances of the forest, along with their inter-trees variability.

In [43]:
# split dataset to 60% training and 40% testing
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.4, random_state=0)
In [44]:
print X_train.shape, y_train.shape
(18097, 103) (18097L,)

feature_importances_

In [47]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

from sklearn.ensemble import ExtraTreesClassifier

# Build a classification task using 3 informative features

# Build a forest and compute the feature importances
forest = ExtraTreesClassifier(n_estimators=250,
                              random_state=0)

forest.fit(X, y)
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]
In [48]:
importances[indices[:5]]
Out[48]:
array([ 0.17 ,  0.153,  0.093,  0.071,  0.066])
In [50]:
features = features_list
for f in range(5):
    print("%d. feature %d - %s (%f)" % (f + 1, indices[f], features[indices[f]] ,importances[indices[f]]))
1. feature 1 - fnlwgt (0.170258)
2. feature 0 - Age (0.152898)
3. feature 6 - Hours-per-week (0.093478)
4. feature 32 - Marital-Status_ Married-civ-spouse (0.070931)
5. feature 4 - Capital-gain (0.065981)
In [51]:
best_features = []
for i in indices[:5]:
    best_features.append(features[i])
In [52]:
# Plot the top 5 feature importances of the forest
plt.figure(num=None, figsize=(8, 6), dpi=80, facecolor='w', edgecolor='k')
plt.title("Feature importances")
plt.bar(range(5), importances[indices][:5], 
       color="r",  yerr=std[indices][:5], align="center")
plt.xticks(range(5), best_features)
plt.xlim([-1, 5])
plt.show()

Decision Tree accuracy and time elapsed caculation

In [324]:
t0=time()
print "DecisionTree"

dt = DecisionTreeClassifier(min_samples_split=20,random_state=99)
# dt = DecisionTreeClassifier(min_samples_split=20,max_depth=5,random_state=99)

clf_dt=dt.fit(X_train,y_train)

print "Acurracy: ", clf_dt.score(X_test,y_test)
t1=time()
print "time elapsed: ", t1-t0
DecisionTree
Acurracy:  0.832399232246
time elapsed:  0.586999893188

cross validation for DT

In [327]:
tt0=time()
print "cross result========"
scores = cross_validation.cross_val_score(dt, X, y, cv=5)
print scores
print scores.mean()
tt1=time()
print "time elapsed: ", tt1-tt0
print "\n"
cross result========
[ 0.828  0.833  0.837  0.843  0.84 ]
0.836399625621
time elapsed:  4.68400001526


Tuning our hyperparameters using GridSearch

In [146]:
from sklearn.metrics import classification_report

pipeline = Pipeline([
    ('clf', DecisionTreeClassifier(criterion='entropy'))
])

parameters = {
    'clf__max_depth': (5, 25 , 50),
    'clf__min_samples_split': (1, 5, 10),
    'clf__min_samples_leaf': (1, 2, 3)
}

grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='f1')
grid_search.fit(X_train, y_train)

print 'Best score: %0.3f' % grid_search.best_score_
print 'Best parameters set:'

best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print '\t%s: %r' % (param_name, best_parameters[param_name])

predictions = grid_search.predict(X_test)

print classification_report(y_test, predictions)
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   22.7s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:  2.4min
[Parallel(n_jobs=-1)]: Done 288 out of 288 | elapsed:  4.5min finished
Fitting 3 folds for each of 96 candidates, totalling 288 fits
Best score: 0.662
Best parameters set:
	clf__max_depth: 5
	clf__min_samples_leaf: 5
	clf__min_samples_split: 1
             precision    recall  f1-score   support

          0       0.62      0.59      0.60      7427
          1       0.65      0.68      0.67      8431

avg / total       0.64      0.64      0.64     15858

Random Forest accuracy and time elapsed caculation

In [329]:
t2=time()
print "RandomForest"
rf = RandomForestClassifier(n_estimators=100,n_jobs=-1)
clf_rf = rf.fit(X_train,y_train)
print "Acurracy: ", clf_rf.score(X_test,y_test)
t3=time()
print "time elapsed: ", t3-t2
RandomForest
Acurracy:  0.857504798464
time elapsed:  2.65199995041

cross validation for RF

In [330]:
tt2=time()
print "cross result========"
scores = cross_validation.cross_val_score(dt, X, y, cv=5)
print scores
print scores.mean()
tt3=time()
print "time elapsed: ", tt3-tt2
print "\n"
cross result========
[ 0.828  0.833  0.837  0.843  0.84 ]
0.836399625621
time elapsed:  4.94099998474


Receiver Operating Characteristic (ROC) curve

In [331]:
roc_auc_score(y_test,rf.predict(X_test))
Out[331]:
0.77685469678109964
In [332]:
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc

predictions = rf.predict_proba(X_test)

false_positive_rate, recall, thresholds = roc_curve(y_test, predictions[:, 1])
roc_auc = auc(false_positive_rate, recall)
plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, recall, 'b', label='AUC = %0.2f' % roc_auc)
plt.legend(loc='lower right')

plt.plot([0, 1], [0, 1], 'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.ylabel('Recall')
plt.xlabel('Fall-out')
plt.show()

Tuning Models using GridSearch

In [ ]:
pipeline2 = Pipeline([
('clf', RandomForestClassifier(criterion='entropy'))
])

parameters = {
    'clf__n_estimators': (5, 25, 50, 100),
    'clf__max_depth': (5, 25 , 50),
    'clf__min_samples_split': (1, 5, 10),
    'clf__min_samples_leaf': (1, 2, 3)
}

grid_search = GridSearchCV(pipeline2, parameters, n_jobs=-1, verbose=1, scoring='accuracy', cv=3)

grid_search.fit(X_train, y_train)

print 'Best score: %0.3f' % grid_search.best_score_

print 'Best parameters set:'
best_parameters = grid_search.best_estimator_.get_params()

for param_name in sorted(parameters.keys()):
    print '\t%s: %r' % (param_name, best_parameters[param_name])

predictions = grid_search.predict(X_test)
print 'Accuracy:', accuracy_score(y_test, predictions)
print classification_report(y_test, predictions)
    

Naive Bayes accuracy and time elapsed caculation

In [333]:
t4=time()
print "NaiveBayes"
nb = BernoulliNB()
clf_nb=nb.fit(X_train,y_train)
print "Acurracy: ", clf_nb.score(X_test,y_test)
t5=time()
print "time elapsed: ", t5-t4
NaiveBayes
Acurracy:  0.772591170825
time elapsed:  0.168999910355

cross-validation for NB

In [335]:
tt4=time()
print "cross result========"
scores = cross_validation.cross_val_score(dt, X, y, cv=5)
print scores
print scores.mean()
tt5=time()
print "time elapsed: ", tt5-tt4
print "\n"
cross result========
[ 0.828  0.833  0.837  0.843  0.84 ]
0.836399625621
time elapsed:  4.72499990463


KNN accuracy and time elapsed caculation

In [336]:
t6=time()
print "KNN"
# knn = KNeighborsClassifier(n_neighbors=3)
knn = KNeighborsClassifier()
clf_knn=knn.fit(X_train, y_train)
print "Acurracy: ", clf_knn.score(X_test,y_test) 
t7=time()
print "time elapsed: ", t7-t6
KNN
Acurracy:  0.774587332054
time elapsed:  3.53800010681

cross validation for KNN

In [337]:
tt6=time()
print "cross result========"
scores = cross_validation.cross_val_score(dt, X, y, cv=5)
print scores
print scores.mean()
tt7=time()
print "time elapsed: ", tt7-tt6
print "\n"
cross result========
[ 0.828  0.833  0.837  0.843  0.84 ]
0.836399625621
time elapsed:  4.67000007629