Student-Performance-Evaluation using Classification-Regression


http://archive.ics.uci.edu/ml/datasets/Student+Performance

Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets:

  1. school - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)
  2. sex - student's sex (binary: 'F' - female or 'M' - male)
  3. age - student's age (numeric: from 15 to 22)
  4. address - student's home address type (binary: 'U' - urban or 'R' - rural)
  5. famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
  6. Pstatus - parent's cohabitation status (binary: 'T' - living together or 'A' - apart)
  7. Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
  8. Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
  9. Mjob - mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
  10. Fjob - father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
  11. reason - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')
  12. guardian - student's guardian (nominal: 'mother', 'father' or 'other')
  13. traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - \>1 hour)
  14. studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
  15. failures - number of past class failures (numeric: n if 1<=n\<3, else 4)
  16. schoolsup - extra educational support (binary: yes or no)
  17. famsup - family educational support (binary: yes or no)
  18. paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
  19. activities - extra-curricular activities (binary: yes or no)
  20. nursery - attended nursery school (binary: yes or no)
  21. higher - wants to take higher education (binary: yes or no)
  22. internet - Internet access at home (binary: yes or no)
  23. romantic - with a romantic relationship (binary: yes or no)
  24. famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
  25. freetime - free time after school (numeric: from 1 - very low to 5 - very high)
  26. goout - going out with friends (numeric: from 1 - very low to 5 - very high)
  27. Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
  28. Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
  29. health - current health status (numeric: from 1 - very bad to 5 - very good)
  30. absences - number of school absences (numeric: from 0 to 93)

  31. G1 - first period grade (numeric: from 0 to 20)

  32. G2 - second period grade (numeric: from 0 to 20)
  33. G3 - final grade (numeric: from 0 to 20, output target)

these grades are related with the course subject, Math or Portuguese:

In [28]:
import os
from sklearn.tree import DecisionTreeClassifier, export_graphviz
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.cross_validation import train_test_split
from sklearn import cross_validation, metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from time import time
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score , classification_report
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import precision_score, recall_score, accuracy_score, classification_report
In [4]:
# read .csv from provided dataset
csv_filename="student/student-mat.csv"

# df=pd.read_csv(csv_filename,index_col=0)
df=pd.read_csv(csv_filename, sep=";")
In [5]:
df.head()
Out[5]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... famrel freetime goout Dalc Walc health absences G1 G2 G3
0 GP F 18 U GT3 A 4 4 at_home teacher ... 4 3 4 1 1 3 6 5 6 6
1 GP F 17 U GT3 T 1 1 at_home other ... 5 3 3 1 1 3 4 5 5 6
2 GP F 15 U LE3 T 1 1 at_home other ... 4 3 2 2 3 3 10 7 8 10
3 GP F 15 U GT3 T 4 2 health services ... 3 2 2 1 1 5 2 15 14 15
4 GP F 16 U GT3 T 3 3 other other ... 4 3 2 1 2 5 4 6 10 10

5 rows × 33 columns

In [6]:
df.describe()
Out[6]:
age Medu Fedu traveltime studytime failures famrel freetime goout Dalc Walc health absences G1 G2 G3
count 395.000000 395.000000 395.000000 395.000000 395.000000 395.000000 395.000000 395.000000 395.000000 395.000000 395.000000 395.000000 395.000000 395.000000 395.000000 395.000000
mean 16.696203 2.749367 2.521519 1.448101 2.035443 0.334177 3.944304 3.235443 3.108861 1.481013 2.291139 3.554430 5.708861 10.908861 10.713924 10.415190
std 1.276043 1.094735 1.088201 0.697505 0.839240 0.743651 0.896659 0.998862 1.113278 0.890741 1.287897 1.390303 8.003096 3.319195 3.761505 4.581443
min 15.000000 0.000000 0.000000 1.000000 1.000000 0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.000000 3.000000 0.000000 0.000000
25% 16.000000 2.000000 2.000000 1.000000 1.000000 0.000000 4.000000 3.000000 2.000000 1.000000 1.000000 3.000000 0.000000 8.000000 9.000000 8.000000
50% 17.000000 3.000000 2.000000 1.000000 2.000000 0.000000 4.000000 3.000000 3.000000 1.000000 2.000000 4.000000 4.000000 11.000000 11.000000 11.000000
75% 18.000000 4.000000 3.000000 2.000000 2.000000 0.000000 5.000000 4.000000 4.000000 2.000000 3.000000 5.000000 8.000000 13.000000 13.000000 14.000000
max 22.000000 4.000000 4.000000 4.000000 4.000000 3.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 75.000000 19.000000 19.000000 20.000000

CASE 1: Binary Classification : G3>10: 1 else 0

In [7]:
df.G3.describe()
Out[7]:
count    395.000000
mean      10.415190
std        4.581443
min        0.000000
25%        8.000000
50%       11.000000
75%       14.000000
max       20.000000
Name: G3, dtype: float64
In [9]:
# handle G3 attrubte to binary
high = df.G3 >= 10
low = df.G3 < 10
df.loc[high,'G3'] = 1
df.loc[low,'G3'] = 0
In [10]:
df.head()
Out[10]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... famrel freetime goout Dalc Walc health absences G1 G2 G3
0 GP F 18 U GT3 A 4 4 at_home teacher ... 4 3 4 1 1 3 6 5 6 0
1 GP F 17 U GT3 T 1 1 at_home other ... 5 3 3 1 1 3 4 5 5 0
2 GP F 15 U LE3 T 1 1 at_home other ... 4 3 2 2 3 3 10 7 8 1
3 GP F 15 U GT3 T 4 2 health services ... 3 2 2 1 1 5 2 15 14 1
4 GP F 16 U GT3 T 3 3 other other ... 4 3 2 1 2 5 4 6 10 1

5 rows × 33 columns

In [11]:
df.G3.describe()
Out[11]:
count    395.000000
mean       0.670886
std        0.470487
min        0.000000
25%        0.000000
50%        1.000000
75%        1.000000
max        1.000000
Name: G3, dtype: float64
In [13]:
cols = list(df.columns)
In [18]:
categorical_features = []
for f in cols:
    if df[f].dtype != 'int64':
        categorical_features.append(f)
categorical_features
Out[18]:
['school',
 'sex',
 'address',
 'famsize',
 'Pstatus',
 'Mjob',
 'Fjob',
 'reason',
 'guardian',
 'schoolsup',
 'famsup',
 'paid',
 'activities',
 'nursery',
 'higher',
 'internet',
 'romantic']
In [19]:
for f in categorical_features:

    #Get binarized columns
    df[f] = pd.get_dummies(df[f])
In [20]:
df.head()
Out[20]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... famrel freetime goout Dalc Walc health absences G1 G2 G3
0 1.0 1.0 18 0.0 1.0 1.0 4 4 1.0 0.0 ... 4 3 4 1 1 3 6 5 6 0
1 1.0 1.0 17 0.0 1.0 0.0 1 1 1.0 0.0 ... 5 3 3 1 1 3 4 5 5 0
2 1.0 1.0 15 0.0 0.0 0.0 1 1 1.0 0.0 ... 4 3 2 2 3 3 10 7 8 1
3 1.0 1.0 15 0.0 1.0 0.0 4 2 0.0 0.0 ... 3 2 2 1 1 5 2 15 14 1
4 1.0 1.0 16 0.0 1.0 0.0 3 3 0.0 0.0 ... 4 3 2 1 2 5 4 6 10 1

5 rows × 33 columns

In [21]:
features=list(df.columns[:-1])
In [23]:
X = df[features]
y = df['G3']
In [24]:
# split dataset to 60% training and 40% testing
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,y, test_size=0.4, random_state=0)
In [25]:
print X_train.shape, y_train.shape
(237, 32) (237L,)

Feature importances with forests of trees

This examples shows the use of forests of trees to evaluate the importance of features on an artificial classification task. The red bars are the feature importances of the forest, along with their inter-trees variability.

In [29]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn.ensemble import ExtraTreesClassifier

# Build a classification task using 3 informative features


# Build a forest and compute the feature importances
forest = ExtraTreesClassifier(n_estimators=250,
                              random_state=0)

forest.fit(X, y)
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(X.shape[1]):
    print("%d. feature %d - %s (%f) " % (f + 1, indices[f], features[indices[f]], importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure(num=None, figsize=(14, 10), dpi=80, facecolor='w', edgecolor='k')
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), indices)
plt.xlim([-1, X.shape[1]])
plt.show()
Feature ranking:
1. feature 31 - G2 (0.240635) 
2. feature 30 - G1 (0.180379) 
3. feature 14 - failures (0.058718) 
4. feature 29 - absences (0.032248) 
5. feature 25 - goout (0.031282) 
6. feature 2 - age (0.031181) 
7. feature 6 - Medu (0.023217) 
8. feature 7 - Fedu (0.023036) 
9. feature 28 - health (0.022799) 
10. feature 13 - studytime (0.022735) 
11. feature 23 - famrel (0.021943) 
12. feature 24 - freetime (0.021443) 
13. feature 27 - Walc (0.020626) 
14. feature 15 - schoolsup (0.018678) 
15. feature 22 - romantic (0.017761) 
16. feature 10 - reason (0.016906) 
17. feature 18 - activities (0.016887) 
18. feature 26 - Dalc (0.016689) 
19. feature 3 - address (0.016217) 
20. feature 1 - sex (0.016042) 
21. feature 12 - traveltime (0.015887) 
22. feature 4 - famsize (0.015885) 
23. feature 17 - paid (0.015443) 
24. feature 16 - famsup (0.014640) 
25. feature 11 - guardian (0.014492) 
26. feature 21 - internet (0.013992) 
27. feature 8 - Mjob (0.013320) 
28. feature 19 - nursery (0.011929) 
29. feature 20 - higher (0.010266) 
30. feature 5 - Pstatus (0.008672) 
31. feature 9 - Fjob (0.008311) 
32. feature 0 - school (0.007740) 
In [30]:
importances[indices[:5]]
Out[30]:
array([ 0.24063511,  0.18037904,  0.05871839,  0.03224845,  0.03128223])
In [31]:
for f in range(5):
    print("%d. feature %d - %s (%f)" % (f + 1, indices[f], features[indices[f]] ,importances[indices[f]]))
1. feature 31 - G2 (0.240635)
2. feature 30 - G1 (0.180379)
3. feature 14 - failures (0.058718)
4. feature 29 - absences (0.032248)
5. feature 25 - goout (0.031282)
In [32]:
best_features = []
for i in indices[:5]:
    best_features.append(features[i])
In [34]:
# Plot the top 5 feature importances of the forest
plt.figure(num=None, figsize=(8, 6), dpi=80, facecolor='w', edgecolor='k')
plt.title("Feature importances")
plt.bar(range(5), importances[indices][:5], 
       color="r",  yerr=std[indices][:5], align="center")
plt.xticks(range(5), best_features)
plt.xlim([-1, 5])
plt.show()

Decision Tree accuracy and time elapsed caculation

In [35]:
t0=time()
print "DecisionTree"

dt = DecisionTreeClassifier(min_samples_split=20,random_state=99)
# dt = DecisionTreeClassifier(min_samples_split=20,max_depth=5,random_state=99)

clf_dt=dt.fit(X_train,y_train)

print "Acurracy: ", clf_dt.score(X_test,y_test)
t1=time()
print "time elapsed: ", t1-t0
DecisionTree
Acurracy:  0.886075949367
time elapsed:  0.00399994850159

cross validation for DT

In [36]:
tt0=time()
print "cross result========"
scores = cross_validation.cross_val_score(dt, X,y, cv=5)
print scores
print scores.mean()
tt1=time()
print "time elapsed: ", tt1-tt0
print "\n"
cross result========
[ 0.89873418  0.86075949  0.84810127  0.82278481  0.86075949]
0.858227848101
time elapsed:  0.148000001907


Tuning our hyperparameters using GridSearch

In [38]:
from sklearn.metrics import classification_report

pipeline = Pipeline([
    ('clf', DecisionTreeClassifier(criterion='entropy'))
])

parameters = {
    'clf__max_depth': (5, 25 , 50),
    'clf__min_samples_split': (1, 5, 10),
    'clf__min_samples_leaf': (1, 2, 3)
}

grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='f1')
grid_search.fit(X_train, y_train)

print 'Best score: %0.3f' % grid_search.best_score_
print 'Best parameters set:'

best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print '\t%s: %r' % (param_name, best_parameters[param_name])

predictions = grid_search.predict(X_test)

print classification_report(y_test, predictions)
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   19.6s
[Parallel(n_jobs=-1)]: Done  81 out of  81 | elapsed:   20.2s finished
Fitting 3 folds for each of 27 candidates, totalling 81 fits
Best score: 0.933
Best parameters set:
	clf__max_depth: 50
	clf__min_samples_leaf: 2
	clf__min_samples_split: 5
             precision    recall  f1-score   support

          0       0.90      0.73      0.80        59
          1       0.85      0.95      0.90        99

avg / total       0.87      0.87      0.86       158

Random Forest accuracy and time elapsed caculation

In [39]:
t2=time()
print "RandomForest"
rf = RandomForestClassifier(n_estimators=100,n_jobs=-1)
clf_rf = rf.fit(X_train,y_train)
print "Acurracy: ", clf_rf.score(X_test,y_test)
t3=time()
print "time elapsed: ", t3-t2
RandomForest
Acurracy:  0.911392405063
time elapsed:  0.871000051498

cross validation for RF

In [40]:
tt2=time()
print "cross result========"
scores = cross_validation.cross_val_score(dt, X,y, cv=5)
print scores
print scores.mean()
tt3=time()
print "time elapsed: ", tt3-tt2
print "\n"
cross result========
[ 0.89873418  0.86075949  0.84810127  0.82278481  0.86075949]
0.858227848101
time elapsed:  0.0499999523163


Receiver Operating Characteristic (ROC) curve

In [41]:
roc_auc_score(y_test,rf.predict(X_test))
Out[41]:
0.89162814586543404
In [42]:
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc

predictions = rf.predict_proba(X_test)

false_positive_rate, recall, thresholds = roc_curve(y_test, predictions[:, 1])
roc_auc = auc(false_positive_rate, recall)
plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, recall, 'b', label='AUC = %0.2f' % roc_auc)
plt.legend(loc='lower right')

plt.plot([0, 1], [0, 1], 'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.ylabel('Recall')
plt.xlabel('Fall-out')
plt.show()

Tuning Models using GridSearch

In [43]:
pipeline2 = Pipeline([
('clf', RandomForestClassifier(criterion='entropy'))
])

parameters = {
    'clf__n_estimators': (5, 25, 50, 100),
    'clf__max_depth': (5, 25 , 50),
    'clf__min_samples_split': (1, 5, 10),
    'clf__min_samples_leaf': (1, 2, 3)
}

grid_search = GridSearchCV(pipeline2, parameters, n_jobs=-1, verbose=1, scoring='accuracy', cv=3)

grid_search.fit(X_train, y_train)

print 'Best score: %0.3f' % grid_search.best_score_

print 'Best parameters set:'
best_parameters = grid_search.best_estimator_.get_params()

for param_name in sorted(parameters.keys()):
    print '\t%s: %r' % (param_name, best_parameters[param_name])

predictions = grid_search.predict(X_test)
print 'Accuracy:', accuracy_score(y_test, predictions)
print classification_report(y_test, predictions)
    
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   21.5s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   36.3s
[Parallel(n_jobs=-1)]: Done 324 out of 324 | elapsed:   58.4s finished
Fitting 3 folds for each of 108 candidates, totalling 324 fits
Best score: 0.916
Best parameters set:
	clf__max_depth: 5
	clf__min_samples_leaf: 3
	clf__min_samples_split: 1
	clf__n_estimators: 50
Accuracy: 0.905063291139
             precision    recall  f1-score   support

          0       0.94      0.80      0.86        59
          1       0.89      0.97      0.93        99

avg / total       0.91      0.91      0.90       158

Naive Bayes accuracy and time elapsed caculation

In [44]:
t4=time()
print "NaiveBayes"
nb = BernoulliNB()
clf_nb=nb.fit(X_train,y_train)
print "Acurracy: ", clf_nb.score(X_test,y_test)
t5=time()
print "time elapsed: ", t5-t4
NaiveBayes
Acurracy:  0.70253164557
time elapsed:  0.166000127792

cross-validation for NB

In [45]:
tt4=time()
print "cross result========"
scores = cross_validation.cross_val_score(dt, X,y, cv=5)
print scores
print scores.mean()
tt5=time()
print "time elapsed: ", tt5-tt4
print "\n"
cross result========
[ 0.89873418  0.86075949  0.84810127  0.82278481  0.86075949]
0.858227848101
time elapsed:  0.0549998283386


KNN accuracy and time elapsed caculation

In [46]:
t6=time()
print "KNN"
# knn = KNeighborsClassifier(n_neighbors=3)
knn = KNeighborsClassifier()
clf_knn=knn.fit(X_train, y_train)
print "Acurracy: ", clf_knn.score(X_test,y_test) 
t7=time()
print "time elapsed: ", t7-t6
KNN
Acurracy:  0.892405063291
time elapsed:  0.00899982452393

cross validation for KNN

In [47]:
tt6=time()
print "cross result========"
scores = cross_validation.cross_val_score(dt, X,y, cv=5)
print scores
print scores.mean()
tt7=time()
print "time elapsed: ", tt7-tt6
print "\n"
cross result========
[ 0.89873418  0.86075949  0.84810127  0.82278481  0.86075949]
0.858227848101
time elapsed:  0.0460000038147


SVM accuracy and time elapsed caculation

In [48]:
t7=time()
print "SVM"

svc = SVC()
clf_svc=svc.fit(X_train, y_train)
print "Acurracy: ", clf_svc.score(X_test,y_test) 
t8=time()
print "time elapsed: ", t8-t7
SVM
Acurracy:  0.867088607595
time elapsed:  0.010999917984

cross validation for SVM

In [49]:
tt7=time()
print "cross result========"
scores = cross_validation.cross_val_score(dt, X,y, cv=5)
print scores
print scores.mean()
tt8=time()
print "time elapsed: ", tt7-tt6
print "\n"
cross result========
[ 0.89873418  0.86075949  0.84810127  0.82278481  0.86075949]
0.858227848101
time elapsed:  7.27399992943


In [50]:
from sklearn.svm import SVC
from sklearn.cross_validation import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn import grid_search

svc = SVC()

parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}

grid = grid_search.GridSearchCV(svc, parameters, n_jobs=-1, verbose=1, scoring='accuracy')


grid.fit(X_train, y_train)

print 'Best score: %0.3f' % grid.best_score_

print 'Best parameters set:'
best_parameters = grid.best_estimator_.get_params()

for param_name in sorted(parameters.keys()):
    print '\t%s: %r' % (param_name, best_parameters[param_name])
    
predictions = grid.predict(X_test)
print classification_report(y_test, predictions)
Fitting 3 folds for each of 4 candidates, totalling 12 fits
Best score: 0.907
Best parameters set:
	C: 10
	kernel: 'rbf'
             precision    recall  f1-score   support

          0       0.92      0.76      0.83        59
          1       0.87      0.96      0.91        99

avg / total       0.89      0.89      0.88       158

[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:   18.2s finished
In [51]:
pipeline = Pipeline([
    ('clf', SVC(kernel='rbf', gamma=0.01, C=100))
])

parameters = {
    'clf__gamma': (0.01, 0.03, 0.1, 0.3, 1),
    'clf__C': (0.1, 0.3, 1, 3, 10, 30),
}

grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy')

grid_search.fit(X_train, y_train)

print 'Best score: %0.3f' % grid_search.best_score_

print 'Best parameters set:'
best_parameters = grid_search.best_estimator_.get_params()

for param_name in sorted(parameters.keys()):
    print '\t%s: %r' % (param_name, best_parameters[param_name])
    
predictions = grid_search.predict(X_test)
print classification_report(y_test, predictions)
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   19.4s
[Parallel(n_jobs=-1)]: Done  90 out of  90 | elapsed:   20.2s finished
Fitting 3 folds for each of 30 candidates, totalling 90 fits
Best score: 0.907
Best parameters set:
	clf__C: 3
	clf__gamma: 0.01
             precision    recall  f1-score   support

          0       0.92      0.83      0.88        59
          1       0.90      0.96      0.93        99

avg / total       0.91      0.91      0.91       158


In [100]:
# read .csv from provided dataset
csv_filename="student/student-mat.csv"

# df=pd.read_csv(csv_filename,index_col=0)
df=pd.read_csv(csv_filename, sep=";")
In [101]:
df.head()
Out[101]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... famrel freetime goout Dalc Walc health absences G1 G2 G3
0 GP F 18 U GT3 A 4 4 at_home teacher ... 4 3 4 1 1 3 6 5 6 6
1 GP F 17 U GT3 T 1 1 at_home other ... 5 3 3 1 1 3 4 5 5 6
2 GP F 15 U LE3 T 1 1 at_home other ... 4 3 2 2 3 3 10 7 8 10
3 GP F 15 U GT3 T 4 2 health services ... 3 2 2 1 1 5 2 15 14 15
4 GP F 16 U GT3 T 3 3 other other ... 4 3 2 1 2 5 4 6 10 10

5 rows × 33 columns

In [102]:
df.describe()
Out[102]:
age Medu Fedu traveltime studytime failures famrel freetime goout Dalc Walc health absences G1 G2 G3
count 395.000000 395.000000 395.000000 395.000000 395.000000 395.000000 395.000000 395.000000 395.000000 395.000000 395.000000 395.000000 395.000000 395.000000 395.000000 395.000000
mean 16.696203 2.749367 2.521519 1.448101 2.035443 0.334177 3.944304 3.235443 3.108861 1.481013 2.291139 3.554430 5.708861 10.908861 10.713924 10.415190
std 1.276043 1.094735 1.088201 0.697505 0.839240 0.743651 0.896659 0.998862 1.113278 0.890741 1.287897 1.390303 8.003096 3.319195 3.761505 4.581443
min 15.000000 0.000000 0.000000 1.000000 1.000000 0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.000000 3.000000 0.000000 0.000000
25% 16.000000 2.000000 2.000000 1.000000 1.000000 0.000000 4.000000 3.000000 2.000000 1.000000 1.000000 3.000000 0.000000 8.000000 9.000000 8.000000
50% 17.000000 3.000000 2.000000 1.000000 2.000000 0.000000 4.000000 3.000000 3.000000 1.000000 2.000000 4.000000 4.000000 11.000000 11.000000 11.000000
75% 18.000000 4.000000 3.000000 2.000000 2.000000 0.000000 5.000000 4.000000 4.000000 2.000000 3.000000 5.000000 8.000000 13.000000 13.000000 14.000000
max 22.000000 4.000000 4.000000 4.000000 4.000000 3.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 75.000000 19.000000 19.000000 20.000000

CASE 2: Multi Class Classification :

In [103]:
df.G3.describe()
Out[103]:
count    395.000000
mean      10.415190
std        4.581443
min        0.000000
25%        8.000000
50%       11.000000
75%       14.000000
max       20.000000
Name: G3, dtype: float64
In [104]:
for i in range(len(df.G3)):
    if df.G3.loc[i] < 10:
        df.G3.loc[i] = 5
    elif df.G3.loc[i] < 12:
        df.G3.loc[i] = 4
    elif df.G3.loc[i] < 14:
        df.G3.loc[i] = 3
    elif df.G3.loc[i] < 16:
        df.G3.loc[i] = 2
    elif df.G3.loc[i] < 21:
        df.G3.loc[i] = 1
In [105]:
df.G3.unique()
Out[105]:
array([5, 4, 2, 1, 3], dtype=int64)
In [106]:
df.head()
Out[106]:
ClassG3Label
I (excellent/very good)16-20A
II (good)14-15B
III (satisfactory)12-13C
IV (sufficient)10-11D
V (fail)0-9E
school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... famrel freetime goout Dalc Walc health absences G1 G2 G3
0 GP F 18 U GT3 A 4 4 at_home teacher ... 4 3 4 1 1 3 6 5 6 5
1 GP F 17 U GT3 T 1 1 at_home other ... 5 3 3 1 1 3 4 5 5 5
2 GP F 15 U LE3 T 1 1 at_home other ... 4 3 2 2 3 3 10 7 8 4
3 GP F 15 U GT3 T 4 2 health services ... 3 2 2 1 1 5 2 15 14 2
4 GP F 16 U GT3 T 3 3 other other ... 4 3 2 1 2 5 4 6 10 4

5 rows × 33 columns

In [107]:
df.G3.describe()
Out[107]:
count    395.000000
mean       3.564557
std        1.349096
min        1.000000
25%        2.000000
50%        4.000000
75%        5.000000
max        5.000000
Name: G3, dtype: float64
In [108]:
cols = list(df.columns)
In [109]:
categorical_features = []
for f in cols:
    if df[f].dtype != 'int64':
        categorical_features.append(f)
categorical_features
Out[109]:
['school',
 'sex',
 'address',
 'famsize',
 'Pstatus',
 'Mjob',
 'Fjob',
 'reason',
 'guardian',
 'schoolsup',
 'famsup',
 'paid',
 'activities',
 'nursery',
 'higher',
 'internet',
 'romantic']
In [110]:
for f in categorical_features:

    #Get binarized columns
    df[f] = pd.get_dummies(df[f])
In [111]:
df.head()
Out[111]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... famrel freetime goout Dalc Walc health absences G1 G2 G3
0 1.0 1.0 18 0.0 1.0 1.0 4 4 1.0 0.0 ... 4 3 4 1 1 3 6 5 6 5
1 1.0 1.0 17 0.0 1.0 0.0 1 1 1.0 0.0 ... 5 3 3 1 1 3 4 5 5 5
2 1.0 1.0 15 0.0 0.0 0.0 1 1 1.0 0.0 ... 4 3 2 2 3 3 10 7 8 4
3 1.0 1.0 15 0.0 1.0 0.0 4 2 0.0 0.0 ... 3 2 2 1 1 5 2 15 14 2
4 1.0 1.0 16 0.0 1.0 0.0 3 3 0.0 0.0 ... 4 3 2 1 2 5 4 6 10 4

5 rows × 33 columns

In [112]:
features=list(df.columns[:-1])
In [113]:
X = df[features]
y = df['G3']
In [114]:
# split dataset to 60% training and 40% testing
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,y, test_size=0.4, random_state=0)
In [115]:
print X_train.shape, y_train.shape
(237, 32) (237L,)

Feature importances with forests of trees

This examples shows the use of forests of trees to evaluate the importance of features on an artificial classification task. The red bars are the feature importances of the forest, along with their inter-trees variability.

In [116]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn.ensemble import ExtraTreesClassifier

# Build a classification task using 3 informative features


# Build a forest and compute the feature importances
forest = ExtraTreesClassifier(n_estimators=250,
                              random_state=0)

forest.fit(X, y)
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(X.shape[1]):
    print("%d. feature %d - %s (%f) " % (f + 1, indices[f], features[indices[f]], importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure(num=None, figsize=(14, 10), dpi=80, facecolor='w', edgecolor='k')
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), indices)
plt.xlim([-1, X.shape[1]])
plt.show()
Feature ranking:
1. feature 31 - G2 (0.195782) 
2. feature 30 - G1 (0.145836) 
3. feature 29 - absences (0.035497) 
4. feature 25 - goout (0.032551) 
5. feature 2 - age (0.032141) 
6. feature 6 - Medu (0.031506) 
7. feature 28 - health (0.030933) 
8. feature 24 - freetime (0.029983) 
9. feature 27 - Walc (0.029495) 
10. feature 13 - studytime (0.029247) 
11. feature 23 - famrel (0.029207) 
12. feature 14 - failures (0.028804) 
13. feature 7 - Fedu (0.028431) 
14. feature 12 - traveltime (0.022976) 
15. feature 26 - Dalc (0.022761) 
16. feature 18 - activities (0.021505) 
17. feature 1 - sex (0.021448) 
18. feature 16 - famsup (0.021390) 
19. feature 17 - paid (0.020667) 
20. feature 10 - reason (0.020605) 
21. feature 4 - famsize (0.020506) 
22. feature 22 - romantic (0.019582) 
23. feature 19 - nursery (0.018253) 
24. feature 11 - guardian (0.017402) 
25. feature 3 - address (0.016361) 
26. feature 15 - schoolsup (0.014624) 
27. feature 21 - internet (0.014290) 
28. feature 5 - Pstatus (0.012596) 
29. feature 8 - Mjob (0.012404) 
30. feature 0 - school (0.010073) 
31. feature 9 - Fjob (0.007019) 
32. feature 20 - higher (0.006123) 
In [117]:
importances[indices[:5]]
Out[117]:
array([ 0.19578233,  0.14583578,  0.03549731,  0.03255128,  0.03214145])
In [118]:
for f in range(5):
    print("%d. feature %d - %s (%f)" % (f + 1, indices[f], features[indices[f]] ,importances[indices[f]]))
1. feature 31 - G2 (0.195782)
2. feature 30 - G1 (0.145836)
3. feature 29 - absences (0.035497)
4. feature 25 - goout (0.032551)
5. feature 2 - age (0.032141)
In [119]:
best_features = []
for i in indices[:5]:
    best_features.append(features[i])
In [120]:
# Plot the top 5 feature importances of the forest
plt.figure(num=None, figsize=(8, 6), dpi=80, facecolor='w', edgecolor='k')
plt.title("Feature importances")
plt.bar(range(5), importances[indices][:5], 
       color="r",  yerr=std[indices][:5], align="center")
plt.xticks(range(5), best_features)
plt.xlim([-1, 5])
plt.show()

Decision Tree accuracy and time elapsed caculation

In [121]:
t0=time()
print "DecisionTree"

dt = DecisionTreeClassifier(min_samples_split=20,random_state=99)
# dt = DecisionTreeClassifier(min_samples_split=20,max_depth=5,random_state=99)

clf_dt=dt.fit(X_train,y_train)

print "Acurracy: ", clf_dt.score(X_test,y_test)
t1=time()
print "time elapsed: ", t1-t0
DecisionTree
Acurracy:  0.708860759494
time elapsed:  0.00399994850159

cross validation for DT

In [122]:
tt0=time()
print "cross result========"
scores = cross_validation.cross_val_score(dt, X,y, cv=5)
print scores
print scores.mean()
tt1=time()
print "time elapsed: ", tt1-tt0
print "\n"
cross result========
[ 0.725       0.7625      0.64556962  0.56410256  0.66666667]
0.672767770204
time elapsed:  0.0529999732971


Tuning our hyperparameters using GridSearch

In [123]:
from sklearn.metrics import classification_report

pipeline = Pipeline([
    ('clf', DecisionTreeClassifier(criterion='entropy'))
])

parameters = {
    'clf__max_depth': (5, 25 , 50),
    'clf__min_samples_split': (1, 5, 10),
    'clf__min_samples_leaf': (1, 2, 3)
}

grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='f1')
grid_search.fit(X_train, y_train)

print 'Best score: %0.3f' % grid_search.best_score_
print 'Best parameters set:'

best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print '\t%s: %r' % (param_name, best_parameters[param_name])

predictions = grid_search.predict(X_test)

print classification_report(y_test, predictions)
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   22.4s
[Parallel(n_jobs=-1)]: Done  81 out of  81 | elapsed:   23.2s finished
Fitting 3 folds for each of 27 candidates, totalling 81 fits
Best score: 0.718
Best parameters set:
	clf__max_depth: 50
	clf__min_samples_leaf: 3
	clf__min_samples_split: 10
             precision    recall  f1-score   support

          1       0.75      0.94      0.83        16
          2       0.68      0.71      0.69        24
          3       0.65      0.46      0.54        24
          4       0.56      0.71      0.63        35
          5       0.90      0.78      0.84        59

avg / total       0.74      0.72      0.72       158

Random Forest accuracy and time elapsed caculation

In [124]:
t2=time()
print "RandomForest"
rf = RandomForestClassifier(n_estimators=100,n_jobs=-1)
clf_rf = rf.fit(X_train,y_train)
print "Acurracy: ", clf_rf.score(X_test,y_test)
t3=time()
print "time elapsed: ", t3-t2
RandomForest
Acurracy:  0.715189873418
time elapsed:  1.20000004768

cross validation for RF

In [125]:
tt2=time()
print "cross result========"
scores = cross_validation.cross_val_score(dt, X,y, cv=5)
print scores
print scores.mean()
tt3=time()
print "time elapsed: ", tt3-tt2
print "\n"
cross result========
[ 0.725       0.7625      0.64556962  0.56410256  0.66666667]
0.672767770204
time elapsed:  0.0699999332428


Tuning Models using GridSearch

In [127]:
pipeline2 = Pipeline([
('clf', RandomForestClassifier(criterion='entropy'))
])

parameters = {
    'clf__n_estimators': (5, 25, 50, 100),
    'clf__max_depth': (5, 25 , 50),
    'clf__min_samples_split': (1, 5, 10),
    'clf__min_samples_leaf': (1, 2, 3)
}

grid_search = GridSearchCV(pipeline2, parameters, n_jobs=-1, verbose=1, scoring='accuracy', cv=3)

grid_search.fit(X_train, y_train)

print 'Best score: %0.3f' % grid_search.best_score_

print 'Best parameters set:'
best_parameters = grid_search.best_estimator_.get_params()

for param_name in sorted(parameters.keys()):
    print '\t%s: %r' % (param_name, best_parameters[param_name])

predictions = grid_search.predict(X_test)
print 'Accuracy:', accuracy_score(y_test, predictions)
print classification_report(y_test, predictions)
    
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   23.1s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   37.6s
[Parallel(n_jobs=-1)]: Done 324 out of 324 | elapsed:  1.1min finished
Fitting 3 folds for each of 108 candidates, totalling 324 fits
Best score: 0.713
Best parameters set:
	clf__max_depth: 50
	clf__min_samples_leaf: 3
	clf__min_samples_split: 5
	clf__n_estimators: 100
Accuracy: 0.753164556962
             precision    recall  f1-score   support

          1       0.80      0.50      0.62        16
          2       0.61      0.83      0.70        24
          3       0.65      0.46      0.54        24
          4       0.67      0.74      0.70        35
          5       0.92      0.92      0.92        59

avg / total       0.76      0.75      0.75       158

Naive Bayes accuracy and time elapsed caculation

In [128]:
t4=time()
print "NaiveBayes"
nb = BernoulliNB()
clf_nb=nb.fit(X_train,y_train)
print "Acurracy: ", clf_nb.score(X_test,y_test)
t5=time()
print "time elapsed: ", t5-t4
NaiveBayes
Acurracy:  0.379746835443
time elapsed:  0.105000019073

cross-validation for NB

In [129]:
tt4=time()
print "cross result========"
scores = cross_validation.cross_val_score(dt, X,y, cv=5)
print scores
print scores.mean()
tt5=time()
print "time elapsed: ", tt5-tt4
print "\n"
cross result========
[ 0.725       0.7625      0.64556962  0.56410256  0.66666667]
0.672767770204
time elapsed:  0.0550000667572


KNN accuracy and time elapsed caculation

In [130]:
t6=time()
print "KNN"
# knn = KNeighborsClassifier(n_neighbors=3)
knn = KNeighborsClassifier()
clf_knn=knn.fit(X_train, y_train)
print "Acurracy: ", clf_knn.score(X_test,y_test) 
t7=time()
print "time elapsed: ", t7-t6
KNN
Acurracy:  0.626582278481
time elapsed:  0.00899982452393

cross validation for KNN

In [131]:
tt6=time()
print "cross result========"
scores = cross_validation.cross_val_score(dt, X,y, cv=5)
print scores
print scores.mean()
tt7=time()
print "time elapsed: ", tt7-tt6
print "\n"
cross result========
[ 0.725       0.7625      0.64556962  0.56410256  0.66666667]
0.672767770204
time elapsed:  0.0540001392365


SVM accuracy and time elapsed caculation

In [132]:
t7=time()
print "SVM"

svc = SVC()
clf_svc=svc.fit(X_train, y_train)
print "Acurracy: ", clf_svc.score(X_test,y_test) 
t8=time()
print "time elapsed: ", t8-t7
SVM
Acurracy:  0.721518987342
time elapsed:  0.0210001468658

cross validation for SVM

In [133]:
tt7=time()
print "cross result========"
scores = cross_validation.cross_val_score(dt, X,y, cv=5)
print scores
print scores.mean()
tt8=time()
print "time elapsed: ", tt7-tt6
print "\n"
cross result========
[ 0.725       0.7625      0.64556962  0.56410256  0.66666667]
0.672767770204
time elapsed:  4.89100003242


In [134]:
from sklearn.svm import SVC
from sklearn.cross_validation import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn import grid_search

svc = SVC()

parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}

grid = grid_search.GridSearchCV(svc, parameters, n_jobs=-1, verbose=1, scoring='accuracy')


grid.fit(X_train, y_train)

print 'Best score: %0.3f' % grid.best_score_

print 'Best parameters set:'
best_parameters = grid.best_estimator_.get_params()

for param_name in sorted(parameters.keys()):
    print '\t%s: %r' % (param_name, best_parameters[param_name])
    
predictions = grid.predict(X_test)
print classification_report(y_test, predictions)
Fitting 3 folds for each of 4 candidates, totalling 12 fits
Best score: 0.679
Best parameters set:
	C: 1
	kernel: 'linear'
             precision    recall  f1-score   support

          1       0.67      0.62      0.65        16
          2       0.59      0.54      0.57        24
          3       0.52      0.62      0.57        24
          4       0.67      0.63      0.65        35
          5       0.92      0.92      0.92        59

avg / total       0.73      0.72      0.72       158

[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:   17.0s finished
In [135]:
pipeline = Pipeline([
    ('clf', SVC(kernel='rbf', gamma=0.01, C=100))
])

parameters = {
    'clf__gamma': (0.01, 0.03, 0.1, 0.3, 1),
    'clf__C': (0.1, 0.3, 1, 3, 10, 30),
}

grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy')

grid_search.fit(X_train, y_train)

print 'Best score: %0.3f' % grid_search.best_score_

print 'Best parameters set:'
best_parameters = grid_search.best_estimator_.get_params()

for param_name in sorted(parameters.keys()):
    print '\t%s: %r' % (param_name, best_parameters[param_name])
    
predictions = grid_search.predict(X_test)
print classification_report(y_test, predictions)
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   19.1s
[Parallel(n_jobs=-1)]: Done  90 out of  90 | elapsed:   19.8s finished
Fitting 3 folds for each of 30 candidates, totalling 90 fits
Best score: 0.700
Best parameters set:
	clf__C: 3
	clf__gamma: 0.01
             precision    recall  f1-score   support

          1       0.79      0.69      0.73        16
          2       0.73      0.79      0.76        24
          3       0.62      0.67      0.64        24
          4       0.61      0.66      0.63        35
          5       0.93      0.85      0.88        59

avg / total       0.76      0.75      0.76       158


Case 3 : Regression

In [149]:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler

from sklearn.cross_validation import train_test_split
from sklearn. cross_validation import cross_val_score

from sklearn.feature_selection import *
from sklearn import metrics
In [136]:
# read .csv from provided dataset
csv_filename="student/student-mat.csv"

# df=pd.read_csv(csv_filename,index_col=0)
df=pd.read_csv(csv_filename,sep=";")
In [137]:
df.head()
Out[137]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... famrel freetime goout Dalc Walc health absences G1 G2 G3
0 GP F 18 U GT3 A 4 4 at_home teacher ... 4 3 4 1 1 3 6 5 6 6
1 GP F 17 U GT3 T 1 1 at_home other ... 5 3 3 1 1 3 4 5 5 6
2 GP F 15 U LE3 T 1 1 at_home other ... 4 3 2 2 3 3 10 7 8 10
3 GP F 15 U GT3 T 4 2 health services ... 3 2 2 1 1 5 2 15 14 15
4 GP F 16 U GT3 T 3 3 other other ... 4 3 2 1 2 5 4 6 10 10

5 rows × 33 columns

In [141]:
cols = list(df.columns)
In [143]:
categorical_features = []
for f in cols:
    if df[f].dtype != 'int64':
        categorical_features.append(f)
categorical_features
Out[143]:
['school',
 'sex',
 'address',
 'famsize',
 'Pstatus',
 'Mjob',
 'Fjob',
 'reason',
 'guardian',
 'schoolsup',
 'famsup',
 'paid',
 'activities',
 'nursery',
 'higher',
 'internet',
 'romantic']
In [144]:
for f in categorical_features:

    #Get binarized columns
    df[f] = pd.get_dummies(df[f])
In [155]:
df.head()
Out[155]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... famrel freetime goout Dalc Walc health absences G1 G2 G3
0 1.0 1.0 18 0.0 1.0 1.0 4 4 1.0 0.0 ... 4 3 4 1 1 3 6 5 6 6
1 1.0 1.0 17 0.0 1.0 0.0 1 1 1.0 0.0 ... 5 3 3 1 1 3 4 5 5 6
2 1.0 1.0 15 0.0 0.0 0.0 1 1 1.0 0.0 ... 4 3 2 2 3 3 10 7 8 10
3 1.0 1.0 15 0.0 1.0 0.0 4 2 0.0 0.0 ... 3 2 2 1 1 5 2 15 14 15
4 1.0 1.0 16 0.0 1.0 0.0 3 3 0.0 0.0 ... 4 3 2 1 2 5 4 6 10 10

5 rows × 33 columns

In [156]:
features=list(df.columns[:-1])
In [157]:
X = df[features]
y = df['G3']
In [158]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
In [169]:
from sklearn.feature_selection import *
fs=SelectKBest(score_func=f_regression,k=5)
X_new=fs.fit_transform(X_train,y_train)
z =  zip(fs.get_support(),features)
print z

x_min, x_max = X_new[:,0].min() - .5, X_new[:, 0].max() + .5
y_min, y_max = y_train.min() - .5, y_train.max() + .5
#fig=plt.figure()
#fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)

# Two subplots, unpack the axes array immediately
fig, axes = plt.subplots(1,5)
fig.set_size_inches(12,12)

for i in range(5):
    axes[i].set_aspect('equal')
    axes[i].set_title('Feature {}'.format(i))
    axes[i].set_xlabel('Feature')
    axes[i].set_ylabel('Grades')
    axes[i].set_xlim(x_min, x_max)
    axes[i].set_ylim(y_min, y_max)
    plt.sca(axes[i])
    plt.scatter(X_new[:,i],y_train)
[(False, 'school'), (False, 'sex'), (False, 'age'), (False, 'address'), (False, 'famsize'), (False, 'Pstatus'), (True, 'Medu'), (True, 'Fedu'), (False, 'Mjob'), (False, 'Fjob'), (False, 'reason'), (False, 'guardian'), (False, 'traveltime'), (False, 'studytime'), (True, 'failures'), (False, 'schoolsup'), (False, 'famsup'), (False, 'paid'), (False, 'activities'), (False, 'nursery'), (False, 'higher'), (False, 'internet'), (False, 'romantic'), (False, 'famrel'), (False, 'freetime'), (False, 'goout'), (False, 'Dalc'), (False, 'Walc'), (False, 'health'), (False, 'absences'), (True, 'G1'), (True, 'G2')]
In [172]:
best_features = []
for bool,feature in z:
    if bool:
        best_features.append(feature)
In [176]:
correlated = best_features + ['G3']
In [177]:
correlated
Out[177]:
['Medu', 'Fedu', 'failures', 'G1', 'G2', 'G3']
In [179]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='whitegrid', context='notebook')

sns.pairplot(df[correlated], size=2.0);
plt.tight_layout()
# plt.savefig('./figures/scatter.png', dpi=300)
plt.show()
In [181]:
import numpy as np
cm = np.corrcoef(df[correlated].values.T)
sns.set(font_scale=1.5)
hm = sns.heatmap(cm, 
            cbar=True,
            annot=True, 
            square=True,
            fmt='.2f',
            annot_kws={'size': 15},
            yticklabels=correlated,
            xticklabels=correlated)

plt.tight_layout()
# plt.savefig('./figures/corr_mat.png', dpi=300)
plt.show()
In [182]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.scatter(df['failures'], df['G3'])
plt.xlabel('Failures')
plt.ylabel('G3')
plt.title('Failures Against G3')
plt.show()
In [196]:
from sklearn.cross_validation import train_test_split

X = df[features].values
y = df['G3'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
In [185]:
slr = LinearRegression()

slr.fit(X_train, y_train)
y_train_pred = slr.predict(X_train)
y_test_pred = slr.predict(X_test)
In [187]:
plt.scatter(y_train_pred,  y_train_pred - y_train, c='blue', marker='o', label='Training data')
plt.scatter(y_test_pred,  y_test_pred - y_test, c='lightgreen', marker='s', label='Test data')
plt.xlabel('Predicted values')
plt.ylabel('Residuals')
plt.legend(loc='upper left')
plt.hlines(y=0, xmin=0, xmax=20, lw=2, color='red')
plt.xlim([0, 20])
plt.tight_layout()

# plt.savefig('./figures/slr_residuals.png', dpi=300)
plt.show()
In [188]:
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(y_train, y_train_pred),
        mean_squared_error(y_test, y_test_pred)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(y_train, y_train_pred),
        r2_score(y_test, y_test_pred)))
MSE train: 2.486, test: 5.642
R^2 train: 0.859, test: 0.798

Using regularized methods for regression

A Lasso Regression model can be initialized as follows:

In [189]:
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
y_train_pred = lasso.predict(X_train)
y_test_pred = lasso.predict(X_test)
print(lasso.coef_)
[-0.         -0.         -0.14097623 -0.         -0.          0.          0.
 -0.          0.         -0.          0.          0.          0.          0.
 -0.         -0.         -0.         -0.          0.          0.         -0.
  0.          0.          0.02030755  0.          0.         -0.
  0.05042061  0.01666828  0.03440757  0.09543018  0.98554785]
In [190]:
print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(y_train, y_train_pred),
        mean_squared_error(y_test, y_test_pred)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(y_train, y_train_pred),
        r2_score(y_test, y_test_pred)))
MSE train: 2.666, test: 5.868
R^2 train: 0.849, test: 0.790

Similiarly Ridge regression can be used:

In [191]:
from sklearn.linear_model import Ridge
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
y_train_pred = ridge.predict(X_train)
y_test_pred = ridge.predict(X_test)
print(ridge.coef_)
[-0.35325243 -0.20836897 -0.21409541 -0.14297151 -0.09381759  0.27131556
  0.17368312 -0.12499116  0.21471346  0.15413121 -0.01400552  0.18185536
  0.06190845  0.07503396  0.00234185 -0.44425912 -0.13290367 -0.10680339
  0.30051497  0.32123389 -0.23413804  0.20521655  0.25409234  0.15749261
  0.02279431  0.00097435 -0.19989632  0.1902062   0.07026242  0.04768856
  0.13956252  0.95878554]

Lastly, the ElasticNet implementation allows us to vary the L1 to L2 ratio:

In [192]:
from sklearn.linear_model import ElasticNet
en = ElasticNet(alpha=1.0, l1_ratio=0.5)
en.fit(X_train, y_train)
y_train_pred = en.predict(X_train)
y_test_pred = en.predict(X_test)
print(en.coef_)
[ 0.         -0.         -0.         -0.         -0.          0.          0.
  0.         -0.         -0.         -0.          0.          0.          0.
 -0.         -0.         -0.         -0.          0.          0.         -0.
 -0.          0.          0.          0.          0.         -0.          0.
  0.          0.01991938  0.13861396  0.89133515]

For example, if we set l1_ratio to 1.0, the ElasticNet regressor would be equal to LASSO regression.

Decision tree regression

In [201]:
X = df[features].values
y = df['G3'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
In [202]:
from sklearn.tree import DecisionTreeRegressor


tree = DecisionTreeRegressor(max_depth=3)
tree.fit(X_train, y_train)
y_train_pred = tree.predict(X_train)
y_test_pred = tree.predict(X_test)

print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(y_train, y_train_pred),
        mean_squared_error(y_test, y_test_pred)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(y_train, y_train_pred),
        r2_score(y_test, y_test_pred)))
MSE train: 2.519, test: 4.577
R^2 train: 0.857, test: 0.836

Random forest regression

In [204]:
X = df[features].values
y = df['G3'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
In [205]:
from sklearn.ensemble import RandomForestRegressor

forest = RandomForestRegressor(n_estimators=1000, 
                               criterion='mse', 
                               random_state=1, 
                               n_jobs=-1)
forest.fit(X_train, y_train)
y_train_pred = forest.predict(X_train)
y_test_pred = forest.predict(X_test)

print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(y_train, y_train_pred),
        mean_squared_error(y_test, y_test_pred)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(y_train, y_train_pred),
        r2_score(y_test, y_test_pred)))
MSE train: 0.250, test: 3.806
R^2 train: 0.986, test: 0.864
In [206]:
plt.scatter(y_train_pred,  
            y_train_pred - y_train, 
            c='black', 
            marker='o', 
            s=35,
            alpha=0.5,
            label='Training data')
plt.scatter(y_test_pred,  
            y_test_pred - y_test, 
            c='lightgreen', 
            marker='s', 
            s=35,
            alpha=0.7,
            label='Test data')

plt.xlabel('Predicted values')
plt.ylabel('Residuals')
plt.legend(loc='upper left')
plt.hlines(y=0, xmin=0, xmax=22, lw=2, color='red')
plt.xlim([0, 22])
plt.tight_layout()

# plt.savefig('./figures/slr_residuals.png', dpi=300)
plt.show()

Linear Regression

In [207]:
X_train, X_test, y_train, y_test = train_test_split(X, y)
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_predictions = regressor.predict(X_test)
print 'R-squared:', regressor.score(X_test, y_test)
R-squared: 0.780914360184

Cross Validation

In [208]:
scores = cross_val_score(regressor, X, y, cv=5)
print "Average of scores: ", scores.mean()
print "Cross validation scores: ", scores
Average of scores:  0.794475403313
Cross validation scores:  [ 0.81813068  0.88591594  0.77938816  0.78516026  0.70378198]
In [209]:
plt.scatter(y_test,y_predictions)
plt.xlabel('True Quality')
plt.ylabel('Predicted Quality')
plt.title('Predicted Quality Against True Quality')
plt.show()

Fitting models with gradient descent

SGDRegressor

In [210]:
# Scaling the features using StandardScaler:
X_scaler = StandardScaler()
y_scaler = StandardScaler()
X_train = X_scaler.fit_transform(X_train)
y_train = y_scaler.fit_transform(y_train)
X_test = X_scaler.transform(X_test)
y_test = y_scaler.transform(y_test)
C:\Miniconda2\lib\site-packages\sklearn\utils\validation.py:420: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
  warnings.warn(msg, DataConversionWarning)
C:\Miniconda2\lib\site-packages\sklearn\preprocessing\data.py:583: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
C:\Miniconda2\lib\site-packages\sklearn\utils\validation.py:420: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
  warnings.warn(msg, DataConversionWarning)
C:\Miniconda2\lib\site-packages\sklearn\preprocessing\data.py:646: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
C:\Miniconda2\lib\site-packages\sklearn\utils\validation.py:420: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
  warnings.warn(msg, DataConversionWarning)
C:\Miniconda2\lib\site-packages\sklearn\preprocessing\data.py:646: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
In [211]:
regressor = SGDRegressor(loss='squared_loss')
scores = cross_val_score(regressor, X_train, y_train, cv=5)
print 'Cross validation r-squared scores:', scores
print 'Average cross validation r-squared score:', np.mean(scores)
regressor.fit_transform(X_train, y_train)
print 'Test set r-squared score', regressor.score(X_test, y_test)
Cross validation r-squared scores: [ 0.75096581  0.86510753  0.76876374  0.7369485   0.7987154 ]
Average cross validation r-squared score: 0.784100195438
Test set r-squared score 0.773056308523
C:\Miniconda2\lib\site-packages\sklearn\utils\__init__.py:93: DeprecationWarning: Function transform is deprecated; Support to use estimators as feature selectors will be removed in version 0.19. Use SelectFromModel instead.
  warnings.warn(msg, category=DeprecationWarning)
In [213]:
from sklearn.cross_validation import *
def train_and_evaluate(clf, X_train, y_train):
    
    clf.fit(X_train, y_train)
    
    print "Coefficient of determination on training set:",clf.score(X_train, y_train)
    
    # create a k-fold croos validation iterator of k=5 folds
    cv = KFold(X_train.shape[0], 5, shuffle=True, random_state=33)
    scores = cross_val_score(clf, X_train, y_train, cv=cv)
    print "Average coefficient of determination using 5-fold crossvalidation:",np.mean(scores)

Support Vector Machines for regression

The regression version of SVM can be used instead to find the hyperplane (note how easy is to change the classification method in scikit-learn!). We will try a linear kernel, a polynomial kernel, and finally, a rbf kernel. For more information on kernels, see http://scikit-learn.org/stable/modules/svm.html#svm-kernels

In [217]:
from sklearn import svm
clf_svr= svm.SVR(kernel='linear')
train_and_evaluate(clf_svr,X_train,y_train)
Coefficient of determination on training set: 0.849157093784
Average coefficient of determination using 5-fold crossvalidation: 0.842490490255
In [218]:
clf_svr_poly= svm.SVR(kernel='poly')
train_and_evaluate(clf_svr_poly,X_train,y_train)
Coefficient of determination on training set: 0.991247519011
Average coefficient of determination using 5-fold crossvalidation: -92.8405695274
In [219]:
clf_svr_rbf= svm.SVR(kernel='rbf')
train_and_evaluate(clf_svr_rbf,X_train,y_train)
Coefficient of determination on training set: 0.76854274345
Average coefficient of determination using 5-fold crossvalidation: 0.637749259288
In [220]:
clf_svr_poly2= svm.SVR(kernel='poly',degree=2)
train_and_evaluate(clf_svr_poly2,X_train,y_train)
Coefficient of determination on training set: 0.875086659141
Average coefficient of determination using 5-fold crossvalidation: 0.577096119973

Random Forests for Regression

Finally, let's try again Random Forests, in their Extra Trees, and Regression version

In [221]:
from sklearn import ensemble
clf_et=ensemble.ExtraTreesRegressor(n_estimators=10,random_state=42)
train_and_evaluate(clf_et,X_train,y_train)
Coefficient of determination on training set: 1.0
Average coefficient of determination using 5-fold crossvalidation: 0.834438823442

An interesting side effect of random forest classification, is that you can measure how 'important' each feature is when predicting the final result

In [222]:
print np.sort(zip(clf_et.feature_importances_,features),axis=0)
[['0.000414630355532' 'Dalc']
 ['0.000514007655848' 'Fedu']
 ['0.000616839340144' 'Fjob']
 ['0.000812859348316' 'G1']
 ['0.000894604750305' 'G2']
 ['0.000964541098088' 'Medu']
 ['0.0018054901535' 'Mjob']
 ['0.00181670492675' 'Pstatus']
 ['0.00185431089792' 'Walc']
 ['0.00216206896705' 'absences']
 ['0.00220641658743' 'activities']
 ['0.00239183938313' 'address']
 ['0.0024096646334' 'age']
 ['0.00249476017392' 'failures']
 ['0.00250597775474' 'famrel']
 ['0.00255391358532' 'famsize']
 ['0.00279086333248' 'famsup']
 ['0.0031383209147' 'freetime']
 ['0.00360150739838' 'goout']
 ['0.00431525126952' 'guardian']
 ['0.00445755164541' 'health']
 ['0.0048610587442' 'higher']
 ['0.00532054800371' 'internet']
 ['0.00710396827835' 'nursery']
 ['0.00841846425943' 'paid']
 ['0.00925093324961' 'reason']
 ['0.00956984349144' 'romantic']
 ['0.0100368287457' 'school']
 ['0.0144620072805' 'schoolsup']
 ['0.0435626263148' 'sex']
 ['0.330796342132' 'studytime']
 ['0.511895255329' 'traveltime']]

Finally, evaluate our classifiers on the testing set

In [223]:
from sklearn import metrics
def measure_performance(X,y,clf, show_accuracy=True,
                        show_classification_report=True,
                        show_confusion_matrix=True,
                        show_r2_score=False):
    y_pred=clf.predict(X)   
    if show_accuracy:
        print "Accuracy:{0:.3f}".format(metrics.accuracy_score(y,y_pred)),"\n"

    if show_classification_report:
        print "Classification report"
        print metrics.classification_report(y,y_pred),"\n"
        
    if show_confusion_matrix:
        print "Confusion matrix"
        print metrics.confusion_matrix(y,y_pred),"\n"
        
    if show_r2_score:
        print "Coefficient of determination:{0:.3f}".format(metrics.r2_score(y,y_pred)),"\n"

        
measure_performance(X_test,y_test,clf_et,
                    show_accuracy=False,
                    show_classification_report=False,
                    show_confusion_matrix=False,
                    show_r2_score=True)
Coefficient of determination:0.766