Student-Performance-Evaluation using Classification-Regression¶

http://archive.ics.uci.edu/ml/datasets/Student+Performance

Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets:¶

school - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)
sex - student's sex (binary: 'F' - female or 'M' - male)
age - student's age (numeric: from 15 to 22)
address - student's home address type (binary: 'U' - urban or 'R' - rural)
famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
Pstatus - parent's cohabitation status (binary: 'T' - living together or 'A' - apart)
Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
Mjob - mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
Fjob - father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
reason - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')
guardian - student's guardian (nominal: 'mother', 'father' or 'other')
traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - \>1 hour)
studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
failures - number of past class failures (numeric: n if 1<=n\<3, else 4)
schoolsup - extra educational support (binary: yes or no)
famsup - family educational support (binary: yes or no)
paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
activities - extra-curricular activities (binary: yes or no)
nursery - attended nursery school (binary: yes or no)
higher - wants to take higher education (binary: yes or no)
internet - Internet access at home (binary: yes or no)
romantic - with a romantic relationship (binary: yes or no)
famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
freetime - free time after school (numeric: from 1 - very low to 5 - very high)
goout - going out with friends (numeric: from 1 - very low to 5 - very high)
Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
health - current health status (numeric: from 1 - very bad to 5 - very good)
absences - number of school absences (numeric: from 0 to 93)
G1 - first period grade (numeric: from 0 to 20)
G2 - second period grade (numeric: from 0 to 20)
G3 - final grade (numeric: from 0 to 20, output target)

these grades are related with the course subject, Math or Portuguese:

import os
from sklearn.tree import DecisionTreeClassifier, export_graphviz
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.cross_validation import train_test_split
from sklearn import cross_validation, metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from time import time
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score , classification_report
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import precision_score, recall_score, accuracy_score, classification_report

# read .csv from provided dataset
csv_filename="student/student-mat.csv"

# df=pd.read_csv(csv_filename,index_col=0)
df=pd.read_csv(csv_filename, sep=";")

df.head()

df.describe()

CASE 1: Binary Classification : G3>10: 1 else 0¶

df.G3.describe()

count    395.000000
mean      10.415190
std        4.581443
min        0.000000
25%        8.000000
50%       11.000000
75%       14.000000
max       20.000000
Name: G3, dtype: float64

# handle G3 attrubte to binary
high = df.G3 >= 10
low = df.G3 < 10
df.loc[high,'G3'] = 1
df.loc[low,'G3'] = 0

df.head()

df.G3.describe()

count    395.000000
mean       0.670886
std        0.470487
min        0.000000
25%        0.000000
50%        1.000000
75%        1.000000
max        1.000000
Name: G3, dtype: float64

cols = list(df.columns)

categorical_features = []
for f in cols:
    if df[f].dtype != 'int64':
        categorical_features.append(f)
categorical_features

['school',
 'sex',
 'address',
 'famsize',
 'Pstatus',
 'Mjob',
 'Fjob',
 'reason',
 'guardian',
 'schoolsup',
 'famsup',
 'paid',
 'activities',
 'nursery',
 'higher',
 'internet',
 'romantic']

for f in categorical_features:

    #Get binarized columns
    df[f] = pd.get_dummies(df[f])

df.head()

features=list(df.columns[:-1])

X = df[features]
y = df['G3']

# split dataset to 60% training and 40% testing
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,y, test_size=0.4, random_state=0)

print X_train.shape, y_train.shape

(237, 32) (237L,)

Feature importances with forests of trees¶

This examples shows the use of forests of trees to evaluate the importance of features on an artificial classification task. The red bars are the feature importances of the forest, along with their inter-trees variability.

import numpy as np
import matplotlib.pyplot as plt

from sklearn.ensemble import ExtraTreesClassifier

# Build a classification task using 3 informative features


# Build a forest and compute the feature importances
forest = ExtraTreesClassifier(n_estimators=250,
                              random_state=0)

forest.fit(X, y)
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(X.shape[1]):
    print("%d. feature %d - %s (%f) " % (f + 1, indices[f], features[indices[f]], importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure(num=None, figsize=(14, 10), dpi=80, facecolor='w', edgecolor='k')
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), indices)
plt.xlim([-1, X.shape[1]])
plt.show()

Feature ranking:
1. feature 31 - G2 (0.240635) 
2. feature 30 - G1 (0.180379) 
3. feature 14 - failures (0.058718) 
4. feature 29 - absences (0.032248) 
5. feature 25 - goout (0.031282) 
6. feature 2 - age (0.031181) 
7. feature 6 - Medu (0.023217) 
8. feature 7 - Fedu (0.023036) 
9. feature 28 - health (0.022799) 
10. feature 13 - studytime (0.022735) 
11. feature 23 - famrel (0.021943) 
12. feature 24 - freetime (0.021443) 
13. feature 27 - Walc (0.020626) 
14. feature 15 - schoolsup (0.018678) 
15. feature 22 - romantic (0.017761) 
16. feature 10 - reason (0.016906) 
17. feature 18 - activities (0.016887) 
18. feature 26 - Dalc (0.016689) 
19. feature 3 - address (0.016217) 
20. feature 1 - sex (0.016042) 
21. feature 12 - traveltime (0.015887) 
22. feature 4 - famsize (0.015885) 
23. feature 17 - paid (0.015443) 
24. feature 16 - famsup (0.014640) 
25. feature 11 - guardian (0.014492) 
26. feature 21 - internet (0.013992) 
27. feature 8 - Mjob (0.013320) 
28. feature 19 - nursery (0.011929) 
29. feature 20 - higher (0.010266) 
30. feature 5 - Pstatus (0.008672) 
31. feature 9 - Fjob (0.008311) 
32. feature 0 - school (0.007740)

importances[indices[:5]]

array([ 0.24063511,  0.18037904,  0.05871839,  0.03224845,  0.03128223])

for f in range(5):
    print("%d. feature %d - %s (%f)" % (f + 1, indices[f], features[indices[f]] ,importances[indices[f]]))

1. feature 31 - G2 (0.240635)
2. feature 30 - G1 (0.180379)
3. feature 14 - failures (0.058718)
4. feature 29 - absences (0.032248)
5. feature 25 - goout (0.031282)

best_features = []
for i in indices[:5]:
    best_features.append(features[i])

# Plot the top 5 feature importances of the forest
plt.figure(num=None, figsize=(8, 6), dpi=80, facecolor='w', edgecolor='k')
plt.title("Feature importances")
plt.bar(range(5), importances[indices][:5], 
       color="r",  yerr=std[indices][:5], align="center")
plt.xticks(range(5), best_features)
plt.xlim([-1, 5])
plt.show()

Decision Tree accuracy and time elapsed caculation¶

t0=time()
print "DecisionTree"

dt = DecisionTreeClassifier(min_samples_split=20,random_state=99)
# dt = DecisionTreeClassifier(min_samples_split=20,max_depth=5,random_state=99)

clf_dt=dt.fit(X_train,y_train)

print "Acurracy: ", clf_dt.score(X_test,y_test)
t1=time()
print "time elapsed: ", t1-t0

DecisionTree
Acurracy:  0.886075949367
time elapsed:  0.00399994850159

cross validation for DT¶

tt0=time()
print "cross result========"
scores = cross_validation.cross_val_score(dt, X,y, cv=5)
print scores
print scores.mean()
tt1=time()
print "time elapsed: ", tt1-tt0
print "\n"

cross result========
[ 0.89873418  0.86075949  0.84810127  0.82278481  0.86075949]
0.858227848101
time elapsed:  0.148000001907

Tuning our hyperparameters using GridSearch¶

from sklearn.metrics import classification_report

pipeline = Pipeline([
    ('clf', DecisionTreeClassifier(criterion='entropy'))
])

parameters = {
    'clf__max_depth': (5, 25 , 50),
    'clf__min_samples_split': (1, 5, 10),
    'clf__min_samples_leaf': (1, 2, 3)
}

grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='f1')
grid_search.fit(X_train, y_train)

print 'Best score: %0.3f' % grid_search.best_score_
print 'Best parameters set:'

best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print '\t%s: %r' % (param_name, best_parameters[param_name])

predictions = grid_search.predict(X_test)

print classification_report(y_test, predictions)

[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   19.6s
[Parallel(n_jobs=-1)]: Done  81 out of  81 | elapsed:   20.2s finished

Fitting 3 folds for each of 27 candidates, totalling 81 fits
Best score: 0.933
Best parameters set:
	clf__max_depth: 50
	clf__min_samples_leaf: 2
	clf__min_samples_split: 5
             precision    recall  f1-score   support

          0       0.90      0.73      0.80        59
          1       0.85      0.95      0.90        99

avg / total       0.87      0.87      0.86       158

Random Forest accuracy and time elapsed caculation¶

t2=time()
print "RandomForest"
rf = RandomForestClassifier(n_estimators=100,n_jobs=-1)
clf_rf = rf.fit(X_train,y_train)
print "Acurracy: ", clf_rf.score(X_test,y_test)
t3=time()
print "time elapsed: ", t3-t2

RandomForest
Acurracy:  0.911392405063
time elapsed:  0.871000051498

cross validation for RF¶

tt2=time()
print "cross result========"
scores = cross_validation.cross_val_score(dt, X,y, cv=5)
print scores
print scores.mean()
tt3=time()
print "time elapsed: ", tt3-tt2
print "\n"

cross result========
[ 0.89873418  0.86075949  0.84810127  0.82278481  0.86075949]
0.858227848101
time elapsed:  0.0499999523163

Receiver Operating Characteristic (ROC) curve¶

roc_auc_score(y_test,rf.predict(X_test))

0.89162814586543404

%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc

predictions = rf.predict_proba(X_test)

false_positive_rate, recall, thresholds = roc_curve(y_test, predictions[:, 1])
roc_auc = auc(false_positive_rate, recall)
plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, recall, 'b', label='AUC = %0.2f' % roc_auc)
plt.legend(loc='lower right')

plt.plot([0, 1], [0, 1], 'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.ylabel('Recall')
plt.xlabel('Fall-out')
plt.show()

Tuning Models using GridSearch¶

pipeline2 = Pipeline([
('clf', RandomForestClassifier(criterion='entropy'))
])

parameters = {
    'clf__n_estimators': (5, 25, 50, 100),
    'clf__max_depth': (5, 25 , 50),
    'clf__min_samples_split': (1, 5, 10),
    'clf__min_samples_leaf': (1, 2, 3)
}

grid_search = GridSearchCV(pipeline2, parameters, n_jobs=-1, verbose=1, scoring='accuracy', cv=3)

grid_search.fit(X_train, y_train)

print 'Best score: %0.3f' % grid_search.best_score_

print 'Best parameters set:'
best_parameters = grid_search.best_estimator_.get_params()

for param_name in sorted(parameters.keys()):
    print '\t%s: %r' % (param_name, best_parameters[param_name])

predictions = grid_search.predict(X_test)
print 'Accuracy:', accuracy_score(y_test, predictions)
print classification_report(y_test, predictions)

[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   21.5s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   36.3s
[Parallel(n_jobs=-1)]: Done 324 out of 324 | elapsed:   58.4s finished

Fitting 3 folds for each of 108 candidates, totalling 324 fits
Best score: 0.916
Best parameters set:
	clf__max_depth: 5
	clf__min_samples_leaf: 3
	clf__min_samples_split: 1
	clf__n_estimators: 50
Accuracy: 0.905063291139
             precision    recall  f1-score   support

          0       0.94      0.80      0.86        59
          1       0.89      0.97      0.93        99

avg / total       0.91      0.91      0.90       158

Naive Bayes accuracy and time elapsed caculation¶

t4=time()
print "NaiveBayes"
nb = BernoulliNB()
clf_nb=nb.fit(X_train,y_train)
print "Acurracy: ", clf_nb.score(X_test,y_test)
t5=time()
print "time elapsed: ", t5-t4

NaiveBayes
Acurracy:  0.70253164557
time elapsed:  0.166000127792

cross-validation for NB¶

tt4=time()
print "cross result========"
scores = cross_validation.cross_val_score(dt, X,y, cv=5)
print scores
print scores.mean()
tt5=time()
print "time elapsed: ", tt5-tt4
print "\n"

cross result========
[ 0.89873418  0.86075949  0.84810127  0.82278481  0.86075949]
0.858227848101
time elapsed:  0.0549998283386

KNN accuracy and time elapsed caculation¶

t6=time()
print "KNN"
# knn = KNeighborsClassifier(n_neighbors=3)
knn = KNeighborsClassifier()
clf_knn=knn.fit(X_train, y_train)
print "Acurracy: ", clf_knn.score(X_test,y_test) 
t7=time()
print "time elapsed: ", t7-t6

KNN
Acurracy:  0.892405063291
time elapsed:  0.00899982452393

cross validation for KNN¶

tt6=time()
print "cross result========"
scores = cross_validation.cross_val_score(dt, X,y, cv=5)
print scores
print scores.mean()
tt7=time()
print "time elapsed: ", tt7-tt6
print "\n"

cross result========
[ 0.89873418  0.86075949  0.84810127  0.82278481  0.86075949]
0.858227848101
time elapsed:  0.0460000038147

SVM accuracy and time elapsed caculation¶

t7=time()
print "SVM"

svc = SVC()
clf_svc=svc.fit(X_train, y_train)
print "Acurracy: ", clf_svc.score(X_test,y_test) 
t8=time()
print "time elapsed: ", t8-t7

SVM
Acurracy:  0.867088607595
time elapsed:  0.010999917984

cross validation for SVM¶

tt7=time()
print "cross result========"
scores = cross_validation.cross_val_score(dt, X,y, cv=5)
print scores
print scores.mean()
tt8=time()
print "time elapsed: ", tt7-tt6
print "\n"

cross result========
[ 0.89873418  0.86075949  0.84810127  0.82278481  0.86075949]
0.858227848101
time elapsed:  7.27399992943

from sklearn.svm import SVC
from sklearn.cross_validation import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn import grid_search

svc = SVC()

parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}

grid = grid_search.GridSearchCV(svc, parameters, n_jobs=-1, verbose=1, scoring='accuracy')


grid.fit(X_train, y_train)

print 'Best score: %0.3f' % grid.best_score_

print 'Best parameters set:'
best_parameters = grid.best_estimator_.get_params()

for param_name in sorted(parameters.keys()):
    print '\t%s: %r' % (param_name, best_parameters[param_name])
    
predictions = grid.predict(X_test)
print classification_report(y_test, predictions)

Fitting 3 folds for each of 4 candidates, totalling 12 fits
Best score: 0.907
Best parameters set:
	C: 10
	kernel: 'rbf'
             precision    recall  f1-score   support

          0       0.92      0.76      0.83        59
          1       0.87      0.96      0.91        99

avg / total       0.89      0.89      0.88       158

[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:   18.2s finished

pipeline = Pipeline([
    ('clf', SVC(kernel='rbf', gamma=0.01, C=100))
])

parameters = {
    'clf__gamma': (0.01, 0.03, 0.1, 0.3, 1),
    'clf__C': (0.1, 0.3, 1, 3, 10, 30),
}

grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy')

grid_search.fit(X_train, y_train)

print 'Best score: %0.3f' % grid_search.best_score_

print 'Best parameters set:'
best_parameters = grid_search.best_estimator_.get_params()

for param_name in sorted(parameters.keys()):
    print '\t%s: %r' % (param_name, best_parameters[param_name])
    
predictions = grid_search.predict(X_test)
print classification_report(y_test, predictions)

[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   19.4s
[Parallel(n_jobs=-1)]: Done  90 out of  90 | elapsed:   20.2s finished

Fitting 3 folds for each of 30 candidates, totalling 90 fits
Best score: 0.907
Best parameters set:
	clf__C: 3
	clf__gamma: 0.01
             precision    recall  f1-score   support

          0       0.92      0.83      0.88        59
          1       0.90      0.96      0.93        99

avg / total       0.91      0.91      0.91       158

# read .csv from provided dataset
csv_filename="student/student-mat.csv"

# df=pd.read_csv(csv_filename,index_col=0)
df=pd.read_csv(csv_filename, sep=";")

df.head()

df.describe()

df.G3.describe()

count    395.000000
mean      10.415190
std        4.581443
min        0.000000
25%        8.000000
50%       11.000000
75%       14.000000
max       20.000000
Name: G3, dtype: float64

array([5, 4, 2, 1, 3], dtype=int64)

df.G3.describe()

count    395.000000
mean      10.415190
std        4.581443
min        0.000000
25%        8.000000
50%       11.000000
75%       14.000000
max       20.000000
Name: G3, dtype: float64

for i in range(len(df.G3)):
    if df.G3.loc[i] < 10:
        df.G3.loc[i] = 5
    elif df.G3.loc[i] < 12:
        df.G3.loc[i] = 4
    elif df.G3.loc[i] < 14:
        df.G3.loc[i] = 3
    elif df.G3.loc[i] < 16:
        df.G3.loc[i] = 2
    elif df.G3.loc[i] < 21:
        df.G3.loc[i] = 1

df.G3.unique()

array([5, 4, 2, 1, 3], dtype=int64)

df.head()

Class	G3	Label
I (excellent/very good)	16-20	A
II (good)	14-15	B
III (satisfactory)	12-13	C
IV (sufficient)	10-11	D
V (fail)	0-9	E

	school	sex	age	address	famsize	Pstatus	Medu	Fedu	Mjob	Fjob	...	famrel	freetime	goout	Dalc	Walc	health	absences	G1	G2	G3
0	GP	F	18	U	GT3	A	4	4	at_home	teacher	...	4	3	4	1	1	3	6	5	6	6
1	GP	F	17	U	GT3	T	1	1	at_home	other	...	5	3	3	1	1	3	4	5	5	6
2	GP	F	15	U	LE3	T	1	1	at_home	other	...	4	3	2	2	3	3	10	7	8	10
3	GP	F	15	U	GT3	T	4	2	health	services	...	3	2	2	1	1	5	2	15	14	15
4	GP	F	16	U	GT3	T	3	3	other	other	...	4	3	2	1	2	5	4	6	10	10

	age	Medu	Fedu	traveltime	studytime	failures	famrel	freetime	goout	Dalc	Walc	health	absences	G1	G2	G3
count	395.000000	395.000000	395.000000	395.000000	395.000000	395.000000	395.000000	395.000000	395.000000	395.000000	395.000000	395.000000	395.000000	395.000000	395.000000	395.000000
mean	16.696203	2.749367	2.521519	1.448101	2.035443	0.334177	3.944304	3.235443	3.108861	1.481013	2.291139	3.554430	5.708861	10.908861	10.713924	10.415190
std	1.276043	1.094735	1.088201	0.697505	0.839240	0.743651	0.896659	0.998862	1.113278	0.890741	1.287897	1.390303	8.003096	3.319195	3.761505	4.581443
min	15.000000	0.000000	0.000000	1.000000	1.000000	0.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	0.000000	3.000000	0.000000	0.000000
25%	16.000000	2.000000	2.000000	1.000000	1.000000	0.000000	4.000000	3.000000	2.000000	1.000000	1.000000	3.000000	0.000000	8.000000	9.000000	8.000000
50%	17.000000	3.000000	2.000000	1.000000	2.000000	0.000000	4.000000	3.000000	3.000000	1.000000	2.000000	4.000000	4.000000	11.000000	11.000000	11.000000
75%	18.000000	4.000000	3.000000	2.000000	2.000000	0.000000	5.000000	4.000000	4.000000	2.000000	3.000000	5.000000	8.000000	13.000000	13.000000	14.000000
max	22.000000	4.000000	4.000000	4.000000	4.000000	3.000000	5.000000	5.000000	5.000000	5.000000	5.000000	5.000000	75.000000	19.000000	19.000000	20.000000

	school	sex	age	famsize	Pstatus	Medu	Fedu	Mjob	...	famrel	freetime	goout	Dalc	Walc	health	absences	G1	G2	G3
0	1.0	1.0	18	1.0	1.0	4	4	1.0	...	4	3	4	1	1	3	6	5	6	0
1	1.0	1.0	17	1.0	0.0	1	1	1.0	...	5	3	3	1	1	3	4	5	5	0
2	1.0	1.0	15	0.0	0.0	1	1	1.0	...	4	3	2	2	3	3	10	7	8	1
3	1.0	1.0	15	1.0	0.0	4	2	0.0	...	3	2	2	1	1	5	2	15	14	1
4	1.0	1.0	16	1.0	0.0	3	3	0.0	...	4	3	2	1	2	5	4	6	10	1

	school	sex	age	famsize	Pstatus	Medu	Fedu	Mjob	...	famrel	freetime	goout	Dalc	Walc	health	absences	G1	G2	G3
0	1.0	1.0	18	1.0	1.0	4	4	1.0	...	4	3	4	1	1	3	6	5	6	5
1	1.0	1.0	17	1.0	0.0	1	1	1.0	...	5	3	3	1	1	3	4	5	5	5
2	1.0	1.0	15	0.0	0.0	1	1	1.0	...	4	3	2	2	3	3	10	7	8	4
3	1.0	1.0	15	1.0	0.0	4	2	0.0	...	3	2	2	1	1	5	2	15	14	2
4	1.0	1.0	16	1.0	0.0	3	3	0.0	...	4	3	2	1	2	5	4	6	10	4

	school	sex	age	famsize	Pstatus	Medu	Fedu	Mjob	...	famrel	freetime	goout	Dalc	Walc	health	absences	G1	G2	G3
0	1.0	1.0	18	1.0	1.0	4	4	1.0	...	4	3	4	1	1	3	6	5	6	6
1	1.0	1.0	17	1.0	0.0	1	1	1.0	...	5	3	3	1	1	3	4	5	5	6
2	1.0	1.0	15	0.0	0.0	1	1	1.0	...	4	3	2	2	3	3	10	7	8	10
3	1.0	1.0	15	1.0	0.0	4	2	0.0	...	3	2	2	1	1	5	2	15	14	15
4	1.0	1.0	16	1.0	0.0	3	3	0.0	...	4	3	2	1	2	5	4	6	10	10

	school	sex	age	famsize	Pstatus	Medu	Fedu	Mjob	...	famrel	freetime	goout	Dalc	Walc	health	absences	G1	G2	G3
0	1.0	1.0	18	1.0	1.0	4	4	1.0	...	4	3	4	1	1	3	6	5	6	0
1	1.0	1.0	17	1.0	0.0	1	1	1.0	...	5	3	3	1	1	3	4	5	5	0
2	1.0	1.0	15	0.0	0.0	1	1	1.0	...	4	3	2	2	3	3	10	7	8	1
3	1.0	1.0	15	1.0	0.0	4	2	0.0	...	3	2	2	1	1	5	2	15	14	1
4	1.0	1.0	16	1.0	0.0	3	3	0.0	...	4	3	2	1	2	5	4	6	10	1

	school	sex	age	famsize	Pstatus	Medu	Fedu	Mjob	...	famrel	freetime	goout	Dalc	Walc	health	absences	G1	G2	G3
0	1.0	1.0	18	1.0	1.0	4	4	1.0	...	4	3	4	1	1	3	6	5	6	5
1	1.0	1.0	17	1.0	0.0	1	1	1.0	...	5	3	3	1	1	3	4	5	5	5
2	1.0	1.0	15	0.0	0.0	1	1	1.0	...	4	3	2	2	3	3	10	7	8	4
3	1.0	1.0	15	1.0	0.0	4	2	0.0	...	3	2	2	1	1	5	2	15	14	2
4	1.0	1.0	16	1.0	0.0	3	3	0.0	...	4	3	2	1	2	5	4	6	10	4

	school	sex	age	famsize	Pstatus	Medu	Fedu	Mjob	...	famrel	freetime	goout	Dalc	Walc	health	absences	G1	G2	G3
0	1.0	1.0	18	1.0	1.0	4	4	1.0	...	4	3	4	1	1	3	6	5	6	6
1	1.0	1.0	17	1.0	0.0	1	1	1.0	...	5	3	3	1	1	3	4	5	5	6
2	1.0	1.0	15	0.0	0.0	1	1	1.0	...	4	3	2	2	3	3	10	7	8	10
3	1.0	1.0	15	1.0	0.0	4	2	0.0	...	3	2	2	1	1	5	2	15	14	15
4	1.0	1.0	16	1.0	0.0	3	3	0.0	...	4	3	2	1	2	5	4	6	10	10

Student-Performance-Evaluation using Classification-Regression¶

Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets:¶

CASE 1: Binary Classification : G3>10: 1 else 0¶

Feature importances with forests of trees¶

Decision Tree accuracy and time elapsed caculation¶

cross validation for DT¶

Tuning our hyperparameters using GridSearch¶

Random Forest accuracy and time elapsed caculation¶

cross validation for RF¶

Receiver Operating Characteristic (ROC) curve¶

Tuning Models using GridSearch¶

Naive Bayes accuracy and time elapsed caculation¶

cross-validation for NB¶

KNN accuracy and time elapsed caculation¶

cross validation for KNN¶

SVM accuracy and time elapsed caculation¶

cross validation for SVM¶

CASE 2: Multi Class Classification :¶

Feature importances with forests of trees¶

Decision Tree accuracy and time elapsed caculation¶

cross validation for DT¶

Tuning our hyperparameters using GridSearch¶

Random Forest accuracy and time elapsed caculation¶

cross validation for RF¶

Tuning Models using GridSearch¶

Naive Bayes accuracy and time elapsed caculation¶

cross-validation for NB¶

KNN accuracy and time elapsed caculation¶

cross validation for KNN¶

SVM accuracy and time elapsed caculation¶

cross validation for SVM¶

Case 3 : Regression¶

Using regularized methods for regression¶

Decision tree regression¶

Random forest regression¶

Linear Regression¶

Cross Validation¶

Fitting models with gradient descent¶

SGDRegressor¶

Support Vector Machines for regression¶

Random Forests for Regression¶

	school	sex	age	famsize	Pstatus	Medu	Fedu	Mjob	...	famrel	freetime	goout	Dalc	Walc	health	absences	G1	G2	G3
0	1.0	1.0	18	1.0	1.0	4	4	1.0	...	4	3	4	1	1	3	6	5	6	0
1	1.0	1.0	17	1.0	0.0	1	1	1.0	...	5	3	3	1	1	3	4	5	5	0
2	1.0	1.0	15	0.0	0.0	1	1	1.0	...	4	3	2	2	3	3	10	7	8	1
3	1.0	1.0	15	1.0	0.0	4	2	0.0	...	3	2	2	1	1	5	2	15	14	1
4	1.0	1.0	16	1.0	0.0	3	3	0.0	...	4	3	2	1	2	5	4	6	10	1

	school	sex	age	famsize	Pstatus	Medu	Fedu	Mjob	...	famrel	freetime	goout	Dalc	Walc	health	absences	G1	G2	G3
0	1.0	1.0	18	1.0	1.0	4	4	1.0	...	4	3	4	1	1	3	6	5	6	5
1	1.0	1.0	17	1.0	0.0	1	1	1.0	...	5	3	3	1	1	3	4	5	5	5
2	1.0	1.0	15	0.0	0.0	1	1	1.0	...	4	3	2	2	3	3	10	7	8	4
3	1.0	1.0	15	1.0	0.0	4	2	0.0	...	3	2	2	1	1	5	2	15	14	2
4	1.0	1.0	16	1.0	0.0	3	3	0.0	...	4	3	2	1	2	5	4	6	10	4

	school	sex	age	famsize	Pstatus	Medu	Fedu	Mjob	...	famrel	freetime	goout	Dalc	Walc	health	absences	G1	G2	G3
0	1.0	1.0	18	1.0	1.0	4	4	1.0	...	4	3	4	1	1	3	6	5	6	6
1	1.0	1.0	17	1.0	0.0	1	1	1.0	...	5	3	3	1	1	3	4	5	5	6
2	1.0	1.0	15	0.0	0.0	1	1	1.0	...	4	3	2	2	3	3	10	7	8	10
3	1.0	1.0	15	1.0	0.0	4	2	0.0	...	3	2	2	1	1	5	2	15	14	15
4	1.0	1.0	16	1.0	0.0	3	3	0.0	...	4	3	2	1	2	5	4	6	10	10