Intro Scikitlearn¶

Reading the data again

import pandas as pd
golf = pd.read_csv('golf.csv')

Different (better) way to encode attributes¶

install the package category_encoders
- conda install -c conda-forge category_encoders
in future releases it will be contained in scikit learn
- it will be called CategoricalEncoder see https://github.com/scikit-learn/scikit-learn/pull/9151

print(golf.head())

from category_encoders.ordinal import OrdinalEncoder# choose one encoder
encoder = OrdinalEncoder()

#from category_encoders.one_hot import OneHotEncoder
#encoder = OneHotEncoder()

golf_encoded = encoder.fit_transform(golf[['Outlook', 'Temperature', 'Humidity', 'Wind']])
print(golf_encoded.head())

    Outlook  Temperature  Humidity   Wind Play
0     sunny         85.0      85.0  False   no
1     sunny         80.0      90.0   True   no
2  overcast         83.0      78.0  False  yes
3      rain         70.0      96.0  False  yes
4      rain         68.0      80.0  False  yes
   Temperature  Humidity   Wind  Outlook
0         85.0      85.0  False        0
1         80.0      90.0   True        0
2         83.0      78.0  False        1
3         70.0      96.0  False        2
4         68.0      80.0  False        2

Applying ML Aproaches¶

Naive Bayes

from sklearn.naive_bayes import GaussianNB
naive_bayes = GaussianNB()
naive_bayes.fit(golf_encoded, golf['Play'])

GaussianNB(priors=None)

k-nearest-neighbor

from sklearn.neighbors import KNeighborsClassifier
knn_estimator = KNeighborsClassifier(3)
#knn_estimator.fit....

nearest centroid

from sklearn.neighbors.nearest_centroid import NearestCentroid
nearest_centroid_estimator = NearestCentroid()
#nearest_centroid_estimator.fit....

Evaluation¶

Three important functions

confusion_matrix (returns the confusion matrix as an array of array)
classification_report (returns a string which shows an overview)
accuracy_score (returns accuracy)

All of them need the following input:

Ground truth , prediction

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

golf_prediction = ['yes','no','yes','yes','no','yes','yes','no','no','no','no','no','no','yes']

print(confusion_matrix(golf['Play'], golf_prediction))
print(accuracy_score(golf['Play'], golf_prediction))
print(classification_report(golf['Play'], golf_prediction))

[[2 3]
 [6 3]]
0.357142857143
             precision    recall  f1-score   support

         no       0.25      0.40      0.31         5
        yes       0.50      0.33      0.40         9

avg / total       0.41      0.36      0.37        14

Pretty Print confusion matrix¶

from sklearn.metrics import confusion_matrix
from sklearn.utils.multiclass import unique_labels

def confusion_matrix_report(y_true, y_pred):    
    cm, labels = confusion_matrix(y_true, y_pred), unique_labels(y_true, y_pred)
    column_width = max([len(str(x)) for x in labels] + [5])  # 5 is value length
    report = " " * column_width + " " + "{:_^{}}".format("Prediction", column_width * len(labels))+ "\n"
    report += " " * column_width + " ".join(["{:>{}}".format(label, column_width) for label in labels]) + "\n"
    for i, label1 in enumerate(labels):
        report += "{:>{}}".format(label1, column_width) + " ".join(["{:{}d}".format(cm[i, j], column_width) for j in range(len(labels))]) + "\n"
    return report

print(confusion_matrix_report(golf['Play'], golf_prediction))

      Prediction
        no   yes
   no    2     3
  yes    6     3

Train / Test split¶

parameters: data, target, test_size, ranomd_state, stratify (use target again)

from sklearn.model_selection import train_test_split
golf_data = golf_encoded
golf_target = golf['Play']
data_train, data_test, target_train, target_test = train_test_split(
    golf_data, golf_target,test_size=0.2, random_state=42, stratify=golf_target)
print("=======TRAIN=========")
print(data_train)
print(target_train)

=======TRAIN=========
    Temperature  Humidity   Wind  Outlook
1          80.0      90.0   True        0
6          64.0      65.0   True        1
4          68.0      80.0  False        2
13         71.0      80.0   True        2
3          70.0      96.0  False        2
2          83.0      78.0  False        1
5          65.0      70.0   True        2
12         81.0      75.0  False        1
0          85.0      85.0  False        0
9          75.0      80.0  False        2
10         75.0      70.0   True        0
1      no
6     yes
4     yes
13     no
3     yes
2     yes
5      no
12    yes
0      no
9     yes
10    yes
Name: Play, dtype: object

print("=======TEST=========")
print(data_test)
print(target_test)

=======TEST=========
    Temperature  Humidity   Wind  Outlook
7          72.0      95.0  False        0
11         72.0      90.0   True        1
8          69.0      70.0  False        0
7      no
11    yes
8     yes
Name: Play, dtype: object

Formatting¶

the easiest way is to use format function
use "{}" to indicate a variable which will be later replaced

print("hello {} you are {} years old".format("heinz", 30))

hello heinz you are 30 years old