Intro Scikitlearn

Reading the data again

In [7]:
import pandas as pd
golf = pd.read_csv('golf.csv')

Different (better) way to encode attributes

In [8]:
print(golf.head())

from category_encoders.ordinal import OrdinalEncoder# choose one encoder
encoder = OrdinalEncoder()

#from category_encoders.one_hot import OneHotEncoder
#encoder = OneHotEncoder()

golf_encoded = encoder.fit_transform(golf[['Outlook', 'Temperature', 'Humidity', 'Wind']])
print(golf_encoded.head())
    Outlook  Temperature  Humidity   Wind Play
0     sunny         85.0      85.0  False   no
1     sunny         80.0      90.0   True   no
2  overcast         83.0      78.0  False  yes
3      rain         70.0      96.0  False  yes
4      rain         68.0      80.0  False  yes
   Temperature  Humidity   Wind  Outlook
0         85.0      85.0  False        0
1         80.0      90.0   True        0
2         83.0      78.0  False        1
3         70.0      96.0  False        2
4         68.0      80.0  False        2

Applying ML Aproaches

  • Naive Bayes
In [9]:
from sklearn.naive_bayes import GaussianNB
naive_bayes = GaussianNB()
naive_bayes.fit(golf_encoded, golf['Play'])
Out[9]:
GaussianNB(priors=None)
  • k-nearest-neighbor
In [10]:
from sklearn.neighbors import KNeighborsClassifier
knn_estimator = KNeighborsClassifier(3)
#knn_estimator.fit....
  • nearest centroid
In [11]:
from sklearn.neighbors.nearest_centroid import NearestCentroid
nearest_centroid_estimator = NearestCentroid()
#nearest_centroid_estimator.fit....

Evaluation

Three important functions

  • confusion_matrix (returns the confusion matrix as an array of array)
  • classification_report (returns a string which shows an overview)
  • accuracy_score (returns accuracy)

All of them need the following input:

  • Ground truth , prediction
In [12]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

golf_prediction = ['yes','no','yes','yes','no','yes','yes','no','no','no','no','no','no','yes']

print(confusion_matrix(golf['Play'], golf_prediction))
print(accuracy_score(golf['Play'], golf_prediction))
print(classification_report(golf['Play'], golf_prediction))
[[2 3]
 [6 3]]
0.357142857143
             precision    recall  f1-score   support

         no       0.25      0.40      0.31         5
        yes       0.50      0.33      0.40         9

avg / total       0.41      0.36      0.37        14

Pretty Print confusion matrix

In [13]:
from sklearn.metrics import confusion_matrix
from sklearn.utils.multiclass import unique_labels

def confusion_matrix_report(y_true, y_pred):    
    cm, labels = confusion_matrix(y_true, y_pred), unique_labels(y_true, y_pred)
    column_width = max([len(str(x)) for x in labels] + [5])  # 5 is value length
    report = " " * column_width + " " + "{:_^{}}".format("Prediction", column_width * len(labels))+ "\n"
    report += " " * column_width + " ".join(["{:>{}}".format(label, column_width) for label in labels]) + "\n"
    for i, label1 in enumerate(labels):
        report += "{:>{}}".format(label1, column_width) + " ".join(["{:{}d}".format(cm[i, j], column_width) for j in range(len(labels))]) + "\n"
    return report

print(confusion_matrix_report(golf['Play'], golf_prediction))
      Prediction
        no   yes
   no    2     3
  yes    6     3

Train / Test split

  • parameters: data, target, test_size, ranomd_state, stratify (use target again)
In [14]:
from sklearn.model_selection import train_test_split
golf_data = golf_encoded
golf_target = golf['Play']
data_train, data_test, target_train, target_test = train_test_split(
    golf_data, golf_target,test_size=0.2, random_state=42, stratify=golf_target)
print("=======TRAIN=========")
print(data_train)
print(target_train)
=======TRAIN=========
    Temperature  Humidity   Wind  Outlook
1          80.0      90.0   True        0
6          64.0      65.0   True        1
4          68.0      80.0  False        2
13         71.0      80.0   True        2
3          70.0      96.0  False        2
2          83.0      78.0  False        1
5          65.0      70.0   True        2
12         81.0      75.0  False        1
0          85.0      85.0  False        0
9          75.0      80.0  False        2
10         75.0      70.0   True        0
1      no
6     yes
4     yes
13     no
3     yes
2     yes
5      no
12    yes
0      no
9     yes
10    yes
Name: Play, dtype: object
In [15]:
print("=======TEST=========")
print(data_test)
print(target_test)
=======TEST=========
    Temperature  Humidity   Wind  Outlook
7          72.0      95.0  False        0
11         72.0      90.0   True        1
8          69.0      70.0  False        0
7      no
11    yes
8     yes
Name: Play, dtype: object

Formatting

  • the easiest way is to use format function
  • use "{}" to indicate a variable which will be later replaced
In [16]:
print("hello {} you are {} years old".format("heinz", 30))
hello heinz you are 30 years old