Python Functions

Functions are defined by keyword "def", followed with the function's name and brackets and a colon

For example:

In [1]:
def my_function():
    print("Hello From My Function!")

Calling this function this done by:

In [2]:
my_function()
Hello From My Function!

Arguments for Functions

In [3]:
def my_function_with_args(username, greeting):
    print("Hello, {}, I wish you {}".format(username, greeting))

my_function_with_args("Mike", "good day")
Hello, Mike, I wish you good day

Return values

Just write return whenever you want to return something

In [4]:
def sum_two_numbers(a, b):
    return a + b

sum_two_numbers(4,5)
Out[4]:
9

Multiple return values

Remember that there are tuples

In [5]:
def sum_and_diff(a, b):
    return (a+b, a-b) # the brackets here are not necessary, one can also write return a+b, a-b

my_tuple = sum_and_diff(4,6)
print(my_tuple) # prints (10, -2)
print(my_tuple[0]) # prints 10
print(my_tuple[1]) # prints -2
(10, -2)
10
-2

Assign return values directly

In [6]:
(my_sum, my_diff) = sum_and_diff(4,6) # here also the brackets are not essential
print(my_sum)
print(my_diff)
10
-2

Dictionaries

  • like a map (maps key to values)
  • initialised by dict() or {}
In [7]:
population = {} # or population = dict()
population["Mannheim"] = 305780
population["Ludwigshafen"] = 164718
population["Heidelberg"] = 156267
print(population)
print(population["Heidelberg"])
{'Mannheim': 305780, 'Ludwigshafen': 164718, 'Heidelberg': 156267}
156267

Direct initialisation

In [8]:
population = {
    "Mannheim" : 305780,
    "Ludwigshafen" : 164718,
    "Heidelberg" : 156267
}
print(population)
{'Mannheim': 305780, 'Ludwigshafen': 164718, 'Heidelberg': 156267}

Iterating over dictionaries

In [9]:
for name, count in population.items():
    print("The population of {} is {}".format(name, count))
The population of Mannheim is 305780
The population of Ludwigshafen is 164718
The population of Heidelberg is 156267

Removing elements

In [10]:
del population["Heidelberg"]
print(population)
{'Mannheim': 305780, 'Ludwigshafen': 164718}

Binning

  • not directly included in scikit learn

Equal-width binning

In [11]:
import pandas as pd
items = [0,4,12,16,16,18,24,26,28]
pd.cut(items, bins=3)
Out[11]:
[(-0.028, 9.333], (-0.028, 9.333], (9.333, 18.667], (9.333, 18.667], (9.333, 18.667], (9.333, 18.667], (18.667, 28.0], (18.667, 28.0], (18.667, 28.0]]
Categories (3, interval[float64]): [(-0.028, 9.333] < (9.333, 18.667] < (18.667, 28.0]]

Equal-frequency binning

In [12]:
pd.qcut(items, q=3)
Out[12]:
[(-0.001, 14.667], (-0.001, 14.667], (-0.001, 14.667], (14.667, 20.0], (14.667, 20.0], (14.667, 20.0], (20.0, 28.0], (20.0, 28.0], (20.0, 28.0]]
Categories (3, interval[float64]): [(-0.001, 14.667] < (14.667, 20.0] < (20.0, 28.0]]

Apply it to a dataset

In [13]:
iris = pd.read_csv("iris.csv")
pd.cut(iris['SepalLength'], bins=3, labels=['low', 'middle', 'high'])
Out[13]:
0         low
1         low
2         low
3         low
4         low
5         low
6         low
7         low
8         low
9         low
10        low
11        low
12        low
13        low
14     middle
15     middle
16        low
17        low
18     middle
19        low
20        low
21        low
22        low
23        low
24        low
25        low
26        low
27        low
28        low
29        low
        ...  
120      high
121    middle
122      high
123    middle
124    middle
125      high
126    middle
127    middle
128    middle
129      high
130      high
131      high
132    middle
133    middle
134    middle
135      high
136    middle
137    middle
138    middle
139      high
140    middle
141      high
142    middle
143      high
144    middle
145    middle
146    middle
147    middle
148    middle
149    middle
Name: SepalLength, Length: 150, dtype: category
Categories (3, object): [high < low < middle]

Create a binned dataset

  • idea:
    • create a new dataframe
    • initialise it with processed values (as dict)
In [14]:
iris_binned = pd.DataFrame(dict(
    SepalLength = pd.cut(iris['SepalLength'], bins=3, labels=['low', 'middle', 'high']),
    SepalWidth = pd.cut(iris['SepalWidth'], bins=3, labels=['low', 'middle', 'high'])
))
iris_binned.head()
Out[14]:
SepalLength SepalWidth
0 low middle
1 low middle
2 low middle
3 low middle
4 low middle

Another way of encoding (third possibility)

  • pandas has also a way of one-hot encoding the values
  • differs from category_encoders one hot because it has nicer column names
  • using the function get_dummies
In [15]:
iris_binned_and_encoded = pd.get_dummies(iris_binned)
iris_binned_and_encoded.head()
Out[15]:
SepalLength_high SepalLength_low SepalLength_middle SepalWidth_high SepalWidth_low SepalWidth_middle
0 0 1 0 0 0 1
1 0 1 0 0 0 1
2 0 1 0 0 0 1
3 0 1 0 0 0 1
4 0 1 0 0 0 1

Applying ML Aproaches

In [16]:
from sklearn import tree
decision_tree = tree.DecisionTreeClassifier()
decision_tree
Out[16]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

Plot the decion tree

  • first install two additional packages (run both command in a console)
    • conda install -c conda-forge graphviz
    • pip install graphviz
In [17]:
decision_tree = tree.DecisionTreeClassifier(max_depth=2)#max_depth=2, because to see onl a small decision tree
decision_tree.fit(iris_binned_and_encoded, iris['Name'])
Out[17]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
In [18]:
# run the following two commands in console:
# conda install -c conda-forge graphviz
# pip install graphviz

import graphviz 
from sklearn.utils.multiclass import unique_labels

dot_data = tree.export_graphviz(decision_tree,
                         feature_names=iris_binned_and_encoded.columns.values,
                         class_names=unique_labels(iris['Name']),  
                         filled=True, rounded=True,special_characters=True,out_file=None)
graphviz.Source(dot_data)
Out[18]:
Tree 0 SepalLength_low ≤ 0.5 gini = 0.6667 samples = 150 value = [50, 50, 50] class = Iris-setosa 1 SepalLength_middle ≤ 0.5 gini = 0.5253 samples = 91 value = [3, 39, 49] class = Iris-virginica 0->1 True 4 SepalWidth_low ≤ 0.5 gini = 0.3304 samples = 59 value = [47, 11, 1] class = Iris-setosa 0->4 False 2 gini = 0.255 samples = 20 value = [0, 3, 17] class = Iris-virginica 1->2 3 gini = 0.538 samples = 71 value = [3, 36, 32] class = Iris-versicolor 1->3 5 gini = 0.0416 samples = 47 value = [46, 1, 0] class = Iris-setosa 4->5 6 gini = 0.2917 samples = 12 value = [1, 10, 1] class = Iris-versicolor 4->6

Get the number of nodes in the tree

In [19]:
decision_tree.tree_.node_count
Out[19]:
7

Model evaluation

Train test split

In [20]:
from sklearn.model_selection import train_test_split
data_train, data_test, target_train, target_test = train_test_split(
    iris_binned_and_encoded, iris['Name'],test_size=0.2, random_state=42, stratify=iris['Name'])
print(data_train.head())
print(target_train.head())
     SepalLength_high  SepalLength_low  SepalLength_middle  SepalWidth_high  \
8                   0                1                   0                0   
106                 0                1                   0                0   
76                  1                0                   0                0   
9                   0                1                   0                0   
89                  0                1                   0                0   

     SepalWidth_low  SepalWidth_middle  
8                 0                  1  
106               1                  0  
76                1                  0  
9                 0                  1  
89                1                  0  
8          Iris-setosa
106     Iris-virginica
76     Iris-versicolor
9          Iris-setosa
89     Iris-versicolor
Name: Name, dtype: object

Cross Validation

  • also have a look at documentation about cross validation
  • for computing cross-validated metrics use cross_val_score function
  • different possibilities for scoring (overview)
    • "accuracy"
    • "precision"
    • "recall"
    • "roc_auc"
  • the return value will be the score for each fold (partition)
In [21]:
from sklearn.model_selection import cross_val_score
accuracy_iris = cross_val_score(decision_tree, iris_binned_and_encoded, iris['Name'], cv=10, scoring='accuracy')
accuracy_iris
Out[21]:
array([ 0.6       ,  0.86666667,  0.66666667,  0.66666667,  0.8       ,
        0.73333333,  0.73333333,  0.8       ,  0.73333333,  0.66666667])

Average the scores to get one value

In [22]:
accuracy_iris.mean()
Out[22]:
0.72666666666666668

Stratified

  • if you want a stratified version (and also set the random seed) you can do the following
In [23]:
from sklearn.model_selection import StratifiedKFold

cross_val = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
acc_each_split = cross_val_score(decision_tree, iris_binned_and_encoded, iris['Name'], cv=cross_val, scoring='accuracy')
acc_each_split.mean()
Out[23]:
0.72666666666666668

Obtaining predictions by cross-validation

In [24]:
from sklearn.model_selection import cross_val_predict
predicted = cross_val_predict(decision_tree, iris_binned_and_encoded, iris['Name'], cv=10)
print(predicted)
['Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-setosa' 'Iris-setosa' 'Iris-versicolor'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-versicolor' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-virginica' 'Iris-versicolor' 'Iris-virginica'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-virginica'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-setosa'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-virginica' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-virginica' 'Iris-versicolor' 'Iris-virginica' 'Iris-versicolor'
 'Iris-virginica' 'Iris-versicolor' 'Iris-versicolor' 'Iris-virginica'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-virginica' 'Iris-virginica' 'Iris-versicolor' 'Iris-virginica'
 'Iris-versicolor' 'Iris-virginica' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-virginica' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-virginica' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-virginica' 'Iris-versicolor'
 'Iris-virginica' 'Iris-versicolor' 'Iris-virginica' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor']

Cross-validation iterators

  • if you want to iterate over each fold with a for loop you can do this with the following snippet
In [25]:
# sometimes you have to use the raw array and not the pandas dataframe (access it with .values)
data = iris_binned_and_encoded.values 
target = iris['Name']

cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

for train_indices, test_indices in cv.split(data, target):
    train_data = data[train_indices]
    train_target = target[train_indices]
    
    decision_tree.fit(train_data, train_target)

    test_data = data[test_indices]
    test_target = target[test_indices]
    
    test_prediction = decision_tree.predict(test_data)

Increae the number of examples in a training set

  • you can use the np.append function to add more traning indices
In [26]:
import numpy as np

for train_indices, test_indices in cv.split(data, target):
    train_indices = np.append(train_indices, (target == 'Iris-setosa').nonzero()[0])
    print(train_indices[:10])
[0 1 2 3 4 5 6 7 8 9]
[0 1 2 3 4 5 6 7 8 9]
[ 0  1  2  5  6  7  9 10 11 13]
[ 0  1  2  3  4  5  7  8  9 10]
[ 0  1  2  3  4  5  6  7  8 10]
[ 1  2  3  4  6  7  8  9 10 11]
[ 0  2  3  4  5  6  7  8  9 10]
[ 0  1  3  4  5  6  7  8  9 10]
[0 1 2 3 4 5 6 7 8 9]
[ 0  1  2  3  4  5  6  8  9 10]

Load Arff and postprocess

In [27]:
from scipy.io import arff
credit_arff_data, credit_arff_meta = arff.loadarff(open('credit-g.arff', 'r'))
credit = pd.DataFrame(credit_arff_data)
credit.head()
Out[27]:
checking_status duration credit_history purpose credit_amount savings_status employment installment_commitment personal_status other_parties ... property_magnitude age other_payment_plans housing existing_credits job num_dependents own_telephone foreign_worker class
0 b"'<0'" 6.0 b"'critical/other existing credit'" b'radio/tv' 1169.0 b"'no known savings'" b"'>=7'" 4.0 b"'male single'" b'none' ... b"'real estate'" 67.0 b'none' b'own' 2.0 b'skilled' 1.0 b'yes' b'yes' b'good'
1 b"'0<=X<200'" 48.0 b"'existing paid'" b'radio/tv' 5951.0 b"'<100'" b"'1<=X<4'" 2.0 b"'female div/dep/mar'" b'none' ... b"'real estate'" 22.0 b'none' b'own' 1.0 b'skilled' 1.0 b'none' b'yes' b'bad'
2 b"'no checking'" 12.0 b"'critical/other existing credit'" b'education' 2096.0 b"'<100'" b"'4<=X<7'" 2.0 b"'male single'" b'none' ... b"'real estate'" 49.0 b'none' b'own' 1.0 b"'unskilled resident'" 2.0 b'none' b'yes' b'good'
3 b"'<0'" 42.0 b"'existing paid'" b'furniture/equipment' 7882.0 b"'<100'" b"'4<=X<7'" 2.0 b"'male single'" b'guarantor' ... b"'life insurance'" 45.0 b'none' b"'for free'" 1.0 b'skilled' 2.0 b'none' b'yes' b'good'
4 b"'<0'" 24.0 b"'delayed previously'" b"'new car'" 4870.0 b"'<100'" b"'1<=X<4'" 3.0 b"'male single'" b'none' ... b"'no known property'" 53.0 b'none' b"'for free'" 2.0 b'skilled' 2.0 b'none' b'yes' b'bad'

5 rows × 21 columns

Postprocess

  • the loaded dataset and texts are in binary format to decode it you can use the follwing snippet
In [28]:
credit_target = np.array([x.decode('ascii') for x in credit['class'].values]) # "inline" for loop with []
print(credit_target[:10])

credit_data = pd.get_dummies(credit.drop('class', axis=1)) # drop can be used to get all columns but removing one (the target column)
credit_data.head()
['good' 'bad' 'good' 'good' 'bad' 'good' 'good' 'good' 'good' 'bad']
Out[28]:
duration credit_amount installment_commitment residence_since age existing_credits num_dependents checking_status_b"'0<=X<200'" checking_status_b"'<0'" checking_status_b"'>=200'" ... housing_b'own' housing_b'rent' job_b"'high qualif/self emp/mgmt'" job_b"'unemp/unskilled non res'" job_b"'unskilled resident'" job_b'skilled' own_telephone_b'none' own_telephone_b'yes' foreign_worker_b'no' foreign_worker_b'yes'
0 6.0 1169.0 4.0 4.0 67.0 2.0 1.0 0 1 0 ... 1 0 0 0 0 1 0 1 0 1
1 48.0 5951.0 2.0 2.0 22.0 1.0 1.0 1 0 0 ... 1 0 0 0 0 1 1 0 0 1
2 12.0 2096.0 2.0 3.0 49.0 1.0 2.0 0 0 0 ... 1 0 0 0 1 0 1 0 0 1
3 42.0 7882.0 2.0 4.0 45.0 1.0 2.0 0 1 0 ... 0 0 0 0 0 1 1 0 0 1
4 24.0 4870.0 3.0 4.0 53.0 2.0 2.0 0 1 0 ... 0 0 0 0 0 1 1 0 0 1

5 rows × 61 columns

Plot ROC curve

In [29]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt

knn_estimator = KNeighborsClassifier(3)

data_train, data_test, target_train, target_test = train_test_split(credit_data, credit_target)
knn_estimator.fit(data_train, target_train)
proba_for_each_class = knn_estimator.predict_proba(data_test)#have to use predict_proba or decision_function 

fpr, tpr, thresholds = roc_curve(target_test, proba_for_each_class[:,1], pos_label='good')#choose the second class

plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r', label='Luck', alpha=.8) # draw diagonal
plt.plot(fpr,tpr,label='K-NN')

plt.legend()
plt.show() 

Avg roc curves (cross validation)

  • here is a snippet of python code you can use
In [30]:
from scipy import interp
from sklearn.metrics import roc_curve, auc

def avg_roc(cv, estimator, data, target, pos_label):
    mean_fpr = np.linspace(0, 1, 100) # = [0.0, 0.01, 0.02, 0.03, ... , 0.99, 1.0]
    tprs = []
    aucs = []    
    for train_indices, test_indices in cv.split(data, target):
        train_data, train_target = data[train_indices], target[train_indices]
        estimator.fit(train_data, train_target)
        
        test_data, test_target = data[test_indices], target[test_indices]
        decision_for_each_class = estimator.predict_proba(test_data)#have to use predict_proba or decision_function 
    
        fpr, tpr, thresholds = roc_curve(test_target, decision_for_each_class[:,1], pos_label=pos_label)
        tprs.append(interp(mean_fpr, fpr, tpr))
        tprs[-1][0] = 0.0 # tprs[-1] access the last element
        aucs.append(auc(fpr, tpr))        
        #plt.plot(fpr, tpr)# plot for each fold
        
    mean_tpr = np.mean(tprs, axis=0)
    mean_tpr[-1] = 1.0 # set the last tpr to 1
    mean_auc = auc(mean_fpr, mean_tpr)
    std_auc = np.std(aucs)    
    return mean_fpr, mean_tpr, mean_auc, std_auc

Using the function to plot a ROC curve

In [31]:
knn_estimator = KNeighborsClassifier(3)

cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

mean_fpr, mean_tpr, mean_auc, std_auc = avg_roc(cv, knn_estimator, credit_data.values, credit_target, 'good')

plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r', label='Luck', alpha=.8) # draw diagonal
plt.plot(mean_fpr,mean_tpr,label='K-NN')

plt.legend()
plt.show()

Add noise to the labels

  • not included in scikit learn nor in numpy or sci-py
    • thus own approach
In [32]:
import random 
from sklearn.utils.multiclass import unique_labels
def add_noise(raw_target, percentage):    
    labels = unique_labels(raw_target)
    target_with_noise = []
    for one_target_label in raw_target:
        if random.randint(1,100) <= percentage:
            target_with_noise.append(next(l for l in labels if l != one_target_label))
        else:
            target_with_noise.append(one_target_label)
    return target_with_noise

Apply the noise function

In [33]:
credit_target_with_noise = add_noise(credit_target, 10)
for i in range(20):
    print("{:10} - {:10} - {}".format(credit_target[i], credit_target_with_noise[i], credit_target[i]==credit_target_with_noise[i]))
good       - good       - True
bad        - bad        - True
good       - good       - True
good       - good       - True
bad        - bad        - True
good       - good       - True
good       - good       - True
good       - good       - True
good       - good       - True
bad        - bad        - True
bad        - bad        - True
bad        - bad        - True
good       - good       - True
bad        - bad        - True
good       - good       - True
bad        - bad        - True
good       - good       - True
good       - good       - True
bad        - bad        - True
good       - good       - True