Python Functions¶

Functions are defined by keyword "def", followed with the function's name and brackets and a colon

For example:

def my_function():
    print("Hello From My Function!")

Calling this function this done by:

my_function()

Hello From My Function!

Arguments for Functions¶

def my_function_with_args(username, greeting):
    print("Hello, {}, I wish you {}".format(username, greeting))

my_function_with_args("Mike", "good day")

Hello, Mike, I wish you good day

Return values¶

Just write return whenever you want to return something

def sum_two_numbers(a, b):
    return a + b

sum_two_numbers(4,5)

9

Multiple return values¶

Remember that there are tuples

def sum_and_diff(a, b):
    return (a+b, a-b) # the brackets here are not necessary, one can also write return a+b, a-b

my_tuple = sum_and_diff(4,6)
print(my_tuple) # prints (10, -2)
print(my_tuple[0]) # prints 10
print(my_tuple[1]) # prints -2

(10, -2)
10
-2

Assign return values directly¶

(my_sum, my_diff) = sum_and_diff(4,6) # here also the brackets are not essential
print(my_sum)
print(my_diff)

10
-2

Dictionaries¶

like a map (maps key to values)
initialised by dict() or {}

population = {} # or population = dict()
population["Mannheim"] = 305780
population["Ludwigshafen"] = 164718
population["Heidelberg"] = 156267
print(population)
print(population["Heidelberg"])

{'Mannheim': 305780, 'Ludwigshafen': 164718, 'Heidelberg': 156267}
156267

Direct initialisation¶

population = {
    "Mannheim" : 305780,
    "Ludwigshafen" : 164718,
    "Heidelberg" : 156267
}
print(population)

{'Mannheim': 305780, 'Ludwigshafen': 164718, 'Heidelberg': 156267}

Iterating over dictionaries¶

for name, count in population.items():
    print("The population of {} is {}".format(name, count))

The population of Mannheim is 305780
The population of Ludwigshafen is 164718
The population of Heidelberg is 156267

Removing elements¶

del population["Heidelberg"]
print(population)

{'Mannheim': 305780, 'Ludwigshafen': 164718}

Binning¶

not directly included in scikit learn

Equal-width binning¶

done with pandas cut function

import pandas as pd
items = [0,4,12,16,16,18,24,26,28]
pd.cut(items, bins=3)

[(-0.028, 9.333], (-0.028, 9.333], (9.333, 18.667], (9.333, 18.667], (9.333, 18.667], (9.333, 18.667], (18.667, 28.0], (18.667, 28.0], (18.667, 28.0]]
Categories (3, interval[float64]): [(-0.028, 9.333] < (9.333, 18.667] < (18.667, 28.0]]

Equal-frequency binning¶

done with pandas qcut function

pd.qcut(items, q=3)

[(-0.001, 14.667], (-0.001, 14.667], (-0.001, 14.667], (14.667, 20.0], (14.667, 20.0], (14.667, 20.0], (20.0, 28.0], (20.0, 28.0], (20.0, 28.0]]
Categories (3, interval[float64]): [(-0.001, 14.667] < (14.667, 20.0] < (20.0, 28.0]]

Apply it to a dataset¶

iris = pd.read_csv("iris.csv")
pd.cut(iris['SepalLength'], bins=3, labels=['low', 'middle', 'high'])

0         low
1         low
2         low
3         low
4         low
5         low
6         low
7         low
8         low
9         low
10        low
11        low
12        low
13        low
14     middle
15     middle
16        low
17        low
18     middle
19        low
20        low
21        low
22        low
23        low
24        low
25        low
26        low
27        low
28        low
29        low
        ...  
120      high
121    middle
122      high
123    middle
124    middle
125      high
126    middle
127    middle
128    middle
129      high
130      high
131      high
132    middle
133    middle
134    middle
135      high
136    middle
137    middle
138    middle
139      high
140    middle
141      high
142    middle
143      high
144    middle
145    middle
146    middle
147    middle
148    middle
149    middle
Name: SepalLength, Length: 150, dtype: category
Categories (3, object): [high < low < middle]

Create a binned dataset¶

idea:
- create a new dataframe
- initialise it with processed values (as dict)

iris_binned = pd.DataFrame(dict(
    SepalLength = pd.cut(iris['SepalLength'], bins=3, labels=['low', 'middle', 'high']),
    SepalWidth = pd.cut(iris['SepalWidth'], bins=3, labels=['low', 'middle', 'high'])
))
iris_binned.head()

Another way of encoding (third possibility)¶

pandas has also a way of one-hot encoding the values
differs from category_encoders one hot because it has nicer column names
using the function get_dummies

iris_binned_and_encoded = pd.get_dummies(iris_binned)
iris_binned_and_encoded.head()

Applying ML Aproaches¶

Decision Tree
also have a look at the possible parameters at the documentation of DecisionTreeClassifier

from sklearn import tree
decision_tree = tree.DecisionTreeClassifier()
decision_tree

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

Plot the decion tree¶

first install two additional packages (run both command in a console)
- conda install -c conda-forge graphviz
- pip install graphviz

decision_tree = tree.DecisionTreeClassifier(max_depth=2)#max_depth=2, because to see onl a small decision tree
decision_tree.fit(iris_binned_and_encoded, iris['Name'])

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

# run the following two commands in console:
# conda install -c conda-forge graphviz
# pip install graphviz

import graphviz 
from sklearn.utils.multiclass import unique_labels

dot_data = tree.export_graphviz(decision_tree,
                         feature_names=iris_binned_and_encoded.columns.values,
                         class_names=unique_labels(iris['Name']),  
                         filled=True, rounded=True,special_characters=True,out_file=None)
graphviz.Source(dot_data)

Get the number of nodes in the tree¶

decision_tree.tree_.node_count

7

Model evaluation¶

Train test split¶

from sklearn.model_selection import train_test_split
data_train, data_test, target_train, target_test = train_test_split(
    iris_binned_and_encoded, iris['Name'],test_size=0.2, random_state=42, stratify=iris['Name'])
print(data_train.head())
print(target_train.head())

     SepalLength_high  SepalLength_low  SepalLength_middle  SepalWidth_high  \
8                   0                1                   0                0   
106                 0                1                   0                0   
76                  1                0                   0                0   
9                   0                1                   0                0   
89                  0                1                   0                0   

     SepalWidth_low  SepalWidth_middle  
8                 0                  1  
106               1                  0  
76                1                  0  
9                 0                  1  
89                1                  0  
8          Iris-setosa
106     Iris-virginica
76     Iris-versicolor
9          Iris-setosa
89     Iris-versicolor
Name: Name, dtype: object

Cross Validation¶

also have a look at documentation about cross validation
for computing cross-validated metrics use cross_val_score function
different possibilities for scoring (overview)
- "accuracy"
- "precision"
- "recall"
- "roc_auc"
the return value will be the score for each fold (partition)

from sklearn.model_selection import cross_val_score
accuracy_iris = cross_val_score(decision_tree, iris_binned_and_encoded, iris['Name'], cv=10, scoring='accuracy')
accuracy_iris

array([ 0.6       ,  0.86666667,  0.66666667,  0.66666667,  0.8       ,
        0.73333333,  0.73333333,  0.8       ,  0.73333333,  0.66666667])

Average the scores to get one value¶

accuracy_iris.mean()

0.72666666666666668

Stratified¶

if you want a stratified version (and also set the random seed) you can do the following

from sklearn.model_selection import StratifiedKFold

cross_val = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
acc_each_split = cross_val_score(decision_tree, iris_binned_and_encoded, iris['Name'], cv=cross_val, scoring='accuracy')
acc_each_split.mean()

0.72666666666666668

Obtaining predictions by cross-validation¶

from sklearn.model_selection import cross_val_predict
predicted = cross_val_predict(decision_tree, iris_binned_and_encoded, iris['Name'], cv=10)
print(predicted)

['Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-setosa' 'Iris-setosa' 'Iris-versicolor'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-versicolor' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-virginica' 'Iris-versicolor' 'Iris-virginica'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-virginica'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-setosa'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-virginica' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-virginica' 'Iris-versicolor' 'Iris-virginica' 'Iris-versicolor'
 'Iris-virginica' 'Iris-versicolor' 'Iris-versicolor' 'Iris-virginica'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-virginica' 'Iris-virginica' 'Iris-versicolor' 'Iris-virginica'
 'Iris-versicolor' 'Iris-virginica' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-virginica' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-virginica' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-virginica' 'Iris-versicolor'
 'Iris-virginica' 'Iris-versicolor' 'Iris-virginica' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor']

Cross-validation iterators¶

if you want to iterate over each fold with a for loop you can do this with the following snippet

# sometimes you have to use the raw array and not the pandas dataframe (access it with .values)
data = iris_binned_and_encoded.values 
target = iris['Name']

cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

for train_indices, test_indices in cv.split(data, target):
    train_data = data[train_indices]
    train_target = target[train_indices]
    
    decision_tree.fit(train_data, train_target)

    test_data = data[test_indices]
    test_target = target[test_indices]
    
    test_prediction = decision_tree.predict(test_data)

Increae the number of examples in a training set¶

you can use the np.append function to add more traning indices

import numpy as np

for train_indices, test_indices in cv.split(data, target):
    train_indices = np.append(train_indices, (target == 'Iris-setosa').nonzero()[0])
    print(train_indices[:10])

[0 1 2 3 4 5 6 7 8 9]
[0 1 2 3 4 5 6 7 8 9]
[ 0  1  2  5  6  7  9 10 11 13]
[ 0  1  2  3  4  5  7  8  9 10]
[ 0  1  2  3  4  5  6  7  8 10]
[ 1  2  3  4  6  7  8  9 10 11]
[ 0  2  3  4  5  6  7  8  9 10]
[ 0  1  3  4  5  6  7  8  9 10]
[0 1 2 3 4 5 6 7 8 9]
[ 0  1  2  3  4  5  6  8  9 10]

Load Arff and postprocess¶

from scipy.io import arff
credit_arff_data, credit_arff_meta = arff.loadarff(open('credit-g.arff', 'r'))
credit = pd.DataFrame(credit_arff_data)
credit.head()

Postprocess¶

the loaded dataset and texts are in binary format to decode it you can use the follwing snippet

credit_target = np.array([x.decode('ascii') for x in credit['class'].values]) # "inline" for loop with []
print(credit_target[:10])

credit_data = pd.get_dummies(credit.drop('class', axis=1)) # drop can be used to get all columns but removing one (the target column)
credit_data.head()

['good' 'bad' 'good' 'good' 'bad' 'good' 'good' 'good' 'good' 'bad']

Plot ROC curve¶

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt

knn_estimator = KNeighborsClassifier(3)

data_train, data_test, target_train, target_test = train_test_split(credit_data, credit_target)
knn_estimator.fit(data_train, target_train)
proba_for_each_class = knn_estimator.predict_proba(data_test)#have to use predict_proba or decision_function 

fpr, tpr, thresholds = roc_curve(target_test, proba_for_each_class[:,1], pos_label='good')#choose the second class

plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r', label='Luck', alpha=.8) # draw diagonal
plt.plot(fpr,tpr,label='K-NN')

plt.legend()
plt.show()

Avg roc curves (cross validation)¶

here is a snippet of python code you can use

from scipy import interp
from sklearn.metrics import roc_curve, auc

def avg_roc(cv, estimator, data, target, pos_label):
    mean_fpr = np.linspace(0, 1, 100) # = [0.0, 0.01, 0.02, 0.03, ... , 0.99, 1.0]
    tprs = []
    aucs = []    
    for train_indices, test_indices in cv.split(data, target):
        train_data, train_target = data[train_indices], target[train_indices]
        estimator.fit(train_data, train_target)
        
        test_data, test_target = data[test_indices], target[test_indices]
        decision_for_each_class = estimator.predict_proba(test_data)#have to use predict_proba or decision_function 
    
        fpr, tpr, thresholds = roc_curve(test_target, decision_for_each_class[:,1], pos_label=pos_label)
        tprs.append(interp(mean_fpr, fpr, tpr))
        tprs[-1][0] = 0.0 # tprs[-1] access the last element
        aucs.append(auc(fpr, tpr))        
        #plt.plot(fpr, tpr)# plot for each fold
        
    mean_tpr = np.mean(tprs, axis=0)
    mean_tpr[-1] = 1.0 # set the last tpr to 1
    mean_auc = auc(mean_fpr, mean_tpr)
    std_auc = np.std(aucs)    
    return mean_fpr, mean_tpr, mean_auc, std_auc

Using the function to plot a ROC curve¶

knn_estimator = KNeighborsClassifier(3)

cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

mean_fpr, mean_tpr, mean_auc, std_auc = avg_roc(cv, knn_estimator, credit_data.values, credit_target, 'good')

plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r', label='Luck', alpha=.8) # draw diagonal
plt.plot(mean_fpr,mean_tpr,label='K-NN')

plt.legend()
plt.show()

Add noise to the labels¶

not included in scikit learn nor in numpy or sci-py
- thus own approach

import random 
from sklearn.utils.multiclass import unique_labels
def add_noise(raw_target, percentage):    
    labels = unique_labels(raw_target)
    target_with_noise = []
    for one_target_label in raw_target:
        if random.randint(1,100) <= percentage:
            target_with_noise.append(next(l for l in labels if l != one_target_label))
        else:
            target_with_noise.append(one_target_label)
    return target_with_noise

Apply the noise function¶

credit_target_with_noise = add_noise(credit_target, 10)
for i in range(20):
    print("{:10} - {:10} - {}".format(credit_target[i], credit_target_with_noise[i], credit_target[i]==credit_target_with_noise[i]))

good       - good       - True
bad        - bad        - True
good       - good       - True
good       - good       - True
bad        - bad        - True
good       - good       - True
good       - good       - True
good       - good       - True
good       - good       - True
bad        - bad        - True
bad        - bad        - True
bad        - bad        - True
good       - good       - True
bad        - bad        - True
good       - good       - True
bad        - bad        - True
good       - good       - True
good       - good       - True
bad        - bad        - True
good       - good       - True

	checking_status	duration	credit_history	purpose	credit_amount	savings_status	employment	installment_commitment	personal_status	other_parties	...	property_magnitude	age	other_payment_plans	housing	existing_credits	job	num_dependents	own_telephone	foreign_worker	class
0	b"'<0'"	6.0	b"'critical/other existing credit'"	b'radio/tv'	1169.0	b"'no known savings'"	b"'>=7'"	4.0	b"'male single'"	b'none'	...	b"'real estate'"	67.0	b'none'	b'own'	2.0	b'skilled'	1.0	b'yes'	b'yes'	b'good'
1	b"'0<=X<200'"	48.0	b"'existing paid'"	b'radio/tv'	5951.0	b"'<100'"	b"'1<=X<4'"	2.0	b"'female div/dep/mar'"	b'none'	...	b"'real estate'"	22.0	b'none'	b'own'	1.0	b'skilled'	1.0	b'none'	b'yes'	b'bad'
2	b"'no checking'"	12.0	b"'critical/other existing credit'"	b'education'	2096.0	b"'<100'"	b"'4<=X<7'"	2.0	b"'male single'"	b'none'	...	b"'real estate'"	49.0	b'none'	b'own'	1.0	b"'unskilled resident'"	2.0	b'none'	b'yes'	b'good'
3	b"'<0'"	42.0	b"'existing paid'"	b'furniture/equipment'	7882.0	b"'<100'"	b"'4<=X<7'"	2.0	b"'male single'"	b'guarantor'	...	b"'life insurance'"	45.0	b'none'	b"'for free'"	1.0	b'skilled'	2.0	b'none'	b'yes'	b'good'
4	b"'<0'"	24.0	b"'delayed previously'"	b"'new car'"	4870.0	b"'<100'"	b"'1<=X<4'"	3.0	b"'male single'"	b'none'	...	b"'no known property'"	53.0	b'none'	b"'for free'"	2.0	b'skilled'	2.0	b'none'	b'yes'	b'bad'

	duration	credit_amount	installment_commitment	residence_since	age	existing_credits	num_dependents	checking_status_b"'0<=X<200'"	checking_status_b"'<0'"	...	housing_b'own'	job_b"'unskilled resident'"	job_b'skilled'	own_telephone_b'none'	own_telephone_b'yes'	foreign_worker_b'yes'
0	6.0	1169.0	4.0	4.0	67.0	2.0	1.0	0	1	...	1	0	1	0	1	1
1	48.0	5951.0	2.0	2.0	22.0	1.0	1.0	1	0	...	1	0	1	1	0	1
2	12.0	2096.0	2.0	3.0	49.0	1.0	2.0	0	0	...	1	1	0	1	0	1
3	42.0	7882.0	2.0	4.0	45.0	1.0	2.0	0	1	...	0	0	1	1	0	1
4	24.0	4870.0	3.0	4.0	53.0	2.0	2.0	0	1	...	0	0	1	1	0	1