Python Functions

Functions are defined by keyword "def", followed with the function's name and brackets and a colon

For example:

def my_function():
    print("Hello From My Function!")

Calling this function this done by:

Hello From My Function!

Arguments for Functions

def my_function_with_args(username, greeting):
    print("Hello, {}, I wish you {}".format(username, greeting))

my_function_with_args("Mike", "good day")
Hello, Mike, I wish you good day

Return values

Just write return whenever you want to return something

def sum_two_numbers(a, b):
    return a + b


Multiple return values

Remember that there are tuples

def sum_and_diff(a, b):
    return (a+b, a-b) # the brackets here are not necessary, one can also write return a+b, a-b

my_tuple = sum_and_diff(4,6)
print(my_tuple) # prints (10, -2)
print(my_tuple[0]) # prints 10
print(my_tuple[1]) # prints -2
(10, -2)

Assign return values directly

(my_sum, my_diff) = sum_and_diff(4,6) # here also the brackets are not essential


  • like a map (maps key to values)
  • initialised by dict() or {}
population = {} # or population = dict()
population["Mannheim"] = 305780
population["Ludwigshafen"] = 164718
population["Heidelberg"] = 156267
{'Mannheim': 305780, 'Ludwigshafen': 164718, 'Heidelberg': 156267}

Direct initialisation

population = {
    "Mannheim" : 305780,
    "Ludwigshafen" : 164718,
    "Heidelberg" : 156267
{'Mannheim': 305780, 'Ludwigshafen': 164718, 'Heidelberg': 156267}

Iterating over dictionaries

for name, count in population.items():
    print("The population of {} is {}".format(name, count))
The population of Mannheim is 305780
The population of Ludwigshafen is 164718
The population of Heidelberg is 156267

Removing elements

del population["Heidelberg"]
{'Mannheim': 305780, 'Ludwigshafen': 164718}


  • not directly included in scikit learn

Equal-width binning

import pandas as pd
items = [0,4,12,16,16,18,24,26,28]
pd.cut(items, bins=3)
[(-0.028, 9.333], (-0.028, 9.333], (9.333, 18.667], (9.333, 18.667], (9.333, 18.667], (9.333, 18.667], (18.667, 28.0], (18.667, 28.0], (18.667, 28.0]]
Categories (3, interval[float64]): [(-0.028, 9.333] < (9.333, 18.667] < (18.667, 28.0]]

Equal-frequency binning

pd.qcut(items, q=3)
[(-0.001, 14.667], (-0.001, 14.667], (-0.001, 14.667], (14.667, 20.0], (14.667, 20.0], (14.667, 20.0], (20.0, 28.0], (20.0, 28.0], (20.0, 28.0]]
Categories (3, interval[float64]): [(-0.001, 14.667] < (14.667, 20.0] < (20.0, 28.0]]

Apply it to a dataset

iris = pd.read_csv("iris.csv")
pd.cut(iris['SepalLength'], bins=3, labels=['low', 'middle', 'high'])
120      high
Categories (3, object): [high < low < middle]

Create a binned dataset

  • idea:
    • create a new dataframe
    • initialise it with processed values (as dict)
iris_binned = pd.DataFrame(dict(
    SepalLength = pd.cut(iris['SepalLength'], bins=3, labels=['low', 'middle', 'high']),
    SepalWidth = pd.cut(iris['SepalWidth'], bins=3, labels=['low', 'middle', 'high'])
SepalLength SepalWidth
0 low middle
1 low middle
2 low middle
3 low middle
4 low middle

Another way of encoding (third possibility)

  • pandas has also a way of one-hot encoding the values
  • differs from category_encoders one hot because it has nicer column names
  • using the function get_dummies
iris_binned_and_encoded = pd.get_dummies(iris_binned)
SepalLength_high SepalLength_low SepalLength_middle SepalWidth_high SepalWidth_low SepalWidth_middle
0 0 1 0 0 0 1
1 0 1 0 0 0 1
2 0 1 0 0 0 1
3 0 1 0 0 0 1
4 0 1 0 0 0 1

Applying ML Aproaches

from sklearn import tree
decision_tree = tree.DecisionTreeClassifier()
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

Plot the decion tree

  • first install two additional packages (run both command in a console)
    • conda install -c conda-forge graphviz
    • pip install graphviz
decision_tree = tree.DecisionTreeClassifier(max_depth=2)#max_depth=2, because to see onl a small decision tree, iris['Name'])
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
# run the following two commands in console:
# conda install -c conda-forge graphviz
# pip install graphviz

import graphviz 
from sklearn.utils.multiclass import unique_labels

dot_data = tree.export_graphviz(decision_tree,
                         filled=True, rounded=True,special_characters=True,out_file=None)
Tree 0 SepalLength_low ≤ 0.5 gini = 0.6667 samples = 150 value = [50, 50, 50] class = Iris-setosa

Get the number of nodes in the tree

Model evaluation

Train test split

from sklearn.model_selection import train_test_split
data_train, data_test, target_train, target_test = train_test_split(
    iris_binned_and_encoded, iris['Name'],test_size=0.2, random_state=42, stratify=iris['Name'])
     SepalLength_high  SepalLength_low  SepalLength_middle  SepalWidth_high  \
8                   0                1                   0                0   
106                 0                1                   0                0   
76                  1                0                   0                0   
9                   0                1                   0                0   
89                  0                1                   0                0   

     SepalWidth_low  SepalWidth_middle  
8                 0                  1  
106               1                  0  
76                1                  0  
9                 0                  1  
89                1                  0  
8          Iris-setosa
106     Iris-virginica
76     Iris-versicolor
9          Iris-setosa
89     Iris-versicolor
Name: Name, dtype: object

Cross Validation

  • also have a look at documentation about cross validation
  • for computing cross-validated metrics use cross_val_score function
  • different possibilities for scoring (overview)
    • "accuracy"
    • "precision"
    • "recall"
    • "roc_auc"
  • the return value will be the score for each fold (partition)
from sklearn.model_selection import cross_val_score
accuracy_iris = cross_val_score(decision_tree, iris_binned_and_encoded, iris['Name'], cv=10, scoring='accuracy')
array([ 0.6       ,  0.86666667,  0.66666667,  0.66666667,  0.8       ,
        0.73333333,  0.73333333,  0.8       ,  0.73333333,  0.66666667])

Average the scores to get one value

  • if you want a stratified version (and also set the random seed) you can do the following
from sklearn.model_selection import StratifiedKFold

cross_val = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
acc_each_split = cross_val_score(decision_tree, iris_binned_and_encoded, iris['Name'], cv=cross_val, scoring='accuracy')

Obtaining predictions by cross-validation

from sklearn.model_selection import cross_val_predict
predicted = cross_val_predict(decision_tree, iris_binned_and_encoded, iris['Name'], cv=10)
Cross-validation iterators

  • if you want to iterate over each fold with a for loop you can do this with the following snippet
# sometimes you have to use the raw array and not the pandas dataframe (access it with .values)
data = iris_binned_and_encoded.values 
target = iris['Name']

cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

for train_indices, test_indices in cv.split(data, target):
    train_data = data[train_indices]
    train_target = target[train_indices], train_target)

    test_data = data[test_indices]
    test_target = target[test_indices]
    test_prediction = decision_tree.predict(test_data)

Increae the number of examples in a training set

  • you can use the np.append function to add more traning indices
import numpy as np

for train_indices, test_indices in cv.split(data, target):
    train_indices = np.append(train_indices, (target == 'Iris-setosa').nonzero()[0])
Load Arff and postprocess

from import arff
credit_arff_data, credit_arff_meta = arff.loadarff(open('credit-g.arff', 'r'))
credit = pd.DataFrame(credit_arff_data)
checking_status duration credit_history purpose credit_amount savings_status employment installment_commitment personal_status other_parties ... property_magnitude age other_payment_plans housing existing_credits job num_dependents own_telephone foreign_worker class
0 b"'<0'" 6.0 b"'critical/other existing credit'" b'radio/tv' 1169.0 b"'no known savings'" b"'>=7'" 4.0 b"'male single'" b'none' ... b"'real estate'" 67.0 b'none' b'own' 2.0 b'skilled' 1.0 b'yes' b'yes' b'good'
1 b"'0<=X<200'" 48.0 b"'existing paid'" b'radio/tv' 5951.0 b"'<100'" b"'1<=X<4'" 2.0 b"'female div/dep/mar'" b'none' ... b"'real estate'" 22.0 b'none' b'own' 1.0 b'skilled' 1.0 b'none' b'yes' b'bad'
2 b"'no checking'" 12.0 b"'critical/other existing credit'" b'education' 2096.0 b"'<100'" b"'4<=X<7'" 2.0 b"'male single'" b'none' ... b"'real estate'" 49.0 b'none' b'own' 1.0 b"'unskilled resident'" 2.0 b'none' b'yes' b'good'
3 b"'<0'" 42.0 b"'existing paid'" b'furniture/equipment' 7882.0 b"'<100'" b"'4<=X<7'" 2.0 b"'male single'" b'guarantor' ... b"'life insurance'" 45.0 b'none' b"'for free'" 1.0 b'skilled' 2.0 b'none' b'yes' b'good'
4 b"'<0'" 24.0 b"'delayed previously'" b"'new car'" 4870.0 b"'<100'" b"'1<=X<4'" 3.0 b"'male single'" b'none' ... b"'no known property'" 53.0 b'none' b"'for free'" 2.0 b'skilled' 2.0 b'none' b'yes' b'bad'

5 rows × 21 columns


  • the loaded dataset and texts are in binary format to decode it you can use the follwing snippet
credit_target = np.array([x.decode('ascii') for x in credit['class'].values]) # "inline" for loop with []

credit_data = pd.get_dummies(credit.drop('class', axis=1)) # drop can be used to get all columns but removing one (the target column)
['good' 'bad' 'good' 'good' 'bad' 'good' 'good' 'good' 'good' 'bad']
duration credit_amount installment_commitment residence_since age existing_credits num_dependents checking_status_b"'0<=X<200'" checking_status_b"'<0'" checking_status_b"'>=200'" ... housing_b'own' housing_b'rent' job_b"'high qualif/self emp/mgmt'" job_b"'unemp/unskilled non res'" job_b"'unskilled resident'" job_b'skilled' own_telephone_b'none' own_telephone_b'yes' foreign_worker_b'no' foreign_worker_b'yes'
0 6.0 1169.0 4.0 4.0 67.0 2.0 1.0 0 1 0 ... 1 0 0 0 0 1 0 1 0 1
1 48.0 5951.0 2.0 2.0 22.0 1.0 1.0 1 0 0 ... 1 0 0 0 0 1 1 0 0 1
2 12.0 2096.0 2.0 3.0 49.0 1.0 2.0 0 0 0 ... 1 0 0 0 1 0 1 0 0 1
3 42.0 7882.0 2.0 4.0 45.0 1.0 2.0 0 1 0 ... 0 0 0 0 0 1 1 0 0 1
4 24.0 4870.0 3.0 4.0 53.0 2.0 2.0 0 1 0 ... 0 0 0 0 0 1 1 0 0 1

5 rows × 61 columns

Plot ROC curve

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt

knn_estimator = KNeighborsClassifier(3)

data_train, data_test, target_train, target_test = train_test_split(credit_data, credit_target), target_train)
proba_for_each_class = knn_estimator.predict_proba(data_test)#have to use predict_proba or decision_function 

fpr, tpr, thresholds = roc_curve(target_test, proba_for_each_class[:,1], pos_label='good')#choose the second class

plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r', label='Luck', alpha=.8) # draw diagonal


Avg roc curves (cross validation)

  • here is a snippet of python code you can use
from scipy import interp
from sklearn.metrics import roc_curve, auc

def avg_roc(cv, estimator, data, target, pos_label):
    mean_fpr = np.linspace(0, 1, 100) # = [0.0, 0.01, 0.02, 0.03, ... , 0.99, 1.0]
    tprs = []
    aucs = []    
    for train_indices, test_indices in cv.split(data, target):
        train_data, train_target = data[train_indices], target[train_indices], train_target)
        test_data, test_target = data[test_indices], target[test_indices]
        decision_for_each_class = estimator.predict_proba(test_data)#have to use predict_proba or decision_function 
        fpr, tpr, thresholds = roc_curve(test_target, decision_for_each_class[:,1], pos_label=pos_label)
        tprs.append(interp(mean_fpr, fpr, tpr))
        tprs[-1][0] = 0.0 # tprs[-1] access the last element
        aucs.append(auc(fpr, tpr))        
        #plt.plot(fpr, tpr)# plot for each fold
    mean_tpr = np.mean(tprs, axis=0)
    mean_tpr[-1] = 1.0 # set the last tpr to 1
    mean_auc = auc(mean_fpr, mean_tpr)
    std_auc = np.std(aucs)    
    return mean_fpr, mean_tpr, mean_auc, std_auc

Using the function to plot a ROC curve

knn_estimator = KNeighborsClassifier(3)

cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

mean_fpr, mean_tpr, mean_auc, std_auc = avg_roc(cv, knn_estimator, credit_data.values, credit_target, 'good')

plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r', label='Luck', alpha=.8) # draw diagonal


Add noise to the labels

  • not included in scikit learn nor in numpy or sci-py
    • thus own approach
import random 
from sklearn.utils.multiclass import unique_labels
def add_noise(raw_target, percentage):    
    labels = unique_labels(raw_target)
    target_with_noise = []
    for one_target_label in raw_target:
        if random.randint(1,100) <= percentage:
            target_with_noise.append(next(l for l in labels if l != one_target_label))
    return target_with_noise

Apply the noise function

credit_target_with_noise = add_noise(credit_target, 10)
for i in range(20):
    print("{:10} - {:10} - {}".format(credit_target[i], credit_target_with_noise[i], credit_target[i]==credit_target_with_noise[i]))
good       - good       - True
bad        - bad        - True
good       - good       - True
good       - good       - True
bad        - bad        - True
good       - good       - True
good       - good       - True
good       - good       - True
good       - good       - True
bad        - bad        - True
bad        - bad        - True
bad        - bad        - True
good       - good       - True
bad        - bad        - True
good       - good       - True
bad        - bad        - True
good       - good       - True
good       - good       - True
bad        - bad        - True
good       - good       - True