Intro to Scikit learn¶

Machine learning: the problem setting¶

outlook	temperature	humidity	wind	play
sunny	hot	high	false	no
sunny	hot	high	true	no
overcast	hot	high	false	yes
rainy	mild	high	false	yes
rainy	cold	normal	false	yes
rainy	cold	normal	true	no
overcast	cold	normal	true	yes
sunny	mild	high	false	?

learning problem considers a set of n samples of data which have m features (e.g. outlook,temperature, etc.)
tries to predict properties of unknown data
We can separate it in two categories:
- supervised learning: predict additional attributes
  - classification: discrete values (e.g. play)
  - regression: continuous values (e.g. temperature in degree celsius)
- unsupervised learning: e.g. find clusters of "similar" days

Structure Data¶

the data is an array (n_samples, n_features)

  [[  0.   0.   5. ...,   0.   0.   0.]
   [  0.   0.   0. ...,  10.   0.   0.]
   [  0.   0.   0. ...,  16.   9.   0.]
   ...,
   [  0.   0.   1. ...,   6.   0.   0.]
   [  0.   0.   2. ...,  12.   0.   0.]
   [  0.   0.  10. ...,  12.   1.   0.]]

in case of supervised learning, we also have a target variable (n_samples)
```
  [
  True,
  False,
  True,
  True,
  False
  ...]
```

Learning and predicting (supervised)¶

two important functions:
- fit(X, target)
- predict(T)

Clustering (unsupervised)¶

one important function:
- fit_predict(X) (fits the model and then predicts the clusters directly)

Load data¶

import pandas as pd
customer_data = pd.read_excel('CustomerDataSet.xls')
customer_data.head()

KMeans clustering:¶

import KMeans
important parameters: n_clusters (the number of clusters)
with the attribute .labels_ all cluster numbers are available

from sklearn.cluster import KMeans
estimator = KMeans(n_clusters = 2)
labels = estimator.fit_predict(customer_data[['ItemsBought', 'ItemsReturned']])
print(labels)

# OR

estimator.fit(customer_data[['ItemsBought', 'ItemsReturned']])
print(estimator.labels_)

[0 0 0 1 1 0 0 0 1 0 0 0 0]
[1 1 1 0 0 1 1 1 0 1 1 1 1]

Plotting clusters:¶

For plotting the clusters you can use the c parameter of the plotting function

import matplotlib.pyplot as plt

plt.title("KMeans #cluster = 2")
plt.xlabel('ItemsBought')
plt.ylabel('ItemsReturned')
plt.scatter(customer_data['ItemsBought'], customer_data['ItemsReturned'], c=estimator.labels_)
plt.show()

Calculating the silhouette score¶

from sklearn.metrics import silhouette_score
silhouette = silhouette_score( customer_data[['ItemsBought', 'ItemsReturned']], labels)
silhouette

0.67031934343920685

Preprocess / Encoding¶

most maschine learning algorithmns can only work with numerical data
to "translate" text data to numerical data, one can use the LabelEncoder

print(customer_data[['Product', 'ZipCode']])

from sklearn import preprocessing
customer_data_encoded = customer_data[['Product', 'ZipCode']].apply(preprocessing.LabelEncoder().fit_transform)

print(customer_data_encoded[['Product', 'ZipCode']])

    Product  ZipCode
0      1365        2
1      2764        5
2      1343        1
3      2435        4
4      2435        3
5      2896        6
6      2869        8
7      1236        2
8      2435        8
9      1764        2
10     1547        1
11     1265        1
12     2465        9
    Product  ZipCode
0         3        1
1         8        4
2         2        0
3         6        3
4         6        2
5        10        5
6         9        6
7         0        1
8         6        6
9         5        1
10        4        0
11        1        0
12        7        7

Preprocess/ Normalise¶

possibilities for normalisation
- StandardScaler (creates zero mean and unit variance)
- MinMaxScaler ( e.g. scale to [0,1] )

from sklearn import preprocessing

#inplace
#customer_data[['ItemsBought', 'ItemsReturned']] = preprocessing.MinMaxScaler().fit_transform(customer_data[['ItemsBought', 'ItemsReturned']])

#creating a new dataframe
customer_data_normalised = pd.DataFrame(
    preprocessing.MinMaxScaler().fit_transform(customer_data[['ItemsBought', 'ItemsReturned']]),
    columns=['ItemsBought', 'ItemsReturned']
)

print(customer_data[['ItemsBought', 'ItemsReturned']])
print(customer_data_normalised)

    ItemsBought  ItemsReturned
0            45             10
1            42             18
2            50              0
3            13             12
4            10              7
5            34             17
6            40             20
7            40              8
8             9              9
9            36              7
10           42              1
11           46              1
12           41             22
    ItemsBought  ItemsReturned
0      0.878049       0.454545
1      0.804878       0.818182
2      1.000000       0.000000
3      0.097561       0.545455
4      0.024390       0.318182
5      0.609756       0.772727
6      0.756098       0.909091
7      0.756098       0.363636
8      0.000000       0.409091
9      0.658537       0.318182
10     0.804878       0.045455
11     0.902439       0.045455
12     0.780488       1.000000

Pandas¶

Extending the data your your own values is done with

customer_data_with_cluster = customer_data.assign(cluster=estimator.labels_)
customer_data_with_cluster

Agglomerative Hierarchical Clustering (for nice diagrams):¶

using the linkage function from scipy
- linkage(data, method, metric)
- method parameter can be one of "single", "complete", "average", "weighted", "centroid", "median", "ward"
- metric parameter (default is okay for us) can be one of "euclidean", "minkowski", "cityblock", "seuclidean", "hamming", "jaccard" and others (see pdist function
plotting is done with the dendrogram function
- dendrogram(Z) Z is the result of the linkage function
- Having nice labels is done with
  - dendrogram(Z,labels=data['column'].values)

from scipy.cluster.hierarchy import dendrogram, linkage

Z = linkage(customer_data[['ItemsBought', 'ItemsReturned']], 'ward')

plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Customer IDs')
plt.ylabel('distance')
dendrogram(Z, labels=customer_data['Customer ID'].values)
plt.show()

Truncate the dendrogram¶

setting truncate_mode to 'lastp'
set the p attribute to the count of clusters

plt.title('Dendrogram - 3 clusters')
plt.xlabel('Count of Customers')
plt.ylabel('distance')
dendrogram(Z,
        truncate_mode='lastp',
        p=3)
plt.show()

Changing the orientation of the dendrogram is done with orientation

ax = dendrogram(
    Z,
    orientation='right'
)
plt.show()

Agglomerative Hierarchical Clustering (for cluster values):¶

like KMeans

from sklearn.cluster import AgglomerativeClustering

agg = AgglomerativeClustering(n_clusters = 3)
agg_predictions = agg.fit_predict(customer_data[['ItemsBought', 'ItemsReturned']])

plt.scatter(customer_data['ItemsBought'], customer_data['ItemsReturned'], c=agg_predictions)
plt.show()

DBScan:¶

nearly the same as KMeans

from sklearn.cluster import DBSCAN
db = DBSCAN().fit(customer_data[['ItemsBought', 'ItemsReturned']])
plt.scatter(customer_data['ItemsBought'], customer_data['ItemsReturned'], c=db.labels_)
plt.show()

Load Data (arff)¶

just use the arff function from scipy

from scipy.io import arff
zoo_arff_data, zoo_arff_meta = arff.loadarff(open('zoo.arff', 'r'))
zoo_data = pd.DataFrame(zoo_arff_data)
zoo_data.head()

Creating multiple plots¶

#for i in range(2,5):
for i in [2,3,4]: 
    estimator = KMeans(n_clusters = i)
    estimator.fit(customer_data[['ItemsBought', 'ItemsReturned']])

    plt.title("#cluster = {}".format(i))
    plt.scatter(customer_data['ItemsBought'], customer_data['ItemsReturned'], c=estimator.labels_)
    plt.show()

Subplots (side by side)¶

plt.figure(1, (10,10)) # otherwise the image is really small

counter = 1 # we need a counter to start at one
for i in [2,3,4,5,6]: 
    plt.subplot(3,2,counter) # plt.subplot(rows, columns, current_index)
    #plt.tight_layout() # sometimes you need it, when the plots are a bit overlapping
    counter += 1
    
    estimator = KMeans(n_clusters = i)
    estimator.fit(customer_data[['ItemsBought', 'ItemsReturned']])
    
    plt.title("#cluster = {}".format(i))
    plt.scatter(customer_data['ItemsBought'], customer_data['ItemsReturned'], c=estimator.labels_)
plt.show()

	animal	hair	feathers	eggs	milk	airborne	aquatic	predator	toothed	backbone	breathes	venomous	fins	legs	tail	domestic	catsize	type
0	b'aardvark'	b'true'	b'false'	b'false'	b'true'	b'false'	b'false'	b'true'	b'true'	b'true'	b'true'	b'false'	b'false'	4.0	b'false'	b'false'	b'true'	b'mammal'
1	b'antelope'	b'true'	b'false'	b'false'	b'true'	b'false'	b'false'	b'false'	b'true'	b'true'	b'true'	b'false'	b'false'	4.0	b'true'	b'false'	b'true'	b'mammal'
2	b'bass'	b'false'	b'false'	b'true'	b'false'	b'false'	b'true'	b'true'	b'true'	b'true'	b'false'	b'false'	b'true'	0.0	b'true'	b'false'	b'false'	b'fish'
3	b'bear'	b'true'	b'false'	b'false'	b'true'	b'false'	b'false'	b'true'	b'true'	b'true'	b'true'	b'false'	b'false'	4.0	b'false'	b'false'	b'true'	b'mammal'
4	b'boar'	b'true'	b'false'	b'false'	b'true'	b'false'	b'false'	b'true'	b'true'	b'true'	b'true'	b'false'	b'false'	4.0	b'true'	b'false'	b'true'	b'mammal'

	Customer ID	ItemsBought	ItemsReturned	ZipCode	Product	cluster
0	4	45	10	2	1365	1
1	5	42	18	5	2764	1
2	6	50	0	1	1343	1
3	8	13	12	4	2435	0
4	9	10	7	3	2435	0
5	10	34	17	6	2896	1
6	11	40	20	8	2869	1
7	12	40	8	2	1236	1
8	14	9	9	8	2435	0
9	15	36	7	2	1764	1
10	16	42	1	1	1547	1
11	17	46	1	1	1265	1
12	21	41	22	9	2465	1