Intro to Scikit learn

Machine learning: the problem setting

outlook temperature humidity wind play
sunny hot high false no
sunny hot high true no
overcast hot high false yes
rainy mild high false yes
rainy cold normal false yes
rainy cold normal true no
overcast cold normal true yes
sunny mild high false ?
  • learning problem considers a set of n samples of data which have m features (e.g. outlook,temperature, etc.)
  • tries to predict properties of unknown data
  • We can separate it in two categories:
    • supervised learning: predict additional attributes
      • classification: discrete values (e.g. play)
      • regression: continuous values (e.g. temperature in degree celsius)
    • unsupervised learning: e.g. find clusters of "similar" days

Structure Data

  • the data is an array (n_samples, n_features)
      [[  0.   0.   5. ...,   0.   0.   0.]
       [  0.   0.   0. ...,  10.   0.   0.]
       [  0.   0.   0. ...,  16.   9.   0.]
       ...,
       [  0.   0.   1. ...,   6.   0.   0.]
       [  0.   0.   2. ...,  12.   0.   0.]
       [  0.   0.  10. ...,  12.   1.   0.]]
  • in case of supervised learning, we also have a target variable (n_samples)
      [
      True,
      False,
      True,
      True,
      False
      ...]

Learning and predicting (supervised)

  • two important functions:
    • fit(X, target)
    • predict(T)

Clustering (unsupervised)

  • one important function:
    • fit_predict(X) (fits the model and then predicts the clusters directly)
Load data
In [1]:
import pandas as pd
customer_data = pd.read_excel('CustomerDataSet.xls')
customer_data.head()
Out[1]:
Customer ID ItemsBought ItemsReturned ZipCode Product
0 4 45 10 2 1365
1 5 42 18 5 2764
2 6 50 0 1 1343
3 8 13 12 4 2435
4 9 10 7 3 2435

KMeans clustering:

  • import KMeans
  • important parameters: n_clusters (the number of clusters)
  • with the attribute .labels_ all cluster numbers are available
In [2]:
from sklearn.cluster import KMeans
estimator = KMeans(n_clusters = 2)
labels = estimator.fit_predict(customer_data[['ItemsBought', 'ItemsReturned']])
print(labels)

# OR

estimator.fit(customer_data[['ItemsBought', 'ItemsReturned']])
print(estimator.labels_)
[0 0 0 1 1 0 0 0 1 0 0 0 0]
[1 1 1 0 0 1 1 1 0 1 1 1 1]

Plotting clusters:

For plotting the clusters you can use the c parameter of the plotting function

In [3]:
import matplotlib.pyplot as plt

plt.title("KMeans #cluster = 2")
plt.xlabel('ItemsBought')
plt.ylabel('ItemsReturned')
plt.scatter(customer_data['ItemsBought'], customer_data['ItemsReturned'], c=estimator.labels_)
plt.show()

Calculating the silhouette score

In [4]:
from sklearn.metrics import silhouette_score
silhouette = silhouette_score( customer_data[['ItemsBought', 'ItemsReturned']], labels)
silhouette
Out[4]:
0.67031934343920685

Preprocess / Encoding

  • most maschine learning algorithmns can only work with numerical data
  • to "translate" text data to numerical data, one can use the LabelEncoder
In [5]:
print(customer_data[['Product', 'ZipCode']])

from sklearn import preprocessing
customer_data_encoded = customer_data[['Product', 'ZipCode']].apply(preprocessing.LabelEncoder().fit_transform)

print(customer_data_encoded[['Product', 'ZipCode']]) 
    Product  ZipCode
0      1365        2
1      2764        5
2      1343        1
3      2435        4
4      2435        3
5      2896        6
6      2869        8
7      1236        2
8      2435        8
9      1764        2
10     1547        1
11     1265        1
12     2465        9
    Product  ZipCode
0         3        1
1         8        4
2         2        0
3         6        3
4         6        2
5        10        5
6         9        6
7         0        1
8         6        6
9         5        1
10        4        0
11        1        0
12        7        7
Preprocess/ Normalise
  • possibilities for normalisation
    • StandardScaler (creates zero mean and unit variance)
    • MinMaxScaler ( e.g. scale to [0,1] )
In [6]:
from sklearn import preprocessing

#inplace
#customer_data[['ItemsBought', 'ItemsReturned']] = preprocessing.MinMaxScaler().fit_transform(customer_data[['ItemsBought', 'ItemsReturned']])

#creating a new dataframe
customer_data_normalised = pd.DataFrame(
    preprocessing.MinMaxScaler().fit_transform(customer_data[['ItemsBought', 'ItemsReturned']]),
    columns=['ItemsBought', 'ItemsReturned']
)
In [7]:
print(customer_data[['ItemsBought', 'ItemsReturned']])
print(customer_data_normalised)
    ItemsBought  ItemsReturned
0            45             10
1            42             18
2            50              0
3            13             12
4            10              7
5            34             17
6            40             20
7            40              8
8             9              9
9            36              7
10           42              1
11           46              1
12           41             22
    ItemsBought  ItemsReturned
0      0.878049       0.454545
1      0.804878       0.818182
2      1.000000       0.000000
3      0.097561       0.545455
4      0.024390       0.318182
5      0.609756       0.772727
6      0.756098       0.909091
7      0.756098       0.363636
8      0.000000       0.409091
9      0.658537       0.318182
10     0.804878       0.045455
11     0.902439       0.045455
12     0.780488       1.000000

Pandas

  • Extending the data your your own values is done with
In [8]:
customer_data_with_cluster = customer_data.assign(cluster=estimator.labels_)
customer_data_with_cluster
Out[8]:
Customer ID ItemsBought ItemsReturned ZipCode Product cluster
0 4 45 10 2 1365 1
1 5 42 18 5 2764 1
2 6 50 0 1 1343 1
3 8 13 12 4 2435 0
4 9 10 7 3 2435 0
5 10 34 17 6 2896 1
6 11 40 20 8 2869 1
7 12 40 8 2 1236 1
8 14 9 9 8 2435 0
9 15 36 7 2 1764 1
10 16 42 1 1 1547 1
11 17 46 1 1 1265 1
12 21 41 22 9 2465 1

Agglomerative Hierarchical Clustering (for nice diagrams):

  • using the linkage function from scipy
    • linkage(data, method, metric)
    • method parameter can be one of "single", "complete", "average", "weighted", "centroid", "median", "ward"
    • metric parameter (default is okay for us) can be one of "euclidean", "minkowski", "cityblock", "seuclidean", "hamming", "jaccard" and others (see pdist function
  • plotting is done with the dendrogram function
    • dendrogram(Z) Z is the result of the linkage function
    • Having nice labels is done with
      • dendrogram(Z,labels=data['column'].values)
In [9]:
from scipy.cluster.hierarchy import dendrogram, linkage

Z = linkage(customer_data[['ItemsBought', 'ItemsReturned']], 'ward')

plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Customer IDs')
plt.ylabel('distance')
dendrogram(Z, labels=customer_data['Customer ID'].values)
plt.show()
Truncate the dendrogram
  • setting truncate_mode to 'lastp'
  • set the p attribute to the count of clusters
In [10]:
plt.title('Dendrogram - 3 clusters')
plt.xlabel('Count of Customers')
plt.ylabel('distance')
dendrogram(Z,
        truncate_mode='lastp',
        p=3)
plt.show()

Changing the orientation of the dendrogram is done with orientation

In [11]:
ax = dendrogram(
    Z,
    orientation='right'
)
plt.show()

Agglomerative Hierarchical Clustering (for cluster values):

  • like KMeans
In [12]:
from sklearn.cluster import AgglomerativeClustering

agg = AgglomerativeClustering(n_clusters = 3)
agg_predictions = agg.fit_predict(customer_data[['ItemsBought', 'ItemsReturned']])

plt.scatter(customer_data['ItemsBought'], customer_data['ItemsReturned'], c=agg_predictions)
plt.show()

DBScan:

  • nearly the same as KMeans
In [13]:
from sklearn.cluster import DBSCAN
db = DBSCAN().fit(customer_data[['ItemsBought', 'ItemsReturned']])
plt.scatter(customer_data['ItemsBought'], customer_data['ItemsReturned'], c=db.labels_)
plt.show()

Load Data (arff)

In [14]:
from scipy.io import arff
zoo_arff_data, zoo_arff_meta = arff.loadarff(open('zoo.arff', 'r'))
zoo_data = pd.DataFrame(zoo_arff_data)
zoo_data.head()
Out[14]:
animal hair feathers eggs milk airborne aquatic predator toothed backbone breathes venomous fins legs tail domestic catsize type
0 b'aardvark' b'true' b'false' b'false' b'true' b'false' b'false' b'true' b'true' b'true' b'true' b'false' b'false' 4.0 b'false' b'false' b'true' b'mammal'
1 b'antelope' b'true' b'false' b'false' b'true' b'false' b'false' b'false' b'true' b'true' b'true' b'false' b'false' 4.0 b'true' b'false' b'true' b'mammal'
2 b'bass' b'false' b'false' b'true' b'false' b'false' b'true' b'true' b'true' b'true' b'false' b'false' b'true' 0.0 b'true' b'false' b'false' b'fish'
3 b'bear' b'true' b'false' b'false' b'true' b'false' b'false' b'true' b'true' b'true' b'true' b'false' b'false' 4.0 b'false' b'false' b'true' b'mammal'
4 b'boar' b'true' b'false' b'false' b'true' b'false' b'false' b'true' b'true' b'true' b'true' b'false' b'false' 4.0 b'true' b'false' b'true' b'mammal'

Creating multiple plots

In [16]:
#for i in range(2,5):
for i in [2,3,4]: 
    estimator = KMeans(n_clusters = i)
    estimator.fit(customer_data[['ItemsBought', 'ItemsReturned']])

    plt.title("#cluster = {}".format(i))
    plt.scatter(customer_data['ItemsBought'], customer_data['ItemsReturned'], c=estimator.labels_)
    plt.show() 

Subplots (side by side)

In [34]:
plt.figure(1, (10,10)) # otherwise the image is really small

counter = 1 # we need a counter to start at one
for i in [2,3,4,5,6]: 
    plt.subplot(3,2,counter) # plt.subplot(rows, columns, current_index)
    #plt.tight_layout() # sometimes you need it, when the plots are a bit overlapping
    counter += 1
    
    estimator = KMeans(n_clusters = i)
    estimator.fit(customer_data[['ItemsBought', 'ItemsReturned']])
    
    plt.title("#cluster = {}".format(i))
    plt.scatter(customer_data['ItemsBought'], customer_data['ItemsReturned'], c=estimator.labels_)
plt.show()